Module: network (SpatialGraph)

File: src/spatial_graph_algorithms/network.py Status: Stable — core schema. Breaking changes require a major version bump.

Purpose

SpatialGraph is the single shared data object for the entire package. Every pipeline function takes one as input and returns one as output. Its job is to hold adjacency, positions, metadata, and reconstructed coordinates in one place, with a small set of invariants enforced at construction time.

Loading Real Data

Use :func:spatial_graph_algorithms.load_edge_list to load an observational edge list. It accepts a CSV path or a pandas DataFrame — whichever you already have. No ground-truth positions, false-edge labels, or simulation metadata are set, so has_ground_truth is False and positions, edge_metadata, and node_metadata are all None.

From a CSV file:

from spatial_graph_algorithms import load_edge_list

sg = load_edge_list("my_experiment.csv")                           # default column names
sg = load_edge_list("edges.csv", source_col="cell_a", target_col="cell_b")  # custom names

From a pandas DataFrame (in-memory edge list):

Node identifiers must be integers — a ValueError is raised otherwise. If your IDs are strings or other types, convert them before loading:

# encode string cell IDs to integers first
df["source"] = df["source"].astype("category").cat.codes
df["target"] = df["target"].astype("category").cat.codes

import pandas as pd
from spatial_graph_algorithms import load_edge_list

df = pd.DataFrame({
    "source": [0, 1, 2],
    "target": [1, 2, 0],
})
sg = load_edge_list(df)

# Custom column names work the same way
df2 = pd.DataFrame({"u": [0, 1], "v": [1, 2]})
sg2 = load_edge_list(df2, source_col="u", target_col="v")

print(sg)
# SpatialGraph(nodes=3, edges=3, has_ground_truth=False, reconstructed=False)

print(sg.has_ground_truth)   # False
print(sg.positions)          # None
print(sg.edge_metadata)      # None — no is_false column
print(sg.node_metadata)      # None — no simulation params
print(sg.node_id_map)        # {0: 0, 1: 1, 2: 2}

Once loaded, pass sg to reconstruct() and metrics.graph_properties() as normal. Functions that require ground truth (e.g. evaluate() comparing reconstructed vs. true positions) will raise a clear error if has_ground_truth is False.

Design Decisions

Why a dataclass, not a class? A dataclass makes all fields visible at a glance and avoids boilerplate __init__. The eq=False override prevents accidental equality comparisons on large sparse matrices.

Why is adjacency stored as CSR? CSR is the standard format for row-wise sparse operations (shortest paths, degree sums). The setter normalises any input format to CSR automatically so callers don't need to know.

Why does the constructor extract the LCC by default? Disconnected graphs cause undefined behaviour in shortest-path reconstruction. keep_lcc=True is a safe default that silently discards isolated nodes with a warning. Set keep_lcc=False when you know the graph is connected or want to preserve all nodes.

Why is nothing mutated in place? Functions that transform a graph (e.g. reconstruct) return a .copy() with the new field set. This prevents silent state aliasing bugs in pipeline code.

Invariants

These hold after __post_init__ and must be preserved by any code that modifies the object:

adjacency_matrix is symmetric, CSR, with zero diagonal and no duplicate entries.
If positions is set, positions.shape[0] == adjacency_matrix.shape[0].
node_id_map maps every original ID to a unique integer in 0..n-1.

How to Extend

Adding a new field: Add it to the dataclass with Optional[...] = None. Update copy() to deep-copy it. If it should survive LCC extraction, add the slice logic to _extract_lcc.

Adding a new constructor: Follow the pattern of from_edge_list and from_positions — they are @classmethod methods that build the adjacency matrix and call cls(...). Always pass keep_lcc through **kwargs so callers can control LCC behaviour.

Do not: - Add algorithm logic to network.py. It is a data container only. - Change the adjacency setter to accept non-square inputs. - Store mutable defaults in field declarations.

Computed Properties

The following are always available as read-only properties — no imports needed:

Property	Type	Notes
`n_nodes`	`int`	`adjacency_matrix.shape[0]`
`n_edges`	`int`	`adjacency_matrix.nnz // 2`
`edge_density`	`float`	`2m / (n(n-1))`
`degree_sequence`	`np.ndarray`	cached, invalidated on adjacency change
`mean_degree`	`float`	mean of `degree_sequence`
`degree_distribution`	`np.ndarray`	counts per degree value
`n_connected_components`	`int`	lazy, cached; always 1 for default `keep_lcc=True` graphs
`largest_component_fraction`	`float`	lazy, cached alongside `n_connected_components`
`false_edge_fraction`	`float \\| None`	reads `edge_metadata["is_false"]`; `None` when absent
`has_ground_truth`	`bool`	`positions is not None`
`graph`	`nx.Graph`	lazy NetworkX view, cached

n_connected_components and largest_component_fraction are computed together in one scipy call and both cached. For graphs built with keep_lcc=True (the default), the call happens during construction and costs nothing at property access time.

Common Mistakes

Mistake	Why it breaks	Fix
`sg.adjacency_matrix = new_adj` without going through the setter	Bypasses normalisation	Use the property setter
`sg2 = sg; sg2.positions = ...`	Aliases the object, mutates the original	Use `sg.copy()`
Storing a dense array in `adjacency_matrix`	Breaks `nnz`-based edge count	Convert to sparse first

Tests

tests/test_network.py — construction, LCC extraction, from_edge_list, from_positions, to_edge_dataframe, to_positions_dataframe, copy.