refnd.core¶
Graph construction¶
- class refnd.core.EdgeStore(node_count: int, edges: Sequence[tuple[int, int, float]])¶
Bases:
objectA compact, flat list of weighted directed edges between integer node IDs.
EdgeStoreis the central data carrier in refnd: it is produced byexact_edgesandHNSWState.edges, consumed byCsrGraph, and can be persisted to disk for later reuse.Each edge is a triple
(src, dst, weight)wheresrcanddstare zero-based node indices in[0, node_count)andweightis afloat32similarity score (higher = more similar, unless the graph is built withis_weight_distance=True, in such case it’s a real distance[0, ∞)that will be converted to a similarity score).Example:
from refnd.core import EdgeStore store = EdgeStore(node_count=3, edges=[(0, 1, 0.9), (1, 2, 0.7)]) print(len(store)) # 2 print(store[0]) # (0, 1, 0.9) for src, dst, w in store: print(src, dst, w)
- edges() list[tuple[int, int, float]]¶
Return all edges as a list of
(src, dst, weight)triples.
- graph(weighted: bool = True, is_weight_distance: bool = True) CsrGraph¶
Build a
CsrGraphfrom this edge store.- Parameters:
use_weight – If
True, edge weights are used for graph operations (e.g. strength). IfFalse, all edges are treated as unweighted (weight = 1.0).is_weight_distance – If
True, edges weights are normalized to have a maximal bound of 1 using this formula:1.0 / (1.0 + w)since the Leiden algorithm works on similarity graphs.
- Returns:
A
CsrGraphbacked by this edge list.
- static load(path: str) EdgeStore¶
Load an EdgeStore that was previously saved with
EdgeStore.save. It infers the format from the extension of the path,.edgelistor.edgestr.- Parameters:
path – Path to the file produced by
save.- Returns:
The deserialized
EdgeStore.- Raises:
IOError – If the file cannot be read or the format is invalid.
- node_count() int¶
Return the total number of nodes this store was created with.
- save(path: str) None¶
Serialize this EdgeStore to disk. It supports two file formats:
textwith.edgelistextension orbinarywith.edgestrextension. Binary is usually 2x more space efficient at the cost of not being human-readable.- Parameters:
path – Destination file path.
- Raises:
IOError – On any I/O failure.
Example:
from refnd.core import EdgeStore store = EdgeStore(node_count=3, edges=[(0, 1, 0.9), (1, 2, 0.7)]) store.save("my/path/myedges.edgelist") # Text format store.save("my/path/myedges.edgestr") # Bin format
- class refnd.core.CsrGraph(edges: EdgeStore, use_weight: bool = True, is_weight_distance: bool = True)¶
Bases:
objectCompressed Sparse Row graph built from an
EdgeStore. A CSR graph is compact, and highly efficient for traversal, exploiting cache structure and maximizing cache hits. However, it is immutable, so adding a new node or edge require building the full graph again.CsrGraphis an immutable adjacency structure. Because of its property of being very efficient for traversal, it is used by graph algorithms such aspartitionandconnected_components.- Properties:
n: Number of nodes.m: Sum of all edge weights (or total edge count whenuse_weight=False).
Example:
from refnd.core import EdgeStore, CsrGraph store = EdgeStore(4, [(0,1,0.9),(1,2,0.8),(2,3,0.6)]) g = CsrGraph(store, use_weight=True) print(g.n) # 4 print(g.neighbors(1)) # [(0, 0.9), (2, 0.8)] print(g.strength(1)) # 1.7
- m¶
Total weight (each edge counted once)
- n¶
Number of nodes
- neighbors(v: int) list[tuple[int, float]]¶
Return the neighbors of node
vas a list of(node_id, weight)pairs.- Parameters:
v – Zero-based node index.
- strength(v: int) float¶
Return the strength (weighted degree) of node
v.When the graph is unweighted this equals the degree (number of neighbors).
- Parameters:
v – Zero-based node index.
HNSW approximate nearest neighbours¶
- class refnd.core.HNSWConfig(proximity_threshold: float = 0.5, ef_construction: int = 64, m: int = 16, m_max: int = 16, m_max0: int = 32, m_l: float = 0.36, ef_init: int = 1, extend_candidates: bool = False, keep_pruned_connections: bool = True, cache_capacity: int = 2000000, cache_shards: int = 64, n_threads: int = 0, shuffle: bool = False, use_heuristic: bool = True, strict_ef: bool = False, threshold_based_neighbourhood: bool = False)¶
Bases:
objectConfiguration for the HNSW approximate nearest-neighbour index.
All parameters have sensible defaults; in most cases you only need to
ef_construction, andproximity_threshold.Parameters:
proximity_threshold(float) — Distance threshold under which a connection is established in the proximity graph.ef_construction(int) — Candidate list size during build. Default64.m(int) — Bidirectional links per node at layers > 0. Higher = better recall, more memory, slower build. Typical: 8–64.m_max(int) — Maximum connections per node at layers > 0. Usually equal tom.m_max0(int) — Maximum connections at layer 0. Usually2 * m.m_l(float) — Level generation factor. Usually0.36 ≈ 1 / ln(m).ef_init(int) — Candidate list size during initial-layer insertion. Almost always 1 for a greedy search.extend_candidates(bool) — Use neighbors of neighbors as candidates, making search more exhaustive.keep_pruned_connections(bool) — Retain discarded candidates to fill up tomconnections when not enough connections are found.cache_capacity(int) — Maximum cached kernel scores. Increasing it increase the memory footprint, but also cache hits, which can improve runtime performances for computationally expensive kernels.cache_shards(int) — Number of cache shards (reduces lock contention).n_threads(int) — Threads used during build.0= all available cores.shuffle(bool) — Shuffle insertion order before building. Can create a less biased graph, but reduces cache hits, which increases runtime.use_heuristic(bool) — Use the heuristic neighbour-selection from the paper (recommended).strict_ef(bool) — IfTrue, enforces the result set size to exactlyefduring search. Empirically, setting this toFalsecan improve runtime performance, as it allows halvingef_constructionwithout sacrificing accuracy.threshold_based_neighbourhood(bool) — Select a minimum ofmneighbors like the classic algorithm, but doesn’t bound the neighbourhood size as all candidates that are closer than the threshold are kept.
- cache_capacity¶
- cache_shards¶
- dict() dict¶
Return the configuration as a plain Python dict.
- ef_construction¶
- ef_init¶
- extend_candidates¶
- keep_pruned_connections¶
- m¶
- m_l¶
- m_max¶
- m_max0¶
- n_threads¶
- proximity_threshold¶
- shuffle¶
- strict_ef¶
- threshold_based_neighbourhood¶
- use_heuristic¶
- class refnd.core.HNSWState(variant: kernels.KernelVariant, data: Any, *args: Any, proximity_threshold: float = 0.5, ef_construction: int = 64, m: int = 16, m_max: int = 16, m_max0: int = 32, m_l: float = 0.36, ef_init: int = 1, extend_candidates: bool = False, keep_pruned_connections: bool = True, cache_capacity: int = 2000000, cache_shards: int = 64, n_threads: int = 0, shuffle: bool = False, use_heuristic: bool = True, strict_ef: bool = False, threshold_based_neighbourhood: bool = False, **kwargs: Any)¶
Bases:
objectStateful Hierarchical Navigable Small World (HNSW) approximate nearest-neighbour index. This class is used to build and search in the index, whereas
HNSWIndexis a read-only internal representation of the index used for saving / loading / structure exploration.HNSWStatewraps the dataset and the graph index together. The typical workflow is:Construct with the data and a
KernelVariant.Call
build()to insert all items and form the graph.Call
search()to query, oredges()to extract the proximity graph for downstream clustering / splitting.
HNSW parameters (
proximity_threshold,ef_construction, …) are the same asHNSWConfigand can be passed directly to the constructor as keyword arguments.- Parameters:
variant – Kernel to use (e.g.
KernelVariant.ProteinGlobal,KernelVariant.ProteinLocal,KernelVariant.TanimotoBit, etc).data – The dataset — a list of items matching the kernel type (e.g.
list[str]orlist[np.ndarray]).proximity_threshold, ef_construction, m, m_max, m_max0, m_l, ef_init, extend_candidates, – keep_pruned_connections, cache_capacity, cache_shards, n_threads, shuffle, use_heuristic, strict_ef, threshold_based_neighbourhood: See
HNSWConfigfor descriptions.
- Properties:
config (HNSWConfig): The config in use. index (HNSWIndex): A snapshot of the current graph structure.
Example:
from refnd import HNSWState, KernelVariant seqs = ["MKTAYIAK", "MKTAYIAKQR", "ACDEFGHIKLM", "MKTAYIAKQRQIS"] state = HNSWState(KernelVariant.ProteinGlobal, seqs, proximity_threshold=0.3, ef_construction=64) state.build() results = state.search(["MKTAYIAK"], k=2) # results[0] -> [(0, 1.0), (1, 0.88)] store = state.edges() # EdgeStore for graph-based splitting state.save("index.hnsw") state2 = HNSWState.load(KernelVariant.ProteinGlobal, "index.hnsw", seqs)
- build(progress: bool = True) None¶
Build the HNSW index by inserting all data items.
Must be called before
searchoredges. Callingbuilda second time raises aRuntimeError; construct a newHNSWStateinstead.- Parameters:
progress – Display a progress bar. Defaults to
True.- Raises:
RuntimeError – If the index has already been built.
- config¶
Config used to initiate this object.
- edges() EdgeStore¶
Extract the proximity graph as an
EdgeStore.Returns all edges found during
build. All distance measurements below theproximity_threshold.- Returns:
An
EdgeStorewithnode_count = dataset_size.
- get_layer(layer_idx: int) list[list[int]]¶
Return the adjacency lists for a specific HNSW layer.
- Parameters:
layer_idx – Zero-based layer index (0 = base layer with most nodes).
- Returns:
A list of length
dataset_sizewhere elementiis the list of neighbour IDs of nodeiat this layer. Node IDs are their index in the original dataset.- Raises:
IndexError – If
layer_idxis out of range.
- index¶
HNSWIndex snapshot.
- is_built¶
Whether
buildhas been called on this index.
- static load(variant: kernels.KernelVariant, path: str, data: Any, *args: Any, **kwargs: Any) HNSWState¶
Load an HNSWState form a binary file saved with
HNSWState.save.- Parameters:
variant – Must match the kernel used during the original build.
path – Path to the saved file.
data – The original dataset (required to re-attach the kernel).
- Returns:
The restored
HNSWState, ready to callsearchoredgesif the index was built.- Raises:
RuntimeError – If the file cannot be read or the format is invalid.
- save(path: str) None¶
Serialize the full state (index + config) to a binary file.
The saved file can be loaded back with
HNSWState.load. The original data must be provided again at load time (it is not embedded in the file).- Parameters:
path – Destination file path.
- Raises:
RuntimeError – On serialisation or I/O failure.
- search(queries: Any, k: int = 1, ef: int = 64, threads: int = 0, progress: bool = True) list[list[tuple[int, float]]]¶
Search the index for approximate nearest neighbours.
For each query item, returns the
kmost similar items in the dataset, sorted by descending similarity. The quality of the approximation is controlled byef: larger values explore more candidates and improve recall at the cost of speed.- Parameters:
queries – List of query items (same type as the indexed data).
k – Number of nearest neighbours to return per query. Defaults to
1.ef – Size of the dynamic candidate list during search. Must be ≥
k. Defaults to64.threads – Number of parallel threads.
0= all available cores.progress – Display a progress bar. Defaults to
True.
- Returns:
A list of length
len(queries). Each element is a sorted list of up toktuples(dataset_index, similarity_score).- Raises:
RuntimeError – If
buildhas not been called yet.
- class refnd.core.HNSWIndex¶
Bases:
objectRead-only snapshot of the HNSW graph structure after a build.
Obtained via
HNSWState.index. Primarily useful for inspection, serialisation, and debugging; normal users interact withHNSWStateinstead.- Properties:
dataset_size (int): Number of items indexed. layers (list): Nested multi-layer adjacency list
layers[layer][node] = [neighbor_ids]. entry_point (tuple | None):(layer, node_id)of the global entry point, orNoneif empty. max_layers (int): Number of layers in the hierarchy. proximity_edges (list): List of((src, dst), score)for proximity-threshold edges. config (HNSWConfig): The config used to build this index.
- config¶
- dataset_size¶
- entry_point¶
- layers¶
- static load(path: str) HNSWIndex¶
Load an index previously saved with
HNSWIndex.save.- Parameters:
path – Path to the saved file.
- Returns:
The deserialised
HNSWIndex.- Raises:
RuntimeError – If the file cannot be read or the format is invalid.
- max_layers¶
- proximity_edges¶
- save(path: str) None¶
Save the object to a binary representation (e.g. .hnsw file).
- Parameters:
path – Destination file path (recommended with a .hnsw extension)
- Raises:
RuntimeError – On serialisation or I/O failure.
Graph algorithms¶
- class refnd.core.LeidenObjective¶
Bases:
objectObjective function used by the Leiden community-detection algorithm.
Modularity— Maximise Newman-Girvan modularity. Good default for most graphs.CPM— Constant Potts Model. Finds communities of a fixed internal density.
- CPM = LeidenObjective.CPM¶
- Modularity = LeidenObjective.Modularity¶
- refnd.core.find_communities(graph: CsrGraph, gamma: float = 1.0, beta: float = 0.01, n_iterations: int = 2, objective: LeidenObjective = LeidenObjective.Modularity) list[int]¶
Detect communities in a graph with the Leiden algorithm.
The Leiden algorithm is an improvement over Louvain that guarantees well-connected communities. Use the returned cluster labels with
partitionto produce train/test splits that consider community boundaries.- Parameters:
graph – The graph to partition into communities.
gamma – Resolution parameter. Higher values produce more, smaller communities. For
Modularity, typical range is 0.5–2.0; default1.0. ForCPM, it represents the minimum internal edge density.beta – Randomness parameter controlling the refinement phase. Smaller values yield more deterministic results. Default
0.01.n_iterations – Number of optimisation passes. More iterations improve quality at the cost of runtime.
objective –
LeidenObjective.Modularity(default) orLeidenObjective.CPM.
- Returns:
A list of length
graph.nwhere elementiis the cluster ID of nodei. Cluster IDs are arbitrary non-negative integers.
Example:
from refnd.core import CsrGraph, EdgeStore, find_communities, LeidenObjective store = EdgeStore(4, [(0,1,0.9),(1,2,0.8),(2,3,0.6)]) g = CsrGraph(store, use_weight=True, is_weight_distance=False) clusters = find_communities(g, gamma=1.0, n_iterations=20) # e.g. [0, 1, 2, 3]
- refnd.core.find_components(graph: CsrGraph) list[int]¶
Finds the connected components of a graph.
Runs a standard BFS/union-find over the graph and assigns each node to its component.
- Parameters:
graph – The graph to find components.
- Returns:
A list of length
graph.nwhere elementiis the component ID of nodei. IDs are arbitrary non-negative integers.
Example:
from refnd.core import EdgeStore, find_components store = EdgeStore(6, [(0,1,0.9),(1,2,0.8),(3,4,0.7),(4,5,0.6)]) g = store.graph() clusters = find_components(g) # [0, 0, 0, 1, 1, 1]
- refnd.core.partition(clusters: Sequence[int], graph: CsrGraph, test_ratio: float = 0.2, seed: Optional[int] = None, post_filtering: bool = False) tuple[list[int], list[int]]¶
Split a clustered dataset into train and test index sets along clusters
Clusters are kept intact — no cluster is split across train and test. The function greedily and randomly assigns whole clusters to the test set until the test size is larger than
test_ratio, then assigns the rest to train.This is the key function for leakage-free dataset splitting: run
find_communities,find_components, or your own clustering, then callpartitionto obtain indices you can use to slice your data arrays.- Parameters:
clusters – A list of cluster IDs, one per data point (as returned by
find_communitiesorconnected_components). Length must equal the number of nodes ingraph.graph – The graph used to derive community structure. Its node count must match
len(clusters).test_ratio – Target fraction of data points to place in the test set. The actual fraction may differ slightly because clusters are indivisible. Defaults to
0.2.seed – Optional integer seed for reproducible cluster shuffling. Pass
Nonefor a random assignment.post_filtering – If
True, remove train nodes that share an edge with any test node (stricter leakage prevention at the cost of a smaller train set). Use only when clusters are found from communities (find_communities). It won’t have any effect when used withconnected_componentsas this prevents inter-cluster connections.
- Returns:
A tuple
(train_indices, test_indices)of zero-based data-point indices.- Raises:
ValueError – If
clusterslength does not matchgraph.n, or iftest_ratiois not in(0, 1).
Example:
from refnd.core import ( EdgeStore, CsrGraph, find_communities, partition ) store = EdgeStore(6, [(0,1,0.9),(1,2,0.8),(3,4,0.7),(4,5,0.6)]) g = CsrGraph(store, use_weight=True, is_weight_distance=False) clusters = find_communities(g) # or connected_components(g) train_idx, test_idx = partition(clusters, g, test_ratio=0.3, seed=42)
Exact search¶
- refnd.core.exact_edges(variant: kernels.KernelVariant, data: Any, proximity_threshold: float = 0.5, threads: int = 0, progress: bool = True, *args: Any, **kwargs: Any) EdgeStore¶
Compute all pairs of data points whose similarity exceeds a threshold (exact, brute-force).
Evaluates every unordered pair
(i, j)withi < jand records an edge whenkernel(data[i], data[j]) <= proximity_threshold. This is O(n²) in the number of data points; prefer the approximate``HNSWState`` for large datasets.Extra positional and keyword arguments are forwarded to the kernel constructor.
- Parameters:
variant – Which kernel to use (
KernelVariant.ProteinGlobalorKernelVariant.ProteinLocal).data – Sequence of data items (e.g.
list[str]for protein sequences).proximity_threshold – Maximum distance for an edge to be recorded.
threads – Number of parallel threads.
0uses all available cores.progress – Show a progress bar. Defaults to
True.
- Returns:
An
EdgeStorecontaining all edges whose distance ≤proximity_threshold.
Example:
from refnd import KernelVariant, exact_edges seqs = ["MKTAYIAK", "MKTAYIAKQR", "ACDEFGHIKLM"] store = exact_edges(KernelVariant.ProteinGlobal, seqs, proximity_threshold=0.5) print(len(store)) # number of similar pairs
- refnd.core.exact_nearest_neighbors(variant: kernels.KernelVariant, queries: Any, references: Any, k: int, threads: int = 0, progress: bool = True, *args: Any, **kwargs: Any) list[list[tuple[int, float]]]¶
Find the k nearest neighbors of each query in a reference set (exact, brute-force).
For every query item, scores it against every reference item using the chosen kernel and returns the top-k references sorted by ascending distance. Complexity is O(n_queries × n_references); prefer
HNSWState.searchfor approximate nearest neighbors with large datasets.Extra positional and keyword arguments are forwarded to the kernel constructor.
- Parameters:
variant – Which kernel to use (e.g.
KernelVariant.ProteinGlobalorKernelVariant.TanimotoBit).queries – Sequence of query items.
references – Sequence of reference items to search over. List of items of the same type of the selected kernel variant.
k – Number of nearest neighbors to return per query. List of items of the same type of the selected kernel variant.
threads – Number of parallel threads.
0uses all available cores.progress – Show a progress bar. Defaults to
True.
- Returns:
A list of length
len(queries). Each element is a list of up toktuples(reference_index, similarity_score)sorted by ascending distance.
Example:
from refnd import KernelVariant, exact_nearest_neighbors queries = ["MKTAYIAK"] refs = ["MKTAYIAKQR", "ACDEFGHIKLM", "MKTAYIAKQRQ"] results = exact_nearest_neighbors( KernelVariant.ProteinGlobal, queries, refs, k=2 ) # results[0] -> [(0, 0.20), (2, 0.27)]