refnd.utils

class refnd.utils.BitFingerprint(fp: ExplicitBitVect)

Bases: object

Dense binary fingerprint backed by a dense bitset.

Each bit represents the presence or absence of a structural feature. The primary source is an RDKit ExplicitBitVect (from GetFingerprint), but plain Python lists and numpy arrays are also accepted.

count caches the popcount so Tanimoto computation avoids re-counting.

Example:

from rdkit.Chem import rdFingerprintGenerator, MolFromSmiles
from refnd.utils import BitFingerprint

mfpgen = rdFingerprintGenerator.GetMorganGenerator(fpSize=1024, radius=2)
mol = MolFromSmiles("c1ccccc1")
fp = BitFingerprint(mfpgen.GetFingerprint(mol))
print(fp.count(), len(fp))   # set bits, total bits
count() int

Number of on bits (popcount).

static from_list(values: Sequence[bool]) BitFingerprint

Construct from a list of booleans (or 0/1 ints).

static from_np(arr: numpy.typing.NDArray[numpy.bool_]) BitFingerprint

Construct from a numpy boolean or uint8 array.

to_list() list[bool]

Export as a list of booleans.

to_np() numpy.typing.NDArray[numpy.bool_]

Export as a numpy bool array.

to_rdkit() ExplicitBitVect

Export as an RDKit ExplicitBitVect.

class refnd.utils.RealFingerprint(fp: UIntSparseIntVect)

Bases: object

Dense real-valued fingerprint backed by a Vec<f32>.

Each element represents a feature count or continuous value. The primary source is an RDKit UIntSparseIntVect (from GetCountFingerprint), but plain Python lists and numpy arrays are also accepted.

norm_sq caches ||x||² so Tanimoto computation avoids recomputing it.

Example:

from rdkit.Chem import rdFingerprintGenerator, MolFromSmiles
from refnd.kernels.molecules import RealFingerprint, TanimotoReal

mfpgen = rdFingerprintGenerator.GetMorganGenerator(fpSize=1024, radius=2)
mol = MolFromSmiles("c1ccccc1")
fp = RealFingerprint(mfpgen.GetCountFingerprint(mol))
print(fp.norm_sq(), len(fp))
static from_list(values: Sequence[float]) RealFingerprint

Construct from a list of floats.

static from_np(arr: numpy.typing.NDArray[numpy.float32]) RealFingerprint

Construct from a numpy float32 or float64 array.

norm_sq() float

Squared Euclidean norm of the feature vector (||x||²).

to_list() list[float]

Export as a list of floats.

to_np() numpy.typing.NDArray[numpy.float32]

Export as a numpy float32 array.

to_rdkit() UIntSparseIntVect

Export as an RDKit UIntSparseIntVect, compatible with GetCountFingerprint. Only non-zero elements are stored; the length equals len(self).

refnd.utils.read_fasta(path: str) list[tuple[str, str]]

Parse a FASTA file and return all records as a list of (header, sequence) pairs.

The header string is the full description line without the leading >. The sequence is the concatenation of all continuation lines for that record, with whitespace stripped.

Parameters:

path – Path to the FASTA file.

Returns:

A list of (header, sequence) tuples, one per FASTA record.

Raises:

IOError – If the file cannot be opened or is not valid UTF-8.

Example:

from refnd.utils import read_fasta

records = read_fasta("proteins.fasta")
header, seq = records[0]
print(header)  # "sp|P12345|MYPR_HUMAN ..."
print(seq)     # "MKTAYIAKQRQISFVKSHFSRQ..."
refnd.utils.largest_cluster(clusters: Sequence[int]) tuple[int, int]

Return the ID and size of the largest cluster.

Convenience helper — iterates over a cluster-label vector and finds the most populous label.

Parameters:

clusters – A list of cluster IDs (e.g. from connected_components or find_communities).

Returns:

A tuple (cluster_id, size) for the largest cluster.