refnd.utils¶
- class refnd.utils.BitFingerprint(fp: ExplicitBitVect)¶
Bases:
objectDense binary fingerprint backed by a dense bitset.
Each bit represents the presence or absence of a structural feature. The primary source is an RDKit
ExplicitBitVect(fromGetFingerprint), but plain Python lists and numpy arrays are also accepted.countcaches the popcount so Tanimoto computation avoids re-counting.Example:
from rdkit.Chem import rdFingerprintGenerator, MolFromSmiles from refnd.utils import BitFingerprint mfpgen = rdFingerprintGenerator.GetMorganGenerator(fpSize=1024, radius=2) mol = MolFromSmiles("c1ccccc1") fp = BitFingerprint(mfpgen.GetFingerprint(mol)) print(fp.count(), len(fp)) # set bits, total bits
- count() int¶
Number of on bits (popcount).
- static from_list(values: Sequence[bool]) BitFingerprint¶
Construct from a list of booleans (or 0/1 ints).
- static from_np(arr: numpy.typing.NDArray[numpy.bool_]) BitFingerprint¶
Construct from a numpy boolean or uint8 array.
- to_list() list[bool]¶
Export as a list of booleans.
- to_np() numpy.typing.NDArray[numpy.bool_]¶
Export as a numpy bool array.
- to_rdkit() ExplicitBitVect¶
Export as an RDKit
ExplicitBitVect.
- class refnd.utils.RealFingerprint(fp: UIntSparseIntVect)¶
Bases:
objectDense real-valued fingerprint backed by a
Vec<f32>.Each element represents a feature count or continuous value. The primary source is an RDKit
UIntSparseIntVect(fromGetCountFingerprint), but plain Python lists and numpy arrays are also accepted.norm_sqcaches||x||²so Tanimoto computation avoids recomputing it.Example:
from rdkit.Chem import rdFingerprintGenerator, MolFromSmiles from refnd.kernels.molecules import RealFingerprint, TanimotoReal mfpgen = rdFingerprintGenerator.GetMorganGenerator(fpSize=1024, radius=2) mol = MolFromSmiles("c1ccccc1") fp = RealFingerprint(mfpgen.GetCountFingerprint(mol)) print(fp.norm_sq(), len(fp))
- static from_list(values: Sequence[float]) RealFingerprint¶
Construct from a list of floats.
- static from_np(arr: numpy.typing.NDArray[numpy.float32]) RealFingerprint¶
Construct from a numpy float32 or float64 array.
- norm_sq() float¶
Squared Euclidean norm of the feature vector (
||x||²).
- to_list() list[float]¶
Export as a list of floats.
- to_np() numpy.typing.NDArray[numpy.float32]¶
Export as a numpy float32 array.
- to_rdkit() UIntSparseIntVect¶
Export as an RDKit
UIntSparseIntVect, compatible withGetCountFingerprint. Only non-zero elements are stored; the length equalslen(self).
- refnd.utils.read_fasta(path: str) list[tuple[str, str]]¶
Parse a FASTA file and return all records as a list of
(header, sequence)pairs.The header string is the full description line without the leading
>. The sequence is the concatenation of all continuation lines for that record, with whitespace stripped.- Parameters:
path – Path to the FASTA file.
- Returns:
A list of
(header, sequence)tuples, one per FASTA record.- Raises:
IOError – If the file cannot be opened or is not valid UTF-8.
Example:
from refnd.utils import read_fasta records = read_fasta("proteins.fasta") header, seq = records[0] print(header) # "sp|P12345|MYPR_HUMAN ..." print(seq) # "MKTAYIAKQRQISFVKSHFSRQ..."
- refnd.utils.largest_cluster(clusters: Sequence[int]) tuple[int, int]¶
Return the ID and size of the largest cluster.
Convenience helper — iterates over a cluster-label vector and finds the most populous label.
- Parameters:
clusters – A list of cluster IDs (e.g. from
connected_componentsorfind_communities).- Returns:
A tuple
(cluster_id, size)for the largest cluster.