refnd.kernels

Kernel variant selector

class refnd.kernels.KernelVariant

Bases: object

ProteinGlobal = KernelVariant.ProteinGlobal
ProteinLocal = KernelVariant.ProteinLocal
TanimotoBit = KernelVariant.TanimotoBit
TanimotoReal = KernelVariant.TanimotoReal

refnd.kernels.protein.sequence

Protein sequence aligners and their configuration enums.

class refnd.kernels.protein.sequence.GlobalAligner(gap_open: int = 11, gap_extend: int = 1, matrix: ScoringMatrix = ScoringMatrix.Blosum62, identity_mode: GlobalIdentityMode = GlobalIdentityMode.MaxLength, vectorization: VectorizationStrategy = VectorizationStrategy.Scan, width: DatatypeWidth = DatatypeWidth.Sat)

Bases: object

Needleman–Wunsch global sequence aligner returning a normalised identity score.

Wraps the parasail SIMD alignment library. The identity score is computed as the number of identical aligned positions divided by the denominator selected by identity_mode.

GlobalAligner is used as a kernel in HNSWState, exact_edges, and exact_nearest_neighbors via KernelVariant.ProteinGlobal.

Parameters:
  • gap_open – Affine gap-open penalty (positive integer, subtracted). Default 11.

  • gap_extend – Affine gap-extend penalty (positive integer, subtracted). Default 1.

  • matrix – Amino-acid substitution matrix. Default ScoringMatrix.Blosum62.

  • identity_mode – Normalisation denominator. Default GlobalIdentityMode.MaxLength.

  • vectorization – SIMD layout. Default VectorizationStrategy.Scan.

  • width – Integer precision. Default DatatypeWidth.Sat.

Example:

from refnd.kernels.protein.sequence import GlobalAligner

aligner = GlobalAligner(gap_open=11, gap_extend=1)
score = aligner.call("MKTAYIAK", "MKTAYIAKQR")
score = aligner("MKTAYIAK", "MKTAYIAKQR") # Alternative
# score in [0.0, 1.0]
call(ref_sample: str, query: str) float

Compute the global alignment identity between two sequences.

Parameters:
  • ref_sample – Reference protein sequence (single-letter amino acid codes).

  • query – Query protein sequence.

Returns:

Identity score in [0.0, 1.0].

class refnd.kernels.protein.sequence.LocalAligner(gap_open: int = 11, gap_extend: int = 1, min_coverage: float = 0.800000011920929, cov_mode: CoverageMode = CoverageMode.BothQueryTarget, matrix: ScoringMatrix = ScoringMatrix.Blosum62, identity_mode: LocalIdentityMode = LocalIdentityMode.AlignmentLength, vectorization: VectorizationStrategy = VectorizationStrategy.Striped, width: DatatypeWidth = DatatypeWidth.Sat)

Bases: object

Smith–Waterman local sequence aligner returning a normalised identity score.

Like GlobalAligner but aligns only the most similar sub-region of each sequence. Pairs that do not meet the min_coverage criterion after alignment receive a score of 0.0.

LocalAligner is used as a kernel via KernelVariant.ProteinLocal.

Parameters:
  • gap_open – Affine gap-open penalty. Default 11.

  • gap_extend – Affine gap-extend penalty. Default 1.

  • min_coverage – Minimum fraction of sequence covered by the local alignment (per cov_mode) for the pair to be accepted. Default 0.8.

  • cov_mode – Which sequence(s) must meet min_coverage. Default CoverageMode.BothQueryTarget.

  • matrix – Substitution matrix. Default ScoringMatrix.Blosum62.

  • identity_mode – Normalisation denominator. Default LocalIdentityMode.AlignmentLength.

  • vectorization – SIMD layout. Default VectorizationStrategy.Striped.

  • width – Integer precision. Default DatatypeWidth.Sat.

Example:

from refnd.kernels.protein.sequence import LocalAligner, CoverageMode

aligner = LocalAligner(min_coverage=0.5, cov_mode=CoverageMode.Query)
score = aligner.call("ACDEFGHIKLM", "CDEFGHI")
score = aligner("ACDEFGHIKLM", "CDEFGHI") # Alternative
call(ref_sample: str, query: str) float

Compute the local alignment identity between two sequences.

Parameters:
  • ref_sample – Reference protein sequence (single-letter amino acid codes).

  • query – Query protein sequence.

Returns:

Identity score in [0.0, 1.0], or 0.0 if the coverage filter fails.

class refnd.kernels.protein.sequence.ScoringMatrix

Bases: object

Blosum100 = ScoringMatrix.Blosum100
Blosum30 = ScoringMatrix.Blosum30
Blosum35 = ScoringMatrix.Blosum35
Blosum40 = ScoringMatrix.Blosum40
Blosum45 = ScoringMatrix.Blosum45
Blosum50 = ScoringMatrix.Blosum50
Blosum55 = ScoringMatrix.Blosum55
Blosum60 = ScoringMatrix.Blosum60
Blosum62 = ScoringMatrix.Blosum62
Blosum65 = ScoringMatrix.Blosum65
Blosum70 = ScoringMatrix.Blosum70
Blosum75 = ScoringMatrix.Blosum75
Blosum80 = ScoringMatrix.Blosum80
Blosum85 = ScoringMatrix.Blosum85
Blosum90 = ScoringMatrix.Blosum90
Blosum95 = ScoringMatrix.Blosum95
Pam10 = ScoringMatrix.Pam10
Pam100 = ScoringMatrix.Pam100
Pam110 = ScoringMatrix.Pam110
Pam120 = ScoringMatrix.Pam120
Pam130 = ScoringMatrix.Pam130
Pam140 = ScoringMatrix.Pam140
Pam150 = ScoringMatrix.Pam150
Pam160 = ScoringMatrix.Pam160
Pam170 = ScoringMatrix.Pam170
Pam180 = ScoringMatrix.Pam180
Pam190 = ScoringMatrix.Pam190
Pam20 = ScoringMatrix.Pam20
Pam200 = ScoringMatrix.Pam200
Pam210 = ScoringMatrix.Pam210
Pam220 = ScoringMatrix.Pam220
Pam230 = ScoringMatrix.Pam230
Pam240 = ScoringMatrix.Pam240
Pam250 = ScoringMatrix.Pam250
Pam260 = ScoringMatrix.Pam260
Pam270 = ScoringMatrix.Pam270
Pam280 = ScoringMatrix.Pam280
Pam290 = ScoringMatrix.Pam290
Pam30 = ScoringMatrix.Pam30
Pam300 = ScoringMatrix.Pam300
Pam310 = ScoringMatrix.Pam310
Pam320 = ScoringMatrix.Pam320
Pam330 = ScoringMatrix.Pam330
Pam340 = ScoringMatrix.Pam340
Pam350 = ScoringMatrix.Pam350
Pam360 = ScoringMatrix.Pam360
Pam370 = ScoringMatrix.Pam370
Pam380 = ScoringMatrix.Pam380
Pam390 = ScoringMatrix.Pam390
Pam40 = ScoringMatrix.Pam40
Pam400 = ScoringMatrix.Pam400
Pam410 = ScoringMatrix.Pam410
Pam420 = ScoringMatrix.Pam420
Pam430 = ScoringMatrix.Pam430
Pam440 = ScoringMatrix.Pam440
Pam450 = ScoringMatrix.Pam450
Pam460 = ScoringMatrix.Pam460
Pam470 = ScoringMatrix.Pam470
Pam480 = ScoringMatrix.Pam480
Pam490 = ScoringMatrix.Pam490
Pam50 = ScoringMatrix.Pam50
Pam500 = ScoringMatrix.Pam500
Pam60 = ScoringMatrix.Pam60
Pam70 = ScoringMatrix.Pam70
Pam80 = ScoringMatrix.Pam80
Pam90 = ScoringMatrix.Pam90
class refnd.kernels.protein.sequence.GlobalIdentityMode

Bases: object

Denominator used to normalise a global-alignment identity score.

After counting identical aligned positions the raw count is divided by:

  • AlignmentLength: the total length of the alignment (including gaps).

  • MaxSeqLength: the length of the longer of the two sequences.

  • MinSeqLength: the length of the shorter of the two sequences.

  • MaxLength (default): same as MaxSeqLength — recommended for RGP datasets.

AlignmentLength = GlobalIdentityMode.AlignmentLength
MaxLength = GlobalIdentityMode.MaxLength
MaxSeqLength = GlobalIdentityMode.MaxSeqLength
MinSeqLength = GlobalIdentityMode.MinSeqLength
class refnd.kernels.protein.sequence.VectorizationStrategy

Bases: object

SIMD vectorization layout used by the parasail alignment engine.

  • Striped (default for local): interleaved layout, best for short sequences.

  • Scan: sequential scan layout, often faster for long sequences or global alignment.

  • Diag: diagonal layout; niche use-case, rarely needed.

In practice the default per-aligner is a good choice; change only if profiling shows a bottleneck.

Diag = VectorizationStrategy.Diag
Scan = VectorizationStrategy.Scan
Striped = VectorizationStrategy.Striped
class refnd.kernels.protein.sequence.DatatypeWidth

Bases: object

Integer precision used for alignment score accumulation.

  • Short (8-bit), Half (16-bit), Full (32-bit), Long (64-bit): fixed-width integers — lower width is faster but can overflow on long sequences.

  • Sat (default): 8-bit saturating arithmetic; If it saturates, silently restart with 16-bit.

Full = DatatypeWidth.Full
Half = DatatypeWidth.Half
Long = DatatypeWidth.Long
Sat = DatatypeWidth.Sat
Short = DatatypeWidth.Short
class refnd.kernels.protein.sequence.CoverageMode

Bases: object

Coverage filter applied before accepting a local alignment as valid.

A pair is scored only when the alignment covers enough of the sequences as specified by the mode and min_coverage threshold:

  • BothQueryTarget (default): both query and target must meet min_coverage.

  • Target: only the target must meet min_coverage.

  • Query: only the query must meet min_coverage.

  • LengthRatio: the shorter / longer length ratio must meet min_coverage.

  • ShorterSeq: coverage computed relative to the shorter sequence.

BothQueryTarget = CoverageMode.BothQueryTarget
LengthRatio = CoverageMode.LengthRatio
Query = CoverageMode.Query
ShorterSeq = CoverageMode.ShorterSeq
Target = CoverageMode.Target
class refnd.kernels.protein.sequence.LocalIdentityMode

Bases: object

Denominator used to normalise a local-alignment identity score.

  • AlignmentLength (default): divide by the length of the local alignment.

  • MinSeqLength: divide by the shorter sequence length.

AlignmentLength = LocalIdentityMode.AlignmentLength
MinSeqLength = LocalIdentityMode.MinSeqLength

refnd.kernels.molecules

Molecules Tanimoto kernels.

class refnd.kernels.molecules.TanimotoBit

Bases: object

Tanimoto distance kernel for binary (bit) molecular fingerprints.

Measures structural dissimilarity between two molecules as the complement of the Jaccard index over their feature sets. A distance of 0 means the two fingerprints are identical; 1 means they share no features at all.

Formula: 1 - |A B| / |A B|

This kernel is the standard choice when working with ExplicitBitVect fingerprints such as Morgan, RDKit, or MACCS keys.

Example:

from rdkit.Chem import MolFromSmiles, rdFingerprintGenerator
from refnd.kernels.molecules import BitFingerprint, TanimotoBit

mfpgen = rdFingerprintGenerator.GetMorganGenerator(fpSize=1024, radius=2)
benzene    = BitFingerprint(mfpgen.GetFingerprint(MolFromSmiles("c1ccccc1")))
naphthalene = BitFingerprint(mfpgen.GetFingerprint(MolFromSmiles("c1ccc2ccccc2c1")))
acetic_acid = BitFingerprint(mfpgen.GetFingerprint(MolFromSmiles("CC(=O)O")))

k = TanimotoBit()
print(k(benzene, naphthalene))  # low  — structurally similar
print(k(benzene, acetic_acid))  # high — structurally dissimilar
call(a: utils.BitFingerprint, b: utils.BitFingerprint) float

Compute the Tanimoto distance between two BitFingerprint objects.

Parameters:
  • a – First fingerprint.

  • b – Second fingerprint.

Returns:

Distance in [0.0, 1.0]. 0.0 means identical feature sets, 1.0 means fully disjoint.

Example:

fp1 = BitFingerprint.from_list([True, False, True, True])
fp2 = BitFingerprint.from_list([True, True,  True, False])
k = TanimotoBit()
assert k.call(fp1, fp2) == k(fp1, fp2)  # both forms are equivalent
# intersection={0,2}=2, union={0,1,2,3}=4 → distance = 0.5
class refnd.kernels.molecules.TanimotoReal

Bases: object

Tanimoto distance kernel for real-valued (count) molecular fingerprints.

Generalises the binary Tanimoto to continuous feature vectors using the dot-product formulation. A distance of 0 means the two fingerprints are proportional; values approach 1 as the vectors become orthogonal.

Formula: 1 - dot(a, b) / (||a||² + ||b||² - dot(a, b))

This kernel is the standard choice when working with count fingerprints such as those returned by GetCountFingerprint (Morgan counts, etc.).

Example:

from rdkit.Chem import MolFromSmiles, rdFingerprintGenerator
from refnd.kernels.molecules import RealFingerprint, TanimotoReal

mfpgen = rdFingerprintGenerator.GetMorganGenerator(fpSize=1024, radius=2)
benzene     = RealFingerprint(mfpgen.GetCountFingerprint(MolFromSmiles("c1ccccc1")))
naphthalene = RealFingerprint(mfpgen.GetCountFingerprint(MolFromSmiles("c1ccc2ccccc2c1")))
acetic_acid = RealFingerprint(mfpgen.GetCountFingerprint(MolFromSmiles("CC(=O)O")))

k = TanimotoReal()
print(k(benzene, naphthalene))  # low  — structurally similar
print(k(benzene, acetic_acid))  # high — structurally dissimilar
call(a: utils.RealFingerprint, b: utils.RealFingerprint) float

Compute the Tanimoto distance between two RealFingerprint objects.

Parameters:
  • a – First fingerprint.

  • b – Second fingerprint.

Returns:

Distance in [0.0, 1.0]. 0.0 means identical (proportional) feature vectors, 1.0 means fully orthogonal.

Example:

fp1 = RealFingerprint.from_list([1.0, 0.0, 1.0])
fp2 = RealFingerprint.from_list([0.0, 1.0, 1.0])
k = TanimotoReal()
assert k.call(fp1, fp2) == k(fp1, fp2)  # both forms are equivalent
# dot=1, ||fp1||²=2, ||fp2||²=2 → distance = 1 - 1/(2+2-1) ≈ 0.667