refnd.kernels¶
Kernel variant selector¶
refnd.kernels.protein.sequence¶
Protein sequence aligners and their configuration enums.
- class refnd.kernels.protein.sequence.GlobalAligner(gap_open: int = 11, gap_extend: int = 1, matrix: ScoringMatrix = ScoringMatrix.Blosum62, identity_mode: GlobalIdentityMode = GlobalIdentityMode.MaxLength, vectorization: VectorizationStrategy = VectorizationStrategy.Scan, width: DatatypeWidth = DatatypeWidth.Sat)¶
Bases:
objectNeedleman–Wunsch global sequence aligner returning a normalised identity score.
Wraps the parasail SIMD alignment library. The identity score is computed as the number of identical aligned positions divided by the denominator selected by
identity_mode.GlobalAligneris used as a kernel inHNSWState,exact_edges, andexact_nearest_neighborsviaKernelVariant.ProteinGlobal.- Parameters:
gap_open – Affine gap-open penalty (positive integer, subtracted). Default
11.gap_extend – Affine gap-extend penalty (positive integer, subtracted). Default
1.matrix – Amino-acid substitution matrix. Default
ScoringMatrix.Blosum62.identity_mode – Normalisation denominator. Default
GlobalIdentityMode.MaxLength.vectorization – SIMD layout. Default
VectorizationStrategy.Scan.width – Integer precision. Default
DatatypeWidth.Sat.
Example:
from refnd.kernels.protein.sequence import GlobalAligner aligner = GlobalAligner(gap_open=11, gap_extend=1) score = aligner.call("MKTAYIAK", "MKTAYIAKQR") score = aligner("MKTAYIAK", "MKTAYIAKQR") # Alternative # score in [0.0, 1.0]
- call(ref_sample: str, query: str) float¶
Compute the global alignment identity between two sequences.
- Parameters:
ref_sample – Reference protein sequence (single-letter amino acid codes).
query – Query protein sequence.
- Returns:
Identity score in
[0.0, 1.0].
- class refnd.kernels.protein.sequence.LocalAligner(gap_open: int = 11, gap_extend: int = 1, min_coverage: float = 0.800000011920929, cov_mode: CoverageMode = CoverageMode.BothQueryTarget, matrix: ScoringMatrix = ScoringMatrix.Blosum62, identity_mode: LocalIdentityMode = LocalIdentityMode.AlignmentLength, vectorization: VectorizationStrategy = VectorizationStrategy.Striped, width: DatatypeWidth = DatatypeWidth.Sat)¶
Bases:
objectSmith–Waterman local sequence aligner returning a normalised identity score.
Like
GlobalAlignerbut aligns only the most similar sub-region of each sequence. Pairs that do not meet themin_coveragecriterion after alignment receive a score of0.0.LocalAligneris used as a kernel viaKernelVariant.ProteinLocal.- Parameters:
gap_open – Affine gap-open penalty. Default
11.gap_extend – Affine gap-extend penalty. Default
1.min_coverage – Minimum fraction of sequence covered by the local alignment (per
cov_mode) for the pair to be accepted. Default0.8.cov_mode – Which sequence(s) must meet
min_coverage. DefaultCoverageMode.BothQueryTarget.matrix – Substitution matrix. Default
ScoringMatrix.Blosum62.identity_mode – Normalisation denominator. Default
LocalIdentityMode.AlignmentLength.vectorization – SIMD layout. Default
VectorizationStrategy.Striped.width – Integer precision. Default
DatatypeWidth.Sat.
Example:
from refnd.kernels.protein.sequence import LocalAligner, CoverageMode aligner = LocalAligner(min_coverage=0.5, cov_mode=CoverageMode.Query) score = aligner.call("ACDEFGHIKLM", "CDEFGHI") score = aligner("ACDEFGHIKLM", "CDEFGHI") # Alternative
- call(ref_sample: str, query: str) float¶
Compute the local alignment identity between two sequences.
- Parameters:
ref_sample – Reference protein sequence (single-letter amino acid codes).
query – Query protein sequence.
- Returns:
Identity score in
[0.0, 1.0], or0.0if the coverage filter fails.
- class refnd.kernels.protein.sequence.ScoringMatrix¶
Bases:
object- Blosum100 = ScoringMatrix.Blosum100¶
- Blosum30 = ScoringMatrix.Blosum30¶
- Blosum35 = ScoringMatrix.Blosum35¶
- Blosum40 = ScoringMatrix.Blosum40¶
- Blosum45 = ScoringMatrix.Blosum45¶
- Blosum50 = ScoringMatrix.Blosum50¶
- Blosum55 = ScoringMatrix.Blosum55¶
- Blosum60 = ScoringMatrix.Blosum60¶
- Blosum62 = ScoringMatrix.Blosum62¶
- Blosum65 = ScoringMatrix.Blosum65¶
- Blosum70 = ScoringMatrix.Blosum70¶
- Blosum75 = ScoringMatrix.Blosum75¶
- Blosum80 = ScoringMatrix.Blosum80¶
- Blosum85 = ScoringMatrix.Blosum85¶
- Blosum90 = ScoringMatrix.Blosum90¶
- Blosum95 = ScoringMatrix.Blosum95¶
- Pam10 = ScoringMatrix.Pam10¶
- Pam100 = ScoringMatrix.Pam100¶
- Pam110 = ScoringMatrix.Pam110¶
- Pam120 = ScoringMatrix.Pam120¶
- Pam130 = ScoringMatrix.Pam130¶
- Pam140 = ScoringMatrix.Pam140¶
- Pam150 = ScoringMatrix.Pam150¶
- Pam160 = ScoringMatrix.Pam160¶
- Pam170 = ScoringMatrix.Pam170¶
- Pam180 = ScoringMatrix.Pam180¶
- Pam190 = ScoringMatrix.Pam190¶
- Pam20 = ScoringMatrix.Pam20¶
- Pam200 = ScoringMatrix.Pam200¶
- Pam210 = ScoringMatrix.Pam210¶
- Pam220 = ScoringMatrix.Pam220¶
- Pam230 = ScoringMatrix.Pam230¶
- Pam240 = ScoringMatrix.Pam240¶
- Pam250 = ScoringMatrix.Pam250¶
- Pam260 = ScoringMatrix.Pam260¶
- Pam270 = ScoringMatrix.Pam270¶
- Pam280 = ScoringMatrix.Pam280¶
- Pam290 = ScoringMatrix.Pam290¶
- Pam30 = ScoringMatrix.Pam30¶
- Pam300 = ScoringMatrix.Pam300¶
- Pam310 = ScoringMatrix.Pam310¶
- Pam320 = ScoringMatrix.Pam320¶
- Pam330 = ScoringMatrix.Pam330¶
- Pam340 = ScoringMatrix.Pam340¶
- Pam350 = ScoringMatrix.Pam350¶
- Pam360 = ScoringMatrix.Pam360¶
- Pam370 = ScoringMatrix.Pam370¶
- Pam380 = ScoringMatrix.Pam380¶
- Pam390 = ScoringMatrix.Pam390¶
- Pam40 = ScoringMatrix.Pam40¶
- Pam400 = ScoringMatrix.Pam400¶
- Pam410 = ScoringMatrix.Pam410¶
- Pam420 = ScoringMatrix.Pam420¶
- Pam430 = ScoringMatrix.Pam430¶
- Pam440 = ScoringMatrix.Pam440¶
- Pam450 = ScoringMatrix.Pam450¶
- Pam460 = ScoringMatrix.Pam460¶
- Pam470 = ScoringMatrix.Pam470¶
- Pam480 = ScoringMatrix.Pam480¶
- Pam490 = ScoringMatrix.Pam490¶
- Pam50 = ScoringMatrix.Pam50¶
- Pam500 = ScoringMatrix.Pam500¶
- Pam60 = ScoringMatrix.Pam60¶
- Pam70 = ScoringMatrix.Pam70¶
- Pam80 = ScoringMatrix.Pam80¶
- Pam90 = ScoringMatrix.Pam90¶
- class refnd.kernels.protein.sequence.GlobalIdentityMode¶
Bases:
objectDenominator used to normalise a global-alignment identity score.
After counting identical aligned positions the raw count is divided by:
AlignmentLength: the total length of the alignment (including gaps).MaxSeqLength: the length of the longer of the two sequences.MinSeqLength: the length of the shorter of the two sequences.MaxLength(default): same asMaxSeqLength— recommended for RGP datasets.
- AlignmentLength = GlobalIdentityMode.AlignmentLength¶
- MaxLength = GlobalIdentityMode.MaxLength¶
- MaxSeqLength = GlobalIdentityMode.MaxSeqLength¶
- MinSeqLength = GlobalIdentityMode.MinSeqLength¶
- class refnd.kernels.protein.sequence.VectorizationStrategy¶
Bases:
objectSIMD vectorization layout used by the parasail alignment engine.
Striped(default for local): interleaved layout, best for short sequences.Scan: sequential scan layout, often faster for long sequences or global alignment.Diag: diagonal layout; niche use-case, rarely needed.
In practice the default per-aligner is a good choice; change only if profiling shows a bottleneck.
- Diag = VectorizationStrategy.Diag¶
- Scan = VectorizationStrategy.Scan¶
- Striped = VectorizationStrategy.Striped¶
- class refnd.kernels.protein.sequence.DatatypeWidth¶
Bases:
objectInteger precision used for alignment score accumulation.
Short(8-bit),Half(16-bit),Full(32-bit),Long(64-bit): fixed-width integers — lower width is faster but can overflow on long sequences.Sat(default): 8-bit saturating arithmetic; If it saturates, silently restart with 16-bit.
- Full = DatatypeWidth.Full¶
- Half = DatatypeWidth.Half¶
- Long = DatatypeWidth.Long¶
- Sat = DatatypeWidth.Sat¶
- Short = DatatypeWidth.Short¶
- class refnd.kernels.protein.sequence.CoverageMode¶
Bases:
objectCoverage filter applied before accepting a local alignment as valid.
A pair is scored only when the alignment covers enough of the sequences as specified by the mode and
min_coveragethreshold:BothQueryTarget(default): both query and target must meetmin_coverage.Target: only the target must meetmin_coverage.Query: only the query must meetmin_coverage.LengthRatio: the shorter / longer length ratio must meetmin_coverage.ShorterSeq: coverage computed relative to the shorter sequence.
- BothQueryTarget = CoverageMode.BothQueryTarget¶
- LengthRatio = CoverageMode.LengthRatio¶
- Query = CoverageMode.Query¶
- ShorterSeq = CoverageMode.ShorterSeq¶
- Target = CoverageMode.Target¶
- class refnd.kernels.protein.sequence.LocalIdentityMode¶
Bases:
objectDenominator used to normalise a local-alignment identity score.
AlignmentLength(default): divide by the length of the local alignment.MinSeqLength: divide by the shorter sequence length.
- AlignmentLength = LocalIdentityMode.AlignmentLength¶
- MinSeqLength = LocalIdentityMode.MinSeqLength¶
refnd.kernels.molecules¶
Molecules Tanimoto kernels.
- class refnd.kernels.molecules.TanimotoBit¶
Bases:
objectTanimoto distance kernel for binary (bit) molecular fingerprints.
Measures structural dissimilarity between two molecules as the complement of the Jaccard index over their feature sets. A distance of 0 means the two fingerprints are identical; 1 means they share no features at all.
Formula:
1 - |A ∩ B| / |A ∪ B|This kernel is the standard choice when working with
ExplicitBitVectfingerprints such as Morgan, RDKit, or MACCS keys.Example:
from rdkit.Chem import MolFromSmiles, rdFingerprintGenerator from refnd.kernels.molecules import BitFingerprint, TanimotoBit mfpgen = rdFingerprintGenerator.GetMorganGenerator(fpSize=1024, radius=2) benzene = BitFingerprint(mfpgen.GetFingerprint(MolFromSmiles("c1ccccc1"))) naphthalene = BitFingerprint(mfpgen.GetFingerprint(MolFromSmiles("c1ccc2ccccc2c1"))) acetic_acid = BitFingerprint(mfpgen.GetFingerprint(MolFromSmiles("CC(=O)O"))) k = TanimotoBit() print(k(benzene, naphthalene)) # low — structurally similar print(k(benzene, acetic_acid)) # high — structurally dissimilar
- call(a: utils.BitFingerprint, b: utils.BitFingerprint) float¶
Compute the Tanimoto distance between two
BitFingerprintobjects.- Parameters:
a – First fingerprint.
b – Second fingerprint.
- Returns:
Distance in
[0.0, 1.0].0.0means identical feature sets,1.0means fully disjoint.
Example:
fp1 = BitFingerprint.from_list([True, False, True, True]) fp2 = BitFingerprint.from_list([True, True, True, False]) k = TanimotoBit() assert k.call(fp1, fp2) == k(fp1, fp2) # both forms are equivalent # intersection={0,2}=2, union={0,1,2,3}=4 → distance = 0.5
- class refnd.kernels.molecules.TanimotoReal¶
Bases:
objectTanimoto distance kernel for real-valued (count) molecular fingerprints.
Generalises the binary Tanimoto to continuous feature vectors using the dot-product formulation. A distance of 0 means the two fingerprints are proportional; values approach 1 as the vectors become orthogonal.
Formula:
1 - dot(a, b) / (||a||² + ||b||² - dot(a, b))This kernel is the standard choice when working with count fingerprints such as those returned by
GetCountFingerprint(Morgan counts, etc.).Example:
from rdkit.Chem import MolFromSmiles, rdFingerprintGenerator from refnd.kernels.molecules import RealFingerprint, TanimotoReal mfpgen = rdFingerprintGenerator.GetMorganGenerator(fpSize=1024, radius=2) benzene = RealFingerprint(mfpgen.GetCountFingerprint(MolFromSmiles("c1ccccc1"))) naphthalene = RealFingerprint(mfpgen.GetCountFingerprint(MolFromSmiles("c1ccc2ccccc2c1"))) acetic_acid = RealFingerprint(mfpgen.GetCountFingerprint(MolFromSmiles("CC(=O)O"))) k = TanimotoReal() print(k(benzene, naphthalene)) # low — structurally similar print(k(benzene, acetic_acid)) # high — structurally dissimilar
- call(a: utils.RealFingerprint, b: utils.RealFingerprint) float¶
Compute the Tanimoto distance between two
RealFingerprintobjects.- Parameters:
a – First fingerprint.
b – Second fingerprint.
- Returns:
Distance in
[0.0, 1.0].0.0means identical (proportional) feature vectors,1.0means fully orthogonal.
Example:
fp1 = RealFingerprint.from_list([1.0, 0.0, 1.0]) fp2 = RealFingerprint.from_list([0.0, 1.0, 1.0]) k = TanimotoReal() assert k.call(fp1, fp2) == k(fp1, fp2) # both forms are equivalent # dot=1, ||fp1||²=2, ||fp2||²=2 → distance = 1 - 1/(2+2-1) ≈ 0.667