mod kmer

module kmer

K-mer based marker deduplication.

Groups markers by shared canonical k-mer signatures to collapse sequencing error variants. Reduces the number of markers tested, increasing statistical power for sex detection.

Functions

fn canonical_kmer_hash(seq: &[u8], k: usize) -> u64

Compute canonical k-mer hash for a DNA sequence (min-hash over windows). Canonical = lexicographically smallest of {kmer, revcomp(kmer)}. The representative for a sequence is the minimum hash among its k-mers. This is an LSH heuristic for grouping similar sequences (e.g. sequencing errors); it is not guaranteed that two sequences differing by one base will share a group (see test_group_single_base_error). Use for approximate collapse only.

fn group_by_kmer(sequences: &[Vec<u8>], k: usize) -> ahash::AHashMap<u64, Vec<usize>>

Group markers by canonical k-mer signature. Returns a map from group_hash -> list of marker indices.