Command Reference ¶

All commands match the original C++ RADSex CLI interface exactly.

1 process ¶

Build a marker depth table from demultiplexed reads.

rsx process -i <input_dir> -o <output.tsv> [-T threads] [-d min_depth] [-k kmer_size]

Flag	Default	Description
`-i`		Directory of FASTQ/FASTA files (gzip supported)
`-o`		Output markers depth table
`-T`	1	Number of threads
`-d`	1	Minimum depth to retain a marker
`-k`		Optional canonical k-mer deduplication before table output

Supported extensions: .fq, .fastq, .fasta, .fa, .fna (optionally .gz). Individual names are derived from filenames.

Optional -k / --kmer-dedup groups markers by min-hash of canonical k-mers of the given size (LSH heuristic for near-duplicate collapse — not a majority vote and not guaranteed for single-base errors). Within each group the sequence with the highest total depth is retained. The header #Number of markers counts only rows that pass -d / --min-depth.

2 distrib ¶

Compute marker distribution between two groups.

rsx distrib -t <table.tsv> -p <popmap.tsv> -o <output.tsv> [-d 5] [-G M,F] [-S 0.05] \
  [--correction bonferroni|fdr|none] [--test chisq|fisher|gtest] [--bayes] [-C]

Flag	Default	Description
`-t`		Markers depth table from `process`
`-p`		Population map (individualtgroup)
`-o`		Output distribution table
`-d`	1	Minimum depth
`-G`	auto	Groups to compare (comma-separated)
`-S`	0.05	Significance threshold
`--correction`	bonferroni	Multiple testing (see note below)
`--test`	chisq	Statistical test: chisq, fisher, gtest
`--bayes`		Add Bayes Factor + posterior columns per cell
`-C`	false	Legacy alias for `--correction none`

Correction note:

bonferroni follows RADSex: each cell’s p is compared to S / n_markers (marker-table size), preserving byte-compatible significance flags with C++ RADSex when groups are explicit.
fdr applies Benjamini–Hochberg across markers. The implementation computes one p-value per presence-count cell and weights it by the cell’s Markers count, which is equivalent to applying the procedure to the expanded list of marker p-values without materializing that list.
none uses the raw threshold S on each cell p-value.

The metadata header reports n_markers and n_tests from this marker-level testing universe. Empty cells have zero multiplicity and cannot be significant.

3 signif ¶

Extract markers significantly associated with a group.

rsx signif -t <table.tsv> -p <popmap.tsv> -o <output.tsv> [-d 5] [-G M,F] [-S 0.05] \
  [--correction bonferroni|fdr|none] [--test chisq|fisher|gtest] \
  [--backend cpu|cuda] [--bayes] [-a]

Flag	Default	Description
`--correction`	bonferroni	Multiple testing: bonferroni, fdr, none
`--test`	chisq	Statistical test: chisq, fisher, gtest
`--backend`	cpu	P-value backend: cpu or cuda
`--bayes`		Add Bayes Factor + posterior P(sex-linked) cols
`-a`		Output FASTA instead of table

Two-pass streaming for Bonferroni / none: pass 1 counts markers, pass 2 writes directly with O(n_individuals) working memory.

FDR correction (--correction fdr) stores all marker p-values (O(n_markers) floats) for Benjamini–Hochberg, then re-streams the table to write hits — it does not materialize full sequences/depths for every marker. Still not suitable for panels where even a float per marker exceeds RAM. Combining --output-fasta with FDR is rejected with an error (use table output or Bonferroni/none for FASTA).

--bayes is table output. It adds Bayes_Factor and Posterior_SexLinked columns, where the posterior treats either compared group as the enriched sex-linked group. FASTA output with -a emits marker sequences and does not include table-only Bayesian columns.

The CUDA backend is available in builds made with --features cuda and is restricted to --test chisq. It sends two u32 counts per marker to the device and returns one f64 p-value (8 bytes in each direction per marker). Backend selection never falls back silently. The host holds the count and p-value vectors (16 bytes per marker); FDR additionally holds the q-value vector. An NVIDIA driver and NVRTC are required at runtime.

4 triage ¶

Write strict and posterior-supported sex-linked marker evidence for each called marker.

rsx triage -t <table.tsv> -p <popmap.tsv> -o <triage.tsv> [-d 5] [-G M,F] \
  [--posterior-threshold 0.9] [--bayes-factor-threshold 10]

The output joins RADSex-style Bonferroni evidence and Bayesian evidence in one marker-level table. Columns include group penetrance, bias direction, p-value, Bonferroni-corrected p-value, Bayes factor, posterior P(sex-linked), and Candidate_Class. Classes are strict+posterior, strict_only, posterior_only, and bayes_factor_only. If no marker meets any configured call threshold, the table retains the highest-posterior marker as exploratory. This row is a ranked lead, not a sex-linked marker call.

Interpretation is directional. A male-biased high-evidence marker is a putative Y-linked RAD marker; a female-biased high-evidence marker is a putative W-linked RAD marker. Bayes-factor-only rows are not sex-linked marker calls unless the posterior and QC outputs also support them.

5 freq ¶

Compute marker frequency distribution.

rsx freq -t <table.tsv> -o <output.tsv> [-d 5]

6 depth ¶

Compute retained read statistics per individual.

rsx depth -t <table.tsv> -p <popmap.tsv> -o <output.tsv> [-f 0.75]

Flag	Default	Description
`-f`	0.75	Min fraction of individuals for a marker to count

Auto-detects large files (> 2GB) and switches to streaming mode with external sort for the exact mathematical median. Even sample counts preserve half-integer medians by averaging the two middle depths. Sparse optimization: only non-zero depths are sorted (3.3x reduction for typical RAD-seq data). See scripts/sympy/sparse_median_proof.py for the mathematical proof.

7 map ¶

Align markers to a reference genome.

rsx map -t <table.tsv> -p <popmap.tsv> -g <genome.fa> -o <output.tsv> \
  [-d 5] [-G M,F] [-q 20] [-Q 0.1] [-S 0.05] [-C]

Flag	Default	Description
`-g`		Reference genome (FASTA)
`-q`	20	Minimum mapping quality
`-Q`	0.1	Minimum marker frequency

Uses minimap2 with sr (short-read) preset. Two-pass streaming: pass 1 counts markers, pass 2 aligns and writes directly. Only available when built with the map feature (default on, not on Windows).

Note: the minimap2 genome index can be large (e.g. 21GB for salmon genomes). This is intrinsic to the aligner, not rsx.

8 subset ¶

Extract a filtered subset of markers.

rsx subset -t <table.tsv> -p <popmap.tsv> -o <output.tsv> \
  [-d 5] [-G M,F] [-m min_g1] [-M max_g1] [-n min_g2] [-N max_g2] \
  [-i min_ind] [-I max_ind] [-S 0.05] [-C] [-a]

Two-pass streaming like signif. O(n_individuals) memory.

9 merge ¶

Merge multiple marker depth tables by sequence identity.

rsx merge -o <output.tsv> <table1.tsv> <table2.tsv> [table3.tsv ...] [-B 2000000]

Flag	Default	Description
`-o`		Output merged markers table
`-B`	2,000,000	Entries to buffer before flushing to disk

Uses external sort-merge with bounded memory (~500MB). Input files are positional arguments (supports shell glob). Sequences are deduplicated across files using 2-bit DNA encoding. Depths from the same individual appearing in multiple files are summed.

The algorithm:

Reads all files streaming, buffers entries in RAM
When buffer full: sorts by packed sequence, writes lz4 temp file
K-way merges sorted temp files, coalesces equal sequences
Writes merged TSV output

Handles 75M+ unique sequences across hundreds of samples without OOM. Adjust -B to trade memory for fewer temp files (higher) or less RAM (lower).

Optional Parquet output (requires --features parquet-io):

rsx merge -o <output.parquet> --output-parquet <table1.tsv> <table2.tsv>

Parquet uses ZSTD compression with nullable u16 depth columns. Sparse zeros stored as null for efficient RLE encoding.

10 pca ¶

Streaming sample PCA of the marker-depth matrix.

rsx pca -t <table.tsv> -o <output_dir/> [-d 5] [-r 10]

Flag	Default	Description
`-t`		Markers depth table
`-o`		Output directory
`-d`	1	Minimum depth
`-r`	all	Number of principal components to output

Computes sample principal components by streaming the centered Gram matrix X^TX (n_individualsx n_individuals) and eigendecomposing it. The per-individual columns are the same right singular vectors (sample-space axes / scores) that full-matrix PCA would produce; the marker table does not need to be materialized. These are not classical marker feature loadings.

Memory: O(n_individuals²) = 320KB for 200 individuals, regardless of the number of markers. Works on arbitrarily large tables.

Output files:

eigenvalues.tsv: component, eigenvalue, variance_fraction, cumulative
loadings.tsv: individual × component sample scores (legacy filename)
sample_scores.tsv: same matrix under the clearer name
summary.txt: total variance, top components

Use cases:

Sex signal detection: tests whether PC1 separates M/F. On real-world RAD-seq data, the answer is no – sex-linked markers are only 0.03% of the depth matrix, drowned by library size and population effects. This negative finding justifies targeted methods (FST, chi-squared).
Sample QC: outlier individuals on PC2-3 indicate sequencing problems or batch effects.
Population structure: on combined multi-population tables, PC1 separates populations, not sexes. Confirms per-population analysis is the correct design for sex detection.
Rank estimation: eigenvalue spectrum reveals effective dimensionality. For RAD-seq depth data, typically 3-5 components explain > 90% of variance (low-rank, driven by restriction site conservation patterns).

See scripts/sympy/tucker_covariance_proof.py for the mathematical proof that this gives exact Tucker mode-2 factors.