Command Reference

Author:

Rohit Goswami

1 Command Reference

All commands match the original C++ RADSex CLI interface exactly.

1.1 process

Build a marker depth table from demultiplexed reads.

rsx process -i <input_dir> -o <output.tsv> [-T threads] [-d min_depth] [-k kmer_size]

Flag

Default

Description

-i

Directory of FASTQ/FASTA files (gzip supported)

-o

Output markers depth table

-T

1

Number of threads

-d

1

Minimum depth to retain a marker

-k

Optional canonical k-mer deduplication before table output

Supported extensions: .fq, .fastq, .fasta, .fa, .fna (optionally .gz). Individual names are derived from filenames.

Optional -k / --kmer-dedup groups markers by canonical k-mer of the given size before output. The representative marker from each k-mer group is the sequence with the highest total depth.

1.2 distrib

Compute marker distribution between two groups.

rsx distrib -t <table.tsv> -p <popmap.tsv> -o <output.tsv> [-d 5] [-G M,F] [-S 0.05] [-C]

Flag

Default

Description

-t

Markers depth table from process

-p

Population map (individualtgroup)

-o

Output distribution table

-d

1

Minimum depth

-G

auto

Groups to compare (comma-separated)

-S

0.05

Significance threshold

-C

false

Disable Bonferroni correction

1.3 signif

Extract markers significantly associated with a group.

rsx signif -t <table.tsv> -p <popmap.tsv> -o <output.tsv> [-d 5] [-G M,F] [-S 0.05] \
  [--correction bonferroni|fdr|none] [--test chisq|fisher|gtest] [--bayes] [-a]

Flag

Default

Description

--correction

bonferroni

Multiple testing: bonferroni, fdr, none

--test

chisq

Statistical test: chisq, fisher, gtest

--bayes

Add Bayes Factor + posterior P(sex-linked) cols

-a

Output FASTA instead of table

Two-pass streaming: pass 1 counts markers, pass 2 writes directly. O(nindividuals) memory. FDR correction collects p-values first (uses more memory but still bounded).

--bayes is table output. It adds Bayes_Factor and Posterior_SexLinked columns, where the posterior treats either compared group as the enriched sex-linked group. FASTA output with -a emits marker sequences and does not include table-only Bayesian columns.

1.4 triage

Write strict and posterior-supported sex-linked marker evidence for each called marker.

rsx triage -t <table.tsv> -p <popmap.tsv> -o <triage.tsv> [-d 5] [-G M,F] \
  [--posterior-threshold 0.9] [--bayes-factor-threshold 10]

The output joins RADSex-style Bonferroni evidence and Bayesian evidence in one marker-level table. Columns include group penetrance, bias direction, p-value, Bonferroni-corrected p-value, Bayes factor, posterior P(sex-linked), and Candidate_Class. Classes are strict+posterior, strict_only, posterior_only, and bayes_factor_only.

Interpretation is directional. A male-biased high-evidence marker is a putative Y-linked RAD marker; a female-biased high-evidence marker is a putative W-linked RAD marker. Bayes-factor-only rows are not sex-linked marker calls unless the posterior and QC outputs also support them.

1.5 freq

Compute marker frequency distribution.

rsx freq -t <table.tsv> -o <output.tsv> [-d 5]

1.6 depth

Compute retained read statistics per individual.

rsx depth -t <table.tsv> -p <popmap.tsv> -o <output.tsv> [-f 0.75]

Flag

Default

Description

-f

0.75

Min fraction of individuals for a marker to count

Auto-detects large files (> 2GB) and switches to streaming mode with external sort for exact median. Sparse optimization: only non-zero depths are sorted (3.3x reduction for typical RAD-seq data). See scripts/sympy/sparse_median_proof.py for the mathematical proof.

1.7 map

Align markers to a reference genome.

rsx map -t <table.tsv> -p <popmap.tsv> -g <genome.fa> -o <output.tsv> \
  [-d 5] [-G M,F] [-q 20] [-Q 0.1] [-S 0.05] [-C]

Flag

Default

Description

-g

Reference genome (FASTA)

-q

20

Minimum mapping quality

-Q

0.1

Minimum marker frequency

Uses minimap2 with sr (short-read) preset. Two-pass streaming: pass 1 counts markers, pass 2 aligns and writes directly. Only available when built with the map feature (default on, not on Windows).

Note: the minimap2 genome index can be large (e.g. 21GB for salmon genomes). This is intrinsic to the aligner, not rsx.

1.8 subset

Extract a filtered subset of markers.

rsx subset -t <table.tsv> -p <popmap.tsv> -o <output.tsv> \
  [-d 5] [-G M,F] [-m min_g1] [-M max_g1] [-n min_g2] [-N max_g2] \
  [-i min_ind] [-I max_ind] [-S 0.05] [-C] [-a]

Two-pass streaming like signif. O(nindividuals) memory.

1.9 merge

Merge multiple marker depth tables by sequence identity.

rsx merge -o <output.tsv> <table1.tsv> <table2.tsv> [table3.tsv ...] [-B 2000000]

Flag

Default

Description

-o

Output merged markers table

-B

2,000,000

Entries to buffer before flushing to disk

Uses external sort-merge with bounded memory (~500MB). Input files are positional arguments (supports shell glob). Sequences are deduplicated across files using 2-bit DNA encoding. Depths from the same individual appearing in multiple files are summed.

The algorithm:

  1. Reads all files streaming, buffers entries in RAM

  2. When buffer full: sorts by packed sequence, writes lz4 temp file

  3. K-way merges sorted temp files, coalesces equal sequences

  4. Writes merged TSV output

Handles 75M+ unique sequences across hundreds of samples without OOM. Adjust -B to trade memory for fewer temp files (higher) or less RAM (lower).

Optional Parquet output (requires --features parquet-io):

rsx merge -o <output.parquet> --output-parquet <table1.tsv> <table2.tsv>

Parquet uses ZSTD compression with nullable u16 depth columns. Sparse zeros stored as null for efficient RLE encoding.

1.10 pca

Streaming sample PCA of the marker-depth matrix.

rsx pca -t <table.tsv> -o <output_dir/> [-d 5] [-r 10]

Flag

Default

Description

-t

Markers depth table

-o

Output directory

-d

1

Minimum depth

-r

all

Number of principal components to output

Computes sample principal components by streaming the centered Gram matrix XTX (nindividualsx nindividuals) and eigendecomposing it. The loadings are the same right singular vectors that full-matrix PCA would produce, but the marker table does not need to be materialized.

Memory: O(nindividuals2) = 320KB for 200 individuals, regardless of the number of markers. Works on arbitrarily large tables.

Output files:

  • eigenvalues.tsv: component, eigenvalue, variancefraction, cumulative

  • loadings.tsv: individual x component loading matrix

  • summary.txt: total variance, top components

Use cases:

  • Sex signal detection: tests whether PC1 separates M/F. On real-world RAD-seq data, the answer is no – sex-linked markers are only 0.03% of the depth matrix, drowned by library size and population effects. This negative finding justifies targeted methods (FST, chi-squared).

  • Sample QC: outlier individuals on PC2-3 indicate sequencing problems or batch effects.

  • Population structure: on combined multi-population tables, PC1 separates populations, not sexes. Confirms per-population analysis is the correct design for sex detection.

  • Rank estimation: eigenvalue spectrum reveals effective dimensionality. For RAD-seq depth data, typically 3-5 components explain > 90% of variance (low-rank, driven by restriction site conservation patterns).

See scripts/sympy/tucker_covariance_proof.py for the mathematical proof that this gives exact Tucker mode-2 factors.