Quick Start ¶

1 Installation ¶

1.1 From source (recommended)¶

git clone https://github.com/HaoZeke/rsx-rs.git
cd rsx-rs
cargo build --release
# Binary at target/release/rsx

1.2 From GitHub releases ¶

Pre-built binaries are available for Linux (x86₆₄, aarch64) and macOS (x86₆₄, arm64) from the Releases page. On Windows, build from source without the map feature (minimap2 is not available there).

1.3 Via pixi ¶

cd rsx-rs
pixi run build

2 Example workflow ¶

Given demultiplexed RAD-seq reads in reads/ and a population map:

# popmap.tsv
ind1    M
ind2    M
ind3    F
ind4    F

2.1 Step 1: Build markers table ¶

rsx process -i reads/ -o markers.tsv -T 4 -d 5

2.2 Step 2: Check marker frequencies ¶

rsx freq -t markers.tsv -o freq.tsv -d 5

2.3 Step 3: Compute sex-bias distribution ¶

rsx distrib -t markers.tsv -p popmap.tsv -o distrib.tsv -d 5 -G M,F

2.4 Step 4: Extract significant markers ¶

rsx signif -t markers.tsv -p popmap.tsv -o signif.tsv -d 5 -G M,F

2.5 Step 5: Map to reference genome ¶

rsx map -t markers.tsv -p popmap.tsv -g genome.fa -o aligned.tsv -d 5 -G M,F

2.6 Step 6: Merge multiple tables ¶

rsx merge -o combined.tsv pop1_markers.tsv pop2_markers.tsv pop3_markers.tsv

Uses bounded-memory external sort (~500MB) for arbitrarily large datasets.

2.7 Step 7: Streaming PCA ¶

rsx pca -t combined.tsv -o pca_results/ -d 5 -r 10

Produces eigenvalues, per-individual sample scores (loadings.tsv legacy name and sample_scores.tsv), and summary in the output directory. These are sample-space principal axes (not marker feature loadings). On real RAD panels PC1 often reflects library size / population structure rather than sex.

3 Output format ¶

All outputs are tab-separated with an optional #source: comment line. The format is identical to the original C++ RADSex tool, so existing R scripts work without modification.

4 Memory guarantees ¶

Default analysis paths stream with memory independent of marker count; some modes intentionally accumulate more:

Command	Memory
distrib, freq	O(n_individuals) or O(n² cells)
signif (bonferroni / none)	O(n_individuals)
signif (FDR)	O(n_markers) p/q-values + re-stream
subset, triage	O(n_individuals)
map	O(genome_index)
depth (small, < 2 GB file)	O(n_markers× n_ind) depths
depth (> 2 GB)	O(buffer_size) external sort
merge	O(buffer_size)
pca	O(n_individuals²) Gram

For 200 individuals and 75M markers, typical peak memory is < 500MB (except map which loads the minimap2 genome index).

5 Language bindings (Python and R)¶

See the full matrix and install notes in

Full matrix: Language bindings. Short form below.

5.1 Python bindings (high-level API)¶

Install with pip install pyrsx (or pixi run -e python build-python from source).

import pyrsx

# Process reads → marker depth table (Arrow-backed under the hood)
pyrsx.process("reads/", "markers.tsv", threads=8, min_depth=5)

# Distribution + significance with Bayesian evidence
pyrsx.distrib("markers.tsv", "popmap.tsv", "distrib.tsv", groups=["M", "F"])
pyrsx.signif("markers.tsv", "popmap.tsv", "signif.tsv",
             groups=["M", "F"], test="fisher", correction="fdr", bayes=True)

# High-level ergonomic objects + narwhals / plotting
tbl = pyrsx.MarkerTable.from_path("markers.tsv")
print("n_markers:", len(tbl))

# Streaming PCA (Tucker mode-2) for sex signal QC
pyrsx.pca("markers.tsv", "pca_out/", n_components=5)

# Merge multiple tables (bounded memory)
pyrsx.merge(["run1.tsv", "run2.tsv"], "merged.tsv")

See the Python README for the full surface (including from_dataframe, to_arrow, custom triage, etc.).

6 Reproducing the paper ¶

Every benchmark, figure, and biological result reported for rsx is reproducible from a single deposited archive. The Zenodo deposit doi:10.5281/zenodo.20531539 bundles the pinned workflow (v0.2.6), the downloaded literature inputs, the result tables, and a one-command pixi + Snakemake pipeline:

# from the extracted reproducibility archive
pixi install
pixi run bench        # regenerates results/ and results/figures/

The archive clones rsx at the pinned tag, builds it alongside the C++ RADSex v1.2.0 reference, and regenerates the four-panel literature benchmark (the 8.38x geometric-mean speedup across 56 paired timings), the Bayesian evidence grades, and the sex-linked marker calls. Timings scale with the host hardware; the biological results (marker counts and evidence grades) do not.

6.1 R bindings (`rsxr`)¶

# pak::pak("HaoZeke/rsx-rs/rsx-r")  # needs cargo
library(rsxr)
rsx_version()
mt <- marker_table("markers.tsv")
triaged <- triage(mt, popmap = "popmap.tsv", min_depth = 10L)

Details: rsx-r/README.md and vignette("rsxr") after install.

Full R notes: R integration.