Performance Benchmarks¶
- Author:
Rohit Goswami
- Date:
2026-04-06
1 Performance Benchmarks¶
Comparison of rsx-rs (Rust) vs the original RADSex (C++11) across three dataset scales. All benchmarks are median of 3 runs on the same machine.
1.1 Test environment¶
CPU: AMD Threadripper (single-socket, benchmarks use 4 threads for
process)OS: Arch Linux 6.19.11
C++ compiler: GCC 15.2.1,
-O2 -std=c++11Rust compiler: rustc 1.85+,
--release(LTO thin)C++ radsex: v1.2.0 (with
<cstdint>fix for GCC 15)Rust rsx-rs: v0.1.0
1.2 Dataset scales¶
Scale |
Individuals |
Markers |
Reads/ind |
Description |
|---|---|---|---|---|
Small |
10 |
1,000 |
500 |
Quick validation |
Medium |
20 |
10,000 |
2,000 |
Typical small study |
Large |
40 |
100,000 |
5,000 |
Realistic RAD-seq size |
Synthetic data generated with benchmarks/generate_data.py: 10% male-biased
markers, 10% female-biased, 80% common. FASTQ files gzip-compressed.
1.3 Results¶
1.3.1 Per-command speedup (Rust / C++)¶
Command |
Small (1K) |
Medium (10K) |
Large (100K) |
Average |
|---|---|---|---|---|
process |
1.8x |
1.5x |
– |
1.7x |
freq |
4.0x |
2.5x |
2.6x |
3.0x |
depth |
5.3x |
5.0x |
3.6x |
4.6x |
distrib |
5.0x |
2.7x |
2.4x |
3.4x |
signif |
5.0x |
1.6x |
1.4x |
2.7x |
subset |
5.3x |
1.9x |
1.4x |
2.9x |
map |
2.0x |
1.8x |
– |
1.9x |
signif/subset large-scale slowdown vs previous: two-pass streaming reads the file twice for Bonferroni correction. The trade-off is bounded O(nindividuals) memory instead of O(nmarkers) accumulation.
1.3.2 Per-scale average speedup¶
Scale |
Average speedup |
|---|---|
Small |
4.1x |
Medium |
2.4x |
Large |
2.3x |
1.3.3 Overall¶
Rust is 2.0x faster across all 19 benchmarks (1.558s vs 0.780s total). All commands operate in bounded memory (< 500MB for any input size).
1.3.4 Microbenchmarks (criterion)¶
Operation |
Time |
Notes |
|---|---|---|
chisquaredyates |
4.6 ns |
Pure f64 arithmetic |
passociation(erfc) |
108 ns |
libm::erfc, was 562 ns (statrs) |
bitset popcount (40) |
2.3 ns |
1 u64 word |
bitset popcount (200) |
5.0 ns |
4 u64 words |
bitset popcount (1000) |
18 ns |
16 u64 words |
fastparseu16 |
4.5 ns |
Integer field parsing |
Cg float format |
193 ns |
C++ %g compatible output |
1.4 Literature-derived benchmark runs¶
The checked-in synthetic suite is small enough for repeatable repository
verification. Paper-facing biological benchmark runs download the public
datasets used by the RADSex literature workflow and run rsx against those
FASTQ files.
The tracked manifest benchmarks/literature_datasets.tsv contains:
15 public RADSex workflow datasets, BioProject
PRJNA548074, with the published RADSex command set run at minimum depths 1, 2, 5, and 10.The public
oryzias_latipesmedaka example, BioProjectPRJNA253959, with the RADSex example genome and medaka-specific thresholds.
The RADSex paper workflow describes the literature panel and its
parameters in its public data/info.tsv and config.yaml. The medaka
example gives a published benchmark target for process, distrib,
signif, and map on a public dataset. The executable benchmark task
queries NCBI SRA metadata, downloads ENA FASTQ files with MD5 validation
(with --download-method fastq-dump available as a fallback), runs the
command set, and writes measured rows to
benchmarks/results/literature_benchmark_results.csv:
pixi run -e benchmark run-literature-benchmarks
For a single public literature dataset:
pixi run -e benchmark run-literature-medaka
1.4.1 Completed biological benchmark panel¶
The tracked paper-facing panel contains four completed datasets downloaded from the RADSex literature BioProject. These are full FASTQ runs, not synthetic marker tables.
Dataset |
Samples |
Bases |
Markers |
Total runtime (s) |
|---|---|---|---|---|
|
58 |
10,200,185,246 |
29,357,812 |
1,240.187 |
|
42 |
10,468,745,314 |
11,337,736 |
1,104.048 |
|
65 |
9,924,715,896 |
7,338,124 |
1,045.383 |
|
43 |
11,257,335,754 |
6,932,387 |
1,020.880 |
The paper figures are generated from the tracked CSV files. The runtime
figure reports compute time only (process plus downstream analysis);
download time is retained in the raw CSV for reproducibility accounting but
is not used as a performance claim.
The panel is biologically anchored by the original RADSex results:
plecoglossus_altivelis is a previously described XX/XY case with 47
source-paper sex-associated markers at d=10, tinca_tinca is an XX/XY case
reported without a previously described system and six markers at d=10, and
danio_albolineatus and notothenia_rossii are source-paper null cases
with no SD identified at d=10. The Bayesian plots therefore compare
strict positive controls, strict negative calls, and posterior-supported
sex-linked marker hypotheses under the same published workflow.
pixi run -e benchmark plot-literature-benchmarks
The generated figure and table assets are written to docs/figures/:
literature_runtime_breakdown.svgliterature_compute_breakdown.svgliterature_marker_throughput.svgliterature_downstream_times.svgliterature_bayesian_evidence.svgliterature_prior_sensitivity_heatmap.svgliterature_candidate_recovery.svgliterature_bio_unlocks.svgliterature_mode_candidate_counts.svgliterature_mode_qc.svgliterature_depth_stability.svgliterature_radsex_speedups.svgliterature_benchmark_summary.tsvliterature_speedup_summary.tsvliterature_speedup_pairs.tsv
1.4.2 Literature speed comparison¶
The same downloaded FASTQ inputs were processed by rsx-rs and by C++ RADSex
v1.2.0 built in the Pixi benchmark environment. The C++ checkout requires
the standard <cstdint> include fix for modern GCC, but the RADSex command
behavior is otherwise unchanged.
Across 56 paired command/dataset/depth timings, on a single AMD Ryzen
Threadripper PRO 3955WX workstation (16 threads), rsx-rs is faster in 53
pairs. The geometric-mean speedup is 8.38x when speedup is defined as C++
wall time divided by Rust wall time. The process-stage geometric-mean
speedup is 2.77x across the four literature datasets (2.44x to 2.96x).
depth, freq, and distrib favor rsx-rs in every paired run, with
geometric-mean speedups of 6.91x, 22.5x, and 29.0x, respectively.
signif is the exception: rsx-rs is faster in 13 of 16 paired runs, with a
1.25x geometric-mean speedup, because both programs stream the complete
table. These are the numbers reported in the paper; the deposited
reproducibility archive (https://doi.org/10.5281/zenodo.20531539) regenerates every
figure and number, and they scale with the host hardware.
On Slurm clusters, submit per-dataset compute-only comparisons as an array job after the FASTQ files and marker tables have been prepared:
sbatch --export=ALL,RSX_REPO=$PWD benchmarks/slurm/rsx_release_build.sbatch
sbatch --export=ALL,RSX_REPO=$PWD benchmarks/slurm/radsex_cpp_build.sbatch
sbatch --export=ALL,RSX_REPO=$PWD benchmarks/slurm/literature_prepare_dataset.sbatch
sbatch --export=ALL,RSX_REPO=$PWD,RADSEX_BIN=/path/to/radsex \
benchmarks/slurm/literature_radsex_compare.sbatch
pixi run -e benchmark merge-literature-slurm-results
pixi run -e benchmark plot-literature-benchmarks
1.4.3 Bayesian biological evidence¶
Bayesian evidence is summarized from the aggregated distrib tables without
materializing marker-level multi-GB signif --bayes outputs. The summary is
tracked in benchmarks/results/literature_bayesian_evidence.csv and plotted
in docs/figures/literature_bayesian_evidence.svg.
The strongest Bayesian signal in the current panel is
danio_albolineatus at minimum depth 1: 61,191 markers have posterior
P(sex-linked) > 0.5 and 4,297 markers have posterior > 0.9. The same
Bonferroni FASTA extraction reports zero significant markers for that
dataset, so the Bayesian summary adds sex-linked marker probabilities that
are not available from strict thresholded FASTA output alone.
The candidate-recovery figure compares strict Bonferroni FASTA marker counts with posterior-supported markers at the same dataset/depth settings. This grounds the biological-value claim in the published RADSex workflow datasets: the Bayesian layer does not change the null-hypothesis test, but it exposes a sex-linked marker hypothesis set when all-or-nothing FASTA extraction is empty.
1.4.4 Prior sensitivity of the Bayesian triage layer¶
The hybrid triage layer carries two free parameters: the prior probability
pi that any given marker is sex-linked, and p_sex, the expected
prevalence in the linked sex. The sensitivity sweep runs rsx triage on
every (pi in {0.001, 0.005, 0.01, 0.02, 0.05, 0.1}) x (psexin {0.8, 0.85,
0.9, 0.95}) combination across the four literature datasets, producing 96
per-marker triage tables in benchmarks/results/slurm/. The aggregated
counts go to benchmarks/results/prior_sensitivity_from_triage.csv and the
heatmap to docs/figures/literature_prior_sensitivity_heatmap.svg.
sbatch --export=ALL,RSX_REPO=$PWD \
benchmarks/slurm/literature_biology_prior_sensitivity.sbatch
pixi run -e benchmark python benchmarks/plot_prior_sensitivity_heatmap.py
The per-dataset spread of the qualifying-marker count (posterior > 0.9) across the 24-cell grid speaks directly to the robustness of the Bayesian call:
Dataset |
Min posterior>0.9 |
Max posterior>0.9 |
Spread factor |
|---|---|---|---|
|
1 |
10,189 |
10,189 |
|
0 |
44 |
N/A |
|
46 |
87 |
1.9x |
|
6 |
169 |
28x |
The two positive controls behave differently. plecoglossus_altivelis
holds within a factor of two across the entire grid, so the Bayesian call
is essentially prior-insensitive on that dataset. tinca_tinca varies
more (6 to 169 markers across the grid), but the strict-call core remains.
danio_albolineatus shows extreme prior sensitivity, with the count
jumping from a single marker at (pi=0.001, psex=0.95) to ten thousand
markers at (pi>=0.05, psex<=0.85). notothenia_rossii peaks at 44
markers but is dominated by zeros, consistent with its singleton-heavy
retained-marker structure described below.
The figure makes this concrete cell by cell. For paper claims we report
all rsx counts at the documented default prior (pi=0.01, p_sex=0.9),
note the per-dataset spread, and forward the prior-sensitivity heatmap
as supporting evidence that the calls are stable for the
plecoglossus_altivelis positive control while explicitly prior-dependent
for danio_albolineatus.
1.4.5 Mode effects on published marker tables¶
The mode-effects analysis runs additional rsx statistical and QC modes through the Python bindings on the four downloaded marker tables at minimum depth 10:
pixi run -e paper analyze-literature-modes
pixi run -e benchmark plot-literature-benchmarks
The tracked result file is
benchmarks/results/literature_mode_effects.csv. It records the command, wall
time, output rows, marker counts, Bayesian thresholds, frequency-QC summaries,
and PCA variance/loadings for each dataset. The generated figures are
docs/figures/literature_mode_candidate_counts.svg and
docs/figures/literature_mode_qc.svg.
At d=10 the mode-effects run reproduces the source-paper strict calls and shows what the additional modes contribute:
Dataset |
Strict chisq/Bonf. |
Fisher/FDR |
G-test/FDR |
Bayes posterior > 0.9 |
PC1 variance |
Singleton fraction |
|---|---|---|---|---|---|---|
|
0 |
0 |
0 |
30 |
79.72% |
19.96% |
|
0 |
0 |
0 |
0 |
94.93% |
52.78% |
|
47 |
49 |
49 |
49 |
93.09% |
37.94% |
|
6 |
6 |
7 |
11 |
98.14% |
7.74% |
This table is the paper-facing demonstration of the added analysis modes. The
strict path preserves the RADSex interpretation: plecoglossus_altivelis and
tinca_tinca are positive controls, while danio_albolineatus and
notothenia_rossii remain strict null cases at d=10. Fisher/FDR and G-test/FDR
show the sensitivity of count-cell testing to exact and likelihood tests on the
same marker table. The Bayesian table gives posterior-supported marker
evidence: it recovers 49 high-posterior markers for
plecoglossus_altivelis, expands the tinca_tinca follow-up set to 11,
and distinguishes the two strict negative datasets
because danio_albolineatus has 30 high-posterior candidates whereas
notothenia_rossii has none at the same threshold.
The PCA and frequency columns are QC outputs rather than marker-discovery
claims. PC1 explains 79.7–98.1% of depth variation across the four datasets,
while the mean sex-loading deltas are small relative to the dominant depth
axis. This indicates that the leading sample-level variation is library/depth
structure, not a clean sex-separation axis. The singleton fraction separates
the retained marker tables as well: tinca_tinca is low-singleton at d=10
(7.74%), whereas notothenia_rossii remains dominated by singleton retained
markers (52.78%), which is a biological-QC warning for candidate follow-up.
1.4.6 Depth stability of the call layers¶
The mode-effects panel runs at min_depth=10 so it can be compared with the
RADSex literature workflow directly. The depth-stability sweep extends the
same eight modes to min_depth in {3, 5, 8, 10}, giving 128 rows tracked in
benchmarks/results/literature_mode_effects_sweep.csv and summarised per
(dataset, depth) in benchmarks/results/literature_depth_stability.csv:
sbatch --export=ALL,RSX_REPO=$PWD \
benchmarks/slurm/literature_biology_low_depth.sbatch
pixi run -e benchmark python benchmarks/collect_lowdepth_sweep.py
pixi run -e paper python benchmarks/summarise_depth_stability.py
Dataset |
mindepth |
strict |
posterior > 0.9 |
Bayes factor > 10 |
|---|---|---|---|---|
|
3 |
0 |
690 |
1,316 |
|
5 |
0 |
143 |
782 |
|
8 |
0 |
45 |
301 |
|
10 |
0 |
30 |
174 |
|
3 |
0 |
2 |
527 |
|
5 |
0 |
1 |
486 |
|
8 |
0 |
0 |
434 |
|
10 |
0 |
0 |
400 |
|
3 |
56 |
137 |
5,965 |
|
5 |
47 |
62 |
2,512 |
|
8 |
47 |
50 |
905 |
|
10 |
47 |
49 |
572 |
|
3 |
7 |
16 |
505 |
|
5 |
6 |
13 |
475 |
|
8 |
6 |
9 |
438 |
|
10 |
6 |
11 |
443 |
The strict call layer is depth-stable on the positive controls: the
plecoglossus_altivelis strict count holds at 47 from d=5 onward, and
tinca_tinca drops by a single marker between d=3 and d=5. The
Bayesian layers are far more depth-sensitive. The plecoglossus_altivelis
Bayes-factor > 10 count contracts ten-fold (5,965 to 572) as min_depth
moves from 3 to 10, and the posterior > 0.9 count contracts almost three-fold
(137 to 49). The figure docs/figures/literature_depth_stability.svg
plots the same data on a faceted log axis.
The two strict-null datasets behave differently across depths.
notothenia_rossii stays strict-zero at every depth and its posterior
> 0.9 count never exceeds 2, which corroborates the negative call from the
d=10 mode-effects table. danio_albolineatus is strict-zero everywhere
but its Bayesian counts contract sharply from d=3 to d=10, retaining
45 high-posterior and 301 Bayes-factor markers at d=8 and 30
high-posterior plus 174 Bayes-factor markers at d=10. This is the same
dataset that shows the highest prior sensitivity in the heatmap above, so
the paper should present it as a prior- and depth-sensitive hypothesis set
rather than a stable sex-determination call.
The depth-stability table is what motivates citing the
plecoglossus_altivelis strict numbers at the standard RADSex d=10
threshold while pairing the Bayesian numbers with their depth and prior
context.
1.4.7 Biological unlock summary¶
The marker-level unlock summary joins strict signif output with
signif --bayes output from the same d=10 marker tables. It is tracked in
benchmarks/results/literature_bio_unlocks.csv, and the top-ranked marker
rows are tracked in benchmarks/results/literature_candidate_triage.csv.
Dataset-level sex-system calls are tracked in
benchmarks/results/literature_sex_system_inference.csv.
The generated figure is docs/figures/literature_bio_unlocks.svg.
Dataset |
Strict + posterior |
Strict only |
Posterior only |
Bayes-factor only |
Interpretation |
|---|---|---|---|---|---|
|
0 |
0 |
30 |
158 |
posterior-supported markers below Bonferroni |
|
0 |
0 |
0 |
400 |
Bayes-factor evidence rejected by posterior threshold |
|
47 |
0 |
2 |
523 |
RADSex signal with posterior-supported additions |
|
6 |
0 |
5 |
432 |
RADSex signal with posterior-supported additions |
This is the clearest biological distinction unlocked by rsx beyond the
original RADSex thresholded FASTA endpoint. In the two positive controls,
posterior evidence recovers every strict d=10 marker and adds two
plecoglossus_altivelis and five tinca_tinca marker hypotheses. In
danio_albolineatus, rsx exposes 30 high-posterior markers while
preserving the Bonferroni-negative call. In notothenia_rossii, 400 rows have
Bayes factor > 10 but none have posterior > 0.9, and the singleton fraction is
52.78%; the posterior/QC layer therefore prevents raw Bayes-factor counts from
being over-read as biological positives.
The generated sex-system inference artifact makes the biological call explicit:
Dataset |
Evidence class |
rsx inference |
Marker basis |
|---|---|---|---|
|
confirmatory |
XX/XY-supported |
47 male-biased strict markers; 49 male-biased high-posterior markers |
|
confirmatory |
XX/XY-supported |
6 male-biased strict markers; 11 high-posterior markers for validation |
|
hypothesis |
putative W-linked marker hypothesis |
30 high-posterior markers without Bonferroni reclassification; 29 are female-biased |
|
negative |
no sex-linked marker call |
400 Bayes-factor-only rows, 0 high-posterior rows, 52.78% singleton retained markers |
1.4.8 Biological inference role of extended features¶
The extended rsx features map to biological interpretation tasks rather than standing alone as software claims:
Inference task |
rsx feature |
Biological use |
|---|---|---|
Sequencing QC |
|
choose depth thresholds and flag outlier libraries |
Marker discovery |
|
test sex-biased marker presence/absence |
Sex-linkage posterior |
FDR, Bayes posterior |
identify putative Y-linked/W-linked marker hypotheses below strict Bonferroni calls |
Candidate validation |
|
inspect marker depths in individuals and outliers |
Locus localization |
|
test whether candidates cluster on a sex chromosome |
Cross-run synthesis |
|
combine lanes/panels and reduce redundant error variants |
Reproducible analysis |
Python bindings, C API |
integrate outputs into notebooks and workflow languages |
The bounded-memory implementation supports inference by allowing the full marker table to be analyzed without downsampling. That matters for RAD-seq sex-determination studies because rare candidate markers are sensitive to minimum-depth thresholds, sequencing-error duplication, and sample-level library effects.
1.4.9 Novelty claim¶
For the software paper, the novelty claim is the combination of:
Contribution |
Biological/inference value |
|---|---|
Drop-in RADSex compatibility |
compare directly with prior RADSex biological results |
Bounded-memory full-table algorithms |
avoid downsampling rare marker candidates |
Additional statistical modes |
compare strict tests, FDR, and posterior sex-linkage calls on the same counts |
Formal mathematical provenance |
make erfc p-values, sparse medians, and streaming PCA auditable |
Workflow bindings |
keep downstream biological analysis in Python/R-style workflows |
DataFrame-native MarkerTableSource |
run any analysis command directly on an in-memory DataFrame without exporting a markers TSV; spill to Parquet only when a Beissinger-style working-set estimate predicts the data will not fit in RAM |
Prior-sensitivity sweep |
report Bayesian call counts across the full (pi, psex) grid per dataset so the prior choice becomes auditable |
1.4.10 Validation layers¶
Paper-readiness validation is split by claim:
Claim |
Validation artifact |
|---|---|
Mathematical identities |
|
Numerical/statistical routines |
|
Command behavior |
|
Literature speedup |
|
Bayesian sex-linkage evidence |
|
Prior sensitivity of the Bayesian layer |
|
Extended mode effects |
|
Depth stability of the call layers |
|
Biological unlock summary |
|
Python workflow surface |
|
|
|
|
|
The math scripts verify the erfc form of the one-degree chi-squared p-value, sparse median indexing, and streaming Gram/PCA equivalence. The precision tests cover known chi-squared values, p-value monotonicity and symmetry, group-count consistency, Fisher’s exact test, the G-test, Benjamini-Hochberg adjustment, Bayes factors, posterior symmetry, and empirical-Bayes EM convergence.
Recent validation runs:
pixi run deriveinscripts/: passed the chi-squared erfc derivation, critical-value checks, and f32/f64 relative-error checks.pixi run python sympy/sparse_median_proof.py: passed sparse-median examples and complexity check.pixi run python sympy/tucker_covariance_proof.py: passed streaming Gram/SVD equivalence on a sparse matrix.Remote
cargo test --workspaceonelja: 60 core unit tests, 11 pipeline integration tests, and 13 precision tests passed.Remote
pixi run -e python test-pythononelja: 18 Python binding tests passed after a release extension build.
1.4.11 Python binding feature evidence¶
The pyrsx bindings expose the same Rust core through Python calls for
workflow notebooks and downstream analysis code. The literature binding run
uses the downloaded RADSex marker tables and popmaps rather than synthetic
test fixtures, and now covers all four panel datasets rather than the
single representative case shipped earlier:
pixi run -e paper run-literature-bindings
On Slurm:
sbatch --export=ALL,RSX_REPO=$PWD \
--array=0-3 benchmarks/slurm/literature_bindings_features.sbatch
pixi run -e benchmark merge-literature-binding-slurm-results
The Pixi task writes benchmarks/results/literature_binding_results.csv
aggregated across the four datasets. The Slurm array writes per-dataset shards
under benchmarks/results/slurm/, and the merge task refreshes the same
aggregated CSV.
It exercises pyrsx.freq, pyrsx.depth,
pyrsx.distrib(test’fisher’, correction=’fdr’)=, pyrsx.pca, and
pyrsx.signif(bayes=True) on each published RADSex workflow dataset,
giving the paper an ease-of-use claim backed by real inputs measured
end-to-end on four datasets, not one.
Aggregated tracked timings per dataset at min_depth=10 (latest tracked
literature_binding_results.csv run):
Dataset |
|
|
|
|
|
|---|---|---|---|---|---|
|
1.42 s |
19.42 s |
1.46 s |
15.42 s |
32.60 s |
|
0.79 s |
5.61 s |
0.45 s |
4.66 s |
9.10 s |
|
0.42 s |
0.65 s |
0.39 s |
4.83 s |
8.78 s |
|
0.29 s |
0.36 s |
0.30 s |
2.92 s |
5.65 s |
The pyrsx surface is the same Rust core; the per-dataset spread mirrors the marker-table sizes (about 3.0M, 0.4M, 0.2M, and 0.07M retained markers respectively).
1.4.11.1 From-DataFrame Arrow path: the MarkerTableSource layer¶
The high-level MarkerTable class accepts both a file path and any
narwhals-compatible DataFrame. The DataFrame entry points (MarkerTable. from_dataframe and the .freq=/.depth=/=.distrib=/=.signif=/=.triage=/
.pca methods on it) route through a new MarkerTableSource layer in
radsex-core/src/source/ rather than materialising a markers TSV behind
the user’s back.
A short specification record is tracked at
docs/orgmode/reference/arrow-input-for-high-level-api.org. The
implementation has three variants:
Variant |
Where the markers live |
When it is chosen |
|---|---|---|
|
mmap’d TSV on disk |
the CLI and any |
|
|
|
|
a hidden temp file written in Parquet |
|
The three variants share the same MarkerStream trait, and every analysis
command is generic over S: MarkerStream, so the spill choice is purely
a memory question; the per-marker results are bytes-identical regardless
of which variant the caller picked. The parity test in
radsex-core/src/source/tests.rs asserts this for freq on a shared
fixture and the Python pytest test_marker_table_from_arrow.py asserts
the per-dataset results match the file path through MarkerTable.
1.4.11.2 Beissinger-style spill estimator¶
The spill decision uses a working-set estimator drawn from the RAD-seq
literature (Beissinger et al. 2013 on marker density and read depth for
GBS-style genotyping, plus empirical overhead factors reported by
TASSEL-GBS, ipyrad, and Torkamaneh et al. 2016). The prediction lives in
radsex-core/src/source/estimator.rs and combines four factors:
estimated_bytes = n_samples
* m_markers_observed
* bytes_per_cell (2 bytes per u16 depth)
* overhead_factor (default 6.0)
* command_multiplier (1.3 freq, 1.5 distrib/pca,
1.8 depth, 2.0 signif/triage)
The available-RAM probe reads /proc/meminfo for MemAvailable and the
default policy spills when the prediction exceeds 55% of that figure.
Operators can override the threshold through RSX_SPILL_BYTES=<N> and
the variant choice through RSX_FORCE_SPILL=1. The 6x overhead floor and
the per-command multipliers come from the rsx working-set audit reported
in the source-module docstring; the paper claim is therefore a defensible
heuristic backed by the cited literature rather than a magic constant.
1.4.11.3 Architectural cost vs the legacy TSV streamer¶
The MarkerTableSource benchmark times the same three commands (freq,
depth, triage) over each literature dataset via three constructions:
MarkerTable.from_path(...) (the legacy mmap’d TSV), from_dataframe(...)
in-memory Arrow, and from_dataframe(...) with RSX_FORCE_SPILL=1 to
exercise the Parquet path.
pixi run -e paper python benchmarks/bench_marker_source.py
Tracked wall-time in benchmarks/results/marker_source_paths.csv (4 datasets
x 3 sources x 3 commands, min_depth=10):
Dataset |
Command |
Path (s) |
In-memory Arrow (s) |
Spilled Parquet (s) |
Arrow slowdown |
Parquet slowdown |
|---|---|---|---|---|---|---|
|
freq |
4.54 |
38.12 |
68.09 |
8.4x |
15.0x |
|
depth |
17.26 |
44.16 |
67.55 |
2.6x |
3.9x |
|
triage |
30.95 |
67.13 |
111.14 |
2.2x |
3.6x |
|
freq |
1.62 |
19.28 |
29.08 |
11.9x |
17.9x |
|
depth |
5.13 |
20.80 |
30.14 |
4.1x |
5.9x |
|
triage |
9.22 |
29.11 |
53.88 |
3.2x |
5.8x |
|
freq |
1.71 |
20.10 |
26.85 |
11.8x |
15.7x |
|
depth |
0.90 |
21.36 |
26.94 |
23.7x |
29.9x |
|
triage |
13.41 |
30.75 |
50.27 |
2.3x |
3.7x |
|
freq |
1.29 |
13.03 |
19.71 |
10.1x |
15.3x |
|
depth |
0.56 |
13.87 |
20.45 |
24.7x |
36.4x |
|
triage |
8.84 |
18.95 |
34.82 |
2.1x |
3.9x |
The qualitative shape:
The path-based mmap streamer keeps its existing speed; nothing in the refactor regressed the CLI path.
The in-memory Arrow path is 2-25x slower than the file path for the same command on the same data. The biggest gaps appear on the smallest-output commands (
freq,depth) where the per-marker IPC decoding overhead dominates; ontriagethe gap shrinks to ~2-3x because the per-marker statistical work amortises the format cost.The Parquet spill path adds another ~1.5-2x over the in-memory variant (the spill-then-stream pass). The Beissinger estimator only chooses it when the working-set prediction exceeds the available-RAM threshold, so the user pays for the spill only when the alternative is exceeding RAM.
The numbers support a paper claim that the DataFrame surface gives an
ergonomic notebook/workflow API for moderate-sized data without
regressing the CLI’s streaming path, and that the spill-vs-in-memory
choice is automatic but explicitly overridable via
RSX_FORCE_SPILL / RSX_SPILL_BYTES. Heavy production analysis on the
largest datasets is still cheapest through the path-based CLI.
1.5 Raw timing data (seconds, median of 3 runs)¶
Scale |
Command |
C++ (s) |
Rust (s) |
Speedup |
|---|---|---|---|---|
small |
freq |
0.016 |
0.004 |
4.0x |
small |
distrib |
0.015 |
0.003 |
5.0x |
small |
signif |
0.015 |
0.003 |
5.0x |
small |
depth |
0.016 |
0.003 |
5.3x |
small |
subset |
0.016 |
0.003 |
5.3x |
small |
map |
0.042 |
0.021 |
2.0x |
small |
process |
0.016 |
0.009 |
1.8x |
medium |
freq |
0.015 |
0.006 |
2.5x |
medium |
distrib |
0.019 |
0.007 |
2.7x |
medium |
signif |
0.021 |
0.013 |
1.6x |
medium |
depth |
0.025 |
0.005 |
5.0x |
medium |
subset |
0.026 |
0.014 |
1.9x |
medium |
map |
0.407 |
0.228 |
1.8x |
medium |
process |
0.058 |
0.038 |
1.5x |
large |
freq |
0.114 |
0.044 |
2.6x |
large |
distrib |
0.138 |
0.057 |
2.4x |
large |
signif |
0.189 |
0.137 |
1.4x |
large |
depth |
0.240 |
0.067 |
3.6x |
large |
subset |
0.170 |
0.118 |
1.4x |
1.6 Why Rust is faster¶
1.6.1 Inline for_each parser (biggest win)¶
The C++ version uses a producer-consumer pattern: one thread parses the
markers table into a std::queue<Marker> protected by a mutex, another
thread processes markers from the queue with busy-wait polling
(sleep(10us) when empty).
rsx-rs originally copied this pattern using crossbeam channels, but
profiling showed ~50% of CPU time was spent on channel overhead and marker
cloning. rsx-rs uses an inline for_each callback that reuses a single
Marker struct with zero allocation per marker. The marker is passed by
reference to the callback, reset in-place between iterations.
1.6.2 Bitset popcount for group counts (eliminated HashMap entirely)¶
The C++ stores group counts in std::unordered_map<string, uint>, hashing
group name strings for every marker field. rsx-rs replaces this with a
BitsetRow (1 bit per individual) and pre-computed GroupMask bitmasks.
Group count = popcount(marker_bits & group_mask) – a single CPU
instruction per 64 individuals, with zero hashing.
For 200 individuals: 5.0 ns per group count vs ~200 ns for a HashMap lookup.
1.6.3 erfc identity for chi-squared CDF (SymPy-derived, 5.2x faster)¶
The chi-squared p-value for df=1 simplifies exactly to
p = erfc(sqrt(chi2/2)). This is proven via the identity
P(1/2, x) = erf(sqrt(x)) (DLMF 8.2.1). The derivation script is in
scripts/sympy/chi2_cdf_derivation.py.
This replaces the full regularized gamma function (statrs crate, 562 ns)
with a single libm::erfc call (108 ns). A Sollya-generated minimax
polynomial (scripts/sollya/erfc_minimax.sollya) can further replace
libm::erfc for GPU kernels.
1.6.4 mmap I/O (zero-copy file access)¶
The markers table is memory-mapped via memmap2, eliminating buffered
read syscalls. The kernel manages page faults transparently. Combined
with the inline for_each parser, this achieves near-memcpy throughput.
1.6.5 Rayon for process command¶
The process command reads all FASTQ files and counts sequences. C++ uses
a manual thread pool with mutex-protected work-stealing. rsx-rs uses
rayon’s work-stealing thread pool with per-thread AHashMap accumulators,
merged at the end. This avoids mutex contention during file processing.
1.6.6 minimap2 for map command¶
rsx-rs uses minimap2 (via Rust bindings) instead of BWA-MEM. minimap2 is
the modern successor with better index construction and alignment speed for
short reads. The sr (short-read) preset matches BWA-MEM behavior.
1.6.7 Zero-copy field parsing¶
The table parser uses push_str into pre-allocated strings rather than
creating new String objects per field. Combined with fast_parse_u16 for
integer fields (matching the C++ fast_stoi approach), this eliminates
allocation in the inner parsing loop.
1.7 Reproducing¶
# Use the benchmark environment
pixi install -e benchmark
# Optional: regenerate equivalent synthetic data without overwriting
# the tracked manuscript reference data.
pixi run -e benchmark generate-data
# Build C++ radsex (requires patching for GCC 15)
cd /path/to/radsex && make -j4
# Run benchmarks against the tracked reference data.
cd /path/to/rsx-rs
CPP_RADSEX=/path/to/radsex/bin/radsex pixi run -e benchmark run-benchmarks
# Print summary
pixi run -e benchmark plot-benchmarks
# Validate and prepare the DOI-deposition package.
pixi run -e benchmark test-package
pixi run -e benchmark package-figshare
By default the benchmark harness reads benchmarks/data/ and writes
benchmarks/results/benchmark_results.csv. Set BENCH_DATA or
BENCH_RESULTS only when using alternate data or result locations. The
tracked CSV and repro/benchmarks.org are the reference inputs for the
manuscript numbers.
1.8 Golden file compatibility¶
Despite the performance differences, rsx-rs produces byte-identical output to C++ radsex for all downstream commands (freq, depth, distrib,
signif, subset) when groups are specified explicitly with -G M,F. The
Cg float formatter matches C++’s default %g output (6 significant
digits). The map command uses minimap2 instead of BWA-MEM so alignment
results differ, but the output format is identical.