Performance Benchmarks ¶

Author:

Rohit Goswami

Date:

2026-04-06

1 Performance Benchmarks ¶

Comparison of rsx-rs (Rust) vs the original RADSex (C++11) across three dataset scales. All benchmarks are median of 3 runs on the same machine.

1.1 Test environment ¶

CPU: AMD Threadripper (single-socket, benchmarks use 4 threads for process)
OS: Arch Linux 6.19.11
C++ compiler: GCC 15.2.1, -O2 -std=c++11
Rust compiler: rustc 1.85+, --release (LTO thin)
C++ radsex: v1.2.0 (with <cstdint> fix for GCC 15)
Rust rsx-rs: v0.1.0

1.2 Dataset scales ¶

Scale	Individuals	Markers	Reads/ind	Description
Small	10	1,000	500	Quick validation
Medium	20	10,000	2,000	Typical small study
Large	40	100,000	5,000	Realistic RAD-seq size

Synthetic data generated with benchmarks/generate_data.py: 10% male-biased markers, 10% female-biased, 80% common. FASTQ files gzip-compressed.

1.3 Results ¶

1.3.1 Per-command speedup (Rust / C++)¶

Command	Small (1K)	Medium (10K)	Large (100K)	Average
process	1.8x	1.5x	–	1.7x
freq	4.0x	2.5x	2.6x	3.0x
depth	5.3x	5.0x	3.6x	4.6x
distrib	5.0x	2.7x	2.4x	3.4x
signif	5.0x	1.6x	1.4x	2.7x
subset	5.3x	1.9x	1.4x	2.9x
map	2.0x	1.8x	–	1.9x

signif/subset large-scale slowdown vs previous: two-pass streaming reads the file twice for Bonferroni correction. The trade-off is bounded O(n_individuals) memory instead of O(n_markers) accumulation.

1.3.2 Per-scale average speedup ¶

Scale	Average speedup
Small	4.1x
Medium	2.4x
Large	2.3x

1.3.3 Overall ¶

Rust is 2.0x faster across all 19 benchmarks (1.558s vs 0.780s total). All commands operate in bounded memory (< 500MB for any input size).

1.3.4 Microbenchmarks (criterion)¶

Operation	Time	Notes
chi_squared_yates	4.6 ns	Pure f64 arithmetic
p_association(erfc)	108 ns	libm::erfc, was 562 ns (statrs)
bitset popcount (40)	2.3 ns	1 u64 word
bitset popcount (200)	5.0 ns	4 u64 words
bitset popcount (1000)	18 ns	16 u64 words
fast_parse_u16	4.5 ns	Integer field parsing
Cg float format	193 ns	C++ %g compatible output

1.4 Literature-derived benchmark runs ¶

The checked-in synthetic suite is small enough for repeatable repository verification. Paper-facing biological benchmark runs download the public datasets used by the RADSex literature workflow and run rsx against those FASTQ files.

The tracked manifest benchmarks/literature_datasets.tsv contains:

15 public RADSex workflow datasets, BioProject PRJNA548074, with the published RADSex command set run at minimum depths 1, 2, 5, and 10.
The public oryzias_latipes medaka example, BioProject PRJNA253959, with the RADSex example genome and medaka-specific thresholds.

The RADSex paper workflow describes the literature panel and its parameters in its public data/info.tsv and config.yaml. The medaka example gives a published benchmark target for process, distrib, signif, and map on a public dataset. The executable benchmark task queries NCBI SRA metadata, downloads ENA FASTQ files with MD5 validation (with --download-method fastq-dump available as a fallback), runs the command set, and writes measured rows to benchmarks/results/literature_benchmark_results.csv:

pixi run -e benchmark run-literature-benchmarks

For a single public literature dataset:

pixi run -e benchmark run-literature-medaka

1.4.1 Completed biological benchmark panel ¶

The tracked paper-facing panel contains four completed datasets downloaded from the RADSex literature BioProject. These are full FASTQ runs, not synthetic marker tables.

Dataset	Samples	Bases	Markers	Total runtime (s)
`danio_albolineatus`	58	10,200,185,246	29,357,812	1,240.187
`notothenia_rossii`	42	10,468,745,314	11,337,736	1,104.048
`plecoglossus_altivelis`	65	9,924,715,896	7,338,124	1,045.383
`tinca_tinca`	43	11,257,335,754	6,932,387	1,020.880

The paper figures are generated from the tracked CSV files. The runtime figure reports compute time only (process plus downstream analysis); download time is retained in the raw CSV for reproducibility accounting but is not used as a performance claim.

The panel is biologically anchored by the original RADSex results: plecoglossus_altivelis is a previously described XX/XY case with 47 source-paper sex-associated markers at d=10, tinca_tinca is an XX/XY case reported without a previously described system and six markers at d=10, and danio_albolineatus and notothenia_rossii are source-paper null cases with no SD identified at d=10. The Bayesian plots therefore compare strict positive controls, strict negative calls, and posterior-supported sex-linked marker hypotheses under the same published workflow.

pixi run -e benchmark plot-literature-benchmarks

The generated figure and table assets are written to docs/figures/:

literature_runtime_breakdown.svg
literature_compute_breakdown.svg
literature_marker_throughput.svg
literature_downstream_times.svg
literature_bayesian_evidence.svg
literature_prior_sensitivity_heatmap.svg
literature_candidate_recovery.svg
literature_bio_unlocks.svg
literature_mode_candidate_counts.svg
literature_mode_qc.svg
literature_depth_stability.svg
literature_radsex_speedups.svg
literature_benchmark_summary.tsv
literature_speedup_summary.tsv
literature_speedup_pairs.tsv

1.4.2 Literature speed comparison ¶

The same downloaded FASTQ inputs were processed by rsx-rs and by C++ RADSex v1.2.0 built in the Pixi benchmark environment. The C++ checkout requires the standard <cstdint> include fix for modern GCC, but the RADSex command behavior is otherwise unchanged.

Across 56 paired command/dataset/depth timings, on a single AMD Ryzen Threadripper PRO 3955WX workstation (16 threads), rsx-rs is faster in 53 pairs. The geometric-mean speedup is 8.38x when speedup is defined as C++ wall time divided by Rust wall time. The process-stage geometric-mean speedup is 2.77x across the four literature datasets (2.44x to 2.96x). depth, freq, and distrib favor rsx-rs in every paired run, with geometric-mean speedups of 6.91x, 22.5x, and 29.0x, respectively. signif is the exception: rsx-rs is faster in 13 of 16 paired runs, with a 1.25x geometric-mean speedup, because both programs stream the complete table. These are the numbers reported in the paper; the deposited reproducibility archive (https://doi.org/10.5281/zenodo.20531539) regenerates every figure and number, and they scale with the host hardware.

On Slurm clusters, submit per-dataset compute-only comparisons as an array job after the FASTQ files and marker tables have been prepared:

sbatch --export=ALL,RSX_REPO=$PWD benchmarks/slurm/rsx_release_build.sbatch
sbatch --export=ALL,RSX_REPO=$PWD benchmarks/slurm/radsex_cpp_build.sbatch
sbatch --export=ALL,RSX_REPO=$PWD benchmarks/slurm/literature_prepare_dataset.sbatch
sbatch --export=ALL,RSX_REPO=$PWD,RADSEX_BIN=/path/to/radsex \
  benchmarks/slurm/literature_radsex_compare.sbatch
pixi run -e benchmark merge-literature-slurm-results
pixi run -e benchmark plot-literature-benchmarks

1.4.3 Bayesian biological evidence ¶

Bayesian evidence is summarized from the aggregated distrib tables without materializing marker-level multi-GB signif --bayes outputs. The summary is tracked in benchmarks/results/literature_bayesian_evidence.csv and plotted in docs/figures/literature_bayesian_evidence.svg.

The strongest Bayesian signal in the current panel is danio_albolineatus at minimum depth 1: 61,191 markers have posterior P(sex-linked) > 0.5 and 4,297 markers have posterior > 0.9. The same Bonferroni FASTA extraction reports zero significant markers for that dataset, so the Bayesian summary adds sex-linked marker probabilities that are not available from strict thresholded FASTA output alone.

The candidate-recovery figure compares strict Bonferroni FASTA marker counts with posterior-supported markers at the same dataset/depth settings. This grounds the biological-value claim in the published RADSex workflow datasets: the Bayesian layer does not change the null-hypothesis test, but it exposes a sex-linked marker hypothesis set when all-or-nothing FASTA extraction is empty.

1.4.4 Prior sensitivity of the Bayesian triage layer ¶

The hybrid triage layer carries two free parameters: the prior probability pi that any given marker is sex-linked, and p_sex, the expected prevalence in the linked sex. The sensitivity sweep runs rsx triage on every (pi in {0.001, 0.005, 0.01, 0.02, 0.05, 0.1}) x (p_sexin {0.8, 0.85, 0.9, 0.95}) combination across the four literature datasets, producing 96 per-marker triage tables in benchmarks/results/slurm/. The aggregated counts go to benchmarks/results/prior_sensitivity_from_triage.csv and the heatmap to docs/figures/literature_prior_sensitivity_heatmap.svg.

sbatch --export=ALL,RSX_REPO=$PWD \
  benchmarks/slurm/literature_biology_prior_sensitivity.sbatch
pixi run -e benchmark python benchmarks/plot_prior_sensitivity_heatmap.py

The per-dataset spread of the qualifying-marker count (posterior > 0.9) across the 24-cell grid speaks directly to the robustness of the Bayesian call:

Dataset	Min posterior>0.9	Max posterior>0.9	Spread factor
`danio_albolineatus`	1	10,189	10,189
`notothenia_rossii`	0	44	N/A
`plecoglossus_altivelis`	46	87	1.9x
`tinca_tinca`	6	169	28x

The two positive controls behave differently. plecoglossus_altivelis holds within a factor of two across the entire grid, so the Bayesian call is essentially prior-insensitive on that dataset. tinca_tinca varies more (6 to 169 markers across the grid), but the strict-call core remains. danio_albolineatus shows extreme prior sensitivity, with the count jumping from a single marker at (pi=0.001, p_sex=0.95) to ten thousand markers at (pi>=0.05, p_sex<=0.85). notothenia_rossii peaks at 44 markers but is dominated by zeros, consistent with its singleton-heavy retained-marker structure described below.

The figure makes this concrete cell by cell. For paper claims we report all rsx counts at the documented default prior (pi=0.01, p_sex=0.9), note the per-dataset spread, and forward the prior-sensitivity heatmap as supporting evidence that the calls are stable for the plecoglossus_altivelis positive control while explicitly prior-dependent for danio_albolineatus.

1.4.5 Mode effects on published marker tables ¶

The mode-effects analysis runs additional rsx statistical and QC modes through the Python bindings on the four downloaded marker tables at minimum depth 10:

pixi run -e paper analyze-literature-modes
pixi run -e benchmark plot-literature-benchmarks

The tracked result file is benchmarks/results/literature_mode_effects.csv. It records the command, wall time, output rows, marker counts, Bayesian thresholds, frequency-QC summaries, and PCA variance/loadings for each dataset. The generated figures are docs/figures/literature_mode_candidate_counts.svg and docs/figures/literature_mode_qc.svg.

At d=10 the mode-effects run reproduces the source-paper strict calls and shows what the additional modes contribute:

Dataset	Strict chisq/Bonf.	Fisher/FDR	G-test/FDR	Bayes posterior > 0.9	PC1 variance	Singleton fraction
`danio_albolineatus`	0	0	0	30	79.72%	19.96%
`notothenia_rossii`	0	0	0	0	94.93%	52.78%
`plecoglossus_altivelis`	47	49	49	49	93.09%	37.94%
`tinca_tinca`	6	6	7	11	98.14%	7.74%

This table is the paper-facing demonstration of the added analysis modes. The strict path preserves the RADSex interpretation: plecoglossus_altivelis and tinca_tinca are positive controls, while danio_albolineatus and notothenia_rossii remain strict null cases at d=10. Fisher/FDR and G-test/FDR show the sensitivity of count-cell testing to exact and likelihood tests on the same marker table. The Bayesian table gives posterior-supported marker evidence: it recovers 49 high-posterior markers for plecoglossus_altivelis, expands the tinca_tinca follow-up set to 11, and distinguishes the two strict negative datasets because danio_albolineatus has 30 high-posterior candidates whereas notothenia_rossii has none at the same threshold.

The PCA and frequency columns are QC outputs rather than marker-discovery claims. PC1 explains 79.7–98.1% of depth variation across the four datasets, while the mean sex-loading deltas are small relative to the dominant depth axis. This indicates that the leading sample-level variation is library/depth structure, not a clean sex-separation axis. The singleton fraction separates the retained marker tables as well: tinca_tinca is low-singleton at d=10 (7.74%), whereas notothenia_rossii remains dominated by singleton retained markers (52.78%), which is a biological-QC warning for candidate follow-up.

1.4.6 Depth stability of the call layers ¶

The mode-effects panel runs at min_depth=10 so it can be compared with the RADSex literature workflow directly. The depth-stability sweep extends the same eight modes to min_depth in {3, 5, 8, 10}, giving 128 rows tracked in benchmarks/results/literature_mode_effects_sweep.csv and summarised per (dataset, depth) in benchmarks/results/literature_depth_stability.csv:

sbatch --export=ALL,RSX_REPO=$PWD \
  benchmarks/slurm/literature_biology_low_depth.sbatch
pixi run -e benchmark python benchmarks/collect_lowdepth_sweep.py
pixi run -e paper python benchmarks/summarise_depth_stability.py

Dataset	min_depth	strict	posterior > 0.9	Bayes factor > 10
`danio_albolineatus`	3	0	690	1,316
`danio_albolineatus`	5	0	143	782
`danio_albolineatus`	8	0	45	301
`danio_albolineatus`	10	0	30	174
`notothenia_rossii`	3	0	2	527
`notothenia_rossii`	5	0	1	486
`notothenia_rossii`	8	0	0	434
`notothenia_rossii`	10	0	0	400
`plecoglossus_altivelis`	3	56	137	5,965
`plecoglossus_altivelis`	5	47	62	2,512
`plecoglossus_altivelis`	8	47	50	905
`plecoglossus_altivelis`	10	47	49	572
`tinca_tinca`	3	7	16	505
`tinca_tinca`	5	6	13	475
`tinca_tinca`	8	6	9	438
`tinca_tinca`	10	6	11	443

The strict call layer is depth-stable on the positive controls: the plecoglossus_altivelis strict count holds at 47 from d=5 onward, and tinca_tinca drops by a single marker between d=3 and d=5. The Bayesian layers are far more depth-sensitive. The plecoglossus_altivelis Bayes-factor > 10 count contracts ten-fold (5,965 to 572) as min_depth moves from 3 to 10, and the posterior > 0.9 count contracts almost three-fold (137 to 49). The figure docs/figures/literature_depth_stability.svg plots the same data on a faceted log axis.

The two strict-null datasets behave differently across depths. notothenia_rossii stays strict-zero at every depth and its posterior > 0.9 count never exceeds 2, which corroborates the negative call from the d=10 mode-effects table. danio_albolineatus is strict-zero everywhere but its Bayesian counts contract sharply from d=3 to d=10, retaining 45 high-posterior and 301 Bayes-factor markers at d=8 and 30 high-posterior plus 174 Bayes-factor markers at d=10. This is the same dataset that shows the highest prior sensitivity in the heatmap above, so the paper should present it as a prior- and depth-sensitive hypothesis set rather than a stable sex-determination call.

The depth-stability table is what motivates citing the plecoglossus_altivelis strict numbers at the standard RADSex d=10 threshold while pairing the Bayesian numbers with their depth and prior context.

1.4.7 Biological unlock summary ¶

The marker-level unlock summary joins strict signif output with signif --bayes output from the same d=10 marker tables. It is tracked in benchmarks/results/literature_bio_unlocks.csv, and the top-ranked marker rows are tracked in benchmarks/results/literature_candidate_triage.csv. Dataset-level sex-system calls are tracked in benchmarks/results/literature_sex_system_inference.csv. The generated figure is docs/figures/literature_bio_unlocks.svg.

Dataset	Strict + posterior	Posterior only	Bayes-factor only	Interpretation
`danio_albolineatus`	0	30	158	posterior-supported markers below Bonferroni
`notothenia_rossii`	0	0	400	Bayes-factor evidence rejected by posterior threshold
`plecoglossus_altivelis`	47	2	523	RADSex signal with posterior-supported additions
`tinca_tinca`	6	5	432	RADSex signal with posterior-supported additions

This is the clearest biological distinction unlocked by rsx beyond the original RADSex thresholded FASTA endpoint. In the two positive controls, posterior evidence recovers every strict d=10 marker and adds two plecoglossus_altivelis and five tinca_tinca marker hypotheses. In danio_albolineatus, rsx exposes 30 high-posterior markers while preserving the Bonferroni-negative call. In notothenia_rossii, 400 rows have Bayes factor > 10 but none have posterior > 0.9, and the singleton fraction is 52.78%; the posterior/QC layer therefore prevents raw Bayes-factor counts from being over-read as biological positives.

The generated sex-system inference artifact makes the biological call explicit:

Dataset	Evidence class	rsx inference	Marker basis
`plecoglossus_altivelis`	confirmatory	XX/XY-supported	47 male-biased strict markers; 49 male-biased high-posterior markers
`tinca_tinca`	confirmatory	XX/XY-supported	6 male-biased strict markers; 11 high-posterior markers for validation
`danio_albolineatus`	hypothesis	putative W-linked marker hypothesis	30 high-posterior markers without Bonferroni reclassification; 29 are female-biased
`notothenia_rossii`	negative	no sex-linked marker call	400 Bayes-factor-only rows, 0 high-posterior rows, 52.78% singleton retained markers

1.4.8 Biological inference role of extended features ¶

The extended rsx features map to biological interpretation tasks rather than standing alone as software claims:

Inference task	rsx feature	Biological use
Sequencing QC	`depth`, `freq`, `pca`	choose depth thresholds and flag outlier libraries
Marker discovery	`distrib`, `signif`	test sex-biased marker presence/absence
Sex-linkage posterior	FDR, Bayes posterior	identify putative Y-linked/W-linked marker hypotheses below strict Bonferroni calls
Candidate validation	`subset`	inspect marker depths in individuals and outliers
Locus localization	`map`	test whether candidates cluster on a sex chromosome
Cross-run synthesis	`merge`, k-mer deduplication	combine lanes/panels and reduce redundant error variants
Reproducible analysis	Python bindings, C API	integrate outputs into notebooks and workflow languages

The bounded-memory implementation supports inference by allowing the full marker table to be analyzed without downsampling. That matters for RAD-seq sex-determination studies because rare candidate markers are sensitive to minimum-depth thresholds, sequencing-error duplication, and sample-level library effects.

1.4.9 Novelty claim ¶

For the software paper, the novelty claim is the combination of:

Contribution	Biological/inference value
Drop-in RADSex compatibility	compare directly with prior RADSex biological results
Bounded-memory full-table algorithms	avoid downsampling rare marker candidates
Additional statistical modes	compare strict tests, FDR, and posterior sex-linkage calls on the same counts
Formal mathematical provenance	make erfc p-values, sparse medians, and streaming PCA auditable
Workflow bindings	keep downstream biological analysis in Python/R-style workflows
DataFrame-native MarkerTableSource	run any analysis command directly on an in-memory DataFrame without exporting a markers TSV; spill to Parquet only when a Beissinger-style working-set estimate predicts the data will not fit in RAM
Prior-sensitivity sweep	report Bayesian call counts across the full (pi, p_sex) grid per dataset so the prior choice becomes auditable

1.4.10 Validation layers ¶

Paper-readiness validation is split by claim:

Claim	Validation artifact
Mathematical identities	`scripts/sympy/chi2_cdf_derivation.py`, `sparse_median_proof.py`, `tucker_covariance_proof.py`
Numerical/statistical routines	`radsex-core/tests/test_precision.rs` and `radsex-core/src/stats.rs` unit tests
Command behavior	`radsex-core/tests/test_pipeline.rs` integration tests
Literature speedup	`benchmarks/results/literature_speed_comparison.csv` and Slurm scripts
Bayesian sex-linkage evidence	`benchmarks/results/literature_bayesian_evidence.csv` and candidate-recovery figures
Prior sensitivity of the Bayesian layer	`benchmarks/results/slurm/triage__pi_psex*.tsv` (96 cells), `benchmarks/results/prior_sensitivity_from_triage.csv`, `docs/figures/literature_prior_sensitivity_heatmap.svg`
Extended mode effects	`benchmarks/results/literature_mode_effects.csv` and mode-effect figures
Depth stability of the call layers	`benchmarks/results/literature_mode_effects_sweep.csv`, `benchmarks/results/literature_depth_stability.csv`, `docs/figures/literature_depth_stability.svg`
Biological unlock summary	`benchmarks/results/literature_bio_unlocks.csv`, `literature_candidate_triage.csv`, `literature_sex_system_inference.csv`, and unlock figure
Python workflow surface	`benchmarks/results/literature_binding_results.csv` (4 datasets) and `rsx-python/tests/test_pyrsx.py`
`MarkerTableSource` parity (file / Arrow / Parquet)	`radsex-core/src/source/tests.rs` and `rsx-python/tests/test_marker_table_from_arrow.py` (including the `RSX_FORCE_SPILL=1` toggle test)
`MarkerTableSource` architectural cost	`benchmarks/results/marker_source_paths.csv` via `benchmarks/bench_marker_source.py`

The math scripts verify the erfc form of the one-degree chi-squared p-value, sparse median indexing, and streaming Gram/PCA equivalence. The precision tests cover known chi-squared values, p-value monotonicity and symmetry, group-count consistency, Fisher’s exact test, the G-test, Benjamini-Hochberg adjustment, Bayes factors, posterior symmetry, and empirical-Bayes EM convergence.

Recent validation runs:

pixi run derive in scripts/: passed the chi-squared erfc derivation, critical-value checks, and f32/f64 relative-error checks.
pixi run python sympy/sparse_median_proof.py: passed sparse-median examples and complexity check.
pixi run python sympy/tucker_covariance_proof.py: passed streaming Gram/SVD equivalence on a sparse matrix.
Remote cargo test --workspace on elja: 60 core unit tests, 11 pipeline integration tests, and 13 precision tests passed.
Remote pixi run -e python test-python on elja: 18 Python binding tests passed after a release extension build.

1.4.11 Python binding feature evidence ¶

The pyrsx bindings expose the same Rust core through Python calls for workflow notebooks and downstream analysis code. The literature binding run uses the downloaded RADSex marker tables and popmaps rather than synthetic test fixtures, and now covers all four panel datasets rather than the single representative case shipped earlier:

pixi run -e paper run-literature-bindings

On Slurm:

sbatch --export=ALL,RSX_REPO=$PWD \
  --array=0-3 benchmarks/slurm/literature_bindings_features.sbatch
pixi run -e benchmark merge-literature-binding-slurm-results

The Pixi task writes benchmarks/results/literature_binding_results.csv aggregated across the four datasets. The Slurm array writes per-dataset shards under benchmarks/results/slurm/, and the merge task refreshes the same aggregated CSV. It exercises pyrsx.freq, pyrsx.depth, pyrsx.distrib(test’fisher’, correction=’fdr’)=, pyrsx.pca, and pyrsx.signif(bayes=True) on each published RADSex workflow dataset, giving the paper an ease-of-use claim backed by real inputs measured end-to-end on four datasets, not one.

Aggregated tracked timings per dataset at min_depth=10 (latest tracked literature_binding_results.csv run):

Dataset	`pyrsx.freq`	`pyrsx.depth`	`pyrsx.distrib` (Fisher/FDR)	`pyrsx.pca`	`pyrsx.signif(bayes=True)`
`danio_albolineatus`	1.42 s	19.42 s	1.46 s	15.42 s	32.60 s
`notothenia_rossii`	0.79 s	5.61 s	0.45 s	4.66 s	9.10 s
`plecoglossus_altivelis`	0.42 s	0.65 s	0.39 s	4.83 s	8.78 s
`tinca_tinca`	0.29 s	0.36 s	0.30 s	2.92 s	5.65 s

The pyrsx surface is the same Rust core; the per-dataset spread mirrors the marker-table sizes (about 3.0M, 0.4M, 0.2M, and 0.07M retained markers respectively).

1.4.11.1 From-DataFrame Arrow path: the `MarkerTableSource` layer ¶

The high-level MarkerTable class accepts both a file path and any narwhals-compatible DataFrame. The DataFrame entry points (MarkerTable. from_dataframe and the .freq=/.depth=/=.distrib=/=.signif=/=.triage=/ .pca methods on it) route through a new MarkerTableSource layer in radsex-core/src/source/ rather than materialising a markers TSV behind the user’s back.

A short specification record is tracked at docs/orgmode/reference/arrow-input-for-high-level-api.org. The implementation has three variants:

Variant	Where the markers live	When it is chosen
`MarkersTableStream` (file)	mmap’d TSV on disk	the CLI and any `MarkerTable.from_path(...)` call
`ArrowMarkerSource` (in-memory)	`Vec<RecordBatch>` held in RAM	`MarkerTable.from_dataframe(...)` when the Beissinger working-set estimator says the data comfortably fits
`ParquetMarkerSource` (spilled)	a hidden temp file written in Parquet	`MarkerTable.from_dataframe(...)` when the estimate exceeds the spill threshold, or `RSX_FORCE_SPILL=1`

The three variants share the same MarkerStream trait, and every analysis command is generic over S: MarkerStream, so the spill choice is purely a memory question; the per-marker results are bytes-identical regardless of which variant the caller picked. The parity test in radsex-core/src/source/tests.rs asserts this for freq on a shared fixture and the Python pytest test_marker_table_from_arrow.py asserts the per-dataset results match the file path through MarkerTable.

1.4.11.2 Beissinger-style spill estimator ¶

The spill decision uses a working-set estimator drawn from the RAD-seq literature (Beissinger et al. 2013 on marker density and read depth for GBS-style genotyping, plus empirical overhead factors reported by TASSEL-GBS, ipyrad, and Torkamaneh et al. 2016). The prediction lives in radsex-core/src/source/estimator.rs and combines four factors:

estimated_bytes = n_samples
                * m_markers_observed
                * bytes_per_cell       (2 bytes per u16 depth)
                * overhead_factor      (default 6.0)
                * command_multiplier   (1.3 freq, 1.5 distrib/pca,
                                        1.8 depth, 2.0 signif/triage)

The available-RAM probe reads /proc/meminfo for MemAvailable and the default policy spills when the prediction exceeds 55% of that figure. Operators can override the threshold through RSX_SPILL_BYTES=<N> and the variant choice through RSX_FORCE_SPILL=1. The 6x overhead floor and the per-command multipliers come from the rsx working-set audit reported in the source-module docstring; the paper claim is therefore a defensible heuristic backed by the cited literature rather than a magic constant.

1.4.11.3 Architectural cost vs the legacy TSV streamer ¶

The MarkerTableSource benchmark times the same three commands (freq, depth, triage) over each literature dataset via three constructions: MarkerTable.from_path(...) (the legacy mmap’d TSV), from_dataframe(...) in-memory Arrow, and from_dataframe(...) with RSX_FORCE_SPILL=1 to exercise the Parquet path.

pixi run -e paper python benchmarks/bench_marker_source.py

Tracked wall-time in benchmarks/results/marker_source_paths.csv (4 datasets x 3 sources x 3 commands, min_depth=10):

Dataset	Command	Path (s)	In-memory Arrow (s)	Spilled Parquet (s)	Arrow slowdown	Parquet slowdown
`danio_albolineatus`	freq	4.54	38.12	68.09	8.4x	15.0x
`danio_albolineatus`	depth	17.26	44.16	67.55	2.6x	3.9x
`danio_albolineatus`	triage	30.95	67.13	111.14	2.2x	3.6x
`notothenia_rossii`	freq	1.62	19.28	29.08	11.9x	17.9x
`notothenia_rossii`	depth	5.13	20.80	30.14	4.1x	5.9x
`notothenia_rossii`	triage	9.22	29.11	53.88	3.2x	5.8x
`plecoglossus_altivelis`	freq	1.71	20.10	26.85	11.8x	15.7x
`plecoglossus_altivelis`	depth	0.90	21.36	26.94	23.7x	29.9x
`plecoglossus_altivelis`	triage	13.41	30.75	50.27	2.3x	3.7x
`tinca_tinca`	freq	1.29	13.03	19.71	10.1x	15.3x
`tinca_tinca`	depth	0.56	13.87	20.45	24.7x	36.4x
`tinca_tinca`	triage	8.84	18.95	34.82	2.1x	3.9x

The qualitative shape:

The path-based mmap streamer keeps its existing speed; nothing in the refactor regressed the CLI path.
The in-memory Arrow path is 2-25x slower than the file path for the same command on the same data. The biggest gaps appear on the smallest-output commands (freq, depth) where the per-marker IPC decoding overhead dominates; on triage the gap shrinks to ~2-3x because the per-marker statistical work amortises the format cost.
The Parquet spill path adds another ~1.5-2x over the in-memory variant (the spill-then-stream pass). The Beissinger estimator only chooses it when the working-set prediction exceeds the available-RAM threshold, so the user pays for the spill only when the alternative is exceeding RAM.

The numbers support a paper claim that the DataFrame surface gives an ergonomic notebook/workflow API for moderate-sized data without regressing the CLI’s streaming path, and that the spill-vs-in-memory choice is automatic but explicitly overridable via RSX_FORCE_SPILL / RSX_SPILL_BYTES. Heavy production analysis on the largest datasets is still cheapest through the path-based CLI.

1.5 Raw timing data (seconds, median of 3 runs)¶

Scale	Command	C++ (s)	Rust (s)	Speedup
small	freq	0.016	0.004	4.0x
small	distrib	0.015	0.003	5.0x
small	signif	0.015	0.003	5.0x
small	depth	0.016	0.003	5.3x
small	subset	0.016	0.003	5.3x
small	map	0.042	0.021	2.0x
small	process	0.016	0.009	1.8x
medium	freq	0.015	0.006	2.5x
medium	distrib	0.019	0.007	2.7x
medium	signif	0.021	0.013	1.6x
medium	depth	0.025	0.005	5.0x
medium	subset	0.026	0.014	1.9x
medium	map	0.407	0.228	1.8x
medium	process	0.058	0.038	1.5x
large	freq	0.114	0.044	2.6x
large	distrib	0.138	0.057	2.4x
large	signif	0.189	0.137	1.4x
large	depth	0.240	0.067	3.6x
large	subset	0.170	0.118	1.4x

1.6 Why Rust is faster ¶

1.6.1 Inline `for_each` parser (biggest win)¶

The C++ version uses a producer-consumer pattern: one thread parses the markers table into a std::queue<Marker> protected by a mutex, another thread processes markers from the queue with busy-wait polling (sleep(10us) when empty).

rsx-rs originally copied this pattern using crossbeam channels, but profiling showed ~50% of CPU time was spent on channel overhead and marker cloning. rsx-rs uses an inline for_each callback that reuses a single Marker struct with zero allocation per marker. The marker is passed by reference to the callback, reset in-place between iterations.

1.6.2 Bitset popcount for group counts (eliminated HashMap entirely)¶

The C++ stores group counts in std::unordered_map<string, uint>, hashing group name strings for every marker field. rsx-rs replaces this with a BitsetRow (1 bit per individual) and pre-computed GroupMask bitmasks. Group count = popcount(marker_bits & group_mask) – a single CPU instruction per 64 individuals, with zero hashing.

For 200 individuals: 5.0 ns per group count vs ~200 ns for a HashMap lookup.

1.6.3 erfc identity for chi-squared CDF (SymPy-derived, 5.2x faster)¶

The chi-squared p-value for df=1 simplifies exactly to p = erfc(sqrt(chi2/2)). This is proven via the identity P(1/2, x) = erf(sqrt(x)) (DLMF 8.2.1). The derivation script is in scripts/sympy/chi2_cdf_derivation.py.

This replaces the full regularized gamma function (statrs crate, 562 ns) with a single libm::erfc call (108 ns). A Sollya-generated minimax polynomial (scripts/sollya/erfc_minimax.sollya) can further replace libm::erfc for GPU kernels.

1.6.4 mmap I/O (zero-copy file access)¶

The markers table is memory-mapped via memmap2, eliminating buffered read syscalls. The kernel manages page faults transparently. Combined with the inline for_each parser, this achieves near-memcpy throughput.

1.6.5 Rayon for `process` command ¶

The process command reads all FASTQ files and counts sequences. C++ uses a manual thread pool with mutex-protected work-stealing. rsx-rs uses rayon’s work-stealing thread pool with per-thread AHashMap accumulators, merged at the end. This avoids mutex contention during file processing.

1.6.6 minimap2 for `map` command ¶

rsx-rs uses minimap2 (via Rust bindings) instead of BWA-MEM. minimap2 is the modern successor with better index construction and alignment speed for short reads. The sr (short-read) preset matches BWA-MEM behavior.

1.6.7 Zero-copy field parsing ¶

The table parser uses push_str into pre-allocated strings rather than creating new String objects per field. Combined with fast_parse_u16 for integer fields (matching the C++ fast_stoi approach), this eliminates allocation in the inner parsing loop.

1.7 Reproducing ¶

# Use the benchmark environment
pixi install -e benchmark

# Optional: regenerate equivalent synthetic data without overwriting
# the tracked manuscript reference data.
pixi run -e benchmark generate-data

# Build C++ radsex (requires patching for GCC 15)
cd /path/to/radsex && make -j4

# Run benchmarks against the tracked reference data.
cd /path/to/rsx-rs
CPP_RADSEX=/path/to/radsex/bin/radsex pixi run -e benchmark run-benchmarks

# Print summary
pixi run -e benchmark plot-benchmarks

# Validate and prepare the DOI-deposition package.
pixi run -e benchmark test-package
pixi run -e benchmark package-figshare

By default the benchmark harness reads benchmarks/data/ and writes benchmarks/results/benchmark_results.csv. Set BENCH_DATA or BENCH_RESULTS only when using alternate data or result locations. The tracked CSV and repro/benchmarks.org are the reference inputs for the manuscript numbers.

1.8 Golden file compatibility ¶

Despite the performance differences, rsx-rs produces byte-identical output to C++ radsex for all downstream commands (freq, depth, distrib, signif, subset) when groups are specified explicitly with -G M,F. The Cg float formatter matches C++’s default %g output (6 significant digits). The map command uses minimap2 instead of BWA-MEM so alignment results differ, but the output format is identical.