Performance Benchmarks

Author:

Rohit Goswami

Date:

2026-04-06

1 Performance Benchmarks

Comparison of rsx-rs (Rust) vs the original RADSex (C++11) across three dataset scales. All benchmarks are median of 3 runs on the same machine.

1.1 Test environment

  • CPU: AMD Threadripper (single-socket, benchmarks use 4 threads for process)

  • OS: Arch Linux 6.19.11

  • C++ compiler: GCC 15.2.1, -O2 -std=c++11

  • Rust compiler: rustc 1.85+, --release (LTO thin)

  • C++ radsex: v1.2.0 (with <cstdint> fix for GCC 15)

  • Rust rsx-rs: v0.1.0

1.2 Dataset scales

Scale

Individuals

Markers

Reads/ind

Description

Small

10

1,000

500

Quick validation

Medium

20

10,000

2,000

Typical small study

Large

40

100,000

5,000

Realistic RAD-seq size

Synthetic data generated with benchmarks/generate_data.py: 10% male-biased markers, 10% female-biased, 80% common. FASTQ files gzip-compressed.

1.3 Results

1.3.1 Per-command speedup (Rust / C++)

Command

Small (1K)

Medium (10K)

Large (100K)

Average

process

1.8x

1.5x

1.7x

freq

4.0x

2.5x

2.6x

3.0x

depth

5.3x

5.0x

3.6x

4.6x

distrib

5.0x

2.7x

2.4x

3.4x

signif

5.0x

1.6x

1.4x

2.7x

subset

5.3x

1.9x

1.4x

2.9x

map

2.0x

1.8x

1.9x

signif/subset large-scale slowdown vs previous: two-pass streaming reads the file twice for Bonferroni correction. The trade-off is bounded O(nindividuals) memory instead of O(nmarkers) accumulation.

1.3.2 Per-scale average speedup

Scale

Average speedup

Small

4.1x

Medium

2.4x

Large

2.3x

1.3.3 Overall

Rust is 2.0x faster across all 19 benchmarks (1.558s vs 0.780s total). All commands operate in bounded memory (< 500MB for any input size).

1.3.4 Microbenchmarks (criterion)

Operation

Time

Notes

chisquaredyates

4.6 ns

Pure f64 arithmetic

passociation(erfc)

108 ns

libm::erfc, was 562 ns (statrs)

bitset popcount (40)

2.3 ns

1 u64 word

bitset popcount (200)

5.0 ns

4 u64 words

bitset popcount (1000)

18 ns

16 u64 words

fastparseu16

4.5 ns

Integer field parsing

Cg float format

193 ns

C++ %g compatible output

1.4 Literature-derived benchmark runs

The checked-in synthetic suite is small enough for repeatable repository verification. Paper-facing biological benchmark runs download the public datasets used by the RADSex literature workflow and run rsx against those FASTQ files.

The tracked manifest benchmarks/literature_datasets.tsv contains:

  • 15 public RADSex workflow datasets, BioProject PRJNA548074, with the published RADSex command set run at minimum depths 1, 2, 5, and 10.

  • The public oryzias_latipes medaka example, BioProject PRJNA253959, with the RADSex example genome and medaka-specific thresholds.

The RADSex paper workflow describes the literature panel and its parameters in its public data/info.tsv and config.yaml. The medaka example gives a published benchmark target for process, distrib, signif, and map on a public dataset. The executable benchmark task queries NCBI SRA metadata, downloads ENA FASTQ files with MD5 validation (with --download-method fastq-dump available as a fallback), runs the command set, and writes measured rows to benchmarks/results/literature_benchmark_results.csv:

pixi run -e benchmark run-literature-benchmarks

For a single public literature dataset:

pixi run -e benchmark run-literature-medaka

1.4.1 Completed biological benchmark panel

The tracked paper-facing panel contains four completed datasets downloaded from the RADSex literature BioProject. These are full FASTQ runs, not synthetic marker tables.

Dataset

Samples

Bases

Markers

Total runtime (s)

danio_albolineatus

58

10,200,185,246

29,357,812

1,240.187

notothenia_rossii

42

10,468,745,314

11,337,736

1,104.048

plecoglossus_altivelis

65

9,924,715,896

7,338,124

1,045.383

tinca_tinca

43

11,257,335,754

6,932,387

1,020.880

The paper figures are generated from the tracked CSV files. The runtime figure reports compute time only (process plus downstream analysis); download time is retained in the raw CSV for reproducibility accounting but is not used as a performance claim.

The panel is biologically anchored by the original RADSex results: plecoglossus_altivelis is a previously described XX/XY case with 47 source-paper sex-associated markers at d=10, tinca_tinca is an XX/XY case reported without a previously described system and six markers at d=10, and danio_albolineatus and notothenia_rossii are source-paper null cases with no SD identified at d=10. The Bayesian plots therefore compare strict positive controls, strict negative calls, and posterior-supported sex-linked marker hypotheses under the same published workflow.

pixi run -e benchmark plot-literature-benchmarks

The generated figure and table assets are written to docs/figures/:

  • literature_runtime_breakdown.svg

  • literature_compute_breakdown.svg

  • literature_marker_throughput.svg

  • literature_downstream_times.svg

  • literature_bayesian_evidence.svg

  • literature_prior_sensitivity_heatmap.svg

  • literature_candidate_recovery.svg

  • literature_bio_unlocks.svg

  • literature_mode_candidate_counts.svg

  • literature_mode_qc.svg

  • literature_depth_stability.svg

  • literature_radsex_speedups.svg

  • literature_benchmark_summary.tsv

  • literature_speedup_summary.tsv

  • literature_speedup_pairs.tsv

1.4.2 Literature speed comparison

The same downloaded FASTQ inputs were processed by rsx-rs and by C++ RADSex v1.2.0 built in the Pixi benchmark environment. The C++ checkout requires the standard <cstdint> include fix for modern GCC, but the RADSex command behavior is otherwise unchanged.

Across 56 paired command/dataset/depth timings, on a single AMD Ryzen Threadripper PRO 3955WX workstation (16 threads), rsx-rs is faster in 53 pairs. The geometric-mean speedup is 8.38x when speedup is defined as C++ wall time divided by Rust wall time. The process-stage geometric-mean speedup is 2.77x across the four literature datasets (2.44x to 2.96x). depth, freq, and distrib favor rsx-rs in every paired run, with geometric-mean speedups of 6.91x, 22.5x, and 29.0x, respectively. signif is the exception: rsx-rs is faster in 13 of 16 paired runs, with a 1.25x geometric-mean speedup, because both programs stream the complete table. These are the numbers reported in the paper; the deposited reproducibility archive (https://doi.org/10.5281/zenodo.20531539) regenerates every figure and number, and they scale with the host hardware.

On Slurm clusters, submit per-dataset compute-only comparisons as an array job after the FASTQ files and marker tables have been prepared:

sbatch --export=ALL,RSX_REPO=$PWD benchmarks/slurm/rsx_release_build.sbatch
sbatch --export=ALL,RSX_REPO=$PWD benchmarks/slurm/radsex_cpp_build.sbatch
sbatch --export=ALL,RSX_REPO=$PWD benchmarks/slurm/literature_prepare_dataset.sbatch
sbatch --export=ALL,RSX_REPO=$PWD,RADSEX_BIN=/path/to/radsex \
  benchmarks/slurm/literature_radsex_compare.sbatch
pixi run -e benchmark merge-literature-slurm-results
pixi run -e benchmark plot-literature-benchmarks

1.4.3 Bayesian biological evidence

Bayesian evidence is summarized from the aggregated distrib tables without materializing marker-level multi-GB signif --bayes outputs. The summary is tracked in benchmarks/results/literature_bayesian_evidence.csv and plotted in docs/figures/literature_bayesian_evidence.svg.

The strongest Bayesian signal in the current panel is danio_albolineatus at minimum depth 1: 61,191 markers have posterior P(sex-linked) > 0.5 and 4,297 markers have posterior > 0.9. The same Bonferroni FASTA extraction reports zero significant markers for that dataset, so the Bayesian summary adds sex-linked marker probabilities that are not available from strict thresholded FASTA output alone.

The candidate-recovery figure compares strict Bonferroni FASTA marker counts with posterior-supported markers at the same dataset/depth settings. This grounds the biological-value claim in the published RADSex workflow datasets: the Bayesian layer does not change the null-hypothesis test, but it exposes a sex-linked marker hypothesis set when all-or-nothing FASTA extraction is empty.

1.4.4 Prior sensitivity of the Bayesian triage layer

The hybrid triage layer carries two free parameters: the prior probability pi that any given marker is sex-linked, and p_sex, the expected prevalence in the linked sex. The sensitivity sweep runs rsx triage on every (pi in {0.001, 0.005, 0.01, 0.02, 0.05, 0.1}) x (psexin {0.8, 0.85, 0.9, 0.95}) combination across the four literature datasets, producing 96 per-marker triage tables in benchmarks/results/slurm/. The aggregated counts go to benchmarks/results/prior_sensitivity_from_triage.csv and the heatmap to docs/figures/literature_prior_sensitivity_heatmap.svg.

sbatch --export=ALL,RSX_REPO=$PWD \
  benchmarks/slurm/literature_biology_prior_sensitivity.sbatch
pixi run -e benchmark python benchmarks/plot_prior_sensitivity_heatmap.py

The per-dataset spread of the qualifying-marker count (posterior > 0.9) across the 24-cell grid speaks directly to the robustness of the Bayesian call:

Dataset

Min posterior>0.9

Max posterior>0.9

Spread factor

danio_albolineatus

1

10,189

10,189

notothenia_rossii

0

44

N/A

plecoglossus_altivelis

46

87

1.9x

tinca_tinca

6

169

28x

The two positive controls behave differently. plecoglossus_altivelis holds within a factor of two across the entire grid, so the Bayesian call is essentially prior-insensitive on that dataset. tinca_tinca varies more (6 to 169 markers across the grid), but the strict-call core remains. danio_albolineatus shows extreme prior sensitivity, with the count jumping from a single marker at (pi=0.001, psex=0.95) to ten thousand markers at (pi>=0.05, psex<=0.85). notothenia_rossii peaks at 44 markers but is dominated by zeros, consistent with its singleton-heavy retained-marker structure described below.

The figure makes this concrete cell by cell. For paper claims we report all rsx counts at the documented default prior (pi=0.01, p_sex=0.9), note the per-dataset spread, and forward the prior-sensitivity heatmap as supporting evidence that the calls are stable for the plecoglossus_altivelis positive control while explicitly prior-dependent for danio_albolineatus.

1.4.5 Mode effects on published marker tables

The mode-effects analysis runs additional rsx statistical and QC modes through the Python bindings on the four downloaded marker tables at minimum depth 10:

pixi run -e paper analyze-literature-modes
pixi run -e benchmark plot-literature-benchmarks

The tracked result file is benchmarks/results/literature_mode_effects.csv. It records the command, wall time, output rows, marker counts, Bayesian thresholds, frequency-QC summaries, and PCA variance/loadings for each dataset. The generated figures are docs/figures/literature_mode_candidate_counts.svg and docs/figures/literature_mode_qc.svg.

At d=10 the mode-effects run reproduces the source-paper strict calls and shows what the additional modes contribute:

Dataset

Strict chisq/Bonf.

Fisher/FDR

G-test/FDR

Bayes posterior > 0.9

PC1 variance

Singleton fraction

danio_albolineatus

0

0

0

30

79.72%

19.96%

notothenia_rossii

0

0

0

0

94.93%

52.78%

plecoglossus_altivelis

47

49

49

49

93.09%

37.94%

tinca_tinca

6

6

7

11

98.14%

7.74%

This table is the paper-facing demonstration of the added analysis modes. The strict path preserves the RADSex interpretation: plecoglossus_altivelis and tinca_tinca are positive controls, while danio_albolineatus and notothenia_rossii remain strict null cases at d=10. Fisher/FDR and G-test/FDR show the sensitivity of count-cell testing to exact and likelihood tests on the same marker table. The Bayesian table gives posterior-supported marker evidence: it recovers 49 high-posterior markers for plecoglossus_altivelis, expands the tinca_tinca follow-up set to 11, and distinguishes the two strict negative datasets because danio_albolineatus has 30 high-posterior candidates whereas notothenia_rossii has none at the same threshold.

The PCA and frequency columns are QC outputs rather than marker-discovery claims. PC1 explains 79.7–98.1% of depth variation across the four datasets, while the mean sex-loading deltas are small relative to the dominant depth axis. This indicates that the leading sample-level variation is library/depth structure, not a clean sex-separation axis. The singleton fraction separates the retained marker tables as well: tinca_tinca is low-singleton at d=10 (7.74%), whereas notothenia_rossii remains dominated by singleton retained markers (52.78%), which is a biological-QC warning for candidate follow-up.

1.4.6 Depth stability of the call layers

The mode-effects panel runs at min_depth=10 so it can be compared with the RADSex literature workflow directly. The depth-stability sweep extends the same eight modes to min_depth in {3, 5, 8, 10}, giving 128 rows tracked in benchmarks/results/literature_mode_effects_sweep.csv and summarised per (dataset, depth) in benchmarks/results/literature_depth_stability.csv:

sbatch --export=ALL,RSX_REPO=$PWD \
  benchmarks/slurm/literature_biology_low_depth.sbatch
pixi run -e benchmark python benchmarks/collect_lowdepth_sweep.py
pixi run -e paper python benchmarks/summarise_depth_stability.py

Dataset

mindepth

strict

posterior > 0.9

Bayes factor > 10

danio_albolineatus

3

0

690

1,316

danio_albolineatus

5

0

143

782

danio_albolineatus

8

0

45

301

danio_albolineatus

10

0

30

174

notothenia_rossii

3

0

2

527

notothenia_rossii

5

0

1

486

notothenia_rossii

8

0

0

434

notothenia_rossii

10

0

0

400

plecoglossus_altivelis

3

56

137

5,965

plecoglossus_altivelis

5

47

62

2,512

plecoglossus_altivelis

8

47

50

905

plecoglossus_altivelis

10

47

49

572

tinca_tinca

3

7

16

505

tinca_tinca

5

6

13

475

tinca_tinca

8

6

9

438

tinca_tinca

10

6

11

443

The strict call layer is depth-stable on the positive controls: the plecoglossus_altivelis strict count holds at 47 from d=5 onward, and tinca_tinca drops by a single marker between d=3 and d=5. The Bayesian layers are far more depth-sensitive. The plecoglossus_altivelis Bayes-factor > 10 count contracts ten-fold (5,965 to 572) as min_depth moves from 3 to 10, and the posterior > 0.9 count contracts almost three-fold (137 to 49). The figure docs/figures/literature_depth_stability.svg plots the same data on a faceted log axis.

The two strict-null datasets behave differently across depths. notothenia_rossii stays strict-zero at every depth and its posterior > 0.9 count never exceeds 2, which corroborates the negative call from the d=10 mode-effects table. danio_albolineatus is strict-zero everywhere but its Bayesian counts contract sharply from d=3 to d=10, retaining 45 high-posterior and 301 Bayes-factor markers at d=8 and 30 high-posterior plus 174 Bayes-factor markers at d=10. This is the same dataset that shows the highest prior sensitivity in the heatmap above, so the paper should present it as a prior- and depth-sensitive hypothesis set rather than a stable sex-determination call.

The depth-stability table is what motivates citing the plecoglossus_altivelis strict numbers at the standard RADSex d=10 threshold while pairing the Bayesian numbers with their depth and prior context.

1.4.7 Biological unlock summary

The marker-level unlock summary joins strict signif output with signif --bayes output from the same d=10 marker tables. It is tracked in benchmarks/results/literature_bio_unlocks.csv, and the top-ranked marker rows are tracked in benchmarks/results/literature_candidate_triage.csv. Dataset-level sex-system calls are tracked in benchmarks/results/literature_sex_system_inference.csv. The generated figure is docs/figures/literature_bio_unlocks.svg.

Dataset

Strict + posterior

Strict only

Posterior only

Bayes-factor only

Interpretation

danio_albolineatus

0

0

30

158

posterior-supported markers below Bonferroni

notothenia_rossii

0

0

0

400

Bayes-factor evidence rejected by posterior threshold

plecoglossus_altivelis

47

0

2

523

RADSex signal with posterior-supported additions

tinca_tinca

6

0

5

432

RADSex signal with posterior-supported additions

This is the clearest biological distinction unlocked by rsx beyond the original RADSex thresholded FASTA endpoint. In the two positive controls, posterior evidence recovers every strict d=10 marker and adds two plecoglossus_altivelis and five tinca_tinca marker hypotheses. In danio_albolineatus, rsx exposes 30 high-posterior markers while preserving the Bonferroni-negative call. In notothenia_rossii, 400 rows have Bayes factor > 10 but none have posterior > 0.9, and the singleton fraction is 52.78%; the posterior/QC layer therefore prevents raw Bayes-factor counts from being over-read as biological positives.

The generated sex-system inference artifact makes the biological call explicit:

Dataset

Evidence class

rsx inference

Marker basis

plecoglossus_altivelis

confirmatory

XX/XY-supported

47 male-biased strict markers; 49 male-biased high-posterior markers

tinca_tinca

confirmatory

XX/XY-supported

6 male-biased strict markers; 11 high-posterior markers for validation

danio_albolineatus

hypothesis

putative W-linked marker hypothesis

30 high-posterior markers without Bonferroni reclassification; 29 are female-biased

notothenia_rossii

negative

no sex-linked marker call

400 Bayes-factor-only rows, 0 high-posterior rows, 52.78% singleton retained markers

1.4.8 Biological inference role of extended features

The extended rsx features map to biological interpretation tasks rather than standing alone as software claims:

Inference task

rsx feature

Biological use

Sequencing QC

depth, freq, pca

choose depth thresholds and flag outlier libraries

Marker discovery

distrib, signif

test sex-biased marker presence/absence

Sex-linkage posterior

FDR, Bayes posterior

identify putative Y-linked/W-linked marker hypotheses below strict Bonferroni calls

Candidate validation

subset

inspect marker depths in individuals and outliers

Locus localization

map

test whether candidates cluster on a sex chromosome

Cross-run synthesis

merge, k-mer deduplication

combine lanes/panels and reduce redundant error variants

Reproducible analysis

Python bindings, C API

integrate outputs into notebooks and workflow languages

The bounded-memory implementation supports inference by allowing the full marker table to be analyzed without downsampling. That matters for RAD-seq sex-determination studies because rare candidate markers are sensitive to minimum-depth thresholds, sequencing-error duplication, and sample-level library effects.

1.4.9 Novelty claim

For the software paper, the novelty claim is the combination of:

Contribution

Biological/inference value

Drop-in RADSex compatibility

compare directly with prior RADSex biological results

Bounded-memory full-table algorithms

avoid downsampling rare marker candidates

Additional statistical modes

compare strict tests, FDR, and posterior sex-linkage calls on the same counts

Formal mathematical provenance

make erfc p-values, sparse medians, and streaming PCA auditable

Workflow bindings

keep downstream biological analysis in Python/R-style workflows

DataFrame-native MarkerTableSource

run any analysis command directly on an in-memory DataFrame without exporting a markers TSV; spill to Parquet only when a Beissinger-style working-set estimate predicts the data will not fit in RAM

Prior-sensitivity sweep

report Bayesian call counts across the full (pi, psex) grid per dataset so the prior choice becomes auditable

1.4.10 Validation layers

Paper-readiness validation is split by claim:

Claim

Validation artifact

Mathematical identities

scripts/sympy/chi2_cdf_derivation.py, sparse_median_proof.py, tucker_covariance_proof.py

Numerical/statistical routines

radsex-core/tests/test_precision.rs and radsex-core/src/stats.rs unit tests

Command behavior

radsex-core/tests/test_pipeline.rs integration tests

Literature speedup

benchmarks/results/literature_speed_comparison.csv and Slurm scripts

Bayesian sex-linkage evidence

benchmarks/results/literature_bayesian_evidence.csv and candidate-recovery figures

Prior sensitivity of the Bayesian layer

benchmarks/results/slurm/triage_*_pi*_psex*.tsv (96 cells), benchmarks/results/prior_sensitivity_from_triage.csv, docs/figures/literature_prior_sensitivity_heatmap.svg

Extended mode effects

benchmarks/results/literature_mode_effects.csv and mode-effect figures

Depth stability of the call layers

benchmarks/results/literature_mode_effects_sweep.csv, benchmarks/results/literature_depth_stability.csv, docs/figures/literature_depth_stability.svg

Biological unlock summary

benchmarks/results/literature_bio_unlocks.csv, literature_candidate_triage.csv, literature_sex_system_inference.csv, and unlock figure

Python workflow surface

benchmarks/results/literature_binding_results.csv (4 datasets) and rsx-python/tests/test_pyrsx.py

MarkerTableSource parity (file / Arrow / Parquet)

radsex-core/src/source/tests.rs and rsx-python/tests/test_marker_table_from_arrow.py (including the RSX_FORCE_SPILL=1 toggle test)

MarkerTableSource architectural cost

benchmarks/results/marker_source_paths.csv via benchmarks/bench_marker_source.py

The math scripts verify the erfc form of the one-degree chi-squared p-value, sparse median indexing, and streaming Gram/PCA equivalence. The precision tests cover known chi-squared values, p-value monotonicity and symmetry, group-count consistency, Fisher’s exact test, the G-test, Benjamini-Hochberg adjustment, Bayes factors, posterior symmetry, and empirical-Bayes EM convergence.

Recent validation runs:

  • pixi run derive in scripts/: passed the chi-squared erfc derivation, critical-value checks, and f32/f64 relative-error checks.

  • pixi run python sympy/sparse_median_proof.py: passed sparse-median examples and complexity check.

  • pixi run python sympy/tucker_covariance_proof.py: passed streaming Gram/SVD equivalence on a sparse matrix.

  • Remote cargo test --workspace on elja: 60 core unit tests, 11 pipeline integration tests, and 13 precision tests passed.

  • Remote pixi run -e python test-python on elja: 18 Python binding tests passed after a release extension build.

1.4.11 Python binding feature evidence

The pyrsx bindings expose the same Rust core through Python calls for workflow notebooks and downstream analysis code. The literature binding run uses the downloaded RADSex marker tables and popmaps rather than synthetic test fixtures, and now covers all four panel datasets rather than the single representative case shipped earlier:

pixi run -e paper run-literature-bindings

On Slurm:

sbatch --export=ALL,RSX_REPO=$PWD \
  --array=0-3 benchmarks/slurm/literature_bindings_features.sbatch
pixi run -e benchmark merge-literature-binding-slurm-results

The Pixi task writes benchmarks/results/literature_binding_results.csv aggregated across the four datasets. The Slurm array writes per-dataset shards under benchmarks/results/slurm/, and the merge task refreshes the same aggregated CSV. It exercises pyrsx.freq, pyrsx.depth, pyrsx.distrib(test’fisher’, correction=’fdr’)=, pyrsx.pca, and pyrsx.signif(bayes=True) on each published RADSex workflow dataset, giving the paper an ease-of-use claim backed by real inputs measured end-to-end on four datasets, not one.

Aggregated tracked timings per dataset at min_depth=10 (latest tracked literature_binding_results.csv run):

Dataset

pyrsx.freq

pyrsx.depth

pyrsx.distrib (Fisher/FDR)

pyrsx.pca

pyrsx.signif(bayes=True)

danio_albolineatus

1.42 s

19.42 s

1.46 s

15.42 s

32.60 s

notothenia_rossii

0.79 s

5.61 s

0.45 s

4.66 s

9.10 s

plecoglossus_altivelis

0.42 s

0.65 s

0.39 s

4.83 s

8.78 s

tinca_tinca

0.29 s

0.36 s

0.30 s

2.92 s

5.65 s

The pyrsx surface is the same Rust core; the per-dataset spread mirrors the marker-table sizes (about 3.0M, 0.4M, 0.2M, and 0.07M retained markers respectively).

1.4.11.1 From-DataFrame Arrow path: the MarkerTableSource layer

The high-level MarkerTable class accepts both a file path and any narwhals-compatible DataFrame. The DataFrame entry points (MarkerTable. from_dataframe and the .freq=/.depth=/=.distrib=/=.signif=/=.triage=/ .pca methods on it) route through a new MarkerTableSource layer in radsex-core/src/source/ rather than materialising a markers TSV behind the user’s back.

A short specification record is tracked at docs/orgmode/reference/arrow-input-for-high-level-api.org. The implementation has three variants:

Variant

Where the markers live

When it is chosen

MarkersTableStream (file)

mmap’d TSV on disk

the CLI and any MarkerTable.from_path(...) call

ArrowMarkerSource (in-memory)

Vec<RecordBatch> held in RAM

MarkerTable.from_dataframe(...) when the Beissinger working-set estimator says the data comfortably fits

ParquetMarkerSource (spilled)

a hidden temp file written in Parquet

MarkerTable.from_dataframe(...) when the estimate exceeds the spill threshold, or RSX_FORCE_SPILL=1

The three variants share the same MarkerStream trait, and every analysis command is generic over S: MarkerStream, so the spill choice is purely a memory question; the per-marker results are bytes-identical regardless of which variant the caller picked. The parity test in radsex-core/src/source/tests.rs asserts this for freq on a shared fixture and the Python pytest test_marker_table_from_arrow.py asserts the per-dataset results match the file path through MarkerTable.

1.4.11.2 Beissinger-style spill estimator

The spill decision uses a working-set estimator drawn from the RAD-seq literature (Beissinger et al. 2013 on marker density and read depth for GBS-style genotyping, plus empirical overhead factors reported by TASSEL-GBS, ipyrad, and Torkamaneh et al. 2016). The prediction lives in radsex-core/src/source/estimator.rs and combines four factors:

estimated_bytes = n_samples
                * m_markers_observed
                * bytes_per_cell       (2 bytes per u16 depth)
                * overhead_factor      (default 6.0)
                * command_multiplier   (1.3 freq, 1.5 distrib/pca,
                                        1.8 depth, 2.0 signif/triage)

The available-RAM probe reads /proc/meminfo for MemAvailable and the default policy spills when the prediction exceeds 55% of that figure. Operators can override the threshold through RSX_SPILL_BYTES=<N> and the variant choice through RSX_FORCE_SPILL=1. The 6x overhead floor and the per-command multipliers come from the rsx working-set audit reported in the source-module docstring; the paper claim is therefore a defensible heuristic backed by the cited literature rather than a magic constant.

1.4.11.3 Architectural cost vs the legacy TSV streamer

The MarkerTableSource benchmark times the same three commands (freq, depth, triage) over each literature dataset via three constructions: MarkerTable.from_path(...) (the legacy mmap’d TSV), from_dataframe(...) in-memory Arrow, and from_dataframe(...) with RSX_FORCE_SPILL=1 to exercise the Parquet path.

pixi run -e paper python benchmarks/bench_marker_source.py

Tracked wall-time in benchmarks/results/marker_source_paths.csv (4 datasets x 3 sources x 3 commands, min_depth=10):

Dataset

Command

Path (s)

In-memory Arrow (s)

Spilled Parquet (s)

Arrow slowdown

Parquet slowdown

danio_albolineatus

freq

4.54

38.12

68.09

8.4x

15.0x

danio_albolineatus

depth

17.26

44.16

67.55

2.6x

3.9x

danio_albolineatus

triage

30.95

67.13

111.14

2.2x

3.6x

notothenia_rossii

freq

1.62

19.28

29.08

11.9x

17.9x

notothenia_rossii

depth

5.13

20.80

30.14

4.1x

5.9x

notothenia_rossii

triage

9.22

29.11

53.88

3.2x

5.8x

plecoglossus_altivelis

freq

1.71

20.10

26.85

11.8x

15.7x

plecoglossus_altivelis

depth

0.90

21.36

26.94

23.7x

29.9x

plecoglossus_altivelis

triage

13.41

30.75

50.27

2.3x

3.7x

tinca_tinca

freq

1.29

13.03

19.71

10.1x

15.3x

tinca_tinca

depth

0.56

13.87

20.45

24.7x

36.4x

tinca_tinca

triage

8.84

18.95

34.82

2.1x

3.9x

The qualitative shape:

  • The path-based mmap streamer keeps its existing speed; nothing in the refactor regressed the CLI path.

  • The in-memory Arrow path is 2-25x slower than the file path for the same command on the same data. The biggest gaps appear on the smallest-output commands (freq, depth) where the per-marker IPC decoding overhead dominates; on triage the gap shrinks to ~2-3x because the per-marker statistical work amortises the format cost.

  • The Parquet spill path adds another ~1.5-2x over the in-memory variant (the spill-then-stream pass). The Beissinger estimator only chooses it when the working-set prediction exceeds the available-RAM threshold, so the user pays for the spill only when the alternative is exceeding RAM.

The numbers support a paper claim that the DataFrame surface gives an ergonomic notebook/workflow API for moderate-sized data without regressing the CLI’s streaming path, and that the spill-vs-in-memory choice is automatic but explicitly overridable via RSX_FORCE_SPILL / RSX_SPILL_BYTES. Heavy production analysis on the largest datasets is still cheapest through the path-based CLI.

1.5 Raw timing data (seconds, median of 3 runs)

Scale

Command

C++ (s)

Rust (s)

Speedup

small

freq

0.016

0.004

4.0x

small

distrib

0.015

0.003

5.0x

small

signif

0.015

0.003

5.0x

small

depth

0.016

0.003

5.3x

small

subset

0.016

0.003

5.3x

small

map

0.042

0.021

2.0x

small

process

0.016

0.009

1.8x

medium

freq

0.015

0.006

2.5x

medium

distrib

0.019

0.007

2.7x

medium

signif

0.021

0.013

1.6x

medium

depth

0.025

0.005

5.0x

medium

subset

0.026

0.014

1.9x

medium

map

0.407

0.228

1.8x

medium

process

0.058

0.038

1.5x

large

freq

0.114

0.044

2.6x

large

distrib

0.138

0.057

2.4x

large

signif

0.189

0.137

1.4x

large

depth

0.240

0.067

3.6x

large

subset

0.170

0.118

1.4x

1.6 Why Rust is faster

1.6.1 Inline for_each parser (biggest win)

The C++ version uses a producer-consumer pattern: one thread parses the markers table into a std::queue<Marker> protected by a mutex, another thread processes markers from the queue with busy-wait polling (sleep(10us) when empty).

rsx-rs originally copied this pattern using crossbeam channels, but profiling showed ~50% of CPU time was spent on channel overhead and marker cloning. rsx-rs uses an inline for_each callback that reuses a single Marker struct with zero allocation per marker. The marker is passed by reference to the callback, reset in-place between iterations.

1.6.2 Bitset popcount for group counts (eliminated HashMap entirely)

The C++ stores group counts in std::unordered_map<string, uint>, hashing group name strings for every marker field. rsx-rs replaces this with a BitsetRow (1 bit per individual) and pre-computed GroupMask bitmasks. Group count = popcount(marker_bits & group_mask) – a single CPU instruction per 64 individuals, with zero hashing.

For 200 individuals: 5.0 ns per group count vs ~200 ns for a HashMap lookup.

1.6.3 erfc identity for chi-squared CDF (SymPy-derived, 5.2x faster)

The chi-squared p-value for df=1 simplifies exactly to p = erfc(sqrt(chi2/2)). This is proven via the identity P(1/2, x) = erf(sqrt(x)) (DLMF 8.2.1). The derivation script is in scripts/sympy/chi2_cdf_derivation.py.

This replaces the full regularized gamma function (statrs crate, 562 ns) with a single libm::erfc call (108 ns). A Sollya-generated minimax polynomial (scripts/sollya/erfc_minimax.sollya) can further replace libm::erfc for GPU kernels.

1.6.4 mmap I/O (zero-copy file access)

The markers table is memory-mapped via memmap2, eliminating buffered read syscalls. The kernel manages page faults transparently. Combined with the inline for_each parser, this achieves near-memcpy throughput.

1.6.5 Rayon for process command

The process command reads all FASTQ files and counts sequences. C++ uses a manual thread pool with mutex-protected work-stealing. rsx-rs uses rayon’s work-stealing thread pool with per-thread AHashMap accumulators, merged at the end. This avoids mutex contention during file processing.

1.6.6 minimap2 for map command

rsx-rs uses minimap2 (via Rust bindings) instead of BWA-MEM. minimap2 is the modern successor with better index construction and alignment speed for short reads. The sr (short-read) preset matches BWA-MEM behavior.

1.6.7 Zero-copy field parsing

The table parser uses push_str into pre-allocated strings rather than creating new String objects per field. Combined with fast_parse_u16 for integer fields (matching the C++ fast_stoi approach), this eliminates allocation in the inner parsing loop.

1.7 Reproducing

# Use the benchmark environment
pixi install -e benchmark

# Optional: regenerate equivalent synthetic data without overwriting
# the tracked manuscript reference data.
pixi run -e benchmark generate-data

# Build C++ radsex (requires patching for GCC 15)
cd /path/to/radsex && make -j4

# Run benchmarks against the tracked reference data.
cd /path/to/rsx-rs
CPP_RADSEX=/path/to/radsex/bin/radsex pixi run -e benchmark run-benchmarks

# Print summary
pixi run -e benchmark plot-benchmarks

# Validate and prepare the DOI-deposition package.
pixi run -e benchmark test-package
pixi run -e benchmark package-figshare

By default the benchmark harness reads benchmarks/data/ and writes benchmarks/results/benchmark_results.csv. Set BENCH_DATA or BENCH_RESULTS only when using alternate data or result locations. The tracked CSV and repro/benchmarks.org are the reference inputs for the manuscript numbers.

1.8 Golden file compatibility

Despite the performance differences, rsx-rs produces byte-identical output to C++ radsex for all downstream commands (freq, depth, distrib, signif, subset) when groups are specified explicitly with -G M,F. The Cg float formatter matches C++’s default %g output (6 significant digits). The map command uses minimap2 instead of BWA-MEM so alignment results differ, but the output format is identical.