mod estimator

module estimator

Working-set size estimator and spill decision.

The estimator combines the observed marker count from the inbound Arrow IPC payload with a conservative overhead multiplier drawn from the RAD-seq literature (Beissinger 2013, TASSEL-GBS, ipyrad).

MarkerTableSource::from_arrow_ipc decodes the bytes once, asks the estimator whether the implied working set fits, and either keeps the batches in RAM (InMemory(ArrowMarkerSource)) or spills them to a Parquet temp file (Spilled(ParquetMarkerSource)).

Variables

const BYTES_PER_CELL: u64

Bytes per depth cell. We always store u16 in the marker buffer regardless of the inbound Arrow type, so this is a fixed 2 bytes.

const DEFAULT_OVERHEAD: f64

Default overhead multiplier capturing arrow validity buffers, group masks, per-marker accumulators, intermediate Vecs. 6x is conservative for the largest commands (signif FDR, triage Bayesian, depth exact).

const DEFAULT_SPILL_FRACTION: f64

Default fraction of available RAM we are willing to use before switching to the spill path.

Functions

fn estimate_working_set_bytes(n_samples: usize, m_observed_or_predicted: usize, bytes_per_cell: u64, overhead_factor: f64, command_specific_multiplier: f64) -> SizeEstimate

Compute the predicted working-set size in bytes.

command_specific_multiplier lets the caller widen the prediction for the heavier commands (e.g. 2.0 for triage / signif with FDR, 1.3 for freq / depth which mostly stream).

fn spill_threshold_bytes() -> u64

Bytes above which we should spill rather than keep the source in RAM.

Enums

enum MarkerSourceError

Combined error for from_arrow_ipc.

Arrow(ArrowSourceError)
Parquet(ParquetSourceError)

Traits implemented

impl std::fmt::Display for MarkerSourceError
impl std::error::Error for MarkerSourceError
enum MarkerTableSource

Resolved marker source: either in-memory Arrow or a Parquet spill.

Wraps the underlying source so the analysis commands can stay generic over MarkerStream without caring which physical backing they got.

InMemory(ArrowMarkerSource)
Spilled(ParquetMarkerSource)

Implementations

impl MarkerTableSource

Functions

fn from_arrow_ipc(bytes: &[u8], popmap: Option<&Popmap>, min_depth: u16, command_multiplier: f64) -> Result<Self, MarkerSourceError>

Decode the inbound IPC bytes, consult the estimator, and produce either an in-memory or spilled source. The popmap is optional but required by the multi-group commands (distrib/signif/triage/depth).

fn is_spilled(&self) -> bool

Convenience: was this source materialised to disk?

Structs and Unions

struct SizeEstimate

Working-set estimate.

n_samples: usize
m_markers: usize
bytes_per_cell: u64
overhead_factor: f64
command_multiplier: f64
estimated_bytes: u64