OpenMS
Loading...
Searching...
No Matches
ParquetWriteConfig Struct Reference

Configuration for Parquet file writing. More...

#include <OpenMS/FORMAT/MSExperimentArrowExport.h>

Collaboration diagram for ParquetWriteConfig:
[legend]

Public Types

enum class  Compression {
  NONE , SNAPPY , GZIP , LZ4 ,
  ZSTD
}
 Compression algorithm. More...
 

Public Attributes

Compression compression = Compression::ZSTD
 Compression algorithm (default: ZSTD for best ratio/speed)
 
int compression_level = 3
 
int64_t row_group_size = 128 * 1024 * 1024
 
bool write_statistics = true
 
int64_t data_page_size = 1024 * 1024
 

Detailed Description

Configuration for Parquet file writing.

Controls compression, row group size, and other Parquet-specific settings. These settings affect file size, read performance, and memory usage.

Parquet automatically applies run-length encoding (RLE) and dictionary encoding where beneficial during write. Repetitive values (like ms_level repeated per peak in long format) are compressed efficiently without explicit configuration.

Performance guidelines for MS data:

  • ZSTD compression typically achieves 3-5x compression on peak data
  • 128MB row groups balance parallelism and compression efficiency
  • Statistics enable predicate pushdown for efficient m/z and RT range queries

Member Enumeration Documentation

◆ Compression

enum class Compression
strong

Compression algorithm.

Enumerator
NONE 

No compression (fastest write, largest files)

SNAPPY 

Fast compression, moderate ratio (widely compatible)

GZIP 

Good compression, slower (universal compatibility)

LZ4 

Very fast compression, lower ratio.

ZSTD 

Best ratio/speed tradeoff (recommended, default)

Member Data Documentation

◆ compression

Compression algorithm (default: ZSTD for best ratio/speed)

◆ compression_level

int compression_level = 3

Compression level (interpretation depends on algorithm) ZSTD: 1-22 (default 3, higher = better ratio, slower) GZIP: 1-9 (default 6) LZ4/SNAPPY: ignored

◆ data_page_size

int64_t data_page_size = 1024 * 1024

Data page size in bytes (default: 1MB) Affects granularity of reads within a row group

◆ row_group_size

int64_t row_group_size = 128 * 1024 * 1024

Target row group size in bytes (default: 128MB) Smaller = more parallelism for readers, larger = better compression For MS data with millions of peaks, 128MB typically gives 1-2M rows per group

◆ write_statistics

bool write_statistics = true

Write column statistics (min/max/null_count) for each row group Enables predicate pushdown for efficient m/z and RT range queries Small overhead (~1% file size increase)