OpenMS
Loading...
Searching...
No Matches
ParquetFile Class Reference

Shared utilities for reading, writing, and packaging Parquet-based file formats. More...

#include <OpenMS/FORMAT/ParquetFile.h>

Static Public Member Functions

Arrow builder helpers
static void appendOrThrow (const arrow::Status &status, const std::string &column)
 Append a value to an Arrow builder, throwing on failure.
 
template<typename BuilderT >
static std::shared_ptr< arrow::Array > finishArray (BuilderT &builder, const std::string &name)
 Finish an Arrow builder and return the resulting Array.
 
Parquet file I/O
static void writeTable (const std::shared_ptr< arrow::Table > &table, const String &filename, int64_t row_group_size=262144)
 Write an Arrow Table to a Parquet file.
 
static std::shared_ptr< arrow::Table > readTable (const String &filename)
 Read a Parquet file into an Arrow Table.
 
static std::shared_ptr< arrow::Table > readTable (const std::shared_ptr< arrow::io::RandomAccessFile > &infile)
 Read a Parquet file from an Arrow RandomAccessFile into an Arrow Table.
 
Column accessors
static std::shared_ptr< arrow::Array > getColumn (const std::shared_ptr< arrow::Table > &table, const std::string &name)
 Get a required column from an Arrow Table by name.
 
static std::shared_ptr< arrow::Array > getOptionalColumn (const std::shared_ptr< arrow::Table > &table, const std::string &name)
 Get an optional column from an Arrow Table by name.
 
Type-safe value accessors
static int64_t getInt64 (const std::shared_ptr< arrow::Array > &array, int64_t row, int64_t default_value, bool allow_null)
 Read an integer value from an Arrow Array with type coercion.
 
static double getDouble (const std::shared_ptr< arrow::Array > &array, int64_t row, double default_value, bool allow_null)
 Read a floating-point value from an Arrow Array with type coercion.
 
static bool getBool (const std::shared_ptr< arrow::Array > &array, int64_t row, bool default_value, bool allow_null)
 Read a boolean value from an Arrow Array with type coercion.
 
static std::string getString (const std::shared_ptr< arrow::Array > &array, int64_t row)
 Read a string value from an Arrow Array.
 
static std::vector< std::string > getStringList (const std::shared_ptr< arrow::Array > &array, int64_t row)
 Read a list of strings from an Arrow Array.
 

Misc helpers

static std::string jsonEscape (const String &input)
 Escape a string for safe embedding into JSON values.
 
static int64_t rowCount (const String &filename)
 Return the number of rows in a parquet file using the low-level parquet reader metadata. Returns 0 if the file does not exist.
 
static void throw_finish_error_ (const std::string &name, const std::string &error)
 Internal helper to throw a consistent error from finishArray.
 

Detailed Description

Shared utilities for reading, writing, and packaging Parquet-based file formats.

This class provides static helpers used by multiple OpenMS Parquet-backed I/O classes (e.g. TransitionParquetFile, OpenSwathOSWParquetWriter, XICParquetFile).

Capabilities include:

  • Reading / writing Arrow Tables to individual Parquet files
  • Type-safe column accessors with flexible type coercion
  • Zip / unzip of Parquet directory bundles (e.g. .oswpq archives)
  • Common Arrow builder helpers (append, finish)

All Parquet zip archives use store-only compression (-0) because Parquet files are already internally compressed; re-compressing with deflate wastes CPU for negligible size reduction.

Member Function Documentation

◆ appendOrThrow()

static void appendOrThrow ( const arrow::Status &  status,
const std::string &  column 
)
static

Append a value to an Arrow builder, throwing on failure.

Parameters
[in,out]statusArrow status returned by the append call
[in]columnColumn name (used in error messages)
Exceptions
Exception::InvalidValueif the status is not OK

◆ finishArray()

template<typename BuilderT >
static std::shared_ptr< arrow::Array > finishArray ( BuilderT &  builder,
const std::string &  name 
)
inlinestatic

Finish an Arrow builder and return the resulting Array.

Parameters
[in,out]builderAny Arrow ArrayBuilder subclass
[in]nameColumn name (used in error messages)
Returns
Shared pointer to the finished Arrow Array
Exceptions
Exception::InvalidValueif finishing fails

◆ getBool()

static bool getBool ( const std::shared_ptr< arrow::Array > &  array,
int64_t  row,
bool  default_value,
bool  allow_null 
)
static

Read a boolean value from an Arrow Array with type coercion.

Supports Boolean and all integer types (non-zero = true).

Parameters
[in]arrayThe Arrow Array (may be nullptr if allow_null)
[in]rowRow index
[in]default_valueValue returned when array is nullptr or value is null
[in]allow_nullIf true, null values return default_value instead of throwing
Returns
The boolean value
Exceptions
Exception::MissingInformationif value is null and allow_null is false
Exception::InvalidValuefor unsupported column types

◆ getColumn()

static std::shared_ptr< arrow::Array > getColumn ( const std::shared_ptr< arrow::Table > &  table,
const std::string &  name 
)
static

Get a required column from an Arrow Table by name.

Parameters
[in]tableThe Arrow Table
[in]nameColumn name
Returns
The first chunk of the column as an Arrow Array
Exceptions
Exception::MissingInformationif column not found
Exception::InvalidValueif column has no chunks

◆ getDouble()

static double getDouble ( const std::shared_ptr< arrow::Array > &  array,
int64_t  row,
double  default_value,
bool  allow_null 
)
static

Read a floating-point value from an Arrow Array with type coercion.

Supports Float, Double, and all integer types (coerced to double).

Parameters
[in]arrayThe Arrow Array (may be nullptr if allow_null)
[in]rowRow index
[in]default_valueValue returned when array is nullptr or value is null
[in]allow_nullIf true, null values return default_value instead of throwing
Returns
The double value
Exceptions
Exception::MissingInformationif value is null and allow_null is false
Exception::InvalidValuefor unsupported column types

◆ getInt64()

static int64_t getInt64 ( const std::shared_ptr< arrow::Array > &  array,
int64_t  row,
int64_t  default_value,
bool  allow_null 
)
static

Read an integer value from an Arrow Array with type coercion.

Supports Int8–Int64, UInt8–UInt64 types.

Parameters
[in]arrayThe Arrow Array (may be nullptr if allow_null)
[in]rowRow index
[in]default_valueValue returned when array is nullptr or value is null
[in]allow_nullIf true, null values return default_value instead of throwing
Returns
The integer value
Exceptions
Exception::MissingInformationif value is null and allow_null is false
Exception::InvalidValuefor unsupported column types

◆ getOptionalColumn()

static std::shared_ptr< arrow::Array > getOptionalColumn ( const std::shared_ptr< arrow::Table > &  table,
const std::string &  name 
)
static

Get an optional column from an Arrow Table by name.

Parameters
[in]tableThe Arrow Table
[in]nameColumn name
Returns
The first chunk of the column, or nullptr if not present
Exceptions
Exception::InvalidValueif column exists but has no chunks

◆ getString()

static std::string getString ( const std::shared_ptr< arrow::Array > &  array,
int64_t  row 
)
static

Read a string value from an Arrow Array.

Supports String and LargeString types.

Parameters
[in]arrayThe Arrow Array (may be nullptr)
[in]rowRow index
Returns
The string value, or empty string if array is nullptr or value is null
Exceptions
Exception::InvalidValuefor unsupported column types

◆ getStringList()

static std::vector< std::string > getStringList ( const std::shared_ptr< arrow::Array > &  array,
int64_t  row 
)
static

Read a list of strings from an Arrow Array.

Supports String/LargeString (semicolon-delimited) and List/LargeList of strings.

Parameters
[in]arrayThe Arrow Array (may be nullptr)
[in]rowRow index
Returns
Vector of strings (empty if array is nullptr or value is null)
Exceptions
Exception::InvalidValuefor unsupported column types

◆ jsonEscape()

static std::string jsonEscape ( const String input)
static

Escape a string for safe embedding into JSON values.

Mirrors the ad-hoc jsonEscape_ implementations used in several Parquet-writing sources so callers can reuse a single canonical implementation.

◆ readTable() [1/2]

static std::shared_ptr< arrow::Table > readTable ( const std::shared_ptr< arrow::io::RandomAccessFile > &  infile)
static

Read a Parquet file from an Arrow RandomAccessFile into an Arrow Table.

Allows reading Parquet data directly from an in-archive RandomAccessFile (e.g. libzip-backed).

◆ readTable() [2/2]

static std::shared_ptr< arrow::Table > readTable ( const String filename)
static

Read a Parquet file into an Arrow Table.

The table is returned with chunks combined into single arrays per column.

Parameters
[in]filenameInput Parquet file path
Returns
Shared pointer to the Arrow Table
Exceptions
Exception::InvalidValueif reading fails

◆ rowCount()

static int64_t rowCount ( const String filename)
static

Return the number of rows in a parquet file using the low-level parquet reader metadata. Returns 0 if the file does not exist.

◆ throw_finish_error_()

static void throw_finish_error_ ( const std::string &  name,
const std::string &  error 
)
staticprivate

Internal helper to throw a consistent error from finishArray.

◆ writeTable()

static void writeTable ( const std::shared_ptr< arrow::Table > &  table,
const String filename,
int64_t  row_group_size = 262144 
)
static

Write an Arrow Table to a Parquet file.

Parameters
[in]tableThe Arrow Table to write
[in]filenameOutput file path
[in]row_group_sizeNumber of rows per row group (default: 262144)
Exceptions
Exception::FileNotWritableif the file cannot be opened
Exception::InvalidValueif writing fails