OpenMS
Loading...
Searching...
No Matches
ProteinIdentificationArrowIO Class Reference

Import and export ProteinIdentification data to/from Apache Arrow format. More...

#include <OpenMS/FORMAT/ProteinIdentificationArrowIO.h>

Static Public Member Functions

static std::shared_ptr< arrow::Table > exportProteinsToArrow (const std::vector< ProteinIdentification > &protein_identifications)
 Export protein hits to Apache Arrow Table.
 
static bool exportProteinsToParquet (const std::vector< ProteinIdentification > &protein_identifications, const std::string &filename, const ParquetWriteConfig &config=ParquetWriteConfig{})
 Export protein hits to Parquet file.
 
static std::shared_ptr< arrow::Table > exportProteinGroupsToArrow (const std::vector< ProteinIdentification > &protein_identifications)
 Export protein groups to Apache Arrow Table.
 
static bool exportProteinGroupsToParquet (const std::vector< ProteinIdentification > &protein_identifications, const std::string &filename, const ParquetWriteConfig &config=ParquetWriteConfig{})
 Export protein groups to Parquet file.
 
static std::shared_ptr< arrow::Table > exportSearchParamsToArrow (const std::vector< ProteinIdentification > &protein_identifications)
 Export search parameters to Apache Arrow Table.
 
static bool exportSearchParamsToParquet (const std::vector< ProteinIdentification > &protein_identifications, const std::string &filename, const ParquetWriteConfig &config=ParquetWriteConfig{})
 Export search parameters to Parquet file.
 
static bool importFromParquet (const std::string &proteins_filename, const std::string &protein_groups_filename, const std::string &search_params_filename, std::vector< ProteinIdentification > &protein_identifications)
 Import all three Parquet files and reconstruct ProteinIdentifications.
 
static bool importSearchParamsFromArrow (const std::shared_ptr< arrow::Table > &table, std::vector< ProteinIdentification > &protein_identifications)
 Import search parameters from Arrow Table.
 
static bool importProteinsFromArrow (const std::shared_ptr< arrow::Table > &table, std::vector< ProteinIdentification > &protein_identifications)
 Import protein hits from Arrow Table.
 
static bool importProteinGroupsFromArrow (const std::shared_ptr< arrow::Table > &table, std::vector< ProteinIdentification > &protein_identifications)
 Import protein groups from Arrow Table.
 
static bool importSearchParamsFromParquet (const std::string &filename, std::vector< ProteinIdentification > &protein_identifications)
 Import search parameters from Parquet file.
 
static bool importProteinsFromParquet (const std::string &filename, std::vector< ProteinIdentification > &protein_identifications)
 Import protein hits from Parquet file.
 
static bool importProteinGroupsFromParquet (const std::string &filename, std::vector< ProteinIdentification > &protein_identifications)
 Import protein groups from Parquet file.
 
static std::map< std::string, std::string > synthesizeRunIdentifiers (std::vector< ProteinIdentification > &protein_identifications)
 Synthesize fresh run identifiers per ProteinIdentification, mirroring IdXMLFile.
 
static void applyRunIdentifierRename (const std::map< std::string, std::string > &rename, PeptideIdentificationList &pep_ids)
 Apply a stored->synthesized identifier rename to a PeptideIdentification collection.
 
static void checkUniqueIdentifiers (const std::vector< ProteinIdentification > &protein_identifications)
 Reject a ProteinIdentification vector with duplicate identifiers (store-side check).
 

Detailed Description

Import and export ProteinIdentification data to/from Apache Arrow format.

This class provides static methods to export and import ProteinIdentification data to/from Apache Arrow Tables and Parquet files. Separate tables are provided for protein hits, protein groups, and search parameters.

Experimental classes:
This API is experimental and may change in future versions.

Member Function Documentation

◆ applyRunIdentifierRename()

static void applyRunIdentifierRename ( const std::map< std::string, std::string > &  rename,
PeptideIdentificationList pep_ids 
)
static

Apply a stored->synthesized identifier rename to a PeptideIdentification collection.

PeptideIdentifications whose stored identifier isn't present in rename are left untouched. Mirrors the orphan-pep_id semantics of the XML lane (where pep_ids that don't match any ProtID retain their stale identifier).

Parameters
[in]renameMap produced by synthesizeRunIdentifiers
[in,out]pep_idsPeptideIdentification collection to re-stamp

◆ checkUniqueIdentifiers()

static void checkUniqueIdentifiers ( const std::vector< ProteinIdentification > &  protein_identifications)
static

Reject a ProteinIdentification vector with duplicate identifiers (store-side check).

Mirrors XMLHandler::checkUniqueIdentifiers_ — throws Exception::InvalidValue with the same message text used by the XML lane's fatalError. Called by every parquet store entry point before any Arrow builder is allocated, so no partial file is created on rejection.

Parameters
[in]protein_identificationsProtID vector to check
Exceptions
Exception::InvalidValuewhen duplicate identifiers are present

◆ exportProteinGroupsToArrow()

static std::shared_ptr< arrow::Table > exportProteinGroupsToArrow ( const std::vector< ProteinIdentification > &  protein_identifications)
static

Export protein groups to Apache Arrow Table.

Each ProteinGroup becomes one row with group probability and member accessions.

Parameters
[in]protein_identificationsVector of protein identifications
Returns
Shared pointer to Arrow Table, or nullptr on error

◆ exportProteinGroupsToParquet()

static bool exportProteinGroupsToParquet ( const std::vector< ProteinIdentification > &  protein_identifications,
const std::string &  filename,
const ParquetWriteConfig config = ParquetWriteConfig{} 
)
static

Export protein groups to Parquet file.

Parameters
[in]protein_identificationsVector of protein identifications
[in]filenameOutput file path
[in]configParquet writing options
Returns
true on success, false on error

◆ exportProteinsToArrow()

static std::shared_ptr< arrow::Table > exportProteinsToArrow ( const std::vector< ProteinIdentification > &  protein_identifications)
static

Export protein hits to Apache Arrow Table.

Each ProteinHit becomes one row with identification, score, and metadata.

Parameters
[in]protein_identificationsVector of protein identifications
Returns
Shared pointer to Arrow Table, or nullptr on error

◆ exportProteinsToParquet()

static bool exportProteinsToParquet ( const std::vector< ProteinIdentification > &  protein_identifications,
const std::string &  filename,
const ParquetWriteConfig config = ParquetWriteConfig{} 
)
static

Export protein hits to Parquet file.

Parameters
[in]protein_identificationsVector of protein identifications
[in]filenameOutput file path
[in]configParquet writing options
Returns
true on success, false on error

◆ exportSearchParamsToArrow()

static std::shared_ptr< arrow::Table > exportSearchParamsToArrow ( const std::vector< ProteinIdentification > &  protein_identifications)
static

Export search parameters to Apache Arrow Table.

Each ProteinIdentification's SearchParameters becomes one row.

Parameters
[in]protein_identificationsVector of protein identifications
Returns
Shared pointer to Arrow Table, or nullptr on error

◆ exportSearchParamsToParquet()

static bool exportSearchParamsToParquet ( const std::vector< ProteinIdentification > &  protein_identifications,
const std::string &  filename,
const ParquetWriteConfig config = ParquetWriteConfig{} 
)
static

Export search parameters to Parquet file.

Parameters
[in]protein_identificationsVector of protein identifications
[in]filenameOutput file path
[in]configParquet writing options
Returns
true on success, false on error

◆ importFromParquet()

static bool importFromParquet ( const std::string &  proteins_filename,
const std::string &  protein_groups_filename,
const std::string &  search_params_filename,
std::vector< ProteinIdentification > &  protein_identifications 
)
static

Import all three Parquet files and reconstruct ProteinIdentifications.

Reads the three Parquet files and reconstructs a vector of ProteinIdentification objects with hits, groups, and search parameters.

Parameters
[in]proteins_filenamePath to proteins Parquet file
[in]protein_groups_filenamePath to protein groups Parquet file
[in]search_params_filenamePath to search params Parquet file
[out]protein_identificationsReconstructed protein identifications
Returns
true on success, false on error

◆ importProteinGroupsFromArrow()

static bool importProteinGroupsFromArrow ( const std::shared_ptr< arrow::Table > &  table,
std::vector< ProteinIdentification > &  protein_identifications 
)
static

Import protein groups from Arrow Table.

Adds ProteinGroups and IndistinguishableProteins to matching ProteinIdentifications by run_identifier.

Parameters
[in]tableArrow Table with protein groups
[out]protein_identificationsProtein identifications to populate
Returns
true on success, false on error

◆ importProteinGroupsFromParquet()

static bool importProteinGroupsFromParquet ( const std::string &  filename,
std::vector< ProteinIdentification > &  protein_identifications 
)
static

Import protein groups from Parquet file.

Parameters
[in]filenamePath to Parquet file
[out]protein_identificationsProtein identifications to populate
Returns
true on success, false on error

◆ importProteinsFromArrow()

static bool importProteinsFromArrow ( const std::shared_ptr< arrow::Table > &  table,
std::vector< ProteinIdentification > &  protein_identifications 
)
static

Import protein hits from Arrow Table.

Adds ProteinHits to matching ProteinIdentifications by run_identifier. If no matching ProteinIdentification exists, creates new ones.

Parameters
[in]tableArrow Table with protein hits
[out]protein_identificationsProtein identifications to populate
Returns
true on success, false on error

◆ importProteinsFromParquet()

static bool importProteinsFromParquet ( const std::string &  filename,
std::vector< ProteinIdentification > &  protein_identifications 
)
static

Import protein hits from Parquet file.

Parameters
[in]filenamePath to Parquet file
[out]protein_identificationsProtein identifications to populate
Returns
true on success, false on error

◆ importSearchParamsFromArrow()

static bool importSearchParamsFromArrow ( const std::shared_ptr< arrow::Table > &  table,
std::vector< ProteinIdentification > &  protein_identifications 
)
static

Import search parameters from Arrow Table.

Each row becomes a ProteinIdentification shell with run-level metadata and SearchParameters populated.

Parameters
[in]tableArrow Table with search parameters
[out]protein_identificationsReconstructed protein identifications
Returns
true on success, false on error

◆ importSearchParamsFromParquet()

static bool importSearchParamsFromParquet ( const std::string &  filename,
std::vector< ProteinIdentification > &  protein_identifications 
)
static

Import search parameters from Parquet file.

Parameters
[in]filenamePath to Parquet file
[out]protein_identificationsReconstructed protein identifications
Returns
true on success, false on error

◆ synthesizeRunIdentifiers()

static std::map< std::string, std::string > synthesizeRunIdentifiers ( std::vector< ProteinIdentification > &  protein_identifications)
static

Synthesize fresh run identifiers per ProteinIdentification, mirroring IdXMLFile.

Mirrors IdXMLFile.cpp:530: every load assigns each ProteinIdentification a fresh identifier <search_engine>_<date>_<UniqueIdGenerator>. The stored identifier on disk is informational; the in-memory identifier is regenerated. This is the same defense FeatureXMLHandler / ConsensusXMLHandler / IdXMLFile apply on load against downstream-collision-after-rip-and-merge scenarios.

The function mutates protein_identifications in place and returns the map { stored_id -> synthesized_id } so the caller can apply the same rename to each PeptideIdentification collection it owns (FeatureMap has 2: per-feature and unassigned; ConsensusMap has 2; PSMArrowIO has 1).

Edge cases:

  • empty getSearchEngine() falls back to literal "unknown"
  • invalid getDateTime() uses placeholder "1900-01-01T00:00:00"
  • multiple ProtIDs sharing one stored identifier each receive their own distinct synthesized identifier; the returned map collapses to the last-seen entry. An OPENMS_LOG_WARN is emitted once per such collision.
Parameters
[in,out]protein_identificationsProtID vector whose identifiers will be replaced
Returns
Map from each stored identifier to its synthesized replacement.