1. GC-MS Raw Data Model

1.1. Introduction

PyMassSpec can read gas chromatography-mass spectrometry (GC-MS) data stored in Analytical Data Interchange for Mass Spectrometry (ANDI-MS), 1 and Joint Committee on Atomic and Molecular Physical Data (JCAMP-DX) 2 formats. The information contained in the data files can vary significantly depending on the instrument, vendor’s software, or conversion utility. PyMassSpec makes the following assumptions about the information contained in the data file:

  • The data contain the m/z and intensity value pairs across a scan.

  • Each scan has a retention time.

Internally, PyMassSpec stores the raw data from ANDI files or JCAMP files as a GCMS_data object.

1.2. Example: Reading JCAMP GC-MS data

The PyMS package pyms.GCMS.IO.JCAMP provides capabilities to read the raw GC-MS data stored in the JCAMP-DX format.

First, setup the paths to the datafile and the output directory, then import JCAMP_reader.

In [1]:
import pathlib
data_directory = pathlib.Path(".").resolve().parent.parent / "pyms-data"
# Change this if the data files are stored in a different location

output_directory = pathlib.Path(".").resolve() / "output"

from pyms.GCMS.IO.JCAMP import JCAMP_reader

Read the raw JCAMP-dx data.

In [2]:
jcamp_file = data_directory / "gc01_0812_066.jdx"
data = JCAMP_reader(jcamp_file)
data
-> Reading JCAMP file '/home/vagrant/PyMassSpec/pyms-data/gc01_0812_066.jdx'
<GCMS_data(305.582 - 4007.722 seconds, time step 0.3753183292781833, 9865 scans)>

1.2.1. A GCMS_data Object

The object data (from the two previous examples) stores the raw data as a pyms.GCMS.Class.GCMS_data object. Within the GCMS_data object, raw data are stored as a list of pyms.Spectrum.Scan objects and a list of retention times. There are several methods available to access data and attributes of the GCMS_data and Scan objects.

The GCMS_data object’s methods relate to the raw data. The main properties relate to the masses, retention times and scans. For example, the minimum and maximum mass from all of the raw data can be returned by the following:

In [3]:
data.min_mass
50.0
In [4]:
data.max_mass
599.9

A list of the first 10 retention times can be returned with:

In [5]:
data.time_list[:10]
[305.582,
 305.958,
 306.333,
 306.708,
 307.084,
 307.459,
 307.834,
 308.21,
 308.585,
 308.96]

The index of a specific retention time (in seconds) can be returned with:

In [6]:
data.get_index_at_time(400.0)
252

Note that this returns the index of the retention time in the data closest to the given retention time of 400.0 seconds.

The GCMS_data.tic attribute returns a total ion chromatogram (TIC) of the data as an IonChromatogram object:

In [7]:
data.tic
<pyms.IonChromatogram.IonChromatogram at 0x7f6b22ff9d68>

The IonChromatogram object is explained in a later example.

1.2.2. A Scan Object

A pyms.Spectrum.Scan object contains a list of masses and a corresponding list of intensity values from a single mass-spectrum scan in the raw data. Typically only non-zero (or non-threshold) intensities and corresponding masses are stored in the raw data.

A list of the first 10 pyms.Spectrum.Scan objects can be returned with:

In [8]:
scans = data.scan_list
scans[:10]
[<pyms.Spectrum.Scan at 0x7f6b4117a518>,
 <pyms.Spectrum.Scan at 0x7f6b22ff9400>,
 <pyms.Spectrum.Scan at 0x7f6b22ff9dd8>,
 <pyms.Spectrum.Scan at 0x7f6b22ff9e80>,
 <pyms.Spectrum.Scan at 0x7f6b22ff9f28>,
 <pyms.Spectrum.Scan at 0x7f6b22ff9fd0>,
 <pyms.Spectrum.Scan at 0x7f6b22ff9e48>,
 <pyms.Spectrum.Scan at 0x7f6b22ff9668>,
 <pyms.Spectrum.Scan at 0x7f6b22ff9d30>,
 <pyms.Spectrum.Scan at 0x7f6b22ff9cf8>]

A list of the first 10 masses in a scan (e.g. the 1st scan) is returned with:

In [9]:
scans[0].mass_list[:10]
[50.1, 51.1, 53.1, 54.2, 55.1, 56.2, 57.2, 58.2, 59.1, 60.1]

A list of the first 10 corresponding intensities in a scan is returned with:

In [10]:
scans[0].intensity_list[:10]
[22128.0,
 10221.0,
 31400.0,
 27352.0,
 65688.0,
 55416.0,
 75192.0,
 112688.0,
 152256.0,
 21896.0]

The minimum and maximum mass in an individual scan (e.g. the 1st scan) are returned with:

In [11]:
scans[0].min_mass
50.1
In [12]:
scans[0].max_mass
599.4

1.2.3. Exporting data and obtaining information about a data set

Often it is of interest to find out some basic information about the data set, e.g. the number of scans, the retention time range, and m/z range and so on. The GCMS_data class provides a method info() that can be used for this purpose.

In [13]:
data.info()
Data retention time range: 5.093 min -- 66.795 min
Time step: 0.375 s (std=0.000 s)
Number of scans: 9865
Minimum m/z measured: 50.000
Maximum m/z measured: 599.900
Mean number of m/z values per scan: 56
Median number of m/z values per scan: 40

The entire raw data of a GCMS_data object can be exported to a file with the method write():

In [14]:
data.write(output_directory / "data")
-> Writing intensities to '/home/vagrant/PyMassSpec/pyms-demo/jupyter/output/data.I.csv'
-> Writing m/z values to '/home/vagrant/PyMassSpec/pyms-demo/jupyter/output/data.mz.csv'

This method takes the filename (“output/data”, in this example) and writes two CSV files. One has extension “.I.csv” and contains the intensities (“output/data.I.csv” in this example), and the other has the extension “.mz” and contains the corresponding table of m/z value (“output/data.mz.csv” in this example). In general, these are not two-dimensional matrices, because different scans may have different number of m/z values recorded.

Note

This example is in pyms-demo/jupyter/reading_jcamp.ipynb. There is also an example in that directory for reading ANDI-MS files.

1.3. Example: Comparing two GC-MS data sets

Occasionally it is useful to compare two data sets. For example, one may want to check the consistency between the data set exported in netCDF format from the manufacturer’s software, and the JCAMP format exported from a third party software.

First, setup the paths to the datafiles and the output directory, then import JCAMP_reader and ANDI_reader.

In [1]:
import pathlib
data_directory = pathlib.Path(".").resolve().parent.parent / "pyms-data"
# Change this if the data files are stored in a different location

output_directory = pathlib.Path(".").resolve() / "output"

from pyms.GCMS.IO.JCAMP import JCAMP_reader
from pyms.GCMS.IO.ANDI import ANDI_reader

Then the raw data is read as before.

In [2]:
andi_file = data_directory / "gc01_0812_066.cdf"
data1 = ANDI_reader(andi_file)
data1
-> Reading netCDF file '/home/vagrant/PyMassSpec/pyms-data/gc01_0812_066.cdf'
<GCMS_data(305.582 - 4007.721 seconds, time step 0.37531822789943226, 9865 scans)>
In [3]:
jcamp_file = data_directory / "gc01_0812_066.jdx"
data2 = JCAMP_reader(jcamp_file)
data2
-> Reading JCAMP file '/home/vagrant/PyMassSpec/pyms-data/gc01_0812_066.jdx'
<GCMS_data(305.582 - 4007.722 seconds, time step 0.3753183292781833, 9865 scans)>

To compare the two data sets, use the function diff()

In [4]:
from pyms.GCMS.Function import diff

diff(data1, data2)
Data sets have the same number of time points.
  Time RMSD: 3.54e-04
Checking for consistency in scan lengths ...OK
Calculating maximum RMSD for m/z values and intensities ...
  Max m/z RMSD: 1.03e-05
  Max intensity RMSD: 0.00e+00

If the data cannot be compared, for example because of different number of scans, or inconsistent number of m/z values in between two scans, diff() will report the difference. For example:

In [5]:
data2.trim(begin=1000, end=2000)
Trimming data to between 1000 and 2001 scans
In [6]:
diff(data1, data2)
The number of retention time points differ.
   First data set: 9865 time points
   Second data set: 1002 time points
Data sets are different.

Note

This example is in pyms-demo/jupyter/comparing_datasets.ipynb.

Footnotes

1

ANDI-MS was developed by the Analytical Instrument Association

2

JCAMP-DX is maintained by the International Union of Pure and Applied Chemistry