1. GC-MS Raw Data Model¶
1.1. Introduction¶
PyMassSpec can read gas chromatography-mass spectrometry (GC-MS) data stored in Analytical Data Interchange for Mass Spectrometry (ANDI-MS), [1] and Joint Committee on Atomic and Molecular Physical Data (JCAMP-DX) [2] formats. The information contained in the data files can vary significantly depending on the instrument, vendor’s software, or conversion utility. PyMassSpec makes the following assumptions about the information contained in the data file:
The data contain the m/z and intensity value pairs across a scan.
Each scan has a retention time.
Internally, PyMassSpec stores the raw data from ANDI files or JCAMP files as a
GCMS_data
object.
1.2. Example: Reading JCAMP GC-MS data¶
The PyMS package pyms.GCMS.IO.JCAMP
provides capabilities to read
the raw GC-MS data stored in the JCAMP-DX format.
First, setup the paths to the datafile and the output directory, then import JCAMP_reader.
In [1]:
import pathlib
data_directory = pathlib.Path(".").resolve().parent.parent / "pyms-data"
# Change this if the data files are stored in a different location
output_directory = pathlib.Path(".").resolve() / "output"
from pyms.GCMS.IO.JCAMP import JCAMP_reader
Read the raw JCAMP-dx data.
In [2]:
jcamp_file = data_directory / "gc01_0812_066.jdx"
data = JCAMP_reader(jcamp_file)
data
-> Reading JCAMP file '/home/vagrant/PyMassSpec/pyms-data/gc01_0812_066.jdx'
<GCMS_data(305.582 - 4007.722 seconds, time step 0.3753183292781833, 9865 scans)>
1.2.1. A GCMS_data Object¶
The object data
(from the two previous examples) stores the raw data
as a pyms.GCMS.Class.GCMS_data
object. Within the GCMS_data
object, raw data are stored as a list of pyms.Spectrum.Scan
objects
and a list of retention times. There are several methods available to
access data and attributes of the GCMS_data
and Scan
objects.
The GCMS_data
object’s methods relate to the raw data. The main
properties relate to the masses, retention times and scans. For example,
the minimum and maximum mass from all of the raw data can be returned by
the following:
In [3]:
data.min_mass
50.0
In [4]:
data.max_mass
599.9
A list of the first 10 retention times can be returned with:
In [5]:
data.time_list[:10]
[305.582,
305.958,
306.333,
306.708,
307.084,
307.459,
307.834,
308.21,
308.585,
308.96]
The index of a specific retention time (in seconds) can be returned with:
In [6]:
data.get_index_at_time(400.0)
252
Note that this returns the index of the retention time in the data closest to the given retention time of 400.0 seconds.
The GCMS_data.tic
attribute returns a total ion chromatogram (TIC)
of the data as an IonChromatogram
object:
In [7]:
data.tic
<pyms.IonChromatogram.IonChromatogram at 0x7f6b22ff9d68>
The IonChromatogram
object is explained in a later example.
1.2.2. A Scan Object¶
A pyms.Spectrum.Scan
object contains a list of masses and a
corresponding list of intensity values from a single mass-spectrum scan
in the raw data. Typically only non-zero (or non-threshold) intensities
and corresponding masses are stored in the raw data.
A list of the first 10 pyms.Spectrum.Scan
objects can be returned
with:
In [8]:
scans = data.scan_list
scans[:10]
[<pyms.Spectrum.Scan at 0x7f6b4117a518>,
<pyms.Spectrum.Scan at 0x7f6b22ff9400>,
<pyms.Spectrum.Scan at 0x7f6b22ff9dd8>,
<pyms.Spectrum.Scan at 0x7f6b22ff9e80>,
<pyms.Spectrum.Scan at 0x7f6b22ff9f28>,
<pyms.Spectrum.Scan at 0x7f6b22ff9fd0>,
<pyms.Spectrum.Scan at 0x7f6b22ff9e48>,
<pyms.Spectrum.Scan at 0x7f6b22ff9668>,
<pyms.Spectrum.Scan at 0x7f6b22ff9d30>,
<pyms.Spectrum.Scan at 0x7f6b22ff9cf8>]
A list of the first 10 masses in a scan (e.g. the 1st scan) is returned with:
In [9]:
scans[0].mass_list[:10]
[50.1, 51.1, 53.1, 54.2, 55.1, 56.2, 57.2, 58.2, 59.1, 60.1]
A list of the first 10 corresponding intensities in a scan is returned with:
In [10]:
scans[0].intensity_list[:10]
[22128.0,
10221.0,
31400.0,
27352.0,
65688.0,
55416.0,
75192.0,
112688.0,
152256.0,
21896.0]
The minimum and maximum mass in an individual scan (e.g. the 1st scan) are returned with:
In [11]:
scans[0].min_mass
50.1
In [12]:
scans[0].max_mass
599.4
1.2.3. Exporting data and obtaining information about a data set¶
Often it is of interest to find out some basic information about the
data set, e.g. the number of scans, the retention time range, and m/z
range and so on. The GCMS_data
class provides a method info()
that can be used for this purpose.
In [13]:
data.info()
Data retention time range: 5.093 min -- 66.795 min
Time step: 0.375 s (std=0.000 s)
Number of scans: 9865
Minimum m/z measured: 50.000
Maximum m/z measured: 599.900
Mean number of m/z values per scan: 56
Median number of m/z values per scan: 40
The entire raw data of a GCMS_data
object can be exported to a file
with the method write()
:
In [14]:
data.write(output_directory / "data")
-> Writing intensities to '/home/vagrant/PyMassSpec/pyms-demo/jupyter/output/data.I.csv'
-> Writing m/z values to '/home/vagrant/PyMassSpec/pyms-demo/jupyter/output/data.mz.csv'
This method takes the filename (“output/data”, in this example) and writes two CSV files. One has extension “.I.csv” and contains the intensities (“output/data.I.csv” in this example), and the other has the extension “.mz” and contains the corresponding table of m/z value (“output/data.mz.csv” in this example). In general, these are not two-dimensional matrices, because different scans may have different number of m/z values recorded.
Note
This example is in pyms-demo/jupyter/reading_jcamp.ipynb
. There is also an example in that directory for reading ANDI-MS files.
1.3. Example: Comparing two GC-MS data sets¶
Occasionally it is useful to compare two data sets. For example, one may want to check the consistency between the data set exported in netCDF format from the manufacturer’s software, and the JCAMP format exported from a third party software.
First, setup the paths to the datafiles and the output directory, then import JCAMP_reader and ANDI_reader.
In [1]:
import pathlib
data_directory = pathlib.Path(".").resolve().parent.parent / "pyms-data"
# Change this if the data files are stored in a different location
output_directory = pathlib.Path(".").resolve() / "output"
from pyms.GCMS.IO.JCAMP import JCAMP_reader
from pyms.GCMS.IO.ANDI import ANDI_reader
Then the raw data is read as before.
In [2]:
andi_file = data_directory / "gc01_0812_066.cdf"
data1 = ANDI_reader(andi_file)
data1
-> Reading netCDF file '/home/vagrant/PyMassSpec/pyms-data/gc01_0812_066.cdf'
<GCMS_data(305.582 - 4007.721 seconds, time step 0.37531822789943226, 9865 scans)>
In [3]:
jcamp_file = data_directory / "gc01_0812_066.jdx"
data2 = JCAMP_reader(jcamp_file)
data2
-> Reading JCAMP file '/home/vagrant/PyMassSpec/pyms-data/gc01_0812_066.jdx'
<GCMS_data(305.582 - 4007.722 seconds, time step 0.3753183292781833, 9865 scans)>
To compare the two data sets, use the function diff()
In [4]:
from pyms.GCMS.Function import diff
diff(data1, data2)
Data sets have the same number of time points.
Time RMSD: 3.54e-04
Checking for consistency in scan lengths ...OK
Calculating maximum RMSD for m/z values and intensities ...
Max m/z RMSD: 1.03e-05
Max intensity RMSD: 0.00e+00
If the data cannot be compared, for example because of different number
of scans, or inconsistent number of m/z values in between two scans,
diff()
will report the difference. For example:
In [5]:
data2.trim(begin=1000, end=2000)
Trimming data to between 1000 and 2001 scans
In [6]:
diff(data1, data2)
The number of retention time points differ.
First data set: 9865 time points
Second data set: 1002 time points
Data sets are different.
Note
This example is in pyms-demo/jupyter/comparing_datasets.ipynb
.
Footnotes