2. GC-MS data derived objects

In the raw GC-MS data, consecutive scans do not necessarily contain the same mass per charge (mass) values. For data processing, it is often necessary to convert the data to a matrix with a set number of masses and scans. In PyMassSpec the resulting object is called an intensity matrix. In this chapter the methods for converting the raw GC-MS data to an intensity matrix object are illustrated.

2.1. IntensityMatrix Object

The general scheme for converting raw mass values is to bin intensity values based on the interval the corresponding mass belongs to. The general procedure is as follows:

  • Set the interval between bins, lower and upper bin boundaries.

  • Calculate the number of bins to cover the range of all masses.

  • Centre the first bin at the minimum mass found for all the raw data.

  • Sum intensities whose masses are in a given bin.

A mass, \(m\), is considered to belong to a bin when \(c - l \le m < c + u\), where \(c\) is the centre of the bin, \(l\) is the lower boundary and \(u\) is the upper boundary of the bin. The default bin interval is one with a lower and upper boundary of \(\pm0.5\).

A function to bin masses to the nearest integer is also available. The default bin interval is one with a lower boundary of \(-0.3\) and upper boundary of \(+0.7\) (as per the NIST library).

2.1.1. Discussion of Binning Boundaries

For any chemical element \(X\), let \(w(x)\) be the atomic weight of \(X\), and

\(\delta(X) = \frac{w(X) - \{w(X)\}}{w(X)}\)

where \(\{a\}\) is the integer value of \(a\) (rounded to the nearest integer).

For example, for hydrogen \(\delta(^1\rm{H}) = \frac{1.007825032 - 1}{1.007825032} = 0.0076\). Similarly \(\delta(^{12}\rm{C}) = 0\), \(\delta(^{14}\rm{N}) = 0.00022\), \(\delta(^{16}\rm{O}) = -0.00032\), etc.

Let also \(\Delta(X) = w(X) - \{w(x)\}\). Then \(-0.023 <\Delta(^{31}\rm{P}), \Delta(^{28}\rm{Si}) < 0\).

Let a compound undergo GC-MS and let Y be one of it’s fragments. If Y consists of \(k_{1}\), \(k_{2}\) atoms of type \(X_{2}\),….., \(k_{r}\) atoms of type \(X_{r}\), then \(\Delta(Y) = k_{1}*\Delta(X_{1}) + k_{2}*\Delta(X_{2}) + ....+ k_{r}*\Delta(X_{r})\).

The fragment will usually not contain more than 2 or 3 P or Si atoms and if it’s molecular weight is less than 550 it may not contain more than 35 O atoms, so \(\Delta(Y) \geq -0.023*5 - 0.00051*35 = -0.133\).

On the other hand, of Y contains \(k\) H atoms and \(m\) N atoms, then \(\Delta(Y) \leq k*0.00783 + m*0.00051\). Since for each two hydrogen atoms at least one carbon (or heavier) atom is needed, giving the limit of no more than 80 hydrogen atoms. Therefore in this case (i.e. H and C atoms only) \(\Delta(Y) \leq 80*0.00783 = 0.63\). If carbon is replaced by any heavier atom, at least 2 hydrogen atoms will be eliminated and \(\Delta(Y)\) will become even smaller.

If the molecular weight of \(Y\) does not exceed 550 (typically the largest mass scanned for in a GC-MS setup) then \(\mathbf{-0.133 \leq \Delta(Y) \leq 0.63}\). This means that if we set our binning boundaries to \((-0.3, 0.7)\) or \((-0.2, 0.8)\) the opportunity for having a fragment whose molecular weight is very close to the boundary is minimised.

Since the resolution of MS is at least 0.1 dalton, we may assume that it’s error does not exceed 0.05, and MS accuracy will not cause additional problems.

2.1.2. Example: Building an Intensity Matrix

First, setup the paths to the datafiles and the output directory, then import JCAMP_reader.

In [1]:
import pathlib
data_directory = pathlib.Path(".").resolve().parent.parent / "pyms-data"
# Change this if the data files are stored in a different location

output_directory = pathlib.Path(".").resolve() / "output"

from pyms.GCMS.IO.JCAMP import JCAMP_reader

Read the raw data files.

In [2]:
jcamp_file = data_directory / "gc01_0812_066.jdx"
data = JCAMP_reader(jcamp_file)
data
-> Reading JCAMP file '/home/vagrant/PyMassSpec/pyms-data/gc01_0812_066.jdx'
<GCMS_data(305.582 - 4007.722 seconds, time step 0.3753183292781833, 9865 scans)>

Then the data can be converted to an IntensityMatrix using the function build_intensity_matrix() from pyms.IntensityMatrix.

The default operation of build_intensity_matrix() is to use a bin interval of one and treat the masses as floating point numbers. The default intensity matrix can be built as follows:

In [3]:
from pyms.IntensityMatrix import build_intensity_matrix

im = build_intensity_matrix(data)

im
<pyms.IntensityMatrix.IntensityMatrix at 0x7f31d8b12860>

The size as the number of scans and the number of bins can be returned with:

In [4]:
im.size
(9865, 551)

There are 9865 scans and 551 bins in this example.

The raw masses have been binned into new mass units based on the minimum mass in the raw data and the bin size. A list of the first ten new masses can be obtained as follows:

In [5]:
im.mass_list[:10]
[50.0, 51.0, 52.0, 53.0, 54.0, 55.0, 56.0, 57.0, 58.0, 59.0]

The attributes im.min_mass and im.max_mass return the minimum and maximum mass:

In [6]:
im.min_mass
50.0
In [7]:
im.max_mass
600.0

It is also possible to search for a particular mass, by finding the index of the binned mass closest to the desired mass. For example, the index of the closest binned mass to a mass of 73.3 \(m/z\) can be found by using the methods im.get_index_of_mass():

In [8]:
index = im.get_index_of_mass(73.3)

index
23

The value of the closest mass can be returned by the method im.get_mass_at_index():

In [9]:
im.get_mass_at_index(index)
73.0

A mass of 73.0 is returned in this example.

2.1.3. Build intensity matrix parameters

The bin interval can be set to values other than one, and binning boundaries can also be adjusted. In the example below, to fit the 0.5 bin interval, the upper and lower boundaries are set to ± 0.25.

In [10]:
im = build_intensity_matrix(data, 0.5, 0.25, 0.25)

im
<pyms.IntensityMatrix.IntensityMatrix at 0x7f31d8b8d710>

The size of the intensity matrix will reflect the change in the number of bins:

In [11]:
im.size
(9865, 1101)
In [12]:
im.mass_list[:10]
[50.0, 50.5, 51.0, 51.5, 52.0, 52.5, 53.0, 53.5, 54.0, 54.5]

In this example there are 9865 scans (as before), but 1101 bins.

The index and binned mass of the mass closest to 73.3 should also reflect the different binning.

In [13]:
index = im.get_index_of_mass(73.3)

index
47
In [14]:
im.get_mass_at_index(index)
73.5

2.1.4. Build integer mass intensity matrix

It is also possible to build an intensity matrix with integer masses and a bin interval of one using build_intensity_matrix_i(). The default range for the binning is -0.3 and +0.7 mass units. The function is imported from pyms.IntensityMatrix:

In [15]:
from pyms.IntensityMatrix import build_intensity_matrix_i

im = build_intensity_matrix_i(data)
im
<pyms.IntensityMatrix.IntensityMatrix at 0x7f31d8b121d0>
In [16]:
im.size
(9865, 551)
In [17]:
im.mass_list[:10]
[50, 51, 52, 53, 54, 55, 56, 57, 58, 59]

The masses are now integers.

In [18]:
index = im.get_index_of_mass(73.3)
index
23
In [19]:
im.get_mass_at_index(index)
73

The lower and upper bounds can be adjusted with build_intensity_matrix_i(data, lower, upper).

Note

This example is in pyms-demo/jupyter/IntensityMatrix.ipynb.

2.2. Example: MassSpectrum Objects

First, setup the paths to the datafiles and the output directory, then import JCAMP_reader and build_intensity_matrix.

In [1]:
import pathlib
data_directory = pathlib.Path(".").resolve().parent.parent / "pyms-data"
# Change this if the data files are stored in a different location

output_directory = pathlib.Path(".").resolve() / "output"

from pyms.GCMS.IO.JCAMP import JCAMP_reader
from pyms.IntensityMatrix import build_intensity_matrix

Read the raw data files and create the IntensityMatrix.

In [2]:
jcamp_file = data_directory / "gc01_0812_066.jdx"
data = JCAMP_reader(jcamp_file)
im = build_intensity_matrix(data)
-> Reading JCAMP file '/home/vagrant/PyMassSpec/pyms-data/gc01_0812_066.jdx'

A MassSpectrum object contains two attributes, mass_list and intensity_list, a list of mass values and corresponding intensities, respectively. A MassSpectrum is returned by the IntensityMatrix method get_ms_at_index(index).

For example, the properties of the first MassSpectrum object can be obtained as follows:

In [3]:
ms = im.get_ms_at_index(0)

ms
<pyms.Spectrum.MassSpectrum at 0x7ff678cfe080>
In [4]:
len(ms)
551
In [5]:
len(ms.mass_list)
551
In [6]:
len(ms.intensity_list)
551

The length of all attributes should be the same.

Note

This example is in pyms-demo/jupyter/MassSpectrum.ipynb.

2.3. Example: IonChromatogram Objects

First, setup the paths to the datafiles and the output directory, then import JCAMP_reader and build_intensity_matrix.

In [1]:
import pathlib
data_directory = pathlib.Path(".").resolve().parent.parent / "pyms-data"
# Change this if the data files are stored in a different location

output_directory = pathlib.Path(".").resolve() / "output"

from pyms.GCMS.IO.JCAMP import JCAMP_reader
from pyms.IntensityMatrix import build_intensity_matrix

Read the raw data files and create the IntensityMatrix.

In [2]:
jcamp_file = data_directory / "gc01_0812_066.jdx"
data = JCAMP_reader(jcamp_file)
im = build_intensity_matrix(data)
-> Reading JCAMP file '/home/vagrant/PyMassSpec/pyms-data/gc01_0812_066.jdx'

An IonChromatogram object is a one dimensional vector containing mass intensities as a function of retention time. This can can be either \(m/z\) channel intensities (for example, the ion chromatogram at 73 \(m/z\)), or cumulative intensities over all measured \(m/z\) (TIC).

An IonChromatogram object for the TIC can be obtained as follows:

In [3]:
data.tic
<pyms.IonChromatogram.IonChromatogram at 0x7f698cbb9e80>

The IonChromatogram at index 0 can be obtained with:

In [4]:
im.get_ic_at_index(0)
<pyms.IonChromatogram.IonChromatogram at 0x7f69ac4e9198>

The IonChromatogram for the closest mass to 73 can be obtained with:

In [5]:
im.get_ic_at_mass(73)
<pyms.IonChromatogram.IonChromatogram at 0x7f69ac4e95f8>

An ion chromatogram object has a method is_tic() which returns True if the ion chromatogram is a TIC, False otherwise.

In [6]:
data.tic.is_tic()
True
In [7]:
im.get_ic_at_mass(73).is_tic()
False

Note

This example is in pyms-demo/jupyter/IonChromatogram.ipynb.

2.3.1. Writing IonChromatogram object to a file

Note

This example is in pyms-demo/31

The method write() of an IonChromatogram object allows the ion chromatogram to be saved to a file:

>>> tic.write("output/tic.dat", minutes=True)
>>> im.get_ic_at_mass(73).write("output/ic.dat", minutes=True)

The flag minutes=True indicates that retention time will be saved in minutes. The ion chromatogram object saved with with the write() method is a plain ASCII file which contains a pair of (retention time, intensity) per line.

$ head tic.dat
5.0930 2.222021e+07
5.0993 2.212489e+07
5.1056 2.208650e+07
5.1118 2.208815e+07
5.1181 2.200635e+07
5.1243 2.200326e+07
5.1306 2.202363e+07
5.1368 2.198357e+07
5.1431 2.197408e+07
5.1493 2.193351e+07

2.4. Saving data

Note

This example is in pyms-demo/32

A matrix of intensity values can be saved to a file with the function save_data() from pyms.Utils.IO. A matrix of intensity values can be returned from an IntensityMatrix with the method intensity_array. For example,

>>> from pyms.Utils.IO import save_data
>>> mat = im.intensity_array
array([[22128.,     0., 10221., ...,     0.,   470.,     0.],
       [22040.,     0., 10335., ...,   408.,     0.,   404.],
       [21320.,     0., 10133., ...,   492.,     0.,   422.],
       ...,
       [    0.,     0.,     0., ...,     0.,     0.,     0.],
       [    0.,     0.,     0., ...,     0.,     0.,     0.],
       [    0.,     0.,     0., ...,     0.,     0.,     0.]])
>>> save_data("output/im.dat", mat)

It is also possible to save the list of masses (from im.mass_list and the list of retention times (from im.time_list using the save_data() function. For convenience, the intensity values, mass list and time list, can be saved with the method export_ascii(). For example,

>>> im.export_ascii("output/data")

will create data.im.dat, data.rt.dat and data.mz.dat, where these are the intensity matrix, retention time vector, and \(m/z\) vector. By default the data is saved as space separated data with a “.dat” extension. It is also possible to save the data as comma separated data with a “.csv” extension with the command:

>>> im.export_ascii("output/data", "csv")

Additionally, the entire IntensityMatrix can be exported to LECO CSV format.

>>> im.export_leco_csv("output/data_leco.csv")

This facility is useful for import into other analytical software packages. The format has a header line specifying the column heading information as:

scan, retention time, mass1, mass2, ...

and then each row as the intensity data.

2.5. Importing ASCII data

Note

This example is in pyms-demo/32

The LECO CSV format data can be imported directly into an IntensityMatrix object. The data must follow the format outlined above. For example, the file saved above can be read and compared to the original:

>>> from pyms.IntensityMatrix import IntensityMatrix
>>> iim = IntensityMatrix([0],[0],[[0]])
>>> iim.import_leco_csv("output/data_leco.csv")
>>> im.size
>>> iim.size

The line IntensityMatrix([0],[0],[[0]]) is required to create an empty IntensityMatrix object.