2. GC-MS data derived objects¶
In the raw GC-MS data, consecutive scans do not necessarily contain the same
mass per charge (mass) values. For data processing, it is often necessary to
convert the data to a matrix with a set number of masses and scans.
In PyMassSpec
the resulting object is called an intensity matrix.
In this chapter the methods for converting the raw GC-MS data to an
intensity matrix object are illustrated.
2.1. IntensityMatrix Object¶
The general scheme for converting raw mass values is to bin intensity values based on the interval the corresponding mass belongs to. The general procedure is as follows:
Set the interval between bins, lower and upper bin boundaries.
Calculate the number of bins to cover the range of all masses.
Centre the first bin at the minimum mass found for all the raw data.
Sum intensities whose masses are in a given bin.
A mass, \(m\), is considered to belong to a bin when \(c - l \le m < c + u\), where \(c\) is the centre of the bin, \(l\) is the lower boundary and \(u\) is the upper boundary of the bin. The default bin interval is one with a lower and upper boundary of \(\pm0.5\).
A function to bin masses to the nearest integer is also available. The default bin interval is one with a lower boundary of \(-0.3\) and upper boundary of \(+0.7\) (as per the NIST library).
2.1.1. Discussion of Binning Boundaries¶
For any chemical element \(X\), let \(w(x)\) be the atomic weight of \(X\), and
\(\delta(X) = \frac{w(X) - \{w(X)\}}{w(X)}\)
where \(\{a\}\) is the integer value of \(a\) (rounded to the nearest integer).
For example, for hydrogen \(\delta(^1\rm{H}) = \frac{1.007825032 - 1}{1.007825032} = 0.0076\). Similarly \(\delta(^{12}\rm{C}) = 0\), \(\delta(^{14}\rm{N}) = 0.00022\), \(\delta(^{16}\rm{O}) = -0.00032\), etc.
Let also \(\Delta(X) = w(X) - \{w(x)\}\). Then \(-0.023 <\Delta(^{31}\rm{P}), \Delta(^{28}\rm{Si}) < 0\).
Let a compound undergo GC-MS and let Y be one of it’s fragments. If Y consists of \(k_{1}`atoms of type :math:`X_{1}\), \(k_{2}\) atoms of type \(X_{2}\),….., \(k_{r}\) atoms of type \(X_{r}\), then \(\Delta(Y) = k_{1}*\Delta(X_{1}) + k_{2}*\Delta(X_{2}) + ....+ k_{r}*\Delta(X_{r})\).
The fragment will usually not contain more than 2 or 3 P or Si atoms and if it’s molecular weight is less than 550 it may not contain more than 35 O atoms, so \(\Delta(Y) \geq -0.023*5 - 0.00051*35 = -0.133\).
On the other hand, of Y contains \(k\) H atoms and \(m\) N atoms, then \(\Delta(Y) \leq k*0.00783 + m*0.00051\). Since for each two hydrogen atoms at least one carbon (or heavier) atom is needed, giving the limit of no more than 80 hydrogen atoms. Therefore in this case (i.e. H and C atoms only) \(\Delta(Y) \leq 80*0.00783 = 0.63\). If carbon is replaced by any heavier atom, at least 2 hydrogen atoms will be eliminated and \(\Delta(Y)\) will become even smaller.
If the molecular weight of \(Y\) does not exceed 550 (typically the largest mass scanned for in a GC-MS setup) then \(\mathbf{-0.133 \leq \Delta(Y) \leq 0.63}\). This means that if we set our binning boundaries to \((-0.3, 0.7)\) or \((-0.2, 0.8)\) the opportunity for having a fragment whose molecular weight is very close to the boundary is minimised.
Since the resolution of MS is at least 0.1 dalton, we may assume that it’s error does not exceed 0.05, and MS accuracy will not cause additional problems.
2.1.2. Example: Building an Intensity Matrix¶
First, setup the paths to the datafiles and the output directory, then import JCAMP_reader.
In [1]:
import pathlib
data_directory = pathlib.Path(".").resolve().parent.parent / "pyms-data"
# Change this if the data files are stored in a different location
output_directory = pathlib.Path(".").resolve() / "output"
from pyms.GCMS.IO.JCAMP import JCAMP_reader
Read the raw data files.
In [2]:
jcamp_file = data_directory / "gc01_0812_066.jdx"
data = JCAMP_reader(jcamp_file)
data
-> Reading JCAMP file '/home/vagrant/PyMassSpec/pyms-data/gc01_0812_066.jdx'
<GCMS_data(305.582 - 4007.722 seconds, time step 0.3753183292781833, 9865 scans)>
Then the data can be converted to an IntensityMatrix
using the
function build_intensity_matrix()
from pyms.IntensityMatrix
.
The default operation of build_intensity_matrix()
is to use a bin
interval of one and treat the masses as floating point numbers. The
default intensity matrix can be built as follows:
In [3]:
from pyms.IntensityMatrix import build_intensity_matrix
im = build_intensity_matrix(data)
im
<pyms.IntensityMatrix.IntensityMatrix at 0x7f31d8b12860>
The size as the number of scans and the number of bins can be returned with:
In [4]:
im.size
(9865, 551)
There are 9865 scans and 551 bins in this example.
The raw masses have been binned into new mass units based on the minimum mass in the raw data and the bin size. A list of the first ten new masses can be obtained as follows:
In [5]:
im.mass_list[:10]
[50.0, 51.0, 52.0, 53.0, 54.0, 55.0, 56.0, 57.0, 58.0, 59.0]
The attributes im.min_mass
and im.max_mass
return the minimum
and maximum mass:
In [6]:
im.min_mass
50.0
In [7]:
im.max_mass
600.0
It is also possible to search for a particular mass, by finding the
index of the binned mass closest to the desired mass. For example, the
index of the closest binned mass to a mass of 73.3 \(m/z\) can be found
by using the methods im.get_index_of_mass()
:
In [8]:
index = im.get_index_of_mass(73.3)
index
23
The value of the closest mass can be returned by the method
im.get_mass_at_index()
:
In [9]:
im.get_mass_at_index(index)
73.0
A mass of 73.0 is returned in this example.
2.1.3. Build intensity matrix parameters¶
The bin interval can be set to values other than one, and binning boundaries can also be adjusted. In the example below, to fit the 0.5 bin interval, the upper and lower boundaries are set to ± 0.25.
In [10]:
im = build_intensity_matrix(data, 0.5, 0.25, 0.25)
im
<pyms.IntensityMatrix.IntensityMatrix at 0x7f31d8b8d710>
The size of the intensity matrix will reflect the change in the number of bins:
In [11]:
im.size
(9865, 1101)
In [12]:
im.mass_list[:10]
[50.0, 50.5, 51.0, 51.5, 52.0, 52.5, 53.0, 53.5, 54.0, 54.5]
In this example there are 9865 scans (as before), but 1101 bins.
The index and binned mass of the mass closest to 73.3 should also reflect the different binning.
In [13]:
index = im.get_index_of_mass(73.3)
index
47
In [14]:
im.get_mass_at_index(index)
73.5
2.1.4. Build integer mass intensity matrix¶
It is also possible to build an intensity matrix with integer masses and
a bin interval of one using build_intensity_matrix_i()
. The default
range for the binning is -0.3 and +0.7 mass units. The function is
imported from pyms.IntensityMatrix
:
In [15]:
from pyms.IntensityMatrix import build_intensity_matrix_i
im = build_intensity_matrix_i(data)
im
<pyms.IntensityMatrix.IntensityMatrix at 0x7f31d8b121d0>
In [16]:
im.size
(9865, 551)
In [17]:
im.mass_list[:10]
[50, 51, 52, 53, 54, 55, 56, 57, 58, 59]
The masses are now integers.
In [18]:
index = im.get_index_of_mass(73.3)
index
23
In [19]:
im.get_mass_at_index(index)
73
The lower and upper bounds can be adjusted with
build_intensity_matrix_i(data, lower, upper)
.
Note
This example is in pyms-demo/jupyter/IntensityMatrix.ipynb
.
2.2. Example: MassSpectrum Objects¶
First, setup the paths to the datafiles and the output directory, then import JCAMP_reader and build_intensity_matrix.
In [1]:
import pathlib
data_directory = pathlib.Path(".").resolve().parent.parent / "pyms-data"
# Change this if the data files are stored in a different location
output_directory = pathlib.Path(".").resolve() / "output"
from pyms.GCMS.IO.JCAMP import JCAMP_reader
from pyms.IntensityMatrix import build_intensity_matrix
Read the raw data files and create the IntensityMatrix.
In [2]:
jcamp_file = data_directory / "gc01_0812_066.jdx"
data = JCAMP_reader(jcamp_file)
im = build_intensity_matrix(data)
-> Reading JCAMP file '/home/vagrant/PyMassSpec/pyms-data/gc01_0812_066.jdx'
A MassSpectrum
object contains two attributes, mass_list
and
intensity_list
, a list of mass values and corresponding intensities,
respectively. A MassSpectrum
is returned by the IntensityMatrix
method get_ms_at_index(index)
.
For example, the properties of the first MassSpectrum
object can be
obtained as follows:
In [3]:
ms = im.get_ms_at_index(0)
ms
<pyms.Spectrum.MassSpectrum at 0x7ff678cfe080>
In [4]:
len(ms)
551
In [5]:
len(ms.mass_list)
551
In [6]:
len(ms.intensity_list)
551
The length of all attributes should be the same.
Note
This example is in pyms-demo/jupyter/MassSpectrum.ipynb
.
2.3. Example: IonChromatogram Objects¶
First, setup the paths to the datafiles and the output directory, then import JCAMP_reader and build_intensity_matrix.
In [1]:
import pathlib
data_directory = pathlib.Path(".").resolve().parent.parent / "pyms-data"
# Change this if the data files are stored in a different location
output_directory = pathlib.Path(".").resolve() / "output"
from pyms.GCMS.IO.JCAMP import JCAMP_reader
from pyms.IntensityMatrix import build_intensity_matrix
Read the raw data files and create the IntensityMatrix.
In [2]:
jcamp_file = data_directory / "gc01_0812_066.jdx"
data = JCAMP_reader(jcamp_file)
im = build_intensity_matrix(data)
-> Reading JCAMP file '/home/vagrant/PyMassSpec/pyms-data/gc01_0812_066.jdx'
An IonChromatogram
object is a one dimensional vector containing
mass intensities as a function of retention time. This can can be either
\(m/z\) channel intensities (for example, the ion chromatogram at 73
\(m/z\)), or cumulative intensities over all measured \(m/z\) (TIC).
An IonChromatogram
object for the TIC can be obtained as follows:
In [3]:
data.tic
<pyms.IonChromatogram.IonChromatogram at 0x7f698cbb9e80>
The IonChromatogram
at index 0 can be obtained with:
In [4]:
im.get_ic_at_index(0)
<pyms.IonChromatogram.IonChromatogram at 0x7f69ac4e9198>
The IonChromatogram
for the closest mass to 73 can be obtained with:
In [5]:
im.get_ic_at_mass(73)
<pyms.IonChromatogram.IonChromatogram at 0x7f69ac4e95f8>
An ion chromatogram object has a method is_tic()
which returns
True
if the ion chromatogram is a TIC, False
otherwise.
In [6]:
data.tic.is_tic()
True
In [7]:
im.get_ic_at_mass(73).is_tic()
False
Note
This example is in pyms-demo/jupyter/IonChromatogram.ipynb
.
2.3.1. Writing IonChromatogram object to a file¶
Note
This example is in pyms-demo/31
The method write()
of an IonChromatogram
object allows the ion chromatogram to be saved to a file:
>>> tic.write("output/tic.dat", minutes=True)
>>> im.get_ic_at_mass(73).write("output/ic.dat", minutes=True)
The flag minutes=True
indicates that retention time will be saved in minutes.
The ion chromatogram object saved with with the
write()
method is a plain ASCII file which contains a pair of
(retention time, intensity) per line.
$ head tic.dat
5.0930 2.222021e+07
5.0993 2.212489e+07
5.1056 2.208650e+07
5.1118 2.208815e+07
5.1181 2.200635e+07
5.1243 2.200326e+07
5.1306 2.202363e+07
5.1368 2.198357e+07
5.1431 2.197408e+07
5.1493 2.193351e+07
2.4. Saving data¶
Note
This example is in pyms-demo/32
A matrix of intensity values can be saved to a file with the function
save_data()
from pyms.Utils.IO
. A matrix of intensity values can
be returned from an IntensityMatrix
with the method
intensity_array
.
For example,
>>> from pyms.Utils.IO import save_data
>>> mat = im.intensity_array
array([[22128., 0., 10221., ..., 0., 470., 0.],
[22040., 0., 10335., ..., 408., 0., 404.],
[21320., 0., 10133., ..., 492., 0., 422.],
...,
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.]])
>>> save_data("output/im.dat", mat)
It is also possible to save the list of masses (from
im.mass_list
and the list of retention times (from
im.time_list
using the save_data()
function.
For convenience, the intensity values, mass list and time list,
can be saved with the method
export_ascii()
.
For example,
>>> im.export_ascii("output/data")
will create data.im.dat
, data.rt.dat
and data.mz.dat
, where these
are the intensity matrix, retention time vector, and \(m/z\) vector. By default
the data is saved as space separated data with a “.dat” extension. It is
also possible to save the data as comma separated data with a “.csv”
extension with the command:
>>> im.export_ascii("output/data", "csv")
Additionally, the entire IntensityMatrix
can be exported to LECO CSV format.
>>> im.export_leco_csv("output/data_leco.csv")
This facility is useful for import into other analytical software packages. The format has a header line specifying the column heading information as:
scan, retention time, mass1, mass2, ...
and then each row as the intensity data.
2.5. Importing ASCII data¶
Note
This example is in pyms-demo/32
The LECO CSV format data can be imported directly into an IntensityMatrix
object. The data must follow the format outlined above. For example, the file saved above can be read and compared to the original:
>>> from pyms.IntensityMatrix import IntensityMatrix
>>> iim = IntensityMatrix([0],[0],[[0]])
>>> iim.import_leco_csv("output/data_leco.csv")
>>> im.size
>>> iim.size
The line IntensityMatrix([0],[0],[[0]])
is required to create an empty IntensityMatrix
object.