4. Peak detection and representation

4.1. Example: Peak Objects

Fundamental to GC-MS analysis is the identification of individual components of the sample mix. The basic component unit is represented as a signal peak. In PyMassSpec a signal peak is represented as Peak object. PyMassSpec provides functions to detect peaks and create peaks (discussed at the end of the chapter).

A peak object stores a minimal set of information about a signal peak, namely, the retention time at which the peak apex occurs and the mass spectra at the apex. Additional information, such as, peak width, TIC and individual ion areas can be filtered from the GC-MS data and added to the Peak object information.

4.1.1. Creating a Peak Object

A peak object can be created for a scan at a given retention time by providing the retention time (in minutes or seconds) and the MassSpectrum object of the scan.

First, setup the paths to the datafiles and the output directory, then import JCAMP_reader.

In [1]:
import pathlib
data_directory = pathlib.Path(".").resolve().parent.parent / "pyms-data"
# Change this if the data files are stored in a different location

output_directory = pathlib.Path(".").resolve() / "output"

from pyms.GCMS.IO.JCAMP import JCAMP_reader

Read the raw data files.

In [2]:
jcamp_file = data_directory / "gc01_0812_066.jdx"
data = JCAMP_reader(jcamp_file)
-> Reading JCAMP file '/home/vagrant/PyMassSpec/pyms-data/gc01_0812_066.jdx'

Build the IntensityMatrix.

In [3]:
from pyms.IntensityMatrix import build_intensity_matrix_i

im = build_intensity_matrix_i(data)

Extract the MassSpectrum at 31.17 minutes in this example.

In [4]:
index = im.get_index_at_time(31.17*60.0)
ms = im.get_ms_at_index(index)

Create a Peak object for the given retention time.

In [5]:
from pyms.Peak.Class import Peak
peak = Peak(31.17, ms, minutes=True)

By default the retention time is assumed to be in seconds. The parameter minutes can be set to True if the retention time is given in minutes. Internally, PyMassSpec stores retention times in seconds, so the minutes parameter ensures the input and output of the retention time are in the same units.

4.1.2. Peak Object properties

The retention time of the peak, in seconds, can be returned with pyms.Peak.Class.Peak.rt. The mass spectrum can be returned with pyms.Peak.Class.Peak.mass_spectrum.

The Peak object constructs a unique identification (UID) based on the spectrum and retention time. This helps in managing lists of peaks (covered in the next chapter). The UID can be returned with pyms.Peak.Class.Peak.UID. The format of the UID is the masses of the two most abundant ions in the spectrum, the ratio of the abundances of the two ions, and the retention time (in the same units as given when the Peak object was created). The format is:

Mass1-Mass2-Ratio-RT

For example:

In [6]:
peak.rt
1870.2
In [7]:
peak.UID
'319-73-74-1870.20'
In [8]:
index = im.get_index_of_mass(73.3)

index
23

4.1.3. Modifying a Peak Object

The Peak object has methods for modifying the mass spectrum. The mass range can be cropped to a smaller range with crop_mass(), and the intensity values for a single ion can be set to zero with null_mass(). For example, the mass range can be set from 60 to 450 \(m/z\), and the ions related to sample preparation can be ignored by setting their intensities to zero as follows:

In [9]:
peak.crop_mass(60, 450)
peak.null_mass(73)
peak.null_mass(147)

The UID is automatically updated to reflect the changes:

In [10]:
peak.UID
'319-205-54-1870.20'

It is also possible to change the peak mass spectrum by setting the attribute pyms.Peak.Class.Peak.mass_spectrum.

Note

This example is in pyms-demo/jupyter/Peak.ipynb.

4.2. Peak Detection

The general use of a Peak object is to extract them from the GC-MS data and build a list of peaks. In PyMassSpec, the function for peak detection is based on the method of Biller and Biemann (1974) 1. The basic process is to find all maximising ions in a pre-set window of scans, for a given scan. The ions that maximise at a given scan are taken to belong to the same peak.

The function is BillerBiemann(). in pyms.BillerBiemann. The function has parameters for the window width for detecting the local maxima (points), and the number of scans across which neighbouring, apexing, ions are combined and considered as belonging to the same peak. The number of neighbouring scans to combine is related to the likelihood of detecting a peak apex at a single scan or several neighbouring scans. This is more likely when there are many scans across the peak. It is also possible, however, when there are very few scans across the peak. The scans are combined by taking all apexing ions to have occurred at the scan that had to greatest TIC prior to combining scans.

4.2.1. Example: Peak Detection

First, setup the paths to the datafiles and the output directory, then import JCAMP_reader and build_intensity_matrix.

In [1]:
import pathlib
data_directory = pathlib.Path(".").resolve().parent.parent / "pyms-data"
# Change this if the data files are stored in a different location

output_directory = pathlib.Path(".").resolve() / "output"

from pyms.GCMS.IO.JCAMP import JCAMP_reader
from pyms.IntensityMatrix import build_intensity_matrix

Read the raw data file and build the IntensityMatrix.

In [2]:
jcamp_file = data_directory / "gc01_0812_066.jdx"
data = JCAMP_reader(jcamp_file)
im = build_intensity_matrix(data)
-> Reading JCAMP file '/home/vagrant/PyMassSpec/pyms-data/gc01_0812_066.jdx'

Preprocess the data (Savitzky-Golay smoothing and Tophat baseline detection).

In [3]:
from pyms.Noise.SavitzkyGolay import savitzky_golay
from pyms.TopHat import tophat

n_scan, n_mz = im.size

for ii in range(n_mz):
    ic = im.get_ic_at_index(ii)
    ic_smooth = savitzky_golay(ic)
    ic_bc = tophat(ic_smooth, struct="1.5m")
    im.set_ic_at_index(ii, ic_bc)

Now the Biller and Biemann based technique can be applied to detect peaks.

In [4]:
from pyms.BillerBiemann import BillerBiemann
peak_list = BillerBiemann(im)
peak_list[:10]
[<pyms.Peak.Class.Peak at 0x7f8deca882b0>,
 <pyms.Peak.Class.Peak at 0x7f8deca88550>,
 <pyms.Peak.Class.Peak at 0x7f8dce908160>,
 <pyms.Peak.Class.Peak at 0x7f8df3bc5c50>,
 <pyms.Peak.Class.Peak at 0x7f8dc3a043c8>,
 <pyms.Peak.Class.Peak at 0x7f8dc3a04588>,
 <pyms.Peak.Class.Peak at 0x7f8dc3a045f8>,
 <pyms.Peak.Class.Peak at 0x7f8dc3a04668>,
 <pyms.Peak.Class.Peak at 0x7f8dc3a046d8>,
 <pyms.Peak.Class.Peak at 0x7f8dc3a04748>]
In [5]:
len(peak_list)
9845

Note that this is nearly as many peaks as there are scans in the data (9865 scans). This is due to noise and the simplicity of the technique.

The number of detected peaks can be constrained by the selection of better parameters. Parameters can be determined by counting the number of points across a peak, and examining where peaks are found. For example, the peak list can be found with the parameters of a window of 9 points and by combining 2 neighbouring scans if they apex next to each other:

In [6]:
peak_list = BillerBiemann(im, points=9, scans=2)
peak_list[:10]
[<pyms.Peak.Class.Peak at 0x7f8dae545be0>,
 <pyms.Peak.Class.Peak at 0x7f8dae545c18>,
 <pyms.Peak.Class.Peak at 0x7f8dae545c88>,
 <pyms.Peak.Class.Peak at 0x7f8dae545cf8>,
 <pyms.Peak.Class.Peak at 0x7f8dae545d68>,
 <pyms.Peak.Class.Peak at 0x7f8dae545dd8>,
 <pyms.Peak.Class.Peak at 0x7f8dae545e48>,
 <pyms.Peak.Class.Peak at 0x7f8dae545eb8>,
 <pyms.Peak.Class.Peak at 0x7f8dae545f28>,
 <pyms.Peak.Class.Peak at 0x7f8dae545f98>]
In [7]:
len(peak_list)
3695

The number of detected peaks has been reduced, but there are still many more than would be expected from the sample. Functions to filter the peak list are covered in the next example.

4.2.2. Example: Peak List Filtering

There are two functions to filter the list of Peak objects.

The first, rel_threshold() modifies the mass spectrum stored in each peak so any intensity that is less than a given percentage of the maximum intensity for the peak is removed.

The second, num_ions_threshold(), removes any peak that has less than a given number of ions above a given threshold.

Once the peak list has been constructed, the filters can be applied as follows:

In [8]:
from pyms.BillerBiemann import rel_threshold, num_ions_threshold
pl = rel_threshold(peak_list, percent=2)
pl[:10]
[<pyms.Peak.Class.Peak at 0x7f8dc3a045f8>,
 <pyms.Peak.Class.Peak at 0x7f8dc3a04630>,
 <pyms.Peak.Class.Peak at 0x7f8dc3a04748>,
 <pyms.Peak.Class.Peak at 0x7f8dc3a047b8>,
 <pyms.Peak.Class.Peak at 0x7f8dc3a048d0>,
 <pyms.Peak.Class.Peak at 0x7f8dc3a049e8>,
 <pyms.Peak.Class.Peak at 0x7f8dc3a04a20>,
 <pyms.Peak.Class.Peak at 0x7f8dc3a04b38>,
 <pyms.Peak.Class.Peak at 0x7f8dc3a04b70>,
 <pyms.Peak.Class.Peak at 0x7f8dc3a04c88>]
In [9]:
new_peak_list = num_ions_threshold(pl, n=3, cutoff=10000)
new_peak_list[:10]
[<pyms.Peak.Class.Peak at 0x7f8deca8e128>,
 <pyms.Peak.Class.Peak at 0x7f8deca8e1d0>,
 <pyms.Peak.Class.Peak at 0x7f8dc3a04780>,
 <pyms.Peak.Class.Peak at 0x7f8dc3a04550>,
 <pyms.Peak.Class.Peak at 0x7f8dbb3cf3c8>,
 <pyms.Peak.Class.Peak at 0x7f8dbb3cf048>,
 <pyms.Peak.Class.Peak at 0x7f8dbb3cf4a8>,
 <pyms.Peak.Class.Peak at 0x7f8dbb3cf550>,
 <pyms.Peak.Class.Peak at 0x7f8dbb3cf5f8>,
 <pyms.Peak.Class.Peak at 0x7f8dbb3cf6a0>]
In [10]:
len(new_peak_list)
146

The number of detected peaks is now more realistic of what would be expected in the test sample.

Note

This example is in pyms-demo/jupyter/Peak_Detection.ipynb.

4.3. Noise analysis for peak filtering

In the previous example the cutoff parameter for peak filtering was set by the user. This can work well for individual data files, but can cause problems when applied to large experiments with many individual data files. Where experimental conditions have changed slightly between experimental runs, the ion intensity over the GC-MS run may also change. This means that an inflexible cutoff value can work for some data files, while excluding too many, or including too many peaks in other files.

An alternative to manually setting the value for cutoff is to use the window_analyzer() function. This function examines a Total Ion Chromatogram (TIC) and computes a value for the median absolute deviation in troughs between peaks. This gives an approximate threshold value above which false peaks from noise should be filtered out.

First, build the Peak list as before

In [1]:
import pathlib
data_directory = pathlib.Path(".").resolve().parent.parent / "pyms-data"
# Change this if the data files are stored in a different location

output_directory = pathlib.Path(".").resolve() / "output"

from pyms.GCMS.IO.JCAMP import JCAMP_reader
from pyms.IntensityMatrix import build_intensity_matrix
from pyms.Noise.SavitzkyGolay import savitzky_golay
from pyms.TopHat import tophat
from pyms.BillerBiemann import BillerBiemann

jcamp_file = data_directory / "gc01_0812_066.jdx"
data = JCAMP_reader(jcamp_file)
im = build_intensity_matrix(data)

n_scan, n_mz = im.size

for ii in range(n_mz):
    ic = im.get_ic_at_index(ii)
    ic_smooth = savitzky_golay(ic)
    ic_bc = tophat(ic_smooth, struct="1.5m")
    im.set_ic_at_index(ii, ic_bc)

peak_list = BillerBiemann(im, points=9, scans=2)
-> Reading JCAMP file '/home/vagrant/PyMassSpec/pyms-data/gc01_0812_066.jdx'

Compute the noise value.

In [2]:
from pyms.Noise.Analysis import window_analyzer

tic = data.tic

noise_level = window_analyzer(tic)
noise_level
432.1719792438844

Filter the Peak List using this noise value as the cutoff.

In [3]:
from pyms.BillerBiemann import num_ions_threshold
filtered_peak_list = num_ions_threshold(peak_list, n=3, cutoff=noise_level)
filtered_peak_list[:10]
[<pyms.Peak.Class.Peak at 0x7f4a9864f128>,
 <pyms.Peak.Class.Peak at 0x7f4a9864f2e8>,
 <pyms.Peak.Class.Peak at 0x7f4a9864f320>,
 <pyms.Peak.Class.Peak at 0x7f4a9864f3c8>,
 <pyms.Peak.Class.Peak at 0x7f4a9864f518>,
 <pyms.Peak.Class.Peak at 0x7f4a9864f4a8>,
 <pyms.Peak.Class.Peak at 0x7f4a9864f6a0>,
 <pyms.Peak.Class.Peak at 0x7f4a9864f748>,
 <pyms.Peak.Class.Peak at 0x7f4a9864f7f0>,
 <pyms.Peak.Class.Peak at 0x7f4a9864f898>]
In [4]:
len(filtered_peak_list)
612

Note

This example is in pyms-demo/jupyter/Peak_Filtering_Noise_Analysis.ipynb.

4.4. Peak Area Estimation

The Peak object does not contain any information about the width or area of the peak when it is first created. This information can be added after the instantiation of a Peak object. The area of the peak can be set with the attribute area.

The total peak area can by obtained by the peak_sum_area() function in pyms.Peak.Function. The function determines the total area as the sum of the ion intensities for all masses that apex at the given peak. To calculate the peak area of a single mass, the intensities are added from the apex of the mass peak outwards.

Edge values are added until the following conditions are met:

  • the added intensity adds less than 0.5% to the accumulated area; or

  • the added intensity starts increasing (i.e. when the ion is common to co-eluting compounds).

To avoid noise effects, the edge value is taken at the midpoint of three consecutive edge values.

First, build the Peak list as before

In [1]:
import pathlib
data_directory = pathlib.Path(".").resolve().parent.parent / "pyms-data"
# Change this if the data files are stored in a different location

output_directory = pathlib.Path(".").resolve() / "output"

from pyms.GCMS.IO.JCAMP import JCAMP_reader
from pyms.IntensityMatrix import build_intensity_matrix
from pyms.Noise.SavitzkyGolay import savitzky_golay
from pyms.TopHat import tophat
from pyms.BillerBiemann import BillerBiemann

jcamp_file = data_directory / "gc01_0812_066.jdx"
data = JCAMP_reader(jcamp_file)
im = build_intensity_matrix(data)

n_scan, n_mz = im.size

for ii in range(n_mz):
    ic = im.get_ic_at_index(ii)
    ic_smooth = savitzky_golay(ic)
    ic_bc = tophat(ic_smooth, struct="1.5m")
    im.set_ic_at_index(ii, ic_bc)

peak_list = BillerBiemann(im, points=9, scans=2)

from pyms.Noise.Analysis import window_analyzer
tic = data.tic
noise_level = window_analyzer(tic)

from pyms.BillerBiemann import num_ions_threshold
filtered_peak_list = num_ions_threshold(peak_list, n=3, cutoff=noise_level)
filtered_peak_list[:10]
-> Reading JCAMP file '/home/vagrant/PyMassSpec/pyms-data/gc01_0812_066.jdx'
[<pyms.Peak.Class.Peak at 0x7fa8eae80198>,
 <pyms.Peak.Class.Peak at 0x7fa8eae80208>,
 <pyms.Peak.Class.Peak at 0x7fa8eae802b0>,
 <pyms.Peak.Class.Peak at 0x7fa8eae80358>,
 <pyms.Peak.Class.Peak at 0x7fa8eae80400>,
 <pyms.Peak.Class.Peak at 0x7fa8eae804a8>,
 <pyms.Peak.Class.Peak at 0x7fa8eae80550>,
 <pyms.Peak.Class.Peak at 0x7fa8eae805f8>,
 <pyms.Peak.Class.Peak at 0x7fa8eae806a0>,
 <pyms.Peak.Class.Peak at 0x7fa8eae80748>]

Given a list of peaks, areas can be determined and added as follows:

In [2]:
from pyms.Peak.Function import peak_sum_area
for peak in peak_list:
    area = peak_sum_area(im, peak)
    peak.area = area

Note

This example is in pyms-demo/jupyter/Peak_Area_Estimation.ipynb.

4.4.1. Individual Ion Areas

Note

This example is in pyms-demo/56

While the previous approach uses the sum of all areas in the peak to estimate the peak area, the user may also choose to record the area of each individual ion in each peak.

This can be useful when the intention is to later perform quantitation based on the area of a single characteristic ion for a particular compound. It is also essential if using the Common Ion Algorithm for quantitation, outlined in the section Common Ion Area Quantitation.

To set the area of each ion for each peak, the following code is used:

>>> from pyms.Peak.Function import peak_top_ion_areas
>>> for peak in peak_list:
...     area_dict = peak_top_ions_areas(intensity_matrix, peak)
...     peak.set_ion_areas(area_dict)
...

This will set the areas of the 5 most abundant ions in each peak. If it is desired to record more than the top five ions, the argument num_ions=x should be supplied, where x is the number of most abundant ions to be recorded. For example:

...     area_dict = peak_top_ions_areas(intensity_matrix, peak, num_ions=10)

will record the 10 most abundant ions for each peak.

The individual ion areas can be set instead of, or in addition to the total area for each peak.

4.4.2. Reading the area of a single ion in a peak

If the individual ion areas have been set for a peak, it is possible to read the area of an individual ion for the peak. For example:

>>> peak.get_ion_area(101)

will return the area of the \(m/z\) value 101 for the peak. If the area of that ion has not been set (i.e. it was not one of the most abundant ions), the function will return None.

4.5. References

1

Biller JE and Biemann K. Reconstructed mass spectra, a novel approach for the utilization of gas chromatograph–mass spectrometer data. Anal. Lett., 7:515–528, 1974