3. Data Filtering

3.1. Introduction

In this chapter filtering techniques that allow pre-processing of GC-MS data for analysis and comparison to other pre-processed GC-MS data are covered.

3.2. Time strings

Before considering the filtering techniques, the mechanism for representing retention times is outlined here.

A time string is the specification of a time interval, that takes the format NUMBERs or NUMBERm for time interval in seconds or minutes. For example, these are valid time strings: 10s (10 seconds) and 0.2m (0.2 minutes).

3.3. Example: IntensityMatrix Resizing

Once an IntensityMatrix has been constructed from the raw GC-MS data, the entries of the matrix can be modified. These modifications can operate on the entire matrix, or individual masses or scans.

First, setup the paths to the datafiles and the output directory, then import JCAMP_reader and build_intensity_matrix.

In [1]:
import pathlib
data_directory = pathlib.Path(".").resolve().parent.parent / "pyms-data"
# Change this if the data files are stored in a different location

output_directory = pathlib.Path(".").resolve() / "output"

from pyms.GCMS.IO.JCAMP import JCAMP_reader
from pyms.IntensityMatrix import build_intensity_matrix

Read the raw data files and create the IntensityMatrix.

In [2]:
jcamp_file = data_directory / "gc01_0812_066.jdx"
data = JCAMP_reader(jcamp_file)
im = build_intensity_matrix(data)
-> Reading JCAMP file '/home/vagrant/PyMassSpec/pyms-data/gc01_0812_066.jdx'

3.3.1. Retention time range

A basic operation on the GC-MS data is to select a specific time range for processing. In PyMassSpec, any data outside the chosen time range is discarded. The trim() method operates on the raw data, so any subsequent processing only refers to the trimmed data.

The data can be trimmed to specific scans:

In [3]:
data.trim(1000, 2000)
data.info()
Trimming data to between 1000 and 2001 scans
 Data retention time range: 11.342 min -- 17.604 min
 Time step: 0.375 s (std=0.000 s)
 Number of scans: 1002
 Minimum m/z measured: 50.100
 Maximum m/z measured: 467.100
 Mean number of m/z values per scan: 57
 Median number of m/z values per scan: 44

or specific retention times (in seconds or minutes):

In [4]:
data.trim("700s", "15m")
data.info()
Trimming data to between 54 and 587 scans
 Data retention time range: 11.674 min -- 15.008 min
 Time step: 0.375 s (std=0.000 s)
 Number of scans: 534
 Minimum m/z measured: 50.100
 Maximum m/z measured: 395.200
 Mean number of m/z values per scan: 59
 Median number of m/z values per scan: 47

3.3.2. Mass Spectrum range and entries

An IntensityMatrix object has a set mass range and interval that is derived from the data at the time of building the intensity matrix. The range of mass values can be cropped. This is done, primarily, to ensure that the range of masses used are consistent when comparing samples.

The mass range of the intensity matrix can be “cropped” to a new (smaller) range as follows:

In [5]:
im.crop_mass(60, 400)
im.min_mass
60.0
In [6]:
im.max_mass
400.0

It is also possible to set all intensities for a given mass to zero. This is useful for ignoring masses associated with sample preparation. The mass can be “nulled” with:

In [7]:
im.null_mass(73)
sum(im.get_ic_at_mass(73).intensity_array)
0.0

As expected, the sum of the intensity array is 0

Note

This example is in pyms-demo/jupyter/IntensityMatrix_Resizing.ipynb.


3.4. Noise smoothing

The purpose of noise smoothing is to remove high-frequency noise from data, and thereby increase the contribution of the signal relative to the contribution of the noise.

First, setup the paths to the datafiles and the output directory, then import JCAMP_reader.

In [1]:
import pathlib
data_directory = pathlib.Path(".").resolve().parent.parent / "pyms-data"
# Change this if the data files are stored in a different location

output_directory = pathlib.Path(".").resolve() / "output"

from pyms.GCMS.IO.JCAMP import JCAMP_reader

Read the raw data files and extract the TIC.

In [2]:
jcamp_file = data_directory / "gc01_0812_066.jdx"
data = JCAMP_reader(jcamp_file)
tic = data.tic
-> Reading JCAMP file '/home/vagrant/PyMassSpec/pyms-data/gc01_0812_066.jdx'

3.4.1. Window averaging

A simple approach to noise smoothing is moving average window smoothing. In this approach the window of a fixed size (:math:2N+1 points) is moved across the ion chromatogram, and the intensity value at each point is replaced with the mean intensity calculated over the window size. The example below illustrates smoothing of TIC by window averaging.

To apply mean window smoothing with a 5-point window:

In [3]:
from pyms.Noise.Window import window_smooth
tic1 = window_smooth(tic, window=5)

To apply median window smoothing with a 5-point window:

In [4]:
tic2 = window_smooth(tic, window=5, use_median=True)

To apply the mean windows smoothing, but specifying the window as a time string (in this example, 7 seconds):

In [5]:
tic3 = window_smooth(tic, window='7s')

Write the original TIC and the smoothed TICs to disk:

In [6]:
tic.write(output_directory / "noise_smoothing_tic.dat",minutes=True)
tic1.write(output_directory / "noise_smoothing_tic1.dat",minutes=True)
tic2.write(output_directory / "noise_smoothing_tic2.dat",minutes=True)

3.4.2. Window Averaging on Intensity Matrix

In the previous section, window averaging was applied to an Ion Chromatogram object (in that case a TIC). Where filtering is to be performed on all Ion Chromatograms, the window_smooth_im() function may be used instead.

The use of this function is identical to the Ion Chromatogram window_smooth() function, except that an Intensity Matrix is passed to it.

For example, to perform window smoothing on an IntensityMatrix object with a 5 point window and mean window smoothing:

In [7]:
from pyms.IntensityMatrix import build_intensity_matrix
from pyms.Noise.Window import window_smooth_im
im = build_intensity_matrix(data)
im_smooth1 = window_smooth_im(im, window=5, use_median=False)

Write the IC for mass 73 to disk for both the original and smoothed IntensityMatrix:

In [8]:
ic = im.get_ic_at_index(73)
ic_smooth1 = im_smooth1.get_ic_at_index(73)

ic.write(output_directory/"noise_smoothing_ic.dat", minutes=True)
ic_smooth1.write(output_directory/"noise_smoothing_ic_smooth1.dat", minutes=True)

3.4.3. Savitzky–Golay noise filter

A more sophisticated noise filter is the Savitzky-Golay filter. Given the data loaded as above, this filter can be applied as follows:

In [9]:
from pyms.Noise.SavitzkyGolay import savitzky_golay
tic4 = savitzky_golay(tic)

Write the smoothed TIC to disk:

In [10]:
tic4.write(output_directory / "noise_smoothing_tic4.dat",minutes=True)

In this example the default parameters were used.

3.4.3.1. Savitzky-Golay Noise filtering of Intensity Matrix Object

The savitzky_golay() function described above acts on a single IonChromatogram. Where it is desired to perform Savitzky Golay filtering on the whole IntensityMatrix the function savitzky_golay_im() may be used as follows:

In [11]:
from pyms.Noise.SavitzkyGolay import savitzky_golay_im
im_smooth2 = savitzky_golay_im(im)

Write the IC for mass 73 in the smoothed IntensityMatrix to disk:

In [12]:
ic_smooth2 = im_smooth2.get_ic_at_index(73)
ic_smooth2.write(output_directory/"noise_smoothing_ic_smooth2.dat",minutes=True)

Note

This example is in pyms-demo/jupyter/NoiseSmoothing.ipynb.

3.5. Baseline Correction

Baseline distortion originating from instrument imperfections and experimental setup is often observed in mass spectrometry data, and off-line baseline correction is often an important step in data pre-processing. There are many approaches for baseline correction. One advanced approach is based on the top-hat transform developed in mathematical morphology 1, and used extensively in digital image processing for tasks such as image enhancement. Top-hat baseline correction was previously applied in proteomics based mass spectrometry 2. PyMS currently implements only the top-hat baseline corrector, using the SciPy package ndimage.

Application of the top-hat baseline corrector requires the size of the structural element to be specified. The structural element needs to be larger than the features one wants to retain in the spectrum after the top-hat transform. In the example below, the top-hat baseline corrector is applied to the TIC of the data set gc01_0812_066.cdf, with the structural element of 1.5 minutes:

The purpose of noise smoothing is to remove high-frequency noise from data, and thereby increase the contribution of the signal relative to the contribution of the noise.

First, setup the paths to the datafiles and the output directory, then import ANDI_reader, savitzky_golay and tophat.

In [1]:
import pathlib
data_directory = pathlib.Path(".").resolve().parent.parent / "pyms-data"
# Change this if the data files are stored in a different location

output_directory = pathlib.Path(".").resolve() / "output"

from pyms.GCMS.IO.ANDI import ANDI_reader
from pyms.Noise.SavitzkyGolay import savitzky_golay
from pyms.TopHat import tophat

Read the raw data files and extract the TIC.

In [2]:
andi_file = data_directory / "gc01_0812_066.cdf"
data = ANDI_reader(andi_file)
tic = data.tic
-> Reading netCDF file '/home/vagrant/PyMassSpec/pyms-data/gc01_0812_066.cdf'

Perform Savitzky-Golay smoothing

In [3]:
tic1 = savitzky_golay(tic)

Perform Tophat baseline correction

In [4]:
tic2 = tophat(tic1, struct="1.5m")

Save the output to disk

In [5]:
tic.write(output_directory / "baseline_correction_tic.dat",minutes=True)
tic1.write(output_directory / "baseline_correction_tic_smooth.dat",minutes=True)
tic2.write(output_directory / "baseline_correction_tic_smooth_bc.dat",minutes=True)

3.5.1. Tophat Baseline correction on an Intensity Matrix object

The tophat() function acts on a single IonChromatogram. To perform baseline correction on an IntensityMatrix object (i.e. on all Ion Chromatograms) the tophat_im() function may be used.

Using the same value for struct as above, tophat_im() is used as follows:

In [6]:
from pyms.TopHat import tophat_im
from pyms.IntensityMatrix import build_intensity_matrix
im = build_intensity_matrix(data)
im_base_corr = tophat_im(im, struct="1.5m")

Write the IC for mass 73 to disk for both the original and smoothed IntensityMatrix:

In [7]:
ic = im.get_ic_at_index(73)
ic_base_corr = im_base_corr.get_ic_at_index(73)

ic.write(output_directory/"baseline_correction_ic.dat",minutes=True)
ic_base_corr.write(output_directory/"baseline_correction_ic_base_corr.dat",minutes=True)

Note

This example is in pyms-demo/jupyter/BaselineCorrection.ipynb.

3.6. Pre-processing the IntensityMatrix

Noise smoothing and baseline correction can be applied to each IonChromatogram in an IntensityMatrix.

First, setup the paths to the datafiles and the output directory, then import the required functions.

In [1]:
import pathlib
data_directory = pathlib.Path(".").resolve().parent.parent / "pyms-data"
# Change this if the data files are stored in a different location

output_directory = pathlib.Path(".").resolve() / "output"

from pyms.GCMS.IO.JCAMP import JCAMP_reader
from pyms.IntensityMatrix import build_intensity_matrix
from pyms.Noise.SavitzkyGolay import savitzky_golay
from pyms.TopHat import tophat

Read the raw data files and build the IntensityMatrix:

In [2]:
jcamp_file = data_directory / "gc01_0812_066.jdx"
data = JCAMP_reader(jcamp_file)
im = build_intensity_matrix(data)
-> Reading JCAMP file '/home/vagrant/PyMassSpec/pyms-data/gc01_0812_066.jdx'

Perform Savitzky-Golay smoothing and Tophat baseline correction

In [3]:
n_scan, n_mz = im.size

for ii in range(n_mz):
    # print("Working on IC#", ii+1)
    ic = im.get_ic_at_index(ii)
    ic_smooth = savitzky_golay(ic)
    ic_bc = tophat(ic_smooth, struct="1.5m")
    im.set_ic_at_index(ii, ic_bc)

Alternatively, the filtering may be performed on the IntensityMatrix without using a for loop, as outlined in previous examples. However filtering by IonChromatogram in a for loop as described here is much faster.

The resulting IntensityMatrix object can be “dumped” to a file for later retrieval. There are general perpose object file handling methods in pyms.Utils.IO. For example;

>>> from pyms.Utils.IO import dump_object
>>> dump_object(im, "output/im-proc.dump")

Note

This example is in pyms-demo/jupyter/IntensityMatrix_Preprocessing.ipynb.

3.7. References

1

Serra J. Image Analysis and Mathematical Morphology. Academic Press, Inc, Orlando, 1983. ISBN 0126372403

2

Sauve AC and Speed TP. Normalization, baseline correction and alignment of high-throughput mass spectrometry data. Procedings Gensips, 2004