3. Data Filtering¶
3.1. Introduction¶
In this chapter filtering techniques that allow pre-processing of GC-MS data for analysis and comparison to other pre-processed GC-MS data are covered.
3.2. Time strings¶
Before considering the filtering techniques, the mechanism for representing retention times is outlined here.
A time string is the specification of a time interval, that takes the format NUMBERs
or NUMBERm
for time interval in seconds or minutes. For example, these are valid time strings: 10s
(10 seconds) and 0.2m
(0.2 minutes).
3.3. Example: IntensityMatrix Resizing¶
Once an IntensityMatrix has been constructed from the raw GC-MS data, the entries of the matrix can be modified. These modifications can operate on the entire matrix, or individual masses or scans.
First, setup the paths to the datafiles and the output directory, then import JCAMP_reader and build_intensity_matrix.
In [1]:
import pathlib
data_directory = pathlib.Path(".").resolve().parent.parent / "pyms-data"
# Change this if the data files are stored in a different location
output_directory = pathlib.Path(".").resolve() / "output"
from pyms.GCMS.IO.JCAMP import JCAMP_reader
from pyms.IntensityMatrix import build_intensity_matrix
Read the raw data files and create the IntensityMatrix.
In [2]:
jcamp_file = data_directory / "gc01_0812_066.jdx"
data = JCAMP_reader(jcamp_file)
im = build_intensity_matrix(data)
-> Reading JCAMP file '/home/vagrant/PyMassSpec/pyms-data/gc01_0812_066.jdx'
3.3.1. Retention time range¶
A basic operation on the GC-MS data is to select a specific time range
for processing. In PyMassSpec, any data outside the chosen time range is
discarded. The trim()
method operates on the raw data, so any
subsequent processing only refers to the trimmed data.
The data can be trimmed to specific scans:
In [3]:
data.trim(1000, 2000)
data.info()
Trimming data to between 1000 and 2001 scans
Data retention time range: 11.342 min -- 17.604 min
Time step: 0.375 s (std=0.000 s)
Number of scans: 1002
Minimum m/z measured: 50.100
Maximum m/z measured: 467.100
Mean number of m/z values per scan: 57
Median number of m/z values per scan: 44
or specific retention times (in seconds
or minutes
):
In [4]:
data.trim("700s", "15m")
data.info()
Trimming data to between 54 and 587 scans
Data retention time range: 11.674 min -- 15.008 min
Time step: 0.375 s (std=0.000 s)
Number of scans: 534
Minimum m/z measured: 50.100
Maximum m/z measured: 395.200
Mean number of m/z values per scan: 59
Median number of m/z values per scan: 47
3.3.2. Mass Spectrum range and entries¶
An IntensityMatrix
object has a set mass range and interval that is
derived from the data at the time of building the intensity matrix. The
range of mass values can be cropped. This is done, primarily, to ensure
that the range of masses used are consistent when comparing samples.
The mass range of the intensity matrix can be “cropped” to a new (smaller) range as follows:
In [5]:
im.crop_mass(60, 400)
im.min_mass
60.0
In [6]:
im.max_mass
400.0
It is also possible to set all intensities for a given mass to zero. This is useful for ignoring masses associated with sample preparation. The mass can be “nulled” with:
In [7]:
im.null_mass(73)
sum(im.get_ic_at_mass(73).intensity_array)
0.0
As expected, the sum of the intensity array is 0
Note
This example is in pyms-demo/jupyter/IntensityMatrix_Resizing.ipynb
.
3.4. Noise smoothing¶
The purpose of noise smoothing is to remove high-frequency noise from data, and thereby increase the contribution of the signal relative to the contribution of the noise.
First, setup the paths to the datafiles and the output directory, then import JCAMP_reader.
In [1]:
import pathlib
data_directory = pathlib.Path(".").resolve().parent.parent / "pyms-data"
# Change this if the data files are stored in a different location
output_directory = pathlib.Path(".").resolve() / "output"
from pyms.GCMS.IO.JCAMP import JCAMP_reader
Read the raw data files and extract the TIC.
In [2]:
jcamp_file = data_directory / "gc01_0812_066.jdx"
data = JCAMP_reader(jcamp_file)
tic = data.tic
-> Reading JCAMP file '/home/vagrant/PyMassSpec/pyms-data/gc01_0812_066.jdx'
3.4.1. Window averaging¶
A simple approach to noise smoothing is moving average window smoothing.
In this approach the window of a fixed size (:math:2N+1
points) is
moved across the ion chromatogram, and the intensity value at each point
is replaced with the mean intensity calculated over the window size. The
example below illustrates smoothing of TIC by window averaging.
To apply mean window smoothing with a 5-point window:
In [3]:
from pyms.Noise.Window import window_smooth
tic1 = window_smooth(tic, window=5)
To apply median window smoothing with a 5-point window:
In [4]:
tic2 = window_smooth(tic, window=5, use_median=True)
To apply the mean windows smoothing, but specifying the window as a time string (in this example, 7 seconds):
In [5]:
tic3 = window_smooth(tic, window='7s')
Write the original TIC and the smoothed TICs to disk:
In [6]:
tic.write(output_directory / "noise_smoothing_tic.dat",minutes=True)
tic1.write(output_directory / "noise_smoothing_tic1.dat",minutes=True)
tic2.write(output_directory / "noise_smoothing_tic2.dat",minutes=True)
3.4.2. Window Averaging on Intensity Matrix¶
In the previous section, window averaging was applied to an Ion
Chromatogram object (in that case a TIC). Where filtering is to be
performed on all Ion Chromatograms, the window_smooth_im()
function
may be used instead.
The use of this function is identical to the Ion Chromatogram
window_smooth()
function, except that an Intensity Matrix is passed
to it.
For example, to perform window smoothing on an IntensityMatrix
object with a 5 point window and mean window smoothing:
In [7]:
from pyms.IntensityMatrix import build_intensity_matrix
from pyms.Noise.Window import window_smooth_im
im = build_intensity_matrix(data)
im_smooth1 = window_smooth_im(im, window=5, use_median=False)
Write the IC for mass 73 to disk for both the original and smoothed
IntensityMatrix
:
In [8]:
ic = im.get_ic_at_index(73)
ic_smooth1 = im_smooth1.get_ic_at_index(73)
ic.write(output_directory/"noise_smoothing_ic.dat", minutes=True)
ic_smooth1.write(output_directory/"noise_smoothing_ic_smooth1.dat", minutes=True)
3.4.3. Savitzky–Golay noise filter¶
A more sophisticated noise filter is the Savitzky-Golay filter. Given the data loaded as above, this filter can be applied as follows:
In [9]:
from pyms.Noise.SavitzkyGolay import savitzky_golay
tic4 = savitzky_golay(tic)
Write the smoothed TIC to disk:
In [10]:
tic4.write(output_directory / "noise_smoothing_tic4.dat",minutes=True)
In this example the default parameters were used.
3.4.3.1. Savitzky-Golay Noise filtering of Intensity Matrix Object¶
The savitzky_golay()
function described above acts on a single
IonChromatogram
. Where it is desired to perform Savitzky Golay
filtering on the whole IntensityMatrix
the function
savitzky_golay_im()
may be used as follows:
In [11]:
from pyms.Noise.SavitzkyGolay import savitzky_golay_im
im_smooth2 = savitzky_golay_im(im)
Write the IC for mass 73 in the smoothed IntensityMatrix
to disk:
In [12]:
ic_smooth2 = im_smooth2.get_ic_at_index(73)
ic_smooth2.write(output_directory/"noise_smoothing_ic_smooth2.dat",minutes=True)
Note
This example is in pyms-demo/jupyter/NoiseSmoothing.ipynb
.
3.5. Baseline Correction¶
Baseline distortion originating from instrument imperfections and
experimental setup is often observed in mass spectrometry data, and
off-line baseline correction is often an important step in data
pre-processing. There are many approaches for baseline correction. One
advanced approach is based on the top-hat transform developed in mathematical morphology [1], and used extensively in digital image
processing for tasks such as image enhancement. Top-hat baseline
correction was previously applied in proteomics based mass spectrometry [2].
PyMS currently implements only the top-hat baseline corrector, using the
SciPy package ndimage
.
Application of the top-hat baseline corrector requires the size of the
structural element to be specified. The structural element needs to be
larger than the features one wants to retain in the spectrum after the
top-hat transform. In the example below, the top-hat baseline corrector
is applied to the TIC of the data set gc01_0812_066.cdf
, with the
structural element of 1.5 minutes:
The purpose of noise smoothing is to remove high-frequency noise from data, and thereby increase the contribution of the signal relative to the contribution of the noise.
First, setup the paths to the datafiles and the output directory, then import ANDI_reader, savitzky_golay and tophat.
In [1]:
import pathlib
data_directory = pathlib.Path(".").resolve().parent.parent / "pyms-data"
# Change this if the data files are stored in a different location
output_directory = pathlib.Path(".").resolve() / "output"
from pyms.GCMS.IO.ANDI import ANDI_reader
from pyms.Noise.SavitzkyGolay import savitzky_golay
from pyms.TopHat import tophat
Read the raw data files and extract the TIC.
In [2]:
andi_file = data_directory / "gc01_0812_066.cdf"
data = ANDI_reader(andi_file)
tic = data.tic
-> Reading netCDF file '/home/vagrant/PyMassSpec/pyms-data/gc01_0812_066.cdf'
Perform Savitzky-Golay smoothing
In [3]:
tic1 = savitzky_golay(tic)
Perform Tophat baseline correction
In [4]:
tic2 = tophat(tic1, struct="1.5m")
Save the output to disk
In [5]:
tic.write(output_directory / "baseline_correction_tic.dat",minutes=True)
tic1.write(output_directory / "baseline_correction_tic_smooth.dat",minutes=True)
tic2.write(output_directory / "baseline_correction_tic_smooth_bc.dat",minutes=True)
3.5.1. Tophat Baseline correction on an Intensity Matrix object¶
The tophat()
function acts on a single IonChromatogram
. To
perform baseline correction on an IntensityMatrix
object (i.e. on
all Ion Chromatograms
) the tophat_im()
function may be used.
Using the same value for struct
as above, tophat_im()
is used as
follows:
In [6]:
from pyms.TopHat import tophat_im
from pyms.IntensityMatrix import build_intensity_matrix
im = build_intensity_matrix(data)
im_base_corr = tophat_im(im, struct="1.5m")
Write the IC for mass 73 to disk for both the original and smoothed
IntensityMatrix
:
In [7]:
ic = im.get_ic_at_index(73)
ic_base_corr = im_base_corr.get_ic_at_index(73)
ic.write(output_directory/"baseline_correction_ic.dat",minutes=True)
ic_base_corr.write(output_directory/"baseline_correction_ic_base_corr.dat",minutes=True)
Note
This example is in pyms-demo/jupyter/BaselineCorrection.ipynb
.
3.6. Pre-processing the IntensityMatrix¶
Noise smoothing and baseline correction can be applied to each
IonChromatogram
in an IntensityMatrix
.
First, setup the paths to the datafiles and the output directory, then import the required functions.
In [1]:
import pathlib
data_directory = pathlib.Path(".").resolve().parent.parent / "pyms-data"
# Change this if the data files are stored in a different location
output_directory = pathlib.Path(".").resolve() / "output"
from pyms.GCMS.IO.JCAMP import JCAMP_reader
from pyms.IntensityMatrix import build_intensity_matrix
from pyms.Noise.SavitzkyGolay import savitzky_golay
from pyms.TopHat import tophat
Read the raw data files and build the IntensityMatrix
:
In [2]:
jcamp_file = data_directory / "gc01_0812_066.jdx"
data = JCAMP_reader(jcamp_file)
im = build_intensity_matrix(data)
-> Reading JCAMP file '/home/vagrant/PyMassSpec/pyms-data/gc01_0812_066.jdx'
Perform Savitzky-Golay smoothing and Tophat baseline correction
In [3]:
n_scan, n_mz = im.size
for ii in range(n_mz):
# print("Working on IC#", ii+1)
ic = im.get_ic_at_index(ii)
ic_smooth = savitzky_golay(ic)
ic_bc = tophat(ic_smooth, struct="1.5m")
im.set_ic_at_index(ii, ic_bc)
Alternatively, the filtering may be performed on the IntensityMatrix
without using a for
loop, as outlined in previous examples. However
filtering by IonChromatogram
in a for
loop as described here is
much faster.
The resulting IntensityMatrix
object can be “dumped” to a file for later
retrieval. There are general perpose object file handling methods in
pyms.Utils.IO
. For example;
>>> from pyms.Utils.IO import dump_object
>>> dump_object(im, "output/im-proc.dump")
Note
This example is in pyms-demo/jupyter/IntensityMatrix_Preprocessing.ipynb
.