5. Peak alignment by dynamic programming¶
PyMS provides functions to align GC-MS peaks by dynamic programming [1]. The peak alignment by dynamic programming uses both peak apex retention time and mass spectra. This information is determined from the raw GC-MS data by applying a series of processing steps to produce data that can then be aligned and used for statistical analysis. The details are described in this chapter.
5.1. Preparation of multiple experiments for peak alignment by dynamic programming¶
5.1.1. Example: Creating an Experiment¶
Before aligning peaks from multiple experiments, the peak objects need
to be created and encapsulated into Experiment
objects. During this
process it is often useful to pre-process the peaks in some way, for
example to null certain m/z channels and/or to select a certain
retention time range.
The procedure starts the same as in the previous examples, namely:
read a file,
bin the data into fixed mass values,
smooth the data,
remove the baseline,
deconvolute peaks,
filter the peaks,
set the mass range,
remove uninformative ions, and
estimate peak areas.
First, setup the paths to the datafiles and the output directory, then import ANDI_reader and build_intensity_matrix_i.
In [1]:
import pathlib
data_directory = pathlib.Path(".").resolve().parent.parent / "pyms-data"
# Change this if the data files are stored in a different location
output_directory = pathlib.Path(".").resolve() / "output"
from pyms.GCMS.IO.ANDI import ANDI_reader
from pyms.IntensityMatrix import build_intensity_matrix_i
Read the raw data file and build the IntensityMatrix
.
In [2]:
andi_file = data_directory / "a0806_077.cdf"
data = ANDI_reader(andi_file)
im = build_intensity_matrix_i(data)
-> Reading netCDF file '/home/vagrant/PyMassSpec/pyms-data/a0806_077.cdf'
Preprocess the data (Savitzky-Golay smoothing and Tophat baseline detection)
In [3]:
from pyms.Noise.SavitzkyGolay import savitzky_golay
from pyms.TopHat import tophat
n_scan, n_mz = im.size
for ii in range(n_mz):
ic = im.get_ic_at_index(ii)
ic1 = savitzky_golay(ic)
ic_smooth = savitzky_golay(ic1) # Why the second pass here?
ic_bc = tophat(ic_smooth, struct="1.5m")
im.set_ic_at_index(ii, ic_bc)
Now the Biller and Biemann based technique can be applied to detect peaks.
In [4]:
from pyms.BillerBiemann import BillerBiemann
pl = BillerBiemann(im, points=9, scans=2)
len(pl)
1191
Trim the peak list by relative intensity
In [5]:
from pyms.BillerBiemann import rel_threshold, num_ions_threshold
apl = rel_threshold(pl, percent=2)
len(apl)
1191
Trim the peak list by noise threshold
In [6]:
peak_list = num_ions_threshold(apl, n=3, cutoff=3000)
len(peak_list)
225
Set the mass range, remove unwanted ions and estimate the peak area
In [7]:
from pyms.Peak.Function import peak_sum_area
for peak in peak_list:
peak.crop_mass(51, 540)
peak.null_mass(73)
peak.null_mass(147)
area = peak_sum_area(im, peak)
peak.area = area
Create an Experiment
.
In [8]:
from pyms.Experiment import Experiment
expr = Experiment("a0806_077", peak_list)
Set the time range for all Experiments
In [9]:
expr.sele_rt_range(["6.5m", "21m"])
Save the experiment to disk.
In [10]:
expr.dump(output_directory / "experiments" / "a0806_077.expr")
Note
This example is in pyms-demo/jupyter/Experiment.ipynb
.
5.1.2. Example: Creating Multiple Experiments¶
In example three GC-MS experiments are prepared for peak alignment. The
experiments are named a0806_077
, a0806_078
, a0806_079
, and
represent separate GC-MS sample runs from the same biological sample.
The procedure is the same as for the previous example, but is repeated three times.
First, setup the paths to the datafiles and the output directory, then import the required functions.
In [1]:
import pathlib
data_directory = pathlib.Path(".").resolve().parent.parent / "pyms-data"
# Change this if the data files are stored in a different location
output_directory = pathlib.Path(".").resolve() / "output"
from pyms.BillerBiemann import BillerBiemann, num_ions_threshold, rel_threshold
from pyms.Experiment import Experiment
from pyms.GCMS.IO.ANDI import ANDI_reader
from pyms.IntensityMatrix import build_intensity_matrix_i
from pyms.Noise.SavitzkyGolay import savitzky_golay
from pyms.Peak.Function import peak_sum_area, peak_top_ion_areas
from pyms.TopHat import tophat
Define the data files to process
In [2]:
expr_codes = ["a0806_077", "a0806_078", "a0806_079"]
# expr_codes = ["a0806_140", "a0806_141", "a0806_142"]
Loop over the experiments and perform the processing.
In [3]:
for expr_code in expr_codes:
print(f" -> Processing experiment '{expr_code}'")
andi_file = data_directory / f"{expr_code}.cdf"
data = ANDI_reader(andi_file)
im = build_intensity_matrix_i(data)
n_scan, n_mz = im.size
# Preprocess the data (Savitzky-Golay smoothing and Tophat baseline detection)
for ii in range(n_mz):
ic = im.get_ic_at_index(ii)
ic1 = savitzky_golay(ic)
ic_smooth = savitzky_golay(ic1) # Why the second pass here?
ic_bc = tophat(ic_smooth, struct="1.5m")
im.set_ic_at_index(ii, ic_bc)
# Peak detection
pl = BillerBiemann(im, points=9, scans=2)
# Trim the peak list by relative intensity
apl = rel_threshold(pl, percent=2)
# Trim the peak list by noise threshold
peak_list = num_ions_threshold(apl, n=3, cutoff=3000)
print("\t -> Number of Peaks found:", len(peak_list))
print("\t -> Executing peak post-processing and quantification...")
# Set the mass range, remove unwanted ions and estimate the peak area
# For peak alignment, all experiments must have the same mass range
for peak in peak_list:
peak.crop_mass(51, 540)
peak.null_mass(73)
peak.null_mass(147)
area = peak_sum_area(im, peak)
peak.area = area
area_dict = peak_top_ion_areas(im, peak)
peak.ion_areas = area_dict
# Create an Experiment
expr = Experiment(expr_code, peak_list)
# Use the same retention time range for all experiments
lo_rt_limit = "6.5m"
hi_rt_limit = "21m"
print(f"\t -> Selecting retention time range between '{lo_rt_limit}' and '{hi_rt_limit}'")
expr.sele_rt_range([lo_rt_limit, hi_rt_limit])
# Save the experiment to disk.
output_file = output_directory / "experiments" / f"{expr_code}.expr"
print(f"\t -> Saving the result as '{output_file}'")
expr.dump(output_file)
-> Processing experiment 'a0806_077'
-> Reading netCDF file '/home/vagrant/PyMassSpec/pyms-data/a0806_077.cdf'
-> Number of Peaks found: 225
-> Executing peak post-processing and quantification...
-> Selecting retention time range between '6.5m' and '21m'
-> Saving the result as '/home/vagrant/PyMassSpec/pyms-demo/jupyter/output/experiments/a0806_077.expr'
-> Processing experiment 'a0806_078'
-> Reading netCDF file '/home/vagrant/PyMassSpec/pyms-data/a0806_078.cdf'
-> Number of Peaks found: 238
-> Executing peak post-processing and quantification...
-> Selecting retention time range between '6.5m' and '21m'
-> Saving the result as '/home/vagrant/PyMassSpec/pyms-demo/jupyter/output/experiments/a0806_078.expr'
-> Processing experiment 'a0806_079'
-> Reading netCDF file '/home/vagrant/PyMassSpec/pyms-data/a0806_079.cdf'
-> Number of Peaks found: 268
-> Executing peak post-processing and quantification...
-> Selecting retention time range between '6.5m' and '21m'
-> Saving the result as '/home/vagrant/PyMassSpec/pyms-demo/jupyter/output/experiments/a0806_079.expr'
The previous set of data all belong to the same experimental condition. That is, they represent one group and any comparison between the data is a within group comparison. For the original experiment, another set of GC-MS data was collected for a different experimental condition. This group must also be stored as a set of experiments, and can be used for between group comparison.
The second set of data files are named a0806_140
, a0806_141
, and
a0806_142
, and are processed and stored as above.
In the example notebook, you can uncomment the line in code cell 2 and run the notebook again to process the second set of data files.
Note
This example is in pyms-demo/jupyter/Multiple_Experiments.ipynb
.
5.2. Dynamic Programming Alignment¶
5.2.1. Example: Within-state alignment of peak lists from multiple experiments¶
In this example the experiments a0806_077
, a0806_078
, and
a0806_079
prepared in the previous example will be aligned, and
therefore the notebook Multiple_Experiments.ipynb
must be run first
to create the files a0806_077.expr
, a0806_078.expr
,
a0806_079.expr
. These files contain the post-processed peak lists
from the three experiments.
First, determine the directory to the experiment files and import the required functions.
In [1]:
import pathlib
output_directory = pathlib.Path(".").resolve() / "output"
from pyms.DPA.PairwiseAlignment import PairwiseAlignment, align_with_tree
from pyms.DPA.Alignment import exprl2alignment
from pyms.Experiment import load_expr
Define the input experiments list.
In [2]:
exprA_codes = ["a0806_077", "a0806_078", "a0806_079"]
Read the experiment files from disk and create a list of the loaded
Experiment
objects.
In [3]:
expr_list = []
for expr_code in exprA_codes:
file_name = output_directory / "experiments" / f"{expr_code}.expr"
expr = load_expr(file_name)
expr_list.append(expr)
Define the within-state alignment parameters.
In [4]:
Dw = 2.5 # rt modulation [s]
Gw = 0.30 # gap penalty
Convert each Experiment
object is converted into an Alignment
object with the function exprl2alignment()
..
In [5]:
F1 = exprl2alignment(expr_list)
In this example, there is only one experimental condition so the
alignment object is only for within group alignment (this special case
is called 1-alignment). The variable F1
is a Python list containing
three alignment objects.
Perform pairwise alignment. The class |pyms.DPA.Class.PairwiseAlignment| calculates the similarity between all peaks in one sample with those of another sample. This is done for all possible pairwise alignments (2-alignments).
In [6]:
T1 = PairwiseAlignment(F1, Dw, Gw)
Calculating pairwise alignments for 3 alignments (D=2.50, gap=0.30)
-> 2 pairs remaining
-> 1 pairs remaining
-> 0 pairs remaining
-> Clustering 6 pairwise alignments.Done
The parameters for the alignment by dynamic programming are: Dw
, the
retention time modulation in seconds; and Gw
, the gap penalty. These
parameters are explained in detail in [1].
The output of PairwiseAlignment
(T1
) is an object which contains
the dendrogram tree that maps the similarity relationship between the
input 1-alignments, and also 1-alignments themselves.
The function align_with_tree()
then takes the object T1
and
aligns the individual alignment objects according to the guide tree.
In [7]:
A1 = align_with_tree(T1, min_peaks=2)
Aligning 3 items with guide tree (D=2.50, gap=0.30)
-> 1 item(s) remaining
-> 0 item(s) remaining
In this example, the individual alignments are three 1-alignments, and
the function align_with_tree()
first creates a 2-alignment from the
two most similar 1-alignments and then adds the third 1-alignment to
this to create a 3-alignment.
The parameter min_peaks=2
specifies that any peak column of the data
matrix that has fewer than two peaks in the final alignment will be
dropped. This is useful to clean up the data matrix of accidental peaks
that are not truly observed over the set of replicates.
Finally, the resulting 3-alignment is saved by writing alignment tables
containing peak retention times (rt.csv
) and the corresponding peak
areas (area.csv
). These are plain ASCII files in CSV format.
In [8]:
A1.write_csv(
output_directory / "within_state_alignment" / 'a_rt.csv',
output_directory / "within_state_alignment" / 'a_area.csv',
)
The file area1.csv
contains the data matrix where the corresponding
peaks are aligned in the columns and each row corresponds to an
experiment. The file rt1.csv
is useful for manually inspecting the
alignment.
5.2.2. Example: Between-state alignment of peak lists from multiple experiments¶
In the previous example the list of peaks were aligned within a single experiment with multiple replicates (“within-state alignment”). In practice, it is of more interest to compare the two experimental states.
In a typical experimental setup there can be multiple replicate experiments on each experimental state or condition. To analyze the results of such an experiment statistically, the list of peaks need to be aligned within each experimental state and also between the states. The result of such an alignment would be the data matrix of integrated peak areas. The data matrix contains a row for each sample and the number of columns is determined by the number of unique peaks (metabolites) detected in all the experiments.
In principle, all experiments could be aligned across conditions and replicates in the one process. However, a more robust approach is to first align experiments within each set of replicates (within-state alignment), and then to align the resulting alignments (between-state alignment) [1].
This example demonstrates how the peak lists from two cell states are aligned.
Cell state A, consisting of three aligned experiments (
a0806_077
,a0806_078
, anda0806_079
), andCell state B, consisting of three aligned experiments (
a0806_140
,a0806_141
, anda0806_142
).
These experiments were created in the notebook
Multiple_Experiments.ipynb
.
First, perform within-state alignment for cell state B.
In [9]:
exprB_codes = ["a0806_140", "a0806_141", "a0806_142"]
expr_list = []
for expr_code in exprB_codes:
file_name = output_directory / "experiments" / f"{expr_code}.expr"
expr = load_expr(file_name)
expr_list.append(expr)
F2 = exprl2alignment(expr_list)
T2 = PairwiseAlignment(F2, Dw, Gw)
A2 = align_with_tree(T2, min_peaks=2)
A2.write_csv(
output_directory / "within_state_alignment" / 'b_rt.csv',
output_directory / "within_state_alignment" / 'b_area.csv',
)
Calculating pairwise alignments for 3 alignments (D=2.50, gap=0.30)
-> 2 pairs remaining
-> 1 pairs remaining
-> 0 pairs remaining
-> Clustering 6 pairwise alignments.Done
Aligning 3 items with guide tree (D=2.50, gap=0.30)
-> 1 item(s) remaining
-> 0 item(s) remaining
A1
and A2
are the results of the within group alignments for
cell state A and B, respectively. The between-state alignment can be
performed as follows alignment commands:
In [10]:
# Define the within-state alignment parameters.
Db = 10.0 # rt modulation
Gb = 0.30 # gap penalty
T9 = PairwiseAlignment([A1,A2], Db, Gb)
A9 = align_with_tree(T9)
A9.write_csv(
output_directory / "between_state_alignment" / 'rt.csv',
output_directory / "between_state_alignment" / 'area.csv')
Calculating pairwise alignments for 2 alignments (D=10.00, gap=0.30)
-> 0 pairs remaining
-> Clustering 2 pairwise alignments.Done
Aligning 2 items with guide tree (D=10.00, gap=0.30)
-> 0 item(s) remaining
Store the aligned peaks to disk.
In [11]:
from pyms.Peak.List.IO import store_peaks
aligned_peaks = A9.aligned_peaks()
store_peaks(aligned_peaks, output_directory / "between_state_alignment" / 'peaks.bin')
In this example the retention time tolerance for between-state alignment is greater compared to the retention time tolerance for the within-state alignment as we expect less fidelity in retention times between them. The same functions are used for the within-state and between-state alignment. The result of the alignment is saved to a file as the area and retention time matrices (described above).
Note
These examples are in pyms-demo/jupyter/DPA.ipynb
.
5.3. Common Ion Area Quantitation¶
Note
This example is in pyms-demo/64
The area.csv
file produced in the preceding section lists the total area of each peak in the alignment.
The total area is the sum of the areas of each of the individual ions in the peak.
While this approach produces broadly accurate results, it can result in errors where neighbouring peaks or unfiltered noise add to the peak in some way.
One alternative to this approach is to pick a single ion which is common to a particular peak (compound), and to report only the area of this ion for each occurrence of that peak in the alignment.
Using the method common_ion()
of the class Alignment
, PyMassSpec
can select an ion for each aligned peak which is both abundant and occurs most often for that peak.
We call this the ‘Common Ion Algorithm’ (CIA).
To use this method it is essential that the individual ion areas have been set (see section Individual Ion Areas).
5.3.1. Using the Common Ion Algorithm¶
When using the CIA for area quantitation, a different method of the class Alignment
is used to write the area matrix; write_common_ion_csv()
.
This requires a list of the common ions for each peak in the alignment.
This list is generated using the Alignment class method common_ion()
.
Continuing from the previous example, the following invokes common ion filtering on previously created alignment object ‘A9’:
>>> common_ion_list = A9.common_ion()
The variable ‘common_ion_list’ is a list of the common ion for each peak in the alignment. This list is the same length as the alignment. To write peak areas using common ion quantitation:
>>> A9.write_common_ion_csv('output/area_common_ion.csv',common_ion_list)