5. Peak alignment by dynamic programming

PyMS provides functions to align GC-MS peaks by dynamic programming 1. The peak alignment by dynamic programming uses both peak apex retention time and mass spectra. This information is determined from the raw GC-MS data by applying a series of processing steps to produce data that can then be aligned and used for statistical analysis. The details are described in this chapter.

5.1. Preparation of multiple experiments for peak alignment by dynamic programming

5.1.1. Example: Creating an Experiment

Before aligning peaks from multiple experiments, the peak objects need to be created and encapsulated into Experiment objects. During this process it is often useful to pre-process the peaks in some way, for example to null certain m/z channels and/or to select a certain retention time range.

The procedure starts the same as in the previous examples, namely:

  1. read a file,

  2. bin the data into fixed mass values,

  3. smooth the data,

  4. remove the baseline,

  5. deconvolute peaks,

  6. filter the peaks,

  7. set the mass range,

  8. remove uninformative ions, and

  9. estimate peak areas.

First, setup the paths to the datafiles and the output directory, then import ANDI_reader and build_intensity_matrix_i.

In [1]:
import pathlib
data_directory = pathlib.Path(".").resolve().parent.parent / "pyms-data"
# Change this if the data files are stored in a different location

output_directory = pathlib.Path(".").resolve() / "output"

from pyms.GCMS.IO.ANDI import ANDI_reader
from pyms.IntensityMatrix import build_intensity_matrix_i

Read the raw data file and build the IntensityMatrix.

In [2]:
andi_file = data_directory / "a0806_077.cdf"
data = ANDI_reader(andi_file)
im = build_intensity_matrix_i(data)
-> Reading netCDF file '/home/vagrant/PyMassSpec/pyms-data/a0806_077.cdf'

Preprocess the data (Savitzky-Golay smoothing and Tophat baseline detection)

In [3]:
from pyms.Noise.SavitzkyGolay import savitzky_golay
from pyms.TopHat import tophat

n_scan, n_mz = im.size

for ii in range(n_mz):
    ic = im.get_ic_at_index(ii)
    ic1 = savitzky_golay(ic)
    ic_smooth = savitzky_golay(ic1)  # Why the second pass here?
    ic_bc = tophat(ic_smooth, struct="1.5m")
    im.set_ic_at_index(ii, ic_bc)

Now the Biller and Biemann based technique can be applied to detect peaks.

In [4]:
from pyms.BillerBiemann import BillerBiemann
pl = BillerBiemann(im, points=9, scans=2)
len(pl)
1191

Trim the peak list by relative intensity

In [5]:
from pyms.BillerBiemann import rel_threshold, num_ions_threshold
apl = rel_threshold(pl, percent=2)
len(apl)
1191

Trim the peak list by noise threshold

In [6]:
peak_list = num_ions_threshold(apl, n=3, cutoff=3000)
len(peak_list)
225

Set the mass range, remove unwanted ions and estimate the peak area

In [7]:
from pyms.Peak.Function import peak_sum_area

for peak in peak_list:
    peak.crop_mass(51, 540)

    peak.null_mass(73)
    peak.null_mass(147)

    area = peak_sum_area(im, peak)
    peak.area = area

Create an Experiment.

In [8]:
from pyms.Experiment import Experiment

expr = Experiment("a0806_077", peak_list)

Set the time range for all Experiments

In [9]:
expr.sele_rt_range(["6.5m", "21m"])

Save the experiment to disk.

In [10]:
expr.dump(output_directory / "experiments" / "a0806_077.expr")

Note

This example is in pyms-demo/jupyter/Experiment.ipynb.

5.1.2. Example: Creating Multiple Experiments

In example three GC-MS experiments are prepared for peak alignment. The experiments are named a0806_077, a0806_078, a0806_079, and represent separate GC-MS sample runs from the same biological sample.

The procedure is the same as for the previous example, but is repeated three times.

First, setup the paths to the datafiles and the output directory, then import the required functions.

In [1]:
import pathlib
data_directory = pathlib.Path(".").resolve().parent.parent / "pyms-data"
# Change this if the data files are stored in a different location

output_directory = pathlib.Path(".").resolve() / "output"

from pyms.BillerBiemann import BillerBiemann, num_ions_threshold, rel_threshold
from pyms.Experiment import Experiment
from pyms.GCMS.IO.ANDI import ANDI_reader
from pyms.IntensityMatrix import build_intensity_matrix_i
from pyms.Noise.SavitzkyGolay import savitzky_golay
from pyms.Peak.Function import peak_sum_area, peak_top_ion_areas
from pyms.TopHat import tophat

Define the data files to process

In [2]:
expr_codes = ["a0806_077", "a0806_078", "a0806_079"]
# expr_codes = ["a0806_140", "a0806_141", "a0806_142"]

Loop over the experiments and perform the processing.

In [3]:
for expr_code in expr_codes:

    print(f" -> Processing experiment '{expr_code}'")

    andi_file = data_directory / f"{expr_code}.cdf"

    data = ANDI_reader(andi_file)

    im = build_intensity_matrix_i(data)

    n_scan, n_mz = im.size

    # Preprocess the data (Savitzky-Golay smoothing and Tophat baseline detection)

    for ii in range(n_mz):
        ic = im.get_ic_at_index(ii)
        ic1 = savitzky_golay(ic)
        ic_smooth = savitzky_golay(ic1)  # Why the second pass here?
        ic_bc = tophat(ic_smooth, struct="1.5m")
        im.set_ic_at_index(ii, ic_bc)

    # Peak detection
    pl = BillerBiemann(im, points=9, scans=2)

    # Trim the peak list by relative intensity
    apl = rel_threshold(pl, percent=2)

    # Trim the peak list by noise threshold
    peak_list = num_ions_threshold(apl, n=3, cutoff=3000)

    print("\t -> Number of Peaks found:", len(peak_list))

    print("\t -> Executing peak post-processing and quantification...")

    # Set the mass range, remove unwanted ions and estimate the peak area
    # For peak alignment, all experiments must have the same mass range

    for peak in peak_list:
        peak.crop_mass(51, 540)

        peak.null_mass(73)
        peak.null_mass(147)

        area = peak_sum_area(im, peak)
        peak.area = area
        area_dict = peak_top_ion_areas(im, peak)
        peak.ion_areas = area_dict

    # Create an Experiment
    expr = Experiment(expr_code, peak_list)

    # Use the same retention time range for all experiments
    lo_rt_limit = "6.5m"
    hi_rt_limit = "21m"

    print(f"\t -> Selecting retention time range between '{lo_rt_limit}' and '{hi_rt_limit}'")

    expr.sele_rt_range([lo_rt_limit, hi_rt_limit])

    # Save the experiment to disk.
    output_file = output_directory / "experiments" / f"{expr_code}.expr"
    print(f"\t -> Saving the result as '{output_file}'")

    expr.dump(output_file)
-> Processing experiment 'a0806_077'
-> Reading netCDF file '/home/vagrant/PyMassSpec/pyms-data/a0806_077.cdf'
    -> Number of Peaks found: 225
    -> Executing peak post-processing and quantification...
    -> Selecting retention time range between '6.5m' and '21m'
    -> Saving the result as '/home/vagrant/PyMassSpec/pyms-demo/jupyter/output/experiments/a0806_077.expr'
-> Processing experiment 'a0806_078'
-> Reading netCDF file '/home/vagrant/PyMassSpec/pyms-data/a0806_078.cdf'
    -> Number of Peaks found: 238
    -> Executing peak post-processing and quantification...
    -> Selecting retention time range between '6.5m' and '21m'
    -> Saving the result as '/home/vagrant/PyMassSpec/pyms-demo/jupyter/output/experiments/a0806_078.expr'
-> Processing experiment 'a0806_079'
-> Reading netCDF file '/home/vagrant/PyMassSpec/pyms-data/a0806_079.cdf'
    -> Number of Peaks found: 268
    -> Executing peak post-processing and quantification...
    -> Selecting retention time range between '6.5m' and '21m'
    -> Saving the result as '/home/vagrant/PyMassSpec/pyms-demo/jupyter/output/experiments/a0806_079.expr'

The previous set of data all belong to the same experimental condition. That is, they represent one group and any comparison between the data is a within group comparison. For the original experiment, another set of GC-MS data was collected for a different experimental condition. This group must also be stored as a set of experiments, and can be used for between group comparison.

The second set of data files are named a0806_140, a0806_141, and a0806_142, and are processed and stored as above.

In the example notebook, you can uncomment the line in code cell 2 and run the notebook again to process the second set of data files.

Note

This example is in pyms-demo/jupyter/Multiple_Experiments.ipynb.

5.2. Dynamic Programming Alignment

5.2.1. Example: Within-state alignment of peak lists from multiple experiments

In this example the experiments a0806_077, a0806_078, and a0806_079 prepared in the previous example will be aligned, and therefore the notebook Multiple_Experiments.ipynb must be run first to create the files a0806_077.expr, a0806_078.expr, a0806_079.expr. These files contain the post-processed peak lists from the three experiments.

First, determine the directory to the experiment files and import the required functions.

In [1]:
import pathlib
output_directory = pathlib.Path(".").resolve() / "output"

from pyms.DPA.PairwiseAlignment import PairwiseAlignment, align_with_tree
from pyms.DPA.Alignment import exprl2alignment
from pyms.Experiment import load_expr

Define the input experiments list.

In [2]:
exprA_codes = ["a0806_077", "a0806_078", "a0806_079"]

Read the experiment files from disk and create a list of the loaded Experiment objects.

In [3]:
expr_list = []

for expr_code in exprA_codes:
    file_name = output_directory / "experiments" / f"{expr_code}.expr"
    expr = load_expr(file_name)
    expr_list.append(expr)

Define the within-state alignment parameters.

In [4]:
Dw = 2.5  # rt modulation [s]
Gw = 0.30 # gap penalty

Convert each Experiment object is converted into an Alignment object with the function exprl2alignment()..

In [5]:
F1 = exprl2alignment(expr_list)

In this example, there is only one experimental condition so the alignment object is only for within group alignment (this special case is called 1-alignment). The variable F1 is a Python list containing three alignment objects.

Perform pairwise alignment. The class |pyms.DPA.Class.PairwiseAlignment| calculates the similarity between all peaks in one sample with those of another sample. This is done for all possible pairwise alignments (2-alignments).

In [6]:
T1 = PairwiseAlignment(F1, Dw, Gw)
Calculating pairwise alignments for 3 alignments (D=2.50, gap=0.30)
-> 2 pairs remaining
-> 1 pairs remaining
-> 0 pairs remaining
-> Clustering 6 pairwise alignments.Done

The parameters for the alignment by dynamic programming are: Dw, the retention time modulation in seconds; and Gw, the gap penalty. These parameters are explained in detail in 1.

The output of PairwiseAlignment (T1) is an object which contains the dendrogram tree that maps the similarity relationship between the input 1-alignments, and also 1-alignments themselves.

The function align_with_tree() then takes the object T1 and aligns the individual alignment objects according to the guide tree.

In [7]:
A1 = align_with_tree(T1, min_peaks=2)
Aligning 3 items with guide tree (D=2.50, gap=0.30)
-> 1 item(s) remaining
-> 0 item(s) remaining

In this example, the individual alignments are three 1-alignments, and the function align_with_tree() first creates a 2-alignment from the two most similar 1-alignments and then adds the third 1-alignment to this to create a 3-alignment.

The parameter min_peaks=2 specifies that any peak column of the data matrix that has fewer than two peaks in the final alignment will be dropped. This is useful to clean up the data matrix of accidental peaks that are not truly observed over the set of replicates.

Finally, the resulting 3-alignment is saved by writing alignment tables containing peak retention times (rt.csv) and the corresponding peak areas (area.csv). These are plain ASCII files in CSV format.

In [8]:
A1.write_csv(
        output_directory / "within_state_alignment" / 'a_rt.csv',
        output_directory / "within_state_alignment" / 'a_area.csv',
        )

The file area1.csv contains the data matrix where the corresponding peaks are aligned in the columns and each row corresponds to an experiment. The file rt1.csv is useful for manually inspecting the alignment.

5.2.2. Example: Between-state alignment of peak lists from multiple experiments

In the previous example the list of peaks were aligned within a single experiment with multiple replicates (“within-state alignment”). In practice, it is of more interest to compare the two experimental states.

In a typical experimental setup there can be multiple replicate experiments on each experimental state or condition. To analyze the results of such an experiment statistically, the list of peaks need to be aligned within each experimental state and also between the states. The result of such an alignment would be the data matrix of integrated peak areas. The data matrix contains a row for each sample and the number of columns is determined by the number of unique peaks (metabolites) detected in all the experiments.

In principle, all experiments could be aligned across conditions and replicates in the one process. However, a more robust approach is to first align experiments within each set of replicates (within-state alignment), and then to align the resulting alignments (between-state alignment) 1.

This example demonstrates how the peak lists from two cell states are aligned.

  • Cell state A, consisting of three aligned experiments (a0806_077, a0806_078, and a0806_079), and

  • Cell state B, consisting of three aligned experiments (a0806_140, a0806_141, and a0806_142).

These experiments were created in the notebook Multiple_Experiments.ipynb.

First, perform within-state alignment for cell state B.

In [9]:
exprB_codes = ["a0806_140", "a0806_141", "a0806_142"]

expr_list = []

for expr_code in exprB_codes:
    file_name = output_directory / "experiments" / f"{expr_code}.expr"
    expr = load_expr(file_name)
    expr_list.append(expr)

F2 = exprl2alignment(expr_list)
T2 = PairwiseAlignment(F2, Dw, Gw)
A2 = align_with_tree(T2, min_peaks=2)

A2.write_csv(
        output_directory / "within_state_alignment" / 'b_rt.csv',
        output_directory / "within_state_alignment" / 'b_area.csv',
        )
Calculating pairwise alignments for 3 alignments (D=2.50, gap=0.30)
-> 2 pairs remaining
-> 1 pairs remaining
-> 0 pairs remaining
-> Clustering 6 pairwise alignments.Done
Aligning 3 items with guide tree (D=2.50, gap=0.30)
-> 1 item(s) remaining
-> 0 item(s) remaining

A1 and A2 are the results of the within group alignments for cell state A and B, respectively. The between-state alignment can be performed as follows alignment commands:

In [10]:
# Define the within-state alignment parameters.
Db = 10.0 # rt modulation
Gb = 0.30 # gap penalty

T9 = PairwiseAlignment([A1,A2], Db, Gb)
A9 = align_with_tree(T9)

A9.write_csv(
        output_directory / "between_state_alignment" / 'rt.csv',
        output_directory / "between_state_alignment" / 'area.csv')
Calculating pairwise alignments for 2 alignments (D=10.00, gap=0.30)
-> 0 pairs remaining
-> Clustering 2 pairwise alignments.Done
Aligning 2 items with guide tree (D=10.00, gap=0.30)
-> 0 item(s) remaining

Store the aligned peaks to disk.

In [11]:
from pyms.Peak.List.IO import store_peaks

aligned_peaks = A9.aligned_peaks()
store_peaks(aligned_peaks, output_directory / "between_state_alignment" / 'peaks.bin')

In this example the retention time tolerance for between-state alignment is greater compared to the retention time tolerance for the within-state alignment as we expect less fidelity in retention times between them. The same functions are used for the within-state and between-state alignment. The result of the alignment is saved to a file as the area and retention time matrices (described above).

Note

These examples are in pyms-demo/jupyter/DPA.ipynb.

5.3. Common Ion Area Quantitation

Note

This example is in pyms-demo/64

The area.csv file produced in the preceding section lists the total area of each peak in the alignment. The total area is the sum of the areas of each of the individual ions in the peak. While this approach produces broadly accurate results, it can result in errors where neighbouring peaks or unfiltered noise add to the peak in some way.

One alternative to this approach is to pick a single ion which is common to a particular peak (compound), and to report only the area of this ion for each occurrence of that peak in the alignment. Using the method common_ion() of the class Alignment, PyMassSpec can select an ion for each aligned peak which is both abundant and occurs most often for that peak. We call this the ‘Common Ion Algorithm’ (CIA).

To use this method it is essential that the individual ion areas have been set (see section Individual Ion Areas).

5.3.1. Using the Common Ion Algorithm

When using the CIA for area quantitation, a different method of the class Alignment is used to write the area matrix; write_common_ion_csv(). This requires a list of the common ions for each peak in the alignment. This list is generated using the Alignment class method common_ion().

Continuing from the previous example, the following invokes common ion filtering on previously created alignment object ‘A9’:

>>> common_ion_list = A9.common_ion()

The variable ‘common_ion_list’ is a list of the common ion for each peak in the alignment. This list is the same length as the alignment. To write peak areas using common ion quantitation:

>>> A9.write_common_ion_csv('output/area_common_ion.csv',common_ion_list)

5.4. References

1(1,2,3)

Robinson MD, De Souza DP, Keen WW, Saunders EC, McConville MJ, Speed TP, and Likic VA. A dynamic programming approach for the alignment of signal peaks in multiple gas chromatography-mass spectrometry experiments. BMC Bioinformatics, 8:419, 2007