Recent advancements in atmospheric mass spectrometry provide huge amounts of new information but at the same time present considerable challenges for the data analysts. High-resolution (HR) peak identification and separation can be effort- and time-consuming yet still tricky and inaccurate due to the complexity of overlapping peaks, especially at larger mass-to-charge ratios. This study presents a simple and novel method, mass spectral binning combined with positive matrix factorization (binPMF), to address these problems. Different from unit mass resolution (UMR) analysis or HR peak fitting, which represent the routine data analysis approaches for mass spectrometry datasets, binPMF divides the mass spectra into small bins and takes advantage of the positive matrix factorization's (PMF) strength in separating different sources or processes based on different temporal patterns. In this study, we applied the novel approach to both ambient and synthetic datasets to evaluate its performance. It not only succeeded in separating overlapping ions but was found to be sensitive to subtle variations as well. Being fast and reliable, binPMF has no requirement for a priori peak information and can save much time and effort from conventional HR peak fitting, while still utilizing nearly the full potential of HR mass spectra. In addition, we identify several future improvements and applications for binPMF and believe it will become a powerful approach in the data analysis of mass spectra.
Volatile organic compounds (VOCs) are emitted to the atmosphere both from biogenic and anthropogenic sources (Guenther et al., 1995; Wei et al., 2008). After oxidation, these gaseous species can partition to the particle phase and contribute to atmospheric organic aerosol (OA), a major component of tropospheric particulate matter (Zhang et al., 2007). The chemical components, both in particulate (OA) and gaseous phase (VOC and their oxidation products), play important roles in many atmospheric physical and chemical processes. They can deteriorate air quality causing adverse health effects, and aerosol particles can influence Earth's climate by altering the radiative balance, as well as decrease visibility (Stocker et al., 2013; Zhang et al., 2016; Pope III et al., 2009; Shiraiwa et al., 2017).
Recent instrumental advances in mass spectrometry have greatly enhanced our
capability to investigate the chemical composition and evolution of aerosol
particles and their precursors. The Aerodyne aerosol mass spectrometer (AMS)
is widely applied in atmospheric research (Canagaratna et al., 2007),
measuring the bulk composition and temporal behavior of the nonrefractory
aerosol, and has successfully identified different/unique OA sources
utilizing factor analysis (Jimenez et al., 2009; Zhang et al., 2011). With
the development of gas-phase chemical ionization mass spectrometry (CIMS)
(Huey, 2007) and the commercially available time-of-flight (TOF)-CIMS (Bertram et al., 2011)
and CI-APi-TOF (chemical ionization atmospheric pressure interface
time-of-flight mass spectrometer; Jokinen et al., 2012), these instruments
are becoming more popular in atmospheric chemistry research. Due to these
new advances, the detection methods for aerosol precursor vapors and the
understanding of their formation mechanisms have been greatly improved. For
example, the discovery of highly oxygenated molecules (HOM) by the
CI-APi-TOF has led to increased knowledge regarding atmospheric oxidation
pathways, with large implications on secondary organic aerosol (SOA) and new
particle formation (Ehn et al., 2014; Jokinen et al., 2015; Kirkby et al.,
2016; Yan et al., 2016). In particular, biogenic VOCs such as monoterpenes
(
While a mass spectrum can contain large amounts of information representing
the highly complex nature of the atmospheric sample, it also presents
considerable challenges for the analysis and interpretation of the data. One
example of such a challenge is the identification and separation of peaks
with similar but not identical masses. A single integer mass can contain
tens of distinct ions, with mass-to-charge ratios (
Figure 1 depicts a concrete example, measured by a nitrate-based CI-APi-TOF,
where peak separation is not large enough to allow unambiguous fitting of
all the ions, and the final result will depend on which ions the analyst
chooses to include. As the
Example of traditional HR peak fitting. Potential peak fitting at
Another typical analysis approach is to utilize only the unit mass resolution, or UMR, data. As opposed to high-resolution fitting, where the signals of individual ions are separated from the total measured signal, in UMR analysis all signals at a given integer mass are integrated and treated together. This approach is more straightforward and less subjective than HR fitting but loses all possible high-resolution details in the spectrum (see Fig. 2).
Conceptual comparison of traditional methods (UMR and HR) and binned mass spectra for PMF analysis. The raw data signal is shown in the left and contains eight ions. By UMR analysis, the information of the eight ions is totally lost. Using an analyst-determined peak list, HR analysis attempts to separate signals at this mass by fitting selected ions. By binning the spectra, we utilize the HR information without any a priori information required.
Even with perfect high-resolution peak fits, a spectrum typically contains information of hundreds, if not thousands, of ions, many of which come from similar sources. This wealth of data itself presents a challenge for data analysis. Factor analysis enables the reduction of data dimensions and can help to apportion the signals to factors. These factors may correspond to different sources or formation processes. Positive matrix factorization (PMF) (Paatero and Tapper, 1994) has been widely utilized in environmental sciences, applied to UMR and HR AMS data, succeeding in identifying multiple OA sources (Lanz et al., 2008; Ulbrich et al., 2009; Sun et al., 2011; Zhang et al., 2011). Compared to AMS data, PMF has been applied to CIMS data analysis much less frequently. To our knowledge, only Yan et al. (2016) and Massoli et al. (2018) have reported PMF analysis on nitrate-based CI-APi-TOF, utilizing UMR and HR data, respectively.
UMR-PMF cannot utilize the full information content provided by HR mass
spectrometers but is more straightforward to apply. In contrast, accurate
HR peak fitting can better preserve the information content of the raw data
than UMR and thus provide more information to PMF, resulting in more
interpretable results. However, incorrectly fitted peaks can severely
disturb the PMF modeling and the factor interpretation. In addition, mass
spectra from iodide-adduct TOF-CIMS (Lee et al., 2014) often contain more
peaks per mass than the
In this study, a novel, yet simple and reliable, data analysis method, binned mass spectra combined with PMF (binPMF), is proposed to try to tackle the abovementioned problems in both HR and UMR PMF. Instead of using traditional UMR or HR fitting techniques for the mass spectra, we binned the mass spectra prior to PMF analysis (Fig. 2). We applied binPMF to both ambient and synthetic datasets, succeeding in separating the key components of different sources/processes. Compared to UMR PMF, binPMF preserves more of the high-resolution information content of the mass spectra, without the immense effort and subjectivity associated with high-resolution peak fitting. As a result, this novel method can improve our understanding of sources/formation processes governing the particulate and gaseous phases in more detail and in a less subjective manner.
We divided the mass spectra into narrow bins as presented in Fig. 2 and
carried out PMF analysis to extract more information from the dataset.
Details on the data preparation (binning the mass spectra) and error
estimation for the PMF input are discussed in the Sect. 2.2 and 2.3. To
test the performance of binPMF under different scenarios, we first
constructed synthetic datasets, using a simple one-/two-mass system (Sect. 2.4.1). In the second step, we applied binPMF to an ambient dataset measured
with a
The PMF model was developed by Paatero and Tapper (Paatero and Tapper, 1994) in the 1990s and has been widely applied in the analysis of various types of environmental data ever since (Zhang et al., 2017; Yan et al., 2016; Ulbrich et al., 2009; Song et al., 2007). By decomposing the observed dataset into different factors, PMF helps to simplify the complex data matrix and extract useful information contained within it. Compared to other common source apportionment tools, like chemical mass balance (CMB) (Schauer et al., 1996), PMF requires no prior knowledge of source information as essential input. Nevertheless, as a statistical method, PMF does require more data as input, which is typically not a problem for environmental mass spectrometry datasets. The main distinction of PMF from other factor analysis techniques is that PMF utilizes a least squares minimization scheme weighted with data uncertainties, as well as nonnegative constrains, to minimize the ambiguity caused by rotation of the factors (Huang et al., 1999; Paatero and Tapper, 1994).
In PMF modeling, a measurement of chemical species is assumed to be a sum
of contributions from several relatively fixed sources/processes. The
measured data matrix is broken down to two smaller matrices and a residual
term as follows:
To find the solution, the PMF model utilizes uncertainty estimates for each
element in the data matrix
One of the problems in any factorization analysis is rotational ambiguity, which is caused by an infinite number of similar solutions generated by PMF (Paatero et al., 2002; Henry, 1987). Generally, the nonnegativity constraint alone is not sufficient for solution uniqueness. Rotating a certain solution and assessing the rotated results is one possible way to determine the most physically reasonable solution. Known source profiles or source contributions can also serve as constrains. In addition, if there is a sufficient number of time points when the contribution of a source is nearly zero, independent of other sources, rotational uniqueness of solutions can be achieved (Paatero et al., 2002). The same is true if specific variables in the profiles go to zero. Otherwise, the correct solution (correct rotation) may only be obtained by skillful use of rotational tools. Ambient measurement data can often contain zero values in most sources/processes, greatly reducing rotational ambiguity of the PMF results. The issue of rotational ambiguity is not explored in detail in this paper, as it is common to all PMF approaches, and the main purpose here is to illustrate the new methodology of binPMF. All the solutions shown in this study were achieved without considering their rotational uniqueness. Finally, we note that, in addition to rotational ambiguity, binPMF also inherits all other fundamental limitations and strengths of the underlying PMF method.
Instead of UMR or HR fitting of the mass spectra, the mass spectra were
divided into small bins after mass calibration (Figs. 2 and S1 in
the Supplement). Data were first linearly interpolated to a mass interval of
0.001 Th and then divided into bins of 0.02 Th width. At an integer mass
Besides the data matrix, an error matrix describing the expected uncertainty
for each element in the data matrix is also required as input in PMF
analysis. Here, the error matrix (Polissar et al., 1998) is estimated as
This study utilized both ambient and synthetic datasets to test the
performance of binPMF. The ambient data were collected at the SMEAR II
station (Station for Measuring Ecosystem–Atmosphere Relations; Hari and
Kulmala, 2005) in the boreal forest in Hyytiälä, southern Finland.
Located in a rural forest area, the station has a wide range of continuous
measurements of meteorology, aerosol and gas-phase properties year-round.
There are no strong anthropogenic sources close to the site but two
sawmills 5
As a first test of the performance of binPMF, we generated a series of
synthetic datasets based on two distinct sources. Each synthetic dataset
Conceptual schematic diagram for the synthetic datasets. Panels
As shown in Fig. 3, each source profile (
Peaks in the synthetic
Twenty-one synthetic experiments were designed, varying the mass difference
between peaks (
With this approach of only using two masses, we purposefully provide a challenging dataset for binPMF, as in most real datasets there would be many more masses to help constrain the final solutions. Nevertheless, as we will show, this simple synthetic dataset already provided a wealth of useful information in the results attainable with binPMF and provided a good comparison to the traditional HR fitting approach.
The ambient dataset was measured at ground level during the Influence of
Biosphere-Atmosphere Interactions on the Reactive Nitrogen budget (IBAIRN)
campaign (Zha et al., 2018) in September 2016. The measurements were
conducted using a
As introduced in Sect. 2.4.1, the synthetic datasets were constructed to
assess the response of binPMF to varied
The analysis procedure of the synthetic dataset is briefly described here.
In all cases, the parameter of interest is to see how well binPMF is able to
deconvolve the adjacent peaks A1 and B1 at
In addition to applying binPMF to the synthetic datasets, traditional HR peak fitting was also conducted as comparison (by tofTools in our study). For the tofTools fits, we constrained the peak locations and widths to those originally used for generating the data (Table S1). Peaks fitted by tofTools and peaks fitted to the binPMF factors were compared, as well as the retrieved time series correlation with the original datasets. More details are presented and discussed in the following sections.
Peak separation results by a traditional HR fitting method (dashed
lines) and binPMF (solid lines), at the 79th time point
We examined the performance of traditional HR fitting and binPMF by
comparing their results to the original input data. In Fig. 4, the shaded
areas depict the original data, the dashed lines the traditional HR peak
fitting result and the solid lines the binPMF factors. Red and blue
represent source/factor
Characteristics of peaks fitted to binPMF factors. Panels
Figure 5 shows an overview of all the results of peaks fitted with binPMF.
Experiments 1–10 for the one-mass system are shown with green lines and
experiments 11–20 for the two-mass system in yellow. Mass accuracy was
calculated as the difference between fitted peak center mass and the
original mass, divided by the original mass, in parts per million. When the
In addition to the peak positions, we also compared the temporal behavior of
both the binPMF factors and the time series obtained through traditional
fitting to the original time series. When the
Comparison of time series of binPMF and HR fitting. Panels
Based on the results shown above, binPMF was found to be as capable of
separating different peaks as traditional peak fitting techniques when the
two peaks were separated by more than the mass calibration uncertainty (yet
still in all cases by less than the FWHM of the peaks). As the
The peak fitting principles of the traditional method and binPMF are very
different. For example, tofTools fits peaks based on predetermined
instrument parameters (e.g., peak shape and peak width), as well as the peak
location, either as a numeric value or a chemical composition from which
the location is calculated (Junninen et al., 2010). HR peak fitting by
tofTools can be effective if the majority of the components (peaks) are
known and provided in a peak list, which is valuable information for peak
separation that was not provided to binPMF in this study. However, this
information can be hard to achieve due to unknown numbers and/or identities
of all the ions at a given mass, in combination with the limited mass
resolving power of the mass spectrometer. HR peak fitting is also sensitive
to mass calibration error, increasingly so when many ions in close proximity
to each other need to be fit. On the contrary, in binPMF, peaks are
separated based on the temporal variation of masses, which is an inherent
advantage of PMF, though no information of the peaks is provided beforehand.
To be more specific, a conceptual illustration is shown in Fig. S3. The red peaks belong to source A and the blue peaks to source B.
As mentioned before, the time series of sources A and B were totally
independent and random. The shaded areas (the tails of the peaks), e.g., red
shaded area in Fig. S3a, contained masses that only had significant signal
from peak A1 (left red peak). Similarly, the blue shaded area in Fig. S3a
was mostly from peak B1. The different temporal behaviors of the red and
blue shaded areas helped the separation and correct attribution also in the
regions with overlapped signals. When the
When peaks A2 and B2 (
We note once more that the results of binPMF and traditional HR peak fitting are not totally comparable. Information about the peaks, like the exact peak centroid position, peak width (resolution) and number of peaks, was provided to the traditional fitting method. For binPMF, no prior information about the peaks was given, except for the optimal number of factors, i.e., two.
With the success of binPMF for the synthetic datasets, we applied the new
method to a real ambient dataset. Here we used data collected in September 2016, from Hyytiälä in Finland. The SMEAR II station is a forest
site dominated by monoterpene (
As mentioned above, no prior knowledge was provided to PMF before the analysis. To determine the number of factors for further analysis, we conducted runs with two to eight factors. As the number of factors increased, more information could be extracted from the raw data. However, after the optimal number of factors, the additional factors may split the physically reasonable factors into meaningless fragments. There have been many studies on evaluations of PMF runs and selections of PMF factor number (Zhang et al., 2011; Craven et al., 2012). This is an inherent challenge in any PMF analysis, and not specific to binPMF, and therefore we do not put emphasis on this here. In this study, based on commonly used mathematical parameters and physical interpretation, we chose the seven-factor result, as presented below. Our main aim with this work is to present a proof of concept for the binPMF methodology, and we will therefore not provide a detailed interpretation of all the factors (though several of the factors are easily validated based on earlier studies). The factor evolution from two to eight factors is briefly discussed below.
From two to six factors,
Comparison of binPMF and UMR-PMF for factor mass spectral profiles
Comparison of binPMF and UMR-PMF for
Figure 7 shows the mass spectral profiles and factor time series for the seven-factor result, while Fig. 8 displays the diurnal trends and factor contributions to the total signal. As shown in Fig. 8a, the seven factors separated by binPMF consist of one nighttime factor (Factor 1), five daytime factors (Factor 2, 3, 4, 5 and 7) and a sawtooth-pattern factor (Factor 6). The same dataset was also analyzed by UMR-PMF, and the corresponding seven-factor results are also included in Figs. 7 and 8 for comparison.
Overall, the results between UMR-PMF and binPMF are very similar. UMR-PMF
also resolved one clear nighttime factor and additionally six daytime
factors. For the nighttime factor, both binPMF and UMR-PMF showed comparable
temporal behavior, diurnal trend (peak at 17:00; all times are given in Finnish winter time, UTC+2), mass spectral profiles
(peaks at 340, 308, 325, 342 Th) and factor contribution
(
Despite the similarities, there also existed distinct differences between
the results from binPMF and UMR-PMF. As the most distinctive dissimilarity,
binPMF Factor 6 revealed a contamination factor. This factor was found
to be related to automated instrument zeroing every 3
binPMF factor profiles at
In addition to better resolving certain factors from the data, the binPMF
mass spectral profiles will still contain more information than visible in
Fig. 7, due to the multiple bins at each unit mass. As an example, binPMF
Factor 6 showed masses with clear negative mass defects, e.g., at 324 and
339 Th (Fig. 9). We identified many ions in this factor as different
fluorinated carboxylic acids, which are common interference signals in negative ion CIMS, outgassing from, e.g., Teflon tubing (Brown et al., 2015; Ehn et al.,
2012; Heinritzi et al., 2016). The exact source of these products in our
setup was not established, but it is not surprising that the additional
valves, filters and/or tubing in the zeroing line could have caused this
type of signal to be introduced to the instrument with the zero air. In
general, this finding highlights the usefulness of the binPMF approach,
where factor separation can be performed first, and the specific factor
profiles can be utilized in interpreting the physical meaning of the
different factors. This is in complete contrast to the more traditional
approach, where all ions need to be identified first, and only then can HR
PMF be attempted. As not all ions are going to be observable at all times,
many ions may remain unidentified. For example, if peak identification would
only have been done during periods when the HOM signals were high, as in the
case shown in Fig. 9a, the fluorinated ion at 339 Th would not have been
found (contributing only 0.45 % to the total signal at this time point),
even though it on average contributes nearly 10 % of the signal at this
mass over the entire campaign. binPMF, on the other hand, utilized the full
dataset for the identification and was able to separate several ions at 339
Th. By fitting Gaussian signals to the factor profiles, similar to the
synthetic data in Sect. 3.1.2, we see that the two major peaks were fitted
with decent resolution (Fig. 9). Also the contamination factor (Factor 6)
was clearly separated and fitted, and the resolution (3136
The new technique for mass spectra analysis, binPMF, as presented above,
shows clear promise in utilizing HR information while saving time and
effort, as well as decreasing ambiguity related to conventional HR peak
fitting. It is also more sensitive to subtle variation than standard UMR
analysis. We consider this study a successful proof of concept and note that
several future improvements and applications are still foreseeable. We list
some of these below.
Most likely several other improvements to the approach will be identified in
future studies, and simplicity of the analysis remains a critical
consideration. We propose that binPMF is a good tool for initial exploration
of new datasets, at which stage optimizing all parameters is not necessarily
crucial, if the results can help guide further analysis directions. However,
for maximizing the information content that can be extracted from a given
dataset, optimized routines are important.
While recent advances in mass spectrometry have greatly enhanced our understanding of atmospheric chemistry, the increased information content in mass spectra also brings difficulties and challenges to the data analysis. Peak identification and separation can be challenging and ambiguous, as well as extremely time-consuming and involving large uncertainties. Constructing peak lists, i.e. deciding which ions to fit to the mass spectra, and validating the results are becoming one of the most labor-intensive parts of the entire work. In this study, we propose a simple and reliable method, binPMF, to try to avoid many of these problems, while still being able to distinguish different chemical pathways/sources in the atmosphere.
Different from traditional analysis, binned positive matrix factorization (binPMF), divides the mass spectra into smaller bins, before applying PMF to distinguish different types of factors and behavior in the data. This method utilizes more available information than classical UMR-PMF and requires no prior peak information as in the case of traditional HR-PMF. We applied binPMF successfully to both ambient and synthetic datasets to test its usefulness under different circumstances.
Traditional HR analysis fits peaks to each mass according to a predefined list and is not able to utilize any information across masses or time. In our analysis of a simple synthetic dataset with two overlapping ions at a single integer mass, we found that binPMF was able to separate the contributions of each ion even in cases where the HR analysis failed completely. This was the case for overlapping ions where binPMF had help in constraining the time series from another integer mass. When applied to an ambient dataset of HOM measured by a CI-APi-TOF, binPMF identified more physically meaningful factors than UMR-PMF. Additionally, for factors where the two PMF approaches agreed, binPMF still contained more mass spectral information for ion identification, as compared to UMR-PMF.
We provide a proof of concept for the utility of binPMF, showing that it can outperform the two traditional analysis approaches, UMR and HR. We identify several future improvements and applications for binPMF, including an approach to greatly facilitate the time-consuming process of peak list construction. We expect binPMF to become a powerful tool in the data exploration and analysis of mass spectra.
The data used in this study are available from the first author upon request: please contact Yanjun Zhang (yanjun.zhang@helsinki.fi).
The supplement related to this article is available online at:
ME, YZ, OP and CY designed the study. QZ and MR collected the data; data analysis was done by YZ and OP. YZ wrote the paper. All coauthors discussed the results and commented the manuscript.
The authors declare that they have no conflict of interest.
This research was supported by the European Research Council (grant 638703-COALA); the Academy of Finland (grants 317380 and 320094); and the Vilho, Yrjö and Kalle Väisälä Foundation. Kaspar R. Daellenbach acknowledges support by the Swiss National Science postdoc mobility grant P2EZP2_181599. We thank the tofTools team for providing tools for mass spectrometry data analysis. The personnel of the Hyytiälä forestry field station are acknowledged for help during field measurements.
Open access funding provided by Helsinki University Library.
This paper was edited by Jonathan Abbatt and reviewed by two anonymous referees.