Introduction
Airborne particles are made of inorganic salts, organic compounds, mineral
dust, black carbon (BC), trace elements, and water . While
regulatory limits on airborne particulate matter (PM) concentrations are set
by gravimetric mass determination, analysis of chemical composition is
desired as it provides insight into source contributions, facilitates
evaluation of chemical simulations, and strengthens links between particle
constituents and health and environmental impacts. However, the diversity of
molecular constituents poses challenges for characterization as no single
instrument can measure all relevant properties; an amalgam of analytical
techniques is often required for comprehensive measurement
.
Fourier transform infrared (FT-IR) spectroscopy is one analytical technique
that captures the signature of a multitude of PM constituents that give rise
to feature-rich spectral patterns over the mid-infrared (mid-IR) wavelengths
. In the past decade, mid-IR spectra have been used for
quantification of various substances in atmospheric PM and for apportionment
of organic matter (OM) into source classes including biomass burning,
biogenic aerosol, fossil fuel combustion, and marine aerosol
. The quantitative information regarding the abundance of
substances in each spectrum is limited only by the calibration models that
can be built for it.
In principle, the extent of frequency-dependent absorption in the mid-IR
range accompanying induced changes in the dipole moment of molecular bonds can be
used to estimate the quantity of sample constituents in any medium
. Based on this principle, FT-IR spectroscopy has a long history in
remote and ground-based measurement of chemical composition in the
atmospheric vapor phase . For ground-based measurement,
gases are measured by FT-IR spectroscopy in an open-path in situ configuration
or via extractive sampling into a closed multi-pass
cell . These techniques have been used to sample urban
smog ; smog chambers
, biomass burning emissions
, volcanoes
, and fugitive gases ; emission fluxes (;
; ); greenhouse gases (; ;
;
); and isotopic composition . For these
applications, quantitative analysis has been conducted using various
regression algorithms with standard gases or synthetic calibration spectra
with absolute accuracies on the order of 1 %–5 %. Synthetic spectra
for calibration are generated from a database of absorption line parameters
together with simulation of pressure and Doppler broadening and instrumental
effects .
Limits of conventional approaches to calibration
Analysis of FT-IR spectra of condensed-phase systems is more challenging. PM
can be found in crystalline solid, amorphous solid, liquid, and semisolid
phase states . Solid- and liquid-phase
spectra do not have the same rotational line shapes present in the vapor
phase, but inhomogeneous broadening occurs due to a multitude of local
interactions of bonds within the liquid or solid environment
. Line shapes are particularly
broad in complex mixtures of atmospheric PM since the resulting spectrum is
the superposition of varying resonances for a given type of bond. FT-IR spectroscopy has
enjoyed a long history of qualitative analysis of molecular characteristics
in multicomponent PM based on visible peaks in the spectrum
e.g.,, and study of
relative composition or changes to composition under controlled conditions
(e.g., humidification, oxidation) has provided insight into
atmospherically relevant aerosol processes e.g.,. Quantitative prediction of substances in
collected PM represents a separate task and is conventionally pursued by
generating laboratory standards and relating observed features to known
concentrations. This calibration approach has been predominantly used to
characterize ambient and atmospherically relevant particles collected on
filters or optical disks. The bulk of past work in aerosol studies has
focused on using laboratory standards to build semiempirical calibration
models for individual vibrational modes belonging to one of many functional
groups present in the mixture. In this approach, the observed absorption is
related to a reference measurement (typically gravimetric mass) of the
compounds on the substrate. In this way, calibration of nitrate and sulfate
salts , silica dust
, and organic functional groups
has
been studied. The organic carbon and organic aerosol mass reconstructed has
typically ranged between 70 % and 100 % when compared with collocated
evolved-gas analysis or mass spectrometry measurements , though many model uncertainties remain. One is that
unmeasured, non-functionalized skeletal carbon can lead to less than full
mass recovery, and the second is the estimation of the detectable fraction
due to the multiplicity of carbon atoms associated with each type of
functional group. . The challenge in this type
of calibration is in the problem of extrapolating from the reference
composition, which is necessarily kept simple, to that of the chemically
complex PM. Spectroscopically, this difference can lead to shifts in
absorption intensity or peak locations and a general broadening of
absorption peaks on account of the same functional group appearing in many
different molecules and in different condensed-phase environments.
Synthetic spectra for condensed-phase systems can be generated by mechanistic
and statistical means, but are not readily available for quantitative
calibration. Absolute intensities are typically even more difficult to
simulate accurately for than peak frequencies .
Computational models that predict vibrational motion of molecules in
isolation using quantum mechanical models or by harmonic
approximation for larger molecules suffer from two
shortcomings: poor treatment of anharmonicity and lack of solvent effects in
liquid solutions . Quantum mechanical simulations can
parameterize interactions with an implicitly modeled solvent through a
polarizable continuum model framework but do not
adequately represent specific interactions such as hydrogen bonding
. Microsolvation can be a better technique to describe the
hydrogen bonding environment but the high computational cost prevents
application to large systems . Gaussian dispersion
analysis has provided accurate spectrum reconstruction in pure liquids
(water–ethanol mixtures) from their calculated dielectric functions
but has not been applied to more complex systems.
Molecular dynamics (MD) provides a general framework for addressing
interactions with the solvent, large-amplitude motions in flexible molecules,
and anharmonicities . Electronic structure
calculations relevant for predicting vibrational spectra can be incorporated
by ab initio MD and path integral MD
methods such as centroid or ring polymer MD
that additionally consider nuclear quantum effects (at higher computational
cost). Ab initio MD is widely used for simulating the spectra of water
and a range of small organic and biological molecules in isolation
. Such calculations generally reproduce the shape of the spectrum
well with respect to experimental ones at very high dilution, although C–H
stretching peaks are known to be shifted towards higher wavenumbers due to
the lack of improper hydrogen bonding in vacuum simulations
. Bulk liquid-phase simulations are limited to a few tens
of molecules (few hundreds of atoms) and have been performed for liquids,
including methanol , water , and
aqueous solutions of biomolecules . These simulations
reproduce peak positions and relative intensities sufficiently well when
compared to experimental spectra, albeit with lower accuracy in peak position
at wavenumbers higher than 2000 cm-1. These methods have also been
shown to reproduce the main features of vibrational spectra in solid (crystalline
ice and naphthalene) systems . Nuclear quantum effects not explicitly accounted for by ab
initio calculations become more important for hydrogen-containing systems
and have been investigated in liquid water and methane for vibrational
spectra simulation .
A recent approach improves upon the accuracy and speed of ab initio MD by
combining a dipole moment model and potentials
derived from machine learning. Trained on only several
hundred reference electronic structure calculations, spectra of several
alkanes and small peptides were simulated with accuracy reflecting improved
treatment of anharmonicities and proton transfer, with reductions in
computational cost by 3 orders of magnitude .
However, this machine-learned method still inherits some common limitations
of ab initio calculations upon which models are trained. One example is the
apparent blue shift of the C–H stretching peak, likely due to an insufficient
treatment of improper hydrogen bonding or the deficiency of the electron
exchange functional . While such methods may be useful in
aiding interpretation of environmental spectra , they are not yet mature for reproducing spectra of suitable
quality for quantitative calibration or (white-box) inverse modeling.
Early applications of artificial intelligence to mid-IR spectra
interpretation also included efforts to generate synthetic spectra of
individual compounds. Mid-IR spectra of new compounds were simulated from
neural networks trained on three-dimensional molecular descriptors (radial
distribution functions) paired with corresponding mid-IR spectra, matched by
a
similarity (nearest neighbor) search in a structural database, or generated
from spectra–structure correlation databases .
Drawing upon internal or commercial libraries , predictions
were made for compounds in the condensed phase with a diverse set of
substructures including methanol, amino acids, ring-structured
acids, and substituted benzene derivatives. Many structural features
including peak location, relative peak heights, and peak widths were
reproduced, provided that relevant training samples were available in the
library. Much of the work was motivated by pattern matching and
classification of spectra for unknown samples ,
and automated band assignment and identification of the underlying fragments
was typically performed by trained spectroscopists . This approach has been
able to generate spectra for more complex molecules than mechanistic modeling
relying on ab initio calculations. However, the extent of evaluation has been
limited; extension to multicomponent mixtures and usefulness for quantitative
calibration is currently not known. While these research fields remain an
active part of cheminformatics, we propose another approach for calibration
model development that can be used for atmospheric PM analysis.
Use of collocated measurements
As an alternative to laboratory-generated mixtures and simulated spectra,
collocated measurements of substances for which there are IR-active
vibrational modes can be used as reference values for calibration (also
referred to as in situ calibration). This data-driven approach permits
the complexity of atmospheric PM spectra with overlapping absorbances from
both analytes and interferences to be included in a calibration model. For
instance, demonstrated the use of collocated ammonium
sulfate measurements by ion chromatography to quantify the abundance of this
substance from FT-IR spectra, though some uncertainties arose from the time
resolution among the sampling instruments.
The benefit of building data-driven calibration models to reproduce
concentrations reported by available measurements is twofold. One is to
provide equivalent measurements when the reference measurements are expensive
or difficult to obtain. For example, FT-IR spectra can be acquired rapidly,
nondestructively, and at low cost from polytetrafluoroethylene (PTFE)
filters commonly used for gravimetric mass analysis in compliance monitoring
and health studies. That vibrational spectra contain many signatures of
chemical constituents of PM (which also gives rise to challenges in
spectroscopic interpretation) provides the basis for quantitative calibration
of a multitude of substances. This capability for multi-analyte analysis is
beneficial when a single filter may be relied upon during short-term
campaigns, or at network sites for which installation of the full suite of
instruments is prohibitive. The second benefit is the ability to gain a
better understanding of atmospheric constituents measured by other techniques
by associating them with important vibrational modes and structural elements of
molecules identified in the FT-IR calibration model. Such an application can
be enlightening for studying aggregated metrics such as carbon content or
functional group composition in atmospheric PM quantified by techniques
requiring more sample mass and user labor: ultraviolet–visible spectrometry
or nuclear magnetic resonance spectroscopy .
In this paper, we demonstrate an extensive application of this approach in
the statistical calibration of FT-IR spectra to collocated measurements of
carbonaceous aerosol content – organic carbon (OC) and elemental carbon (EC)
– characterized by a particular type of evolved gas analysis (EGA). EGA
includes thermal–optical reflectance (TOR) and thermal–optical transmittance
(TOT), which apportions total carbon into OC and EC fractions according to
different criteria applied to the changing optical properties of the filter
under stepwise heating . EGA OC and EC are widely measured in
monitoring networks , with historical significance
in regulatory monitoring, source apportionment, and epidemiological studies.
While EC is formally defined as sp2-bonded carbon bonded only to other
carbon atoms, EC measured by EGA is an operationally defined quantity that
is likely associated with low-volatility organic compounds . EGA OC comprises a larger fraction of the total
carbon and therefore is less influenced by pyrolysis artifacts that affect
quantification of EGA EC. In addition to OC estimates independently
constructed from laboratory calibrations of functional groups, prediction of
EGA OC and EC from FT-IR spectra will provide values for which strong
precedent in atmospheric studies exist. Thus, use of collocated measurements
complements conventional approaches in expanding the capabilities of FT-IR
spectroscopy to extract useful information contained in vibrational spectra.
We review the current state of the art for quantitative prediction of OC and
EC as reported by TOR using FT-IR spectroscopy at selected sites of the Interagency
Monitoring of Protected Visual Environments (IMPROVE) monitoring network
and the Chemical Speciation Network (CSN)
. This work is placed within the context of overseeing the
life cycle of a statistical calibration model more generally: reporting
further developments in anticipating errors due to precision and bias in new
samples and describing a road map for future work. While partial least squares (PLS) regression and its variants figure heavily in the calibration
approach taken thus far, related developments in the fields of machine
learning, chemometrics, and statistical process monitoring are mentioned to
indicate the range of possibilities yet available to overcome future
challenges in interpreting complex mid-IR spectra of PM. We expect that
many concepts described here will also be relevant for the emerging field of
statistical calibration and deployment of measurements in a broader
environmental and atmospheric context e.g.,. In the following sections, we describe the experimental
methods for collecting data (Sect. ), the calibration
process (Sect. ), assessing suitability of existing models for
new samples (Sect. ), and maintaining calibration
models (Sect. ). Finally, we conclude with a summary and
outlook (Sect. ). A list of recurring abbreviations can be found in
Appendix .
Background
First, we review the basic principles of FT-IR spectroscopy and how the measured
absorbances can be related to underlying constituents, including carbonaceous
species (Sect. ). We then describe the samples used
for calibration and evaluation (Sect. ). We
then conclude the section with discussion regarding quality assurance and
quality control (QA/QC) of the FT-IR hardware performance
(Sect. ). Under the assumption that these hardware
QA/QC criteria are met, we dedicate the remainder of the paper to outlining
model evaluation on the assumption that the performance in prediction can be
attributed to differences in sample composition.
Fourier transform infrared spectroscopy
In this section, we cover the background necessary to understand FT-IR
spectroscopy in the analysis of PM collected onto PTFE filter media, which is
optically thin and permits an absorbance spectrum to be obtained by
transmission without additional sample preparation . The wavelengths of IR are longer than visible light
(400–800 nm) and FT-IR spectroscopy refers to a nondispersive analytical technique
probing the mid-IR range, which is radiation from 2500 to 25 000 nm or in the
vibrational frequency units used by spectroscopists, wavenumbers, 4000 to
400 cm-1. Molecular bonds absorb mid-IR radiation at characteristic
frequencies of their vibrational modes when interactions between electric
dipole and electric field induce transitions among vibrational energy states
. Based on this principle, the spectrum
obtained by FT-IR spectroscopy represents the underlying composition of organic and
inorganic functional groups containing molecular bonds with a dipole moment.
In transmission-mode analysis in which the IR beam is directed through the
sample, absorbance (A) can be obtained by ratioing the measured extinction
of radiation through the sample (I) by a reference value (I0), also
called the “background”, and taking the negative value of their decadic
logarithm (first relation of Eq. ).
A(ν̃)=-log10I(ν̃)I0(ν̃)=ε(ν̃)n(a)
The sample is the PTFE filter (with or without PM) and the background is
taken as the empty sample compartment. The quality of the absorbance spectrum
depends on how accurately the background reflects the conditions of the
sample scan, and the background is therefore acquired regularly as discussed
in Sect. .
When absorption is the dominant mode of extinction, the measured absorbance
(A) is proportional to the areal density of molecules (n(a))
in the beam in the sample (Eq. ) . The superscript “(a)” is used to denote the area-normalized quantity. ε is the proportionality constant and
is called the molar absorption coefficient. Although scattering off of
surfaces present in the sample can generate a significant contribution to the
absorbance spectrum, its effects can be modeled as a sum of incremental
absorbances by a linear calibration model or minimized through spectral
preprocessing procedures (baseline correction) as discussed in
Sect. .
A composite metric of PM such as carbon content presumably results from
contributions by a myriad of substances. The abundances of these underlying
molecules concurrently give rise to the apparent mass of carbon
(mC) (Eq. ) measured by evolved gas analysis
and the absorbance spectrum (A) (Eq. ) measured by FT-IR spectroscopy
:
mC(a)=12.01⋅∑kfC,knk(a),A(ν̃)=∑kεk(ν̃)nk(a)+∑k′εk′(ν̃)nk′(a)+{…}.
fC,k denotes the number of (organic or elemental)
carbon atoms in molecule k, and 12.01 is the atomic mass of carbon.
Non-carbonaceous substances (e.g., inorganic compounds) that give rise to
additional (possibly interfering) absorbance are indexed by k′. “{…}”
indicates contributions from instrumental noise, ambient background, and
additional factors such as scattering. Using TOR measurements from collocated
quartz fiber filters, our objective is to develop a calibration model for
estimating the abundance of carbonaceous material
(mC(a)) in the PTFE sample that may have led to the
observed pattern of mid-IR absorbances (A(ν̃)). A common approach
is to explore the relationship between response and absorbance spectra
through a class of models that take on a multivariate linear form
:
mC,i(a)=∑jbjAi(ν̃j)+ei.
The set of wavelength-dependent regression coefficients bj comprise a
vector operator that effectively extracts the necessary information from the
spectrum for calibration. These coefficients (bjs) presumably represent a
weighted combination of coefficients expressed in Eqs. ()
and () (also correcting for non-carbonaceous interferences).
The remaining term, ei, characterizes the model residual (in regression
fitting) or prediction error (in application to new samples). The
relationship with underlying substances (k) that comprise OC and EC is
implicit, though some efforts to interpret these constituents have been made
through examination of latent (or hidden) variables obtained from the
calibration model (discussed in Sect. ).
Using complex, operationally defined TOR measurements as reference for
calibration, some caution in interpretation and application is warranted. For
instance, these coefficients may not necessarily capture the true
relationship expressed by Eqs. () and (), but
rather rely on correlated rather than causal variables for quantification.
Particles and the PTFE substrate itself can confer a large scattering
contribution to the extinction spectrum (Eq. ), and
additional sample matrix interactions among analytes may challenge
assumptions regarding the linear relationship (Eq. ) underlying
the model for quantification (Eq. )
. Furthermore, the relationship between spectra and
concentrations embodied by the regression coefficients
is specific to the chemical
composition of PM at the geographic location and sampling artifacts due to
composition and sample handling protocols of the calibration samples. To
address these concerns, extensive evaluation regarding model performance in
various extrapolation contexts is necessary to investigate the limits of our
calibration models, and methods for anticipating prediction errors provide
some guidance on their general applicability in new domains. Regression
coefficients and underlying model parameters are inspected to determine
important vibrational modes that provide insight into the infrared absorption
bands that drive the predictive capability of our regression models.
Sample collection (IMPROVE and CSN)
Map of IMPROVE and CSN sites used for this work. The Sac and Fox,
KS, IMPROVE site was only operational for the first half of 2011. Samples
from Fresno, CA, and South Korea were additionally used for a separate
calibration.
The IMPROVE network consists of approximately 170 sites in rural and pristine
locations in the United States primarily national parks and wilderness areas
. Data from the IMPROVE network are used to monitor trends in
particulate matter concentrations and visibility. IMPROVE collects ambient
samples midnight to midnight every third day by pulling air at
22.8 L min-1 through filters.
PTFE (25 mm, Pall Corp.), or more commonly referred
to as Teflon, filters are routinely used for gravimetric, elemental, and light-absorption measurements and are used in this work for FT-IR analysis. Quartz
filters are used for TOR measurements to obtain
OC and EC. Nylon filters are used to measure inorganic
ions, primarily sulfate and nitrate.
The CSN consists of about 140 sites located in urban and suburban area and
the data are used to evaluate trends and sources of particulate matter
. Ambient samples are collected in the CSN on a midnight-to-midnight schedule once every third or once every sixth day. Quartz filters for
TOR analysis are collected with a flow rate of 22.8 L min-1. PTFE
filters (Whatman PM2.5 membranes, 47 mm, used through late 2015; MTL
filters (Measurement Technology Laboratories, 47 mm) have been used thereafter) and nylon filters are collected at a flow rate of 6.7 L min-1.
All sites in CSN have used TOR for carbon analysis since 2010.
PTFE filters are used for gravimetric analysis on account of their low vapor
absorption (especially water) and standardization in compliance monitoring,
while quartz fiber filters are separately collected on account of their thermal
stability . TOR
analysis consists of heating a portion of the quartz filter with the
IMPROVE_A temperature ramp and measuring the evolved carbon
. The initial heating is performed with an inert environment
and the material that is removed is ascribed to OC. Oxygen
is added at the higher temperatures and the measured material is ascribed to
EC. Charring of ambient particulate carbon is corrected
using a laser that reflects off the surface of the sample (hence reflectance)
. The evolved carbon is converted to methane and measured
with a flame ionization detector. Organic carbon data are corrected for
gas-phase adsorption using a monthly median blank value specific to each
network .
For this work, we examine a subset of these sites in which PTFE filters were
analyzed for FT-IR spectra (Fig. ). For model building and
evaluation (Sect. ), we use seven sites consisting of 794 samples
for IMPROVE in 2011 and 10 sites consisting of 1035 samples for CSN in 2013.
Two sites in 2011 IMPROVE are samplers collocated at the same urban location
in Phoenix, AZ, and one site (Sac and Fox) that was discontinued midyear.
Additional IMPROVE samples were analyzed by FT-IR spectroscopy during sample year 2013,
which included six of the same sites and 11 additional sites. This data set is
used for evaluation of the operational phase of the model
(Sect. ).
Given the different sampling protocols that result in different spectroscopic
interferences from PTFE (due to different filter types) and range of mass
loadings (due to flow rates), and the difference in expected chemical composition
(due to site types), calibrations for the CSN and IMPROVE networks have been
developed separately . Advantages of building such
specialized models in favor of larger, all-inclusive models are discussed in
Sect. . Therefore, TOR-equivalent carbon
predictions for 2011 and 2013 IMPROVE samples discussed for this paper are
made with a calibration model using a subset of samples from 2011 IMPROVE,
and TOR predictions for 2013 CSN samples are made with a calibration model
using a subset of samples from 2013 CSN. One exception is a special model
constructed to illustrate how new samples can improve model prediction
(Sect. ); a subset of samples from two sites – Fresno,
CA (FRES), and Baengnyeong Island, S. Korea (BYIS) – in 2013 IMPROVE are used
to make predictions for the remaining samples at those sites. In all cases,
analytical figures of merit for model evaluation are calculated for samples
that are not used in calibration.
Laboratory operations and quality control of analysis
IMPROVE and CSN PTFE sample and blank filters are analyzed without
pretreatment on either Tensor 27 or Tensor II FT-IR spectroscopy instruments (Bruker
Optics, Billerica, MA) equipped with a liquid nitrogen-cooled detector.
Filters are placed in a small custom-built sample chamber, which reliably
places each filter the same distance from the source. IR-active water vapor
and CO2 are purged from the sample compartment and instrument optics
to minimize absorption bands of gas-phase compounds in the aerosol spectra.
Samples are measured in transmission mode and absorbance spectra, which are
used for calibration and prediction, are calculated using the most recent
empty chamber spectrum as a reference (collected hourly). The total
measurement time for one filter is 5 min. Additional details on the FT-IR
analysis are described by and .
Diagram of the model building, evaluation, and monitoring process.
Sections and subsections covering the illustrated topics are denoted in
parentheses. Note that the any of the calibrations {1,2,…,N} can be
a multilevel model (Sect. ) consisting
of an ensemble of models.
Daily and weekly quality control checks are performed to monitor the
comparability, precision, and stability of the FT-IR spectroscopy instruments. Duplicate
spectra are collected every 50 filters (once or twice per day) per
instrument in order to evaluate measurement precision. Measured precision
values are low and smaller than the 95th percentile of the standard deviation
of the blanks for both TOR OC and EC, indicating that instrument error has a
relatively minor influence on the prediction of TOR OC and EC and is smaller
than the variability observed among PTFE filters. Quality control filters
– blank filters and ambient samples – are analyzed weekly to monitor
instrument stability. conclude that predictions of TOR OC
and EC remain relatively stable over a 2.5-year period based on
analyses of quality control filters and that observed changes are small.
These data enable us to track instrumental changes that will require
recalibration (Sect. ). A subset of ambient filters are
analyzed on all FT-IR spectroscopy instruments to evaluate spectral dissimilarities and
differences in prediction. These samples show that differences in spectral
response among instruments are small and due mainly to variability in PTFE.
In addition, these samples indicate that careful control of laboratory
conditions and detector temperature, sample position, relative humidity (RH),
and CO2 levels in the FT-IR spectroscopy instrument enables instrument-agnostic
calibrations that predict accurate concentrations independent of the
instrument on which a spectrum is collected. The quality control data show
that the TOR OC and EC measurements obtained from multiple FT-IR spectroscopy instruments
in one laboratory are precise, stable (over the 2.5-year period evaluated)
and agnostic to the instrument used for analysis .
Model building, evaluation, and interpretation
In this section, we describe the model building process for quantitative
calibration. The relationship between spectra and reference values to be
exploited for prediction can be discovered using any number of algorithms,
the
method of spectra pretreatment, and the calibration set of samples
to be used for model training and validation. As the best choices for each of
these categories are not known a priori, the typical strategy is to generate
a large set of candidate models and select one that scores well across a
suite of performance criteria against a test set of samples reserved
for independent evaluation. The process of building and evaluating a model
conceptualized in the framework of statistical process control is depicted in
Fig. . In the first stage, various pathways to model
construction are evaluated, and expectations for model performance are
determined. The second stage involves continued application and monitoring of
model suitability for new samples (prediction set), which is
discussed in Sect. . Where applicable, the sample type
in each data set should include several types of samples. For instance, the
calibration set can include blank samples in which analyte (but not
necessarily interferent) concentrations are absent. Test and prediction set
samples can include both analytical and field blank samples. Collocated
measurements can be used for providing replicates for calibration or used as
separate evaluation of precision. Immediately below, we describe the
procedure for model specification, algorithms for parameter estimation, and
model selection in Sect. . Methods for spectra
processing are described in Sect. and sample
selection in Sect. . In each section, the broader
concept will be introduced and then its application to TOR will be reviewed.
Model estimation
Many algorithms in the domain of statistical learning, machine learning, and
chemometrics have demonstrated utility in building calibration models with
spectra measurements: neural networks , Gaussian
process regression , support vector regression
, principal component regression
, ridge regression , wavelet regression , functional
regression , and PLS , among others. There is
no lack of algorithms for supervised learning with
continuous response
variables that can potentially be adapted for such an application
. Each of these techniques maps relationships between
spectral features and reference concentrations using different similarity
measures, manifolds, and projections, largely in metric spaces where the
notion of distances among real-valued data points is well-defined
e.g.,. The best mathematical
representation for any new data set is difficult to ascertain a priori, but
models can be compared by their fundamental assumptions and their
formulation: e.g., linear or nonlinear in form; globally parametric, locally
parametric, or distribution free (random forest, nearest neighbor); feature
transformations; objective function and constraints; and expected residual
distributions. Approaches that incorporate randomized sampling can return
slightly different numerical results, but reproducibility of any particular
result can be ensured by providing seed values for the pseudo-random number
generator. A typical procedure for model development is to select candidate
methods that have enjoyed success in similar applications and empirically
investigate which techniques provide meaningful performance and
interpretability for the current task, after which implementation measures
are then pursued . In lieu of selecting a single model,
ensemble learning and Bayesian model averaging approaches combine predictions
from multiple models .
For FT-IR calibration targeting prediction of TOR-equivalent concentrations,
we focus on finding solutions to the linear model introduced in
Sect. . Letting y=[mC,i/a],
X=[Ai(ν̃j)], b=[bi], and e=[ei], we re-express
Eq. () in array notation to facilitate further
discussions of linear operations:
y=Xb+e.
Equation () is an ill-posed inverse problem;
therefore, it is desirable to introduce some form of regularization (method
of introducing additional information or assumptions) to find suitable
candidates for b . In
this paper, we summarize the application of PLS (;
)
for obtaining solutions to this equation, with which good results have been
obtained for our application and FT-IR spectra more generally
. This technique has been a classic
workhorse of chemometrics for many decades and is particularly well-suited
for characteristics of FT-IR analysis, for which data are collinear
(neighboring absorbances are often related to one another) and
high-dimensional (more variables than measurements in many scenarios). These
issues are addressed by projection of spectra onto an orthogonal basis of
latent variables (LVs) that take a combination of spectral features, and
regularization by LV selection . Furthermore, PLS is
agnostic with respect to assumption of residual structure (e.g., normality)
for obtaining b, which circumvents the need to explicitly account for
covariance or error distribution models to characterize the residuals
. PLS is also used as a preliminary
dimension reduction technique prior to application of nonlinear methods
. Therefore, it is sensible that PLS should be selected as
a canonical approach for solving Eq. ().
Mathematically, classical PLS represents a bilinear decomposition of a
multivariate model in which both X and y are projected
onto basis sets (“loadings”) P and q, respectively
:
X=TPT+EXy=TqT+e.
T is the orthogonal score matrix and EX denotes the
residuals in the reconstruction of the spectra matrix. Common solution
methods search for a set of loading weight vectors (represented in a column
matrix W) such that covariance of scores (T) with
respect to the response variable (y) is maximized. The weight matrix
can be viewed as a linear operator that changes the basis between the feature
space and FT-IR measurement space. These weights and their relationship to
the score matrix and regression vector are expressed below:
R=WPTW-1T=XRb=RqT.
For univariate y as written in Eq. (), a
number of commonly used algorithms – nonlinear iterative partial
least squares NIPALS;, SIMPLS , kernel
PLS with linear kernel; – can be used to arrive at
the same solution (while varying in numerical efficiency). Kernel PLS can be
further extended into modeling nonlinear interactions by projecting the
spectra onto a high-dimensional space and applying linear algebraic
operations akin to classical PLS, with comparative performance to support
vector regression and other commonly used nonlinear modeling approaches
. However, likely due to the linear nature of the
underlying relationship (Eq. ), linear PLS has
typically performed better than nonlinear algorithms for FT-IR calibration
. In addition, the linearity of classical PLS regression
has yielded more interpretable models than nonlinear ones .
Therefore, past applications of PLS to FT-IR calibration of atmospheric
aerosol constituents has focused on its linear variants and will be the focus
of this paper.
An optimal number of LVs must be selected to arrive at the best predictive
model. A larger number of LVs are increasingly able to capture the variations
in the spectra, leading to reduction in model bias. Some of the finer
variations in the spectra are not part of the analyte signal that we wish to
model; including LVs that model these terms leads to increased variance in its
predictions. A universal problem in statistical modeling is to find a method
for characterizing model bias and variance such that one with the lowest
apparent error can be chosen. There is no shortage of methods devised to
capture this bias–variance tradeoff, and their implications for model
selection continue to be an active area of development .
With no immediate consensus on the single best approach for all cases, the
approach often taken is to select and use one based on prior experience until
found to be inadequate (as with model specification).
One class of methods characterizes the bias and variance using the information
obtained from fitting of the data. For instance, the Akaike information criterion
(AIC; ) and Bayesian information criterion (BIC;
) consider the balance between model fidelity (fitting
error, which monotonically decreases with number of parameters) and
penalties incurred for increasing model complexity (which serves as a form of
regularization). The fitting error may be characterized by residual sum of
squares or maximum likelihood estimate e.g.,, and the
penalty may be a scaled form of the number of parameters or norms of the
regression coefficient vector. An effective degrees of freedom (EDF) or
generalized EDF parameter aims to characterize the resolvable dimensionality
as apparent from the model fit to data , though the EDF
may not always correspond to desired model complexity .
Another class of methods relies on assessment of the bias and variance
contributions implicitly present in prediction errors, which are obtained by
application of regression coefficients estimated using a training data set
and evaluated against a separate set of (“validation”) data withheld from
model construction to fix its parameters. To maximize the data available for
both training and validation, modern statistical algorithms such as cross-validation (CV) and the bootstrap
method allow the use of the same samples for both training and
validation, which comprise what we collectively refer to as the calibration
set. The essential principle is to partition the same calibration set
multiple times such that the model is trained and then validated on different
samples over a repeated number of trials. In this way, a distribution of
performance metrics for models containing different subsets of the data can
be aggregated to determine a suitable estimate of a parameter (number of
LVs). The number and arrangement of partitions vary by method, with
CV using each sample exactly once for validation and bootstrap
resamples with replacement. Both have reported usable results
. For an increasingly smaller number of samples,
leave-one-out (LOO) CV or bootstrap may be favored as it reserves a larger
number of samples to train each model, though it is generally appreciated
that LOO leads to suboptimal estimates of prediction error
. Evaluation metrics are calculated on samples that have
not been involved in the model-building process .
Examples of metrics include the minimum root-mean-square error of cross-validation (RMSECV) (one of the most widely used metrics;
), 1 standard deviation above RMSECV ,
Wold's R criterion , coefficient of determination (R2),
and randomization p value , among others. A
suite of these metrics can also be considered simultaneously
. The final model is obtained by refitting the model to all
of the available samples in the calibration set and using the number of
parameters selected in the CV process. Other strategies and general
discussions on the topic of performance metrics and statistical sampling are
covered in many textbooks e.g.,.
Past work on TOR and FT-IR spectroscopy measurements has used V-fold CV,
with using minimum RMSECV and
using Wold's R criterion for performance evaluation. In
V-fold CV, the data are partitioned into V groups, and V-1 subsets are
used to train a model to be evaluated on the remaining subset (repeated for
V arrangements). found that V=2, 5, and 10 selected a
different number of LVs but led to similar overall performance. To keep the
solution deterministic (i.e., no random sampling) and representative (i.e.,
the composition of training sets and validation sets is representative of the
overall calibration sets across permutations), samples in the calibration set
are ordered according to a strategy amenable for stratification. For
instance, samples are arranged by sampling site and date (used as a surrogate
for source emissions, atmospheric processing, and composition, which often
vary by geography and season), or with respect to increasing target analyte
concentration, and samples separated by interval V are used to create each
partition in a method referred to as Venetian blinds (also referred to as
interleaved or striped) CV. An illustration of
RMSECV compared to the fitting errors represented by the root-mean-square
error calibration (RMSEC) for TOR OC is shown in Fig. .
Other strategies for arranging CV include maximizing differences among
samples in each fold to reduce chances of overfitting but
have not been explored in this application.
Even with specification of model and approach for parameter selection fixed,
spectral processing and sample selection can lead to differences in overall
model performance. We first discuss how different models can be generated
from the same set of samples according to these decisions before proceeding
to protocols for model evaluation using the test set reserved for
independent assessment (Sect. ). The test set is used to
compare the merits of models built in different ways and establish control
limits for the operational phase (Sect. ).
Illustration of RMSEC, which represents the fitting errors, and
RMSECV, which represents the prediction error, calculated for TOR OC using
the same calibration set. A 10-fold Venetian blinds CV was used for this
calculation.
Model evaluation
Statistical models can be evaluated using many of the same techniques also
used by mechanistic models . In this
section, we describe methods for evaluating overall performance
(Sect. ) and occurrence of systematic errors
(Sect. ).
Illustration of model fits (“Calibration set”, column a)
and predictions (“Test set”, column b) for the 2011 IMPROVE and 2013 CSN networks. Open
circles for CSN EC indicate anomalous samples (discussed in
Sect. ).
Note units are in areal mass density on the filter.
Overall performance
Predictions for a set of selected models for 2011 IMPROVE and 2013 CSN are
shown in Fig. . Details of sample selection for calibration are
provided in Sect. ) but here we present results
for the “base case” models which contain representations of all sites and
seasons for each network. There are many aspects of each model that we wish
to evaluate by comparing predictions against known reference values. These
aspects include the bias and magnitude of dispersion, but also our capability
to distinguish ambient samples from blank samples at the low end of observed
concentrations. Metrics that capture these effects can effectively be
derived from the term e in the multivariate regression equation
(Eq. ) when predictions and observations are
compared in the test set spectra. e is referred to as the residual
when describing deviations from observations in fitted values and prediction
error when describing deviations from observed values when the model is used
for prediction in new samples. However, by convention we often resort to the
negative of the residual such that deviation in prediction is calculated with
respect to the observation, rather than the other way around. Example
distributions for residuals and prediction errors for TOR OC in 2011 IMPROVE
are shown in Fig. .
While the use of the minimum root-mean-square error (RMSE) is pervasive in
chemometrics and machine learning as a formal parameter tuning or model
selection criterion, another family of metrics is more commonly used in the
air quality community (Table ). For instance, the mean bias
and mean error and their normalized quantities are often used for
model–measurement evaluation of mechanistic (chemical transport) models
. R2 is commonly used in intercomparisons of
analytical techniques. Many of the statistical estimators in
Table converge to a known distribution from which
confidence intervals can be calculated, or otherwise estimated numerically
(e.g., by bootstrap). In addition to conventional metrics, alternatives
drawing upon robust statistics are also useful when undue
influence from a few extreme values may lead to misrepresentation of
overall model performance . For instance, the mean bias is
replaced by the median bias, and mean absolute error is replaced by median
absolute deviation. Even if a robust estimator is unbiased, it may not have
the same variance properties as its non-robust counterpart
; therefore, comparison against a reference distribution
for statistical inference may be less straightforward.
For TOR-equivalent values predicted by FT-IR spectroscopy, the median bias and errors have
been typically preferred for characterizing overall model performance,
together with R2 and the minimum detection limit (MDL). Mean errors have
been examined primarily to make specific comparisons among models. Having
derived these metrics, we place them in context by comparing them to those
reported by the reference (TOR) measurement, which include collocated
measurement precision and percent of samples below MDL
(Table ).
Residuals (red symbols) and prediction errors (blue symbols) from
2011 IMPROVE OC (baseline corrected, base case) predictions. The
corresponding kernel density estimate of the distribution is shown on the right.
Time series chart of TOR-equivalent OC residuals (for calibration
samples) and prediction errors (for test set samples) separated by site. Each
value (residual: open circle; prediction error: filled circle) is mapped to a
median-centered inverse hyperbolic sine function using 175 values
(approximately 20 % of the 2011 IMPROVE set) from neighboring TOR OC
concentrations to derive distribution parameters so that values are defined
within a normal distribution (p value > 0.2). Dotted horizontal lines
indicate ±3 standard deviations of the standard normal variate
(Z score).
Definition for figures of merit for overall assessment of prediction
error, samples to which they are applied, and their reference distribution
(if available) used for significance testing. y is the
(mean-centered) response vector (i.e., TOR OC or EC mass loadings), and
y^ is the predicted response (Eq. ).
〈⋅〉 is the sample mean, Med[⋅] is the
sample median, and Var[⋅] is the unbiased sample variance.
Nc is the number of paired collocated samples.
Metric
Samples
Estimate
Ref. dist.
Root-mean-square error (RMSE)
all
(y^-y)2
χ2
Mean bias
all
〈y^-y〉
Median bias
all
Med[y^-y]
Mean absolute error
all
〈|y^-y|〉
t
Median absolute deviation
all
Med[|(y^-y)-Med[y^-y]|]
Coefficient of determination (R2)
all
1-(y^-y)T(y^-y)/yTy
F
Minimum detection limit (MDL)
blank
3Var[(y^-y)]
χ2
Collocated precision
collocated
‖y^1-y^2‖/2Nc
t
Systematic errors
In addition to the aggregate metrics discussed above, we evaluate whether
essential effects appear to be accounted for in the regression by examining
errors across different classes of samples. Systematic patterns or lack of
randomness can be evaluated by examining the independence of the individual
prediction errors with respect to composition or using time and location of
sample collection as surrogates for composition. For instance, high
prediction errors elevated over multiple days may be associated with aerosols
of unusual composition transported under synoptic-scale meteorology that is
not well-represented in the calibration samples. A special exception is made
for concentration, as errors can be heteroscedastic (i.e., nonconstant
variance) on account of the wide concentration range of atmospheric
concentrations that may be addressed by a single calibration model. This
heteroscedasticity leads to a distribution that is leptokurtic (i.e.,
heavy tailed) compared to a normal distribution, as shown in
Fig. . As solution algorithms for PLS are agnostic with
respect to such residual structure, their application to this type of problem
is well-suited.
Given the propensity of prediction error distributions to be long-tailed,
error and residual values are transformed to standard-normal variates using
inverse hyperbolic sine (IHS) functions using parameters derived from samples with similar analyte (TOR)
concentrations. Such a transformation aids identification of systematic
errors in prediction related to sample collection time and location; a
control chart is displayed for TOR-equivalent OC in
Fig. . Each prediction error is then characterized by
its Z score, which gives an immediate indication of its relation to other
prediction errors for samples with similar concentrations. Because of the IHS
transformation, the magnitude of errors does not scale linearly in vertical
distance on the chart but conveys its centrality, sign, and bounds of the
error (e.g., three units from the mean encompasses 99 % of errors in samples
similar in concentration). In this data set, we can see that prediction
errors for Sac and Fox (SAFO) in each concentration regime are biased
positively during the winter but systematically trend toward the mean toward
the summer months. Other high error samples near the 99th percentile
(±3 probits) occur in the urban environment of Phoenix, where the TOR OC
concentrations are also highest. However, the prevalence of higher errors in
only one of the two Phoenix measurements (PHOE5) may be indicative of sampler
differences, rather than unusual atmospheric composition. Errors are
negatively biased during the summer months in Trapper Creek, when TOR OC
concentrations are typically low.
Description and figures of merit for “base case” models.
“Predictors” describes the number of wavenumbers and “Components”
describes
the number of LVs. Bias and errors are estimated by ensemble medians.
Network
FT-IR
Baseline
Wavenumber
Predictors
Components
correction
selection
IMPROVE 2011
OC
Spline
None
1563
15
EC
Raw
None
2784
Multilevel
CSN 2013
OC
Second derivative
BMCUVE
375
3
EC
Spline
BMCUVE
Multilevel
Multilevel
Network
FT-IR
R2
Bias
Error
MDL
Below MDL
Precision
(µgm-3)
(µgm-3)
(µgm-3)
(%)
(µgm-3)
IMPROVE 2011
OC
0.97
0.01
0.08
0.11
0.7
0.21
EC
0.96
0.00
0.03
0.01
2
0.06
CSN 2013
OC
0.95
0.04
0.15
0.49
3.0
0.19
EC
0.88
0.02
0.11
0.17
4.8
0.04
Network
TOR
MDL
Below MDL
Precision
(µgm-3)
(%)
(µgm-3)
IMPROVE 2011
OC
0.05
1.5
0.14
EC
0.01
3
0.11
CSN 2013
OC
0.51
2.7
0.23
EC
0.03
16.7
0.09
Systematic errors arising from underrepresentation of concentration or
composition range in the calibration set of IMPROVE were investigated by
deliberate permutations of calibration and test set samples by
. This study is discussed together with
model interpretation (Sect. ).
found systematic errors with respect to OC / EC ratios
when predicting TOR-equivalent EC concentrations in the CSN network. These
samples were found to originate from Elizabeth, NJ, (ELLA), which differed
from the nine other examined sites on account of the high contributions from
diesel PM and extent of reduced charring compared to other samples. The
solution was to build a separate calibration model
(Sect. ).
Spectral preparation
Mid-IR spectra can be processed in many different ways for use in
calibration. The primary reasons for spectral processing are to remove
influences from scattering such that calibration models follow the principles
of the linear relation outlined in Eq. () and to
remove unnecessary wavenumbers or spectral regions that degrade prediction
quality or interpretability. Scattering of particles manifests itself in a
broad contribution to the signal that is present in the measured spectrum by
FT-IR spectroscopy and is addressed by a class of statistical methods referred to as
baseline correction (Sect. ). It is even
possible to model nonlinear relationships such as the scattering contribution
to the signal using a linear model with additional LVs, but these phenomena may
not be mixed together with the noise .
Elimination of unnecessary wavenumbers can reduce noise in the predictions
and confer interpretation on the important absorption bands used for
prediction; the class of procedures used in this is referred to as variable
selection and uninformative variable elimination, among other names
(Sect. ). Some algorithms can separate the
influence of the background and select variables in the process of finding
the optimal set of coefficients b in
Eq. (). In each of the following sections, each
of the topics in spectral processing will be introduced before describing
their applications to TOR calibrations.
Baseline correction
Baseline correction can be fundamental to the way spectra are analyzed
quantitatively. Significant challenges exist in separating the analyte signal
from the baseline of mid-IR spectra, which include the superposition of broad
analyte absorption bands (O–H stretches in particular) to the broadly varying
background contributions from scattering. The algorithm for baseline
correction may therefore depend on the type of analyte and the broadness of
its profile; optimization of the correction becomes more important as
concentrations decrease such that they become difficult to distinguish from
the baseline. Approaches can be categorized as reference dependent or
reference independent and can be handled within or
outside of the regression step. Reference-dependent methods define the
baseline with respect to an external measurement, which may be a reference
spectrum or concentrations of an analyte. For instance,
orthogonal signal correction (OSC) isolates contributions to
the spectrum that are uncorrelated with the analyte, and can be
conceptualized as containing baseline effects. OSC can be incorporated into
PLS, in which the orthogonal contribution would be represented by underlying
LVs . Even without explicit specification of orthogonal
components, the influence of baseline effects is accounted for by multiple
LVs in the standard PLS model . Reference-independent
baseline correction methods remove baseline contributions based on the
structure of the signal without invocation of reference values. Two examples
described below include interpolation and derivative correction methods. A
more comprehensive discussion on this topic is provided by
.
While theories for absorption peak profiles are abundant, the lack of
corollaries for baselines leads to semiempirical approaches
for modeling their effects. If we conceptualize the broad baseline as an
Nth-order polynomial, we can approximate this expression with an
analytical function or algorithm. Models can be considered to be (globally)
parametric (e.g., polynomial, exponential) across a defined region of a
spectrum, or nonparametric (e.g., spline or convex hull;
), in which case local features of the spectrum are
considered with more importance. These approaches typically determine the
form of the curve by training a model on regions without significant analyte
absorption and interpolated through the analyte region. The modeled baseline
is then subtracted from the raw spectrum such that the analyte contribution
remains. Model parameters are selected such that processed spectra conform to
physical expectations – namely, that blank absorbances are close to zero and
analyte absorbances are nonnegative. In general, these approaches aim to
isolate the absorption contribution to the spectra that are visually
recognizable and therefore most closely conform to traditional approaches
for manual baseline removal used by spectroscopists. In addition to
quantitative calibration or factor analytic applications (e.g., multivariate
curve resolution; ), these spectra are more amenable for
spectral matching.
Alternatively, taking the first nth derivatives of the spectrum will remove
the first n terms of the Nth-order polynomial and transform the rest of
the signal . Since Gaussian (and most absorption) bands
are not well approximated by low-order polynomials, they are not eliminated;
i.e., their relative amplitudes and half-widths (ideally) remain unaffected
by the transformation. This ensures that their value is retained for
multivariate FT-IR calibrations . Moreover,
derivative-based methods can improve resolution of absorption bands after
transformation (illustrated in Fig. ). Derivative
transformations can affect the signal-to-noise (S/N) ratio, however,
inflating the relative contribution of small perturbations. Therefore,
smoothed derivative methods such as the three-parameter Savitzky–Golay filter
are favored in order to minimize this effect and, in
practice, only first and second derivatives are generally used with
vibrational spectra to maintain a reasonable S/N ratio .
In complex aerosol spectra caution must be exercised when interpreting the bands
resolved by smoothed derivative filters since the filter parameters (i.e.,
bandwidth, kernel) all influence the outcome of the transformation. A major
disadvantage of derivative filtering, in addition to the reduced visual
connection to the original spectrum, relates to the inadvertent removal of
broad absorption bands . Tuning filter parameters by
trial and error may limit this type of band suppression to some extent. As a
rule of thumb, the broad O–H stretches of alcohols
(3650–3200 cm-1), carboxylic acids (3400–2400 cm-1),
and N–H stretches of amines (3500–3100 cm-1) are likely to be
sacrificed as a result of derivative filtering . A
willingness to balance this type of information loss against the simplicity
and rapidity afforded by derivative methods must be considered in practice.
Different approaches have been used for processing of spectra for TOR
calibration, including two interpolation and one derivative approach.
Spectral processing is useful for spectra of PM collected on PTFE filters due
to the significant contribution of scattering from the PTFE
. Small differences in filter characteristics lead to
high variation in its contribution to each spectrum; a simple blank
subtraction of similar blank filters or the same filter prior to PM loading
is not adequate to obtain spectra amenable for calibration
. As the magnitude of this variability is typically
greater than the analyte absorbances, baseline correction models trained on a
set of blank filters typically do not perform adequately in isolating the
nonnegative absorption profile of a new spectrum. Accurate predictions made
by PLS without explicit baseline correction suggest that the calibration
model is able to incorporate its interferences effectively within its feature
space if trained on both ambient samples and blank samples together, though
visually interpretable spectra for general use are not necessarily retrievable
from this model. For this purpose, models based on interpolation from the
sample spectrum itself have been preferred. described
semiautomated polynomial and linear fitting to remove PTFE residuals
remaining from blank-subtracted spectra, which was based on prior work for
manual baseline correction by and .
This correction method had been used for spectral peak fitting, cluster
analysis, and factor analysis previously,
and was used for 2011 IMPROVE TOR OC and EC calibration shown in
Table .
introduced a smoothing spline method that produced baseline-corrected spectra (both visually and with respect to
clustering and calibration) in ambient samples similar to in the polynomial method
without need for PTFE blank subtraction. While the non-analyte regions of the
spectra are implicitly assumed, the flexibility of the local splines combined
with an iterative method for readjusting the non-analyte region effectively
reduced the number of tuning parameters from four (in the global polynomial
approach) to one. The spline baseline method was used for TOR EC prediction
in 2013 CSN . The second derivative baseline correction method
was applied to 2013 CSN TOR OC calibration .
Overall, differences in calibration model performance in TOR prediction
between spline-corrected and raw spectra models were minor for the samples
evaluated in 2011 IMPROVE (results were comparable to metrics in
Table ). However, wavenumbers remaining after uninformative
ones were eliminated (Sect. ) differed
when using baseline-corrected and raw spectra – even while the two
maintained similar prediction performance. and
used the Savitzky–Golay method and spline correction
method for TOR OC and EC, respectively, in the 2013 CSN network but did not
systematically investigate the isolated effect of baseline correction on
predictions without additional processing. A formal comparison between the
derivative method against raw and spline-corrected spectra has not been
performed, but this is an area warranting further investigation.
Standardizing a protocol for spectra correction based on targeted analyte is
a sensible strategy, as spectral derivatives are associated with enhancement
in specific regions of the spectra. The selection of baseline correction
method may also consider the areal density of the sample since the S/N is
reduced with derivative methods. However, the success of derivative methods
demonstrated for TOR OC in CSN samples (with systematically lower areal
loadings than IMPROVE samples) indicates that the reduction in S/N is not
likely a limiting factor for quantification in this application.
The derivative method appears to have a significant advantage in reducing the
number of LVs as demonstrated for TOR OC (Table ). The
derivative-corrected spectra model for 2013 CSN resulted in only four components
in contrast to the 35 selected by the raw spectra model. While wavenumber
selection and a different model selection criterion were simultaneously
applied to the derivative-corrected model, a large reason for the
simplification is likely due to the baseline correction. For reference,
reduced-wavenumber raw spectra models for 2011 IMPROVE TOR OC and EC still
required seven to nine components (the full-wavenumber model required 15–28,
depending on spectral baseline correction) . A
parsimonious model is desirable in that it facilitates physical
interpretation of individual LVs as further discussed in
Sect. .
The effect of baseline correction on reducing the scattering is illustrated
by revisiting the TOR-equivalent OC predictions for the 2013 IMPROVE data
set. found that the raw spectra 2011 IMPROVE calibration
model performed poorly in extrapolation to two new sites in 2013,
particularly FRES and BYIS. When using baseline-corrected spectra, the median
bias and errors are reduced from 0.28 and 0.43 and to 0.19 and
0.28 µgm-3, and R2 increases from 0.79 to 0.91 for
samples from these sites (figure for baseline-corrected predictions shown in
Sect. ). As the filter type
remained the same, this improvement in prediction accuracy is likely due to
the removal of scattering contributions in PM2.5 particles in the new
set that differs from the calibration set. Spectral signatures of nitrate and
dust suggested the presence of coarse particles different than those in the
2011 calibration (and test) set samples (Sect. ).
Three synthetic absorption spectra constructed with varying
contributions from a polynomial baseline and two unresolved Gaussian
peaks (a) and their second-order, five-point, second-derivative
Savitzky–Golay filter transformations (b). Absorption spectra were
constructed such that the additive, linear, and polynomial components of the
baseline scale with the amplitude of the absorption bands.
Wavenumber selection
Wavenumber or variable selection techniques aim to improve PLS calibrations
by identifying and using only germane predictor variables . Typically, such techniques remove variables
deemed excessively redundant, enhance the precision of PLS calibration,
reduce collinearity in the variables (and therefore model complexity)
, and possibly improve interpretability of the regression.
The simplest variable selection method based on physical insight rather than
algorithmic reduction is truncation, in which regions for which absorbances
are not expected or expected to be uninformative are removed a priori.
Algorithmic variable selection techniques fall into three categories: filter,
wrapper, and embedded methods .
Filter methods provide a one-time (single-pass) measure of a variable
importance with important and redundant variables distinguished according to
a reliability threshold. Variables above such a threshold are retained and
used for PLS calibration. Often, thresholds are either arbitrary or
heuristically determined . In general, filter
methods are limited by their need to choose an appropriate threshold prior to
calibration, potentially leading to a suboptimal subset of variables.
The essential principle of wrapper methods is to apply variable filters
successively or iteratively to sample data until only a desirable subset of
quintessential variables remain for PLS modeling . Wrappers operate under the implicit assumption that
single-pass filters are inadequate, requiring a guided approach to
comprehensively search for the optimal subset of modeling variables. Since
searching all 2p-1 combinations of wavenumbers is not tractable for
multivariate FT-IR calibration problems (p>103), model inputs (or
importance weights) are generally randomized at each pass of the algorithm to
develop importance criteria, foregoing an exhaustive variable search. Genetic
algorithms and backward Monte Carlo unimportant variable elimination (BMCUVE)
are examples of two randomized wrapper methods . Wrapper methods generally perform better than simple filter
methods and have an additional benefit of considering both variables and PLS
components simultaneously during optimization. The major drawback to wrapper
methods is generally longer run times (which may be on the order of hours for
large-scale problems) than filter methods.
As their name implies, embedded methods nest variable selection directly into
the main body of the regression algorithm. For example, sparse PLS (SPLS) methods
eliminate variables from the PLS loading weights (w), which
reduce the number of nonzero regression coefficients (b) when
reconstructed through Eq. () .
The zero-valued coefficients obtained for each LV can possibly confer
component-specific interpretation of important wavenumbers but leads to a
set of regression coefficients which are overall not as sparse as methods
imposing sparsity directly on the regression coefficients
.
Many methods select informative variables individually, but for spectroscopic
applications it is often desirable to select a group of variables associated
with the same absorption band. Elastic net (EN) regularization
adds an L2 penalty to the regression coefficient vector
in addition to the L1 penalty imposed by the least absolute shrinkage and
selection operator (LASSO) , thereby imparting a
grouping effect in selection. Interval variable selection methods
draw upon methods discussed previously but employ additional
constraints or windowing methods to target selection of contiguous variables
(i.e., an algorithmic approach to truncation).
evaluated two embedded (sparse PLS) algorithms and one
hyphenated method in which EN was used as a filtering method prior to PLS
calibration (EN-PLS, ) for TOR OC and EC calibration in the
IMPROVE network. A suite of reduced-wavenumber models were considered by
varying model parameters that controlled the sparsity and evaluated using
CV and separate test set samples. Since full-wavenumber
calibration models (both raw and baseline corrected) for TOR OC and EC in the
IMPROVE networks already performed well (Sect. ),
wavenumber selection did not improve model predictions but served mostly to
aid interpretation of the most important absorption bands.
found that these methods could use as little as
4 %–9 % of the original wavenumbers (2784 for raw and 1563 for
spline corrected) to predict TOR-equivalent OC and EC. EN-PLS consistently
achieved the sparsest solution (by more than a factor of 2 in almost all
cases) on account of the LASSO penalty applied directly to the regression
vector. While all variable selection methods generally performed well for
TOR-equivalent OC and EC prediction in 2011 IMPROVE samples, calibrations for
organic functional groups built using SPLS algorithms appeared to be
less robust in extrapolation to ambient sample spectra. While also being the
most sparse, EN-PLS yielded predictions similar to the original PLS (full
wavenumber) models that led to OC reconstruction from
summed functional group contributions having better agreement with TOR OC
than other sparse calibration algorithms, including EN without PLS. This
finding suggests that variables eliminated for being uninformative in the
calibration set samples may lead to undesirable oversimplification of a model
that may be used with samples with potentially different composition, though
this hypothesis has yet to be tested with calibrations developed with ambient
measurements as reference, for which the extent of extrapolation may not be so
severe as with calibrations developed with laboratory standards.
applied BMCUVE to second-derivative or
spline-corrected spectra in the CSN network. Improved MDL but otherwise
similar performance metrics to the raw (full wavenumber) calibration model
was obtained using the reduced model for TOR OC (performance described in
Sect. ), though the individual contributions of
baseline correction and wavenumber selection to improvement in MDL were not
investigated. The impact of wavenumber selection on model performance was not
investigated for TOR EC, but the reduced-wavenumber model predicted EC within
TOR precision (Sect. ). Interpretation of the
selected wavenumbers is discussed in Sect. .
Interpretation of important variables and their interrelationships
Interpreting the relationships among variables being used by a statistical
model to make predictions is a challenging topic on account of
its
semiempirical basis. In particular, it is possible to exploit statistical
correlations among the variables to make predictions, which can be
detrimental if the correlation changes or model is applied in a different
context. Therefore, model interpretation is strongly related to anticipation
of model applicability and a priori identification of samples with
potentially high prediction errors
(Sect. ). Inspection of how LVs and
absorption bands are used by a model can give an indication of their
importance and possibly establish a physical basis between analyte
concentrations and their relevant vibrational modes. Existence of potentially influential subgroups can initiate
identification
of relevant sample characteristics that have a disproportionate role in
prediction. To some extent, discussions in Sect.
and focusing on eliminating uninformative
variables (LVs or wavenumbers) during the model selection process are also
relevant in this context (some of the same techniques are applicable to both
tasks), but the focus will be on understanding the importance of the
remaining variables. The importance of samples and specific attributes
(concentration or composition) associated with them is addressed in
Sect. .
As with complex mechanistic models, a general investigation can be carried
out through sensitivity analyses . One of
the advantages of a PLS regression approach is that the contribution of each
LV to the response (y) or spectra matrix (X) can be
characterized by the explained sum of squares (SS) and its normalized
surrogate, explained variation (EV) . The
emphasis placed by a model on particular wavenumbers can be examined through
its regression coefficients b, selectivity ratio (SR)
, or the variable importance in projection (VIP) metric
. These quantities can be written using j and
k as indices for wavenumber and LV (with J as the total number of
wavenumbers), respectively.
SSy,k=qk2tkTtkSSX,k=(pkTpk)⋅(tkTtk)SSX,j=pjTTTpjTEVy,k=SSy,k/(yTy)×100%EVX,k=SSX,k/(XTX)×100%SRj=SSX,j/eX,jTeX,jVIPjk=J∑ℓ=1kSSy,ℓ(wℓj/‖wℓ‖)2∑ℓ=1kSSy,ℓ1/2
Note that for new samples, the loadings (q and p),
SS (TTT), and the means used for centering of
each array (y and X) are fixed according to the
calibration set. For PLS, the EVX is not as commonly examined as
for other factor analysis techniques as the primary objective is in
explaining the variation in y. In addition to metrics characterizing
the overall importance of latent and physical variables, the (normalized
Euclidean) distance of individual samples from the center of the calibration
space can be indicated by its leverage h. For mean-centered PLS, h is
computed for the row vector of new scores t corresponding to sample i
weighted by the inverse SS of the calibration set
:
hi=tiTTT-1tiT.
The sample leverage is used to assess influential points in the model,
identify outliers, and estimate prediction variance (prediction intervals).
Further discussion of leverage used in the last two objectives is discussed
in Sect. . Regression coefficients can oscillate
between positive and negative numbers as a higher number of LVs are used
and their magnitude must be considered together with that
of the absorbance (i.e., large regression coefficients coupled with small
absorbances may not have a large impact on the modeled outcome); metrics such
as SR or VIP can be more useful to assess their relative importance (the two
vary in ease of interpretability for different types of data and data sets;
).
For TOR analysis, VIP scores have been used to interpret wavenumber
importance . VIP
scores can also be used as a filtering method
(Sect. ) for wavenumber selection
e.g.,, but here they have been used
only for post hoc interpretation for this work. The main principle is that
the mean VIP score across all wavenumbers is unity, so those with more
influence in explaining y carry values above and those with less
influence fall below. However, found that the actual
importance threshold can be data specific, with dependence on the proportion
of uninformative predictors, predictor correlation, and the actual values of
the regression coefficients. Meaningful threshold values varied between 0.8
and 1.2 in their work. VIP scores for TOR models are summarized in
Fig. . Wavenumbers associated with TOR OC, not
surprisingly, span a range of functional group structures. Common functional
groups interpreted for both 2011 IMPROVE and 2013 CSN include aliphatic C–H
and carbonyls (carboxyl, ketone, ester, aldehyde), with possible
contributions from various nitrogenated (amine, amide, nitro) groups
. Other candidate bonds are described but
assigned with less certainly on account of strong overlap of absorption bands
in some spectral regions. based their interpretation on
the selected wavenumbers and VIP scores for both raw and baseline-corrected
models under a “common bond” that the two models are basing their
prediction using the same set of functional groups rather than different
ones. Based on this assumption, it appeared that the two models were using
different vibrational modes (stretching or bending) for aliphatic C–H and
alcohol O–H, though bending modes typically exhibit weaker absorption
signatures. The capability to accurately predict TOR-equivalent OC
concentrations in samples with different OM/OC ratios (determined by
functional group calibration models with FT-IR spectroscopy) as discovered through
permutation analysis (Sect. ) suggests
that on average there is some insensitivity to weighting of functional
groups that determine the degree of functionalization in the sample.
For TOR EC, among other functional groups, wavenumbers selected between
1600 and 1500 cm-1 were attributed to C–C and C=C stretching
in skeletal ring structures of aromatic or graphitic carbon
. While this absorption band corresponds to
lattice vibrations in graphitic carbon and is commonly used
in Raman spectroscopy for characterization of soot particles
, a peak has been observed in mid-IR spectra
only after crystalline structure is broken down through mechanical stress
. Nonetheless, a peak of
moderate to broad width in this region is observed in soot , soil BC ,
and coal . In constructing a PLS model to predict BC in
soil by mid-IR spectra and PLS, further removed the
potential influence of correlation between EC and OC in soil samples by
predicting the BC content normalized by OC with an R2 of 0.81. This
analysis encouraged their interpretation that the aromatic structures visible
in their first PLS loading weight vector were specific to BC, which
potentially supports the same interpretation for atmospheric samples.
However, found that a calibration model for ELLA did not
require aromatic structures for prediction of TOR-equivalent EC. This site
was located in close proximity to a toll station on the New Jersey turnpike
and was characterized by high diesel PM loading, a low OC/EC ratio,
and a
low degree of charring compared to samples from other CSN sites in the 2013
data set. The calibration model was able to predict TOR-equivalent EC
concentrations primarily using absorption bands associated with aliphatic C–H
(also selected in the calibration model for the other 2013 CSN sites) and
nitrogenated groups believed to be markers for diesel PM. A standard method
for quantification of soot recommends the use of
scattering characterized at 2000 cm-1 (without baseline
correction) on the assumption that there is no absorption usable for
quantification. Given that baseline-corrected spectra (in which scattering at
2200–1900 cm-1 in addition to other wavenumbers with negligible
absorption is forced to zero) are able to predict TOR-equivalent EC
concentrations in both 2011 IMPROVE and 2013 CSN – and most relevant
wavenumbers are in regions associated with visible absorption peaks – the
predictions do not appear to be based on scattering in this application.
Early work by reported a calibration for collocated EGA
EC using a peak located at 666–650 cm-1 in mid-IR spectra of PM
collected onto PTFE filters at Glendora, CA. However, what vibrational mode
this peak corresponds to is unclear, as there is also IR interference from
the PTFE substrate in this region . The true nature of
operationally defined TOR EC and a definitive reason that its concentration
can be predicted from mid-IR spectra are ongoing topics of investigation.
Surface functionalization of graphitic combustion particle surfaces
is estimated to be a small fraction of the
functional groups from organic aerosol in the same sample and therefore
considered to be unlikely to be useful for calibration. Soot emissions
comprise both light-absorbing BC and OC
, and it is possible that both fractions
exhibit mid-IR activity (some structures co-absorbing in the same region)
that can be used for quantification. Whether the functional groups used for
prediction of TOR-equivalent EC are due to the organic fraction associated
with incomplete combustion or other indirect markers warrants further
investigation in controlled studies.
While the large number of LVs used by the IMPROVE calibration models
precluded attempts at identification of individual components,
were able to do this for 2013 CSN TOR OC calibration
models on account of their low complexity. Application of second-derivative
baseline correction, BMCUVE wavenumber selection, and model selection by
Wold's R criterion resulted in a four-LV model for TOR OC. Further nuanced
interpretation was aided by re-projection of LVs onto principal component analysis (PCA) space, which modeled
much of the same variance as PLS scores but were formulated and arranged
according to their capability to explain the remaining variance in the
spectra instead of the covariance with respect to TOR OC. By visualizing the
sample spectra in two dimensions of this space using a conventional biplot,
identified a subset of samples with extraneous variance
in 2013 CSN spectra attributed to water vapor in the beam path present during
spectra acquisition in the laboratory. While the water vapor conferred
minimal prediction error, loading this spectral interference onto one
dimension and excluding it in the final calibration model improved
interpretability with a more parsimonious model using only the three remaining
components. Surprisingly, a single component representing an organic mixture
explained close to 90 % of the TOR OC variance, with the remaining two
components attributed to interferents: PTFE substrate and ammonium nitrate
(explained variation of 3 %–4 % each).
Model interpretation is a continual challenge but a necessary aspect of
statistical modeling from a chemometrics perspective and remains an active
area of investigation for TOR analysis. While the LVs are not constrained to
be nonnegative as factors for multivariate curve resolution, positive matrix
factorization, nonnegative matrix factorization , and the relative variation in scores can be analyzed
alongside auxiliary measurements to identify their importance toward specific
PM samples. This association can be made in a correlative capacity
or through more sophisticated means such as
target transformation factor analysis . In
addition, the way of obtaining LVs can be modified to accommodate features
from TOR OC and EC simultaneously. A variant of PLS that can potentially aid
in this endeavor is “PLS2”, which uses a shared representation of LVs for
multiple response variables . Shared representations are
commonly used in multitask learning to build models that
generalize from fewer diverse training instances and may additionally
confer benefit in this context for understanding the interrelationship
between these two substances and their thermal fractions. The
univariate-response formulation of PLS (“PLS1”) as described in
Sect. has been the focus of past work with TOR
calibrations as it typically achieves the same or better accuracy as PLS2
with fewer LVs , but the potential for PLS2 in improved
interpretation and robustness in a wider range of contexts is an area that
can be further explored.
Selected wavenumbers (blue points) overlayed on mean of calibration
spectra (black lines). The 2011 IMPROVE spectra remain unprocessed (“raw”) or
baseline corrected using smoothing splines (“baseline corrected”), while
the 2013 CSN spectra are baseline corrected using the Savitzky–Golay second-derivative approach. “Atypical” and “typical” categories for 2013 CSN EC
refer to samples for Elizabeth, NJ, and the remaining nine sites,
respectively.
Sample selection
To design a campaign to collect both FT-IR spectra and reference measurements
or to select among available collocated measurements in a database to
construct a new calibration model, it is necessary to address the question of
how many of which type of samples do we need? Provided that the form
of a data set can be fit by several models, it is possible for the simpler
ones with more training data to outperform more complex ones with fewer
training data for new predictions . This argument can be
rationalized in a chemometric context by conceptualizing an ideal calibration
model as one built upon samples of identical composition and concentration
(with replicates) for every sample in the prediction set. Especially for
complex PM components such as TOR OC and EC that have a multitude of
absorption bands in the IR from both target and interfering substances,
enough samples must be included in the calibration set to span the range of
multiple attributes. For each unique sample removed from the calibration set,
the corresponding composition in the prediction set must be estimated by
mathematical interpolation or extrapolation from the remaining samples.
Reducing the number of calibration samples increases the dependence of the
predictions on the functional form or weighting scheme (with respect to
variables and samples) of the selected model with possible consequences for
prediction accuracy. Lacking mechanistic constraints, predictions from
data-driven models may exceed physical limits with increasing reliance on the
underlying algorithm over measurements. The obvious importance of chemical
similarity in calibration can be related back to physical principles that
give rise to the observed mid-IR spectrum. First, for any given wavenumber,
the absorption scales with analyte abundance – simpler calibration models in
analytical chemistry built on this principle dictate that the concentration
range covered by calibration samples should bound the concentrations in the
new samples so that values are interpolated rather than extrapolated to
minimize prediction error. Second, complex absorption profiles arise from
inhomogeneous broadening of absorption bands in the condensed phase.
Therefore, samples with a chemical composition similar to that of new samples are
likely to have similar patterns of absorbance and interferences that can be
accounted for by the calibration model.
A basic premise follows that calibration models built with samples having
similar spectroscopic profiles, specifically near the most relevant
absorption bands, are likely to yield better prediction results for new
samples. For analysis of simple mixtures, one common strategy pursued in
experimental design is to prepare samples that populate the chemical
coordinates (e.g., molar concentrations of its constituent species) of
anticipated composition according to Euclidean distance .
However, this procedure does not guarantee that the training and prediction
data will have similar distributions in the feature space of an effective
calibration model (i.e., similarity may not be best characterized by
Euclidean distances). This task is further complicated by the fact that
chemical similarity is not easy to define for composite substances (TOR OC)
or chemically ambiguous quantities (TOR EC). Moreover, the samples for
calibration at the level of chemical complexity of atmospheric mixtures are
typically limited by the availability of collocated measurements (e.g., TOR
reference measurements together with sample spectra from PTFE filters).
In the context of these challenges, the canonical (base case) strategy
for TOR OC and EC calibration has been to use space and time as a proxy for
composition. A stratified selection approach – in which selected samples are
evenly spaced out over a full year at each measurement site – is used to
construct the calibration set, as there is reasonable expectation that an
adequate representation of emission sources and extent of atmospheric
processing can be captured. Blank PTFE filter spectra are added to the
calibration set and their corresponding reference concentrations are set to
zero, as this value is equally valid to the TOR-determined concentration for
below-MDL samples. Excluding irregular events (e.g., wildfires), this
approach can be effective in building a general calibration model for
atmospheric samples and has demonstrated good performance
(Sect. ). However, samples from the same site and season
are not strictly required for successful prediction of each new sample.
demonstrate accurate prediction for a full year of TOR
OC and EC concentrations at sites not included in the calibration (also
revisited in Sect. ). The extent to which site, season,
local emission, or meteorological regime of a new sample affects prediction
depends on how these factors contribute to deviation in chemical composition
from calibration samples. We further summarize our efforts in understanding
which types of samples are important
(Sect. ) and how many samples are
needed (Sect. ) for calibration. Lastly,
we describe how specialized calibration models can better serve a specific
set of samples that are not well-represented in the feature space of all
calibration samples (Sect. ).
Important attributes
Our findings indicate that many, though not all, methods for sample selection
can lead to an acceptable calibration model as determined by evaluation
criteria described in Sect. . To investigate which
aspects of similarity are important in this regard, performed permutation analyses on the available set of samples
to study how differences between calibration and test set samples influenced
prediction errors. Samples were grouped according to values of descriptors
chosen to capture the effect of analyte concentration (TOR OC, EC), source
and degree of functionalization (OC/EC and OM/OC), and
inorganic interferences (ammonium/OC, ammonium/EC). Predictions
were evaluated when the distribution of these descriptors represented in the
calibration set was selected to be either similar or different to those in
the test set. To construct calibration and test sets according to these
specifications, samples were arranged in order of a particular attribute. For
similar calibration and test set distributions, every third was reserved for
the test set while the remainder was used for calibration. To examine the
effect of extrapolation with respect to any attribute, the calibration set
was constructed from samples with the lowest two-thirds or highest two-thirds
of attribute values, and the remainder were used for the test set. To examine the
effect of interpolation, the highest third and lowest third were used for
calibration and predictions made on the middle third of samples. Inadequate
representation of any of these variables in the calibration set led to
increased errors in model predictions, but with typically low bias in
interpolation. TOR OC could be predicted with only a marginal increase in bias
(median absolute bias of 0.1 µgm-3) and no increase in
normalized error (∼10 %) even when extrapolating predictions on
average 3 times higher, indicating a calibration that was effectively
linear over the range tested (0–8 µgm-3). For samples
varying in OM/OC ratio between 1.4 and 2.5, normalized error in predicted
TOR OC increased from ∼10 % when the calibration and test sets were
similar to 14 %–17 % when they were forced to diverge according to
the segmentation described above, but the predictions remained unbiased. The
largest increase in prediction error came when using calibration samples with
low ammonium interference (low ammonium/OC ratio) to high ammonium
content, with an increase in normalized error of ∼10 % to
24 %. For TOR EC, almost every extrapolation scenario resulted in an
increase in either bias or normalized error (by 10 to 60 percentage points),
suggesting its sensitivity to a large number of sample attributes.
Such permutation analyses permit independent evaluation of attribute
importance only to the extent that they are not correlated in the samples.
For instance, for 2011 IMPROVE, much of the variability across the entire
data set was driven by the two collocated urban sites in Phoenix, AZ, which
contained higher concentrations of less functionalized PM in general than the
remaining rural sites. However, normalization strategies – e.g., of ammonium
by OC or EC – reduced confounding effects.
only tested each univariate case in turn, but multidimensional permutation
analysis in which samples are partitioned according to differences across
multiple variables for model building and testing may be possible with a
large number of samples. Computational resources permitting, bootstrap
sampling combined with post analysis may provide another means of testing the
importance of particular attributes in such instances.
Number of samples
The minimum number of samples required by a model is dependent on the
capacity of its calibration samples to collectively represent the diversity
of composition in new samples and the algorithm to effectively interpolate
or extrapolate into unpopulated regions of the composition space. To
illustrate this notion, we present the change in prediction metrics for
TOR-equivalent OC as a function of the number of ambient samples in the
calibration set (Fig. ). Beginning with samples selected
according to the base case strategy (stratifying by space and time) as the
initial reference, the number of ambient samples in the calibration set is
reduced while the number of blank samples is held constant. The set of test
samples is also fixed for all evaluations. While the conclusions are not
strikingly obvious, some overall trends can be noted.
Figure shows a general decrease in prediction accuracy
with fewer number of ambient samples, especially below ∼150 samples,
though individual differences among most models are not statistically
significant. The gradual degradation in prediction accuracy is attributed to
difficulty in maintaining representativeness of important attributes with a
small number of samples. Figure shows the increasing
difference in empirical probability distributions of attributes in the
calibration and test set samples as a function of the number of ambient
samples using the Kolmogorov–Smirnov test statistic (higher values indicate
higher dissimilarity between the calibration and test set distributions). The
increase in differences between the distributions in TOR OC, but particularly
the ammonium/OC ratio, is the primary cause as it was determined to be
a critical attribute for TOR OC prediction
(Sect. ). Due to the diminishing
statistical power with fewer calibration samples, statistical significance is
not established in this regime; we therefore interpret these results
qualitatively. The MDL is generally maintained or improved with decreasing
number of ambient samples, which is sensible as the number of blank samples
grows in proportion. Conversely, the number of blank samples (varied
between 0 and 36) when included with 501 ambient samples in the calibration
set did not have a large effect on the MDL.
The prediction accuracy for TOR OC as a function of the number of
ambient samples in the calibration set (the number of blanks was kept
constant at 36). Using the 2011 IMPROVE base case calibration model, every
nth sample was removed (which leaves spatial and temporal representation of
samples close to the original set). The performance metrics are computed on
the same 286 test set samples for all calibration models.
We might conclude that larger calibration sets that more likely cover the
range of attributes in new samples might lead to better model performance.
show an example for raw spectra. Without baseline
correction, TOR OC concentrations for two sites – FRES and BYIS – in 2013
IMPROVE were not predicted well by the original model. Predictions were shown
to improve when samples from these sites were included .
In this case, the calibration set without FRES and BYIS was too small in that
it did not contain the appropriate representation of specific sample
characteristics. However, as with wavenumbers, populating the calibration set
with an increasing number of unrelated or uninformative samples with respect to
a targeted class of samples may lead to added noise or bias from unfavorable
model weighting. In such instances, smaller, dedicated models may be better
for specific classes of samples provided that it is possible to distinguish
which model is best suited for each sample. In the next section, we describe
cases in which a smaller subset of samples for calibration have been found to
be appropriate for improving specific performance targets.
Kolmogorov–Smirnov (KS) test statistic for different numbers of
calibration samples used in Fig. . The KS statistic
characterizes the difference between two empirical distribution functions, in
this case determined for probability distributions of each variable between
the calibration and test set samples.
Smaller specialized models
While a large, monolithic model may be most capable of accommodating diverse
composition in prediction set samples, models that assume underlying
structure of the chemical domain for interpolation or extrapolation may be
susceptible to undue influence by one or more groups of (high leverage)
samples and return biased predictions for a specific set of underrepresented
samples. Statistical localization is the process by which calibration models
are built with samples that are closest in composition to samples for which
predictions are desired. While the overall number of samples used for
training in each localized model is reduced, the distribution of the
calibration model better reflects that of the subset of samples for which new
predictions are to be made. Together with a classifier capable of selecting
the appropriate localized model for each new spectrum, several models can
collectively function as a single multilevel model to provide a best estimate
of the targeted concentration.
This approach has been applied to TOR EC calibration in both networks studied
(Fig. ).
constructed a multilevel model consisting of
calibrations for two different concentration regimes for 2011 IMPROVE. A
calibration model using only a third of the lowest concentration samples
(areal density <0.68 µgcm-2) led to an MDL of
0.01–0.02 µgm-3, while using the full range of areal
loadings for calibration led to an MDL of 0.03–0.08 µgm-3.
Overall prediction errors for low samples were also reduced with a dedicated
model, but to a lesser extent than the MDL. The full range model served as a
classifier; predictions that fell below the areal loading threshold according
to this model were refined with the low-concentration calibration model. As
discussed in Sect. , ELLA was believed to be
influenced by diesel emission sources that led to different PM composition
and spectral characteristics from the remaining nine CSN sites. Therefore,
predicted concentrations for ELLA were systematically biased low compared to
observations. trained a partial least squares
discriminant analysis (PLS-DA) model on geographical location to segregate
typical samples from atypical ones that resembled ELLA spectra. Spectra
classified as being atypical were predicted using a model trained solely on
ELLA samples, while the ones classified as typical were predicted using a
model trained on the rest of the samples. Considering the overall model
performance for all samples, using this multilevel approach led to an
improvement in R2 from 0.76 to 0.88 and a decrease in bias from 5.2 %
to 2.7 % (with corresponding improvements in MDL, precision, and other
figures of merit). The difference in metrics was largely due to improvement
in ELLA predictions, as the predictions for non-ELLA samples were similar in
both approaches (mean errors of 0.15 and 0.16 µgm-3 and
R2 of 0.83 and 0.85 for the monolithic and multilevel models,
respectively).
Multilevel modeling strategies used for TOR EC in the IMPROVE and
CSN network. In the left figure, “full range” denotes the calibration model
using the full range of TOR EC concentrations, while the “low conc.”
denotes the model using only the lowest third. In the right figure,
“atypical” samples were taken from a particular site (ELLA) while the
“typical” samples comprised the rest.
Operational phase of a calibration model
The operational phase of the model marks a departure from the building and
evaluation phases (Fig. ) in that reference
measurements may no longer be available on a regular basis. However, this is
the eventual use case for such calibration models – for instance, to enable
FT-IR spectroscopy to provide TOR-equivalent carbon values from a PTFE filter at new
monitoring sites or measurement campaigns where TOR analysis from a separate
filter is not available. Without reference measurements, it is important to
evaluate the appropriateness of available calibration models for new samples,
continually monitor the performance of the model by introspective means, and
update the calibration as necessary. To this end, we describe methods for
anticipating prediction errors arising from precision and bias
(Sect. ) and strategies for calibration maintenance
(Sect. ).
Anticipating prediction errors for new samples
We dedicate this section to describing ways for anticipating prediction errors
in new samples during the operational phase of a calibration model. Higher
prediction errors may arise from a decrease in precision or additional
biases incurred for samples that are not well-represented by the calibration
samples. The former can be approximated from the measurement noise
characterized from the calibration set, while the latter is assessed on a
more qualitative scale based on similarity of new samples to those in the
calibration set. Anticipating these errors is imperative for reporting
estimated precision for new samples, monitoring systematic changes in model
performance, and selecting an alternate calibration model for new samples
when prediction quality is questionable. For this task, we assume the
unavailability of reference measurements for which evaluation methods in
Sect. and would otherwise
apply, and we primarily rely on spectral characteristics. To this end,
Sect. discusses the construction
of prediction intervals around point estimates,
Sect. covers the strategy for
outlier detection, and Sect.
illustrates the use of sample similarity assessment for comparing suitability
of models. The raw spectra TOR EC calibration model for IMPROVE 2011
introduced by and evaluated for 2013 by
is revisited on account of its high prediction error and
difficulty anticipating prediction errors compared to TOR OC.
Sample-specific prediction intervals
In Sect. , discussions focused around providing and evaluating
point estimates of prediction. Additionally, interval estimates for each
sample can be obtained to determine prediction uncertainty under a fixed
relationship between model and data assumed under conditions of the
calibration. In effect, prediction intervals describe magnitude of errors
that are similar to those in the calibration set and can be obtained from
error propagation or resampling (bootstrap or jackknife)
or by employing a Bayesian framework . We will restrict
our discussion to estimating prediction intervals as they pertain to
multivariate linear regression (including PLS). Provided that sufficient data
exist, numerically resampled intervals can be generated free of assumptions
regarding underlying distributions, but the error propagation approach is
favored on account of its connection to the fundamental processes
contributing to the errors. The standard error of prediction has two primary
contributions: the model contribution from calibration and the measurement
contribution from the prediction sample. These contribute nonlinearly to the
prediction error, but an approximate expression can be derived through local
linearization (i.e., neglecting higher-order terms typically assumed in error
propagation) . This
approximation results in a tractable expression for the prediction standard
error σy^,i similar to that used by ordinary least squares
regression but considers heteroscedastic errors :
σy^,i=s(1+hi)1/2.
The point estimate of prediction can then be bounded by an interval defined
as ±tα,νσy^,i, where tα,ν denotes
a t distribution with significance level α and degrees of freedom
ν. s is estimated from the fitting error – the mean squared error of
calibration (MSEC, squared error normalized by the degrees of freedom). While
a common assumption is that s captures only the prediction variance, the
MSEC can implicitly include the prediction bias if present in the fit of the
calibration set. h is the leverage introduced in Eq. (),
and its role can be rationalized by the fact that samples closer to the
“average” calibration sample are more precisely estimated than those which
are further away. The approximations made for Eq. ()
result in a method that is most applicable for a small noise and small range
of FT-IR absorbances . Furthermore, prediction
standard error can be refined by subtracting the precision of the reference
measurement , but it is not considered here.
The prediction intervals given by Eq. () calculated
for TOR-equivalent OC and EC are shown in Fig. .
Low standard errors of predictions anticipate low prediction errors, but
prediction errors for higher concentrations (3–85 µgcm-2)
are more variable than indicated by the precision error. While deviations
from observations in calibration are mostly explained by
Eq. (), and
found that actual prediction errors do not always scale with computed
leverage. This phenomenon is also reported in other applications
and indicates the possible role of bias due to
differences in composition that are not well-captured by this metric.
Point estimates and prediction intervals for the TOR-equivalent
concentrations in the 2013 IMPROVE prediction set. Gray shades indicate the
extent of areal mass densities in the calibration samples. Triangles
represent samples associated with burning (scaled spectra shown in the right
column). Red samples correspond to those for which the difference between
predicted and observed concentrations exceeds the combined uncertainties at
the α=0.05 significance level.
It is also relevant to consider the standard errors of prediction for the TOR
measurements . Naïve propagation of reported errors across
the relevant thermal fractions (including pyrolyzed carbon) leads to
estimates of relative precision that approach 7 % and 14 % for TOR OC
and EC, respectively, for the highest concentrations observed for this
IMPROVE data set. As the errors are not truly independent for each sample, a
simple summation of prediction variances may lead to an underestimation.
However, these calculated errors are close in magnitude to the average
collocated precision error estimated for 2011 IMPROVE (15 % and 23 %
for TOR OC and EC, respectively; Table ) and the combined
uncertainty estimated from analytical, cross-laboratory, and cross-sampler
effects . The relative precision estimated for their
respective calibration models using Eq. () converges
toward values that are approximately 3 times lower for both variables. The
standard errors of prediction of a multivariate model can be lower than the
reference measurements from which it is derived, as random errors from the
latter are averaged out in the calibration process – especially when a large
number of calibration samples are used . However, given
that the apparent collocated precision for model predictions is on a par
with TOR (Table ), it is likely that model uncertainties
calculated from Eq. () are underestimated on account
of unaccounted-for variations. Nonetheless, a general conclusion can still be
drawn that many samples are predicted within uncertainty. There remain
samples (167 for TOR OC and 126 for TOR EC, out of 2177 total) that can be
identified (in red, Fig. ) as having prediction
errors that fall outside the anticipated range of uncertainty of both model
and measurement. We describe procedures for algorithmically detecting these
samples in the absence of reference measurements in
Sect. .
Outlier detection
As described in Sect. , a calibration model that
is likely to be suitable for a new sample is that which is trained on samples
with similar concentration and composition. Therefore, identifying samples
that are different from those in the calibration set of a particular model
is closely tied to anticipation of potentially high prediction errors due to
incurred bias. We first review possible categorizations of samples in a Venn
diagram (Fig. ). Within a multivariate space
encompassing all samples, some will lie at the edge of the domain
(extreme values), while others will lie in sparsely populated
regions of the interior (inliers). Some of these extreme values and
inliers will be statistically surprising given the rest of the points and
are typically labeled as outliers or anomalous samples
. We note that
“inliers”
is sometimes used to refer to statistically different samples that lie
within the composition domain, but we reserve the word “outlier” for all
statistically significant samples in this paper. New samples in furthest
proximity from calibration samples in this composition space require
aggressive extrapolation or interpolation (i.e., they are least constrained
by data) and are most likely to suffer in prediction performance. However,
the actual increase in prediction error (if any) will depend on the
functional relationship among variables and how well they are represented by
the model – e.g., a linear relationship modeled by a linear mapping may
perform adequately in interpolation and extrapolation. For instance, samples
with OM/OC and OC/EC composition and TOR OC concentrations out
of range with respect to calibration samples were predicted without
substantial increase in errors
(Sect. ). Therefore, not all outliers
may be associated with high prediction errors.
Dissimilarity can be expressed as a measure of distance or a discrete label
of normal or anomalous resulting from a unary (one-class) classification
. Identification of dissimilar observations is the
subject of many disciplines including chemometrics, machine learning, and
statistical process control and are referred to under various names: anomaly
detection, fault detection, novelty detection, and outlier detection
e.g.,. Together with
knowledge regarding “prediction outliers” (samples with surprisingly high
prediction errors), decisions can be grouped into the following outcomes
(Fig. ): true negative (TN; samples are classified as
being similar and prediction error is low), true positive (TP; samples are
classified as being dissimilar and prediction error is high), false negative
(FN; samples are classified as being similar while prediction error is high),
and false positive (FP; samples are classified as being dissimilar while
prediction error is low). The realization of these outcomes by a classifier
can be used to judge its performance. We note that in contrast to the
multilevel modeling strategy described in
Sect. , the problem of error
anticipation is to build a classifier that identifies all samples not similar
to those in the training set (i.e., outliers, some of which may have
anomalously high magnitude of prediction error) without exhaustive knowledge
or separate training sets comprising the new sample types.
Venn diagram (not to scale, a) and confusion
matrix (b)
depicting the relationship between detected outliers and magnitude of
prediction errors.
Without reference measurements, many external indicators might be used to
characterize differences between new samples and those in the calibration
set, especially with respect to attributes identified to be important
(Sect. ). For instance, the fraction
of inorganic to total PM may give an indication of the ammonium-to-OC ratio, or
NOx may be a valid surrogate for EC in many urban
situations. However, our primary objective is to rely on indicators of
composition and concentration that can be extracted directly from the FT-IR
spectrum to determine the appropriateness of an existing calibration model
for
the new samples. Baseline-corrected spectra have been used in the past to
characterize similarity among ambient aerosol spectra through cluster
analysis e.g., and can also be used
for classification . For instance, many
samples with large deviations in predictions of TOR-equivalent OC from
observed values are spectroscopically similar
(Fig. ) and exhibit sharp methylene peaks and
large carbonyl absorbances present in spectra of biomass burning samples
. Locations and dates of some of these samples
are consistent with known periods of wildfires and will be the topic of
future investigation. The underrepresentation of these types of samples in
the 2011 IMPROVE calibration (and test) sets, or simply the higher
concentrations beyond the calibration range, may explain the proportionally
high prediction errors incurred for these samples. The highest TOR EC
concentrations in 2013 are associated with FRES, an urban site, and BYIS, an
international site, both of which were not part of the 2011 calibration set.
Spectral matching combined with model interpretation
(Sect. ) can identify particular sample types
that may be problematic for a calibration model a priori. However, as sparse
calibration modeling has shown (Sect. ),
not all spectral features are likely to be relevant for prediction of TOR OC
or EC concentrations. Therefore, transformations specific for the target
analyte (which can include but are not limited to spectral processing
techniques described in Sect. ) are likely to reveal
the discriminating spectral features for distinguishing samples that are
different from those in the calibration set.
Projection of the spectra in the feature space of the calibration model
(i.e., factor scores and residuals of PLS or PCA, kernel distances, latent
encoding in Gaussian process) after appropriate spectra processing and
wavenumber selection can provide spectral comparisons that are specifically
meaningful for prediction of the response variable . For PLS regression, the
feature vectors (scores) can be combined into a single metric called the
Mahalanobis distance or Hotelling's T2 statistic
, which are both proportional to the leverage introduced
in Eq. (). The two terms are often used synonymously
e.g., but can also be defined differently
according to rank approximation of X or a coefficient making the
T2 comparable to the F distribution e.g.,. We will adopt the convention of defining T2≡DM2, but reserve Hotelling's T2 statistic for use with its
eponymous test to determine out-of-limit samples (e.g., in statistical
process control) and DM2 for a general distance measure (which is also
used in classification methods built upon different criteria). Outside of
this feature space, the Q(X)-statistic estimated using residuals
E of spectra reconstructed from its latent variables
(Eq. ) can additionally indicate
variations orthogonal to the feature space and hence variations that are
orthogonal to the modeled portion of the response variable
. Therefore, Q(X) is typically monitored
over time alongside T2. The two metrics for mean-centered PLS can be
written as follows:
Ti2=DM,i2=(N-1)⋅hQi(X)=eX,ieX,iT=xiI-PPTxiT.
N is the number of samples in the calibration and h is the leverage from
Eq. (). P is the matrix of loadings
(Eq. ) and eX denotes the row vector of residuals
associated with each sample (Eq. ), equivalent to
the product of latent variables unused for calibration. In an analytical
chemistry context, high values of T2 result from extreme values or unusual
combinations of the same chemical components as those in the calibration set,
whereas introduction of new analytes or interferences that result in
spectroscopic response lying outside of the modeled domain would be carried
in the residuals . In practice, the separation of unfamiliar
contributions to the spectra is likely not as clean, particularly with
respect to nonlinear phenomena (e.g., scattering), which can be spread over
multiple factors, and the portion of the spectroscopic signal associated with
new substances may not be entirely apportioned to the residuals.
For classification purposes, thresholds for T2 and Q(X) are
determined from the F distribution and χ-square distribution,
respectively, at different significance levels .
Classification and dissimilarity characterization by T2 for a given data
set perform best when the points converge toward a multivariate normal
distribution. Such a distribution becomes less representative of the data set
when the problem increases to proportions of extremely high dimensionality,
where points become sparsely dispersed throughout the vast composition space
rather than clustered around a single centroid . To
alleviate this problem, it is useful to conceptualize different relationships
of training data in the column space of T and E against
which new samples are compared. This task can be fulfilled by unary
classifiers that learn patterns from the data without imposition of global
structure (e.g., normality). These approaches may employ superposition of
local potential or kernel density functions , kernel methods , or recursive
partitioning of the chemical space for detection of points
separated from the remainder of the samples.
Receiver operating characteristic (ROC) curves for the 2013 IMPROVE
data set. Symbol colors are grouped according to the data used for detection
(either scores T or residuals E). Symbol shapes
indicate method of estimation. “Reference” denotes Hotelling's T2
statistic for scores and the Q(X) statistic for residuals, for which
three open circles are shown for the α={0.1,0.05,0.01}
significance levels. The filled purple symbol indicates the performance
determined by the maximum T2 of the 2011 IMPROVE test set, as originally
used by . For other methods, two symbols are drawn and
connected by dotted lines to indicate the solution with the highest accuracy
(fraction classified correctly) and the solution that lies close to the
coordinate (0,1).
For the 2013 IMPROVE data set, used the 2011 IMPROVE
calibration models developed by and applied
the Mahalanobis distance metric. Heuristic thresholds for DM2 and the
prediction error were determined as their respective maximum values in the
2011 IMPROVE test set for purposes of classification. The number of samples
in 2013 that had prediction errors greater than the selected threshold was
small for both TOR OC and EC – for paired samples above detection limit
across 17 sites, only 36 out of 2189 (TOR OC) and 22 out of 2177 (TOR EC)
samples (1 %–2 % of total) were determined as having high errors
according to this criterion. The overall accuracy (fraction of TN and TP out
of total) was high, with 98 % for both TOR OC and EC. These numbers are
enviable for any classifier but were largely aided by the low number of
high-error samples, which resulted in high overall accuracy from a permissive
DM2 threshold and a limited number of FP classifications. When
considering prediction intervals of both prediction and reference
measurement, some of these high prediction errors are within anticipated
uncertainties of the samples, while a few anomalous samples with errors
outside of the range of uncertainties occur with lower absolute prediction
errors (Sect. and
Fig. ). Therefore, we first correlate the
results of outlier analysis to samples with prediction errors that lie
outside of expected agreement (i.e., prediction outliers). We then revisit
the topic of using these classification algorithms to identify samples with
the highest magnitude of prediction errors.
For this discussion, it is useful to define two additional metrics: true
positive rate (TPR) is the fraction of samples with high error correctly
identified as such, and the false positive rate (FPR) is the fraction of
samples with low errors that are incorrectly identified as having high error.
In a coordinate space with TPR as the ordinate and FPR as the abscissa
(Fig. ), the perfect model lies at (0, 1). For detecting
new or anomalous spectra, we explore classifiers introduced above (potential
function method, one-class support vector machine (SVM), and isolation forest) and consider their
tradeoffs in TPR, FPR, and overall accuracy. For the potential function
method, the radial basis function (RBF) is selected; the free parameters are
the number of nearest neighbors used to determine the kernel width parameter
and the confidence level for the thresholds. For one-class SVM, the RBF
kernel is also used with the kernel coefficient and effective thresholding
parameter varied. For isolation forest, the randomization seed and number of
iterations is varied. For any given model, parameters or effective thresholds
determine an approximate envelope in the space of TPR and FPR referred to as
a receiver operating characteristic (ROC) curve . For
simplicity, the solutions with the highest accuracy (fewest false
classifications) and nearest proximity to the (0,1) coordinate are shown in
Fig. , alongside T2 and Q(X) for the
α={0.01,0.05,0.1} significance levels. For reference, the
heuristic threshold for T2 from is also shown.
For TOR OC, classification performance using residuals (E) is
slightly but consistently better than using LVs (T). The TPR
ranges between 10 % and 88 % and FPR between 1 % and 36 % using
T and TPR ranges between 36 % and 87 % and FPR between
4 % and 28 % using E. For TOR EC, the selected results are
clustered together with a few exceptions; TPRs and FPRs are typically higher
(56 %–85 % and 8 %–38 %, respectively). Regarding
systematic differences among methods over parameters studied, the potential
function and SVM methods can span a wide range of solutions in the ROC space
that follows the arc delineated by the selected points shown (up to TPR and
FPR of 100 %), while all isolation forest solutions remained in close
proximity to the points depicted in Fig. . Both T2 and
Q(X) metrics with the significance levels explored are restricted to the
upper left corner of the ROC space as depicted.
Prediction error distribution (row a) and classification
results using the Q(X) classifier with a α=0.05 significance level
applied to model residuals (row b) for the 2013 IMPROVE data set. A value of 1 corresponds to outliers
and 0 to those not classified as outliers. Triangles and red samples
correspond to the same sample specification as
Fig. ; the rest of the individual prediction errors
are symbolized with open circles.
The tradeoff in TPR and FPR is in part determined by what are designated as
prediction outliers. The stratification of prediction errors by
classification is illustrated in Fig. . A classifier
that is able to identify all samples with prediction errors greater than
expected uncertainties would result in segregation by color in this figure.
However, we see that the prediction outliers are only partially correlated
with the absolute magnitude of prediction error (especially for TOR EC, for
which
the pyrolyzed fraction adds a variable contribution to precision error across
samples), while samples labeled as spectroscopic outliers are more aligned
with the latter. Furthermore, samples with the lowest prediction errors are
also not flagged as outliers. That spectral outliers are primarily correlated
with magnitude of prediction errors (more than deviation outside of expected
precision) is sensible. Greater prediction errors are anticipated by sample
leverage (Eqs. and ) used
explicitly or implicitly by classification algorithms, and high leverage can
be related to extreme concentrations for which heteroscedastic measurement
errors are also greater. Biomass burning samples previously mentioned can be
identified visually (and by spectral matching), but they are not necessarily
flagged as outliers with respect to the calibration models. This is not
surprising as prediction errors for burning samples are not systematically
higher, except for the few samples with the highest TOR OC loadings. Revisiting
the classification problem posed by and considering only
the samples with the highest prediction errors exceeding those of the 2011
IMPROVE test set as prediction outliers, it is possible to achieve TPR of
81 % and FPR of 12 % for TOR OC and TPR of 91 % and FPR of
8 % for TOR EC (both with the potential function method) as the solutions
closest to (0,1) on the ROC curve. Outlier detection for TOR EC is better
served by methods alternative to T2 on account of the strong non-normality
in the multivariate feature space . For this scenario,
selecting a classifier with high TPR comes at a cost of lowering the overall
accuracy significantly because of the small proportion of high-error samples.
For instance, moving from the max DM2 classifier of
to the potential function solution for TOR EC as described above, an increase
in TPR from 59 % to 91 % (a difference of seven samples) accompanied by
an increase in FPR from 1 % to 8 % (a difference of 142 samples)
drops the overall accuracy from 98 % to 92 % on account of the large
number of low-error samples that would be detected as being different. The
desired criterion for the optimal classifier may depend on the purpose of
classification. For the purposes of flagging suspicious samples during
routine application of a calibration model, it may be desirable to select a
classifier with high overall accuracy to keep the total number of FNs and FPs
to a minimum. A conservative classifier with higher TPR than low FPR is,
however, likely to be more useful for model selection against a specific
sample (Sect. ).
Model selection without reference measurements
Methods for error anticipation may also be used for evaluating among a set of
candidate models when reference measurements are not available to provide a
full evaluation. To illustrate such an application, we revisit the apparent
increase in mean prediction error shown for decreasing number of ambient
samples in the calibration set displayed in Fig. . The
corresponding increase in mean squared Mahalanobis distance between the fixed
set of 253 test set spectra and those of the changing calibration set is
shown in Fig. . As DM2 increases linearly with the
number of components, only the first 10 LVs are considered in each model for
the purpose of a fair comparison. This example provides indication that the
loss in representativeness of composition or concentration between the 253
predicted samples and calibration samples as the latter numbers are
diminished (Fig. ) is reflected in the FT-IR spectra
and can be appropriately extracted after projecting them onto factor scores
of their respective PLS models.
Mean squared Mahalanobis distance (DM2) between spectra of the
fixed test set and changing calibration set, constructed as described in the
caption of Fig. . Error bars span ± 1 standard
deviation. The first 10 latent variables are used for estimation of DM2
in this example to reduce the dimensionality of the factor space
.
While we have demonstrated use of DM2 to provide a qualitative comparison
among several models, in principle it would be possible to use the
classifiers introduced in Sect. to
find a set of models for which a new sample is not determined to be
dissimilar. As mentioned in Sect. , a
conservative classifier with higher TPR than low FPR is likely to be more
useful for model selection for any specific sample. A sample-specific
calibration model in which individual compounds are available from a database
for each new prediction sample is in principle possible using concepts
described in this section. However, without a priori knowledge, the most
relevant features and measure of similarity among individual samples is
necessarily defined through the process of calibrating a model. Therefore, it
is at present time necessary to hypothesize or propose several candidate
models and select among them for any new prediction sample or set of samples
for possible improvements in prediction.
Updating the calibration model
Calibration maintenance and transfer learning addresses the problem of
updating a calibration model developed under one set of conditions to
continue providing accurate predictions for samples measured under new
conditions . This topic has
not yet been addressed for TOR OC and EC calibrations using FT-IR spectroscopy, but we can
nonetheless make a few remarks for future research needs. Difference in
sampled or measured conditions can arise from changes in hardware, changes in
(PTFE filter) substrate, or atmospheric aerosol composition and imply a
possible difference introduced into distributions between training and
prediction data in the feature space of the model. During the operational
phase of the calibration, it is therefore necessary to continuously monitor
model performance and appropriateness for new samples using protocols
described in Sects. and . Notable
changes may be registered by trends in the magnitude of prediction errors
compared against available reference measurements, or increasing instances of
spectral outliers. The role of hardware performance in these changes can be
assessed separately using the analytical protocols summarized in
Sect. – specifically, through the repeated
analysis of laboratory check standards.
Figures of merit for selected FRES (Fresno, CA) and BYIS
(Baengnyeong Island, S. Korea) samples using base case 2011 IMPROVE
calibration and a dedicated model built only using samples from FRES and
BYIS.
Model
Variable
Samples
Bias
Error
R2
(µgm-3)
(µgm-3)
2011 IMPROVE
OC
FRES, BYIS
0.28
0.43
0.79
Dedicated
OC
FRES, BYIS
-0.03
0.16
0.96
2011 IMPROVE
EC
FRES
0.05
0.10
0.85
Dedicated
EC
FRES
0
0.06
0.93
2011 IMPROVE
EC
BYIS
0.13
0.17
0.60
Dedicated
EC
BYIS
-0.07
0.11
0.66/0.84*
* One outlier removed.
The strategy for model updating can be different according to the cause and
nature of the change, but a basic premise is that the original condition
still holds useful information that can be transferred to the new condition
such that an entirely new calibration is not warranted. In this way, a
significant investment of resources required by model building (consisting of
data collection and evaluation) may be avoided. For changes in instrument
performance or installation of a separate spectrometer, commonly applied
modifications range from simple linear corrections of predictions to
calibration transfer algorithms to convert spectra to resemble those that may
have been acquired from the primary instrument in its original state so that
the original model remains applicable .
The contribution from PTFE can presumably be removed with the appropriate
baseline correction technique (Sect. ).
Though not tested extensively across various filter types, successful
prediction has been reported between two PTFE filter types
. Treating the PTFE signal as an interferent, training
the model with additional blank (zero-analyte) samples from different filter
types may be an effective approach , though it also requires evaluation. Changing atmospheric composition
can be addressed by updating the calibration set with new samples that
contain new analytes or different regimes in concentration. While there are
recursive algorithms for online updating (reweighting) of models with new
samples , recalibration with the appropriate proportion of old and new
samples will recreate a feature space that accommodates both groups of
samples. When new samples are needed, active learning strategies seek the
potentially most informative samples and minimize the requirement of new
calibration samples .
Additional strategies from transductive learning aim to avoid the
requirement of obtaining new samples for recalibration but rather search for
common feature representations between calibration and prediction set
(“unlabeled”) samples . While these methods are more
typically based on non-PLS-based algorithms and applied to classification
problems , some
results in multivariate calibration tasks give an indication of their
applicability. One approach is to reattribute weights in calibration samples
to have the closest feature distribution to new samples . New estimates weighted by
their uncertainty can furthermore be used for re-estimation of model
parameters in an iterative fashion . Another
approach is to re-estimate a feature representation in which the calibration
and prediction samples are in closer proximity in this space . Limited studies with PLS regression report mixed
results regarding the value of incorporating unlabeled data into the
calibration over simply using the original model . The benefit of such efforts not surprisingly depends on
both the specific characteristics of the calibration model and unlabeled data
.
In the context of FT-IR measurements, TOR reference measurements may not be
available for short-term campaigns at new sites and some aspects of transfer
learning and transductive learning strategies (sample reweighting or
basis-set rederivation) may be the only option for improvement if prediction
errors from existing calibration models are expected to be high
(Sect. ). For long-term operation at
a fixed site, collecting a limited number of reference samples for
recalibration initially or periodically can be a viable strategy if sample
characteristics substantially differ from those available for calibration.
For instance, showed that a recalibration strategy can
improve predictions for new types of samples for the IMPROVE network. TOR
predictions for samples collected in 2013 from the FRES and BYIS sites had not
only high instances of prediction errors, but also systematic biases when
using the 2011 IMPROVE model. A dedicated calibration model built with
two-thirds of the available data set at the two new sites improved prediction
performance for samples reserved for testing (Table ).
Whether to incorporate new types of samples into the original calibration set
to build a monolithic model or to unify the calibrations through a
multilevel modeling framework may depend on the number and leverage of new
samples. A model derived from including new samples with old may cease to
perform adequately for the original types of samples. From a case study in
2013 CSN , including ELLA samples in the calibration did
not seem to affect the non-ELLA samples, but ELLA samples were also found to not have much leverage within the scope of all samples. When updating an
existing model, it is necessary to re-evaluate the model for old as well as
new types of samples.
Conclusions
The FT-IR spectra of PM are rich in chemical information, and quantitative
information such as TOR-equivalent OC and EC can be extracted from it
provided that we can find the appropriate combination of training samples and
algorithms for extraction. In this paper, we review procedures for
spectral processing and data-driven calibration, in which the data are taken
from collocated measurements of TOR OC and EC. In this effort, procedures for
initial steps for model building and evaluation, and later steps for
monitoring of model behavior during the operational phase of a calibration
model, are described.
The number and types of samples required for calibration are determined by the
diversity of composition in the prediction set. When samples are selected
from the same sites as the prediction set, FT-IR calibration models could
predict with virtually no bias and errors within 0.15 µgm-3
for TOR OC and 0.11 µgm-3 for TOR EC for areal loadings in
the 2011 IMPROVE and 2013 CSN networks. Less than 5 % of samples fell
below the estimated detection limit. These metrics are on a par with the
reference measurement evaluated for the same year. For the 2011 IMPROVE data
set, the number of ambient calibration samples can be reduced from the
canonical number of 501 down to approximately 150 samples and maintain
similar prediction performance for the diversity in composition represented
by 237 samples. To the extent that we have experimented (virtually) for TOR
OC, the limitation is likely due to the difficulty in maintaining the same
distribution of ammonium-to-OC ratio in the calibration set as in the test
set with fewer samples obtained by the temporal and spatial
stratified sample reduction approach illustrated.
As evaluated for the IMPROVE network, TOR-equivalent concentrations in new
samples collected for a later year (2013) and more sites (11 additional ones)
have similar performance metrics overall, with the exception of samples from two
new sites (FRES and BYIS) not in the calibration set. Higher prediction
errors for TOR OC occur largely due to specific types of samples not
well-represented in the calibration year. While these samples are predicted
without bias, their errors are higher on account of the higher areal loadings
of TOR OC beyond the range of original calibration. Estimates of prediction
intervals for both TOR and model predictions suggest that more than 92 %
of samples are predicted within anticipated precision errors. Outlier
detection methods can be used to detect samples that are different with
respect to the modeled domain to provide some indication of the magnitude of
prediction errors. However, accurate detection of high-error samples comes
with a tradeoff of increased FP rates; the outlier detection
method can be selected based on the application and desired tolerance for
each type of detection error (FP or FN). An obvious
solution for reducing prediction errors in different samples is to acquire
new samples for recalibration, though judicious calibration maintenance
strategies (e.g., sample reweighting) can potentially minimize the number of
new samples needed.
The procedure for quantitative prediction of TOR-equivalent OC and EC is a
statistical one and depends on the ability of an algorithm to resolve the
overlapping absorption bands in the mid-IR range and relate relevant features to
the concentration of the target analyte. Given the evolving diversity in
aerosol composition, it is not clear that arriving at an invariant, universal
calibration model applicable for every new sample is practical. However, in
describing the broader context of chemometrics and machine learning
algorithms that are available for addressing each stage of the model life
cycle, challenges for calibrating complex spectra are not insurmountable
provided that they are systematically handled as described in this paper. We
can use a wide range of statistical quality control procedures at our
disposal to assess similarity of relevant features among spectra to
continually monitor model performance, to anticipate appropriateness of
existing calibration models, and to propose revisions. Construction of
calibration models specific to individual or groups of samples may be
envisioned provided that we are further able to identify the most important
spectral features to assess similarities relevant for TOR OC and EC
estimation.
In parallel to ensuring numerical accuracy of a calibration, understanding
how the calibration relates spectral absorbances to TOR concentrations is
critical for anticipating model applicability. Identification of important
vibrational modes used in the calibration facilitates understanding of how
the model relates absorbances to concentrations of the target analyte.
Moreover, this association can be used to gain a better understanding of
molecular structure in complex substances underlying the OC and EC
concentrations reported by TOR. For TOR-equivalent OC, functional groups
typically associated with atmospheric organic matter were found: aliphatic
CH, carbonyls, and nitrogenated functional groups. For TOR-equivalent EC
prediction, the vibrational mode associated with C–C stretch of aromatic
rings typically observed in mid-IR spectra of soot appears to be an important
absorption band, but a model for Elizabeth, NJ, was able to predict
TOR-equivalent EC concentrations accurately without use of this spectroscopic
region. While attempts to understand model LVs have thus far been limited,
some work by indicates that 2013 CSN aerosols could be
modeled with surprisingly few LVs, with nearly 90 % of the variation in
TOR OC explained by one variable. Further analysis of constituent samples
using source apportionment techniques and analysis of chemical composition
(e.g., using functional groups) is bound to benefit overall model
interpretation.
In summary, this paper outlines a general perspective and specific
practices for model building; encompassing judicious specification of
algorithm, spectra processing procedure, and sample selection. Taking a
systematic approach toward calibration with a diverse set of reference
measurements allows us to expand the suite of information extractable from
FT-IR spectra to complement functional group analysis from laboratory
calibrations, which has long been the focus. Given the demonstrated
simplicity and nondestructive nature of acquiring spectra from PTFE filters,
this technique can expand TOR-equivalent OC and EC measurements (which have a
long history) to new campaigns and new locations in which only PTFE samples
are collected for gravimetric reference measurements. Therefore, we
anticipate that the procedure outlined in this paper can complement existing
methods for PM monitoring with TOR-equivalent OC and EC and provide guidance
in extracting composition of substances from FT-IR spectra of atmospheric PM.
Given that a wide range of inorganic and organic substances display mid-IR
activity, further exploration of data sources and algorithms for quantitative
analysis can continue to expand the cost-effective application of FT-IR in
chemical speciation measurements.