Predicting ambient aerosol thermal – optical reflectance ( TOR ) measurements from infrared spectra : extending the predictions to different years and different sites

Organic carbon (OC) and elemental carbon (EC) are major components of atmospheric particulate matter (PM), which has been associated with increased morbidity and mortality, climate change, and reduced visibility. Typically OC and EC concentrations are measured using thermal–optical methods such as thermal–optical reflectance (TOR) from samples collected on quartz filters. In this work, we estimate TOR OC and EC using Fourier transform infrared (FT-IR) absorbance spectra from polytetrafluoroethylene (PTFE Teflon) filters using partial least square regression (PLSR) calibrated to TOR OC and EC measurements for a wide range of samples. The proposed method can be integrated with analysis of routinely collected PTFE filter samples that, in addition to OC and EC concentrations, can concurrently provide information regarding the functional group composition of the organic aerosol. We have used the FTIR absorbance spectra and TOR OC and EC concentrations collected in the Interagency Monitoring of PROtected Visual Environments (IMPROVE) network (USA). We used 526 samples collected in 2011 at seven sites to calibrate the models, and more than 2000 samples collected in 2013 at 17 sites to test the models. Samples from six sites are present both in the calibration and test sets. The calibrations produce accurate predictions both for samples collected at the same six sites present in the calibration set (R = 0.97 and R = 0.95 for OC and EC respectively), and for samples from 9 of the 11 sites not included in the calibration set (R = 0.96 and R = 0.91 for OC and EC respectively). Samples collected at the other two sites require a different calibration model to achieve accurate predictions. We also propose a method to anticipate the prediction error; we calculate the squared Mahalanobis distance in the feature space (scores determined by PLSR) between new spectra and spectra in the calibration set. The squared Mahalanobis distance provides a crude method for assessing the magnitude of mean error when applying a calibration model to a new set of samples.


Introduction
Organic carbon (OC) and elemental carbon (EC) are major components of atmospheric particulate matter (PM), which has been associated with increased morbidity and mortality (Janssen et al., 2011;Anderson et al., 2012), climate change (Yu et al., 2006;Bond et al., 2013), and reduced visibility (Watson, 2002;Hand et al., 2012).The major sources of EC are incomplete burning of fossil fuels and pyrolysis of biological material during combustion (Bond et al., 2007;Szidat et al., 2009).OC may be either directly emitted from sources, (e.g., incomplete combustion of organic materials) or produced from chemical reactions involving organic carbon gases (Jacobson et al., 2000).Therefore, OC and EC are monitored over long periods of time by large monitoring networks such as the Interagency Monitoring of PROtected Visual Environments network (IMPROVE, Hand et al., 2012;Malm et al., 1994) in rural areas in the USA; the Chemical Speciation Network/Speciation Trends Network (CSN/STN, Published by Copernicus Publications on behalf of the European Geosciences Union.Flanagan et al., 2006) in urban and suburban areas in the USA; and the European Monitoring and Evaluation Programme (EMEP, Tørseth et al., 2012) in Europe.
Typically OC and EC concentrations are measured using thermal methods such as thermal-optical reflectance (TOR, Chow et al., 2007;NIOSH 5040, Birch and Cary, 1996;EUSAAR-2, Cavalli et al., 2010) from samples collected on quartz filters.OC and EC are operationally defined by the temperature gradient and gaseous environment in which carbon evolves from the sample.An optical method such as reflectance or transmittance is used to correct for pyrolysis of the organic material (Cavalli et al., 2010;Chow et al., 2007).However, TOR measurements are destructive and relatively expensive.
To reduce the operating costs of large air quality monitoring networks, Dillner and Takahama (2015a, b) proposed using Fourier transform infrared spectroscopy (FT-IR) as an alternative for quantification of OC and EC.This analysis technique is inexpensive, non-destructive, and rapid.It uses PTFE samples, which are commonly used in PM monitoring networks for gravimetric mass and elemental analysis.Moreover, many quantities of interest (e.g., organic functional groups, OM and OM/OC) can be quantified from the same FT-IR spectra (Ruthenburg et al., 2014;Takahama et al., 2013;Russell, 2003).
In this work, we further evaluate the use of FT-IR as an alternative method for quantification of TOR OC and EC by extending the work of Dillner and Takahama (2015a, b) to a different year and different sites.We used FT-IR absorbance spectra and TOR OC and TOR EC measurements collected in the IMPROVE network.In the previous works, the authors used PTFE samples collected at seven sites in 2011.The filters were collected every third day; two-thirds of samples were used to calibrate the models and the remaining to test the calibration models.We used the same calibration data set (collected in 2011) used by Dillner and Takahama (2015a, b), and we extended the previous analysis by (i) evaluating the models using test samples collected at the same sites of the calibration data set but a different year ( 2013), (ii) evaluating the models using test samples collected at 11 different sites and a different year (2013) of the calibration data set, and (iii) proposing a statistical method to anticipate the prediction errors for each site.
To predict OC and EC values, we used partial least square regression (PLSR, Sect.2.3).PLSR is a common method to develop calibration models for different compounds (Madari et al., 2005;Weakley et al., 2014;Vongsvivut et al., 2012).Moreover, we also propose a statistical modeling technique to anticipate the estimation error and the goodness of the estimation (Sect.2.4).We use the squared Mahalanobis distance (Mahalanobis, 1936;Cios et al., 1998) in the feature space (scores of the PLSR; Barker and Rayens, 2003) to discriminate between sites that likely have predictions outside the predefined error range (presumably because their features are not well modeled with the available data set).

IMPROVE network samples
In this work, we used samples from the IMPROVE network collected in 2011 and 2013 (Table 1 and Fig. 1).We divided these samples in four data sets (Table 1).Calibration 2011 is composed of 526 PTFE ambient samples collected at seven sites in the USA in 2011 (filled and empty squares symbols in Fig. 1) plus 36 laboratory blank samples.This data set is the same calibration set used by Dillner and Takahama (2015a, b) including Sect.S3 in the Supplement of Dillner and Takahama (2015b).Test 2011 is composed of 269 PTFE ambient samples (plus 18 PTFE laboratory blank samples) collected at the same sites and the same year (2011) of the calibration data set (filled and empty squares symbols in Fig. 1).This data set is the same test set used by Dillner and Takahama (2015a, b).Test 2013 is composed of 949 PTFE ambient samples (plus 50 PTFE laboratory blank samples) collected at six sites that are the same sites but from a different year (2013) of the calibration data set (filled squares symbols in Fig. 1).Test 2013 additional (Addl) is composed of 1290 PTFE ambient samples (plus the same 50 PTFE laboratory blank samples of the Test 2013 data set) collected at 11 different sites and a different year (2013) of the calibration data set (black triangles in Fig. 1).Four of the eleven additional sites are urban sites, some of the samples experienced significant smoke impact, and some were collected at an IMPROVE site in South Korea (with high sample loadings).
The IMPROVE network collects samples every third day from midnight to midnight (local time) at a nominal flow rate of 22.8 L min −1 , which yields a nominal volume of 32.8 m 3 and produces samples of particles smaller than 2.5 µm in diameter (PM 2.5 ).The FT-IR analysis is applied to 25 mm PTFE samples (Teflon, Pall Gelman -3.53 cm 2 sample area) that are analyzed for gravimetric mass, elements, and light absorption in the IMPROVE network.Quartz filters collected in parallel to the PTFE samples are analyzed by TOR using the IMPROVE A protocol to obtain OC and EC mass in the IMPROVE network (Chow et al., 2007).The OC and EC values are also adjusted to account for flow differences between the quartz and PTFE samples.IMPROVE samples lacking either flow records for PTFE filters or TOR measurements are excluded.

FT-IR analysis: spectra acquisition
We analyzed a total of 3034 PTFE ambient samples and 104 PTFE laboratory blank samples using a Tensor 27 Fourier transform infrared (FT-IR) spectrometer (Bruker Optics, Billerica, MA) equipped with a liquid-nitrogen-cooled wideband mercury cadmium telluride detector.The samples are analyzed using transmission FT-IR in air that has low levels of water vapor and CO 2 using an empty sample compartment as the reference (a more detailed description is in Dillner and Takahama, 2015a).The filter samples are not treated prior to FT-IR analysis except that values interpolated during the zero-filling process are removed.These spectra contain 2784 wavenumbers.

Building the calibration models
We used the calibration models developed by Dillner and Takahama (2015a, b) to predict TOR OC and EC in the 2013 data sets.For each site, the samples collected in 2011 are ordered by date and every third sample is removed and included in the Test 2011 data set.The remaining samples are placed in the Calibration 2011 data set.Briefly, the calibraton models were developed using partial least squares regression (PLSR, Wold et al., 1983;Geladi and Kowalski, 1986) using the kernel PLS algorithm, implemented by the PLS library (Mevik and Wehrens, 2007) of the R statistical package (R Core Team, 2015).The goal is to predict a set of coefficients b from a matrix of spectra X for observation y (OC or EC), with residuals e: y = Xb + e. (1) PLSR circumvents issues that arise when collinearity exists among variables in X (strong correlation of absorbances across wavenumbers) and when the number of variables (wavenumbers) in the spectra matrix X exceeds the number of observation (rows of X).PLSR performs a bilinear decomposition of both X and y: the matrix of spectra (X) is decomposed into a product of orthogonal factors (loadings, P) and their respective contributions (scores, T).Observed variations in the OC or EC mass are reconstructed through a combination of these factors (T) and a set of weights simultaneously (q) developed to relate features to the dependent and independent variables -scores and loadings describe the covariance between X and y.

X = TP + E
(2) The two sets of factors are related through a weighting matrix W to reconstruct the set of regression coefficient b: Candidate models for calibration are generated by varying the number of factors used to represent the matrix of spectra.
To select the number of factors we used K = 10 fold cross validation (CV, Hastie et al., 2009;Arlot and Celisse, 2010) on the calibration 2011 data set (Sect.2.1).We used the minimum root mean square error to select the model with least prediction error (RMSEP).
For the calibration of the FT-IR spectra to TOR EC, in order to eliminate bias in the calibration and improve prediction capability for low EC samples (EC < 2.4 µg), we used the hybrid calibration approach described in Dillner and Takahama (2015b).We used two calibration models: the first uses samples in the calibration set that are in the lowest one-third of the EC mass range to predict samples in the test set that are also in the lowest one-third of the EC mass range; the second one is utilized for the remaining samples.Localization of the calibration (with respect to concentration) is a commonly used method to improve the performance of the calibration, often at the more difficult to measure low end of the range.

Anticipating prediction errors
For samples collected during different periods or different locations, it is useful to know whether the present calibration model is appropriate for a new sample or a new set of samples.Using features present in FT-IR spectra, we propose a method to anticipate the prediction error in OC or EC concentrations prior to applying the calibration model.The purpose of such an approach is to determine whether a particular calibration model is suitable for a new set of samples without requiring an assessment of prediction accuracy using TOR OC and EC measurements a posteriori.The feature space is a low-dimensional projection of the absorbance spectra that has been associated with prediction capability for TOR OC and EC, and is determined by the factor scores determined by PLSR (Eqs. 2 and 3).We calculate the centroid (µ) and the covariance ( ) of the calibration samples projected in the feature space: where k is the number of factors used to represent the matrix of spectra; n is the number of PTFE samples included in the Calibration 2011 data set; t are the columns of the scores matrix.
For each calibration sample, we calculate the squared Mahalanobis distance (D 2 M ) between the sample itself and the centroid of the calibration set, taking into account the covariance matrix to normalize the distance according to the magnitude of dispersion in each dimension.
We then project absorbance spectra of the test set (Table 1) in the feature space, and calculate the D 2 M between each test sample and the centroid (µ) of the calibration set.
We present our evaluation for two cases.In the first case, we calculate the D 2 M and errors for the calibration and test sets according to the sampling site, and calculated the mean D 2 M and mean absolute error for each site.In the second case, we considered the D 2 M and absolute error for each sample, without aggregation.We considered acceptable predictions to be the ones that have errors within the magnitude of the 2011 data set, and unacceptable predictions those for which the errors are greater.As boundaries for the discrimination, in the first case, we used the greatest mean D 2 M plus 1 standard error (SE) and the greatest mean absolute error plus 1 SE found in the Test 2011 set.In the second case, we used the greatest D 2 M and the absolute error found in the Test 2011 set (except for one sample that we considered an outlier, Fig. S3 in the Supplement).
The samples collected in 2013 are divided into four classifications, which can be visualized by membership in one of four quadrants in a plot of absolute error vs. D 2 M (Figs. 5 and 11).True negative (TN) samples (or sites) are those for which the D 2 M (or mean D 2 M ) and absolute error (or mean absolute error) fall in the third (bottom left) quadrant.For TN samples, the D 2 M gives a reliable indication that the prediction error is within magnitude of 2011 samples.True positive (TP) samples (or sites) are those that lie in the first (upper right) quadrant; the D 2 M provides a reliable indication that the prediction error is greater than the magnitude of 2011 samples.False negative (FN) samples (or sites) lie in the second (upper left) quadrant; the D 2 M is not indicative of the increased errors above the 2011 predictions.False positive (FP) samples (or sites) are those that lie in the fourth (bottom right) quadrant; sample spectra have significantly higher D 2 M s but do not have increased prediction errors over 2011 predictions.

Model evaluation
We evaluated the calibration models (trained with the calibration 2011 data set, Sect.2.3) on the three data sets described in Sect.2.1.The quality of each calibration is evaluated by calculating four performance metrics: bias, error, normalized error, and the coefficient of determination (R 2 ) of the linear regression fit of the predicted FT-IR OC and FT-IR EC to measured TOR OC and EC.FT-IR OC and EC are the OC and EC predicted from the FT-IR spectra and the PLSR calibration model.The bias is the median difference between measured (TOR) and predicted (FT-IR) for the test set.Error is the median absolute bias.The normalized error for a single prediction is the error divided by the TOR value.The median normalized error is reported to provide an aggregated estimate.The performance metrics are also calculated for the collocated TOR observations and compared to those of the FT-IR OC and FT-IR EC to TOR OC and TOR EC regression.The minimum detection limit (MDL) and precision of the FT-IR and TOR methods are calculated and compared.The MDL of the FT-IR method is 3 times the standard deviation of the laboratory blank samples in the test sets (Sect.2.1).The MDL for the TOR method is 3 times the standard deviation of 514 blank samples (Desert Research Institute, 2012).Precision for both FTIR and TOR is calculated using 14 parallel samples in the data set Test 2011 at the Phoenix site, 240 parallel samples in the data set Test 2013 at the Phoenix and Proctor Maple R. F. sites, and 115 parallel samples in the data set Test 2013 Addl at the Yosemite site.For evaluation of the FT-IR predictions against TOR reference values, we used 621 measurements collected in 2013 from seven IMPROVE sites with collocated TOR measurements (Everglades, Florida; Hercules Glade, Missouri; Hoover, California; Medicine Lake, Montana; Phoenix, Arizona; Saguaro West, Arizona; Seney, Virginia).

Results and discussion
In this section, we first evaluate and discuss the performance of the models by comparing the predicted with the observed TOR OC (Sect.3.1) and TOR EC (Sect.3.3) measurements for ambient samples collected in the four data sets described in Sect.2.1.Then, in Sects.3.2 and 3.4 we describe the results of the prediction error anticipation for the FT-IR OC and FT-IR EC respectively.

Prediction of TOR OC from FT-IR spectra
The comparison between predicted FT-IR OC and measured TOR OC for the data sets described in Sect.2.1 is shown in Fig. 2. The first row refers to the Calibration 2011 and Test 2011 data set.In this work, the results obtained with the Test 2011 are used for comparison with the results obtained with the Test 2013 and Test 2013 Addl data sets.The detailed description and discussion of the Test 2011 results can be found in Dillner and Takahama (2015a).
In the case of predictions based on ambient samples collected at the same sites of the calibration data set but in a different year (Test 2013 data set, bottom left panel in Fig. 2), we observe that the performance metrics show good agreement between measured (TOR OC) and predicted OC values (FT-IR OC).Moreover, comparing these metrics with the ones obtained with the Test 2011 data set, we note that the R 2 is better in the former, and the bias, error, and normalized error are slightly worse in the former.However, these differences are not substantial, and we can conclude that the predictions made with the Test 2011 and Test 2013 data sets are similar.A t test at a confidence level of 95 % between the Test 2011 and Test 2013 predictions, and one between their absolute errors, confirmed this observation.From Fig. 3 (performance of the collocated TOR OC measurements), we observe that the metrics are similar to the ones obtained with the Test 2011 data set.The bias (0.09 µg m −3 ), error (0.12 µg m −3 ), and normalized error (18 %) of the Test 2013 data set are slightly worse than the ones obtained with the collocated TOR OC (0, 0.06 µg m −3 and 11 % respectively).However, these differences are not substantial as shown in Table 2, which compares the MDLs and precisions of FT-IR OC predictions and TOR OC measurements.The bottom right panel in Fig. 2 shows the performance of the calibration model for samples collected at different sites and a different year (Test 2013 Addl data set) to the samples used for calibration.The bias, error, and normalized error (0.06, 0.13 µg m −3 , and 14 %) are in between the ones found for the Test 2011 and Test 2013 data sets.However, the scatter plot and the R 2 metric show worse agreement between measured TOR OC and predicted FT-IR OC values (R 2 = 0.89) than the ones obtained with the Test 2011 and Test 2013 data sets (R 2 = 0.94 and 0.97 respectively).The scatter plot also shows that the model tends to overestimate the OC values.A t test at a confidence level of 95 % between the Test 2013 and Test 2013 Addl predictions, and one between their absolute errors, shows a statistical difference between the results obtained with the two data sets.These results can be explained by the consideration that the sites where the samples of the Test 2013 Addl were collected may have different composition and loadings of OC than the ones used in the calibration data set.Therefore, the calibration model does not have enough information to model the different features of these sites.However, the MDL and precision are similar to the one obtained with the collocated TOR data set, indicating that low concentrations are more likely to be precise than high concentrations.This could be explained by the fact that the majority of the ambient samples used for calibration were collected at rural sites (76 % from six sites), and the ones collected at the urban site (Phoenix) may have different features (composition and mass range) from the ones collected in the Test 2013 Addl data set, many of which are urban or highly polluted (Birmingham (Alabama), Fresno (California), Puget Sound (Seattle, Washington), and South Korea).
This observation is supported by the results in Fig. 4. In this case, the calibration and the test data sets are composed of the ambient samples collected at the rural sites of each data set.The performance metrics show accurate agreement between measured (TOR OC) and predicted OC values (FT-IR OC) for all the data sets.Moreover, it is worth noting that in the Test 2013 and Test 2013 Addl there are samples that are above 200 µg (probably due to fire events).Even though these samples are outside the calibration range, the model is still able to predict them with reasonable accuracy (with a tendency of overestimation).On the basis of these results, we can conclude that (i) it is possible to use PTFE ambient samples to calibrate a model that predicts TOR OC values from PTFE ambient samples collected at the same sites (both rural and urban) in the same or different years to the ones used for the calibration accurately; (ii) it is possible to use PTFE ambient samples, collected at rural sites, to calibrate a model that predicts TOR OC values from PTFE ambient samples collected at different rural sites and in different years to the ones used for the calibration accurately.

Anticipation of the prediction error: FT-IR OC
As shown in Fig. 5, the D 2 M s are smaller in the calibration set than in the test set because the latent variables comprising the feature space were defined using the former.Moreover, the D 2 M s increase for the test samples that have different features from the ones used in calibration.Figure 5   against mean absolute error (between TOR OC and FT-IR OC).The measurements are aggregated per site.Each site is denoted with the site ID used in Table 1.set contains two sites (10 represents the South Korea site and 11 represents the Fresno site, Table 1) which have both mean D 2 M and mean absolute errors above the boundaries used for discrimination of unacceptable prediction errors (TP sector).This diagnostic indicates that the two sites may contain different sources and chemical composition that are not well represented in the Calibration 2011 data set.
The scatter plot and the performance metrics of the Test 2013 Addl without the samples collected at the two sites anticipated (and confirmed) to have high errors are shown in Fig. 6.The R 2 metric notably improves from 0.89 to 0.96, and the remaining evaluation statistics are also improved.Moreover, a t test at a confidence level of 95 % between the predictions and one between their absolute errors shows a statistical difference between the results obtained with the two data sets.
The evaluation of predictions using a calibration model constructed from only the Korea and Fresno sites is shown in Fig. 7.The calibration set uses two-thirds of the ambient samples (we followed the same methodology that we used to prepare the Calibration 2011 data set) collected in 2013 at the two sites and one-third for the test.The results show that with the appropriate calibration samples, we can also achieve accurate predictions of TOR OC values (bias = −0.03µg m −3 , error = 0.16 µg m −3 , normalized error = 10 %, and R 2 = 0.96) in these two sites.For comparison, Fig. S1 shows the evaluation of predictions at the Korean and Fresno sites using the Calibration 2011 data set (bias = 0.28 µg m −3 , error = 0.43 µg m −3 , normalized error = 25 %, and R 2 = 0.79).A t test at a confidence level of 95 % between the predictions and one between their absolute errors shows a statistically significant reduction in mean errors for predictions using the new calibration (Fig. 7) and those using the base case calibration (Fig. S1).Therefore, we  can conclude that sites with samples that are on average dissimilar to those in the calibration are shown to benefit from the construction of a separate calibration model.The results of the D 2 M against the absolute error for each sample (without any aggregation) are reported in Table 3.In the Supplement, we show the plots of the D 2 M distances against the absolute errors for each site (Figs.S2-S5).For each data set, Table 3 reports both the percentage of samples falling in the FN, FP, TN, and TP sectors and the performance metrics of the models that exclude unacceptable predictions (samples falling in the FP and TP sectors).The Test 2013 data set presents a low percentage (0.4 %) of erroneous classifications (samples falling in the FN and FP sectors), and the performance metrics are in line with what we found in Fig. 2.
The Test 2013 Addl contains 3.2 % of well-classified unacceptable predictions that are TP.Most of these unacceptable predictions are due to samples collected at the Korean (1.9 %) and Fresno (0.9 %) sites (0.4 % from three other sites, Fig. S5).Moreover, the Test 2013 Addl contains 2.3 % of erroneous classifications falling in either FP or FN sectors (1.1 % is from the Korean samples, 0.4 % from Fresno sam-ples, and 0.8 % from seven other sites, Fig. S5).The R 2 metric improves from 0.89 to 0.92 with respect to the results found in Fig. 2. The bias, error, and normalized error are similar.However, because 1.7 % of samples are in the FN sector (and we consider them acceptable), this explains the lower prediction performance in comparison to the case in which all the Korean and Fresno samples are excluded (Fig. 6).
This analysis suggests that spectral signals projected into the feature space of a particular PLS calibration model contain useful information for anticipating the magnitude of prediction errors.The ability to anticipate the quality of predictions based on spectra features is relevant for strategic collection of calibration samples and selection of available samples from which a calibration model can be constructed.We expect that the capability for discrimination at the level of individual samples can be improved in future studies.The D 2 M is a strong classifier in the case that the scores T are normally distributed in the multivariate (MV, in this case 47 dimensions) feature space.Indeed, the D 2 M (Eq.7) is the exponential term of the MV normal (MVN) distribution and represents the distance between each point and the mean of the MVN distribution.Assessing the assumption of multivariate normality is a challenging process because different methods may provide different results under different assumptions and conditions (Mecklin and Mundfrom, 2005).Both the Henze-Zirkler (Henze and Zirkler, 1990) and Mardia MV normality test (Mardia, 1970) on the calibration scores lead to a rejection of the null hypothesis (at 95 % confidence level) that the scores are normally distributed.It is plausible to conceive of spectral preprocessing methods improving the MVN assumption, or to invoke different similarity metrics that do not require a specific distribution of points in the feature space.However, even while the assumption of MVN is not fulfilled, we report that the mean D 2 M provides indication of samples dissimilar to those comprising the calibration set when aggregated at the site level.

Prediction of TOR EC from FT-IR spectra
In this section, we extend the analysis done in Sects.3.1 and 3.2 to the case of TOR EC measurements.For this case, as described in Sect.2.3, we use the hybrid model proposed by Dillner and Takahama (2015b).
The comparison between predicted FT-IR EC and measured TOR EC for the data sets described in Sect.2.1 is shown in Fig. 8.The first row refers to the Calibration 2011 and Test 2011 data set.In this work, the results obtained with the Test 2011 are used for comparison with the results obtained with the Test 2013 and Test 2013 Addl data sets.The detailed description and discussion of the results obtained with the Test 2011 data set can be found in Dillner and Takahama (2015b).
In the case of predictions based on ambient sample collected at the same sites of the calibration data set but from a different year (Test 2013 data set, bottom left panel in Fig. 8), we observe that the performance metrics show good agreement between measured (TOR EC) and predicted EC values (FT-IR EC).Moreover, it shows similar performance to the prediction based on ambient samples collected at the same sites and year of the calibration (Test 2011).A t test at a confidence level of 95 % between the Test 2011 and Test 2013 predictions, and one between their absolute errors, confirmed this observation.The results are similar to FT-IR OC (Sect.3.1).
From Fig. 9 (performance of the collocated TOR EC measurements), we observe that the metrics are similar to the ones obtained with the Test 2011 and Test 2013 data sets.The normalized error is higher for FT-IR EC (21 and 24 % for Test 2011 and Test 2013 respectively) than for collocated TOR EC measurements (14 %).  and precisions of FT-IR EC predictions and TOR EC measurements and it shows that the MDL is very similar to TOR EC and the precision is better than (less than half) of TOR EC.
The bottom right panel in Fig. 8 shows the performance of the calibration model for samples collected at different sites and in a different year (Test 2013 Addl data set) of the samples used for calibration.We observe that the performance metrics show worse agreement between measured TOR EC and predicted EC values (FT-IR EC), than the ones obtained with the Test 2011 and Test 2013 data sets.A t test at a confidence level of 95 % between the Test 2013 and Test 2013 Addl predictions, and one between the absolute errors, shows a statistical difference between the results obtained with the two data sets.
For the case of OC, we found that the model, based on rural sites only, predicts TOR OC values from PTFE ambient samples collected at different rural sites and different years to the ones used for the calibration accurately (Sect.3.1).For the case of EC, the models trained only with the rural sites (Fig. 10) have lower R 2 (0.88, 0.90, and 0.87 respectively for the three test sets) than the ones that use the entire calibration set (0.96, 0.95, and 0.87, Fig. 8).However, looking at the plots in Fig. 10 we observe that the predicted FT-IR EC values do not differ substantially from the TOR EC values, and the worse performance metric is explained by the low concentrations measured at the rural sites: R 2 decreases (the total sum of squares tends to zero) when the measurements are close to zero.

Anticipation of the prediction error: FT-IR EC
The aggregated (per site) mean D 2 M against the mean absolute error for each data set is shown in Fig. 11.Similar to the results shown in Sect.3.3, the Test 2011 (except for site 5) and Test 2013 sites present D 2 M similar to the ones found in the calibration data set (all the sites lie in the TN sector).Unlike the OC case (Sect.3.2), the Birmingham site (number 8) is misclassified (it lies in the FN sector).Moreover, site 11 (Fresno) has a mean D 2 M close to the boundary, but it lies in the TP sector.The Korean site (10) has both mean D 2 M and mean absolute errors above the boundaries used for discrimination between acceptable and unacceptable predictions (TP sector).Figure 12 shows the scatter plot and the performance metrics of the Test 2013 Addl without the samples collected at the two sites we have classified as not acceptable (Fresno and Korea).We note that the predictions are less spread and against mean absolute error (between TOR OC and FT-IR OC).The measurements are aggregated per site.Each site is denoted with the site ID used in Table 1.
the R 2 metric improves from 0.87 to 0.91.The remaining metrics are almost identical (Fig. 8 for comparison).Moreover, a t test at a confidence level of 95 % between the predictions, and one between their absolute errors, shows a statistical difference between the results obtained with the two data sets.The evaluation of predictions using a calibration model constructed from only the Korea and Fresno sites is shown in Fig. 13.The calibration set uses two-thirds of the ambient samples (we followed the same methodology that we used to prepare the Calibration 2011 data set proposed by Dillner and Takahama, 2015a, b) collected in 2013 at the two sites and one-third for the test.The results show that with  the appropriate data set, also in Fresno we can achieve accurate predictions of EC values (R 2 = 0.93, bias = 0 µg m −3 , error = 0.06 µg m −3 , and normalized error = 11 %).The performance metrics at the Korean site (R 2 = 0.66, bias = −0.07µg m −3 , error = 0.11 µg m −3 , and normalized error = 18 %) are not as good as Fresno and are mostly due to one sample.The predictions for all other samples are more accurate.By removing the one erroneous sample, R 2 increases from 0.66 to 0.84.For comparison, Fig. S7 shows the evaluation of predictions at the Korean and Fresno sites using the Calibration 2011 data set (R 2 = 0.85, bias = 0.05 µg m −3 , error = 0.10 µg m −3 , and normalized error = 22 % and R 2 = 0.60, bias = 0.13 µg m −3 , error = 0.17 µg m −3 , and normalized error = 33 % at Fresno and Korea sites respectively).A t test at a confidence level of 95 % between the predictions, and one between their absolute errors, shows a statistically significant reduction in mean errors for predictions using the new calibration (Fig. 13) and those using the base case calibration (Fig. S7).Therefore, we can conclude that sites with samples that are on average dissimilar to those in the calibration are shown to benefit from the construction of a separate calibration model.
The results of the D 2 M (and absolute error) for each sample (without any aggregation) are reported in Table 5, and in the Supplement we show the corresponding figures (Figs.S8-S11).For each data set, Table 5 reports both the percentage of samples falling in the FN, FP, TN, and TP sectors and the performance metrics of the models that exclude unacceptable samples (those falling in the FP and TP sectors).The Test 2013 data set presents a low percentage (0.5 %) of erroneous classifications (samples falling in the FN and FP sectors), and the performance metrics, for the test sets with unacceptable samples excluded, are in line with what we found when all samples are included (Fig. 8).The Test 2013 Addl contains 1 % of well-classified unacceptable predictions TP.Most of these unacceptable predictions are due to samples collected at the Korean site (0.7 %, Fresno; 0.2 %, Hoover, 0.1 %; Fig. S11).Moreover, the Test 2013 Addl presents 2.8 % of erroneous classifications (Korea, 1.2 %; Fresno, 0.7 %; four sites, 0.9 %; Fig. S11).The R 2 metric, for the test sets with unacceptable samples excluded, improves from 0.87 to 0.92 with respect to the results found when all samples are included (Fig. 8).The bias, error, and normalized error are similar.
We note that for this work, the boundaries are chosen heuristically (maximum D 2 M and absolute error found in the 2011 data set, Sect.2.4), and they tend to classify the great majority of the samples (96.3 %) in the Test 2013 Addl data set as well predicted (TN).The choice of different boundaries, spectral preprocessing, or distance metric (as discussed for OC in Sect.3.2), may lead to a less generous classification and an improved discrimination between acceptable and unacceptable predictions, with particular care to minimize FN classifications.However, as for OC, the analysis of EC suggests that spectral signals projected into the feature space of a particular PLS calibration model contain useful information for anticipating the magnitude of prediction errors, and using the D 2 M , we still obtain useful results as tested by the performance metrics in Fig. 12

Figure 2 .
Figure 2. Scatter plots and performance metrics between FT-IR OC and TOR OC for the four data sets described in Sect.2.1.Concentration units of µg m −3 for bias and error are based on the IMPROVE nominal volume of 32.8 m 3 .

Figure 3 .
Figure 3. Scatter plot and performance metrics between the measurements of the collocated TOR OC.Concentration units of µg m −3 for bias and error are based on the IMPROVE nominal volume of 32.8 m 3 .
shows the aggregated (per site) mean D 2 M against the mean absolute error for each data set.In accordance with the results shown in Sect.3.1, the Test 2011 and Test 2013 sites present D 2 M s similar to the ones found in the calibration data set (all sites lie in TN sector).On the other hand, the Test 2013 Addl data TOR OC (μg) FT-IR OC (μg)

Figure 4 .)Figure 5 .
Figure 4. Scatter plots and performance metrics between FT-IR OC and TOR OC for the rural sites of the four data sets described in Sect.2.1.Concentration units of µg m −3 for bias and error are based on the IMPROVE nominal volume of 32.8 m 3 .

Figure
Figure Scatter plot and performance metrics between FT-IR OC and TOR OC for the Test 2013 Addl data sets (Sect.2.1) without the ambient samples collected at Fresno and Korean sites.Concentration units of µg m −3 for bias and error are based on the IMPROVE nominal volume of 32.8 m 3 .

FTFigure 7 .
Figure 7. Scatter plot and performance metrics between FT-IR OC and TOR OC of the Korea and Fresno sites (site ID 10 and 11 respectively).A new calibration set, based on two-thirds of the ambient samples collected at these two sites in 2013, is used for a dedicated model.Concentration units of µg m −3 for bias and error are based on the IMPROVE nominal volume of 32.8 m 3 .

Figure 8 .
Figure 8. Scatter plots and performance metrics between FT-IR EC and TOR EC for the four data sets described in Sect.2.1.Concentration units of µg m −3 for bias and error are based on the IMPROVE nominal volume of 32.8 m 3 .

Figure 9 .
Figure 9. Scatter plot and performance metrics between the measurements of the collocated TOR EC.Concentration units of µg m −3 for bias and error are based on the IMPROVE nominal volume of 32.8 m 3 .

FTFigure 10 .
Figure 10.Scatter plots and performance metrics between FT-IR EC and TOR EC for the rural sites of the four data sets described in Sect.2.1.Concentration units of µg m −3 for bias and error are based on the IMPROVE nominal volume of 32.8 m 3 .

Figure 11 .
Figure11.FT-IR EC.Anticipation of the prediction error.Mean squared Mahalanobis distance (between the scores of each data set described in Sect.2.1 and the centroid of the calibration data set) against mean absolute error (between TOR OC and FT-IR OC).The measurements are aggregated per site.Each site is denoted with the site ID used in Table1.

Figure 12 .
Figure 12.Scatter plot and performance metrics between FT-IR OC and TOR OC for the Test 2013 Addl data sets (Sect.2.1) without the ambient samples collected at Fresno and Korean sites.Concentration units of µg m −3 for bias and error are based on the IMPROVE nominal volume of 32.8 m 3 .

Sac and Fox, KS North Birmingham, AL Bliss SP, CA Fresno, CA Great Smoky Mountains, TN Hoover, CA Mesa Verde, CO Okefenokee, GA Olympic, WA Phoenix, AZ Proctor Maple R.F., VT Puget Sound, WA Cape Romain, SC St. Marks, FL Tallgrass, KS Yosemite, CA Trapper Creek, AK South Korea Calibration 2011, Test 2011 Calibration 2011, Test 2011, Test 2013 Test 2013 Addl Figure
1. IMPROVE network sites.

Table 1 .
IMPROVE network sites and number of samples in each data set described in Sect.2.1.
Table2shows that both the MDL and precision of the Test 2013 are in the same range of the TOR OC and slightly better than the ones obtained with the Test 2011 data set.It is also interesting to note that in the Test 2013 data set there are two samples notably above the maximum values used in the calibration data set, and the model is still able to predict them accurately.This observation agrees with the results found by Dillner and Takahama (2015a) (non-uniform A case), in which the authors used samples with TOR OC in the lowest two-thirds of the TOR OC range to predict samples with TOR OC in the highest one-third of the TOR OC range, and the highest one-third of samples were well predicted.

Table 2 .
MDL and precision for FT-IR OC and TOR OC.Concentration units of µg m −3 for MDL and precision are based on the IMPROVE volume of 32.8 m 3 .
a b Not reported.

Table 3 .
FT-IR OC, anticipation of the prediction error per sample.Percentage of ambient samples falling in the FN (false negative), FP (false positive), TN (true negative), and TP (true positive) sectors.Performance metrics of the models that exclude predictions of samples that fall in the FP and TP sectors.Concentration units of µg m −3 for bias and error are based on the IMPROVE volume of 32.8 m 3 .

Table 4 .
MDL and precision for FT-IR EC and TOR EC.Concentration units of µg m −3 for MDL and precision are based on the IMPROVE volume of 32.8 m 3 .b Value reported for network (0.44 µg) in concentration units.c Not reported.

Table 5 .
FT-IR EC, anticipation of the prediction error per sample.Percentage of ambient samples falling in the FN (false negative), FP (false positive), TN (true negative), and TP (true positive) sectors.Performance metrics of the models that exclude predictions of samples that fall in the FP and TP sectors.Concentration units of µg m −3 for bias and error are based on the IMPROVE volume of 32.8 m 3 .