Closing the gap on lower cost air quality monitoring: machine learning calibration models to improve low-cost sensor performance

Low-cost sensing strategies hold the promise of denser air quality monitoring networks, which could significantly improve our understanding of personal air pollution exposure. Additionally, low-cost air quality sensors could be deployed to areas where limited monitoring exists. However, low-cost sensors are frequently sensitive to environmental conditions and 10 pollutant cross-sensitivities, which have historically been poorly addressed by laboratory calibrations, limiting their utility for monitoring. In this study, we investigated different calibration models for the Real-time Affordable Multi-Pollutant (RAMP) sensor package, which measures CO, NO2, O3, and CO2. We explored three methods: 1) laboratory univariate linear regression, 2) empirical multivariate linear regression and 3) machine-learning based calibration models using random forests (RF). Calibration models were developed for 19 RAMP monitors using training and testing windows spanning August 2016 through 15 February 2017 in Pittsburgh, PA. The random forest models matched (CO) or significantly outperformed (NO2, CO2, O3) the other calibration models, and their accuracy and precision was robust over time for testing windows of up to 16 weeks. Following calibration, average mean absolute error on the testing dataset from the random forest models was 38 ppb for CO (14% relative error), 10 ppm for CO2 (2% relative error), 3.5 ppb for NO2 (29% relative error) and 3.4 ppb for O3 (15% relative error), and Pearson r versus the reference monitors exceeded 0.8 for most units. Model performance is explored in detail, 20 including a quantification of model variable importance, accuracy across different concentration ranges, and performance in a range of monitoring contexts including the National Ambient Air Quality Standards (NAAQS), and the US EPA Air Sensors Guidebook recommendations of minimum data quality for personal exposure measurement. A key strength of the RF approach is that it accounts for pollutant cross sensitivities. This highlights the importance of developing multipollutant sensor packages (as opposed to single pollutant monitors); we determined this is especially critical for NO2 and CO2. The evaluation reveals 25 that only the RF-calibrated sensors meet the US EPA Air Sensors Guidebook recommendations of minimum data quality for personal exposure measurement. We also demonstrate that the RF model calibrated sensors could detect differences in NO2 concentrations between a near-road site and a suburban site less than 1.5 km away. From this study, we conclude that combining RF models with the RAMP monitors appears to be a very promising approach to address the poor performance that has plagued low cost air quality sensors. 30 Atmos. Meas. Tech. Discuss., https://doi.org/10.5194/amt-2017-260 Manuscript under review for journal Atmos. Meas. Tech. Discussion started: 9 August 2017 c © Author(s) 2017. CC BY 4.0 License.

Following calibration, average mean absolute error on the testing dataset from the random forest models was 38 ppb for CO (14% relative error), 10 ppm for CO 2 (2% relative error), 3.5 ppb for NO 2 (29% relative error) and 3.4 ppb for O 3 (15% relative error), and Pearson r versus the reference monitors exceeded 0.8 for most units. Model performance is explored in detail, 20 including a quantification of model variable importance, accuracy across different concentration ranges, and performance in a range of monitoring contexts including the National Ambient Air Quality Standards (NAAQS), and the US EPA Air Sensors Guidebook recommendations of minimum data quality for personal exposure measurement. A key strength of the RF approach is that it accounts for pollutant cross sensitivities. This highlights the importance of developing multipollutant sensor packages (as opposed to single pollutant monitors); we determined this is especially critical for NO2 and CO 2 . The evaluation reveals 25 that only the RF-calibrated sensors meet the US EPA Air Sensors Guidebook recommendations of minimum data quality for personal exposure measurement. We also demonstrate that the RF model calibrated sensors could detect differences in NO 2 concentrations between a near-road site and a suburban site less than 1.5 km away. From this study, we conclude that combining RF models with the RAMP monitors appears to be a very promising approach to address the poor performance that has plagued low cost air quality sensors. 30

Introduction
Historically, spatial coverage of air quality monitoring stations has been limited by the high cost of instrumentation; urban areas typically rely on a few reference-grade monitors to assess population scale exposure. However, air pollutant concentrations often exhibit significant spatial variability depending on local sources and features of the built environment (Marshall et al., 2008;Nazelle et al., 2009;Pugh et al., 2012;Tan et al., 2014), which may not be well captured by the existing 5 monitoring networks. In the past several years, there has been a significant increase in the development and applications of low-cost sensor-based air quality monitoring technology (Lewis and Edwards, 2016;McKercher et al., 2017;Moltchanov et al., 2015;Snyder et al., 2013). The use of low-cost air quality sensors for monitoring ambient air pollution could enable much denser air quality monitoring networks at a comparable cost to the existing regime. Increasing the spatial density of air quality monitoring would help quantify and characterize exposure gradients within urban areas and support better epidemiological 10 models. Additionally, more highly resolved air quality information can assist regulators with future policy planning, with identification of hot spots or potential areas of concern (e.g., fracking in rural areas) where more detailed characterization is needed, and with risk mitigation for noncompliant zones. Furthermore, low-cost air quality sensors are generally characterized by their compact size and low power demand. These features enable low-cost sensors to be moved with relative ease to rural areas or developing regions where limited monitoring exists. 15 A primary challenge with low-cost air quality sensors is calibration at typical ambient pollutant concentrations and environmental conditions. These sensors are prone to cross-sensitivities with other ambient pollutants (Bart et al., 2014;Cross et al., 2017;Masson et al., 2015b;Mead et al., 2013). The most common example is for ozone electrochemical sensors, which also undergo redox reactions in the presence of NO2. Additionally, NO has also been observed to interfere with NO 2 and CO 20 sensors have exhibited some cross-sensitivity to molecular hydrogen in urban environments (Mead et al., 2013). Furthermore, low-cost sensors can be affected by meteorology (Levy, 2014;Masson et al., 2015b;Pang et al., 2017;Williams et al., 2013).
Most electrochemical sensors are configured such that the reactions are diffusion-limited, and the diffusion coefficient can be affected by temperature (Hitchman et al., 1997); Masson et al. (2015b) have shown that at relative humidity exceeding 75% there is significant error, possibly due to condensation on potentiostat electronics. Lastly, the stability of low-cost sensors is 25 known to degrade over time (Jiao et al., 2016;Masson et al., 2015a). For example, in electrochemical cells, the reagents are consumed over time and have a typical lifetime of 1-2 years.
Deconvolving the effects of cross-sensitivity and stability on sensor performance is complex. Linear calibration models developed in the laboratory perform poorly on ambient data . Attempts to build calibration models from 30 first principles have shown some success, but the models are difficult to construct and their transferability to new environments remains unknown (Masson et al., 2015b). Accurate and precise calibration models are particularly critical to the success of dense sensor networks deployed in urban areas of developed countries where concentrations are on the low end of the spectrum Atmos. Meas. Tech. Discuss., https://doi.org/10.5194/amt-2017-260 Manuscript under review for journal Atmos. Meas. Tech. Discussion started: 9 August 2017 c Author(s) 2017. CC BY 4.0 License. of global pollutant concentrations, as poor signal-to-noise ratios may hamper their ability to distinguish between intra-urban sites. As such, there has been increasing interest in more sophisticated algorithms (e.g., machine learning) for low cost sensor calibration. To date, there have been published studies using high-dimensional multi-response models (Cross et al., 2017) and neural networks (Esposito et al., 2016;Spinelle et al., 2015. Spinelle et al. (2015) showed that artificial neural network calibration models could meet European data quality objectives for measuring ozone (uncertainty < 18 ppb); however, meeting 5 these objectives for NO2 remained a challenge. Cross et al. (2017) built high-dimensional multi-response calibration models for CO, NO, NO 2 and O 3 which had good agreement with reference monitors (slopes 0.6-0.96, R 2 0.51-0.96). Esposito et al. (2016) demonstrated excellent performance with dynamic neural network calibrations of NO 2 sensors (mean absolute error < 2 ppb); however, the same performance for O 3 was not observed. Furthermore, these calibrations have only been tested on a small number of sensor packages. For example, Cross et al. (2017) tested two sensor packages, each containing one sensor per 10 pollutant over a four-month period, of which 35% was used as training data. Spinelle et al. (2015) tested a cluster of sensors in a single enclosure, testing 22 individual sensors in total over a period of 5 months, of which 15% was used as training data. Esposito et al. (2016) reported calibration performance on a single sensor package (5 gas sensors per package for measuring NO,NO2 and O 3 ) and the model was tested on four weeks of data.

15
In this study, we aim to improve the calibration strategies of low-cost sensors using a random-forest-based machine learning algorithm, which, to our knowledge, has not been previously applied to low-cost air quality monitor calibrations. To ensure calibration model robustness, they were developed and validated for 19 sensor packages, with each package containing one sensor per species (CO, CO2, NO 2 , SO 2 and O 3 ) for a total of 95 individual sensors. Furthermore, the model training and testing was conducted over a six-month period (August 2016 -February 2017) spanning multiple seasons and a wide range of 20 meteorological conditions, providing one of the most comprehensive low-cost air quality sensor calibration investigations to date. The fitting of the machine learning algorithms is discussed in detail to determine ideal calibration datasets to maximize performance and minimize overtraining. The performance of the random forest models is compared to traditional laboratory univariate linear models, multiple linear regression models, and EPA performance guidelines. The performance of a given model over time is also discussed. 25 2 Experimental methods

Measurement site
Measurements were made from August 3, 2016 to February 7, 2017 on the Carnegie Mellon University campus in the Oakland neighbourhood of Pittsburgh, PA. The measurement site (40°26'31.5"N, 79°56'33"W) is located within small (< 100 vehicles) limited access, open air parking lot near the center of campus. It consisted of a mobile laboratory equipped with reference-30 grade instrumentation (Section 2.3) and adjacent lawn space where the RAMP monitors were mounted on tripods (Section 2.2). The dominant local source at the site is vehicle emissions when vehicles enter and exit the parking lot during the morning Atmos. Meas. Tech. Discuss., https://doi.org/10.5194/amt-2017-260 Manuscript under review for journal Atmos. Meas. Tech. Discussion started: 9 August 2017 c Author(s) 2017. CC BY 4.0 License. and evening rush hours. The small size of the parking lot (< 100 cars) and few other local sources means that for most of the day the location is essentially an urban background site. During the measurement period, the site mean (range) ambient temperature and relative humidity were 13°C (-15 to 34 °C) and 71% (27 to 98%), respectively.

Real-time Affordable Multi-Pollutant (RAMP) monitor
The study uses the Real-time Affordable Multi-Pollutant (RAMP) monitor, which was developed in a collaboration between 5 Carnegie Mellon University and SenSevere. The RAMP monitor incorporates widely-used Alphasense electrochemical sensors to measure gaseous pollutants (CO, NO 2 , SO 2 O 3 ) and a non-dispersive infrared (NDIR) sensor to measure CO 2 . The latter sensor also includes modules to measure temperature and relative humidity. The RAMP is paired with a Met-One Neighborhood PM monitor to measure optical PM 2.5 . The RAMP uses the following commercially-available electrochemical sensors from Alphasense Ltd: carbon monoxide (CO, Alphasense ID: CO-B41), nitrogen dioxide (NO 2 , Alphasense ID: NO2-10 B43F), sulfur dioxide (SO 2 , Alphasense ID: SO2-B4), and total oxidants (O x , Alphasense ID: Ox-B431). The unit also includes a nondispersive infrared (NDIR) CO 2 sensor (SST CO2S-A) which contains built-in T (method: bandgap) and RH (method: capacitive) measurement. The experiments involved 95 individual pollutant sensors mounted in 19 unique RAMP monitors.
The electrochemical sensor outputs were measured using electronic circuitry custom designed by SenSevere optimized for 15 signal stability. The circuitry includes custom electronics to drive the device, multiple stages of filtering circuitry for specific noise signatures, and an analog-to-digital converter for measurement of the conditioned signal. The RAMP monitors are housed in a NEMA-rated weather proof enclosure ( Figure 1A) and equipped with GSM cards to transmit data using cellular networks to an online server. The RAMP monitors also log data to an SD card as a fail-safe in case of wireless data transfer issues. The sensors sample passively from the bottom of the unit ( Figure 1B), with screens installed to protect the sensors. If operated 20 with the PM2.5 monitor, the RAMP monitors require 120-240V AC power; however, roughly 3 weeks of measurements of gaseous species, T, and RH are possible on single charge of a built-in 30 amp-hour NiMH battery. The RAMP monitors are either mounted to a steel plate for easy pole mounting or are deployed on tripods approximately 1.5 m above the ground ( Figure   1C). In this study, all the RAMP monitors were tripod-mounted at a consistent height.

25
In their simplest configuration, electrochemical sensors function based on a redox reaction within an electrochemical cell in which the target analyte oxidizes the anode and the cathode is proportionally reduced (or vice versa, depending on target analyte). The subsequent movement of charge between the electrodes produces a current which is proportional to the analyte reaction rate, which can be used to determine the analyte concentration. The Alphasense electrochemical sensors utilize a more complex configuration by using four electrodes (working, reference, counter and auxiliary) to account for zero current changes. 30 Essentially, the auxiliary electrode, which is not exposed to the target analyte, accounts for baseline changes in the sensor baseline signal under different meteorological conditions. Additional details on the theory of operation for electrochemical sensors can be found in Mead et al. (2013). Atmos. Meas. Tech. Discuss., https://doi.org/10.5194/amt-2017-260 Manuscript under review for journal Atmos. Meas. Tech. Discussion started: 9 August 2017 c Author(s) 2017. CC BY 4.0 License.
The RAMP monitors log two output signals from each of the Alphasense sensors: one from the auxiliary electrode and the other from the working electrode. The net sensor response is determined by subtracting the auxiliary electrode signal from that of the working electrode. In theory, for a target analyte a linear relationship should exist between the net sensor signal for that analyte and ambient analyte concentrations, and this expectation forms the basis of univariate linear regression models built 5 from laboratory calibrations. However, as noted in the introduction, even with an auxiliary electrode, electrochemical sensors may insufficiently account for the impacts of temperature (which affects the rate of diffusion) and relative humidity under high humidity conditions where condensation is possible. This has motivated researchers to construct multivariate linear regression models (MLR) to account for these temperature and humidity effects (Jiao et al., 2016). While these calibration models typically improve performance relative to univariate linear models (Spinelle et al., 2015, they typically do not 10 incorporate any cross-sensitivities to other pollutants or any non-linearities in the response. In this study, we attempt to build a calibration model for each analyte with no underlying assumptions regarding the calibration model structure and allow the models to consider directly the full suite of data being reported by the RAMP monitors using a machine learning approach.

Reference instrumentation
Reference measurements were made on ambient air continuously drawn through an inlet on the roof of the supersite located 15 approximately 2.5 m above ground. Gaseous pollutants were drawn through approximately 4 m of 0.953 cm outer diameter Teflon fluorinated ethylene propylene (FEP) tubing with a six-port stainless steel manifold for flow distribution to the gas analyzers. Measurements were made using direct absorbance at 405 nm for NO2 (2B Technologies Model 405 nm), a gas filter correlation infrared analyzer for CO (Teledyne T300U), a non-dispersive infrared analyzer for CO 2 (LICOR 820), UV absorption for O 3 (Teledyne T400 Photometric Ozone Analyzer) and by UV fluorescence for SO 2 (Teledyne T100A UV 20 Fluorescence SO 2 Analyzer). The time resolution for all reference measurements was 1 s.
The reference gas analyzers were checked and calibrated weekly using calibration gas mixtures, except for O 3 which is calibrated biannually at a nearby regulatory monitoring site. The CO and NO 2 analyzers experience modest baseline drift between weekly calibrations, on the order of approximately 40 ppb for CO and 2 ppb for NO 2 . Hence, baseline pollutant 25 concentrations were normalized to a nearby regulatory monitoring site (Allegheny County Health Department, Air Quality Division, Pittsburgh, PA). The gas analyzers at the regulatory monitoring site are checked daily and thus this normalization helped correct for any baseline drift during the days between calibration. No significant drift was observed for CO 2 or O 3 .

Calibration methods
Three calibration methods were evaluated: (1) a laboratory-based univariate linear regression based on net sensor response 30 when exposed to calibration gases, (2): an empirical multivariate linear regression of net sensor response, T and RH regressed Atmos. Meas. Tech. Discuss., https://doi.org/10.5194/amt-2017-260 Manuscript under review for journal Atmos. Meas. Tech. Discussion started: 9 August 2017 c Author(s) 2017. CC BY 4.0 License. against reference monitor concentrations, and (3): a random forest machine learning model using net responses from all sensors, T, and RH to predict reference monitor concentrations. Calibration models were constructed for the CO, NO 2 , CO 2 and O 3 sensors in each RAMP monitor. In this study, no calibration models were built for SO 2 due to SO 2 concentrations measured with the reference instrumentation being below the instrument detection limit (<0.4 ppbv) for most of the campaign (no nearby sources of SO 2 ). While lab calibrations were conducted for the SO 2 sensors, this data will be the subject of a future 5 publication on air quality in industrial areas where SO 2 is more commonly detected.

Laboratory-based univariate linear regression (LAB)
Prior to outdoor collocation, the sensors inside the RAMP monitors were calibrated in a laboratory environment using a custom manufactured sensor bed and calibration gas mixtures. The sensors were exposed to each step in the calibration window (Table   1) for 20 minutes and a flow rate of 9 LPM flowed perpendicular to the sensor surface. The sensor response at each calibration 10 step was averaged once the signal had stabilized. Temperature and relative humidity were not controlled during the calibration.
The temperature was at levels typical of indoor laboratory environments (approx. 20 °C), and the dry calibration gas provided very little humidity (RH <5%). Calibrations were built for CO, NO 2 and CO 2 . Laboratory calibrations for O 3 were not performed.

15
The laboratory calibration follows a standard univariate linear regression model of regression net (CO, NO 2 ) or raw (CO 2 ) signal against the reference gas concentration (Eq. 1)

Empirical multivariate linear regression (MLR)
Following laboratory calibration, the individual sensors were mounted in the RAMP monitors and deployed adjacent to the 25 Carnegie Mellon University supersite. The collocation period varied by RAMP, with a minimum collocation period of 6 weeks and a maximum collocation period of the entire 6-month study period. The collocation window varied due to intermittent deployment of some RAMP monitors for ongoing air quality monitoring campaigns in the Pittsburgh area. To build calibration models, the collocation period was separated into a training and testing period identical to that used for the random forest calibration (see Section 3.3). Due to the previously established influence of T and RH on sensor response (Jiao et al., 2016;30 Masson et al., 2015b;Spinelle et al., 2015, a multiple linear regression (MLR) model was used to calibrate the output from each sensor using net sensor response to the target analyte (e.g. CO for the CO sensor), T and RH as explanatory variables Atmos. Meas. Tech. Discuss., https://doi.org/10.5194/amt-2017-260 Manuscript under review for journal Atmos. Meas. Tech. Discussion started: 9 August 2017 c Author(s) 2017. CC BY 4.0 License.
(Eq. 2), similar to the approach described in a recent a European Union report on protocols for evaluating and calibrating lowcost sensors (Spinelle et al., 2013).
The training data was used to calculate the model coefficients (β 0 through β 4 ) and the model performance was evaluated on withheld testing data. Separate multivariate linear regression models were developed for each sensor (95 individual models).
We refer to these models as MLR.

Random forest model (RF)
A random forest (RF) model is a machine learning algorithm for solving regression or classification problems (Breiman, 2001). 10 It works by constructing an ensemble of decision trees using a training data set; the mean value from that ensemble of decision trees is then used to predict the value for new input data. Briefly, to develop a random forest model, the user specifies the maximum number of trees that make up the forest, and each tree is constructed using a bootstrapped random sample from the training data set. The origin node of the decision tree is split into sub-nodes by considering a random subset of the possible explanatory variables. The training algorithm splits the tree based on which of the random subsets of explanatory variables is 15 the strongest predictor of the response. The number of random explanatory variables considered at each node (denoted mtry) is tuned by the user. This process of node splitting is repeated until a terminal node is reached; the user can specify the maximum number of sub-nodes or the minimum number of data points in the node as the indication to terminate the tree. For our random forest models, the terminal node was specified using a minimum node size of 5 data points per node.

20
To illustrate the method, consider building a random forest model for one RAMP monitor using a single decision tree and a subset of 100 training data points to build a CO calibration model ( Figure 2). In this highly simplified example, at the first node, the net CO sensor signal is the strongest predictor of the CO reference monitor concentration, with a natural split in the data at a net CO sensor voltage of 255.9 a.u. If sensor voltage exceeds 255.9, a cluster of 7 data points from the training data predicts an average CO concentration of 357 ppb, if CO net sensor voltage is ≤255.9 then the data goes to the next decision 25 node, in which net CO sensor signal is again the strongest predictor of the CO reference monitor concentration, with a natural break in the data at a net CO sensor voltage of 167.3 a.u. The splitting proceeds until all the training data are assigned to a terminal node. The prediction value for each terminal node is the average reference monitor concentration of training points assigned to that node. To apply the algorithm (i.e. predict the CO concentration from a set of measured inputs), the user takes the measured T and the net CO, NO2 and O 3 signals and follows the path through the tree to the appropriate terminal node. 30 The predicted CO concentration for that tree is then the average training value associated with that terminal node. This process is then repeated through multiple trees ( Figure 2 shows only one simple tree) and the predictions from each tree are averaged to determine the final output from the entire random forest model. In this simple example, there are only six possible CO Atmos. Meas. Tech. Discuss., https://doi.org/10.5194/amt-2017-260 Manuscript under review for journal Atmos. Meas. Tech. Discussion started: 9 August 2017 c Author(s) 2017. CC BY 4.0 License.
concentrations the random forest model will output. In practice, each tree has hundreds of terminal nodes and the forest typically comprises hundreds of trees, which means that there are thousands of possible answers. The model prediction for a given set of inputs is the average prediction across all the hundreds of trees that comprise the forest. The random forest model's main limitation is that its ability to predict new outcomes is limited to the range of the training data 5 set; in other words, it will not predict data with variable parameters outside the training range. Therefore, a larger and more variable training data set should create a better final model. To maximize utilization of the training data set to avoid missing any spikes during the training window, a k-fold cross validation approach was used. A k-fold cross-validation divides the data into k equal sized groups (where k is specified by the user) and k repeats are used to tune the model. Consider an example where k is equal to 5 (a 5-fold cross-validated random forest model). With a 5-fold validation, five unique random forest 10 models are constructed, one for each fold. In building the first random forest, the first 20% (1/k) of the data will be the testing data, and the remaining 80% [(1-k)/k] of the data will be used as training. In building the second random forest, the next 20% of the data will be used as test data, and the first 20% and remaining 60% will be used to train. This is repeated until the data are fully covered, at which point the random forest model is created by combining the five (k) individual models into one large random forest model. This helps to minimize bias in training data selection when predicting new data, and ensures that every 15 point in the training window is used to build the model.
In this study, reference gas data, RAMP net sensor data for CO, NO2, SO 2 , O 3 , and RAMP raw sensor data for CO 2 , T, and RH were collected at 15 second resolution, time-matched, and down-averaged to 15 min intervals (IGOR Pro v6.34), which is higher temporal than the 1 h intervals at which typical regulatory monitoring information are reported. The down-sampled data 20 were then imported into R (ver. 3.3.3, "Another Canoe") for random forest model building. R is an open-source package for tuning and cross-validating many classes of statistical models, including random forest models. The cross-validated random forest models were compiled using the open-source "caret" package (Kuhn, 2017). The model considered all RAMP data (net voltage outputs from the five gas sensors plus T and RH, 7 possible variables total) as potential explanatory variables to predict the reference monitor gas concentration. The number of trees was capped at 100 per fold, and a five-fold cross-validation was 25 used for a total of 500 trees. Therefore, the predicted value for a given set of measured inputs is the average value from this set of 500 trees (each tree provides one prediction). When fitting the random forest models with the training data, the main tuning parameter is the number of explanatory variables to consider at each decision node (mtry). To determine the optimal m try , the root mean square error (RMSE, equation in Supplemental Information) and the coefficient of determination (R 2 ) were calculated on the withheld folds of the training data ( Figure 3, step 2) for m try equal to 2, 4 or 7 to span the complete variable 30 range. The random subset of explanatory variables considered at each node was chosen based on which value of m try minimized RMSE. The cross-validation and the subset of explanatory variables randomly considered at each node (m try ) was tuned using the caret package in R (Kuhn, 2017). Following random forest model generation and tuning, the five 100 tree models were combined to create a final model with 500 trees. This process was repeated for each sensor to create 95 separate random forest Atmos. Meas. Tech. Discuss., https://doi.org/10.5194/amt-2017-260 Manuscript under review for journal Atmos. Meas. Tech. Discussion started: 9 August 2017 c Author(s) 2017. CC BY 4.0 License. models. The final models convert the RAMP output signals into calibrated concentrations. The model conversion was done within R, where it exists as a standalone object compatible with the standard R configuration.
Data from three RAMP monitors (15 individual gas sensors) were used to investigate the optimal training period, which was determined by comparing the training data size to mean absolute error (MAE, the average of the absolute value of the 5 residuals). The optimal training period was the period beyond which increases in the length of the training window (and therefore size of the training dateset) no longer resulted in significant reductions in the MAE. The initial training window evaluated was 1 week, and 1 week increments in training period duration were considered until MAE was minimized. The optimal collocation window was determined to be 4 weeks (or 2688 data points at 15-minute resolution). This was evaluated for a consecutive collocation window and for 8 collocation windows equally distributed throughout the whole collocation 10 period (August 2016 -February 2017) in half week increments. Details of this evaluation are provided in the Supplemental Information, but the intermittently distributed collocations generally performed slightly better, with reductions in MAE of 12 ppb (4% relative error) for CO, 2 ppm for CO2 (0.4% relative error), 0.4 ppb for NO 2 (4% relative error), and 1.6 ppb for O 3 (7% relative error) compared to the consecutive four-week collocation. The motivation for exploring intermittent collocation windows dispersed throughout the study period was to ensure that the training period covered a complete range of gas species 15 concentrations, temperatures and relative humidity. In practice, the degree of collocation utilized in this study is equivalent to collocating the RAMP monitors with reference monitors for 3-4 days every 1-2 months. However, if the MAE using the initial consecutive collocation is satisfactory for the application, this calibration strategy was not substantially less accurate than the distributed collocations.

Metrics for performance evaluation 20
The evaluation of the different models was conducted on 15-minute averaged testing data (i.e., data withheld entirely from model building). Metrics to quantitatively compare the LAB, MLR and RF model output to the reference monitor concentrations included Pearson r, which is a measure of the strength and direction of a linear relationship, and the coefficient of variation of the mean absolute error (CvMAE, Eq. 3). For comparing the RF model performance to other published studies, we also evaluated mean bias error, mean absolute error, slope of the linear regression of RF model calibrated RAMP data and 25 reference data, and coefficient of determination (R 2 ).
Another useful tool for visually comparing competing models is a target diagram (Jolliff et al., 2009). A target diagram illustrates the contributions of the centered root mean square error (CRMSE, which is RMSE corrected for bias) and the mean 30 bias error (MBE) towards total RMSE. In a target diagram, the x-axis is the CRMSE, the y-axis is the MBE and the vector distance to the origin is the RMSE. Since CRMSE is always positive, a further dimension is added: if the standard deviation Atmos. Meas. Tech. Discuss., https://doi.org/10.5194/amt-2017-260 Manuscript under review for journal Atmos. Meas. Tech. Discussion started: 9 August 2017 c Author(s) 2017. CC BY 4.0 License. of the model exceeds the standard deviation of the measurements, the CRMSE is plotted in the right quadrants and vice versa.
To match previously constructed target diagrams (Borrego et al., 2016;Spinelle et al., 2015, the CRMSE and MBE were normalized by the standard deviation of the reference measurements, and thus the vector distance in our diagrams is RMSE/σ reference (nRMSE). The resulting diagram enables visualization of four diagnostic measures: (1) whether the model tends to overestimate (MBE > 0) or underestimate (MBE < 0), (2) whether the standard deviation of the model is larger (right 5 plane) or smaller (left plane) than the standard deviation of the measurements, (3) whether the variance of the residuals is smaller than the variance of the reference measurements (inside circle of radius 1) or larger than the variance of the reference measurements (outside circle), and (4) the error (nRMSE), the vector distance between the coordinate and the origin. Details of equations required to build a target diagram are provided in the Supplemental Information. Model performance metrics were calculated in R (ver. 3.3.3, "Another Canoe") using the "tdr" package (Perpinan Lamigueiro, 2015). 10

Calibration model goodness of fit: comparing model predictions to training data
Following model building, the goodness of fit between the model output concentrations and the reference monitor concentrations during the training window (i.e. the data used to build the model) were evaluated for all three calibration model approaches (laboratory univariate linear regression "LAB", field-based multiple linear regression "MLR" and field-based 15 random forest "RF"). For the training period, the calibrated CO and O3 concentrations were all highly correlated (Pearson r > 0.8) with the reference monitor concentrations for all the calibration model approaches (Table 2). However, only the RF model achieved strong correlations between the reference monitor and the RAMPs for NO 2 and CO 2 . Furthermore, CvMAE for each species was ≤5% during the training window for the RF models, substantially outperforming the other models.

20
Regression plots for all 19 RAMPs and all four gas species illustrating the goodness of fit of the RF model are provided in the Supplemental Information. For the RF models, Table 2 also provides the random subset of explanatory variables sampled for splitting at each decision node (mtry) to achieve the lowest model RMSE. In general, the larger the m try , the simpler the underlying structure of the model. The advantage of a lower m try is that subtle relationships between explanatory variables and the response can be probed. For example, if there is one dominant variable but the model is permitted to consider all 7 25 explanatory variables at each decision node, then the model will most frequently split the data based on the dominant variable, potentially masking the effect of other variables on the response. If the goodness of fit of the calibration model is improved by decreasing m try , this suggests more complex variable interactions (Strobl et al., 2008).
Using the m try metric, we observed that the underlying RF model structure is the simplest for CO, that some model explanatory 30 variable complexities exist for the O 3 and NO 2 models, and that the CO 2 model is the most complex and relies on subtle relationships between the explanatory variables to best fit the data (lowest m try had the best results). This finding matches our Atmos. Meas. Tech. Discuss., https://doi.org/10.5194/amt-2017-260 Manuscript under review for journal Atmos. Meas. Tech. Discussion started: 9 August 2017 c Author(s) 2017. CC BY 4.0 License. expectations based on the LAB and MLR models; these simpler models performed best for CO and worst for CO 2 . The trends in the m try metric highlights the value of the RF model approach which directly accounts for multiple pollutants. This appears to be critical for O 3 , NO 2 and CO 2 sensors because they are cross-sensitive to other pollutants. Cross-sensitivities have been shown to have a minimal impact on CO sensors, with the only notable cross-sensitivity being to molecular hydrogen (Mead et al., 2013). The poor performance of linear models at predicting CO 2 concentration is not surprising, as the sensor was observed 5 to measure high concentrations under periods of high relative humidity (e.g., during rain) and in some cases during heavy rain will be saturated at 2000 ppm, the upper limit of the sensor, and then is reset to 400 ppm daily, as per manufacturer recommendations. The increase in CO 2 under high humidity conditions is likely due to the interference of water with CO 2 in the NDIR signal. Linear models are poorly suited to describe this behaviour.

Evaluation of models using testing data 10
To test the performance of the three different calibration models, the models were applied to the testing data that were not used for model fitting. The RAMP monitor concentrations after correction using the calibration models were compared to the actual measured reference concentrations (Figure 2, step 5). To illustrate the approach, in Figure 4, we show a very short time-series of the testing data (~48-hour window) for RAMP #1. This RAMP monitor's performance is representative of the average model performance across the 19 RAMP monitors and therefore illustrates the quality of an average model. Figure 4 also 15 shows the calibrated RAMP #1 output regressed against the reference monitor concentration for the entire testing period for all three calibration models (LAB, MLR, and RF). For this period, the RF model clearly outperformed the LAB and MLR models. Differences between the different models were smallest for CO and O3 and largest for CO 2 and NO 2 ; the LAB models essentially did not reproduce the reference concentrations for CO 2 and NO 2 . To illustrate the consistency of the RF modelcalibrated RAMP monitors across the entire suite of monitors, regressions for all the RAMP monitors for O 3 are shown in 20 To assess the overall model performance, two performance metrics (Pearson r and CvMAE) were calcualted for each RAMP monitor using the entire testing dataset (Figure 6). The size of the testing dataset varied from 1.4 to 15 weeks, with a median value of 5 weeks. This aggregate assessment shows that the MLR and RF models are interchangable for CO, as both models 25 achieved Pearson r >0.9 and CvMAE <15%. The LAB model achieved a similar Pearson r, but CvMAE doubled to ~30%. For CO2, NO 2 , and O 3 , the RF model substantially outperforms the LAB and MLR calibration models on the testing data. On average, Pearson r exceeded 0.8 for the RF model for CO 2 and NO 2 versus < 0.6 for the LAB and MLR calibration models.
Furthermore, the RF model performance was more consistent across the RAMP monitors than the MLR and LAB models. To compare the LAB, MLR and RF models, target diagrams were constructed for the four gases using all three calibration models for each RAMP (Figure 7). The target diagrams show that, on average, across the RAMP monitors the random sensor error (distance to origin) was smaller for RF models and the RF models showed the least RAMP-to-RAMP variability (less disperse). This contrasts with the MLR models, whose bias and extent of model standard deviation varied much more widely 5 between RAMPs, especially for CO2. For the LAB models, the error for CO 2 and NO 2 was approximately an order of magnitude larger than for the RF and MLR models and had to be plotted on a separate inset due to their poor performance. Across all gases, the RF models on average were biased slightly lower than the reference. Thus, we conclude that the low CvMAE, high Pearson r correlations, lowest bias and lowest absolute error characteristics of the RF models for all four gases are significant improvements compared to conventional calibration approaches (LAB and MLR). 10

Detailed assessment of RF model performance
To investigate the performance of the RF models in greater detail, we assessed the effect of amount of testing data on model performance, the relative importance of the seven explanatory variables, the performance of the models across the different concentration ranges, and the number of data points needed in each concentration range to optimize the fit.

4.3.1
Drift over amount of testing data 15 The first assessment was of amount of testing data. In this study, any data remaining after training were used to test model performance, provided there were at least 48 hours of testing data (192 data points). Again, all the data have 15 min temporal resolution. The number of points used to test the model performance varied by RAMP monitor and by pollutant, as reference monitors were occasionally offline for maintenance and calibration, and some RAMP monitors were intermittently deployed for concurrent air quality monitoring campaigns in Pittsburgh. To assess the effect of number of testing points on conclusions 20 regarding RF model performance, we compared the MAE to the number of points in the testing window ( Figure 8). For all the gas species, the MAE was essentially flat across the RAMP monitors; RAMP monitors with more testing data did not have substantially higher (worse) MAE, suggesting the RF models are robust over time. For NO2, the most data available for testing was approximately 8 weeks due to instrument maintenance and repair taking the NO 2 reference monitor offline for 6 weeks of the study. Figure 8 also shows MAE over time from one RAMP, RAMP #4, which remained at the Carnegie Mellon supersite 25 for the entirety of the six-month study. MAE was calculated for an increasing cumulative number of weeks forward in time, and again, MAE was consistent (and in some weeks improved) over time.

RF model explanatory variable importance
While RF models are non-parametric, some sense of the model structure can be gained by examining the relative importance of the explanatory variables. The importance of each variable was quantified by comparing the percent increase in mean square 30 error (MSE) if the explanatory variable signal is permuted (randomly shuffled). If an explanatory variable strongly affects the model performance, permuting that variable results in a large increase in MSE. Conversely, if a variable is not a strong predictor of the response, then permuting the variable does not significantly increase the MSE. Figure 9 shows for each of the gases (CO, CO 2 , NO 2 and O 3 ) the increase in MSE when the explanatory variables were permuted. For both CO and O 3 , the signal from the sensor measuring the target analyte (CO or O 3 ) is the most important explanatory variable, as expected. For the O 3 , the second most important variable was the NO 2 signal, an expected cross-sensitivity, as the ozone sensor measures total 5 oxidants (O 3 + NO 2 ) (Spinelle et al., 2015).
The explanatory variable importance is more complex for CO 2 and NO 2 . For CO 2 , all variables are roughly equally important, with CO being the most important. This is likely due to the strong meteorological effect of humidity on the measured CO 2 concentration; the model must rely on other primary pollutants to predict the CO 2 signal when the measured CO 2 has reached 10 full-scale, and short-term fluctuations of CO 2 are likely from combustion sources (e.g., vehicular traffic in urban areas) which also emit CO. This highlights the value of having sensors for multiple pollutants in the same monitor. Including measurements of additional pollutants helps the RF model correct for cross-sensitivities. For the NO 2 model, RH was the most important explanatory variable followed by the NO 2 sensor signal, highlighting again the importance of including meteorological data within sensor packages. The NO 2 model was also more strongly affected by temperature than the other pollutants. We 15 hypothesize that the sensitivity of the NO 2 sensor to ambient NO 2 is suppressed in Pittsburgh, which has low ambient NO 2 concentrations compared to other cities where these sensors have been evaluated (see Table 3). NO 2 is lowest when ozone is highest in the summer, and thus the NO 2 RF model effectively uses T and RH as indicators for seasonality when NO 2 is low and the sensor response is supressed. Furthermore, the relatively equal variable importance of several of the explanatory variables within a model suggests that a cluster of sensors measuring many different species is critically important to build 20 robust calibration models. The only sensor channel that did not contribute significantly to any model performance was the SO 2 sensor, thus this sensor could be replaced with a more relevant sensor, such as NO, in future iterations of the RAMP monitor.
These findings highlight the value of bundling sensors for measuring a suite of pollutants together, as the different sensors can capture (at least to some extent) cross-sensitivities to other pollutants and improve the model performance for other sensors.

RF model performance as a function of ambient concentration
In Section 4.2, predicted concentrations were normalized to average reference monitor concentration to compare quantitatively differences between the different calibration models (CvMAE). To evaluate the RF model performance at different reference concentrations, the testing data were divided into deciles for which the median reference monitor concentration, the absolute residual, and the residual normalized to the reference monitor concentration were calculated ( Figure 10). For all species, the 30 RF models tended to overestimate at lower concentrations, and underestimate at the highest concentrations. For the CO RF model, the normalized residual is within 10% of the reference monitor concentration by the 20 th percentile of the data (>100 ppb), and continues to improve until the 50 th percentile when it plateaus at a normalized residual of about 5%. The US EPA Atmos. Meas. Tech. Discuss., https://doi.org/10.5194/amt-2017-260 Manuscript under review for journal Atmos. Meas. Tech. Discussion started: 9 August 2017 c Author(s) 2017. CC BY 4.0 License. requires a limit of detection of 100 ppb for CO instruments used for regulatory monitoring (United States Environmental Protection Agency, 2014), thus our performance meets that goal. In the top decile, the average absolute CO residual for the RF models approximately doubles but the relative error is still around 5%. However, the top decile spans the broadest concentration range due to the lognormal shape of the CO concentration distribution, and these points are difficult to capture in training data sets. 5 For the CO2 RF model, agreement with the reference monitor data are within a few percent up to the 90 th percentile, when agreement drops to within 5%. This is possibly due to the RF model actively supressing high CO 2 sensor signals, as the sensor is prone to reading erroneously high concentrations during rain events. Additionally, the top decile of the data spans a wide range of CO 2 concentrations due to the lognormal shape of the CO 2 distribution. As with CO, the NO 2 RF model agreement 10 with the reference monitor plateaus around the 50 th percentile mark; however, the NO 2 RF-model error exceeds 100% for the lowest decile (<5 ppb), suggesting an effective sensitivity of the sensor of 5 ppb. For the O 3 RF model, the effective sensitivity is also around 5 ppb; when the average reference monitor concentration increased from 5 ppb to 10 ppb (from first to second decile), the normalized residual decreased from over 100% to about than 20%. The US EPA limit of detection for federal regulatory monitors is 10 ppb for both NO 2 and O 3 , suggesting that as with CO, the RF model performance is within 20% of 15 regulatory standards (United States Environmental Protection Agency, 2014).
Systematic underprediction at the highest concentrations was also observed and is a consequence of the training dataset used to fit the RF model. Unless the range of concentrations in the training data encompasses the range of concentrations during model testing, there will be underpredictions for concentrations in exceedance of the training range. Additionally, the 20 performance of the RF model is sensitive to the number of data points at a given concentration and the model performance.
To build a robust model, many data points are required at a given concentration to probe the extent of the ambient air pollutant matrix. In this study, the training windows were dispersed throughout the collocation period to ensure good agreement of gas species and meteorological conditions during both the training and testing windows (see Supplemental Information).

25
To illustrate the impact of number of data points, we binned the data for the representative RAMP (RAMP #1) by concentration and the average concentration measured by the reference monitors was plotted against the average concentration from the calibrated RAMP (Figure 11). The uncertainty in the random forest model was plotted as the standard deviation of the model solutions from the 500 trees and the bins were colour coded by the number of data points within each bin. Figure 11 illustrates that for every pollutant agreement with the reference monitor and uncertainty in the model prediction was larger for 30 concentration bins containing fewer than 10 data points. This disproportionately impacted the upper end of the pollutant distribution where fewer data points were collected due to the intermittent and variable nature of high pollutant episodes. This suggests that a minimum of 10 data points at a given concentration are needed to adequately train the RF model, which may inform future RF model building. At NO2 concentrations below 5 ppb, deviations from the 1:1 line were also observed despite Atmos. Meas. Tech. Discuss., https://doi.org/10.5194/amt-2017-260 Manuscript under review for journal Atmos. Meas. Tech. Discussion started: 9 August 2017 c Author(s) 2017. CC BY 4.0 License. the training dataset containing more than 100 data points at these concentrations. As was concluded from Figure 10, 5 ppbv appears to be the sensitivity limit of these low-cost sensors for NO 2 .

Comparison of results to other published studies
In this section, we compare the performance of our RF models to results from other recent studies including the EuNetAir project in Italy (Borrego et al., 2016) and EPA Community Air Sensor Network (CAIRSENSE) project (Jiao et al., 2016). 5 Additionally, a handful of studies have tested the field performance of low-cost sensors both 'out of the box' with factory calibrations Duvall et al., 2016), and after a machine-learning-based calibration (Cross et al., 2017;Esposito et al., 2016;Spinelle et al., 2015. The number of sensors and length of deployment used here is generally greater than those previous studies. We compare the performance of our RF models to these studies in Table 3. While several low-cost sensor calibration studies have investigated calibration models within laboratory environments (Masson et al., 2015a;10 Mead et al., 2013;Piedrahita et al., 2014;Williams et al., 2013), we have elected to limit our comparison to field data.
There was not a substantial difference in performance of the RF model calibrated vs. LAB calibrated RAMP for CO, and performance was best for this pollutant on the 'out-of-the-box' factory calibrated performance assessments in EuNetAir and CAIRSENSE, suggesting that rigorous calibration models may not be critical for CO. However, the RAMP CO RF model did 15 provide improved performance (smallest MAE, 38 ppb) at lower average concentrations compared to the EuNetAir study.
Similarly, the 'out-of-the-box' performance of the CO sensors tested as part of CAIRSENSE and by the 24 AQMesh sensors tested in Castell et al. (2017) was poorer than the RF model calibrated RAMP. Of those studies that used an advanced algorithm to calibrate the sensors (Cross et al., 2017;, the CO RF model resulted in greater than or equivalent R 2 values and slightly lower slopes. While the R 2 of the CO HDMR model of Cross et al. (2017) is highest, it is difficult to 20 estimate its true predictive performance due to its statistical metrics being calculated over the whole collocation period of which 35% of the data were used for training. Therefore, it blends goodness of fit and predictions.
For NO 2 , the performance of 'out-of-the-box' low-cost sensors varied widely and half the sensors in the EuNetAir study (Borrego et al., 2016) reported errors larger than the average ambient concentrations. Therefore, advanced calibration models, 25 such as those using machine learning, are critical to accurate measurements of ambient NO 2 . Furthermore, sensor performance was correlated with average ambient concentration; studies in areas with higher NO 2 concentrations had the best performance, consistent with our observations (Figure 10). For studies using advanced NO 2 sensor calibration models (Cross et al., 2017;Esposito et al., 2016;Spinelle et al., 2015), Esposito et al. (2016) had the best performance, with a MAE of < 2 ppb; however, this evaluation was done in a location with high NO 2 concentrations, 45 ppbv (Air Quality England, 2015), more than three 30 times higher than the 12 ppbv in Pittsburgh. In addition, they only evaluated one sensor array so the robustness of the approach is unknown. In our study, the MAEs across the NO 2 RF model RAMPs ranged from 2.6-3.8 ppb, which is almost as good as Esposito et al. (2016), but at less than one third the ambient concentrations. The slope and R 2 of the HDMR model for NO 2 of Atmos. Meas. Tech. Discuss., https://doi.org/10.5194/amt-2017-260 Manuscript under review for journal Atmos. Meas. Tech. Discussion started: 9 August 2017 c Author(s) 2017. CC BY 4.0 License. Cross et al. (2017) do exceed that of the RAMP, but again their performance metrics appear to be calculated over the entire collocation period, which includes 35% training data. Similarly, the annual average NO 2 concentrations in 2015 were 15 ppb at the Massachusetts regulatory site used as a reference in Cross et al. (2017) (Massachusetts Department of Environmental Protection, 2016), 3 ppb higher than the average concentration observed in our study. As shown in Figure 10, an increase of a few ppb of NO 2 can result in almost 100% reductions in relative residuals in our model, thus this effect is not surprising. 5 Furthermore, for identical factory calibrated sensors out of the box, such as the Cairclip and AQMesh, a 5 ppb increase in average NO 2 concentration results in R 2 values more than doubling. As such, the excellent performance of the RF model for NO 2 at average ambient concentrations of 12 ppbv shows promise.
For O 3 , the RF model, the calibrated data from Spinelle et al., (2015), and the measurements from the Aeroqual SM50 (Jiao et 10 al., 2016) performed the best. Good performance from the Aeroqual when measuring NO 2 has also been previously observed (Delgado-Saborit, 2012). However, the results were the most consistent across the RAMP monitors calibrated with RF models, with relative standard deviations of <20% across the 19 RAMPs for all markers of statistical performance. This performance consistency also holds for the CO and NO 2 RF models. The O 3 RF models were built in Pittsburgh, PA, which has historically had issues with NAAQS ozone compliance, thus while our model was seemingly one of the most accurate and robust, some 15 of this performance may be attributed to the higher ambient O 3 concentrations. From this comparison, we conclude that the RAMP monitor calibrated with a RF model is unique in that it is more accurate when considering the combined suite of pollutants (i.e., all pollutants were accurately measured), it is consistent between many units (<20% relative standard deviation in performance metrics across 19 monitors), and is precise even at lower ambient concentrations.

RF model calibrated RAMP performance in a monitoring context
We further assess the RAMP monitor performance against two metrics: 1) for NAAQS compliance, and 2) for suitability for exposure measurements as per the US EPA Air Sensor Guidebook (Williams et al., 2014). We also demonstrate the benefit of improved performance of the RF models in a real-world deployment at two nearby sites in Pittsburgh, PA.

25
In this study, the time resolution and methods used to assess the effectiveness of the RF models (15 min) do not match the metrics used by regulators when considering compliance to National Ambient Air Quality Standards (NAAQS). For example, the NAAQS standard for O3 is based on the maximum daily maximum 8-hour average, and compliance for NO 2 is based on the 98 th percentile of the daily maximum 1-hour averages. While acknowledging that the RAMP monitor collocation period was shorter than typical NAAQS compliance periods (e.g. annually for O 3 and across 3 years for NO 2 ) it is still worth 30 characterizing the RAMP performance using the LAB, MLR and RF models (Figure 12). For the representative RAMP monitor used previously (RAMP #1), daily maximum 8-hour O 3 was in good agreement between the RF calibrated RAMP and the reference monitor, with all data points falling roughly along the 1:1 line, while for the MLR model, concentrations were Atmos. Meas. Tech. Discuss., https://doi.org/10.5194/amt-2017-260 Manuscript under review for journal Atmos. Meas. Tech. Discussion started: 9 August 2017 c Author(s) 2017. CC BY 4.0 License. skewed slightly low (slope of 0.65 for MLR, 0.82 for RF). For NO 2 , the 98 th percentile of the daily maximum 1-hour averages was 34 ppb for the RF model versus 35 ppb measured using a reference monitor compared to 25 ppb for the MLR model and 51 ppb for the LAB model. The RF model was substantially closer to the reference monitor estimate and the underestimation was only by 1 ppb. Other RF model calibrated RAMP monitors performed similarly, all agreeing within 5 ppb.

5
To demonstrate the improved performance of the RF models in a real-world context, two of the RAMPs used in the evaluation study were deployed for a 6-week period at two nearby sites in Pittsburgh, PA. One RAMP monitor was located on the roof of a building at the Pittsburgh Zoo in a residential urban area, and another was placed approximately 1.5 km away at a nearroad site located within 15 m of Highway 28 in Pittsburgh ( Figure 13). NO2 concentrations are known to be elevated up to 200 m away from a major roadway compared to urban backgrounds due to the reaction of fresh NO in vehicle exhaust with ambient 10 O 3 (Zhou and Levy, 2007). Figure 13 shows the diurnal profiles of the RAMPs at the two locations evaluated using the RF and MLR models. The RF model indicates an NO 2 enhancement of approximately 6 ppb at the near-road site ( Figure 13, red trace) compared to the nearby urban residential site (Figure 13, blue trace) and there are notable increases in NO 2 during morning and evening rush hour periods, as expected. The concentrations reported by the RF model calibrated RAMPs were further verified with measurements using a mobile van equipped with reference instrumentation at different periods throughout the 15 day. However, applying the MLR model to the RAMP data reveals no significant difference between the two sites ( Figure 13, bottom diurnal). In fact, the MLR model predicts negative concentrations during the day. The results of this preliminary deployment suggest that the RF model calibrated RAMPs could be suitable for quantification of intra-urban pollutant gradients.
The US EPA Air Sensor Guidebook (Williams et al., 2014) provides air sensor performance goals by application area . The 20 performance criteria include maximum precision and bias error rates for applications ranging from education and information (Tier I) to regulatory monitoring (Tier V). The precision estimator is the upper bound of a 90% confidence interval of the coefficient of variation (CV) and the bias estimator is the upper bound of a 95% confidence interval of the mean absolute percent difference between the sensors and the reference (full equations in the Supplemental Information). An overarching goal of RAMP monitor deployments is to use low-cost sensor networks to quantify intra-urban exposure gradients, thus our 25 benchmark performance was Tier IV (Personal Exposure), which recommends that low-cost sensors have precision and bias error rates of less than 30%. For the testing (withheld) periods, we compared the performance of the RF, MLR and LAB models for all the RAMP monitors used in this study to the precision and bias estimators recommended by the US EPA ( Figure   14). The performance across the RAMP monitors was summarized using box plots, and only the RF model calibrated RAMPs are suitably precise and accurate for Tier IV (personal exposure) monitoring across CO, NO2 and O 3 . Furthermore, both RF 30 model calibrated CO and O 3 RAMP monitor measurements were below the even more stringent Tier III (Supplemental Monitoring) standards, which recommends precision and bias error rates of <20%. The RF model NO 2 RAMP measurements may reach Tier III in locations with higher NO 2 concentrations. Atmos. Meas. Tech. Discuss., https://doi.org/10.5194/amt-2017-260 Manuscript under review for journal Atmos. Meas. Tech. Discussion started: 9 August 2017 c Author(s) 2017. CC BY 4.0 License.

Conclusions
This study demonstrates that the RF model applied to the RAMP low-cost sensor package can accurately characterize air pollution concentrations at the low levels typical of many urban areas in the United States and Europe. The fractional error of the models at a 15-minute time resolution was <5% for CO 2 , approximately 10-15% for CO and O 3 and approximately 30% for NO 2 , corresponding to mean absolute errors of 10 ppm, 38 ppb, 3.4 ppb and 3.5 ppb, respectively. This performance meets 5 the recommended precision and accuracy error metrics from the US EPA Air Sensor Guidebook for Personal Exposure (Tier IV) monitoring. We demonstrate that degree of sensitivity allows quantification of intra-urban gradients. Furthermore, the calibration models were well-constrained across 19 RAMP units (all performance metrics <20% relative standard deviation), and showed minimal degradation over the duration of the collocation study (August 2016 -February 2017),

10
While the iteration of the RAMP used in this study was equipped with an SO2 sensor, no calibration model was possible due to SO 2 concentrations at our supersite being below reference instrument detection limits. One feature of the RAMP monitor is that the sensors are modular and can be readily replaced. The assessment of explanatory variable importance combined with the sub-detection limit levels SO 2 during the study suggests that the RAMP monitor did not benefit from the presence of the SO 2 sensor in this urban background environment. Future iterations of the RAMP will be equipped with NO sensors, which 15 may be more relevant in an urban context. The RF-models described here were built on four weeks of training data equally distributed in 3.5 day periods throughout the entire collocation. This is nominally equivalent to 3-4 days of calibration every 2 months. As previously mentioned, the lowcost sensor modules within the RAMP monitors can be readily replaced, and as such, we recommend for a large urban 20 deployment to prepare a set of sensors at a regulatory monitoring site and to exchange sensors as they malfunction or as calibration models drift. Since the completion of this study, the sensors have been deployed in Pittsburgh for over 4 months, and changes in the calibration models over longer periods of deployment (1 year or more) will be discussed in a future work.
Additionally, the sensors were first opened in July 2016, and characterized over the first 7 months of exposure to ambient environments. During this period, no significant temporal drift or sensor degradation was observed, but longer observational 25 studies are likely needed to characterize sensor decay and end-of-life.
The calibration models were developed in Pittsburgh, which had higher O3 and lower NO 2 compared to several published fieldbased calibrations and measurements with low-cost sensors. Our results and those of other studies demonstrate that low-cost sensor performance generally increases with increasing ambient concentration, but despite this, the RF models for NO 2 had 30 the second lowest mean absolute error (<4 ppbv) even at low NO 2 concentrations. The good performance of the RF models across all pollutants can likely be attributed to the ability of the RF models to account for pollutant and meteorological crosssensitivities, highlighting the importance of building multipollutant sensor packages. Atmos. Meas. Tech. Discuss., https://doi.org/10.5194/amt-2017-260 Manuscript under review for journal Atmos. Meas. Tech. Discussion started: 9 August 2017 c Author(s) 2017. CC BY 4.0 License.
Overall, we conclude that with careful data management and calibration using advanced machine learning models, that lowcost sensing with the RAMP monitors may significantly improve our ability to resolve spatial heterogeneity in air pollutant concentrations. Developing highly resolved air pollutant maps will assist researchers, policymakers and communities in developing new policies or mitigation strategies to enhance human health. Going forward, a random forest calibrated RAMP 5 network of up to 50 nodes will be deployed in Pittsburgh, PA. This robustly calibrated network will help support better epidemiological models, aid in policy planning, and identify areas where more assessment is needed.

Competing interests
Author J. Gu is the CEO of SenSevere, the developer and manufacturer of the RAMP hardware. The extent of J. Gu's involvement was solely in development management, and improvement of the hardware in the RAMP monitors, and not in 10 data analysis. Authors N. Zimmerman and R. Subramanian may in the future act as consultants for SenSevere on low-cost sensor calibration. The data output from the SenSevere hardware in conjunction with the calibration algorithms presented in this paper yields significantly more accurate measurements than previously reported, and are the subject of provisional patent application. The authors declare no other competing interests. 5 Figure 2: Simplified illustration of one potential CO random forest tree for one RAMP using 100 data points (the trees within the actual models are significantly more complex and 500 such trees are included in the final models). Tree nodes are coloured by splitting variable and split point is overlaid on the branch (e.g., at first split, points with CO sensor signal >255.9 are sent to a terminal node, the remaining points go to the next splitting node). ���� is the average CO reference monitor concentration (ppb) in 10 each terminal node; n = number of data points in each terminal node. Atmos. Meas. Tech. Discuss., https://doi.org/10.5194/amt-2017-260 Manuscript under review for journal Atmos. Meas. Tech. Discussion started: 9 August 2017 c Author(s) 2017. CC BY 4.0 License.    (Table 3) were performed with much larger testing datasets.     The y-axis is the bias relative to the reference and the x-axis is the bias-adjusted RMSE (CRMSE) normalized by reference monitor standard deviation; the vector distance between any given point and the origin is the RMSE normalized by the standard deviation of the reference measurements. The CRMSE is in the left plane if model standard deviation is smaller than the standard deviation of the reference observations, and vice versa. If data falls within the circle, then the variance of the residuals is smaller than the variance of the reference measurements. The target diagram for the LAB model for CO2 and NO2 is shown in the inset figure because of the order of magnitude difference in MBE and CRMSE compared to the MLR and RF models. Atmos. Meas. Tech. Discuss., https://doi.org/10.5194/amt-2017-260 Manuscript under review for journal Atmos. Meas. Tech. Discussion started: 9 August 2017 c Author(s) 2017. CC BY 4.0 License.   Atmos. Meas. Tech. Discuss., https://doi.org/10.5194/amt-2017-260 Manuscript under review for journal Atmos. Meas. Tech. Discussion started: 9 August 2017 c Author(s) 2017. CC BY 4.0 License. Figure 9: Importance of the explanatory variables to each of the RF models. For each model, the explanatory variables are rank ordered from most to least important, and the sensor response corresponding to the target analyte is marked with a yellow star. The box plots represent the range of importance across the 19 RAMPs (whiskers: 10 th and 90 th percentile, box edges: 25 th and 75 th percentile). The relative importance is determined by calculating the increase in mean square error if the explanatory variable is 5 permuted (i.e., randomly shuffled). Atmos. Meas. Tech. Discuss., https://doi.org/10.5194/amt-2017-260 Manuscript under review for journal Atmos. Meas. Tech. Discussion started: 9 August 2017 c Author(s) 2017. CC BY 4.0 License.      Atmos. Meas. Tech. Discuss., https://doi.org /10.5194/amt-2017-260 Manuscript under review for journal Atmos. Meas. Tech. Discussion started: 9 August 2017 c Author(s) 2017. CC BY 4.0 License.