Interactive comment on “ Calibration and assessment of electrochemical air quality sensors by co-location with reference-grade instruments

This work presents a detailed analysis of the performance one type of electrochemical air quality sensor for SO2 detection. The authors use data from multiple sensors deployed for approximately 21 weeks and compare with co-located reference-grade SO2 instruments on the island of Hawaii. The performance of multiple regression methods to calibrate the electrochemical sensors and correct for known temperature responses are evaluated. The availability and interest in low cost sensor technologies over recent years means comprehensive evaluations of their performance and possible sampling methodologies such as this are essential. As acknowledged by the authors, the choice of Hawaii as a sampling location provides the best possible scenario for sensor performance, due to the large dynamic range of SO2 mixing ratios experienced and the


Introduction
The last several years have seen an explosion in the use of low-cost sensor technologies for air pollution monitoring efforts (Snyder et al., 2013).The low cost, small size, and low power consumption of these sensors offer the promise of distributed measurements over wide geographical areas, with potential applications for topics such as air quality (AQ) monitoring, source attribution, human exposure and epidemiology, and atmospheric chemistry.However, because of questions associated with their sensitivity, calibration, and long-term reliability, there is a critical need to establish a cohesive approach for evaluation and performance assessment of low-cost sensors prior to their large-scale adoption (Lewis and Edwards, 2016).
One of the most commonly used technologies for low-cost AQ sensing is the electrochemical sensor, in which a pollu-Published by Copernicus Publications on behalf of the European Geosciences Union.
tant of interest reacts electrochemically within a cell, drawing a current that is proportional to the analyte concentration (Cao et al., 1992).Modern electrochemical sensors have sensitivities in the parts per billion by volume (ppb) range (Hodgson et al., 1999), enabling sensitive, real-time pollutant measurements.However, accurate calibration of such sensors poses a major challenge.Even setting aside the logistical difficulties associated with calibrating a large number of sensors distributed throughout a network, there are specific technical challenges that can limit the accuracy of any calibration; these include the sensitivity of sensors to environmental conditions (temperature and relative humidity, RH) (Cross et al., 2017;Masson et al., 2015;Mead et al., 2013;Popoola et al., 2016), cross sensitivities to other (sometimes unknown or unmeasured) atmospheric species (Lewis et al., 2015;Mueller et al., 2017;Spinelle et al., 2015;Zimmerman et al., 2017), and long-term sensitivity decay (drift) associated with the evaporation of electrolyte solution (Mead et al., 2013;Smith et al., 2017).
Thus far, two general approaches have been applied for the calibration of electrochemical (and other low-cost) AQ sensors: laboratory calibration and co-location with reference instruments.The first involves calibrating the sensor in a laboratory over a controlled and well-defined range of conditions (Castell et al., 2017;Mead et al., 2013;Piedrahita et al., 2014), as is standard for calibration of high-fidelity atmospheric chemistry and AQ instrumentation.However, because electrochemical sensors tend to be less selective and more prone to interferences than such higher-fidelity instruments (Lewis et al., 2015), identifying and calibrating over the full range of relevant measurement conditions in the laboratory can be challenging, and the presence of additional interfering components cannot always be anticipated.In addition, this approach requires high-quality analytical instruments and standard gas mixtures and so is generally not an option for anyone who is not affiliated with a research institution (e.g., community organizations, citizen scientists) or is conducting research in resource-limited environments (e.g., developing countries).
The second approach for calibrating low-cost sensors is by co-location with reference instruments, typically government-run AQ stations equipped with regulatory-grade monitors.There are multiple advantages to this approach: the reference instruments are regularly calibrated, the reference measurement data are generally made publicly available (e.g., EPA AirNow, US EPA, 2017;OpenAQ, Hasenkopf, 2017), and the calibrations are carried out under ambient conditions that are (at least partially) representative of the sensor measurements to be made.Indeed, the effectiveness of co-location has been demonstrated in several recent studies, with sensor outputs (voltages) and other environmental parameters (e.g., temperature) related to the true concentration values (from the reference instruments) via some form of regression from either parametric models (Jiao et al., 2016;Lewis et al., 2015;Masson et al., 2015;Mueller et al., 2017;Popoola et al., 2016;Sadighi et al., 2017;Smith et al., 2017) or machine-learning/nonparametric methods (Cross et al., 2017;Spinelle et al., 2015;Zimmerman et al., 2017).
While this previous work has demonstrated the effectiveness of sensor calibration by co-location, this general approach has not yet been systematically explored or optimized for realistic deployment conditions.Important open topics include: ideal calibration algorithms (regression techniques), criteria for an acceptable calibration (range of conditions sampled, length of calibration time) prior to sensor deployment, and performance of calibration algorithms when faced with conditions outside the training set.In fact, to our knowledge it has never been demonstrated whether a sensor can be calibrated at one ambient location and collect accurate data at another, which is a fundamental requirement of any sensor deployment.Here, we attempt to address such questions by collecting an extensive co-location dataset and using it to assess various calibration algorithms.Central to this work is the development of models that are accurate, robust, repeatable, and predictive.
All measurements in the present study are made on the island of Hawai'i (USA); due to the ongoing eruption of Kīlauea, local levels of SO 2 can be extremely high (even exceeding 1 ppm) (Kroll et al., 2015), constituting serious AQ and human health concerns (Longo, 2009(Longo, , 2013;;Longo et al., 2010;Longo and Yang, 2008;Mannino et al., 1996;Tam et al., 2016).The SO 2 is emitted from just two point sources (the Halema'uma'u and Pu'u ' Ō'ō craters; see Fig. 1) into an otherwise clean environment, leading to large spatial and temporal variability in SO 2 levels throughout the island.Accurate AQ measurements and estimates of human exposure to volcanic pollution ("vog") thus require a relatively dense monitoring network; in fact, the present calibration study is part of a planned island-wide AQ sensor network.Moreover, this location represents an ideal test bed for sensor characterization and validation, since air pollution is dominated by SO 2 , with no interfering gas-phase co-pollutants (H 2 S emissions from Kīlauea are generally quite low; Edmonds et al., 2013), and the dynamic range in SO 2 can be very large (varying from < 1 ppb to > 1 ppm).This is in contrast to environments targeted in most other AQ sensor studies (e.g., polluted urban areas), which tend to have more pollutants, typically present at lower concentrations.This location is thus an ideal environment for the detailed characterization of the sensor response to a single target analyte, the focus of the present study.At the same time, because of the unique features of this environment, not all results from this work (such as accuracy of the calibration) will necessarily directly translate to other pollutants and environments.However, the general calibration and characterization approaches described here should be suitable for use in a wide range of sensor applications.
In this study, we install a set of low-cost, autonomous SO 2 sensor nodes at AQ stations on the island for a period of 5 months.This provides a large dataset for testing, validating, and optimizing this in-field co-location approach to calibra- tion.We evaluate a number of sensor calibration algorithms (both parametric and nonparametric), with a particular focus on the temperature dependence of the baseline.Further, we investigate the performance of the calibrations given practical constraints (e.g., the possibility that measurement conditions may be different from those of the calibration period) and examine how sensitivity changes over a period of several months.

Sensor node design
Measurements were made using a custom sensor node for continuous, real-time monitoring of ambient SO 2 and environmental variables (temperature, RH) at a fixed-site location.Each node is powered by a small solar panel and is internet-connected via a 3G cellular module to allow bidirectional communication between a server and the sensor node.The nodes are weatherproofed (housed in a UL-certified weather-proof enclosure) and low power (∼ 1 W), with a total component cost of ∼ USD 400.Major components of the design are shown in Fig. 2. SO 2 is measured using an Alphasense SO2-B4 electrochemical sensor (purchased December 2016, opened January 2017) in conjunction with the Alphasense potentiostat circuitry.This four-electrode sensor includes a working elec- trode (WE), at which the electrochemical reaction (oxidation of SO 2 ) takes place, as well as an auxiliary electrode (AE), which is isolated from the gas phase, but responds to changes in the signal associated with changing environmental variables.In particular, it has been shown that the AE response to changes in ambient temperature and relative humidity is nonlinear (Cross et al., 2017;Lewis et al., 2015;Masson et al., 2015;Mead et al., 2013) and can depend on not only these parameters but also their derivatives (Masson et al., 2015;Pang et al., 2017).The SO 2 sensor and adjacent relative humidity and temperature (RHT) sensor (HIH6130, Honeywell) are embedded in a 3-D-printed flow chamber, with a small direct current (DC) fan used to pull air perpendicular to the surface of the sensors.This design is improved from an earlier prototype that used a passive external sensor, which was susceptible to large temperature variations caused by direct irradiation by sunlight and may have exhibited poorer sensitivity (Masson et al., 2015).The inlet and outlet are protected from the elements by 3-D-printed awnings that are epoxied in place.
The analog signals are sampled at 20 Hz using a 16 bit analog-to-digital (ADC) converter (Texas Instruments ADS1115), before being averaged and saved locally as a 1 Hz measurement on a micro-SD card.The 1 Hz measurements are then averaged over a user-defined interval (1 min in the present study) and transmitted to a remote server where data are stored in a MySQL database and visualized in real time.Flags were set to mark the first 4 h after a node was turned on to indicate a sensor warmup period (Roberts et al., 2012;Smith et al., 2017).In addition, flags are set whenever the ADC or RHT sensor reported a failure.The node is operated using a 3G-enabled, ARM-based microcontroller (Particle Electron), allowing for two-way communication between the node and the server.Each node is powered continuously using a 9 W solar panel (Voltaic Systems) with a 4000 mAh battery (Voltaic Systems V15) serving as the power supply when the solar panels are not supplying enough power.In areas with less sunlight, two 6 W panels in parallel are used rather than a single 9 W panel.At full charge the battery can supply continuous backup power for 20 h, allowing the nodes to run overnight without loss of power.

Site description and reference data
Sensor nodes were first deployed on the island of Hawai'i beginning 15 January 2017 and most are still active as of August 2017.The Hawaii Department of Health (DOH) operates six AQ monitoring stations that continuously monitor SO 2 and supporting meteorological variables including wind speed, wind direction, relative humidity, and temperature (Hawaii Department of Health, 2017).Continuous SO 2 measurements are made by a pulsed-fluorescence analyzer (Thermo Scientific 43i), which provide data as 1 min averages and are calibrated at least once every 2 weeks.The data are continuous except during periods of calibration, which are excluded from the dataset.The AQ stations are spread across the island; the two primary sites used in this work are Pahala and Hilo (see Fig. 1).Pahala (population ∼ 1300; location: 19 • 12 9 N, 155 • 28 38 W) is located 37 km southwest of the main volcanic vent (Halema'uma'u) and so is subjected to the volcanic plume when the trade winds (the prevailing winds, from the northeast) are dominant.The mean 1 h SO 2 level is 39 ppb, though levels can exceed 1 ppm during direct plume hits (typically in the morning, when the boundary layer is low) (Kroll et al., 2015).Hilo (population ∼ 43 300; location: 19 • 42 20 N, 155 • 5 9 W) is located 50 km northeast of the volcanic vent and is characterized by much lower SO 2 values, with a mean 1 h level of 6 ppb and a yearly maximum of ∼ 500 ppb (during southwesterly "Kona winds").

Co-location of nodes
Nine sensor nodes were installed at the Pahala AQ station for no less than 48 h each over a 4-day period (15-19 January 2017) for initial calibration.(Two additional nodes lost power for some fraction of this calibration period and thus are not included in this study.)At the end of this calibration period, two nodes were re-located to the Hilo AQ station (23 January 2017 -ongoing as of August 2017), and three nodes remained at Pahala (still operating as of August 2017).The remaining four nodes were distributed to elementary and middle schools across the island; due to the lack of co-location data, measurements taken at the schools will not be discussed here.All co-located nodes were mounted on the roof of the AQ monitoring station, within 2 m of the reference instrument's inlet.In this work, we focus on the data collection period of 15 January-22 May 2017.Power loss due to lack of sufficient sunlight impacted several nodes (mostly during early morning periods), though the two nodes located at Hilo and one node located at Pahala suffered no power loss.Beginning 25 April, the RHT sensor on one of the Pahala nodes (SO2-02) began to behave erratically for hours at a time, making it difficult to assess the data beyond that date.

Data preparation
A time delay between the sensor data and AQ station reference data caused by differences in clock times and inlet residence times was corrected by determining the maximum cross correlation (typically ∼ 3 min) between the two time series (Knapp and Carter, 1976).Measurements marked by flags (indicating calibration of the reference instrument, sensor warmup time after power-on, etc.) were removed in both data streams prior to removing all sensor data for which no reference data were available.This process led to the exclusion of less than 1 % of all sensor data collected.

Sensor calibration approaches
Calibration of sensor response based on the AQ station data was attempted using several techniques, including both a parametric method (linear regression, LR) and several nonparametric methods.All algorithms were implemented using the scikit-learn python library (Pedregosa et al., 2012), which is open source and available under a BSD license.Additionally, several open-source software python packages were also used in this work for data analysis and visualization, including seaborn (Waskom et al., 2017), pandas (McKinney, 2010), and numpy (Van Der Walt et al., 2011).

Linear regression
A multivariate LR using ordinary least squares (OLS) was constructed using the WE voltage (V WE ), AE voltage (V AE ), and temperature (T ) as inputs.When considering the full dynamic range of SO 2 concentrations (most cases), RH was not included as an input parameter since no unique contribution to the variance in our signal could be attributed to it, as per the results of a commonality analysis (Seibold and McPhee, 1979).As discussed in a later section, RH does appear to uniquely affect the sensor response at low SO 2 levels; however, for the full range of measurements its contribution was negligible and so was not included.This does not mean that sensor response is independent of RH but rather that in the present dataset RH does not contribute to signal uniquely, as RH inversely tracks T in this environment.The form of the regression used is thus (1) To reduce instability and uncertainty in our model caused by outliers, we used an ensemble meta-estimator rather than a single linear model using a bootstrap process (Kohavi, 1995).This involves the construction of many individual linear models on random subsets of the original training data, followed by their combination based on median individual parameters.

Nonparametric calibration approaches
Because of concerns associated with the nonlinear dependence of sensor response on environmental variables (namely T ), various nonparametric (machine-learning) regression techniques were also explored.These algorithms were chosen based on their potential ability to determine the relationship between inputs (V WE , V AE , T ) and outputs ([SO 2 ]) without needing to know the functional form of the relationship itself.The methods examined were: ridge regression (RR), which attempts to reduce standard error by introducing bias to reduce multicollinearity among independent variables (Rifkin, 2007); least absolute shrinkage and selection operator (LASSO) regression, which similarly reduces covariance and overfitting by eliminating similar features and imposing an absolute limit on the sum of the coefficients (Tibshirani, 1996); classification and decision trees (CART), which forms a collection of rules based in a recursive fashion by selecting data that differentiate observations based on the dependent variable (Breiman et al., 1984); and k nearest neighbors regression (kNN), which estimates the regression curve without making assumptions about the structure of the model (Altman, 1992).The kNN approach, which was found to have the best performance (see Results, below), involves mapping input variables from the training data (V WE , V AE , T ) to the output variable (SO 2 mixing ratio) in an ndimensional vector space.Determination of SO 2 concentration using new sensor data involves mapping those data to the k nearest points in the same vector space and computing the predicted value by taking the weighted average.
3 Results and discussion

SO 2 sensor response
The 1 min time series for one sensor (SO2-02) located at the Pahala AQ station is shown in Fig. 3 for a 4-day period at the beginning of the co-location campaign.The working electrode voltage (V WE ) is generally correlated well to the reference SO 2 measurement, except for periods of high temperature, in which they clearly diverge.The auxiliary electrode (V AE ) peaks with an increase in T and appears to follow the divergence between the V WE and SO 2 .As described above, RH does not provide any additional information because it is inversely correlated with T in this environment.

Algorithm selection
The performance of each calibration algorithm (LR, RR, LASSO, CART, and kNN) was evaluated using the data from a sensor node SO2-02, located at Pahala from 15 January to 25 April (for a total of 145 467 1 min data points).Assessment of each was done by performing a 10-fold cross validation, by randomly splitting the data into 10 subsets and then training the algorithm on 9 of the subsets and evaluating on the final one.This process is repeated such that every possible combination of training and evaluation dataset is tested.Scoring for each algorithm was evaluated using the negative mean squared error and was performed on both normalized (scaled) and raw (un-scaled) data.
Performance of each algorithm is shown in Fig. 4.While all techniques show generally strong performance, kNN (scaled) gives the most accurate results.The LR performs at least as well as the remaining nonparametric algorithms.We thus focus on the results from these two regression algorithms, for all co-located sensor nodes.Parameters were tuned through a grid-search process (iterating over each possible parameter value) to determine the optimum settings.Finally, an ensemble meta-estimator was built using a bootstrap process (Breiman, 1996) in which subsets of the data were pulled with replacement to be trained and voted into the final algorithm.For the kNN method, the optimized number of neighbors was found to be between 3 and 15, depending on sensor node.

Algorithm validation
These two approaches (LR and kNN) are evaluated for all sensors by splitting the data into training and validation subsets: 70 % of the data were randomly selected for training, and the remaining 30 % for validation throughout the entire data collection period (which varies for each sensor).Predictive power of the models is described by their correlation coefficient (r 2 ), mean absolute error (MAE), and root mean square error (RMSE), evaluated only on the previously unseen validation data (and not on the training dataset itself).Results for node SO2-02 are presented in Fig. 5.
Figure 5a and c show results from the multivariate linear regression (Eq.1).SO 2 mixing ratios measured by the electrochemical sensor are correlated well with the reference SO 2 monitor (r 2 = 0.987; 95 % CI: 0.986-0.988)and are reasonably accurate (RMSE = 9.7 ppb; 95 % CI: 9.6-9.9ppb).The relative error (as a percentage of absolute concentration) decreases as the concentration of SO 2 increases, dropping below 20 % around 50 ppb and below 5 % at 100 ppb.At the same time, the RMSE increases as the [SO 2 ] range increases, since small fractional errors lead to large absolute errors at high concentrations.This model performs well at high concentrations because the V WE response (which is linear with concentration) dominates the signal and is large relative to any shifts in the baseline.However, the LR calibration per- Each algorithm was run on data that were as-is ("unscaled", blue boxes) and normalized by removing the mean and scaling to unit variance ("scaled", green boxes).Results shown are for a single sensor (SO2-02) covering 145 467 1 min data points.
forms less well at low SO 2 concentrations, overestimating SO 2 levels when the temperature is highest.Under these conditions the temperature response dominates the sensor signal and, since it is apparently nonlinear, is not captured well by the LR.
Figure 5b and d show results for the kNN model, which offers improved performance over the LR model: the correlation coefficient is 0.995 (95 % CI: 0.994-0.995)and the RMSE is 6.3 ppb (95 % CI: 6.2-6.5 ppb).kNN outperforms the linear model at lower SO 2 concentrations, while performing similarly at higher concentrations, with the relative error dropping below 20 % at 20 ppb and below 5 % at 100 ppb.Unlike in the LR case, there is no clear relationship between T and measurement bias, indicating that kNN successfully captures the nonlinear temperature response of the sensor.kNN cannot infer the derivative of any feature (T , RH) and thus may be a limitation in cases where environmental conditions shift rapidly or for other types of sensors for which derivatives are more important (Masson et al., 2015;Pang et al., 2017).
The results shown in Fig. 5 are for a single sensor node colocated with the Pahala AQ station for the entire study period, but applying these algorithms to results from the other sensors (over the time they were located at Pahala) gives qualitatively similar results.A complete statistical summary of results for all nine sensors can be found in Table 1.Regardless of the algorithm used, results from the calibrated sensors are correlated well to the reference measurements.The few previous studies in which ambient SO 2 was measured using electrochemical sensors (Jiao et al., 2016;Lewis et al., 1. a Total number of 1 min data points, covering only the period during which the sensor was located at the Pahala AQ station for calibration.b Using the methods described in the text (and shown in Fig. 5) for evaluating node SO2-02.c See "Practical Calibration Considerations" subsection for details.
2015) have found little to no correlation to reference data, which is likely due to exceedingly low ambient SO 2 levels in the study regions and the cross sensitivities of the sensor to more abundant pollutant species (Lewis et al., 2015).For context, co-location studies of different electrochemical sensors targeting more abundant pollutants have found correlations with reference instruments (r 2 ) to range between 0.7 and 0.96 for O 3 , NO 2 , NO, and CO (Cross et al., 2017;Jiao et al., 2016;Mead et al., 2013;Popoola et al., 2016;Zimmerman et al., 2017), with estimates of RMSE spanning 4-60 ppb for O 3 (Cross et al., 2017;Sadighi et al., 2017;Spinelle et al., 2015), 4-22 ppb for NO (Cross et al., 2017;Masson et al., 2015), 39 ppb for CO (Cross et al., 2017), and 4.5 ppb for NO 2 (Cross et al., 2017)  38 ppb for CO, 3.5 ppb for NO 2 , and 3.4 ppb for O 3 (Zimmerman et al., 2017).However, it is difficult to directly compare performance metrics (r 2 , RMSE) obtained from the different calibration algorithms taken in these different studies, given the differences not only in sensor types but in also environmental conditions (T , RH, range of pollutant concentrations, and interferences by other pollutants).

Practical calibration considerations
The results in Fig. 5 and Table 1 show that the kNN regression performs well when the full range of measurement conditions (pollutant levels, T ) is covered in the training set.However, training and validating sensors in the same physical location, under similar environmental conditions, is in many ways a best-case scenario and is not always possible for most calibration efforts.Because calibration (co-location) periods are generally limited in time, they likely will not cover the full range of environmental conditions; for example, they might not cover the highest levels of pollutants, or the full range of temperatures at a given site (which can require months to years of co-located measurements).It is therefore important to understand how such real-world constraints may affect the accuracy of sensor calibrations.
Figure 6 shows results from the LR and kNN algorithms, trained under subsets of our data to mimic such real-world calibration scenarios.Each row in the diagram represents a different calibration scenario: models were trained on data ranging from 0 to 50 ppb SO 2 in row one, 0 to 150 ppb SO 2 row two, and 0 to 500 ppb SO 2 row three.After being trained on the truncated datasets, they were evaluated using the entire previously withheld validation dataset (with the full dynamic range in SO 2 ).
In such limited training-set cases, the LR performs about the same as in the full training-set case (Fig. 5).The only exception is the ≤ 50 ppb condition (row 1), whose calibration lacks the dynamic range for an accurate determination of sensitivity (c 1 in Eq. 1).In all cases, performance of LR at low SO 2 concentrations is relatively poor, again due to the importance of nonlinear temperature effects under these conditions.By contrast, when the SO 2 levels in the training sets are limited, kNN performs poorly when SO 2 levels in the validation set are high.This is because kNN cannot extrapolate outside the range of data with which it was trained.This is problematic in an area such as Hawai'i where it is difficult to know the upper bounds of SO 2 concentrations; similar scenarios may occur in polluted urban areas, where plumes could be intercepted or new sources emerge.Thus, when the full range of pollutant concentrations is not accessed during calibration, each regression technique has a strength and a weakness: LR can extrapolate to higher concentrations, whereas kNN cannot; but LR does not correct for the temperature dependence of the signal, whereas kNN can.
To preserve the best feature of each approach, we propose a hybrid regression approach using both algorithms in a piece-wise fashion.This hybrid approach entails using kNN below some concentration threshold (here, 50 ppb) and LR when it is above this threshold.Because we are trying to predict the concentration, the determination of whether this threshold is crossed must be made using the sensor measurements (V WE , V AE , T ) only.We use a kNN classifier to make this determination using a method similar to that suggested by Kuncheva (2000), thereby classifying each measurement as either "above threshold" or "below threshold".The threshold was chosen by performing a grid search using target SO 2 concentrations that are included within the boundaries of our training dataset; the threshold that produced the lowest RMSE was then chosen as the target threshold moving forward (here, 50 ppb).
Results from the hybrid regression are shown in the rightmost column of Fig. 6.It generally performs better (with a lower RMSE) than either of the two regression approaches, as it can correct for the nonlinear temperature dependence at low concentrations, while performing well across the entire dynamic range, even when calibrated under lower-SO 2 conditions.The hybrid algorithm offers an approach for accurately extrapolating to pollutant levels higher than were covered during the calibration period.

Multiple site validation
The performance of the hybrid regressor provides confidence in the ability to calibrate sensors via co-location and then deploy them at a different physical location, an essential step in building any distributed network of sensors.This was tested directly on two nodes (SO2-04, SO2-13) via calibration by co-location at the Pahala AQ station for a period of 48 h (results in Table 1) followed by relocation to the Hilo AQ station (80 km to the northeast; Fig. 1), where they remain in operation as of August 2017.The data collected at Hilo (118 days, n = 115 343) were then evaluated using the hybrid regressor trained using data from Pahala.
The results of this evaluation for one of the nodes (SO2-04) are shown in Fig. 7.The calibration carried out at Pahala performs well at Hilo (r 2 = 0.892, RMSE = 6.9 ppb); this is only somewhat worse than the performance of a sensor (SO2-02) trained at Pahala over the same 2 days and then kept at Pahala for validation on the subsequent 118 days (r 2 = 0.986, RMSE = 9.6 ppb; see Fig. S1).
Measurement error is higher for the sensor relocated in Hilo than the one left at Pahala, largely due to the differences in the training and test environments.As seen in the two probability distribution plots (right side of Fig. 7), the calibration data were from a colder-and higher-SO 2 environment (Pahala) than was used in the evaluation (Hilo).Specifically, Pahala did not experience any clean air (SO 2 < 1 ppb) and experienced cooler temperatures, whereas Hilo was most often clean (due to influence from marine air), leading to an imbalance in what the model was trained to perform.Nonetheless, the performance of the sensor and the robustness of its calibration at the new site is encouraging.The other node (SO2-13), calibrated in a similar fashion, performed comparably (r 2 = 0.880, RMSE = 8.5 ppb).These sensors compare reasonably well to the sensor (SO2-02), which was kept at Pahala and calibrated for only 2 days (Fig. S1); to our knowledge, this is the first demonstration of an electrochemical sensor being trained in one environment and validated in another.

Measuring SO 2 at low(er) concentrations
Because of the intensity of the volcanic plume, with SO 2 levels regularly reaching 100 s (and even 1000 s) of ppb, the dynamic range of the present measurements is extremely high, with upper-limit concentrations much greater than is typically found for SO 2 (and other pollutants) in most environments.Assessing sensor performance at lower SO 2 concentrations is thus important for understanding the potential for sensors and calibration algorithms to be used under a wider range of conditions.Sensor performance under lower-SO 2 conditions can be evaluated using the present dataset by removing all points in which the reference value was greater than some threshold value (chosen here to be 25 ppb SO 2 , a reasonable value for cities in India and China; Meng et al., 2010;O'Shea et al., 2016).
Figure 8 shows kNN regression results for node SO2-02 under lower-SO 2 conditions only; these were generated using the same technique for generating Fig. 5, but with the training and validation sets limited only to reference SO 2 measurements < 25 ppb.In addition, we found the marginal variance caused by relative humidity was non-negligible when considering measurements at these low concentrations, and thus RH was added as an input to the kNN model.Sensor performance remains good in this case, with an RMSE of 2.9 ppb and r 2 of 0.788; even between 3 and 25 ppb, relative errors are ∼ 20 %.Across all nine sensors, performance is similar, with a median RMSE of 2.2 (±0.6) ppb and an r 2 of 0.864 (±0.061) (see Table S1).= 0.788 with RMSE = 2.9 ppb.The top plot shows the relative error as a percentage of concentration where the dark line is the median value and the shaded region is the interquartile range.Other sensors evaluated at 0-25 ppb showed even better performance (median r 2 : 0.864; median RMSE: 2.1 ppb); results are given in Table S1.
The kNN approach works well when trained on lowerconcentration data because it can sufficiently map the nonlinear temperature and relative humidity dependence without needing to determine the functional form of the equation.The improved performance (lower RMSE) of this assessment compared to that of the full dataset (Fig. 5, Table 1) is a result of removing the highest-SO 2 points, which contribute substantially to absolute error.Overall, this robust sensor calibration at lower SO 2 levels suggests that the sensor calibration approach described here is not limited just to the present environment (which is characterized by very high SO 2 levels) and could be applied to a wider range of conditions (e.g., polluted urban areas) as well.

Drift in sensitivity over time
The rate of drift in sensor sensitivity (change in gain) over time is a crucial parameter in sensor characterization, as it determines the interval of calibration, as well as the overall useable lifetime of the sensors.Recent work has shown varying rates of drift, ranging from several days (Smith et al., 2017) to many months (Mead et al., 2013;Popoola et al., 2016).We expect to observe a gradual degradation in sensitivity over time as the electrolyte evaporates; the manufacturer (Alphasense Ltd.) quotes a 50 % decay over 2 years.The long duration of the data collection period (4.5 months) Figure 9. Sensitivity decay for a single SO2-B4 sensor (SO2-02) across 18 weeks.After being trained on data from weeks 2 to 3, the sensor was evaluated using the hybrid regression approach for each successive week of data and fit using ordinary least-squares regression.Slope indicates whether the model was underpredicting (m < 1) or overpredicting (m > 1) SO 2 values.A decrease in sensitivity would be seen as a gradual decline in the slope, which is not seen here.Light blue points (weeks 17-21) denote periods during which the RHT sensor was behaving erratically, limiting the amount of useful trusted data.
enables us to determine the SO 2 sensor drift in the present dataset.
To determine the time-dependent change in sensitivity of the electrochemical sensor to its target gas, we perform a LR of the predicted mixing ratios (using the hybrid regression method) against the reference data collected at the AQ station.The hybrid regressor was trained using data from weeks 2 to 3 (10 days total) and then evaluated on all subsequent data.Figure 9 shows the comparison between the calibrated sensor measurements and reference values of SO 2 for each week.The slope is ∼ 1 throughout the first 18 weeks after deployment (weeks 4-21) without significant degradation in sensitivity to SO 2 .The last 5 weeks of data (shown in light blue) should be treated with caution, as the temperature sensor used in the device began to behave erratically, including periods where the temperature sensor reported anomalously high values (> 40 • C; all such data points were excluded from this analysis).Over this 4-month period, there is no evidence for a gradual decay in sensitivity, which would suggest the evaporation of the electrolyte solution.This indicates that under the present environmental conditions, the SO 2 sensor calibration remains stable over a period of at least 4 months, with no need for re-calibration over this time.

Implications and future work
In this work, we have laid out a general calibration approach for electrochemical sensors based on co-location with reference (regulatory-grade) monitors.This work shows that the complex temperature dependence of electrochemical sensors can be accounted for using nonparametric regression techniques.To overcome the limitations of nonparametric methods, we introduced a new hybrid linear-nonparametric regression scheme that provides the benefits of multiple regression techniques simultaneously and allows for the use of electrochemical sensors in environments for which they have not been previously calibrated against.This hybrid approach enables reliable long-term measurements of SO 2 across a dynamic range of 1 ppb to 2 ppm with good accuracy (RMSE < 7 ppb) and correlation (r 2 > 0.99) with the reference monitor.Additionally, we have shown that low-cost electrochemical SO 2 sensors can provide acceptable results in lower-SO 2 environments, extending their utility to other locales, and that they exhibit little to no sign of sensitivity decay through the first 18 weeks of deployment, suggesting the necessary recalibration interval is on the order of several months (as opposed to weeks or days).
Ideally, calibration by co-location with reference monitors will cover the entire range of conditions (e.g., pollutant levels, temperature) expected to be encountered; however, this is not always possible, especially when using sensors in previously unmeasured conditions and geographic areas.When deciding how large a training dataset is needed, the key quantity to consider is the fraction of total feature space mapped, rather than total number of measurements taken (or time calibrated).In the present study (which uses the Alphasense SO2-B4 sensor on the island of Hawai'i), this means completely covering the 2-D vector space of SO 2 concentration and temperature; for other sensors in other environments, the feature space likely also should include concentrations of relevant cross-sensitive species, such as nitrogen dioxide in the case of ozone sensors (Mueller et al., 2017;Spinelle et al., 2015).Under conditions in which environmental conditions (T or RH) change very rapidly, the feature space may include the time derivative of these as well (Masson et al., 2015;Pang et al., 2017).
The scope of this work is limited to the measurement of a single pollutant (SO 2 ) by a single make of sensor (Alphasense SO2-B4) in a single environment (characterized by a very wide range in SO 2 concentrations, low levels of other pollutants, and relatively little variability in T ).It is therefore difficult to generalize the specific results of this work to other pollutants, sensors, and environments.However, the general approaches discussed here -the use of a hybrid linear-nonparametric regression algorithm, the examination of calibrations by limiting the environmental conditions of the training set, and the testing of sensors and algorithms by calibration at one reference site and validation at anothercould be applied to other sensor system as well; sensor characterization in these other conditions is an important area of future research.Such characterization efforts, covering a full range of pollutants (e.g., CO, O 3 , NO, NO 2 ) and environments (with different pollutant levels, temperature and humidity conditions, etc.) will improve our understanding of the performance and applicability of low-cost AQ sensors for a range of studies in AQ, human health, and atmospheric chemistry.
Data availability.All data can be provided by the author upon request.
Competing interests.The authors declare that they have no conflict of interest.
Disclaimer.The views expressed in this document are solely those of the authors and do not necessarily reflect those of the Agency.EPA does not endorse any products or commercial services mentioned in this work.

Figure 2 .
Figure 2. Primary components of the custom sensor node used in this work.Each node includes an Alphasense SO2-B4 electrochemical sensor and a RHT sensor embedded in a flow cell with active flow provided by a DC computer fan.Power is provided by a 9 W solar panel coupled to a 4000 mAh battery and communicates with the remote server via the 3G network.Dimensions are 20 cm (L) × 16 cm (W) × 11 cm (H).

Figure 3 .
Figure 3.Time series of raw 1 min reference SO 2 and Alphasense SO2-B4 data for a 4-day period at the beginning of the field campaign at the Hawaii Department of Health AQ station in Pahala.(a) T and RH exhibit a strong inverse correlation in this environment.(b) Working electrode voltage (V WE , yellow) and reference SO 2 mixing ratios (green) correlate strongly.(Data above 500 ppb were excluded to enable a simple visual comparison.)(c) Auxiliary voltage (V AE ), which increases when T is high, and V WE and SO 2 diverge.

Figure 4 .
Figure 4. Box-and-whisker plots showing results from the initial spot-check of various algorithms -linear regression (LR), least absolute shrinkage and selection operator regression (LASSO), ridge regression (RR), classification and regression tree regression (CART), and k nearest neighbors regression (kNN) -to determine their ability to quantify SO 2 using three independent variables (V WE , V AE , and T ).Each box represents the interquartile range, with the whiskers describing the minimum and maximum values.Each algorithm was run on data that were as-is ("unscaled", blue boxes) and normalized by removing the mean and scaling to unit variance ("scaled", green boxes).Results shown are for a single sensor (SO2-02) covering 145 467 1 min data points.

Figure 5 .
Figure 5. Validation results using multivariate linear regression (a, c) and k nearest neighbors regression (b, d).Data are shown as the SO 2 measurement by a single sensor (SO2-02) vs. the reference measurement from the AQ station, colored by T .Relative error (a, b) is shown as a function of observed SO 2 concentration (the interquartile range is shown as the shaded region).Data shown are for the test set only, made up of 30 % of data collected over the entire measurement period (15 January to 25 April 2017, n = 125 258).Results for other sensors are given inTable 1.

Figure 6 .
Figure 6.Comparing linear regression, k nearest neighbors regression, and hybrid regression on various subsets of the training data by splitting on arbitrary SO 2 thresholds (row 1: < 50 ppb; row 2: < 150 ppb; row 3: < 500 ppb).All models were validated using the entire SO 2 and T ranges of the previously withheld validation dataset.

Figure 7 .
Figure 7. Hybrid regression results for node SO2-04 when trained using data from the Pahala AQ station (2 days) and validated using data from the Hilo AQ station (4 months).Right panel: kernel density estimates showing the distribution of temperature and SO 2 used both in the training (Pahala) and validation (Hilo) datasets.The difference in the environmental conditions is likely the cause of the somewhat decreased performance of the sensor calibration at the new site.For comparison, a plot comparing a different sensor (SO2-02) which was trained during these same 2 days and then re-evaluated at the same physical site is shown in Fig. S1 in the Supplement.

Figure 8 .
Figure 8. Performance of the sensor (node SO2-02) at lower (< 25 ppb) levels of SO 2 , evaluated using kNN regression.Data shown are from the validation dataset and result in an r 2= 0.788 with RMSE = 2.9 ppb.The top plot shows the relative error as a percentage of concentration where the dark line is the median value and the shaded region is the interquartile range.Other sensors evaluated at 0-25 ppb showed even better performance (median r 2 : 0.864; median RMSE: 2.1 ppb); results are given in TableS1.

Table 1 .
Summary of calibration results for all sensors deployed in this study.