Wireless low-cost particulate matter sensor networks
(WLPMSNs) are transforming air quality monitoring by providing particulate matter (PM)
information at finer spatial and temporal resolutions. However, large-scale WLPMSN calibration and maintenance remain a challenge. The manual labor involved in initial calibration by collocation and routine recalibration is intensive. The transferability of the calibration models determined from initial collocation to new deployment sites is questionable, as calibration factors typically vary with the
urban heterogeneity of operating conditions and aerosol optical properties. Furthermore, the stability of low-cost sensors can drift or degrade over time. This study presents a simultaneous
Gaussian process regression (GPR) and simple linear regression pipeline to
calibrate and monitor dense WLPMSNs on the fly by leveraging all available
reference monitors across an area without resorting to pre-deployment
collocation calibration. We evaluated our method for Delhi, where the
PM

Low-cost air quality (AQ) sensors that report high time resolution data
(e.g.,

On the down side, researchers have been plagued by calibration-related issues since the emergence of low-cost AQ sensors. One common brute force solution is initial calibration by collocation with reference analyzers before field deployment and follow-up with routine recalibration. Yet, the transferability of these pre-determined calibrations at collocation sites to new deployment sites is questionable as calibration factors typically vary with operating conditions such as PM mass concentrations, relative humidity (RH), temperature, and aerosol optical properties (Holstius et al., 2014; Austin et al., 2015; Wang et al., 2015; Lewis and Edwards, 2016; Crilley et al., 2018; Jayaratne et al., 2018; Zheng et al., 2018). Complicating this further, the pre-generated calibration curves may only apply for a short term as the stability of low-cost sensors can develop drift or degrade over time (Lewis and Edwards, 2016; Jiao et al., 2016; Hagler et al., 2018). Routine recalibrations which require frequent transit of the deployed sensors between the field and the reference sites are not only too labor intensive for a large-scale network but also still cannot address the impact of urban heterogeneity of ambient conditions on calibration models (Kizel et al., 2018).

As such, calibrating sensors on the fly while they are deployed in the field is highly desirable. Takruri et al. (2009) showed that the interacting multiple model (IMM) algorithm combined with the support vector regression (SVR)–unscented Kalman filter (UKF) can automatically and successfully detect and correct low-cost sensor measurement errors in the field; however, the implementation of this algorithm still requires pre-deployment calibrations. Fishbain and Moreno-Centeno (2016) designed a self-calibration strategy for low-cost nodes with no need for collocation by exploiting the raw signal differences between all possible pairs of nodes. The learned calibrated measurements are the vectors whose pairwise differences are closest in the normalized projected Cook–Kress (NPCK) distance to the corresponding pairwise raw signal differences given all possible pairs over all time steps. However, this strategy did not include reference measurements in the self-calibration procedure, and therefore the tuned measurements were still essentially raw signals (although instrument noise was dampened). An alternative calibration method involves chain calibration of the low-cost nodes in the field with only the first node calibrated by collocation with reference analyzers and the remaining nodes calibrated sequentially by their respective previous node along the chain (Kizel et al., 2018). While this node-to-node calibration procedure proved its merits in reducing collocation burden and data loss during calibration, relocation, and recalibration and accommodating the influence of urban heterogeneity on calibration models, it is only suitable for relatively small networks because calibration errors propagate through chains and can inflate toward the end of a long chain (Kizel et al., 2018).

In this paper, we introduce a simultaneous Gaussian process regression (GPR)
and simple linear regression pipeline to calibrate PM

quantifying experimentally the daily performance of our dynamic calibration model in Delhi during the winter season based on model prediction accuracy on the holdout reference nodes during leave-one-out cross-validations (CVs) and low-cost node calibration accuracy;

revealing the potential pitfalls of employing a dynamic calibration algorithm;

examining the sensitivity of our algorithm to the training data size and the feasibility of it for dynamic calibration;

demonstrating the ability of our algorithm to auto-detect faulty nodes and auto-correct the drift of nodes within a network via computational simulation, and therefore the practicality of adapting our algorithm for automated large-scale sensor network monitoring; and

studying computationally the optimal number of reference stations across Delhi to support our technique and the usefulness of low-cost sensors for extending the spatial precision of a sensor network.

The low-cost packages used in the present study (dubbed “Atmos”) shown in
Fig. 1a were developed by Respirer Living Sciences (

The Atmos network's server architecture was also developed by Respirer Living Sciences and built on the following open-source components: KairosDB as the primary fast scalable time series database built on Apache Cassandra, custom-made Java libraries for ingesting data and for providing XML-, JSON-, and CSV-based access to aggregated time series data, HTML5 and JavaScript for creating the front-end dashboard, and LeafletJS for visualizing Atmos networks on maps.

Delhi PM sensor network sites along with the 1 h percentage data
completeness with respect to the entire sampling period (i.e., from 1 January 00:00 to 31 March 2018 23:59, Indian standard time, IST; in total
90 d, 2160 h) before and after 1 h missing-data imputation for each
individual site. Note that a 10 % increase in the percentage data
completeness after 1 h missing-data imputation is equivalent to

Locations of the 22 reference nodes (triangles with italic text) and 10 low-cost nodes (circles) that form the Delhi PM sensor network. ©OpenStreetMap contributors 2019. Distributed under a Creative Commons BY-SA License.

Hourly ground-level PM

Hourly uncalibrated PM

The flow diagram illustrating the simultaneous GPR and simple
linear regression calibration algorithm. In step one, for each of the
22-fold leave-one-out CVs, one of the 22 reference nodes is held out of
modeling for the model predictive performance evaluation in step seven. In
step two, fit a simple linear regression model between each low-cost node

The simultaneous GPR and simple linear regression calibration algorithm is introduced here as Algorithm 1. The critical steps of the algorithm are linked to sub-sections under which the respective details can be found. Complementing Algorithm 1, a flow diagram illustrating the algorithm is given in Fig. 3.

Because the true calibration factors for the low-cost nodes are not known
beforehand, a leave-one-out CV approach (i.e., holding one of the 22
reference nodes out of modeling each run for model predictive performance
evaluation) was adopted as a surrogate to estimate our proposed model
accuracy of calibrating the low-cost nodes. For each of the 22-fold CV, 31
node locations (denoted

For simplicity's sake, the kernel function was set to a squared exponential
(SE) covariance term to capture the spatially correlated signals coupled
with another component to constrain the independent noise (Rasmussen and
Williams, 2006):

What separates our method from standard GP applications is the simultaneous
incorporation of calibration for the low-cost nodes using a simple linear
regression model into the spatial model. Linear regression has previously
been shown to be effective at calibrating PM sensors (Zheng et al., 2018).
Linear regression was first used to initialize low-cost nodes' calibrations
(step two in Fig. 3). In this step, each low-cost node

In the next step (step three in Fig. 3), a GPR model was fit to each day

Once the optimum

Iterative optimizations alternated between the GPR hyperparameters and the
low-cost node calibrations using the approaches described in Sect. 2.3.3 and
2.3.4, respectively (Fig. 3 steps five and six, respectively), until the GPR
parameters

The final GPR was used to predict the 59 d PM

Figure 4a presents the box plot of the daily averaged PM

The optimum values of the GPR model parameters including the signal variance
(

Summary of the GPR model 24 h performance scores (including RMSE
and percent error) for predicting the measurements of the 22 holdout
reference nodes across the 22-fold leave-one-out CV when the full sensor
network is used. The mean of the true ambient PM

We start by showing the accuracy of model prediction on the 22 reference
nodes using leave-one-out CV (when the low-cost node measurements were
included in our spatial prediction). Without any prior knowledge of the true
calibration factors for the low-cost nodes, the holdout reference node
prediction accuracy is a statistically sound proxy for estimating how well
our technique can calibrate the low-cost nodes. The performance scores
(including RMSE and percent error) for each reference station sorted by the
3-month mean PM

In this paper, we interpolated the missing 1 h PM

Box plots of the GPR model 24 h performance scores (including RMSE and percent error) for predicting the measurements of the 22 holdout reference nodes across the 22-fold leave-one-out CV under two scenarios – using the full sensor network by including both reference and low-cost nodes and using only the reference nodes for the model construction. Note both scenarios were given the initial parameter values and bounds that maximize the model performance.

It is of particular interest to validate the value of establishing a relatively dense wireless sensor network in Delhi by examining if the addition of the low-cost nodes can truly lend a performance boost to the spatial interpolation among sensor locations. We juxtapose the interpolation performance using the full sensor network (including both the reference and low-cost nodes) with that using only the reference nodes in Fig. 5. In this context, the unnormalized RMSE is less representative than the percent error of the model interpolation performance because of the unequal numbers of overlapping 24 h observations for all the nodes (59 data points) and for only the reference nodes (87 data points). The comparison revealed that the inclusion of the 10 low-cost devices on top of the regulatory grade monitors can reduce mean and median interpolation error by roughly 2 %. While only a marginal improvement with 10 low-cost nodes in the network, the outcome hints that densely deployed low-cost nodes can have great promise of significantly decreasing the amount of pure interpolation among sensor locations, therefore benefitting the spatial precision of a network. We will explore more about the significance of the low-cost nodes for the network performance in Sect. 3.3.3.

Next we describe the technique's accuracy of low-cost node calibration. The model-produced calibration factors are shown in Fig. 6. The intercepts and slopes for each unique low-cost device varied little among all the 22 CV folds, reiterating the stability of the GPR model. The values of these calibration factors resemble those obtained in the previous field work, with slopes comparable to South Coast Air Quality Management District's evaluations on the Plantower PMS models (SCAQMD, 2017a–c) and intercepts comparable to our Kanpur, India post-monsoon study (Zheng et al., 2018).

Box plots of the learned calibration factors (i.e., intercept and slope) for each individual low-cost node from the 22 optimized GPR models across the 22-fold leave-one-out CV.

Correlation plots comparing the GPR model-calibrated low-cost node
PM

Two low-cost nodes (i.e., MRU and IITD) were collocated with two E-BAMs
throughout the entire study. This allows us to take their model-derived
calibration factors and calibrate the corresponding raw values of the
low-cost nodes before computing the calibration accuracy based on the ground
truth (i.e., E-BAM measurements). Figure 7a and b show the scatterplots of
the collocated E-BAM measurements against the model-calibrated low-cost
nodes at the MRU and the IITD sites, respectively. The two sites had
similarly large calibration errors (

So far, the optimization of both GPR model hyperparameters and the linear
regression calibration factors for the low-cost nodes has been carried out
over the entire sampling period using all 59 available daily averaged data
points. It is of critical importance to examine the effect of time history
on the algorithm, by analyzing how sensitive the model performance is to
training window size. We tracked the model performance change when an
increment of 2 d of data were included in the model training. The model
performance was measured by the mean accuracy of model prediction on the 22
reference nodes (within the time period of the training window) using
leave-one-out CV, as described in Sect. 3.2.1. Figure 8 illustrates that,
throughout the 59 d, the error rate and the standard error of the mean
(SEM) remained surprisingly consistent at

The mean percent error rate of GPR model prediction on the 22 reference nodes using leave-one-out CV (see Sect. 3.2.1) as a function of training window size in an increment of 2 d. The error bars represent the standard error of the mean (SEM) of the GPR prediction errors of the 22 reference nodes.

The stationary model performance in response to the increase of training
data hints that using our method for dynamic calibration or prediction is
feasible. We assessed the algorithm's 1 week-ahead prediction performance,
by using simple linear regression calibration factors and GPR
hyperparameters that were optimized from one week to calibrate the 10
low-cost nodes and predict each of the 22 reference nodes, respectively, in
the next week. For example, the first, second, third…, and seventh weeks' data
were used as training data to build GPR models and simple linear regression
models. These simple linear regression models were then used to calibrate
the low-cost nodes in the second, third, fourth…, and eighth weeks, followed
by the GPR models to predict each of the 22 reference nodes in that week.
The performance was still measured by the mean accuracy of model prediction
on the 22 reference nodes using leave-one-out CV, as described in Sect. 3.2.1. We found similarly stable 26 %–34 % dynamic calibration error rates
and

We attempted RH adjustment to the algorithm by incorporating an RH term in
the linear regression models, where the RH values were the measurements from
each corresponding low-cost sensor package's embedded Adafruit DHT22 RH and
temperature sensor. However, there was no improvement in the algorithm's
accuracy after RH correction. A plausible explanation is one regarding the
infrequently high RH conditions during the winter months in Delhi and
stronger smoothing effects at longer averaging time intervals (i.e., 24 h).
Our previous work (Zheng et al., 2018) suggested that the PMS3003 PM

Additionally, while our algorithm was analyzed over the 59 available days in
this study, the daily averaged temperature and RH measurements for the
entire sampling period (i.e., from 1 January to 31 March 2018, 90 days)
were statistically the same as those for the 59 d. To support this
statement, we conducted the Wilcoxon rank-sum test, also called Mann–Whitney

While the exact values of the calibration factors derived from the GPR model
fell short of faithfully recovering the original picture of PM

Learned calibration factors for each individual low-cost node from
the optimized GPR models by replacing measurements of all

One way to simulate the conditions of low-cost node failure or under heavy
influence of local sources is to replace their true signals with values from
random number generators so that the inherent spatial correlations are
corrupted. In this study, we simulated how the model-produced calibration
factors change when all (10), nine, seven, three, and one of the low-cost
nodes within the network malfunction or are subject to strong local
disturbance. We have three major observations from evaluating the simulation
results (Figs. 9 and S5). First, the normal calibration factors are
quite distinct from those of the low-cost nodes with random signals.
Compared to the normal values (see Fig. 9f), the ones of
the low-cost nodes with random signals have slopes close to 0 and intercepts
close to the Delhi-wide mean of true PM

We further investigated the feasibility of applying the GPR model to track
the drift of low-cost nodes accurately over time. We simulated drift
conditions by first setting random percentages of intercept and slope drift,
respectively, for each individual low-cost node and for each simulation run.
Next, we adjusted the signals of each low-cost node over the entire study
period given these randomly selected percentages using Eq. (11). Then, we
rebuilt a GPR model based on these drift-adjusted signals and evaluated if
the new model-generated calibration factors matched our expected
predetermined percentage drift relative to the true (baseline) calibration
factors.

Comparison of predetermined percentages of drift to those estimated from the GPR model for intercept and slope, respectively, for each individual low-cost node, assuming all (10), six, and two of the low-cost nodes developed various degrees of drift such as significant (11 %–99%), marginal (1 %–10 %), and a balanced mixture of significant and marginal. Note the sensors that drifted, the percentages of drift, and which sensors drifted significantly or marginally are randomly chosen. The results reported under each scenario are based on averages of 10 simulation runs.

The performance of the model for predicting the drift was examined under a variety of scenarios including the assumption that all (10), eight, six, four, and two of the low-cost nodes developed various degrees of drift such as a significant (11 %–99 %), a marginal (1 %–10 %), and a balanced mixture of significant and marginal. The testing results for 10, six, and two low-cost nodes are displayed in Table 3 and those for eight and four nodes are in Table S2. Overall, the model demonstrates excellent drift predictive power with less than 4 % errors for all the simulation scenarios. The model proves to be most accurate (within 1 % error) when low-cost nodes only drifted marginally regardless of the number of nodes drift. In contrast, significant, and particularly a mixture of significant and marginal drifts, might lead to marginally larger errors. We also notice that the intercept drifts are slightly harder to accurately capture than the slope drifts. Similar to the simulation of low-cost node failure or under strong local impact as described in Sect. 3.3.1, the performance of the model for predicting the measurements of the 22 holdout reference nodes across the 22-fold leave-one-out CV was untouched by the drift conditions (see Fig. S6). This unaltered performance can be attributable to the fact that the drift simulations only involve simple linear transformations as shown in Eq. (11). The high-quality drift estimation has therefore presented another convincing case of how useful our original algorithm can be applied to dynamically monitoring dense sensor networks, as a by-product of calibrating low-cost nodes.

It should be noted that the mode of drift (linear or random drift) will not significantly affect our simulation results. As we demonstrated in Sect. 3.2.3, the performance of our algorithm is insensitive to the training data size. And we believe that models with a similar prediction accuracy should have a similar drift detection power. For example, if the prediction accuracy of the model trained on 59 d of data is virtually the same as accuracy of the model trained on 2 d of data, and if the model trained on 59 d is able to detect the simulated drift, then so should the model trained on 2 d. Then if we reasonably assume that the drift rate remains roughly unchanged within a 2 d window, then the drift mode (linear or random), which only dictates how the drift rate jumps (usually smoothly as well) between any adjacent discrete 2 d windows, does not matter anymore. All that matters is to track that one fixed drift rate reasonably well within those 2 d, which is virtually the same as what we already did with the entire 59 d of data.

Average 24 h percent errors of the GPR model for predicting the
holdout reference nodes in the network as a function of the number of
reference stations used for the model construction under two scenarios –
using the full sensor network information by including both reference and
low-cost nodes and using only the reference nodes for the model
construction. Note each data point (mean value) is derived from 100
simulation runs. The error bars indicating 95 % confidence interval (CI) of the means are based
on 1000 bootstrap iterations. All scenarios were given the initial parameter
values and bounds that maximize the model performance. The

Points which remain unaddressed are (1) what the optimum or minimum number of reference instruments is to sustain this technique, and (2) if the inclusion of low-cost nodes can effectively assist in lowering the technique's calibration or mapping inaccuracy. It is interesting to note that optimizing the model's calibration accuracy can not only directly fulfill the fundamental calibration task, but it can also better help the sensor network monitoring capability as an added bonus. To address these two outstanding issues, we randomly sampled subsets of all the 22 reference nodes within the network in increments of one node (i.e., from 1 to 21 nodes) and implemented our algorithm with and without incorporating the low-cost nodes, before finally computing the mean percent errors in predicting all the holdout reference nodes. To get the performance scores as close to truth as possible but without incurring excessive computational cost in the meantime, the sampling was repeated 100 times for each subset size. The calibration error in this section was defined as the mean percent errors in predicting all the holdout reference nodes further averaged over 100 simulation runs for each subset size.

Figure 10 describes the 24 h calibration percent error rate of the model as
a function of the number of reference stations used for modeling with and
without involving the low-cost nodes. The error rates generally decrease as
the number of reference instruments increases (full network: from

Lastly, we used the Wilcoxon rank-sum test (Mann–Whitney

This study introduced a simultaneous GPR and simple linear regression
pipeline to calibrate wireless low-cost PM sensor networks (up to any scale)
on the fly in the field by capitalizing on all available reference monitors
across an area without the requirement of pre-deployment collocation
calibration. We evaluated our method for Delhi, where 22 reference and 10
low-cost nodes were available from 1 January to 31 March 2018
(Delhi-wide average of the 3-month mean PM

Two directions are possible for our future work. The first one is to expand
both the longitudinal and the cross-sectional scopes of field studies and
examine how well our solution works for more extensive networks in a larger
geographical area over longer periods of deployment (when sensors are
expected to actually drift, degrade, or malfunction). This enables us to
validate the practical use of our method for calibration and surveillance
more confidently. The second is to explore the infusion of information about
urban PM

The data are available upon request to Tongshu Zheng (tongshu.zheng@duke.edu).

The supplement related to this article is available online at:

TZ, MHB, and DEC designed the study. DEC and TZ participated in the algorithm development. TZ wrote the paper, coded the algorithm, and performed the analyses and simulations. MHB and DEC provided guidance on analyses and simulations and assisted in writing and revising the paper. RS and SNT established, maintained, and collected data from the low-cost sensor network and the two E-BAM sites. TZ collected data from all the regulatory air quality monitoring stations in Delhi. RC provided funding and technical support for the project.

Author Ronak Sutaria is the founder of Respirer Living Sciences Pvt. Ltd, a start-up based in Mumbai, India, which is the developer of the Atmos low-cost AQ monitor. Ronak Sutaria was involved in developing and refining the hardware of Atmos and its server and dashboard, in deploying the sensors, but not involved in data analysis. Author Robert Caldow is the director of engineering at TSI and responsible for the funding and technical support but not responsible for data analysis.

The authors would like to thank CPCB, DPCC, IMD, SPCBs, and AirNow DOS (Department of State) for providing the Delhi 1 h reference PM

This research has been supported under the Research Initiative for Real-time River Water and Air Quality Monitoring program funded by the Department of Science and Technology, Government of India and Intel^{®}.

This paper was edited by Francis Pope and reviewed by three anonymous referees.