Evaluation of linear regression techniques for 1 atmospheric applications : The importance of 2 appropriate weighting 3

Abstract. Linear regression techniques are widely used in atmospheric science, but they are
often improperly applied due to lack of consideration or inappropriate
handling of measurement uncertainty. In this work, numerical experiments are
performed to evaluate the performance of five linear regression techniques,
significantly extending previous works by Chu and Saylor. The five techniques
are ordinary least squares (OLS), Deming regression (DR), orthogonal distance
regression (ODR), weighted ODR (WODR), and York regression (YR). We first
introduce a new data generation scheme that employs the Mersenne twister (MT)
pseudorandom number generator. The numerical simulations are also improved
by (a) refining the parameterization of nonlinear measurement
uncertainties, (b) inclusion of a linear measurement uncertainty, and (c) inclusion of WODR for comparison. Results show that DR, WODR and YR produce
an accurate slope, but the intercept by WODR and YR is overestimated and the
degree of bias is more pronounced with a low R2 XY dataset. The
importance of a properly weighting parameter λ in DR is investigated
by sensitivity tests, and it is found that an improper λ in DR can
lead to a bias in both the slope and intercept estimation. Because the
 λ calculation depends on the actual form of the measurement error,
it is essential to determine the exact form of measurement error in the XY 
data during the measurement stage. If a priori error in one of the variables
is unknown, or the measurement error described cannot be trusted, DR, WODR
and YR can provide the least biases in slope and intercept among all tested
regression techniques. For these reasons, DR, WODR and YR are recommended for
atmospheric studies when both X and Y data have measurement errors. An Igor
Pro-based program (Scatter Plot) was developed to facilitate the
implementation of error-in-variables regressions.


1234
C. Wu and J. Z. Yu: Evaluation of linear regression techniques parison between modeling and measurement (Petäjä et al., 2009), emission factor study (Janhäll et al., 2010), retrieval of shortwave cloud forcing (Cess et al., 1995), calculation of pollutant growth rate (Richter et al., 2005), estimation of ground-level PM 2.5 from MODIS data (Wang and Christopher, 2003), distinguishing OC origin from biomass burning using K + as a tracer (Duan et al., 2004) and emission type identification by the EC / CO ratio (Chen et al., 2001).
Ordinary least squares (OLS) regression is the most widely used method due to its simplicity.In OLS, it is assumed that independent variables are error-free.This is the case for certain applications, such as determining a calibration curve of an instrument in analytical chemistry.For example, a known amount of analyte (e.g., through weighing) can be used to calibrate the instrument output response (e.g., voltage).However, in many other applications, such as inter-instrument comparison, X and Y (from two instruments) may have comparable degrees of uncertainty.This deviation from the underlying assumption in OLS would produce biased slope and intercept when OLS is applied to the dataset.
To overcome the drawback of OLS, a number of error-invariables regression models (also known as bivariate fittings; Cantrell, 2008) or total least-squares methods (Markovsky and Van Huffel, 2007) arise.Deming (1943) proposed an approach by minimizing sum of squares of X and Y residuals.A closed-form solution of Deming regression (DR) was provided by York (1966).Method comparison work of various regression techniques by Cornbleet and Gochman (1979) found significant error in OLS slope estimation when the relative standard deviation (RSD) of measurement error in "X" exceeded 20 %, while DR was found to reach a more accurate slope estimation.In an early application of the EC tracer method, Turpin and Huntzicker (1995) realized the limitation of OLS since OC and EC have comparable measurement uncertainty and thus recommended the use of DR for (OC / EC) pri (primary OC to EC ratio) estimation.Ayers (2001) conducted a simple numerical experiment and concluded that reduced major axis regression (RMA) is more suitable for air quality data regression analysis.Linnet (1999) pointed out that when applying DR for inter-method (or interinstrument) comparison, special attention should be paid to the sample size.If the range ratio (max / min) is relatively small (e.g., less than 2), more samples are needed to obtain statistically significant results.
In principle, a best-fit regression line should have greater dependence on the more precise data points rather than the less reliable ones.Chu (2005) performed a comparison study of OLS and DR specifically focusing on the EC tracer method application and found that the slope estimated by DR is closer to the correct value than OLS but may still overestimate the ideal value.Saylor et al. (2006) extended the comparison work of Chu (2005) by including a regression technique developed by York et al. (2004).They found that the slope overestimation by DR in the study of Chu (2005) was due to improper configuration of the weighting parame-ter, λ.This λ value is the key to handling the uneven errors between data points for the best-fit line calculation.This example demonstrates the importance of appropriate weighting in the calculation of best-fit line for error-in-variables regression model, which is overlooked in many studies.
In this study, we extend the work by Saylor et al. (2006) to achieve four objectives.The first is to propose a new data generation scheme by applying the Mersenne twister (MT) pseudorandom number generator for evaluation of linear regression techniques.In the study of Chu (2005), data generation is achieved by a varietal sine function, which has limitations in sample size, sample distribution, and nonadjustable correlation (R 2 ) between X and Y .In comparison, the MT data generation provides more flexibility, permitting adjustable sample size, XY correlation and distribution.The second is to develop a nonlinear measurement error parameterization scheme for use in the regression method.The third is to incorporate linear measurement errors in the regression methods.In the work by Chu (2005) and Saylor et al. (2006), the relative measurement uncertainty (γ Unc ) is nonlinear with concentration, but a constant γ Unc is often applied on atmospheric instruments due to its simplicity.The fourth is to include weighted orthogonal distance regression (WODR) for comparison.Abbreviations and symbols used in this study are summarized in Table B1 for quick reference.

Description of regression techniques compared in this study
Ordinary least squares (OLS) method OLS only considers the errors in dependent variables (Y ).OLS regression is achieved by minimizing the sum of squares (S) in the Y residuals (i.e., distance of AB in Fig. S1 in the Supplement): where Y i are observed Y data points, while y i are regressed Y data points of the regression line.N represents the number of data points that is used for regression.

Orthogonal distance regression (ODR)
ODR minimizes the sum of the squared orthogonal distances from all data points to the regressed line and considers equal error variances (i.e., distance of AC in Fig. S1): (2)

Weighted orthogonal distance regression (WODR)
Unlike ODR, which considers even error in X and Y , weightings based on measurement errors in both X and Y are con- sidered in WODR when minimizing the sum of squared orthogonal distance from the data points to the regression line (Carroll and Ruppert, 1996) as shown by AD in Fig. S1: where η is the error variance ratio, which determines the angle θ shown in Fig. S1.Implementation of ODR and WODR in Igor Pro (WaveMetrics, Inc.Lake Oswego, OR, USA) was done by the computer routine ODRPACK95 (Boggs et al., 1989;Zwolak et al., 2007).

Deming regression (DR)
Deming (1943) proposed the following function to minimize both the X and Y residuals as shown by AD in Fig. S1, where X i and Y i are observed data points and x i and y i are regressed data points.Individual data points are weighted based on errors in X i and Y i , where σ X i and σ Y i are the standard deviation of the error in measurement of X i and Y i , respectively.The closed-form solutions for slope and intercept of DR are shown in Appendix A.

York regression (YR)
The York method (York et al., 2004) introduces the correlation coefficient of errors in X and Y into the minimization function.
where r i is the correlation coefficient between measurement errors in X i and Y i .The slope and intercept of YR are calculated iteratively through the formulas in Appendix A. Summary of the five regression techniques is given in Table S1 in the Supplement.It is worth noting that OLS and DR have closed-form expressions for calculating slope and intercept.In contrast, ODR, WODR and YR need to be solved iteratively.This need to be taken into consideration when choosing regression algorithm for handling huge numbers of data.
A computer program (Scatter Plot; Wu, 2017a) with a graphical user interface (GUI) in Igor Pro (WaveMetrics, Inc.Lake Oswego, OR, USA) was developed to facilitate the implementation of error-in-variables regression (including DR, WODR and YR).Two other Igor Pro-based computer programs, Histbox (Wu, 2017b) and Aethalometer data processor (Wu, 2017c), are used for data analysis and visualization in this study.

Data description
Two types of data are used for regression comparison.The first type is synthetic data generated by computer programs, which can be used in the EC tracer method (Turpin and Huntzicker, 1995) to demonstrate the regression application.The true "slope" and "intercept" are assigned during data generation, allowing quantitative comparison of the bias of each regression scheme.The second type of data comes from ambient measurement of light absorption, OC and EC in Guangzhou for demonstration in a real-world application.

Synthetic XY data generation
In this study, numerical simulations are conducted in Igor Pro (WaveMetrics, Inc.Lake Oswego, OR, USA) through custom codes.Two types of generation schemes are employed: one is based on the MT pseudorandom number generator (Matsumoto and Nishimura, 1998) and the other is based on the sine function described by Chu (2005).
The general form of linear regression on XY data can be written as where k is the regressed slope and b is the intercept.The underlying meaning is that, Y can be decomposed into two parts.One part is correlated with X, and the ratio is defined by k.The other part of Y is constant and independent of X and regarded as b.
To make the discussion easier to follow, we intentionally avoid discussion using the abstract general form and instead opt to use a real-world application case in atmospheric science.Linear regression had been heavily applied on OC and EC data, here we use OC and EC data as an example to demonstrate the regression application in atmospheric science.In the EC tracer method, OC (mixture) is Y and EC (tracer) is X.OC can be decomposed into three components based on their formation pathway: where POC comb is primary OC from combustion.POC non-comb is primary OC emitted from non-combustion activities.SOC is secondary OC formed during atmospheric aging.Since POC comb is co-emitted with EC and well correlated with each other, their relationship can be parameterized as

Parameterization of synthesized measurement uncertainty
Weighting of variables is a crucial input for errors-invariables linear regression methods such as DR, YR and WODR.In practice, the weights are usually defined as the inverse of the measurement error variance (Eq.5).When measurement errors are considered, measured concentrations (Conc.measured ) are simulated by adding measurement uncertainties (ε Conc. ) to the true concentrations (Conc.true ): where ε Conc. is the random error following an even distribution with an average of 0, the range of which is constrained by The γ Unc is a dimensionless factor that describes the fractional measurement uncertainty relative to the true concentration (Conc.true ).γ Unc could be a function of Conc.true (Thompson, 1988) or a constant.The term γ Unc × Conc.true defines the boundary of random measurement errors.Two types of measurement error are considered in this study.The first type is γ Unc−nonlinear .In the data generation scheme of Chu (2005) for the measurement uncertainties (ε POC and ε EC ), γ Unc−nonlinear is nonlinearly related to Conc.true : and thus Eq. ( 13) for POC and EC becomes In Eq. ( 14), the γ Unc decreases as concentration increases, since low concentrations are usually more challenging to measure.As a result, the γ Unc−nonlinear defined in Eq. ( 14) is more realistic than the constant approach, but there are two limitations.First, the physical meaning of the uncertainty unit is lost.If the unit of OC is µg m −3 , then the unit of ε OC becomes µg m −3 .Second, the concentration is not normalized by a consistent relative value, making it sensitive to the X and Y units used.For example, if POC true = 0.9 µg m −3 , then ε POC = ± 0.95 µg m −3 and γ Unc = 105 %, but by changing the concentration unit to POC true = 900 ng m −3 , ε OC = ± 30 ng m −3 and γ Unc = 3 %.
To overcome these deficiencies, we propose to modify Eq. ( 14) to where LOD (limit of detection) is introduced to generate a dimensionless γ Unc .α is a dimensionless adjustable factor to control the position of γ Unc curve on the concentration axis, which is indicated by the value of γ Unc at LOD level.As shown in Fig. 1a, at different values of α (α = 1, 0.5 and 0.3), the corresponding γ Unc at the same LOD level would be 100, 50 and 30 %, respectively.By changing α, the location of the γ Unc curve on x axis direction can be set, using the γ Unc at LOD as the reference point.Then Eq. ( 7) for POC and EC becomes With the modified γ Unc−nonlinear parameterization, concentrations of POC and EC are normalized by a corresponding LOD, which maintains unit consistency between POC true and ε POC and EC true and ε EC and eliminates dependency on the concentration unit.Uniform distribution has been used in previous studies (Cox et al., 2003;Chu, 2005;Saylor et al., 2006) and is adopted in this study to parameterize measurement error.For  18) and ( 19), the weights in DR and YR (inverse of variance) become The parameter λ in Deming regression is then determined: Besides the γ Unc−nonlinear discussed above, a second type measurement uncertainty parameterized by a constant proportional factor, γ Unc−linear , is very common in atmospheric applications: where γ POCunc and γ ECunc are the relative measurement uncertainties, e.g., for relative measurement uncertainty of 10 %, γ Unc = 0.1.As a result, the measurement error is linearly proportional to the concentration.An example comparison of γ Unc−nonlinear and γ Unc−linear is shown in Fig. 1b.For γ Unc−linear , the weights become and λ for Deming regression can be determined:

XY data generation by the Mersenne twister generator following a specific distribution
The Mersenne twister (MT) is a pseudorandom number generator (PRNG) developed by Matsumoto and Nishimura (1998).MT has been widely adopted by mainstream numerical analysis software (e.g., MATLAB, SPSS, SAS and Igor Pro) as well as popular programing languages (e.g., R, Python, IDL, C++ and PHP).Data generation using MT provides a few advantages: (1) frequency distribution can be easily assigned during the data generation process, allowing straightforward simulation of the frequency distribution characteristics (e.g., Gaussian or lognormal) observed in ambient measurements; (2) the inputs for data generation are simply the mean and standard deviation of the data series and can be changed easily by the user; (3) the correlation (R 2 ) between X and Y can be manipulated easily during the data generation to satisfy various purposes; and (4) unlike the sine function described by Chu (2005), which has a sample size limitation of 120, the sample size in MT data generation is highly flexible.
In this section, we will use POC as Y and EC as X as an example to explain the data generation.Procedure of applying MT to simulate ambient POC and EC data can be found in our previous study (Wu and Yu, 2016).Details of the data generation steps are shown in Fig. 2 and described below.The first step is generation of EC true by MT.In our previous study, it was found that ambient POC and EC data follow a lognormal distribution in various locations of the Pearl River Delta (PRD) region.Therefore, lognormal distributions are adopted during EC true generation.A range of average concentration and relative standard deviation (RSD) from ambient samples is considered in formulating the lognormal Besides MT, inclusion of the sine function data generation scheme in this study mainly serves two purposes.First, the sine function scheme was adopted in two previous studies (Chu, 2005;Saylor et al., 2006), the inclusion of this scheme can help to verify whether the codes in Igor for various regression approaches yield the same results from the two previous studies.Second, the crosscheck between results from sine function and MT provides circumstantial evidence that the MT scheme works as expected.
In this section, XY data generation by sine functions is demonstrated using POC as Y and EC as X.There are four steps in POC and EC data generation as shown by the flowchart in Fig. S2.Details are explained as follows.(1) The first step is to generate POC and EC (Chu, 2005): EC true = 3.5 where x is the elapsed hour (x = 1, 2, 3. ..n; n ≤ 120), τ is used to adjust the width of each peak, and φ is used to adjust the phase of the sine wave.The constants 14 and 3.5 are used to lift the sine wave to the positive range of the y axis.An example of data generation by the sine functions of Chu (2005) is shown in Fig. 3. Dividing Eq. ( 28) by Eq. ( 29) yields a value of 4. In this way the exact relation between POC and EC is defined clearly as (OC / EC) pri = 4.
(2) With POC comb and EC true generated, the second step is to add POC non-comb to POC comb to compute POC true .As for POC non-comb , a single value is assigned and added to all POC following Eq.( 10).Then the goodness of the regression intercept can be evaluated by comparing the regressed intercept with preset POC non-comb .
(3) The third step is to compute ε POC and ε EC , considering both γ Unc−nonlinear and γ Unc−linear .(4) The last step is to apply measurement errors on POC true and EC true following Eq.( 12).Then POC measured and EC measured can be used as Y and X, respectively, to evaluate the performance of various regression techniques.(Wu et al., 2018), was developed to perform the data correction and detailed descriptions can be found in https://sites.google.com/site/wuchengust.More details of the measurements can be found in Wu et al. (2018).

Comparison study using synthetic data
In the following comparisons, six regression approaches are compared using two data generation schemes (Chu sine function and MT) separately, as illustrated in Fig. 4.Each data generation scheme considers both γ Unc−nonlinear and γ Unc−linear in measurement error parameterization.In total, 18 cases are tested with different combination of data generation schemes, measurement error parameterization schemes, true slope and intercept settings.In each case, six regression approaches are tested, i.e., OLS, DR (λ = 1), DR λ = ω(X i ) ω(Y i ) , ODR, WODR and YR.In commercial software (e.g., OriginPro ® , SigmaPlot ® , GraphPad Prism ® ), λ in DR is set to 1 by default if not specified.As indicated by Saylor et al. (2006), the bias observed in the study of Chu (2005) is likely due to λ = 1 in DR.The purpose of including DR (λ = 1) in this study is to examine the potential bias using the default input in many software products.The six regression approaches are considered to examine the sensitivity of regression results to various parameters used in data generation.For each case, 5000 runs are performed to obtain statistically significant results, as recommended by Saylor et al. (2006).The mean slope and intercept from 5000 runs is compared with the true value assigned during data generation.If the difference is < 5 %, the result is considered unbiased.Chu (2005) In this section, the scheme of Chu ( 2005) is adopted for data generation to obtain a benchmark of six regression approaches.With different setup of slope, intercept and γ Unc , six cases (Cases 1-6) are studied and the results are discussed below.and 19) is summarized in Table 1.LOD POC , LOD EC , α POC and α EC are all set to 1 to reproduce the data studied by Chu (2005) and Saylor et al. (2006).Two sets of true slope and intercept are considered (Case 1: slope = 4, intercept = 0; Case 2: slope = 4, intercept = 3) to examine if any results are sensitive to the nonzero intercept.The R 2 (POC, EC) from 5000 runs for both Case 1 and 2 are 0.67 ± 0.03.

Comparison results using the dataset of
As shown in Fig. 5, for the zero-intercept case (Case 1), OLS significantly underestimates the slope (2.95 ± 0.14), while it overestimates the intercept (5.84 ± 0.78).This result indicates that OLS is not suitable for errors-in-variables linear regression, consistent with similar analysis results from Chu (2005) and Saylor et al. (2006).With DR, if the λ is properly calculated by weights λ = ω(X i ) ω(Y i ) , unbiased slope (4.01 ± 0.25) and intercept (−0.04 ± 1.28) are obtained; however, results from DR with λ = 1 show obvious bias in the slope (4.27 ± 0.27) and intercept (−1.45 ± 1.36).ODR also produces biased slope (4.27 ± 0.27) and intercept (−1.45 ± 1.36), which are identical to results of DR when λ = 1.With WODR, unbiased slope (3.98 ± 0.22) is observed, but the intercept is overestimated (1.12 ± 1.02).Results of YR are identical to WODR.For Case 2 (slope = 4, intercept = 3), slopes from all six regression approaches are consistent with Case 1 (Table 1).The Case 2 intercepts are equal to the Case 1 intercepts plus 3, implying that all the regression methods are not sensitive to a nonzero intercept.
For Case 3, LOD POC = 0.5, LOD EC = 0.5, α POC = 0.5, α EC = 0.5 are adopted (Table 1), leading to an offset to the left of γ Unc−nonlinear (blue curve) compared to Case 1 and 2 (black curve) in Fig. 1.As a result, for the same concentration of EC and OC in Case 3, the γ Unc−nonlinear is smaller than in Cases 1 and 2 as indicated by a higher R 2 (0.95 ± 0.01 for Case 3, Table 1).With a smaller measurement uncertainty, the degree of bias in Case 3 is smaller than in Case 1.For example, OLS slope is less biased in Case 3 (3.83± 0.08) compared to Case 1 (2.94 ± 0.14).Similarly, the slope (4.03 ± 0.09) and intercept (−0.18 ± 0.44) of DR (λ = 1) exhibit a much smaller bias with a smaller measurement uncertainty, implying that the degree of bias by improperly weighting in DR, WODR and YR is associated with the degree of measurement uncertainty.A higher measurement uncertainty results in larger bias in slope and intercept.
An uneven LOD POC and LOD EC is tested in Case 4 with LOD POC = 1, LOD EC = 0.5, α POC = 0.5, α EC = 0.5, which yield an R 2 (POC, EC) of 0.78 ± 0.02.The results are similar to Case 1.For DR λ = ω(X i ) ω(Y i ) unbiased slope and intercept are obtained.For WODR and YR, unbiased slopes are reported with a small bias in the intercepts.Large bias values are observed in both the slopes and intercepts in Case 4 using OLS, DR (λ = 1) and ODR.

Comparison results using data generated by MT
In this section, MT is adopted for data generation to obtain a benchmark of six regression approaches.Both γ Unc−nonlinear and γ Unc−linear are considered.With different configuration of slope, intercept and γ Unc , 12 cases (Cases 7-18) are studied and the results are discussed below.
These results imply that if the true slope is less than 1, the improper weighting (λ = 1) in Deming regression and ODR without weighting tends to underestimate slope.If the true slope is 1, these two estimators can provide unbiased results.If the true slope is larger than 1, the improper weighting (λ = 1) in Deming regression and ODR without weighting tends to overestimate slope.
is the only regression approach reporting unbiased intercept (3.08 ± 0.23).Cases 15 and 16 are tested to investigate whether the results are different if the true slope is smaller than 1.As shown in Table 1, the results are similar to Cases 13 and 14, i.e., that DR λ = ω(X i ) ω(Y i ) can provide unbiased slope and intercept while WODR and YR can provide unbiased slopes but biased intercepts.Cases 17 and 18 are tested to see if the results are the same for a special case when the true slope is 1.As shown in Table 1, the results are similar to Cases 13 and 14, implying that these results are not sensitive to the special case when the true slope is 1.

The importance of appropriate λ input for Deming regression
As discussed above, inappropriate λ assignment in the Deming regression (e.g., λ = 1 by default for much commercial software) leads to biased slope and intercept.Besides λ = 1, inappropriate λ input due to improper handling of measurement uncertainty can also result in bias for Deming regression.An example is shown in Fig. S3.Data are generated by MT with following parameters: slope = 4, intercept = 0, and γ Unc−linear (30 %). Figure S2a and b demonstrate that when an appropriate λ is provided (following γ Unc−linear , λ = POC 2 EC 2 ), unbiased slopes and intercepts are obtained.If an improper λ is used due to a mismatched measurement uncertainty assumption γ Unc−nonlinear , λ = POC EC , the slopes are overestimated (Fig. S3c, 4.37 ± 0.05) and intercepts are underestimated (Fig. S3, −2.01 ± 0.24).This result emphasizes the importance of determining the correct form of measurement uncertainty in ambient samples, since λ is a crucial parameter in Deming regression.
In the λ calculation, different representations for POC and EC, including mean, median and mode, are tested as shown in Fig. S4.The results show that when X and Y have a similar distribution (e.g., both are lognormal), any of mean, median or mode can be used for the λ calculation.

Caveats of regressions with unknown X and Y uncertainties
In atmospheric applications, there are scenarios in which a priori error in one of the variables is unknown, or the measurement error described cannot be trusted.For example, in the case of comparing model prediction and measurement data, the uncertainty of model prediction data is unknown.
A second example is the case in which measurement uncertainty cannot be determined due to the lack of duplicated or collocated measurements and as a result, an arbitrarily assumed uncertainty is used.Such a case was illustrated in the study by Flanagan et al. (2006).They found that in the Speciation Trends Network (STN), the whole-system uncertainty retrieved by data from collocated samplers was different from the arbitrarily assumed 5 % uncertainty.Additionally, the discrepancy between the actual uncertainty obtained through collocated samplers and the arbitrarily assumed uncertainty varied by chemical species.To investigate the performance of different regression approaches in these cases, two tests (A and B) are conducted.
In Test A, the actual measurement error for X is fixed at 30 %, while γ Unc for Y varies from 1 to 50 %.The assumed measurement error for regression is 10 % for both X and Y .Results of Test A are shown in Fig. 6a and b.For OLS, the slopes are underestimated (−14 to −12 %) and intercepts are overestimated (90-103 %) and the biases are independent of variations in γ Unc_Y .ODR and DR (λ = 1) yield similar results with overestimated slopes (0-44 %) and underestimated intercepts (−330-0 %).The degree of bias in slopes and intercepts depends on the γ Unc_Y .WODR, DR λ = ω(X i ) ω(Y i ) and YR perform much better than other regression approaches in Test A, with a smaller bias in both slopes (−8-12 %) and intercepts −98-55 %).
In Test B, γ Unc_Y is fixed at 30 % and γ Unc_X varies between 1 and 50 %.The results of Test B are shown in Fig. 6c  and d.The assumed measurement error for regression is 10 % for both X and Y .OLS underestimates the slopes (−29 to −0.2 %) and overestimates the intercepts (2-209 %).In contrast to Test A in which slope and intercept biases are independent of variations in γ Unc_Y , the slope and intercept biases in Test B exhibit dependency on γ Unc_X .The reason for this is that OLS only considers errors in Y and X is assumed to be error-free.ODR and DR (λ = 1) yield similar results with overestimated slopes (11-18 %) and underestimated intercepts (−144 to −87 %).The degree of bias in slopes and intercepts is relatively independent on the γ Unc_X .WODR, DR and YR performed much better than the other regression approaches in Test B, with a smaller bias in both slopes (−14-8 %) and intercepts (−59-106 %).
The results from these two tests suggest that, if one of the measurement errors described cannot be trusted or a priori error in one of the variables is unknown, WODR, DR λ = ω(X i ) ω(Y i ) and YR should be used instead of ODR and DR (λ = 1) and OLS.This conclusion is consistent with results presented in Sect.4.1 and 4.2.This analysis, albeit crude, also suggests that, in general, the magnitude of bias in slope estimation by these regression approaches is smaller than those for intercept.In other words, slope is a more reliable quantity compared to intercept when extracting quantitative information from linear regressions.

Regression applications to ambient data
This section demonstrates the application of the six regression approaches on a light absorption coefficient and EC dataset collected in a suburban site in Guangzhou.As mentioned in Sect.4.4, measurement uncertainties are crucial inputs for DR, YR and WODR.The measurement precision of Aethalometer is 5 % (Hansen, 2005), while EC by the RT-ECOC analyzer is 24 % (Bauer et al., 2009).These measurement uncertainties are used in DR, YR and WODR calculation.The dataset contains 6926 data points with an R 2 of 0.92.
As shown in Fig. 7, the y axis is light absorption at 520 nm (σ abs520 ) and the x axis is EC mass concentration.The regressed slopes represent the mass absorption efficiency (MAE) of EC at 520 nm, ranging from 13.66 to 15.94 m 2 g −1 by the six regression approaches.OLS yields the lowest slope (13.66 as shown in Fig. 7a) among all six regression approaches, consistent with the results using synthetic data.This implies that OLS tends to underestimate regression slope when mean Y to X ratio is larger than 1.DR (λ = 1) and ODR report the same slope (14.88) and intercept (5.54); this equivalency is also observed for the synthetic data.Similarly, WODR and YR yield identical slope (14.88) and intercept (5.54), in line with the synthetic data results.The regressed slope by DR (λ = 1) is higher than DR λ = ω(X i ) ω(Y i ) , and this relationship agrees well with the synthetic data results.
Regression comparison is also performed on hourly OC and EC data.Regression on OC / EC percentile subset is a widely used empirical approach for primary OC / EC ratio determination.Figure S5 shows the regression slopes as a function of OC / EC percentile.OC / EC percentile ranges from 0.5 to 100 %, with an interval of 0.5 %.As the percentile increases, SOC contribution in OC increases as well, resulting in decreased R 2 between OC and EC.The deviations between six regression approaches exhibit a dependency on R 2 .When percentile is relatively small (e.g., < 10 %), the differences between the six regression approaches are also small due to the high R 2 (0.98).The deviations between the six regression approaches become more pronounced as R 2 decreases (e.g., < 0.9).The deviations are expected to be even larger when R 2 is less than 0.8.These results emphasize the importance of applying error-in-variables regression, since ambient XY data more likely has an R 2 less than 0.9 in most cases.
As discussed in this section, the ambient data confirm the results obtained in comparing methods with the synthetic data.The advantage of using the synthetic data for regression approaches evaluation is that the ideal slope and intercept are known values during the data generation, so the bias of each regression approach can be quantified.

Recommendations and conclusions
This study aims to provide a benchmark of commonly used linear regression algorithms using a new data generation scheme (MT).Six regression approaches are tested, i.e., OLS, DR (λ = 1), DR λ = ω(X i ) ω(Y i ) , ODR, WODR and YR.The results show that OLS fails to estimate the correct slope and intercept when both X and Y have measurement errors.This result is consistent with previous studies.For ambient data with R 2 less than 0.9, error-in-variables regression is needed to minimize the biases in slope and intercept.If measurement uncertainties in X and Y are determined during the measurement, measurement uncertainties should be used for regression.With appropriate weighting, DR, WODR and YR can provide the best results among all tested regression techniques.Sensitivity tests also reveal the importance of the weighting parameter λ in DR.An improper λ could lead to biased slope and intercept.Since the λ estimation depends on the form of the measurement errors, it is important to determine the measurement errors during the experimentation stage rather than making assumptions.If measurement errors are not available from the measurement and assumptions are made on measurement errors, DR, WODR and YR are still the best option that can provide the least bias in slope and intercept among all tested regression techniques.For these reasons, DR, WODR and YR are recommended for atmospheric studies when both X and Y data have measurement errors.
Application of error-in-variables regression is often overlooked in atmospheric studies, partly due to the lack of a specified tool for the regression implementation.To facilitate the implementation of error-in-variables regression (including DR, WODR and YR), a computer program (Scatter Plot) with a GUI in Igor Pro (WaveMetrics, Inc.Lake Oswego, OR, USA) was developed (Fig. 8).It is packed with many useful features for data analysis and plotting, including batch plotting, data masking via GUI, color coding in the z axis, data filtering and grouping by numerical values and strings.The Scatter Plot program and user manual are available from https://sites.google.com/site/wuchengustand https://doi.org/10.5281/zenodo.832417.

Figure 2 .
Figure 2. Flowchart of data generation steps using MT.

3. 2 Figure 3 .
Figure 3. POC comb and EC true data generated by the sine functions of Chu (2005).(a) Time series of the 120 data points for POC comb and EC true .(b) Scatter plot of POC comb vs. EC true .

Figure 4 .
Figure 4. Overview of the comparison study design.
scheme Slope Intercept (X, Y ) error

Figure 5 .
Figure 5. Regression results on synthetic data, Case 1 (Slope = 4, Intercept = 0, LOD POC = 1, LOD EC = 1, a POC = 1, a EC = 1, R 2 (POC, EC) = 0.67 ± 0.03).The scatter plots demonstrate regression examples from a single run.The box plots show the distribution of regressed slopes and intercepts from 5000 runs of six regression approaches.The dashed lines in orange and pink represent true slope and intercept, respectively.

Figure 6 .
Figure6.Slope and intercept biases by different regression schemes in two test scenarios (A and B) in which the assumed error for one of the regression variables deviates from the actual measurement error.In Test A data generation, γ Unc_X is fixed at 30 % and γ Unc_Y is varied between 1 and 50 %.In Test B, γ Unc_X varies between 1 and 50 % and γ Unc_Y is fixed at 30 %.The assumed measurement error for regression is 10 % for both X and Y .(a) Slope biases as a function of γ Unc_Y in Test A. (b) Intercept biases as a function of γ Unc_Y in Test A. (c) Slope biases as a function of γ Unc_X in Test B. (d) Intercept biases as a function of γ Unc_X in Test B.

Figure 7 .
Figure 7. Regression results using ambient σ abs520 and EC data from a suburban site in Guangzhou, China.

Figure 8 .
Figure 8.The user interface of the Scatter Plot Igor program.The program and its operation manual are available from https://doi.org/10.5281/zenodo.832417.
Data availability.OC, EC and σ abs data used in this study are available from the corresponding authors upon request.The computer programs used for data analysis and visualization in this study are available inWu (2017a-c).
The second step is to generate POC comb .As shown in Fig.2, POC comb is generated by multiplying EC true with (OC / EC) pri .Instead of having a Gaussian distribution, (OC / EC) pri in this study is a single value, which favors direct comparison between the true value of (OC / EC) pri and (OC / EC) pri estimated from the regression slope.The third step is generation of POC true by adding POC non-comb onto POC comb .Instead of having a distribution, POC non-comb in this study is a single value, which favors direct comparison between the true value of POC non-comb and POC non-comb estimated from the regression intercept.The fourth step is to compute ε POC and ε EC .As discussed in Sect.3.1.1,twotypes of measurement errors are considered for ε POC and ε EC calculation: γ Unc−nonlinear and γ Unc−linear .In the last step, POC measured and EC measured are calculated following Eq.(12), i.e., applying measurement errors on POC true and EC true .Then POC measured and EC measured can be used as Y and X, respectively, to test the performance of various regression techniques.An Igor Pro-based program with a GUI was developed to facilitate the MT data generation for OC and EC.A brief introduction is given in the Supplement.

Table 1 .
Summary of six regression approaches comparison with 5000 runs for 18 cases.
Appendix B: Summary of abbreviations and symbolsAbbreviation/symbol Definition α a dimensionless adjustable factor to control the position of γ Unc curve on the concentration axis b intercept in linear regressionβ i , U i , V i , W i intermediates in York regression calculations γ POC true + ε POC ) σ X i , σ Y ithe standard deviation of the error in measurement of X i and Y i r i correlation coefficient between errors in X i and Y i in YR