Consistent evaluation of GOSAT, SCIAMACHY, CarbonTracker, and MACC through comparisons to TCCON

Consistent validation of satellite CO 2 estimates is a prerequisite for using multiple satellite CO 2 measurements for joint ﬂux inversion, and for establishing an accurate long-term atmospheric CO 2 data record. We focus on validating model and satellite observation attributes that impact ﬂux estimates and CO 2 assimilation, including accu- 5 rate error estimates, correlated and random errors, overall biases, biases by season and latitude, the impact of coincidence criteria, validation of seasonal cycle phase and amplitude, yearly growth, and daily variability. We evaluate dry air mole fraction (X CO 2 ) for GOSAT (ACOS b3.5) and SCIAMACHY (BESD v2.00.08) as well as the CarbonTracker (CT2013b) simulated CO 2 mole fraction ﬁelds and the MACC CO 2 inversion 10 system (v13.1) and compare these to TCCON observations (GGG2014). We ﬁnd standard deviations of 0.9 ppm, 0.9, 1.7, and 2.1 ppm versus TCCON for CT2013b, MACC, GOSAT, and SCIAMACHY, respectively, with the single target errors 1.9 and 0.9 times the predicted errors for GOSAT and SCIAMACHY, respectively. When satellite data are averaged and interpreted according to error 2 = a 2 + b 2 /n (where n are the number of 15 observations averaged, a are the systematic (correlated) errors, and b are the random (uncorrelated) and Four Corners, which are highly inﬂuenced by local e ﬀ ects. We compare the variability within one day between TCCON and models in JJA; there is correlation between 0.2 and 0.8 in the NH, with models showing 10–100 % the variability of TC- 5 CON at di ﬀ erent stations (except Bremen and Four Corners which have no variability compared to TCCON) and CT2013b showing more variability than MACC. This paper highlights ﬁndings that provide inputs to estimate ﬂux errors in model assimilations, and places where models and satellites need further investigation, e.g. the SH for models and 45–67 ◦ N for GOSAT.


Introduction
Carbon-climate feedbacks are a major uncertainty in predicting the climate response to anthropogenic forcing (Friedlingstein et al., 2006). Currently, about 9 Gigatons (Gt) of carbon are emitted per year from human activity (e.g. fossil fuel burning, deforestation), of which about 5 Gt stays in the atmosphere, causing an annual CO 2 increase of approximately 2 ppm yr −1 . The yearly increase is quite variable, estimated at 1.99 ± 0.43 ppm yr −1 (http://www.esrl.noaa.gov/gmd/ccgg/trends/global.html), however always positive (Houghton et al., 2007). The remaining 4Gt of carbon is taken up by the ocean and the terrestrial biosphere, however there are uncertainties in the location and mechanism of these sinks, e.g. the distribution of land sinks between the North-20 ern Hemisphere and the tropics (e.g. Stephens et al., 2007), and the localization of sources and sinks on regional scales (Canadell et al., 2011;Baker et al., 2006). The uncertainties in top-down source and sink estimates are a consequence of uncertainties in model transport and dynamics (e.g. Prather et al., 2008;Stephens et al., 2007) and sparseness of available surface-based CO 2 observations (Hungershoefer et al., in the retrievals. V3.5 has corrections of GOSAT High (H) and Medium (M) gain data over land, as well as glint-mode data over the ocean, by using not only the "Southern Hemisphere Approximation", but also TCCON observations, and comparisons to an ensemble mean of multiple transport model output. Details of the post-retrieval filter and the bias-correction scheme can be found in the ACOS v3.5 user's guide which will 15 soon be at https://co2.jpl.nasa.gov/.

SCIAMACHY CO 2
The following description of the SCanning Imaging Absorption SpectroMeter for Atmospheric ChartographY (SCIAMACHY) CO 2 retrieval algorithm summarizes important aspects of Reuter et al. (2010Reuter et al. ( , 2011 and is adopted in parts from the algorithm theo-20 retical basis document (Reuter et al., 2012b).
The Bremen Optimal Estimation DOAS (BESD) algorithm is designed to analyze SCIAMACHY sun normalized radiance measurements to retrieve the column-average dry-air mole fraction of atmospheric carbon dioxide (X CO 2 ). BESD is a so-called full physics algorithm, which uses measurements in the O 2 -A absorption band to retrieve 25 scattering information of clouds and aerosols. This information is transferred to the CO 2 absorption band at 1580nm by simultaneously fitting the spectra measured in both spectral regions. Similar to the ACOS three-band retrieval for GOSAT consideration of scattering by this approach reduces potential systematic biases due to clouds or aerosols. The retrieved 26-elements state vector consists of a second order polynomial of the surface spectral albedo in both fit windows, two instrument parameters (spectral shift and slit functions full width at half maximum (FWHM) in both fit windows, described 5 in Reuter et al., 2010), a temperature profile shift, a scaling of the H 2 O profile and a default aerosol profile, cloud water/ice path, cloud top height, surface pressure and a ten layer CO 2 mixing ratio profile. Even though the number of state vector elements (26) is smaller than the number of measurement vector elements (134), the inversion problem is generally under-determined, especially for the CO 2 profile. For this reason 10 BESD uses a priori knowledge as a side-constraint. However, for most of the state vector elements the a priori knowledge gives only a weak constraint and is therefore not dominating the retrieval results. The degree of freedom for X CO 2 typically lies within an interval between 0.9 and 1.1.
A post-processor adjusts the retrieved X CO 2 to a priori CO 2 profiles generated with 15 the simple empirical CO 2 model (SECM) described by Reuter et al. (2012a). Additionally the post-processor performs quality filtering and bias correction. The bias correction is based on, e.g., convergence, fit residuals, error reduction, etc. The bias correction follows the idea of  using TCCON as reference to derive an empirical bias model depending on solar zenith angle, retrieved albedo, etc. The 20 theoretical predicted errors have been scaled by 0.22 to agree with the errors versus TCCON (Reuter et al., 2011). More details can be found in BESD's algorithm theoretical basis document (Reuter et al., 2012b).

The TCCON
The Total Carbon Column Observing Network (TCCON) consists of ground-based 25 Fourier transform spectrometers (FTS) that measure high spectral (0.02 cm −1 ) and temporal (∼ 90 s) resolution spectra of the direct sun in the near infrared (Wunch et al., 2011a). from their absorption signatures in the solar spectra using the GGG software package, which employs a nonlinear least squares spectral fitting algorithm to scale an a priori volume mixing ratio profile. Absorption of CO 2 is measured in the weak CO 2 band centered on 6220 and 6339 cm −1 , and of O 2 in the band centered on 7885 cm −1 .
The total column dry-air mole fractions of CO 2 (X CO 2 ) are computed by ratioing the 5 column abundances of CO 2 and O 2 . The resulting dry-air mole fractions have been calibrated against profiles of CO 2 measured by WMO-scale instrumentation aboard aircraft (Wunch et al., 2010;Messerschmidt et al., 2011). The precision and accuracy of the TCCON X CO 2 product is ∼ 0.8 ppm (2-sigma) after calibration (Wunch et al., 2010). The TCCON data used in this paper are from the GGG2012 release, available from http://tccon.ipac.caltech.edu/. We use 18 TCCON stations, distributed globally (see Fig. 1), and these data have been used extensively for satellite validation (e.g., Butz et al., 2011;Morino et al., 2011;Wunch et al., 2011b;Reuter et al., 2011;Schneising et al., 2012;Oshchepkov et al., 2012), in flux inversions (Chevallier et al., 2011), and in model comparisons 15 (Basu et al., 2011;Saito et al., 2012). We use the GGG2014 data when available, and the GGG2012 data from sites Four Corners, Tsukuba, and Bremen. The GGG2012 sites have corrections based on the instructions from the TCCON partners, listed on the TCCON website (https://tccon-wiki.caltech.edu/Network_Policy/Data_Use_Policy/ Data_Description_GGG2012#Laser_Sampling_Errors). We also apply a 0.9972 fac-20 tor to Four Corners, as indicated here: https://tccon-wiki.caltech.edu/Network_Policy/ Data_Use_Policy/Data_Description_GGG2012. Two instruments have been operated at the Lauder site. We identify them using 120HR (for the 20 June 2004 through 28 February 2011 period) and 125HR (for 2 February 2010-through to the present) when results are instrument specific. Introduction In order to explicitly quantify the impact of transport uncertainty and prior flux model bias on inverse flux estimates from CarbonTracker, the CT2013b release is composed of a suite of inversions, each using a different combination of prior flux models and parent meteorological model. Sixteen independent inversions were conducted, using two terrestrial biosphere flux priors, two air-sea CO 2 exchange flux priors, two esti-25 mates of imposed fossil fuel emissions, and two transport estimates in a factorial design. CT2013b results are presented as the performance-weighted mean of the inversion suite, with uncertainties including a component of across-model differences. All CarbonTracker results and complete documentation can be accessed online at http://carbontracker.noaa.gov. For model-data comparisons at selected sites, CT2013b is sampled at 90-minute intervals on the model's native vertical grid of 34 levels. Quantities are laterally interpolated from grid points to the location of the site using the sub-grid tracer dis-5 tribution model of the Russel and Lerner (1981) advection scheme. This "column" output includes CO 2 tracers and meteorological conditions, and is available online at ftp://aftp.cmdl.noaa.gov/products/carbontracker/co2/CT2013/column/.

MACC
Monitoring Atmospheric Composition and Climate (MACC, http://www. 10 gmes-atmosphere.eu/) is the European Union-funded project responsible for the development of the pre-operational Copernicus atmosphere monitoring service. MACC monitors the global distributions of greenhouse gases, aerosols, and reactive gases, and estimates some of their sources and sinks. Since 2010, it has been delivering every year an analysis of the carbon dioxide in the atmosphere and of its 15 surface fluxes, based on the assimilation of air sample mole fraction measurements (Chevallier et al., 2010). It relies on a variational inversion formulation, developed by LSCE, that estimates 8-day grid-point daytime/nighttime CO 2 fluxes and the grid point total columns of CO 2 at the initial time step of the inversion window. The Bayesian error statistics of the estimate are computed by a robust randomization approach. The 20 MACC inversion scheme relies on the global tracer transport model LMDZ (Hourdin et al., 2006), driven by the wind analyses from the ECMWF. For release v13.1 of the MACC inversion, used here, LMDZ was run at a horizontal resolution 3.75  Fig. 2. These plots show matches using the geometric coincidence criteria described in Table 2 below (for satellites) and give an idea of the number of coincidences for each dataset using these criterion. These sites were chosen as they have the most coincidences in the Northern and Southern Hemisphere, respectively, for satellites. All sets compare 20 well; the 30-day moving averages show differences most easily; such as a repeating blip in CT2013b comparisons at the summer drawdown at Lamont and a seasonal mismatches in CT2013b comparisons to Lauder, which will be discussed later in the paper.

Coincidence criteria and other matching details
The SCIAMACHY and GOSAT comparisons in this paper are based on two different definitions of coincidence criteria between TCCON and satellite data. Satellite measurements, which satisfy the so-called geometric criteria, are within ± 1 h, ± 5 • latitude and longitude of the mean time of a 90-min TCCON average. The dynamical 5 criteria Keppel-Aleks et al., 2011) are designed to exploit information about the dynamical origin of an air parcel through a constraint on the free-tropospheric temperature. This allows us to relax the geometric constraints and find more coincident satellite soundings per TCCON measurement. Briefly, a match is found when the measurements are within 5 days and the following is satisfied: where ∆Temperature is the co-located NCEP temperature difference at 700 hPa (Kalnay et al., 1996).   (Wunch et al., 2010. When the measured biases are larger than the gray box, they are considered significantly different than TCCON. For GOSAT, biases larger than the TCCON bias uncertainty occur at stations north of 67 • N (Eureka, Sodankyla), Garmisch, Four Corners, Tsukuba, and Lauder. Stations which have 10 special circumstances regarding validation are: Garmisch which is in the midst of complicated terrain that is difficult to model local atmospheric transport and to measure from space; Four Corners (4C), which is located in the vicinity of two power plants with large CO 2 emissions (Lindenmaier et al., 2014). The meteorology is such that 4C regularly samples large localized plumes with column CO 2 increases of several ppm that 15 last hours in the late morning. Therefore, the low bias in models and satellite data relative to the 4C TCCON is attributed to the smaller scale enhancements from the power plants measured in TCCON which are significantly diluted in the model and satellite results; Bremen is also affected by local urban sources, and satellites and models would be expected to be biased low; which is a finding, though it is similar to adjacent sta-20 tions; and JPL is in a megacity with complex adjacent terrain. SCIAMACHY has the same outliers as GOSAT with an additional low bias at Karlsruhe. occur for models. The standard deviations show some variability from station to station which are investigated below. The effects of averaging and coincident criteria are investigated in Sect. 3.3. Figure 4 shows the biases and standard deviations grouped globally and over the northern and Southern Hemispheres. To estimate the overall bias and standard deviations for single observations, we take out the outliers as follows. For the models, we take out JPL, Four Corners, Bremen, and Garmisch, with the caveat that models are unable to resolve variations with complex orography (Garmisch) or strongly influenced by local sources (JPL, Four Corners, Bremen) due to resolution, and Tsukuba for the standard deviation, as the TCCON instrument at Tsukuba has higher standard 10 deviation. For satellites, we remove the above plus Tsukuba and Lauder due to limited numbers of comparisons for SCIAMACHY. For the bias we take out stations poleward of 60

Bias and standard deviation for individual matches
• N, which have large positive biases for GOSAT and SCIAMACHY, which we note as an issue. There is an overall bias versus TCCON on the order of 0.7 ppm for CT2013b, and 0.2-0.3 ppm for the other 3 sets. The overall bias is less of a concern 15 than the bias variability in satellite data which indicates regional errors that will translate to regional errors in flux estimates. 2.1 ppm (BESD) and 2.3 ppm (ACOS) for the single sounding precision and 0.9 ppm for the station-to-station biases. Their findings for BESD are consistent with the findings of Dils et al. (2014). The station-to-station biases are lower in our analysis due to corrections in TCCON, improvements in satellite estimates, and removal of several stations from the estimates. 5 We test whether the biases seen in Figs. 3 and 4 are persistent from year to year. When at least two full-year averages exist for a station, the standard deviation of the yearly bias is calculated. The average over all stations of the yearly bias standard deviation is 0.3 ppm for all sets (CT2013b, MACC, SCIAMACHY, GOSAT). The year-toyear variability in the bias could be partly attributed to the distribution of data seasonally.  Another important comparison is of the predicted and actual errors. The predicted error (also referred to as the a posteriori error) is reported for each satellite product and the actual error we take to be the standard deviation of the satellite observation versus TCCON. These two quantities should agree if the TCCON error is much smaller than the a posteriori error and the coincidence criteria does not degrade the agreement.

20
The predicted and actual errors vary from site to site, e.g. from variations in albedo, aerosol composition, solar zenith angle, etc. We calculate the correlation between the standard deviation vs. TCCON and the predicted error for each site as follows: the standard deviation of the satellite vs. TCCON is calculated at each TCCON station. The correlation of the vectors of standard deviation and predicted errors by station 25 are calculated. ACOS-GOSAT has a 0.6 correlation and BESD-SCIAMACHY has a 0.5 correlation. This indicates that the predicted error should be utilized, e.g. when assimilating ACOS-GOSAT, as the variability in the predicted error represents variability in the actual error, though not perfectly. A scale factor should also be applied to the predicted errors. For ACOS-GOSAT the predicted error averaged over all TCCON sites is 0.9 ppm, as compared to the actual error of 1.7 ppm and can be corrected by applying a factor 1.9 to the reported GOSAT errors. For BESD-SCIAMACHY, the prediction error of 2.3 ppm multiplied by 0.9 • with the 2.1 ppm actual error. 5 We now directly compare performance of geometric and dynamic coincidence criteria and averaging in terms of error. Figure 5 shows SCIAMACHY and GOSAT standard deviations versus TCCON for geometric and dynamical coincidence criteria. The stations used were those that had entries for all comparisons, listed in the Fig. 5 caption. For n = 1 no averaging is done and the dynamic coincidence criteria performs similarly to 10 the geometric criteria, though the dynamic error is ∼ 0.2 ppm higher for SCIAMACHY. For n = 2, exactly two satellite observations were averaged for each coincidence. The error drops substantially, but not as 1/ √ 2, which would be expected if the error were uncorrelated. For n = 4, the error again drops but it is not half the n = 1 error, which is shown by the dotted line. At n = 4, the dynamic coincidence criteria is the same as the 15 geometric error, likely because dynamic coincidence involves averaging observations farther apart in location and time, which are less likely to have correlated errors. The last bar is the maximum n, which has results for all stations included. The dynamic criteria allows far more coincidences, resulting in significantly lower average errors. The dynamic criteria is used for the remainder of the paper but with checks using the geo-20 metric criteria to ensure that artifacts are not added by the dynamic criteria. Note that all averaged satellite observations match to one particular TCCON observation.

Errors versus averaging: random and correlated error
To test the effects of spatial averaging, we calculate station by station standard deviations of satellite -TCCON matched pairs as a function of n, where n is the number 25 of satellite observations that are averaged, which are chosen randomly from available 6232 Introduction

Tables Figures
Back Close

Full Screen / Esc
Printer-friendly Version

Interactive Discussion
Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | matches (so there should be no difference in the characteristics of chosen points for larger vs. smaller n). Figure 6 shows plots from Lamont for SCIAMACHY and GOSAT for standard deviation difference to TCCON versus n. Initially the error drops down rapidly with n, however the decrease slows with larger n. The curve fits well to the theoretically expected form: where a represents correlated errors which do not decrease with averaging for similar cases (including smoothing errors, errors from interferents such as aerosols, TCCON error, and co-location error), b represents uncorrelated errors which decrease with averaging, and n represents the number of satellite observations that are averaged.

10
The purple dashed line represents the standard deviation of CT2013b at the satellite time and location vs. CT2013b at the TCCON time and location. The purple dashed line represents spatio-temporal mismatch error and as expected, this value is much smaller for geometric than for dynamic coincidence criteria. We calculate a and b by station in Table 3 with average values for northern 15 hemispheric stations of a = 1.5 ± 0.3 ppm, b = 1.6 ± 0.2 ppm for SCIAMACHY geometric, a = 1.1 ± 0.2 ppm, b = 1.4 ± 0.4 ppm for SCIAMACHY dynamic, a = 0.9 ± 0.2 ppm, b = 1.7 ± 0.3 ppm for GOSAT dynamic. These values indicate the expected error when averaging GOSAT or SCIAMACHY observations matching to a single TCCON observation. There is more correlated error, a, for SCIAMACHY geometric versus dynamic 20 matches in 4/7 stations in the Northern Hemisphere, indicating that averaging is more effective when it is over a larger spatial/temporal area, probably due to variability in the source of the correlated errors. GOSAT only has two stations, Lamont and Park Falls, which have enough co-locations to directly compare dynamic and geometric coincidence criteria but these stations have smaller correlated error for geometric matches, 25 which is true in all seasons. This could be due to the smaller GOSAT footprint allowing more variability from observation to observation.  Table 3) and TCCON error (Appendix A) results in a = 1.4, 1.0, and 0.6 ppm for SCIAMACHY geometric, SCIAMACHY dynamic, GOSAT dynamic, respectively. The correlated errors indicate a likely regional bias for the specified spatio-5 temporal scale, e.g. one would expect an error of 0.6 ppm for 5 × 5 • × 1 h GOSAT averages. The larger a value for SCIAMACHY could be a result of the SCIAMACHY larger footprint size and closer footprints; which likely have more correlations in clouds, aerosols, and other sources of systematic errors versus the smaller, more separated, GOSAT footprint which would likely have more variability in interferents from observa-10 tion to observation. The green dashed line in Fig. 6 shows the standard deviation of the satellite prior versus TCCON. Although using an optimal constraint will result in an error lower than the prior error in the absence of systematic errors, these satellite retrievals of CO 2 have been set up to value average results over single observations, so the error increases 15 from the prior for a single observation, but average results have both less error and minimal prior influence.

Seasonal-dependent biases
It is important to determine whether there are seasonally-dependent biases, as these will impact flux distributions. We look at 3-months periods (DJF, MAM, JJA, SON), with 20 the overall yearly bias at each site subtracted out to isolate the seasonal biases. To get enough comparisons, we use the dynamical criteria for satellite coincidences, as using the geometric criteria cuts down the comparisons with sufficient seasonal coverage to 3 stations (Park Falls, Lamont, and Wollongong). Figure 7 shows the biases for stations that have at least 20 matches in each season, 25 and Fig. 8  for different years and compared the error bars to errors calculated using the bootstrap method (Rubin, 1981), the standard deviation of results within one bin, and differences in results when using dynamic vs. geometric coincidence criteria. Where results are available, we compared to geometric coincidence criteria. needed to distinguish the seasonal cycle from the yearly increase. The errors are calculated using the bootstrap method (Rubin, 1981) and the standard deviation of results within one bin. Two datasets at a time are matched, using the dynamic criteria, with the satellite averaging 4 observations for SCIAMACHY and 2 for GOSAT, which reduces the fit errors. Since different datasets will have different data gaps and time ranges, 5 the TCCON results will be somewhat different for each comparison. Plots are individually examined to ensure that there is adequate data (e.g. see Fig. 9). Stations that are removed are Tsukuba, Four Corners, and JPL2007, which do not have more than 2 years, Izana, Lauder, and Darwin for Sciamachy. Izana is removed because Izana is an ocean station and SCIAMACHY only retrieves over land. The ocean/land behavior is very different near Izana (see Fig. 10) and although the dynamic coincidence criteria does remarkably well with SCIAMACHY at Izana, it does not seem correct to include it, although the characteristics relative to TCCON are similar to Lamont for SCIAMACHY. The seasonal cycle amplitude is taken to be the maximum and minimum of the sampled harmonic fit. This provides the best average seasonal amplitude over the time 15 range.

Seasonal cycle amplitude
The seasonal cycle amplitude is important for estimating source and sink estimates and global distributions. Table 4 shows the seasonal cycle amplitudes grouped by latitude. The errors represent the maximum of the predicted errors using the bootstrap 20 method or the standard deviation of all results in that bin divided by the square root of the number of entries in that bin (n). All datasets show a similar pattern with respect to NH vs. SH, and with amplitudes increasing poleward in the Northern Hemisphere. Specific places where differences are at least as large as the estimated errors are: in the far north (46-53 • N), GOSAT underestimates the seasonal cycle by 0.9 ppm. This

CO 2 yearly growth rate
The same fitting program in the above section, CCGCRV, also calculates a yearly increase. In Table 5 we compare the fitted yearly increase for TCCON to each of the datasets. Comparisons to TCCON are within the predicted error except the SH where SCIAMACHY is low compared to TCCON and in the 46-53 • N range where 15 GOSAT is low compared to TCCON. The yearly increase for TCCON varies from 1.9 to 2.3 ppm yr −1 for the different locations and time ranges. To see how much of the observed variability in the growth rate is temporal vs. spatial variability in the growth rate, we compare to the global annual increase (growth rate) from surface measurements (http://www.esrl.noaa.gov/gmd/ccgg/trends/global.html) shown in Table 6. The 20 average global yearly increase predicted from Table 6 using the time periods in Table 5 are shown in the last column of Table 5. The correlation r value between "Yearly incr. TCCON" and "global" (Table 6) coloumns is 0.84 (similarly the correlation to Mauna Loa calculated average annual increase is 0.82), whereas the correlation r value between "Yearly incr. TCCON" and "Yearly incr." columns is 0.60. Therefore, the variability of the 25 seen in Table 5 is primarily explained by the time-range of the comparisons. Reuter et al. (2011, JGR,  all offset times. Correlations are fit to a 2nd order polynomial to determine the phase minimum difference. As TCCON is moved forward or backward in time, different points will match up, particularly when there are data gaps in either dataset. This can cause difficulties in interpretation. The maximum correlation is limited by the ratio of the error to the variability. It follows from the definition of correlation that: where corr o is the noise-free correlation, ε x is the error on x and σ x is the true variability for x, ε y is the error on y and σ y is the true variability for y. Because we are estimating σ and ε, there is uncertainty on the correlation maximum. In our case σ y is taken to be the TCCON variability and ε y is estimated using  (Fisher, 1915(Fisher, , 1921. Figure 11 shows SCIAMACHY and GOSAT results at Park Falls. Although the prior performs well in regards to the standard deviation vs. TCCON Fig. 11 shows the prior has a clear seasonal cycle phase error which is corrected by the satellite retrievals for both SCIAMACHY and GOSAT at Park Falls.

5
Results of the seasonal cycle phase error are tabulated in Table 8, columns "GOSAT prior" and "GOSAT retrieved", "SCIA prior" and "SCIA retrieved". Stations not shown have either too few match-ups (e.g. Sodankyla) or too little variability compared to the noise (e.g. Wollongong) to have useful comparisons. The GOSAT retrieval markedly improves the seasonal cycle phase versus TCCON all stations where there is adequate 10 data. The SCIAMACHY retrieval clearly improves over the prior for Park Falls and Four Corners, mildly improves in 3 and stays the same in 2 cases. Mismatches in SCIA-MACHY phase could be from mismatches in vertical sensitivity (as higher altitudes have lagged seasonal cycles), effects of coincidence criteria, or seasonal-dependent biases. To check the coincidence criteria, cross-correlations were done for the geo-15 metric coincidence criteria which had significantly fewer matchups. Similar results for geometric coincidence criteria were found for GOSAT and SCIAMACHY for Lamont and Park Falls; the other stations are too noisy to draw conclusions. Table 7 also shows the phase differences for the models, which have closer spatial/temporal matches and lower single-matchup errors. Model-TCCON phase differ-20 ences could result from errors in model flux distributions, seasonal timing, or transport errors. Table 7 shows the phase differences, which vary from −20 to +10 days. Phase differences more than 10 days are noticeable by eye and occur in the NH at: Bremen and Four Corners (negative) (these stations are influenced by local effects), and Orleans and Izaña (positive, CT2013b only). Larger phase differences occur at some 25 stations in the Southern Hemisphere. Although the seasonal cycle is weaker in the Southern Hemisphere, it can be clearly seen in, e.g. the Lauder_125HR data in Fig. 2. The correlations versus offset days show a phase difference of −20 days for CT2013b and + 0 days for MACC at LAUDER_125HR, as seen in Fig. 12 the seasonal cycle in the SH display more complexity, such as multiple local maximum, than fits in the NH and "phase lag" could also be an indication of an issues with the fit shape. Figure 12 shows correlations and standard deviations versus day offset for 3 stations that have the seasonal cycle peak within ± 10 days for CT2013b and MACC (top panels), and for stations which have a larger phase lag compared to TCCON (bot-5 tom panels). There is often a small peak within ± 3 days, which indicates the models' capability of picking up variations that occur day to day (i.e., synoptic scale variability), which indicates the strength of synoptic activity and matching between models and TCCON. This peak is not seen in satellite data for dynamic coincidence criteria likely due to matching, or geometric coincidence criteria likely due to the noise. Not that this synoptic peak occurs at 0 even when the seasonal cycle has a phase lag (e.g. MACC model at Bremen, in the lower right panel, or Lauder_125HR comparisons). The synoptic scale correlation varies between 0 and 0.17, as seen in Table 7. A brief discussion on Izana. The TCCON station is on Tenerife Island, a small island (about 50 × 90 km) with complex topography located about 300 km west of southern 15 Morroco. The TCCON station is located at 2.37 km (about 770 mb). The MACC and CT2013b models at ∼ 2 • × 3 • resolution do not resolve topography at these scales and consequently have mean surface pressure at sea level, about 1000 hPa at this location. Our standard treatment is to interpolate the model to the TCCON pressure grid, then calculate X CO 2 using the TCCON pressure weighting function. At Izaña this has the 20 effect of chopping off the lower atmosphere. The CT2013b result for this treatment has a ∼ + 10 day seasonal cycle phase difference at Izaña; whereas MACC has no phase difference at Izaña. If, however, the model surface pressure is used to calculate X CO 2 , MACC goes from a 0 to a −10 day phase lag, and CT2013b has 0 phase difference. An argument for using the model surface pressure would be if the upslope winds at Izaña 25 (Sancho et al., 1991;Bergamaschi et al., 2000) shifted the profile upwards rather than chopping it off, which would occur if the air instead deviated around the island. This finding has important implications on the choice of the comparison methodology and the ideal location for validation sites. Validation sites within complex geographical terrain Introduction have to be treated as special cases as (a) the atmospheric models usually do not resolve these variations and (b) satellite measurements rarely have a perfect co-location with the ground-based site, meaning that they could sample a substantially different altitude level. This holds for both mountains (e.g. Izaña) and valleys (e.g. Garmisch). This highlights one of the many choices that are made when comparing two products 5 (e.g. whether to apply the averaging kernel, whether to use interpolation, how to treat the surface pressure, or what coincidence criteria to use). Another finding worth noting is the comparisons at Lauder. In 2010 the Lauder125HR instrument began routine operation, while the Lauder 120HR instrument continued to take TCCON data up to through the end of 2010. Both MACC and CT2013b show no At Bremen and Four Corners, local effects that do not affect CO 2 at 2 × 3 • are likely dominating, particularly since Bremen is clustered with Orleans, Garmisch, and Karlsruhe, which all compare fine, and because the correlation of daily variability, as seen in the next section, is also very low at these two stations.

Daily variability (models vs. TCCON)
At the surface, CO 2 shows a strong diurnal cycle in areas with active vegetation, e.g. Park Falls during summer, and synoptic trends based on regional dynamics. Even though the diurnal cycle is markedly smaller in the total column (Olsen and Randerson, 2004), it can be observed both by TCCON and also in models, in our case CT2013b 25 and MACC, as seen in Fig. 13. Both diurnal variations and synoptic trends can be seen in Fig. 13. Validating the amplitude of the diurnal variability in the column is im-Introduction portant as the column diurnal variability better represents the amount of CO 2 emitted or absorbed by surface processes as compared to surface measurements, which are impacted by boundary layer height. To our knowledge this is the first comparison of model fields to TCCON to compare the diurnal cycle. As TCCON itself has not been validated at multiple times in one day, this is considered a comparison not a validation. 5 We compare the difference between morning and afternoon in models and TCCON. To minimize potential TCCON biases that depend on the solar zenith angle (through the air mass factor), we compare at two points in each day separated by the largest time with the same solar zenith angle (SZA). The methodology is to (1) identify two points, t 1 and t 2 , from the same day with the largest time difference but with the same SZA.

10
As the TCCON data used in this paper has been averaged over 90 min, t 1 or t 2 may be interpolated between two time points.
(2) We compare TCCON at t 2 minus TCCON at t 1 and the same times for each model. We look at the variability within one day for one season (JJA). Looking at different seasons for the Northern Hemisphere at the bottom of Table 8, both models showed clearly higher correlations and slopes in MAM  Table 8 show correlations between CT2013b or MACC vs. TCCON in the daily variability. The correlations are about 2/3 as large as could be expected, given the relative sizes of the variability and errors (see text around Eq. 3). When correlations are present the models have about 1/3 to 1/2 the variability of TCCON (as seen from the smaller slopes). In the far north (Ny Alesund, Sodanklya), the correlations indicate agreement but the model daily variability is less than 1/4 TCCON. In the mid-latitudes there is the highest correlation (∼ 0.3-∼ 0.7) with model daily variability ∼ 0.2-0.6 that of TC- CON. Bremen and Four Corners, the models do not show diurnal variability that is seen by TCCON. These are also the two stations in the mid-latitude NH which showed a seasonal cycle phase lag in both models. These sites are expected to be strongly influenced by local sources, a power plant for Four Corners, and urban sources for Bremen. In the Southern Hemisphere, correlation is seen at Wollongong and Lauder 5 (125HR), again with less variability seen in models. Because of the smaller variability in the SH sites, the best correlation that could be achieved is about 0.3 for Darwin and Lauder120HR, 0.4 for Lauder125HR, and 0.5 for Darwin.
The CT2013b model in general shows more daily variability and higher correlations, which are in better agreement with TCCON. Since the satellite observations are co-10 incident ∼ once per day, the diurnal pattern will not be constrained by satellite observations, except as preserved in transported air coincident with satellite measurements downwind. Model Observing System Simulation Experiments (OSSE) can determine the impact of the diurnal cycle strength on flux estimates to determine the importance of independently verifying the diurnal cycle in models.

Discussion and conclusions
We focus on validating aspects of model and satellite data which may be important for accurate flux estimates and CO 2 assimilation, including accurate error estimates, overall biases, biases by season and latitude, impact of coincidence criteria, validation of seasonal cycle phase and amplitude, yearly growth, and daily variability. The impact 20 of our findings can be used for correcting data (e.g. Basu et al.,2013, accounted for global land/sea biases; Nassar et al., 2011, corrected for hemispheric gradients) or can be mitigated by assimilation method (e.g. the inversion method of Reuter et al., 2014, which is set up to be insensitive to seasonal and regional biases outside the investigated region). To determine the importance of the findings of this paper on flux estimates, each type of bias found in this paper (seasonal biases, location-dependent biases, seasonal cycle differences, seasonal cycle phase differences, and diurnal cycle Introduction

Tables Figures
Back Close

Full Screen / Esc
Printer-friendly Version

Interactive Discussion
Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | differences) should be tested using an OSSE to determine its effect on flux biases and flux distribution errors. For example, Kulawik et al. (2013) tested the effect of a NH bias of 0.3-0.5 ppm in JJA, finding flux biases comparable to GOSAT updates in some regions. We find standard deviations of 0.9, 0.9, 1.7, and 2.1 ppm versus TCCON for Car-5 bonTracker, MACC, GOSAT, and SCIAMACHY, respectively. GOSAT predicted error should be multiplied by 1.9 and SCIAMACHY predicted error should be multiplied by 0.9 to represent the actual single target error. There is a correlation r value of 0.5 for SCIAMACHY and 0.6 for GOSAT for the actual and predicted errors grouped by station. Equation (2) and Table 3 show how errors decrease when satellite results are 10 averaged and estimate the magnitude of the correlated and random errors components for averaged satellite results, where random error components decrease with increasing number of averaged observations. When satellite data are averaged and interpreted according to the model error 2 = a 2 + b 2 /n (where n are the number of observations averaged, a are the systematic (correlated) errors, and b are the ran-15 dom (uncorrelated) errors), a = 0.6 ± 0.3 ppm and b = 1.7 ± 0.3 ppm for GOSAT, and a = 1.0 ± 0.3 ppm, b = 1.4 ± 0.4 ppm for SCIAMACHY regional averages (dynamic coincidence criteria) in the Northern Hemisphere, correcting for coincidence errors and TCCON errors. SCIAMACHY averaging results in the lowest correlated errors when using dynamic coincidence criteria where values are averaged from a larger spatio-20 temporal region, whereas GOSAT, in the two stations where sufficient data exists (Lamont, Park Falls), geometric criteria performs better than dynamic coincidence criteria. These data represent averaging of satellite data which matches a single TCCON value. The above error model should help assigning realistic retrieval error correlations in assimilation systems in place of current ad hoc hypotheses (see, e.g., Sect. preliminary study of how a seasonal bias in JJA in GOSAT of 0.5 ppm in the NH would affect fluxes using a global assimilation showed that the effect was not minor (Kulawik et al., 2013). The seasonal cycle phase can detect seasonally dependent biases in satellite data and issues with model fluxes or transport errors. We investigate the alignment of the 20 seasonal cycles by offsetting each CO 2 set versus TCCON by −60 to + 60 days. For satellites, the following stations had adequate data and high enough signal/error to estimate a result: Bialstok, Karlsruhe, Orleans, Garmisch, Park Falls, Four Corners, Lamont, and Izaña (GOSAT only). The GOSAT r.m.s. phase difference versus TCCON is 16.9 days for the prior and 4.7 days for the GOSAT retrieved X CO 2 , a marked improve- dropoff (e.g. see Fig. 12), with the peak correlation near 0 days, and an additional spike within ± 3 days indicating the capture of synoptic variability. Stations that showed phase differences larger than 10 days are Four Corners (both models), Bremen (both models), Izana (CT2013b only), Darwin (Macc only), and Lauder 125HR (CT2013b only).

5
In studying the variability through a single day, both models show correlation to the variability within a day versus TCCON, on the order of 0.2-0.8 correlation for NH stations, about 2/3 of the possible correlation given the errors (except at Bremen and Four Corners which had little correlation and no slope). The amplitude of the variability is higher in TCCON versus the models, with CT2013b closer to TCCON than MACC.
However, TCCON daily variability has not been validated (there are plans to validate TCCON throughout the day in the near future). Diurnal pattern will not be constrained by satellite observations, except as preserved in transported air coincident with satellite measurements downwind, and therefore may be important to independently verify the diurnal cycle in models to ensure accurate satellite assimilation results. The importance 15 of the diurnal cycle on flux estimates would need to be tested.
In our analysis a clear picture has emerged of two TCCON stations (Bremen, Four Corners) most influenced by local sources, seen in phase differences versus models, daily variability, and large overall biases. Caution should be used when using these stations for validation. Spatial and seasonal-dependent biases are obstacles to accu-20 rate and better resolved CO 2 flux estimates. This paper highlights findings that provide inputs to estimate flux errors in model assimilations, and places where models and satellites need additional validation or improvement. Some of the issues which need further investigation are: the GOSAT seasonal cycle in 46-53 • N latitude range (which is 0.9 ppm smaller than TCCON), SCIAMACHY over-predicting the seasonal cycle at 25 Lamont, both models with seasonal cycle differences at the different SH stations, differences in the diurnal cycle amplitude between models and TCCON, and high biases for GOSAT and SCIAMACHY north of 67 • N.

Tables Figures
Back Close

Full Screen / Esc
Printer-friendly Version