Consistent evaluation of ACOS-GOSAT, BESD-SCIAMACHY, CarbonTracker, and MACC through comparisons to TCCON

Consistent validation of satellite CO2 estimates is a prerequisite for using multiple satellite CO2 measurements for joint flux inversion, and for establishing an accurate long-term atmospheric CO2 data record. Harmonizing satellite CO2 measurements is particularly important since the differences in instruments, observing geometries, sampling strategies, etc. imbue different measurement characteristics in the various satellite CO2 data products. We focus on validating model and satellite observation attributes that impact flux estimates and CO2 assimilation, including accurate error estimates, correlated and random errors, overall biases, biases by season and latitude, the impact of coincidence criteria, validation of seasonal cycle phase and amplitude, yearly growth, and daily variability. We evaluate dryair mole fraction (XCO2) for Greenhouse gases Observing SATellite (GOSAT) (Atmospheric CO2 Observations from Space, ACOS b3.5) and SCanning Imaging Absorption spectroMeter for Atmospheric CHartographY (SCIAMACHY) (Bremen Optimal Estimation DOAS, BESD v2.00.08) as well as the CarbonTracker (CT2013b) simulated CO2 mole fraction fields and the Monitoring Atmospheric Composition and Climate (MACC) CO2 inversion system (v13.1) and Published by Copernicus Publications on behalf of the European Geosciences Union. 684 S. Kulawik et al.: Evaluation of ACOS-GOSAT, BESD-SCIAMACHY, CarbonTracker, and MACC compare these to Total Carbon Column Observing Network (TCCON) observations (GGG2012/2014). We find standard deviations of 0.9, 0.9, 1.7, and 2.1 ppm vs. TCCON for CT2013b, MACC, GOSAT, and SCIAMACHY, respectively, with the single observation errors 1.9 and 0.9 times the predicted errors for GOSAT and SCIAMACHY, respectively. We quantify how satellite error drops with data averaging by interpreting according to error = a+ b/n (with n being the number of observations averaged, a the systematic (correlated) errors, and b the random (uncorrelated) errors). a and b are estimated by satellites, coincidence criteria, and hemisphere. Biases at individual stations have year-to-year variability of∼ 0.3 ppm, with biases larger than the TCCONpredicted bias uncertainty of 0.4 ppm at many stations. We find that GOSAT and CT2013b underpredict the seasonal cycle amplitude in the Northern Hemisphere (NH) between 46 and 53 N, MACC overpredicts between 26 and 37 N, and CT2013b underpredicts the seasonal cycle amplitude in the Southern Hemisphere (SH). The seasonal cycle phase indicates whether a data set or model lags another data set in time. We find that the GOSAT measurements improve the seasonal cycle phase substantially over the prior while SCIAMACHY measurements improve the phase significantly for just two of seven sites. The models reproduce the measured seasonal cycle phase well except for at Lauder_125HR (CT2013b) and Darwin (MACC). We compare the variability within 1 day between TCCON and models in JJA; there is correlation between 0.2 and 0.8 in the NH, with models showing 10–50 % the variability of TCCON at different stations and CT2013b showing more variability than MACC. This paper highlights findings that provide inputs to estimate flux errors in model assimilations, and places where models and satellites need further investigation, e.g., the SH for models and 45–67 N for GOSAT and CT2013b.


Introduction
Carbon-climate feedbacks are a major uncertainty in predicting the climate response to anthropogenic forcing (Friedlingstein et al., 2006).Currently, about 10 Gigatons (Gt) of carbon are emitted per year from human activity (e.g., fossil fuel burning, deforestation), of which about 5 Gt stays in the atmosphere, causing an annual CO 2 increase of approximately 2 ppm yr −1 .The yearly increase is quite variable, estimated at 1.99 ± 0.43 ppm yr −1 (http://www.esrl.noaa.gov/gmd/ccgg/trends/global.html),however always positive (Houghton et al., 2007).The remaining 5 Gt of carbon is taken up by the ocean and the terrestrial biosphere; however, there are uncertainties in the location and mechanism of these sinks, e.g., the distribution of land sinks between the Northern Hemisphere and the tropics (e.g., Stephens et al., 2007), and the localization of sources and sinks on regional scales (Canadell et al., 2011;Baker et al., 2006).The un-certainties in top-down source and sink estimates are a consequence of uncertainties in model transport and dynamics (e.g., Prather et al., 2008;Stephens et al., 2007) and sparseness of available surface-based CO 2 observations (Hungershoefer et al., 2010;Chevallier et al., 2010).Satellites offer a much denser and spatially contiguous data set for top-down estimates, but are much more susceptible to biases as compared to ground-based measurements (e.g., see summary in Sect.3.3.2 of Ciais et al., 2014).
This paper tests different characteristics of model and satellite CO 2 (e.g., seasonal cycle amplitude and phase, regional and seasonal biases, effects of averaging, and diurnal variations) through a series of specialized comparisons to the Total Carbon Column Observing Network (TCCON).The findings from this work can be propagated into assimilation systems to determine the influence of various findings on topdown flux estimates (e.g., see Miller et al., 2007;Deng et al., 2014;Chevallier and O'Dell, 2013;Chevallier et al., 2014).For example, this paper characterizes biases by latitude and season; these biases can be assimilated to determine their effects on flux estimates (e.g., Kulawik et al., 2013).This paper also shows a set of comparisons and tests that may be useful for evaluating bottom-up flux estimates or transport schemes in models.
The remainder of the paper is organized as follows: Sect. 2 describes the satellite X CO 2 data and models used.The various satellite and model X CO 2 data are compared against TC-CON observations in Sect.3. Temporal characteristics of the different X CO 2 data and models, including the seasonal cycle amplitude and phase, are compared in Sect. 4. Diurnal behavior is evaluated in Sect. 5. Section 6 summarizes our findings and discusses their impact on evaluating terrestrial carbon fluxes.

Data and models used
The characteristics of the sets of carbon dioxide that will be compared to TCCON are summarized in Table 1.The following sections contain detailed descriptions of the data set versions and characteristics.

The TCCON
The TCCON consists of ground-based Fourier transform spectrometers (FTSs) that measure high spectral (0.02 cm −1 ) and temporal (∼ 90 s) resolution spectra of the direct sun in the near infrared spectrum (Wunch et al., 2011a).Column abundances of CO 2 , O 2 , and other atmospheric gases are determined from their absorption signatures in the solar spectra using the GGG software package, which employs a nonlinear least squares spectral fitting algorithm to scale an a priori volume mixing ratio profile.Absorption of CO 2 is measured in the weak CO 2 band centered on 6220 and 6339 cm −1 , and of O 2 in the band centered on 7885 cm −1 .The total column dry-air mole fractions of CO 2 (X CO 2 ) are computed by ratioing the column abundances of CO 2 and O 2 .The resulting dry-air mole fractions have been calibrated against profiles of CO 2 measured by WMO-scale instrumentation aboard aircraft (Wunch et al., 2010;Messerschmidt et al., 2011).The precision and accuracy of the TCCON X CO 2 product is ∼ 0.8 ppm (2σ ) after calibration (Wunch et al., 2010).The TCCON data used in this paper are from the GGG2012 and GGG2014 releases, available from http://tccon.ornl.gov/.In this paper we use 90 min average TCCON values which have 1σ precision of 0.4 ppm (see Appendix B).
We use 20 TCCON stations, distributed globally (see Fig. 1), and these data have been used extensively for satellite validation and bias correction (e.g., Butz et al., 2011;Morino et al., 2011;Wunch et al., 2011b;Reuter et al., 2011;Schneising et al., 2012;Oshchepkov et al.;2012), in flux inversions (Chevallier et al., 2011), and in model comparisons (Basu et al., 2011;Saito et al., 2012).We use the GGG2014 data when available, and the GGG2012 data from sites Four Corners and Tsukuba_120HR.Note that the overall bias between the GGG2012 and GGG2014 X CO 2 is 0.3 ppm (with the GGG2014 results smaller) (Wunch et al., 2015).The GGG2012 sites have corrections based on the instructions from the TC-CON partners, listed on the TCCON website (https: //tccon-wiki.caltech.edu/Network_Policy/Data_Use_Policy/Data_Description_GGG2012#Laser_Sampling_Errors).We also apply a 0.9972 factor to Four Corners, as indicated here: https://tccon-wiki.caltech.edu/Network_Policy/Data_Use_Policy/Data_Description_GGG2012.Two instruments have been operated at the Lauder site.We identify them using 120HR (for the period of 20 June 2004 through 28 February 2011) and 125HR (for 2 February 2010 through to the present) when results are instrument-specific.We find biases in the Tsukuba_125HR for the time ranges in this paper (possibly from high humidity combined with solar tracker issues during this time period that were later corrected, and/or bias updates that are needed) and do not use Tsukuba_125HR in this paper.
Stations which have special circumstances regarding validation and are considered locally influenced are Garmisch which is in the midst of complicated terrain, for which local atmospheric transport is difficult to model and to measure from space; Four Corners, which is located in the vicinity of two power plants with large CO 2 emissions (Lindenmaier et al., 2014).The meteorology is such that Four Corners regularly samples large localized plumes with column CO 2 increases of several ppm that last hours in the late morning.Therefore, the low bias in models and satellite data relative to the Four Corners TCCON is attributed to the smaller scale enhancements from the power plants measured in TCCON which are significantly diluted in the model and satellite results; Bremen is also affected by local urban sources, and satellites and models would be expected to be biased low, though it is similar to adjacent stations; and JPL, Pasadena, and Edwards are in or adjacent to a megacity with complex adjacent terrain.The Izaña TCCON station is on Tenerife, a small island (about 50 × 90 km) with complex topography located about 300 km west of southern Morocco.The TC-CON station is located at 2.37 km above sea level (about 770 hPa), whereas models and satellite observations are of the surrounding ocean at about 1013 hPa.These stations are not included in averages (e.g., average bias, average seasonal cycle amplitude differences).In the future, either targeted observations or more spatially resolved models could make better use of these TCCON stations.Although we do not use these stations in averaging, we show results from all TCCON stations in this paper.

GOSAT CO 2
The Greenhouse gases Observing SATellite (GOSAT) takes measurements of reflected sunlight in three shortwave bands with a circular footprint of approximately 10.5 km diameter at nadir (Kuze et al., 2009;Yokota et al., 2009;Crisp et al., 2012).The first useable science measurements were made in April 2009, but due to changing observational modes in the early months, we use data beginning in July 2009.In this work, we use column-averaged dry-air mole fraction (X CO 2 ) retrievals produced by NASA's Atmospheric CO 2 Observations from Space (ACOS) project, version 3.5 (see O'Dell et al. (2012) for retrieval details).For each sounding, the retrieval produces an estimate of X CO 2 , the vertical sensitivity of the measurement (i.e., the averaging kernel), and the posterior uncertainty in X CO 2 .It also produces a number of other retrieval variables, such as surface pressure and aerosols, which are used in both filtering and bias correction.Postretrieval filter is employed based on a number of variables associated with the retrieval.In addition to filtering, a revised bias correction scheme has been developed for the v3.5 retrievals.This scheme is similar to the approach described in Wunch et al. (2011b), which characterized the errors in earlier versions of the ACOS retrieval using a simple spatial uniformity assumption of X CO 2 in the Southern Hemisphere     MACC 13.1 1979MACC 13.1 -2013MACC 13.1 (used 2007MACC 13.1 -2013) ) 3 h 1.9 × 3.75 • (sometimes referred to as the Southern Hemisphere Approximation) to assess errors and biases in the retrievals.V3.5 has corrections of GOSAT high (H ) and medium (M) gain data over land, as well as glint-mode data over the ocean, by using not only the Southern Hemisphere Approximation, but also TCCON observations, and comparisons to an ensemble mean of multiple transport model output.Details of the post-retrieval filter and the bias correction scheme can be found in the ACOS v3.5 user's guide which will soon be at https://co2.jpl.nasa.gov/.

SCIAMACHY CO 2
The following description of the SCanning Imaging Absorption spectroMeter for Atmospheric CHartographY (SCIA-MACHY) CO 2 retrieval algorithm summarizes important aspects of Reuter et al. (2010 and2011) and is adopted in parts from the algorithm theoretical basis document (Reuter et al., 2012b).
The Bremen Optimal Estimation DOAS (BESD) algorithm is designed to analyze SCIAMACHY sun normalized radiance measurements to retrieve the column-averaged dryair mole fraction of atmospheric carbon dioxide (X CO 2 ).BESD is a so-called full physics algorithm, which uses a two-band retrieval, with the O 2 A absorption band used to retrieve scattering information of clouds and aerosols, while the 1580 nm band additionally contains CO 2 information.Similar to the ACOS three-band retrieval for GOSAT, the explicit consideration of scattering by this approach reduces potential systematic biases due to clouds or aerosols.
The retrieved 26-elements state vector consists of a second-order polynomial of the surface spectral albedo in both fit windows, two instrument parameters (spectral shift and slit functions full width at half maximum in both fit windows, described in Reuter et al., 2010), a temperature profile shift, a scaling of the H 2 O profile and a default aerosol profile, cloud water/ice path, cloud top height, surface pressure, and a 10-layer CO 2 mixing ratio profile.Even though the number of state vector elements ( 26) is smaller than the number of measurement vector elements (134), the inversion problem is generally underdetermined, especially for the CO 2 profile.For this reason BESD uses a priori knowledge as a side constraint.However, for most of the state vector elements the a priori knowledge gives only a weak constraint and therefore does not dominate the retrieval results.The degree of freedom for X CO 2 typically lies within an interval between 0.9 and 1.1.
A post-processor adjusts the retrieved X CO 2 to a priori CO 2 profiles generated with the simple empirical CO 2 model described by Reuter et al. (2012a).Additionally the postprocessor performs quality filtering and bias correction.The bias correction is based on, e.g., convergence, fit residuals, and error reduction.The bias correction follows the idea of Wunch et al. (2011b) using TCCON as a reference to derive an empirical bias model depending on solar zenith angle, retrieved albedo, etc.The theoretical predicted errors have been scaled to agree with the errors vs. TCCON (Reuter et al., 2011).More details can be found in BESD's algorithm theoretical basis document (Reuter et al., 2012b).

CarbonTracker
CarbonTracker (CT) is an annually updated analysis of atmospheric carbon dioxide distributions and the surface fluxes that create them (Peters et al., 2007).CarbonTracker uses the Transport Model 5 (TM5) offline atmospheric tracer transport model (Krol et al., 2005) driven by meteorology from the European Centre for Medium-Range Weather Forecasts (ECMWF) operational forecast model and from the ERA-Interim reanalysis (Dee et al., 2011) to propagate surface emissions.TM5 runs at a global 3 • × 2 • resolution and at a 1 • × 1 • resolution over North America.CarbonTracker separately propagates signals from fossil fuel emissions, air-sea CO 2 exchange, and terrestrial fluxes from wildfire emissions and non-fire net ecosystem exchange.Similar to other existing CO 2 inverse models, oceanic and terrestrial biosphere surface fluxes are optimized to agree with atmospheric CO 2 observations, while fossil fuel and wildfire emissions are specified.First-guess fluxes from terrestrial biosphere models and surface ocean carbon analyses are modified by applying weekly multiplicative scaling factors estimated for 126 land and 30 ocean regions using an ensemble Kalman fil-ter optimization method.The CT2013b release of Carbon-Tracker assimilates in situ observations between 2000 and 2012 from 103 data sets around the world, including time series from NOAA observatories, tall towers, and flasks sampled by the NOAA Cooperative Air Sampling Network, and flask and continuous measurements from partners including Environment Canada, the Australian Commonwealth Scientific and Industrial Research Organization, the National Center for Atmospheric Research, the Lawrence Berkeley National Laboratory, and the Brazilian Instituto de Pesquisas Energéticas e Nucleares.
In order to explicitly quantify the impact of transport uncertainty and prior flux model bias on inverse flux estimates from CarbonTracker, the CT2013b release is composed of a suite of inversions, each using a different combination of prior flux models and parent meteorological model.Sixteen independent inversions were conducted, using two terrestrial biosphere flux priors, two air-sea CO 2 exchange flux priors, two estimates of imposed fossil fuel emissions, and two transport estimates in a factorial design.CT2013b results are presented as the performance-weighted mean of the inversion suite, with uncertainties including a component of across-model differences.All CarbonTracker results and complete documentation can be accessed online at http://carbontracker.noaa.gov.

MACC
Monitoring Atmospheric Composition and Climate (MACC, http://www.copernicus-atmosphere.eu/) was the European Union-funded project responsible for the development of the pre-operational Copernicus atmosphere monitoring service.MACC monitors the global distributions of greenhouse gases, aerosols, and reactive gases, and estimates some of their sources and sinks.Since 2010, it has been delivering an analysis of the carbon dioxide in the atmosphere and of its surface fluxes every year, based on the assimilation of air sample mole fraction measurements (Chevallier et al., 2010).It relies on a variational inversion formulation, developed by LSCE, that estimates 8-day grid point daytime/nighttime CO 2 fluxes and the grid point total columns of CO 2 at the initial time step of the inversion window.The Bayesian error statistics of the estimate are computed by a robust randomization approach.The MACC inversion scheme relies on the global tracer transport model Laboratoire de Météorologie Dynamique (LMDZ; Hourdin et al., 2006), driven by the wind analyses from the ECMWF.For release v13.1 of the MACC inversion, used here, LMDZ was run at a horizontal resolution 3.75 • longitude × 1.9 • latitude with 39 vertical layers.The other elements of the inversion configuration follow Chevallier et al. (2011), with climatological (i.e., not interannually varying), terrestrial, and ocean prior fluxes and interannually varying fossil fuel and biomass burning emissions.The variational formulation of the inversion allowed the 1979-2013 period to be processed in a single assimilation window, therefore ensuring the physical and statistical consistency of the inversion over the full 35-year period.Mole fraction records from 131 measurement sites have been used from the NOAA Earth System Research Laboratory archive, the World Data Centre for Greenhouse Gases archive, and the Réseau Atmosphérique de Mesure des Composés à Effet de Serre (RAMCES) database (see the list in the Supplement of Peylin et al., 2013).

Direct comparisons to TCCON
We show comparisons between satellite X CO 2 , modelsimulated mole fraction fields, and TCCON X CO 2 at 25 different TCCON sites, shown in Fig. 1.These sites span the northern and southern hemispheres and cover a wide range of latitudes and longitudes.In our analysis we mainly use stations with start dates in or before 2012 which have coverage in all seasons so that seasonal cycle fits, which require 2 years of data, can be made.Newer sites are not used for SCIA-MACHY comparisons, as this data set ends in mid-2012 and our version of CarbonTracker (CT2013b) ends at the end of 2013.We show representative time series for CT2013b, MACC, GOSAT, and SCIAMACHY for a Northern Hemisphere site (Lamont, OK, US at 37 • N) and two Southern Hemisphere sites (Lauder, NZ at 45 • S or Wollongong, New South Wales, Australia at 35 • S) in Fig. 2. Lauder is shown for the models to show the phase lag for CT2013b seen at this site in Table 5.Since there are not enough coincidences for the satellites at Lauder, we show Wollongong for satellites.These plots show matches using the geometric coincidence criteria described in Table 2 below (for satellites) and give an idea of the number of coincidences for each data set using these criterion.As only TCCON/satellite-matched pairs are shown, different subsets of TCCON are included for models and the two satellites.These sites were chosen as they have the most coincidences in the northern and southern hemispheres, respectively, for satellites.All sets compare well; the 30-day moving averages show differences most easily, such as a repeating blip in CT2013b comparisons at the summer drawdown at Lamont and a seasonal mismatch in CT2013b comparisons to Lauder, which will be discussed later in the paper.

Coincidence criteria and other matching details
The SCIAMACHY and GOSAT comparisons in this paper are based on two different definitions of coincidence criteria between TCCON and satellite data.Satellite measurements, which satisfy the so-called geometric criteria, are within ±1 h, ±5 • latitude and longitude of an unaveraged TCCON observation.Following the match-ups, all TCCON observations matching one satellite observation which are within 90 min are averaged, reducing the TCCON random error.The dynamical criteria (Wunch et al., 2011b;Keppel-Aleks et al., 2011, 2012) are designed to exploit informa- tion about the dynamical origin of an air parcel through a constraint on the free-tropospheric temperature.This allows us to relax the geometric constraints and find more coincident satellite soundings per TCCON measurement.Briefly, a match is found when the measurements are within 5 days and the following is satisfied: where Temperature is the co-located NCEP temperature difference at 700 hPa (Kalnay et al., 1996).Matches are found with unaveraged TCCON data; TCCON observations matching a single satellite observation are averaged within 90 min intervals.Table A2 summarizes the coincidence criteria and data versions that are used.Other matching schemes The choices used in this paper regarding model/TCCON match-ups are linearly interpolating to the TCCON latitude, longitude, and time for the models, and using the TCCON surface pressure for calculating X CO 2 .In the cases where the TCCON surface pressure is greater than the model surface pressure, the model surface CO 2 value is replicated to the missing pressure values.When comparing models vs. TC-  2), with stations having data out to at least n = 50 for dynamic coincidence criteria, n = 10 (GOSAT) or n = 40 (SCIAMACHY) for geometric coincidence criteria.The CT-CT column describes the CT2013b coincidence error, providing a lower bound on the satellite/TCCON differences.The subtr.co-location error row estimates the correlated error for satellites only, subtracting the quadrature TCCON error (0.35 ppm) and co-location error from the CT-CT column.
Mean SH: subtr.co-loc.error 0.9 ± 0.  CON, the TCCON averaging kernel is applied to the model.When using the model to assess satellite coincidence error, the satellite averaging kernel is applied to the model at the satellite and TCCON coincidences.Note that the TCCON averaging kernel cannot be applied to satellite data and vice versa because a profile of higher resolution than the comparison observation is needed to apply an averaging kernel.In the satellite/TCCON comparison, both products have ∼ 1 • of freedom.(The satellites do initially retrieve a profile with ∼ 1.6 • of freedom for GOSAT, but the profile is not what we are validating.)Because of the earth's curvature, high-latitude sites could have relaxed coincidence in longitude, particularly for geometric coincidence.However stations north of 60 • N have gaps in the winter months in the satellite record such that relaxed criteria do not add additional stations to the analysis.

Bias and standard deviation for individual matches
Figure 3 shows a summary of the comparisons for geometric criteria where satellite matches are not averaged.Averaging and the effects of coincidence criteria and satellite averaging are addressed in Sect.3.4.The black box shows five European stations which are very close, geographically, yet have different biases.The gray bars labeled TCCON bias uncertainty in Fig. 3 signify the overall calibration uncertainty in TCCON which is estimated to be 0.4 ppm (Wunch et al., 2010(Wunch et al., , 2011a)).The significance of the bias vs. TCCON is estimated by the 5 % t test (looking up bias/standard deviation times the square root of the number of comparisons in a t test table).The gray box is the uncertainty in the TCCON station-to-station bias and all comparisons with |bias| > 0.4 were found by the t test to be significant.Therefore, from the above two pieces of information, all biases differing from TCCON more than 0.4 ppm are significantly different than TCCON.For GOSAT, biases larger than the TCCON bias uncertainty occur at Sodankylä(+), Karlsruhe(+), Lamont(−), Tsukuba(+), and Wollongong(+).SCIAMACHY has the same outliers as GOSAT with additional biases at Bialystok(+), Orléans(+), Park Falls(+), and Lauder(+).The sites with identified local influences and the general bias with respect to these sites are Bremen(+), Four Corners(−), JPL2007(−), and JPL2010(−).The stations which have complex conditions are Garmisch and Izaña, which show similar biases to surrounding stations.For unaveraged results, model standard deviations are lower than either satellite as satellite differences result from both systematic and random measurement error, the latter which does not occur for models.The standard deviations show some variability from station to station which are investigated below.The effects of averaging and coincident criteria are investigated in Sect.3.3.
Figure 4 shows the biases and standard deviations grouped globally and over the northern and southern hemispheres.To estimate the overall bias and standard deviations for single observations, we take out the outliers as follows.
As described in Sect.2.1, we take out JPL, Four Corners, Bremen, Garmisch, and Izaña for averaging.For satellites, we remove the above plus Tsukuba and Lauder due to limited numbers of comparisons for SCIAMACHY.For the mean NH bias, we take out stations poleward of 60 • N, which show a large positive bias for GOSAT and SCIA-MACHY.There is an overall bias vs. TCCON on the order of 0.7 ppm for CT2013b, and 0.2-0.3ppm for the other three sets.The overall bias is less of a concern than the bias variability in satellite data which indicates regional errors that will translate to regional errors in flux estimates.The bias variability is 0.4, 0.4, 0.5, and 0.3 ppm for CT2013b, MACC, SCIAMACHY non-polar, and GOSAT non-polar respectively.Note SCIAMACHY data are corrected to have an average zero bias with respect to TCCON GGG2012 which is 0.3 ppm higher than GGG2014 (https://tccon-wiki.caltech.edu/Network_Policy/Data_Use_Policy/Data_Description# CORRECTIONS_AND_CALIBRATIONS).The overall standard deviations are 0.9 ppm for CT2013b, 0.9 ppm for MACC, 2.1 ppm for SCIAMACHY (all latitudes), and 1.7 ppm for GOSAT (all latitudes).These values represent the overall performance of CT2013b, MACC, and single soundings from SCIAMACHY and GOSAT.The standard deviations are somewhat lower for the SH stations used, with values of 0.8, 0.8, 2.0, and 1.6 for CT2013b, MACC, SCIA-MACHY, and GOSAT, respectively.Reuter et al. (2013) validated earlier retrieval versions of BESD-SCIAMACHY and ACOS-GOSAT with TCCON and found 2.1 ppm (BESD) and 2.3 ppm (ACOS) for the single sounding precision and 0.9 ppm for the station-to-station biases.Their findings for BESD are consistent with the findings of Dils et al. (2014).The station-to-station biases are lower in our analysis due to corrections in TCCON, improvements in satellite estimates, and removal of several stations from the estimates.The polar station Sondankyla north of 67 • N is compared to GOSAT and SCIAMACHY.A similar standard deviation is found vs. other TCCON stations, but a higher bias, 1.6 ppm for GOSAT and SCIAMACHY.Most of the remaining analysis in the paper does not have enough coincidences at high latitudes so it is important to note this result.
We test whether the biases seen in Figs. 3 and 4 are persistent from year to year.When at least two full-year averages exist for a station, the standard deviation of the yearly bias is calculated.The average over all stations of the yearly bias standard deviation is 0.3 ppm for all sets (CT2013b, MACC, SCIAMACHY, GOSAT).
Another important comparison is of the predicted and actual errors.The predicted error (also referred to as the a posteriori error) is reported for each satellite product and the actual error we take to be the standard deviation of the satellite observation vs. TCCON.These two quantities should agree if the TCCON error is much smaller than the a posteriori error and the coincidence criteria do not degrade the agreement.The predicted and actual errors vary from site to site, e.g., from variations in albedo, aerosol composition, and solar zenith angle.We calculate the correlation between the standard deviation vs. TCCON and the predicted error for each site as follows: the standard deviation of the satellite vs. TCCON is calculated at each TCCON station.The correlation of the standard deviation and predicted errors by station are calculated.ACOS-GOSAT has a 0.6 correlation and BESD-SCIAMACHY has a 0.5 correlation.This indicates that the predicted error should be utilized, e.g., when assimilating ACOS-GOSAT, as the variability in the predicted error represents variability in the actual error, though not perfectly.A scale factor should also be applied to the predicted errors.For ACOS-GOSAT the predicted error averaged over all TCCON sites is 0.9 ppm, as compared to the actual error of 1.7 ppm and can be corrected by applying a factor 1.9 to the reported GOSAT errors.For BESD-SCIAMACHY, the prediction error of 2.3 ppm multiplied by 0.9 agrees with the 2.1 ppm actual error.

Errors as a function of coincidence criteria and averaging
We now directly compare performance of geometric and dynamic coincidence criteria and averaging in terms of error.
Figure 5 shows SCIAMACHY and GOSAT standard deviations vs. TCCON for geometric and dynamical coincidence criteria in the Northern Hemisphere.The stations used were those that had entries for all comparisons, listed in the Fig. 5 caption.For n = 1 no averaging is done and the dynamic coincidence criteria perform similarly to the geometric criteria, though the dynamic error is ∼ 0.1 ppm higher for both satellites.For n = 2, exactly two satellite observations were averaged for each coincidence.The error drops substantially, but not as 1/ √ 2, which would be expected if the error were uncorrelated.For n = 4, the error again drops but it is not half the n = 1 error, which is shown by the dotted line.At n = 4, the dynamic coincidence error is the same as the geometric error for SCIAMACHY, likely because dynamic coincidence involves averaging observations farther apart in location and time, which are less likely to have correlated errors.The last bar is the maximum n, which has results for all stations included.The dynamic criteria allow far more coincidences, resulting in significantly lower average errors.Note that all averaged satellite observations match one particular TCCON observation.

Errors vs. averaging: random and correlated error
To test the effects of spatial averaging, we calculate station by station standard deviations of satellite-TCCON matched pairs as a function of n, where n is the number of satellite observations that are averaged, which are chosen randomly from available matches (so there should be no difference in the characteristics of chosen points for larger vs. smaller n). Figure 6 shows plots from Lamont for SCIAMACHY and GOSAT for standard deviation difference to TCCON vs. n.Initially the error drops down rapidly with n, however the decrease slows with larger n.The curve fits well to the theoretically expected form: where a represents correlated errors which do not decrease with averaging for similar cases (including smoothing errors, errors from interferents such as aerosols, TCCON error, and co-location error), b represents uncorrelated errors which decrease with averaging, and n represents the number of satellite observations that are averaged.The purple (2) (black).For GOSAT the uncorrected data are also fit (black dashed).The initial guess minus TCCON standard deviation is shown as a green dashed line.We see that in this case, for GOSAT at Lamont, averaging more than about four observations improves over the initial guess.
dashed line represents the standard deviation of CT2013b at the satellite time and location vs. CT2013b at the TCCON time and location and represents a lower bound of the spatiotemporal mismatch error (co-location error).As expected, this value is much smaller for geometric than for our dynamic coincidence criteria (other dynamic coincidence criteria may do better; in our case the dynamic criteria consider points ±30 • longitude, ±10 • latitude, ±5 days, and ±2 K temperature vs. 1 h, 5 • for geometric criteria).The co-location error for large n is shown in the CT-CT columns of Table 2, with Northern Hemisphere averages of 0.3, 0.6, 0.4, and 0.7 ppm, for SCIAMACHY geometric, SCIAMACHY dynamic, GOSAT geometric, and GOSAT dynamic coincidences, respectively.There is not much difference between the estimated co-location error at North American sites of Lamont and Park Falls where the CT2013b model is at 1 × 1 • vs. other Northern Hemisphere sites where the CT2013b model is at 3 × 2 • .The co-location error is subtracted from the a value in quadrature to estimate results without co-location error.
We calculate a and b by station in Table 2, splitting out geometric and dynamic coincidence criteria.The a term, which does not reduce with averaging, is the correlated error, and the b term, which reduces with averaging, is the uncorrelated error.There is more correlated error, a, for SCIA-MACHY geometric vs. dynamic matches in 5/6 stations in the Northern Hemisphere, indicating that averaging is more effective when it is over a larger spatial/temporal area, probably due to variability in the source of the correlated errors.GOSAT geometric vs. dynamic averages show larger cor-related error for 4/8 stations.The GOSAT dynamic correlated error of 0.9 ppm is significantly reduced by subtraction of the 0.7 ppm co-location error, whereas the 0.9 ppm geometric error is not strongly changed by the subtraction of 0.4 ppm co-location error.Subtracting (in quadrature) the colocation error (CT-CT column in Table 2) and TCCON error of 0.35 ppm (Appendix B) results in the corrected correlated error, a, shown in the mean NH: subtr.co-location error row of Table 2.
The green dashed line in Fig. 6 shows the standard deviation of the satellite prior vs. TCCON.Although using an optimal constraint will result in an error lower than the prior error in the absence of systematic errors, these satellite retrievals of CO 2 have been set up to value average results over single observations; while the error increases from the prior for a single observation, average results have both less error and minimal prior influence.

Seasonally dependent biases
It is important to determine whether there are seasonally dependent biases, as these will impact flux distributions.We look at 3-month periods (DJF, MAM, JJA, SON), with the overall yearly bias at each site subtracted out to isolate the seasonal biases.To get enough comparisons, we use the dynamical criteria for satellite coincidences, as using the geometric criteria cuts down the comparisons with sufficient seasonal coverage to three stations (Park Falls, Lamont, and Wollongong).This is a simple averaging method which will later be compared to seasonal cycle amplitude fit results.
Figure 7 shows the biases for stations that have at least 20 matches in each season, and Fig. 8 shows the results averaged by SH, 0-45 • N (which includes Tsukuba and Lamont), > 45 • N. The error bars shown in Fig. 8 are the standard deviation of the results in each bin, once the station-dependent average biases are subtracted.Significance was tested using the t test with biases larger at least 0.2 ppm.For the north midlatitudes, the seasonal cycle peak is in MAM and the minimum is in JJA, so the seasonal cycle error should be approximately bias (MAM) minus bias (JJA).For Lamont and Tsukuba in JJA, SCIAMACHY is biased low vs. TC-CON by about 1.4 ppm (the low bias is 0.8 ppm for Lamont).In 45-53 • N, SCIAMACHY is biased low vs. TCCON in MAM and high in JJA, with about a 0.9 ppm spread.GOSAT has the same pattern with a 0.5 ppm spread.SCIAMACHY is biased low vs. TCCON in DJF by 0.7 ppm, but this does not affect the seasonal cycle amplitude.GOSAT is similarly low but has more station-dependent variability so is not considered significant.CT2013b is biased high vs.TCCON in JJA by about 0.2 ppm and biased low vs. TCCON in DJF by about 0.3 ppm.MACC is biased low vs. TCCON in MAM by 0.2 ppm in 28-37 • N. In the Southern Hemisphere, the seasonal cycle maximum is in the October to January time frame, and the minimum is in the April time frame.The significant findings in the SH are that CT2013b is biased about 0.3 ppm low vs. TCCON in DJF and biased 0.2 ppm high in MAM, which should lead to a 0.5 ppm underestimate of the SH seasonal cycle.Geometric results, which including fewer stations due to the cutoff of at least 20 observations per season, corroborate the low bias for in MAM, high bias in JJA, and low bias in DJF in 45-53 • N for SCIAMACHY, and the low bias in MAM and high bias in JJA in 45-53 • N for GOSAT.

Comparisons of seasonal cycle amplitude, phase, and yearly increase
We compare model and satellite X CO 2 to TCCON using the NOAA fitting software CCGCRV (Thoning et al., 1989) to calculate seasonal cycle amplitudes and yearly increases.At least 2 years are needed to distinguish the seasonal cycle from the yearly increase.The errors are calculated using the bootstrap method (Rubin, 1981;Efron, 1979), the standard deviation of differences within one bin divided by the square root of the number of stations minus 1 (n > 1), the variability of results when choosing different averaging (for satellites) where one, two or four satellite observations are averaged for GOSAT and SCIAMACHY, coincidence errors estimated by the mean difference between CT2013b matched to the satellite and CT2013b matched to TCCON, and the variability resulting from small changes in the time range.Two data sets at a time are matched, using the dynamic criteria, with the satellite averaging four observations for SCIAMACHY and two for GOSAT, which reduces the fit errors.Since different data sets will have different data gaps and time ranges, the TC-CON results will be somewhat different for each comparison.Plots are individually examined to ensure that there are adequate data.Stations are removed when there are large errors for any of the above errors, and locally influenced stations are also removed (see Sect. 2.1).The stations that were excluded based on variability of results for changes in the time range on the order of a few months are Eureka for the models and Sodankylä, Bialystok, Saga, and Lauder_125HR for the satellites.At Bialystok, for example, the GOSAT-TCCON seasonal cycle was 0.3 or −0.9 ppm depending on whether the time series ended in November 2013 or on 1 January 2014, respectively.The time series shows large gaps in the data in the winter and a gap around the seasonal cycle minimum in 2012.The findings show that the GOSAT seasonal cycle amplitude at Bialystok may be influenced by a small subset of the GOSAT points.

Seasonal cycle amplitude
The seasonal cycle amplitude is important for estimating source and sink estimates and global distributions.Table 3 shows the seasonal cycle amplitudes grouped by latitude.As described in Sect.4, the error is calculated by several different methods.The error in The significant findings from Table 3 are as follows.
(1) In northern latitudes (46-53 • N), GOSAT underestimates the seasonal cycle by 0.4 ppm.This latitudinal range is composed of two European sites and Park Falls, all of which follows this pattern, so it is not just in Europe but also North America at this latitude.This finding is consistent with Lindqvist et al. (2015).The simple bias calculation from Sect.3.5 estimates that GOSAT should underestimate the seasonal cycle by 0.5 ppm, with half the issue being a low bias in MAM and half the issue being a high bias in JJA.The co-location error is estimated as +0.1 ppm, so removing the co-location error should increase the GOSAT-TCCON discrepancy.The geometric criteria bias for this same bin is −0.7 ± 0.4 ppm, which is consistent with the dynamic results, though with larger error bars and significantly sparser time series plots.(2) There is an underestimate in the CT2013b seasonal cycle for 45-53 • N compared to TCCON by 0.2 ppm.This is a small discrepancy but the error bars are also small.Similarly in Sect.3.5, a 0.2 ppm low bias is predicted.(3) There is an overestimate in the MACC seasonal cycle in the 28-37 • N range by 0.6 ppm.A smaller 0.2 ppm low bias is seen in MAM for MACC.(4) CT2013b underestimates the SH seasonal cycle by 0.5 ppm.This is partially from a low bias in DJF and partly from a high bias in MAM.
Findings from Sect.3.5 which did not reach significance in Table 3 are as follows.(1) SCIAMACHY should underestimate the seasonal cycle in 45-53 • N by 2.5 ppm, which it does not.There is a 0.4 ppm high bias with large error bars.Comparisons vs. TCCON may be different than Table 4, since Table 4 has very close criteria for models and dynamic coincidence criteria for satellite data.The models are sampled at GOSAT observations.
(2) SCIAMACHY should overestimate the seasonal cycle for Lamont by 1.6 ppm.In Table 3, SCIAMACHY overestimates by 0.8 ppm but statistical significance is not calculated for one station.
(3) MACC should underestimate the seasonal cycle by 0.2 ppm in 28-37 • N. In Table 3 MACC overestimates the seasonal cycle by 0.6 ppm.The seasonal biases are generally consistent with the seasonal cycle amplitude differences vs. TCCON and can be used to pinpoint which months are the cause of the seasonal cycle amplitude differences vs. TCCON; however, the mean seasonal biases find more significant differences than the seasonal cycle fits.
Figure 9 shows a global map of fits of the seasonal cycle amplitude of SCIAMACHY, GOSAT, CT2013b, and MACC, with TCCON having at least 2 years of matches shown by circles.This map shows how the results of Table 4 fit into the global pattern (with the model fields matched to GOSAT locations and times).Interestingly, the seasonal cycle amplitude varies longitudinally; this pattern is seen in both satellite data sets and both models.Since the amplitude is taken from the sampled harmonic there is no extrapolation, although the seasonal cycle will be underpredicted at high latitudes where there are significant data gaps.The model data in Fig. 9 is colocated with GOSAT, so the same gaps will occur in GOSAT and the two models, other than fit errors larger than 10 % of the amplitude, were screened out.This map is consistent with Lindqvist et al. (2015), Fig. 8, which also finds high values in the 45-50 • N, 120-180 • E range.

CO 2 yearly growth rate
The same fitting program in the above section, CCGCRV, also calculates a yearly increase.In Table 4 we compare the fitted yearly increase for TCCON, which ranges from 1.92 to 2.55 ppm yr −1 for the different stations and time ranges, to each of the data sets.None of the TCCON/satellite or TCCON/model differences reach significance for the t test, which is done when there is more than one result per bin.However, the GOSAT Lamont station is low at −0.3 ppm yr −1 , with −0.1 of this estimated to result from co-location error.creases 2006-2014, and 1.76, 2.22, 1.60, 1.89, 2.42, 1.86, 2.63, 2.06, 2.17 ppm yr −1 for Mauna Loa yearly increases 2006-2014, with error bars 0.05-0.09ppm yr −1 for global and 0.11 ppm yr −1 for Mauna Loa.The average global yearly increase predicted from the above using the time periods in Table 4 is shown in the last column of Table 4.The correlation r value between the yearly increase in TCCON and the above ESRL global yearly increase is 0.61, and the correlation with the Mauna Loa yearly increase is 0.67, whereas the correlation r value between the yearly increase in TC-CON and the yearly increase in columns is 0.82.Therefore, the variability seen in Table 4 is partly from the time range used but also partly from variations due to geographic location (and sampling).Reuter et al. (2011, JGR, Table 2) found agreement within the calculated errors at Park Falls and Darwin for BESD-SCIAMACHY and CT2009 vs. TCCON.However, older data sets were used for this result.Looking specifically at Park Falls, we see 1.80 ± 0.14 and 2.10 ± 0.22 for SCIAMACHY and TCCON, respectively and at Darwin 1.67 ± 0.08 and 2.16 ± 0.05 for SCIAMACHY and TCCON, respectively, where the errors represent the standard deviation of SCIAMACHY fits for similar latitudes.

Seasonal cycle phase
This section looks at the time offset correlation and standard deviation between the test data sets and TCCON.This checks whether, for example, a seasonal cycle is delayed or ahead of the TCCON seasonal cycle, which has important implications for flux estimates (Keppel-Aleks et al., 2012), whether there are seasonally dependent biases that are affecting the seasonal cycle, and whether the data sets are seeing the same seasonal cycle.
To compare seasonal cycle amplitudes, all data sets have 2 ppm yr −1 subtracted to approximately remove secular increases (over the ±60 days' offset this has a very small effect).For a 0 day offset, the data sets are matched as usual.For a 1 day offset, TCCON is moved forward by 1 day and compared to the data set.This is repeated for all offset times.Correlations are fit to a second-order polynomial to determine the phase minimum difference.As TCCON is moved forward or backward in time, different points will match up, particularly when there are data gaps in either data set.This can cause difficulties in interpretation.The maximum correlation is limited by the ratio of the error to the variability.It follows from the definition of correlation that where corr o is the noise-free correlation, ε x is the error on x and σ x is the true variability for x, ε y is the error on y and σ y is the true variability for y.Because we are estimating σ and ε, there is uncertainty on the correlation maximum.In our case σ y is taken to be the TCCON variability and ε y is esti-   mated using Table 2 with 10 SCIAMACHY and 4 GOSAT averages.The error bars on the correlations are calculated from Fisher's z test.(Fisher, 1915(Fisher, , 1921)).Figure 10 shows SCIAMACHY and GOSAT results at Park Falls.Although the prior performs well in regards to the standard deviation vs. TCCON, Fig. 10 shows the prior has a clear seasonal cycle phase error which is corrected by the satellite retrievals for both SCIAMACHY and GOSAT at Park Falls.
Results of the seasonal cycle phase error are tabulated in Table 5. Stations not shown have either too few match-ups (e.g., Sodankylä) or too little variability compared to the noise (e.g., Wollongong) to have useful comparisons.The GOSAT retrieval markedly improves over the prior seasonal cycle phase vs. TCCON at 12 out of 13 stations.For the six stations that are not locally influenced (with no a by the station name), GOSAT improves over the prior at all six stations, and additionally has a smaller phase difference than one or more of the models at four of the six stations, is the same at one station, and is worse at one station.SCIA-MACHY improves over its prior for Park Falls and Four Corners, mildly improves at Karlsruhe, and stays the same or gets slightly worse in four cases.Mismatches in SCIA-MACHY phase could be from mismatches in vertical sensitivity (as higher altitudes have lagged seasonal cycles), effects of coincidence criteria, or seasonal-dependent biases.
To check the coincidence criteria, cross-correlations were calculated for the geometric coincidence criteria which had significantly fewer match-ups.Similar results for geometric coincidence criteria were found for GOSAT and SCIA-MACHY for Lamont and Park Falls; the other stations are too noisy to draw conclusions.
Table 5 also shows the phase differences for the models, which have closer spatial/temporal matches and lower singlematch-up errors.Model-TCCON phase differences could result from errors in model flux distributions, seasonal timing, or transport errors.Table 5 shows the phase differences, which vary from −20 to +10 days.Phase differences more than 10 days are noticeable by eye when looking at time series data and occur in the NH at Bremen, Four Corners (negative), Orléans, and Izaña (positive, CT2013b only), although Bremen, Four Corners, and Izaña are locally influenced.Large phase differences also occur at some stations Table 6.Diurnal variability of CT2013b and MACC13.1 vs. TCCON in JJA arranged by latitude.TCCON variability and maximum theoretical correlation are shown, as well as actual correlation and slope for both models.The slope is the mode vs. TCCON fit to a straight line.a indicates sites expected to be strongly influenced by local sources.Max correlation is calculated using Eq. ( 3), TCCON SD, and the lower of the model-TCCON standard deviations from Appendix B. TCCON range, SD, shows the range of the values seen for the daily variability for TCCON, as well as the standard deviation of the TCCON daily differences.The average row gives averages for all entries above it.The average rows, e.g., DJF, are for stations Bialystok, Orléans, Park Falls, and Lamont.Eureka and Karlsruhe are not included because fewer than five matches were found.in the Southern Hemisphere.Although the seasonal cycle is weaker in the Southern Hemisphere, the phase offset can be seen in the CT2013b Lauder_125HR plots in Fig. 2. The correlations vs. offset days show a phase difference of −20 days for CT2013b and +0 days for MACC at Lauder_125HR, as seen in Fig. 11.Note that the fits of the seasonal cycle in the SH display more complexity, such as multiple local maximum, than fits in the NH and "phase lag" could easily be an indication of an issue with the fit shape.Figure 11 shows correlations and standard deviations vs. day offset for three stations that have the seasonal cycle peak within ±10 days for CT2013b and MACC (top panels), and for stations which have a larger phase lag compared to TCCON (bottom panels).There is often a small peak within ±3 days, which indicates the models' capability of picking up variations that occur day to day (i.e., synoptic-scale variability), which indicates both the strength of synoptic activity and matching between models and TCCON.This peak is not seen in satellite data for dynamic coincidence criteria likely due to matching, or geometric coincidence criteria likely due to the noise -not that this synoptic peak occurs at 0, even when the seasonal cycle has a phase lag (e.g., MACC model at Bremen, in the lower right panel, or Lauder_125HR comparisons).The synoptic-scale correlation varies between 0 and 0.17, as seen in Table 5. Izaña will be briefly discussed.As noted in Sect.2.1, the TCCON station is located on a small island at 2.37 km above sea level (about 770 hPa), whereas the MACC and CT2013b models at ∼ 2 • × 3 • resolution do not resolve topography and consequently have mean surface pressure at sea level, about 1013 hPa at this location.Our standard treatment is to interpolate the model to the TCCON pressure grid, then calculate X CO 2 using the TCCON pressure weighting function.At Izaña this has the effect of chopping off the lower atmosphere.The CT2013b result for this treatment has a +10 day seasonal cycle phase difference at Izaña; whereas MACC has no phase difference at Izaña.If, however, the model surface pressure is used to calculate X CO 2 , MACC goes from a 0 to a −10 day phase lag, and CT2013b has 0 phase difference.An argument for using the model surface pressure would be if the upslope winds at Izaña (Sancho et al., 1991;Bergamaschi et al., 2000) shifted the profile upwards rather than chopping it off, which would occur if the air deviated around the island instead.This finding has important implications for the choice of the comparison methodology and the ideal location for validation sites.Validation sites within complex geographical terrain have to be treated as special cases as (a) the atmospheric models usually do not resolve these variations and (b) satellite measurements rarely have a perfect co-location with the ground-based site, meaning that they could sample a substantially different altitude level.This holds for both mountains (e.g., Izaña) and valleys (e.g., Garmisch).This highlights one of the many choices that are made when comparing two products (e.g., whether to apply the averaging kernel, whether to use interpolation, how to treat the surface pressure, or what coincidence criteria to use).

TCCON
Another finding worth noting is the comparisons at Lauder.In 2010 the Lauder_125HR instrument began routine operation, while the Lauder_120HR instrument continued to take TCCON data through to the end of 2010.Both MACC and CT2013b show no seasonal cycle correlation with the 120HR time series at Lauder, but do show correlation with Lauder_125HR time series.We attribute this to the improved precision of the 125HR data, and an increase of the seasonal cycle amplitude in 2011 and 2012 as compared to other years (e.g., compare 2011 vs. 2007).The phasing error found in the CT2013b comparison with the Lauder_125HR may be due to CT2013b not modeling the drivers of the seasonal cycle amplification in 2011 and 2012.
At Bremen and Four Corners, local effects that are not resolved at 3 × 2 • likely dominate, particularly since Bremen is clustered with Orléans, Garmisch, and Karlsruhe, which all compare fine, and because the correlation of daily variability, as seen in the next section, is also very low at these two stations.

Daily variability (models vs TCCON)
At the surface, CO 2 shows a strong diurnal cycle in areas with active vegetation, e.g., Park Falls, during summer, and synoptic trends based on regional dynamics.Even though the diurnal cycle is markedly smaller in the total column (Olsen and Randerson, 2004), it can be observed both by TCCON and also in models, in our case CT2013b and MACC, as seen in Fig. 12.Both diurnal variations and synoptic trends can be seen in Fig. 12. Validating the amplitude of the diurnal variability in the column is important as the column diurnal variability better represents the amount of CO 2 emitted or absorbed by surface processes as compared to surface measurements, which are more impacted by boundary layer height.To our knowledge this is the first comparison of model fields to TCCON to compare the diurnal cycle.As TCCON itself has not been validated at multiple times in 1 day, this is considered a comparison, not a validation.We compare the difference between morning and afternoon in models and TCCON.To minimize potential TCCON biases that depend on the solar zenith angle (through the air mass factor), we compare at two points in each day separated by the largest time with the same solar zenith angle (SZA).The methodology is to (1) identify two points, t1 and t2, from the same day with the largest time difference but with the same SZA.As the TCCON data used in this paper have been averaged over 90 min, t1 or t2 may be interpolated between two time points.(2) We compare TCCON at t2 minus TC-CON at t1 and the same times for each model.We look at the variability within 1 day for one season (JJA).Looking at different seasons for the Northern Hemisphere at the bottom of Table 6, both models showed clearly higher correlations and slopes in MAM and JJA vs. DJF and SON, with SON showing the lowest correlation and slope.In the SH, correlations are highest in DJF, second highest in SON, and lowest in MAM.However in the SH, the slopes are on the order of 15 % of the slope of TCCON.For CT2013b, at Darwin, correlations are highest in DJF, less in SON, with slopes on the order of 15 % of the slope of TCCON.Table 6 shows correlations between CT2013b or MACC vs. TCCON in the daily variability.In the NH, on average, the correlations are about two thirds as large as could be expected, given the relative sizes of the variability and errors (see text around Eq. 3).Additionally, the two models have about one third to one half the daily variability of TCCON (as seen from the smaller slopes).In the far north (Eureka, Sodankylä), the correlations indicate agreement but the model daily variability is less than 1/4 TCCON.In the midlatitudes (excluding locally influenced stations) there is the highest correlation (∼ 0.3-∼ 0.7) with model daily variability 20 to 60 % of the variability of TCCON.Sites influenced by local sources, Bremen, Four Corners, and JPL2011, the models do not show the diurnal variability that is seen by TCCON, as evidenced by very small or zero slopes.Sites which have complicated terrain, Garmisch and Izaña, do show correlations in the daily variability similar to other nearby stations.
The CT2013b model in general shows more daily variability and higher correlations, which are in better agreement with TCCON.Since the satellite observations are coincident ∼ once per day, the diurnal pattern will not be well constrained by satellite observations, except as preserved in transported air coincident with satellite measurements downwind.Model Observing System Simulation Experiments (OSSEs) can determine the impact of the diurnal cycle strength on flux estimates to determine the importance of independently verifying the diurnal cycle in models.

Discussion and conclusions
We find standard deviations of 0.9, 0.9, 1.7, and 2.1 ppm vs. TCCON for CT2013b, MACC, GOSAT, and SCIAMACHY, respectively, with the single target errors 1.9 and 0.9 times the predicted errors for GOSAT and SCIAMACHY, respec- tively.There is a correlation r value of 0.5 for SCIAMACHY and 0.6 for GOSAT for the actual and predicted errors grouped by station.Equation (2) and Table 2 show how errors decrease when satellite results are averaged and estimate the magnitude of the correlated and random errors components for averaged satellite results, where random error components decrease with an increasing number of averaged observations.When satellite data are averaged and interpreted according to the model error 2 = a 2 + b 2 /n (where n is the number of observations averaged, a is the systematic (correlated) errors, and b is the random (uncorrelated) errors).The northern hemispheric average values from Table 2 are a = 1.4 ± 0.3 ppm, b = 1.7 ± 0.2 ppm for SCIAMACHY geometric; a = 1.0 ± 0.3 ppm, b = 2.1 ± 0.1 ppm for SCIA-MACHY dynamic; a = 0.8 ± 0.2 ppm, b = 1.6 ± 0.1 ppm for GOSAT geometric; a = 0.5 ± 0.2 ppm, b = 1.9 ± 0.1 ppm for GOSAT dynamic.The lowest correlated errors are found when using dynamic coincidence criteria where values are averaged from a larger spatiotemporal region, but the lower value for GOSAT is only after the estimated co-location error is subtracted.The Southern Hemisphere errors are uniformly smaller, both in correlated and uncorrelated errors.These data represent averaging of satellite data which match one averaged TCCON value.The above error model should help in assigning realistic retrieval error correlations in assimilation systems in place of current ad hoc hypotheses.For example, in Basu et al. (2013) observations within 500 km and 1 h are assumed to have 100 % correlated errors, and are inflated by a factor such that when observations are later treated as if the errors were random, the final error of the average is the same as the error of one observation.This can be improved by setting the inflation factor so that the average observation error is a 2 + b 2 /n, with a and b set by the geometric values from Table 2, which should result in a lower error than assuming 100 % correlation.
Regarding the quality of the dynamic relative to the geometric coincidence criteria, the coincidence error estimated by models is larger for dynamic coincidences by about 0.3 ppm, as seen in Table 2.However, the coincidence error (0.3 to 0.4 ppm for geometric criteria and 0.6 to 0.7 ppm for dynamic criteria) is not the dominant error.As seen in Fig. 5, the dynamic coincidence criteria average 0.1 ppm higher error for unaveraged satellite comparisons.This is small compared to a total error of 2.0 and 2.2 ppm, respectively, for stations in the Northern Hemisphere.With maximum averaging, as seen in Fig. 5, the errors are lower for dynamic vs. geometric because the dynamic criteria finds more observations to average.Figure 6 shows that at Lamont the average difference between geometric and dynamic observations is 0.4 ppm for unaveraged satellite observations, which is higher than average.This error difference reduces to less than 0.2 ppm when all available observations are averaged, also seen in Fig. 6.While coincidence error is an important error source, it is not the dominant error source.Although dynamic coincidence criteria allow the inclusion of more stations in analyses because of the larger number of coincidences, comparisons to geometric coincidence results are done when possible.Biases at individual stations have a year-to-year variability of ∼ 0.3 ppm, with biases larger than the TCCON predicted bias uncertainty of 0.4 ppm at many stations.Using fitting software, we find that GOSAT and CT2013b underpredict the seasonal cycle amplitude in the Northern Hemisphere between 46 and 53 • N, MACC overpredicts in 26-37 • N, and CT2013b underpredicts the seasonal cycle amplitude in the Southern Hemisphere.The seasonal cycle phase indicates whether a data set or model lags another data set in time.We find that the GOSAT phase improves substantially over the prior and the SCIAMACHY retrieved phase improves substantially for two of seven sites.The models reproduce the measured seasonal cycle phase well except for at Lauder_125HR (CT2013b) and Darwin (MACC).We compare the variability within 1 day between TCCON and models in JJA; there is correlation between 0.2 and 0.8 in the NH, with models showing 10 to 50 % the variability of TCCON at different stations and CT2013b showing more variability than MACC.This paper highlights findings that provide inputs to estimate flux errors in model assimilations, and places where models and satellites need further investigation, e.g., the SH for models and 45-67 • N for GOSAT and CT2013b.
We focus on validating aspects of model and satellite data which may be important for accurate flux estimates and CO 2 assimilation, including accurate error estimates, overall biases, biases by season and latitude, impact of coincidence criteria, validation of seasonal cycle phase and amplitude, yearly growth, and daily variability.Our methodologies can be used to correct known biases and data deficiencies (e.g., Basu et al. (2013) accounted for global land/sea biases; Nassar et al. (2011) corrected for hemispheric gradients).Alternatively, biases can be mitigated through data assimilation, such as the inversion method of Reuter et al. (2014) which is insensitive to seasonal and regional biases outside a targeted region.The bias evaluation performed in the present study has been restricted to various satellite and model CO 2 data products.To quantify the importance of each bias (seasonal biases, location-dependent biases, seasonal cycle differences, seasonal cycle phase differences, and diurnal cycle differences) on carbon flux estimates requires detailed Observing System Simulation Experiments (OSSEs).For example, Kulawik et al. (2013) tested the effect of a NH bias of 0.3-0.5 ppm in JJA, finding flux biases were not insignificant, and were comparable to flux updates from GOSAT in some regions.
Biases vary by station (See Fig. 3); these stationdependent biases have a standard deviation of ∼ 0.3 ppm from year to year.Biases also vary by season as seen in Figs.7-8.Seasonal biases affect the seasonal cycle amplitudes, which are important for biospheric flux attribution.
The seasonal cycle phase is a sensitive indicator of seasonally dependent biases in satellite data as well as issues with model fluxes or transport errors.The GOSAT root mean square (RMS) phase difference vs. TCCON across all sites is 16.1 days for the prior; this improves to 6.8 days for the GOSAT retrieved XCO2.The SCIAMACHY RMS phase difference vs. TCCON across all sites is 16.4 days for the prior; this improves to 13.2 days for the SCIAMACHY retrieved XCO2, reflecting the fact that SCIAMACHY data significantly improved the seasonal cycle phase at just two of the seven TCCON sites.
Model comparisons to TCCON are much less noisy as there are many more matches.Most NH stations show the expected seasonal drop-off (e.g., see Fig. 11), with the peak correlation near 0 days, and an additional spike within ±3 days indicating the capture of synoptic variability.Stations that showed phase differences larger than 10 days are Darwin (Macc only), and Lauder_125HR (CT2013b only), as well as all of the locally influenced stations.
In comparing CO 2 diurnal variability between TCCON and models, both models show up average 0.5 correlation to the TCCON CO 2 change between afternoon and morning for select stations in the NH.This is on the order of about 2/3 of the maximum possible correlation given the error vs. variability (except the locally influenced JPL2011, Bremen, and Four Corners which had little correlation and no slope).The amplitude of the variability is higher in TCCON vs. the models, with CT2013b closer to TCCON than MACC.However, note that TCCON daily variability has not been validated.Diurnal pattern will not be constrained by satellite observations, except as preserved in transported air coincident with satellite measurements downwind, and therefore may be important to independently verify the diurnal cycle in models to ensure accurate flux attribution.The importance of the diurnal cycle on flux estimates needs to be tested.
Spatial and seasonal-dependent biases are obstacles to accurate and better resolved CO 2 flux estimates.This paper highlights findings that provide inputs to estimate flux errors in model assimilations, and places where models and satellites need additional validation or improvement.Some of the issues which need further investigation are the GOSAT and CT2013b seasonal cycles in the 46-53 • N latitude range (which are 0.4 and 0.2 ppm smaller than TCCON, respectively), the MACC seasonal cycle in 28-37 • N latitude range (which is 0.6 ppm larger than TCCON), seasonal cycle amplitude and phase differences at SH stations, differences in the diurnal cycle amplitude between models and TCCON, and high biases for GOSAT and SCIAMACHY north of 67

1Figure 1 .
Figure 1.TCCON site locations used for this work.The color indicates the year when each station started collecting data.

Figure 2 .
Figure 2. Time series for matches of CT2013b, MACC, SCIAMACHY, and GOSAT vs. TCCON at Lamont (a-d) and Lauder (for models) or Wollongong (for satellites) (e-h).The top plot of each set (a1, b1, c1, . . . ) shows a time series of all geometric matching pairs.The middle panel of each set (a2, b2, c2, . . . ) shows the difference vs. TCCON, with the blue line denoting the 30-day average difference.The bottom panel of each set (a3, b3, c3, . . . ) shows a histogram of the differences, indicating an approximate error and bias.
not included in this paper include a method implemented by S. Basu described in Guerlet et al. (2013), which utilizes model CO 2 fields to determine coincidences, and Nguyen et al. (2014) which uses a weighted average of distance, time, and mid-tropospheric temperature.Dynamic and geometric coincidence criteria are compared in Sect.3.3 and geometric coincidence criteria are used to spot-check dynamic coincidence criteria results.
Garmisch, Four Corners, JPL, and Izaña are influenced by local effects or complex terrain and are not included in averages.b Ascension data start in 2013 after CT2013 stops.

Figure 3 .
Figure 3. Bias (left panel) and standard deviation (right panel) for CT2013b, MACC, BESD-SCIAMACHY, and ACOS-GOSAT vs. TCCON stations, arranged from high to low latitude.Comparisons which have a particularly low number of matches are TSUKUBA and Lauder for SCIAMACHY and Lauder for GOSAT.

Figure 4 .
Figure 4. Overall bias (left panel, with error bars showing the standard deviation of the bias) and standard deviation (right panel, with stars showing the predicted error for satellites) for most stations (some stations removed, see text).

Figure 5 .
Figure 5.Standard deviation of SCIAMACHY and GOSAT minus TCCON for different coincidence criteria and number of satellite observations averaged, n, in the Northern Hemisphere.The following consistent set of stations was used for all comparisons: Bialystok, Karlsruhe, Orléans, Garmisch, Park Falls, and Lamont.The dotted line shows the error if it scaled as the inverse square root of the number of averaged observations.The far right case for each of the categories contains the maximum n that has results for all stations.

Figure 6 .
Figure 6.Averaging matches of satellite data vs.TCCON at Lamont.As the number averaged increases, the standard deviation vs. TCCON decreases.CT2013 at the satellite vs. at CT2013 at TCCON (purple) is used to quantitate spatiotemporal mismatch error.The points are fit to Eq. (2) (black).For GOSAT the uncorrected data are also fit (black dashed).The initial guess minus TCCON standard deviation is shown as a green dashed line.We see that in this case, for GOSAT at Lamont, averaging more than about four observations improves over the initial guess.

Figure 7 .
Figure 7. Bias for 3-month groups for each station, where each station is normalized to have 0 yearly bias.For satellites, stations are included when at least 20 matches are found in each season.Dynamic coincidence criteria are used.The station colors are coded by location: far NH gray, European and Park Falls red/yellow, midlatitude green, SH blue.

Figure 8 .
Figure 8. Bias for 3-month groups for Southern Hemisphere (left panel), 0-45 • N (middle panel), and poleward of 45 • N (right panel).Each station composing each group is normalized to have zero average over the year.The Southern Hemisphere (left panel, Lauder (except SCIAMACHY), Wollongong, and Darwin) has relatively small biases.0-45 • N includes Tsukuba and Lamont.> 45 • N includes Karlsruhe, Park Falls, and Orléans for satellites, and those plus Sodankylä and Bialystok for models.The results that are significant according to the t test are shown with a * in the bar plots.

Figure 9 .
Figure 9. Seasonal cycle amplitude.The TCCON values are shown by the circles.The averaging is done over 10 × 10 • bins every 5 • .Comparisons vs. TCCON may be different than Table4, since Table4has very close criteria for models and dynamic coincidence criteria for satellite data.The models are sampled at GOSAT observations.

Figure 10 .
Figure 10.Top: cross correlation between TCCON and SCIAMACHY (top left) and GOSAT (top right) with matches using dynamic criteria at Park Falls.The x axis shows results when satellite data are offset in days vs. TCCON.The dashed line shows the expected maximum correlation based on the error (see Eqs. 2 and 3).The gray line is the correlation for the satellite priors, which are each out of phase by at least 10+ days.The * is the peak of a polynomial fit of the correlation between −25 and +25 days.Bottom: standard deviation between TCCON and SCIAMACHY (bottom left) and GOSAT (bottom right).The dashed line shows the Eq.(2) predicted error.

Figure 11 .
Figure 11.Top: cross correlation examples between TCCON and CT2013 (left) or MACC (right).Each panel shows the correlation and second-order polynomial fits (top) and standard deviation (bottom) vs. offset in days of TCCON vs. satellite data.The correlation should be at a maximum and standard deviation at a minimum at days offset = 0.The top panels show examples of stations with a phase difference of less than 10 days.The bottom set shows example stations which have phase differences of at least 10 days.

Figure 12 .
Figure 12.(a) Time series for models and TCCON from 1 to 8 August 2011 at Bialystok.The end points of the solid lines show the time points used for comparing daily variability; both diurnal and synoptic variations are seen.(b) Change throughout the day at Bialystok for 2 August 2011.The large diamonds at 06:00 and 18:00 show the two times with largest difference that have the same solar zenith angle that are used in the analysis.The difference between these times are −2, −1.7, and −0.7 ppm for TCCON, CT2013, and MACC, respectively.Right: CarbonTracker (c) and MACC (d) daily trends vs. TCCON daily trends for BIALYSTOK in JJA, from compiling differences as shown in (b).Correlation is seen in the daily trends as compared to TCCON with the daily amplitude for the models smaller than TCCON.

Table 1 .
Summary of the CO 2 data sets and models we used, showing the coverage for several different CO 2 products.The obs/day are the approximate number of CO 2 observations which passed quality screening.

Table 2 .
Fit of a and b in Eq. ( Table3is root mean square of the end date choice, bootstrap error, averaging choice, and bin standard deviation.The co-location error (only relevant for satellites) is calculated as an average bias in the bin; this bias should be subtracted from the satellite-TCCON differences to remove the effects of co-location error.When n > 1, results are considered significant using the t test of all the results in the bin (comparison of the mean difference/standard error difference to a standard look up table).The calculated seasonal cycle amplitudes are shown in Table3, and results are compared for consistency to the simpler seasonal cycle biases calculated in Sect.3.5 to understand what season the discrepancy occurs.

Table 4 .
Yearly increases.Each comparison uses matched pairs with TCCON using locations which have at least 2 years of data for comparisons.See Table3for stations included.The start date and end date are averaged for the stations in each bin and are shown in the second to last column.Bold text shows one difference larger than predicted errors.The last column shows the average global yearly increase for the time period using Table6.