Validation of MOPITT carbon monoxide using ground-based Fourier transform infrared spectrometer data from NDACC

. The Measurements of Pollution in the Troposphere (MOPITT) satellite instrument provides the longest continuous dataset of carbon monoxide (CO) from space. We perform the ﬁrst validation of MOPITT version 6 retrievals using total column CO measurements from ground-based remote sensing Fourier transform infrared spectrometers (FTSs). Validation uses data recorded at 14 stations, that span a wide range of latitudes (80 ◦ N to 78 ◦ S), in the Network for the Detection of Atmospheric Composition Change (NDACC). MOPITT measurements are spatially co-located with each station and different 5 vertical sensitivities between instruments are accounted for by using MOPITT averaging kernels. All three MOPITT retrieval types are analyzed: thermal infrared (TIR-only), joint thermal and near infrared (TIR-NIR), and near infrared (NIR-only). Generally, MOPITT measurements overestimate CO relative to FTS measurements, but the bias is typically satellite error prior to data assimilation, allowing for inclusion 20 of data over a wider spatial range than is currently used. The MOPITT long-term bias drift has been bound to within ± 0.5% yr − 1 or lower at almost all locations. Variable drift in the Northern Hemisphere implies an uncharacterized retrieval parameter such as uncertainty in cloud detection or sea-ice representation. We recommend trend analysis should not be performed above 60 ◦ N. Overall, this study extends the geographical and temporal analysis of MOPITT validation results.


Introduction
Atmospheric carbon monoxide (CO) is useful for studying both transported and local sources of pollution. CO is directly emitted from incomplete combustion, such as from biomass burning and fossil fuel use. CO is also chemically produced from the oxidation of methane and volatile organic compounds. The approximate global lifetime of two months makes CO an ideal atmospheric constituent to track atmospheric physical and chemical processes over a range of spatial scales (Edwards et al., 5 2006;Duncan et al., 2007).
Measurements of Pollution in the Troposphere (MOPITT) is the longest running satellite sensor measuring atmospheric CO globally, measuring since 2000 aboard the satellite Terra from low-Earth orbit using thermal infrared (TIR). MOPITT is the only satellite instrument measuring CO in both TIR and near infrared (NIR). A long record presents an opportunity to analyse temporal changes in atmospheric CO. For example, long term CO trends from satellite records were compared in Worden 10 et al. (2013). However, continued validation of the instrument is necessary to ensure that observed temporal changes are due to changes in the atmospheric state, rather than changes in the instrument. Validation is performed against an independent measure of atmospheric CO over a long time period, to help determine any instrument drift.
MOPITT has been extensively validated with in situ measurements at the ground and by aircraft (Deeter et al., 2014(Deeter et al., , 2010Emmons et al., 2009Emmons et al., , 2004. Validation and intercomparison was also completed using other satellite products (Martínez-

Comparison methodology
We analyse all three MOPITT retrieval products: TIR-only, TIR-NIR, and NIR-only (Deeter et al., 2014). Although MOPITT has been measuring CO since March 2000, one of the two optical benches became nonoperational in May 2001, as a result of cooler failure. The period before optical bench loss is known as Phase 1 and from August 2001 onward is known as Phase 2 (Deeter et al., 2004;Emmons et al., 2004). While instrument changes between phases are accounted for in the forward model 5 and retrieval algorithm, a small step-change remains between Phase 1 and 2 retrievals. Consequently, we focus on the validation of Phase 2.
Recently, a calibration issue was found with the NIR radiances that affects retrievals after February 2012. Therefore, TIR-NIR and NIR-only are validated between August 2001-February 2012. In contrast, TIR-only are validated between August 2001 and station specific end dates, determined by the available FTS data at each station (see Table 1).

10
Validation is performed for a range of conditions to assess whether parameters that are known to affect the MOPITT sensitivity and AKs will affect the validation results. Specifically, separate validation is performed for the four MOPITT detector elements; for land-scenes versus water-scenes; and over a range of latitudes.
In order to accurately compare measurements between instruments, equivalent air masses must be compared. This involves co-locating measurements in time and space, and accounting for the relative sensitivity of each instrument.

Co-location criteria
Temporal co-location is defined as comparing daytime measurements from MOPITT with FTS measurements retrieved within the same day as the MOPITT overpass. All FTS measurements within the same day as the MOPITT overpass time of ∼10:30 a.m. (local time) are considered. While the MOPITT overpass also occurs at ∼10:30 p.m., the daytime-only MOPITT measurements are used in order to include enhanced information from the reflected solar NIR. Another constraint is that ground-based 20 instruments only measure during daytime clear-sky conditions. MOPITT retrievals are spatially co-located with the FTS by selecting MOPITT data within a one degree radius around each ground station, a distance criterion that is suggested by Sparling and Bacmeister (2001). One degree has been found adequate in other satellite validation studies (Yurganov et al., 2008;Kerzenmacher et al., 2012), and falls within the range of previous validation of the V6 MOPITT product, which used radii of 0.5 • (against National Oceanic and Atmospheric Administration, 25 NOAA, profiles) and 2 • (against HIAPER Pole-to-Pole Observations, HIPPO) (Deeter et al., 2014).

Averaging
Prior to validation, MOPITT retrievals that are co-located with FTS measurements are averaged, inversely weighted by the square of relative retrieval measurement error. Thus, one MOPITT average is compared with several FTS measurements.
There are several advantages of averaging the MOPITT data. Combining satellite measurements within the one degree ra-30 dius criterion satisfies a compromise between reducing the effects of random retrieval noise, and minimizing spatial dilution of measurements through using a small radius. Error-weighted averaging improves the signal-to-noise ratio, and reduces the random uncertainty in the MOPITT data, while any systematic bias remains, allowing diagnosis of the MOPITT bias. Averaging will also reduce the co-location errors associated with non-coincidence of air masses, thereby reducing the sampling bias. Additionally, averaging improves computational efficiency, reducing the number of comparisons at some stations from ∼40,000 to ∼5,000. Error-weighted averaging is also performed for the corresponding MOPITT AK matrices and a priori.
The instrument error is combined in quadrature. Depending upon the experiment, averages are restricted to include land-only 5 scenes, water-only scenes, or specific detector elements pixels.
A comparison between the validation of averaged and raw MOPITT values against HIPPO and ACE-FTS measurements is available in the supplementary material of Martínez-Alonso et al. (2014). These authors concluded that the two methods produced equivalent results, and found that averaged MOPITT values produced bias of <1.2% against HIPPO and <0.8% against ACE-FTS, over raw MOPITT values. Therefore, averaging MOPITT values here could overestimate MOPITT biases 10 by approximately 1%.
Temporal averaging is not performed on the FTS measurements, which enables a qualitative assessment of the influence of diurnal changes CO. In summary, the MOPITT spatial averages are compared separately with each FTS measurement in the same day.

15
Vertical grids between instruments are different in terms of resolution as well as the retrieved surface altitude. Specifically, FTS measurements are retrieved on a finer vertical grid than MOPITT. The FTS measurements must be re-gridded to the MOPITT vertical levels for two reasons: (1) measurements must describe the total column over the same altitude range in order to compare equivalent atmospheric amounts; (2) the FTS profile will be smoothed by the MOPITT AKs, which are reported on MOPITT vertical layers.

20
MOPITT CO profile values describe the average VMR within the layer above the reported level (Deeter et al., 2013). In contrast, FTS values are reported on layer mid-points and describe the average VMR within that layer (using SFIT), or are level values (when using PROFFIT). FTS profiles are re-gridded in a manner that is independent of the FTS profile definition, assuming hydrostatic equilibrium. We first interpolate the FTS profile in logP space to an ultrafine grid of 100 levels per MOPITT layer. The VMR values are then averaged over each set of 100 ultrafine levels to produce an average within each 25 MOPITT layer. Resulting FTS averages are associated with levels, the same definition as for the MOPITT profile.
During re-gridding, two situations are accounted for: either the reported FTS surface pressure is larger than MOPITT's; or the FTS surface pressure is smaller than MOPITT's. In the first case, if the FTS surface pressure is larger than MOPITT, any FTS layers below the MOPITT surface layer are not used when averaging the 100 ultrafine levels. This process is visualized in Fig. 2 (a). If the MOPITT surface layer occurs at an altitude above 900 hPa, the MOPITT profile will have less than 10 vertical 30 layers, as will the re-gridded FTS profile.
Alternatively, where FTS surface pressure is smaller than the MOPITT surface pressure, the FTS values are not extrapolated outside the FTS surface pressure. These situations occur mainly for stations located near highly varying terrain or at high altitude. One possible method is to replace the missing lower values of the interpolated FTS profile with a scaled version of the MOPITT a priori, such as in Kerzenmacher et al. (2012). However, seeing as land exists below the altitude of the FTS station, we choose an alternative method that uses the lowest level of the re-gridded FTS profile to define the lower bound of the comparison altitude range. In these cases, a new MOPITT column is calculated from a truncated profile to compare with the re-gridded FTS values. Figure 2 (b) shows the schematic of this process.
3.4 Averaging kernel smoothing 5 MOPITT AKs are used to smooth the re-gridded FTS profiles in order to account for sensitivity differences between instruments. The total column AKs of FTS are near unity over the altitude range covered by MOPITT (e.g. Toronto and Wollongong in Fig. 3; column 4), indicating relatively uniform sensitivity to the true atmospheric state and little inclusion of the a priori in the retrieved values. In contrast, the column AKs of MOPITT peak in the free troposphere and show overall less sensitivity than FTS, including more of the a priori in the retrievals, particularly for the lower altitude levels. Rodgers and Connor (2003) 10 show that when intercomparing instruments, if one instrument possesses less dependence on the a priori and more information than the other, it can be used as a closer representation of the true atmospheric state. Therefore, we take the FTS retrievals to be 'atmospheric truth' and smooth to MOPITT retrieval space. Specifically, the vertically re-gridded FTS profile is smoothed by MOPITT AK matrices.
The MOPITT AK matrices are applied following Rodgers and Connor (2003), modified for log(VMR): where n is the number of vertical layers, ..n, j = 1...n} is the MOPITT AK matrix; and ..n} is the FTS regridded VMR profile. The A has been calculated based on log 10 (VMR), and therefore must be applied to a profile of log 10 (VMR). Differences between AKs matrices for the three MOPITT products are 20 visualized in Fig. 4. AKs are further discussed in Sect. 4.1, 4.2 and 4.3.
The resulting smoothed profile of log 10 (VMR) is converted to a VMR profile. Equations 2 and 3 describe the relationship between the smoothed VMR values and the terms in Eq. 1.
Aij (log 10 (x F T S,j )−log 10 (xap,j )) Smoothed FTS total column values (c smooth ) are calculated from x smooth using pressure weighted integration with MOPITT retrieval pressure widths (Eq. 4). The smoothed FTS column value is calculated over the same altitude range as MOPITT and represents what MOPITT would have retrieved, had the FTS measurement described the true atmosphere.
where α is the conversion factor between VMR and column amount and ∆p i is the pressure width of layer i.
MOPITT retrieved column (c M ) is then validated against the smoothed FTS column, for example by calculating the mean bias (Eq. 5).
where m is the number of comparisons at an NDACC station.

Information content
The information content of each instrument is described by Degrees of Freedom for Signal (DFS). DFS are determined from the trace of the AK matrix, which is influenced by instrumental and geophysical parameters (Rodgers, 2000; Deeter et al., . DFS are provided for each MOPITT retrieval and are calculated for FTS measurements from the AK matrices.
The theoretical maximum DFS of each instrument may be higher than reported values because AKs depend on both the retrieval methods and the choice of a priori, which are different between instruments. However, we aim to validate the operational MOPITT products rather than perform instrument intercomparison, which means a comparison of retrieved DFS between instruments is indicative of the information content differences between retrieved values.

15
Median DFS for each instrument at each station is recorded in Table 2. MOPITT median DFS are below 2 at all stations.
The joint TIR-NIR product consistently retrieves more information than the TIR-only product and the NIR-only product shows very low DFS, although some information is still present. In comparison, the FTS measurements retrieve more information than MOPITT at all stations. Median DFS for FTS is generally above 2 (except at La Réunion, Zugspitze and Ny-Ålesund).
Higher information content in the ground-based measurements relative to MOPITT supports our choice to smooth the FTS 20 measurements by the MOPITT AK. MOPITT is generally biased high relative to the FTS by a few percent. Overall, the TIR-only product performs the best, followed by the joint-TIR-NIR and then NIR-only. Mean station biases are always less than 10% for TIR-only with an average producing bias of 5.4%. NIR-only biases are less than 10% (except at Ny-Ålesund) with a bias of 7.0%. Standard deviation is always larger than bias, except for the TIR-NIR product at Lauder and the NIR-only product at Lauder and Arrival Heights.

Total column validation at each station
Correlation values are generally the highest for the TIR-only product (r: 0.85) compared to TIR-NIR (r: 0.80) or NIR-only Instrument sensitivity varies with season, which is reflected in the column AK seasonal variability. An example of the range of AK variability is shown by the normalized MOPITT column AKs at Toronto and Wollongong (Fig. 3). The question arises whether seasonal sensitivity differences are significant enough to affect validation results. We conducted seasonal validation at each station and found the maximum difference in station-wise bias between seasons was on average 4.8%, 4.5% and 3.2% for 10 TIR-only, TIR-NIR and NIR-only respectively. Seasonal variation in bias is below the all-station average standard deviation for each product: 7.6% (TIR-only), 10.3% (TIR-NIR) and 9.4% (NIR-only). We conclude there is no significant seasonally dependent bias for MOPITT.

Surface-type specific validation
MOPITT classifies pixel surface-type as land, water, or mixed. Different surface-types have the potential to affect validation 15 results by influencing MOPITT retrievals. Larger variability in surface height over land, combined with emissivity and albedo differences, results in greater geophysical noise relative to water scenes (Deeter et al., 2011). Also, thermal contrast between skin surface and the overlying air can affect MOPITT sensitivity to measuring CO. For instance, water scenes have lower thermal contrast, where skin surface and overlying air temperatures are similar. MOPITT has difficulty viewing the surface in low thermal contrast scenes and in these cases has better sensitivity to CO in the free troposphere. Consequently, water scenes 20 tend to be sensitive to the free troposphere, while land scenes include more information from the lower troposphere (Worden et al., 2010).
AKs reflect the retrieval differences between surface-types (Fig. 4). For example, when comparing the mean MOPITT AK matrices at Toronto, the TIR-only land AK shows increased sensitivity around 900 hPa relative to the water AK, as a result of improved thermal contrast. The TIR-NIR land AK shows even greater sensitivity at around 900 hPa relative to both the 25 TIR-only land AK and the TIR-NIR water AK, due to the combination of improved thermal contrast with extra information from the NIR signal. While the TIR-NIR water product does not include reflected solar information, AKs are different between the TIR-only and TIR-NIR over water scenes, due to retrieval differences. Specifically, the joint product attributes less weight to the a priori in the retrieval process, with a cost of higher variability (Deeter et al., 2011).
To assess the effect of different surface-types in the MOPITT retrievals, validation is performed separately for land or water and Lauder are only represented by land pixels. Consequently, comparison between land or water pixels is completed where stations are represented approximately equally by water and land surface-types. At each station, the error-weighted average within a one degree radius is calculated with either all land or all water pixels. Mixed surface-type pixels are discarded. Table 4 summarizes validation results over water scenes for TIR-only and joint TIR-NIR products. Land scene validation results were presented in Tables 3a, 3b and 3c. While the TIR-NIR product over water scenes does not include NIR information, differences arise compared to the TIR-only water scenes due to differences in the retrieval algorithm as discussed above.

5
Validation statistics over water show a pattern consistent with validation over land, i.e. lower correlation, higher bias and higher standard deviation occurs for the joint product compared to TIR-only. Overall, the choice of surface-type has very little effect on validation statistics for the sites investigated here.

Pixel-wise validation
The MOPITT detectors are comprised of four detector elements, resulting in four pixels each with a nadir ground size of 10 22 × 22 km. Instrument-only noise is determined for each pixel from a periodic view of space. Pixel noise, combined with the response to geophysical variability, has been demonstrated to be highly variable between pixels (Deeter et al., 2015). We investigate the impact of pixel-specific variability on validation. At each station, the error-weighted average for each pixel is calculated within a one degree radius to be validated against FTS measurements. Analysis is for daytime-only and land-only retrievals (except for water-only at IZA, MLO and LRN).

15
Validation results differ between pixels. Most noticeably, pixel 1 provides consistently poorer correlations and larger standard deviations than the other three pixels (e.g. summarized for Lauder in Table 5). To visualize results at all stations, correlation is plotted against bias in Fig. 6. Perfect validation occurs at the intersection of the zero bias and unity correlation lines. All stations generally produce similar results to Lauder, with pixel 1 showing the poorest correlation in all three products. Figure  pixel 1 suggests to first remove pixel 1 from the average. A more restrictive average would include the two best pixels for each dataset: pixels 3 and 4 for TIR-only and pixels 2 and 3 for TIR-NIR and NIR-only. The resulting average would include the satellite values that perform best against the FTS measurements. Satellite retrievals over colder surfaces at higher latitudes are challenging mainly due to low thermal contrast, resulting in a higher weighting to the a priori and consequent lower information content. Information content of satellite retrievals is therefore dependent upon latitude. The latitudinal dependence of MOPITT DFS at these stations of interest is depicted in Fig. 7 (top row), which shows how DFS decreases moving closer to the poles, in the TIR-only and TIR-NIR products. The relationship of DFS with bias and correlation is assessed through latitudinal dependence (Fig. 7).

10
The latitudinal dependence of MOPITT total column retrieval biases are consistent with Deeter et al. (2014), who show V6 TIR-only biases relative to HIPPO are generally within ±2×10 17 molec. cm −2 (or approximately ± 10%). Results here show the latitudinal dependence is similarly bound for the TIR-NIR and NIR-only products (Fig. 7, (2015) suggested MOPITT bias may be related to DFS. We find that although DFS vary strongly with latitude, the MOPITT bias does not depend upon latitude. There is also no latitudinal dependence in the 20 DFS, bias or Pearson's R for the NIR-only product, reflecting that this product is not as affected by thermal contrast difficulties.
In contrast, correlations are weakly determined by latitude, which suggests that DFS are related to correlation values (Fig. 7 To help understand the driver of bias variability, we investigate the influence of altitude and find a larger range in bias at lower altitudes, for the TIR-only and TIR-NIR products (Fig. 8). High biases in TIR-only and TIR-NIR product (defined as >5%) all occur at low altitudes. High bias is most likely due to values from the single overpass time of MOPITT at 10:30 a.m. being compared with all daytime measurements from FTS at these stations. There is more variability throughout the day in the FTS column due to changes in lower tropospheric CO, which is not captured in the MOPITT measurements. For example, the FTS will capture diurnal variation due to greater atmospheric mixing throughout the day, frontal systems bringing variable CO amounts, and/or rapid changes in nearby emissions and transport. Biases may be improved by temporally restricting comparisons closer to the 10:30 a.m. overpass. Further investigation would be necessary to determine the effect of temporal 5 restriction at stations with high bias. The NIR-only product does not show bias dependence on altitude. Comparisons with NIR-only include a large amount of a priori (mean DFS of 0.32), which masks the variability in FTS columns.
Satellite bias can introduce inaccuracies for data assimilation and inverse modeling studies, particularly at high latitudes (Hooghiemstra et al., 2012;Jiang et al., 2015;Gaubert et al., 2016). Significantly high emissions were attributed to high MOPITT bias in Hooghiemstra et al. (2012), who suggest the need for satellite bias correction. Therefore, rather than restricting 10 data to be assimilated to within ±40 • (Jiang et al., 2015), we suggest that FTS could be used to either correct or account for MOPITT retrieval biases, particularly at high latitude stations, prior to data assimilation. The geographical relationship of the bias drift is shown in Fig. 10. Drift in the Southern Hemisphere is small. In contrast, Northern Hemisphere drift is highly variable. Instrument degradation would be expected to produce consistent drift across stations. However, the variable drift implies the cause of drift is due to input parameters to the MOPITT retrieval process rather occur at high latitudes in the Northern Hemisphere, where potential uncharacterized surface errors contribute to retrieval drift. Ho et al. (2005) found large standard deviation in a priori emissivity due to cloud detection uncertainties. Additionally, sea-ice may not be correctly accounted for in the satellite retrievals because sea-ice scenes are retrieved with the same parameters as water, despite having different emissivity properties. Consequently, a trend in cloudiness or sea-ice extent could therefore produce a trend in MOPITT retrievals. As a result, we recommend avoiding the use of MOPITT retrievals above 60 • N when 5 assessing the temporal evolution of CO.

Conclusions
The first systematic validation of MOPITT version 6 retrievals with ground-based FTS at 14 NDACC stations has demonstrated low bias of the MOPITT instrument (generally <10%) and has highlighted some important considerations for using the satellite data in scientific analysis. While values have been calculated for an average of MOPITT values within a 1 • radius, 10 any systematic bias in MOPITT remains and is evaluated. MOPITT is generally biased high relative to FTS and bias was consistently higher for joint and NIR-only products than for the TIR-only product. Mean bias is 2.8% for TIR-only, 5.4% for TIR-NIR and 7.0% for NIR-only. MOPITT retrieves with equivalent skill over land or water, although the most information is present in the joint TIR-NIR land product as indicated by largest DFS. Pixel-wise validation revealed the poor performance of pixel 1. Some applications that require data thinning techniques (for example data assimilation) may remove pixel 1 from 15 weighted averages, as this pixel has the lowest correlation and most variability. The poor performance of pixel 1 also suggests that processing of the level 3 product may need to be revised. We find no dependence of bias on latitude, suggesting no relationship to DFS. In contrast, latitude-dependent information content had a weak relationship to correlation results. Variability in lower tropospheric CO influences MOPITT bias, which is probably due to sampling/sensitivity differences between instruments. MOPITT bias found here may be used to account for satellite error prior to data assimilation, allowing for inclusion 20 of data over a wider spatial range than is currently used. The MOPITT long-term bias drift has been bound to within ±0.5% yr −1 or lower at almost all locations. Variable drift in the Northern Hemisphere implies an uncharacterized retrieval parameter such as uncertainty in cloud detection or sea-ice representation. We recommend trend analysis should not be performed above 60 • N. Overall, this study extends the geographical and temporal analysis of MOPITT validation results.  Clerbaux, C., George, M., Turquety, S., Walker, K. A., Barret, B., Bernath, P., Boone, C., Borsdorff, T., Cammas, J. P., Catoire, V., Coffey, M.,     Figure 1. Location of the 14 NDACC ground-based remote-sensing FTS sites used in this study. Three-letter acronyms correspond to the information in Table 1.