Structural uncertainty in air mass factor calculation for NO 2 and HCHO satellite retrievals

Air mass factor (AMF) calculation is the largest source of uncertainty in NO2 and HCHO satellite retrievals in situations with enhanced trace gas concentrations in the lower troposphere. Structural uncertainty arises when different retrieval methodologies are applied in the scientific community to the same satellite observations. Here, we address the issue of AMF structural uncertainty via a detailed comparison of AMF calculation methods that are structurally different between seven retrieval groups for measurements from the Ozone Monitoring Instrument (OMI). We estimate the escalation of structural un5 certainty in every sub-step of the AMF calculation process. This goes beyond the algorithm uncertainty estimates provided in state-of-the-art retrievals, which address the theoretical propagation of uncertainties for one particular retrieval algorithm only. We find that top-of-atmosphere reflectances simulated by four radiative transfer models (RTMs) (DAK, McArtim, SCIATRAN and VLIDORT) agree within 1.5%. We find that different retrieval groups agree well in the calculations of altitude resolved AMFs from different RTMs (to within 3%), and in the tropospheric AMFs (to within 6%) as long as identical ancillary data 10 (surface albedo, terrain height, cloud parameters and trace gas profile) and cloud and aerosol correction procedures are being used. Structural uncertainty increases sharply when retrieval groups use their preference for ancillary data, cloud and aerosol correction. On average, we estimate the AMF structural uncertainty to be 42% over polluted regions and 31% over unpolluted regions, mostly driven by substantial differences in the a priori trace gas profiles, surface albedo and cloud parameters. Sensitivity studies for one particular algorithm indicate that different cloud correction approaches result in substantial AMF 15 differences in polluted situations (5 to 40% depending on cloud fraction and cloud pressure, and 11% on average) even for 1 Atmos. Meas. Tech. Discuss., doi:10.5194/amt-2016-306, 2016 Manuscript under review for journal Atmos. Meas. Tech. Published: 2 November 2016 c © Author(s) 2016. CC-BY 3.0 License.


Introduction
Satellite observations in the UV and visible spectral range are widely used to monitor trace gases such as nitrogen dioxide (NO 2 ) and formaldehyde (HCHO).These gases are relevant for air quality and climate change, because they are involved in the formation of tropospheric ozone and aerosols, which have an important influence on atmospheric radiative forcing (IPCC, 2013).Ozone and aerosols are defined as "essential climate variables" (ECVs) by the Global Climate Observing System (GCOS).These ECVs and their precursors (NO 2 and HCHO among others) are included in the ECV framework because they contribute to characterisation of the Earth's climate and they can be monitored from existing observation systems (Bojinski et al., 2014).Currently a wide range of ECV products are available, but they rarely have reliable and fully traceable quality information.To address this need, the Quality Assurance for Essential Climate Variables project (QA4ECV, www.qa4ecv.eu)aims to harmonise, improve and assure the quality of retrieval methods for the ECV precursors NO 2 and HCHO.Here, we focus on retrievals of tropospheric NO 2 and HCHO vertical column densities (VCDs) from space-borne UV/Vis spectrometers.Retrievals from these instruments have been used for a wide range of applications.These notably include estimating anthropogenic emissions of NO x and HCHO (e.g.Boersma et al., 2015;Marbach et al., 2009), natural isoprene emissions (e.g.Marais et al., 2014;Barkley et al., 2013) and NO x production from lightning (e.g.Lin, 2012;Beirle et al., 2010), data assimilation (e.g.Miyazaki et al., 2012) and trend detection (e.g.Krotkov et al., 2016;Richter et al., 2005;De Smedt et al., 2010).
Although trace gas satellite retrievals have improved over the last decades (e.g. C. Li et al., 2015;Richter et al., 2011;De Smedt et al., 2012;Bucsela et al., 2013), there is still a need for a more complete understanding of the uncertainties involved in each retrieval step.The retrieval of NO 2 and HCHO columns consists of three successive steps.First a spectral fitting is performed to obtain the trace gas concentration integrated along the average atmospheric light path (slant column density, SCD) from backscattered radiance spectra.For NO 2 , the stratospheric contribution to the SCD is removed to obtain the tropospheric SCD.Finally, the SCD is converted into the VCD using an air mass factor (AMF).Previous studies indicated that the AMF calculation is the largest source of uncertainty (contributing up to half of the typical VCD uncertainties of 40-60 %) in the NO 2 and HCHO retrievals in scenarios with a substantial tropospheric contribution to the total column (e.g.Boersma et al., 2004;De Smedt et al., 2008;Barkley et al., 2012).These studies arrived at such theoretical uncertainty estimates based on error propagation for one specific retrieval algorithm.
Theoretical uncertainty (also known as parametric uncertainty) is the uncertainty arising within one particular retrieval method.Structural uncertainty is the uncertainty that arises when different retrieval methodologies are applied to the same data (Thorne et al., 2005).To represent the state of the atmosphere, several choices and assumptions are made in the retrieval algorithm, in particular within the AMF calculation.Even though these choices are physically robust and valid, when different retrieval algorithms based on different choices are applied to the same satellite observations, this usually leads to different results.The structural uncertainty is intrinsic to the retrieval algorithm formulation and it is considered to be a source of systematic uncertainty (Povey and Grainger, 2015).In principle, theoretical and structural uncertainties should be considered independently of each other.However, in the calculation of the theoretical uncertainty, the contribution of the ancillary data is often calculated by comparing different databases (e.g. to estimate surface albedo uncertainty as in Boersma et al., 2004) rather than using the uncertainty of the database itself.Consequently, some components are shared in the structural and theoretical uncertainty calculations.However, for a full structural uncertainty estimate, all sources of methodological differences need to be considered.In the framework of AMF calculations addressed here, this implies e.g. the selection of radiative transfer model, vertical discretisation and interpolation schemes, the method for cloud and aerosol correction and the selection of (external or ancillary) data on the atmospheric state (surface reflectivity, cloud cover, terrain height and a priori trace gas profile).The problem of structural uncertainty has been addressed in other fields of atmospheric sciences, e.g. in satellite retrievals for atmospheric variables (Fangohr and Kent, 2012) and in numerical models for climate studies (Tebaldi and Knutti, 2007).
There are few studies addressing structural uncertainty for trace gas retrievals.van Noije et al. (2006) compared NO 2 tropospheric columns retrieved from GOME data by three different groups.In that study, the discrepancies inherent to differences and assumptions in the retrieval methods were identified as a major source of systematic uncertainty.However, the causes of discrepancies between retrievals were not addressed but were targeted for a more detailed investigation.In this study we focus on AMF structural uncertainty by comparing the AMF calculation approaches used by seven different retrieval groups and providing a traceable analysis of all components of the AMF calculation.Ensemble techniques to estimate structural uncertainty have already been applied in different atmospheric disciplines (e.g.Steiner et al., 2013;Liu et al., 2015).The groups that partici- pated in this study are Belgian Institute for Space Aeronomy (IASB-BIRA; abbreviated as BIRA), Institute of Environmental Physics, University of Bremen (IUP-UB), Wageningen University (WUR) and Royal Netherlands Meteorological Institute (KNMI) (calculations made by WUR following the KNMI approach, abbreviated as WUR), University of Leicester (UoL), Max Planck Institute for Chemistry (MPI-C), NASA Goddard Space Flight Center (NASA-GSFC; abbreviated as NASA) and Peking University.
We start with a comparison of top-of-atmosphere (TOA) reflectances simulated by radiative transfer models (RTMs), the main tool for any AMF calculation (Sect.3.1).The RTMs DAK, McArtim, SCIATRAN and VLIDORT solve the radiative transfer equation differently, and have different degrees of sophistication to account for the Earth's sphericity and multiple scattering.Next we compare altitude-dependent (or box-) AMFs for NO 2 and HCHO computed with the four RTMs (Sect.3.2).This is followed by a comparison of tropospheric AMFs (for NO 2 ) calculated by four groups for measurements by the Ozone Monitoring Instrument (OMI) based on identical settings (same ancillary data and same approach for cloud and temperature correction) (Sect.3.3.1).We interpret the resulting spread between the tropospheric AMFs as the AMF structural uncertainty associated with using different RTMs, vertical discretisation and interpolation schemes.Then, we investigate how the choice of cloud correction affects the AMF structural uncertainty (Sect.3.3.2).For the overall structural uncertainty estimate, we perform a round-robin exercise (Sect.3.3.3) in which seven different groups calculate NO 2 AMFs using their own preferred methods for cloud and aerosol correction and sources of ancillary data.Here we assess the effect of the different choices in the AMF structural uncertainty.Finally, we investigate how stratospheric AMFs are affected by the selection of RTM and their physical description of photon transport through a spherical atmosphere.The complete chain of uncertainties associated with each phase provides traceable quality as-surance for the AMF calculation.Recommendations on best practices are given for this particular algorithm step and they will be applied in a community best practice retrieval algorithm for ECV precursors, under development in the framework of the QA4ECV project.

AMF calculation process
The concept of a traceability chain (here in the form of a flow diagram) for the AMF calculation process and uncertainty assessment used in this study is illustrated in Fig. 1.Structural uncertainty estimated in each step is based on the standard deviation (1σ ) of relative differences of the compared elements.Modelled reflectance (R) at TOA is the starting point for air mass factor calculations using radiative transfer models.A RTM solves the radiative transfer equation, which describes the transport of radiation through the atmosphere to the observer (in our case the satellite) and the physical processes that affect the intensity of the radiation (absorption, scattering, refraction and reflection) (first box in the diagram in Fig. 1).Reflectance (unitless) is calculated from fundamental radiation quantities, and it is defined as the ratio of modelled Earth radiance (I ) (times π ) and the solar irradiance at TOA perpendicular to the solar beam (E 0 ) multiplied by the cosine of the solar zenith angle (µ 0 ): Different models use different methods to solve the radiative transfer equation and to describe the sphericity of the Earth's atmosphere.Differences in modelled TOA reflectances between RTMs provide an estimate for the reflectance structural uncertainty (σ R ).This uncertainty due to the choice of the RTM propagates to the next step in the AMF calculation.
A. Lorente et al.: Structural uncertainty in AMF calculation process for NO 2 and HCHO satellite retrievals Altitude-dependent AMFs (box-AMFs, equivalent to scattering weights) characterise the vertical sensitivity of the measurement to a trace gas (e.g.Palmer et al., 2001).They are directly related to how the measured radiance at TOA changes with a change of the optical depth of the atmosphere (related to the presence of a trace gas in a certain atmospheric layer), with the requirement that the absorber is optically thin (optical thickness τ gas 1).In the context of the AMF calculation (second box in diagram of Fig. 1), box-AMFs for each layer can be calculated and stored in a look-up table (LUT) as a function of the forward model parameters (b) such as satellite viewing geometry, pressure level, surface pressure and surface reflectivity.There is also the possibility of online radiative transfer calculations for determining box-AMFs, i.e. bypassing the calculation of a LUT (e.g.Lin et al., 2014Lin et al., , 2015;;Hewson et al., 2015).Different RTMs use different vertical discretisations of the atmosphere, and calculate box-AMFs in different ways (see Sect. 2.2).A comparison of the box-AMF LUTs calculated with different RTMs provides a measure for the box-AMF structural uncertainty (σ m ), which can be considered to be the reproducibility of the box-AMFs from different RTMs when the same settings and input data are used.
The air mass factor (M) represents the relative (dimensionless) length of the mean light path at a certain wavelength for photons interacting with a certain absorber in the atmosphere relative to the vertical path.The AMFs are used to convert the SCD obtained from the reflectance spectra to a VCD.To calculate the tropospheric VCD, a tropospheric AMF is used (VCD tr = SCD tr /M tr ).For species that have a stratospheric contribution to the total slant column, the stratospheric SCD first needs to be estimated and subtracted from the total SCD.For this purpose, a stratospheric AMF is often used together with an independent estimate of the stratospheric VCD (e.g. from a chemistry transport model, a climatology or independent measurements) (SCD strat = VCD strat • M strat ).
If the trace gas is optically thin, the total air mass factor can be written as the sum of the box-AMFs of each layer weighted by the partial vertical column (e.g.Palmer et al., 2001;Boersma et al., 2004): In Eq. (2) m l is the box-AMF and x a,l is the trace gas sub-column in layer l.However, as the actual profile of subcolumns is unknown, an a priori profile has to be used in the AMF calculation.The summation is done over the atmospheric layers (l) of the a priori trace gas profile.In this step of the AMF calculation, apart from the profile shape of the trace gas, it is also necessary to have the best estimates for other forward model parameters ( b) such as satellite viewing geometry, surface pressure and surface reflectivity.Surface reflectivity depends on the surface properties and the geometry of the incident and reflected light.This anisotropy is described by the bidirectional reflectance distribution function (BRDF).In practice, surface reflectivity is often approximated by an isotropic Lambertian equivalent reflector (LER).There are different sources from which the a priori information can be obtained.It is desirable to use as many forward model parameters as possible retrieved from the satellite instrument itself.This practice gives consistency to the trace gas retrieval regarding the forward model parameters.
The NO 2 and HCHO absorption cross sections used in the SCD fit and box-AMF calculation are representative of one fixed temperature.However, these cross sections vary with temperature, so it is necessary to apply a temperature correction.This correction accounts for the change in the absorption cross section spectrum at a specific layer as a function of the effective temperature (see Eq. S1 in the Supplement), based on temperature and trace gas profiles from model data or climatologies.The correction is commonly done by applying a correction factor (c l ) for each layer in the AMF calculation.
Most of the studies in which the temperature effect on the NO 2 cross section is analysed assume a simple dependency of the correction factor to temperature (Vandaele et al., 2002) (see Eqs. S2 and S3 for typically used correction factors).For satellite applications, the change of the absorption cross section in case of NO 2 has been reported to be approximately −0.3 % per K in the visible (Bucsela et al., 2013;Boersma et al., 2002) and −0.05 % per K for HCHO (De Smedt, 2011).Satellite retrievals also need to consider the presence of clouds.In the AMF calculation, residual clouds can be accounted for in several ways.The independent pixel approximation (IPA) consists of calculating the AMF for a partly cloudy scene as a linear combination of cloudy (M cl ) and clear (M cr ) components of the AMF, weighted by the cloud radiance fraction w (i.e. the fraction of radiance that originates from the cloudy part of the pixel) (Martin et al., 2002;Boersma et al., 2004): In Eq. (4) w is wavelength dependent through radiation intensity, so it will be different for NO 2 and HCHO (see Eq. S4 in the Supplement).AMFs for cloudy scenes are calculated using Eq.(3) with a specific cloud albedo and cloud pressure, with m l = 0 below the cloud.In line with assumptions made in current cloud retrievals, the cloud is considered to be a Lambertian reflector with a fixed cloud albedo.This simple cloud model is, in most cases, suitable to be used in trace gas retrieval algorithms (Acarreta et al., 2004).As an alternative, the radiative  , 1998).
The atmosphere can also be assumed to be cloud-free for cloud fractions below a certain threshold (e.g.0.1 or 0.2, see Table 3).In that case, a clear-sky AMF is used and Eq. ( 4) reduces to M = M cr .For cloud fractions larger than the clear-sky threshold but below a cloudy-sky threshold, IPA is sometimes applied.Generally, measurements with cloud fractions higher than the cloudy-sky threshold are discarded or flagged.There is also the possibility to account for cloudaerosol mixtures; in this case the threshold for using either clear-sky AMF or IPA can depend on both cloud fraction and cloud altitude (see Sect.S1.3).In all approaches accurate information is needed on the cloud radiance fraction and cloud height.
Different retrieval groups use different sources for the ancillary data, as well as different methods to account for the temperature dependence and the presence of clouds and aerosols (e.g.van Noije et al., 2006).In our study, each of the groups first calculated tropospheric AMFs using harmonised settings, i.e. using the same forward model parameters, temperature correction and cloud correction.In order to calculate the total AMF using Eq. ( 3), an interpolation from the LUT needs to be done to obtain the box-AMFs at the specific values of the forward model parameters.Furthermore, a vertical interpolation is required to adjust the vertical discretisation of the a priori absorber profile to the one of the LUT.From the comparison of the tropospheric AMFs calculated using harmonised settings, we can thus obtain a relative AMF structural uncertainty, which is determined by different approaches in interpolation and vertical discretisation of the box-AMFs, assuming that the selected forward model parameters are the true values.
Next, each of the groups used their preferred settings to calculate tropospheric AMFs.In this round-robin exercise, a comparison of state-of-the-art retrieval algorithms, the dif-ferences between AMFs not only arise from differences between the RTMs, vertical discretisation and interpolation but also from differences in the selection of forward model parameter values and the different corrections for clouds, aerosols and surface reflectivity.Thus the differences in the AMFs using the preferred settings can be interpreted as the overall structural uncertainty of the AMF calculation (Thorne et al., 2005).

Participating models
Four RTMs from different research groups participated in the comparison.Some differences between models are highlighted in Table 1.A brief summary for each model is listed alphabetically in this section and more detailed information about the models can be found in the references.

DAK
DAK (Doubling-Adding KNMI) was developed at the Royal Netherlands Meteorological Institute (Stammes, 2001).DAK uses the doubling-adding method for solving the radiative transfer equation (Stammes et al., 1989;de Haan et al., 1987).The method consists of first calculating the reflection and transmission properties of a homogeneous layer by repeated doubling, starting with a very thin layer, and then adding homogeneous layers on top of each other, which then yields the reflection and transmission of the combined layers.The internal radiation field is computed at the interface of all layers and the radiation emerging at the top of the atmosphere and at the surface is calculated.DAK accounts for multiple scattering and polarisation.It is also possible to account for the Earth's sphericity using the pseudo-spherical option, which corrects for sphericity in the light path of the direct solar beam, but not in the scattered beam.
Box-AMFs are calculated with DAK in this study by WUR/KNMI by differencing the logarithm of reflectances at TOA with and without the trace gas in atmospheric layer l divided by the gas absorption optical thickness of the layer m l = − lnR(τ gas,l ) − lnR(τ gas,l = 0) τ gas,l . (5)

McArtim
McArtim (Monte Carlo Atmospheric Radiative Transfer Inversion Model) (Deutschmann et al., 2011) was developed at University of Heidelberg and Max-Planck Institute for Chemistry (MPI-C, Mainz).It is based on the backward Monte Carlo method: a photon emerges from a detector in an arbitrary line-of-sight direction and is followed in the backward direction along the path until the photon leaves the top of the atmosphere.The various events which may happen to the photon at various altitudes are defined by suitable probability distributions.At each scattering event the probability that the photon is scattered into the direction of the sun is calculated and the intensity of the photon is weighted by the sum of the probabilities of all scattering events (local estimation method).In this RTM, the integro-differential equation for radiative transfer is deduced and solved using Neumann series, the summands of which are linked with the contributions of multiple scattering orders to the radiation field.
McArtim is a 3-D-model and uses full spherical geometry, which means that sphericity is accounted for incoming, single scattered and multiple scattered photons.The model is capable of including polarisation and rotational Raman scattering (which are included in the simulations shown in this study).Box-AMFs calculated by MPI-C are obtained from Jacobians (derived by W = ∂lnI ∂β , with β (km −1 ) the absorption coefficient) for each grid box according to the formula: In Eq. ( 6) W refers to the Jacobian (km), I is the simulated radiance at TOA normalised by the solar spectrum (unitless) and h is the grid box thickness (km).(Rozanov et al., 2014) was developed at the Institute of Environmental Physics at the University of Bremen (IUP-UB) in Germany.It models radiative transfer processes in the atmosphere from the UV to the thermal infrared, in both scalar and vector mode, i.e. with the option to account for polarisation.The simulations can be done for a plane parallel, pseudo-spherical or fully spherical atmosphere.In the fully spherical approach, the integral radiative transfer equation is solved by accounting for single scattering in spherical mode, and multiple scattering is approximated with a solution of the differential-integral radiative transfer equation in the plane parallel mode.SCIATRAN calculates the Jacobians or weighting functions, which are the derivatives of the simulated radiance with respect to atmospheric and surface parameters (air number density in this case).These quantities are related to the box-AMFs calculated by IUP-UB as follows:

SCIATRAN
) is the weighting function at atmospheric level l, I (W m −2 nm −2 sr −1 ) is the TOA radiance, σ l (cm 2 molec −1 ) is the absorber cross section and h l (cm) is the thickness of the layer.

VLIDORT
VLIDORT (Vector-LInearized Discrete Ordinate Radiative Transfer) was developed by Rob Spurr at RT SOLUTIONS, Inc.The model is based on the discrete ordinate approach to solve the radiative transfer equation in a multi-layered atmosphere, reducing the RTE to a set of coupled linear first order differential equations.Then, perturbation theory is applied to the discrete ordinate solution (Spurr et al., 2001).Intensity and partial derivatives of intensity with respect to atmospheric parameters and surface parameters (i.e.weighting functions) are determined for upwelling direction at TOA, for arbitrary angular direction.The pseudo-spherical formulation in VLIDORT corrects for the curved atmosphere in the solar and scattered beam (for single scattering, not for multiple scattering).Box-AMFs are derived from the altitude-dependent weighting functions determined by VLIDORT: I (W m −2 nm −2 sr −1 ) is the TOA radiance, τ gas is the trace gas absorption optical thickness of the layer and the term (τ gas,l • ∂I ∂τ gas,l ) is the altitude-dependent weighting function.

TOA reflectances
As a first exercise, a base case calculation and comparison of TOA reflectances was made to assess the performance of the four RTMs and to obtain the structural uncertainty in TOA reflectance modelling.The base case comparison allowed us to establish the best possible level of agreement between RTMs by identifying differences in the RTMs performance that in more complex settings would be difficult to recognise.Furthermore, total and ozone optical thickness were compared to evaluate how the models agreed in their treatment of scattering and absorption processes and whether differences in scattering and absorption can explain possible differences between the TOA reflectances.Basic model parameters were established as input in all RTMs (details can be found in Table S1 in the Supplement).The basic atmospheric profile was a 33-layer midlatitude summer atmosphere (Anderson et al., 1986), and every group performed their own vertical discretisation of this profile.
In the RT modelling, we considered a clear-sky atmosphere, so clouds and aerosols were not included.Rayleigh scattering and O 3 absorption were included, but Raman scattering was not included.The temperature dependence of the ozone cross section was neglected in the reflectance calculation.TOA reflectances were calculated at seven wavelengths, including 440 and 340 nm which are relevant for the retrievals of NO 2 and HCHO, respectively.Both scalar (i.e.without polarisation) and vector (i.e. with polarisation) calculations were performed in most of the cases.All models applied their particular sphericity treatments to the calculations.The surface was considered to be a Lambertian reflector by all the RTMs.This approximation assumes that surface reflectivity is isotropic (i.e it does not consider the directionality of the surface reflectance distribution).The selected geometries covered a wide range of values for solar zenith angle (SZA, θ 0 ), viewing zenith angle (VZA, θ ) and relative azimuth angle (RAA, ϕ = 180 • − |φ − φ 0 |, where φ − φ 0 is the viewing direction minus solar direction).All the angles are specified with respect to the surface.The values for SZA span the typical range of what UV/Vis sensors are encountering in orbit, and the maximum value of VZA is related to the higher possible values of this parameter for the future TROPOMI instrument (72.5 • ) (van Geffen et al., 2016).
All models calculate the same spectral dependency of TOA reflectance, as shown in Fig. 2 (solid line).TOA reflectance increases towards shorter wavelengths due to stronger Rayleigh scattering.TOA reflectance simulated by the different models agrees within 1.3 % for the geometries included in Fig. 2. The dashed line in Fig. 2 shows the total optical thickness as a function of wavelength for DAK, SCI-ATRAN and VLIDORT (McArtim does not provide this output), and is generally consistent within 0.15 % for all wavelengths except 340 nm, where the differences are 0.5 %.
Figure 3 shows the distribution of relative differences (defined as (100(a−b)/a)) between TOA reflectances simulated by the four RTMs at 340 and 440 nm.The distribution is determined by the relative differences between all combinations of model differences, including all simulated geometry scenarios for a surface albedo of 0 and terrain pressure of 1013 hPa.According to the standard deviation in both distributions (dashed lines in Fig. 3), the relative differences are below 1.5 % at 340 nm and 1.1 % at 440 nm in most geometry configurations (80 % of the samples of the distribution), including the most common retrieval scenarios.The tails of the distributions at both wavelengths correspond to extreme viewing geometries, i.e. for scenarios in which solar and viewing zenith angles are both large.Mean relative differences over all RTM pairs are at most 6.4 % for extreme geometries (θ 0 = 87 • , θ = 72.5 • ), and for shorter wavelengths.For nadir view (θ = 0 • ) relative differences are on average two times smaller than for larger VZA (θ ≥ 60 • ) at both 340 and 440 nm.
The results show strong consistency of TOA reflectance calculations for the most common moderate viewing geometry retrieval scenarios.Relative differences are somewhat higher for larger VZA, SZA and shorter wavelengths.For the more extreme geometries, the light path through the atmosphere is generally longer and photons have a higher probability of undergoing interactions (scattering, absorption) with the atmosphere.Furthermore, differences in the treatment of the Earth's sphericity for the extreme geometries have a stronger influence than in close to nadir viewing geometries.These differences will still be present in the box-AMF comparison in Sect.3.2.Rayleigh scattering also affects the effective photon path and it is stronger at 340 nm than at 440 nm.Thus, small differences in the description of Rayleigh scattering in the RTMs are more likely to lead to differences for the extreme geometries and shorter wavelengths.The standard deviation of differences between modelled TOA reflectances of 1.5 % (at 340 nm) and 1.1 % (at 440 nm) in this comparison can be considered to be the reflectance structural uncertainty.The agreement in this study is better than in previous RTM comparisons like Wagner et al. (2007) and Stammes (2001) which reported differences of 5 %.The detailed RTM comparison will serve as a test bed to analyse the performance of other RTMs.

NO 2 and HCHO altitude-dependent (box-) air mass factors
To calculate box-AMFs, a common vertical grid was agreed between the groups in order to reduce the sources that might cause differences between the RTMs.The common profile  resolution was 0.1 km from the surface up to 10 km, 1 km resolution from 10 to 60 and 2 km resolution from 60 to 100 km.NO 2 box-AMFs were calculated at 440 nm and HCHO box-AMFs were calculated at 338, 341 and 344 nm to investigate the wavelength dependency (not shown).Box-AMFs were calculated accounting for the polarisation of light and the Earth's sphericity.The number of reference points for surface albedo was increased and several surface pressures were added relative to the TOA reflectance simulations in the pre-vious section to cover a wider range of scenarios.All settings are detailed in Table S2.
Figure 4a shows that the four participating groups generally agree well on the vertical profile shape of NO 2 and HCHO box-AMFs in the troposphere.Measurement sensitivity decreases towards the surface due to the increase of light scattering in the lower troposphere.Measurement sensitivity to HCHO is substantially lower than to NO 2 , because of stronger Rayleigh scattering at shorter wavelengths.
McArtim box-AMFs have lower values in the stratosphere (pink line), presumably reflecting the more realistic description of atmosphere's sphericity in McArtim relative to the other models (see Sect. 3.4 for specific sphericity effect on AMFs).The vertical profile of McArtim shows a wavering line due to the statistical noise in the Monte Carlo simulations (which can be reduced by increasing the number of simulations).Figure 4b, d-f shows the NO 2 and HCHO box-AMF dependency on forward model parameters (surface albedo, surface pressure, SZA, VZA and RAA) in the lower troposphere at 950 hPa.This pressure level (close to the surface) is especially relevant because this is where trace gas concentrations are enhanced in polluted conditions.The sensitivity to surface albedo at 950 hPa (Fig. 4b) is similar for all four RTMs.Box-AMFs increase with surface albedo due to a stronger reflection of light at the surface.This increase is particularly strong for low values of surface albedo.For an albedo of 0.05, an increase of 0.01 in the surface albedo results in an increase of 11 % in the NO 2 box-AMF at 440 nm and of 9 % in the case of HCHO at 338 nm.The increase in the box-AMFs is less steep for higher values of surface albedo.Thus, an accurate knowledge of surface albedo is required, especially for low albedo values.For surface pressure (Fig. 4c), the box-AMF (at 797 hPa) decreases with decreasing surface pressure.For increasing terrain height, the amount of light scattered and reflected from below 797 hPa decreases.In a more elevated terrain, the photons undergo fewer scattering events, which tends to reduce box-AMFs at a specific level.Models agree well in representing this sensitivity.An error in the surface pressure of 10 hPa leads to ±2 % errors in the lower tropospheric box-AMF values, which indicates the importance of accurate surface pressure information that is representative of the entire pixel area.Box-AMFs at 950 hPa show relatively weak dependency on VZA (Fig. 4e) and RAA (Fig. 4f) and stronger dependency on high values of SZA (Fig. 4d), but all RTMs agree well on measurement sensitivity to geometry parameters.
Figure 5 shows the vertical profile of mean relative differences in NO 2 (left panel) and HCHO (right panel) box-AMFs between all the models, for a specific surface albedo and surface height and a wide range of solar and viewing geometries.Generally, models reproduce box-AMFs to within 2 % for NO 2 and 2.6 % for HCHO.Mean relative differences are higher at the lowest layers and around 300 hPa.This is due to unavoidable slight differences in vertical discretisation of the surface-atmosphere boundary and where the resolution changes from 0.1 to 1 km at 10 km altitude in the different models.Specific differences were also found in the mid-troposphere to upper troposphere and stratosphere, where McArtim is on average lower than the other RTMs.Those differences illustrate the different treatments of multiple scattering within the models.McArtim accounts for multiple scattering in a fully spherical atmosphere, whereas DAK, VLIDORT and SCIATRAN simulate multiple scattering in a plane parallel atmosphere.In a spherical atmosphere, less light is horizontally scattered into the line of sight of the instrument than in a plane parallel atmosphere (see Fig. S2), which is one of the reasons for lower box-AMFs by McArtim in the stratosphere (visible in Fig. 4a between 200 and 0 hPa).
Relative differences for 950 hPa box-AMFs are below 1.1 % for NO 2 and below 2.6 % for HCHO in most geometry configurations (according to the standard deviation of relative differences distribution for 950 hPa box-AMFs, not shown).Higher relative differences mainly occur between McArtim and the other models.The highest relative differences occur for scenarios with high VZAs (θ = 72.5 • ) (not shown), again indicating that different Rayleigh scattering descriptions and sphericity treatments in the radiative transfer modelling of the atmosphere are important.
This comparison indicates a good agreement between box-AMF LUTs computed using different RTMs.The structural uncertainty in the AMF calculation due to the choice of RTM and different interpolation schemes is 2 % for NO 2 and 2.6 % for HCHO.These results suggest that a correct treatment of the processes affecting the effective light path in the atmosphere is important for box-AMF calculation.The vertical discretisation is also relevant in box-AMF calculations, as demonstrated by the differences at specific altitudes (Fig. 5) and by the box-AMF sensitivity to altitude (Fig. 4a).Therefore, the vertical sampling of the LUT should have a fine resolution, especially in the lower troposphere where strong gradients in NO 2 and HCHO concentrations occur.The dependencies of the box-AMFs at low surface albedo values (Fig. 4b) and to surface pressure (Fig. 4c), suggest that the number of reference points in the LUT for these parameters should be large.

Tropospheric air mass factors
In order to compute tropospheric AMFs via Eq.(3) we need to interpolate the box-AMFs from the LUT for the best estimate of the forward model parameters b.Generally a 6-D linear interpolation (or 5-D if the vertical resolution of the LUT and the a priori profile vertical grid are equal) is done over all the parameters on which the box-AMF depend.For each dimension, the two closest values to the exact pixel parameters are used to obtain the interpolated box-AMF (m l in Eq. 3).This approach will introduce systematic errors in case of non-linear dependencies of the parameters in the LUT.Pixel-by-pixel online calculations of box-AMFs would  2015) estimated the differences between online and LUT-derived AMFs to be on average less than 1 %, for individual measurements less than 8 %, with an upper bound of the difference of 20 % over South America.Lin et al. (2014) found 1-5 % differences on retrieved VCDs with and without LUT over China.

Harmonised settings
Four groups used the same settings (forward model parameters, a priori profiles, temperature and cloud correction) to calculate clear-sky and total tropospheric NO 2 AMFs for one specific OMI orbit over Australia and East Asia on 2 February 2005 (see Fig. 6).The selected harmonised settings were those from KNMI/WUR (see Table 3).All groups applied the same temperature correction (from Boersma et al., 2004, see Eq. S1) and cloud correction via the independent pixel approximation.The aim of this comparison was to obtain an estimate of the structural AMF uncertainty introduced by different vertical discretisations and by different interpolation schemes, assuming that the values of the selected forward model parameters are true.
All groups calculate similar AMF spatial patterns for the selected orbit.Figure 6 (upper panels) shows total tropospheric NO 2 AMFs calculated by each group.The distribution of the AMF values along the orbit is determined by the different parameters on which AMFs depend.Lower panels in Fig. 6 show NO 2 (a priori) model vertical column, surface albedo and cloud fraction in the orbit.At high latitudes, where surface albedo is high, AMFs are up to 3-5.Surfaces with high albedo (usually covered by snow or ice) reflect more radiation than surfaces with lower surface albedo, and this increases the AMF values.The effect of clouds and the a priori profile is also visible: AMFs are generally low in cloudy regions and over polluted regions in eastern China (∼ 30 • N), indicative of reduced sensitivity to NO 2 in the lowest layers of the atmosphere.
The correlation between AMFs calculated by the different retrieval groups is excellent (R 2 > 0.99).Overall, tropospheric AMFs calculated by each of the groups agree within 6.5 % in polluted areas and within 2.5 % in clean remote areas for most retrieval scenarios, in line with the results from the box-AMF LUT comparison.BIRA AMFs are on average higher than AMFs calculated by the other groups, generally by a few percent, and IUP-UB AMFs are on average lower for polluted and unpolluted conditions.Table 2 summarises the results of the comparison.
The largest differences are found at the edges of the OMI orbit, where viewing zenith angles are large and light paths are long.This can be seen in the lower right panel of Fig. 6, where the relative differences of tropospheric NO 2 AMFs between MPI-C and WUR are clearly visible at the edges of the orbit.These differences are consistent with the higher sensitivity to tropospheric trace gases for extreme viewing zenith angles (also shown in Fig. 4e) in McArtim compared to DAK. Figure S1 in the Supplement shows the relative differences between all AMFs calculated by the groups.Relative difference distributions show patterns that reflect the spatial distribution of surface albedo, clouds and NO 2 (e.g. over south-eastern Australia, eastern China and Korea).Large differences between the groups are found in cloudy conditions.These effects reflect the uncertainties arising from the use of different RTM as well as from the interpolation and the vertical discretisation of the LUT when calculating the AMFs.
These results demonstrate that, even when similar RTMs, box-AMFs and identical forward model parameters are used to calculate the AMFs, there is structural uncertainty that is introduced by the specific implementation of different groups.First, the choice of a RTM introduces uncertainty in the box-AMF calculation.Second, there are interpolation errors that are intrinsic to the calculation method using Eq.(3), i.e. interpolation errors in finding the AMF value from the 6-D LUT and the vertical discretisation of the a priori profile.Table 2. Statistical parameters for the comparison of total tropospheric NO 2 AMFs for polluted and unpolluted pixels (pixels with model NO 2 vertical column higher or lower than 1 × 10 15 molec cm −2 respectively) between the different retrieval groups for one complete orbit from 2 February 2005 (2005m0202t0339-o02940 v003).Only pixels with effective (i.e.radiometrically equivalent) cloud fraction ≤ 0.2 are considered.Mean, median and sigma are relative differences in % (100(a − b)/a).
Polluted pixels (#1983) Diff. between Mean (rel.diff.)Median (rel.diff.)σ (rel.diff.) R  Overall, the average differences between the AMFs (always below 6.5 % for cloud fractions less than 0.2) are somewhat higher than the differences from the LUT comparison (2 %).This means that in successive steps of the AMF calculation sources of systematic uncertainty are added that propagate throughout the AMF calculation process.These sources directly affect the agreement between the AMF calculated by different groups and hence affect the AMF structural uncertainty.6.5 % represents an upper limit value for the differences that different RTMs and LUTs may introduce to the final AMF calculation.

Cloud correction: IPA vs. clear-sky AMF
It is important to account for the effect of clouds on the photon path lengths in the troposphere when calculating tropospheric AMFs.There are various approaches that are commonly used to calculate AMFs in (partly) cloudy conditions.The independent pixel approximation (IPA), introduced in Eq. ( 4) (e.g.Martin et al., 2002), is motivated by the fact that few pixels are completely cloud-free.Many pixels still have some degree of cloud cover, and even small cloud fractions strongly affect the sensitivity to the trace gas.The relevant physical effect of clouds (reduced sensitivity to trace gas below the cloud and enhanced sensitivity to trace gas above and in the top layer of the cloud) is explicitly taken into account in the IPA.Another approach is to consider clearsky AMF for scenes with a sufficiently small cloud fraction (e.g.Richter and Burrows, 2002).The motivation for using clear-sky AMFs instead of IPA is that for scenes with small cloud fractions (e.g.< 0.2), retrieved cloud parameters (cloud fraction and cloud pressure) have relatively high uncertainty.This inhibits the reliable modelling of the effect of clouds on photon path lengths and, consequently, a clear-sky AMF is used.
To quantify the differences between the two approaches, here we compare tropospheric NO 2 AMFs calculated by WUR (see Table 3) with the IPA approach and the clearsky AMFs, for two complete days of OMI measurements (2 February and 16 August 2005).In polluted conditions, IPA AMFs are smaller than clear-sky AMFs on average, with differences as large as −40 % for cloud fractions approaching the threshold value of 0.2 (left panel of Fig. 7).The negative differences between IPA and clear-sky AMFs are largest for the highest clouds, illustrating the reduced sensitivity to tropospheric NO 2 below the cloud in the IPA.IPA AMFs are larger than clear-sky AMFs for clouds situated in the lower troposphere (cloud pressure > 900 hPa), where most NO 2 pollution resides.These positive differences can be understood from the albedo effect of residual clouds.Low, bright clouds lead to enhanced photon scattering through the NO 2 layers above the cloud level and also inside the cloud top layer, and this increases the sensitivity to NO 2 .For polluted conditions, IPA AMFs are on average smaller than clear-sky AMFs by 20 % for cloud fractions of 0.05-0.2,and smaller by 11 % for cloud fractions between 0.0 and 0.2.
In unpolluted conditions, IPA and clear-sky AMFs are generally quite similar, with average relative differences within 5 %.Still, there are important differences between the two approaches.In unpolluted conditions with clouds in the free and upper troposphere (cloud pressure < 600 hPa), IPA AMFs are smaller because of reduced sensitivity to NO 2 (right panel of Fig. 7).For clouds in the lower troposphere, IPA AMFs are larger because of the albedo effect.The change of sign in the differences between IPA and clearsky AMFs now occurs near 700 hPa (instead of near 900 hPa for polluted scenes), reflecting the more even vertical distribution of NO 2 in pristine conditions compared to polluted scenes in which most NO 2 resides in the polluted boundary layer.
These results indicate that the differences between using IPA or clear-sky AMFs are especially substantial for polluted conditions and small residual cloud fractions.Selecting a particular cloud correction approach implies that AMF values that will be systematically different from values obtained with the other method.In polluted conditions, the mean differences are 20-40 % for cloud fractions between 0.1 and 0.2, with cloud pressure largely explaining the magnitude and sign of the differences.Note that the a priori profiles used to calculate the AMFs in this section have been obtained from a specific CTM.If a different CTM were used, the values for the differences between IPA and clear-sky AMFs would be different, in line with the structural uncertainty that is being discussed in this study (See Sect. 3.3.3).A previous study by van Noije et al. (2006) reported 30 % higher GOME tropospheric NO 2 columns retrieved using the IPA compared to retrievals using clear-sky AMFs.Such differences are in line with the systematically lower IPA AMFs found here.However, like the study by van Noije et al. (2006), we cannot clearly recommend one AMF approach over the other.In order to make such a recommendation, a more detailed analysis of the cloud parameter uncertainties is needed, along with a validation of tropospheric NO 2 retrievals using different AMF approaches against independent reference data.Such a validation exercise should preferably focus on polluted conditions with small (0.05-0.2) residual cloud fractions.

Round-robin comparison
For the round-robin comparison, each group calculated tropospheric NO 2 AMFs using their preferred settings (i.e.their own preference for source of forward model parameters, cloud and aerosol correction).We extended the comparison and included other leading international retrieval groups (University of Leicester, NASA and Peking University).We now have a wider range of approaches and assumptions to better evaluate the impact that the calculation methods and choices of forward model parameters have on the structural uncertainty.
Table 3 summarises the AMF algorithms included in this comparison.There are several differences with the harmonised settings used in the previous section.IUP-UB and BIRA now apply IPA only when cloud fraction exceeds 0.1 and 0.2, respectively, motivated by the high uncertainty of cloud parameters for scenes with small cloud fractions (see Sect. 3.3.2).Peking University accounts for the surface reflectance anisotropy and does pixel-by-pixel online radiative transfer calculations.They also include an explicit aerosol correction, motivated by the fact that the implicit aerosol correction breaks down under conditions of high aerosol optical thickness and strongly absorbing particles (Castellanos et al., 2015;Chimot et al., 2016), which is particularly significant in East China.MPIC applies IPA cloud correction for clouds higher than 3 km and clear-sky AMFs for clouds between 2 and 3 km when cloud fraction is less than 0.1.For clouds be-low 2 km they include a parameterised aerosol-cloud layer in order to account for the possibility of cloud aerosol mixtures, which might be especially relevant for AMF calculation in scenarios where trace gas is most abundant in the lowest part of the troposphere.Among all the groups, five different chemistry transport models for the a priori NO 2 profiles are used.
Different groups use different LUTs for their AMF calculations, and POMINO uses pixel-by-pixel online radiative transfer calculations.The LUTs are different in several aspects, such as the RTMs used to create them and the number of reference points for each dimension.All these differences affect the AMF structural uncertainty.Based on the discussion in previous sections we consider that the use of different LUTs introduces a structural uncertainty of the order of 6.5 %.
Most of the surface albedo values used in the retrievals come from the Kleipool et al. (2008) database, which is based on OMI surface reflectance climatology.However, due to the different representations of surface reflectance within this database (mode and min LER), only three retrieval groups use the exact same albedo values.We investigated whether this could bias the estimation of the AMF structural uncertainty.We recalculated the AMF structural uncertainty with two retrievals that use the exact same albedo values and with three that use different albedo values.These estimated that AMF structural uncertainties were of similar magnitude and therefore we can conclude that the fact that the surface albedo values come from the Kleipool et al. (2008) database is not a clear driver of the overall structural uncertainty calculation.
The agreement of AMFs from this round-robin exercise quantifies the overall AMF structural uncertainty.The comparison with seven groups allowed us to calculate a mean AMF as a reference (which is not necessarily the true AMF) value which can be considered a state-of-the-art AMF value.For a representative ensemble mean AMF, we required all groups to have a valid (unflagged) AMF value at a pixel location.We selected two different days (2 February and 16 August 2005) in winter and summer to identify possible seasonality effects in the agreement of the AMFs.

Round robin: identical cloud parameters
First we compare the six groups that use the same cloud parameters.In contrast to what we found in the harmonised settings comparison, the global maps of tropospheric AMF calculated by each group using their preferred settings (Fig. 8) show pronounced differences in several regions.For example, over the Sahara desert, where surface albedo is high (see lower panel on Fig. 8), AMFs differ by up to 15 %.Small differences in the albedo values can lead to high differences in the AMFs, especially for surface albedos lower than 0.3 (see Fig. 4b).Over central Africa, AMFs differ in situations where cloud fraction is close to the typically applied threshold of 0.2 (left lower panel in Fig. 8).
We compared global AMF calculations from all individual groups against the pixel mean AMF from six groups (Peking University only calculates AMFs over China).Figure 9 shows the average ratio of the AMF by each group to the ensemble mean AMF (bars) and the correlation (crosses) for polluted conditions (NO 2 > 1 15 molec cm −2 , left panel) and unpolluted conditions (NO 2 < 1 15 molec cm −2 , right panel).Over polluted regions (for pixels with SZA < 60 • and effective cloud fraction < 0.2), the agreement between the six groups is within (minimum-maximum) 12-42 % in February and within 10-31 % in August.BIRA AMFs are 14 % higher than the ensemble mean, and WUR AMFs are 18 % lower, suggesting considerable structural uncertainty.
Over unpolluted regions the agreement is better: AMFs from the different groups agree within 8.5-18 % in both February and August, which implies a smaller structural uncertainty (Table S6 provides a detailed summary of the comparison).
In order to assess which forward model parameters explain most of the AMF structural uncertainty, we analysed AMF differences from groups that use identical cloud parameters and implicit aerosol correction (BIRA, University of Leicester, NASA and WUR).Between these four groups, the only different forward model parameters are surface albedo, a priori NO 2 profile and surface pressure.To investigate which of these parameters best explains the AMF variability, we correlated differences between a particular parameter ( A s , NO 2 and P s ) with the corresponding AMF differences ( AMF).For each particular parameter, we required the differences in the other parameters to be small (surface albedo within ±0.02, surface pressure within ±50 hPa and a priori NO 2 vertical columns within ±0.2×10 15 molec cm −2 ) so we could isolate the effect of one parameter only while keeping sufficient pixels for statistical significance.
We focus on explaining the differences between BIRA and WUR here, since these were of the order of 30 % (Fig. 9).We explored the correlations between BIRA-WUR AMF differences and differences between assumed surface pressures, albedos and NO 2 vertical columns and profile shapes; results are shown in Fig. S3 and Table S3.We find that surface pressure differences do not explain the large systematic AMF differences, and that surface albedo differences explain WUR and BIRA AMF differences, especially in winter when NO 2 is found close to the surface and AMFs are more sensitive to albedo variations than in summer.In our ensemble, the WUR-BIRA AMF differences are highly sensitive to the differences between the a priori NO 2 profiles used, especially in summer.NO 2 profiles are vertically more elevated in TM5 (used by BIRA) than in TM4 (used by WUR) (right panel of Fig. S3), as diagnosed by their 20 hPa lower effective NO 2 pressures (pressure levels weighted by NO 2 sub-column in that level).The confinement of the trace gas to lower atmospheric layers and the higher concentrations explains the systematically lower AMF values for WUR compared to BIRA.Selecting a specific chemistry transport model thus influences the AMF structural uncertainty via differences in the profile shape.These differences in the profile shape depend on the different characteristics of the models (e.g.spatial and temporal resolution and parameterisation of different processes in the atmosphere).Previous studies analysed how using different CTMs influences the NO 2 retrievals due to the change in the profile shapes used to calculate the AMF values.Heckel et al. (2011) compared retrievals using fine-and coarse-resolution models and concluded that using one AMF value for a large heterogeneous scene can lead to a 50 % bias in the retrieved NO 2 columns.Vinken et al. ( 2014) reported much smaller average differences of 10 % in retrieved NO 2 columns, mainly due to different emission inventories used in TM4 (3 • × 2 • ) and GEOS-Chem (0.5 • × 0.67 • ).According to Laughner et al. (2016), temporal resolution also influences a priori profile shapes; they found differences in the retrieved NO 2 column for individual days up to 40 % that were mostly explained by day-to-day wind direction variations that were not captured in the monthly averages.
All these aspects influence the estimation of retrieval (and AMF) theoretical uncertainties.In order to quantitatively estimate the effect of one model characteristic alone (e.g. the spatial resolution) on the AMF structural uncertainty it would be necessary to compare AMF calculated with the same approach but with just that specific characteristic being different in the profile shapes generated by the CTM.Such a specific sensitivity analysis has not been done in this study but should be considered in future AMF comparisons.To test the robustness of our structural uncertainty estimate, we did some experiments by simulating the effect of high-resolution a priori profiles on AMF values.Kuhlmann et al. (2015), McLinden et al. (2014) and Heckel et al. (2011) reported that AMFs calculated using coarse-resolution a priori profiles are overestimated over polluted areas by approximately 50 %.Over remote locations, there is little spatial variability in NO 2 distributions, and the a priori profile spatial resolution is less important in the AMF calculation.When including synthetic AMF emulating the use of high-resolution a priori profiles over polluted areas, the estimated AMF structural uncertainty is not strongly affected (increases by 3-6 %).This indicates that with the ensemble of retrievals used in our comparison the estimate of the structural uncertainty in the AMF calculation may be considered a robust estimate.
The findings in this subsection indicate that quality assurance efforts for Kleipool et al. (2008) retrievals should not focus just on column validation, but also target the validation of the a priori NO 2 profiles used in the AMF calculations.It is worth noting that using averaging kernels in satellite applications (e.g. when comparing retrieved NO 2 columns with modelled NO 2 distributions or observed NO 2 profiles) will reduce the representativeness errors in the comparisons associated with the a priori trace gas profile used in the retrieval scheme (e.g.Boersma et al., 2016).

Round robin: different cloud parameters
In the previous section, we found that differences between a priori NO 2 profiles and surface albedo values are the main cause of AMF structural uncertainty when cloud parameters are identical in AMF calculation approaches.Here we extend our round-robin experiment by including AMF calculations from Peking University (J.-T.Lin et al., 2015;Lin et al., 2014;J.-T. Lin et al., 2015) that were done with different cloud parameters (Table 3) to the O 2 -O 2 cloud parameters used by all other groups.The comparison of Peking Univer- sity and WUR AMFs thus allowed us to investigate the relative importance of differences in cloud parameters in driving AMF structural uncertainty.Our comparison of AMFs is confined to China, since Peking University calculations are only available over that region.
All groups calculate similar spatial patterns for the AMFs over China (Fig. 10).In the polluted north-east (Beijing area) AMFs are lower due to the reduced sensitivity to NO 2 in the lower troposphere.In the western part over the Tibet region, AMFs are higher due to the presence of ice and snow in February.Figure 11 shows the average ratio of each group's AMF to the ensemble mean AMF (bars) and the correlation (crosses) for polluted conditions (left panel) and unpolluted conditions (right panel).In polluted regions, AMFs generally agree within 37 % in February and within 20 % in August, and correlations are 0.7-0.9.Peking University AMFs are higher than the ensemble mean AMF, especially in August when they are 25 % higher.WUR and MPI-C AMFs are lower than the mean AMF, especially in August (20 % lower).In unpolluted regions the agreement is better: within 26 % in February and within 16 % in August, with a correlation of 0.8-0.95(see Table S7).To estimate the effect of differences in cloud parameters on AMF structural uncertainty, we analysed differences in AMF calculated by WUR and Peking University.The Peking University AMF calculations (and the cloud parameters) were based on a version of the POMINO retrieval using clouds retrieved with an implicit aerosol treatment (i.e.similar to KNMI/WUR).We explored the correlations between Peking University and WUR AMFs differences and differences in cloud pressure (P c ) and NO 2 vertical columns by requiring the differences in other forward model parameters to be relatively small.Results are shown in Fig. S4 and Table S4.AMF differences are partly explained by differences in the effective cloud pressures (Table S4): the O 2 -O 2 cloud pressures used by WUR are systematically lower (by 100 hPa) than those by Peking University, in line with Veefkind et al. (2016).This results in stronger screening of below-cloud NO 2 pollution, and consequently lower AMFs by WUR compared to Peking University AMFs.Peking University uses NO 2 profiles from GEOS-Chem.These profiles tend to peak at higher vertical levels than those from TM4 (Lin et al., 2014;Boersma et al., 2016), thus contributing to higher AMFs by Peking University compared to WUR AMFs.In summary, the more elevated NO 2 profiles in combination with less elevated clouds explain the substantially higher AMF by Peking University than WUR AMFs.

Round robin: explicit aerosol correction
The POMINO retrieval by Peking University explicitly corrects for the presence of aerosols in the atmosphere by including profiles of aerosol optical properties simulated by the GEOS-Chem model (and constrained by MODIS AOD on a monthly basis) in the radiative transfer model and in the cloud retrieval (Lin et al., 2014;J.-T. Lin et al., 2015).All the other groups except MPIC-C (see Table 3 and Sect.S1.3) assume that the aerosol effects are implicitly accounted for in the cloud retrievals (Boersma et al., 2011;Castellanos et al., 2015).Including an explicit aerosol correction influences AMF values indirectly through changes in cloud fraction and cloud pressure and directly in the radiative transfer simulations.We quantify the effect of the choice of aerosol correction in AMF structural uncertainty by comparing AMFs calculated by Peking University with (abbreviated AMF aer hereafter) and without (AMF) explicit aerosol correction.
In conditions with substantial aerosol pollution (AOD > 0.5), the selection of one aerosol correction approach over another can result in an AMF structural uncertainty of 45 % over China.The sign of the AMF differences depends mainly on the altitude of the aerosol layer relative to the NO 2 profile (see e.g.Leitao et al., 2010).We find that AMF aer are on average 55 % smaller in situations when aerosols are located above the NO 2 layer, mainly because cloud pressures are lower on average (more than 350 hPa), resulting in stronger screening of NO 2 (upper panel of Fig. S5; Table S5).When the aerosol vertical distribution is similar to that of NO 2 , AMF aer are on average 45 % higher, mostly because of much smaller cloud fractions, resulting in reduced screening of below-cloud NO 2 (lower panel in Fig. S5; Table S5).An additional factor is that when aerosols are mixed with NO 2 , they increase the optical light path and enhance AMF values.These results are in line with J.-T.Lin et al. (2015) where an evaluation of the influence of the aerosols in the NO 2 retrieval is analysed for 2012.

Stratospheric air mass factors
We pointed out in Sect.3.2 that differences in the description of the atmosphere's sphericity could lead to differences in stratospheric AMFs, especially for extreme geometries.
Here we investigate the differences between stratospheric NO 2 AMFs calculated with DAK and McArtim radiative transfer models.The McArtim model simulates the radiative transfer in an atmosphere that is spherical for incoming, single-scattered and multiple-scattered light.DAK's atmosphere is spherical for incoming sunlight, but plane-parallel for scattered sunlight.Based on these differences, we may expect the average photon paths at high altitudes in McArtim to be shorter than in DAK, as diffuse photon contributions (from near-horizontal directions) in McArtim are bound to finite spherical atmosphere (as illustrated in Fig. S2).Consequently, stratospheric AMFs in McArtim are smaller (Fig. 4a).Figure 12 shows that McArtim box-AMFs (at 25 hPa) are systematically lower than those from DAK by 1-2 % for moderate viewing geometries, with more significant differences (up to −5 to −10 %) when solar zenith and viewing angles are large.
A direct validation of stratospheric NO 2 AMFs is difficult, but comparing simulated stratospheric slant column densities against observed NO 2 SCDs constitutes a test of the radiative transfer models.Here we use OMI-observed (un-destriped) SCDs over the Pacific from the OMNO2A v1 product (van Geffen et al., 2015;Boersma et al., 2011) as a benchmark.The NO 2 columns over the Pacific Ocean are dominated by stratospheric NO 2 , so we expect simulated stratospheric SCD values to be similar or somewhat smaller than the observed, total SCDs.Simulated SCDs are the product of modelled VCDs (from data assimilation in TM4) and the stratospheric AMFs calculated with DAK and McArtim.Figure 13 (left panel) indicates (for high solar and viewing zenith angles) that stratospheric SCDs simulated with McArtim are close to, or slightly below the OMI SCDs.In contrast, the stratospheric SCDs simulated with DAK overtop the OMI SCDs, because of the higher stratospheric AMFs from that model.This inevitably leads to negative values for SCD-SCD strat , and consequently to reduced or even negative tropospheric NO 2 VCDs at high latitudes.Indeed, DOMINO v2 retrievals (using DAK stratospheric AMFs) are known to suffer from negative tropospheric VCDs at high latitudes especially in the summer hemisphere (Beirle et al., 2016) when solar zenith angles are largest.For small solar zenith angles in the tropics, the differences between DAK and McArtim stratospheric slant columns are smaller, but still appreciable at the edges of the swath (Fig. 13, right panel).
We tested whether possible errors in the diurnal cycle of stratospheric NO 2 could explain the overestimated slant columns for extreme viewing geometries.We did so by imposing stratospheric NO 2 vertical columns that are either constant with OMI row number (i.e. with local time) or increase (as N 2 O 5 photolysis, NO 2 concentrations build up) at a rate of approximately 0.15 × 10 15 molec cm −2 h −1 , i.e. by 1 × 10 15 molec cm −2 from the left to the right side of the orbit (Fig. S6a).These estimates correspond to the range of increase rates at high latitudes in summer reported in the literature (e.g.Vaughan et al., 2006;Celarier et al., 2008;Dirksen et al., 2011).Our tests show that, for these scenarios, simulated SCDs based on McArtim generally stay within the observational constraints of the OMI SCD patterns but that the simulated SCDs based on DAK are still exceeding the observed SCDs (Fig. S6b-c).McArtim provides a better physical description of photon transport in the stratosphere.The results above are not yet fully conclusive; a complete test would require the implementation of McArtim (instead of DAK) in the data assimilation scheme, or a dedicated validation of NO 2 columns with independent reference data in situations with extreme viewing geometries.Nevertheless, our results clearly hint at McArtim as the RTM providing the more realistic stratospheric AMFs, and we will test this assumption further in the remainder of the QA4ECV project.

Conclusions and recommendations
We have analysed the AMF calculation process for NO 2 and HCHO satellite retrievals from seven different retrieval groups in detail.By comparing approaches for every step of the AMF calculation process we have identified the main sources of structural uncertainty and we have traced back these uncertainties to their underlying causes.We have estimated the structural uncertainty in the NO 2 AMF calculation, which results from methodological choices and from preferences and assumptions made in the calculation process.Structural uncertainty is relevant beyond theoretical algorithm uncertainty, which typically only addresses the propagation of errors within the context of one particular retrieval algorithm.
The choice of RTM for TOA reflectance and box-AMF calculation introduces an average uncertainty of 2-3 %.The detailed comparison showed that state-of-the-art RTMs are in good agreement.Particularly for DAK, this is the first time that box-AMF calculations are extensively tested against those calculated with other RTMs.The McArtim model simulates systematically lower box-AMFs in the stratosphere, which we attribute to the model's geometrically more realistic description of photon scattering in a spherical atmosphere.The four European retrieval groups agree within 6 % in their calculation of NO 2 tropospheric AMFs when identical ancillary data (surface albedo, terrain height, cloud parameters and a priori trace gas profile) and cloud correction are used.This demonstrates that the selection of RTM and the interpolation operations lead to modest uncertainty, which is intrinsic to the calculation method chosen and therefore cannot be avoided.
Table 4. Average relative structural uncertainty for every step of the AMF calculation following the comparison process shown in Fig. 1.This includes the modelling of TOA reflectance (σ R ), calculation of box-AMF LUT (σ m ), tropospheric AMFs using harmonised settings (σ M ) and the overall structural uncertainty from AMF using preferred settings (σ M ).
NO 2 1.1 % 2.6 % 6 % 31-42 % HCHO 1.5 % 2.6 % When retrieval groups use their preference for ancillary data along with their preferred cloud and aerosol correction, we find that the structural uncertainty of the AMF calculation is 42 % over polluted regions and 31 % over unpolluted regions.Table 4 shows the escalation of the structural uncertainty with every step of the AMF calculation.The steep increase from 6 to 42 % strongly suggests that it is not the models or the calculation method but the assumptions and choices made to represent the state of the atmosphere that introduce most structural uncertainty in the AMF calculation.The structural uncertainty is of similar magnitude as the theoretical uncertainties found in algorithm error propagation studies which confirms that there is a substantial systematic component in trace gas satellite retrieval uncertainties.
Sensitivity studies for one particular algorithm indicate that the choice for cloud correction (IPA or clear-sky AMF for small cloud fractions) is a strong source of structural uncertainty especially for polluted conditions with residual cloud fractions of 0.05-0.2(on average an structural uncertainty of 20 %).The choice for aerosol correction (explicitly or implicitly via the cloud correction) introduces an av-erage uncertainty of 50 %, especially when aerosol loading is substantial.Selecting trace gas a priori profiles from different chemistry transport models, surface albedo from different data sets and cloud parameters from different cloud retrievals contributes substantially to structural uncertainty in the AMFs.These findings point to the need for detailed validation experiments designed to specifically test cloud and aerosol correction methods under relevant conditions (strong pollution, residual cloud fractions of 0.1-0.2).Not only should the retrieved NO 2 column itself be validated, but the a priori vertical NO 2 profile, the cloud and aerosol distributions and the surface albedo values should also be compared in detail to independent reference measurements.
The magnitude of the structural uncertainty in AMF calculations is significant, and is caused mainly by methodological differences and particular preferences for ancillary data between different retrieval groups.This study provides evidence of the need for improvement of the different ancillary data sets, including uncertainties of the forward model parameters used in the retrievals for a better agreement in the AMF calculation.This will significantly decrease AMF structural uncertainty towards the levels desired in user requirement studies (±10 %).As there is no "true" AMF value to be used as reference, it is difficult to decide which approach and which ancillary data are best.For this reason, future research should include a thorough validation against independent reference data, specifically in the situations where AMF structural uncertainty has the highest impact.
The Supplement related to this article is available online at doi:10.5194/amt-10-759-2017-supplement.

Figure 1 .
Figure 1.Flow chart of AMF calculation and comparison process followed in the study.In the third step forward model parameters (b: surface albedo, surface pressure, a priori profile, temperature, cloud fraction and cloud pressure) are selected for the harmonised settings comparison (upper part) and preferred settings comparison (lower part).In each step the main differences between the compared elements are highlighted.The compared parameters and their structural uncertainty (σ ) in each step are TOA reflectance (R, σ R ), box-AMFs (m, σ m ), and tropospheric AMFs (M, σ M ).

Figure 7 .
Figure 7. Mean relative differences between IPA and clear-sky NO 2 tropospheric AMFs for different cloud fraction intervals at different cloud pressures ranges (different colours) for a complete day of OMI measurements (2 February 2005).Left panel is for polluted conditions and right panel for unpolluted conditions (pixels with model NO 2 vertical column higher or lower than 1 × 10 15 molec cm −2 respectively).The stars with the black dashed lines show the average difference for all the cloud pressures.Pixels with surface albedo less than 0.3 and SZA < 70 • are considered.

Figure 8 .
Figure 8. Tropospheric NO 2 AMFs calculated by each of the groups for a complete day of OMI measurements (2 February 2005).Lower panels show an example of cloud fraction and surface albedo used by KNMI/WUR (showed as example; see Table3) to calculate the AMFs.Groups apply different filters to the measurements which explains the different gaps (grey).

Figure 9 .
Figure 9. Ratio of tropospheric NO 2 AMFs by each group to the ensemble mean (left axis, bars) and the correlation coefficient (right axis, cross) for two complete days of OMI measurements on 2 February 2005 (blue) and 16 August 2005 (green) over the globe for polluted (left panel) and unpolluted (right panel) pixels.The error bars correspond to the standard deviation.Only pixels for SZA < 60 • and cloud fraction < 0.2 are considered in the analysis.

Figure 11 .
Figure 11.Ratio of tropospheric NO 2 AMFs by each group to the ensemble mean (left axis, bars) and the correlation coefficient (right axis, cross) for two complete days of OMI measurements on 2 February 2005 (blue) and 16 August 2005 (green) for polluted (left panel) and unpolluted (right panel) pixels over China.The error bars correspond to the standard deviation.Only pixels for SZA < 60 • and cloud fraction < 0.2 are considered in the analysis.

Figure 12 .
Figure 12.Box-AMFs at 25 hPa as a function of cosine of SZA (left panel) and as a function of cosine of VZA (right panel).In the left panel, VZA is constant at 37 • (µ = 0.8), and in the right panel, SZA is constant at 37 • (µ 0 = 0.8).

Figure 13 .
Figure 13.Averaged OMI total NO 2 SCD (black line) as a function of viewing zenith angle for solar zenith angles between 70 and 80 • (left panel) and 20-30 • (right panel) (OMI orbit 02940 on 2 February 2005).The blue line indicates the estimated stratospheric SCDs based on DOMINO v2 stratospheric VCDs and DAK stratospheric AMFs, and the purple line represents the stratospheric SCDs based on DOMINO v2 stratospheric VCDs and McArtim stratospheric AMFs.The only difference between the DAK and McArtim-based stratospheric slant columns is the use of the radiative transfer model; all other relevant parameters (TM4 assimilated stratospheric column, cloud parameters, albedo, NO 2 profile shape) are identical.

Table 1 .
Overview of radiative transfer models that participated in the top-of-atmosphere reflectance comparison and their main characteristics.

Table 3 .
Overview of AMF calculation methods and ancillary data used in the round-robin experiment by various research groups.