Ground-based assessment of the bias and long-term stability of fourteen limb and occultation ozone profile data records.

The ozone profile records of a large number of limb and occultation satellite instruments are widely used to address several key questions in ozone research. Further progress in some domains depends on a more detailed understanding of these data sets, especially of their long-term stability and their mutual consistency. To this end, we made a systematic assessment of fourteen limb and occultation sounders that, together, provide more than three decades of global ozone profile measurements. In particular, we considered the latest operational Level-2 records by SAGE II, SAGE III, HALOE, UARS MLS, Aura MLS, POAM II, POAM III, OSIRIS, SMR, GOMOS, MIPAS, SCIAMACHY, ACE-FTS and MAESTRO. Central to our work is a consistent and robust analysis of the comparisons against the ground-based ozonesonde and stratospheric ozone lidar networks. It allowed us to investigate, from the troposphere up to the stratopause, the following main aspects of satellite data quality: long-term stability, overall bias, and short-term variability, together with their dependence on geophysical parameters and profile representation. In addition, it permitted us to quantify the overall consistency between the ozone profilers. Generally, we found that between 20-40 km the satellite ozone measurement biases are smaller than ±5 %, the short-term variabilities are less than 5-12% and the drifts are at most ±5% decade-1 (or even ±3 % decade-1 for a few records). The agreement with ground-based data degrades somewhat towards the stratopause and especially towards the tropopause where natural variability and low ozone abundances impede a more precise analysis. In part of the stratosphere a few records deviate from the preceding general conclusions; we identified biases of 10% and more (POAM II and SCIAMACHY), markedly higher single-profile variability (SMR and SCIAMACHY), and significant long-term drifts (SCIAMACHY, OSIRIS, HALOE, and possibly GOMOS and SMR as well). Furthermore, we reflected on the repercussions of our findings for the construction, analysis and interpretation of merged data records. Most notably, the discrepancies between several recent ozone profile trend assessments can be mostly explained by instrumental drift. This clearly demonstrates the need for systematic comprehensive multi-instrument comparison analyses.


Introduction
Long-term global observations of the distribution and evolution of ozone are vital to improve our current understanding of atmospheric processes, and thereby to allow more robust projections of the recovery of the ozone layer and climate change. Measurements of the vertical profile of ozone have been carried out over the last few decades by a large number of instruments, operating in situ or from remote vantage points, on the ground and in space (for an overview, see Hassler et al., 2014). These indisputably show globally declining ozone levels during the 1980s and a large part of the 1990s in the lower and upper stratosphere (∼ 5-7 % decade −1 ), and to a lesser extent also in the middle stratosphere (1-2 % decade −1 ) (WMO, 2014;Harris et al., 2015). Furthermore, the observed loss rates are in excellent agreement with expectations for the chemical destruction of ozone by manmade halocarbons (WMO, 2014). The abundances of these substances have decreased significantly over the past 15-20 years (WMO, 2011), as a result of the Montreal Protocol and its subsequent adjustments and amendments. It is therefore generally expected that the ozone layer is currently recovering from the effects of ozone depleting substances, albeit in an atmosphere with concomitant increases in greenhouse gas concentrations and changes in residual circulation (Waugh et al., 2009;Oman et al., 2010). While observations provide substantial evidence for the levelling off of the downward trend around 1997 at most latitudes and altitudes, i.e. the first phase of recovery, it is less clear whether they support an upward trend in recent years . Whether the onset of the second stage has been detected (or not) is one of the key questions in current ozone research, a debate that is hampered by two factors. The first is the small magnitude of the increases in ozone (a few percent) when compared to its natural variability. This can only be remedied by longer time series. And the second is the lack of appropriate knowledge of the uncertainties in the observational records. Shedding more light on the latter issue is the main objective of this paper.
Limb and occultation sounders are of prime interest for ozone profile trend assessments, as they provide near-global coverage at reasonably high vertical resolution. However, satellite instruments are rarely operational for much more Atmos. Meas. Tech., 9, 2497-2534, 2016 www.atmos-meas-tech.net/9/2497/2016/ than a decade, so their records are generally combined for long-term studies. The uncertainties (overall bias, short-term variability and long-term stability) in the resulting combined data set are an intricate combination of the uncertainties inherited from the contributing data sets and those introduced by the merging algorithm. Tummon et al. (2015) recently noted that the former source of error tends to dominate over the latter, thereby demonstrating the need for a detailed characterization of each individual record and especially of their mutual consistency. Numerous validation studies have been published in recent years (for an overview, see Hubert et al., 2016), but some important gaps remain. First of all, there are no comprehensive multi-instrument assessments of most limb/occultation sounders using ground-based data as a reference. Also satellite intercomparison studies rarely cover more than a handful of records (exceptions are, e.g. Dupuy et al., 2009;Jones et al., 2009;Laeng et al., 2014;Rahpoe et al., 2015). Tegtmeier et al. (2013) conducted perhaps the most complete assessment so far, of the ozone climatologies from 18 sounders. Like most works, it was dedicated to the quantification of bias patterns and shorter-term variability, but not to a detailed assessment of the stability on decadal time scales. However, precise estimates of instrumental drift are crucial for a sound determination of the significance of trend results. Just a few (in some cases indirect) drift estimates are available from ground-based comparisons (e.g. Terao and Logan, 2007;Nair et al., 2012) or from satellite intercomparisons (e.g. Jones et al., 2009;Mieruch et al., 2012;Adams et al., 2014;Eckert et al., 2014;Rahpoe et al., 2015). Moreover, no works comprise all the records considered in the recent trend assessments, by, e.g. the World Meteorological Organisation (WMO) (WMO, 2014) or within the SPARC/IO3C/IGACO-O3/NDACC (SI2N) initiative (for an overview, see Harris et al., 2015). Finally, the quality of auxiliary pressure and temperature profiles plays a role too, as it unavoidably affects the quality of ozone data when used to convert the ozone profiles to another vertical coordinate (altitude ↔ pressure) or ozone quantity (number density ↔ volume mixing ratio), a common step in the merging process. At the moment, very little information on this latter aspect of data quality is available.
Our objective is to shed more light on these three missing pieces of information. We therefore perform an exhaustive assessment, from the ground up to the stratopause, of the latest releases of the operational Level-2 ozone profile data sets collected by 14 limb/occultation instruments over the period 1984-2013: SAGE II (v7), SAGE III (v4), HALOE (v19), UARS MLS (v5), Aura MLS (v3.3), POAM II (v6), POAM III (v4), OSIRIS (v5.07), SMR (v2.1), GOMOS (IPF 6), MIPAS (ML2PP 6), SCIAMACHY (SGP 5),  and MAESTRO (v1.2). Each satellite data set is compared to the observations by the groundbased ozonesonde and stratospheric ozone lidar networks, thereby acting as a pseudo-global, independent and well-characterised transfer standard. The robust analysis of colocated satellite-ground profile pairs allows us to quantify overall bias, short-term variability and long-term stability of the satellite records, and their dependence on altitude, latitude and season. Methodology and results for the native profile representation of each record are described in Sects. 3-5. In Sect. 6 we investigate whether the accompanying ancillary meteorological data impact ozone data quality when the original profiles are converted to another vertical coordinate or ozone quantity.
The adoption of a consistent analysis framework permits us to bring all single-instrument results together, and examine the mutual consistency between instruments of each quality indicator (Sect. 7). We report the tendencies and several peculiarities, most notably a few instruments that drift significantly at some altitudes. Finally, we frame our findings within the broader context (Sect. 8), by commenting on current challenges related to verifying user requirements, and by highlighting the implications of our results for the design of merging schemes. Perhaps the most tangible outcome of our study is the successful interpretation of discrepancies in recent trend studies in terms of instrumental drift. It demonstrates that our work can contribute to a better exploitation of the limb and occultation ozone profile data sets. This should, in the end, be beneficial not only for trend assessments and the related merging activities, but also for other applications, such as trend attribution studies or model evaluations.

Ozone profile data records
Our assessment covers the period between October 1984 and May 2013 and considers 14 satellite missions and two types of ground-based instruments. We first present the ozone profile data records that play a central role in our analyses: those gathered by ozonesonde and stratospheric lidar instruments. Then, we introduce the limb and occultation sounders that are the subject of this work. We limit ourselves to brief descriptions since all space-and ground-based ozone profile measurement techniques were reviewed exhaustively by Hassler et al. (2014). The technical details most relevant to our assessment are summarised in Tables 1-3.

Ozonesondes
Balloon-borne ozonesondes are launched around the world, at many sites at least once a week. These electrochemical instruments record ozone partial pressure in situ at high vertical resolution (100-150 m) from the surface to the middle stratosphere (∼ 30-35 km). An interfaced radiosonde provides the pressure (p), temperature (T ) and GPS data necessary to geolocate each measurement, and to convert ozone partial pressure to other quantities. The data quality depends on various factors such as sonde type and manufacturer, the preflight Table 1. Overview of the 72 ozonesonde stations considered in this work, their location and the archive the data were taken from. Time range and profile statistics reflect the total, screened sample straddling the analysis period (10/1984-5/2013), not the co-located sample (which differs per satellite instrument). All listed stations were used in the analyses of bias and comparison spread, those indicated in the last column were also used for the drift analysis. characterisation and post-flight processing (see, e.g. Tarasick et al., 2016;Van Malderen et al., 2016). However, when standard operating procedures are followed, the three most commonly used sonde types 1 produce consistent results between the tropopause and ∼ 28 km, with biases smaller than ±5 % and precisions better than ∼ 3 % (Smit and ASOPOSpanel, 2014). At higher and lower altitudes the data quality 1 Nowadays more than 80 % of the stations launch an electrochemical concentration cell (ECC) sonde (Komhyr, 1969). The Brewer-Mast sonde has mostly been used by the early sounding stations with long data records (Brewer and Milford, 1960), while the Japanese stations fly a carbon iodine cell sonde (Kobayashi and Toyama, 1966). degrades somewhat, and the differences between the sonde types become more clear. Overall, ECC-type sondes perform best with a bias of ±5-7 % and a precision of 3-5 % in the troposphere. We use the ozonesonde data acquired by the Network for the Detection of Atmospheric Composition Change (NDACC, http://www.ndacc.org), WMO's Global Atmospheric Watch (GAW, data distributed by the World Ozone and Ultraviolet Data Centre http://www.woudc.org) and the Southern Hemisphere Additional Ozonesondes network (SHADOZ, http://croc.gsfc.nasa.gov/shadoz, Thompson et al. (2012)). The stations considered in this work are listed in Table 1, together with the total number of screened Atmos. Meas. Tech., 9, 2497-2534, 2016 www.atmos-meas-tech.net/9/2497/2016/ profiles over the analysis period. The screening procedure is outlined in Sect. 3.

Stratospheric ozone lidars
Differential absorption lidars are laser-based active remote sensing systems that operate mostly during clear-sky nights. Profiles of ozone number density vs. geometric altitude are retrieved between the tropopause and 45-50 km from backscattered signals at two wavelengths (Mégie et al., 1977). While instrument and retrieval set-up differs from one site to another, the NDACC ozone lidar network can be considered as homogeneous within 2 % between 20 and 35 km. In this altitude range both bias and precision are estimated at ∼ 2 % and worsen to 5-10 % at other altitudes due to, e.g. lower signal-to-noise ratios or the saturation of the detectors (Keckhut et al., 2004). The vertical resolution degrades from 0.3 km around the tropopause to 3-5 km in the upper stratosphere (Godin et al., 1999). The retrieval algorithms used at the different sites were extensively intercompared and the profile measurements validated against a mobile lidar reference, ozonesondes and microwave radiome-  ters (McGee et al., 1991;Keckhut et al., 2004). Furthermore, comparisons to space-based observations over the range 20-40 km showed biases less than ±5 % and a decadal stability better than ±5 % decade −1 (Nair et al., 2012). We use data from 13 stratospheric ozone lidars in the NDACC network. Geographical location, measurement period and number of screened profiles over the analysis period are listed in Table 2. The screening procedure is outlined in Sect. 3. Whenever lidar data are converted to non-native profile representations, we do so in this work using the p/T information extracted at the time and location of the lidar measurement from the ERA-Interim reanalysis fields (Dee et al., 2011) produced by the European Centre for Medium-Range Weather Forecasts (ECMWF).

Satellite observations
Over the past few decades numerous instruments were deployed in space to monitor atmospheric ozone. Detailed intercomparison studies of monthly zonal mean ozone profile data (i.e. Level-3) were published for nadir-viewing  and limb/occultation-viewing instruments . Here, we focus on a groundbased validation of the Level-2 ozone profile records from 14 limb/occultation sounders that had (have) prime sensitivity in the stratosphere and were (are) operational for more than 3 years, see Table 3. Most instruments were launched only once: HALOE (Halogen Occultation Experiment), OSIRIS (Optical Spectrograph and InfraRed Imaging System), SMR (Sub-Millimetre Radiometer), GOMOS (Global Ozone Monitoring by Occultation of Stars), MIPAS (Michelson Interferometer for Passive Atmospheric Sounding), SCIAMACHY (SCanning Imaging Absorption spectroMeter for Atmospheric CHartographY), ACE-FTS (Atmospheric Chemistry Experiment Fourier Transform Spectrometer) and MAE-STRO (Measurements of Aerosol Extinction in the Stratosphere and Troposphere Retrieved by Occultation). Some were deployed more than once, with improved design: SAGE (Stratospheric Aerosol and Gas Experiment, II and III), MLS (Microwave Limb Sounder, on the UARS and EOS-Aura platforms) and POAM (Polar Ozone and Aerosol Measurement, II and III). Five instruments (OSIRIS, SMR, ACE-FTS, MAESTRO and Aura MLS) remain operational until the present, nine ceased operations before the end of the analysis period (May 2013).
For each instrument we consider the latest data release of the operational Level-2 product (Table 3), which typically comprises not one but several data sets. Our focus is on the observations that are best suited for longterm studies of stratospheric ozone. We therefore choose the 205 GHz profiles for UARS MLS rather than the 183 GHz retrievals (Livesey et al., 2003). The standard Aura MLS product, considered here, is based on observations by the 240 GHz radiometer. We take the 501.8 GHz retrievals for SMR since these are less biased (although more noisy) than the 544.6 GHz data (Urban et al., 2005). MAESTRO retrievals in the visible range perform better in the upper stratosphere (US) than the ultraviolet product and are therefore used here . For SAGE III, we consider the profiles retrieved with the multiple linear regression technique rather than the SAGE II type method used for v6.2 (Wang et al., 2006). We further select MIPAS data from the nominal measurement mode (70% of total number of observations) which is most suitable for long-term stratospheric studies (Raspollini et al., 2013). The ACE-FTS team Atmos. Meas. Tech., 9, 2497-2534, 2016 www.atmos-meas-tech.net/9/2497/2016/ Table 3. Overview of satellite ozone profile data records. For more details on the instrument and the retrieval technique we refer to the review by Hassler et al. (2014). Some instrument teams recommend to discard a considerable part of their ozone record for long-term studies. The asterisk in the analysis period columns denotes whether the early or late part of the mission is cropped (see text). provides ozone data sets on both a variable and a fixed altitude grid, we pick the latter product. A number of alternative data sets for these instruments were not included in this assessment. For instance the retrievals by scientific prototype Level-2 processors (MIPAS, SCIAMACHY) were not considered here. Their bias structure is often comparable to that of the operational ozone data set, especially when contrasted to that of other instruments (e.g. Rozanov et al., 2007;Laeng et al., 2015), due to the use of the same calibrated Level-1 radiance data and a common sensitivity to retrieval parameters (e.g. spectroscopic data). Profile data from alternative viewing geometries (e.g. lunar occultations for SAGE III and SCIAMACHY, solar occultation data from SCIAMACHY or bright limb measurements by GOMOS) were not investigated either, and their quality may well be different from the findings presented in the following. Table 3 summarizes host platform, observation geometry and time, spectral region and spatial coverage. Vertical resolution and sampling in space and time are mainly determined by the observation geometry, the orbit and the spectral range. Solar occultation observations yield 30 profiles per day at ∼ 1 km vertical resolution. Limb instruments on the other hand easily provide 1000 profiles per day but with a poorer vertical resolution of ∼ 3 km and a larger uncertainty in the altitude registration as well. The latter changes in some cases with time, e.g. the UARS MLS team noticed an upward drift of the geopotential height (GPH) of the 100 hPa reference level by 600 m between 1991 and 1997 (Livesey et al., 2003, Fig. 1). A downward drift of 100 m in GPH was also found in Aura MLS v3.3 data from 2005 to 2009, but stabilised thereafter . Fortunately, the pressure information retrieved by limb emission instruments (including UARS and Aura MLS) is typically more reliable and therefore used as native vertical scale instead of altitude.
We screen the satellite profiles according to the prescriptions of the data provider (Table 3). In some cases this implies the removal of a considerable part of the data record, e.g. periods during which the product stability is not guaranteed. In particular, we remove the UARS MLS data after the 15 June 1997 switch-off of the 63 GHz radiometer (Livesey et al., 2003). We also reject MIPAS observations before January 2005 since these are potentially biased relative to those from the second phase of the mission due to a different set of retrieval microwindows (Ceccherini et al., 2013). Finally, from September 2010 onwards the ACE-FTS and MAESTRO retrievals are affected by problems with auxiliary input data and therefore rejected from the analysis. These issues were fixed in the v2.5/v3.5 data release of ACE-FTS, which extends the mission's record to the present. Data providers generally recommend a vertical range for their ozone product in addition to the standard screening prescriptions, see Table 3. Here, we keep all grid levels in order to verify at what point the data quality starts to degrade. Each record is provided in its native ozone profile representation ( Fig. 1) defined by the vertical coordinate (altitude or pressure), the vertical grid levels and the quantity in which ozone is expressed (volume mixing ratio, VMR, or number density). The vertical grid of some records varies with the changing tangent heights of the measurements. Throughout this work the difference between geometric altitude and geopotential height is neglected. Satellite data providers typically include the pressure and/or temperature data required to convert the native ozone profiles to another representation. These auxiliary data are sometimes retrieved by the same processor but in general taken from an external source (see Table 3). This assessment focuses primarily on data quality in the satellite's native profile representation. But given its importance in, e.g. the data merging context we complement the analysis with tests of the impact of auxiliary data on the profile quality in other representations. We will see in Sect. 6 that this should indeed not be ignored.

Analysis approach and data preprocessing
A careful design of the analysis allows us not only to obtain robust estimates of the data quality of the individual satellite records but also, and this is one of our primary objectives, to assess their mutual consistency. Prerequisite to achieving these goals is a good understanding of the metrological aspects of the comparison analysis. Our analysis approach is therefore based on three principles that reduce confounding methodological biases. First of all, we use a single analysis and software framework. Second, all satellite records are compared to the same reference data, from ground-based observations. And finally, the manipulation of satellite data is kept to a strict minimum. In this section we describe the general aspects of the analysis. A detailed account of how decadal stability, bias and short-term variability are estimated follows in Sects. 4 and 5.
The ozonesonde and lidar networks provide vertical ozone profiles of well-documented quality and serve as suitable transfer standards on a pseudo-global scale and from the troposphere to the stratopause. We compare the satellite profiles to co-located ground-based measurements in relative units Here, x ij, sat (l) and x ij, gnd (l) represent respectively satellite and (vertically smoothed and representation-transformed) ground-based ozone at grid level l of co-location pair i for correlative instrument j . If the satellite bias is of multiplicative nature 2 then any time dependence in ozone levels (e.g. seasonal, interannual, solar cycle) is divided out in the relative differences. Another advantage is that it allows for a direct comparison between the results in different ozone quan-tities. A disadvantage, however, is that relative differences are sensitive to low ozone values, leading to larger values in and below the UTLS (upper troposphere lower stratosphere) and in the upper stratosphere.
x is determined by several factors besides pure measurement and retrieval uncertainties (S x sat , S x gnd ) because satellite and ground-based instruments have different perceptions of a variable atmosphere. Vertical and horizontal resolutions differ and the probed air masses rarely coincide perfectly in space and time. In addition, the comparison can only be done when both profiles are expressed in the same representation. As a result, the total comparison error budget contains terms related to the differences in smoothing, the spatiotemporal mismatch of the co-locations and the auxiliary data used to transform between profile representations. When correlations between the terms are disregarded, the total uncertainty covariance matrix (including systematic and random components) becomes S x = S x sat + S x gnd + S smoothing + S mismatch + S auxiliary (von Clarmann, 2006). Furthermore, when x data are averaged or regressed the co-located profile sample may not be sufficiently representative of the actual state of the studied parameter (ozone differences). Toohey et al. (2013) recently showed the importance of S sampling for trace gas climatologies and Damadeo et al. (2014) for time series analyses. Estimating sampling uncertainty for validation purposes is an analysis in its own right and outside the scope of this paper.
The next few paragraphs describe the data preprocessing scheme in which the mitigation of the uncertainties due to differences in smoothing, geolocation and auxiliary data plays a central role. Preprocessing starts off by removing the unreliable measurements following the guidelines of the data providers. Table 3 lists the recommended screening procedure references for the satellite records. Ground-based data are filtered using general criteria, removing measurements with larger uncertainties: altitudes above the 5 hPa level (∼ 33 km) for ozonesondes and outside the 15-47 km range for lidars. In addition, we reject measurement levels with clearly unphysical readings (O 3 < 0, p < 0 hPa, T < 0 K or T > 400 K) or during unrealistic jumps in pressure (dp/dt > 0 and dz > 0.1 km). Entire profiles are discarded from further analysis when (a) more than half of the levels are tagged bad, or (b) less than 30 levels are tagged good.
The choice of a co-location window is a trade-off between mismatch uncertainties and a sufficiently large sample size to obtain robust statistical estimates. We found that a maximum horizontal distance r of 500 km between the profiles is optimal, given the typical horizontal resolution of the order of a few hundred km of the satellite and ground-based measurements. The maximal temporal separation t is 6 h for MIPAS and Aura MLS, and 12 h for the other instruments. When multiple satellite profiles are present in the co-location window around a ground-based profile, only the pair closest in space and time is retained, defined by r 2 + V 2 wind t 2 with V wind = 100 km h −1 as a rough estimate of horizontal wind speed in the stratosphere. Multiple co-locations occur mostly between polar orbiting instruments and high latitude stations. Figure S1 in the Supplement shows the latitudetime cross-section of the co-location samples. Mismatch uncertainties S mismatch increase when and where atmospheric inhomogeneities are larger. Diurnal variations in ozone contribute a systematic component since the local time of ground-based observations (ozonesonde mostly around noon, lidar during night) and satellite measurements (Table 3) is generally constant. Biases due to the diurnal cycle are negligible below 30 km, but not at higher altitudes where ozone reaches minimal levels after dawn and maximal values in the afternoon (Schanz et al., 2014;Parrish et al., 2014;Sakazaki et al., 2015). The largest effect on our bias estimates is expected in the middle (< 2-3 %) and upper stratosphere (< 4 %) for the comparisons of lidar to sunset occultation profiles and, to a lesser extent, to the evening observations by SMR and OSIRIS. The random component of mismatch uncertainty is typically 5 % but can reach 20 % at, e.g. Antarctic stations dropping in and out of the polar vortex (Cortesi et al., 2007;De Clercq, 2009).
There is no well-established method in the community to remove the horizontal component of the smoothing error. Instead we refer to the model-based estimates for the specific case of MIPAS comparisons (Cortesi et al., 2007;De Clercq, 2009), which indicated that the horizontal smoothing uncertainty mainly has a random nature and is of similar magnitude as the mismatch uncertainty. The vertical component on the other hand can be mostly removed by smoothing the ground-based profiles. We use a triangular response function with a base width that follows the altitude-dependent satellite resolution (Table 3). The exception is the MIPAS analysis, for which we smoothed with the vertical averaging kernel (AK) and a priori of the co-located MIPAS profile. Such an AK smoothing was initially also tried for the SCIA-MACHY analysis. Unfortunately it introduced peculiar and unexpected vertical oscillations in the comparisons, so we resorted to the triangular method for SCIAMACHY. The comparison results, especially observed spreads, differ slightly when another shape of the smoothing function is chosen (we tried rectangular and Gaussian windows), but most of the vertical smoothing error is removed. We estimate that the residual vertical smoothing uncertainty is less than a few percent.
In a final preprocessing step the data are transformed to the same profile representation, defined by the ozone quantity (number density or VMR), the vertical coordinate (altitude or pressure) and the levels of the vertical grid. Differences between geometric and geopotential height are neglected. We focus on the satellite instrument's native representation, see Fig. 1, mainly because it is closest to the retrieved information, but also because users will use it as a starting point to convert to another representation if their application requires that. They can use the auxiliary pressure and/or temperature profiles provided along with the ozone profiles in many satellite records for this purpose. As we have seen in Sect. 2.2, these auxiliary data originate from different sources which may lead to a representation-dependence of the mutual consistency of the satellite data quality. This is discussed further in Sect. 6. Until then, only the correlative data are converted when needed. Ozonesonde data are transformed with the help of p and T measurements from the attached radiosonde, and lidar data using ERA-Interim fields. The quality of these ancillary data has been investigated by various authors (e.g. Sun et al. (2013); Stauffer et al. (2014); Simmons et al. (2014); Inai et al. (2015)). The regridding to the satellite's vertical grid is based on a pseudo-inverse interpolation method (Calisesi et al., 2005). Since the groundbased grid is more finely resolved than the satellite grid, the associated regridding uncertainties are generally negligible. We note that the SMR, GOMOS, MIPAS and MAESTRO profiles are inevitably regridded as well because the grid is variable. In these cases the levels of the comparison grid are selected to reflect the average spacing between two lines of sight.
To conclude this section we repeat the importance of using a single analysis and code framework. Apart from some unavoidable preprocessing steps, the data and analysis flow is identical for all 14 satellite comparison studies. In this way, the methodological biases are mostly identical and, hence, unlikely responsible for eventually observed differences between the satellite records. This approach will be exploited in Sect. 7. The next two sections present a detailed assessment of the bias, the short-term variability and the decadal stability of each individual satellite record.

Decadal stability
We estimate the decadal stability of satellite data through a robust analysis of the time series of the satellite-ground differences. This is a two-step process, in which the linear drift is first estimated at each ground station and subsequently averaged over the ozonesonde and lidar networks. The focus of this section is on the decadal stability of the individual satellite records, in their native profile representation. Later on we expand the discussion to the consistency of drift between profile representations (Sect. 6) and between satellite records (Sect. 7).

Time series analysis at individual stations
We first estimate the drift of the satellite data at each ground station. The comparison time series can contain large gaps and/or outliers; see, e.g. the GOMOS comparisons in Fig. 2 (top panel). Hence robust techniques are needed to estimate not only the drift but also its uncertainty (Muhlbauer et al., 2009;Croux et al., 2004). To this end we use an iterative Tukey-bisquare reweighted least-squares procedure to fit the daily averaged relative difference time series to a linear regression model (2) With x ij (l) as in Eq.
(1) at time t i and grid level l, and the fit residual e ij (l). In this model, the fit parameter α j (l) represents the linear drift of the satellite data relative to the ground-based record j , whereas β j (l) is the bias between both records at reference time t 0 . Time series with less than 10 data points are not regressed. The significance of the estimatedα(l) is tested using a robust estimate of its standard deviationσ α (l) proposed by Street et al. (1988), a slightly modified version of the ordinary least-squares expression. Figure 2 illustrates three time series with superimposed regression results (left panels, blue line) and the corresponding 95 % confidence intervals forα(l) (right panels, vertical dashed blue lines).

Aggregation into ground network average
In a second step, the drift estimatesα j (l) are averaged over various ground stations j = {1, . . ., N }. None of the satellite records exhibit a clear latitudinal structure of drift (see, for instance HALOE and Aura MLS at 25 km in Fig. 3). Therefore, we average the results over the entire sonde network and over the entire lidar network. Since there is a clear variability in the regression uncertainty across the network, each station estimate is weighted by the inverse of its variance w j (l) =σ −2 α,j (l). The network-averaged drift has a standard deviation σᾱ(l) = 1/ j w j (l). The single-site drift uncertainties alone do not always explain the observed variability of the drift estimates over the network. When the number of stations is large enough (N 20) the distribution of normalised residuals ν j (l) = (α j −ᾱ)/σ α,j should have unit variance for realistic estimateŝ σ α of the variance ofα. That is typically not the case for the dense samplers, that tend to have larger variance as illustrated, for instance, for Aura MLS in Fig. 3 (right). This suggests an unaccounted-for source of uncertainty, likely related to differences in sampling or inhomogeneities across the ground-based network. We follow an ad hoc approach to incorporate this unknown component, by scaling the uncertainty up so that the reduced χ 2 (l) = 1 N−1 j ν j (l) 2 becomes unity. We also assume, conservatively, that the original regression uncertainty does not overestimate the true uncertainty; hence κ(l) = max χ 2 (l), 1 . In the following, this adjusted standard deviation σ * α (l) is used to test the significance of the drift averages at the 5 % level. Figure S2 shows the κadjustment factor for each satellite record.

Sensitivity to analysis parameters
The importance of correct single station uncertaintiesσ α,j (l) is evident for the calculation of both the weighted mean and its uncertainty. The possible presence of data gaps, outliers and auto-correlation in the time series led us to cross-check the analytic expression of Street et al. (1988) with a bootstrapping technique (Efron and Tibshirani, 1986). Each comparison time series was resampled 2500 times by replacement of single data points, and subsequently regressed to reconstruct the distribution ofα j (l) (Fig. 2, right). The 2.5 and 97.5 % quantiles define the 95 % confidence interval (light red area) which is in good agreement with the analytic expression (vertical dashed blue lines). Replacing the analytic by the bootstrap-derived uncertainties in Eqs. (3) and (4) changesᾱ and σ * α typically by less than ∼ 0.5 % decade −1 (Fig. S3). Figure 2 (left) also illustrates the outcome of other sensitivity checks, such as changing the temporal resolution of the time series prior to regression (from daily to monthly, green curve) or adding a 1-year harmonic component to the regression model (orange curve). Again, the results are very consistent, changingᾱ and σ * α typically by less than ∼ 1 % decade −1 (Fig. S3). These cross-checks demonstrate the robustness of the results to changes in the analysis parameters.

Selection of ground sites
Several ground sites were discarded from the drift analysis because of poor sampling, spurious features in the reference data, or peculiarities in the satellite data. Drift estimates at stations with small co-location samples have large uncertainty and, hence, in principle a negligible influence on the network-averaged estimates. Nevertheless, a few ozonesonde stations with a short data record or with episodic observations collected during field campaigns are not retained for the regression analyses. Figure 4 shows the vertical drift profiles for seven limb/occultation records at nine NDACC lidar sites, six of which were also studied by Nair et al. (2012). The common vertical drift structure of the sounders noted at Andøya and Tsukuba is indicative of features in the lidar time series, which may influence the network-averaged satellite drift analyses. Both lidar sites are therefore rejected from the stability analysis. Also the Dumont d'Urville comparison time series are not considered, for two reasons. First, the lidar system was entirely redesigned in 2002 (David et al., 2012), which possibly introduces inhomogeneities in the time series. And secondly, the station is located close to the edge of the polar vortex, which can induce spurious biases due to mismatches in the air parcel sampled by lidar and satellite. The latter challenge could be overcome, e.g. by co-locating in equivalent-latitude space (Bergeret, 1999), but this was outside the scope of this work. The drift results at Hohenpeißenberg for all recent sounders are significantly negative above about 25 km, while the results scatter around zero for two historic occultation instruments. Inspection of the time series indeed showed that the Hohenpeißenberg lidar reported more ozone for a few years after 2007 (Nair et al., 2012). This station is hence discarded from the drift analyses of all satellite sounders operational during and af-ter 2007 (Table 3). Similarly, the Table Mountain lidar (Mc-Dermid et al., 1990) measured higher ozone relative to satellite instruments during 2007-2008. This bias disappeared in later years to leave the satellite drift estimates nearly unchanged (Nair et al., 2012). One exception is Aura MLS since the temporary lidar bias occurred close to the start of the mission. Nonetheless, we keep the Table Mountain lidar data for our analyses. A similar procedure was followed to discard about 20 ozonesonde records. For one satellite instrument we deviate from previous, standard selection of ground sites. SCIAMACHY drift results in the Arctic are very different from those in the rest of the atmosphere, especially for lidar. We believe this is a combined result of sampling and the seasonal cycle observed in the difference time series (Sect. 5). Therefore, all Arctic stations are excluded from the drift analysis of SCIAMACHY. Tables 1 and 2 list the stations used for the drift analysis (last column). Thanks to the pseudo-global coverage of the ozonesonde network, the network average should be a reasonably robust representation of the global satellite drift. Lidar network averages, on the other hand, are less representative of the global state and they are somewhat more sensitive to the station selection as well. Figure S4 shows howᾱ and σ * α change when the discarded ground stations are included in the averaging procedure. Ozonesonde network-averaged drift and uncertainty change by less than 0.2 % decade −1 . The impact is a bit larger   Figure 4. Comparison of the vertical structure of the drift α of two historic (dashed lines) and five recent (solid) satellite records relative to stratospheric lidar observations at nine NDACC stations. The shaded area represents the unadjusted 68 % confidence interval, which does not include possible uncertainties from differences in sampling or from inhomogeneities in the lidar network. The analysis is performed in the native profile representation of each satellite record.
(< 0.5-1 % decade −1 ) for SCIAMACHY due to its peculiar data characteristics in the Arctic. Lidar network averages are more sensitive to the selection of sites, especially for the recent satellite records. They differ by 1-2 % decade −1 above 25 km, mainly as a result of the inclusion of the Hohenpeißenberg data which systematically pulls the vertical drift profile towards more negative values. The impact of lidar site selection is much smaller for older records, less than 0.5 % decade −1 . Also the estimates of drift uncertainty are somewhat affected, but not as much as the actual drift values. Typically, the difference in uncertainty is less than 0.5 % decade −1 . Later on, we describe the remarkable agreement between the ozonesonde and lidar-derived drift results, strengthening the confidence in the stability of these ground networks (Fig. 5).

Results
Below we report on the vertical structure of the networkaveraged drift estimates and their significance for each satellite record. We also mention some indicators of the performance of the ground networks for this type of analysis: (a) the smallest value of the 1 σ regression uncertainty found across the network, (b) the typically found uncertainty and (c) the adjustment factor κ. Main results are presented in Fig. 5 and summarised in Table 4.

SAGE II
The very long record of SAGE II, spanning 21 years, allows for a detailed analysis of its stability. The smallest 1 σ uncertainty derived at single sites in the ozonesonde and lidar net- Atmos. Meas. Tech., 9, 2497-2534, 2016 www.atmos-meas-tech.net/9/2497/2016/ Table 4. Overview of the drift of satellite ozone profile records relative to ozonesonde and lidar, in the lower, middle and upper stratosphere.
For each altitude region we present the range of the network average of the drift (ᾱ) and its adjusted one sigma uncertainty (σ * α ). Bold values indicate results with more than 2 σ significance.
Drift SAT-GND 10-20 km 20-30 km 30-45 km Remark works is, respectively, 0.8 % decade −1 and 1.6 % decade −1 . The average drift uncertainty over the ensemble of stations is ∼ 4 % decade −1 . The drift results are furthermore very consistent from one station to another, with a spread of 2-3 % decade −1 at 25 km (Fig. 4). The sonde and lidar derived estimates are statistically consistent as well. When aggregated over the entire ground network a significant SAGE II drift should be detectable at the 1-2 % decade −1 level, depending on altitude.
In the middle and upper stratosphere, between 20 and 40 km, the average drift is slightly negative except around ∼ 33 km (Fig. 5). The negative drift remains smaller than 1-2 % decade −1 and is not significant. At lower altitudes the drift becomes gradually more pronounced, but is never significant either as a result of the increased atmospheric variability or noise in the SAGE II record. We therefore conclude that the SAGE II record is stable relative to the ground measurements, at least within 2 % decade −1 .

SAGE III
SAGE III collected data for only 3.5 years, which excludes the upper stratosphere from our study as no lidar sites provide sufficient statistics. Between 20 and 30 km the minimal drift uncertainty is 6 % decade −1 , while that of most stations is easily twice as high.
SAGE III ozone decreases relative to ground measurements, by 2-6 % decade −1 in the middle stratosphere (MS) and more than 10 % decade −1 at lower altitudes (Fig. 5). The significance is by far insufficient however for a 2 σ detection. The detection limit for the network-averaged drift is at best 6 % decade −1 between 20 and 30 km. In the lower stratosphere (LS) the threshold rapidly worsens to 10-30 % decade −1 due to the increased contribution of noise from natural variability and instrumental noise. We therefore conclude that SAGE III is stable within about ±10 % decade −1 , which is consistent with an earlier report by Wang et al. (2006).

HALOE
The 14-year HALOE record allows for a quite detailed study of the stability as well. The typical uncertainty at single stations is 5 % decade −1 , which is comparable to the variability of the spread between stations (Fig. 3, top left, light grey band). The 2 σ detection threshold for the network average is 2-3 % decade −1 or more.
For altitudes above 100 hPa we observe a negative drift of about 1-7 % decade −1 (Fig. 5). The result is significant between 10 and 40 hPa for both the ozonesonde and the lidar comparisons. Figure 3 demonstrates that negative drifts are found across the entire ground network (left panel), all centred around the network-averaged value (right). At altitudes above 10 hPa and below 40 hPa the drift is less than ±5 % decade −1 with an uncertainty of 1.5-6 % decade −1 and hence not significant. No dependence on vertical coordinate or ozone quantity was found for the HALOE drift results (Fig. 9), so these cannot be explained by drifting auxiliary data of the correlative records (Sect. 6).
Two earlier studies concluded that HALOE does not drift significantly relative to SAGE II, at least not more than ±10-15 % decade −1 (Morris et al., 2002;Nazaryan et al., 2005). Due to the longer data record considered here, the more frequent sampling and the stability of the ground networks we obtain a significant result already at the 2-3 % decade −1 level between 10 and 50 hPa. Our result is consistent with the earlier reports, although a direct comparison is not straightforward due to the different timespan and vertical coordinate (we come back to this in Sect. 6). From Fig. 5 we infer that the middle stratospheric drift of HALOE relative to SAGE II must range between 0 and −5 % decade −1 , which is comparable in sign and in magnitude with the −(0-10) % decade −1 reported by Morris et al. (2002, Fig. 4a) and −(2-4) % decade −1 by Nazaryan et al. (2005, Fig. 8).

UARS MLS
The UARS MLS record is somewhat short (less than 6 years) which limits the drift study especially at low altitudes and relative to the lidar instruments. Between 5 and 50 hPa (20-35 km) the single station drift uncertainty is 5 % decade −1 at best, but typically twice as large. When the results are averaged over the ground network the 2 σ detection threshold is 4-8 % decade −1 . At other altitudes the threshold increases rapidly, by a factor of at least 2.
For altitudes below 10 hPa the ozonesonde comparisons show a non-significant positive drift of 0-3 % decade −1 (Fig. 5). The drift relative to lidar, on the other hand, is negative but it is also not well constrained. As a result, the difference between the sonde-and lidar-derived results is not significant. In fact, it is difficult to conclude anything from the lidar results; the results at different sites tend to be somewhat inconsistent, especially at altitudes above the 10 hPa level. While the upper stratospheric drift of UARS MLS goes up to +10 % decade −1 relative to the Observatoire de Haute-Provence (OHP) and Table Mountain lidars, it goes down to −10 % decade −1 relative to the Mauna Loa and Lauder lidars. This necessitates a large χ 2 -adjustment of κ 2.5 for lidar (Eq. (4) and Fig. S2) and results in a final uncertainty of about 10 % decade −1 . We conclude that between 10 and 50 hPa the UARS MLS instrument is stable within about ±5-10 % decade −1 , perhaps slightly worse. In the upper stratosphere the discrepancy between the lidar results is too large to assess the stability of UARS MLS.
We also note a dependence of the UARS MLS ozone drift results with profile representation due to an ascending drift in the accompanying GPH profile products (Fig. 9). More details and a recommendation to avoid such representationdependences follow in Sect. 6.

Aura MLS
The stability of the Aura MLS instrument can be studied in great detail, thanks to its excellent temporal and spatial sampling. Single site drift uncertainty is at best 0.6 and 2 % decade −1 on average. Regression uncertainties are substantially smaller than the observed standard deviation of the drifts over the network, which is about 4-6 % decade −1 at altitudes above 50-100 hPa (Fig. 3, bottom). This leads to a considerable χ 2 -adjustment (Fig. S2) of κ 2.5 in the middle stratosphere (sonde) and κ 3 in the upper stratosphere (lidar). The resulting 2 σ detection limit for network averages is 1-3 % decade −1 at altitudes below 5 hPa, and increases rapidly in the uppermost stratosphere.
In the upper and middle stratosphere the average drift is slightly positive, but generally not more than 1.5-2 % decade −1 (Fig. 5). Sonde and lidar derived results are very consistent. A significant negative drift seems to develop at altitudes below 100 hPa, which we think is due to an underestimation of the uncertainty. Indeed, obtaining realistic uncertainties at the level of a few % decade −1 level in the UTLS is a daunting task. We therefore conclude that Aura MLS v3.3 is stable in the entire stratosphere, certainly within 1.5 % decade −1 (MS) and 2 % decade −1 (US). Our ground-based estimates are consistent with earlier intercomparisons of Aura MLS, MIPAS (Eckert et al., 2014) and OSIRIS , indicating drifts between the instruments less than ±3-5 % decade −1 .
We will see later on that the above drift results differ from those in non-native vertical coordinate representations, due to an overall descending drift of the Aura MLS GPH profiles (Fig. 9). This issue and a possible solution will be discussed in Sect. 6.

POAM II
The analysis of POAM II is extremely limited due to its infrequent sampling and short record, merely 3 years. The regression requirement of at least 10 data points was met at just 7 polar ozonesonde stations. There were not enough co-locations with lidar instruments to study the upper stratosphere. Drift uncertainty is about 30 % decade −1 at most sites and 20 % decade −1 in the best case. The resulting 2 σ detection threshold for the network average is 20-40 % decade −1 in the middle stratosphere. This is much larger than the observed drifts, which range from −15 % decade −1 at 20 km and 30 km to +15 % decade −1 at 25 km (Fig. 5). We conclude that the stability of POAM II is better than ±25 % decade −1 in the middle stratosphere.

POAM III
The POAM III data record spans 7.5 years and can therefore be studied in greater detail than that of its predecessor. In addition to the seven polar stations in the POAM II drift analysis, five ozonesonde sites at northern mid-latitudes provide a sufficiently sampled time series. Again, the regression was not feasible for lidar comparisons, limiting the altitude of our analysis to 30 km. The single station uncertainty is 4 % decade −1 at best and about 6 % decade −1 on average. When the results are averaged, the 2 σ drift uncertainty becomes 4-8 % decade −1 in the middle stratosphere and rapidly grows to 10 % decade −1 at 15 km. Overall, POAM III seems to drift to lower ozone values between Atmos. Meas. Tech., 9,2016 www.atmos-meas-tech.net/9/2497/2016/ 20 and 30 km, at a rate of −(2-8) % decade −1 (Fig. 5). At lower altitudes the drift changes sign. None of our results are statistically significant. We conclude that POAM III is stable within, respectively, ±5 and 15 % decade −1 in the middle and lower stratosphere.

OSIRIS
The OSIRIS time series are densely sampled at many ground stations. In the middle and upper stratosphere the minimum drift uncertainty is 1.3 % decade −1 and typically amounts to 3-4 % decade −1 . The regression uncertainties do not fully explain the observed variability of 5-6 % decade −1 between stations above 20 km. The corresponding χ 2 -adjustment factor κ is ∼ 1.5-2 for the sonde network and mostly less than 1.5 for the lidar network. The 2 σ detection limit for the network average is 3 % decade −1 at 15 km, 1.6 % decade −1 between 20 and 30 km and 5 % decade −1 at 45 km.
In the lowermost stratosphere the OSIRIS drift relative to correlative measurements is negative, at most −5 % decade −1 and not significant (Fig. 5). There are clear indications of a positive drift between 15 and 35 km, of about 1-3 % decade −1 . While the sonde-derived result is significant (> 22 km), that is generally not the case for the lidar results (except between 28 and 34 km). In the upper stratosphere the positive drift becomes more pronounced and very significant above 37 km. Its presence is easily visible in the comparison time series, e.g. at the OHP lidar (Fig. 2). Around 42 km we find a > 2 σ drift of +8 % decade −1 at three of the four best sampled lidar stations (Fig. 4). Adams et al. (2014) reported a +(3-6) % decade −1 drift of OSIRIS relative to Aura MLS in the US, depending on how the Aura MLS data (pressure-VMR) are converted to the native OSIRIS system (altitude-number density). This is consistent with the 5 % decade −1 difference that we find between our lidar-based drift estimates for these two instruments. Also Rahpoe et al. (2015) obtained positive drift estimates of OSIRIS relative to five satellite instruments above 40 km, though the results are not significant for most instrument pairs.
In summary, OSIRIS ozone drifts very likely to higher values above 20 km. The drift is quite small up to 35 km and close to the 5 % significance threshold. In the upper stratosphere the presence of a +(5-8) % decade −1 drift is evident. The OSIRIS team has found that the drift in ozone may be caused by a positive drift in the altitude registration. Efforts are under way to correct for this in the next data release.

SMR
Even though the SMR record spans 12 years and has good sampling properties, the ability to assess its stability is limited by the noise of the profiles. In Sect. 5 we show that the single SMR profile noise exceeds 20 % in the tropics and 30 % at higher latitudes. This is substantially larger than for any other satellite record in this study. As a result, the drift uncertainty is at best 5-6 % decade −1 and typically ∼ 10 % decade −1 at individual ground sites. The regression uncertainties cover the observed drift variability across the ground network, so the χ 2 -adjustment is close to one. In the end, the 2 σ threshold to detect averaged drifts ranges from 3 to 10 % decade −1 between 25 and 40 km.
The SMR profile drifts slightly to higher values in the middle stratosphere, although by no more than +5 % decade −1 which is insignificant (Fig. 5). Above 30 km the drift changes sign and increases rapidly in magnitude, reaching −12 % decade −1 around 40 km. Due to the large singleprofile noise the negative drift is only significant at 2 σ level between 40 and 43 km. A recent six-satellite intercomparison study pointed to a negative drift of SMR upper stratospheric ozone as well, though the estimates were generally not considered significant (Rahpoe et al., 2015). These results contrasts with satellite intercomparisons by Jones et al. (2009) which indicated an insignificant positive drift of SMR relative to a multi-satellite average in the upper stratosphere. The difference may be due to a shorter period (2001)(2002)(2003)(2004)(2005)(2006)(2007) or due to the different data versions, and deserves further study. Meanwhile, we conclude that SMR is stable within ±6-8 % decade −1 over most of the stratosphere. SMR ozone trends in the uppermost stratosphere, however, should be interpreted cautiously as they possibly underestimate the actual trend by more than 10 % decade −1 .

GOMOS
The constraints on the stability of GOMOS are weaker than for its contemporary limb sounders, due to its sparser sampling and, below ∼ 20 km, its larger noise. These limitations are clear from the comparison time series at the Payerne ozonesonde station (Fig. 2). In the middle stratosphere, the drift variability between stations is about 10 % decade −1 , which is larger than the uncertainties at individual sites, about 3 % decade −1 at best and 7 % decade −1 in general. The χ 2 -adjustment increases the uncertainty of the network averages by κ 1.5. The resulting 2 σ detection threshold is 3-5 % decade −1 between 20 and 40 km and raises rapidly in the lower stratosphere, e.g. to ∼ 12 % decade −1 at 15 km.
In the upper stratosphere the lidar results are scattered, but they point on average to a small, positive drift of GO-MOS retrievals above 35 km (Fig. 4). The maximum drift is only +3 % decade −1 at 45 km, well below the 2 σ threshold. However, below 25-30 km a pronounced negative drift develops with decreasing altitude, from −1 % decade −1 at 30 km to −4 % decade −1 at 20 km (Fig. 5). The results for the ozonesonde and lidar networks are qualitatively and quantitatively consistent, the latter being less significant below 22 km. GOMOS drift estimates are close to the 2 σ threshold between 15 and 25 km. At lower altitudes, the significance decreases due to markedly increased noise. Various other studies corroborate our observation of a negative drift in the lower stratosphere. Nair et al. (2011) reported a drift of up to −18±8 % decade −1 (1 σ uncertainty) near 20 km relative to the OHP lidar (43.9 • N, 5.7 • E). Similarly, intercomparisons pointed to a negative drift of GOMOS lower stratospheric ozone relative to all of its contemporary limb sounders Rahpoe et al., 2015).

MIPAS
We only consider profiles from 2005-2012 (optimised resolution period, OR) in the nominal observation mode, since other MIPAS data is less recommended for use in long-term studies. Nevertheless, the stability can still be studied down to several % decade −1 thanks to the good sampling properties of the instrument. In the middle and upper stratosphere, the smallest single-site regression uncertainty is 1.5 % decade −1 and typically ∼ 3 % decade −1 . These errors do not fully cover the observed variability between sonde stations. They are therefore scaled by a factor of κ 2 between 3 and 50 hPa and 1 at altitudes below 100 hPa. The resulting 2 σ detection limit for the network average is 2-4 % decade −1 between 10 and 100 hPa and 5-9 % decade −1 in the upper stratosphere.
No significant drift is observed in the MIPAS OR profiles, they are stable relative to the ground-based networks over the entire considered altitude range. Drift estimates are less than ±2 % decade −1 in the middle and upper stratosphere, and less than ±4 % decade −1 at lower altitudes (Fig. 5). Eckert et al. (2014), on the other hand, noted clear negative drifts in the upper stratosphere between MIPAS data retrieved by the Level-2 processor at Karlsruhe Institute of Technology and Aura MLS (0.2-0.3 ppmv decade −1 , or ∼ 3-5 % decade −1 ) or OSIRIS (0.3-0.6 ppmv decade −1 , or ∼ 5-10 % decade −1 ). The seemingly contrasting results from both analyses are nevertheless in good agreement. We deduce from the lidarbased drift estimates that the relative drift between MIPAS and Aura MLS or OSIRIS would be, respectively, −(2-5) and −(3-10) % decade −1 for altitudes above 5 hPa (Fig. 5).
Our drift results are generally not applicable for trend analyses which include MIPAS data prior to 2005 (full resolution period, FR). The FR data are biased relative to the OR profiles (Ceccherini et al., 2013), which will introduce an (altitude-dependent) systematic uncertainty in trend analyses if not accounted for. Eckert et al. (2014) overcome this issue by including the FR-OR bias as a free parameter in the regression model.

SCIAMACHY
The excellent sampling of SCIAMACHY allows us to probe its stability down to 0.8 % decade −1 at some ground sites, and on average down to ∼ 2 % decade −1 . Again, these statistical uncertainties do not cover the variability of 6 % decade −1 observed between the stations, leading to a κ 2 − 2.5 adjustment over most of the middle stratosphere.
The drift averages become significant when they cross the 2-6 % decade −1 bar in the middle and upper stratosphere.
SCIAMACHY data below 30 km drift to higher values relative to sondes and lidars (Fig. 5). The drift is nearly independent of altitude and amounts to about +2 % decade −1 . The sonde results surpass the 2 σ threshold, but those derived from lidar observations do not. The drift has the opposite sign above 30 km and becomes rapidly highly significant at all lidar sites (Fig. 4). It reaches maximal significance, more than 5 σ , around 38 km with a magnitude of −9 % decade −1 . These results clearly show that SCIAMACHY trend results should be interpreted very cautiously in the upper stratosphere, and likely at lower altitudes as well. For instance, the large negative drift in SCIAMACHY US ozone explains, at least partially, the more negative trends derived from the IUP Bremen v2.5 data set than those found for Aura MLS and OSIRIS (Gebhardt et al., 2014). While latter authors consider a different SCIAMACHY Level-2 processor than us, there have been reports of a negative drift of 5 % decade −1 at 30-40 km for the IUP Bremen processor as well Lambert et al., 2014;Rahpoe et al., 2015).
The drift in SCIAMACHY data is not well understood and several possible causes are being explored. The SGP 5.02 limb ozone retrieval does not use UV wavelengths, so little information is retrieved in the upper stratosphere and the resulting data will be weighted towards the a priori. Since the latter is taken from an annually repeating climatology, a negative drift in the US can be expected provided that the actual ozone trend is positive in this part of the atmosphere. However, this seems to provide only a partial explanation as the magnitude of the positive trend (about +3-4 % decade −1 between 30 and 40 km) is not nearly as large as the negative drift in SGP 5.02 ozone data (−9 % decade −1 ). The IUP Bremen data record should be less prone to this effect, since more information is extracted in the US by exploiting the Hartley band. Nonetheless, a negative drift in IUP Bremen data is observed in the US as well, but of smaller magnitude. A second possibility is that the retrieved ozone values change as a result of changes over time in the sensitivity to limb polarisation. The polarisation is currently not well determined but is expected to in a future operational data release (version 7). Meanwhile, further investigations are ongoing.

ACE-FTS
The solar occultation instruments onboard SCISAT sample mainly high latitudes. We limit our stability study of ACE-FTS to the lower and middle stratosphere, since there is only one lidar site with a sufficient number of co-locations. The best single-site drift uncertainty is 3 % decade −1 , whereas it amounts to about 10 % decade −1 in general, close to the observed variability between stations. The observed drift Atmos. Meas. Tech., 9, 2497-2534, 2016 www.atmos-meas-tech.net/9/2497/2016/ is mostly negative, less than 5 % decade −1 , which is consistent with the no-drift hypothesis (Fig. 5). The ACE-FTS data record can be considered stable to within about 5 % decade −1 . A more precise analysis will be possible once the ACE-FTS profiles taken after September 2010 are included in the analysis.

MAESTRO
The uncertainty on the stability of the MAESTRO record is slightly poorer than that of ACE-FTS. The larger singlestation uncertainties, at least 5 % decade −1 and typically 12-14 % decade −1 , lead to a 2 σ detection threshold at 6-8 % decade −1 and 6-25 % decade −1 in the middle and lower stratosphere, respectively. The results never cross these thresholds: below 20 km we find a drift between −7 and +10 % decade −1 , above 20 km the drift is mainly positive and about 2-3 % decade −1 (Fig. 5). Hence, the MAE-STRO record is considered stable within ±6-10 % decade −1 . Again, as for ACE-FTS, the uncertainty will decrease once the post-September 2010 profiles will be added to the analysis.

Bias and short-term variability
After studying decadal stability, we address the overall bias and short-term variability and search for patterns in altitude, latitude and season. As in the previous section, we focus here on the individual satellite records in their native profile representation. Later on we expand the discussion to the consistency between profile representations (Sect. 6) and between satellite records (Sect. 7).

Methodology
Again, robust statistics are adopted that protect against outliers. We define the bias b(l) as the median of the difference distribution at grid level l b(l) = Q 50 ( x i (l)), where i runs over the pairs in the comparison sample. The 68 % interpercentile of the difference distribution is referred to as comparison spread s. We stress that s should not be confused with an estimate of the precision of the satellite data, as other, non-negligible terms enter the comparison error budget. These include the precision of the groundbased data and random uncertainties in the metrology of the comparison related to the difference in sampled air masses (Sect. 3), but also any long-term time dependence of the bias. In principle a similar remark is also valid for the bias b, but systematic uncertainties in the metrology of the comparison are expected to play a smaller role, except perhaps in the UTLS due to low ozone abundances and above 30 km due to the different sampling by lidar and a few satellite instruments of the diurnal cycle (Sect. 3).

Results
The vertical and meridional structure of bias and comparison spread relative to ozonesonde measurements is shown in Figs. 6 and 7. Since there is more resemblance between the instruments, we only show a few typical cases for the comparison spread. Table 5 summarizes the bias estimates in four layers of the atmosphere. In the Supplement we provide vertical profiles of bias and spread from comparisons to ozonesonde and lidar observations in five latitude bands (Figs. S5-S18). In addition, for selected instruments, there are supplementary figures for the dependence of data quality on solar occultation type (Fig. S19) and month (Fig. S20).

SAGE II
Between 20 and 40 km SAGE II ozone remains mostly within ±3 % of the correlative measurements. Above 30-35 km, however, sunrise profiles have a ∼ 4 % more negative bias relative to lidar than sunset profiles (Fig. S19). This confirms, qualitatively, earlier reports of 8-10 % smaller sunrise concentrations than at sunset in the middle and upper stratosphere Damadeo et al., 2014;Sakazaki et al., 2015). In the lowermost stratosphere, and below, ozone is underestimated by up to 10-15 %. The spread in the comparisons is lowest between 25 and 40 km and shows poleward increases, 5 % at the Equator and ∼ 10 % at the high latitudes. Below 20 km the observed spread increases rapidly to 20-30 %, and especially under Antarctic ozone hole conditions.

SAGE III
The stratospheric bias of SAGE III is mostly less than ±3 %, comparable to that of its predecessor. Ozone is generally slightly overestimated except in the Arctic between 10 and 35 km and below ∼ 15 km at mid latitudes. The latter contrasts with a high bias up to 10 % seen at 13 km by Wang et al. (2006) for an earlier version of the data set. It is not clear whether the SAGE III sunrise and sunset profiles are biased relative to each other. Figure S19 shows that the bias relative to lidar is ∼5 % more positive for sunrise measurements above 30 km. However, it is not possible to attribute this to diurnal variation since the type of occultation depends on the hemisphere (sunset in North, sunrise in South) and there may be a meridian structure in the instrument bias field (Fig. S6). The short-term variability seems a few percent better than that of SAGE II, i.e. about 5 % at mid-latitudes and 8 % in the Arctic. Below 20 km and above 35-40 km the variability in the comparisons increases markedly.   1984-8/2005 3/2002-11/2005 10/1991-11/2005 9/1991-6/1997 8/2004-5/2013 11/1993-11/1996 4/1998-12

HALOE
In the upper stratosphere and tropical middle stratosphere HALOE overestimates ozone by up to 3 %. In contrast, a negative bias is noted over the rest of the atmosphere. In the middle stratosphere it is not more than 5 % but it decreases rapidly at altitudes below 50 hPa, reaching at least 25 % at 200 hPa. The variability in the comparisons is similar to that from the SAGE instruments, ranging from 5-10 % in the middle and upper stratosphere, and peaking at 30-40 % around the tropopause. During the Antarctic ozone hole season, the volume mixing ratios are overestimated by 25 % and the spread increases to 35 %. Our results are consistent with earlier satellite and ground-based studies (Morris et al., 2002;Nazaryan et al., 2005). Sakazaki et al. (2015) reported a 2-5 % positive bias of sunset relative to sunrise occultations above 40 km. The lidar-based analysis seems to confirm this, differences between both occultation types are less than 2 % below 40 km and somewhat higher in the uppermost stratosphere (Fig. S19).

UARS MLS
Our findings corroborate most of those by Livesey et al. (2003): (a) at altitudes above 50 hPa UARS MLS slightly overestimates ozone by up to 5 %, (b) the 68 hPa and 100 hPa levels exhibit larger biases up to 10 %, and (c) the bias peaks at 68 hPa. However, negative biases up to 5 % are seen be-tween 10 and 50 hPa relative to southern ozonesondes and at altitudes above 5 hPa relative to northern lidars. The shortterm variability is similar to the previous records, but reaches the 5-10 % range somewhat higher up in the middle stratosphere, around 20 hPa. At lower altitudes the comparison spread increases fast, maximizing at more than 40-50 % at the tropopause. We noted furthermore that the UARS MLS bias depends on the profile representation if one uses the GPH and temperature data included in the MLS product to perform conversions. This will be discussed in more detail in Sect. 6.

Aura MLS
Aura MLS ozone remains within ±3 % of correlative measurements between 5 and 50 hPa, except in the Arctic where a negative bias of 5 % is noted. The most striking bias characteristics are the stationary vertical oscillations found in the finer vertical retrieval grid results (version 3.3/3.4 data, see Livesey et al., 2013a). They are very pronounced in the tropical UTLS where the amplitude reaches 10-15 %, but also extend to higher latitudes and altitudes, with amplitudes of 3-5 %. The previous data release, v2.2, has a coarser grid in the UTLS and displays fewer oscillations. The recent new release of Aura MLS data (version 4.2) mitigates these oscillations to some extent . The comparison spread shows that the single-profile precision is better than 4-7 % in the middle and upper stratosphere, and starts to de- Table 5. Overview of the bias of satellite ozone profile records relative to ozonesonde and lidar, in the upper troposphere and stratosphere. We present the range of the median relative difference (bias b) in each altitude bin, and whether there are any dependences on latitude and season that depart from the general tendency. grade for altitudes below 50 hPa. Furthermore, the Aura MLS bias depends on the profile representation if one uses the GPH and temperature data included in the MLS product to perform conversions. We come back to this in Sect. 6.

POAM II
We observe a negative bias of about 5-10 % between 20 and 30 km, which becomes rapidly more pronounced at lower altitudes in the Antarctic. This is consistent with earlier satellite and ground-based studies Deniel et al., 1997;Danilin et al., 2002). In the northern lower stratosphere, however, the negative remains less than 5 %. Above 30 km, there is a positive bias of 5-10 % or more relative to the polar lidars. The small lidar comparison sample did not allow us to study sunrise vs. sunset results. As for SAGE III, the observed differences (Fig. S19) could also be due to a meridian dependence of the instrument bias since the occultation type changes with hemisphere. The comparison spread is 5-10 % in the middle and upper stratosphere, and increases below 20 km.

POAM III
The POAM III bias is less than 5 % in the middle stratosphere and upper stratosphere, and has a negative sign between 18 and 30 km and positive elsewhere. Here, the spread in the comparisons is also similar to its predecessor, ranging between 5 and 10 %. In the lower stratosphere there is an overestimation of at least 10 %, and, again, the spread is more pronounced. Our results corroborate the findings of Randall et al. (2003). Unfortunately, the small comparison sample does not allow us to verify their report of a negative bias of up to 5 % of MS and US sunrise data (taken in the Northern Hemisphere) relative to sunset profiles (in the SH).

OSIRIS
Our ground-based bias results are very consistent with those of satellite intercomparisons . OSIRIS ozone remains mostly within ±4 % of correlative measurements above 20 km, but two features stand out. First and foremost, a marked peak in bias around 22 km is seen at all latitudes which is possibly related to biases in the aerosol retrieval preceding the ozone retrieval . The comparison to lidars in the tropics and the Southern Hemisphere shows a second jump towards a persistent 5 % positive bias, occurring between 30 and 35 km. Such a feature is not seen in the Northern Hemisphere. In the lower stratosphere, below 20 km, OSIRIS underestimates ozone by 5-10 % at mid and high latitudes and by more than 15 % in the tropics. Comparison spreads range from 6 to 11 % between 20 and 35-40 km. In the UTLS these increase to 20-40 % at 15 km, depending on latitude.

SMR
Our analysis confirms earlier reports (by, e.g. Urban et al., 2005;Jones et al., 2007;Jégou et al., 2008) of a systematic underestimation by 5-10 % in the upper and (most of the) middle stratosphere. The bias changes sign at lower altitudes and peaks at +5 to +10 % around 20 km. The most notable characteristic is the high comparison spread. It increases in the middle stratosphere from 20 % to 30 % between the tropics and the polar regions, and becomes even larger at other altitudes (Fig. 7). The poor single-profile precision is caused Atmos. Meas. Tech., 9, 2497-2534, 2016 www.atmos-meas-tech.net/9/2497/2016/ by the low signal-to-noise ratio for the 501.8 GHz line used for the ozone retrievals. Better precision can be obtained by averaging the profiles in the logaritmic VMR domain (Urban et al., 2005). Alternatively, one could use the SMR ozone products from the stronger 544.6 GHz band; these are clearly less noisy though exhibit larger biases (Hassler et al., 2014).

GOMOS
The GOMOS ozone bias above 20 km is generally less than ±3 %. The exception is the Arctic where a −7 % bias is found relative to ozonesonde and lidar at 25 km and again at 40 km. This is in agreement with earlier analyses by van Gijsel et al. (2010). Another notable feature is that the sign of the bias in the extratropical UTLS is opposite in both hemispheres. It reaches −20 % in the North and +20 % in the South at 10 km. The larger biases below 20 km are due to the interference of ozone and aerosol retrievals with aerosol models . In the middle and upper stratosphere the comparison spread ranges from 6 % in the tropics to 11 % at high latitudes (Fig. 7). Below 20-25 km, GOMOS data becomes notably more noisy; at 15 km the observed spreads amount to 25-50 %, as a result of the increasing opacity of the atmosphere. Theoretically it is expected that profile quality depends on star properties such as magnitude and temperature. However, our analysis confirms (not shown here) an earlier claim by van Gijsel et al. (2010) that this is not the case when the recommended screening procedure is applied. The illumination condition of the occultation is clearly a more determining factor, with dark limb profiles offering best data quality.

MIPAS
Due to changes in instrument and retrieval set-up there is an altitude-dependent bias between the first (2002)(2003)(2004) and later years of the mission of up to 5 % (Ceccherini et al., 2013;Eckert et al., 2014). Our analysis covers the 2005-2012 period only and corroborates earlier findings for the operational and several alternative MIPAS Level-2 processors (e.g. Cortesi et al., 2007;Laeng et al., 2014Laeng et al., , 2015. MIPAS OR profiles overestimates ozone systematically over most of the stratosphere, except in the Arctic. At mid and low latitudes there are two bias peaks of +(5-10) % around 50 hPa and 5 hPa. At other pressure levels the bias remains below 5 %. At the bottom of the profile, for p > 200 hPa, ozone is overestimated by at least 20 %. In the tropics, the bias briefly flips sign between 50 and 200 hPa, where a very negative bias is found. Between 2 and 50 hPa, the observed spread ranges from 4 % in the tropics to 8 % at higher latitudes. Again, in the UTLS a sharp increase is observed (Fig. 7). We also noted a dependence of the MIPAS bias on ozone quantity representation when the pressure and temperature data retrieved by the operational ML2PP 6.0 proces-sor are used to perform conversions. More details follow in Sect. 6.

SCIAMACHY
The SCIAMACHY bias is clearly positive over most of the atmosphere and manifests an intricate structure in altitude, latitude and season. The agreement with ozonesonde and lidar is better than 10 %, and best at northern mid-latitudes (between 0 and +5 % over 15-40 km). However, the bias easily reaches +10-15 % over a large part of the stratosphere, stretching from 30 • N-60 • S (> 25 km) to 60-90 • S (> 30 km). Similar results were obtained by Tegtmeier et al. (2013) for an alternative Level-2 processor developed by IUP Bremen. Arctic profile data quality is particularly peculiar (Fig. S20). There is a clear vertical dependence of the bias, peaking at +10 % around 20 km and −10 % at 15 and 30 km. Also, both bias and comparison spread vary strongly with season. The bias at 20 km reaches a maximum of +25 % during boreal winter and a minimum of 0 % in summer. Similarly, the mean comparison spread is about 20 %, but it peaks at 30 % in winter and shrinks to 10 % in summer. At other latitudes the observed spread is never below ∼ 10 %. Furthermore, in Sect. 6 we will show that the SCIAMACHY bias depends on the profile representation if the pressure and temperature data included in the SCIAMACHY product are used to perform conversions.

ACE-FTS
ACE-FTS ozone remains generally within about ±3 % from ground-based measurements over the entire stratosphere. The bias is negative relative to Arctic ozonesondes, everywhere else ozone mixing ratios are overestimated. Above 30 km, the comparisons to mid-northern and high-southern lidars indicate a slightly larger positive bias, but not more than 5 %. The relative bias only exceeds 5 % a few km above the tropopause. These observations are in line with other studies (Dupuy et al., 2009;Waymark et al., 2013). Sakazaki et al. (2015) recently reported sunrise-sunset biases in the upper stratosphere of 2-5 % above 40 km. Figure S19 shows differences of similar magnitude between the lidar bias results for both occultation types, but with the opposite sign and penetrating deep into the middle and lower stratosphere. These results are clearly due to statistical fluctuations in the small co-location sample. The ACE-FTS record performs also well in terms of short-term variability. The comparison spread is at most 7 % (10 %) at high latitudes in the middle (upper) stratosphere. It increases strongly below 20 km.

MAESTRO
The MAESTRO profiles exhibit typically a negative bias in the Northern Hemisphere (3 to 6 %) and a positive bias in the Southern Hemisphere (0 to 10 %). Ozone is clearly underestimated below 15 km, by at least 20 % at all latitudes.

Bias b [%]
-15 -10 -5 0 5 10 15 (altitude, number density) (altitude, VMR) (pressure, number density) (pressure, VMR) These findings confirm those by other authors Dupuy et al., 2009). Earlier reports of a negative bias of up to 20 % between sunrise and sunset measurements above 35 km can not be confirmed or excluded. Our lidar-based bias results for sunset and sunrise data differ less than 5 % in the upper stratosphere (Fig. S19). Nonetheless, the comparison sample is quite small so it does not necessarily provide a representative picture. The observed comparison spreads range from 7 % at mid-latitudes to 10 % in the Arctic.
6 Impact of auxiliary data on non-native representations So far, we have considered the quality of the satellite records in their native profile representation. But a user may actually desire another representation depending on his/her application (e.g. model comparisons, merging or assimilation of different records). In this case coincident altitude, pressure and/or temperature profile data are necessary for the conversion between ozone VMR and number density or between altitude and pressure. Users may prefer measurements, climatologies or reanalysis fields, all of which bring along uncertainties (e.g. Thorne et al., 2011;Seidel et al., 2011;Stauffer et al., 2014;Simmons et al., 2014). These ultimately add uncertainty S auxiliary to the transformed ozone profile, which may have structure in space (altitude, latitude) and time (short-and long-term). Moreover, the currently observed negative trend of 1 K decade −1 in upper stratospheric tem-perature data already leads to representation-dependent differences of up to 1 % decade −1 in the ozone trends (McLinden and Fioletov, 2011). It is important to realise that a drift in temperature data will, in a similar fashion, introduce extra (altitude-dependent) drift in non-native ozone representations. Here, we consider the auxiliary data provided in the satellite data files, see Table 3. The auxiliary profiles for ground-based data are taken either from actual measurements (ozonesonde: interfaced radiosonde), or from reanalysis fields (lidar: ERA-Interim).
At altitudes below about 35 km (∼ 5 hPa) there is generally no clear change in bias or comparison spread (both < 1 %) and drift (< 1 % decade −1 ) after the conversion to another representation. There is, therefore, no considerable difference in bias, short-term variability or long-term stability of the auxiliary data for most satellite and ground-based profiles. Examples are shown for SAGE II bias (Fig. 8) and HALOE and OSIRIS drift (Fig. 9). Complete information for all sounders can be found in the Supplement (Figs. S5-S18). Observations of upper stratospheric temperature are generally less consistent (Simmons et al., 2014), so it is not surprising to find considerable changes in ozone bias (up to ∼ 5 %) or drift (up to ∼ 5 % decade −1 ) around 45 km (∼ 1 hPa).
For a few records we find clear indications that the accompanying auxiliary data have a more important impact than the numbers stated before. MIPAS bias changes by about 3 % when switching between VMR and number density, except between 25 and 30 km (∼ 10-20 hPa), see Fig. 8 and tropics and slightly less in the polar regions. Interestingly, transforming the vertical coordinate does not influence the ozone bias even though there is a ∼ 200 m negative bias in MIPAS altitude. This indicates that the averaging kernel smooths out the effect of altitude offsets. The observed dependence on ozone quantity is not caused by the conversion procedure of the averaging kernel, since a similar depdendence is seen in MIPAS comparisons to non-smoothed correlative data (not shown here). The SCIAMACHY bias depends on both vertical coordinate and ozone quantity over the entire stratosphere, by 3-5 % ( Fig. 8 and Fig. S16), likely as a result of uncertainties in the McLinden p/T climatology. Both MLS records on the other hand exhibit clear representation dependences of the drift ( Fig. 9 and Figs. S8-S9), and to a lesser extent also of the bias (although the ozonesonde and lidar results are somewhat discrepant, Fig. 8). The Aura MLS drift changes by about 3 % decade −1 , similar to earlier reports . The dependence has the opposite sign for UARS MLS and is more pronounced, up to ∼ 10 % decade −1 . These observations are consistent with the known drifts in absolute pointing of the MLS records. Whereas UARS MLS geopotential height profiles drift upwards, by ∼ 1000 m decade −1 (Livesey et al., 2003), the Aura MLS v3.3/v2.2 GPH data drift downwards by ∼ 120 m decade −1 , especially between 2005 and 2009 . Obviously, if a more stable and less biased source of auxiliary data were used for the conversion, the reported issues for MIPAS, SCIAMACHY and the MLS records could be easily avoided. Our results suggest that radiosonde data, reanalysis fields by ERA-Interim (lidar) and MERRA (SAGE II), and ECMWF operational data (SMR, OSIRIS, GOMOS) allow for consistent conversions.

Consistency between satellite records
Until now we discussed each satellite Level-2 data set individually. Here, we take advantage of the specific design of the analysis to compare the satellite records directly. What follows is an evaluation of their mutual consistency in terms of bias (Fig. 10), short-term variability (Fig. 11), decadal stability ( Fig. 12) and auxiliary data. In the context of the SI2N initiative an extensive literature review was performed of the ground-based validation and satellite intercomparison studies . We refer the reader to this work for a more in-depth discussion of the global picture that emerges from the different studies. Figure 10 shows a superimposed view of the vertical structure of satellite bias in five latitude bands, in the native representation of each satellite record. The smallest biases and best mutual consistency are found between 20 and 40 km (∼ 2-50 hPa). Here, satellite and ground-based measurements mostly agree within 5 % or better (grey shaded area). Furthermore, the inter-satellite bias is not more than about 5 %. This illustrates the excellent consistency of all satellite and ground-based records in this part of the atmo-  sphere. The consistency appears slightly poorer in the uppermost stratosphere (above 40 km/∼ 2 hPa), perhaps due to the lower ozone abundances or due to larger systematic uncertainties in the lidar measurements. In the lower stratosphere and below the tropopause there is a clear degradation of the percentage bias and consistency, due to declining ozone levels and increasing interference by clouds and aerosols (Wang et al., 1999(Wang et al., , 2002Randall et al., 2003). The bias relative to sondes easily reaches 15 % and more, and the inter-satellite biases can be more than twice as large. Exceptions to this general picture are POAM II (dashed green) and especially SCIAMACHY (solid yellow). POAM II ozone is systematically low by about 5-10 % in the middle stratosphere, except in the Arctic. The SCIAMACHY bias reaches 10 % and more over a large part of the stratosphere, with a peculiar meridional structure, and a seasonal dependence that is very pronounced in the Arctic (Fig. S20). Section 5 presented noteworthy bias features also for other records, but of smaller magnitude and at smaller atmospheric scales: SMR (crosses the −5 % threshold above 35 km or ∼ 5 hPa), Aura MLS (distinctive vertical oscillations in the UTLS), MIPAS (persistent positive bias of ∼ 5 %), OSIRIS (sudden anomaly in bias around 22 km) and GOMOS (larger negative bias in Arctic). Some bias features in Fig. 10 are common to the satellite measurements and, hence, possibly relate to the groundbased data quality. Perhaps the most striking, and not understood at the moment, is that the Arctic middle stratospheric bias is negative for most satellite records, relative to both sonde (eight stations) and lidar (three sites). This may indicate that the Arctic ground-based ozone values are too high, although co-location mismatch uncertainties could play an important role too in the proximity of the edge of the polar vortex. Secondly, there is a systematic positive upper stratospheric bias at tropical and southern mid-latitudes, possibly caused by a small negative bias of the dominating lidar record (Mauna Loa and Lauder). These ∼ 3 % biases remain within the systematic uncertainty due to uncertainties on the absorption cross-sections used for the lidar retrievals. Thirdly, a ∼ 10-15 % negative bias seems present in the early Dumont d'Urville lidar record (1991)(1992)(1993)(1994)(1995)(1996)(1997)(1998) (Godin et al., 2001). Indeed, all satellite records that started before 1999 (dashed lines) are biased high with similar magnitude in the Antarctic middle and upper stratosphere, while that is not the case for the more recent comparisons at this lidar site (2008-2013, solid). And finally, we systematically note a curved vertical structure of the bias relative to ozonesondes: sonde ozone values are decreasing by up to 5 % between 25 km and the top of the profile. This may be related to an incomplete sonde correction scheme for the decrease in pump performance or for the increase in vertical registration error due to biases in the pressure readings (Stauffer et al., 2014). Apart from these differences, the ozonesonde and lidar results are highly consistent, highlighting the suitability of these ground-based networks as a transfer standard.  Figure 11. Similar as Fig. 10, but for the comparison spread s. The spread is mostly between 5 and 12 % (grey shaded area) in the middle and upper stratosphere.

Comparison spread
The comparison spread results in Fig. 11 are more straightforward than those of the bias. There is a consistent dependence on latitude and altitude for all records. Between 20 and 35/40 km (∼ 50-2/5 hPa) the spread ranges between 5 and 12 % (grey shaded area) and increases slightly from the tropics towards the poles, qualitatively consistent with a larger co-location mismatch uncertainty due to higher natural variability at high latitudes. Above 35-40 km (∼ 2-5 hPa) ozone levels decrease and the precision of the lidar measurements degrades, leading to a ∼ 5 % and more increase in comparison spread. Below 20 km (∼ 50 hPa) the spread increases rapidly, easily more than 40 % at the tropopause, due to the higher natural variability. But here the lower signal to noise ratio (clouds, aerosols) plays a role as well and differences in comparison spread between records become obvious. GOMOS and UARS MLS appear less sensitive to ozone in the lower stratosphere. The most precise measurements over the entire stratosphere, on the other hand, are made by ACE-FTS, Aura MLS and MIPAS, although the comparison spread results for the latter two records may include a smaller co-location mismatch component due to the tighter time window (6 h instead of 12 h). SCIAMACHY and SMR are clearly different. The single-profile variability in the SMR comparisons is more elevated over the entire stratosphere (20-30 %). For SCIAMACHY this is seen (Fig. S16) in the upper stratosphere (10-15 %), and particularly in the Arctic (25-40 %) where a clear anomaly is discerned around 25 km, together with a very strong seasonal dependence (10 % in boreal summer, more than 30 % in win-ter). During the Antarctic ozone hole season, the extremely low ozone conditions inflate the comparison spread of all records to 40 % or more around 20 km (not shown here in detail). The low signal to noise ratios thus pose a real challenge for all limb and occultation sounders. Figure 12 presents a superimposed view of the vertical structure of the ground-network averaged decadal stability of all satellite records 3 , in their native representation. The drift relative to ground observations is generally not significant and less than 5 % decade −1 in the middle and upper stratosphere, for some records even better than 3 % decade −1 over a large part of the stratosphere. The relative drift between satellite records can be twice as large however. A few records deviate from this general tendency. Either seemingly so because of large drift uncertainty (UARS MLS, SAGE III, POAM II), or because of the presence of a significant drift (HALOE between 20 and 30 km, SCIAMACHY between 32 and 42 km, OSIRIS between 36 and 44 km). The GO-MOS (below 25 km) and SMR (above 35 km) records may also drift, although the results are close to the detection threshold for these instruments. Another peculiarity is the possible presence of a common, weak vertical dependence of the drifts in the middle stratosphere. These tend to become gradually more positive with increasing altitude, by 1-2 % decade −1 between 20 and 30 km, see   SAGE II, Aura MLS, OSIRIS, GOMOS). This unexplained feature is observed independently of satellite record, or of type of correlative instrument, and deserves further study.

Impact of auxiliary data
Satellite ozone profile data quality is generally not affected by the conversion to another representation with the help of the accompanying pressure and temperature profiles. Bias, spread or decadal stability typically change, respectively, by less than 1 %, 1 % or 1 % decade −1 in the lower and middle stratosphere, and somewhat more in the upper stratosphere. This demonstrates the good mutual consistency of the meteorological data by ozonesonde, MERRA and ERA-Interim. The exceptions are MIPAS and SCIAMACHY (3-5 % change in bias) and the Aura MLS and UARS MLS records (respectively 3 and 10 % decade −1 change in drift). Obviously, the introduction of these artificial effects can be avoided by using less biased or more stable sources of auxiliary data.

Discussion
The patterns in bias, short-term variability and decadal stability of the Level-2 ozone profile records identified in the preceding sections will affect higher-level products if not properly accounted for. Many studies within the community are based on gridded Level-3 data (e.g. monthly zonal means from single or a combination of instruments) or assimilated Level-4 fields. In this section we discuss the relevance of our Level-2 assessment for the construction and analysis of such derived records, and focus in particular on implications for recent ozone profile trend assessments.

Can end-user requirements be verified?
We start the discussion by reflecting on the requirements of end users. Naturally, these depend on the envisaged application, so various sets of requirements have been drafted by the community 4 . We focus here on climate applications which rely on stable data sets spanning multiple decades on a global scale. The Global Climate Observing System (GCOS), for instance, requests an accuracy (Joint Committee for Guides in Metrology, 2012) better than 10 % in the UTLS and 5-20 % above, and a stability better than 1 % decade −1 (GCOS, 2011). Within ESA's Climate Change Initiative program (Ozone_cci) similar requirements were set for accuracy (< 8-15 %) and somewhat looser targets for stability (< 1-3 % decade −1 ) (van der A et al., 2011). In practice, the accuracy and stability of a particular record can of course only be tested to a level determined by the accuracy and stability of the reference data and by constraints from the metrology of the comparison. From Figs. 10 and 11 we conclude that ground-based studies are indeed able to verify an accuracy of 5-10 %, and resolve altitude-latitudeseason patterns, in the middle and upper stratosphere. This is much more challenging in the UTLS, where uncertainties in the metrology of the comparison become important due to increased natural variability and imperfect co-locations or differences in smoothing. Model data can help to reduce these, Table 6. Decadal stability of merged ozone profile records (Level-3) estimated from our ground-based assessment of the stability of the contributing Level-2 records, in two time periods and three layers of the stratosphere. These drift values could serve as 1 σ systematic uncertainty in trend studies. Nevertheless, in expectation of more rigourous analyses, the estimates below should be considered with care, as they may overestimate the actual drift.
e.g. Verhoelst et al. (2015) showed recently that MACC (IFS-MOZART) and MERRA reanalysis fields allow them to close the error budget for total ozone column validation studies. However, further work is needed in the context of vertical profile validation. It is even more challenging to verify the GCOS requirements for stability. Figure 12 (right panels) shows that the verification of a 1 % decade −1 target with 95 % confidence is possible for just a few records (SAGE II, Aura MLS) and only in the middle stratosphere. In general, the analysis is not sensitive to network-averaged drifts below 2-3 % decade −1 in the middle stratosphere. In the upper and lower stratosphere, focus regions for current trend studies, the 2 σ uncertainty on the drift is 3-4 % decade −1 or worse. In addition, a ground-based assessment of the meridional structure of satellite drift is currently infeasible. This is due to a lack of stations (with a long data record) in certain latitude bands and the considerable observed scatter in the single-station drift estimates. However, there is some room for improvement. The best sampled comparison time series yield 1 σ drift uncertainties as low as 0.7 % decade −1 at individual sites. But the dominant contribution to the network-averaged drift uncertainty of some recent satellite records comes from the scatter in the drift estimates across individual sites (Fig. 3). More homogeneity across the network will surely be beneficial, and this is one of the aims of the Ozonesonde Data Quality Assessment initiative (O3S-DQA). New correction schemes are being developed for the few percent biases introduced by (station-and time-dependent) changes in instrumental and post-processing set-ups, which may, ultimately, lead to more homogeneous sonde time series in time and space (Smit et al., 2012;Tarasick et al., 2016;Van Malderen et al., 2016). When successful, this may perhaps also allow an exploration of meridional drift structure. Longer time series will also help, but not to the full extent of what is actually desired. And finally, with the help of current models part of the comparison spread could be removed statistically, which should, at least in the UTLS, lead to reduced drift uncertainties. Nevertheless, we consider it improbable that in the next few years sufficient progress can be made to demonstrate that single satellite records are stable within 1 % decade −1 relative to ground-based network observations. At the moment, 2-3 % decade −1 seems a more realistic target.

Implications for merging schemes
Space-based instruments are rarely operational for much more than a decade. Various groups have therefore produced multi-decade data sets from a series of individual records. The longest record, spanning 42 years, is based on measurements by nine SBUV nadir-viewing instruments  and was validated by Kramarova et al. (2013). Merged records based on limb/occultation instruments include SAGE-GOMOS , SAGE-OSIRIS Sioris et al., 2014), GOZ-CARDS , SWOOSH (Davis et al., 2016) and Ozone_cci , all listed in Table 6. These Level-3 data are typically reported as monthly averaged ozone over 5-10 • latitude bins. A recent intercomparison by Tummon et al. (2015) showed that the differences between the merged limb/occultation data sets are dominated by the differences between the underlying data sets and to a lesser extent by differences between the merging algorithms. This shows the importance of a detailed understanding of the consistency between the Level-2 records in order to understand the merged product. In addition, comprehensive intercomparison studies (such as Jones et al. (2009) Rahpoe et al. (2015) and this work), can guide the design of the merging algorithms so as to reduce the impact of unfavourable Level-2 characteristics.
Although it is well known that the bias correction scheme should be altitude-latitude dependent, further improvements could be made. The inclusion of a diurnal and seasonal component may be pertinent, as we found sunrise-sunset bias differences for a few solar occultation instruments and a pronounced seasonal dependence of the bias and short-term variability of, e.g. Arctic SCIAMACHY data. We also reported that the single Level-2 profile noise of SMR and SCIA-MACHY is considerably higher than that of other records. Averaged profiles will be sufficiently precise over large bins (monthly, 5 • latitude) since both instruments are dense samplers, but this may not be the case at finer spatiotemporal resolutions. Our assessment of stability furthermore demonstrates the potential of drift correction schemes, especially when HALOE, OSIRIS or SCIAMACHY data are involved (and likely GOMOS and SMR as well). Eckert et al. (2014) have recently explored this approach, by correcting MIPAS trends for a drift relative to Aura MLS. In practice, however, the drift estimate between two satellite records is not sufficiently well constrained, especially for a short overlap period, which makes it very challenging to obtain robust corrections. Finally, the impact of the auxiliary data should not be forgotten, since profile representation conversions are typically required. We observed considerable changes in bias (MIPAS, SCIAMACHY and, to a lesser extent, UARS/Aura MLS) and stability (UARS/Aura MLS) due to the auxiliary data provided along with the ozone data sets. The use of a common source of stable auxiliary profiles eliminates additional discrepancies between the contributing records. Our results suggest that ECMWF (operational and ERA-Interim) and MERRA fields impact ozone trends in a consistent way over the entire stratosphere.

Are observed trend differences due to drift?
Recently, a number of regression analyses were carried out on gridded ozone profile data from a variety of limb and occultation instruments. A few studies considered single records (Eckert et al., 2014;Gebhardt et al., 2014), others a combination of two Laine et al., 2014;Bourassa et al., 2014;Sioris et al., 2014) or more data sets (Tummon et al., 2015;WMO, 2014;Harris et al., 2015). The resulting profile trends are generally in reasonable agreement, but notable differences are observed in some parts of the stratosphere. The SCIAMACHY data set retrieved by the IUP Bremen Level-2 processor, for instance, suggests a 2004-2012 trend in the tropics around 35 km that is 4-6 % decade −1 more negative than OSIRIS and Aura MLS data (Gebhardt et al., 2014). A combined SAGE-OSIRIS record, on the other hand, produces more positive post-1998 trends in the uppermost stratosphere, by 3-4 % decade −1 at mid northern latitudes Tummon et al., 2015;Harris et al., 2015). Two records that combine SAGE and GOMOS data lead to considerably more negative trends than other data sets in the lower stratosphere (Tummon et al., 2015;Harris et al., 2015).
Many of the ozone trend differences cannot be explained by statistical uncertainty. Our ground-based assessment of decadal stability suggests that these may be interpreted, at least for the better part, in terms of instrumental drift. Indeed, we noted a +8 % decade −1 drift above 40 km for OSIRIS and a −9 % decade −1 drift for SCIAMACHY 5 around 35 km. Additionally, we found indications of a −5 % decade −1 drift of GOMOS below 20 km. These quite successful interpretations of some recent ozone trend differences builds additional confidence in our single-instrument drift estimates, which could therefore be employed as 1 σ systematic uncertainty for long-term trend results for the corresponding records.
No studies have been performed so far of the decadal stability of the merged data sets. Yet, there is also a clear need for realistic drift estimates for such data sets . We therefore make a first attempt to provide these for the merged records used by the recent WMO and SI2N assessments (WMO, 2014;Harris et al., 2015). Table 6 presents drift estimates for three stratospheric layers and for two time periods typically differentiated in trend analyses. Before 1997 all merged records rely on SAGE II observations, which are stable to within 1-1.5 % decade −1 depending on altitude. Since GOZCARDS and SWOOSH include HALOE data, the drift is possibly somewhat larger in the middle stratosphere. Producing post-1998 estimates is a more intricate problem, due to the increasing number of contributing instruments, and due to the fact that none of these cover the entire period. We are inclined toward a conservative approach, giving figures that should be considered upper limits to the actual drift. The SAGE-GOMOS record will be impacted by negative GOMOS drifts in the lower stratosphere. The SAGE-OSIRIS trends should be considered more uncertain in the upper stratosphere due to drifting OSIRIS data. Records that use Aura MLS as backbone (GOZCARDS, SWOOSH) should not be more unstable than about 2 % decade −1 in the stratosphere. A merged Ozone_cci data set is likely also prone to larger uncertainty in the upper stratosphere (drifting OSIRIS, SCIAMACHY, SMR) and to some extent in the lower stratosphere as well (GOMOS). We stress that a more rigorous assessment is needed, since the estimates in Table 6 may well overestimate the actual drift. This work is currently on-going, following an approach similar to that by, e.g. Mears et al. (2011) and.

Conclusions
Ground-based network observations by ozonesonde and stratospheric lidar instruments allowed us to assess the quality of 14 records of the vertical distribution of ozone, collected by limb and occultation instruments over the past three decades. We considered three aspects of satellite data quality: the stability at decadal time scale (or drift), the overall bias, and the short-term variability. Further investigation of the vertical, meridional and seasonal structure of these parameters, together with their dependence on auxiliary data, revealed common and distinguishing features between satellite instruments. Such a comprehensive analysis serves two main objectives. First, to verify whether the spatiotemporal patterns of atmospheric ozone are correctly reproduced by the individual instruments at different scales. Second, to assess the consistency between satellite records, which is vital for their synergistic exploitation, a topic that has received increased interest in recent years.
We start our concluding remarks by distilling the general tendencies, saving some prominent exceptions for the following paragraph. Typically, we found a satellite bias better than ±5 % between 20 and 40 km (∼ 2-50 hPa), increasing slowly towards the stratopause (±10 %) and quite rapidly towards the tropopause (±15 % and more). A similar vertical dependence was observed for the comparison spread. It generally ranges from 5 to 12 % between 20 and 40 km and increases towards the stratopause (15-20 %) and tropopause (40 % and more). The precision of the records is actually better than suggested by the observed spread in the comparisons, since the latter also includes the precision of the ground-based record and, especially in the UTLS, the random uncertainties due to differences in co-location and horizontal smoothing. Nevertheless, the altitude at which the quality of UTLS observations starts to degrade rapidly is clearly not only determined by the tropopause. It also depends on the measurement technique and instrument (e.g. UTLS observations of UV-visible star occultations being less sensitive than those of infrared emissions at the limb). There were furthermore no evident signs of seasonal patterns, except for the Arctic SCIAMACHY data which exhibit a 10 % increase in bias and spread in boreal winter and a 10 % decrease in bias and spread in boreal summer. We found no significant drifts at decadal time scales, most records are stable within about ±5 % decade −1 in the middle and upper stratosphere and, for some records, even within ±3 % decade −1 (SAGE II, Aura MLS and MIPAS). However, the drift uncertainty should not be neglected, as our analysis is typically not sensitive (at 2 σ ) to drifts smaller than 2-3 % decade −1 in the middle stratosphere and 3-4 % decade −1 at lower and higher altitudes. The pressure and/or temperature data that accompany the satellite ozone data sets are generally well suited for the conversion between ozone quantities or vertical coordinates. Bias, spread and drift in non-native ozone profile representations differ, respectively, not more than about 1 %, 1 % and 1 % decade −1 , and somewhat more in the uppermost stratosphere.
There are of course exceptions to these general observations. We noted more pronounced biases (∼ 10 %) over much of the stratosphere for POAM II and SCIAMACHY, the latter also exhibits a clear hemispheric asymmetry. Two records show markedly poorer single-profile precision: SMR (entire atmosphere) and SCIAMACHY (upper stratosphere and Arctic). And three records drift significantly: HALOE in the middle stratosphere (−5 % decade −1 ) and in the upper stratosphere SCIAMACHY (−9 % decade −1 ) and OSIRIS (+8 % decade −1 ). There are also indications of a −5 % decade −1 or more drift in the lower stratosphere for GOMOS, and in the upper stratosphere for SMR. Further confirmation is needed however for the latter two data sets. In the meantime, we advise caution when using GOMOS and SMR measurements at these altitudes. Finally, we observed for a few records a considerable impact of the accompanying auxiliary data (e.g. GPH retrievals) on ozone quality in non-native profile representations. The ozone bias changes by 3-5 % for MIPAS and SCIAMACHY; both MLS records (UARS and Aura) show a dependence of the drift (by 3 % decade −1 or more) on vertical coordinate and/or ozone quantity, and perhaps of the overall bias as well. We stress that these representation-dependent quality issues are unrelated to the satellite ozone retrievals themselves, and can be avoided by using another, external source of auxiliary information for any necessary conversions.
Overall, the observing system of limb and occultation instruments produces ozone profiles that meet the ∼ 10-15 % accuracy requirements by climate users, most certainly over 20-40 km, and perhaps also in the lower stratosphere. However, it remains unclear whether the current Level-2 records comply with the 1-3 % decade −1 target on decadal stability. The combination of different data sets has received widespread interest in recent years, but also poses several challenges. Our results show that the merging schemes should be sufficiently refined to temper additional artefacts in the Level-3 data sets. Even then, the characteristics of merged records remain mostly defined by those of their contributors (Tummon et al., 2015). Multi-instrument comparison studies are therefore crucial to establish observational evidence. Indeed, we could relate the most notable differences between recent ozone profile trend studies to instrumental drift (WMO, 2014;Harris et al., 2015). This led us to a conservative estimate of the decadal stability of several merged records, which, until more rigourous analyses are performed, provides essential information for the recent trend assessments by WMO and SI2N.
Covering most limb and occultation ozone profilers of the past three decades, the ground-based networks of sonde and lidar instruments, and all major data quality indicators, this assessment is arguably the most comprehensive groundbased analysis so far. While bias and short-term variability of satellite records are well documented in the literature, this is much less the case for their long-term stability, the impact of auxiliary data and their mutual consistency. We therefore believe that this work will contribute to an improved interpretation of observation-based studies of the long-term evolution of ozone and its link to climate change. However, our results represent a snapshot of the current versions of the data sets. In the near future, improved (and for some instruments longer) ozone profile time series will be released by the satellite teams and by the ground-based observers. Their efforts may lead to more stable records, which, in turn, would increase the sensitivity to even smaller drifts. In addition, the inclusion of microwave radiometer measurements and model data should help to evaluate the stability in the mesosphere and improve current estimates in the UTLS, especially in the tropics.
The Supplement related to this article is available online at doi:10.5194/amt-9-2497-2016-supplement.
contribution via BIRA-IASB. Sweden's Odin satellite carries the atmospheric and astronomical missions OSIRIS and SMR, developed and funded jointly by the space agencies of Sweden, Canada, Finland and France. This work is dedicated to our much appreciated colleague J. Urban, who regrettably passed away.