Quality assessment of the Ozone_cci Climate Research Data Package (release 2017) - Part 1: Ground-based validation of total ozone column data products

. The GOME-type Total Ozone Essential Climate Variable (GTO-ECV) is a level-3 data record, which com-bines individual sensor products into one single cohesive record covering the 22-year period from 1995 to 2016, generated in the frame of the European Space Agency’s Climate Change Initiative Phase II. It is based on level-2 total ozone data produced by the GODFIT (GOME-type

Abstract.The GOME-type Total Ozone Essential Climate Variable (GTO-ECV) is a level-3 data record, which combines individual sensor products into one single cohesive record covering the 22-year period from 1995 to 2016, generated in the frame of the European Space Agency's Climate Change Initiative Phase II.It is based on level-2 total ozone data produced by the GODFIT (GOME-type Direct FITting) v4 algorithm as applied to the GOME/ERS-2, OMI/Aura, SCIAMACHY/Envisat and GOME-2/Metop-A and Metop-B observations.In this paper we examine whether GTO-ECV meets the specific requirements set by the international climate-chemistry modelling community for decadal stability long-term and short-term accuracy.In the following, we present the validation of the 2017 release of the Climate Research Data Package Total Ozone Column (CRDP TOC) at both level 2 and level 3.The inter-sensor consistency of the individual level-2 data sets has mean differences generally within 0.5 % at moderate latitudes (±50 • ), whereas the level-3 data sets show mean differences with respect to the OMI reference data record that span between −0.2 ± 0.9 % (for GOME-2B) and 1.0 ± 1.4 % (for SCIA-MACHY).Very similar findings are reported for the level-2 validation against independent ground-based TOC observations reported by Brewer, Dobson and SAOZ instruments: the mean bias between GODFIT v4 satellite TOC and the ground instrument is well within 1.0 ± 1.0 % for all sensors, the drift per decade spans between −0.5 % and 1.0 ± 1.0 % depending on the sensor, and the peak-to-peak seasonality of the differences ranges from ∼ 1 % for GOME and OMI to ∼ 2 % for SCIAMACHY.For the level-3 validation, our first goal was to show that the level-3 CRDP produces findings consistent with the level-2 individual sensor comparisons.We show a very good agreement with 0.5 to 2 % peak-topeak amplitude for the monthly mean difference time series and a negligible drift per decade of the differences in the Northern Hemisphere of −0.11 ± 0.10 % decade −1 for Dobson and +0.22 ± 0.08 % decade −1 for Brewer collocations.The exceptional quality of the level-3 GTO-ECV v3 TOC record temporal stability satisfies well the requirements for the total ozone measurement decadal stability of 1-3 % and the short-term and long-term accuracy requirements of 2 and 3 %, respectively, showing a remarkable inter-sensor consistency, both in the level-2 GODFIT v4 and in the level-3 GTO-ECV v3 datasets, and thus can be used for longerterm analysis of the ozone layer, such as decadal trend studies, chemistry-climate model evaluation and data assimilation applications.

Introduction
The European Space Agency's Climate Change Initiative (ESA-CCI) phases I and II focused on building consolidated climate-relevant ozone data sets as essential climate variables (ECVs).During Phase I, the Ozone_cci mostly concentrated on developing and demonstrating improved algorithms and methods, with the aim to define new baselines for the generation of consistent, state-of-the-art and fully characterized long-term ozone data products derived from a complete suite of European nadir and limb-type sensors.For the first time, Earth observation science teams consisting of leading experts from European ozone sensing communities were gathered in a single project working towards common objectives defined against requirements formulated by the scientific user community.This resulted in new synergies, exchanges of ideas and overall significant progress in terms of data harmonization and understanding of quality issues at level 1, level 2 and level 3. Three lines of multi-sensor ozone data products were hence developed: (i) total ozone columns (TOCs) from ultraviolet (UV) nadir instruments, (ii) low-resolution ozone profiles from nadir sensors and (iii) stratospheric and uppertropospheric ozone profiles from limb and occultation types of sensors.During Phase II, existing state-of-the-art ozone retrieval algorithms were further developed and applied to long time series of observations from all relevant ESA atmospheric chemistry sensors, with the aim to generate wellcharacterized and validated ozone data products that meet as closely as possible the requirements formulated by the Global Climate Observing System (GCOS) and the Climate Modelling User Group (CMUG) climate modelling community for ozone column and profile ECVs.The most important user requirements were identified as (i) homogenized multidecadal records, (ii) records with good vertical resolution in the (lower) stratosphere and (iii) records with good horizontal resolution in the troposphere, the main gap being the lack of multi-decadal high-vertical resolution ozone profile data sets that cover the full ozone depletion time period (1980present) and provide a potential to cover the upcoming ozone recovery time period.
This work addresses the first of these requirements, the level-2 and level-3 homogenized multi-decadal total ozone Climate Research Data Package (CRDP), with upcoming companion papers expanding on the limb (TBA) and nadir (Keppens et al., 2018) ozone profile CRDPs.With respect to total ozone, 22 years of harmonized level-2 data records from GOME/ERS-2 (Global Ozone Monitoring Experiment instrument on board the second European Remote Sensing satellite), OMI/Aura (Ozone Monitoring Instrument on board Aura satellite), SCIAMACHY/Envisat (Scanning Imaging Absorption Spectrometer for Atmospheric Cartography on board Envisat) and GOME-2/Metop-A and Metop-B (Global Ozone Monitoring Experiment-2 on board Metop-A and Metop-B satellites) sensors have been produced using an advanced version of the direct-fitting GODFIT (GOME-type Direct FITting) v4 algorithm.The ESA-CCI total ozone CRDP includes the level-2 products for each instrument (over the entire instrument lifetime) and a level-3 merged monthly mean gridded data set using GOME and OMI as long-term stability reference.
In the following section, we briefly present the GODFIT v4 algorithm that creates the level-2 CRDPs, followed by the validation against the Brewer, Dobson and SAOZ (Système d'Analyse par Observation Zénitale; Pommereau and Goutail, 1988) ground-based instruments and the comparison to the independent solar backscatter ultraviolet measurements (SBUV) v8.6 long-term TOC record.Thereafter, the algorithm that merges the individual level-2 TOC records to create the level-3 dataset is presented, followed by the validation to the ground-based records and intercomparison to the individual level-2 validation findings.Summary and conclusions are given in the last section.
2 Level-2 total ozone columns 2.1 Satellite total ozone column records GODFIT is an algorithm jointly developed by BIRA-IASB (Royal Belgian Institute for Space Aeronomy), RT Solutions and DLR (German Aerospace Center) to retrieve TOC from satellite-borne nadir-viewing hyperspectral spectrometers, such as GOME(-2), SCIAMACHY and OMI.It relies on a non-linear least-squares minimization procedure, during which sun-normalized radiances simulated in the Huggins bands (325-335 nm) with the radiative transfer model LIDORT (Linearized Discrete Ordinate Radiative Transfer; Spurr et al., 2013) are adjusted to the level-1 measurements.As part of Phase I of the ESA Ozone_cci project, version 3 of GODFIT has been successfully transferred to other nadir sensors and is comprehensively described in Lerot et al. (2014) and validated in Koukouli et al. (2015).During the second phase of this project, a number of algorithmic improvements have been realized and the full time series of GOME, OMI, SCIAMACHY and GOME-2A/B have been entirely reprocessed with the latest version (version 4) of GODFIT.The most important update is the adaptation of the level-1 softcalibration scheme in order to restore the full independency of the satellite observations with respect to the ground-based measurements.This algorithm, described in detail in Rahpoe et al. (2017), is also the future baseline for generating the offline operational total ozone from the TROPOMI/S5-P (TROPOspheric Monitoring Instrument on board the Copernicus Sentinel-5 Precursor satellite) instrument that launched in October 2017.
The radiance simulations require that the atmosphere is properly defined at each iteration within the retrieval and so a series of auxiliary data is also required.Ozone vertical profiles are prescribed by the total ozone classified climatology recently released by Labow et al. (2015) using MLS (Microwave Limb Sounder) and sondes data, combined with the tropospheric column database constructed by Ziemke et al. (2011).The ozone absorption is modelled using the temperature-dependent cross sections measured by Serdyuchenko et al. (2014).The temperature in each atmospheric layer is prescribed by a priori profiles, allowed to be shifted by a constant offset, determined simultaneously with the total column.All cross sections are pre-convolved at the respective instrumental resolution and an improved correction for the so-called solar I 0 effect (Aliwell et al., 2002) has been applied (Rahpoe et al., 2017).GODFIT has the capability to characterize instrumental slit function on an orbit basis by fitting pre-determined functions such as (super-)Gaussian shapes (Beirle et al., 2017) or by stretching slit functions premeasured on the ground.To account for contamination by clouds and/or aerosols, an effective scene approach is used (Coldewey-Egbers et al., 2005) in which the effective albedo of a scene located in between the cloud top height and the ground surface is fitted during the retrieval.The altitude of this effective scene depends on both the effective cloud fraction and the cloud top altitude provided by independent cloud algorithms (FRESCO v7 by Wang et al., 2008, or the O 2 -O 2 product by Veefkind et al., 2016).Radiances are simulated on the fly with the scalar radiative transfer model LI-DORT for GOME, SCIAMACHY and GOME-2.Because of the heavy computational burden of those simulations, the radiances may alternatively be extracted from a pre-computed look-up table, of which the granularity has been cautiously defined in order to limit interpolation errors while keeping a reasonable size (Rahpoe et al., 2017).Once simulated, correction terms are applied to the radiances to correct for the impact of atmospheric polarization and inelastic scattering processes (Lerot et al., 2014).
When a common retrieval algorithm is applied to various instruments, systematic differences may remain due to calibration deficiencies or instrumental degradation effects affecting the level-1 reflectance data.To generate the CCI total ozone data sets with the high inter-sensor consistency required for climate studies, an original soft-calibration scheme had been incorporated within GODFIT v3.This procedure, extensively described in Lerot et al. (2014), relied on reference total column measurements at selected northern mid-latitude Brewer stations.Although it was shown to work well, this approach had the disadvantage of introducing a link between the satellite and ground-based measurements.As illustrated in Fig. 1, experience has shown that the GOME and OMI sensors perform in an extremely stable way and do not require any spectral soft-calibration procedure.Therefore it was decided to use these two instruments to soft-calibrate the spectra measured by SCIAMACHY and GOME-2A/B.In practice, for every cloud-free satellite pixel falling into a reference sector between 40 • S-50 • N and 175-145 • W, the closest reference clear-sky OMI (or GOME before 2005) column is used to simulate a radiance (using the GODFIT forward model), which is then compared to the level-1 spectrum recorded by the sensor to be soft-calibrated.Such comparisons are done systematically for a large number of pixels (e.g.several hundreds of thousands for GOME-2A) spanning most of the observation geometries and the full time series, which allows the identification and correction of systematic issues in the level-1 data.See Lerot et al. (2014) for more details on the soft-calibration approach.
Using this new GODFIT v4 baseline, the time series of GOME, SCIAMACHY, GOME-2A/B and OMI have been entirely reprocessed.Figure 2 illustrates the excellent consistency between the individual level-2 data sets with mean differences generally within 0.5 % at moderate latitudes (±50 • ).The level-2 data sets are publicly available on the Ozone_cci website (http://www.esa-ozone-cci.org)and the time series are also regularly extended as part of the Copernicus Climate Change Service (C3S).

Ground-based total ozone column records
For the purposes of this work, both direct-sun measurements (from Dobson and Brewer UV spectrophotometers) and zenith-sky scattered-light (ZSL-DOAS) measurements were used as ground-based reference data.
Total ozone column measurements from Dobson and Brewer UV spectrophotometers were downloaded from the WOUDC (World Ozone Ultraviolet Radiation Data Center) archive (http://www.woudc.org);see Tables S2 and S3 in the Supplement for a complete list.The measurement techniques and the data analysis methodology are extensively analysed in Koukouli et al. (2015) and in references therein.It is important to point out that according to Van Roozendael et al. (1998), the estimated total uncertainty for the Dobson spectrophotometer is about 1 % for cloud-free direct-sun observations and 2-3 % for zenith-sky or cloudy observations, while the error of individual total ozone measurements for a well-maintained Brewer instrument is about 1 % (e.g.Kerr et al., 1988).
The main issues that have to be taken into account during the validation process with these direct-sun instruments are as follows: (a) TOC measurements from Dobson spectrometers depend on the stratospheric effective temperature, which is manifested in the comparisons as a seasonality effect (Kerr et al., 1988;Kerr, 2002;Bernhard et al., 2005;Scarnato et al., 2009;Koukouli et al., 2016).(b) Even though the principles of operation between Dobson and Brewer spectrometers do not differ significantly, TOC measurements from the two types of instruments show small differences in the range of ±0.6 % due to the use of different wavelengths and the different temperature dependence for the ozone absorption coefficients (Staehelin et al., 2003).(c) Due to the limited number and poor spatial distribution of stations with Brewer instruments in the Southern Hemisphere (SH) (all of them allocated in the Antarctic), the Dobson network is considered much more suitable to investigate spatial homogeneity of satellite products below the Equator.
TOC ground-based measurements from the abovementioned instruments have been extensively used in past publications for the purpose of analysis and validation of satellite data (see e.g.Balis et al., 2007a, b;Antón et al., 2009;Loyola et al., 2011;Koukouli et al., 2012Koukouli et al., , 2015;;Labow et al., 2013;Bak et al., 2015).The ground-based stations were selected in accordance with the criteria discussed in detail in Balis et al. (2007a, b).Their measurements are thoroughly inspected once a year, in terms of quality assurance and stability, following the principles described in Fioletov et al. (1999), Vanicek (2006) and Fioletov et al. (2008), among others.
The GODFIT v4 TOCs were also compared against twilight zenith-sky measurements obtained with ZSL-DOAS (zenith scattered-light differential optical absorption spectroscopy) instruments.Most of these instruments form part of the SAOZ network (Pommereau and Goutail, 1988) of the Network for the Detection of Atmospheric Composition Change (NDACC).In NDACC, four slightly different ZSL-DOAS instruments are also routinely reporting data (see Table S1 for complete list of instruments used).To avoid confusion in the paper, hereafter they will all be referred to as "SAOZ measurements".
The total accuracy of SAOZ measurements is of the order of 6 % (Hendrick et al., 2011), including a 3 % systematic uncertainty of the absorption cross sections.However, since all NDACC SAOZ/ZSL-DOAS instruments are using the same cross sections, there is no systematic error between them.The random error of SAOZ spectral analysis is less than 2 %, to which one should add the random error on the air mass factor, mainly impacted by clouds (up to 3.3 %).Thus, significantly better performance, of the order of 2 %, can be expected in differential analyses of cloud-free data.
These twilight zenith-sky measurements are complementary to the Brewer and Dobson measurements for several reasons: (a) they use spectral features of the visible Chappuis band, where the ozone differential absorption cross sections are temperature insensitive; (b) the long horizontal stratospheric optical path allows measurements of the column above cloudy scenes; and (c) measurements are always performed in the same small solar zenith angle (SZA) range (86-91 • ).For further details on the measurement procedures and on the specific collocation approach, taking into account the actual area of measurement sensitivity, we refer to Balis et al. (2007a), Koukouli et al. (2015) and references therein.After quality control and the application of thresholds on the minimum number of collocated measurements, data from about 20 instruments were used, covering both the Northern Hemisphere (NH) and SH up to high latitudes and leaving only the equatorial region poorly sampled (see Fig. S1 in the Supplement for the locations of all three types of instruments).In spite of the dedicated collocation method, some residual errors due to collocation mismatch may persist and must be kept in mind, in particular at high latitudes, as shown by Verhoelst et al. (2015).

Level-2 validation results and discussion
As a basis for the validation process of the satellite TOC measurements, pairs of collocated satellite and daily-mean ground-based measurements are formed and their percentage difference is calculated.Specific criteria are applied to minimize the noise of the comparison: i.For the Dobson and Brewer instruments, (a) the maximum search radius between the ground-based stations and the centre coordinates of the satellite pixel is set to 150 km and the spatially closest satellite observations are paired with the ground-based station's dailymean measurement and (b) only direct-sun groundbased measurements are used for the validation process, since they are deemed to be most accurate.
ii.For the SAOZ measurements, the large displacement (with respect to the instrument location) of the actual measurement sensitivity is taken into account by requiring satellite pixels to intersect with a 2-D (lat, long) polygon describing the true area of measurement sensitivity; see Balis et al. (2007a) and Verhoelst et al. (2015) for full details.
Following those criteria, three time series (one for each type of ground-based instrument) of the percentage differences are formed.Hereupon, a statistical analysis of the time series is performed, separately for each type of instrument, so as to study a variety of possible dependences on geospatial parameters such as the season, latitude and observation geometry.
The results of the analysis are shown in the following graphs and are summed up in Table 1.In the figures presented in this section, the dependency of the percentage difference between satellite and ground-based TOC measurements on parameters such as the ones mentioned above is displayed (the line colours used for Figs.3-6 are black for GOME, blue for SCIAMACHY, cyan for OMI, green for GOME-2A and orange for GOME-2B).It should be noted that SH GOME measurements are only shown before 2003, when it encountered downlink telemetry problems.In Fig. 3 the time series of the percentage difference between the TOC measurements from five different satellites to the collocated Dobson, Brewer and SAOZ ground-based measurements are shown.In all panels the entire available time series from each satellite instrument is displayed (except for GOME for the SH, as mentioned above) in the form of monthly mean difference (in %).The monthly means for each sensor were calculated using the percentage differences of all the available collocations from all stations for each month, without any weighting.The comparison with the Dobson measurements is presented in panel (a), which corresponds to the NH stations, and panel (b), which presents the SH percentage differences.It is shown that the NH time series are highly consistent and stable for all five satellites, with an amplitude of ∼ 2 % for all sensors apart from SCIA-MACHY, which shows a slightly increased variability, with certain months underestimating the ground-based mean (differences reaching −1 %).Part of the seasonality observed in Fig. 3a and b is due to the known Dobson dependency on the effective temperature of the stratosphere (Koukouli et al., 2016).The ∼ 1.5 % bias of the satellite TOCs compared to the Dobson TOCs is in agreement with the bias of ±2 % found by the Absorption Cross Sections of Ozone (ACSO) committee (Orphal et al., 2016) and might be related to systematic uncertainties in the different ozone absorption cross sections used to retrieve satellite and ground-based measurements.Dobson and Brewer TOC data records are based on Bass and Paur (1985) ozone absorption cross sections, whereas, as it is mentioned in the previous section, the respective satellite TOCs are produced using the cross section measured by Serdyuchenko et al. (2014).
The comparison for the SH Dobson measurements (Fig. 3b) shows higher variability due to the fact that the number of available stations in this part of the globe is limited and their measurements are greatly affected by the vigorous phenomena developing over the Antarctic.However, all time series present a rather consistent and stable behaviour, similar to that shown in the NH, with a bias of the order of 1-1.5 % for OMI, GOME-2A and GOME-2B.
In Fig. 3c, the same plot of the percentage differences between the satellites and Brewer ground-based measurements performed at stations located in the NH is shown.Due to the extremely limited number of stations with Brewer spectrophotometers in the SH, positioned exclusively in the Antarctic, it was decided not to present the respective plot.The consistency and the stability of the satellite measurements is evident for the whole time period of available data and for the whole set of five sensors: the overall bias of the comparison is up to 1 % for GOME, 0 % for SCIAMACHY and 1.5 % for the rest of the instruments, with peak-to-peak amplitude of the order of 1-2.5 %.
Panels (d) and (e) of Fig. 3 depict the time series of the comparison to the SAOZ network for the NH and SH, respectively.The known seasonality effect, which is present in comparisons between SAOZ and direct-sun measurements (Hendrick et al., 2011), is obviously stronger in these figures than in the other three panels.Asides from the cross sections' stratospheric effective temperature dependence, affecting Dobson and lesser Brewer and satellite measurements, the SAOZ seasonality observed on panels (d) and (e) comes from the comparison performed up to high latitudes in winter; this is in contrast to the Dobson and Brewer instruments, which are "blind" at that latitude in winter.In addition, SAOZ comparisons at high latitudes are known to be affected by collocation mismatch (Verhoelst et al., 2015).Finally, the overall bias of the SAOZ comparison is fairly stable at 1.5 % in the NH but rather variable for the SH, which can be attributed to the large number of high-latitude stations contributing to the statistics.
The dependence of the percentage differences of the five satellites measurements to the ground-based TOC measurements on SZA was investigated, as shown in Fig. 4, where panels  as daily means from the WOUDC database.First, as it is seen in Fig. 4, all curves in each plot have highly consistent dependencies on SZA, which proves that, irrespective of its magnitude, the dependence can be contributed mainly on the ground-based measurements of each kind.Specifically, in panel (a) the NH comparison is shown, and there is a strong but very consistent dependence on SZA for all five satellite instruments, whereas in the SH (panel b) almost no dependency is seen for SZAs < 80 • .The first reason for this dissimilar behaviour is the fact that in the NH most Dobson ground-based stations are located in the middle latitudes, contrary to the SH stations that are much more homogeneously distributed.Additionally, since the measurements of the Dobson stations are affected by the variation of the stratospheric effective temperature, the data provided by NOAA/National Weather Service (http://www.cpc.ncep.noaa.gov/products/stratosphere/temperature/)were investigated to see whether there is a difference in the stratospheric temperature between the mid-latitudes of the two hemispheres.The results are very consistent with the two plots of the panels (a) and (b): the peak-to-peak amplitude of the stratospheric effective temperature annual variation above the mid-latitudes of the NH is about 3-4 • C greater compared in the variation above the respective latitudes of the SH, which resulted to the stronger variability of the NH Dobson measurements, seen in panel (a).Of course, further investigation on this issue is needed, but it is beyond the scope of this work.
Figure 4c shows that the percentage difference of the measurements is almost constant for the Brewer comparison and increases for SZAs larger than 70 • .SCIAMACHY, however, shows a slightly stronger dependence on SZA starting from low angles.Comparisons performed at SZAs over 75 • and below 25 • are affected by the limited number of observations and the uncertainties of the ground-based measurements themselves.Hence, it is difficult to assess their significance level.
In Fig. 4d-e, we show that the SZA dependence between satellite and SAOZ ground measurements was up to 4 % at the highest satellite-viewed SZAs (> 80 • ) at all high-latitude stations, irrespective of season.There was also some minor dependence at very small SZAs in the northern tropics, but this is based on only a few tropical stations with limited data, and it is not confirmed by the Brewer comparisons.There are also some systematic inter-hemispheric differences for SAOZ measurements, which is obvious when comparing panels (d) and (e) of Fig. 4, in particular due to comparisons at some northern high-latitude stations being biased high (up to 5 %), and those at southern high-latitude stations being biased low (of the order of 2 %), as shown in Fig. 5c, which will be commented on below.
Additionally, the dependency of the satellite and groundbased measurements percentage difference on latitude is presented.In Fig. 5a, the ground-based measurements are performed by Dobson spectrophotometers, in panel (b) Brewer data are used, while in panel (c) the comparison with the SAOZ data record is shown.It is obvious in this exercise that all five satellite sensors appear to be very consistent, regardless of the ground-based instrument type, which is the main concern of this work.It is also noticeable that, mainly for Brewer and Dobson ground-based measurements, the dependency on latitude is less eminent for the NH due to the much higher number of collocations found there.Specifically, the comparisons with Dobson measurements show differences between 0 and 2 % for latitudes between −40 and 0 • as well as for the entire NH, similar to the Brewer comparisons.In the SH, especially southwards of −40 • , the comparisons show differences ranging between −2 and 4 %, depending on the satellite sensor, attributable partially to the small number of stations located in that part of the Earth and partially to the higher variability of the TOCs within the southern polar vortex (see also Verhoelst et al., 2015).In Fig. 5c, where the comparison with the SAOZ measurements is shown, a higher dependency on latitude is eminent even for the NH, where the other two ground-based instruments have completely different performances.Nevertheless, the inter-sensor consistency is very satisfactory in this comparison, too, except for the high-altitude Izaña station located at 28 • N (near the NH tropics), for which the differences were adjusted to take into account the missing column in the ground-based measurement but some residual effect due to different satellite pixel sizes is probably still present.The correction for the station's altitude is described in Verhoelst et al. (2015) and uses an ERA-Interim based estimate of the column below the instrument altitude in the immediate vicinity of the island and/or mountain, at the resolution of the reanalysis, and not taking into account the exact satellite pixel size and location.For the SAOZ/ZSL-DOAS network, Izaña and Jungfraujoch are the only stations for which a significant missing column was derived with this methodology (about 2.8 and 3.2 %, respectively, with some seasonal variation) due to their isolated mountain-top locations.Any pixel-size dependence at Jungfraujoch is less evident in Fig. 5c as that latitude bin contains three other stations not located on mountain tops.Moreover, the measurements performed by the stations located in the 70-80 • N belt show larger differences between sensors, but these discrepancies are not confirmed by the Brewer or the Dobson networks and they are most probably related to the larger (and pixel-size-dependent) horizontal smoothing difference errors between SAOZ and the satellite measurements.
According to the guidelines given at the Ozone_cci project's User Requirement Document (version 2.1) (Van Weele et al., 2016), Table 5, the stability of the total ozone column measurements must be among 1 and 3 % decade −1 , the evolution of the ozone layer (radiative forcing) has to be less than 2 % and the seasonal cycle and inter-annual (short-term) variability should be less than 3 %.To investigate whether the five satellite data records are compliant to those requirements, a statistical analysis of the percentage deviation between satellite and ground-based measurements was performed, with the statistics presented in Table 1.The first column enumerates the physical quantity studied, the second column differentiates between Brewer, Dobson and SAOZ collocations, the third column shows the results of the statistical analysis for GOME/ERS-2, the fourth column for SCIAMACHY/Envisat, the fifth for OMI/Aura, the sixth for GOME-2/Metop-A and the seventh for GOME-2/Metop-B sensor.The rows of Table 1 depict (1) the mean bias and standard deviation (1σ ), computed from the monthly mean differences of the entire record for each sensor, shown in Fig. 3; (2) the monthly mean variability, i.e. the variability of the standard deviations of differences in individual months, calculated by the root mean square; (3) the drift per decade, i.e. the decadal drift and associated standard deviation; (4) the seasonality, i.e. the peak-to-peak amplitude of the seasonal variability; (5) the latitudinal mean bias, i.e. the mean bias and standard deviation as calculated by the latitudinal variability plots (Fig. 5) on a global scale; and (6) the SZA mean bias, i.e. the mean bias and standard deviation as calculated from the SZA ranges shown in Fig. 4, on a global scale.The values of the table are all measured in percent and all the quantities for the Brewer measurements, as well as quantities (1), ( 2) and ( 3) for the Dobson measurements, are calculated for the NH only.
The percentages listed in Table 1 prove that the products of the GODFIT v4 algorithm for all five sensors fulfill the requirements set by the European Space Agency's Ozone_cci project (Lambert et al., 2018), since the amplitude of the short-term variability (seasonality) is less than 2 % and the maximum drift per decade is equal to −1.37 ± 1.60 % decade −1 for GOME-2/Metop-B, whose time series is only 3.5 years long and as a result its drift per decade cannot be considered statistically significant.For the rest of the sensors the maximum drift per decade is less than ±1 %.In conclusion, the statistics presented in Table 1 indicate that the data sets produced by the Ozone_cci GODFIT v4 algorithm for all five sensors under validation are reliable, homogeneous and consistent.
In order to further demonstrate the long-term inter-sensor consistency of the GODFIT v4 level-2 TOCs, comparisons to the SBUV data products are shown.Daily level-2 overpass files of total ozone column measurements produced by the SBUV v8.6 algorithm for the locations of the ground-based stations were downloaded from https://acd-ext.gsfc.nasa.gov/Data_services/merged/ and are described by McPeters et al. (2013) and Frith et al. (2014) Labow et al. (2013), their measurements were also validated against  Brewer and Dobson ground-based measurements, showing an agreement of the order of ±1 %.In Fig. 6 the percentage deviation of NH SBUV and GOD-FIT v4 satellite data sets from the respective ground-based measurements performed by the Dobson instrument is displayed.In Fig. 6a, the time period 1995 to 2012 is shown, encompassing the available data sets from NOAA 14 SBUV/2, NOAA 16 SBUV/2, NOAA 17 SBUV/2, GOME and SCIA-MACHY.In Fig. 6b, the time series of NOAA 18 SBUV/2, NOAA 19 SBUV/2, OMI, GOME-2A and GOME-2B for the years 2005 to 2017 are shown.The purpose of these plots is to investigate the consistency, the stability and the homogeneity of 10 completely different time series generated with two different algorithms.It is well shown that, for the two time periods under consideration, all sensors are in very good agreement, with very similar seasonality amplitudes and biases, further testifying to the homogeneity and stability of the GODFIT v4 products.

The level-3 GOME-type Total Ozone Essential
Climate Variable (GTO-ECV) data record One of the main aims of the ESA Ozone_cci project is to construct the homogeneous global long-term GOME-type total ozone climate data record, hereafter termed GTO-ECV v3.The individual level-2 observations (presented and validated above in Sect.2) are converted into a level-3 product and then combined into one single cohesive record spanning the entire 22-year period, from 1995 to 2016.This section summarizes the main characteristics of the merging methodology as well as the latest improvements and extensions implemented within the second phase of the Ozone_cci project.
A detailed description of the predecessor of GTO-ECV v3 has been presented and validated in Loyola et al. (2009) and Coldewey-Egbers et al. (2015).
In short, at first, the individual level-2 measurements processed with the GODFIT v4 retrieval algorithm are mapped onto a regular global grid of 1 • × 1 • in latitude and longitude to construct daily averages for each sensor.Before combining the individual gridded data, adjustments are made in order to account for possible biases and drifts between the instruments.In the previous algorithm version, which spanned the 15-year period between March 1996 and June 2011 (Coldewey-Egbers et al., 2015), the GOME TOCs were used as a reference to the other sensors; in this version the OMI measurements serve as a baseline for the inter-sensor calibration.Their long-term stability with respect to groundbased observation data is noteworthy (see Fig. 3a and c and Table 1) and the periods of overlap with the other sensors are sufficiently long, at least 4 years.
Figure 7 shows the percentage differences between OMI and the other four sensors for 1 • zonal monthly mean ozone columns during overlap periods.These zonal means were computed for collocated daily gridded data in order to minimize the impact of differences in the sampling pattern for OMI and the corresponding second sensor.In general, the inter-sensor consistency is very good; mean differences are between −0.2 ± 0.9 % (for GOME-2B, panel d) and 1.0 ± 1.4 % (for SCIAMACHY, panel b).In the inner tropics the bias is slightly negative for all sensors and it increases toward higher latitudes.The differences between OMI and GOME show slightly larger scatter in the SH due to significantly reduced spatial coverage of GOME as a consequence of the tape recorder failure in June 2003.The differences between OMI and SCIAMACHY indicate a positive bias for most parts of the globe, with a maximum in the SH around the polar night.For both GOME and SCIAMACHY we apply a static correction that depends on latitude and month of the year, using the seasonal mean differences calculated from the seasonal mean average of all available years with respect to OMI as a function of latitude.The differences between OMI and GOME-2A indicate a positive drift of ∼ 0.15 % per annum in the middle latitudes of both hemispheres, which we take into account during the adjustment.For both GOME-2A and GOME-2B, the correction factors with respect to OMI depend on time (month) and latitude.The adjustment is then applied to the daily gridded data for each individual sensor.Thereby the monthly correction factors are linearly interpolated in time.
Figure 8 shows the percentage differences between OMI and the other sensors without (orange-red curves) and with (green curves) the adjustment to OMI for the near global (60 • N-60 • S) mean ozone column as a function of time during the periods of overlap.The comparison with GOME is omitted in this plot because we use these data only until June 2003 in the final product.After the application of the correction the mean biases are almost completely reduced, the scatter (standard deviation) decreased by 15-40 % and the drift in the differences between GOME-2A and OMI is eliminated.
Subsequently, the individual (adjusted) data sets are combined into one single record.In contrast to the previous version (Coldewey-Egbers et al., 2015), where we used only one instrument at any given time, in GTO-ECV v3 we now average all available daily measurements (weighted by the number of measurements per day and grid box for the corresponding sensor), which improves the representativeness of the monthly averages.GOME data are restricted up to and until June 2003.As the ground-based validation of SCIAMACHY level-2 data indicates some lingering issues with the level-2 TOCs (see Sect. 2.3), we use SCIAMACHY only until October 2004 in order to fill the data gap between the GOME loss of global coverage and the launch date of OMI.For the calculation of monthly means we apply the same latitudinal constraints as defined in Coldewey-Egbers et al. (2015; see their Table 2), in order to provide representative averages that contain a sufficient number of measurements equally distributed over time.The complete merged GTO-ECV v3 data

Level-3 validation results and discussion
The validation of the new level-3 GTO-ECV v3 merged product was performed using as ground truth the Brewer and Dobson spectrophotometer network described in Sect.2.2, as was applied in the validation of the previous level-3 record (Coldewey-Egbers et al., 2015).In order to create the level- 3 TOC field, based on the WOUDC ground-based stations, the reported TOCs were gridded into the same 1 • × 1 • grid as the GTO-ECV v3 data, on a monthly basis, with most grid points being represented by only one reporting station.In detail, direct-sun measurements were considered for the gridding of the ground-based TOCs into level-3 grid points, even though in some cases this choice severely decreases the number of measurements.As also performed in Coldewey-Egbers et al. (2015), the threshold on the number of measurements available before the computation of the associated monthly mean was investigated.As a compromise between obtaining the highest global coverage possible and the most representative monthly means, especially at high latitudes, a lower limit of 10 measurements per month and per grid box was enforced so that the temporal representativeness errors are minimized.We note here that restricting the monthly collocated measurements with respect to their mean effective day, which is a measure for the temporal distribution of the daily measurements within a month, did not alter significantly the findings, whereas it excluded entire zones and months from the comparative process and we opted not to apply such a restriction here.
Figure 10 shows the percentage difference between the satellite and the Brewer The agreement between the five datasets and the groundbased measurements is outstanding, with 0.5 to 1.5 % peakto-peak amplitude.For the entire time series of the level-3 data record the mean difference remains mainly positive for all time-series comparisons shown in Fig. 11.Concerning the level-3 comparisons in the NH, the drift per decade of the differences with respect to ground-based data is negligible, −0.11 ± 0.10 % decade −1 for Dobson and +0.22 ± 0.08 % decade −1 for Brewer collocations.Similarly to level 3, no long-term drift in the differences of the individual level-2 data sets was found for either Dobson and Brewer comparisons, with OMI showing the smallest drift per decade (+0.05 ± 0.12 % for Brewer and −0.39 ± 0.19 % for Dobson in the NH; −0.15 ± 0.15 % for Dobson measurements in the SH).The good quality of the GTO-ECV v3 level-3 TOC record temporal stability, which satisfies well the requirements for the long-term stability for total ozone measurements of between 1 and 3 % decade −1 (Van Weele  In order to assess and ensure the quality of the new level-3 GTO-ECV v3 dataset, comparisons are performed against the SBUV merged data product, also shown above in the level-2 TOC validation section and recently quality assured in Frith et al. (2017).In Fig. 12, the time-series comparison between GTO-ECV v3 and SBUV merged is presented for the NH and Dobson (panel a), the SH and Dobson (panel b) and the NH and Brewer (panel c) instrument types.The level-3 GTO-ECV v3 (red line) and SBUV merged (black line) datasets show an agreement of within ±1.5 %, considering their individual instrumental and algorithm differences, as well as a very similar seasonal variability with a peak-topeak amplitude between −1 and +2 % in Dobson and −0.5 and +1 % in Brewer cases over the entire time period.Furthermore, the two datasets show almost the same negligible drift per decade in the NH for both ground-based instrument networks, whereas in the SH for Dobson collocations the drift per decade is +0.23 ± 0.09 % and −0.09 ± 0.07 % for the level-3 GTO-ECV v3 and the SBUV merged TOCs, respectively.

Summary and conclusions
In this work, the Essential Climate Variable (ECV) Climate Research Data Package Total Ozone Column (CRDP TOC), refined and updated via the ESA-CCI Phase II, is presented and validated against independent ground-based TOC observations.Level-2 TOCs, produced by the GODFIT v4 algorithm as applied to the GOME/ERS-2, OMI/Aura, SCIAMACHY/Envisat and GOME-2/Metop-A and Metop-B observations, form the basis for a 22-year-long consistent, smooth and homogeneous CRDP.In addition, the individual sensor products have been combined and merged into one single cohesive level-3 data record, GTO-ECV v3.Detailed quality control and assurance against specific requirements from the international climate-chemistry modelling community showed that the product more than meets the official user requirements, i.e. that the stability of the TOC measurements has to be between 1 and 3 % decade −1 , that the radiative forcing introduced by the evolution of the ozone layer has to be less than 2 % and that the short-term variability has to be less than 3 %.
The individual level-2 data sets show excellent intersensor consistency with mean differences within 1.0 % at moderate latitudes (±50 • ), whereas the level-3 data sets show mean differences with respect to the OMI reference data record that span between −0.2 ± 0.9 % (for GOME-2B) and 1.0 ± 1.4 % ( for SCIAMACHY).
For the level-2 validation against ground-based measurements, the mean bias between GODFIT v4 satellite and Brewer, Dobson and SAOZ-reported TOCs is well within 1.5 ± 1.0 % for all sensors, the drift per decade spans between 0 % and 1.4 ± 1.0 % depending on the sensor and the peak-to-peak seasonality ranges from ∼ 1 % for GOME and OMI to ∼ 2 % for SCIAMACHY.
For the level-3 validation against ground-based measurements a remarkable agreement is shown with 0.5 to 1.5 % peak-to-peak amplitude for the monthly mean time series, as well as a negligible drift in the NH with differences at −0.11 ± 0.10 % decade −1 for Dobson and +0.22 ± 0.08 % decade −1 for Brewer collocations.
We hence conclude that the quality of the GTO-ECV v3 level-3 TOC record temporal stability satisfies well the requirements of 1-3 % decade −1 .The prominent inter-sensor consistency renders both the level-2 GODFIT v4, as well as the level-3 GTO-ECV v3 datasets, suitable and useful for longer-term analysis of the ozone layer, such as decadal trend studies, the evaluation of model simulations and data assimilation applications.
The Ozone_cci CRDP includes data products for TOCs, ozone profiles from nadir sensors and stratospheric ozone profiles from limb and occultation sensors.All data sets are reported in netCDF-CF format following CCI and GCOS standards and are freely available on the Ozone_cci web site (http://www.esa-ozone-cci.org/?q=node/160).
Competing interests.The authors declare that they have no conflict of interest.Special issue statement.This article is part of the special issue "Quadrennial Ozone Symposium 2016 -Status and trends of atmospheric ozone (ACP/AMT inter-journal SI)".It is a result of the Quadrennial Ozone Symposium 2016, Edinburgh, United Kingdom, 4-9 September 2016.
Acknowledgements.The authors are grateful to ESA's Climate Change Initiative Ozone project, Phase II, for providing the support and funding necessary for this work.The ground-based data used in this publication were obtained as part of WMO's Global Atmosphere Watch (GAW) and the Network for the Detection of Atmospheric Composition Change (NDACC).They are publicly available via the World Ozone and UV Data Centre (WOUDC) and the NDACC Data Host Facility.We would like to acknowledge and warmly thank all the investigators that provide data to these repositories on a timely basis, as well as the handlers of these databases

Figure 1 .
Figure1.Time series of the relative differences between the total ozone columns retrieved from the GOME and OMI sensors for different latitude bands.Retrievals have been performed without any soft calibration of the reflectances for both instruments.

Figure 2 .
Figure 2. Time series of the relative differences between the total ozone columns retrieved from GOME, SCIAMACHY and GOME-2A/B with respect to OMI.
(a) and (b) depict Dobson NH and SH comparisons, panel (c) shows the Brewer NH-only comparisons and panels (d) and (e) show the SAOZ NH and SH comparisons, respectively.It should be noted that the SZA values used for the grouping in the plots are the SZAs of the satellite and not the ground-based measurements, which are downloaded

Figure 3 .
Figure 3.The time series of the monthly mean percentage differences between the five satellite instruments and the co-located ground-based TOC measurements performed by Dobson (a Northern Hemisphere and b Southern Hemisphere), Brewer (c Northern Hemisphere) and SAOZ (d Northern Hemisphere and e Southern Hemisphere) instruments.

Figure 5 .
Figure 5.The percentage difference between the five satellites TOC measurements and ground-based measurements from Dobson (a), Brewer (b) and SAOZ (c) instruments, as a function of latitude.

Figure 8 .
Figure 8. Percentage differences between SCIAMACHY and OMI (circles), GOME-2A and OMI (squares), and GOME-2B and OMI (diamonds) as a function of time for the periods of overlap.Orangered curves denote the differences without adjustment to OMI, and green curves denote the differences after the adjustment to OMI.

Figure 9 .
Figure 9. GTO-ECV total ozone column data record as a function of latitude and time from July 1995 to March 2017.Blue horizontal lines indicate the period for each sensor included in the merged product.

Figure 10 .
Figure 10.Latitudinal variability of the percentage difference between satellite observations and ground-based measurements.(a) for the Brewer network and (b) for the Dobson network.The light blue line is GOME level-2 comparison, the green line is SCIAMACHY level-2 comparison, the red line is GOME-2A level-2 comparison, the black line is OMI level-2 comparison, the orange line is GOME-2B level-2 comparison and the purple line is level-3 GTO-ECV v3 comparison.The 1σ standard deviation of the average is also displayed only for the level-3 lines.
(panel a) and Dobson (panel b) TOC records as a function of latitude.The five individual satellite TOCs are very consistent with each other for all latitudes and in very close agreement with the ground-based data.The level-3 comparisons (purple line) show very good agreement with the individual level-2 lines.In particular, over the NH, all level-2 comparisons (apart from SCIAMACHY, in green) show a slight positive deviation of 0-2 % to the ground-based data for both ground-based instrument types.In the SH the level-3 comparisons show a near-perfect agreement with the level-2 comparisons apart from the 70-80 • S belt, where the spread in comparisons reaches the 3.0 % level, which may be attributed to sampling differences between the level-2 and level-3 data (see Coldewey-Egbers et al., 2015, for more indepth discussion of this issue).In Fig. 11 the NH and SH time-series comparisons of the level-2 and level-3 data records with the Dobson and Brewer measurements are shown.The Dobson comparisons for SH (panel a) and NH (panel b) show very good agreement between level-3 and individual level-2 lines, within the 1 % difference level for most of the 22-year data record, except for a small number of outliers.The Brewer comparison in the NH (panel c) shows less amplitude than the Dobson comparisons throughout the full time series, for reasons discussed already in Sect.2.2 and 2.3.

Figure 11 .
Figure 11.Time series of the percentage difference between satellite observations and ground-based measurements for the Dobson network in the NH (a) and in the SH (b) and for the Brewer network, NH only (c).The light blue line is GOME level-2 comparison, the green line is SCIAMACHY level-2 comparison, the red line is GOME-2A level-2 comparison, the black line is OMI level-2 comparison, the orange line is GOME-2B level-2 comparison and the purple line is level-3 GTO-ECV v3 comparison.

Figure 12 .
Figure 12.Same as in Fig. 11.The black line is SBUV merged comparison and the red line is level-3 GTO-ECV v3 comparison.

Table 1 .
Statistics of the comparison between satellite ground-based TOC measurements.
* NH only; n/a means not applicable.