Known and Unknown Unknowns: Uncertainty Estimation in Satellite Remote Sensing

This paper discusses a best-practice representation of uncertainty in satellite remote sensing data. An estimate of uncertainty is necessary to make appropriate use of the information conveyed by a measurement. Traditional error propagation quantifies the uncertainty in a measurement due to well-understood perturbations in a measurement and in auxiliary data – known, quantified " unknowns ". The under-constrained nature of most satellite remote sensing observations requires the use of various approximations and assumptions that produce non-linear systematic errors that are not readily assessed – known, unquantifiable " unknowns ". Additional errors result from the inability to resolve all scales of variation in the measured quantity – unknown " unknowns ". The latter two categories of error are dominant in under-constrained remote sensing retrievals, and the difficulty of their quantification limits the utility of existing uncertainty estimates, degrading confidence in such data. This paper proposes the use of ensemble techniques to present multiple self-consistent realisations of a data set as a means of depicting unquantified uncertainties. These are generated using various systems (different algorithms or forward models) believed to be appropriate to the conditions observed. Benefiting from the experience of the climate modelling community, an ensemble provides a user with a more complete representation of the uncertainty as understood by the data producer and greater freedom to consider different realisations of the data.


Introduction
All measurements are subject to error, the difference between the value obtained and the theoretical true value (or measurand).Errors are traditionally classified as "random" or "systematic" depending on if they would have zero or non-zero mean (respectively) when considering an infinite number of measurements of the same circumstances.The uncertainty on a measurement describes the expected magnitude of the error by characterising the distribution of error that would be found if the measurement was infinitely repeated.These concepts are sketched in Fig. 1.
Uncertainty is a vital component of data as it provides -a means of efficiently and consistently communicating the strengths and limitations of data to users, and -a metric with which to compare and consolidate different estimates of a measurand.
The importance of quoting the uncertainty on any measurement and the thorough validation of both are well accepted, being essential for data assimilation (one of the primary uses of satellite data products).However, the terms "uncertainty" and "validation" are used inconsistently.This paper aims to present a succinct outline of uncertainty and validation and their best-practice application to satellite remote sensing of the environment.Satellite remote sensing is a sequence of processes that estimate a geophysical quantity from a measurement of the current or voltage produced by a space-based detector in response to the radiation incident upon it.Each step in processing, formally described in Table 1, is subject to various sources of error.This formalisation was applied as early as 1970 for Nimbus 4 data process-Published by Copernicus Publications on behalf of the European Geosciences Union.
Figure 1.An illustration of error and uncertainty.The error in a measurement (purple arrow) is the difference between the true value of the measurand (solid blue) and the value measured (dashed red).The black line shows the frequency distribution of values that would be obtained if the measurement were infinitely repeated, referred to as the distribution of error.(a) A conventional random error.The uncertainty (green arrow) characterises the distribution of error by its width.(b) An error with a systematic component.This cannot be characterised with a single value.
Standardised methods for uncertainty estimation can be insufficient for satellite remote sensing data as they assume a well-constrained measurement where the sources of error are established -known, quantifiable unknowns.The dominance of systematic errors in satellite remote sensing data introduce known, unquantified unknowns (such as the impact of cloud filtering) and unknown unknowns (such as variability on scales smaller than that observed).
Ensemble techniques, a method widely used in the weather and climate communities, provide multiple self-consistent realisations of a data set as a means of representing nonlinear error propagation and variations resulting from ambiguous representations of natural processes.This paper argues that such techniques provide an effective means to represent and communicate the uncertainty resulting from the latter two categories of "unknowns" affecting satellite remote sensing data.
The discussions to follow aim to be accessible to both users and producers of satellite remote sensing data, and the issues considered apply (theoretically) to all satellite-based instruments.The relative importance of each point will depend on the precise technique considered, and the concepts will not be considered for all possible measurements.Illustrative examples will primarily draw from the characterisation of aerosol, cloud, and the surface with a hypothetical nadir-viewing radiometer in a low Earth orbit ( ∼ 800 km) with a spatial resolution of ∼ 1 km having bands in the visible and infrared.This specification is typical of a number of past and existing instruments such as the Along Track Scanning Radiometer (ATSR) series, the Advanced Visible High Resolution Radiometer (AVHRR) series, and the Moderate Resolution Imaging Spectroradiometer (MODIS) on the Aqua and Terra platforms.

Level 0
Reconstructed, unprocessed instrument data at full resolution.
Level 1A Reconstructed, unprocessed instrument data, time-referenced and annotated with ancillary information such as radiometric and geometric calibration coefficients and geolocation parameters.Data may be at full resolution or an average over some retrieval area.
Level 1B Level 1A data that have been converted to physical units (e.g.brightness temperature rather than voltage).Not all instruments will have a Level 1B equivalent.
Level 2 Derived environmental variables (e.g.ocean wave height, soil moisture) at the same resolution and location as the Level 1 source data.
Level 3 Variables mapped onto uniform space-time grid scales, usually with some corrections for completeness and consistency (e.g.interpolation of missing points, interlacing multiple orbits).
Section 2 outlines the accepted definition of uncertainty, and the use of ensemble techniques in characterising the distribution of systematic errors in satellite remote sensing data.These are discussed with respect to specific sources of error in Sect.3. Retrieval validation is considered in Sect. 4. Section 5 discusses the importance of qualitative information in the communication of uncertainty to data users, while Sect.6 summarises some conclusions and recommendations.

Within retrieval theory
A generalised description of a retrieval technique is that it uses observations y and auxiliary information b to find some quantities of interest x that satisfy which is practically performed by evaluating where the forward model F approximates the process by which the instrument and the environment translate the desired quantities x into the observation y and whose formulation will depend on the choice of basis x.The error in the measurements and forward model is denoted , and the inverse function G is some statistical or approximate inversion of the forward model, for which many schemes exist (e.g.Rodgers, 2000;Twomey, 1997).
If a hat denotes the theoretical true value of a quantity or function, the error in the retrieval is given by ε = x − x.It is affected by sources that fall between the following three extremes.
-Random fluctuations in the measurement, such as thermal fluctuations and shot noise.These are unavoidable but generally linear and (at least approximately) normally distributed such that the uncertainty can be represented by the standard deviation of their distribution.When using Eq. ( 2), the uncertainty resulting from random errors in multiple measurements can be calculated using the standard "propagation of errors" (Clause 5.1.2 of Working Group 1, 2008) where σ x j is the uncertainty in the j th element of x and N observations were considered, which are assumed to have uncorrelated errors.
-Simplifications and approximations made in the technique.These errors are systematic and are unlikely to be quantified (as they would have been included in the forward model if they were).Such errors are commonly characterised through validation.
-The degree to which the observation is representative of the situation it is proposed to describe.These are especially important for satellite observations, where measurements are averaged over some volume of the atmosphere that does not necessarily correspond to the scale of physical perturbations, such as turbulent mixing or cloud contamination.
These considerations compound when considering the uncertainty resulting from the use of auxiliary parameters, b.If the uncertainty on the auxiliary parameters is well known, it is straightforward to propagate it into the retrieval using Eq. ( 3) with the substitution y → b.However, the data may not map directly onto the defined state (e.g.observations at a different spatial resolution taken at a different sub-solar time), introducing additional error.If an auxiliary parameter is very poorly known, it may be preferable to retrieve it as an additional element of x, though in doing so the problem may become under-constrained (if it was not already).Even where it is possible to make additional measurements, it is often necessary to input an independently retrieved quantity rather than work from raw data.

Formal definition
The metrological community has prepared an extensive summary of best-practice in the assessment of uncertainty in measurements -the Guide to the expression of uncertainty in measurement (Working Group 1, 2008, known hereafter as the GUM).It defines uncertainty as a "parameter, associated with the result of a measurement, that characterises the dispersion of the values that could reasonably be attributed to the measurand."This definition has been adopted by the European Space Agency's (ESA) Climate Change Initiative (CCI project teams, 2010).
In clause 0.4, the GUM states that an ideal method for evaluating uncertainty should be universal, in that it is applicable to all types of data.The reported uncertainty should then be internally consistent, being directly derivable from the information that was used in its calculation, and transferable, such that it can be input to subsequent calculations.These are achieved by assuming that any probability distribution from which errors are sampled can be accurately described by a single variance.If a series of N observations x i are made, the mean is Clause 4.3 provides guidelines for determining a pseudovariance when observations are not repeated, such as where the measurand is known to fall between two limits.With that, Eq. ( 3) can be evaluated for the equations used to derive the measurement (outlined in clause 5).

Application to satellite remote sensing
These conventions apply equally to satellite remote sensing data but represent an impractical ideal that does not help an analyst fully represent their understanding of the uncertainty in their data.This is due to the simplistic treatment of systematic errors.Clause 3.2.4 of the GUM states that, "It is assumed that the result of a measurement has been corrected for all recognized significant systematic effects and that every effort has been made to identify such effects."While data producers put significant effort into identifying systematic errors, their quantification can be a difficult and occasionally impossible task.For such errors, it is unclear that their distribution is symmetric, such that the emphasis on traditional error propagation contributes to many analysts neglecting important systematic errors as they cannot be quantified with confidence (Li et al., 2009;Kokhanovsky et al., 2010).This applies primarily to highly under-constrained observations.A few measurements of the radiation at the top of atmosphere (TOA) cannot be used to deduce the intricate state of the atmosphere and surface in the observed column without substantial simplification of the physics and/or additional information on the variation of the state.Systematic errors are produced where these assumptions break down (e.g. using an inaccurate water vapour profile when evaluating measurements affected by water absorption).
The magnitude and nature of systematic errors experienced is a function of the state observed.A common example is the differing treatment of land and sea surfaces.Averaging adjacent retrievals will not necessarily combine errors sampled from the same distribution.As the uncertainty of a retrieval is a function of the environment observed, they must be ascertained on a pixel-by-pixel basis to be meaningful.
The basis chosen to describe a system also impacts the expression of uncertainty.Consider the retrieval of cloud top temperature or pressure from measurements by a nadirviewing infrared radiometer (for a more detailed description, see King, 1992;Fischer and Grassl, 1991;Schiffer and Rossow, 1983).The observed signal is the radiance at TOA, which is converted (using the Planck function) into the radiating temperature of the droplets at the top of the cloud.As that transform is non-linear, a symmetric distribution of random error in the radiance will not be symmetric when considering temperature, as sketched in Fig. 2. Similarly, the cloud top pressure is calculated from the temperature by interpolating a meteorological profile.As temperature varies linearly with height while pressure varies logarithmically, the distribution will be further distorted in pressure space, in addition to the uncertainty introduced by the meteorological profile.
If errors are expected to be small (as in the radiance to temperature transform), the non-linearity will be minimal and a variance-based representation of error is sensible.Otherwise, the distribution of error may be skewed or asymmetric such that one value is insufficient to describe it.Ensemble techniques can provide the additional information required to characterise the distribution of error properly.

Ensemble techniques
As illustrated above, the standard error propagation techniques do not properly represent the distribution of nonlinear errors.In such situations, the uncertainty can be approximated by the variation in an ensemble of individually self-consistent predictions.An example is numerical weather prediction (NWP).Rather than predict the weather from the output of a single model run, multiple runs are performed (Buizza et al., 2005) with each initialised by a perturbed version of the initial state (the perturbations being consistent with the uncertainty in the observations used).The weather is chaotic, such that small changes in the input data produce significant and non-linear changes in the result (Lorenz, 1965).The ensemble of forecasts captures the variability as an approximation of the uncertainty in a forecast (Houtekamer and Lefaivre, 1997), such as the fraction of model runs in which a given feature is observed, in a way that standard error propagation cannot.
Non-linear error propagation in satellite remote sensing observations can be characterised via ensembles.Each member of the ensemble adds a random perturbation to the measurements y and ancillary parameters b (in accordance with their respective error distributions).The feasibility of doing this in large-scale processing is limited by computational cost so it is primarily useful as a method to validate the calculated uncertainties (commonly referred to as a sensitivity study).
Ensembles are also widely used in the climate modelling community (for example, Flato et al., 2013;Crucifix et al., 2005;Meehl et al., 2000).Many processes cannot be accurately modelled at the coarse resolutions practical for climate modelling.These are parametrised, but there are many possible schemes and each has associated unquantifiable systematic errors.The diversity in an ensemble of models (using different assumptions and simplifications) approximates the uncertainty in those models.This approximation is limited (as it cannot sample uncertainty related to features that are neglected from all of the models) but can still be useful (Knutti, 2010).
Such ensembles could be useful to assess the impact of a priori assumptions in poorly constrained retrievals (such as the selection of aerosol microphysical properties).To illustrate the concept, consider estimating the volume of an aluminium bucket knowing only its mass.As the density of aluminium is known and the thickness of metal used to make the bucket is assumed, the mass can be converted into a surface area.The volume is then determined from the surface area by assuming the shape and height of the bucket.That choice of shape (i.e. the forward model) will greatly affect how the retrieval interprets the mass measurement.This is portrayed in Fig. 3.Each line represents a different forward model for converting mass into volume.A slice (lines of the same colour) shows the impact of shape on the form of the forward model.Looking through the slices (different colours of the same line style) shows the impact of the assumed height.Note the following.
-When the bucket is assumed to have a height of 12 cm (purple), the three different models produce consistent results between 0.15 and 0.3 kg.The error due to using an inappropriate model there will be small, but increases for masses > 0.3 kg.The error is a function of the true state.An ensemble of forward models for the volume of a bucket (x axis) as a function of its mass (y axis).A third parameter, the bucket's height, is not measured and so must be assumed.Its impact is shown over five slices of the z axis.Solid, dotted, and dashed lines denote cylindrical, hemispherical, and conical buckets respectively.The material is assumed to have thickness 1 mm and density 2.7 g cm −3 .
-For a height of 24 cm (red) the models diverge greatly; a 0.32 kg bucket could have a volume between 0.10 and 11 L. Thus, the use of an incorrect model will introduce substantial error.The error is a function of the forward model's parameters.
-In this example the actual shape of the bucket is not known, so it is not possible to rigorously quantify the error resulting from the choice of forward model.Without additional information, the results for a hemispherical bucket are just as valid as a conical one despite their significantly different interpretations of the data (e.g. a hemispherical bucket has a minimum mass for a given height while a conical one does not).
The form of the ensemble will depend on its intended use and a priori knowledge.In this example, the ensemble would be three estimates of the volume (one for each shape).The uncertainty resulting from errors in the weight, density, and thickness would be given separately for each ensemble member.If genuinely nothing was known about the height, the ensemble could be extended to represent a range of heights.In reality, some auxiliary information will exist that should constrain the values.
The standard deviation across ensemble members may be a useful proxy where the models are consistent, as in the 12 cm slice, but not generally.Non-linear errors can be most meaningfully described through an ensemble, with which many users already have extensive experience (Rayner et al., 2014).Ensemble techniques are universal, being a generalisation of the GUM's techniques to a poorly constrained problem (i.e. a well-constrained problem has a one-member ensemble).Each realisation of the data is internally consistent, and the ensemble presents a more complete understanding of the data, as ambiguities are explicitly highlighted.The information is transferable using the well-established techniques of the modelling community.
This example is artificial but illustrates the utility of ensemble techniques to satellite remote sensing data.
-Retrievals of aerosol optical depth are strongly affected by the choice of aerosol microphysical properties.Analogous to the choice of bucket shape, these properties alter the form of the forward model and introduce unquantifiable errors.An ensemble can be produced by evaluating the observations with various models, as currently performed by the MISR (Multi-angle Imaging Spectroradiometer, Liu et al., 2009) and ORAC (Optimal Retrieval of Aerosol and Cloud, Thomas et al., 2009) algorithms.
-A variety of techniques can be used to merge multiple satellite sensors into a single, long-term product, such as the Jason-1 and Jason-2 mean sea-level missions (Ablain et al., 2015) or the SeaWiFS (Sea-Viewing Wide Field-of-View Sensor) and MODIS Terra and Aqua ocean colour data (Maritorena and Siegel, 2005).These correspond to the choice of bucket height -a poorly constrained retrieval parameter.
-Retrieval parameters and auxiliary data have associated uncertainties.Where the propagation of these is highly non-linear, they can be estimated via ensemble techniques analogous to the NWP approach, as done by Liu et al. (2015).Rather than present an ensemble of retrievals, Mears et al. (2011) produced an ensemble of estimated errors (as perturbations about the measured value presume it is the mean of the true distribution).
-Errors that are correlated over large temporal and/or spatial scales are impractical to calculate and represent with traditional covariance matrices.Ensembles have been used to represent these in sea surface temperature (SST) products (Kennedy et al., 2011a, b), with less problematic errors represented by separate uncertainty estimates.
In essence, the ensemble approach is useful for characterising the error resulting from an incomplete description of the situation observed.At the expense of increased data volume, an ensemble provides the user with 1. a more appropriate representation of the uncertainty resulting from the realisation of the problem, and 2. the freedom to select the portrayal(s) of the data most appropriate to their purposes.
An ensemble also facilitates the intercomparison of different methodologies, through which techniques can be refined or rejected.

Evaluating errors in a satellite observation
Despite their extensive use in the community (and this paper), the classification of errors as random or systematic is limited.A random error can appear to introduce a systematic bias after propagation through a non-linear equation due to its asymmetric distribution, and the distribution of a systematic error has finite width.The use of these terms is better understood as synonyms for the non-technical meanings of noise and bias, respectively.The GUM chose to eschew classification of error altogether, instead classifying uncertainties as type A and B dependent on if they were calculated from an observed frequency distribution (i.e.traditional statistical techniques) or an assumed probability density function.This provides an important focus on the different techniques through which uncertainty is calculated, but does not address the interest of data users in understanding the cause of errors in a measurement.The source of an error affects how it is realised and its relative importance in the eyes of data producers and users.Five classifications of error by source are proposed, which will be discussed in turn.

Measurement errors
Measurement errors result from statistical variation in the measurand or random fluctuations in the detector and electronics.To assess these accurately, it is important that a measurement is traceable to a well-documented standard.This requires the straightforward (if not simple) comparison of an instrument to a thoroughly characterised reference.Further the response of any instrument will evolve over time, necessitating the periodic repeat of calibration procedures.
Satellite radiometers are characterised prior to launch (e.g.Hickey and Karoli, 1974;Barnes et al., 1998;Tanelli et al., 2008), to varying levels of accuracy, providing a traceable assessment of uncertainty.However, the stresses of launch can irrevocably and unpredictably alter the behaviour of an instrument, such that this assessment merely provides a first guess of the performance in practice (e.g.Kummerow et al., 2000).It is impossible to perform calibration in orbit analogous to the laboratory-based format.Some instruments carry calibration sources to provide continual, in situ evaluation (e.g.Smith et al., 2012).Though designed to be more robust than the instrument itself, these have been shown to have stability issues (Xiong et al., 2010).Hence, it is unreasonable to expect a traceable assessment of uncertainty for a satelliteborne sensor analogous to any ground-based instrument.
Vicarious methods of calibration can be used, whereby the response of the instrument to a known stimulus is considered (e.g.Slater et al., 1996;Fougnie et al., 2007;Powell et al., 2009;Kuze et al., 2014).For example, radiometers have been calibrated by observing an area of the Libyan desert known to have a very stable surface reflectance over time (Smith et al., 2002) or the Moon (Eplee et al., 2011).This can complement pre-launch calibration or may be the only direct calibration possible (Heidinger et al., 2003).Calibrations are periodically re-evaluated and new data sets released (e.g. the recent ATSR V1.2 or MODIS L1B Collection 6).For such calibrations to be traceable, it is necessary to establish international standard reference sites that are independently and regularly monitored.

Parameter errors
Retrievals using satellite observations virtually always require auxiliary information as there is insufficient information available to retrieve all parameters of the atmosphere and the surface simultaneously.For example, the accuracy of line-by-line radiative transfer calculations depends upon the spectroscopic data used (see, for example, Fischer et al., 2008).Parameters will be produced by an independent retrieval and have associated uncertainties.If uncertainty is reported via a standard deviation, it can be propagated using Eq. ( 3).More complex uncertainties can be represented through an ensemble.

Approximation errors
It is not always practical to evaluate the most precise formulation of a forward model.For example, the atmosphere may be approximated as plane parallel to simplify the equations or look-up tables (LUTs) may be used rather than solving the equations of radiative transfer.Such approximations will introduce error.Often known as "forward model error" (Rodgers, 2000), it can be assessed by comparing the performance of the rigorous and simplified forward models through simulated data.These errors can be highly state-dependent but should also be small (as otherwise the approximation was misguided), such that it should be appropriate to quantify the maximum error and convert that into an effective standard deviation (GUM Clause 4.3).To continue the analogy of Sect.2.4, an approximation error would result from assuming the bucket is perfectly cylindrical when it is actually slightly tapered.

Definition of the measurand
How a measurand is defined affects which errors are relevant.Summarising clause D.3 of the GUM, consider the use of a micrometer to measure the thickness of a sheet of paper.As the sheet will not be uniform, the true value depends on the precise location of the measurement.Hence, when measuring "the thickness of this sheet of paper", the variation of thickness across the sheet is an additional source of error to be considered when estimating the uncertainty.This error can be neglected by defining the measurand as "the thickness of this sheet of paper at this point", but that is of little practical use.Similarly, "the thickness of a sheet of paper from this supplier" is a more useful measurand, for which the error due to variations between different sheets would also need to be considered.
A datum in a satellite product is understood to represent an average of some physical quantity over the observed pixel at a specified time.Compared to the situations considered in the GUM, these suffer a number of important limitations.1.It is not possible to redefine the scope of the measurand (i.e.changing from "this sheet of paper" to "a sheet from this supplier") as that is prescribed by the optics of the instrument.What will be called the resolution error derives from the inability of the measurement to resolve the desired measurand.This generally results from variations in the quantity on scales smaller than a pixel, analogous to the variations in thickness over a sheet of paper.
2. The perturbations are not necessarily independent.For example, in the open ocean it is reasonable to expect that mixing will homogenise SST over a pixel, but in coastal waters variations in depth and sediment concentration introduce spatially correlated perturbations that will not average to zero.
3. Unlike the thickness example, it is not possible to repeat the observation.Atmospheric states evolve over minutes to hours and influence (to some extent) any environmental observation such that two instruments can never strictly observe the same state.This contrasts with laboratory-based measurements, where experiments generally accumulate statistical confidence through repeated measurement of equivalent circumstances.
The last point can be addressed by averaging adjacent pixels from the same sensor.When done with Level 1 data, this is known as superpixeling (Munechika et al., 1993).It is commonly used in aerosol retrievals to reduce measurement error (e.g.Sayer et al., 2010a), as aerosols are assumed to vary over scales much larger than a pixel (order 50 km, Anderson et al., 2003).Such averaging is not valid in the presence of cloud, which is fundamentally a stochastic feature with an extended region of influence (Grandey and Stier, 2010).
When Level 2 data are aggregated onto a regular grid, the result is Level 3 data.Averages over hundreds of kilometres and days to weeks are similar to the scales evaluated by climate models, and the volume of data is vastly more manageable.Such data are susceptible to additional limitations.
-The definition of the measurand is even more important.It may appear sufficient to describe a product as (for example) "average SST in March 2005 over 30-31 • N and 10-11 • W", but the satellite's spatial sampling will greatly affect the value.Comparison of satellite products to model outputs can only be successful if the model is sampled as if observed by that satellite (so called "instrument simulators", e.g.Sayer et al., 2010b).
-Satellite products are only representative of the time they observe (Privette et al., 1995).If the quantity has a diurnal cycle, the measurand should be described as an average at a specific time.That time may evolve through a record due to satellite drift, such that data from the beginning of such a record may not be directly comparable to those at the end.
-Resolution errors are a function of the pixel size and the variability of the measured quantity.A satellite datum is interpreted as a spatial average over the footprint of the pixel.This presumes that the value retrieved is equal to the average of retrievals from infinitely high spatial resolution data (i.e. the derivative of the product with respect to the measurement is linear for variations within the pixel).While this approximation holds in many circumstances, it is not universally true and certainly breaks down as pixels are aggregated to represent a larger spatial scale.
-For retrievals that use an a priori constraint, each retrieved value contains a contribution from the a priori.
When averaging, if the a priori is not "removed" from the value, it will contribute repeatedly to the average, biasing it.Neglecting covariance between state vector elements, this can be done via To account for covariance, see Eq. (10.47) of Rodgers (2000).The values x i can then be averaged as desired, explicitly including the a priori value once.
Level 2 data can also be averaged while remaining on the satellite grid (for example, Hsu et al., 2013), which could be referred to as Level 2.5 data.

Impact of sampling
The interaction of cloud with the radiation field is sufficiently complex and variable that it is not generally possible to retrieve its properties simultaneously with the surface and/or other atmospheric constituents.Hence, most atmospheric measurements are pre-filtered for the presence of cloud via one of a plethora of empirical techniques (e.g.Ackerman et al., 1998;Stowe et al., 1999;Pavolonis and Heidinger, 2004;Curier et al., 2009).This constrains the retrieval to observations believed to be appropriate to the forward model used.
The filtering process impacts the sampling of the product, as regions with persistent cloud cover will be neglected.
Level 3 products are particularly susceptible to these sampling effects.The concept is also known as "fair-weather bias" as the exclusively clear-sky conditions considered are not necessarily representative of the long-term average conditions that the measurand purports to describe (an example can found in Levy et al., 2009).Ensemble techniques can be used to characterise this error either by demonstrating the changes in coverage as a function of the cloud filter used or by explicitly considering cloudy conditions as an alternative realisation of the system (for which the state vector will likely be different).
Filtering can remove exceptional events.Aerosol retrievals often assume all data with optical thickness above some threshold are cloud contaminated, but it is possible for dust or volcanic ash to achieve an optical thickness above any useful threshold.This systematically removes high optical depths from long-term averages, producing a low bias in average products and failing to characterise the largest (and potentially most important) events.Such limits should be stated within the product definition to make this distinction clear.
Sampling is also affected by the instrument swath.As examined in Sayer et al. (2015), there is often a distortion of pixel size, shape, and overlap near the edges of a swath (e.g. the MODIS "bowtie effect").The local solar time of pixels is variable across any swath.These effects complicate the definition of the measurand and raise important questions for the production of Level 3 data: Should overlapping data from different swaths be combined despite differences in local time?When combining pixels, should they be weighted by their area?Should distorted pixels be excluded from such averages entirely?

System errors
The stochastic change in TOA radiance due to the presence of cloud (or other optically thick layer such as smoke or volcanic ash) is a long-standing problem in satellite remote sensing.The issue is that the forward model, F in Eq. ( 1), has a significantly different form for each stochastic realisation of the environment.One realisation will be referred to as a system.
If there is no a priori knowledge of which system is appropriate, the forward model could be formed from the linear sum of all possible systems; e.g.
where a, b, c, . . .are the weighting of each system, which sum to unity.Each system is represented by a unique state x a , x b , x c , . .., and there may be degeneracies between them (e.g. each state may quantify the surface reflectance).While this approach may be successful for some multispectral observation systems, in most cases it makes an underconstrained problem worse.
Another technique is to assume the measurements are of a specific system (i.e. one of the weights is unity and the others are zero).The choice of system is based on prior knowledge, usually relative values of radiances or their spatial variability (e.g. the cloud flagging discussed in Sect.3.4.2).However, the choice of thresholds is often application dependent, leading to gross error (e.g.Sect.3.2 of Holzer-Popp et al., 2015) as there is a substantial difference between asking "Is this an observation of X?" and "Is this observation suitable for analysis with my model of X?" The former desires an appraisal of the state based on data; the latter seeks to minimise forward model errors.
An alternative approach is to perform a retrieval with each relevant system in turn and choose a posteriori the best system (e.g.Levy et al., 2013).Ideally, the fit to the measurements would indicate a best choice of system, shown schematically in Fig. 4. Difficulty emerges when multiple systems produce values with indistinguishable fits to the measurements (e.g. the measurements can be fit equally well by a water cloud or thick aerosol haze).In either case, analogous to the 24 cm slice of Fig. 3, an unquantified error may be present due to deviations between the forward model and reality.This manner of reporting an ensemble of all the systems evaluated allows the error to be at least sampled.

Existing terminology
The combined impact of approximation, resolution, and system errors was defined as "structural uncertainty" by Thorne et al. (2005).Their emphasis was that the choices made by different investigators in the analysis of the same data can produce discrepancies.The terminology proposed above clarifies the type of choices which introduce such errors to an analysis and delineates by the manner in which they would be assessed.Regardless, this paper would prefer "structural error" as it is the error that is structural, not its uncertainty.The term "structural uncertainty" is used by Draper (1995) to describe system errors, though with respect to statistical rather than physical models.

Summary
Measurement and parameter errors are both intrinsic sources of uncertainty in a retrieval.Measurement errors affect the quantities measured and analysed by the retrieval.Parameter errors are propagated from auxiliary inputs, such as meteorological data or empirical constants.Resolution errors result from finite sampling of a constantly varying system.These can be especially important as satellites do not sample the environment randomly but with a systematic bias due to the satellite's orbit and quality control or filtering.
Approximation errors represent aspects of the analysis that could have been done more precisely but do not affect the fundamental measurand.A plane parallel atmosphere is a simplification of the real world; it would not be observed.System errors express choices in the analysis that alter the measurand.An assumed aerosol optical model will represent a possible state of particulates in the atmosphere; it may be unlikely but still possible.The system error results from the difference between the assumed system and reality.

Retrieval validation
Validation is a vital step in the production of any data set, confirming that the data and methodology are fit for their purpose.Often thought of as the conclusion of data generation, it provides guidance for future development of the algorithm and so is better considered a step in the cycle of retrieval development (see Fig. 5).Validation should be traceable and repeatable and can take two forms that will be discussed in this section: -Internal validation: the comparison of measurements from a single instrument; -External validation: the comparison of measurements with correlative measurements made by a different instrument.
These can be thought of as assessing the precision and accuracy of the retrieval, respectively, and can establish that the methodology produces physically consistent results.The process should demonstrate that new data are consistent with independent results, estimate the relative error between the techniques considered, and show that the predicted uncertainties accurately describe the distribution of that error.This paper construes a validation as a comparison against real data only.There is use in evaluating the performance of an algorithm against simulated data, but that is considered a step in retrieval refinement (confirming it behaves as expected in controlled conditions) rather than a validation.where data are generated and disseminated.The application and critique of the data by the scientific community then feeds into further refinement of the algorithm (or entirely new algorithms).The development and operational cycles continue independent of the larger cycle but over time operations will increasingly dominate resources as the product becomes increasingly fit for purpose.

External validation
Users will be most familiar with external validation -the comparison of observations from two or more instruments.This focuses on quantifying the correlation and difference between data sets.While such validation activities are fundamental to the characterisation and minimisation of systematic errors, they should not be confused with a quantification of uncertainty.Validation techniques are neither universal (being dependant on the collocation criteria), internally consistent (as external data are used), nor transferable (being representative of only the conditions considered).

Weighting functions
When comparing two data sets, neither quantifies "the truth" (even when one is substantially more precise than the other).Both have associated errors, random and systematic such that all that can be said is the products are consistent with each other.Also, simply because two measurements purport to quantify the same measurand does not mean they actually do.Weighting functions illustrate the difference in sensitivity between instruments.
As an illustration, consider cloud top height (CTH).The entire cloud emits thermal radiation, much of which will be scattered or absorbed within the cloud.Radiation from the cloud observed by a satellite corresponds to photons that found an unimpeded path to TOA.Hence, a radiometer quantifies an average of the cloud's temperature profile weighted by the probability that a photon from that level can arrive at TOA.The distribution of the weight is known as the weighting function, and is sketched in red in lack of information about the vertical extent of the cloud, it is common to assume the cloud is infinitely thin (e.g.Poulsen et al., 2012), and the measurand would be more accurately described as the "effective cloud radiating height".A very simple model of this situation assumes that radiation increases linearly with optical path τ measured in the direction away from the observer.That radiance is attenuated with the exponential of τ so the observed radiance R can be approximated as where a is some constant.This function has a maximum at τ = 1.This result approximately holds in more detailed calculations, such that a useful rule of thumb is that a radiance can be thought of as emanating from the level of the atmosphere at unit optical path.
The Cloud-Aerosol Lidar with Orthogonal Polarization (CALIOP) is commonly used to validate CTH (e.g.Holz et al., 2008;Stengel et al., 2013).CALIOP measures the backscatter from a pulsed laser beam as a function of height, which is predominately a function of the number of particles in the beam.CTH is identified by the rapid increase in signal at the edge of the cloud as particle density increases.This results in a weighting function that is substantially sharper and peaked at the physical top of the cloud (black in Fig. 6).
A direct comparison of these two products will find that radiometer-retrieved CTH are consistently lower than those from the lidar.To validate the satellite against the lidar properly, it is necessary to use the satellite's weighting function to calculate an "effective cloud radiating height" from the lidar profiles (see, for example, Sayer et al., 2011).All variables retrieved may have a weighting function, such as cloud effective radius (Platnick, 2000).When measurements are compared, it must be done on a common basis.
More formally, a weighting function describes the dependence of a measurement on the underlying state.When the state chosen to describe a measurement is not an orthogonal basis of the observed state, a variable in the state vector will not uniquely determine an element of the true state.The relationship between the retrieved state and true state is expressed by the averaging kernel A = ∂ x/∂x, which satisfies where represents the action of G on .
Consider where x has two elements: the CTH and total optical thickness.In the lidar retrieval, these two variables are independent; A lidar is a unit matrix.In the radiometer retrieval, the CTH retrieved is a function of the optical depth profile and A rad contains off-diagonal elements.To illustrate, consider when an optically thin cloud (τ 1) lies above a thicker cloud (Fig. 6b).The lidar will identify CTH as the physical top of the thin cloud, but the radiometer will retrieve a CTH between the clouds.As the upper cloud's thickness increases, the weighting function is increasingly dominated by the upper cloud.The retrieved CTH is dependent on the upper cloud's optical thickness.The averaging kernel would be The off-diagonal elements of the averaging kernel represent aspects of the state that cannot be resolved by the chosen basis and forward model.Here, a two-layer cloud cannot be properly represented when the basis only describes the properties of a single-layer cloud.The characterisation of an aver-aging kernel may require the use of an extended state vector and simulations with a more detailed model.(If the retrieval had been posed over that extended state vector, the averaging kernel would have been diagonal.)

Comparing retrieved quantities
Retrievals will be compared over some collection of observations representing only a subset of the realisable state vectors (e.g. a SST product compared to ship-based measurements will only encapsulate the variation in SST over major shipping lanes rather than globally).As systematic errors are circumstantial, this collection represents only a sample of the complete distribution -just as the definition of a measurand frames how its value can be understood and used, the scope of a validation frames the understanding of systematic errors.
Towards the aim of repeatability, validation should be performed in a manner such that, if an additional source of data were introduced (e.g. a new instrument site or satellite orbit), the conclusions would not be expected to change.In the highly common case that there are insufficient data to achieve this, the scope of the validation should be clearly outlined.
One would naïvely judge if two retrievals are consistent by considering, where S i is the covariance of a retrieved solution.Rodgers and Connor (2003) noted that this does not apply for retrievals with differing averaging kernels.If the averaging kernel is not calculated, it is not possible to compare the data from different sensors rigorously, even from the same algorithm.
Different algorithms have distinct sensitivities to the same input information.Products from different sensors consider distinct inputs and so react differently to the same atmospheric state.Even where channels with similar wavelengths are used, they will have different band passes which subtly affect their sensitivity (weighting functions).For example, the scattering properties of smaller droplets change more rapidly with wavelength than those of larger droplets.In Fig. 6c a second radiometer with a wider band pass has a broader weighting function (which will vary with droplet size because the cloud's transmission varies at the edges of the band.) When independent observations are not available to externally validate data, one can compare a product to model output provided the model is sampled as if viewed by a satellite.The retrieval's averaging kernel and weighting functions are necessary to translate the physical variables quantified by the model (e.g.particle number density) into the observed measurand.Further, a method for estimating the random error variance of a geophysical variable from three collocated data sets was proposed by Stoffelen (1998) and has become an important evaluation method in Earth observation.

Formalism for comparison
The formalism of Rodgers and Connor (2003) is widely used in the trace gas community (e.g.Froidevaux et al., 2008;Wunch et al., 2010).It is less straightforward but equally important in any comparison of data products and will be briefly summarised.The collection of states compared is assumed to have a mean state x c with covariance S c .This could be the mean of one of the data sets considered, or represent prior information, such as a climatology from a previous measurement campaign.
Equation ( 8) linearises the retrieved state about the a priori state.The two retrievals are unlikely to share an a priori.Hence, to consider compatible averaging kernels it is necessary to translate both data sets to a common linearisation point, for which x c and S c are suitable.The necessary translation is The difference between retrievals is then, which has covariance, Thus, rather then Eq. ( 10), an appropriate comparison metric is When one product is of much higher resolution, such as the comparison against CALIOP described in Sect.4.1.1,it may be possible to transform it onto the basis of the other via for which which has covariance, As Eq. ( 11) casts each observation on the same linearisation point, these techniques can be directly applied to the comparison of more than two instruments.

Expected error envelopes
Expected error envelopes are a common means of presenting the result of a validation of, for example, aerosol optical depth τ (e.g.Kahn et al., 2005;Levy et al., 2010).The difference between the retrieved value and that reported by www.atmos-meas-tech.net/8/4699/2015/Atmos.Meas.Tech., 8, 4699-4718, 2015 the Aerosol Robotic Network (AERONET) approximates the "error" in the retrieval.The "expected error envelope" is the width of the observed distribution of "error" and is described like an uncertainty.The value is an "envelope" because the distribution widens with increasing retrieved optical depth, such that the final value is reported as ±(a + bτ ), where a represents the minimum width of the "error" distribution and b represents the rate at which it widens with increasing optical depth.Envelopes can be stratified according to the observed conditions and retrieval assumptions.This is an efficient means of communicating the results of the validation against AERONET and conveys a quantitative measure of the degree of certainty the data producer has in their product.It is not, strictly, an estimation of uncertainty.Such validation techniques are neither universal (being dependant on the collocation criteria), internally consistent (as external data are used), nor transferable (being representative of only the conditions considered).Though envelopes provide a diagnostic approximation of the uncertainty, additional correction is necessary to use them as prognostic uncertainties (Hyer et al., 2011).Treating envelopes as a transferable uncertainty has led to significant difficulty integrating data from different sensors as global and local sources of error are disconnected (Holzer-Popp et al., 2014).
This application of envelopes conveys an incorrect appreciation of the uncertainty to users as it implies wellconstrained random and systematic components.Though stratification by relevant circumstances (e.g. over desert, high aerosol loading) indicates that the error depends on the state observed, a simple expression cannot usefully communicate the distribution of error in any particular measurement.Only pixel-level estimates provide an uncertainty consistent with its widely accepted definition and the presentation of ensembles, already used in the calculation of these envelopes, can better represent the distribution of errors not quantified in that estimate.

Internal validation
Internal validation is a less frequently discussed means to assess the precision and consistency of measurements.

Self consistency
Repeated observations of an unchanged target should sample the distribution of error, such that a histogram of the observations should be Gaussian with a standard deviation equivalent to the uncertainty.An opportunity for this type of repeated observation is rare with satellite instruments.More common is the sampling of the same point in successive orbits (often near the poles), assembling pairs of measurements of similar (if not identical) atmospheric states (e.g.Lambert et al., 1996).If the first observation is x 1 with uncertainty σ 1 and the second x 2 with σ 2 , then a histogram of should have a mean of zero and a standard deviation of unity.The covariance of simultaneously retrieved quantities can be considered by evaluating Eq. ( 10) instead.Atmospheric variation may increase the observed variability so a larger standard deviation is not questionable.A variance less than one usually indicates an underestimation of the uncertainty.Significant departure from a Gaussian distribution is indicative of unidentified systematic errors.If the variable is expected to be homogeneous across a region, all observations there can be used to validate the uncertainty directly, as the variance of the observations should be greater than the average of the uncertainties.

Against other algorithms
Using different forward model assumptions, statistical techniques, and/or filtering methods can produce results that may be consistent with themselves and external validation but not with each other.Differences between retrievals, in the absence of external validation data or a programming error, indicate variations in the state within the unconstrained state space.They form an ensemble that illuminates where the formulation of the problem is most relevant, highlighting where future research could be concentrated to represent the observations more carefully (Holzer-Popp et al., 2013).Belief that one representation is "better" than others independent of external validation is an expression of a priori knowledge.Such knowledge can be very useful in identifying "unknown unknowns" in a retrieval, but it is important to appreciate that any constraint not made by the data is an expression of a priori data, be it as formal as knowing that surface temperatures are generally within 40 degrees of 10 • C or as simple as believing surface pressure should not vary across a land-sea boundary.

Communication with users
Confidence in data is communicated to users through uncertainty estimates and quality assurance statements.The quantification of uncertainty illustrates how new data relate to the existing body of knowledge, but there is also the user's qualitative sense of the "worth" of data.To what extent does it constrain the variables they are investigating?When and where are the data most robust and when and where do they effectively convey no information?What do they quantify that was not already known?The aims of the user frame these questions.A detailed case study requires reliable uncertainty estimates to incorporate varied measurements and understand the limitations of the information provided but it Atmos.Meas.Tech., 8, 4699-4718, 2015 www.atmos-meas-tech.net/8/4699/2015/ Total Add above values Add above values Uncertainty in quadrature is impractical for a 20-year model climatology to consider a single measurement, its uncertainty even more so.Further, the "unknown unknowns" affecting satellite remote sensing data are not completely indescribable.Information such as "results are often unreliable over deserts" is still important to users, even if the uncertainty cannot be quantified.A dialogue with users is important in improving the understanding of data and receiving feedback on those data for future improvement.

Error budget
The aim of an error budget is to classify the contributions to the uncertainty by their source.At its simplest this may be in the form of a table, as suggested in Table 2.The total uncertainty estimated in this way can be compared with that found through validation activities.Discrepancy between the two can potentially indicate that an error source has been overlooked.

Quality assurance
Quality assurance (or flagging) is a qualitative judgement of the performance of a retrieval and the suitability of that technique for processing the data.This complements the uncertainty, whose calculation assumes that the forward model is appropriate to the observed circumstances.Statistical distributions are unsuited to show when an algorithm fails to converge, converges to an unphysical state, encounters incomprehensible data, or observes circumstances beyond the ability of its model to describe.Provided it is described in the language of a statement of confidence, quality assurance provides useful information.
The difficulty is that a simple flag is a coarse means of communication.For example, MODIS Collection 5 aerosol products provided a data quality flag of value 0, 1, 2, or 3 to describe increasing confidence in the retrieval method (Sect.2.5, Remer et al., 2006).This is widely used as a simple filter, rejecting data below some level.The level selected varies widely and it neglects, for example, that all low magnitude retrievals have confidence 1 due to the small signal.This will bias analyses to circumstances ideal for the chosen formulation, which are not necessarily representative of the environment (Sect.3.4.2).
However, such filtering is a logical response to this presentation of information.A more useful scheme would provide multiple separate flags (e.g.presence of cloud, challenging surface conditions, failure to converge, etc.) in a bit mask.When these are properly documented they allow an attentive user to evaluate the impact of using data degraded by a specific feature, and the disinterested user may be inspired to consider, if only briefly, the most appropriate flags for their purposes.

Distinction between maturity and uncertainty
Satellite remote sensing data have existed for several decades, but the retrieved geophysical quantities evolve as additional auxiliary data become available and new scientific problems appear.For example, AVHRR measurements from 1978 are still reprocessed for climate studies (Stengel et al., 2013;Heidinger et al., 2014).Figure 5 outlines the interlinking cycles of algorithm and operational development.Figure 7 illustrates how the repeated refinement and validation of data is a fundamental expression of the scientific method in data analysis.The cycle describes the ongoing conversation through which measurements and algorithms are improved in response to their use until a consensus is built that either: 1. the data set sufficiently addresses the needs of its users; or 2. the maximal amount of information has been extracted from the measurement and additional information is required to meet the needs of users.
The progress of a data set from initial conception to the achievement of one of these goals is known as its maturity.Bates and Barkstrom (2006) and Bates and Privette (2012) have outlined the system maturity matrix as a standardised metric to quantify the maturity of a product, briefly summarised in Table 3.It provides a means to track the development of an algorithm and data set from initial concept to an

Validation
Characterisation of the retrieved geophysical quantities over observation space.
A description of the uncertainty as a function of state and its stability over time.

Application
Use of geophysical results to characterise or describe the state of the atmosphere or processes within it.

Calibration
Prelaunch characterisation of instrument radiometric response referenced to international standard.
Post launch evaluation of instrument performance against onboard reference and/or vicarious targets.

Post-Launch
Algorithm Description Description of how measurements are converted into geophysical quantities.Quantification of the uncertainty budget.Evaluation of theoretical performance for reference atmospheric states.

Algorithm Development Cycle
Figure 7.The sequence of scientific output needed to underpin satellite observations.The instrument, calibration, and algorithm descriptions may be contained in one or more publications.Significant iterations of the retrieval algorithm are usually described in a new publication.
Table 3. Levels of system maturity, as defined in Bates and Barkstrom (2006).
Level  The appropriate presentation of data with thorough documentation and metadata produced using a publicly available, consistently realised computer code is a desirable aim.Such features should be included in any algorithm from inception to minimise simple mistakes and the misunderstanding of data by users.However, the presence of such features does not address the scientific quality or importance of the data.
The proposed metric simply counts the citations the data have received, disregarding the variety of applications and their impact upon scientific understanding.Participation in international data assessments works towards this aim, but only when there are multiple means of observing or evaluating a measurand.These are not available for many environmental variables, and they should not be considered immature if they make the best use of the information available (goal 2).
It is important that an inexperienced user should not misinterpret data with a high maturity index as being more accurate or suited to a particular study.A mature data set is one which is near the end of its development cycle in that it is agreed to be fit for purpose by the scientific community.This must not be confused with a data set that fully constrains the measurand.
With specific regard to the evaluation of uncertainty: -As discussed in Sect.3.1, SI traceability is not possible for a satellite instrument in the traditional meaning of that phrase.The environmental science community as a whole must develop ground-based, traceable standards for satellite instruments, such as well-characterised and monitored surfaces.The current metric penalises products that have no such standard to reference.
-The spatial covariance of error in a product can only be quantified through validation against spatially distributed, independent data.Satellite remote sensing is used for many environmental products because they are impractical to measure from the ground.In such cases it is not possible to assess covariance errors independently.Ensemble techniques may be useful there.
-A distinction must be made between internal and external validation activities.An international assessment of multiple, independent products from different measurement techniques that quantify equivalent measurands represents the external validation of a mature research area.An internal validation of differing algorithms from the same sensor evaluates the relative properties of the algorithms, not their suitability for quantifying the measurand.
Monitoring the progress of algorithm development must be done in a manner which encourages researchers to follow the fundamental scientific method (Fig. 7) whereby the interpretation of geophysical properties or processes is underpinned by a description of instrument calibration, the retrieval algorithm, and product validation.Maturity is an ex-

Conclusions
An appreciation of the range of values consistent with a measurement is necessary to apply and to contextualise data.Three qualities were identified by the Guide to Uncertainty in Measurement (Working Group 1, 2008) as necessary for an expression of uncertainty to be useful: -universality: all manners of observation can apply the techniques to calculate their uncertainty; -internal consistency: the calculation of uncertainty requires no information beyond that used in the analysis; -transferability: the uncertainty must be of use to a data user.
This paper classifies errors affecting satellite remote sensing data with five groups: -measurement: intrinsic variability in the observation; -parameter: errors propagated from auxiliary data; -approximation: explicit simplifications in the formulation of the forward model; -system: differences between the chosen description of the environment and reality; -resolution: variability at scales smaller than that observed.
In the terminology of Thorne et al. (2005), the first two result in parametric errors and the remainder in structural errors.Measurement and parameter errors are generally well represented by the traditional propagation of random perturbations.These are useful but only describe one aspect of the uncertainty -the "unknowns" that are known and quantifiable.Approximation and system errors represent the inability of the analysis to describe the environment observed and are the dominant source of error in most passive satellite remote sensing data (as it is not possible to constrain the complex behaviour of the environment with a few TOA radiances).Data producers are aware of these additional "unknowns", such as the representation of the surface's bi-directional reflectance, but cannot quantify them in the manner required for traditional error propagation (i.e. they are known, unquantifiable unknowns).Even well-constrained analyses will be affected by system errors resulting from quality control, cloud filtering being the most common.Resolution errors describe the disconnect between what occurs in nature and the means by which it is observed, primarily resulting from the instrument's sampling.
The difficulty with the last three categories of error is that they can be highly non-linear -their magnitude and nature depend upon the state observed and the ability of the forward model to describe it.Propagation of errors assumes that the equations used are accurate and that errors affect them linearly.Uncertainties currently reported with satellite remote sensing data neither represent the actual (non-linear) distribution of errors nor the full range of information known about the errors.
This can be addressed in various ways.Firstly, uncertainty estimates in satellite remote sensing data must be presented at pixel level.Pervasive quantifications misrepresent the dependence of error upon state and rely on external information.While pixel-level estimates will not represent the impact of unquantified unknowns, it is important that uncertainty be presented in a context that represents the data producer's confidence in and understanding of their data.
Ensemble techniques can be used to represent unquantifiable unknowns.The under-constrained nature of many satellite observations means that multiple realisations of a data set that are consistent with measurements can be derived by using conflicting descriptions of the environment, such as assumptions of particle microphysical properties or differing calibration coefficients.In the absence of a priori constraints, each of these realisations is feasible and should be presented together.This is common practice in the climate modelling community, and the satellite remote sensing community should capitalise on user's experience to improve communication of the uncertainty in products.
The manner in which a measurand is defined affects both the sources of error that must be considered (e.g.resolution errors) and the manner in which the data must be compared with other measurements.In an under-constrained problem, it is often not possible to report a value that is uniquely constrained by those conditions (i.e. the state vector elements do not form a basis of the observed conditions).This can result in the retrieved value being sensitive to multiple features of the environment, as quantified by the averaging kernel.When comparing data sets, it is important to ensure that equivalent quantities are being compared or biases will be observed that are a function of the system definition rather than an error in the retrieval.The necessary transforms were outlined in Rodgers and Connor (2003).
As not all errors can be quantified, there is also qualitative information necessary to appreciate the applicability of data and, as a data set evolves, it is important to assess both the degree to which it represents a scientific advancement and to which it satisfies the needs of its users.This information can be conveyed through product user guides, validation studies, quality assurance flags, and/or measures of a retrieval system's maturity.It is both important that this information is readily available to users and that it is communicated in the language of a statement of confidence.Continuous interaction with users will be necessary to improve these reports to ensure they communicate the desired information.Of particular importance are the following: -an error budget outlining the quantified sources of error; -a description of the available quality control information and its physical meaning to enable users to apply it in an educated fashion; -known weaknesses of the data that are not represented by the uncertainty.
This paper concentrated on passive remote sensing, but the clear communication of uncertainty to users is still important in active remote sensing.The different definitions of active and passive measurands must be appreciated if they are to be compared.Active data are generally better constrained than passive and are often analysed with analytical equations, where approximations and system choices are substantially less important but still present (for example, the Ångström coefficient, the lidar ratio, and multiple scattering).These errors are minimised, in part, by selecting measurands closely aligned with the measurement (e.g.backscatter, extinction, reflectivity, depolarisation).Approximation and system errors can become important when calculating more poorly constrained, physical parameters such as particle size or number.Resolution errors are more obvious with active sensing due to their narrow swath.
Evaluating the quality of an algorithm using existing metrics limits the ability of the satellite remote sensing community to communicate their understanding of the uncertainties in their products to users in an efficient or effective manner.Without that dialogue, users cannot appropriately use data and cannot feedback to data producers to improve it.The hope is that by representing uncertainties in satellite remote sensing data through ensembles, understanding of the limitations of the data will increase, highlighting areas for future research.Through continual communication among the entire scientific community, unknown unknowns can become known and, eventually, make the use of ensembles unnecessary as understanding of the environment converges upon the truth.

Figure 2 .
Figure 2. Distortion of the distribution of error for different selections of measurand when observing a cloud.(Non-linearities exaggerated for illustration.)(a) Measured TOA radiance suffers random errors, which have a symmetric distribution.(b) Transformation with the Planck function warps the distribution when reporting cloud top temperature.(c) These are further distorted when cloud top pressure is calculated.An additional error (grey; not to scale) is introduced by the auxiliary data used in that calculation, giving an irregular total distribution (black).
Figure3.An ensemble of forward models for the volume of a bucket (x axis) as a function of its mass (y axis).A third parameter, the bucket's height, is not measured and so must be assumed.Its impact is shown over five slices of the z axis.Solid, dotted, and dashed lines denote cylindrical, hemispherical, and conical buckets respectively.The material is assumed to have thickness 1 mm and density 2.7 g cm −3 .

y
= aF clear sky (x a , b a ) + bF cloud (x b , b b )

Figure 4 .
Figure 4. One-dimensional representation of a retrieval considering multiple systems (realisations of the forward model that do not necessarily retrieve the same variable).For a system, the retrieved state is the minimum of its cost function (indicated by a circle).The state with globally minimal cost (across all systems) is a posteriori taken as the best representation of the observed environment.

Figure 5 .
Figure 5.The cycle of retrieval development.The initial formulation and algorithm are repeatedly revised in light of internal validation activities.When consistent results are achieved, an external validation is performed (and published) to begin the operational cycle,where data are generated and disseminated.The application and critique of the data by the scientific community then feeds into further refinement of the algorithm (or entirely new algorithms).The development and operational cycles continue independent of the larger cycle but over time operations will increasingly dominate resources as the product becomes increasingly fit for purpose.

Figure 6 .
Figure 6.Schematic of the weighting functions for CTH for an infrared radiometer (red) and lidar (black), with dashed lines denoting the value retrieved.(a) For a thick cloud, the radiometer is most sensitive to the region one optical depth into the cloud while the lidar detects the physical cloud top.(b) The lidar's sensitivity is unchanged when a thin cloud lies over a thicker one, but the radiometer observes both clouds, resulting in an unphysical CTH somewhere between the two.(c) Compares (b) with the weighting function for a wider radiometer band (blue, exaggerated).

Table 2 .
Example of an error budget.