Data-driven clustering of rain events: microphysics information derived from macro-scale observations

. Rain time series records are generally studied us-ing rainfall rate or accumulation parameters, which are estimated for a ﬁxed duration (typically 1 min, 1 h or 1 day). In this study we use the concept of “rain events”. The aim of the ﬁrst part of this paper is to establish a parsimonious characterization of rain events, using a minimal set of variables selected among those normally used for the characterization of these events. A methodology is proposed, based on the combined use of a genetic algorithm (GA) and self-organizing maps (SOMs). It can be advantageous to use an SOM, since it allows a high-dimensional data space to be mapped onto a two-dimensional space while preserving, in an unsupervised manner, most of the information contained in the initial space topology. The 2-D maps obtained in this way allow the relationships between variables to be determined and redundant variables to be removed, thus leading to a minimal subset of variables. We verify that such 2-D maps make it possible to determine the characteristics of all events, on the basis of only ﬁve features (the event duration, the peak rain rate, the rain event depth, the standard deviation of the rain rate event and the absolute rain rate variation of the order of 0.5). From this minimal subset of variables, hierarchical cluster analyses were carried out. We show that clustering into two classes allows the conventional convective and stratiform classes to be determined, whereas classiﬁcation into ﬁve classes


Introduction
The analysis of "precipitation events" or "rain events" can be used to obtain information concerning the characteristics of precipitation at a particular location and for a specific application.This is a convenient way to summarize precipitation time series in the form of a small number of characteristics that make sense for particular applications.
The concept of a precipitation event is not new and has been used for many years (Eagleson, 1970;Brown et al., 1984).A wide variety of definitions, varying according to the context of each study, have been reported in the literature (Larsen and Teves, 2015).Moreover, when a rain rate time series (generally based on rain gauge records) is broken down into individual rainfall events, a wide variety of their characteristics, such as average rainfall rate, rain event duration and rainfall event distribution (known as hydrological information), can be computed for each event.Our analysis of the literature has led to the identification of 17 features used to characterize rainfall, which makes it quite difficult to compare different studies.The first goal of the present study is to select a reduced set of features characterizing rainfall events, through the use of a data-driven approach, without taking a priori knowledge of the field of application into account, thereby characterizing rainfall events in the most parsimonious and efficient manner.
The second goal is to assess, without using any a priori criteria, whether the rain events are still correctly clustered by the most relevant observed features.Indeed, atmospheric process specialists distinguish between stratiform and convective events, arguing that the physical processes involved in their evolution are different.The goal here is to check that a small sample of variables, derived from spot measurements to describe rain events, can allow this distinction to be made Published by Copernicus Publications on behalf of the European Geosciences Union.
and ultimately be used to refine it.Hydrological (hereafter referred to as "macrophysical") information makes use of rain gauge measurements to characterize rain events.This information is defined in order to characterize the features of global events but not to provide any information concerning the raindrop microphysics of the event.Nevertheless, in many applications such as remote sensing, knowledge of the microphysics is essential.One key parameter in remote sensing is the raindrop size distribution, noted as N (D), which is defined by the number of raindrops per unit volume and per unit raindrop diameter (D).Information related to the raindrop size distribution is often derived from its proxies, as explained in Sect. 5.Such features are not currently accessible through rain gauge measurements, which provide macrophysical information only.However, more expensive devices referred to as disdrometers can provide both hydrological and microphysical information.There are currently several tens of thousands of rain gauges operated throughout the world, in locations equipped with a far smaller number (if any) of disdrometers.As described later in this paper, it is possible to retrieve some microphysical information from the hydrological data.As a consequence, rain gauge data could provide valuable information in microphysics studies through the use of a statistical approach to indirectly infer the missing microphysics information.In the following, the terms "macrophysical" or "hydrological" information are associated with characteristics related to rain rates or rain accumulation, whereas the term "microphysical" is associated with the characteristics of the raindrop size distribution.
In the present study, we use a data-driven approach to study the relationships between different rain properties.As disdrometers provide drop size distributions, they allow 1minute (or shorter) rain rates to be estimated, and in the present study these can be used to derive the hydrological information of interest, which is coherent with the data that would be provided by standard rain gauges.Through the combined interpretation of microphysical and hydrological information, we are also able to analyze the microphysical properties of the rain event clusters provided by our algorithm.This makes it possible to retrieve (unobservable) microphysical information from rain gauge measurements.
From a single rain rate time series, observed with a 1minute time resolution, we seek to answer the following questions: -Among the large number of hydrological variables described in the literature, which are the most significant?
-Does the resulting description of rain events allow different types of rain event to be discriminated?
-What (unobserved) microphysical properties of an event, or type of rain event, can be inferred from its macrophysical description?
Our paper is structured as follows.Section 2 presents the data used in our study and lists various hydrological param-eters that are commonly found in the literature.Seventeen macrophysical variables are identified, requiring appropriate normalization.Section 3 presents our methodology, which is based on the use of a genetic algorithm (GA) implementing a self-organizing map (SOM, also referred to as a topological map).This unsupervised approach is used to select a small subset of variables from the 17 identified variables, allowing a parsimonious characterization of rainfall events to be applied.An exploratory statistical analysis of rainfall events is provided.In Sect.4, the rainfall events are grouped in clusters and are divided into two classes.It is then shown that this grouping of the data set corresponds to the standard convective-stratiform classification.We then propose a five-subclass classification, which corresponds to a refinement of the two initial groups.In Sect. 5 we include some additional microphysical features of rainfall events, allowing the microphysical properties of the five previously defined event classes to be studied.Our conclusions are presented in Sect.6.

The disdrometer data sets -data processing methodology
This research relies on the analysis of raindrop measurements obtained with a dual-beam spectropluviometer (DBS) disdrometer, first described by Delahaye et al. (2006).This instrument allows the arrival time, diameter and fall velocity of incoming drops to be recorded.As the capture area of the sensor is 100 cm 2 its observations can be considered, in spatial terms, to be "point-like".In the present study the integration time T int was set to 1 min, and the raindrop measurements were used to estimate the corresponding 1-minute rain rate time series RR t (t).In order to eliminate false raindrop detections that could be generated by dust or insects, a threshold T 0 = 0.1 mm h −1 was applied.Rain rates lower than T 0 are thus set to zero.This conventional threshold is also chosen to ensure coherency with previous studies (Verrier et al., 2013;Llasat et al., 2001).In the present study, we worked with two data sets recorded during the period between July 2008 and July 2014, at the Site Instrumental de Recherche par Télédétection Atmosphérique (SIRTA 1 ) in Palaiseau, France.

Rain event definition
In everyday life, it is common knowledge that rain starts at a certain moment and stops some time later.However, due to its discreet nature, rain (which generally consists of a very large number of raindrops) is not an easy concept to define.Indeed, the exact definition of a rain event will depend on the sensor's characteristics (specific surface capture, detection threshold, instrumental noise) as well as the spatial or temporal resolution chosen for the study.This definition may also depend on the purpose of the study and thus on the scientific community behind it.There is thus a wide range of criteria used to break down precipitation records into rain events.For this reason, it is important to define and apply an unambiguous definition of a "rain event".In this study, the pattern produced by the 1-minute rain rate time series RR t (t) can be simplified by grouping non-null rain rates into a set of separate "primitive events" (Brown et al., 1985).On the basis of an assigned minimum inter-event time (MIT;Coutinho et al., 2014) each rain rate value, corresponding to a specific 1-minute period of observation, is assigned to a given rainfall event, i.e., either the rainfall event in progress or a subsequent event that is considered to be independent and "new".The MIT could also be defined as the duration of a dry period D dry following which the next occurrence of non-null rainfall marks the beginning of a new event (Driscoll et al., 1989).For dry periods shorter than the MIT, rain rates from either side of this period are considered to belong to the same "composite event".Various authors have proposed different values of MIT that ensure event independence.Llasat (2001) noted that "The definition of an episode is quite subjective.In this case it was felt possible to distinguish between two different episodes, when the time which elapses between them without rainfall exceeds 1 h, which ensures that the two episodes come from different "clouds".Moussa and Bocquillon (1991) wrote, "the constant rain observations on less than 30 min represent only 5 % of all the rainy periods.The representative threshold of the discretization of the data is 30 min to an hour."Dunkerley (2008a, b) carried out an analysis of the interevent time (IET) in order to check the influence of this variable on the definition of rainfall events and its influence on the average rainfall rate.As emphasized in this study, when determining a value for the MIT, it is crucial to find an appropriate compromise between the independence of rain events and the intra-event variability of rain rates.The choice of MIT thus has a direct impact on the macrophysical characteristics that are ultimately determined by the analysis.Other researchers have proposed to use MIT values of 20 min, 1 h or even 1 day (see Dunkerley, 2008a, for a detailed list).In the present study it was decided to set the MIT to 30 min.This is in agreement with the value used by Coutinho et al. (2014), Haile et al. (2011), Dunkerley (2008a, b), Balme et al. (2006) and Cosgrove and Garstang (1995).
When applied to our data set, this choice leads to the identification of 545 rain events, which can be divided up into two subsets, i.e., one for learning and the other for testing (Table 1).The learning data set is composed of observations collected over a 2-year period between 2013 and 2014, with an availability of 96.4 %, whereas the test data set collected during the 2008-2012 period contains periods with missing data due to a malfunction of the recording device.

Macrophysical description of rain events
Rain events contain a wealth of information, which generally needs to be condensed into a limited set of well-chosen features.However, there is no conventional or commonly accepted list or specific set of macrophysical features that can be used to accurately describe and summarize an event.In the present study it was thus decided to consider a large number of features, allowing the macrophysical rain event information described in the literature to be correctly represented.Seventeen characteristics were selected and identified (Llasat, 2001;Moussa and Bocquillon, 1991) and are listed in Table 2. Some of these are parameter dependent, such as P c , which uses three values of the parameter c.These three values lead to three P c indices, namely P c1 , P c2 and P c3 .Finally, a total of 23 descriptors were defined and numbered from 1 to 23 (column 1 in Table 2).
Among the 23 indicators (hereafter referred to as variables) corresponding to the previously defined features, some are very well known.These include the event duration (D e ), the quartile (Q i ), the mean event rain rate (R m ) and the standard rain rate deviation (σ R ), as well as other, less traditional parameters such as the parameter β L (indicator for the convective nature of the rain; see Llasat, 2001), the absolute rain rate variation of order c (P c ) or the absolute rain rate variation (P s,c ).Some variables that are usually used to describe time series, such as the fractal dimension, multi-fractal parameters, trend, seasonality and autocorrelation, require a long series of data and are not well suited to an event-byevent analysis.This set of 17 features is not exhaustive, and some other features could also be included, depending on the application.One example is the case of hydrology, for which the positions of the intensity peaks inside the event could be a relevant feature.Although, for events comprising a very small number of samples (very low value of variable D e ), the computation of some indicators (σ R , Q i ) is questionable, in the present study the 23 variables were computed for each of the 545 rain events.

Principal component analysis (PCA) analysis and normalization step
It is important to note that very few of these 23 variables are compatible with the probabilistic assumptions generally associated with exploratory statistical methods.
these data, as it may lead to misleading interpretations (Daumas, 1982).It is thus necessary to introduce an additional step in order to transform the original distributions into tquasi normally distributed distributions.The most suitable type of normalizing transformation for each of these variables was selected empirically by testing seven different possible trans-formations (Table 3).For each variable, the retained transformation is that leading to a distribution with the strongest similarity to a normal distribution, i.e., with a kurtosis close to 3 and a skewness close to 0. For each indicator, the selected transformation is provided in the last column of   2.
Following the normalization step, PCA was carried out on the learning data set (see end of Sect.2.1 and Table 1).It follows that the two principal axes contain 73 % of the total information, whereas the first five principal axes are needed to represent 90 % of the total information.The IET p variable (no.7) is very well correlated with axis 5, whereas the other variables are not.This means that there is no linear relationship between IET p and the other variables.For this reason, this variable was not considered as a possible candidate in the variable selection process during the remainder of the study.The results obtained in Sect.4.2 confirm that there is no relationship between this variable and the other 22 variables.The correlation circle on axes 1 and 2 (Fig. 1a) shows that among the 23 variables, 16 are well correlated with the axis (close to a unit circle) and are distributed in approximately five groups (hereafter referred to as PCA groups).The first PCA group (G 1 ) can be identified by the variables, which are grouped close to the first axis and are well correlated with it.
As an example, this is the case for the variables σ R (no.9), P C N (nos.17-18) and β (nos.21 to 23).A second PCA group (G 2 ) comprises the variables R max (no.11) and P c3 (no.16), just above axis 1.The third PCA group (G 3 ) is formed by the variable P c2 (no.15) only.The fourth PCA group (G 4 ) comprises the variables P c1 (no.14) and P s,c (no.20) and is well correlated with axis 2. The last PCA group (G 5 ) is formed by the variable D e (no.1).The correlation circle on axis 1 and 3 (Fig. 1b) shows that the variables Q 1 (no.4), Q 2 (no.5) and M 0 (no.10) are quite well represented by these two axes.A similar remark can be made for variables D d (no. 3) and β L1 (no.21) on axes 1 and 4 (not shown).
Finally, PCA analysis clearly shows that, within each PCA group, many variables are highly intercorrelated, i.e., linearly dependant on each other.This means that several variables could be removed with no substantial loss of information.This leads to the following question: which variables can be removed in order to retain the most parsimonious subset of variables representative of the full data set?The PCA extracts summary variables, which are a linear combination of original variables, but does not allow for the selection of variables.To answer to this question, we propose a method for the global selection of variables that seeks to identify the relevant variables in a data set.As it appears to be intuitively more advantageous to select variables with a physical sense, rather than using dimension reduction methods (e.g., PCA, which is more suitable for the detection of linear relationships), the proposed method is based on the use of a GA.
The following section provides a brief introduction into the concept of GAs and shows how they can be advantageously used for the selection of variables in the context of the present study.

Variable selection using a genetic algorithm
Computer-assisted variable selection is important for several reasons.Indeed, the selection of a subset of variables in a high-dimensional space can improve the performance of the model or its statistical properties, but it also provides more robust models and reduces their complexity.In practice, it is not generally possible to try all potential combinations of variables and to select the best of these, as a consequence of the enormous computational cost associated with such an approach.Among the many different variable-selection techniques described in the literature (Guyon and Elisseeff, 2003), we chose to develop a model based on the use of GAs to search for an optimal subset of variables.GAs (Holland, 1975) are stochastic optimization algorithms based on the mechanics of natural selection and the genetics described by Charles Darwin.In our study, a chromosome is defined as a subset made up from our 23 variables.A first generation composed of a population of 60 potential chromosomes is arbitrarily chosen.The performance of each chromosome (i.e., for each corresponding subset of 60 variables) is evalwww.atmos-meas-tech.net/10/1557/2017/Atmos.Meas.Tech., 10, 1557-1574, 2017 Table 3. Transformations used to normalize the variables listed in Table 2.
Transformation Transformation name Formula f(x) Notes number 0 Standardization Data are between 0 and 1 6 Decimal Logarithm log(x + c) c = 0.1 uated through a fitness function f.This fitness function is defined in such a way that the higher its value, the greater the fitness function's ability to represent the full data set (of dimension 23), using the smallest possible number of variables.On the basis of the performance of these 60 chromosomes, we create a new generation of 60 potential-solution chromosomes, using classical evolutionary operators: selection, crossover and mutation.The performance of this new generation is then evaluated.This cycle is repeated until a predefined stop criterion is satisfied.The best chromosome from the current generation then provides the optimal subset of variables.

Methodology
We define by x k the chromosome number k: x k is a binary vector in {0, 1} 23 space such that each component has the following meaning: The word "selected" in Eq. ( 1) means that the corresponding variable will be used, both in the learning step described in " Step 2" below and for performance evaluation.Otherwise, if the corresponding variable is not selected it will be used only for performance evaluation.
As previously stated, the fitness function allows a measure to be provided of how well a minimal subset of variables can represent the entire data space (in dimension 23).The fitness function f is thus defined as follows: where x k is chromosome k, n(x k ) is the number of selected variables in chromosome x k and te(x k ) is the topological error associated with chromosome x k .
As the aim of this approach is to minimize the number of selected variables n and the topological error te, we seek to maximize the fitness function.The estimation of the topological error made from an SOM is somewhat complicated and requires some explanation.The notion of an SOM, introduced by Kohonen (1982Kohonen ( , 2001)), makes use of a popular clustering and visualization algorithm.SOM is a neural network algorithm based on unsupervised learning, derived from the technique of competitive learning (Kohonen, 1982(Kohonen, , 2001;;Vesanto and Alhoniemi, 2000).It may be considered as a nonlinear generalization, which has many advantages over the conventional feature extraction techniques such as empirical orthogonal functions (EOF) or PCA (e.g., Liu et al., 2006).SOM applications are becoming increasingly useful in geosciences (e.g., Liu and Weisberg, 2011).As stated by Uriarte and Martín (2008): "The SOM provides a nonlinear, ordered, smooth mapping of high-dimensional input data manifolds onto the elements of a regular, low-dimensional array.The main characteristic of the projection provided by this algorithm is the preservation of neighborhood relationships; as far as possible, nearby data vectors in the input space are mapped onto neighboring locations in the output space."This property makes it straightforward to compute a topological error (see Uriarte and Martín, 2008, Eq. 2).For each of the x k chromosomes, an SOM M(x k ) is learned on the learning data set.Only the selected variables are used during the learning process.Finally, for each Map M(x k ), the topological error te(x k ) can be computed in accordance with Eq. ( 2) in Uriarte and Martín (2008).Section 4 provides additional information concerning SOMs.
The genetic algorithm is based on the following five steps (Fig. 2.): 1.In the first step, initialization, a (initial) population {x k , k = 1, . .., 60} of 60 chromosomes of dimension 23 is randomly generated.
2. In the second step, evaluation, for each of the  logical error te(x k ), allowing their fitness score f (x k ) to be computed.
3. In the third step, the best chromosome x Best is selected from the full set of 60 chromosomes according to the fitness score previously computed with the test data set.If x Best remains unchanged over a period of 50 generations, the procedure is stopped and the most relevant variables are selected, i.e., those for which the corresponding components are equal to 1 in x Best .Otherwise, go to step 4.
4. In the fourth step, selection, a new population of 60 chromosomes is created from the current population by randomly sampling with replacement chromosomes based on their probabilities, determined using the formula . (3) 5. The fifth step, reproduction, uses mutation and crossover possibilities in the new population.Mutation consists in modifying (or not) certain components of the chromosomes.The probability of mutation is in general very low and is commonly set to p = 10 −7 .In the present case, the number of generations needed to reach the objective is less than a few hundred, such that the probability of a mutation is very low.For crossover, in an initial step 60 2 = 30 pairs of chromosomes are randomly drawn from the population.Then, for each pair (x k , x l ) (called parents) one crossover point, noted I c , is randomly drawn over the range [1, 23], using a discrete uniform law.Two new chromosomes (x k , x l ) are created as follows: (4) Thus, from two parents, two children are generated, allowing a new generation to be produced with the same number of chromosomes.Finally, the algorithm returns to step 2.

Parsimonious description of a rain event
The GA is applied to our data sets in order to obtain an optimal subset of variables forming a subspace, which can (in www.atmos-meas-tech.net/10/1557/2017/Atmos.Meas.Tech., 10, 1557-1574, 2017 a certain sense) provide relatively accurate information concerning the global space, whilst having the particularity of containing non-redundant information.At the 187th generation the algorithm produces a subspace comprising five variables, namely, event duration D e (no.1), standard deviation σ R (no.9), maximum rain rate during event R max (no.11), rain event depth R d (no.13) and absolute rain rate variation P c1 (no.14).
The three variables (D e , R max , R d ) selected using this data-driven approach are commonly used in the study of hydrological processes (Haile et al., 2011).Moreover, it should be noted that the commonly used variable R m , which is computed simply by dividing the rain event depth (R d ) by the duration (D e ), was not selected by the algorithm.This result could be expected, since it is correlated with the latter variables, and the algorithm provides a parsimonious description.Concerning the absolute rain rate variation (P c1 ), this variable was proposed by Moussa and Bocquillon (1991).It tends to provide information on the structure of the events, more specifically related to smooth events with a small number of sharp peaks.In fact, this variable promotes low variations of RR t because P c1 is in a certain sense a structure function of order c 1 of the variable RR t (see no. 14, column 4 in Table 2), with a low value for the exponent (c 1 = 0.5).Finally, the standard deviation variable (σ R ), which is a secondorder moment, is the most commonly used indicator to describe the variability of the precipitation rate within the rain event.

SOM learned with the five selected variables
An SOM is a topological map composed of neurons.In the present case, a neuron is a vector of dimension 23 containing the 23 variables defined in Table 2.Each neuron has six neighboring neurons.SOM is an unsupervised neural network trained by a competitive learning strategy that performs two tasks: vector quantization and vector projection.The SOM, which is different to k means, uses the neighborhood interaction set to learn the topological structure hidden in the data.In addition, in order to achieve optimal referent vector (neuron) matching, its neighbors on the map are updated, leading to the generation of regions in which neurons located in the same neighborhood are very similar.The SOM can thus be considered as an algorithm that maps a high-dimensional data space onto a two-dimensional space called a map.A map can be used both to reduce the amount data by means of clustering and to project the data in a nonlinear manner onto a regular grid (the map grid).
In the present study we used the toolbox developed by the SOM Toolbox Team, which is available at the following site: http://www.cis.hut.fi/somtoolbox/.An SOM with 8 × 8 = 64 neurons is considered here.This choice corresponds to a compromise, since a smaller map would not be able to dis- tinguish fine details whereas, in view of the number of observations, and a larger map would not be meaningful.
After learning by the GA algorithm described in the previous section, the resulting map M(x Best ) can be used to assign to any event the best matching reference vector (neuron), in accordance with the five selected variables associated with the chromosome x Best .The M(x Best ) map obtained with this procedure can be considered as an optimal representation of the initial data set.
Figure 3 shows the distance matrix.For each neuron, the color indicates the mean distance between a neuron and its neighbors.The value at the center of each neuron represents the number of rain events of the learning data set, captured by the corresponding neuron.All neurons capture rain events and slightly more than half of these capture between three and five rain events, which is close to the value that would be obtained (234/64 ∼ = 4) if the rain events were uniformly distributed over the map.

Projection of the selected and unlearned variables onto the SOM
The five variables D e , σ R , R max , R d and P c1 used for learning are referred to as "selected" variables, whereas the remaining 18 variables are referred to as "unlearned" variables.
In order to study the relationship between these variables, Fig. 4 shows the projections for each of the variables in the M(x Best ) map obtained with the aforementioned GA selection algorithm.The variables are discussed individually, by considering their structure, as well as the relationships between them.We note that the map is well structured for the majority of variables.This advantageous structuration of most of the variables confirms the ability of the selected Atmos.Meas.Tech., 10, 1557-1574, 2017 www.atmos-meas-tech.net/10/1557/2017/variables to summarize all of the significant characteristics of rain events.Only a small number of characteristics are not adequately represented.It should be noted that almost all variables are structured according to the first or second diagonal.Among these, one may consider an initial subset comprising variables that are more or less structured according to the first diagonal.This is the case for the unlearned variable D d , as well as for the selected variables P c1 and D e .A second subset comprising variables that are structured in approximate accordance with the second diagonal can be identified.This is the case for the unlearned variables R m,r , R m and P C Ni , which are very similar to the selected variable σ R .The unlearned variables Q 3 , P c3 and P S,C also belong to the second subset and have a structure close to that of the selected variable R max .
The map can be related to the previously implemented PCA (Fig. 1).As can be seen in Fig. 4, the variables P c3 (no.16) and R max (no.11), which have a similar structure, also belong to the same PCA group, namely group G 1 (see Sect. 2.3, Fig. 1a).It is interesting to note that the variables P S,C (no.20) and R max (no.11), which also have a similar structure, do not belong to the same PCA group (groups G 4 and G 2 respectively) and are uncorrelated (they are orthogonal in Fig. 1a).This remark means that the topological map reveals a relationship that cannot be detected using PCA.As the rain event depth (R d ) depends on both the duration and the intensity of the events, the corresponding map has a topdown structure.Two distinct situations thus occur: -Those events which contribute the greatest quantities of water (Fig. 4., brown neuron at the bottom right of R d ) are among the longest (see corresponding neuron of D e ), but they do not have an extremely high peak rain rate (see corresponding neuron of R max ) and are quite smooth (see corresponding neuron of P c1 and σ R ).
-Other events which contribute large amounts of water (but less than previously; Fig. 4. red neuron at the bottom left of R d ) have short durations (see corresponding neuron of D e ), but they are violent (see corresponding neuron of R max ) and less smooth (see corresponding neuron of P c1 and σ R ).The latter case reflects situations that are typical of convective storms.
The resulting map confirms the dependence structure of the two hydrological variables, R d and D e , studied by Gargouri and Chebchoub (2010).
Concerning the variable IET p (previous IET), the map is not structured, reflecting the independence of the characteristics of a rain event with respect to the drought period preceding the event.This corroborates the results of several prewww.atmos-meas-tech.net/10/1557/2017/Atmos.Meas.Tech., 10, 1557-1574, 2017 vious studies (Lavergnat andGole, 1998, 2006;Akrour et al., 2015;de Montera et al., 2009) dealing with rain support simulations.When studying temperate midlatitudes for relatively short periods, these authors noticed that successive rain and no-rain periods are uncorrelated, such that a rain time series could be considered as an independently drawn, alternating series of rain events and periods without rain.This is equivalent to an IET that does not characterize the rain events.The same effects are not necessarily observed at other locations and under different climatological conditions.Brown et al. (1983) also investigated a possible correlation between IETs and the intra-event characteristics and concluded that their data provided no evidence of this.The variable β L (Llasat, 2001) is considered to represent a measure of the convective nature of the rain, it makes sense that the three variables β L1 , β L2 and β L3 are structured similarly, with the peak rain rate variable R max .This relationship is clearly visible on the maps.
Several other relationships, which are not described in detail here, can be observed.These include the correlation between the normalized absolute rain rate variation (P C Ni ) and the standard deviation of the intensity (σ R ).We conclude that the combination of the five selected variables provides a relatively accurate summary of the information needed to describe the rain events.The poor structuring of some variables is justified by the independence of these variables with respect to the properties of the rain events; this is the case for the variable dry percentage in event D d or the variable IET P .

Representation of rain events on SOM
In an effort to provide additional information for validation of the map, we compared each of the 23 variables with their corresponding value given by the SOM, for the learning data set and the test data set.For each of the 311 events of the test data set, the best matching unit of the SOM, i.e., the neuron that is the closest to the event, is determined with respect to the five selected variables.As an example, for each event Fig. 5 shows the current value of the unlearned variable β L3 as a function of the corresponding value given by the best matching unit of the event.A spread can be seen, in particular in the central zone, whereas the spread is relatively small for values located near to the edges (which are more numerous).A linear regression leads to a relatively good determination coefficient (R 2 ; 0.96 and 0.89, respectively, for the learning and test data sets).Table 4 lists the value of R 2 for the 23 variables obtained with the learning and test data sets.As expected, the coefficient of determination of the variable IET p is very poor (0.31/0.26), since this variable is not related to the five selected variables and as a consequence cannot be well represented by the SOM (Fig. 4).The selected variables have good determination coefficients, with both the learning and the test data sets; this confirms the quality of the learning and the generalization ability of the SOM.The quality of the learning step is confirmed by the fact that the R 2 values of the selected variables obtained on the test set are close to those obtained on the learning set.The R 2 corresponding to the unlearned variables obtained on the learning data set emphasize the ability of the selected variables to provide the information contained in the unlearned variables; in the case of the test data set it denotes the ability of the SOM to derive all event characteristics from the selected variables only.

Hierarchical clustering of rain events
We have shown that the distance matrix (Fig. 3) confirms the successful deployment of the map.Based on the distance between neurons, it appears that neurons can be grouped to obtain a limited number of classes, each with its own characteristics.In order to group the 64 neurons into a small number of classes, a hierarchical cluster analysis was carried out (Everitt, 1974).Only the five selected variables were used for the classification, and a Euclidian distance was selected for the hierarchical algorithm.Figure 6 shows the resulting dendrogram, applied to the 64 neurons.
Depending on the physical processes involved, experts tend to separate rain events into two different classes: stratiform and convective events.Although this classification is relatively crude, since stratiform and convective events can sometimes exist inside the same rain event, it is very commonly used.Concerning the time series, most authors use a very simple scheme to distinguish between stratiform and convective rain types.For reasons of simplicity, rain classification is sometimes defined using the instantaneous rain rate and the standard deviation estimated over consecutive samples.As an example, Bringi et al. (2003) defined stratiform rain samples when the standard deviation of the rain rate, taken over five consecutive 2 min samples, is less than 1.5 mm h −1 , the convective rain samples are defined for a rain rate greater than or equal to 5 mm h −1 , and the standard deviation of the rain rate over five consecutive 2 min samples is greater than 1.5 mm h −1 .Atmos.Meas.Tech., 10, 1557-1574, 2017 www.atmos-meas-tech.net/10/1557/2017/Variables  Firstly, we separate the dendrogram into two classes.The first class contains 51 neurons and 79 % of the observations, whereas the second class contains 13 neurons and 21 % of the observations.The solid black line in Fig. 3 corresponds to the dividing line between these two classes.The first class, containing the greatest number of neurons, is in most cases characterized by relatively low rain rates.This can be seen by examining the structure of the map, according to the mean rain rate variable (R m ).Moreover, through analysis of the standard deviation (small values of σ R ), absolute rain rates P c (high values of P c1 and low values of P c3 ) show that this class is more or less characterized by quiet, homogeneous events.Our analysis of event durations (D e ) shows that this class contains both short and long durations but is dominated by the latter.These characterizations are relatively well matched to a description involving stratiform and stable precipitations, which are often the consequence of the slow, large-scale uprising of a large mass of moist air which then condenses uniformly.
The second group is characterized by a smaller number of neurons.This corresponds to the higher values of the mean rain rates (R m ) and peak rain rates (R max ).The variables σ R and P c have the opposite values with respect to those of the previous group.Most of the event durations (D e ) in this group are short, with the exception of neuron no.64 (bottom right on the maps).This group fits well with the definition of convective events resulting from the rapid rise of air masses loaded with moisture for buoyancy.This convective moist air can lead to the development of cumulus clouds up to an altitude in excess of 10 km and to heavy rain.
Our analysis of the structure of the variables β L1 , β L2 and β L3 in Fig. 4 confirms the previous interpretation of the two groups.These three variables, which are representative of convective rain, have high values for the neurons belonging to this group.
Figure 7a and b show the neurons in the R m , β L3 and P c2 subspace.These three variables were not used in the learning step.Nevertheless, the two classes are well separated, although an overlap does occur in Fig. 7a due to neuron no.64 (bottom right on the map, Fig. 4).Although it belongs to the convective class, this neuron nevertheless has some characteristics of the stratiform class.
www.atmos-meas-tech.net/10/1557/2017/Atmos.Meas.Tech., 10, 1557-1574, 2017 The hypothesis that the two categories of precipitation events corresponding to different dynamic regimes can be identified solely on the basis of hydrometeorological variables is in agreement with the findings of Molini et al. (2011).These authors have shown that there is a strong agreement between the hydro-meteorological classification (based on the duration and extent of events from rain gauge network data) and dynamic classifications (the convective adjustment timescale identified to distinguish between equilibrium and non-equilibrium convection derived from ECMWF analysis).We conclude that this unsupervised automatic clustering, based on the five selected variables, makes it possible to correctly implement a classification with these two wellknown classes (stratiform and convective).It should be noted that, unlike other classifications described in the literature, this was established without making use of a priori information, since it is produced by an unsupervised process.

Classification of events into several classes
From the stratiform and convective classification described above, it is interesting to refine the two classes into a set of subclasses.The synoptic rainfall associated with midlatitude depressions provides an example of stratiform precipitation, which forms in depressions in the vicinity of warm and cold fronts.The very light type of rainfall (drizzle) asso- The colors represent the subclass numbers: subclass 1 is dark blue, subclass 2 is blue, subclass 3 is green, subclass 4 is orange, and subclass 5 is red.ciated with stratus or stratocumulus is included in the class of stratiform precipitation.This can occur under anticyclonic conditions,or in the warm region of a depression.The associated rain depths (R d ) are minimal and usually have no hydrological impact other than superficial wetting.In order to identify relevant subclasses, our classification was broken down into a number of unknown subclasses, such that n > 2.
An important step in hierarchical clustering is the selection of an optimal number of partitions (n opt ) in the data set (Grazioli et al., 2015).Many indices can be used to evaluate each partition, from the point of view of data similarity only.Most of these evaluate the scattering inside each cluster, with respect to the distance between clusters, and assign relatively favorable scores to partitions with compact and well-separated clusters.Although different indices were tested, these did not provide the same number of subclasses (between 2 and 32 with the indices tested in this study).It should be noted that these did not take the physical meaning of each class into account.Finally, we chose n opt = 5, since higher values led to classes with the same physical sense.The new classification based on the use of five subclasses is shown in Fig. 8.
From these five subclasses, two belong to the stratiform class and the other three belong to the convective class.In the learning data set, the first subclass represents 12 % of all events and 68, 1.2, 6.8 and 12 % for subclasses 2, 3, 4 and 5 respectively.The characteristics of these five subclasses are summarized below and in Table 5.The five selected variables are remarkably heterogeneous between classes, meaning the accuracy of these variables for clustering: -Subclass 1 (drizzle and very light rain): the main feature of this class is the very low mean value (R m ) and standard deviation σ R of rain rate events, in addition to the features of the superclass.The mean rain rate events lie in the range [0, 0.5] mm h −1 , with a mean value of Atmos.Meas.Tech., 10, 1557-1574, 2017 www.atmos-meas-tech.net/10/1557/2017/0.36 mm h −1 and σ R in the range [0, 3] mm h −1 , with a mean value equal to 0.1 mm h −1 .Although this event has a significant duration, the corresponding subclass, which corresponds to drizzle, involves only small quantities of water.It can also be noted that a low value of β L3 is a good indicator (< 0.01) for drizzle.
-Subclass 2 ("normal" events): this is a relatively broad class containing 68 % of all events, with a mean event rain rate (R m ) in the range [0.5 , 6] mm h −1 and a mean value of 1.48 mm h −1 .The standard deviation σ R lies in the range [1, 10] mm h −1 with a mean value of 2. This subclass is characterized by a significant relative variation of some parameters (D e , R m and P c1 , for instance), together with dry periods (D d ), which may be sufficiently long.
The three remaining subclasses correspond to convective classes of events, which are characterized by a strong temporal heterogeneity and significant intensities.Depending on the depth of rain events, this convective class is subdivided into three subclasses.
-Subclass 3 contains relatively long events (D e ) with high values for the rain event depth (R d ) variable and P c1 .This class represents events with a very small likelihood of occurrence (1.2 %).
-Subclass 4 contains relatively short events (D e ) with peak rain rate R max > 50 mm h −1 , in addition to strong heterogeneities (σ R , P c2 and P c3 are high) and large values for the convective indicator (β L3 ).
-Subclass 5 contains events that are characterized by relatively low values for the rain event depth (R d ).This is due to the short duration of the events (D e ).The variables σ R and P c3 remain high.Another feature of this subclass is that it includes continuous events only, with no short, embedded dry periods (low values of D d in Fig. 4 and Table 5).
To conclude this section, this new classification allows the conventional definition for stratiform events to be refined.The convective classification can be subdivided into five different subclasses, each of which is homogeneous.This classification is obtained for midlatitude climates.As the data set used in this study is representative of only one specific region and topography (i.e., the temperate climate encountered in the Île-de-France region, France), its analysis cannot reveal information related to different processes, i.e., those which are not sampled in the data set.Such processes could lead to the identification of additional specific clusters of events.In particular, there are no orographic rainfall events or oceanic observations.The final step in this study involves assessing whether the homogeneous character of each class is preserved at the microphysics scale and attempting to identify any relationships between the information present at the scale of both the microphysics and the macrophysics of these events (hydrological information).

Microphysical point of view
Our study of the microphysical properties of rain is based on a comprehensive analysis of its drop size distribution N (D), corresponding to the number of raindrops per unit volume and per interval of diameter D. The shape of N (D) reflects the microphysical processes involved.The identification of various features of the drop size distribution, as well as the type of precipitation, is very useful for many applications.
As an example, this information is used in the calculation of heating profiles in the precipitation parameterization of atmospheric models to gain a more detailed understanding of microphysical processes as well as for the development of rain retrieval algorithms applied to remote sensing observations.The microphysical characteristics of rainfall act as hidden variables that affect the relationship between microwave remote sensing measurements and the volume of water in a rainfall event (Ulaby et al., 1981;Iguchi et al., 2009).It can thus be very useful to use conventional rain gauges to determine the microphysical characteristics of rainfall events, www.atmos-meas-tech.net/10/1557/2017/Atmos.Meas.Tech., 10, 1557-1574, 2017 thereby improving the quality of active or passive remote sensing observations, and the spatial properties of rainfall events in particular.
A general expression for the drop size distribution defined by Testud et al. (2001) is commonly used in the literature.This allows a distinction to be made between the stable shape function f and the variability induced by rain.This variability is represented by two microphysical parameters, namely the mass-weighted volume diameter (D m ) and the parameter N * 0 .In some studies, the term N w is used rather than N * 0 .Not all authors use exactly the same units; in particular, Bringi et al. (2003) and Suh et al. (2016) use mm −1 m −3 for the units of N w rather than the unit m −4 , which is used in this study for N * 0 .
where D m and N * 0 are defined as and M i is ith-order moment of the drop size distribution N(D): Rain samples are usually analyzed by computing the microphysical parameters (D m and N * 0 ) for each rain sample obtained over a given timescale.In the present study, N (D) is obtained by considering the entire raindrop collection corresponding to each rain event of (variable) duration D e .This approach leads to one pair (D m , N * 0 ) of microphysics variables per rain event, whereas most other authors rely on values computed over a fixed timescale.
Projections of the learned map, according to D m and N * 0 , are shown in Fig. 4 (bottom right).It can be seen that the two maps are well structured and that these two parameters have opposite influences on the map projection.Although these two microphysical parameters were not learned, the relationship between them is clearly accounted for by the information used to structure the map (the five selected variables).Moreover, the existence of a relationship between the microphysical and macrophysical features of the rainfall is also confirmed in this figure, since both of the macrophysical variables used to learn the SOM, i.e., σ R and R max , have patterns similar to those revealed on the D m map.
Many authors, including Atlas et al. (1999), Bringi et al. (2003), Marzuki et al. (2013) and Suh et al. (2016), have endeavored to associate specific microphysical properties with each type of precipitation (convective or stratiform).In view of the maps shown in Fig. 4 and the convectivestratiform classification developed in Sect.4.3, we are able to confirm that precipitation events classified as stratiform express small values for D m and large values for N * 0 .In the case of the convective class, the opposite trend is observed (i.e., larger values for D m and smaller values for N * 0 ).Similar observations have been reported by Testud et al. (2001).It can also be noticed that the two microphysical variables are relatively homogeneous in the convective class, whereas in the stratiform class they are characterized by a higher level of variability.
In order to improve our analysis of the microphysical information embedded in the data set, we analyzed the relationship between the two microphysical parameters using the reference vectors (neurons) from the map, which include information related to the original rain events.
Figure 9 shows the variable D m as a function of N * 0 for the 64 neurons on the map.This relationship is indicated through the use of distinct markers to identify the five subclasses defined in Sect.4.4, thus facilitating the discussion of the microphysics associated with stratiform and convective rain.The two solid lines show the linear regressions computed for these two classes.
In the case of the stratiform subclasses (1 and 2) a clear relationship can be observed between the two variables.The microphysics characteristics of these two subclasses are clearly distinct.Indeed, subclass 1 (drizzle and light rain) has the smallest D m and the highest N * 0 and varies over just a small range.Conversely, as in the case of the macrophysical variables (see Sect. 4.4), the microphysical characteristics of subclass 2 (normal events) are considerably more heterogeneous.Knowledge of D m makes it straightforward to identify the corresponding subclass.As a consequence, an event with D m lying in the range [0.5, 1] millimeter belongs to subclass 1.Similarly, it is very likely that an event with D m lying in the range [1, 1.7] millimeter belongs to subclass 2.
For the convective events (subclasses 3, 4, 5), small differences can be noticed with respect to N * 0 .In the range [1.7, 2.5] mm, two neurons belonging to subclass 4 are close to a neuron belonging to subclass 5, and they therefore have similar microphysics.Although they are located far from all other subclass 2 neurons, three isolated neurons belonging to subclass 2 (stratiform) can be noted.These are characterized by relatively strong values of D m (2 mm) and low values of N * 0 .The corresponding events are a mixture of stratiform and convective rain.A typical case is given by convective rain associated with strong rain rates occurring at the beginning of an event, whereas the remainder of the event is stratiform with low rain rates and small variations.
Following our classification, Fig. 9 indicates that there are real relationships between the macrophysical and microphysical variables.Nevertheless, knowledge of the variables (D m , N * 0 ) does not allow the correct subclass to be determined in all cases.
Researchers who study microphysical features and their association with specific types of precipitation use simple schemes, based on rain rate estimations over a fixed period Atmos.Meas. Tech., 10, 1557-1574, 2017 www.atmos-meas-tech.net/10/1557/2017/ of integration (a few minutes), in order to separate stratiform and convective rain types.They also use these simple schemes to label D m and N * 0 as stratiform or convective (Testud et al., 2001).This approach is significantly different to the method presented here, which assumes that all of the samples in a given event belong to the same class.Our values for N * 0 and D m are thus computed for the timescale of a given event rather than for a fixed integration time.Thus, although in the present study a good agreement is found for the range of values covered by D m , those determined for N * 0 do not cover the same range as in the case of the previously cited studies.
Many previous authors have observed that the drop size distribution is closely related to processes controlling rainfall development mechanisms.In the case of stratiform rainfall, the residence time of the drops is relatively long and the raindrops grow by the accretion mechanism.In convective rainfall, raindrops grow by the collision-coalescence mechanism, associated with relatively strong vertical wind speeds.Numerous studies have been published concerning the variability of N * 0 and D m : Bringi et al. (2003) studied rain samples from diverse climates and analyzed their variability in stratiform and convective rainfall; Marzuki et al. (2013) investigated the variability of the raindrop size distribution through a network of Parsivel disdrometers in Indonesia; and Suh et al. (2016) investigated the raindrop size distribution in Korea using a POSS disdrometer.In the case of stratiform rain, all of these authors observe that N * 0 and D m are nearly log-linearly related, with a negative slope.This is consistent with the trend shown in Fig. 9 for the two stratiform subclasses (1 and 2).Even the three distinct neurons, which are isolated from the others, appear to be governed by the same relationship.
Marzuki et al. ( 2013) noted that during convective rain the increase in value of N * 0 with decreasing D m is nearly log linear, with a flatter slope.In the present case, the dependence is also log linear, with a slope that is slightly flatter for convective events than for stratiform events.In the aforementioned studies, the data were aggregated over time, campaign or site, on the basis of a criterion computed over a fixed period of time.We believe that this process is weakly suited to determining the properties of convective events, as a consequence of their strong variability and shorter characteristic time.In this study we were able to retrieve the log-linear relationship between N * 0 and D m without having to learn it directly.When applying our algorithm to the various macroscopic properties by rain event, we also take into account the variability of rain within an individual rain event.Fig. 9 clearly shows that the spreading of parameters N * 0 and D m inside each subclass has the same magnitude as the distance between subclasses.This remark confirms the hypothesis of Tapiador et al. (2010): the intra-event variability can exceed the inter-event variability due to events arising from different precipitation systems.It is thus preferable to examine the properties of events with a more general approach rather than using individual samples to study the distinction between stratiform and convective processes.The three isolated neurons in subclass 2 described above (circled in Fig. 9) have the same properties as the other events of their subclass (i.e., the same slope for the log-linear relationship between N * 0 and D m ).This example confirms the ability of our methodology to preserve the macroscopic information needed to cluster www.atmos-meas-tech.net/10/1557/2017/Atmos.Meas.Tech., 10, 1557-1574, 2017 rain events, thus allowing the intra-event variability as well as microphysical information to be (partially) retrieved.Suh et al. (2016) also compare log(N * 0 ) and D m probability density functions (pdf), for the case of stratiform and convective samples over a 4-year period.On the basis of the D m pdf of both stratiform and convective classes, they compute a threshold value for D m , such that when D m > 1.66 mm the rainfall samples are mainly convective and when D m < 1.66 mm they are mainly stratiform.This finding is consistent with the results of Atlas et al. (1999), who also found a threshold value for D m , distinguishing between convective and stratiform rainfall.In Fig. 9, it can be seen that this threshold is confirmed (vertical solid line), with D m smaller than 1.6 mm corresponding to stratiform events, whereas higher values correspond to mainly convective events.When we consider the events analyzed in the present study, there are also three neurons corresponding to a "mixed event" beyond this threshold.Suh et al. (2016) show in Fig. 4c of their study that the pdf for convective rainfall is higher than that corresponding to stratiform rainfall, when log(N * 0 ) > 6.2 (N w = 3.2 in their figure).As described above, by considering the data corresponding to rain events, rather than to samples recorded over fixed periods of time, our range of values for N * 0 is smaller than that used in other publications.In addition, log(N * 0 ) < 6.15 for all neurons labeled as convective in our study, which is very close to the value of 6.2 determined by Suh et al. (2016).
In view of the generally satisfactory retrieval of microphysical information from macrophysical parameters, we are of the opinion that the topological map successfully restores some of the information implicitly embedded in the data set.It is thus interesting to note that the macrophysical parameters of rainfall are related to its microphysical properties.Firstly, the map collects similar events, whilst ensuring, through the minimization of topological errors, that the unfolding of the map is correct.A neuron is thus closer to its neighbors than to any other neuron on the map.This criterion ensures that the data space is optimally partitioned into connected subparts, such that the neurons on the map can be related to the underlying processes governing rainfall.

Conclusions
Although the definition of a rain event is relatively subjective, this study underlines the advantages of using event analysis rather than sample analysis.This data-driven analysis of events shows that rain events exhibit coherent features.As a consequence of the discrete and intermittent nature of rainfall, some of the features commonly used to describe rain processes are inadequate, in particular when they defined for a fixed duration.Excessively long integration times (hours or days) can lead to the mixing of observations that correspond to distinct physical processes as well as to the mixing of rainy and clear air periods, within the same sample.An excessively short integration time (seconds, minutes) leads to noisy data, which are sensitive to the sensor's characteristics (sensor area, detection threshold and noise).By analyzing entire rain events, rather than short individual samples of fixed duration, it is possible to clearly identify certain relationships between the different features of rain events, in particular the influence of the microphysical properties of rain on its macrophysical characteristics.This approach allows the intra-event variability caused by measurement uncertainties to be reduced, thus improving the accuracy with which physical processes can be identified.
Once an event has been clearly identified, it is possible to choose a small number of variables to describe it.We present a new data-driven approach, which can be used to select the most relevant variables for this characterization.This approach has generic properties and can be adapted to many multivariate applications.A GA, when combined with SOM clustering, can allow the unsupervised selection of an optimal subset of five macrophysical variables.This is achieved by minimizing a score function, which depends on the topology error of the SOM and the number of variables.This score provides a parsimonious description of the event, whilst preserving as much as possible the topology of the initial space.
Numerous variables derived mainly from rain rate recordings are used to describe precipitation in the context of rain time series studies and a wide variety of topics of interest, including hydrology, meteorology, climate and weather forecasting.The algorithm proposed in this study produces a subspace formed by only 5 of the 23 rain features described in the literature.We show that these five features can be selected by the algorithm in an unsupervised manner and, from the macrophysical point of view, can provide an adequate description of the main characteristics of rainfall events.These characteristics are the event duration, the peak rain rate, the rain event depth, the standard deviation of the event rain rate and the absolute rain rate variation of order 0.5.
In order to confirm the relevance of the five selected features, we analyze the corresponding SOM and are able to clearly reveal the presence of relationships between these features.This approach also reveals the independence of the inter-event time (IET p ) characteristic and the weak dependence of the dry percentage in event (D d%e ) characteristic, thus confirming that a rain time series can be considered as an alternating series of independent rain events, interrupted by periods without rain.Hierarchical clustering allows the wellknown separation between stratiform and convective events to be clearly identified.This dual classification is then refined into a set of five relatively homogeneous subclasses.The stratiform class is divided into two subclasses: a drizzle/very light rain subclass and a normal event subclass.The convective class is divided into three subclasses, characterized by a strong temporal heterogeneity and significant rain rates.
As this research was based on the analysis of observations made in midlatitude plains in France, the relevance of this classification remains to be confirmed through the analysis of data sets recorded in different climatic zones and under different meteorological conditions, such as those encountered in mountainous or coastal areas.If the SOM described in the present study were learned with a more exhaustive data set, a larger map would be produced, and this could reveal new types of rainfall behavior, which remained undetected in the current data set.This point will be addressed in future studies.
The data-driven analysis of entire rain events (rather than the analysis of fixed-length samples) is relevant to the study of interactions between the macrophysical (based on the rain rate) and microphysical (based on raindrop) properties of rain.In the present study, several strong relationships were identified between these microphysical and macrophysical characteristics, and we show that some of the five subclasses identified in this analysis have specific microphysical characteristics.When a relationship between the microphysical and macrophysical properties of rain is identified, this can have many practical implications, especially for remote sensing.In the context of weather radar applications, the microphysical properties of rain are needed in order to estimate rain rates through the use of the Z-R relationships.The estimation of microphysical rain characteristics, based on easily observable rain gauge measurements, could play a significant role in the development of the quantitative precipitation estimation (QPE).

Figure 1 .
Figure 1.PCA on the learning data set based on the 23 variables described in Table 3.(a) Correlation circle on axes 1 and 2. (b) Correlation circle on axes 1 and 3.All of the variables are normalized according to the last column of Table2.

Figure 2 .
Figure 2. Diagram for the selection of variables based on a genetic algorithm associated with Kohonen maps.

Figure 3 .
Figure 3. Distance matrix for the M(x Best ) map: the color of each neuron represents the average distance between itself and its neighboring neurons.The value inside each neuron indicates the number of rain events that it has captured inside the learning data set.The black line separates the neurons into two classes, using the hierarchical ascendant classification (see Sect. 4.1).The arrows represent the gradients of the variables R max , σ R and D e .

Figure 4 .
Figure 4. Projection of the M(x Best ) map according to the 23 variables.The red-framed variables are those selected by the GA algorithm.The last two variables D m and N * 0 are defined in Sect. 5.

Figure 5 .
Figure 5.The variable β L3 versus its corresponding value, given by the best matching unit: from the learning data set (circles) and the test data set (stars).The solid line corresponds to the first diagonal.

Figure 6 .
Figure 6.Dendrogram obtained from the hierarchical cluster analysis of the 64 neurons in the SOM.The horizontal dashed line represents the threshold between the two classes.

Figure 7 .
Figure 7. Representation of the neurons in the R m , β L3 , and R m , P c2 subspaces.The stars represent neurons from group 1 (stratiform), and the squares correspond to neurons from group 2 (convective).Dashed lines indicate the neuron no.64.

Figure 8 .
Figure 8. Hierarchical clustering of the map into five subclasses.The colors represent the subclass numbers: subclass 1 is dark blue, subclass 2 is blue, subclass 3 is green, subclass 4 is orange, and subclass 5 is red.

Figure 9 .
Figure 9. Microphysical variable N * 0 versus D m for the five rainy event subclasses.The three neurons corresponding to mixed events are circled.The dashed lines correspond to borders D m > 1.66 and Log(N * 0 ) > 6.15.

Table 1 .
Observation periods, availability of DBS observations and numbers of rain events for the learning and test data sets.

Table 2 .
The 23 variables identified in the literature, used for the characterization of rain events.

Table 4 .
Coefficient of determination obtained on the learning and test data sets.The values in bold correspond to the five selected variables.

Table 5 .
Summary of the rain event subclasses computed with the learning data set.