Machine learning for improved data analysis of biological aerosol using the WIBS

Ruske, Simon; Topping, David O.; Foot, Virginia E.; Morse, Andrew P.; Gallagher, Martin W.

doi:https://doi.org/10.5194/amt-11-6203-2018

Articles | Volume 11, issue 11

https://doi.org/10.5194/amt-11-6203-2018

Articles | Volume 11, issue 11

Research article

19 Nov 2018

Research article |

| 19 Nov 2018

Machine learning for improved data analysis of biological aerosol using the WIBS

Simon Ruske, David O. Topping, Virginia E. Foot, Andrew P. Morse, and Martin W. Gallagher

Abstract

Primary biological aerosol including bacteria, fungal spores and pollen have important implications for public health and the environment. Such particles may have different concentrations of chemical fluorophores and will respond differently in the presence of ultraviolet light, potentially allowing for different types of biological aerosol to be discriminated. Development of ultraviolet light induced fluorescence (UV-LIF) instruments such as the Wideband Integrated Bioaerosol Sensor (WIBS) has allowed for size, morphology and fluorescence measurements to be collected in real-time. However, it is unclear without studying instrument responses in the laboratory, the extent to which different types of particles can be discriminated. Collection of laboratory data is vital to validate any approach used to analyse data and ensure that the data available is utilized as effectively as possible.

In this paper a variety of methodologies are tested on a range of particles collected in the laboratory. Hierarchical agglomerative clustering (HAC) has been previously applied to UV-LIF data in a number of studies and is tested alongside other algorithms that could be used to solve the classification problem: Density Based Spectral Clustering and Noise (DBSCAN), k-means and gradient boosting.

Whilst HAC was able to effectively discriminate between reference narrow-size distribution PSL particles, yielding a classification error of only 1.8 %, similar results were not obtained when testing on laboratory generated aerosol where the classification error was found to be between 11.5 % and 24.2 %. Furthermore, there is a large uncertainty in this approach in terms of the data preparation and the cluster index used, and we were unable to attain consistent results across the different sets of laboratory generated aerosol tested.

The lowest classification errors were obtained using gradient boosting, where the misclassification rate was between 4.38 % and 5.42 %. The largest contribution to the error, in the case of the higher misclassification rate, was the pollen samples where 28.5 % of the samples were incorrectly classified as fungal spores. The technique was robust to changes in data preparation provided a fluorescent threshold was applied to the data.

In the event that laboratory training data are unavailable, DBSCAN was found to be a potential alternative to HAC. In the case of one of the data sets where 22.9 % of the data were left unclassified we were able to produce three distinct clusters obtaining a classification error of only 1.42 % on the classified data. These results could not be replicated for the other data set where 26.8 % of the data were not classified and a classification error of 13.8 % was obtained. This method, like HAC, also appeared to be heavily dependent on data preparation, requiring a different selection of parameters depending on the preparation used. Further analysis will also be required to confirm our selection of the parameters when using this method on ambient data.

There is a clear need for the collection of additional laboratory generated aerosol to improve interpretation of current databases and to aid in the analysis of data collected from an ambient environment. New instruments with a greater resolution are likely to improve on current discrimination between pollen, bacteria and fungal spores and even between different species, however the need for extensive laboratory data sets will grow as a result.

Download & links

Article (PDF, 2283 KB)

How to cite

How to cite.

Dates

Received: 19 Apr 2018 – Discussion started: 18 Jun 2018 – Revised: 15 Oct 2018 – Accepted: 26 Oct 2018 – Published: 19 Nov 2018

1 Introduction

Biological aerosol, such as bacteria, fungal spores and pollen have important implications for public health and the environment (Després et al., 2012). They have been linked to the formation of cloud condensation nuclei and ice nuclei which in turn may have important influence on the weather (Crawford et al., 2012; Cziczo et al., 2013; Gurian-Sherman and Lindow, 1993; Hader et al., 2014; Hoose and Möhler, 2012; Möhler et al., 2007). These particles have impacts on health (Kennedy and Smith, 2012), particularly for those who suffer from asthma and allergic rhinitis (D'Amato et al., 2001). It is therefore of paramount importance that we continue to develop methods of detecting these particles, to quantify them, determine seasonal trends and to compare different environments.

There are a wide range of biological molecules, commonly referred to as biological fluorophores, that are known to re-emit radiation upon excitation, e.g. amino acids, coenzymes and pigments (Pöhlker et al., 2012, 2013). Ultraviolet-light induced fluorescence (UV-LIF) spectrometers, such as the wideband integrated bioaerosol spectrometer (WIBS) have received increased attention in recent years as a potential methodology for detecting biological aerosol (Kaye et al., 2005). The WIBS uses irradiation at 280 and 370 nm to target some of the most significantly fluorescent bioflorophores such as tryptophan (an amino acid) and NADH (a coenzyme). These measurements are combined with an optical measurement of size and shape to further aid in discrimination.

Measurements from the WIBS have limited application in isolation. However, there are a range of techniques that could be used to predict quantities of biological aerosol from these fluorescence, size and morphology measurements. Techniques that could be used to solve this classification problem, include field specific techniques such as ABC analysis (Hernandez et al., 2016) as well as supervised and unsupervised machine learning techniques that are broadly used (Friedman et al., 2001).

It is not clear at this point what approach is preferred as all approaches have a range of advantages and disadvantages.

Supervised machine learning uses data collected within the laboratory, where the correct classification is known. Data are split into training data and testing data where the training data are used to fit a model which is then validated on the test set. Once a model is fitted and validated it may then be applied to classify ambient data.

During unsupervised analysis, ambient data are classified without using laboratory training data. Instead, an attempt is made to naturally segregate the data. Ideally, we may expect data to naturally be segregated into broad biological classes or into different groups of similar bacteria, fungal material and pollen, but this may not necessarily be the case.

The supervised methods, have the disadvantage that training data collected may not include the entirety of what might be collected during an ambient campaign. Particularly, in an urban environment, the instrument may collect measurements for a large quantity of non-biological material that should be classified as such or removed from the analysis. We would expect most of this non-biological material to either be non-fluorescent or weakly fluorescent and therefore it should be removed prior to analysis by applying a justifiable threshold to the fluorescent measurements (see Sect. 2.2). Nonetheless, a few weakly fluorescent non-biological particles may remain and could be overlooked if the training data are incomplete.

There are likely to be issues to be explored with either approach and therefore it seems unlikely that either supervised or unsupervised techniques can justifiably be abandoned at this point in time and it may well be the case that usage of a variety of techniques may be required to better understand the atmospheric environment. Nonetheless, it is still vital to investigate how these different techniques behave when analysing laboratory data to better understand how they can be most appropriately applied to ambient data.

In an ambient setting, determining the number of clusters is difficult, so hierarchical agglomerative clustering (HAC) has been the preferred method over other methods such as k-means since the method naturally presents a clustering for all possible number of clusters (Robinson et al., 2013). A suggestion of the number of clusters can then be provided using indices such as the Caliński–Harabasz Index (CH Index) (Caliński and Harabasz, 1974) by maximizing a statistic which yields a peak for clusterings which contain clusters that are compact and far apart. HAC has previously been used on data collected using the WIBS to discriminate between different Polystyrene Latex Spheres (PSLs) and has been applied to ambient measurements collected as part of the BEACHON RoMBAS experiment (Crawford et al., 2015; Gabey et al., 2012; Robinson et al., 2013).

Nonetheless, relatively few studies have studied the usage of HAC on laboratory data from the WIBS (Savage and Huffman, 2018; Savage et al., 2017). Evaluating the effectiveness of HAC on generated aerosol is crucial to support or repudiate conclusions made using HAC on ambient data, especially since the fluorescence response from the laboratory generated aerosol will much better reflect fluorescence responses from the environment, when compared with PSLs.

During the process of HAC there are also a number of vital choices that have to be made that could have a substantial implication on the effectiveness of the method (these are discussed in detail in Sect. 2.2). For the PSLs previously analysed (Crawford et al., 2015), we determined standardizing using the z score, with removal of non-fluorescent particles, taking logarithms of shape and size was most effective. The CH index was selected to determine the number of clusters as it was demonstrated to perform best in the literature (Milligan and Cooper, 1985). It is, however, not clear whether these choices will remain the most effective for laboratory generated aerosol or ambient data. See Sect. 2.3 for further details on data preparation for HAC.

https://www.atmos-meas-tech.net/11/6203/2018/amt-11-6203-2018-f01

Figure 1Overview of different analysis approaches.

Download

Furthermore, data analysis using HAC can take a matter of hours, if not days, depending on the number of particles. The time requirements for HAC are between N² and N³ meaning that a doubling of the number of particles will require between 4 and 8 times as much time. Such time requirements mean that not only is the method already quite slow, but will get increasingly slower as more data are collected, which may limit the real time effectiveness of the method.

Within the Python programming language, a package called Scikit-learn (Pedregosa et al., 2011) offers implementations of several unsupervised methods. Some of these methods, i.e. Affinity Propagation, Mean-shift, Spectral Clustering and Gaussian mixtures are not explored as they will scale poorly as the number of particles increases (Pedregosa et al., 2011). Instead, our analysis is focused on k-means, HAC and DBSCAN which can be used on larger data-sets.

For HAC we continue to use the fastcluster package (described in Sect. 2.3). Sci-kit learn does have a HAC implementation but it is not as fast or memory efficient. We do use sklearn for DBSCAN and k-means, although if one was to use DBSCAN for ambient data we would suggest exploring alternatives such as ELKI (Schubert et al., 2015) as the sci-kit learn implementation of DBSCAN by default is not memory efficient making it difficult to utilize for more than 30 000 particles. Sci-kit learn has a fast implementation for gradient boosting, so this is used.

2 Methods

In this section we discuss the variety of approaches that could be used to classify particles such as bacteria, fungal spores or pollen. In Sect. 2.1 we provide an overview of the instrument used to collect the data. In Sect. 2.2 we discuss the variety of decisions that need to be made prior to passing the data to the machine learning algorithms which are discussed in Sect. 2.3–2.6. An overview of the different methods is given in Fig. 1.

2.1 Instrumentation

The Wideband Integrated Bioaerosol Sensor (WIBS) collects size, shape and fluorescence measurements (Kaye et al., 2005). The size is a single measurement; the shape measurement consists of four measurements (one for each quadrant) which are combined to produce a single asymmetry factor measurement. A more precise definition of asymmetry factor has been provided previously in the literature (Gabey et al., 2010).

To measure fluorescence, the particle is irradiated with UV light at 280 and 370 nm from the firing of two xenon sources. Fluorescence emission is collected via two collection channels in the ranges 310–400 and 420–600 nm. The 370 nm xenon radiation lies within the first detection range and hence elastically scattered light from the particle, sufficient to saturate the detection amplifier, is received. This signal is therefore discarded.

After removal of this fluorescent measurement, there are three remaining fluorescence measurements. The notation FL1_280 is used to denote the measurement in the first detection channel when the particle is irradiated with ultraviolet light at 280 nm and FL2_280 and FL2_370 are used to denote the measurements in the second detection channel when the particle is irradiated with ultraviolet light at 280 and 370 nm, respectively. These fluorescence measurements are combined with the size and asymmetry factor measurements. A more detailed description of the instrument can be found in previous publications (Gabey et al., 2010; Healy et al., 2012 a).

https://www.atmos-meas-tech.net/11/6203/2018/amt-11-6203-2018-f02

Figure 2Overview of preprocessing steps for WIBS data.

Download

2.2 Data preparation

Prior to analysis using the machine learning algorithm we may choose to make a variety of decisions to pre-process the data with the aim to improve performance (see Fig. 2). An overview for the decisions often made are outlined below.

First we may elect to remove particles which are non-fluorescent. Forced trigger data are collected which is a measurement of the instrument response when particles are not present. We then set a threshold, for which if a particle fails to exceed this threshold in at least one of the fluorescent channels we conclude that the particle is non-fluorescent. Usually we set the threshold to be three standard deviations above the average forced trigger measurement although a recent laboratory study has suggested that nine standard deviations may be more appropriate (Savage et al., 2017).

Another threshold is usually then applied to the size. A size threshold of 0.8 µm is usually applied as detection efficiency of the instrument drops below 50 % at this point. (Gabey, 2011; Gabey et al., 2011; Healy et al., 2012 b).

Natural logarithms of the size and the asymmetry factor are often taken as these measurements are often log normally distributed and it is postulated that this will increase performance in the case of hierarchical agglomerative clustering.

It is also widely regarded that standardizing the data prior to analysis is utmost importance (Milligan and Cooper, 1988). We often subtract the average measurement in each of the five variables and divide by the standard deviation, often referred to as “standardizing using the z score”. Standardization is used to prevent variables with larger magnitude, such as the fluorescent measurements, from dominating the analysis. An alternative approach to standardizing is to divide each of the five variables by the range.

2.3 Hierarchical agglomerative clustering

In order for particles to be clustered, we need to define a measurement of how similar two clusters are. These similarity measures are often referred to as linkages. We use the Python package fastcluster (Müllner, 2013) which provides modern implementations of single, complete, average, weighted, Ward, centroid and median linkages (Müllner, 2011). A thorough detailing of the definitions of the different linkages can be found in the fastcluster manual (Müllner, 2013). For the memory efficient mode, which is essential when using the algorithm for large data sets, only Ward, centroid, median and single linkages are available.

Initially each particle is placed into an individual cluster. Next, using the linkage selected, the two most similar clusters are merged. The merging process is repeated until all the particles are placed in a single cluster, which provides a clustering from k=1, …, N, where k is the number of clusters and N is the number of particles being analysed. A cluster validation index such as the Calińnski–Harabasz index (Caliński and Harabasz, 1974) is then used to identify an appropriate number of clusters. The index is maximized for clusterings that contain compact clusters that are far apart.

2.4 K-means clustering

K-Means clustering is designed to place particles into k clusters. However we can repeat the method multiple times, e.g. for k=1, 2, …, 10, where k is the number of clusters. Similar to HAC we can then use a cluster validation index to determine which choice of k gives the most effective results.

The method works as follows. Initially k cluster centroids are set by selecting k particles at random. The rest of the particles are then placed into these k clusters depending on which of the centroids the particle is closest to. At this point a new centroid is calculated for each cluster. The process is then repeated many times until convergence occurs and the centroids do not change significantly from one iteration to the next.

2.5 DBSCAN

For DBSCAN we set two parameters, the radius for a neighbourhood ϵ, and the number of particles required for a neighbourhood to be identified as dense.

Initially a random point, say A, is selected. If there are sufficient number of points in the neighbourhood of A then all the points in A's neighbourhood are also checked and so on, until the cluster has fully expanded and there are no points left to check. Should the point not have a sufficient number of other points in its neighbourhood then it is left unclassified. Further points are then selected and the above process is repeated until all points have been considered.

https://www.atmos-meas-tech.net/11/6203/2018/amt-11-6203-2018-f03

Figure 3Visual representation of DBSCAN. Here each point is represented as a black dot and its neighbourhood is represented by a circle. Here ϵ is the radius of the circle and the minimum number of points is 3. Four points have each been placed into the blue cluster and green cluster, all of which having at least 3 other points in their neighbourhood. One point is classified as noise as it has only 1 other point in its neighbourhood.

Download

We give an example of DBSCAN in Fig. 3. Note that cluster validation indices are not required for DBSCAN, since the number of clusters is intrinsically calculated within the algorithm.

2.6 Gradient boosting

A basic decision tree is constructed by considering each possible split across all variables and evaluating which split best divides the data. For example, we may consider the third fluorescence channel and split the data on the basis of whether the measurement is more or less than 10 arbitrary units (AU). This process is then repeated many times until a tree is built.

There are two ways in which trees can be combined into an ensemble. The first is by averaging multiple trees in the hope to produce a more accurate classification as is the case in random forests and bagging classifiers (Breiman, 1996, 2001). In the case of random forests and bagging, the data set is sampled with replacement, meaning that the same particle could be selected more than once or not at all. Sampling in this way enables the algorithm to produce a subtly different version of the data from which to build each tree. In addition, when using a random forest, instead of considering all possible variables to use to split the data, only a random subset is used.

Alternatively we can fit a single decision tree to the data, evaluate where the tree is performing well and then fit a second tree to the particles in the data for which the current model is performing poorly. This process can be repeated many times, each time adding a new tree to the model in the hope of making an improvement. This approach is known as AdaBoost (Freund and Schapire, 1997). Gradient boosting is an extension of AdaBoost to allow for other loss functions (Friedman, 2001).

For the current study we elect to use gradient boosting to indicate the performance of the supervised approach since it was the best performer for the Multiparameter Bioaerosol Spectrometer, a similar UV-LIF spectrometer similar to the WIBS but with single waveband fluorescence, 8 fluorescence detection channels and very high shape analysis capability (Ruske et al., 2017).

2.7 Evaluation criteria

https://www.atmos-meas-tech.net/11/6203/2018/amt-11-6203-2018-f04

Figure 4Four example matching matrices. Immediately below each matrix is the percentage of particles placed into the same cluster for both clusterings in each case. At the very bottom we have the adjusted rand score.

Download

To aid in evaluating how well methodologies performed we used two tools: the matching matrix (Ting, 2010) and the adjusted rand score (Hubert and Arabie, 1985).

In Fig. 4 we present four different matching matrices. To produce these matrices we compared: two random clusterings with approximately 50 % of the data in each cluster (A); two random clusterings each with 80 % and 20 % of the data in each of the two clusters, respectively (B); two identical clusterings (C); and two clusterings which were nearly identical except one data point had been placed into a third cluster for one of the clusterings.

2.7.1 Matching matrix

The matching matrix, often referred to as a confusion matrix, can be used as an aid in comparing two clusterings.

In the case of the current paper, we use this to compare the output from an algorithm with labels assigned to each particle. We may assign labels to indicate what broad type the particle is (e.g. 1 if the particle is bacteria, 2 if the particle is fungal etc.) or we may assign labels to indicate what sample a particle is from (e.g. 1 if the particle is Bacillus atrophaeus, 2 if the particle is E. coli etc.)

Consider example C in Fig. 4. This matching matrix compares two clusterings each containing two clusters. Each row corresponds to a cluster in the first clustering and each column corresponds to a cluster in the second clustering. The element in the first row and the first column (in this case 784) indicates the number of particles that were placed into the first cluster in the first clustering that were also placed into cluster 1 in the second clustering. Two identical clusterings will produce a matching matrix that has non-zero values only the diagonal.

A and B in Fig. 4 are examples of poor performance and C and D are examples of very good performance.

2.7.2 Adjusted Rand score

When evaluating a large number of clusterings, it may be useful to use a statistic to summarize the information in the matching matrix. In a previous study (Ruske et al., 2017), we used percentage of particles correctly classified as a statistic for indicating performance. This is an easy to interpret statistic, but can be misleading when used on imbalanced data. In both example A and B, we have two randomly generated clusterings. However in B we have 80 % of the data points placed into the first cluster, whereas in A the data points are approximately equally distributed between the two clusters. The percentage of points which are placed into the same cluster for both clusterings are 52.2 % and 68.3 % for A and B, respectively. We can see that the more imbalanced a data set is, the more likely data points are to be placed into the same clusters. It is for this reason we elect to use an alternative statistic: the adjusted rand score. This statistic attains a value of approximately zero for both A and B.

Comparing clusterings is a developing area of research and there are other alternative statistics such as the mutual information score (Vinh et al., 2010) that could be preferable to the adjusted rand score. However our initial tests (not presented), indicated that calculation of the mutual information often required an order of magnitude more time than the calculation of the adjusted rand. Therefore, we elected to use the adjusted rand score for the current study.

3 Data

The efficacy of the different data analysis approaches was evaluated using three different data sets. The first of which comprised several industry standard polystyrene latex spheres of various different sizes and colours. This data set was first analysed in Crawford et al. (2015), where hierarchical agglomerative clustering was successfully applied to the data yielding a classification accuracy of 98.2 %. This data set presents a simple challenge for which we would expect any reasonable algorithm to be able to discriminate between the different sizes and colours of particles.

To further extend the previous analysis in Crawford et al. (2015) we include two data sets collected in 2008 and 2014 which are similar to data previously published using the Multiparameter Bioaerosol Spectrometer (Ruske et al., 2017). A subsection of the data collected 2014 has previously been analysed in the Appendix of Crawford et al. (2017). These data sets consist of various different pollen, fungal, bacterial and non-biological samples, and should present a much more difficult challenge for the algorithms.

The samples of laboratory generated aerosol were collected as follows. Material was aerosolized into a large, clean HEPA filtered chamber, which incorporated a recirculation fan. The Bacillus atrophaeus and Escherichia coli (E. coli) bacteria were aerosolized into the chamber using a mini-nebulizer (e.g. Hudson RCI Micro-Mist nebulizer) as were the salt and phosphate buffered saline samples. The dry samples, which included the pollen, and fungal samples were aerosolized directly into the chamber from small quantities of powder utilizing a filtered compressed air jet. The diesel smoke and grass smoke samples were generated by burning a small amount within a fume cupboard using a smoker (a piece of bespoke equipment). The bacterial samples were either washed or unwashed and diluted or undiluted.

Table 1The number of particles remaining after a fluorescent threshold of 3σ or 9σ was applied for each of the bacterial samples collected in 2008. Each sample was either washed or unwashed and diluted or undiluted. Each sample was either washed or unwashed, and diluted or undiluted as indicated by a check mark in the corresponding column.

Machine learning for improved data analysis of biological aerosol using the WIBS

2.1 Instrumentation

2.2 Data preparation

2.3 Hierarchical agglomerative clustering

2.4 K-means clustering

2.5 DBSCAN

2.6 Gradient boosting

2.7 Evaluation criteria

2.7.1 Matching matrix

2.7.2 Adjusted Rand score

4.1 Hierarchical agglomerative clustering

4.1.1 Impact of data preparation

4.1.2 Impact of the Calińnski–Harabasz index

4.1.3 Breakdown of the hierarchies

4.2 DBSCAN

4.3 Gradient boosting

4.4 K-means