Single-particle mass spectrometer (SPMS) analysis of aerosols has become increasingly popular since its invention in the 1990s. Today many iterations of commercial and lab-built SPMSs are in use worldwide. However, supporting analysis toolkits for these powerful instruments are outdated, have limited functionality, or are versions that are not available to the scientific community at large. In an effort to advance this field and allow better communication and collaboration between scientists, we have developed FATES (Flexible Analysis Toolkit for the Exploration of SPMS data), a MATLAB toolkit easily extensible to an array of SPMS designs and data formats. FATES was developed to minimize the computational demands of working with large data sets while still allowing easy maintenance, modification, and utilization by novice programmers. FATES permits scientists to explore, without constraint, complex SPMS data with simple scripts in a language popular for scientific numerical analysis. In addition FATES contains an array of data visualization graphic user interfaces (GUIs) which can aid both novice and expert users in calibration of raw data; exploration of the dependence of mass spectral characteristics on size, time, and peak intensity; and investigations of clustered data sets.
Single-particle mass spectrometers (SPMSs) yield the size and chemical composition of individual aerosol particles in real time. SPMSs can generate tens of single-particle mass spectra per second, utilizing laser desorption–ionization (LDI). However mass spectra generated by LDI exhibit ion signals only qualitatively dependent on particle chemical composition (e.g., Ge et al., 1998; Gross et al., 2000; Hinz and Spengler, 2007) and also can exhibit large particle-to-particle variation even for chemically uniform particles (e.g., Steele et al., 2005; Wenzel and Prather, 2004; Zelenyuk et al., 2008a, b). Thus SPMSs generate both large and highly complex data sets, requiring sophisticated data analysis techniques for exploration and distillation of information.
Summary of SPMSs developed and data analysis packages used.
As Table 1 illustrates, individual laboratories have independently developed
a variety of SPMSs, and two commercial versions have also been produced. Due
to the many iterations of SPMSs that exist and the lack of a standard data
format, individual laboratories have had to build their own data analysis
software, though these toolkits are often not reported in the literature
(Table 1). Only two of these data analysis toolkits have been made publicly
available, YAADA (
Motivated by the continued use of SPMS and the limitations of the currently available software, we have developed a new flexible analysis toolkit for the exploration of single-particle mass spectrometer data (FATES). To encourage the widespread adoption of this toolkit, it was purposely designed in an extensible manner to adapt to the ever-evolving and varied implementations of SPMS. It is clear that building open-source tools in a standard, well-known platform and creating a work flow with user-defined parameters for data analysis would be beneficial to the SPMS community, increasing the rate of knowledge discovery and enabling collaboration between researchers. For example, maintenance and alterations of the software should be easily accessible to chemists and aerosol scientists without extensive training in computer science. In addition, any new toolkit should not be explicitly limited to expected common analyses, which may be built into GUIs, but should give the user complete freedom to access, explore, and utilize SPMS data and also integrate with other temporally and spatially resolved data sets. Finally any framework needs to make careful consideration of both memory and speed constraints imposed by the possible large size of SPMS data sets. Given these constraints, the FATES toolkit (Sultana et al., 2017) was developed completely in the MATLAB environment, and an extensive manual was written and is provided in the Supplement. MATLAB is a popular language for numerical data analysis by scientists because it has an extensive library of well-documented built-in functions, utilizes libraries optimized for speed in matrix manipulation, and can support both graphical and script-based exploration of data. By taking advantage of native MATLAB data types, FATES is easier to maintain and computationally more efficient than YAADA, the previous publicly available MATLAB toolkit for SPMS analysis. The FATES framework allows users to creatively explore their data without previous assumptions or constraints with simple scripts and by leveraging built-in MATLAB functions. Additionally FATES offers a suite of GUIs for interactive visualizations which can aid both novice and expert users in calibration of raw data; exploration of data sets using temporal, size, and mass spectral filters; and investigations of clustered data sets. FATES is the first publicly available SPMS toolkit to allow creative, efficient script-based data mining along with GUI-based visual data exploration and calibration all within a single programming environment.
FATES is implemented completely in MATLAB. No other languages, drivers, or software are needed to utilize FATES. In addition FATES was purposely developed in a manner that demands few presumptions about the instrument, particle, and spectral variables collected by the SPMS. For example one SPMS may only record the speed and time of detection for each particle, while another SPMS may also record the power of the desorption–ionization laser pulse. These differences are handled easily as FATES allows users to specify, define, and change the instrument, particle, and spectral variables they would like imported into and saved to a study. To make these alterations, users only need modify simple scripts where the desired variables are listed, and then these changes are carried over throughout the entirety of the source code. This flexible but simple design gives high utility for the SPMS community because it prevents users from needing expert knowledge of any language and having to search for and make line-by-line or structural changes within the source code. Detailed instructions for making these simple modifications are included in the FATES manual (Supplement M-5) and commented within the code. As distributed, the FATES source code already contains the necessary modifications to read in data sets from three SPMS designs: ATOFMS, ALABAMA, and TSI ATOFMS. In addition FATES avoids the explicit creation of new class objects, which minimizes the lines of source code and number of scripts by over an order of magnitude when compared to YAADA. This greatly minimizes the maintenance needed to keep FATES compatible with future versions of MATLAB. FATES has been tested for compatibility with MATLAB versions 2014b through 2016b.
SPMS data imported within FATES is stored within separate variables for the experiment description, the particle data, and the spectral data. A SPMS data set imported into MATLAB via FATES is referred to as a FATES study, the data architecture of which is comprehensively detailed in the FATES manual (Supplement M-4). Logically, the data mostly consist of one-to-many relationships from study to experiment, experiment to particle, and particle to spectral peaks. The data are most typically loaded once and then accessed and filtered in bulk. Therefore, it is more efficient to organize the observed measurements into denormalized matrices for particle and spectral data, where key information is duplicated in each matrix.
Each FATES study stores a data structure that contains a number of
user-defined fields (e.g., instrument name, operator, location) to describe
the experiment in which the data within the study were collected. Each row of
the structure describes a unique experiment, which pertains to a unique
experiment identifier (ID). All particle data (e.g., speed, power of
desorption–ionization laser pulse) are stored in a MATLAB matrix. More
specifically, each particle within a FATES study has a unique two-column
particle ID. The first column of the particle ID is the experiment ID,
previously described, to which the particle belongs. This framework allows
users to easily select for particle or spectral data collected during a
specific experiment within a FATES study that contains data from multiple
experiments. The mass spectral data for all particles in the FATES study are
held in an external binary file. Users can easily and quickly retrieve
spectral peak data (e.g.,
Comparison of run times for various operations in YAADA and FATES.
Considerable work has been completed to optimize the FATES framework for memory demands, speed, and ease of use. An ATOFMS data set collected at Bodega Bay, CA, in February and March of 2016 is used throughout this paper to illustrate the speed of data analysis within the FATES toolkit. This data set contains 1 386 042 dual-polarity single-particle mass spectra as well as particle data for an additional 11 454 356 particles that were detected in the light-scattering region but did not generate spectra. All FATES analysis is performed in MATLAB 2014b with an Intel Core i7-4930K CPU running at 3.4 GHz with 16.0 GB of RAM. Run time comparisons, summarized in Table 2, are made using the same computer utilizing a version of YAADA, which had been maintained by Kim Prather's research group to be compatible with MATLAB 2013a.
To begin working with a SPMS data set, a new FATES study has to be created (Supplement M-2). This process only needs to occur once for any data set, but the source code was still designed to minimize the time for study initialization. Despite the large size of the Bodega Bay data set, the creation of the FATES study only took 28.4 min. Even initiating a subset of the Bodega Bay study roughly one-tenth of the FATES study (127 077 dual-polarity mass spectra) in YAADA still required 20.8 min. Small ALABAMA and TSI ATOFMS data sets were also initiated expediently in FATES (Table 2). Note the version of YAADA maintained by Kim Prather's research group is not able to import these data sets into MATLAB for comparison. FATES has also been designed so that additional data can be added to an existing study without having to re-initialize the entire data set (Supplement M-A). This is especially useful for field studies, where daily examination of the data is required, but initialization of increasingly large data sets can become onerous and time consuming.
Once a FATES study is initiated, it is crucial to efficiently handle the
spectral data. Users may desire to examine data sets with millions of mass
spectra, and each spectrum can contain hundreds of peaks. SPMS spectra data
formats usually contain mass-to-charge (
In addition, the binary format minimizes both the time required to write and retrieve spectral data and the storage requirements for the file. Retrieving all 1 386 042 dual-polarity mass spectra in a single call from the external binary file created for the Bodega Bay study and loading it into a MATLAB array only took 3.3 min. It is important to note that this example is used for benchmarking purposes, but rarely would users need or choose to load into and hold all spectra information for entire large data sets within memory at the same time. The FATES framework automatically employs data pointers so that the whole binary file does not need to be read if the user is only attempting to retrieve spectra from particles which make up a subset of all the data in the FATES study. Run times for retrieving all and contiguous subsets (i.e., the raw data files from which the study was created were contiguous) of the dual-polarity mass spectra from the FATES and YAADA studies are summarized in Table 2. Retrieving a subset of 50 000 mass spectra from the FATES study (2.7 s) was over 6 times faster than in the YAADA study (17.3 s). Searching through and sorting data by particle information is also quickly performed in the FATES framework. By holding all hit particle data in memory, any operation querying the particle data does not require any data input/output calls and therefore is nearly instantaneous in MATLAB. For example retrieving the particle IDs for all submicron particles from the Bodega Bay study only took 0.01 s, while performing a similar analysis on the much smaller YAADA study required 0.6 s.
The quickness of the FATES framework depends partially upon minimizing retrieval calls to external files outside of the MATLAB workspace. Thus formatting of the data held within the MATLAB workspace has been carefully considered to minimize the memory demands of the FATES framework. Because spectral data are held in an external binary file, users can choose to store spectra data in the study at a high resolution without increasing the study's working memory. When retrieving spectra from the external binary file, users may specify the resolution to hold the data in the workspace. This feature allows users to tailor the resolution of the spectra in the workspace to its application and therefore the memory requirements. Mass spectral data loaded into the MATLAB workspace are stored in a single-precision floating-point format, saving memory compared to the standard MATLAB double-precision format, which requires twice the space. Particle data stored within a FATES study have also been formatted to minimize memory demands. If the user loads data into a FATES study for both detected particles that generated mass spectra (hit) and detected particles that did not generate spectra (missed), only hit particle data are stored in the particle matrices in MATLAB. Most data analyses utilize spectra, and therefore only hit particle information is necessary, but hit particles usually make up a small fraction of total particles detected by the light-scattering region of the SPMS. Therefore storing missed particle data in MATLAB memory would take up large amounts of space needlessly. All missed particle data are written to an external binary file and can be loaded by the user into MATLAB using a script provided in the FATES toolkit. Furthermore particle data stored in MATLAB memory are split between a single-precision and double-precision matrix. It is not necessary to store most data collected for particles (e.g., speed, laser power) in a double-precision format, so this choice further relieves the space required to store all particle data in memory. Therefore storing data for 1 million hit particles in memory where three variables require double-precision format (particle ID, time) and three variables only need single-precision format (speed, size, laser power) only requires 0.036 GB, which is very feasible for most modern desktop computers. Finally because all SPMS data when loaded into a FATES study are held in native MATLAB data types, interacting with the data requires very few FATES-specific functions. Almost all common analyses can be patterned off a basic script, provided with demonstration data in the FATES toolkit and relying on a handful of MATLAB built-in functions and matrix indexing, which makes the FATES framework accessible and powerful for both expert and novice users.
In this section we provide a brief overview of common analyses that can be
performed on SPMS data within a FATES study. However it should be mentioned
that it is impossible to describe or predict all data analyses and plotting
options easily available to FATES users due to the extensive library of
built-in and user-developed MATLAB functions. A large array of analyses can
be performed using concise code (Supplement M-6), with only a few examples
quickly discussed here. By utilizing logical indexing, particles and spectra
can be filtered using any single or combination of particle and mass
spectral characteristics (e.g., particle size, peak area at a certain
Screen capture of a guiFATES window with data from 46 432 individual particles.
While the FATES toolkit allows flexibility in script-based SPMS data analysis, graphical tools can also be an effective way to explore the data and quickly identify trends and patterns. To this end the FATES toolkit includes GUIs, built within MATLAB, which allow users to easily examine trends in spectra based on particle metrics such as size and time, and cluster and spectral characteristics. Figure 1 is a screen capture of the FATES spectra explorer guiFATES, displaying data for 46 432 particles. This spectra explorer has been modeled after ClusterSculptor, a SPMS data analysis GUI developed by Zelenyuk et al. (2008a) that has not been made publicly available. To initiate guiFATES, the user provides the function with the mass spectra, two user-selected particle metrics, and cluster data for a set of particles. A description of the functionality and abilities of guiFATES is given below.
The main panel of the guiFATES display is the heat map of the individual
particle mass spectra. Each row is an individual mass spectrum with peak
intensity indicated by color. The user can choose to display the provided
mass spectra peak intensity utilizing a linear or log10 scale. The
logarithmic scale makes it easier to visually detect relatively small peak
intensities in the spectra, while the linear scale helps users visualize
absolute differences between peak intensities. In Fig. 1 the logarithmic
scale has been selected. Users can choose to provide any two characteristic
particle metrics, such as particle size, time of detection, laser pulse
energy, or total ion intensity, which are displayed in the left panels. In
Fig. 1 particle time and size have been provided. Clustering information is
displayed in the right panel. The cluster or group assigned to each particle
is indicated by the color of the points on the right, while the location on
the
Screen capture of a dendroFATES window showing the cluster tree or dendrogram for 30 input clusters. The cluster contributions to the user-selected node are shown in the plot on the left. The particle data for the selected node are automatically plotted in a guiFATES window (Fig. S1).
guiFATES provides the user with many options for displaying and exploring
the data, and all functionalities are thoroughly detailed in the manual
(Supplement M-7). A check box allows the user to display all data with or
without grouping by cluster. In addition the user can select to sort the
data by any of the particle metrics in the vertical side panels or by a
Screen capture of a scatterFATES window showing the
These visual sorting and filtering methods enable users to efficiently inspect data sets and visually discover mass spectral trends, differences, and similarities both between distinct particle types and within populations of chemically similar particles. Due to the high variability and qualitative nature of single-particle mass spectra generated by laser desorption–ionization techniques, clustering algorithms utilized to group SPMS mass spectra within a data set often do not generate a one-to-one relationship between the number of chemical particle types in the population and spectra clusters generated (e.g., Giorio et al., 2012; Murphy et al., 2003; Rebotier and Prather, 2007; Wenzel and Prather, 2004; Zelenyuk et al., 2006, 2008a). Therefore it is necessary to leverage expert knowledge either to combine multiple spectra clusters, generated algorithmically, into a single chemical particle type or to further split clusters into smaller groups as has been noted in many SPMS studies of unconstrained aerosol populations (e.g., Dall'Osto and Harrison, 2006; Pratt et al., 2009; Qin et al., 2012). The authors emphasize that there is not a consensus on the most suitable algorithms and thresholds for SPMS analysis and suggest users investigate the previously listed references before embarking on mass-spectral-based algorithmic analysis. However, despite the conditions of initial clustering, guiFATES aids this process by allowing users to visualize all clustered particles at once and combine any number of clusters or split any cluster in any location during the data exploration process. Users can choose to output the particle identifiers of any cluster in the guiFATES window to the MATLAB workspace. All plotting, sorting, filtering, and grouping applications of guiFATES have been tested on a set of 100 000 particles with dual-polarity mass spectra, and at this size all updates to the displayed plots occurred nearly instantaneously, making guiFATES an appropriate and efficient tool for the large data sets common to SPMS analysis.
The advantages and benefits of this general method of data visualization and exploration for refining particle clusters have been discussed at length previously (Zelenyuk et al., 2008) and with the publication of FATES will be available to the SPMS community at large. A specific detail of note is that Zelenyuk et al. (2008) demonstrate that discontinuities in the particle cluster size distributions were characteristic of misclassifications of their mass spectra. Because this technique is not dependent on specific ion markers, it has the potential to be effective for a broad range of particle types but is yet to be extensively explored. guiFATES also enables future investigations of the extension of this cluster-discriminating technique to other common particle metrics, such as total ion intensity. Finally many studies have examined the influences of particle and experimental characteristics on the mass spectra generated from particles of uniform composition (e.g., Neubauer et al., 1998; Reinard and Johnston, 2008; Steele et al., 2003; Zelenyuk et al., 2008b). guiFATES can also be utilized in the exploration of these data sets consisting of a single particle type, where algorithmic grouping of particles utilizing mass spectra is unnecessary or even inappropriate.
Screenshot of a calibFATES window displaying a single-particle uncalibrated mass spectrum. Calibration data are input and displayed on the right, and particle size and time are displayed on the bottom.
FATES also includes two supplementary GUIs which allow the users to graphically select the particles to feed into the guiFATES spectra explorer. dendroFATES is a GUI where the user supplies the clusters and representative cluster mass spectra output from any clustering algorithm of the user's choice. The clusters are then automatically grouped into a cluster tree by a hierarchical analysis performed within MATLAB which is displayed in the dendroFATES GUI window. Hierarchical analyses have been utilized previously with SPMS data sets (Giorio et al., 2012; Hinz et al., 2006; Murphy et al., 2003; Rebotier and Prather, 2007; Zelenyuk et al., 2006), but a brief description is given here. The dendrogram links clusters in a binary fashion, creating new groups which are then further linked. Lower linkage heights indicate a higher degree of similarity between groups, and large distances between levels in the dendrogram are indicative of natural divisions in the data set. Figure 2 is a screenshot of the dendroFATES window with a dendrogram generated from the 30 most populous clusters generated using the ART-2a algorithm to cluster a subset of 166 666 particles from the Bodega Bay data set. Zooming in and out of the dendrogram is handled by MATLAB's native graphics functionality and makes it possible to supply dendroFATES with hundreds of clusters and still explore the cluster tree quickly and intuitively. Because the dendrogram allows the user to easily visualize similarities and natural groupings of clusters generated, it is an excellent tool to select clusters for further exploration of the particle and spectral data using the guiFATES tool. Clicking linkages in dendroFATES automatically opens a guiFATES window displaying all particles belonging to the selected node. When a linkage is selected, the fractional cluster contribution to the selected node is displayed on the right in the dendroFATES window, and the fraction of the selected node to the total population is also displayed in text. Figure S2 illustrates the guiFATES window generated with the node selection made in Fig. 2 when the user chooses to display particles by their cluster label (Fig. S2a) or grouped by the left and right branch (Fig. S2b). As illustrated in Fig. S2a, when guiFATES is populated by dendroFATES, the clusters are displayed in the same order as displayed in the dendogram. Therefore very similar clusters are adjacent in the guiFATES window, assisting intuitive visual comparisons and combinations of data. Because all FATES GUIs are in MATLAB and the user can also access the data programmatically, it is straightforward and fast for the user to iteratively select clusters from the dendrogram in dendroFATES, refine them in guiFATES, output new clusters to the workspace, and feed the new cluster results back into dendroFATES until the user is satisfied with the grouping of the data set.
The complexity of SPMS data sets means there are numerous relationships that
could be explored, and predicting all desired comparisons is impossible.
scatterFATES is another GUI used to populate guiFATES with user-selected
particles. However, rather than grouping particles via clusters as in
dendroFATES, scatterFATES creates a scatterplot of particles using any two
particle data metrics the user supplies as the axes. The points are then
color-coded by cluster or group. Figure 3 is an example scatterFATES window,
where the
FATES has been designed so that all aspects and functionalities of SPMS data
analysis and exploration are contained within a single programming
environment and language. To this end we developed calibFATES, a GUI to
quickly scan through raw spectra data files before importation into FATES
and generate calibrations to convert raw time-of-flight spectra to
mass-to-charge spectra. calibFATES allows SPMS users to quickly visually
examine generated spectra on the fly without any time-consuming processing,
even during data acquisition, to ensure the quality and consistency of the
data being acquired. While calibFATES is currently written to be able to
read the raw spectra files generated by the ATOFMS and TSI ATOFMS, it could
be easily modified to read in any raw spectra file (Supplement M-B). Figure 4 is a screenshot of a calibFATES window displaying a single uncalibrated
raw spectrum. Users can scan through and display spectra contained in any
raw spectra files within the folder. A calibration can be generated by
setting selected times to entered
FATES is the first software package for SPMS data sets to include flexible script-based data analysis and graphical user interfaces for data exploration integrated within a single programming language. Because FATES is designed to be easily extensible to diverse input data formats and implemented completely in MATLAB, a highly documented language popular among scientists, it should be accessible and employable across the SPMS community despite the many independent instrumental designs. SPMS data importation and programmatic and graphical data analyses can be performed quickly in FATES even for large data sets thanks to both speed and memory optimizations and utilization of native MATLAB data types and built-in functions. Within a FATES study data are structured so that complex analyses can be performed using concise code with little reliance on FATES-specific functions. In addition a set of GUIs with many display, sorting, filtering, and grouping functionalities have been developed to assist both expert and novice users to intuitively visualize a complex SPMS data set and create robust particle groupings. For these reasons we believe FATES will greatly improve the efficiency of data processing and knowledge discovery from SPMS data sets.
The FATES software package (v1.0.0), an extensive manual, and an example data set are
available online at
The authors declare that they have no conflict of interest.
This work was funded by the National Science Foundation through the Center for Aerosol Impacts on Climate and the Environment (CHE 1305427). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. Edited by: G. Phillips Reviewed by: two anonymous referees