In recent years, machine learning and pattern recognition methods have become common in Earth and space sciences. This is especially true for remote sensing applications, which often rely on massive archives of noisy data and so are well suited to such artificial intelligence (AI) techniques.
As the data science revolution matures, we can assess its impact on specific research disciplines. We focus here on imaging spectroscopy, also known as hyperspectral imaging, as a data-centric remote sensing discipline expected to benefit from machine learning. Imaging spectroscopy involves collecting spectral data from airborne and satellite sensors at hundreds of electromagnetic wavelengths for each pixel in the sensors’ viewing area.
Since the introduction of imaging spectrometers in the early 1980s, their numbers and sophistication have grown dramatically, and their application has expanded across diverse topics in Earth, space, and laboratory sciences. They have, for example, surveyed greenhouse gas emitters across California [Duren et al., 2019], found water on the moon [Pieters et al., 2009], and mapped the tree chemistry of the Peruvian Amazon [Asner et al., 2017]. The data sets involved are large and complex. And a new generation of orbital instruments, slated for launch in coming years, will provide global coverage with far larger archives. Missions featuring these instruments include NASA’s Earth Surface Mineral Dust Source Investigation (EMIT) [Green et al., 2020] and Surface Biology and Geology investigation [National Academies of Sciences, Engineering, and Medicine, 2019].
Researchers have introduced modern signal processing and machine learning concepts to imaging spectroscopy analysis, with potential benefits for numerous areas of geoscience research. But to what extent has this potential been realized? To help answer this question, we assessed whether the growth in signal processing and pattern recognition research, indicated by an increasing number of peer-reviewed technical articles, has produced a commensurate impact on science investigations using imaging spectroscopy.
Mining for Data
Following an established method, we surveyed all articles cataloged in the Web of Science [Harzing and Alakangas, 2016] since 1976 with titles or abstracts containing the term “imaging spectroscopy” or “hyperspectral.” Then, using a modular clustering approach [Waltman et al., 2010], we identified clustered bibliographic communities among the 13,850 connected articles within the citation network.
We found that these articles fall into several independent and self-citing groups (Figure 1): optics and medicine, food and agriculture, machine learning, signal processing, terrestrial Earth science, aquatic Earth science, astrophysics, heliophysics, and planetary science. The articles in two of these nine groups (signal processing and machine learning) make up a distinct cluster of methodological research investigating how signal processing and machine learning can be used with imaging spectroscopy, and those in the other seven involve research using imaging spectroscopy to address questions in applied sciences. The volume of research has increased recently in all of these groups, especially those in the methods cluster (Figure 2). Nevertheless, these methods articles have seldom been cited by the applied sciences papers, drawing more than 96% of their citations internally but no more than 2% from any applied science group.
The siloing is even stronger among published research in high-ranked scholarly journals, defined as having h-indices among the 20 highest in the 2020 public Google Scholar ranking. Fewer than 40% of the articles in our survey came from the clinical, Earth, and space science fields noted above, yet these fields produced all of the publications in top-ranked journals. We did not find a single instance in which one of those papers in a high-impact journal cited a paper from the methods cluster.
A Dramatic Disconnect
From our analysis, we conclude that the recent boom in machine learning and signal processing research has not yet made a commensurate impact on the use of imaging spectroscopy in applied sciences.
A lack of citations does not necessarily imply a lack of influence. For instance, an Earth science paper that borrows techniques published in a machine learning paper may cite that manuscript once, whereas later studies applying the techniques may cite the science paper rather than the progenitor. Nonetheless, it is clear that despite constituting a large fraction of the research volume having to do with imaging spectroscopy for more than half a decade, research focused on machine learning and signal processing methods is nearly absent from high-impact science discoveries. This absence suggests a dramatic disconnect between science investigations and pure methodological research.
Research communities focused on improving the use of signal processing and machine learning with imaging spectroscopy have produced thousands of manuscripts through person-centuries of effort. How can we improve the science impact of these efforts?
Lowering Barriers to Entry
We have two main recommendations. The first is technical. The methodology-science disconnect is symptomatic of high barriers to entry for data science researchers to engage applied science questions.
Imaging spectroscopy data are still expensive to acquire, challenging to use, and regional in scale. Most top-ranked journal publications are written by career experts who plan and conduct specific acquisition campaigns and then perform each stage of the collection and analysis. This effort requires a chain of specialized steps involving instrument calibration, removal of atmospheric interference, and interpretation of reflectance spectra, all of which are challenging for nonexperts. These analyses often require expensive and complex software, raising obstacles for nonexpert researchers to engage cutting-edge geoscience problems.
In contrast, a large fraction of methodological research related to hyperspectral imaging focuses on packaged, publicly available benchmark scenes such as the Indian Pines [Baumgardner et al., 2015] or the University of Pavia [Dell’Acqua et al., 2004]. These benchmark scenes reduce multifaceted real-world measurement challenges to simplified classification tasks, creating well-defined problems with debatable relevance to pressing science questions.
Not all remote sensing disciplines have this disconnect. Hyperspectral imaging, involving hundreds of spectral channels, contrasts with multiband remote sensing, which generally involves only 3 to 10 channels and is far more commonly used. Multiband remote sensing instruments have regular global coverage, producing familiar image-like reflectance data. Although multiband instruments cannot measure the same wide range of phenomena as hyperspectral imagers, the maturity and extent of their data products democratize their use to address novel science questions.
We support efforts to similarly democratize imaging spectrometer data by improving and disseminating core data products, making pertinent science data more accessible to machine learning researchers. Open spectral libraries like SPECCHIO and EcoSIS exemplify this trend, as do the commitments by missions such as PRISMA, EnMAP, and EMIT to distribute reflectance data for each acquisition.
In the longer term, global imaging spectroscopy missions can increase data usage by providing data in a format that is user-friendly and ready to analyze. We also support open-source visualization and high-quality corrections for atmospheric effects to make existing hyperspectral data sets more accessible to nonexperts, thereby strengthening connections among methodological and application-based research communities. Recent efforts in this area include open source packages like the EnMAP-Box, HyTools, ISOFIT, and ImgSPEC.
Expanding the Envelope
Our second recommendation is cultural. Many of today’s most compelling science questions live at the limits of detectability—for example, in the first data acquisition over a new target, in a signal close to the noise, or in a relationship struggling for statistical significance. The papers in the planetary science cluster from our survey are exemplary in this respect, with many focusing on first observations of novel environments and achieving the best high-impact publication rate of any group. In contrast, a lot of methodological work makes use of standardized, well-understood benchmark data sets. Although benchmarks can help to coordinate research around key challenge areas, they should be connected to pertinent science questions.
Journal editors should encourage submission of manuscripts reporting research about specific, new, and compelling science problems of interest while also being more skeptical of incremental improvements in generic classification, regression, or unmixing algorithms. Science investigators in turn should partner with data scientists to pursue challenging (bio)geophysical investigations, thus broadening their technical tool kits and pushing the limits of what can be measured remotely.
Machine learning will play a central role in the next decade of imaging spectroscopy research, but its potential in the geosciences will be realized only through engagement with specific and pressing investigations. There is reason for optimism: The next generation of orbiting imaging spectrometer missions promises global coverage commensurate with existing imagers. We foresee a future in which, with judicious help from data science, imaging spectroscopy becomes as pervasive as multiband remote sensing is today.
The research was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with NASA (80NM0018D0004). Copyright 2021. California Institute of Technology. Government sponsorship acknowledged.