Illustration of multiple views of Earth overlain by zeros and ones
Artificial intelligence, machine learning, and data science are gaining popularity throughout the geosciences, but geoscience education has not kept up with these trends. Credit: Gerd Altmann, Pixabay License

Artificial intelligence (AI), machine learning (ML), and data science provide flexible, scalable, and interpretable approaches to harness the growing volume of available data that can help us improve the understanding and prediction of a wide variety of geoscience phenomena, including natural hazards, climate change, and severe weather events. As such, AI/ML and data science are gaining popularity throughout the geosciences. However, geoscience education has not kept up with this trend, leaving students and researchers with knowledge gaps that hinder their ability to innovate and grow through the development of new approaches to and applications of their research. To bridge these gaps, we need to train a new generation of data scientists who are prepared to address the unique needs of geoscience data and related phenomena.

Applying artificial intelligence (AI) and machine learning (ML) to geoscience phenomena and problems requires deep knowledge of the physics involved.

AI/ML methods are domain agnostic and lack inherent physics-based understanding of natural processes. This characteristic of AI/ML methods can be advantageous in some situations, but applying AI/ML to geoscience phenomena and problems requires deep knowledge of the physics involved. And although superficial training may enable a researcher to select existing AI/ML methods that could be useful in their work, creating new methods that can transform scientific understanding requires users to know the underlying characteristics of their data and their methods.

Physical data scientists thus need holistic preparation, including foundational training in their respective disciplines (atmospheric science, oceanography, geoscience, etc.) as well as in AI/ML, that will allow them to work and innovate with increasingly large and complex data sets.

Beyond Basic Mathematics and Programming

A strong grounding in mathematics is fundamental to understanding foundational geoscience processes as well as the principles of computer science. Some math courses already feature in geoscience programs nationwide. The background offered by these courses might be sufficient to develop a superficial understanding of AI/ML methods. However, innovating requires a deeper level of knowledge of the mathematics underlying AI/ML.

For example, adding physics-based constraints to existing AI/ML methods requires understanding partial derivatives and how changing loss functions affects machine learning. Thus, for the next generation of graduates, we must expand core training in mathematics, up to and including courses on partial differential equations and statistics.

Many geoscience disciplines have added introductory computer programming to their curricula, but physical data scientists will also need training in computer science fundamentals such as efficient data structures, parallel programming, and high-performance computing to implement and test ideas using AI/ML. Introductory courses typically teach students about concepts like variables and simple loops and functions but not about more advanced concepts such as object-oriented programming and data structures. We advocate that understanding how to use and create data structures like trees, hash maps, and sets is critical for efficiently and reproducibly handling large geoscience data sets.

Efficient high-performance computing also requires a grasp of how today’s modern supercomputers work, including graphical processing units (GPUs). GPUs first came into widespread use in arcade games and were later used for desktop gaming, but their ability to perform parallel operations on multiple data sets has greatly expanded their range of applications. Used appropriately, GPUs can provide orders-of-magnitude faster processing of multidimensional geoscience data. Many data libraries already use GPUs, although creating specialized code (which often requires linear algebra) may be required to use these processors for novel purposes.

Foundational Training in AI/ML and Data Science

Ensuring that the next generation of physical data scientists is well prepared may require training that diverges from traditional approaches to training scientists in data science and AI/ML research.

Ensuring that the next generation of physical data scientists is well prepared may require training that diverges from traditional approaches to training scientists in data science and AI/ML research.

Within the geosciences, AI, ML, statistics, and data science all overlap—and so are often conflated or misunderstood—so it helps to clarify their meanings. We define AI to include all categories of methods that act intelligently to solve problems (Figure 1). Such methods include intelligent search techniques such as A*, the method used in most map search apps. They also include multiagent systems that enable AI methods to coordinate actions among diverse agents, such as teams of humans and robots completing a search and rescue operation.

Fig. 1. Artificial intelligence is a broad field that encompasses and overlaps with other fields, including data science, machine learning, and statistics. Tools from these fields are applied in a wide variety of scientific endeavors.

Within AI, ML focuses on models that adapt over time, given experience or data. ML methods draw in part on traditional statistical methods, like regression or Kullback-Leibler divergence, and are thus not entirely independent of statistics. New techniques involve hybrid statistical and ML approaches. For example, data science methods tend to focus on analysis of big data as well as on data management and draw from AI, ML, and statistical methods. Deep learning is a type of ML focusing on the use of specialized neural networks, and it is currently one of the most popular ML methods in use in the geosciences.

Traditional training for data scientists typically includes separate classes on each of the topics mentioned above. We suggest that training for physical data scientists could instead focus on foundational methods relevant to all these topics, with strong and synergistic involvement of mathematics and computer science. This approach would be more efficient and likely require a shorter series of classes.

For example, instead of taking a longer series of disciplinary-focused classes in AI/ML, we propose an interdisciplinary three-course sequence. This sequence would cover the mathematical foundations of ML methods while focusing on applications to facilitate understanding of which methods are best suited for which types of phenomena. The course sequence would include a class on more traditional ML methods, an advanced class focused on deep learning, and a class that brings together methods from data science and statistics to facilitate efficient exploration of and experimentation with large data sets, including empirical and statistical analysis and validation of AI/ML methods as applied to different scientific domains.

Workforce Development and Diversity

In addition to reshaping curricula for student education as AI/ML methods gain popularity, it is critically important that we also provide existing geoscience researchers, forecasters, and practitioners with avenues for continuing education and development. Given career constraints, expecting current members of the workforce to attend multiple semester-long classes, as degree-seeking students do, is not realistic. Hence, other paths must be developed.

Several efficient concepts for retraining working atmospheric scientists already exist and might serve as models for similar programs related to AI/ML and data science.

Several efficient concepts for retraining working atmospheric scientists already exist and might serve as models for similar programs related to AI/ML and data science. These concepts include the following:

Summer schools are intended to get people up to speed quickly on a broad topic, but with less depth. Because of COVID-19, the National Center for Atmospheric Research (NCAR) adapted its traditional in-person summer school format for an online audience, expanding attendance in summer 2020 to more than 2,000 people. In July, NCAR and the National Science Foundation–funded AI Institute for Research on Trustworthy AI in Weather, Climate, and Coastal Oceanography (AI2ES) ran a 4-day joint summer school on developing trustworthy AI for environmental sciences that included lectures, tutorials, and group discussions. The summer school was offered both live and asynchronously, so people can access it at any time, which should increase its impact.

Short courses cover more focused topics in depth. For example, the Cooperative Institute for Research in the Atmosphere taught a short course on using AI/ML in weather and climate research. AI2ES taught a short course on explainable AI and will be teaching additional courses over the next few years. As with the recent summer school, these short courses were held live, albeit with smaller audiences to facilitate participant interaction, as well as recorded and later provided online to the public. Sample AI/ML codes provided by instructors have proven critical in the success of these courses—with the codes, participants are able to see quickly how the methods work and apply them to phenomena and problems in their own domains.

Tutorials are generally full- or half-day events at which researchers jump into a topic while attending a larger conference. The American Meteorological Society (AMS) AI Committee, for example, has been providing in-depth tutorials on AI for weather research at the AMS annual meeting for several years.

Full-length online courses developed and taught through AI2ES are, like traditional semester-long, in-person courses, offered for university students; they are also being provided online for free. Rather than signing up for the full course, members of the workforce can view specific modules as needed and at their own pace—an approach that facilitates targeted and efficient retraining.

Community college certificates in AI are a recent development. Del Mar College in Corpus Christi, Texas, a partner in AI2ES, has developed one of the first such community college certificates in AI for environmental sciences. This five-course sequence, and others like it, could be used for workforce retraining as well as for broadening participation in the geosciences.

We emphasize that such efforts to strengthen and streamline workforce retraining should be broadly available to everyone in the geoscience community to help improve the diversity of the workforce.

Research has demonstrated that women and those in marginalized communities become more interested in science, technology, engineering, and mathematics (STEM) fields if they can see the real-world applications of the work. However, classes focused on foundations of programming, or even about specific AI/ML methods, rarely offer opportunities to appreciate tangible applications. Ensuring that the training for future physical data scientists includes relevant and frequent demonstrations of the applicability of foundational computer science and mathematical principles may thus improve diversity both in the geosciences and in computer science, which could also improve innovation.

Evolving and Adapting Instruction

“We anticipate that it will be increasingly difficult to distinguish between scientists working on machine learning and domain scientists in the future.”

The need to evolve and innovate in training and education reflects trends in the broader research community. For example, the recently released European Centre for Medium-Range Weather Forecasts 10-year road map for AI/ML states, “We anticipate that it will be increasingly difficult to distinguish between scientists working on machine learning and domain scientists in the future.”

Quickly adapting training and education to leverage new and emerging technologies has traditionally not been a strength of academic communities. Yet with the rapid growth of AI/ML and data science methods and with the range of pressing geoscience questions to which they can be applied, we argue that it is worth the time and investment to recast instructional approaches to train students in physical data science and to better prepare them for the workforce of the future.

Author Information

Amy McGovern (amcgovern@ou.edu), School of Computer Science and School of Meteorology, University of Oklahoma, Norman; also at NSF AI Institute for Research on Trustworthy AI in Weather, Climate, and Coastal Oceanography; and John Allen, Department of Earth and Atmospheric Sciences and Earth and Ecosystem Science Ph.D. Program, Central Michigan University, Mount Pleasant

Citation: McGovern, A., and J. Allen (2021), Training the next generation of physical data scientists, Eos, 102, https://doi.org/10.1029/2021EO210536. Published on 6 October 2021.
This article does not represent the opinion of AGU, Eos, or any of its affiliates. It is solely the opinion of the author.
Text © 2021. The authors. CC BY-NC-ND 3.0
Except where otherwise noted, images are subject to copyright. Any reuse without express permission from the copyright owner is prohibited.