Research Areas in "Geo Data Science"
We focus on eight Methodological Research Areas in data science as well as four Research Fields in earth sciences.
Methodological Research Areas
One of the central challenges for the geosciences in the 21st century arises from the necessity to quantify uncertainty in any scientific conclusions we draw from computer simulations and available observational data. Uncertainty quantification relies on a probabilistic interpretation of all involved quantities and requires advanced mathematical tools from statistics, the theory of stochastic processes, scientific computing, and machine learning.
At the same time, it is important that these tools are being considered against the background of specific applications from the geosciences in order to result in computationally tractable and robust algorithms and analytic techniques. Research areas within uncertainty quantification with particular relevance to the geosciences include data assimilation, stochastic parametrizations, multi-scale analysis, model reduction, sensitivity analysis, stochastic Galerkin and non-intrusive, approximation methods, statistical inverse problems, and optimization techniques.
The evolution of complex Geo-Systems quite generally involves interactions of many individual subprocesses and mechanisms across a range of length and time scales. A substantial portfolio of methods and techniques from modern applied mathematics is available today that supports the development of an in-depth understanding of such systems. This includes homogenization techniques, asymptotic multiscale analysis, and stochastics. Possible targets of research in this area may come from atmosphere-ocean science, climate research, and Geophysics as much as from climate impact research dealing with links between the society and the environment.
Given the sometimes overwhelming complexity of systems of interest in the geo- and climate sciences, computational simulation is often the most viable approach to developing modelling and prediction capabilities. Scientifically sound computation-based research calls for solid knowledge of numerical mathematics and related computer science techniques, combined with a good understanding of the applied geo-science issues and language(s).
With ever increasing precision capability of measuring devices, as well as the multiplication of data sources, come increased expectations about the amount of information that scientists will be able to extract from the mass of collected data: the geosciences find themselves at the forefront of that trend. While fundamental statistical questions such as assessing signal from noise are more than ever current, they come with new significant challenges due to the huge size of these new data, both in dimensionality (number of variables) and size (number of observations). Traditional approaches undergo severe limitations both due to intrinsic statistical limits (the so-called "curse of dimensionality") as well as computational bottlenecks. Modern statistical methods try to tackle those issues simultaneously by introducing innovating point of views and techniques, which should be further developed to meet the needs rooted in specific geosciences applications. Particularly relevant themes include dimensionality reduction, computational resource sparing approximations, inference of low-dimensional structures (such as manifolds, sparse and low-rank representations), statistics of networks, variable and model selection, resampling methods, iterative methods and regularization methods.
Data Integration is the process of unifying datasets from disparate sources. The goal is to build automated systems that combines datasets efficiently into a complete, correct, and useful target schema. Typical steps in the data integration pipeline are data extraction, data cleaning, data transformation, and entity resolution. Building appropriate wrapper-mediator architectures and designing efficient data transformation and cleaning algorithms are key challenges in data integration. In the realm of geosciences, data collected from dynamic geophysical remote sensors sensors and other static data sources have to be appropriately integrated before the actual knowledge generation and data analytics can take place.
Data Profiling is the process of generating meta-data, such as statistics, summaries and dependencies from one or multiple datasets. The results of data profiling algorithms can support the decision process in data integration, data management, feature engineering, and data analysis. Considering the velocity of data produced by geophysical remote sensors, new challenges include the adaptation of existing profiling techniques on relational data to data streams.
Many interesting research problems in the geosciences can only be tackled by the use of a multitude of different data sources. These are typically distributed over various sites, exhibit highly heterogeneous structure, format and semantics, and provide their data in various scales and granularities. Providing integrated, up-to-date and semantically homogeneous views on such data sources is a difficult problem, requiring a mixture of techniques from research fields like databases, statistics, and semantic web technologies. Heterogeneous data sources are particularly challenging when data is are very large or changes content very rapidly.
Many interesting research problems in the geosciences can be cast as instances of classical machine learning problems, such as classification, clustering, or rule mining. However, adequately representing the specific nature of the data underlying most geoscientific problems, such as their spatio-temporal reference frame or their domain-specific error models, into the present machine learning is challenging.
Geoscientific Research Fields
Natural Hazards and Risks occur at the interface between natural and societal processes, e.g. from the combination of hazard probability, exposure and vulnerability. In the last decades these aspects have been subject of major research efforts, because of their high societal relevance and their large uncertainty. Major scientific challenges are to better understand mechanisms of hazard development and vulnerability processes as well as to improve their predictability. In addition, the magnitude, frequency and impact of certain natural hazards are likely to change over time, because of, e.g., changes in the climate or other geophysical systems, increasing population at risk, or altered wealth. Quantifying and managing hazards and risks, assessing their future developments and deriving risk reduction strategies in the appropriate spatial-temporal scales and resolutions, are currently among the most essential challenges in Earth sciences, and can best be addressed in an interdisciplinary network such as Geo.X. We focus on seismic, hydrological, geomorphological, atmospheric, and volcanic events in terrestrial systems of any region in the world. This may include cascades of different types of hazard and risk and conditions with complex interaction of anthropogenic activities and natural phenomena. We strive to push forward innovative early warning and prediction systems and develop new approaches using e.g., big data driven Earth observations, high spatial-temporal resolutions, data fusion for rapid event analysis, new smart technologies and coupled modeling.
The sustainable supply of georesources and their environmentally-responsible use in this and future generations is a key societal challenge. This is a traditional field for the geosciences and related geo-engineering disciplines, but today the topic of georesources is much more complex because (a) many of the "easy" near-surface and rich deposits of mineral and energy resources have been exhausted so future exploration must address deeply-buried, more diffuse deposits and/or more remote areas including the seabed; and (b) exploration and exploitation of resources requires broad societal acceptance, meaning that a balance must be found with the interests for land use, soil and water resources. Georesource research and development is a rapidly-growing, data-intensive and global activity. A key challenge and prerequisite for success is therefore the efficient use of new and existing information. Resource exploration programms rely on integration of extensive and diverse data sets from geology, geophysics, geochemistry and remote sensing; including innovative numerical tools and models for interpretation. To these must be added data sets and information on other aspects like climate, land use and population statistics that are needed by societal stakeholders and political decision-makers who must reach agreement on the conditions for resource extraction and use.
Investigations on bio-geological interactions in space and time enlarge our knowledge about how life is changing the environment of Planet Earth and how the Geosphere and the Biosphere interact. On one hand, the origin, evolution and distribution of life as well as biogeochemical cycling influence different processes within the Geosphere. On the other hand, geomorphological features, geological processes, climate dynamics, and geophysical conditions affect the Biosphere and are key elements of the past, present and future co-evolution of the Geo- and Biosphere on Earth. New ideas and innovative strategies are needed to investigate these complex interactions between the Biosphere and the Geosphere through big data synthesis and analysis across wide spatial and temporal scales ranging from DNA-level studies to plate tectonics. Results obtained by this interdisciplinary approach will also be of significance for the understanding of other terrestrial planets in space and the potential for life beyond Earth. The envisioned methods for this data-driven research enterprise are completely open and may include, but are not limited to: experimental approaches or model studies as well as in situ- or remote-sensing observations.
Humans continuously interact with the geosphere. Human agency leads to global climate change, land use change or - more generally - changed states and flows in ecosystem services provided to humanity. Therefore, information on the human impact on Earth and how to transition to a more sustainable use of resources are among the core challenges at the cross-section of information sciences and geosciences. More complete datasets and models to monitor ongoing changes, to illustrate future mitigation and transitional adaptation pathways related to climate change are urgently needed. Modelling and monitoring of global land cover and land use change needs substantial advancement. A prerequisite for such novel information is the integration of big data related approaches. In the domain of "Human habitat and sustainability" Geo.X therefore explicitly addresses information needs related to climate change modelling and land cover and land use change analyses based on satellite data and other big data domains.