How Datathons Drive New Understanding of Earth’s Evolution

Shaunna Morrison at Datathon
Datathons are useful to develop new ways to analyze and visualize vast mineralogical datasets and ultimately better understand the evolution of our planet. In this image, Carnegie mineralogist Shaunna Morrison works with data scientist Ahmed Eleish (RPI) during a small datathon in March 2017 to explore embedded trendlines in the carbon mineral-locality network (pictured on the screen). Photo taken by Anirudh Prabhu (RPI).
Friday, August 28, 2020 

Studying minerals helps scientists tell the story of our planet’s evolution and even helps us understand life’s development on Earth. However, geochemists collect a lot of data in many different labs all over the world. Connecting data sets and wading through the sea of information to tease apart patterns and relationships is a major challenge for modern geoscientists.

That’s where “datathons” come in.

Defining a datathon

Much like “hackathons” before them, datathons are high-intensity collaborative meetings that bring together people from a variety of disciplines to solve a set of problems. The difference is that instead of writing a program, datathons start with a scientific question or dataset and can end with potential publications. 

Participants start their day of data diving by discussing the goals of the meeting. Shaunna Morrison, an EPL Research Scientist who organized the recent 4D (Deep-Time Data Driven Discovery) Initiative Datathon, explains what that looks like for geoscientists, “These goals could be things like working on a machine-learning algorithm to generate a predictive model in a particular system of interest, or it could be a general exploration for trends in a large, complex data resource.”

A group photo of a datathon in 2019,  back when they could be held in person. Sara Walker (ASU), Hyunju Kim (ASU), Adrienne Hoarfrost (Rutgers), Shuang Zhang (EPL), Joy Buongiorino (then EPL, now Maryille), Donato Giovannelli (not pictured), Robert Hazen (EPL) and Shaunna Morrison (EPL) came together to dig through biochemical, metagenomic, and geochemical data to characterize the biochemical network signatures of Earth and generate potential biosignature prediction tools based on the biochemical networks of other planetary bodies. Image Credit: Donato Giovanelli

Once everyone has a handle on the questions, the participants work in small groups to clean, explore, and analyze the data. They also spend time generating models, figures, and ideas. They come up with new questions, and may even start to draft papers. By the end of the day, the group comes back together to share their findings and delegate tasks for future work.

Morrison explained it like this, “Basically, it's a very intense, data-centric working meeting.” She continued, “We usually start in the morning and go until dinner time. If we can meet in real life, we keep talking about ideas and next steps through dinner. It’s a very long day, but very productive.”

Getting on the Datathon Train

Example of carbon mineral network analysis that comes out of 4D Initiative's datathons. This figure is from a recently published paper, Exploring Carbon Mineral Systems: Recent Advances in C Mineral Evolution, Mineral Ecology, and Network Analysis.

Morrison attended her first datathon in 2016. From that one meeting emerged a groundbreaking idea: scientists could use network analysis—like that used to analyze groups of friends on social media—to understand the relationships between rocks and minerals. 

“We knew there were higher dimensional trends in our data, but we didn't have the tools or the knowledge to understand how to explore those trends, visually or analytically,” said Morrison. The collaboration across disciplines was essential, Morrison continued, “Once we showed our data to the Rensselaer Polytechnic Institute data scientists, I was blown away by what they came up with!” 

Since 2016, Carnegie’s Robert Hazen and Shaunna Morrison have hosted dozens of datathons with the goal of understanding Earth’s complex mineral interactions through data analysis. Datathons continue to be an essential collaborative tool to hasten the data-driven understanding of our ever-evolving planet.

“If I'm organizing a meeting,” says Morrison, “It will be a datathon.”

Analytical tools to come out of 4D Initiative datathons in the past:

Mineral network analysis: provides a holistic view of complex mineralogical relationships through space and time, including finding that mineral data naturally embed timelines and chemical trend lines (Example)

Natural kind clustering: predicts the formational environment of mineral samples based on geochemical and physical properties that are unique to their formational environments (Example

Mineral affinity analysis: predicts the previously unknown location of mineral deposits or environments of interest, including planetary analogs (Example)

Paleobiology network analysis: provides visualization and analysis of complex paleobiological systems through deep time, including the prediction of a previously unrecognized massive faunal turnover events (Example)

Label distribution learning: predicts major and minor component mineral compositions on Mars, based on X-ray diffraction data alone (Example)

EPL’s First Virtual Datathon Finds Deep Connections

In August 2020, Morrison organized the Earth and Planets Laboratory’s 4D Initiative datathon, a 15-person 2-day virtual meeting that resulted in plans for 13 manuscripts, most of which were ideas that came out of the datathon itself —nearly one manuscript per participant!

While the virtual format led to some challenges for casual conversation and small group chats, the digital meet up allowed people from around the world to participate. The group included researchers and data scientists from Rensselaer Polytechnic Institute, Purdue, George Mason, the Carnegie Institution for Science as well as the Universities of Naples, Toronto, Southern Illinois, and Maryville College.

The multifaceted group spent the majority of those two days focused on the cluster analysis of geochemical systems, like that of feldspar minerals. 

Feldspar can form under a wide variety of conditions and a feldspar’s chemical makeup changes based on what they interact with in their environment. Cluster analysis aims to look past the traditional method of simply naming the mineral (e.g. “This is feldspar”) and instead model a mineral based on the large number of factors that brought it into existence (e.g. “This is a hydrothermal feldspar that formed at a certain temperature from a parental magma of a certain composition”). 

Specimen of a lead-rich feldspar, which gets its translucent light green color from the lead content. Credit: Rob Lavinsky via Wikimedia

Morrison explains, “By simply characterizing what mineral species a sample is, you only learn a little bit about how and why it formed, but if we dig into their complex, multidimensional, multivariate chemical and physical properties, we see that there are distinctions between samples that formed in different environments.”

While that may seem straightforward to some, the challenge is that these high-level trends don’t fall in your traditional X-Y graph. They require machine learning to understand the many complicated variables that contribute to mineral formation. 

The 13 potential papers include clustering work on pyrite, magnetite, chondrites, and feldspar as well as other data-driven approaches to mineral analysis. 

Stay tuned for publications! Morrison has high confidence in the success of these manuscripts, stating “It is very likely that all of these 13 manuscripts will be published. Rarely do we put a paper on our list that doesn't come to fruition.”

Learn more about 4D Initiative here.