Reconstructive Self-Supervised Learning for Astronomy

Status Update

This is super preliminary and in-progress work from the fall of 2023 and winter of 2024 and more progress has been made so I will update this page later with a more up to date set of results in the coming months!

TL;DR

Masking images and making a Machine Learning (ML) model reconstruct the missing pixels is a meaningful task to get a model to learn cetain fundamental patterns in astronomy data.

Built With

Python Notebook PyTorch WandB github

Background

The Ultraviolet Near Infrared Optical Northern Survey (UNIONS) uses observations from three telescopes in Hawaii to investigate some of the most fundamental questions in astrophysics, such as determining the properties of dark matter and dark energy, as well as the growth of structure in the Universe. However, it is difficult to effectively search through and categorize UNIONS data to address these questions due to the volume of data produced.

Goal

This project aims to exploit advances in a sub-field of ML called Self-Supervised Learning (SSL), to train a model to produce astrophysically meaningful representations of astronomy observations.

Method

This work is done solely by training it on images of the sky, without the need for explicit labels indicating what source is being observed. When paired with a small number of labelled examples, these representations are useful in downstream tasks such as similarity searches for rare astronomical objects, or as inputs to a linear regression layer to predict redshifts.

Most recently, a SSL masked image modeling method called SimMIM was implemented and evaluated on a specific use case of dwarf galaxy identification.

Image 5

Results

The model is able to reconstruct the astronomy images pretty well:

Image 6

More importantly, useful repesentations are learned!

Similar observations are clustered together (which is super useful for similarity searches):

Image 7

Linear classifiers and regressors can be build on these embeddings to allow for learning from fewer labelled examples thaen is needed train a model of similar performance from scratch for a specific task. The example of a dwarf galaxy classifier is used and achieves 90\% accuracy at identifying known dwarf galaxy candidates with only 200 examples.

Further, a similarity search is performed using the representations of just a handful of dwarf galaxies and 23 of the top 25 most similar sources are found to be dwarf galaxies:

Image 8

And here is an example of how the similarity search functionality can be useful when searching for other examples of similar dwarfs given just one query image:

Image 9

Conclusion

These results prove this method is a promising avenue to explore for not only the discovery of more dwarf galaxies but various other tasks of interest to astronomers.

Resources

Image 8