DSG Bring Your Own Project

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

BYOP PROPOSAL(REPRODUCIBILITY TRACK)

Contact Details
● Name:Kriti Garg

● Enrollment Number:22118038

● Email Id:kriti_g@mt.iitr.ac.in

● Mobile Number:9759171008

PROJECT TITLE: CLIMATE PREDICTION MODEL

PAPER TITLE: CLIMATE SET: A LARGE SCALE CLIMATE


MODEL DATASET
Kriti Garg| Metallurgical and Materials Engineering

CONFERENCE: NeurIPS’23

LINK TO THE PAPER:


https://drive.google.com/drive/u/0/folders/1cNFJK5QhXoyzBGHvZ92Dk
1orsQaNiPlZ

OBJECTIVE:
Climate change is an alarming problem since few years and will reflect severely
in the years to come. Climate Models have been key in assessing the impact of
climate change and in future scenarios. Existing climate models do this on one
model. Here, we will explore climate model emulation using multiple models.
The objective is to perform accurate (close to accurate climate model
prediction) using Large Scale Datasets. Then using the evaluation metrics, the
most accurate ML Model is chosen.

UNDERSTANDING OF THE RESEARCH PAPER USED:


This paper was selected in the NeurIPS’23 Conference. It introduces a large
scale dataset for climate modelling: Climate Set. It contains over 36 models
combined as an input. 4 Climate Forcerers (CO2,CH4,SO2,BC) are used for input
and the output consists of 2 variables:(Temperature, Precipitation). The Time
frame of these climate forecasts can be segmented into medium-to long term
time series forecasting.

The detailed analysis of the paper is done. The link to the document containing
the same is as follows:
https://drive.google.com/drive/u/0/folders/1cNFJK5QhXoyzBGHvZ92Dk
1orsQaNiPlZ

CODE AVAILABILITY:
In the research paper, few links have been provided to access the codes
for downloader and preprocessor. The links are:
https: //code.mpimet.mpg.de/projects/cdo/embedded/cdo.pdf
https://climateset.github.io.

ABLATION STUDY:
This paper has been chosen from the NeurIPS’23 Conference. It discusses about
taking a large scale dataset . Climate Change concerns over a large time period.
To address tasks at this large scale, we need more consistent and large
ML-ready climate model datasets.
More variables can be extended to the output variables(other than temperature,
precipitation).
Down Scaling is done to reduce the gap between spatial and temporal
dimensions.
Further, to reduce spatial-temporal heterogeneity, inherent measurement
errors, data assimilation techniques such as : Kalman Filtering, 3D
Variational Analysis, 4D Variational Analysis are some of the statistical
algorithms which will be used.

RESOURCES USED:
https://drive.google.com/drive/folders/1LFMxIYzjCAQoAM8fyeRMpR1Az
8rLQL7H?usp=sharing
https://www.mdpi.com/2076-3417/13/21/12019
https://www.kaggle.com/datasets/berkeleyearth/climate-change-earth-s
urface-temperature-data
Project Specification/ METHODOLOGY:

● Dataset Features:
The dataset contains input and output from 36 models from CMIP6 and
InputMPIs archives. CMIP6 consists of projections of future climate change
scenarios from 58 Climate Models.
InputMPIs collect the future emission trajectories of climate forcing agents
that are used as input for climate models.

● Data Collection:
All data requested through ClimateSet can be directly downloaded from
ESGF(Earth System Grid Federation)

● Data preprocessing:
The climate set preprocessor is built modularly as it contains: (1)The Checker,
(2) The Rawprocessor ,(3) The Resolution Processor and (4) The Structure
Processor.

● Extending the ClimateSet:


The core dataset can be extended to include different variables,height levels,
ensemble members, scenarios and other information made available through
climate models on the CMIP6 server of ESGF.

● Accelerating the Climate Set:


If run on a machine with 1CPU Core, 16GB Memory , single threading, the
complete preprocessing of the 36 models takes ~160 hours. Therefore, training
time can be accelerated using Multi-Thread function of the CDO
implemented processors(resolution & raw).

● Train-Test Split:
Around 10% data is hold out for validation and SSP2.45 for testing.

● Model Training:
This is done twice for all the 4 algorithms that we’ll be considering. (1)Single
Emulation: ML Model is trained on each of the 15 climate models separately.
(2) Super Emulation: ML Model is trained on 6 climate models together out of
36 Models.
Evaluation and Testing:
We analyze the Latitude-Longitude weighted Root Mean Square Error as an
evaluation Metric.Statistical techniques like Empirical Orthogonal Functions (EOF)
and Principal Component
Analysis (PCA) are employed to identify and validate overarching climatic patterns.
These models often have to account for high levels of uncertainty and are
cross-validated against geological or even astronomical records, making immediate
validation impractical.
Climate models, conversely, are evaluated based on their ability to accurately
reproduce decadal and centennial patterns.

ML Models Used:
(1) ClimaX: It is a foundational model that is used for both satial and
temporal data. It can be trained on heterogeneous datasets. It extends the
vision transformer architecture to accommodate the intricacies of weather
and climate modelling and climate forecasting. For more reference on coding,
help can be taken from: https://github.com/microsoft/ClimaX.
One more paper used for reference is:
https://drive.google.com/drive/folders/1NA8DakroL3Z6IQmVFatdWdgevVyt
H8-E?usp=sharing

(2) U-net: It is a convolutional neural network built for image segmentation


tasks.It has exhibited exceptional capabilities in handling spatial variables
associated with climate forecasts. It comprises of an encoder path, responsible for
capturing contextual information , a decoder path to reconstruct the segment
image, an output layer. It utilizes a VGG-11 encoder backbone used in Image Net
classification challenge.

(3) Convolutional LSTM: This is the combination of both Convolutional Neural


Network for spatial features and LSTM for temporal features. It can be used for
climate projection tasks and has been recognized as the best baseline in terms of
reducing the RMSE(ROOT MEAN SQUARED ERROR).
● The input data undergoes feature extraction through CNN Layer. A time
distribution Layer is applied to ensure this at each timestep.
● An average pooling operation is performed to reduce dimensionality of
the data post feature extraction.
● Then the features are fed into an LSTM Layer to capture its temporal
dependencies.
● Finally, a linear readout layer is led to obtain the desired output.

(4) Gaussian Process: It is a non-parametrical, bayesian regression method


that provides uncertainty measures over predictions.Since we deal with large
datasets, we use a stochastic variational variant of the Gaussian process for
regression .To deal with the multi-output nature of the task, we use the Linear
Model of Coregionalization (LMC) with 100 latent Gaussian processes. The
number of latents used is a hyperparameter that controls the capacity of the
model. We use Matern1.5 kernels and train with the Adam optimizer with a
learning rate of 0.1 and batch size of 64.

Access to code and data is provided on https://climateset.github.io.


(Mentioned in the paper)

CHALLENGES:
(1) DATA RETRIEVAL WHILE EXTENDING THE DATASET: users may run into data
retrieval issues while retrieving data from the ESGF Server since it is down from
time to time , but that may affect only one climate model.

(2) Weighting of Climate Models: Done to prevent over and under -


representation of certain climate models.

TIMELINE:
WEEK-1(LAST WEEK OF DECEMBER):
1) Get Acquainted with the Dataset: Climate Set, extract the data from ESGF
Server and perform preprocessing using downloaders, preprocessors.

2) Perform Spatial and Temporal Preprocessing( Spatial Aggregation as well


as Spatial Interpolation). The resolution preprocessor will be implemented using
CDO; it is the fastest tool for remapping netcdf files.
The spatial resolution preprocessor uses the CDO Command which does both
aggregation and interpolation.

WEEK-2( FIRST WEEK OF JANUARY):


(1) Train the data on ML Models

(2) We’ll use ClimaX , ClimaX-F and U-Nets in this week.

(3) Set the Hyperparameters for the above models. These hyperparameters
include: MLP Ratio, prediction depth, hidden dimension , Dropout Rate, Padding
Size, Kernel Size etc.
WEEK-3( SECOND WEEK OF JANUARY):
(1) Train ML Models: Convolutional LSTM; ConvLSTM and Gaussian Regression
Process on the climate Models.

(2) Set their hyperparameters , accelerate the data using multi-thread function
of CDO and train them.

(3) Experiment individually both on Single Emulation and Super Emulation. In


single Emulation , I will train my ML Model onto a single climate model and its
performance will be evaluated separately onto each model ( ConvLSTM and
Gaussian Regression). In Super Emulation . I will train the ML Model onto a set of
climate models and will evaluate the performance separately.

WEEK-4( THIRD WEEK OF JANUARY:


(1) Evaluate using Latitude-Longitude weighted Root Mean Squared
Error(RMSE) as the performance metric.

(2) Visualize the data for all the 4 models. For visualisation , I will use the
libraries such as Matplotlib.pyplot.

(3) Document all the process . I will prepare a document of all the research that
is done and will be done in preparing the final outcome. Along with this, all the
preprocessing techniques, resolution preprocessing techniques, data acquisition
techniques and ML Models will be documented along with pictures taken from the
training , evaluaion and testing phase of the model.

EXPECTED OUTCOME:
After training , testing and evaluation , we will reach onto certain outcomes.
Although as per the research paper , the expected outcome must be as follows:
(1) In Single Emulation , ClimaX outperforms other models. This is possible since
it is a better fine tuned model from pre trained weights.
(2) In Super Emulation, ConvLSTM is the best performing model on both Surface
Temperature as well as Precipitation. This is because it is able to learn faster than
other models due to its lesser number of learnable parameters.

GLIMPSES FROM THE RESEARCH PAPER:


SINGLE EMULATOR

SUPER EMULATOR

ABOUT ME:
Hello, I am Kriti Garg, a B.Tech sophomore majoring in Metallurgical and
Materials Engineering. Along with my coursework, I am pursuing Machine
Learning . Currently , I am doing 2 Research Internships in remote mode. One at
MNIT Jaipur, where I am supposed to use Convolutional Swin Transformers
for Object Detection in Agriculture. Other at NIT Trichy, where the task is
related to Insider Threat Detection using an amalgamation of ML and DL. Both
these tasks are ongoing. I have done an open Project on Portfolio
Optimisation using Monte Carlo Simulation under Finance Club. I love to
explore more and more in Tech domain. I also actively do Competitive
Programming on Codeforces. Apart from this, I am a part of the official media
body of the campus, Watch Out! And love to read and watch crime thrillers.

Github link: https://github.com/galactic-me?tab=repositories

You might also like