Preprocessed Datasets

This is a list of weather and climate datasets preprocessed for AI research. This can include benchmarks, competitions or ML papers with published data. The list is in alphabetical order.

AI for Earth System Science Summer School Hackathon

  • Code and Data: https://github.com/NCAR/ai4ess-hackathon-2020

  • Source: NCAR, Lawrence Berkeley Lab, and NOAA

  • Description: 5 challenge problems related to prediction and emulation. GOES challenge problem focuses on predicting lightning from GOES-16 satellite imagery. GECKO-A challenge problem focuses on emulating the GECKO-A chemistry model from a large set of model time series. Microphysics challenge problem focuses on emulating the TAU bin microphysics scheme. HOLODEC challenge problem focuses on estimating rain drop distribution properties from synthetic holographic diffraction patterns. ENSO challenge problem focuses on predicting ENSO from gridded model output.

AMS Solar Energy Prediction Contest

AQ-Bench: A Benchmark Dataset for Machine Learning on Global Air Quality Metrics

CAMELS: CATCHMENT ATTRIBUTES AND METEOROLOGY FOR LARGE-SAMPLE STUDIES

ClimateNet: an expert-labelled open dataset and Deep Learning architecture for enabling high-precision analyses of extreme weather

CloudCast: A large-scale dataset and baseline for forecasting clouds

  • Paper: http://doi.org/10.1109/JSTARS.2021.3062936

  • Code and data: https://vision.eng.au.dk/cloudcast-dataset/

  • Source data: Meteosat-11 with cloud types annotated on a pixel-level

  • Description: The CloudCast dataset contains 70080 images with 10 different cloud types for multiple layers of the atmosphere annotated on a pixel level. The dataset has a spatial size of 928 x 1530 pixels recorded with 15-min intervals for the period 2017-2018, with a 3.0 km resolution.

CUMULO: a benchmark dataset for training and evaluating global cloud classification models

Deepti: Deep-Learning-Based Tropical Cyclone Intensity Estimation System (+ Competition)

EarthNet2021: A novel large-scale dataset and challenge for forecasting localized climate impacts

  • Paper: https://arxiv.org/abs/2012.06246

  • Code and data: https://www.earthnet.tech/

  • Source data: Sentinel 2

  • Description: Curated dataset containing target spatio-temporal Sentinel 2 satellite imagery at 20 m resolution, matched with high-resolution topography and mesoscale (1.28 km) weather variables. With over 32000 samples it is suitable for training deep neural networks.

The ExtremeWeather Dataset

  • Paper: https://arxiv.org/abs/1612.02095

  • Code and data: https://github.com/eracah/hur-detect, https://extremeweatherdataset.github.io/

  • Source data: CAM5

  • Description: Consists of 768 × 1152 images of the global atmospheric state with a spatial resolution of 25 km and separated by 6 hour intervals from 1979 to 2005. There are 16 channels of images that correspond to different variables such as surface pressure, surface temperature and humidity of the reference altitude. In addition, there are boundary boxes and class labels for 4 types of extreme weather events: Tropical Depressions, Tropical Cyclones, Extratropical Cyclones and Atmospheric Rivers.

FlowDB: A new large scale river flow, flash flood, and precipitation dataset

How Much Did It Rain I and II

MeteoNet, an open reference weather dataset by METEO FRANCE

  • Code and data: https://github.com/meteofrance/meteonet

  • Source data: AROME/ARPEGE forecasts, radar reflectivity and ground stations over France

  • Description: Multi source dataset of forecasts and observations over France spanning 3 years

Neural Networks for Postprocessing Ensemble Weather Forecasts

  • Paper: Rasp and Lerch 2018

  • Code and data: https://github.com/slerch/ppnn

  • Source data: TIGGE forecasts and station observations over Germany

  • Description: Ensemble temperature postprocessing of station observations over Germany. 9 years of data at 500 stations. Predictors include temperature as well as a range of other variables.

RainBench: Towards Global Precipitation Forecasting from Satellite Imagery

SEVIR Dataset

  • Paper: NeurIPS

  • Code and data: http://sevir.mit.edu/

  • Source data: GOES-16 and NEXRAD over CONUS

  • Description: Preprocessed satellite and radar data over the continental US, served in patches. For a range of challenges with baselines (check website for updates).

SVRIMG - SeVere Reflectivity IMaGe Dataset

  • Presentation: AMS

  • Code and data: https://svrimg.org/

  • Source data: GridRad (which in turn is sourced from NOAA NEXRAD Level II archives)

  • Description: over 500,000 data rich, geospatial, radar reflectivity images centered on high-impact weather events. These images have consistent dimensions and intensity values on a grid with relatively low spatial distortion over the Conterminous United States. Also includes crowd-sourced labeling.

TAASRAD19, a high-resolution weather radar reflectivity dataset for precipitation nowcasting

Understanding Clouds from Satellite Images

VALUE: A framework to validate downscaling approaches for climate change studies

WeatherBench: A Benchmark Data Set for Data‐Driven Weather Forecasting

Template

  • Paper:

  • Code and data:

  • Source data:

  • Description:

  • Papers using this dataset: