Preprocessed Datasets¶

This is a list of weather and climate datasets preprocessed for AI research. This can include benchmarks, competitions or ML papers with published data. The list is in alphabetical order.

AI for Earth System Science Summer School Hackathon¶

Code and Data: https://github.com/NCAR/ai4ess-hackathon-2020
Source: NCAR, Lawrence Berkeley Lab, and NOAA
Description: 5 challenge problems related to prediction and emulation. GOES challenge problem focuses on predicting lightning from GOES-16 satellite imagery. GECKO-A challenge problem focuses on emulating the GECKO-A chemistry model from a large set of model time series. Microphysics challenge problem focuses on emulating the TAU bin microphysics scheme. HOLODEC challenge problem focuses on estimating rain drop distribution properties from synthetic holographic diffraction patterns. ENSO challenge problem focuses on predicting ENSO from gridded model output.

AMS Solar Energy Prediction Contest¶

Code and data: https://www.kaggle.com/c/ams-2014-solar-energy-prediction-contest
Source data: GEFS forecasts and Mesonet solar observations
Description: Predict total daily solar irradiance from GEFS and Oklahoma Mesonet Data

AQ-Bench: A Benchmark Dataset for Machine Learning on Global Air Quality Metrics¶

Paper: Betancourt, C., Stomberg, T., Roscher, R., Schultz, M. G., and Stadtler, S.: AQ-Bench: a benchmark dataset for machine learning on global air quality metrics, Earth Syst. Sci. Data, 13, 3013–3033, https://doi.org/10.5194/essd-13-3013-2021, 2021.
Code and data: https://gitlab.version.fz-juelich.de/toar/ozone-mapping and https://doi.org/10.23728/b2share.30d42b5a87344e82855a486bf2123e9f
Source data: Database of the Tropospheric Ozone Assessment Report (TOAR)
Description: Aggregated air quality data from the years 2010–2014 and metadata at more than 5500 air quality monitoring stations all over the world. A well-defined task, a suitable evaluation metric and baseline scores are provided.

CAMELS: CATCHMENT ATTRIBUTES AND METEOROLOGY FOR LARGE-SAMPLE STUDIES¶

Data: https://ral.ucar.edu/solutions/products/camels
Paper: https://ncar.github.io/hydrology/datasets/CAMELS_timeseries
Source data: Weather models (Daymet, NLDAS, Maurer), streamflow observations (USGS), catchment attributes (USGS, MODIS, Daymet, STATSGO, Global Lithological Map (GLiM), GLobal Hydrogeology Maps (GLHYMPS))
Description: Weather drivers, streamflow observations, and catchment attributes for 671 catchments across the continental US.
Papers using this dataset: https://doi.org/10.5194/hess-22-6005-2018, https://doi.org/10.1029/2019WR026793

ClimateNet: an expert-labelled open dataset and Deep Learning architecture for enabling high-precision analyses of extreme weather¶

Paper: https://gmd.copernicus.org/preprints/gmd-2020-72/
Code and data: https://portal.nersc.gov/project/ClimateNet/
Source data: Climate Model simulations and expert labels
Description: Detect atmospheric rivers and tropical cyclones from climate model simulations. Tool for labeling along with dataset of expert labelled data.

CloudCast: A large-scale dataset and baseline for forecasting clouds¶

Paper: http://doi.org/10.1109/JSTARS.2021.3062936
Code and data: https://vision.eng.au.dk/cloudcast-dataset/
Source data: Meteosat-11 with cloud types annotated on a pixel-level
Description: The CloudCast dataset contains 70080 images with 10 different cloud types for multiple layers of the atmosphere annotated on a pixel level. The dataset has a spatial size of 928 x 1530 pixels recorded with 15-min intervals for the period 2017-2018, with a 3.0 km resolution.

CUMULO: a benchmark dataset for training and evaluating global cloud classification models¶

Paper: https://arxiv.org/abs/1911.04227
Code and data: https://github.com/FrontierDevelopmentLab/CUMULO
Source data: Moderate Resolution Imaging Spectroradiometer (MODIS) from Aqua satellite and 2B-CLDCLASS-LIDAR
Description: the dataset provides the global 1km-resolution imagery of the MODIS aligned with the accurately measured cloud properties of the CloudSat products. It contains three years of 1354 x 2030 pixel hyperspectral images combined with pixel-width ‘tracks’ of cloud labels, corresponding to the eight World Meteorological Organization genera.

Deepti: Deep-Learning-Based Tropical Cyclone Intensity Estimation System (+ Competition)¶

Paper: http://doi.org/10.1109/JSTARS.2020.3011907
Competition: https://www.drivendata.org/competitions/72/predict-wind-speeds/page/274/
Code and data: http://registry.mlhub.earth/10.34911/rdnt.xs53up/
Source data: GOES
Description: A collection of tropical storms in the Atlantic and East Pacific Oceans from 2000 to 2019 with corresponding maximum sustained surface wind speed. This dataset is split into training and test categories for the purpose of a competition. The train set consists of 70,257 images and the test set consists of 44,377 image, each one being 366 x 366 pixels

EarthNet2021: A novel large-scale dataset and challenge for forecasting localized climate impacts¶

Paper: https://arxiv.org/abs/2012.06246
Code and data: https://www.earthnet.tech/
Source data: Sentinel 2
Description: Curated dataset containing target spatio-temporal Sentinel 2 satellite imagery at 20 m resolution, matched with high-resolution topography and mesoscale (1.28 km) weather variables. With over 32000 samples it is suitable for training deep neural networks.

The ExtremeWeather Dataset¶

Paper: https://arxiv.org/abs/1612.02095
Code and data: https://github.com/eracah/hur-detect, https://extremeweatherdataset.github.io/
Source data: CAM5
Description: Consists of 768 × 1152 images of the global atmospheric state with a spatial resolution of 25 km and separated by 6 hour intervals from 1979 to 2005. There are 16 channels of images that correspond to different variables such as surface pressure, surface temperature and humidity of the reference altitude. In addition, there are boundary boxes and class labels for 4 types of extreme weather events: Tropical Depressions, Tropical Cyclones, Extratropical Cyclones and Atmospheric Rivers.

FlowDB: A new large scale river flow, flash flood, and precipitation dataset¶

Paper: https://arxiv.org/abs/2012.11154
Code and data: https://flow-forecast.atlassian.net/wiki/spaces/FF/pages/33456135/FlowDB+Dataset (Not public)
Source data: USGS, SNOTEL, NOAA, ASOS,EcoNet
Description: An hourly river flow and precipitation dataset and a second subset of flash flood events with damage estimates and injury counts. Created for general stream flow forecasting and flash flood damage estimation.

How Much Did It Rain I and II¶

Code and data: https://www.kaggle.com/c/how-much-did-it-rain and https://www.kaggle.com/c/how-much-did-it-rain-ii
Source data: US Radar and rain gauges
Description: Estimate rainfall probability distribution from Dual Pol. radar data.

MeteoNet, an open reference weather dataset by METEO FRANCE¶

Code and data: https://github.com/meteofrance/meteonet
Source data: AROME/ARPEGE forecasts, radar reflectivity and ground stations over France
Description: Multi source dataset of forecasts and observations over France spanning 3 years

Neural Networks for Postprocessing Ensemble Weather Forecasts¶

Paper: Rasp and Lerch 2018
Code and data: https://github.com/slerch/ppnn
Source data: TIGGE forecasts and station observations over Germany
Description: Ensemble temperature postprocessing of station observations over Germany. 9 years of data at 500 stations. Predictors include temperature as well as a range of other variables.

RainBench: Towards Global Precipitation Forecasting from Satellite Imagery¶

Paper: https://arxiv.org/abs/2012.09670
Code and data: https://github.com/frontierdevelopmentlab/pyrain
Source data: IMERG, ERA5 and SimSat
Description: Multi-modal benchmark dataset for data-driven precipitation forecasting at 3 different spatial resolutions: 0.1deg (IMERG and SimSat) and 0.5deg (ERA5). Presented along an efficient dataloading pipeline: Pyrain

SEVIR Dataset¶

Paper: NeurIPS
Code and data: http://sevir.mit.edu/
Source data: GOES-16 and NEXRAD over CONUS
Description: Preprocessed satellite and radar data over the continental US, served in patches. For a range of challenges with baselines (check website for updates).

SVRIMG - SeVere Reflectivity IMaGe Dataset¶

Presentation: AMS
Code and data: https://svrimg.org/
Source data: GridRad (which in turn is sourced from NOAA NEXRAD Level II archives)
Description: over 500,000 data rich, geospatial, radar reflectivity images centered on high-impact weather events. These images have consistent dimensions and intensity values on a grid with relatively low spatial distortion over the Conterminous United States. Also includes crowd-sourced labeling.

TAASRAD19, a high-resolution weather radar reflectivity dataset for precipitation nowcasting¶

Paper: https://doi.org/10.1038/s41597-020-0574-8
Code and data:https://github.com/MPBA/TAASRAD19
Source data: Official public meteorological agency of the civil protection department of the Autonomous Province of Trento (Italy)
Description: Benchmark dataset for radar nowcasting with deep learning. The dataset contains 1,732 radar sequences labeled with precipitation type spanning from 2010 to 2019, for a total of 362,233 radar images. Image size is 480 x 480 at 500m resolution (UTM grid) covering a complex orographic area in the Italian Alps.
Papers using this dataset: https://doi.org/10.3390/atmos11030267, https://doi.org/10.3390/rs11242922

Understanding Clouds from Satellite Images¶

Paper: BAMS
Code and data: https://www.kaggle.com/c/understanding_cloud_organization
Source data: TERRA and AQUA MODIS visible images
Description: Cloud classification challenge of 4 human-designed shallow cloud patterns of organization: Sugar, Flower, Fish and Gravel with 30,000 human labels

VALUE: A framework to validate downscaling approaches for climate change studies¶

Paper: https://agupubs.onlinelibrary.wiley.com/doi/full/10.1002/2014EF000259
Code and data: http://www.value-cost.eu/
Source data: Station observations
Description: Framework for evaluating climate model downscaling methods. Validation observations are provided.
Papers using this dataset: Many

WeatherBench: A Benchmark Data Set for Data‐Driven Weather Forecasting¶

Paper: https://doi.org/10.1029/2020MS002203
Code and data: https://github.com/pangeo-data/WeatherBench
Source data: ERA5 and TIGGE for baselines
Description: Benchmark dataset for medium-range (3 and 5 day) forecasting of global pressure, temperature and precipitation with preprocessed data (40 years), evaluation and baselines
Papers using this dataset: https://arxiv.org/abs/2003.11927, https://arxiv.org/abs/2008.08626

s2s-ai-challenge¶

Paper: BAMS
Code and data: https://renkulab.io/projects/aaron.spring/s2s-ai-challenge-template/
Source data: S2S access via climetlab-s2s-ai-challenge
Description: https://s2s-ai-challenge.github.io/
Papers using this dataset:

Template¶

Paper:
Code and data:
Source data:
Description:
Papers using this dataset: