Project Overview

Overall Experimental Workflow and Results

Prepare a series of training datasets of PM2.5 for ML and geospatial purposes. ( Purple Air (PA-II) Sensors and Environmental Protection Agency (EPA) Sensors )

Formalize a list of methods for uncertainty quantification and accuracy assessments

Prepare computing environment for a holistic study of the simulation, retrieval, and prediction

Formalize a list of parameters for tuning models

Analyze to identify the best configuration spots for models and configurations

Below is open-source documentation for replicability of each model across the 5 packages.

RStudio

Decision Tree Regressor (DT)
 | SCRIPT | GUIDE

Random Forest (RF) | SCRIPT | GUIDE

K Nearest Neighbors
 (KNN) | SCRIPT | GUIDE

XGBoost
 (XGB) | SCRIPT | GUIDE

Support Vector Regression
 (SVM) | SCRIPT | GUIDE

Simple Neural Network
 (SNN) | SCRIPT | GUIDE

Deep Neural Network
 (DNN) | SCRIPT | GUIDE

Long Short Term Memory
 (LSTM) | SCRIPT | GUIDE

Recurrent Neural Network
 (RNN) | SCRIPT | GUIDE

Ordinary Least Square Regression
 (OLS) | SCRIPT | GUIDE

Lasso Regression (LR) | SCRIPT | GUIDE

Sci-Kit Learn

Decision Tree Regressor (DT)
 | SCRIPT | GUIDE

Random Forest (RF) | SCRIPT | GUIDE

K Nearest Neighbors
 (KNN) | SCRIPT | GUIDE

Support Vector Regression
 (SVM) | SCRIPT | GUIDE

Simple Neural Network
 (SNN) | SCRIPT | GUIDE

Deep Neural Network
 (DNN) | SCRIPT | GUIDE

Ordinary Least Square Regression
 (OLS) | SCRIPT | GUIDE

Lasso Regression (LR) | SCRIPT | GUIDE

XGBoost

Decision Tree Regressor (DT)
 | SCRIPT | GUIDE

Random Forest (RF) | SCRIPT | GUIDE

K Nearest Neighbors
 (KNN) | SCRIPT | GUIDE

XGBoost
 (XGB) | SCRIPT | GUIDE

Pytorch

Simple Neural Network
 (SNN) | SCRIPT | GUIDE

Deep Neural Network
 (DNN) | SCRIPT | GUIDE

Long Short Term Memory
 (LSTM) | SCRIPT | GUIDE

Recurrent Neural Network
 (RNN) | SCRIPT | GUIDE

Ordinary Least Square Regression
 (OLS) | SCRIPT | GUIDE

Tensorflow

Decision Tree Regressor (DT)
 | SCRIPT | GUIDE

Random Forest (RF) | SCRIPT | GUIDE

Simple Neural Network
 (SNN) | SCRIPT | GUIDE

Deep Neural Network
 (DNN) | SCRIPT | GUIDE

Long Short Term Memory
 (LSTM) | SCRIPT | GUIDE

Recurrent Neural Network
 (RNN) | SCRIPT | GUIDE

Ordinary Least Square Regression
 (OLS) | SCRIPT | GUIDE

Systematic Study Preliminary Results

The popular training data split of 80/20 and 70/30 were examined, and the split difference was found to have minimal impact across models and packages.

The mean difference between the two splits in RMSE across all models and packages was 0.051 and the mean difference in R2 was 0.00381, or a mean percent difference of 1.55% for RMSE and a mean percent difference of 0.745% for R2 across all packages and models

This study utilizes evaluation metrics Root Mean Square Error (RMSE) and Coefficient of Determination (R2) to evaluate the fit of the PM2.5 calibration model against the EPA data used as benchmark.

RSTUDIO BASE RESULTS
80 (Train) 20 (Test) Split70 (Train) 30 (Test) Split
RMSER2Time Elapsed (s)RMSER2Time Elapsed (s)
Decision Tree Regressor6.0280.701800:02.76.110.701800:02:42
Random Forest5.260.772800:02:235.360.761900:02:23
K-Nearest Neighbors5.9590.712800:00:096.03290.703200:00:09
XGBoost5.170.780700:00:095.260.770200:00:08
Support Vector Regression5.390.7640720:56:005.490.75220:56:00
Simple Neural Network5.35690.764700:07:345.4640.752800:10:50
Deep Neural Network5.3360.7646800:06:325.4260.758600:11:54
Long Short-Term Memory neural network (LSTM)4.25180.857801:34:484.20220.855401:30:36
Recurrent Neural Networks5.5840.754300:47:525.3820.762600:45:18
Ordinary Least Square Regression (OLS)5.7240.730900:00:025.70.730600:00:02
Lasso Regression5.7040.730600:00:045.7040.730600:00:02
SCIKIT BASE RESULTS
80 (Train) 20 (Test) Split70 (Train) 30 (Test) Split
RMSER2Time Elapsed (s)RMSER2Time Elapsed (s)
Decision Tree Regressor5.64660.735800:07.95.63560.738900:07.0
Random Forest5.38670.762800:03:575.42330.755300:03:30
K-Nearest Neighbors5.9920.701800:0.865.96680.70730:00:02
Support Vector Regression5.68710.73207:03:265.7030.732606:19:41
Simple Neural Network5.48540.758800:03:195.38550.758900:03:05
Deep Neural Network5.68520.746701:15:315.57770.742601:12:21
Ordinary Least Square Regression (OLS)5.66380.736900:04.25.67070.733800:03.6
Lasso Regression5.93980.711100:03.45.9140.713900:03.3
XGBOOST BASE RESULTS
80 (Train) 20 (Test) Split70 (Train) 30 (Test) Split
RMSER2Time Elapsed (s)RMSER2Time Elapsed (s)
Decision Tree Regressor5.58340.741700:08.65.61250.741100:08.6
Random Forest5.58330.741700:08.45.61760.748600:08.4
K-Nearest Neighbors6.00180.701600:08.260.704800:08.2
Support Vector Regression5.58340.741700:08.65.61250.741100:08.6
PYTORCH BASE RESULTS
80 (Train) 20 (Test) Split70 (Train) 30 (Test) Split
RMSER2Time Elapsed (s)RMSER2Time Elapsed (s)
Simple Neural Network5.464510.6988308:45.05.70530.679306:34.0
Deep Neural Network4.98020.74212:52.0
Long Short-Term Memory neural network (LSTM)5.159650.7820905:29.05.26550.782100:03:36
Recurrent Neural Networks5.45780.765803:19.05.46390.765300:01:11
Ordinary Least Square Regression (OLS)5.72930.6689800:34:215.72940.667300:34:21
TENSORFLOW BASE RESULTS
80 (Train) 20 (Test) Split70 (Train) 30 (Test) Split
RMSER2Time Elapsed (s)RMSER2Time Elapsed (s)
Decision Tree Regressor5.26590.77500:01:065.36380.756500:00:48
Random Forest5.23490.775600:00:475.32480.77200:00:46
Simple Neural Network5.40090.759300:55:505.4380.754800:48:41
Deep Neural Network5.27490.763100:55:595.45070.756500:19:13
Long Short-Term Memory neural network (LSTM)4.260.85701:31:124.23280.85331:13:12
Recurrent Neural Networks5.4590.765705:15:125.52080.760305:20:28
Ordinary Least Square Regression (OLS)5.55750.739900:43:115.69180.732300:37:45

The RMSE is calculated using the formula:

Where yi represents the actual PM2.5 values from the EPA data, ŷi denotes the predicted values from the model, and n is the number of spatiotemporal data points. This metric measures the average magnitude of the errors between the model’s predictions and the actual benchmark EPA data. A lower RMSE value indicates a model with higher accuracy, reflecting a closer fit to the benchmark.

The Coefficient of Determination, denoted as R2, is given by:

In this formula, RSS is the sum of the squares of residuals—the difference between actual and predicted values—and TSS is the total sum of squares—the differences between actual values and their mean value. R2represents the proportion of variance in the observed EPA PM2.5 levels that is predictable from the models. An R2 value close to 1 would suggest that the model has a high degree of explanatory power, aligning well with the variability observed in the EPA dataset.

For a comprehensive understanding of the model’s performance RMSE and R2 are obtained. RMSE provides a direct measure of prediction accuracy, while R2 offers insight into how well the model captures the overall variance in the EPA dataset. Together, these metrics are crucial for validating the effectiveness of the calibrated PM2.5 model in replicating the benchmark data.