Overall Experimental Workflow and Results
Prepare a series of training datasets of PM2.5 for ML and geospatial purposes. ( Purple Air (PA-II) Sensors and Environmental Protection Agency (EPA) Sensors )
Formalize a list of methods for uncertainty quantification and accuracy assessments
Prepare computing environment for a holistic study of the simulation, retrieval, and prediction
Formalize a list of parameters for tuning models
Analyze to identify the best configuration spots for models and configurations
Below is open-source documentation for replicability of each model across the 5 packages.
RStudio
Decision Tree Regressor (DT) | SCRIPT | GUIDE
Random Forest (RF) | SCRIPT | GUIDE
K Nearest Neighbors (KNN) | SCRIPT | GUIDE
XGBoost (XGB) | SCRIPT | GUIDE
Support Vector Regression (SVM) | SCRIPT | GUIDE
Simple Neural Network (SNN) | SCRIPT | GUIDE
Deep Neural Network (DNN) | SCRIPT | GUIDE
Long Short Term Memory (LSTM) | SCRIPT | GUIDE
Recurrent Neural Network (RNN) | SCRIPT | GUIDE
Sci-Kit Learn
Decision Tree Regressor (DT) | SCRIPT | GUIDE
Random Forest (RF) | SCRIPT | GUIDE
K Nearest Neighbors (KNN) | SCRIPT | GUIDE
Support Vector Regression (SVM) | SCRIPT | GUIDE
Simple Neural Network (SNN) | SCRIPT | GUIDE
Deep Neural Network (DNN) | SCRIPT | GUIDE
XGBoost
Decision Tree Regressor (DT) | SCRIPT | GUIDE
Random Forest (RF) | SCRIPT | GUIDE
Pytorch
Simple Neural Network (SNN) | SCRIPT | GUIDE
Deep Neural Network (DNN) | SCRIPT | GUIDE
Long Short Term Memory (LSTM) | SCRIPT | GUIDE
Tensorflow
Decision Tree Regressor (DT) | SCRIPT | GUIDE
Random Forest (RF) | SCRIPT | GUIDE
Simple Neural Network (SNN) | SCRIPT | GUIDE
Deep Neural Network (DNN) | SCRIPT | GUIDE
Long Short Term Memory (LSTM) | SCRIPT | GUIDE
Systematic Study Preliminary Results
The popular training data split of 80/20 and 70/30 were examined, and the split difference was found to have minimal impact across models and packages.
The mean difference between the two splits in RMSE across all models and packages was 0.051 and the mean difference in R2 was 0.00381, or a mean percent difference of 1.55% for RMSE and a mean percent difference of 0.745% for R2 across all packages and models
This study utilizes evaluation metrics Root Mean Square Error (RMSE) and Coefficient of Determination (R2) to evaluate the fit of the PM2.5 calibration model against the EPA data used as benchmark.
RSTUDIO BASE RESULTS | ||||||
---|---|---|---|---|---|---|
80 (Train) 20 (Test) Split | 70 (Train) 30 (Test) Split | |||||
RMSE | R2 | Time Elapsed (s) | RMSE | R2 | Time Elapsed (s) | |
Decision Tree Regressor | 6.028 | 0.7018 | 00:02.7 | 6.11 | 0.7018 | 00:02:42 |
Random Forest | 5.26 | 0.7728 | 00:02:23 | 5.36 | 0.7619 | 00:02:23 |
K-Nearest Neighbors | 5.959 | 0.7128 | 00:00:09 | 6.0329 | 0.7032 | 00:00:09 |
XGBoost | 5.17 | 0.7807 | 00:00:09 | 5.26 | 0.7702 | 00:00:08 |
Support Vector Regression | 5.39 | 0.76407 | 20:56:00 | 5.49 | 0.752 | 20:56:00 |
Simple Neural Network | 5.3569 | 0.7647 | 00:07:34 | 5.464 | 0.7528 | 00:10:50 |
Deep Neural Network | 5.336 | 0.76468 | 00:06:32 | 5.426 | 0.7586 | 00:11:54 |
Long Short-Term Memory neural network (LSTM) | 4.2518 | 0.8578 | 01:34:48 | 4.2022 | 0.8554 | 01:30:36 |
Recurrent Neural Networks | 5.584 | 0.7543 | 00:47:52 | 5.382 | 0.7626 | 00:45:18 |
Ordinary Least Square Regression (OLS) | 5.724 | 0.7309 | 00:00:02 | 5.7 | 0.7306 | 00:00:02 |
Lasso Regression | 5.704 | 0.7306 | 00:00:04 | 5.704 | 0.7306 | 00:00:02 |
SCIKIT BASE RESULTS | ||||||
---|---|---|---|---|---|---|
80 (Train) 20 (Test) Split | 70 (Train) 30 (Test) Split | |||||
RMSE | R2 | Time Elapsed (s) | RMSE | R2 | Time Elapsed (s) | |
Decision Tree Regressor | 5.6466 | 0.7358 | 00:07.9 | 5.6356 | 0.7389 | 00:07.0 |
Random Forest | 5.3867 | 0.7628 | 00:03:57 | 5.4233 | 0.7553 | 00:03:30 |
K-Nearest Neighbors | 5.992 | 0.7018 | 00:0.86 | 5.9668 | 0.7073 | 0:00:02 |
Support Vector Regression | 5.6871 | 0.732 | 07:03:26 | 5.703 | 0.7326 | 06:19:41 |
Simple Neural Network | 5.4854 | 0.7588 | 00:03:19 | 5.3855 | 0.7589 | 00:03:05 |
Deep Neural Network | 5.6852 | 0.7467 | 01:15:31 | 5.5777 | 0.7426 | 01:12:21 |
Ordinary Least Square Regression (OLS) | 5.6638 | 0.7369 | 00:04.2 | 5.6707 | 0.7338 | 00:03.6 |
Lasso Regression | 5.9398 | 0.7111 | 00:03.4 | 5.914 | 0.7139 | 00:03.3 |
XGBOOST BASE RESULTS | ||||||
---|---|---|---|---|---|---|
80 (Train) 20 (Test) Split | 70 (Train) 30 (Test) Split | |||||
RMSE | R2 | Time Elapsed (s) | RMSE | R2 | Time Elapsed (s) | |
Decision Tree Regressor | 5.5834 | 0.7417 | 00:08.6 | 5.6125 | 0.7411 | 00:08.6 |
Random Forest | 5.5833 | 0.7417 | 00:08.4 | 5.6176 | 0.7486 | 00:08.4 |
K-Nearest Neighbors | 6.0018 | 0.7016 | 00:08.2 | 6 | 0.7048 | 00:08.2 |
Support Vector Regression | 5.5834 | 0.7417 | 00:08.6 | 5.6125 | 0.7411 | 00:08.6 |
PYTORCH BASE RESULTS | ||||||
---|---|---|---|---|---|---|
80 (Train) 20 (Test) Split | 70 (Train) 30 (Test) Split | |||||
RMSE | R2 | Time Elapsed (s) | RMSE | R2 | Time Elapsed (s) | |
Simple Neural Network | 5.46451 | 0.69883 | 08:45.0 | 5.7053 | 0.6793 | 06:34.0 |
Deep Neural Network | 4.9802 | 0.742 | 12:52.0 | |||
Long Short-Term Memory neural network (LSTM) | 5.15965 | 0.78209 | 05:29.0 | 5.2655 | 0.7821 | 00:03:36 |
Recurrent Neural Networks | 5.4578 | 0.7658 | 03:19.0 | 5.4639 | 0.7653 | 00:01:11 |
Ordinary Least Square Regression (OLS) | 5.7293 | 0.66898 | 00:34:21 | 5.7294 | 0.6673 | 00:34:21 |
TENSORFLOW BASE RESULTS | ||||||
---|---|---|---|---|---|---|
80 (Train) 20 (Test) Split | 70 (Train) 30 (Test) Split | |||||
RMSE | R2 | Time Elapsed (s) | RMSE | R2 | Time Elapsed (s) | |
Decision Tree Regressor | 5.2659 | 0.775 | 00:01:06 | 5.3638 | 0.7565 | 00:00:48 |
Random Forest | 5.2349 | 0.7756 | 00:00:47 | 5.3248 | 0.772 | 00:00:46 |
Simple Neural Network | 5.4009 | 0.7593 | 00:55:50 | 5.438 | 0.7548 | 00:48:41 |
Deep Neural Network | 5.2749 | 0.7631 | 00:55:59 | 5.4507 | 0.7565 | 00:19:13 |
Long Short-Term Memory neural network (LSTM) | 4.26 | 0.857 | 01:31:12 | 4.2328 | 0.8533 | 1:13:12 |
Recurrent Neural Networks | 5.459 | 0.7657 | 05:15:12 | 5.5208 | 0.7603 | 05:20:28 |
Ordinary Least Square Regression (OLS) | 5.5575 | 0.7399 | 00:43:11 | 5.6918 | 0.7323 | 00:37:45 |
The RMSE is calculated using the formula:
Where yi represents the actual PM2.5 values from the EPA data, ŷi denotes the predicted values from the model, and n is the number of spatiotemporal data points. This metric measures the average magnitude of the errors between the model’s predictions and the actual benchmark EPA data. A lower RMSE value indicates a model with higher accuracy, reflecting a closer fit to the benchmark.
The Coefficient of Determination, denoted as R2, is given by:
In this formula, RSS is the sum of the squares of residuals—the difference between actual and predicted values—and TSS is the total sum of squares—the differences between actual values and their mean value. R2represents the proportion of variance in the observed EPA PM2.5 levels that is predictable from the models. An R2 value close to 1 would suggest that the model has a high degree of explanatory power, aligning well with the variability observed in the EPA dataset.
For a comprehensive understanding of the model’s performance RMSE and R2 are obtained. RMSE provides a direct measure of prediction accuracy, while R2 offers insight into how well the model captures the overall variance in the EPA dataset. Together, these metrics are crucial for validating the effectiveness of the calibrated PM2.5 model in replicating the benchmark data.