**Overall Experimental Workflow and Results**

Prepare a series of **training datasets** of PM_{2.5 }for ML and geospatial purposes. ( Purple Air (PA-II) Sensors and Environmental Protection Agency (EPA) Sensors )

Formalize a **list of methods** for uncertainty quantification and accuracy assessments

Prepare **computing environment** for a holistic study of the simulation, retrieval, and prediction

Formalize a **list of parameters** for tuning models

Analyze to **identify the best configuration spots** for models and configurations

Below is open-source documentation for replicability of each model across the 5 packages.

## RStudio

Decision Tree Regressor (DT) | SCRIPT | GUIDE

Random Forest (RF) | SCRIPT | GUIDE

K Nearest Neighbors (KNN) | SCRIPT | GUIDE

XGBoost (XGB) | SCRIPT | GUIDE

Support Vector Regression (SVM) | SCRIPT | GUIDE

Simple Neural Network (SNN) | SCRIPT | GUIDE

Deep Neural Network (DNN) | SCRIPT | GUIDE

Long Short Term Memory (LSTM) | SCRIPT | GUIDE

Recurrent Neural Network (RNN) | SCRIPT | GUIDE

## Sci-Kit Learn

Decision Tree Regressor (DT) | SCRIPT | GUIDE

Random Forest (RF) | SCRIPT | GUIDE

K Nearest Neighbors (KNN) | SCRIPT | GUIDE

Support Vector Regression (SVM) | SCRIPT | GUIDE

Simple Neural Network (SNN) | SCRIPT | GUIDE

Deep Neural Network (DNN) | SCRIPT | GUIDE

## XGBoost

Decision Tree Regressor (DT) | SCRIPT | GUIDE

Random Forest (RF) | SCRIPT | GUIDE

## Pytorch

Simple Neural Network (SNN) | SCRIPT | GUIDE

Deep Neural Network (DNN) | SCRIPT | GUIDE

Long Short Term Memory (LSTM) | SCRIPT | GUIDE

## Tensorflow

Decision Tree Regressor (DT) | SCRIPT | GUIDE

Random Forest (RF) | SCRIPT | GUIDE

Simple Neural Network (SNN) | SCRIPT | GUIDE

Deep Neural Network (DNN) | SCRIPT | GUIDE

Long Short Term Memory (LSTM) | SCRIPT | GUIDE

## Systematic Study Preliminary Results

The popular training data split of 80/20 and 70/30 were examined, and the split difference was found to have minimal impact across models and packages.

The mean difference between the two splits in RMSE across all models and packages was 0.051 and the mean difference in R^{2} was 0.00381, or a mean percent difference of 1.55% for RMSE and a mean percent difference of 0.745% for R^{2} across all packages and models

This study utilizes evaluation metrics Root Mean Square Error (RMSE) and Coefficient of Determination (R^{2}) to evaluate the fit of the PM2.5 calibration model against the EPA data used as benchmark.

RSTUDIO BASE RESULTS | ||||||
---|---|---|---|---|---|---|

80 (Train) 20 (Test) Split | 70 (Train) 30 (Test) Split |
|||||

RMSE | R2 | Time Elapsed (s) | RMSE | R2 | Time Elapsed (s) | |

Decision Tree Regressor | 6.028 | 0.7018 | 00:02.7 | 6.11 | 0.7018 | 00:02:42 |

Random Forest | 5.26 | 0.7728 | 00:02:23 | 5.36 | 0.7619 | 00:02:23 |

K-Nearest Neighbors | 5.959 | 0.7128 | 00:00:09 | 6.0329 | 0.7032 | 00:00:09 |

XGBoost | 5.17 | 0.7807 | 00:00:09 | 5.26 | 0.7702 | 00:00:08 |

Support Vector Regression | 5.39 | 0.76407 | 20:56:00 | 5.49 | 0.752 | 20:56:00 |

Simple Neural Network | 5.3569 | 0.7647 | 00:07:34 | 5.464 | 0.7528 | 00:10:50 |

Deep Neural Network | 5.336 | 0.76468 | 00:06:32 | 5.426 | 0.7586 | 00:11:54 |

Long Short-Term Memory neural network (LSTM) | 4.2518 | 0.8578 | 01:34:48 | 4.2022 | 0.8554 | 01:30:36 |

Recurrent Neural Networks | 5.584 | 0.7543 | 00:47:52 | 5.382 | 0.7626 | 00:45:18 |

Ordinary Least Square Regression (OLS) | 5.724 | 0.7309 | 00:00:02 | 5.7 | 0.7306 | 00:00:02 |

Lasso Regression | 5.704 | 0.7306 | 00:00:04 | 5.704 | 0.7306 | 00:00:02 |

SCIKIT BASE RESULTS | ||||||
---|---|---|---|---|---|---|

80 (Train) 20 (Test) Split | 70 (Train) 30 (Test) Split |
|||||

RMSE | R2 | Time Elapsed (s) | RMSE | R2 | Time Elapsed (s) | |

Decision Tree Regressor | 5.6466 | 0.7358 | 00:07.9 | 5.6356 | 0.7389 | 00:07.0 |

Random Forest | 5.3867 | 0.7628 | 00:03:57 | 5.4233 | 0.7553 | 00:03:30 |

K-Nearest Neighbors | 5.992 | 0.7018 | 00:0.86 | 5.9668 | 0.7073 | 0:00:02 |

Support Vector Regression | 5.6871 | 0.732 | 07:03:26 | 5.703 | 0.7326 | 06:19:41 |

Simple Neural Network | 5.4854 | 0.7588 | 00:03:19 | 5.3855 | 0.7589 | 00:03:05 |

Deep Neural Network | 5.6852 | 0.7467 | 01:15:31 | 5.5777 | 0.7426 | 01:12:21 |

Ordinary Least Square Regression (OLS) | 5.6638 | 0.7369 | 00:04.2 | 5.6707 | 0.7338 | 00:03.6 |

Lasso Regression | 5.9398 | 0.7111 | 00:03.4 | 5.914 | 0.7139 | 00:03.3 |

XGBOOST BASE RESULTS | ||||||
---|---|---|---|---|---|---|

80 (Train) 20 (Test) Split | 70 (Train) 30 (Test) Split |
|||||

RMSE | R2 | Time Elapsed (s) | RMSE | R2 | Time Elapsed (s) | |

Decision Tree Regressor | 5.5834 | 0.7417 | 00:08.6 | 5.6125 | 0.7411 | 00:08.6 |

Random Forest | 5.5833 | 0.7417 | 00:08.4 | 5.6176 | 0.7486 | 00:08.4 |

K-Nearest Neighbors | 6.0018 | 0.7016 | 00:08.2 | 6 | 0.7048 | 00:08.2 |

Support Vector Regression | 5.5834 | 0.7417 | 00:08.6 | 5.6125 | 0.7411 | 00:08.6 |

PYTORCH BASE RESULTS | ||||||
---|---|---|---|---|---|---|

80 (Train) 20 (Test) Split | 70 (Train) 30 (Test) Split |
|||||

RMSE | R2 | Time Elapsed (s) | RMSE | R2 | Time Elapsed (s) | |

Simple Neural Network | 5.46451 | 0.69883 | 08:45.0 | 5.7053 | 0.6793 | 06:34.0 |

Deep Neural Network | 4.9802 | 0.742 | 12:52.0 | |||

Long Short-Term Memory neural network (LSTM) | 5.15965 | 0.78209 | 05:29.0 | 5.2655 | 0.7821 | 00:03:36 |

Recurrent Neural Networks | 5.4578 | 0.7658 | 03:19.0 | 5.4639 | 0.7653 | 00:01:11 |

Ordinary Least Square Regression (OLS) | 5.7293 | 0.66898 | 00:34:21 | 5.7294 | 0.6673 | 00:34:21 |

TENSORFLOW BASE RESULTS | ||||||
---|---|---|---|---|---|---|

80 (Train) 20 (Test) Split | 70 (Train) 30 (Test) Split |
|||||

RMSE | R2 | Time Elapsed (s) | RMSE | R2 | Time Elapsed (s) | |

Decision Tree Regressor | 5.2659 | 0.775 | 00:01:06 | 5.3638 | 0.7565 | 00:00:48 |

Random Forest | 5.2349 | 0.7756 | 00:00:47 | 5.3248 | 0.772 | 00:00:46 |

Simple Neural Network | 5.4009 | 0.7593 | 00:55:50 | 5.438 | 0.7548 | 00:48:41 |

Deep Neural Network | 5.2749 | 0.7631 | 00:55:59 | 5.4507 | 0.7565 | 00:19:13 |

Long Short-Term Memory neural network (LSTM) | 4.26 | 0.857 | 01:31:12 | 4.2328 | 0.8533 | 1:13:12 |

Recurrent Neural Networks | 5.459 | 0.7657 | 05:15:12 | 5.5208 | 0.7603 | 05:20:28 |

Ordinary Least Square Regression (OLS) | 5.5575 | 0.7399 | 00:43:11 | 5.6918 | 0.7323 | 00:37:45 |

The RMSE is calculated using the formula:

Where y_{i} represents the actual PM2.5 values from the EPA data, ŷ_{i} denotes the predicted values from the model, and *n* is the number of spatiotemporal data points. This metric measures the average magnitude of the errors between the model’s predictions and the actual benchmark EPA data. A lower RMSE value indicates a model with higher accuracy, reflecting a closer fit to the benchmark.

The Coefficient of Determination, denoted as R^{2}, is given by:

In this formula, RSS is the sum of the squares of residuals—the difference between actual and predicted values—and TSS is the total sum of squares—the differences between actual values and their mean value. R^{2}represents the proportion of variance in the observed EPA PM2.5 levels that is predictable from the models. An R^{2} value close to 1 would suggest that the model has a high degree of explanatory power, aligning well with the variability observed in the EPA dataset.

For a comprehensive understanding of the model’s performance RMSE and R^{2} are obtained. RMSE provides a direct measure of prediction accuracy, while R^{2} offers insight into how well the model captures the overall variance in the EPA dataset. Together, these metrics are crucial for validating the effectiveness of the calibrated PM2.5 model in replicating the benchmark data.