Abstract
Accurate air pollution monitoring is critical for understanding how human activities impact air quality and how to mitigate the impact on public health and ecological systems. Though high accuracy sensors were installed to monitor air pollutants, there are many more low cost, low accuracy sensors. Calibrating to improve the accuracy of low-cost sensors would help us improve the accuracy of measurement and fill the geographic gap of sensor coverage with acceptable accuracy. In addition to physics-based instrument calibration, this research examines how different ML models and open-source packages could help improve the accuracy using large number of particulate matter (PM) 2.5 datasets collected by the popular low-cost sensor of Purple Air and authoritative sensors maintained by Environmental Protection Agency (EPA). 11 ML models are experimented with on five packages including XGBoost, Scikit-learn, TensorFlow, PyTorch, and the RStudio. The 11 models include: Decision Tree Regressor (DTR), Random Forest (RF), K-Nearest Neighbors (KNN), XGBRegressor, Support Vector Regression (SVR), Simple Neural Network (SNN), Deep Neural Network (DNN), Long Short-Term Memory neural network (LSTM), Recurrent Neural Networks (RNN), Ordinary Least Square Regression (OLS), and Least Absolute Shrinkage and Selection Operator (Lasso) regression. The TensorFlow and R-Studio based LSTM is found to be the best performed with an R2 of 0.857 and an RMSE of 4.26. Both models and packages impact the accuracy of calibration results. The study also found that minimal impact from the choice of training/testing split (80/20 vs 70/30) on model performance. In package comparison, RStudio and TensorFlow excelled with LSTM models, showing high R^2 scores of 0.8578 and 0.857 and low RMSEs of 4.2518 and 4.26, respectively, indicating their strong capability to process high-volume data. The choice of package will impact the model performance with OLS regression shows the biggest difference. Computational demands may make LSTM too slow or expensive (xxx hours) for applications with fast response requirements. Our results also suggest a potential approach with tree-boosted models like XGBoost in RStudio and RF in TensorFlow, which also exhibited high performance (R2 of xxx, RMSE of xxx) but with shorter training times (xxx seconds), making them suitable for applications with low computational resources or the need for quick model training. These findings suggest that AI/ML models, particularly LSTM, can effectively calibrate low-cost sensors, potentially enhancing large-scale air quality monitoring and public health risk assessments. Furthermore, by advancing the accuracy of low-cost sensors, this study supports broader environmental health initiatives and informs policy decisions with precise, localized air quality data.
Why we need to calibrate low-cost sensor?
There are inconsistency between low-cost and high-quality regulatory measurements and calibration is essential to produce reliable and validated data
In USA, total number of: PurpleAir sensors – 15,000+;
In USA, total number of: EPA AQS sensors – 5,000+
Plans to further our investigation to best leverage AI/ML for air quality studies
- Hyper parameter tuning should be able to further improve accuracy and reduce uncertainty but will require significant computing power to investigate different combinations. LSTM emerged as the best performing model in this study. We plan to further explore the application of this model, including detailed hyper-parameter tuning/model optimization. Additionally, further exploration of the difference in performance of LSTM in PyTorch vs TensorFlow should be conducted.
- Different species of air pollutants may have different patterns so a systematic study on each of them might be needed for, e.g., NO2 and Ozone or Methane. The in-situ sensor is good on temporal coverage and lacks continuous geographic coverage which can be complemented by introducing satellite retrieval of pollutants.
- Other analytics such as data downscaling, upscaling, interoperation, and fusion to best replicate air pollution status is needed
- To better facilitate the systematic study and extensive AI/ML model runs, an adaptable ML toolkit and potential Python package can be developed and packaged to speed up the AQ research and forecasting research. The experiments were run without utilizing any GPU or booster, so the times taken to train could be reduced with the GPU support (Wang, 2023). To further speed up the configuration, a cloud computing based Docker-Container would be able to help improve the computing process and reduce operation accidental errors.