Module # 13 Dynamic/interaction and animation Visualization

 

The Final Project for Visual Analytics: Data Analysis of Air Quality and Pollution Trends

1) This week will probably be my last week here updating this blog, because this is my final assignment for this course on visual analytics. For this final project, I will be discussing a real work problem or forming a hypothesis surrounding a one, explain how this problem ties back to visual analytics. I will then describe a solution or methodology to counteract the stated problem and create some sort of data visualization off of it in RStudio. I really like RStudio so I will be using it again. This is all really normal final exam stuffs so without further ado lets get into it. 

Selecting the data

    For my final project in data visualization,  I selected the dataset "AirQuality.csv," which contains various air quality measurements over time. The dataset contains over 50 observations and more than 10 variables. Variables include levels of pollutants such as CO, NMHC, C6H6, NOx, NO2, and meteorological data like temperature (T), relative humidity (RH), and absolute humidity (AH). The data is collected at different timestamps, making it suitable for time series analysis.

    The goal of this project is to analyze air quality data collected over time to understand trends, variations, and relationships between pollutants and environmental conditions. Specifically, I aim to investigate how air quality parameters such as CO and NO2 change over time and how they relate to environmental factors like temperature and humidity.

    Visual analytics of environmental data, especially air quality, has been widely studied. Existing visualizations typically include time series charts showing the variation of pollutant levels over time, as well as correlation heatmaps to identify relationships between different air quality parameters. For example, a study on air quality in urban areas analyzed how CO, NO2, and other pollutants vary with temperature and humidity, using line graphs and scatter plots for visual representation. I found these methods insightful and have employed similar approaches in my analysis.

    The problem will be addressed using time series analysis to observe the variations of air quality parameters over time. Additionally, multivariate analysis will be applied to understand the relationships between pollutants and environmental variables like temperature and humidity. The technical approach involves the following steps. A time Series Analysis: Plotting pollutants such as CO and NO2 over time.
Correlation Analysis: Analyzing the relationships between different pollutants and environmental factors. The imputation of Missing Data: Replacing missing or erroneous data with median values to avoid skewed results. The visualization: Using ggplot2 for line plots and scatter plots, and Lubridate (the R package) to handle datetime formatting.

Methodology

    I used ggplot2 to create line plots of pollutant concentrations over time, providing a visual representation of how these levels fluctuate. The focus is on analyzing the temporal trends of the pollutants. Then I explored how individual air quality parameters contribute to overall environmental conditions and how pollutants relate to factors like temperature and humidity. Then came the calculation of the correlation between pollutants and environmental conditions using Pearson's correlation coefficient, visualized through scatter plots and correlation heatmaps. After that I identified any significant deviations in pollutant levels by observing the data for outliers or irregular trends, especially in the cleaned and processed data.The dataset contained invalid values (e.g., -200), which were replaced with NA during data cleaning. I used median imputation to handle missing values.

Data Visualization and Discussion

    First, I imported the dataset and processed the date and time columns to create a datetime variable. The pollutants and environmental variables were cleaned by replacing commas with dots for numeric conversion. I also handled missing or erroneous values (e.g., -200) by replacing them with NA and performing median imputation. To analyze the time series of CO levels, I created a line plot using ggplot2 (Shown Below), revealing how the concentration of CO changes over time, providing insights into the fluctuation of this pollutant. I then calculated the correlation between the various pollutants and environmental factors, then visualized the results using a heatmap to identify strong relationships. To have a better understanding of the relationship between CO levels and temperature, I created a scatter plot, allowing me to examine how these two variables are correlated (See below). The plot reveals how CO levels vary with temperature, which could be important in understanding air quality dynamics.











    Now Before conducting the analysis, I ensured that missing data was handled by replacing it with median values, ensuring a clean dataset for reliable analysis. The time series plot showed significant fluctuations in CO levels, which might suggest pollution spikes during certain times. The correlation heatmap indicated a strong positive correlation between CO and NOx, suggesting that higher levels of CO may occur with elevated NOx levels. The scatter plot of CO vs. Temperature displayed some negative correlation, implying that CO levels might decrease as temperature increases, although this relationship needs further exploration. 

Conclusions 

    This analysis successfully demonstrates the application of time series analysis, correlation analysis, and multivariate analysis to air quality data. Future work could include more advanced statistical models, such as time series forecasting, to predict future air quality levels based on historical trends. Additionally, expanding the dataset to include more environmental factors or using machine learning models could improve the accuracy of predictions. If you would like to explore this topic even further than I have, here is a link to where you can find the data yourself and continue this study. 

Comments