The Final Project for Visual Analytics: Data Analysis of Air Quality and Pollution Trends

1) This week will probably be my last week here updating this blog, because this is my final assignment for this course on visual analytics. For this final project, I will be discussing a real work problem or forming a hypothesis surrounding a one, explain how this problem ties back to visual analytics. I will then describe a solution or methodology to counteract the stated problem and create some sort of data visualization off of it in RStudio. I really like RStudio so I will be using it again. This is all really normal final exam stuffs so without further ado lets get into it.

Selecting the data

   For my final project in data visualization, I selected the dataset "AirQuality.csv," which contains various air quality measurements over time. The dataset contains over 50 observations and more than 10 variables. Variables include levels of pollutants such as CO, NMHC, C6H6, NOx, NO2, and meteorological data like temperature (T), relative humidity (RH), and absolute humidity (AH). The data is collected at different timestamps, making it suitable for time series analysis.

   The goal of this project is to analyze air quality data collected over time to understand trends, variations, and relationships between pollutants and environmental conditions. Specifically, I aim to investigate how air quality parameters such as CO and NO2 change over time and how they relate to environmental factors like temperature and humidity.

   Visual analytics of environmental data, especially air quality, has been widely studied. Existing visualizations typically include time series charts showing the variation of pollutant levels over time, as well as correlation heatmaps to identify relationships between different air quality parameters. For example, a study on air quality in urban areas analyzed how CO, NO2, and other pollutants vary with temperature and humidity, using line graphs and scatter plots for visual representation. I found these methods insightful and have employed similar approaches in my analysis.

The problem will be addressed using time series analysis to observe the variations of air quality parameters over time. Additionally, multivariate analysis will be applied to understand the relationships between pollutants and environmental variables like temperature and humidity. The technical approach involves the following steps. A time Series Analysis: Plotting pollutants such as CO and NO2 over time.

Correlation Analysis: Analyzing the relationships between different pollutants and environmental factors. The imputation of Missing Data: Replacing missing or erroneous data with median values to avoid skewed results. The visualization: Using ggplot2 for line plots and scatter plots, and Lubridate (the R package) to handle datetime formatting.

Methodology

I used ggplot2 to create line plots of pollutant concentrations over time, providing a visual representation of how these levels fluctuate. The focus is on analyzing the temporal trends of the pollutants. Then I explored how individual air quality parameters contribute to overall environmental conditions and how pollutants relate to factors like temperature and humidity. Then came the calculation of the correlation between pollutants and environmental conditions using Pearson's correlation coefficient, visualized through scatter plots and correlation heatmaps. After that I identified any significant deviations in pollutant levels by observing the data for outliers or irregular trends, especially in the cleaned and processed data.The dataset contained invalid values (e.g., -200), which were replaced with NA during data cleaning. I used median imputation to handle missing values.

Data Visualization and Discussion

First, I imported the dataset and processed the date and time columns to create a datetime variable. The pollutants and environmental variables were cleaned by replacing commas with dots for numeric conversion. I also handled missing or erroneous values (e.g., -200) by replacing them with NA and performing median imputation. To analyze the time series of CO levels, I created a line plot using ggplot2 (Shown Below), revealing how the concentration of CO changes over time, providing insights into the fluctuation of this pollutant. I then calculated the correlation between the various pollutants and environmental factors, then visualized the results using a heatmap to identify strong relationships. To have a better understanding of the relationship between CO levels and temperature, I created a scatter plot, allowing me to examine how these two variables are correlated (See below). The plot reveals how CO levels vary with temperature, which could be important in understanding air quality dynamics.

> # LIS 4317 Visual Analytics FINAL
> 
> # Load necessary packages
> library(dplyr)
> library(ggplot2)
> library(lubridate)
> 
> # Read the data (replace with the correct file path)
> data <- read.csv("CENSORED/AirQuality.csv", sep = ";", header = TRUE)
> 
> # Inspect the first few rows of the data
> head(data)
        Date     Time CO.GT. PT08.S1.CO. NMHC.GT. C6H6.GT. PT08.S2.NMHC. NOx.GT. PT08.S3.NOx. NO2.GT.
1 10/03/2004 18.00.00    2,6        1360      150     11,9          1046     166         1056     113
2 10/03/2004 19.00.00      2        1292      112      9,4           955     103         1174      92
3 10/03/2004 20.00.00    2,2        1402       88      9,0           939     131         1140     114
4 10/03/2004 21.00.00    2,2        1376       80      9,2           948     172         1092     122
5 10/03/2004 22.00.00    1,6        1272       51      6,5           836     131         1205     116
6 10/03/2004 23.00.00    1,2        1197       38      4,7           750      89         1337      96
  PT08.S4.NO2. PT08.S5.O3.    T   RH     AH  X X.1
1         1692        1268 13,6 48,9 0,7578 NA  NA
2         1559         972 13,3 47,7 0,7255 NA  NA
3         1555        1074 11,9 54,0 0,7502 NA  NA
4         1584        1203 11,0 60,0 0,7867 NA  NA
5         1490        1110 11,2 59,6 0,7888 NA  NA
6         1393         949 11,2 59,2 0,7848 NA  NA
> 
> # Convert Date and Time to datetime format
> data$datetime <- dmy_hms(paste(data$Date, data$Time))
Warning message:
 114 failed to parse. 
> 
> # Replace commas with dots in numeric columns and convert to numeric
> data_clean <- data %>%
+   mutate(across(where(is.character), ~ gsub(",", ".", .))) %>%  # Replace commas with dots in numeric columns
+   mutate(across(c("CO.GT.", "NMHC.GT.", "C6H6.GT.", "NOx.GT.", "NO2.GT.", "T", "RH", "AH"), as.numeric))  # Convert to numeric
> 
> # Handle invalid values such as -200, which are likely placeholders for missing data
> data_clean <- data_clean %>%
+   mutate(across(c("CO.GT.", "NMHC.GT.", "C6H6.GT.", "NOx.GT.", "NO2.GT.", "T", "RH", "AH"),
+                 ~ ifelse(. == -200, NA, .)))  # Replace -200 with NA
> 
> # Check for missing values after conversion and cleaning
> missing_values <- colSums(is.na(data_clean[, c("CO.GT.", "NMHC.GT.", "C6H6.GT.", "NOx.GT.", "NO2.GT.", "T", "RH", "AH")]))
> print(missing_values)
  CO.GT. NMHC.GT. C6H6.GT.  NOx.GT.  NO2.GT.        T       RH       AH 
    1797     8557      480     1753     1756      480      480      480 
> 
> # Impute missing values by replacing them with the median (optional)
> data_clean <- data_clean %>%
+   mutate(across(c("CO.GT.", "NMHC.GT.", "C6H6.GT.", "NOx.GT.", "NO2.GT.", "T", "RH", "AH"), 
+                 ~ ifelse(is.na(.), median(., na.rm = TRUE), .)))  # Impute missing values with the median
> 
> # Double-check for missing values after imputation
> missing_values_after_imputation <- colSums(is.na(data_clean[, c("CO.GT.", "NMHC.GT.", "C6H6.GT.", "NOx.GT.", "NO2.GT.", "T", "RH", "AH")]))
> print(missing_values_after_imputation)
  CO.GT. NMHC.GT. C6H6.GT.  NOx.GT.  NO2.GT.        T       RH       AH 
       0        0        0        0        0        0        0        0 
> 
> # Check the summary of cleaned data
> summary(data_clean[, c("CO.GT.", "NMHC.GT.", "C6H6.GT.", "NOx.GT.", "NO2.GT.", "T", "RH", "AH")])
     CO.GT.          NMHC.GT.         C6H6.GT.         NOx.GT.          NO2.GT.            T        
 Min.   : 0.100   Min.   :   7.0   Min.   : 0.100   Min.   :   2.0   Min.   :  2.0   Min.   :-1.90  
 1st Qu.: 1.200   1st Qu.: 150.0   1st Qu.: 4.600   1st Qu.: 113.0   1st Qu.: 86.0   1st Qu.:12.10  
 Median : 1.800   Median : 150.0   Median : 8.200   Median : 180.0   Median :109.0   Median :17.80  
 Mean   : 2.086   Mean   : 156.6   Mean   : 9.988   Mean   : 234.5   Mean   :112.3   Mean   :18.29  
 3rd Qu.: 2.600   3rd Qu.: 150.0   3rd Qu.:13.500   3rd Qu.: 281.5   3rd Qu.:132.0   3rd Qu.:24.00  
 Max.   :11.900   Max.   :1189.0   Max.   :63.700   Max.   :1479.0   Max.   :340.0   Max.   :44.60  
       RH              AH        
 Min.   : 9.20   Min.   :0.1847  
 1st Qu.:36.70   1st Qu.:0.7501  
 Median :49.60   Median :0.9954  
 Mean   :49.25   Mean   :1.0240  
 3rd Qu.:61.70   3rd Qu.:1.2915  
 Max.   :88.70   Max.   :2.2310  
> 
> # Display the cleaned column names to confirm no issues with them
> colnames(data_clean) %>% print()
 [1] "Date"          "Time"          "CO.GT."        "PT08.S1.CO."   "NMHC.GT."      "C6H6.GT."     
 [7] "PT08.S2.NMHC." "NOx.GT."       "PT08.S3.NOx."  "NO2.GT."       "PT08.S4.NO2."  "PT08.S5.O3."  
[13] "T"             "RH"            "AH"            "X"             "X.1"           "datetime"     
> 
> # View the first few rows of the cleaned data
> head(data_clean)
        Date     Time CO.GT. PT08.S1.CO. NMHC.GT. C6H6.GT. PT08.S2.NMHC. NOx.GT. PT08.S3.NOx. NO2.GT.
1 10/03/2004 18.00.00    2.6        1360      150     11.9          1046     166         1056     113
2 10/03/2004 19.00.00    2.0        1292      112      9.4           955     103         1174      92
3 10/03/2004 20.00.00    2.2        1402       88      9.0           939     131         1140     114
4 10/03/2004 21.00.00    2.2        1376       80      9.2           948     172         1092     122
5 10/03/2004 22.00.00    1.6        1272       51      6.5           836     131         1205     116
6 10/03/2004 23.00.00    1.2        1197       38      4.7           750      89         1337      96
  PT08.S4.NO2. PT08.S5.O3.    T   RH     AH  X X.1            datetime
1         1692        1268 13.6 48.9 0.7578 NA  NA 2004-03-10 18:00:00
2         1559         972 13.3 47.7 0.7255 NA  NA 2004-03-10 19:00:00
3         1555        1074 11.9 54.0 0.7502 NA  NA 2004-03-10 20:00:00
4         1584        1203 11.0 60.0 0.7867 NA  NA 2004-03-10 21:00:00
5         1490        1110 11.2 59.6 0.7888 NA  NA 2004-03-10 22:00:00
6         1393         949 11.2 59.2 0.7848 NA  NA 2004-03-10 23:00:00
> summary(data_clean)
     Date               Time               CO.GT.        PT08.S1.CO.      NMHC.GT.         C6H6.GT.     
 Length:9471        Length:9471        Min.   : 0.100   Min.   :-200   Min.   :   7.0   Min.   : 0.100  
 Class :character   Class :character   1st Qu.: 1.200   1st Qu.: 921   1st Qu.: 150.0   1st Qu.: 4.600  
 Mode  :character   Mode  :character   Median : 1.800   Median :1053   Median : 150.0   Median : 8.200  
                                       Mean   : 2.086   Mean   :1049   Mean   : 156.6   Mean   : 9.988  
                                       3rd Qu.: 2.600   3rd Qu.:1221   3rd Qu.: 150.0   3rd Qu.:13.500  
                                       Max.   :11.900   Max.   :2040   Max.   :1189.0   Max.   :63.700  
                                                        NA's   :114                                     
 PT08.S2.NMHC.       NOx.GT.        PT08.S3.NOx.     NO2.GT.       PT08.S4.NO2.   PT08.S5.O3.    
 Min.   :-200.0   Min.   :   2.0   Min.   :-200   Min.   :  2.0   Min.   :-200   Min.   :-200.0  
 1st Qu.: 711.0   1st Qu.: 113.0   1st Qu.: 637   1st Qu.: 86.0   1st Qu.:1185   1st Qu.: 700.0  
 Median : 895.0   Median : 180.0   Median : 794   Median :109.0   Median :1446   Median : 942.0  
 Mean   : 894.6   Mean   : 234.5   Mean   : 795   Mean   :112.3   Mean   :1391   Mean   : 975.1  
 3rd Qu.:1105.0   3rd Qu.: 281.5   3rd Qu.: 960   3rd Qu.:132.0   3rd Qu.:1662   3rd Qu.:1255.0  
 Max.   :2214.0   Max.   :1479.0   Max.   :2683   Max.   :340.0   Max.   :2775   Max.   :2523.0  
 NA's   :114                       NA's   :114                    NA's   :114    NA's   :114     
       T               RH              AH            X             X.1             datetime                  
 Min.   :-1.90   Min.   : 9.20   Min.   :0.1847   Mode:logical   Mode:logical   Min.   :2004-03-10 18:00:00  
 1st Qu.:12.10   1st Qu.:36.70   1st Qu.:0.7501   NA's:9471      NA's:9471      1st Qu.:2004-06-16 05:00:00  
 Median :17.80   Median :49.60   Median :0.9954                                 Median :2004-09-21 16:00:00  
 Mean   :18.29   Mean   :49.25   Mean   :1.0240                                 Mean   :2004-09-21 16:00:00  
 3rd Qu.:24.00   3rd Qu.:61.70   3rd Qu.:1.2915                                 3rd Qu.:2004-12-28 03:00:00  
 Max.   :44.60   Max.   :88.70   Max.   :2.2310                                 Max.   :2005-04-04 14:00:00  
                                                                                NA's   :114                  
> 
> #Visualization
> ggplot(data_clean, aes(x = datetime, y = CO.GT.)) +
+   geom_line(color = "pink") +
+   labs(title = "CO.GT. Over Time", x = "Time", y = "CO.GT.") +
+   theme_minimal()
Warning message:
Removed 114 rows containing missing values or values outside the scale range (`geom_line()`). 
> 
> ggplot(data_clean, aes(x = T, y = CO.GT.)) +
+   geom_point(color = "blue") +
+   labs(title = "CO vs. Temperature", x = "Temperature (°C)", y = "CO (ppm)") +
+   theme_minimal()

>

Now Before conducting the analysis, I ensured that missing data was handled by replacing it with median values, ensuring a clean dataset for reliable analysis. The time series plot showed significant fluctuations in CO levels, which might suggest pollution spikes during certain times. The correlation heatmap indicated a strong positive correlation between CO and NOx, suggesting that higher levels of CO may occur with elevated NOx levels. The scatter plot of CO vs. Temperature displayed some negative correlation, implying that CO levels might decrease as temperature increases, although this relationship needs further exploration.

Conclusions

This analysis successfully demonstrates the application of time series analysis, correlation analysis, and multivariate analysis to air quality data. Future work could include more advanced statistical models, such as time series forecasting, to predict future air quality levels based on historical trends. Additionally, expanding the dataset to include more environmental factors or using machine learning models could improve the accuracy of predictions. If you would like to explore this topic even further than I have, here is a link to where you can find the data yourself and continue this study.

LIS:4317 Visual Analytics with Dr. Friedman By Moeen Khan

Search This Blog

Module # 13 Dynamic/interaction and animation Visualization

The Final Project for Visual Analytics: Data Analysis of Air Quality and Pollution Trends

Selecting the data

Methodology

Data Visualization and Discussion

Conclusions

Comments

Post a Comment