The Final Project for Visual Analytics: Data Analysis of Air Quality and Pollution Trends
1) This week will probably be my last week here updating this blog, because this is my final assignment for this course on visual analytics. For this final project, I will be discussing a real work problem or forming a hypothesis surrounding a one, explain how this problem ties back to visual analytics. I will then describe a solution or methodology to counteract the stated problem and create some sort of data visualization off of it in RStudio. I really like RStudio so I will be using it again. This is all really normal final exam stuffs so without further ado lets get into it.
Selecting the data
For my final project in data visualization, I selected the dataset "AirQuality.csv," which contains various air quality measurements over time. The dataset contains over 50 observations and more than 10 variables. Variables include levels of pollutants such as CO, NMHC, C6H6, NOx, NO2, and meteorological data like temperature (T), relative humidity (RH), and absolute humidity (AH). The data is collected at different timestamps, making it suitable for time series analysis.
The goal of this project is to analyze air quality data collected over time to understand trends, variations, and relationships between pollutants and environmental conditions. Specifically, I aim to investigate how air quality parameters such as CO and NO2 change over time and how they relate to environmental factors like temperature and humidity.
Visual analytics of environmental data, especially air quality, has been widely studied. Existing visualizations typically include time series charts showing the variation of pollutant levels over time, as well as correlation heatmaps to identify relationships between different air quality parameters. For example, a study on air quality in urban areas analyzed how CO, NO2, and other pollutants vary with temperature and humidity, using line graphs and scatter plots for visual representation. I found these methods insightful and have employed similar approaches in my analysis.
The goal of this project is to analyze air quality data collected over time to understand trends, variations, and relationships between pollutants and environmental conditions. Specifically, I aim to investigate how air quality parameters such as CO and NO2 change over time and how they relate to environmental factors like temperature and humidity.
Visual analytics of environmental data, especially air quality, has been widely studied. Existing visualizations typically include time series charts showing the variation of pollutant levels over time, as well as correlation heatmaps to identify relationships between different air quality parameters. For example, a study on air quality in urban areas analyzed how CO, NO2, and other pollutants vary with temperature and humidity, using line graphs and scatter plots for visual representation. I found these methods insightful and have employed similar approaches in my analysis.
The problem will be addressed using time series analysis to observe the variations of air quality parameters over time. Additionally, multivariate analysis will be applied to understand the relationships between pollutants and environmental variables like temperature and humidity. The technical approach involves the following steps. A time Series Analysis: Plotting pollutants such as CO and NO2 over time.
Correlation Analysis: Analyzing the relationships between different pollutants and environmental factors. The imputation of Missing Data: Replacing missing or erroneous data with median values to avoid skewed results. The visualization: Using ggplot2 for line plots and scatter plots, and Lubridate (the R package) to handle datetime formatting.
Methodology
I used ggplot2 to create line plots of pollutant concentrations over time, providing a visual representation of how these levels fluctuate. The focus is on analyzing the temporal trends of the pollutants. Then I explored how individual air quality parameters contribute to overall environmental conditions and how pollutants relate to factors like temperature and humidity. Then came the calculation of the correlation between pollutants and environmental conditions using Pearson's correlation coefficient, visualized through scatter plots and correlation heatmaps. After that I identified any significant deviations in pollutant levels by observing the data for outliers or irregular trends, especially in the cleaned and processed data.The dataset contained invalid values (e.g., -200), which were replaced with NA during data cleaning. I used median imputation to handle missing values.
Data Visualization and Discussion
First, I imported the dataset and processed the date and time columns to create a datetime variable. The pollutants and environmental variables were cleaned by replacing commas with dots for numeric conversion. I also handled missing or erroneous values (e.g., -200) by replacing them with NA and performing median imputation. To analyze the time series of CO levels, I created a line plot using ggplot2 (Shown Below), revealing how the concentration of CO changes over time, providing insights into the fluctuation of this pollutant. I then calculated the correlation between the various pollutants and environmental factors, then visualized the results using a heatmap to identify strong relationships. To have a better understanding of the relationship between CO levels and temperature, I created a scatter plot, allowing me to examine how these two variables are correlated (See below). The plot reveals how CO levels vary with temperature, which could be important in understanding air quality dynamics.
> # LIS 4317 Visual Analytics FINAL
>
> # Load necessary packages
> library(dplyr)
> library(ggplot2)
> library(lubridate)
>
> # Read the data (replace with the correct file path)
> data <- read.csv("CENSORED/AirQuality.csv", sep = ";", header = TRUE)
>
> # Inspect the first few rows of the data
> head(data)
Date Time CO.GT. PT08.S1.CO. NMHC.GT. C6H6.GT. PT08.S2.NMHC. NOx.GT. PT08.S3.NOx. NO2.GT.
1 10/03/2004 18.00.00 2,6 1360 150 11,9 1046 166 1056 113
2 10/03/2004 19.00.00 2 1292 112 9,4 955 103 1174 92
3 10/03/2004 20.00.00 2,2 1402 88 9,0 939 131 1140 114
4 10/03/2004 21.00.00 2,2 1376 80 9,2 948 172 1092 122
5 10/03/2004 22.00.00 1,6 1272 51 6,5 836 131 1205 116
6 10/03/2004 23.00.00 1,2 1197 38 4,7 750 89 1337 96
PT08.S4.NO2. PT08.S5.O3. T RH AH X X.1
1 1692 1268 13,6 48,9 0,7578 NA NA
2 1559 972 13,3 47,7 0,7255 NA NA
3 1555 1074 11,9 54,0 0,7502 NA NA
4 1584 1203 11,0 60,0 0,7867 NA NA
5 1490 1110 11,2 59,6 0,7888 NA NA
6 1393 949 11,2 59,2 0,7848 NA NA
>
> # Convert Date and Time to datetime format
> data$datetime <- dmy_hms(paste(data$Date, data$Time))
Warning message:
114 failed to parse.
>
> # Replace commas with dots in numeric columns and convert to numeric
> data_clean <- data %>%
+ mutate(across(where(is.character), ~ gsub(",", ".", .))) %>% # Replace commas with dots in numeric columns
+ mutate(across(c("CO.GT.", "NMHC.GT.", "C6H6.GT.", "NOx.GT.", "NO2.GT.", "T", "RH", "AH"), as.numeric)) # Convert to numeric
>
> # Handle invalid values such as -200, which are likely placeholders for missing data
> data_clean <- data_clean %>%
+ mutate(across(c("CO.GT.", "NMHC.GT.", "C6H6.GT.", "NOx.GT.", "NO2.GT.", "T", "RH", "AH"),
+ ~ ifelse(. == -200, NA, .))) # Replace -200 with NA
>
> # Check for missing values after conversion and cleaning
> missing_values <- colSums(is.na(data_clean[, c("CO.GT.", "NMHC.GT.", "C6H6.GT.", "NOx.GT.", "NO2.GT.", "T", "RH", "AH")]))
> print(missing_values)
CO.GT. NMHC.GT. C6H6.GT. NOx.GT. NO2.GT. T RH AH
1797 8557 480 1753 1756 480 480 480
>
> # Impute missing values by replacing them with the median (optional)
> data_clean <- data_clean %>%
+ mutate(across(c("CO.GT.", "NMHC.GT.", "C6H6.GT.", "NOx.GT.", "NO2.GT.", "T", "RH", "AH"),
+ ~ ifelse(is.na(.), median(., na.rm = TRUE), .))) # Impute missing values with the median
>
> # Double-check for missing values after imputation
> missing_values_after_imputation <- colSums(is.na(data_clean[, c("CO.GT.", "NMHC.GT.", "C6H6.GT.", "NOx.GT.", "NO2.GT.", "T", "RH", "AH")]))
> print(missing_values_after_imputation)
CO.GT. NMHC.GT. C6H6.GT. NOx.GT. NO2.GT. T RH AH
0 0 0 0 0 0 0 0
>
> # Check the summary of cleaned data
> summary(data_clean[, c("CO.GT.", "NMHC.GT.", "C6H6.GT.", "NOx.GT.", "NO2.GT.", "T", "RH", "AH")])
CO.GT. NMHC.GT. C6H6.GT. NOx.GT. NO2.GT. T
Min. : 0.100 Min. : 7.0 Min. : 0.100 Min. : 2.0 Min. : 2.0 Min. :-1.90
1st Qu.: 1.200 1st Qu.: 150.0 1st Qu.: 4.600 1st Qu.: 113.0 1st Qu.: 86.0 1st Qu.:12.10
Median : 1.800 Median : 150.0 Median : 8.200 Median : 180.0 Median :109.0 Median :17.80
Mean : 2.086 Mean : 156.6 Mean : 9.988 Mean : 234.5 Mean :112.3 Mean :18.29
3rd Qu.: 2.600 3rd Qu.: 150.0 3rd Qu.:13.500 3rd Qu.: 281.5 3rd Qu.:132.0 3rd Qu.:24.00
Max. :11.900 Max. :1189.0 Max. :63.700 Max. :1479.0 Max. :340.0 Max. :44.60
RH AH
Min. : 9.20 Min. :0.1847
1st Qu.:36.70 1st Qu.:0.7501
Median :49.60 Median :0.9954
Mean :49.25 Mean :1.0240
3rd Qu.:61.70 3rd Qu.:1.2915
Max. :88.70 Max. :2.2310
>
> # Display the cleaned column names to confirm no issues with them
> colnames(data_clean) %>% print()
[1] "Date" "Time" "CO.GT." "PT08.S1.CO." "NMHC.GT." "C6H6.GT."
[7] "PT08.S2.NMHC." "NOx.GT." "PT08.S3.NOx." "NO2.GT." "PT08.S4.NO2." "PT08.S5.O3."
[13] "T" "RH" "AH" "X" "X.1" "datetime"
>
> # View the first few rows of the cleaned data
> head(data_clean)
Date Time CO.GT. PT08.S1.CO. NMHC.GT. C6H6.GT. PT08.S2.NMHC. NOx.GT. PT08.S3.NOx. NO2.GT.
1 10/03/2004 18.00.00 2.6 1360 150 11.9 1046 166 1056 113
2 10/03/2004 19.00.00 2.0 1292 112 9.4 955 103 1174 92
3 10/03/2004 20.00.00 2.2 1402 88 9.0 939 131 1140 114
4 10/03/2004 21.00.00 2.2 1376 80 9.2 948 172 1092 122
5 10/03/2004 22.00.00 1.6 1272 51 6.5 836 131 1205 116
6 10/03/2004 23.00.00 1.2 1197 38 4.7 750 89 1337 96
PT08.S4.NO2. PT08.S5.O3. T RH AH X X.1 datetime
1 1692 1268 13.6 48.9 0.7578 NA NA 2004-03-10 18:00:00
2 1559 972 13.3 47.7 0.7255 NA NA 2004-03-10 19:00:00
3 1555 1074 11.9 54.0 0.7502 NA NA 2004-03-10 20:00:00
4 1584 1203 11.0 60.0 0.7867 NA NA 2004-03-10 21:00:00
5 1490 1110 11.2 59.6 0.7888 NA NA 2004-03-10 22:00:00
6 1393 949 11.2 59.2 0.7848 NA NA 2004-03-10 23:00:00
> summary(data_clean)
Date Time CO.GT. PT08.S1.CO. NMHC.GT. C6H6.GT.
Length:9471 Length:9471 Min. : 0.100 Min. :-200 Min. : 7.0 Min. : 0.100
Class :character Class :character 1st Qu.: 1.200 1st Qu.: 921 1st Qu.: 150.0 1st Qu.: 4.600
Mode :character Mode :character Median : 1.800 Median :1053 Median : 150.0 Median : 8.200
Mean : 2.086 Mean :1049 Mean : 156.6 Mean : 9.988
3rd Qu.: 2.600 3rd Qu.:1221 3rd Qu.: 150.0 3rd Qu.:13.500
Max. :11.900 Max. :2040 Max. :1189.0 Max. :63.700
NA's :114
PT08.S2.NMHC. NOx.GT. PT08.S3.NOx. NO2.GT. PT08.S4.NO2. PT08.S5.O3.
Min. :-200.0 Min. : 2.0 Min. :-200 Min. : 2.0 Min. :-200 Min. :-200.0
1st Qu.: 711.0 1st Qu.: 113.0 1st Qu.: 637 1st Qu.: 86.0 1st Qu.:1185 1st Qu.: 700.0
Median : 895.0 Median : 180.0 Median : 794 Median :109.0 Median :1446 Median : 942.0
Mean : 894.6 Mean : 234.5 Mean : 795 Mean :112.3 Mean :1391 Mean : 975.1
3rd Qu.:1105.0 3rd Qu.: 281.5 3rd Qu.: 960 3rd Qu.:132.0 3rd Qu.:1662 3rd Qu.:1255.0
Max. :2214.0 Max. :1479.0 Max. :2683 Max. :340.0 Max. :2775 Max. :2523.0
NA's :114 NA's :114 NA's :114 NA's :114
T RH AH X X.1 datetime
Min. :-1.90 Min. : 9.20 Min. :0.1847 Mode:logical Mode:logical Min. :2004-03-10 18:00:00
1st Qu.:12.10 1st Qu.:36.70 1st Qu.:0.7501 NA's:9471 NA's:9471 1st Qu.:2004-06-16 05:00:00
Median :17.80 Median :49.60 Median :0.9954 Median :2004-09-21 16:00:00
Mean :18.29 Mean :49.25 Mean :1.0240 Mean :2004-09-21 16:00:00
3rd Qu.:24.00 3rd Qu.:61.70 3rd Qu.:1.2915 3rd Qu.:2004-12-28 03:00:00
Max. :44.60 Max. :88.70 Max. :2.2310 Max. :2005-04-04 14:00:00
NA's :114
>
> #Visualization
> ggplot(data_clean, aes(x = datetime, y = CO.GT.)) +
+ geom_line(color = "pink") +
+ labs(title = "CO.GT. Over Time", x = "Time", y = "CO.GT.") +
+ theme_minimal()
Warning message:
Removed 114 rows containing missing values or values outside the scale range (`geom_line()`).
>
> ggplot(data_clean, aes(x = T, y = CO.GT.)) +
+ geom_point(color = "blue") +
+ labs(title = "CO vs. Temperature", x = "Temperature (°C)", y = "CO (ppm)") +
+ theme_minimal()
| |
|
Now Before conducting the analysis, I ensured that missing data was handled by replacing it with median values, ensuring a clean dataset for reliable analysis. The time series plot showed significant fluctuations in CO levels, which might suggest pollution spikes during certain times. The correlation heatmap indicated a strong positive correlation between CO and NOx, suggesting that higher levels of CO may occur with elevated NOx levels. The scatter plot of CO vs. Temperature displayed some negative correlation, implying that CO levels might decrease as temperature increases, although this relationship needs further exploration.
Conclusions
This analysis successfully demonstrates the application of time series analysis, correlation analysis, and multivariate analysis to air quality data. Future work could include more advanced statistical models, such as time series forecasting, to predict future air quality levels based on historical trends. Additionally, expanding the dataset to include more environmental factors or using machine learning models could improve the accuracy of predictions. If you would like to explore this topic even further than I have, here is a link to where you can find the data yourself and continue this study.
Comments
Post a Comment