In this exercise, we will explore packages such as tidyverse and ggplot2 and perform visual analysis. We will study the demographic of the city of Engagement,Ohio,USA.
While we do so, we will also aim to address some common issues such as why is Pie chart not a preferred graph for visual analysis and how we can understand the distribution better.
Each graph has been created in a step by step manner for the reader to understand the steps and the rationale behind each upgradation.
Reading the required packages:
packages=c('tidyverse','psych','rmarkdown','ggmosaic','ggpubr')
for (p in packages){
if(!require(p,character.only=T)){
install.packages(p)
}
library(p,character.only = T)
}
The data required some cleaning to make it visually appealing to the eyes.
Since ggplot2 does not give us very aesthetically looking graphs, they have to modified rigorously.
Datatype for columns such as HouseholdSize is an integer while it should be of nominal data type. This becomes a caveat while plotting a bargraph and needs to be converted to relevant data type to process it further.
Dataset: Participants.csv
Source: https://vast-challenge.github.io/2022/
data=read_csv("data/Participants.csv",show_col_types = FALSE)
paged_table(data)
1. Regrouping Age
Understanding the distribution of age in the dataset
ggplot(data,aes(x=age))+
geom_histogram(bins=10,fill="lightblue",colour="black")+
coord_flip()+
ggtitle("Distribution of Age")+
theme(plot.title = element_text(hjust=0.5))
The age can be regrouped in the following categories: Below 30, 30-39, 40-49, 50 and above
d<-data%>%mutate(agegroup=case_when(age<30~"Below 30",
age>=30 &age<40~"30-39",
age>=40 &age<50~"40-49",
age>=50 ~"50 and above"))
x_selected<-d%>%select(participantId,householdSize,haveKids,age,agegroup,educationLevel)%>%arrange(age)
paged_table(x_selected)
2. Reordering haveKids
Instead of having values as True, False- it will be modified to ‘WithKids’ and ‘WithoutKids’ for better visualization
d<-d%>%mutate('Kids_Reordered'=
case_when(haveKids=="TRUE"~"With Kids",
haveKids=="FALSE"~"Without Kids"))
paged_table(d)
Objective:Observe the Age composition of the participants
ggplot(d,aes(x=agegroup))+
geom_bar()+
ggtitle("Frequency of Age-Groups")+
theme(plot.title = element_text(hjust=0.5))
As a next step, the x axis would be arranged in the following order- Below 30, 30-39, 40-49 and 50 and above. Also, the colour of the bar to be changed to pink
ggplot(d,aes(x=factor(agegroup,levels=c("Below 30","30-39","40-49","50 and above"))))+
geom_bar(fill="pink")
Adding a line and point graph to identify the trend in the frequency of age. And to get a better understanding, the graph would be labelled with the proportion of each of the age group.
ggplot(d,aes(x=factor(agegroup,levels=c("Below 30","30-39","40-49","50 and above"))))+
geom_bar(fill="pink")+
geom_line(aes(group=1),stat="count",colour="darkblue")+
geom_point(aes(),stat="count",colour="black")+
geom_text(stat="count",aes(label=paste0(round(..count../sum(..count..)*100,2),"%")),vjust=-1)
Further adjustments
Modifying the x and y labels
Increasing the y limit to read the numbers
Changing the theme
ggplot(d,aes(x=factor(agegroup,levels=c("Below 30","30-39","40-49","50 and above"))))+
geom_bar(fill="pink")+
geom_line(aes(group=1),stat="count",colour="darkblue")+
geom_point(aes(),stat="count",colour="black")+
geom_text(stat="count",aes(label=paste0(round(..count../sum(..count..)*100,2),"%")),vjust=-1)+
ylim(0,350)+
xlab("Age Group")+
ylab("Frequency")+
ggtitle("How old is the crowd?") +
theme_classic()+
theme(plot.title = element_text(hjust = 0.5))
Objective: Understand the frequency and the proportion of qualifications of the participants
ggplot(data = d,
aes(x=agegroup,fill=educationLevel))+
geom_bar()
Here it can be seen that the graph is plotted with respect to the frequency of specific age group. Since the frequency of age group is different,this kind of visualization does not give us a clarity on the proportion of Education Status.
To rectify this issue, column chart would be used to plot the proportion of Education Status for each age group
Creating a dataset where the count and percentage of the age group-education status combination is calculated.
AgeGroup_EducationLevel<-d%>%group_by(agegroup,educationLevel)%>%tally()
AgeGroup_EducationLevel<-AgeGroup_EducationLevel%>%
group_by(agegroup)%>%
mutate(Total=sum(n),percent=round(n*100/Total))%>%
ungroup()
paged_table(AgeGroup_EducationLevel)
This dataset helps to understand the percentage value for each agegroup-educationlevel combination
ggplot(data = AgeGroup_EducationLevel,
aes(x=factor(agegroup,
levels=c("Below 30","30-39","40-49","50 and above")),
y=percent,
fill=educationLevel))+
geom_col()
Further adjustments:
Using geom_text(), perecantage of participants who have attained a particular level of education to be labelled.
X,Y axis labels to be renamed
Y axis can be made continuous
Flipping the coordinates to make it visually appealing
ggplot(data = AgeGroup_EducationLevel,
aes(x=factor(agegroup,levels=c("Below 30","30-39","40-49","50 and above")),
y=percent,
fill=educationLevel))+
geom_col()+
geom_text(aes(label=paste0(percent,"%")),
position = position_stack(vjust = 0.5),size=3)+
theme(axis.text.x = element_text(angle=0))+
coord_flip()+
xlab("Age Group")+
ylab("% of participants")+
ggtitle("How well educated are the participants?") +
theme_bw()+
guides(fill = guide_legend(title = "Education Level"),
shape = guide_legend(override.aes = list(size = 0.5)))+
theme(plot.title = element_text(hjust = 0.5,size = 13),
legend.title = element_text(size = 9),
legend.text = element_text(size = 7))
Since there are 2 categorical variables in question, Mosaic plot can also be explored.A mosaic plot can be understood intuitively: the entire rectangle represents 100% of the observations. The area of each mosaic piece shows the proportion of observations in that category combination. The rest basically works like a stacked bar chart, which should be familiar to the average reader.
d[['agegroup']]=factor(d[['agegroup']],levels=c("Below 30","30-39","40-49","50 and above"))
ggplot(data = d) +
geom_mosaic(aes(x = product(agegroup,educationLevel),
fill=agegroup))+
labs(title='How well educated are the participants?')+
ylab("Age Group")+
xlab("Education Level")+
guides(fill = guide_legend(title = "Education Level"),
shape = guide_legend(override.aes = list(size = 0.5)))+
theme_bw()+
theme(plot.title = element_text(hjust = 0.5,size = 13),
legend.title = element_text(size = 9),
legend.text = element_text(size = 7))
Objective: Compare the frequency of Household Size
Here,both doughnut chart and bar graph can be explored to understand the better visualization technique to read and interpret the household size.
First,the dataset is grouped based on household size to get the count and the cumulative frequency (denoted as ymax). Post calculating the cummulative frequency, another column which will downshift the cummulative frequency will be created (denoted as ymin).
These variables are used to create the doughnut chart.
x<-d%>%
group_by(householdSize)%>%
tally()%>%
ungroup()
y<-x%>%mutate(`ymax`=cumsum(`n`/sum(`n`)))
z=y%>%mutate(`ymin`=c(0,head(`ymax`,n=-1)))
paged_table(z)
doughnutchart<-ggplot(z, aes(x=3,y=n, fill=as.character(householdSize))) +
geom_col(color = "black")
doughnutchart
The graph is created using rectangular coordinates. Converting it to Polar coordinates
doughnutchart<-ggplot(z, aes(x=3,y=n, fill=as.character(householdSize))) +
geom_col(color = "black") +
coord_polar(theta="y")
doughnutchart
Adding x limits to create a doughnut chart
doughnutchart<-ggplot(z, aes(x=3,y=n, fill=as.character(householdSize))) +
geom_col(color = "black") +
coord_polar(theta="y")+
xlim(c(2,4))
doughnutchart
Further adjustments:
Change the colour pallete
Theme to be updated to theme_void() to remove the grid
Legend Title to be modified
doughnutchart<-ggplot(z, aes(x=3,y=n, fill=as.character(householdSize))) +
geom_col(color = "black") +
coord_polar(theta="y")+
xlim(c(2,4))+
scale_color_brewer(palette=1) +
scale_fill_manual(values = c("#FFF7FB", "#D0D1E6",
"#74A9CF"))+
theme_void()+
guides(fill = guide_legend(title = "Household Size"))+
ggtitle("Which has the highest frequency of Household Size?")+
theme(plot.title = element_text(hjust=0.5))
doughnutchart
Seems like our question is still not answered!
To understand the difference, bar graph with the same dataset and colour pallete would be created.
doughnutchart<-ggplot(z, aes(x=3,y=n, fill=as.character(householdSize))) +
geom_col(color = "black") +
coord_polar(theta="y")+
xlim(c(2,4))+
scale_color_brewer(palette=1) +
scale_fill_manual(values = c("#FFF7FB", "#D0D1E6",
"#74A9CF"))+
theme_void()+
guides(fill = guide_legend(title = "Household Size"))
barchart<-ggplot(z, aes(x=as.character(householdSize),y=n)) +
geom_col(fill=c("#FFF7FB", "#D0D1E6","#74A9CF"))+
xlab("Household Size")+
ylab("")+
theme_classic()
ggarrange(doughnutchart,barchart)
Even though the numbers are not labelled in the graphs, the doughnut graph does not help us in comparing the frequency of household size. Bar graph, on the other hand gives us a fair idea for comparison.
This can be because in case of doughnut chart, the count is quite similar to each other and there is no significant difference.
Adding labels to make it more readable.
doughnutchart<-ggplot(z, aes(x=3,y=n, fill=as.character(householdSize))) +
geom_col(color = "black") +
coord_polar(theta="y")+
xlim(c(2,4))+
geom_text(aes(label=n),
position = position_stack(vjust = 0.5)) +
scale_color_brewer(palette=1) +
scale_fill_manual(values = c("#FFF7FB", "#D0D1E6",
"#74A9CF"))+
theme_void()+
guides(fill = guide_legend(title = "Household Size"))
barchart<-ggplot(z, aes(x=as.character(householdSize),y=n)) +
geom_col(fill=c("#FFF7FB", "#D0D1E6","#74A9CF"))+
geom_text(aes(label=n),
position = position_stack(vjust =0.5)) +
xlab("Household Size")+
ylab("")+
theme_classic()
c<-ggarrange(doughnutchart,barchart)
annotate_figure(c,top = text_grob("Doughnut vs Bar-The better one",
face = "bold", size = 14))
<b.Objective: To observe if people with kids are happier than the ones who are childless
Based on the dataset provided, this vital objective which has been running across the minds of all couples will be achieved using the boxplot and violin plot to understand the distribution amongst both the categories.
ggplot(data=d,aes(y=joviality,x=Kids_Reordered))+
geom_violin(fill="light blue")
Adding a boxplot to the graph
ggplot(data=d,aes(y=joviality,x=Kids_Reordered))+
geom_violin(fill="light blue")+
geom_boxplot()
To understand the distribution of joviality amongst various interest groups, facet_wrap() would be used and would be arranged based on the number of interest groups.
ggplot(data=d,aes(y=joviality,x=Kids_Reordered))+
geom_violin(fill="light blue")+
geom_boxplot()+
facet_wrap(~interestGroup)
Rearranging it in 2 rows by adding nrow=2
ggplot(data=d,aes(y=joviality,x=Kids_Reordered))+
geom_violin(fill="light blue")+
geom_boxplot()+
facet_wrap(~interestGroup,nrow = 2)
Further adjustments:
Adding mean to the box plot
Tilting the x axis by 45 degree to make it more readable
Graph title to be added
ggplot(data=d,aes(y=joviality*100,x=Kids_Reordered))+
geom_violin(fill="light blue")+
geom_boxplot()+
facet_wrap(~interestGroup,nrow = 2)+
geom_point(stat="summary",
fun.y="mean",
colour='red',size=1)+
xlab("Kids")+
ylab("Joviality (*0.01)")+
theme(axis.text.x = element_text(angle=0),
plot.title = element_text(hjust = 0.5,size = 10))+
ggtitle("Are Participants with kids based on each interest group happy?")+
coord_flip()
https://www.singstat.gov.sg/-/media/files/publications/population/population2020.pdf
https://vast-challenge.github.io/2022/