Simple histograms
When embarking on data analysis for a project, one of the most important things you can do in the beginning is to run descriptives on your data. It is important to get to know your data well, so that you know the patterns that exist in your data, what the distribution of your data is like, and of course, to catch any mistakes you may have made during data cleaning.
One of the first plots I make when learning about my data is to make histograms of all of my variables — both demographic variables, which I may use as covariates, and key variables of interest.
You can also visually glean how the distribution of your data may differ within your sample, by grouping your data by demographic indicators such as gender, race, age.
Below is some R code for generating histograms that place histograms for multiple subgroups on top of each other (with some translucency). Lines for group means are added as well so you can see whether the means differ by group.
library(plyr)
library(ggplot2)
library(ggsci)
groupmeans <-
ddply(datafile,
c("female"),
summarise,
group.mean = mean(criticalreflection, na.rm = TRUE))
ggplot(datafile, aes(x = criticalreflection, color = female, fill = female)) +
geom_histogram(position = "identity",
alpha = 0.3,
bins = 15) +
geom_vline(data = groupmeans,
aes(xintercept = group.mean, color = female),
linetype = "dashed") +
theme(legend.position = "right") + scale_color_npg() +
scale_fill_npg() +
xlab("Critical Reflection") +
ylab("Count") + theme_bw() + theme(legend.title = element_blank(),
plot.margin = margin(1.5, 1.5, 1.5, 1.5, "cm"))
The above code creates this graph:
Here is a breakdown of what the different functions are of the R code:
library(plyr)
library(ggplot2)
library(ggsci)
I loaded three packages for this histogram, plyr which helps with data manipulation, ggplot2 for creating visualizations of data, and ggsci for my color scheme.
groupmeans <-
ddply(datafile,
c("female"),
summarise,
group.mean = mean(criticalreflection, na.rm = TRUE))
Here I am creating a vector called groupmeans
that takes the mean of my variable of interest criticalreflection
and calculated the group-specific means. The grouping variable is female
which is coded 1 for female and 0 for male. The part na.rm = TRUE
is just telling plyr that there are missing values in my data.
ggplot(datafile, aes(x = criticalreflection, color = female, fill = female)) +
geom_histogram(position = "identity",
alpha = 0.3,
bins = 15) +
geom_vline(data = groupmeans,
aes(xintercept = group.mean, color = female),
linetype = "dashed") +
theme(legend.position = "right") + scale_color_npg() +
scale_fill_npg() +
xlab("Critical Reflection") +
ylab("Count") + theme_bw() + theme(legend.title = element_blank(),
plot.margin = margin(1.5, 1.5, 1.5, 1.5, "cm"))
Next I am using ggplot2 to create the histogram. In addition to the regular histogram, I’ve added a position = "identity"
line because I wanted to have the bars for each group on top of each other (I later add some transparency). You can have the bars next to each other with dodge
or stacked on top of each other with stack
. I use geom_vline
to add a dashed line where the group means are, one for female and one for male. The lines are in the assigned color for each group. The rest of the code make specifications for where the legend is, what the axis labels are, and adds a 1.5cm margin around everything. Scale_color_npg
is my chosen color scheme for the graph. It's from the ggsci package, and is the color scheme used in journals like Nature.
And that’s it!
Another thing that would be cool to do is actually run a t-test on the means, and put a dialog box with the means, their SDs, and whether the difference between the means is statistically significant. I may post some code for doing that. Stay tuned.
But since for this project, I am less interested in differences in my variables by gender identity, I decided to just make plots for all my variables with a visual representation of any differences by the female
variable as a way to get to know the data better.
Oh, and another thing I would like to find out more about is better ways to deal with a categorical variable that has very small categories. In the social sciences, a lot of demographic data we collect usually consists of some very large categories (e.g., male, female) and some smaller categories (e.g., non-binary/third gender). In order to use these variables as covariates, I often drop the cases belonging to the tiny categories! I hate doing this, as I feel like I am further marginalizing communities that experience erasure in my data analysis.
What are some ways that we can preserve these cases, even though they represent very small categories?