Number of PhDs by Field
An Incomplete Data Exploration
By Harshvardhan in R statistics economics thoughts
February 20, 2022
Yesterday I was talking to one of my friends about his plans post PhD. “I want to go for pure sciences and abstract mathematics, but there are hardly any positions in academia on these topics.”, he said. It got me into thinking how many PhD students graduate every year and if the demand (in academia or in industry) is less than that. But I didn’t even know how many PhDs are awarded each year, let alone employed.
While searching for a dataset for my Text Mining class project, I discovered this dataset on number of PhDs by field. So, let’s explore!
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.8.9000
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 2.1.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(garlic)
library(DT)
theme_set(theme_linedraw())
# Loading dataset from their repository
phds = readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-02-19/phd_by_field.csv")
## Rows: 3370 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): broad_field, major_field, field
## dbl (2): year, n_phds
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
phds
## # A tibble: 3,370 × 5
## broad_field major_field field year n_phds
## <chr> <chr> <chr> <dbl> <dbl>
## 1 Life sciences Agricultural sciences and natural resources Agric… 2008 111
## 2 Life sciences Agricultural sciences and natural resources Agric… 2008 28
## 3 Life sciences Agricultural sciences and natural resources Agric… 2008 3
## 4 Life sciences Agricultural sciences and natural resources Agron… 2008 68
## 5 Life sciences Agricultural sciences and natural resources Anima… 2008 41
## 6 Life sciences Agricultural sciences and natural resources Anima… 2008 18
## 7 Life sciences Agricultural sciences and natural resources Anima… 2008 77
## 8 Life sciences Agricultural sciences and natural resources Envir… 2008 182
## 9 Life sciences Agricultural sciences and natural resources Fishi… 2008 52
## 10 Life sciences Agricultural sciences and natural resources Food … 2008 96
## # … with 3,360 more rows
There are many records by fields — in three levels of granularity.There are 337 fields and we have records for each of them between 2008 to 2017. Let’s see how many people are from which field.
phds %>%
group_by(broad_field) %>%
summarise(n_phds = sum(n_phds, na.rm = T)) %>%
arrange(desc(n_phds)) %>%
datatable(colnames = c("Broad Field", "Number of PhDs"),
rownames = FALSE,
caption = "Number of PhDs by their broad fields. Life sciences lead the way.") %>%
formatRound("n_phds", digits = 0)
Life sciences has most number of graduates. Engineering has least number of graduates — even less than mysterious Other. Surprisingly social sciences, humanities and eucation are higher than mathematics and computer science. And they lead by a margin. The number of graduates in “humanities and social science” subjects is four times the number of PhDs in in “hard sciences” like engineering and maths. No wonder there is such a shortage of people in the tech world.
Life sciences as such a broad encompassing field. Let’s explore what is covered in life sciences.
phds %>%
filter(broad_field == "Life sciences") %>%
group_by(major_field) %>%
summarise(n_phds = sum(n_phds, na.rm = T)) %>%
arrange(desc(n_phds)) %>%
datatable(colnames = c("Major Field", "Number of PhDs"),
rownames = FALSE,
caption = "Number of PhDs by their major fields. Biology, excluding health sciences, leads the way.") %>%
formatRound("n_phds", digits = 0)
Biological and biomedical sciences has the most number of graduates. Let me explore engineering too. There are so few PhDs in geosciences. With climate change becoming another major issue, I wonder why the field isn’t picking up fast.
Let’s see the fields in engineering.
phds %>%
filter(broad_field == "Engineering") %>%
group_by(major_field) %>%
summarise(n_phds = sum(n_phds, na.rm = T)) %>%
arrange(desc(n_phds))
## # A tibble: 1 × 2
## major_field n_phds
## <chr> <dbl>
## 1 Other engineering 18139
Oh, so no information. The information is nested in another column, I guess. I’ll have to group by field.
phds %>%
filter(broad_field == "Engineering") %>%
group_by(field) %>%
summarise(n_phds = sum(n_phds, na.rm = T)) %>%
arrange(desc(n_phds)) %>%
datatable(colnames = c("Field", "Number of PhDs")) %>%
formatRound("n_phds", digits = 0)
Computer engineering PhDs are most popular; twice as much as next in the list. Environmental engineering is the second most popular. That’s impressive. Let’s visualise the counts.
phds %>%
filter(broad_field == "Engineering") %>%
group_by(field) %>%
summarise(n_phds = sum(n_phds, na.rm = T)) %>%
ggplot(aes(reorder(field, n_phds), n_phds)) +
geom_col() +
coord_flip() +
labs(y = "Number of PhDs", x = "Field (Engineering only)")
The data gives me opportunity to see how it grew up with the rise in popoularity of computer engineering. I’ve heard numerous time that its popularity has increased over the years.
# ggrepel for text labels
library(ggrepel)
phds %>%
filter(broad_field == "Engineering") %>%
mutate(label = if_else(year == max(year), field, NA_character_)) %>%
ggplot(aes(x = year, y = n_phds, colour = field)) +
geom_line() +
scale_x_continuous(breaks = seq(from = 2008, to = 2017, by = 1)) +
geom_label_repel(aes(label = label),
nudge_x = 1,
na.rm = TRUE) +
labs(x = "Year", y = "Number of PhDs") +
theme(legend.position = "none")
## Warning: Removed 20 row(s) containing missing values (geom_path).
## Warning: ggrepel: 10 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
phds_top_engineering = phds %>%
filter(broad_field == "Engineering") %>%
group_by(field) %>%
summarise(n_phds = sum(n_phds)) %>%
filter(n_phds > 100) %>%
slice_max(order_by = n_phds, n = 6)
phds_top_engineering
## # A tibble: 6 × 2
## field n_phds
## <chr> <dbl>
## 1 Computer engineering 4030
## 2 Environmental, environmental health engineeringl 2001
## 3 Engineering, other 1488
## 4 Nuclear engineering 1166
## 5 Operations research (engineering) 985
## 6 Systems engineering 924
phds %>%
filter(field %in% phds_top_engineering$field) %>%
ggplot(aes(x = year, y = n_phds, fill = field)) +
geom_bar(stat = "identity") +
scale_x_continuous(labels = scales::label_number(accuracy = 1)) +
scale_fill_manual(values = MetBrewer::met.brewer("Hokusai1", 6)) +
facet_wrap( ~ field) +
labs(x = "Year", y = "Number of PhDs", fill = "Field")
Computer engineering has been ever popular. I didn’t expect that.
But wait, wasn’t there a computer science in major_field
? What was that? It was called Mathematics and computer sciences
.
phds %>%
filter(broad_field == "Mathematics and computer sciences") %>%
group_by(major_field) %>%
summarise(n_phds = sum(n_phds, na.rm = T)) %>%
arrange(desc(n_phds)) %>%
datatable(colnames = c("Major Field", "Number of PhDs"),
rownames = FALSE,
caption = "Mathematics and computer sciences has two fields.") %>%
formatRound("n_phds", digits = 0)
phds %>%
filter(broad_field == "Mathematics and computer sciences") %>%
filter(n_phds >= 300) %>%
mutate(label = if_else(year == max(year), field, NA_character_)) %>%
ggplot(aes(x = year, y = n_phds, colour = field)) +
geom_line() +
scale_x_continuous(breaks = seq(from = 2008, to = 2017, by = 1)) +
geom_label_repel(aes(label = label),
nudge_x = 1,
na.rm = TRUE) +
labs(x = "Year", y = "Number of PhDs") +
theme(legend.position = "none")
Computer engineering averaged around 400; computer science averaged around 1500. I think this the “computer science” in general parlance.
This exploration is incomplete. I couldn’t finish it in time but I’d get back to it someday.
Today I found this wonderful visualisation on Twitter that I thought to replicate for the number of PhDs by field.
library(tweetrmd)
tweet_screenshot("https://twitter.com/jenjentro/status/1512997114896269312?t=nWQqyQa3tHQVNSHPakh2TA")
Her codes were available on Github.
# Loading packages
library(tidytuesdayR)
library(tidylog)
##
## Attaching package: 'tidylog'
## The following objects are masked from 'package:dplyr':
##
## add_count, add_tally, anti_join, count, distinct, distinct_all,
## distinct_at, distinct_if, filter, filter_all, filter_at, filter_if,
## full_join, group_by, group_by_all, group_by_at, group_by_if,
## inner_join, left_join, mutate, mutate_all, mutate_at, mutate_if,
## relocate, rename, rename_all, rename_at, rename_if, rename_with,
## right_join, sample_frac, sample_n, select, select_all, select_at,
## select_if, semi_join, slice, slice_head, slice_max, slice_min,
## slice_sample, slice_tail, summarise, summarise_all, summarise_at,
## summarise_if, summarize, summarize_all, summarize_at, summarize_if,
## tally, top_frac, top_n, transmute, transmute_all, transmute_at,
## transmute_if, ungroup
## The following objects are masked from 'package:tidyr':
##
## drop_na, fill, gather, pivot_longer, pivot_wider, replace_na,
## spread, uncount
## The following object is masked from 'package:stats':
##
## filter
library(showtext)
## Loading required package: sysfonts
## Loading required package: showtextdb
- Posted on:
- February 20, 2022
- Length:
- 8 minute read, 1547 words
- Categories:
- R statistics economics thoughts
- See Also: