The tidyverse team recently completed a 1.0.0 release for dplyr
, which was a pretty big deal, and it included a bunch of new features. One of the things that I really enjoyed was that they wrote a series of blog posts describing new features in the release. This was great, because we got to see what was coming up, and great because people tried them out and gave them feedback. Then, the tidyverse listened, and changed behaviour based on feedback from the community.
Isn’t that great?
Let’s celebrate something from the tidyverse today: rowwise
. This function has actually been around for a while, but I never really used it, for some reason. A student recently had an issue where they had data like this:
library(tibble)
income <- tibble(income_range = c("0-74",
"75-145",
"150-325",
"325+"),
count = c(125,
170,
215,
250))
income
#> # A tibble: 4 x 2
#> income_range count
#> <chr> <dbl>
#> 1 0-74 125
#> 2 75-145 170
#> 3 150-325 215
#> 4 325+ 250
They wanted to calculate the median of income_range
.
This presents an interesting problem, with a few steps:
- Separate the range values into two columns.
- Calculate the median of each of those pairs of numbers.
We can get ourselves into a better position by separating out income_range
into two columns, lower
, and upper
, and converting the contents. We can use separate
from tidyr
. It is kind of magical. While you can specify a specific thing that separates the numbers, separate
has a nice bit og magic that just finds the most likely character to separate on.
library(tidyr)
income_sep <- income %>%
separate(col = income_range,
into = c("lower", "upper"),
convert = TRUE)
income_sep
#> # A tibble: 4 x 3
#> lower upper count
#> <int> <int> <dbl>
#> 1 0 74 125
#> 2 75 145 170
#> 3 150 325 215
#> 4 325 NA 250
So now have a lower and an upper range of values, and we want to calculate the median of these.
This…gets a little bit tricky.
At first instinct, you might try something like this:
calculate the median based on the lower and upper columns:
library(dplyr, warn.conflicts = FALSE)
income_sep %>%
mutate(med = median(c(lower, upper), na.rm = TRUE))
#> # A tibble: 4 x 4
#> lower upper count med
#> <int> <int> <dbl> <int>
#> 1 0 74 125 145
#> 2 75 145 170 145
#> 3 150 325 215 145
#> 4 325 NA 250 145
But this doesn’t give us what we want. It just gives us the median of the vector, I think?
Anyway, how do we solve this?
We can now call rowwise()
and calculate the median based on the lower
and upper
, and it will consider each row and take the median of those two numbers:
income_sep %>%
rowwise() %>%
mutate(med = median(c(lower, upper), na.rm = TRUE))
#> # A tibble: 4 x 4
#> # Rowwise:
#> lower upper count med
#> <int> <int> <dbl> <dbl>
#> 1 0 74 125 37
#> 2 75 145 170 110
#> 3 150 325 215 238.
#> 4 325 NA 250 325
And I think that’s pretty neat.
Thanks, tidyverse team!