I think that I have learnt and forgotten, and then learnt about this feature of R a few times in the past 4 years. The idea (I think), is this:
- R allows you to pass functions as arguments
- Functions can be modified inside a function
So what the hell does that mean?
Well, I think I can summarise it down to this crazy piece of magic:
my_fun <- function(x, fun){
fun(x)
}
Now we can pass in some input, and any function.
Let’s take the storms
data from dplyr
.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
storms
## # A tibble: 10,010 x 13
## name year month day hour lat long status category wind pressure
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr> <ord> <int> <int>
## 1 Amy 1975 6 27 0 27.5 -79 tropi… -1 25 1013
## 2 Amy 1975 6 27 6 28.5 -79 tropi… -1 25 1013
## 3 Amy 1975 6 27 12 29.5 -79 tropi… -1 25 1013
## 4 Amy 1975 6 27 18 30.5 -79 tropi… -1 25 1013
## 5 Amy 1975 6 28 0 31.5 -78.8 tropi… -1 25 1012
## 6 Amy 1975 6 28 6 32.4 -78.7 tropi… -1 25 1012
## 7 Amy 1975 6 28 12 33.3 -78 tropi… -1 25 1011
## 8 Amy 1975 6 28 18 34 -77 tropi… -1 30 1006
## 9 Amy 1975 6 29 0 34.4 -75.8 tropi… 0 35 1004
## 10 Amy 1975 6 29 6 34 -74.8 tropi… 0 40 1002
## # … with 10,000 more rows, and 2 more variables: ts_diameter <dbl>,
## # hu_diameter <dbl>
Let’s take the mean of wind
:
my_fun(storms$wind, mean)
## [1] 53.495
And, we can also do the standard deviation, or the variance, or the median
my_fun(storms$wind, sd)
## [1] 26.21387
my_fun(storms$wind, var)
## [1] 687.1668
my_fun(storms$wind, median)
## [1] 45
Why would you want to do this?
Let’s say you want to summarise the storms
data further, for each month.
We take storms, group my month, then take the mean for month.
storms %>%
group_by(month) %>%
summarise(wind_summary = mean(wind))
## # A tibble: 10 x 2
## month wind_summary
## <dbl> <dbl>
## 1 1 45.7
## 2 4 44.6
## 3 5 36.3
## 4 6 37.8
## 5 7 41.2
## 6 8 52.1
## 7 9 58.0
## 8 10 54.6
## 9 11 52.5
## 10 12 47.9
You could repeat the code again you could vary mean
to be, say sd
storms %>%
group_by(month) %>%
summarise(wind_summary = sd(wind))
## # A tibble: 10 x 2
## month wind_summary
## <dbl> <dbl>
## 1 1 9.08
## 2 4 5.94
## 3 5 9.57
## 4 6 13.4
## 5 7 19.1
## 6 8 26.0
## 7 9 28.2
## 8 10 25.3
## 9 11 22.0
## 10 12 14.6
Over the years, every time I repeat some code like this, I have felt a tug somewhere in my brain - a little spidey sense saying (something like): “Don’t repeat yourself, Nick”.
We can avoid repeating ourselves by using the template from earlier here in dplyr. We want to manipulate the summary (mean) used - so you could also take the median, variance, etc.
We can write the following:
storms_wind_summary <- function(fun){
storms %>%
group_by(month) %>%
summarise(wind_summary = fun(wind))
}
And now we can pass the function name, say, mean.
storms_wind_summary(mean)
## # A tibble: 10 x 2
## month wind_summary
## <dbl> <dbl>
## 1 1 45.7
## 2 4 44.6
## 3 5 36.3
## 4 6 37.8
## 5 7 41.2
## 6 8 52.1
## 7 9 58.0
## 8 10 54.6
## 9 11 52.5
## 10 12 47.9
Or, any other function!
storms_wind_summary(sd)
## # A tibble: 10 x 2
## month wind_summary
## <dbl> <dbl>
## 1 1 9.08
## 2 4 5.94
## 3 5 9.57
## 4 6 13.4
## 5 7 19.1
## 6 8 26.0
## 7 9 28.2
## 8 10 25.3
## 9 11 22.0
## 10 12 14.6
storms_wind_summary(var)
## # A tibble: 10 x 2
## month wind_summary
## <dbl> <dbl>
## 1 1 82.5
## 2 4 35.3
## 3 5 91.5
## 4 6 180.
## 5 7 365.
## 6 8 678.
## 7 9 793.
## 8 10 638.
## 9 11 482.
## 10 12 213.
storms_wind_summary(median)
## # A tibble: 10 x 2
## month wind_summary
## <dbl> <dbl>
## 1 1 50
## 2 4 45
## 3 5 35
## 4 6 35
## 5 7 37.5
## 6 8 45
## 7 9 50
## 8 10 50
## 9 11 50
## 10 12 45
We could even make our own!
range_diff <- function(x){
diff(range(x))
}
storms_wind_summary(range_diff)
## # A tibble: 10 x 2
## month wind_summary
## <dbl> <int>
## 1 1 25
## 2 4 15
## 3 5 35
## 4 6 70
## 5 7 130
## 6 8 140
## 7 9 145
## 8 10 145
## 9 11 120
## 10 12 50
Looks like there was a pretty huge range in July through to November!
Pretty neat, eh? You can manipulate the function itself!
Going slightly further
The above was an example demonstrating how you can manipulate a function being passed.
But, there are other ways to do this with dplyr that I might use instead.
We could use summarise_at
here, to specify a function in a different, equivalent, way.
storms_wind_summary <- function(fun){
storms %>%
group_by(month) %>%
summarise_at(.vars = vars(wind),
.funs = list(fun))
}
storms_wind_summary(mean)
## # A tibble: 10 x 2
## month wind
## <dbl> <dbl>
## 1 1 45.7
## 2 4 44.6
## 3 5 36.3
## 4 6 37.8
## 5 7 41.2
## 6 8 52.1
## 7 9 58.0
## 8 10 54.6
## 9 11 52.5
## 10 12 47.9
storms_wind_summary(median)
## # A tibble: 10 x 2
## month wind
## <dbl> <dbl>
## 1 1 50
## 2 4 45
## 3 5 35
## 4 6 35
## 5 7 37.5
## 6 8 45
## 7 9 50
## 8 10 50
## 9 11 50
## 10 12 45
What if we want to provide many functions? Say, the mean, median, sd, variance, all together, how they belong?
We can do this.
This is done by passing dots (ellipsis) ...
to the function. This allows for any number of inputs.
storms_wind_summary <- function(...){
storms %>%
group_by(month) %>%
summarise_at(.vars = vars(wind),
.funs = list(...))
}
storms_wind_summary(median, mean, max)
## # A tibble: 10 x 4
## month fn1 fn2 fn3
## <dbl> <dbl> <dbl> <int>
## 1 1 50 45.7 55
## 2 4 45 44.6 50
## 3 5 35 36.3 60
## 4 6 35 37.8 80
## 5 7 37.5 41.2 140
## 6 8 45 52.1 150
## 7 9 50 58.0 160
## 8 10 50 54.6 160
## 9 11 50 52.5 135
## 10 12 45 47.9 75
What’s the point of this?
So, this might not be the most useful summary of the storms
data…and writing functions like this might not be the most general usecase. dplyr
provides some amazingly flexible syntax to summarise data. Sometimes the answer isn’t writing a function, and you want to be mindful of replicating the flexibility of dplyr
itself.
That said, with a task like this, or any section of code, I really think it can be useful to wrap them in a function, which describes more broadly what that section does. And, with features like what I wrote about here, I think that you can more clearly and flexible wrap up these features for your own use.
R is flexible enough to make that quite straightforward, and I think that is pretty darn neat!
Fin
Go forth, and use the power of functions in your work!
PS
Upon reflection, I’m pretty sure Mitchell O’Hara-Wild was the one who helped really solidify this into my brain. Thanks, Mitch!