The other day a colleague of mine ran into an issue where NA values were appearing in their dataset that weren’t there before - it was strange stuff!
A close look revealed some interesting things, that I thought might be worthwhile briefly going into. It is to do with the use of []
when subsetting, and how this is different to base::subset
and dplyr::filter
.
Let’s start by creating a dataset, df
. This contains countries and an airport
df <- data.frame(
country = c(NA, "AUS", "NZL", "USA "),
airport = c(NA, "BNE", "CHC", " SFO")
)
df
#> country airport
#> 1 <NA> <NA>
#> 2 AUS BNE
#> 3 NZL CHC
#> 4 USA SFO
If we want to just look at rows that contain “AUS”, we can do the following:
df[df$country == "AUS", ]
Which is asking to only choose rows where country is “AUS”.
However, when we run this, we get something unexpected
df[df$country == "AUS", ]
#> country airport
#> NA <NA> <NA>
#> 2 AUS BNE
We get an NA row?
What? Weird?
Under the hood, ==
is identifying rows that match this statement.
You could also do this:
df[c(FALSE, TRUE, FALSE, FALSE), ]
#> country airport
#> 2 AUS BNE
But that’s too hard to manually create those vectors for datasets - we instead get that result out by running it inside the []
:
df[df$country == "AUS", ]
#> country airport
#> NA <NA> <NA>
#> 2 AUS BNE
So why the NA row?
Well, we do have an NA row in the dataset:
df
#> country airport
#> 1 <NA> <NA>
#> 2 AUS BNE
#> 3 NZL CHC
#> 4 USA SFO
And we can check what the output of df$country == "AUS"
is:
where_match_aus <- df$country == "AUS"
where_match_aus
#> [1] NA TRUE FALSE FALSE
This shows us the same as what we saw above
df[where_match_aus, ]
#> country airport
#> NA <NA> <NA>
#> 2 AUS BNE
df[c(NA, TRUE, FALSE, FALSE), ]
#> country airport
#> NA <NA> <NA>
#> 2 AUS BNE
But what is weird about this is that you can use NA
inside []
when subsetting:
df[NA, ]
#> country airport
#> NA <NA> <NA>
#> NA.1 <NA> <NA>
#> NA.2 <NA> <NA>
#> NA.3 <NA> <NA>
And you get a whole lot of weird rows now? Strange, right?
And what happens when our filter is wrong? Remember, “USA” has a trailing space in it, so if we write out “USA”, we get:
df[df$country == "USA", ]
#> country airport
#> NA <NA> <NA>
Since we get NA
when we do:
df$country == "USA"
#> [1] NA FALSE FALSE FALSE
What about base::subset or dplyr::filter
For what it’s worth, this issue doesn’t appear inside base::subset
or dplyr::filter
- I guess this means the turn NA
into FALSE?
# vs subset
subset(df, country == "AUS")
#> country airport
#> 2 AUS BNE
subset(df, country == "USA")
#> [1] country airport
#> <0 rows> (or 0-length row.names)
subset(df, NA)
#> [1] country airport
#> <0 rows> (or 0-length row.names)
# vs filter
dplyr::filter(df, country == "AUS")
#> country airport
#> 1 AUS BNE
dplyr::filter(df, country == "USA")
#> [1] country airport
#> <0 rows> (or 0-length row.names)
dplyr::filter(df, NA)
#> [1] country airport
#> <0 rows> (or 0-length row.names)
I have more questions, but I have to go
I feel like there are more questions I have about this, and I am probably missing some important details, but I just thought this was interesting!