Nick logo Credibly Curious

Nick Tierney's (mostly) rstats blog

2021-06-30

Some Unexpected Things About Subsetting

Nicholas Tierney

Categories: rstats Tags: rstats data science dplyr data cleaning

3 minute read

The other day a colleague of mine ran into an issue where NA values were appearing in their dataset that weren’t there before - it was strange stuff!

A close look revealed some interesting things, that I thought might be worthwhile briefly going into. It is to do with the use of [] when subsetting, and how this is different to base::subset and dplyr::filter.

Let’s start by creating a dataset, df. This contains countries and an airport

df <- data.frame(
  country = c(NA, "AUS", "NZL", "USA "),
  airport = c(NA, "BNE", "CHC", " SFO")
)

df
#>   country airport
#> 1    <NA>    <NA>
#> 2     AUS     BNE
#> 3     NZL     CHC
#> 4    USA      SFO

If we want to just look at rows that contain “AUS”, we can do the following:

df[df$country == "AUS", ]

Which is asking to only choose rows where country is “AUS”.

However, when we run this, we get something unexpected

df[df$country == "AUS", ]
#>    country airport
#> NA    <NA>    <NA>
#> 2      AUS     BNE

We get an NA row?

What? Weird?

Under the hood, == is identifying rows that match this statement.

You could also do this:

df[c(FALSE, TRUE, FALSE, FALSE), ]
#>   country airport
#> 2     AUS     BNE

But that’s too hard to manually create those vectors for datasets - we instead get that result out by running it inside the []:

df[df$country == "AUS", ]
#>    country airport
#> NA    <NA>    <NA>
#> 2      AUS     BNE

So why the NA row?

Well, we do have an NA row in the dataset:

df
#>   country airport
#> 1    <NA>    <NA>
#> 2     AUS     BNE
#> 3     NZL     CHC
#> 4    USA      SFO

And we can check what the output of df$country == "AUS" is:

where_match_aus <- df$country == "AUS"
where_match_aus
#> [1]    NA  TRUE FALSE FALSE

This shows us the same as what we saw above

df[where_match_aus, ]
#>    country airport
#> NA    <NA>    <NA>
#> 2      AUS     BNE
df[c(NA, TRUE, FALSE, FALSE), ]
#>    country airport
#> NA    <NA>    <NA>
#> 2      AUS     BNE

But what is weird about this is that you can use NA inside [] when subsetting:

df[NA, ]
#>      country airport
#> NA      <NA>    <NA>
#> NA.1    <NA>    <NA>
#> NA.2    <NA>    <NA>
#> NA.3    <NA>    <NA>

And you get a whole lot of weird rows now? Strange, right?

And what happens when our filter is wrong? Remember, “USA” has a trailing space in it, so if we write out “USA”, we get:

df[df$country == "USA", ]
#>    country airport
#> NA    <NA>    <NA>

Since we get NA when we do:

df$country == "USA"
#> [1]    NA FALSE FALSE FALSE

What about base::subset or dplyr::filter

For what it’s worth, this issue doesn’t appear inside base::subset or dplyr::filter - I guess this means the turn NA into FALSE?

# vs subset
subset(df, country == "AUS")
#>   country airport
#> 2     AUS     BNE
subset(df, country == "USA")
#> [1] country airport
#> <0 rows> (or 0-length row.names)
subset(df, NA)
#> [1] country airport
#> <0 rows> (or 0-length row.names)
# vs filter
dplyr::filter(df, country == "AUS")
#>   country airport
#> 1     AUS     BNE
dplyr::filter(df, country == "USA")
#> [1] country airport
#> <0 rows> (or 0-length row.names)
dplyr::filter(df, NA)
#> [1] country airport
#> <0 rows> (or 0-length row.names)

I have more questions, but I have to go

I feel like there are more questions I have about this, and I am probably missing some important details, but I just thought this was interesting!