R dplyr – making ‘count’ function work in summarize_at chain

You just need to be careful about the functions you are passing. You can’t add the na.rm = TRUE parameter to summarize_at unless it can be passed to all the functions in your list. However, if you pass na.rm = TRUE to length, it will throw an error.

One way round this is to create a wrapper around length that takes an na.rm parameter:

funsList <- list(count = function(x, na.rm) length(x),
                 mean = mean,
                 sum = sum, 
                 median = median, 
                 max = max, 
                 min = min)

(Note that you can use unquoted function names directly rather than having them all as character strings and using match.fun)

This approach works, but it reveals an additional problem in your code. Some of the groups you are summarizing have no non-NA entries, so you are effectively doing min(NA, na.rm = TRUE) on some groups. This causes a warning and returns an Inf instead of an NA in the result that you probably don’t want. You similarly get an unwanted -Inf with max and NaN from mean and median.

The solution is to be specific about what you want each function to do in this scenario. For example, you can create a little function that takes your summary functions as an argument and returns an NA-safe version of them:

handle_NA <- function(func) 
{
  function(x) if(all(is.na(x))) NA else func(x, na.rm = TRUE) 
}

This allows you to create a safe funsList like so:

funsList <- list(count = length,
                 mean = handle_NA(mean),
                 sum = handle_NA(sum),
                 median = handle_NA(median), 
                 max = handle_NA(max), 
                 min = handle_NA(min))

data %>%
  group_by(across(all_of(dimensionsVec))) %>%
  summarize(across(all_of(measuresVec), funsList), .groups = "drop")
#> # A tibble: 53 x 14
#>    skin_color eye_color height_count height_mean height_sum height_median
#>    <chr>      <chr>            <int>       <dbl>      <int>         <dbl>
#>  1 blue       blue                 1        196         196          196 
#>  2 blue       hazel                1        178         178          178 
#>  3 blue, grey yellow               2        116.        231          116.
#>  4 brown      blue                 1        234         234          234 
#>  5 brown      brown                2        130.        259          130.
#>  6 brown      yellow               1        198         198          198 
#>  7 brown mot~ orange               1        180         180          180 
#>  8 brown, wh~ green, y~            1        216         216          216 
#>  9 dark       blue                 1        184         184          184 
#> 10 dark       brown                4        183.        733          184 
#> # ... with 43 more rows, and 8 more variables: height_max <int>,
#> #   height_min <int>, mass_count <int>, mass_mean <dbl>, mass_sum <dbl>,
#> #   mass_median <dbl>, mass_max <dbl>, mass_min <dbl>

Note that the scoped verbs summarize_at and group_by_at have been superseded by using across, so I have switched to the more modern syntax.

CLICK HERE to find out more related problems solutions.

Leave a Comment

Your email address will not be published.

Scroll to Top