Lesson 14 of 33
In Progress

# group_by

Complete the group_by sections of the data-wrangling-and-analysis-exercises.Rmd file.

Solutions

Video

Code

General Data Wrangling and Analysis Resources

Because most material that discusses data wrangling and analysis with the dplyr packges does so in a way that covers all of the verbs discussed in this course, I have chosen not to separate them by lesson. Instead, here are some helpful resources for learning more about all of the tidyverse verbs discussed in this course:

Chapter 5 of R for Data Science

RStudio Cloud primer on working with data

Tidyverse for Beginners by Danielle Navarro

Learning Statistics with R by Danielle Navarro

Introduction to the Tidyverse by Alison Hill

A gRadual intRoduction to data wRangling by Chester Ismay and Ted Laderas

Working in the Tidyverse by Desi Quintans and Jeff Powell

Christine Monnier video tutorials on dplyr

#### Have any questions? Put them below and we will help you out!

1. Hello, I have a question. how could I get the subtotals by group on the same DF that we obtain when we use de group by + summarise?

We get something like this, but i would like to see on the same table the sub totals (by gender) and the grand total (which should by 10.000)

female Looking 6.940299 135
female NotWorking 7.094077 1732
female Working 6.909353 2086
female NA NaN 1067
male Looking 7.147727 176
male NotWorking 7.101619 1115
male Working 6.736634 2527
male NA 6.000000 1162

2. I made a short video to show how you could do this. You can also find the code that I used here. Hope that helps! If you have other questions, let me know.

3. Hey, I am getting the follow message in my console when I run anything with groupby

## `summarise()` has grouped output by ‘gender’. You can override using the `.groups` argument.

The output seems fine and I googled to find a way to suppress it (message = FALSE) but just want to double check that nothing is wrong.

1. That’s perfect. Good to know that I can sort of toggle between keeping the group or not, depending on the need.

So, if I’m getting it correctly, I should basically add .groups = “keep” when I want all groups to show (which is probably the vast majority of the time) but if I wanted a single specific detail, like a max, I would want the groups to be dropped, right?

And at a much more basic level, I shouldn’t even really worry about it (i.e. keep default arguments) if I don’t really care about the message popping up, right? (…as I can usually suppress it with message = FALSE)

1. Yeah, I’d say mostly just don’t even worry about it. I never set anything, as I’ve been using R long enough that it wasn’t an option when I started so I just don’t think in that way. I’d say go with whatever works best for you. And yes, no need to worry about the message.

1. Awesome. Thanks!

2. Hey David and Blayne,

I wanted to chime in on the .groups argument of summarise(). It’s currently labelled “lifecycle:experimental” which means theoretically how it works could change, and there’s a small chance the feature would be removed. I’m grumpy about this feature and deliberately don’t use it, but that’s my personal opinion.

4. In this lesson, you opted to not use the number_of_observations for one of the calculations but not the other – was there any particular reason behind that?

1. No particular reason. I was just wondering when I did the analysis!

5. Hello! I was doing the exercises without a problem, but all of the sudden at this stage (when I try to use “group by” am getting a repeated error message, “Error: attempt to use zero-length variable name.” And I just closed and reopened but now nothing seems to be working.
nhanes %>%
group_by(gender) %>%
Error: attempt to use zero-length variable name

(I know that I’m a bit behind this week, so don’t expect a quick answer!)

In my code, also, the variables and such use capital letters rather than underscores. I believe you mentioned this syntax difference in class but cannot remember whether it is an issue. Thank you!

1. It’s hard to say without seeing your full code, but I’m guessing this is because you didn’t use the `clean_names()` function when you read in your nhanes object and so the variable name is Gender with a capital G. If you want to post your full code as a GitHub Gist and then post the link in response I can confirm if that’s the issue.

1. Thank you, David!
I think the problem might have been me just messing up the “`{r] by accident…I did rerun the clean_names() function and load all my packages again today. And things seem to be working now. Thanks so much.

6. Hi,
I follow the instructions (below in line 305). I added the function “filter” because you said ” (whether or not respondents are working)”; however, in the solution, you don’t use “filter”. Maybe I didn’t understand the assignment?

We can use `group_by` with multiple groups.

Use `group_by` for `gender` and `work` (whether or not respondents are working) before calculating mean hours of sleep.

“`{r}
nhanes %>%
group_by(gender, work) %>%
dplyr::filter (work %in% c(“Working”, “NotWorking”)) %>%
summarize(mean_sleep = mean(sleep_hrs_night, na.rm = TRUE))
“`

Thank you very much for your help.

1. Hi Sara,
In this video David uses group_by() to summarise the data by their working status, meaning that when summarise() is used we calculate the values for all different working statuses. We would need to use filter() if we were interested in only specific values in the work column. Does that help?
Cheers, Charlie

1. Can you explain this more with an example of the code where we would filter work? I am thinking this would be appropriate to add to the last solutions code —
filter (work == “Looking” | work == “Not Working” | work == “Working”)
Would this go before group by? Before summarize? Does the order matter?

1. That code should work. You can put it before the `group_by()` and/or `summarize()`. If you filter first, the result of the `group_by()` and/or `summarize()` will only include rows that are not filtered out.

7. Hi David –

Quick question. Let’s say I had state level data with numerical values per state. Let’s say I wanted to assign each state to relevant regions, like northeast, pacific, etc. How can I first assign the states to the object, then perform a group by, to then summarize the numerical values aggregated to the regions I assigned them to? Thanks

1. Hello!

This is a great question. In order to do so you need to decide which states belong to which regions and then combine together left_join() and group_by(). In this video tutorial I made use of the {tigris} package for an authoritative decision on regions and divisions. The code I wrote can be found in this gist. Please let me know if you have any questions.

Cheers,

Charlie

1. Hi! Thanks, Charlie. I will be looking to applying this concept for a summer project.