group_by() and summarize()

This lesson is called group_by() and summarize(), part of the R in 3 Months (Spring 2025) course. This lesson is called group_by() and summarize(), part of the R in 3 Months (Spring 2025) course.

Transcript

Click on the transcript to go to that point in the video. Please note that transcripts are auto generated and may contain minor inaccuracies.

View code shown in video

# Load Packages -----------------------------------------------------------

library(tidyverse)

# Import Data -------------------------------------------------------------

penguins <- read_csv("penguins.csv")

# group_by() and summarize() ----------------------------------------------

# summarize() becomes truly powerful when paired with group_by(), 
# which enables us to perform calculations on multiple groups. 

# Calculate the mean bill length for penguins on different islands.

penguins |> 
  group_by(island) |> 
  summarize(mean_bill_length = mean(bill_length_mm, na.rm = TRUE))

# We can use group_by() with multiple groups.

penguins |> 
  group_by(island, year) |> 
  summarize(mean_bill_length = mean(bill_length_mm, na.rm = TRUE)) 

# Another option is to use the .by argument in summarize().

penguins |> 
  summarize(mean_bill_length = mean(bill_length_mm, na.rm = TRUE),
            .by = c(island, year))

# You can count the number of penguins in each group using the n() summary function.

penguins |> 
  group_by(island) |> 
  summarize(number_of_penguins = n())

# But a simpler way do this is with the count() function.

penguins |> 
  count(island)

# You can also use count() with multiple groups.

penguins |> 
  count(island, year)

Your Turn

# Load Packages -----------------------------------------------------------

# Load the tidyverse package

library(tidyverse)

# Import Data -------------------------------------------------------------

# Download data from https://rfor.us/penguins
# Copy the data into the RStudio project
# Create a new R script file and add code to import your data

penguins <- read_csv("penguins.csv")
			
# group_by() and summarize() ----------------------------------------------

# Calculate the weight of the heaviest penguin on each island.

# YOUR CODE HERE

# Calculate the weight of the heaviest penguin on each island for each year.

# YOUR CODE HERE

Learn More

To learn more about the group_by() and summarize() functions, check out Chapter 3 of R for Data Science.

Have any questions? Put them below and we will help you out!

You need to be signed-in to comment on this post. Login.

gene trevino • January 30, 2025

When I run the following code:

penguins %>% group_by(island, year) %>%
summarize(Heaviest_Penguins = max(body_mass_g, na.rm = TRUE))

I get the following output:

island year Heaviest_Penguins

Why do I get NA for Biscoe and Torgersen ?

Thanks

David Keyes Founder • January 31, 2025

Hmm, that's strange. I see something different:

# A tibble: 9 × 3
# Groups:   island [3]
  island     year Heaviest_Penguins
  <fct>     <int>             <int>
1 Biscoe     2007              6300
2 Biscoe     2008              6000
3 Biscoe     2009              6000
4 Dream      2007              4650
5 Dream      2008              4800
6 Dream      2009              4475
7 Torgersen  2007              4675
8 Torgersen  2008              4700
9 Torgersen  2009              4300

Can you share the code you used to import the CSV file?

Pepper Phillips • February 11, 2025

Can you use drop_na instead of na.rm = TRUE?

David Keyes Founder • February 11, 2025

Yes, absolutely! I do that quite often, in fact.

Zaynaib Giwa • March 24, 2025

What is the difference between using summarize n()/count() vs tally?

Gracielle Higino Coach • March 24, 2025

The difference is very small, and these functions are all connected somehow!

tally() counts unique values assuming you've done the grouping before, and it's equivalent to df |> summarise(n = n());
count() calls the n() or the sum() function and groups your data based on the variable you call as the function argument. It's equivalent to df |> group_by(a, b) |> summarise(n = n())
n() is the basic frequency counting function, and can only be used inside a sumarise(), a mutate() or a filter().

Rebecca Heilman • October 1, 2025

An observation that I am curious about: The group_by with two variables appears to sort the tibble by the first variable alphabetically (island), while the alternative method David showed (i.e., the .by) sorts it by the year in the tibble. Is this always the case or was this simply by chance that this change in sorting of the summary tibble occured?

My code:

# Calculate the weight of the heaviest penguin on each island for each year.

penguins |> 
  drop_na(body_mass_g) |> 
  group_by(island, year) |> 
  summarize(max(body_mass_g))

#Another method to calculate the weight of the heaviest penguin on each island for each year, but this one sorts by year while the other method sorts by alphabetically by island???

penguins |> 
  drop_na(body_mass_g) |> 
    summarize(max(body_mass_g), 
              .by = c(island, year))

Gracielle Higino Coach • October 3, 2025

Hi Rebecca! We covered your question during the live session today, but for the records: the .by argument keeps the original order of the variables, while group_by() changes the order to alphabetical. Beyond that, notice that group_by() actually creates groups within your dataset, which influences how the rows are ordered, while .by doesn't create any groups.