Skip to content

Fundamentals of R

group_by() and summarize()

This lesson is called group_by() and summarize(), part of the Fundamentals of R course. This lesson is called group_by() and summarize(), part of the Fundamentals of R course.

Transcript

Click on the transcript to go to that point in the video. Please note that transcripts are auto generated and may contain minor inaccuracies.

View code shown in video

# Load Packages -----------------------------------------------------------

library(tidyverse)

# Import Data -------------------------------------------------------------

penguins <- read_csv("penguins.csv")

# group_by() and summarize() ----------------------------------------------

# summarize() becomes truly powerful when paired with group_by(), 
# which enables us to perform calculations on multiple groups. 

# Calculate the mean bill length for penguins on different islands.

penguins |> 
  group_by(island) |> 
  summarize(mean_bill_length = mean(bill_length_mm, na.rm = TRUE))

# We can use group_by() with multiple groups.

penguins |> 
  group_by(island, year) |> 
  summarize(mean_bill_length = mean(bill_length_mm, na.rm = TRUE)) 

# Another option is to use the .by argument in summarize().

penguins |> 
  summarize(mean_bill_length = mean(bill_length_mm, na.rm = TRUE),
            .by = c(island, year))

# You can count the number of penguins in each group using the n() summary function.

penguins |> 
  group_by(island) |> 
  summarize(number_of_penguins = n())

# But a simpler way do this is with the count() function.

penguins |> 
  count(island)

# You can also use count() with multiple groups.

penguins |> 
  count(island, year)

Your Turn

# Load Packages -----------------------------------------------------------

# Load the tidyverse package

library(tidyverse)

# Import Data -------------------------------------------------------------

# Download data from https://rfor.us/penguins
# Copy the data into the RStudio project
# Create a new R script file and add code to import your data

penguins <- read_csv("penguins.csv")
			
# group_by() and summarize() ----------------------------------------------

# Calculate the weight of the heaviest penguin on each island.

# YOUR CODE HERE

# Calculate the weight of the heaviest penguin on each island for each year.

# YOUR CODE HERE

Learn More

To learn more about the group_by() and summarize() functions, check out Chapter 3 of R for Data Science.

Have any questions? Put them below and we will help you out!

You need to be signed-in to comment on this post. Login.

gene trevino • January 30, 2025

When I run the following code:

penguins %>% group_by(island, year) %>%
summarize(Heaviest_Penguins = max(body_mass_g, na.rm = TRUE))

I get the following output:

island year Heaviest_Penguins

Why do I get NA for Biscoe and Torgersen ?

Thanks

David Keyes Founder • January 31, 2025

Hmm, that's strange. I see something different:

# A tibble: 9 × 3
# Groups:   island [3]
  island     year Heaviest_Penguins
  <fct>     <int>             <int>
1 Biscoe     2007              6300
2 Biscoe     2008              6000
3 Biscoe     2009              6000
4 Dream      2007              4650
5 Dream      2008              4800
6 Dream      2009              4475
7 Torgersen  2007              4675
8 Torgersen  2008              4700
9 Torgersen  2009              4300

Can you share the code you used to import the CSV file?

Pepper Phillips • February 11, 2025

Can you use drop_na instead of na.rm = TRUE?

David Keyes Founder • February 11, 2025

Yes, absolutely! I do that quite often, in fact.

Zaynaib Giwa • March 24, 2025

What is the difference between using summarize n()/count() vs tally?

Gracielle Higino Coach • March 24, 2025

The difference is very small, and these functions are all connected somehow!

tally() counts unique values assuming you've done the grouping before, and it's equivalent to df |> summarise(n = n());
count() calls the n() or the sum() function and groups your data based on the variable you call as the function argument. It's equivalent to df |> group_by(a, b) |> summarise(n = n())
n() is the basic frequency counting function, and can only be used inside a sumarise(), a mutate() or a filter().

Course Content

34 Lessons

Welcome to Fundamentals of R

Update Everything

Start a New Project

group_by() and summarize()

Create a New Data Frame

Bring it All Together (Data Wrangling)

The Grammar of Graphics

Setting color and fill Aesthetic Properties

Setting color and fill Scales

Setting x and y Scales

Adding Text to Plots

Bring it All Together (Data Visualization)

Quarto Overview

Tips for Working with Quarto

Bring It All Together (Quarto)

An Important Workflow Tip

Wrapping up Fundamentals of R