Skip to content
R for the Rest of Us Logo

Fundamentals of R

summarize()

Transcript

Click on the transcript to go to that point in the video. Please note that transcripts are auto generated and may contain minor inaccuracies.

View code shown in video
# Load Packages -----------------------------------------------------------

library(tidyverse)

# Import Data -------------------------------------------------------------

penguins <- read_csv("penguins.csv")

# summarize() -------------------------------------------------------------

# With summarize(), we can go from a complete dataset down to a summary.

# We use any of the summary functions with summarize().
# Here's how we calculate the mean bill length.

penguins |> 
  summarize(mean_bill_length = mean(bill_length_mm))

# This doesn't work! Notice what the result is. 

# We need to add na.rm = TRUE to tell R to drop NA values.

penguins |> 
  summarize(mean_bill_length = mean(bill_length_mm, na.rm = TRUE))

# Another option is to drop NA values before calling summarize().

penguins |> 
  drop_na(bill_length_mm) |> 
  summarize(mean_bill_length = mean(bill_length_mm))

# We can have multiple arguments in each usage of summarize().

penguins |> 
  summarize(mean_bill_length = mean(bill_length_mm, na.rm = TRUE),
            max_bill_depth = max(bill_depth_mm, na.rm = TRUE))

Your Turn

# Load Packages -----------------------------------------------------------

# Load the tidyverse package

library(tidyverse)

# Import Data -------------------------------------------------------------

# Download data from https://rfor.us/penguins
# Copy the data into the RStudio project
# Create a new R script file and add code to import your data

penguins <- read_csv("penguins.csv")
			
# Calculate the weight of the heaviest penguin.
# Don't forget to drop NAs!

# YOUR CODE HERE

# Calculate the minimum and maximum weight of penguins in the dataset.

# YOUR CODE HERE

Learn More

To learn more about the summarize() function, check out Chapter 3 of R for Data Science.

Have any questions? Put them below and we will help you out!

You need to be signed-in to comment on this post. Login.

Brian Slattery

Brian Slattery • September 20, 2023

I'm just curious, so feel free to ignore if this is covered later. But, you mentioned that piping sequential summarizes into each other doesn't work to get a single table with multiple columns. Is there a way to do that? I didn't know if mutate would be able to handle taking in a tibble from summarize? I was guessing there must be some other way to combine tibbles? For example, if you were getting the mean bill length from the penguins data, but also wanted to get a mean bill length from some other bird dataset, and have these in the same table side by side (I googled it and it looked like there's a merge() function, but I didn't know if that was the best way to go about it in this case)

Gracielle Higino

Gracielle Higino Coach • September 20, 2023

Hi Brian! Thank you for your question, that's very interesting! We'll discuss this in our live session this week, stay tuned!

Gabby Bachhuber

Gabby Bachhuber • March 20, 2024

Perfect

Brian Slattery

Brian Slattery • September 20, 2023

Is there some way to change the default behavior of summarize so that it ignores NAs without having to specify it specifically? I didn't know if there was something like a global variable that you can set in the R script file, or something within the RStudio environment or installed package?

Gracielle Higino

Gracielle Higino Coach • September 20, 2023

Hey! =D The short answer is: you shouldn't do that! There are some complicated workarounds, but by default, you should make it explicit in your code when NAs are being ignored/dropped. We'll discuss this on Thursday too!

Rachel Udow

Rachel Udow • March 17, 2024

Hello! Two questions about this lesson:

  1. Why is it required to use the summarize() function before using the more specific summary functions (e.g., mean())?

  2. Does the "rm" in "na.rm" stand for anything? Just asking as it might help me remember that argument if so.

Thank you!

Libby Heeren

Libby Heeren Coach • March 17, 2024

Hey, Rachel! Here are some answers for you:

Q: Why is it required to use the summarize() function before using the more specific summary functions (e.g., mean())?

A: summarize() is allowing you to create/name a column in a data frame or tibble that will contain your summary value (like your mean). You could run mean directly on your column and it would output just a value, not a data frame or tibble with a named column.

For example, if you ran:

tomatoes |>
    drop_na(productionvalue) |> 
    pull(productionvalue) |> 
    mean()

You'd get a single value as your output:

[1] 3816933

If you use summarize, like this:

tomatoes |>
    drop_na(productionvalue) |> 
    summarize(meanvalue = mean(productionvalue))

You'd get a tibble that looks like this:

# A tibble: 1 × 1
  meanvalue
      <dbl>
1  3816933.

Of course, it's more helpful when you're adding a mean column to a grouped data set so that your resulting data frame has a row for each group containing the group name and the group's mean. I hope this helps!

Q: Does the "rm" in "na.rm" stand for anything? Just asking as it might help me remember that argument if so.

A: Yep! It stands for "remove" so you can say "na remove" in your mind to help remember it.

Rachel Udow

Rachel Udow • March 19, 2024

Thanks Libby, this is really helpful!

Michelle Brodesky

Michelle Brodesky • March 20, 2024

I was curious about Rachel's Q1 as well! Thanks for asking, Rachel, and for your response, Libby.

Maria Montenegro

Maria Montenegro • April 1, 2024

I am not sure why I am not getting the same output as in the video. When I ran the exact same code as in the "solutins" for the first excercise I see this in the console but for all variables:

$year $year$variableType Variable type: numeric $year$countMissing Number of missing obs.: 0 (0 %) $year$uniqueValues Number of unique values: 3 $year$centralValue Median: 2008 $year$quartiles 1st and 3rd quartiles: 2007; 2009 $year$minMax Min. and max.: 2007; 2009

Any idea why?

David Keyes

David Keyes Founder • April 2, 2024

Hmm, hard to say. Can you please post your full code so I can run it and check what might be going on?

Libby Heeren

Libby Heeren Coach • April 3, 2024

Hey, Maria! Just wanted to contribute my experience here: sometimes this happens to me when I use summarize and I don't know why, but if I restart R (Session > Restart R) it goes away and works fine! When in doubt, restart R 😅 Let us know if it works and share your code so we can see what you were running!

Maria Montenegro

Maria Montenegro • April 8, 2024

Thank you! I found out what the issue was... I changed it to summarise() and it worked! so strange...

Sara Parisi

Sara Parisi • September 25, 2024

This same thing happened to me, too. I figured out that it's because there is a function called "summarize" in the package "dataReporter" which David talks about in Week 1. For my work in Week 2, I just went ahead and loaded all of the packages I used in Week 1 (including dataReporter) before completing the exercises, and I got the same output as Maria Montenegro. If I don't load dataReporter, it works fine and gives me the mean. There is a warning message about dataReporter masking summarize from dplyr, but it's easy to miss if you're moving quickly. dataReporter doesn't contain a function called "summarise", so if you're using the English spelling, you don't run into this problem.

Gracielle Higino

Gracielle Higino Coach • September 26, 2024

Good observation, Sara! One way to go around it is to explicit which package your function is coming from, using the double colon notation. dplyr::summarize() will return one type of results, and dataReporter::summarize() will return another type of results.

You can play with the code below to check the difference yourself:

library(dplyr)
library(dataReporter)
library(palmerpenguins)

penguins |> 
dplyr::summarize( 
  mean_bill_length_mm = mean(bill_length_mm)
)

penguins |> 
  dataReporter::summarize( 
    mean_bill_length_mm = mean(bill_length_mm)
  )
Sara Parisi

Sara Parisi • September 26, 2024

Oh thanks! Throughout my haphazard, self-guided journey of learning R, I have always wondered why sometimes code lists the "package::" before the function.The pieces of the puzzle are coming together for me now :)

Keith Karani

Keith Karani • April 30, 2024

For this question I gave my solution as this. Is this a good practice to write R code like this or should switch to the solution given in the solution?

Calculate the minimum and maximum weight of penguins in the dataset.

wght_min_max <- penguins %>% drop_na(body_mass_g) %>% summarize(max_body_weight = max(body_mass_g), min(body_mass_g))

View(wght_min_max)

David Keyes

David Keyes Founder • April 30, 2024

That's a perfectly good way to write code!

Keith Karani

Keith Karani • May 2, 2024

thank you

Samreen Chhabra

Samreen Chhabra • November 11, 2024

hi, in regards to the first have of the video where you mentioned about invalid output due to NA values in the observations; i ran my code without the na.rm function and it showed me the same thing. I wonder how that happens... I did save my codes of filter functions earlier, but as you have also mentioned it does not modify the original data set. Just curious about how the code worked:

> penguins |> 
  summarise(mean_bill_depth =
              mean(bill_length_mm))
# A tibble: 1 × 1
  mean_bill_depth
            <dbl>
1            37.9

same result shows if i use the na.rm function: 

> penguins |> 
  summarise(mean_bill_depth =
              mean(bill_length_mm, na.rm = TRUE))
# A tibble: 1 × 1
  mean_bill_depth
            <dbl>
1            37.9

thank you!

Samreen Chhabra

Samreen Chhabra • November 11, 2024

oh i just realized it doesnt shows the output; it is this for both codes:

David Keyes

David Keyes Founder • November 12, 2024

It's hard to know exactly what's going on because I can't see your code all the way from when you imported the data, but my guess is that you removed the NAs prior to running the summarize() function. If you do this (e.g. with the drop_na() function), then summarize() will work fine without the na.rm argument.

Myles Kwitny

Myles Kwitny • March 18, 2025

When I use the summary function (listed below)

penguins |> 
  summarize(mean_bill_length = mean(bill_length_mm))

In my console I get a summary of all the variables (example of one variable below). Is there another step I should be taking before entering this command?

$bill_length_mm
$bill_length_mm$variableType
Variable type: numeric
$bill_length_mm$countMissing
Number of missing obs.: 2 (0.58 %)
$bill_length_mm$uniqueValues
Number of unique values: 164
$bill_length_mm$centralValue
Median: 44.45
$bill_length_mm$quartiles
1st and 3rd quartiles: 39.23; 48.5
$bill_length_mm$minMax
Min. and max.: 32.1; 59.6

Thank you!

Gracielle Higino

Gracielle Higino Coach • March 18, 2025

Hi Myles! I can't reproduce your error. The possibilities that I can think of are that you have a package loaded with a similar function that is overwriting the dplyr summarise function, or there is a typo in your code somewhere... Restarting the session might help in this case!

Do you wanna share more details about your script?

Myles Kwitny

Myles Kwitny • March 18, 2025

Yes, this is an example of a code I enter:

# Load Packages -----------------------------------------------------------

# Load the tidyverse package

library(tidyverse)

# Import Data -------------------------------------------------------------

# Download data from https://rfor.us/penguins
# Copy the data into the RStudio project
# Create a new R script file and add code to import your data

penguins <- read_csv("penguins_data.csv")

# Calculate the weight of the heaviest penguin.
# Don't forget to drop NAs!

penguins |> 
  summarize(max_body_mass = max(body_mass_g))
Myles Kwitny

Myles Kwitny • March 18, 2025

I restarted R studio and it worked! Thank you

Gracielle Higino

Gracielle Higino Coach • March 18, 2025

Amazing! Sometimes that's all it takes! 😁

Heather Worker

Heather Worker • March 20, 2025

Does max and min work with dates or just with strictly numeric values?