Get access to all lessons in this course.
- Welcome to Fundamentals of R
- Update Everything
- Start a New Project
-
Data Wrangling and Analysis
- The Tidyverse
- Pipes
- select()
- mutate()
- filter()
- summarize()
- group_by() and summarize()
- arrange()
- Create a New Data Frame
- Bring it All Together (Data Wrangling)
-
Data Visualization
- The Grammar of Graphics
- Scatterplots
- Histograms
- Bar Charts
- Setting color and fill Aesthetic Properties
- Setting color and fill Scales
- Setting x and y Scales
- Adding Text to Plots
- Plot Labels
- Themes
- Facets
- Save Plots
- Bring it All Together (Data Visualization)
-
Quarto
- Quarto Overview
- YAML
- Text
- Code Chunks
- Tips for Working with Quarto
- Bring It All Together (Quarto)
-
Wrapping Up
- An Important Workflow Tip
Fundamentals of R
summarize()
This lesson is locked
This lesson is called summarize(), part of the Fundamentals of R course. This lesson is called summarize(), part of the Fundamentals of R course.
If the video is not playing correctly, you can watch it in a new window
Transcript
Click on the transcript to go to that point in the video. Please note that transcripts are auto generated and may contain minor inaccuracies.
View code shown in video
# Load Packages -----------------------------------------------------------
library(tidyverse)
# Import Data -------------------------------------------------------------
penguins <- read_csv("penguins.csv")
# summarize() -------------------------------------------------------------
# With summarize(), we can go from a complete dataset down to a summary.
# We use any of the summary functions with summarize().
# Here's how we calculate the mean bill length.
penguins |>
summarize(mean_bill_length = mean(bill_length_mm))
# This doesn't work! Notice what the result is.
# We need to add na.rm = TRUE to tell R to drop NA values.
penguins |>
summarize(mean_bill_length = mean(bill_length_mm, na.rm = TRUE))
# Another option is to drop NA values before calling summarize().
penguins |>
drop_na(bill_length_mm) |>
summarize(mean_bill_length = mean(bill_length_mm))
# We can have multiple arguments in each usage of summarize().
penguins |>
summarize(mean_bill_length = mean(bill_length_mm, na.rm = TRUE),
max_bill_depth = max(bill_depth_mm, na.rm = TRUE))
Your Turn
# Load Packages -----------------------------------------------------------
# Load the tidyverse package
library(tidyverse)
# Import Data -------------------------------------------------------------
# Download data from https://rfor.us/penguins
# Copy the data into the RStudio project
# Create a new R script file and add code to import your data
penguins <- read_csv("penguins.csv")
# Calculate the weight of the heaviest penguin.
# Don't forget to drop NAs!
# YOUR CODE HERE
# Calculate the minimum and maximum weight of penguins in the dataset.
# YOUR CODE HERE
Learn More
To learn more about the summarize()
function, check out Chapter 3 of R for Data Science.
You need to be signed-in to comment on this post. Login.
Brian Slattery
September 20, 2023
I'm just curious, so feel free to ignore if this is covered later. But, you mentioned that piping sequential summarizes into each other doesn't work to get a single table with multiple columns. Is there a way to do that? I didn't know if mutate would be able to handle taking in a tibble from summarize? I was guessing there must be some other way to combine tibbles? For example, if you were getting the mean bill length from the penguins data, but also wanted to get a mean bill length from some other bird dataset, and have these in the same table side by side (I googled it and it looked like there's a merge() function, but I didn't know if that was the best way to go about it in this case)
Gracielle Higino Coach
September 20, 2023
Hi Brian! Thank you for your question, that's very interesting! We'll discuss this in our live session this week, stay tuned!
David Keyes Founder
September 22, 2023
Hi Brian! I recorded a quick video to show how you could deal with this. Hope it helps!
Gabby Bachhuber
March 20, 2024
Perfect
Brian Slattery
September 20, 2023
Is there some way to change the default behavior of summarize so that it ignores NAs without having to specify it specifically? I didn't know if there was something like a global variable that you can set in the R script file, or something within the RStudio environment or installed package?
Gracielle Higino Coach
September 20, 2023
Hey! =D The short answer is: you shouldn't do that! There are some complicated workarounds, but by default, you should make it explicit in your code when NAs are being ignored/dropped. We'll discuss this on Thursday too!
Rachel Udow
March 17, 2024
Hello! Two questions about this lesson:
Why is it required to use the summarize() function before using the more specific summary functions (e.g., mean())?
Does the "rm" in "na.rm" stand for anything? Just asking as it might help me remember that argument if so.
Thank you!
Libby Heeren Coach
March 17, 2024
Hey, Rachel! Here are some answers for you:
Q: Why is it required to use the summarize() function before using the more specific summary functions (e.g., mean())?
A: summarize() is allowing you to create/name a column in a data frame or tibble that will contain your summary value (like your mean). You could run mean directly on your column and it would output just a value, not a data frame or tibble with a named column.
For example, if you ran:
You'd get a single value as your output:
If you use summarize, like this:
You'd get a tibble that looks like this:
Of course, it's more helpful when you're adding a mean column to a grouped data set so that your resulting data frame has a row for each group containing the group name and the group's mean. I hope this helps!
Q: Does the "rm" in "na.rm" stand for anything? Just asking as it might help me remember that argument if so.
A: Yep! It stands for "remove" so you can say "na remove" in your mind to help remember it.
Rachel Udow
March 19, 2024
Thanks Libby, this is really helpful!
Michelle Brodesky
March 20, 2024
I was curious about Rachel's Q1 as well! Thanks for asking, Rachel, and for your response, Libby.
Maria Montenegro
April 1, 2024
I am not sure why I am not getting the same output as in the video. When I ran the exact same code as in the "solutins" for the first excercise I see this in the console but for all variables:
$year $year$variableType Variable type: numeric $year$countMissing Number of missing obs.: 0 (0 %) $year$uniqueValues Number of unique values: 3 $year$centralValue Median: 2008 $year$quartiles 1st and 3rd quartiles: 2007; 2009 $year$minMax Min. and max.: 2007; 2009
Any idea why?
David Keyes Founder
April 2, 2024
Hmm, hard to say. Can you please post your full code so I can run it and check what might be going on?
Libby Heeren Coach
April 3, 2024
Hey, Maria! Just wanted to contribute my experience here: sometimes this happens to me when I use summarize and I don't know why, but if I restart R (Session > Restart R) it goes away and works fine! When in doubt, restart R 😅 Let us know if it works and share your code so we can see what you were running!
Maria Montenegro
April 8, 2024
Thank you! I found out what the issue was... I changed it to summarise() and it worked! so strange...