Get access to all lessons in this course.
-
RMarkdown
- Why Use RMarkdown?
- RMarkdown Overview
- YAML
- Text
- Code Chunks
- Wrapping Up
-
Data Wrangling and Analysis
- Getting Started
- The Tidyverse
- select
- mutate
- filter
- summarize
- group_by
- count
- arrange
- Create a New Data Frame
- Crosstabs
- Wrapping Up
-
Data Visualization
- An Important Workflow Tip
- The Grammar of Graphics
- Scatterplots
- Histograms
- Bar Charts
- color and fill
- scales
- Text and Labels
- Plot Labels
- Themes
- Facets
- Save Plots
- Wrapping Up
-
Wrapping Up
- You Did It!
Fundamentals of R
select
This lesson is locked
This lesson is called select, part of the Fundamentals of R course. This lesson is called select, part of the Fundamentals of R course.
Transcript
Click on the transcript to go to that point in the video. Please note that transcripts are auto generated and may contain minor inaccuracies.
If you want to see the examples file used for this section of the course, you can take a look at the RMarkdown version as well as the knitted HTML version.
Your Turn
Complete the select sections of the data-wrangling-and-analysis-exercises.Rmd file.
Learn More
To learn more about select helper functions (e.g. contains), check out the Tidyverse website. We only covered a few of them and there are more!
General Data Wrangling and Analysis Resources
Because most material that discusses data wrangling and analysis with the dplyr packges does so in a way that covers all of the verbs discussed in this course, I have chosen not to separate them by lesson. Instead, here are some helpful resources for learning more about all of the tidyverse verbs discussed in this course:
Chapter 5 of R for Data Science
RStudio Cloud primer on working with data
Tidyverse for Beginners by Danielle Navarro
Learning Statistics with R by Danielle Navarro
Introduction to the Tidyverse by Alison Hill
A gRadual intRoduction to data wRangling by Chester Ismay and Ted Laderas
You need to be signed-in to comment on this post. Login.
Jyoni Shuler
March 23, 2021
Hi David, I'm trying to figure out the keyboard shortcut to run code - for Macs, it says up arrow + Command + and another arrow I cannot figure out. What is that exactly? Thanks!
David Keyes
March 24, 2021
I just use command and enter and that works on my Mac. You do have to have your cursor on the line of the code you're wanting to run for it to work. Let me know if that helps!
Ellen Wilson
October 6, 2022
It seems like command+enter just runs a line, and command+shift+enter will run a whole code chunk.
Lindsey Kenyon
March 24, 2021
Is it possible to 'select' from a row rather than a column? Or does wrangling data in R require data frames to be vertical?
David Keyes
March 24, 2021
The select() function works for column, the filter() function works for rows. Hope that helps!
Lindsey Kenyon
March 24, 2021
How do you accommodate using the 'select' function if your table headers are merged?
David Keyes
March 24, 2021
Can you explain more what you mean when you say your table headers are merged?
Abby Isaacson
March 29, 2021
FYI to the group I was looking for the pipe shortcut reminder and came across this link: https://support.rstudio.com/hc/en-us/articles/200711853-Keyboard-Shortcuts
Kathleen Carson
March 30, 2021
Thank you!
Atlang Mompe
March 30, 2021
Thank you for sharing...
Abby Isaacson
March 29, 2021
Also a note for what's worked for me on 'select' section: when I try the suggested code format such as "marital_status", I get an error; but when I format identically to the imported nhanes dataset, "MaritalStatus" works (matches variable name). Same throughout the exercise.
David Keyes
March 29, 2021
Did you run the clean_names() function in your code? The original variable name in the CSV is MaritalStatus, but when you run clean_names() it becomes marital_status.
Abby Isaacson
March 29, 2021
huh! I have it in my code from the last time I worked on this, but I hadn't run it today before I started. I am also not sure I had successfully imported the dataset before today. Does it only have to be run once, or fresh every session?
Abby Isaacson
March 29, 2021
This was the message I got from running clean_names: Show in New Window Parsed with column specification: cols( .default = col_character(), ID = col_double(), Age = col_double(), Weight = col_double(), Height = col_double(), BMI = col_double(), DaysPhysHlthBad = col_double(), DaysMentHlthBad = col_double(), SleepHrsNight = col_double(), PhysActiveDays = col_double(), TVHrsDay = col_logical() ) See spec(...) for full column specifications. 4859 parsing failures. row col expected actual file 5001 TVHrsDay 1/0/T/F/TRUE/FALSE 2_hr 'data/nhanes.csv' 5002 TVHrsDay 1/0/T/F/TRUE/FALSE More_4_hr 'data/nhanes.csv' 5003 TVHrsDay 1/0/T/F/TRUE/FALSE 4_hr 'data/nhanes.csv' 5004 TVHrsDay 1/0/T/F/TRUE/FALSE 4_hr 'data/nhanes.csv' 5005 TVHrsDay 1/0/T/F/TRUE/FALSE 1_hr 'data/nhanes.csv'
David Keyes
March 29, 2021
Yup, that is an informational message telling you how it parsed your variables. col_character() means it's treating a variable as a string, col_double() is numeric, and col_logical() is TRUE/FALSE.
David Keyes
March 29, 2021
It's a good question! By default, RStudio saves any objects you have created when you exit and then reloads them when you restart. However, this isn't actually the best approach, as you often don't know if the object you created is based on your most recent code or code you wrote last week, month, etc. In a later lesson, I'll show you how to change this so that you have to rerun your code each session to recreate any objects.
Kathleen Carson
March 30, 2021
Is there an output that tells us all of our columns? I have the parsing notice but that doesn't show all the columns so I am not sure how to do the "select variables from 'health_gen' to end' without looking at the solutions.
David Keyes
March 30, 2021
Yes, there is! Check out the
last_col()
function. FYI you need to have loaded the tidyverse first in order to use this function.Harold Stanislaw
March 31, 2021
Comment rather than a question. When dropping a range of variables, I tired this code: select(-heath_gen:education), which left out a set of parentheses. One solution is to include the parentheses, so the code is select(-(heath_gen:education)). However, I found that this also works: select(-heath_gen:-education).
David Keyes
March 31, 2021
Cool, thanks! I didn't know that the last one would work.
Naomi Nichols
April 13, 2021
None of the exercises that are evident in your tutorial video are accessible to me in the data-wrangling-and-analysis-exercises RMD file. I just have the code you used.
David Keyes
April 13, 2021
That's very odd! I will follow up with you by email.
Marcus Lee
May 16, 2021
Hi David,
Any quick way to select a column to the last column? E.g health_gen to the last column, instead of typing out select(health_gen:smoke_now]?
David Keyes
May 16, 2021
Yes, there is! Check out the
last_col()
function. Using that, you could write:select(health_gen:last_col())
.Matt M
September 26, 2021
Nothing major, but I've noticed that several of the variable names differ between your solutions video and my nhanes (e.g., Height vs height and HealthGen vs. health_gen).
It may be another issue of things changing slightly in the data over time. But it has served as good reminder to be careful about capitalization (and why to avoid it in variable names)
David Keyes
September 27, 2021
Interesting. Are you running the
clean_names()
function? I'm wondering if that might explain the discrepancy.Matt M
September 29, 2021
ah, I believe that was it. I must not have done that the last time.
Matt M
September 29, 2021
When I run clean_names I see a tibble created in the RMarkdown that has all the clean variable names (e.g., marital_status). But in the next code chunk if I input: nhanes %>% select(marital_status) I get an error Error: Can't subset columns that don't exist. x Column
marital_status
doesn't exist. Runrlang::last_error()
to see where the error occurred.But if I run nhanes %>% select(MaritalStatus)
it works
David Keyes
September 29, 2021
I believe what's happening here is that you're getting confused between displaying the results of running code vs assigning the results of your code to a new (or existing) object. Here is a video I made for a different lesson, but which I think will help you understand why the variable
marital_status
doesn't exist in yournhanes
object. tldr: you need to run code with theclean_names()
function to create thenhanes
object like so:Matt M
October 6, 2021
Thanks for the help. But I don't think that's my issue.
nhanes % select(marital_status)
##I get the error: "Error: Can't subset columns that don't exist. x Column
marital_status
doesn't exist. Runrlang::last_error()
to see where the error occurred." But using the original variable name MaritalStatus, the code runs fine.David Keyes
October 6, 2021
Ah, got it! In that case, I think the difference is likely to do with whether you're using the
clean_names()
function or not. The variable name in the CSV is MaritalStatus, but it becomes marital_status if you runclean_names()
. Let me know if that helps!Matt M
October 6, 2021
I believe I have been running clean_names(). I've also tried clicking on the Run All Chunks Above option instead of just running the code chunk and clean_names() gets a tibble to appear with the cleaned names (age_decade), but I have to do that in every code chunk. In your Solutions video, the clean_names() continues to work through all subsequent code chunks.
David Keyes
October 6, 2021
Could you please post your code as a GitHub Gist and post the link so I can take a look?
Matt M
October 6, 2021
Here you go: https://gist.github.com/MmattC/3ebdee50c012bcb813da5482d86f0491
Chhavi Kotwani
March 18, 2022
Hi David!
I ran clean_names on nhanes and then displayed nhanes to see if it worked - it did. However, when I move on to the select function, it refuses to recognized the cleaned version and still refers to the earlier version. Is there something I missed?
Thanks!
Charlie Hadley
March 18, 2022
Hello Chhavi! The most likely cause of this error is writing code that doesn't make an assignment. Let me take you through some steps.
Step 1: Run this code
Step 2: Run this code
This will cause an issue because when you ran the first step the output was only being printed to the console and not assigned - meaning that
nhanes
hasn't been changed. Instead, if you followed these steps the code would work:Step a: Assign the result of your code to nhanes
Step b: Run this and the code will work
If this hasn't answered your question please do give me more details.
Cheers,
Charlotte
Tatiana Bustos
July 27, 2022
Im getting an error "attempt to use zero-length variable name" when I use the following code:
Any idea what the error message means? It worked for the single select.
Tatiana Bustos
July 27, 2022
Figured it out - I was highlighting the back ticks with my code. I thought I had to highlight everything :X
Tatiana Bustos
July 27, 2022
Just reflecting on the data wrangling - it looks like the data on the excel (or CSV) sheet has to be set up just right to be able to use these coding exercises. Can you share more about the data file preparation? What practices we should have in place regarding variables, types of inputted data, etc? A lot of my time is spent in data cleaning before actually getting to the analyses. Sorry if I am getting ahead !
David Keyes
July 28, 2022
So I got this data from the NHANES r package. It's been a while so I don't remember exactly what I did (this is bad practice from a time I didn't know better), but in terms of general data cleaning advice, here are a couple articles:
https://ivelasq.rbind.io/blog/tidying-census-data/ https://evamaerey.github.io/little_flipbooks_library/data_cleaning/data_cleaning
The main thing I would say is that all data cleaning should happen in R so that you can always see what you did and re-run your cleaning code in the future (again, I didn't follow this advice).
Elsa Bailey
October 3, 2022
Could you please go over the use of "quotes" around a term. When are quotes required, and when are they not necessary? For example, quotes are used here - select(contains("hlth_bad")). But no quotes are used here - select(marital_status, education). Thanks!
David Keyes
October 4, 2022
Yes, we can definitely discuss this!
Rachel Nicholson
October 4, 2022
I believe I'm having the same issue as described by Matt M below. I have run the clean_names function and I get an output with the new correct names. However when I run the select functions if I don't put in the previous names I get an error message that says the columns don't exist. If I put in the non-cleaned names it works fine. I see that you have a video below, but I get a message that I don't have permission to view the video. Could you let me know what the solution was?
Rachel Nicholson
October 5, 2022
I found a temporary solution - if I run "clean_names" inside of each code chunk then it works, but I have to add that to every single code chunk. Is that correct?
Charlie Hadley
October 5, 2022
Hello Alyssa! As a participant on our VSA course we had a live session where we discussed this. But so anyone reading the comments cane benefit. Let me answer our question here too. When we run code in R scripts or RMarkdown documents we are almost always outputting content to the console - there are things that appear under the code chunk or in the console. If we want to store the results of code then we must ensure to use an assignment, eg data_clean <- data_raw %>% clean_names()
If we don't make an assignment all that happens is the output is printed.
Hope that helps! Charlie