Skip to content
R for the Rest of Us Logo

This lesson is locked

Get access to all lessons in this course.

Transcript

Click on the transcript to go to that point in the video. Please note that transcripts are auto generated and may contain minor inaccuracies.

If you want to see the examples file used for this section of the course, you can take a look at the RMarkdown version as well as the knitted HTML version.

Your Turn

Complete the select sections of the data-wrangling-and-analysis-exercises.Rmd file.

Learn More

To learn more about select helper functions (e.g. contains), check out the Tidyverse website. We only covered a few of them and there are more!

General Data Wrangling and Analysis Resources

Because most material that discusses data wrangling and analysis with the dplyr packges does so in a way that covers all of the verbs discussed in this course, I have chosen not to separate them by lesson. Instead, here are some helpful resources for learning more about all of the tidyverse verbs discussed in this course:

Chapter 5 of R for Data Science

RStudio Cloud primer on working with data

Tidyverse for Beginners by Danielle Navarro

Learning Statistics with R by Danielle Navarro

Introduction to the Tidyverse by Alison Hill

A gRadual intRoduction to data wRangling by Chester Ismay and Ted Laderas

Working in the Tidyverse by Desi Quintans and Jeff Powell

Christine Monnier video tutorials on dplyr

Have any questions? Put them below and we will help you out!

You need to be signed-in to comment on this post. Login.

Jyoni Shuler

Jyoni Shuler

March 24, 2021

Hi David, I'm trying to figure out the keyboard shortcut to run code - for Macs, it says up arrow + Command + and another arrow I cannot figure out. What is that exactly? Thanks!

David Keyes

David Keyes

March 24, 2021

I just use command and enter and that works on my Mac. You do have to have your cursor on the line of the code you're wanting to run for it to work. Let me know if that helps!

Ellen Wilson

Ellen Wilson

October 6, 2022

It seems like command+enter just runs a line, and command+shift+enter will run a whole code chunk.

Lindsey Kenyon

Lindsey Kenyon

March 24, 2021

Is it possible to 'select' from a row rather than a column? Or does wrangling data in R require data frames to be vertical?

David Keyes

David Keyes

March 24, 2021

The select() function works for column, the filter() function works for rows. Hope that helps!

Lindsey Kenyon

Lindsey Kenyon

March 24, 2021

How do you accommodate using the 'select' function if your table headers are merged?

David Keyes

David Keyes

March 24, 2021

Can you explain more what you mean when you say your table headers are merged?

Abby Isaacson

Abby Isaacson

March 29, 2021

FYI to the group I was looking for the pipe shortcut reminder and came across this link: https://support.rstudio.com/hc/en-us/articles/200711853-Keyboard-Shortcuts

Kathleen Carson

Kathleen Carson

March 30, 2021

Thank you!

Atlang Mompe

Atlang Mompe

March 30, 2021

Thank you for sharing...

Abby Isaacson

Abby Isaacson

March 30, 2021

Also a note for what's worked for me on 'select' section: when I try the suggested code format such as "marital_status", I get an error; but when I format identically to the imported nhanes dataset, "MaritalStatus" works (matches variable name). Same throughout the exercise.

David Keyes

David Keyes

March 30, 2021

Did you run the clean_names() function in your code? The original variable name in the CSV is MaritalStatus, but when you run clean_names() it becomes marital_status.

Abby Isaacson

Abby Isaacson

March 30, 2021

huh! I have it in my code from the last time I worked on this, but I hadn't run it today before I started. I am also not sure I had successfully imported the dataset before today. Does it only have to be run once, or fresh every session?

Abby Isaacson

Abby Isaacson

March 30, 2021

This was the message I got from running clean_names: Show in New Window Parsed with column specification: cols( .default = col_character(), ID = col_double(), Age = col_double(), Weight = col_double(), Height = col_double(), BMI = col_double(), DaysPhysHlthBad = col_double(), DaysMentHlthBad = col_double(), SleepHrsNight = col_double(), PhysActiveDays = col_double(), TVHrsDay = col_logical() ) See spec(...) for full column specifications. 4859 parsing failures. row col expected actual file 5001 TVHrsDay 1/0/T/F/TRUE/FALSE 2_hr 'data/nhanes.csv' 5002 TVHrsDay 1/0/T/F/TRUE/FALSE More_4_hr 'data/nhanes.csv' 5003 TVHrsDay 1/0/T/F/TRUE/FALSE 4_hr 'data/nhanes.csv' 5004 TVHrsDay 1/0/T/F/TRUE/FALSE 4_hr 'data/nhanes.csv' 5005 TVHrsDay 1/0/T/F/TRUE/FALSE 1_hr 'data/nhanes.csv'

David Keyes

David Keyes

March 30, 2021

Yup, that is an informational message telling you how it parsed your variables. col_character() means it's treating a variable as a string, col_double() is numeric, and col_logical() is TRUE/FALSE.

David Keyes

David Keyes

March 30, 2021

It's a good question! By default, RStudio saves any objects you have created when you exit and then reloads them when you restart. However, this isn't actually the best approach, as you often don't know if the object you created is based on your most recent code or code you wrote last week, month, etc. In a later lesson, I'll show you how to change this so that you have to rerun your code each session to recreate any objects.

Kathleen Carson

Kathleen Carson

March 30, 2021

Is there an output that tells us all of our columns? I have the parsing notice but that doesn't show all the columns so I am not sure how to do the "select variables from 'health_gen' to end' without looking at the solutions.

David Keyes

David Keyes

March 30, 2021

Yes, there is! Check out the last_col() function. FYI you need to have loaded the tidyverse first in order to use this function.

Harold Stanislaw

Harold Stanislaw

April 1, 2021

Comment rather than a question. When dropping a range of variables, I tired this code: select(-heath_gen:education), which left out a set of parentheses. One solution is to include the parentheses, so the code is select(-(heath_gen:education)). However, I found that this also works: select(-heath_gen:-education).

David Keyes

David Keyes

April 1, 2021

Cool, thanks! I didn't know that the last one would work.

Naomi Nichols

Naomi Nichols

April 13, 2021

None of the exercises that are evident in your tutorial video are accessible to me in the data-wrangling-and-analysis-exercises RMD file. I just have the code you used.

David Keyes

David Keyes

April 14, 2021

That's very odd! I will follow up with you by email.

Marcus Lee

Marcus Lee

May 16, 2021

Hi David,

Any quick way to select a column to the last column? E.g health_gen to the last column, instead of typing out select(health_gen:smoke_now]?

David Keyes

David Keyes

May 16, 2021

Yes, there is! Check out the last_col() function. Using that, you could write: select(health_gen:last_col()).

Nothing major, but I've noticed that several of the variable names differ between your solutions video and my nhanes (e.g., Height vs height and HealthGen vs. health_gen).

It may be another issue of things changing slightly in the data over time. But it has served as good reminder to be careful about capitalization (and why to avoid it in variable names)

Interesting. Are you running the clean_names() function? I'm wondering if that might explain the discrepancy.

ah, I believe that was it. I must not have done that the last time.

When I run clean_names I see a tibble created in the RMarkdown that has all the clean variable names (e.g., marital_status). But in the next code chunk if I input: nhanes %>% select(marital_status) I get an error Error: Can't subset columns that don't exist. x Column marital_status doesn't exist. Run rlang::last_error() to see where the error occurred.

But if I run nhanes %>% select(MaritalStatus)

it works

I believe what's happening here is that you're getting confused between displaying the results of running code vs assigning the results of your code to a new (or existing) object. Here is a video I made for a different lesson, but which I think will help you understand why the variable marital_status doesn't exist in your nhanes object. tldr: you need to run code with the clean_names() function to create the nhanes object like so:

nhanes <- read_csv("../data/nhanes.csv") %>% 
  clean_names()

Thanks for the help. But I don't think that's my issue.

nhanes % select(marital_status)

##I get the error: "Error: Can't subset columns that don't exist. x Column marital_status doesn't exist. Run rlang::last_error() to see where the error occurred." But using the original variable name MaritalStatus, the code runs fine.

David Keyes

David Keyes

October 6, 2021

Ah, got it! In that case, I think the difference is likely to do with whether you're using the clean_names() function or not. The variable name in the CSV is MaritalStatus, but it becomes marital_status if you run clean_names(). Let me know if that helps!

I believe I have been running clean_names(). I've also tried clicking on the Run All Chunks Above option instead of just running the code chunk and clean_names() gets a tibble to appear with the cleaned names (age_decade), but I have to do that in every code chunk. In your Solutions video, the clean_names() continues to work through all subsequent code chunks.

David Keyes

David Keyes

October 7, 2021

Could you please post your code as a GitHub Gist and post the link so I can take a look?

Here you go: https://gist.github.com/MmattC/3ebdee50c012bcb813da5482d86f0491

Chhavi Kotwani

Chhavi Kotwani

March 18, 2022

Hi David!

I ran clean_names on nhanes and then displayed nhanes to see if it worked - it did. However, when I move on to the select function, it refuses to recognized the cleaned version and still refers to the earlier version. Is there something I missed?

Thanks!

Charlie Hadley

Charlie Hadley

March 19, 2022

Hello Chhavi! The most likely cause of this error is writing code that doesn't make an assignment. Let me take you through some steps.

Step 1: Run this code

nhanes %>% 
  clean_names()

Step 2: Run this code

nhanes %>% 
  select(days_phys_hlth_bad)

This will cause an issue because when you ran the first step the output was only being printed to the console and not assigned - meaning that nhanes hasn't been changed. Instead, if you followed these steps the code would work:

Step a: Assign the result of your code to nhanes

nhanes <- nhanes %>% 
  clean_names()

Step b: Run this and the code will work

nhanes %>% 
  select(days_phys_hlth_bad)

If this hasn't answered your question please do give me more details.

Cheers,

Charlotte

Tatiana Bustos

Tatiana Bustos

July 27, 2022

Im getting an error "attempt to use zero-length variable name" when I use the following code:

nhanes %&gt;% 
  select(marital_status, education)

Any idea what the error message means? It worked for the single select.

Tatiana Bustos

Tatiana Bustos

July 28, 2022

Figured it out - I was highlighting the back ticks with my code. I thought I had to highlight everything :X

Tatiana Bustos

Tatiana Bustos

July 28, 2022

Just reflecting on the data wrangling - it looks like the data on the excel (or CSV) sheet has to be set up just right to be able to use these coding exercises. Can you share more about the data file preparation? What practices we should have in place regarding variables, types of inputted data, etc? A lot of my time is spent in data cleaning before actually getting to the analyses. Sorry if I am getting ahead !

David Keyes

David Keyes

July 29, 2022

So I got this data from the NHANES r package. It's been a while so I don't remember exactly what I did (this is bad practice from a time I didn't know better), but in terms of general data cleaning advice, here are a couple articles:

https://ivelasq.rbind.io/blog/tidying-census-data/ https://evamaerey.github.io/little_flipbooks_library/data_cleaning/data_cleaning

The main thing I would say is that all data cleaning should happen in R so that you can always see what you did and re-run your cleaning code in the future (again, I didn't follow this advice).

Elsa Bailey

Elsa Bailey

October 4, 2022

Could you please go over the use of "quotes" around a term. When are quotes required, and when are they not necessary? For example, quotes are used here - select(contains("hlth_bad")). But no quotes are used here - select(marital_status, education). Thanks!

David Keyes

David Keyes

October 5, 2022

Yes, we can definitely discuss this!

Rachel Nicholson

Rachel Nicholson

October 5, 2022

I believe I'm having the same issue as described by Matt M below. I have run the clean_names function and I get an output with the new correct names. However when I run the select functions if I don't put in the previous names I get an error message that says the columns don't exist. If I put in the non-cleaned names it works fine. I see that you have a video below, but I get a message that I don't have permission to view the video. Could you let me know what the solution was?

Rachel Nicholson

Rachel Nicholson

October 5, 2022

I found a temporary solution - if I run "clean_names" inside of each code chunk then it works, but I have to add that to every single code chunk. Is that correct?

Charlie Hadley

Charlie Hadley

October 6, 2022

Hello Alyssa! As a participant on our VSA course we had a live session where we discussed this. But so anyone reading the comments cane benefit. Let me answer our question here too. When we run code in R scripts or RMarkdown documents we are almost always outputting content to the console - there are things that appear under the code chunk or in the console. If we want to store the results of code then we must ensure to use an assignment, eg data_clean <- data_raw %>% clean_names()

If we don't make an assignment all that happens is the output is printed.

Hope that helps! Charlie