select()

This lesson is called select(), part of the Fundamentals of R course. This lesson is called select(), part of the Fundamentals of R course.

Transcript

Click on the transcript to go to that point in the video. Please note that transcripts are auto generated and may contain minor inaccuracies.

View code shown in video

# Load Packages -----------------------------------------------------------

library(tidyverse)

# Import Data -------------------------------------------------------------

penguins <- read_csv("penguins.csv")

# select() ----------------------------------------------------------------

penguins

# With select() we can select variables from the larger data frame.

penguins |> 
  select(bill_length_mm)

# We can also use select() for multiple variables:

penguins |>
  select(bill_length_mm, bill_depth_mm)

# select() has several helper functions for selecting variables.

# The contains() function finds any variable with certain text 
# in the variable name:

penguins |>
  select(contains("bill"))

# The starts_with() function allows us to select variables 
# that start with certain text:

penguins |> 
  select(starts_with("bill"))

# The ends_with() function allows us to select variables that end with certain text:

penguins |> 
  select(ends_with("mm"))

# We can select a range of columns using the var1:var2 pattern

penguins |> 
  select(species:bill_length_mm)

# We can drop variables using the -var format:

penguins |> 
  select(-bill_length_mm)

# We can drop a set of variables using the -(var1:var2) format:

penguins |> 
  select(-(bill_length_mm:flipper_length_mm))

Your Turn

Copy the code below into your R script file and complete the exercises. Please make sure you are using the CSV file that you downloaded from https://rfor.us/penguins. If you use the penguins_data.csv file from the Getting Started with R course or continue to use the penguins object you created in that course, you will run into problems!

# Load Packages -----------------------------------------------------------

# Load the tidyverse package

library(tidyverse)

# Import Data -------------------------------------------------------------

# Download data from https://rfor.us/penguins
# Copy the data into the RStudio project
# Create a new R script file and add code to import your data

penguins <- read_csv("penguins.csv")

# select() ----------------------------------------------------------------

# Use select() to keep only the sex variable

# YOUR CODE HERE

# Use select() to keep the island and sex variables

# YOUR CODE HERE

# Use one of the select() helper functions to keep all variables that have the letter s in their names

# YOUR CODE HERE

# Use one of the select() helper functions to keep all variables that start with the letter b

# YOUR CODE HERE

# Use select() to keep the variables from island to the end

# YOUR CODE HERE

# Use the dropping syntax with - to keep the same variables as above (island to the end)

# YOUR CODE HERE

# Drop all variables from bill_length_mm to body_mass_g
			
# YOUR CODE HERE

Learn More

To learn more about the select() function, check out Chapter 3 of R for Data Science.

Have any questions? Put them below and we will help you out!

You need to be signed-in to comment on this post. Login.

Olivia Noel • September 20, 2023

This may be covered in a future lesson, but I'm wondering about toggling variables on/off in the data view instead of using the select() function. I understand that the select() function will allow me to view the data for selected variables in the console, though I don't find this very useful (or maybe am just not used to it?), as it's providing such a small snippet of the data (i.e. only the first 10 rows, also not sure how this would look if I were asking to a dataset that is 40 vars wide). Currently, when I'm reviewing my data in Stata, I open up the data viewer, which looks similar to the data tab in R Studio. I can then select any variables to show/hide directly in this panel, and then may highlight a particular observation of interest and toggle additional variables on or off. I'm not sure that the select() function in R Studio would allow me to explore my data as seamlessly -- but, like I said, maybe this will come in a future lesson, or I will get used to viewing my data in the console :P

Olivia Noel • September 20, 2023

As a response to my own question, I saw on the filter() lesson that I can use view() to open up the selected variables in a new data pane! I imagine this being less "click-and-pointy" than my current workflow in Stata, as I imagine I would need to return to my R script/console many times to be toggling variables on and off as I explore my data -- but this already gets me much closer to what I'm trying to do!

Gracielle Higino Coach • September 20, 2023

Hi Olivia! That's a great question, I also sometimes wish RStudio/posit had a more intuitive and exploratory view function, but you totally get used to doing everything on the command line! I think this solution using view() is the most straightforward. You can use the following code to check only the first three columns, for example:

penguins |>
  select(1:3) |>
  view()

Lorenzo Dragani • March 15, 2024

I think there is a mistake in the solution provided to one of the exercises, namely the one that asks to use select() to keep the variables from island to the end. The solution provided says that the command is: penguins |> select(island:sex) but it seems that the last column of the penguins dataset is year so the answer should be either: penguins |> select(island:year) or equivalently: penguins |> select(island:last_col()) . Is this correct or am I mistaken? Thanks.

David Keyes Founder • March 15, 2024

Oh, you're totally right. Fixed!

Gabby Bachhuber • March 19, 2024

What's the logic for the "-" (minus) in the second solution below being in its own parenthesis but it's not in the first solution? That tripped me up.

penguins |> select(-species)

penguins |> select(-(bill_length_mm:body_mass_g))

David Keyes Founder • March 19, 2024

Recorded a little video for you. Let me know if it helps!

Charles Obiorah • March 21, 2024

Hello David, I just finished this lesson, and I have the following concerns

In the video, in importing the data penguins, you only used the data frame without adding .csv. It kept returning an error message of not having the file. Meanwhile, I could see the file below and on the environment. Well after a while I had to add .csv before I could continue.
if you were to select a range from a big frame, how do you name the first and last columns without having to type in the names of the columns? That is, supposing that you would not go to the table to know their names
I do not understand and therefore did not attempt how to use the drop syntax (-) and yet keep the variable of interest

David Keyes Founder • March 22, 2024

In the video, in importing the data penguins, you only used the data frame without adding .csv. It kept returning an error message of not having the file. Meanwhile, I could see the file below and on the environment. Well after a while I had to add .csv before I could continue.

I'm not sure I quite follow your question. Could you please elaborate and/or post a video using this link?

if you were to select a range from a big frame, how do you name the first and last columns without having to type in the names of the columns? That is, supposing that you would not go to the table to know their names

You could do it with this syntax:

penguins |>
	select(1, last_col())

In this code, the 1 refers to the first column (the select() function can select columns by position as well as name) and the last_col() function selects the last column.

I do not understand and therefore did not attempt how to use the drop syntax (-) and yet keep the variable of interest

Happy to give further guidance if there are particular things you did not understand.

Queeneth Onwuka • March 19, 2025

# Use the dropping syntax with - to keep the same variables as above (island to the end)
penguins |> 
  select(-island: year) except i add (-1, island:year)

Does not remove the "species" variable but this does:

# Use select() to keep the variables from island to the end
penguins |> 
  select(island:year)

Gracielle Higino Coach • March 21, 2025

Hi Queeneth! As we mentioned today in the live session, we're very intrigued by this. It seems to be a syntax complexity, I'll try to detail it here - if it's still confusing, let me know and I can record a video for you!

Using `-x:y` in the `select()` function

The : in R indicates a sequence. If you try out in the console -1:5, the result will be a sequence of numbers between -1 and 5. My guess is that, when you write select(-island:year), R is trying to create a sequence between -island and year, which it can't do, and so your select function is selecting none of the existing columns, and you see all your columns as the output. It does not return an error because the function worked well: because there were no column to select, it returned all of them.

Using `-(x:y)` or `-c()` in the `select()` function

In this case you are telling R that you want it to look at a set of column names located between x and y, including these two, and create a new set of elements (column names) excluding them. In this case, the c() function and the parenthesis is a small syntax tweak to R understand that you have a set of elements predefined. And the result is that it finds the column names in your dataframe and exclude them.

select() as an ordering function

One thing to notice is that select() can also be used as an ordering function, to reorder your columns. select(c(year, island:bill_depth_mm)) reorders your columns according to this specific sequence.

Questions remaining

I noticed, though, that using the minus sign in reverse order works well! For example, you can use select(-year:bill_depth_mm) to exclude columns from year to bill_depth_mm (but include bill_depth_mm) or select(-year:-bill_depth_mm) to exclude columns from year to bill_depth_mm (but exclude bill_depth_mm). I have no clue why this works only in the reverse order, though! 😂

Queeneth Onwuka • March 24, 2025

Hi Gracielle,

Still quite confusing. Could you please do a short video.

Use the dropping syntax with - to keep the same variables as above (island to the end)

penguins |> select(-1, island: year)

or

penguins |> select(-species)

I wrote the first code for the DIY assignment after the select video and the solution as shown is the second code.

Gracielle Higino Coach • March 29, 2025

Hey Queeneth! I tried to illustrate it better here in this video. It's quite complicated to explain why the "-" syntax only works with parentheses, but I think it'll be helpful to see what happens in each case!

David also has recorded a video to another participant above in this thread, and you can watch his video here.