Skip to content
R for the Rest of Us Logo

This lesson is locked

Get access to all lessons in this course.

Transcript

Click on the transcript to go to that point in the video. Please note that transcripts are auto generated and may contain minor inaccuracies.

Your Turn

  1. Download the oregon-districts.xlsx file into the data-raw folder. The URL to download the file is https://github.com/rfortherestofus/going-deeper/raw/master/data-raw/oregon-districts.xlsx

  2. Import a new data frame called oregon_districts from oregon-districts.xlsx

  3. Merge the oregon_districts data frame into the enrollment_by_race_ethnicity data frame so you can see the names of the districts

Learn More

If you do nothing else, check out the tidyexplain project mentioned throughout the video. The animations are incredible for grasping how joins work.

Two other references to help you understand joining data are Chapter 13 of R for Data Science and Chapter 15 of Stat 545.

Have any questions? Put them below and we will help you out!

You need to be signed-in to comment on this post. Login.

Daniel Sossa

Daniel Sossa

March 24, 2021

Hello, If we have 2 df with many columns on both and we want to add only ONE column from df2 to df1 (instead of merging the whole thing), what should we do? is there an argument on the left_join function to do that ?

David Keyes

David Keyes

March 24, 2021

There probably is a more elegant way to do this, but I usually just do a left join and then drop the columns I don't need using select(-c(col1:col2). Does that help?

Daniel Sossa

Daniel Sossa

March 24, 2021

That works, thanks!

Jody Oconnor

Jody Oconnor

May 6, 2021

When I download the file it I end up with something in my data-raw folder that can not be opened by Excel or by R. This is the error message:

> or_districts <- read_excel(path = "data-raw/oregon_districts.xlsx",

  •                        sheet = &quot;Sheet1&quot;)
    

Error: Evaluation error: zip file 'data-raw/oregon_districts.xlsx' cannot be opened.

If I go to the url directly and download it to my laptop then move it into the data-raw folder I am able to open it. How can I get R to download a usable file directly? Thanks!

David Keyes

David Keyes

May 6, 2021

Did you use the mode = "wb" argument when downloading it in the Importing Data lesson? See the big orange box on that lesson which explains why this might be necessary.

Jody Oconnor

Jody Oconnor

May 6, 2021

Easy fix - that solved the issue. Thanks!

Atlang Mompe

Atlang Mompe

May 9, 2021

Hi David, Out of curiosity, how would you do an anti-join in order to keep 4, 4y? Thanks, Atty

David Keyes

David Keyes

May 10, 2021

I'm not sure I quite understand your question. Do you mind expanding?

Atlang Mompe

Atlang Mompe

May 9, 2021

Hi David,

When I import the district data - I cannot read sheet 1, so I ended up using this code without specifying sheet 1: oregon_districts <- read_excel(path = "data-raw/oregon-districts.xlsx").

Also, I noticed that maybe you omitted some steps in your final code (enrollment_by_race_ethnicity) - it is not in the solution code you shared, but it is in the video.

Here is my code, I wonder why I could not read my excel Sheet 1.

#Download Oregon District File download.file(url = "https://github.com/rfortherestofus/going-deeper/raw/master/data-raw/oregon-districts.xlsx", mode = "wb", destfile = "data-raw/oregon-districts.xlsx")

#Import Oregon District Data oregon_districts % clean_names ()

#Join Race/Ethnicity and Oregon District Data enrollment_by_race_ethnicity <- left_join(enrollment_by_race_ethnicity, oregon_districts, by = c("district_id" = "attending_district_institutional_id"))

Thanks, Atty

David Keyes

David Keyes

May 10, 2021

On the "Sheet 1" issue, are you sure you had a capital S? If not, it wouldn't work.

On the missing code, I'm not exactly sure what's missing. Do you mind giving me a bit more detail? Thanks!

the code shown below the solution video appears to have an error and does not match the code in the Solution video. On line 38 it is missing the x data frame of the left_join() "enrollment_by_race_ethnicity" should come before the comma with oregon_districts after, right? [that's what was done in the solutions video

Charlie Hadley

Charlie Hadley

December 3, 2021

What's happening on line 38 is that the tibble created by bind_rows() is being piped into the left_join() which means that becomes the 'x' part of the join.

Here's a slightly simpler version of what's happening

data %>% left_join(other_data, by = c("name" = "NAME"))

I understand this distinction now (it is taught a few lessons later). But the code that shows up below the Solutions video is not the same as the code used in Solutions video.

David Keyes

David Keyes

December 6, 2021

Thanks for pointing this out, Matt. I've updated the code so it matches.

The animations are Fantastic. The slide text describing full_join() and and inner_join() was confusing for me. The slides appear to be using the same text (I’ve checked multiple times and can’t find a difference) and was confusing.

The animation of inner_join() makes it look like it does NOT keep “all rows and all columns from both x and y”. it only does so IF there is a match.

Is this a correct interpretation of inner_join(): “Only returns rows where there are matches between x and y”?

Charlie Hadley

Charlie Hadley

December 3, 2021

Honestly, I've almost always used left_join() with a filter() to ensure I get the results I want. But the other joins can be a faster way to get there. You're right about inner_join() this will only return matches that appear in both datasets. Here's an example demonstrating what I meant about using filter

band_members %>% inner_join(band_instruments)

band_members %>% filter(name %in% band_instruments$name) %>% left_join(band_instruments)

A good question is "why are there different kinds of join, then?!". Joins come from SQL where folks are working with (very) large databases and it's more efficient to use inner_join() than pre-filtering the data.

JULIO VERA DE LEON

JULIO VERA DE LEON

April 29, 2022

Hi! When I try to use the read_excel function I get this error:

Error: filepath: /Users/julio/Library/Mobile Documents/com~apple~CloudDocs/R/R for the Rest of Us/Going Deeper with R/going-deeper-dk/data-raw/oregon-districts.xls libxls error: Unable to open file

This is my code: oregon_districs <- read_excel(path = "data-raw/oregon-districts.xls", sheet = "Sheet1")

If I open it with excel I get a warning that the format and extension of the file don't match. In the end I can open it and see the values if I press I want to open it anyway.

JULIO VERA DE LEON

JULIO VERA DE LEON

April 29, 2022

I made it work with this https://github.com/tidyverse/readxl/issues/598#issuecomment-597718089

Charlie Hadley

Charlie Hadley

May 3, 2022

I'm pleased you were able to find this solution! Unfortunately this is a consequence of the very old and outdated .xls file type. Could you help me find where you got the link to the .xls file as the solutions section links to the .xlsx file for which there aren't any issues.

download.file("https://github.com/rfortherestofus/going-deeper/raw/master/data-raw/oregon-districts.xlsx",
              "data-raw/oregon-districts.xlsx")
oregon_districts &lt;- read_excel(&quot;data-raw/oregon-districts.xlsx&quot;,
                               sheet = &quot;Sheet1&quot;)

Thanks,

Charlie

JULIO VERA DE LEON

JULIO VERA DE LEON

May 3, 2022

Hi! Yes, sure. I found it here:

https://stackoverflow.com/questions/59218057/readxlread-xls-returns-libxls-error-unable-to-open-file

They have there the link to the respective issue on GitHub.

Charlie Hadley

Charlie Hadley

May 5, 2022

Thank you!

Sara Cifuentes

Sara Cifuentes

May 25, 2022

Dear Charlie and David, I do not know what I'm doing wrong. I used inner_join to join a df that has 72500 observations and 253 variables (x), and I wanted to join with a df that has 718232 observations and 2 variables (y). I joined them using the variable in which they share the same code ("annotation"). I expected that I would have 72,508 observations and 254 variables in my new matrix, but a matrix with 718,232 observations and 254 variables was generated.

gene_ecoli_mat <- inner_join(matrix_profile_ecoli, ecoli_pangenome_annot, by = "Annotation")

I thought there would be a matrix where the 72508 observations are shared.

Thanks in advance,

Sara

Charlie Hadley

Charlie Hadley

May 26, 2022

Hi Sara,

Let's use this syntax: inner_join(x, y) which we'd write with pipes as follows

x %>% inner_join(y)

inner_join(x, y) will only join those observations that exist in both x and y. This means that the number of rows in your output must be equal to or lower than the number of rows in the smallest dataset - which is why you ended up with 718,232 observations. The output will also contain the combination of all unique columns across the two datasets, which is why there are 254 variables.

There are some excellent animations demonstrating the different joins built by Garrick Aden-Buie.

I'd also be slightly careful about the difference between a matrix and a data.frame (or tibble). In R we can create matrix() objects and we use them for linear algebra and other things, but they are not general purpose rectangular data stores.

Let me know if you have any questions.

Cheers,

Charlie

I am often combining multiple datasets where I combining by name. Many times the names are slightly different. (Things like Jr. or AJ vs A.J. or Mike vs Michael) Do you have any suggestions to do handle this?

Charlie Hadley

Charlie Hadley

November 11, 2022

Hi Matt,

There's a useful package called {fuzzyjoin} that you can find here - https://github.com/dgrtwo/fuzzyjoin. This allows you to match strings together based on string distance. There are a few different distance metrics built into the package, however I'm not sure which one would perform best for this kind of data.

Cheers,

Charlie

In the solutions code lines 51-53 seem redundant?

David Keyes

David Keyes

January 13, 2023

Ah yes, you're right. Updated. Thanks!