Get access to all lessons in this course.
-
Advanced Data Wrangling and Analysis
- Overview
- Importing Data
- Tidy Data
- Reshaping Data
- Dealing with Missing Data
- Changing Variable Types
- Advanced Variable Creation
- Advanced Summarizing
- Binding Data Frames
- Functions
- Merging Data
- Renaming Variables
- Quick Interlude to Reorganize our Code
- Exporting Data
-
Advanced Data Visualization
- Data Visualization Best Practices
- Tidy Data
- Pipe Data Into ggplot
- Reorder Plots to Highlight Findings
- Line Charts
- Use Color to Highlight Findings
- Declutter
- Use the scales Package for Nicely Formatted Values
- Use Direct Labeling
- Use Axis Text Wisely
- Use Titles to Highlight Findings
- Use Color in Titles to Highlight Findings
- Use Annotations to Explain
- Tweak Spacing
- Customize Your Theme
- Customize Your Fonts
- Try New Plot Types
-
Advanced RMarkdown
- Advanced Markdown Text Formatting
- Tables
- Advanced YAML
- Inline R Code
- Making Your Reports Shine: Word Edition
- Making Your Reports Shine: HTML Edition
- Making Your Reports Shine: PDF Edition
- Presentations
- Dashboards
- Other Formats
-
Wrapping Up
- You Did It!
Going Deeper with R
Merging Data
This lesson is locked
This lesson is called Merging Data, part of the Going Deeper with R course. This lesson is called Merging Data, part of the Going Deeper with R course.
Transcript
Click on the transcript to go to that point in the video. Please note that transcripts are auto generated and may contain minor inaccuracies.
Your Turn
Download the oregon-districts.xlsx file into the data-raw folder. The URL to download the file is https://github.com/rfortherestofus/going-deeper/raw/master/data-raw/oregon-districts.xlsx
Import a new data frame called
oregon_districts
from oregon-districts.xlsxMerge the
oregon_districts
data frame into theenrollment_by_race_ethnicity
data frame so you can see the names of the districts
Learn More
If you do nothing else, check out the tidyexplain project mentioned throughout the video. The animations are incredible for grasping how joins work.
Two other references to help you understand joining data are Chapter 13 of R for Data Science and Chapter 15 of Stat 545.
You need to be signed-in to comment on this post. Login.
Daniel Sossa
March 24, 2021
Hello, If we have 2 df with many columns on both and we want to add only ONE column from df2 to df1 (instead of merging the whole thing), what should we do? is there an argument on the left_join function to do that ?
David Keyes
March 24, 2021
There probably is a more elegant way to do this, but I usually just do a left join and then drop the columns I don't need using select(-c(col1:col2). Does that help?
Daniel Sossa
March 24, 2021
That works, thanks!
Jody Oconnor
May 6, 2021
When I download the file it I end up with something in my data-raw folder that can not be opened by Excel or by R. This is the error message:
> or_districts <- read_excel(path = "data-raw/oregon_districts.xlsx",
Error: Evaluation error: zip file 'data-raw/oregon_districts.xlsx' cannot be opened.
If I go to the url directly and download it to my laptop then move it into the data-raw folder I am able to open it. How can I get R to download a usable file directly? Thanks!
David Keyes
May 6, 2021
Did you use the mode = "wb" argument when downloading it in the Importing Data lesson? See the big orange box on that lesson which explains why this might be necessary.
Jody Oconnor
May 6, 2021
Easy fix - that solved the issue. Thanks!
Atlang Mompe
May 9, 2021
Hi David, Out of curiosity, how would you do an anti-join in order to keep 4, 4y? Thanks, Atty
David Keyes
May 10, 2021
I'm not sure I quite understand your question. Do you mind expanding?
Atlang Mompe
May 9, 2021
Hi David,
When I import the district data - I cannot read sheet 1, so I ended up using this code without specifying sheet 1: oregon_districts <- read_excel(path = "data-raw/oregon-districts.xlsx").
Also, I noticed that maybe you omitted some steps in your final code (enrollment_by_race_ethnicity) - it is not in the solution code you shared, but it is in the video.
Here is my code, I wonder why I could not read my excel Sheet 1.
#Download Oregon District File download.file(url = "https://github.com/rfortherestofus/going-deeper/raw/master/data-raw/oregon-districts.xlsx", mode = "wb", destfile = "data-raw/oregon-districts.xlsx")
#Import Oregon District Data oregon_districts % clean_names ()
#Join Race/Ethnicity and Oregon District Data enrollment_by_race_ethnicity <- left_join(enrollment_by_race_ethnicity, oregon_districts, by = c("district_id" = "attending_district_institutional_id"))
Thanks, Atty
David Keyes
May 10, 2021
On the "Sheet 1" issue, are you sure you had a capital S? If not, it wouldn't work.
On the missing code, I'm not exactly sure what's missing. Do you mind giving me a bit more detail? Thanks!
Matt M
December 2, 2021
the code shown below the solution video appears to have an error and does not match the code in the Solution video. On line 38 it is missing the x data frame of the left_join() "enrollment_by_race_ethnicity" should come before the comma with oregon_districts after, right? [that's what was done in the solutions video
Charlie Hadley
December 3, 2021
What's happening on line 38 is that the tibble created by bind_rows() is being piped into the left_join() which means that becomes the 'x' part of the join.
Here's a slightly simpler version of what's happening
data %>% left_join(other_data, by = c("name" = "NAME"))
Matt M
December 4, 2021
I understand this distinction now (it is taught a few lessons later). But the code that shows up below the Solutions video is not the same as the code used in Solutions video.
David Keyes
December 6, 2021
Thanks for pointing this out, Matt. I've updated the code so it matches.
Matt M
December 2, 2021
The animations are Fantastic. The slide text describing full_join() and and inner_join() was confusing for me. The slides appear to be using the same text (I’ve checked multiple times and can’t find a difference) and was confusing.
The animation of inner_join() makes it look like it does NOT keep “all rows and all columns from both x and y”. it only does so IF there is a match.
Is this a correct interpretation of inner_join(): “Only returns rows where there are matches between x and y”?
Charlie Hadley
December 3, 2021
Honestly, I've almost always used left_join() with a filter() to ensure I get the results I want. But the other joins can be a faster way to get there. You're right about inner_join() this will only return matches that appear in both datasets. Here's an example demonstrating what I meant about using filter
band_members %>% inner_join(band_instruments)
band_members %>% filter(name %in% band_instruments$name) %>% left_join(band_instruments)
A good question is "why are there different kinds of join, then?!". Joins come from SQL where folks are working with (very) large databases and it's more efficient to use inner_join() than pre-filtering the data.
JULIO VERA DE LEON
April 29, 2022
Hi! When I try to use the read_excel function I get this error:
Error: filepath: /Users/julio/Library/Mobile Documents/com~apple~CloudDocs/R/R for the Rest of Us/Going Deeper with R/going-deeper-dk/data-raw/oregon-districts.xls libxls error: Unable to open file
This is my code: oregon_districs <- read_excel(path = "data-raw/oregon-districts.xls", sheet = "Sheet1")
If I open it with excel I get a warning that the format and extension of the file don't match. In the end I can open it and see the values if I press I want to open it anyway.
JULIO VERA DE LEON
April 29, 2022
I made it work with this https://github.com/tidyverse/readxl/issues/598#issuecomment-597718089
Charlie Hadley
May 3, 2022
I'm pleased you were able to find this solution! Unfortunately this is a consequence of the very old and outdated .xls file type. Could you help me find where you got the link to the .xls file as the solutions section links to the .xlsx file for which there aren't any issues.
Thanks,
Charlie
JULIO VERA DE LEON
May 3, 2022
Hi! Yes, sure. I found it here:
https://stackoverflow.com/questions/59218057/readxlread-xls-returns-libxls-error-unable-to-open-file
They have there the link to the respective issue on GitHub.
Charlie Hadley
May 5, 2022
Thank you!
Sara Cifuentes
May 25, 2022
Dear Charlie and David, I do not know what I'm doing wrong. I used inner_join to join a df that has 72500 observations and 253 variables (x), and I wanted to join with a df that has 718232 observations and 2 variables (y). I joined them using the variable in which they share the same code ("annotation"). I expected that I would have 72,508 observations and 254 variables in my new matrix, but a matrix with 718,232 observations and 254 variables was generated.
gene_ecoli_mat <- inner_join(matrix_profile_ecoli, ecoli_pangenome_annot, by = "Annotation")
I thought there would be a matrix where the 72508 observations are shared.
Thanks in advance,
Sara
Charlie Hadley
May 26, 2022
Hi Sara,
Let's use this syntax: inner_join(x, y) which we'd write with pipes as follows
x %>% inner_join(y)
inner_join(x, y) will only join those observations that exist in both x and y. This means that the number of rows in your output must be equal to or lower than the number of rows in the smallest dataset - which is why you ended up with 718,232 observations. The output will also contain the combination of all unique columns across the two datasets, which is why there are 254 variables.
There are some excellent animations demonstrating the different joins built by Garrick Aden-Buie.
I'd also be slightly careful about the difference between a matrix and a data.frame (or tibble). In R we can create matrix() objects and we use them for linear algebra and other things, but they are not general purpose rectangular data stores.
Let me know if you have any questions.
Cheers,
Charlie
Matt Kropp
November 9, 2022
I am often combining multiple datasets where I combining by name. Many times the names are slightly different. (Things like Jr. or AJ vs A.J. or Mike vs Michael) Do you have any suggestions to do handle this?
Charlie Hadley
November 11, 2022
Hi Matt,
There's a useful package called {fuzzyjoin} that you can find here - https://github.com/dgrtwo/fuzzyjoin. This allows you to match strings together based on string distance. There are a few different distance metrics built into the package, however I'm not sure which one would perform best for this kind of data.
Cheers,
Charlie
Hatem Kotb
January 13, 2023
In the solutions code lines 51-53 seem redundant?
David Keyes
January 13, 2023
Ah yes, you're right. Updated. Thanks!