Tidy Data Rule #2: Every Row is an Observation
This lesson is called Tidy Data Rule #2: Every Row is an Observation, part of the Going Deeper with R course. This lesson is called Tidy Data Rule #2: Every Row is an Observation, part of the Going Deeper with R course.
Transcript
Click on the transcript to go to that point in the video. Please note that transcripts are auto generated and may contain minor inaccuracies.
View code shown in video
activities_survey |>
separate_longer_delim(cols = activities,
delim = ", ")
Your Turn
Run the following code to view the built-in gss_cat
data frame.
library(tidyverse)
gss_cat |>
view()
Then, write code to count the number of unique responses in the partyid
variable.
You’ll need to use the separate_longer_delim()
and count()
functions to do this.
Learn More
This lesson from the R for Social Scientists course materials covers separate_longer_delim()
.
Have any questions? Put them below and we will help you out!
Course Content
44 Lessons
1
Downloading and Importing Data
10:32
2
Overview of Tidy Data
05:50
3
Tidy Data Rule #1: Every Column is a Variable
07:43
4
Tidy Data Rule #3: Every Cell is a Single Value
10:04
5
Tidy Data Rule #2: Every Row is an Observation
04:42
6
Changing Variable Types
04:51
7
Dealing with Missing Data
04:55
8
Advanced Summarizing
06:25
9
Binding Data Frames
07:17
10
Functions
15:06
11
Data Merging
09:27
12
Exporting Data
04:38
13
Bring It All Together (Advanced Data Wrangling)
13:03
1
Best Practices in Data Visualization
03:44
2
Tidy Data
02:25
3
Pipe Data into ggplot
09:54
4
Reorder Plots to Highlight Findings
03:37
5
Line Charts
04:17
6
Use Color to Highlight Findings
09:16
7
Declutter
08:29
8
Add Descriptive Labels to Your Plots
09:10
9
Use Titles to Highlight Findings
08:14
10
Use Annotations to Explain
07:09
11
Tweak Spacing
05:11
12
Create a Custom Theme
03:47
13
Customize Your Fonts
08:32
14
Try New Plot Types
03:24
15
Bring it All Together (Advanced Data Visualization)
14:30
1
Advanced Markdown
06:43
2
Advanced YAML and Code Chunk Options
05:53
3
Tables
18:36
4
Inline R Code
04:42
5
Making Your Reports Shine: Word Edition
04:30
6
Making Your Reports Shine: PDF Edition
06:11
7
Making Your Reports Shine: HTML Edition
06:06
8
Presentations
10:12
9
Dashboards
05:38
10
Websites
06:43
11
Publishing Your Work
04:38
12
Quarto Extensions
05:50
13
Parameterized Reporting, Part 1
10:57
14
Parameterized Reporting, Part 2
05:11
15
Parameterized Reporting, Part 3
07:47
16
Wrapping up Going Deeper with R
You need to be signed-in to comment on this post. Login.
Olivia Noel • November 1, 2023
I'm seeing a difference in the syntax used in the main video vs the solution video and am wondering if you can please explain the difference. In the main video, you use separate_longer_delim(cols = activities,... but the solution video uses separate_longer_delim(partyid,... without the use of "cols =".
David Keyes Founder • November 1, 2023
This is a great question! The answer is that you can list arguments explicitly (that is, include their name) or by position (that is, by the order they appear in the documentation). There is a good discussion of this in this video, which comes from the Reproducibility for the Rest of Us course.
When teaching, I almost always try to list arguments with their names, but the reality is, when I write code myself, I often skip argument names because it's kind of a pain to write them out. Functionally, there is no difference between the code shown in the video and the code in the solution.
Does that all make sense?
Olivia Noel • November 13, 2023
Thanks David! That makes perfect sense :D
Sara Parisi • November 18, 2024
This video you shared on function arguments was so helpful! I have always confused about why sometimes I'd see "the argument =" in example code and sometimes not. Thanks!
Rachel Udow • April 25, 2024
When working with "select all that apply" variables in the past in datasets where one row = one individual, I've typically dealt with this by converting each response option to its own column where a 1 ("yes") is present if that response option was selected. This has worked well for my purposes, but I understand now that it violates Tidy Data Rule 1 because a single variable is spread across multiple columns. Am I understanding correctly that while the approach I've used in the paste is not inherently better or worse than tidy format, the advantage to making it tidy is that it will be easier to analyze using tidyverse? Is this still true if the unit of analysis I'm interested in is the individual and not the activity (for example)?
David Keyes Founder • April 25, 2024
Your understanding sounds right to me! In terms of analyzing data if you care about individuals rather than responses, you might consider leaving your data as is. At a certain point, what tidy data is kind of just depends on what you plan to do with it. BTW, you may find this chat I had with a former R in 3 Months participant on tidy data interesting.
Rachel Udow • April 27, 2024
David, thanks for your response, and especially for sharing this video with me -- I'm feeling a lot more clear about tidy data after watching it. Your former student's questions are exactly the ones I've been wrestling with, though he formulated and articulated them much more clearly! (Bonus -- we work with similar kinds of data, so his examples are directly relevant to my work.)
Aside from the very practical points about dealing with "select all that apply" questions, my biggest conceptual takeaways from this video were about tidying data as being both an art and a science, and the ways in which tidy data offers flexibility and promotes strong vs. "brittle" code. A big project I'm currently working on is creating the structure for an "annual dataset" for the Research & Evaluation team I'm part of at work, which will involve many types of data from many different sources. My past/default approach would have been to create one wide dataset with an individual on each row (and many NAs across columns since not all variables apply to all individuals). However, I'm now understanding that the best approach will likely be to create smaller, tidy data frames that can be further tidied or joined as needed to address the multitude of questions that will be answered using the dataset.
This is one of the most significant shifts I've ever experienced in my thinking about data, and I'm grateful for the learning and the resources you've shared to help me wrap my head around it. Thank you again!
David Keyes Founder • April 29, 2024
Glad you're finding it helpful! It seems odd to make a bunch of smaller objects at first, but I get a ton of value in working this way. Here's an example of some code that shows this.
Rachel Udow • May 1, 2024
This is great, David - thank you for sharing! Aside from this approach being more manageable visually (in terms of the object width), I imagine there are also huge benefits for performance when working w/ datasets with a lot of records. Thanks again
Tobia Dondè • March 18, 2025
Hello David and thanks for your clear explanations! Just a thought: the gss_cat$partyid variable already seems tidy to me in the sense that the values "Ind,near rep" and "Ind,near dem" are not multiple but simple abbreviations (they should have used a full stop). If I straightforwardly count the values of the variable it doesn't seem that there are multiple answers. Perhaps a useful, albeit scolastic, example is to obtain tidyr::table2 starting from tidyr::table3.
David Keyes Founder • March 18, 2025
Yes, you're probably right. It's the challenge in finding good teaching examples!