Tidy Data Rule #2: Every Row is an Observation

This lesson is called Tidy Data Rule #2: Every Row is an Observation, part of the Going Deeper with R course. This lesson is called Tidy Data Rule #2: Every Row is an Observation, part of the Going Deeper with R course.

Transcript

Click on the transcript to go to that point in the video. Please note that transcripts are auto generated and may contain minor inaccuracies.

View code shown in video

activities_survey |> 
  separate_longer_delim(cols = activities,
                        delim = ", ")

Your Turn

Run the following code to view the built-in gss_cat data frame.

library(tidyverse)

gss_cat |> 
  view()

Then, write code to count the number of unique responses in the partyid variable.

You’ll need to use the separate_longer_delim() and count() functions to do this.

Learn More

This lesson from the R for Social Scientists course materials covers separate_longer_delim().

Have any questions? Put them below and we will help you out!

You need to be signed-in to comment on this post. Login.

Olivia Noel • November 1, 2023

I'm seeing a difference in the syntax used in the main video vs the solution video and am wondering if you can please explain the difference. In the main video, you use separate_longer_delim(cols = activities,... but the solution video uses separate_longer_delim(partyid,... without the use of "cols =".

David Keyes Founder • November 1, 2023

This is a great question! The answer is that you can list arguments explicitly (that is, include their name) or by position (that is, by the order they appear in the documentation). There is a good discussion of this in this video, which comes from the Reproducibility for the Rest of Us course.

When teaching, I almost always try to list arguments with their names, but the reality is, when I write code myself, I often skip argument names because it's kind of a pain to write them out. Functionally, there is no difference between the code shown in the video and the code in the solution.

Does that all make sense?

Olivia Noel • November 13, 2023

Thanks David! That makes perfect sense :D

Sara Parisi • November 18, 2024

This video you shared on function arguments was so helpful! I have always confused about why sometimes I'd see "the argument =" in example code and sometimes not. Thanks!

Rachel Udow • April 25, 2024

When working with "select all that apply" variables in the past in datasets where one row = one individual, I've typically dealt with this by converting each response option to its own column where a 1 ("yes") is present if that response option was selected. This has worked well for my purposes, but I understand now that it violates Tidy Data Rule 1 because a single variable is spread across multiple columns. Am I understanding correctly that while the approach I've used in the paste is not inherently better or worse than tidy format, the advantage to making it tidy is that it will be easier to analyze using tidyverse? Is this still true if the unit of analysis I'm interested in is the individual and not the activity (for example)?

David Keyes Founder • April 25, 2024

Your understanding sounds right to me! In terms of analyzing data if you care about individuals rather than responses, you might consider leaving your data as is. At a certain point, what tidy data is kind of just depends on what you plan to do with it. BTW, you may find this chat I had with a former R in 3 Months participant on tidy data interesting.

Rachel Udow • April 27, 2024

David, thanks for your response, and especially for sharing this video with me -- I'm feeling a lot more clear about tidy data after watching it. Your former student's questions are exactly the ones I've been wrestling with, though he formulated and articulated them much more clearly! (Bonus -- we work with similar kinds of data, so his examples are directly relevant to my work.)

Aside from the very practical points about dealing with "select all that apply" questions, my biggest conceptual takeaways from this video were about tidying data as being both an art and a science, and the ways in which tidy data offers flexibility and promotes strong vs. "brittle" code. A big project I'm currently working on is creating the structure for an "annual dataset" for the Research & Evaluation team I'm part of at work, which will involve many types of data from many different sources. My past/default approach would have been to create one wide dataset with an individual on each row (and many NAs across columns since not all variables apply to all individuals). However, I'm now understanding that the best approach will likely be to create smaller, tidy data frames that can be further tidied or joined as needed to address the multitude of questions that will be answered using the dataset.

This is one of the most significant shifts I've ever experienced in my thinking about data, and I'm grateful for the learning and the resources you've shared to help me wrap my head around it. Thank you again!

David Keyes Founder • April 29, 2024

Glad you're finding it helpful! It seems odd to make a bunch of smaller objects at first, but I get a ton of value in working this way. Here's an example of some code that shows this.

Rachel Udow • May 1, 2024

This is great, David - thank you for sharing! Aside from this approach being more manageable visually (in terms of the object width), I imagine there are also huge benefits for performance when working w/ datasets with a lot of records. Thanks again

Tobia Dondè • March 18, 2025

Hello David and thanks for your clear explanations! Just a thought: the gss_cat$partyid variable already seems tidy to me in the sense that the values "Ind,near rep" and "Ind,near dem" are not multiple but simple abbreviations (they should have used a full stop). If I straightforwardly count the values of the variable it doesn't seem that there are multiple answers. Perhaps a useful, albeit scolastic, example is to obtain tidyr::table2 starting from tidyr::table3.