R for the Rest of Us Community
Public / Community
Public / Community
This is a place to ask questions and get help along the way on your R journey.
In addition to discussions of general questions, you’ll see threads for office hours. These are twice-monthly sessions to help you get unstuck. Ask questions and get live answers from me as well as guest experts. All code for office hours can be found here.
Not yet a member? If you have an account on the R for the Rest of Us website and are logged in, just click Join Group on the top right of this page. If you need to create an account, you can sign up here.
Sum within group
-
I am trying to group my data by two variables and then get the sum within that grouping. However, when I run my code, I get the sum across all rows, not within groups.
Here is an example with the iris dataset. I would like an output table with sum_petal_length to be the sum within the petal_width and species group, but I’m getting sum_petal_length = 563.7 for all groups.
as_tibble(iris) %>%
clean_names() %>%
group_by(species, petal_width) %>%
mutate(sum_petal_length = sum(petal_length)) %>%
distinct(species, petal_width, sum_petal_length)
-
I don’t think I quite understand the question. Would you mind rephrasing or perhaps even using a tool like Loom to record a short video explaining it further? Thanks!
-
Hi David,
I can try to record a video if this is still unclear. For instance, for grouping of petal_width = 0.1 and species = setosa, I would like sum_petal_length = 1.5+1.4+1.1 = 4, but it’s showing up as 563.7. Similarly for petal_width = 0.3 and species = setosa, sum_petal_length = 1.4 + 1.7 + 1.5 + 1.3 = 5.8, but it’s also showing up as 563.7.
as_tibble(iris) %>%
clean_names() %>%
mutate(species = as.factor(species)) %>%
mutate(petal_width = as.factor(petal_width)) %>%
group_by(species, petal_width) %>%
mutate(sum_petal_length = sum(petal_length)) %>%
distinct(species, petal_width, sum_petal_length, petal_length) %>%
arrange(petal_width)
This approach works (sum_petal_length = 4) if I move my distinct() function up and filter the data to one group:
as_tibble(iris) %>%
clean_names() %>%
filter(petal_width == 0.1) %>%
filter(species == "setosa") %>%
distinct(species, petal_width, petal_length) %>%
mutate(species = as.factor(species)) %>%
mutate(petal_width = as.factor(petal_width)) %>%
group_by(species, petal_width) %>%
mutate(sum_petal_length = sum(petal_length)) %>%
arrange(petal_width)
However, it doesn’t work if I remove the filter rows (sum_petal_length = 438.1).
as_tibble(iris) %>%
clean_names() %>%
distinct(species, petal_width, petal_length) %>%
mutate(species = as.factor(species)) %>%
mutate(petal_width = as.factor(petal_width)) %>%
group_by(species, petal_width) %>%
mutate(sum_petal_length = sum(petal_length)) %>%
arrange(petal_width)
I’ve also tried using the aggregate() function instead of sum() but I kept getting errors.
iris_data <- as_tibble(iris) %>%
clean_names()
iris_data %>%
filter(petal_width == 0.1) %>%
filter(species == "setosa") %>%
distinct(species, petal_width, petal_length) %>%
mutate(sum_petal_length = aggregate(iris_data$petal_length, by = list(iris_data$petal_width, iris_data$species), FUN = sum)) %>%
arrange(petal_width)
Error: Assigned data
eval(cols[[col]], .data, parent.frame())
must be compatible with existing data. x Existing data has 3 rows. x Assigned data has 27 rows. i Only vectors of size 1 are recycled. -
Hi David,
I made a video to explain how my output is different from yours. I think something’s wrong with my packages that I’m loading or something to give me different behavior from your output. Let me know if you have trouble accessing the video.
-
Hi David, I wanted to let you know that I got a little closer to solving this. I’m now certain it’s a problem with one of my packages. If I just load tidyverse and janitor, I get similar behavior to what you got in your video. Have you seen this before? I found this as a possible explanation. I can’t remember why I was using plyr in the first place:
https://stackoverflow.com/questions/26923862/why-are-my-dplyr-group-by-summarize-not-working-properly-name-collision-with -
Oh, that totally makes sense! Yes, try using dplyr::group_by() to make sure you are using dplyr and not plyr. This is, unfortunately, a semi-common thing (see, for example, responses to this tweet). Moving forward, I really don’t think you need to use the plyr package at all if you’re using dplyr so I would just remove it. Let me know if that works!
-
Thanks. I see the pain and frustration in that Twitter feed. This little detail cost me half of last Friday and several hours this week. Now that I know this is a problem, I will be wiser!
-
Oh no, that’s terrible! If you have other packages with identically-named functions, you might also look at the conflicted package.
Log in to reply.