Sum within group

  • Sum within group

     David updated 3 hours, 4 minutes ago 2 Members · 9 Posts
  • jordan.trachtenberg

    Member
    February 22, 2021 at 10:28 am

    I am trying to group my data by two variables and then get the sum within that grouping. However, when I run my code, I get the sum across all rows, not within groups.

    Here is an example with the iris dataset. I would like an output table with sum_petal_length to be the sum within the petal_width and species group, but I’m getting sum_petal_length = 563.7 for all groups.

    as_tibble(iris) %>%

    clean_names() %>%

    group_by(species, petal_width) %>%

    mutate(sum_petal_length = sum(petal_length)) %>%

    distinct(species, petal_width, sum_petal_length)

  • David

    Organizer
    February 23, 2021 at 7:15 am

    I don’t think I quite understand the question. Would you mind rephrasing or perhaps even using a tool like Loom to record a short video explaining it further? Thanks!

  • jordan.trachtenberg

    Member
    February 23, 2021 at 8:36 am

    Hi David,

    I can try to record a video if this is still unclear. For instance, for grouping of petal_width = 0.1 and species = setosa, I would like sum_petal_length = 1.5+1.4+1.1 = 4, but it’s showing up as 563.7. Similarly for petal_width = 0.3 and species = setosa, sum_petal_length = 1.4 + 1.7 + 1.5 + 1.3 = 5.8, but it’s also showing up as 563.7.

    as_tibble(iris) %>%

    clean_names() %>%

    mutate(species = as.factor(species)) %>%

    mutate(petal_width = as.factor(petal_width)) %>%

    group_by(species, petal_width) %>%

    mutate(sum_petal_length = sum(petal_length)) %>%

    distinct(species, petal_width, sum_petal_length, petal_length) %>%

    arrange(petal_width)

    This approach works (sum_petal_length = 4) if I move my distinct() function up and filter the data to one group:

    as_tibble(iris) %>%

    clean_names() %>%

    filter(petal_width == 0.1) %>%

    filter(species == "setosa") %>%

    distinct(species, petal_width, petal_length) %>%

    mutate(species = as.factor(species)) %>%

    mutate(petal_width = as.factor(petal_width)) %>%

    group_by(species, petal_width) %>%

    mutate(sum_petal_length = sum(petal_length)) %>%

    arrange(petal_width)

    However, it doesn’t work if I remove the filter rows (sum_petal_length = 438.1).

    as_tibble(iris) %>%

    clean_names() %>%

    distinct(species, petal_width, petal_length) %>%

    mutate(species = as.factor(species)) %>%

    mutate(petal_width = as.factor(petal_width)) %>%

    group_by(species, petal_width) %>%

    mutate(sum_petal_length = sum(petal_length)) %>%

    arrange(petal_width)

    I’ve also tried using the aggregate() function instead of sum() but I kept getting errors.

    iris_data <- as_tibble(iris) %>%

    clean_names()


    iris_data %>%

    filter(petal_width == 0.1) %>%

    filter(species == "setosa") %>%

    distinct(species, petal_width, petal_length) %>%

    mutate(sum_petal_length = aggregate(iris_data$petal_length, by = list(iris_data$petal_width, iris_data$species), FUN = sum)) %>%

    arrange(petal_width)

    Error: Assigned data eval(cols[[col]], .data, parent.frame()) must be compatible with existing data. x Existing data has 3 rows. x Assigned data has 27 rows. i Only vectors of size 1 are recycled.

  • David

    Organizer
    February 24, 2021 at 2:20 pm

    Ok, did my best to get you a response. Let me know if this helps!

  • jordan.trachtenberg

    Member
    February 24, 2021 at 3:04 pm

    Hi David,

    I made a video to explain how my output is different from yours. I think something’s wrong with my packages that I’m loading or something to give me different behavior from your output. Let me know if you have trouble accessing the video.

  • jordan.trachtenberg

    Member
    February 26, 2021 at 6:05 am

    Hi David, I wanted to let you know that I got a little closer to solving this. I’m now certain it’s a problem with one of my packages. If I just load tidyverse and janitor, I get similar behavior to what you got in your video. Have you seen this before? I found this as a possible explanation. I can’t remember why I was using plyr in the first place:
    https://stackoverflow.com/questions/26923862/why-are-my-dplyr-group-by-summarize-not-working-properly-name-collision-with

  • David

    Organizer
    February 26, 2021 at 12:57 pm

    Oh, that totally makes sense! Yes, try using dplyr::group_by() to make sure you are using dplyr and not plyr. This is, unfortunately, a semi-common thing (see, for example, responses to this tweet). Moving forward, I really don’t think you need to use the plyr package at all if you’re using dplyr so I would just remove it. Let me know if that works!

  • jordan.trachtenberg

    Member
    February 26, 2021 at 1:15 pm

    Thanks. I see the pain and frustration in that Twitter feed. This little detail cost me half of last Friday and several hours this week. Now that I know this is a problem, I will be wiser!

  • David

    Organizer
    February 26, 2021 at 1:21 pm

    Oh no, that’s terrible! If you have other packages with identically-named functions, you might also look at the conflicted package.

Viewing 1 - 9 of 9 posts

Log in to reply.

Original Post
0 of 0 posts June 2018
Now
Just announced: R in 3 Months, a program to help you FINALLY learn R. Starts March 18.