Hello David and all. I asked this question on stackoverflow 2 weeks ago. I received 2 answers which I felt did not have valid solutions. I have a dataset of systolic blood pressure and date/time taken. I have created a fictional example dataset for an individual. The readings were sometimes taken within minutes of each other, sometimes not. An example is below:
I would like to group systolic readings together that took place in a span of 10 minutes. This is the result I would want: group 1: rows 1 and 2 group 2: rows 3, 4, 5 group3: row 6 group 4: row 7 The ultimate object is to then average readings that were taken close to each other in time. The criterion for "close in time" is readings that are all taken within 10 minutes--i.e,. the last reading in any group has to be within 10 minutes of the first reading in that group. Groups can have any number of readings, as long as the group meets the 10-minutes-from-start-to-finish criterion.
I've tried to solve this with the lag function, but it is awkward and possibly inaccurate. I suspect this might be addressed by detecting time "clusters" but I'm not familiar with those methods. I apologize for asking something that has been asked on stackoverflow, but since I haven't gotten any helpful answers I'm not sure where to go next. (If you provide a solution, I would answer the question there with attribution. I suspect it is not an unusual problem since blood pressure readings taken close together in time are often averaged.)
That’s a good question and not one I know the answer to! I’m going to post on Twitter to see if I get anyone with a good idea for you. I think one solution might be the tsibble package but I haven’t used it so I’m not 100% sure.
Actually, I’ve rethought this. I think the question is a bit too complicated for Twitter, but would be perfect for the RStudio Community. Would you consider posting there and seeing you get any good responses?
I’m thinking of having a column with start of interval, stop of interval, and then an indicator column of whether or not the time is in the interval, and then you’d group those times accordingly. And then you would tidy that up to do your average readings.
Hello Jordan, thanks for your suggestion. I’ve gone down that road and as far as I can see, while interval or something like it would help, I don’t think it solves the main difficulty.
Assuming that the dataset is sorted ascending by dtetime, the program needs to (1) establish the start dtetime of the first group in row 1, (2) advance to the next row, (3) evaluate whether dtetime in that row is more than 10 minutes after the start dtetime in (step 1). If “no”, go to next row and repeat (step 3), recalling that the starting dtetime may be more than 1 row back. If “yes”, establish that row’s dtetime as the start of a new group and repeat steps 1-3.
Interesting package, but there are limitations I don’t know how to get around. slide() does slide 10-minute periods together, but they overlap because the period is sliding. (I didn’t see a way to stop overlap or filter the ones I don’t want.) From what I can see, two other commands, slide_period() and block(), were promising for non-overlapping periods, but both depend on an origin date/time, which is not a dynamically assigned start date/time the group being formed. So the 10-minute increments are all calculated from the origin date/time, which inappropriately splits one of the groups. I’m going to work on a loop.