Action trigger data grows quickly, would benefit from cleanup
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Evergreen |
Fix Released
|
Wishlist
|
Unassigned |
Bug Description
Evergreen 2.12 / All versions
Email notices, print notices, a variety of of print documents, and many more are built with Action/Trigger. Their data all ends up in the action_
Another concern of course is that we are retaining data that links patrons to circulations and holds (via notices) that need not be retained for patron privacy.
Arguably, much of this data does not need to persist very long. For example, each time a patron clicks the Print option on the record detail page of the catalog, the print output is stored in the database forever. In this example, the data is only needed for a few seconds. In other cases, like overdue notices, we may want the data to persist for weeks or months for debugging purposes, but likely not for years.
I propose a new action_
One concern with removing event data is that it's used to determine when something has already happened. For example, a patron can't receive a second 7-day overdue email notice for a given circulation, because such an event already exists in the database. This will have to be taken into account when considering reasonable default retention intervals.
Comments, suggestions appreciated.
Changed in evergreen: | |
milestone: | 3.next → 3.0-alpha |
Changed in evergreen: | |
assignee: | nobody → Galen Charlton (gmc) |
Changed in evergreen: | |
importance: | Undecided → Wishlist |
status: | New → Confirmed |
assignee: | Galen Charlton (gmc) → nobody |
Changed in evergreen: | |
status: | Fix Committed → Fix Released |
Some more..
Taking bug #1672824 into account, plus the fact that non-completed (invalid, error) events, which do not get a complete_time, should also be purged, a likely candidate for the timestamp used when testing the retention interval would be the event's update_time. This should be the last time an event was modified, regardless of its outcome. Or start_time -- I think the difference would be negligible in practice.
When (grouped) events link to the same output, no events are deleted until all events and their output are deleted. Maybe just check the max(update_time) for grouped events.
Regarding the final paragraph from the description, I believe as long as all event definitions that should have a non-NULL retention_interval also have a max_delay value and that the retention_interval exceeds the max_delay time, there would be no chance of duplicates.
Also, to be clear, retention_interval can be null. We may want to keep some data indefinitely.