Page MenuHomePhabricator

Create cron to aggregate old data
Closed, ResolvedPublic5 Estimated Story Points

Description

Create a cron job to aggregate all data from link_aggregates, user_aggregates, and pageproject_aggregates. We don't need granular data for all the collections by day and some tables (especially pageproject_aggregates are getting too big).

Acceptance criteria

  • Create a cron job that sums links added and links removed per month instead of per day. The day field should be zero.
  • Delete the daily aggregates from link_aggregates, user_aggregates, and pageproject_aggregates
  • Filters should continue to work on aggregated data (maybe we should add a sign that indicates this is not daily data whenever someone filters on the aggregated data).

Details

Other Assignee
jsn.sherman

Event Timeline

Regarding the second item:

  • Delete the daily aggregates from link_aggregates, user_aggregates, and pageproject_aggregates

Can we confirm if we have all data in these tables since inception? Or do we have old records stored in some sort of dump file? Just thinking if we should worry about aggregating old records stored somewhere else.

Regarding the second item:

  • Delete the daily aggregates from link_aggregates, user_aggregates, and pageproject_aggregates

Can we confirm if we have all data in these tables since inception? Or do we have old records stored in some sort of dump file? Just thinking if we should worry about aggregating old records stored somewhere else.

We have all of the data in those tables. We perform backups of the database every other day. We don't have the need to aggregate old records elsewhere.

Scardenasmolinar changed the task status from Open to In Progress.Feb 20 2025, 1:35 AM

assigning myself as other assignee to write a little script for clearing historical aggregates.

Okay, after we had a mod tools engineering meeting, we decided to adjust the commands to start with the oldest aggregates when no year-month is specified. In this PR we're now running the fill monthly crons every 5 minutes to enable us to chew through the backlog as quickly as possible. Command run logs are also visible in django admin:
https://github.com/WikipediaLibrary/externallinks/pull/419

Noting that we've been having reliability issues with django cron job logs not saving to the database. We've been able to run the underlying management commands reliably in the crons service on the production host, so it doesn't seem to be an issue with the code itself. We'll take some more troubleshooting steps to see if we can improve the reliability of django cron.

jsn.sherman added a subscriber: Mimurawil.

Okay, we tried adjusting the django cron schedules to work around this issue, but it persists; we should move these to not-django-cron cron tasks for improved reliability. We've done this previously with our (now archived) matomo setup https://github.com/WikipediaLibrary/twlight_matomo/tree/master

Okay, we tried adjusting the django cron schedules to work around this issue, but it persists; we should move these to not-django-cron cron tasks for improved reliability. We've done this previously with our (now archived) matomo setup https://github.com/WikipediaLibrary/twlight_matomo/tree/master

https://github.com/WikipediaLibrary/externallinks/pull/428

I've let this rot on the vine, so thank you @Scardenasmolinar for picking it up!