Page MenuHomePhabricator

archive events after each day's aggregates are completed
Closed, ResolvedPublic5 Estimated Story Points

Description

Currently we have the ability to manually archive link events by month and year. The link events table currently has 6580783, which includes events for 205 days (we have archived all events from before 2024). That's approximately 32000 events per day on average. Performance should improve significantly if we can reduce the table size so drastically.

Acceptance criteria

  • identify and remediate any code that uses linkevents after 1 day
  • update link event archive management command to name files uniquely per day
  • add cron task for archiving events daily

Event Timeline

If I'm reading https://phabricator.wikimedia.org/T370977 correctly we'll be modifying the aggregate jobs to run once a month rather than daily as they do now. With that change it sounds like we would need to do this automated archival once a month instead of daily so we can ensure the aggregate jobs have all of the data they need.

If I'm reading https://phabricator.wikimedia.org/T370977 correctly we'll be modifying the aggregate jobs to run once a month rather than daily as they do now. With that change it sounds like we would need to do this automated archival once a month instead of daily so we can ensure the aggregate jobs have all of the data they need.

The aggregates jobs run daily on information from the linkevents table. The task in T370977: Create cron to aggregate old data would run monthly on the information from the *_aggregates table. Therefore, the relevant linkevent information is already stored in the aggregates table and can be disposed of. The only caveat is that some aggregate commands (mainly the pageprojects command) have been known to run for more than 24 hours, so you would have to consider that before archiving the information from the linkevent table. Let me know if you have any follow-up questions about the dataflow (it can be a bit of a headache).

If I'm reading https://phabricator.wikimedia.org/T370977 correctly we'll be modifying the aggregate jobs to run once a month rather than daily as they do now. With that change it sounds like we would need to do this automated archival once a month instead of daily so we can ensure the aggregate jobs have all of the data they need.

The aggregates jobs run daily on information from the linkevents table. The task in T370977: Create cron to aggregate old data would run monthly on the information from the *_aggregates table. Therefore, the relevant linkevent information is already stored in the aggregates table and can be disposed of. The only caveat is that some aggregate commands (mainly the pageprojects command) have been known to run for more than 24 hours, so you would have to consider that before archiving the information from the linkevent table. Let me know if you have any follow-up questions about the dataflow (it can be a bit of a headache).

Ah okay I misread that ticket then, thank you for the clarification! As for dealing with the long running aggregate commands, would it make sense to check the cron job log for when the most recent successful aggregate jobs have completed and not archiving anything that comes after the most recent end_time?

If I'm reading https://phabricator.wikimedia.org/T370977 correctly we'll be modifying the aggregate jobs to run once a month rather than daily as they do now. With that change it sounds like we would need to do this automated archival once a month instead of daily so we can ensure the aggregate jobs have all of the data they need.

The aggregates jobs run daily on information from the linkevents table. The task in T370977: Create cron to aggregate old data would run monthly on the information from the *_aggregates table. Therefore, the relevant linkevent information is already stored in the aggregates table and can be disposed of. The only caveat is that some aggregate commands (mainly the pageprojects command) have been known to run for more than 24 hours, so you would have to consider that before archiving the information from the linkevent table. Let me know if you have any follow-up questions about the dataflow (it can be a bit of a headache).

Ah okay I misread that ticket then, thank you for the clarification! As for dealing with the long running aggregate commands, would it make sense to check the cron job log for when the most recent successful aggregate jobs have completed and not archiving anything that comes after the most recent end_time?

Yes, that would be a great idea!

Scardenasmolinar changed the task status from Open to In Progress.Feb 19 2025, 1:17 AM

the cron class needs to be added to settings; right now the job isn't running

Note that we should query the state of the aggregates data rather than checking the cron job logs; we have discovered that django cron is somewhat unreliable and shouldn't be treated as a source of truth.

Note that we should query the state of the aggregates data rather than checking the cron job logs; we have discovered that django cron is somewhat unreliable and shouldn't be treated as a source of truth.

This is being addressed in this PR: https://github.com/WikipediaLibrary/externallinks/pull/428/files#diff-3ea96e0e2a46bf5e184c22e77820491fe3bdfc10fc2f30b0bfd010da779354ee

Scardenasmolinar moved this task from QA to Done on the Moderator-Tools-Team (Kanban) board.

Verified that the link event archiving ran successfully.