Page MenuHomePhabricator

Export aggregates to static files
Closed, ResolvedPublic5 Estimated Story Points

Description

The aggregates tables have not grown as big as the external links table, but a couple already have millions of rows. To improve the performance of Wikilink, we should export the aggregate tables to static files. This way, the table sizes remain manageable.

Acceptance Criteria

Event Timeline

Scardenasmolinar set the point value for this task to 5.
Scardenasmolinar moved this task from To be estimated to Estimated on the Moderator-Tools-Team board.
Amdrel subscribed.

When we archive aggregate data and remove it from the tables it won't be viewable from the programs and organisations pages anymore. How far back are we planning to keep data? We could make a required date option for the aggregate archival commands so we have control over that regardless of what retention period we decide on.

I was thinking maybe the last 3 years? @Samwalton9-WMF, your opinion on this would be appreciated.

Could we not provide the information via a static page, treating the archive json urls like API responses for xhr?

Can you clarify the user-facing implication of this? I'm not sure I completely understand. Would this mean users could only view data up to 3 years back? Or the aggregates would only be based on the last 3 years?

Could we not provide the information via a static page, treating the archive json urls like API responses for xhr?

If we group the information by organisation and month then yes, though the size of those archives would be skewed depending on which organisation is being viewed at least for page project aggregates.

For example if I run the following query I see one organisation (126) has millions of records worth of page aggregates while others only have hundreds of thousands or less:

SELECT ap.organisation_id, COUNT(*) AS organisation_count
FROM aggregates_pageprojectaggregate ap
GROUP BY ap.organisation_id
ORDER BY organisation_count DESC;

Against my local dump I can see that organisation 126 has 8,970,681 records in the table while the rest of the organisations combined have 6,995,978 records, which means if 126 was dumped into archives then all of them together would be at least 1.43GB uncompressed (the table has a data length of 2.4G). Those archives split into months would average to 21.42MB per archive. That data would also have to be processed in order to render the totals.

Since historical data for previous months never changes we could instead store organisation monthly totals for all aggregate types separately and add them together for whatever range is requested for rendering them on the website rather than pulling from the aggregate archives directly on each request (which have totals split by day). Storing additional monthly aggregates just for use by pages like /organisations/<pk> would have a lot fewer records to scan through and our date filter only considers the full month anyways.

Edit: It looks like we recently made a change like this: https://phabricator.wikimedia.org/T370977. Does this change alone not solve the performance issues that necessitate archiving the aggregate data?

The CSV downloads for pages would have to pull from the aggregate archives since link stats per page are included, but for the website view it's grouped by project name instead.

Nevermind I misunderstood this. There's both page and project totals, but if they're grouped by month I think that would be fine.

@Amdrel,

It sounds like you had some concerns about this approach, but I couldn't find them written down anywhere. Could you share here?

I'm worried about how this is going to work with the date range filters on pages like /organisations/<pk> and /programs/<pk> if we pulled archive .json files over XHR client-side to be rendered. We need to be able to support any range of months for the chart and totals, so the archives need to be split up to enable that. If we have archives of aggregate data split by month and data from the last 6 years is requested for an organization's collection then that could be over 60 archives that need to be downloaded to generate the graph and calculate the totals below it. I don't know exactly how large the final archives are going to be yet file size wise since I'm still running the monthly aggregate jobs against my local data.

For example if I have to pull down an page project aggregate archive for July of 2019 then that month alone is going to return an archive with 44,995 records even after they've been squashed together by the job. If we pull archives from object storage server side as needed and cache / process the totals there then maybe it will be more manageable.

Also "The Wikipedia Library" program (id 1) has most organizations under it so we'd also need to figure out how to render data there too.

I'm worried about how this is going to work with the date range filters on pages like /organisations/<pk> and /programs/<pk> if we pulled archive .json files over XHR client-side to be rendered. We need to be able to support any range of months for the chart and totals, so the archives need to be split up to enable that. If we have archives of aggregate data split by month and data from the last 6 years is requested for an organization's collection then that could be over 60 archives that need to be downloaded to generate the graph and calculate the totals below it. I don't know exactly how large the final archives are going to be yet file size wise since I'm still running the monthly aggregate jobs against my local data.

For example if I have to pull down an page project aggregate archive for July of 2019 then that month alone is going to return an archive with 44,995 records even after they've been squashed together by the job. If we pull archives from object storage server side as needed and cache / process the totals there then maybe it will be more manageable.

Also "The Wikipedia Library" program (id 1) has most organizations under it so we'd also need to figure out how to render data there too.

Makes sense! Saving totals (aggregates of aggregates of aggregates I guess) would be a completely reasonable approach. Basically, we're not so worried about page load performance, but more about scalability / performance of things running on a vm that we have to manage. We should keep an eye on storage usage though. As you suggest, it could get big!

We could check with @Samwalton9-WMF on how much leeway we have with the date filtering feature. Eg. for data from before some rolling cutoff, could we just provide the monthly (and maybe annual) statistics and not provide the filtering feature at all? Just links to charts for each month for orgs & programs? My assumption is that there is a level of historical data that we still want to be queryable for research, but isn't actually used as often by partners.

My monthly aggregate jobs are complete and I've got some stats on what these archives will look like. To work with the existing filters I have the archives split by organisation_id, collection_id, full_date and on_user_list (if false include all aggregates).

In total 10,693 archives were generated. All of them combined are 400.8 MB compressed and 6.59 GB uncompressed. Of the archives 10,668 are < 1 MB, 25 are > 1 MB, 7 are > 10 MB, and 3 are > 20 MB (all compressed). Each archive represents a single month of pageproject aggregates for a single collection under an organisation.

I have attached a text file containing archive file size stats per organisation for all collections under them to get a better idea of how aggregate data is distributed between different organisations.

I'm worried about how this is going to work with the date range filters on pages like /organisations/<pk> and /programs/<pk> if we pulled archive .json files over XHR client-side to be rendered. We need to be able to support any range of months for the chart and totals, so the archives need to be split up to enable that. If we have archives of aggregate data split by month and data from the last 6 years is requested for an organization's collection then that could be over 60 archives that need to be downloaded to generate the graph and calculate the totals below it. I don't know exactly how large the final archives are going to be yet file size wise since I'm still running the monthly aggregate jobs against my local data.

For example if I have to pull down an page project aggregate archive for July of 2019 then that month alone is going to return an archive with 44,995 records even after they've been squashed together by the job. If we pull archives from object storage server side as needed and cache / process the totals there then maybe it will be more manageable.

Also "The Wikipedia Library" program (id 1) has most organizations under it so we'd also need to figure out how to render data there too.

Makes sense! Saving totals (aggregates of aggregates of aggregates I guess) would be a completely reasonable approach. Basically, we're not so worried about page load performance, but more about scalability / performance of things running on a vm that we have to manage. We should keep an eye on storage usage though. As you suggest, it could get big!

We could check with @Samwalton9-WMF on how much leeway we have with the date filtering feature. Eg. for data from before some rolling cutoff, could we just provide the monthly (and maybe annual) statistics and not provide the filtering feature at all? Just links to charts for each month for orgs & programs? My assumption is that there is a level of historical data that we still want to be queryable for research, but isn't actually used as often by partners.

The primary use case for the date filers is to ask fairly basic questions like "how many citations were added in the last year?" I think monthly statistics would be totally fine, because a user could simply copy those into a spreadsheet and sum up whatever they need to.

I implemented the calculation of monthly totals in my patch and that solves the issue of having to pull most of the archives from object storage for the calculation of program-level totals. It doesn't result in much additional database bloat and is performant (~200MB mostly coming from user totals). Filters still work as well.

I don't think that doing this totals approach for individual organisations makes as much sense seeing as the monthly aggregates jobs that were added recently effectively compresses things down just as much as what we'd get from calculating and storing additional organisation-level totals. Downloading aggregate archives from object storage on-demand for individual organisations should work fine for most of them, though as I pointed out earlier in the thread there are a few outliers that are 10-20 MB in size. If we download and cache these archives server-side then I think that should mitigate any issues. With that done we should be able to archive old aggregate data while still keeping access to historical data in a reasonably performant way for both program-level and organsation-level statistics.

Here's the work in progress patch for reference: https://github.com/WikipediaLibrary/externallinks/pull/422. Right now that patch includes archival scripts for aggregate data and jobs that calculate program-level totals. I still need to add code that grabs aggregate archives from object storage on-demand for organisation pages.

Kgraessle changed the task status from Open to Stalled.Jun 17 2025, 4:16 PM
Kgraessle subscribed.

Stalling this out until we can get T396681: Create cron tasks for archiving aggregates completed as we'll need that before we can merge this in.

statically served from an object store instead of serving it from the database.

Thanks for all your work!

Scardenasmolinar changed the task status from Stalled to In Progress.Jul 18 2025, 7:32 PM
Scardenasmolinar claimed this task.
Scardenasmolinar updated the task description. (Show Details)
Scardenasmolinar moved this task from QA to Done on the Moderator-Tools-Team (Kanban) board.

Can confirm we have archived aggregates on the Object Store!