Page MenuHomePhabricator

Include contributors count in REST Attribution endpoint
Open, In Progress, HighPublicSpike

Description

MediaWiki provides a WikiPage::getContributors() method, but it uses a complex SQL query and it may not perform well especally under higher load.

We should check if we can use that method for the attribution endpoint, and if not, then we should look into a different way of retreving contributors count.

Event Timeline

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptTue, Feb 17, 5:06 PM

Change #1240678 had a related patch set uploaded (by Pmiazga; author: Pmiazga):

[mediawiki/extensions/WikimediaCustomizations@master] DNM Use WikiPage::getContributors to retrieve contributors count

https://gerrit.wikimedia.org/r/1240678

@Ladsgroup we have a way to retrieve the contributors count via MediaWiki query. The sample SQL query it does:

SELECT  
    actor_user AS `user_id`,
    actor_name AS `user_name`,
    MIN(rev_actor) AS `actor_id`,
    MIN(user_real_name) AS `user_real_name`,
    MAX(rev_timestamp) AS `timestamp`  
FROM 
    `revision` 
    JOIN `actor` ON ((rev_actor = actor_id)) 
    LEFT JOIN `user` ON ((actor_user = user_id))  
WHERE 
  rev_page = 2 
  AND (rev_actor != 31) 
  AND ((rev_deleted & 4) = 0)  
GROUP BY 
  actor_user,actor_name 
ORDER BY 
  timestamp DESC

From what I see it's pretty bad:

idselect_tyetabletypepossible_keyskeykey_lenrefrowsExtra
1SIMPLErevisionrefrev_page_timestamp,rev_actor_timestamp,rev_page_actor_timestamprev_page_timestamp4const2Using where; Using temporary; Using filesort
1SIMPLEactoreq_refPRIMARYPRIMARY8my_database.revision.rev_actor1
1SIMPLEusereq_refPRIMARYPRIMARY4my_database.actor.actor_user1Using where

I don't think we can use it directly in attribution endpoint. Is it something we could use/ or add some indexes to make it usable on live ?

So first, adding extra index on revision is a no-go. It's the biggest table in almost all wikis and we should move in the opposite direction (we removed columns and index from it in the past couple of years).

On your question: I assume you're saying it's bad because the extra says filesort and temporary. Right? If that's the case, I don't think it's an issue, when it does filesort of 2 rows, it's going to be blazing fast no matter how bad it is.

In fact I tried this query on stat machines and it took 1.5 seconds:

SELECT  
    actor_user AS `user_id`,
    actor_name AS `user_name`,
    MIN(rev_actor) AS `actor_id`,
    MIN(user_real_name) AS `user_real_name`,
    MAX(rev_timestamp) AS `timestamp`  
FROM 
    `revision` 
    JOIN `actor` ON ((rev_actor = actor_id)) 
    LEFT JOIN `user` ON ((actor_user = user_id))  
WHERE 
  rev_page = 55943877 
  AND (rev_actor != 31) 
  AND ((rev_deleted & 4) = 0)  
GROUP BY 
  actor_user,actor_name 
ORDER BY 
  timestamp DESC

1859 rows in set (1.509 sec)

That's the page of Bad Bunny and has 2K different editors editing it and it's still blazing fast.

See:

mysql:research@dbstore1008.eqiad.wmnet [enwiki]> explain SELECT  
    ->     actor_user AS `user_id`,
    ->     actor_name AS `user_name`,
    ->     MIN(rev_actor) AS `actor_id`,
    ->     MIN(user_real_name) AS `user_real_name`,
    ->     MAX(rev_timestamp) AS `timestamp`  
    -> FROM 
    ->     `revision` 
    ->     JOIN `actor` ON ((rev_actor = actor_id)) 
    ->     LEFT JOIN `user` ON ((actor_user = user_id))  
    -> WHERE 
    ->   rev_page = 55943877 
    ->   AND (rev_actor != 31) 
    ->   AND ((rev_deleted & 4) = 0)  
    -> GROUP BY 
    ->   actor_user,actor_name 
    -> ORDER BY 
    ->   timestamp DESC;
+------+-------------+----------+--------+-----------------------------------------------------------------+--------------------+---------+---------------------------+------+----------------------------------------------+
| id   | select_type | table    | type   | possible_keys                                                   | key                | key_len | ref                       | rows | Extra                                        |
+------+-------------+----------+--------+-----------------------------------------------------------------+--------------------+---------+---------------------------+------+----------------------------------------------+
|    1 | SIMPLE      | revision | ref    | rev_actor_timestamp,rev_page_actor_timestamp,rev_page_timestamp | rev_page_timestamp | 4       | const                     | 4431 | Using where; Using temporary; Using filesort |
|    1 | SIMPLE      | actor    | eq_ref | PRIMARY                                                         | PRIMARY            | 8       | enwiki.revision.rev_actor | 1    |                                              |
|    1 | SIMPLE      | user     | eq_ref | PRIMARY                                                         | PRIMARY            | 4       | enwiki.actor.actor_user   | 1    | Using where                                  |
+------+-------------+----------+--------+-----------------------------------------------------------------+--------------------+---------+---------------------------+------+----------------------------------------------+
3 rows in set (0.012 sec)

I'd say do not trigger this query on every page load and put a max exec time (let's say ten seconds)‌ to make sure pathological cases don't show up. Then you're good to go.

This query won't be executed on every page load, but would be triggered every time for Atribution endpoint, which we expect to be called frequently, at first most likely by Enteriprise.

Yeah, I was bit scared of that filesort, my local db doesn't have plenty of revisions/contributors, and I wasn't sure if this is something that could be safely used.

Just to clarify -- while this will likely be used by Enterprise, I don't think they'll be calling it on every hit on THEIR API, I think they may use this when they do their periodic fetch + every time a page changes/gets edited.

We should verify this, but it does mean that we're absolutely not expecting the loads of Enterprise -- but we *are* expecting this endpoing to be probably significantly higher than, say, the "page history" REST endpoing where this is already used today.

Just to clarify -- while this will likely be used by Enterprise, I don't think they'll be calling it on every hit on THEIR API, I think they may use this when they do their periodic fetch + every time a page changes/gets edited.

We should verify this, but it does mean that we're absolutely not expecting the loads of Enterprise -- but we *are* expecting this endpoing to be probably significantly higher than, say, the "page history" REST endpoing where this is already used today.

In that case, you could add an option to limit the timespan (e.g. get credits for the last month only) so that way the rows read will be even smaller (rev_page_timestamp index). For some pages, checking twenty years of history might not be useful or needed. Of course, from time to time, they can do a full check based on dumps or whatever else they seem fit.

pmiazga renamed this task from Spike: check the WikiPage::getContributors() performance. to Include contributors count in REST Attribution endpoint.EditedThu, Feb 19, 7:02 PM
pmiazga changed the task status from Open to In Progress.
pmiazga triaged this task as High priority.

Updated this from a spike to a regular ticket, as initially I expected more work to be required (performance concerns). However, if we can utilize what MediaWiki provides, let's wrap up this work with this single task.

In any case it's only doing a filesort because you are grouping it on another table + sorting the results (which might be necessary if you need to list people in the order of their last contributions, but for a count it's irrelevant).

SELECT  
    actor_user AS `user_id`,
    actor_name AS `user_name`,
    MIN(rev_actor) AS `actor_id`,
    MIN(user_real_name) AS `user_real_name`,
    MAX(rev_timestamp) AS `timestamp`  
FROM 
    `revision` 
    JOIN `actor` ON ((rev_actor = actor_id)) 
    LEFT JOIN `user` ON ((actor_user = user_id))  
WHERE 
  rev_page = 55943877 
  AND (rev_actor != 31) 
  AND ((rev_deleted & 4) = 0)  
GROUP BY 
  rev_actor 

+------+-------------+----------+--------+-----------------------------------------------------------------+--------------------------+---------+---------------------------+------+-------------+
| id   | select_type | table    | type   | possible_keys                                                   | key                      | key_len | ref                       | rows | Extra       |
+------+-------------+----------+--------+-----------------------------------------------------------------+--------------------------+---------+---------------------------+------+-------------+
|    1 | SIMPLE      | revision | range  | rev_actor_timestamp,rev_page_actor_timestamp,rev_page_timestamp | rev_page_actor_timestamp | 12      | NULL                      | 8127 | Using where |
|    1 | SIMPLE      | actor    | eq_ref | PRIMARY                                                         | PRIMARY                  | 8       | enwiki.revision.rev_actor | 1    |             |
|    1 | SIMPLE      | user     | eq_ref | PRIMARY                                                         | PRIMARY                  | 4       | enwiki.actor.actor_user   | 1    | Using where |
+------+-------------+----------+--------+-----------------------------------------------------------------+--------------------------+---------+---------------------------+------+-------------+

Even so, the query can be pretty slow for huge articles. E.g. for Donald Trump, it took me 17 sec on a stat machine. There was some discussion about how to handle slow but predictable aggregate queries on T341649: Provide an easy way for MediaWiki to fetch aggregate data from the data lake, maybe worth reviving?

That's definitely something we could look into, or we could try and get something similar to "PageViewInfo" from AQS but for the contributor count (they have already an API endpoint). They, however, *also* have time limits, though it might be easier to look for expanded timespans than doing that within MW...?

It's worth measuring the SQL query without the sort, but I think we might need input here from Product about the time-span limit.

We should make sure this ticket summarizes the pros/cons and requirements of the SQL approach, and the alternatives we may use (like AQS, data lake, etc). Halley and our stakeholders will be able to then make a much more informed decision about what type of data works for this -- and if needed, prioritize resourcing for other work (or not).

Let's make sure this ticket summarizes the available options and their implications.

Ideally, we want to have the full page count. Limiting it to the last X days/months or even years would negatively affect pages that have been around a long time and are relatively stable, yet still trustworthy.

I do also think that we should consider using the data lake here, if possible. Tagging @GGoncalves-WMF to weigh in as well -- it seems like the data we want is already largely available in AQS (although I'm not sure what the look back window is) and/or we could potentially have a derived data set from the data lake, so we aren't calculating it on the fly all the time (or are at least counting from the data lake instead of production tables).

Yeah, but iirc there's another entrypoint for number of contributors for a page (I mentioned above) -- but that one also seems to require date range, iirc.

We should check what the options are and see if we can use existing tools or potentially create something more dedicated for this (and what would be the scope/price/pros/cons etc).

Change #1240678 abandoned by Pmiazga:

[mediawiki/extensions/WikimediaCustomizations@master] Use WikiPage::getContributors to retrieve contributors count

Reason:

Not performan enough, we need a different solution

https://gerrit.wikimedia.org/r/1240678

Let me sum up my findings:

MediaWiki was never written in a way that it should list all contributors. This was one of the early assumptions of the system, we can easily retrieve a list of recent contributors or query the revision table in a specific time range. Doing a full scan is just very slow due to the size. And the Recent Change information is periodically cleaned up from entries older than 30 days.

There are a couple of existing code paths that retrieve information regarding contributors:

MediaWiki presenting contributors count
The PageInfo action

Example: https://en.wikipedia.org/w/index.php?title=79th_British_Academy_Film_Awards&action=info
Shows recent number of distinct author, and recent refers to RCMaxAGE, which is 30 days. We can easily fetch the number of recent editors, but only for the last 30 days.

WikiPage::getContributors() built in method

This is built into MediaWiki, and retrieves the entire list from the revision table. The issue with this approach is performance - it has to go through a very big table, and as Gergo mentioned, doing a query for Donald Trump article on stats database takes more than 15seconds

REST endpoint PageHistoryCountHandler

Source: https://github.com/wikimedia/mediawiki/blob/master/includes/Rest/Handler/PageHistoryCountHandler.php#L556
This one is doing a query on the revision table. It's a tad faster, but only because it again limits the results. It will retrieve counts only up to the timestamp of the latest entry in the logging table for the given page id.

Possible solutions:

  • calling existing methods - but limit the response to the last 30 days. Contributors is just one of the signals, and it shouldn't take more than a couple of seconds to retrieve this information. Because of that, we cannot retrieve the entire history. This is a small task, doable within this quarter.
  • We can update the existing DB schema. At this stage, I'm event not even sure if adding an extra index to the revision table would help, and even if - it would take a very long time (I heard like half a year) to process. I don't think we can fit this in this fiscal year, and most likely it's not the way to go.
  • We could provide a new table with this specific information and fill it in during the recent change save hook. This solution would allow us to collect information like contributors count, reference count, etc, but we would still need a script to fill in the data for old articles. Processing this again can take a lot of time (talking months). The implementation phase should fit within two sprints, but running the script to fill up tables will take a couple months for sure ( all titles, all wikis, all projects )
  • Use PageProps and update information like the contributors count during the recent change save hook. This is bit faster than the previous solution as we can reuse the existing mechanism. But it sill requires us to process the entire database and fill in the required information (contributor counts for all articles). Implementation can fit within a sprint but it requires to run migration script, which will take a couple months.

We may have to query DataPlatform to retrieve the Trending signal and PageViews signal. The only issue with this solution is that the information would be up to 24 hours old.

Due to the fact that we want to release this feature quickly, I would propose to check with DataPlatform if they can provide us contributors count.
A partial solution would be just recent contributors for now, but this may be invalid ( returning 0 ) for older articles that weren't changed recently.

@Mooeypoo can you review this and let me know if there is anything I missed?

Just chiming in here with notes from a conversation with Piotr. The long story short is that the 30 day timebox is likely better than nothing, and could potentially be used as a starting point while we wait for data platform supported solutions. However, it really goes against the intent of what we're trying to foster with having the total count. My preference would therefore be to see if there is a reasonable way to get something that at least approximates the total contributor count a bit more accurately.

I think we need a few additional data points to make a decision here:

  1. I'd love to get more information about the relative performance of the REST endpoint PageHistoryCountHandler vs GetContributors(). Based on initial exploration that Piotr and I did on the call, it does seem like it's a lot faster, but it's hard to tell if it's fast "enough" without having some additional comparisons and consultation.
  2. Get clarity on how far back the PageHistoryCountHandler goes during calculation. In addition to investigating when the logs actually start, I would recommend comparing results for contributor counts between PageHistoryCountHandler vs GetContributors() as well, to get a sense for how big of a difference the logging timebox makes in practice.

For the above, it seems like we would need to test with highly active, long-lived pages. We should also see if cacheing is a thing/makes a difference in any of the options presented.

It also seems like adding something to the DB schema would probably be a last resort, given it would take months to just populate it. We can reasonably assume that DPE will hook us up with something better by then.

check with DataPlatform if they can provide us contributors count.

Total editors per page can for sure be counted. But! How can it be served? That is the question!

T341649: Provide an easy way for MediaWiki to fetch aggregate data from the data lake

Example Data Lake query (from @JAllemandou):

SELECT 
  wiki_id, 
  page_id 
  COUNT(DISTINCT user_central_id) as distinct_users 
FROM 
  mediawiki_revision_history_v1 
GROUP BY 
  wiki_id, 
  page_id

mediawiki_revision_history_v1 is updated hourly, and reconciled (eventually consistency) daily.

This data is also available in the mediawiki_history_reduced dataset, which is loaded into Druid and powers many contributor related AQS queires. This is only updated monthly. We had a plan to update it daily (or more often) when we were working on Global Editor Metrics for Year in Review in Q2, but it looks like that plan was changed to more quickly support just Global Editor Metrics.

Even if we had implemented incremental updates for the Druid dataset, it isn't clear that Druid would perform well for an all-time-history COUNT DISTINCT query.