Page MenuHomePhabricator

Investigation: Should we we grab data from xtools? [timebox 6 days]
Closed, ResolvedPublic

Description

As a PM for content integrity, I want our users to understand the quality of the revisions they are receiving. We collate this data from various locations in the wikiverse. The following link contains two items I am interested in for our reusers. The latter, for the attribution guidelines the foundation is raising for reusers overall.

To do any of this, we need to test the quality of the APIs first.

https://xtools.wmcloud.org/articleinfo/en.wikipedia.org/Lionel_Messi

Note that Xtools is open source.

  1. the current assessment of each article where they are available per language
  1. the number of editors the article has overall.

To Do -

Event Timeline

FNavas-foundation renamed this task from Investigation: Can we grab content assessment grades from xtools? to Investigation: Can we grab data from xtools?.Apr 1 2025, 1:58 PM
FNavas-foundation renamed this task from Investigation: Can we grab data from xtools? to Investigation: Should we we grab data from xtools?.
FNavas-foundation updated the task description. (Show Details)
JArguello-WMF renamed this task from Investigation: Should we we grab data from xtools? to Investigation: Should we we grab data from xtools? [timebox 6 days].May 7 2025, 1:52 PM

please consider the warning message above this page -

"This page is very old. Some data may be inaccurate due to how revisions were stored in the early days of MediaWiki."

https://xtools.wmcloud.org/articleinfo/en.wikipedia.org/Albert_Einstein

Should we we grab data from xtools?

Do you mean to power the enterprise APIs? If so, the answer is no. XTools enjoys ~99% uptime, but is not a production WMF product and comes with no guarantees. It also definitely cannot handle the load you'd expect, assuming you want to query its APIs in real time.

Note also that XTools has to query the sanitized Cloud Services replicas because it runs outside production. This usually isn't a problem with respect to data integrity, but for example if a page has lot of subsequent suppressed revisions, XTools data may be inaccurate because it can't see those revisions. Enterprise API consumers presumably can't, either, so maybe that doesn't matter much.

As noted XTools is open source, and at least for the metrics mentioned in this task, you're better off querying for them directly.


the number of editors the article has overall.

For that you'd use the PageInfo API, going by the editors attribute as you noted. A production DB query would look something like SELECT COUNT(DISTINCT rev_actor) FROM revision WHERE rev_page = <page-id>.

WikiWho is a content persistence algorithm. The list of authors it returns represents the authors of the content of the given revision (the current revision unless specified otherwise), not the entire history of the page. So note these are different metrics.

the current assessment of each article where they are available per language

You should use the PageAssessments API directly for this. No need to go through XTools :)

please consider the warning message above this page -

"This page is very old. Some data may be inaccurate due to how revisions were stored in the early days of MediaWiki."

https://xtools.wmcloud.org/articleinfo/en.wikipedia.org/Albert_Einstein

This is just a warning that some calculations could be off due to imported edits, data corruption, and other oddities that happened in early versions of MediaWiki. As a random example, https://en.wikipedia.org/?diff=5 shows "No difference" when actual content was added and removed.


Also note that the Page History tool (but not its APIs) is limited to 50,000 revisions for performance reasons. Having this many revisions is rare for any mainspace article, but common for Project namespace pages such as https://en.wikipedia.org/wiki/Wikipedia:Administrators%27_noticeboard

thank you @MusikAnimal this is gold for whoever will pick up this ticket. Are you happy if we reach out to you with questions? perhaps via here?

PageAssessmentAPI may be the solution there for that part.

99.99% we would not want to hit xtools in realtime, its not necessary for the use case we're thinking. I imagine it'd be more a of a cache'd off-line situation.

thank you @MusikAnimal this is gold for whoever will pick up this ticket. Are you happy if we reach out to you with questions? perhaps via here?

Of course! Through here or elsewhere (IRC / email / Slack), I'm happy to help :)

99.99% we would not want to hit xtools in realtime, its not necessary for the use case we're thinking. I imagine it'd be more a of a cache'd off-line situation.

Okay, thank you. I am relieved, hehe!