Page MenuHomePhabricator

Migrate ORES clients to LiftWing
Open, Needs TriagePublic

Description

The scope of the project probably deserves a Phabricator tag, but we can start collecting ideas and suggestions from other people and discuss what road to take.

Some useful links:

https://www.mediawiki.org/wiki/ORES/Components
https://www.mediawiki.org/wiki/ORES/Applications

Clients

The clients can be divided into multiple macro areas:

  • ChangeProp and revision-score events: when a revision-create event is generated, ChangeProp calls the precache ORES endpoint to get a score for all the models associated with the revid's wiki. For example, if the revision-create event carries rev-id 123456 for enwiki, then ORES is contacted to score that revision for all models compatible with enwiki (the list is part of a ORES configuration). ChangeProp then uses the following code to generate a mediawiki.revision-score event, that is sent to EventGate. We opened T301878 to address this use case in Lift Wing in a more modern way.
  • External clients hitting ores.wikimedia.org: mostly bots that want to score rev-ids in batches to help the community in fighting vandalism. See "Counter Vandalism) in the above /Applications link for an initial list (that may be old and not accurate).
  • MediaWiki extension ORES: This frontend displays revscoring data on the Special:Contributions and Special:RecentChanges pages. A FetchScoreJob event is created in response to the RecentChange_save event, which fetches scores from the ORES API and caches them in the local MediaWiki database for efficient access. It looks that the PHP code is highly configurable but it will need to be adapted to work with the Lift Wing API.

Migration strategies

The Lift Wing API is still to be decided due to T288789, but we'll likely have two main entry points:

  • api.wikimedia.org for the external clients
  • an internal discovery endpoint for internal clients.

With "internal clients" we mean ChangeProp + MediaWiki and researchers/data-engineers/etc.. that need to contact Lift Wing.

Migrating external users from ores.wikimedia.org to api.wikimedia.org will surely be a problem, since:

  • We don't know the exact list of bots/tools and their codebases, together with owners and their point of contact.
  • Moving from ores.wikimedia.org to api.wikimedia.org is not only a change of endpoint, but also a change in API calls. Due to what written above, it may become difficult since some tools/bots don't have a clear owner.
  • Some bots/tools are incredibly vital for the community, but their codebase may not be owned by somebody available to change code etc..

We have essentially two options:

  1. Create a thin rewrite/transition layer behind an endpoint like ores-legacy.wikimedia.org, that simply gets ORES-like API calls (maybe a limited set) and "translates" them to Lift Wing ones.
  2. Follow up with all bot owners asking to migrate their tools to Lift Wing, keeping up ORES for the time being.

Event Timeline

The ORES extension runs PHP code that calls ORES for damaging and goodfaith only (but others are supported, see the extension.json file). The function that returns the HTTP URL to hit is:

	/**
	 * @return string Base URL plus your wiki's `scores` API path.
	 */
	public function getUrl() {
		$wikiId = self::getWikiID();
		$prefix = 'v' . self::API_VERSION;
		$baseUrl = self::getBaseUrl();
		$url = "{$baseUrl}{$prefix}/scores/{$wikiId}/";
		return $url;
	}

And ores.wikimedia.org is configured as BaseUrl. So in theory we should just add an option to use LiftWing (that could may be rolled out on a few wikis first) it shouldn't be much harder than extending the above function. Since it is the PHP code that calls ORES we should be able to use our inference.discovery.wmnet internal endpoint. I'll try to follow up with Amir to verify if my understanding is right or not :)

More or less copying over a comment from another task that's more pertinent here though likely beyond scope: the ORES Extension has the two MariaDB tables mentioned in the description. They are obvious important to the functioning of the extension but also serve as a public record* of ORES scores that have been used for research. I doubt there are tools that depend on these tables but it's worth recognizing that if there is a way to create a nice public dataset of ORES scores (or really any LiftWing model) as part of this work, that would have research benefits.

Relevant tasks based on quick search: T209611, T209739, T280107

*A poor public record unfortunately as their scope is very unclear and the data schema is very difficult to understand.

More or less copying over a comment from another task that's more pertinent here though likely beyond scope: the ORES Extension has the two MariaDB tables mentioned in the description. They are obvious important to the functioning of the extension but also serve as a public record* of ORES scores that have been used for research. I doubt there are tools that depend on these tables but it's worth recognizing that if there is a way to create a nice public dataset of ORES scores (or really any LiftWing model) as part of this work, that would have research benefits.

Relevant tasks based on quick search: T209611, T209739, T280107

*A poor public record unfortunately as their scope is very unclear and the data schema is very difficult to understand.

Hi @Isaac! I have a follow up question on the public datasets - Do you think that we need the Mariadb tables for the dumps, or would it be ok to explore alternatives like the mediawiki.revision-score dataset in Hive/HDFS? If the latter is viable we could partner with Data Engineering and see if we can generate a dump from Hadoop. The ORES extension seems something useful for people patrolling Recent Changes etc.., but it seems a stretch to use it for dumps. What do you think?

Do you think that we need the Mariadb tables for the dumps, or would it be ok to explore alternatives like the mediawiki.revision-score dataset in Hive/HDFS? If the latter is viable we could partner with Data Engineering and see if we can generate a dump from Hadoop.

@elukey yeah, I think that would be a much better strategy than the MariaDB tables. I think it's acceptable to have an e.g., 1-month lag for public large-scale access to revision scores and I have found the format of the MariaDB tables to be particularly opaque so would much rather see the scores packaged up in an interpretable format and merged with some other metadata as they are in events. It also presumably has the benefit of not splitting the data across lots of databases and including all scores (not just ones used by the ORES extension).