Page MenuHomePhabricator

Move MediaWiki QueryPages computation to Hadoop
Open, Needs TriagePublic

Description

I was talking with @Ladsgroup and we see a potential collaboration. MediaWiki wastes a lot of resources* computing the results of queries that take a few minutes on the cluster. One class of such queries powers Special pages called QueryPages. Here are two examples:

Most Linked Pages: the query and the result on English Wikipedia.

Wanted Pages: the query and the result on English.

* These queries run once a month or sometimes more frequently. They often take a full day to compute. While they run they execute full table scans and are very disruptive to anything else trying to access the database.

A data pipeline that helps offload this work could be:

sqoop monthly -> wmf_raw -> wmf.<<new table>> -> load into Cassandra -> AQS

And the php pages above could be changed to instead of running expensive sql, query AQS and get the data directly. Or we could just skip the pages altogether and load the data into the QueryCache table.

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Introduce querypage_most_categories DAGrepos/data-engineering/airflow-dags!1828zabeT309738main
querypage: Change the target path to be the exact name of the query pagerepos/data-engineering/airflow-dags!566ladsgroupquerypage_improve_pathmain
Introduce querypage_most_linked_templates_monthly DAGrepos/data-engineering/airflow-dags!534milimetricairflow-dags-start_querypagemain
Introduce querypage_most_linked_templates_monthly DAGrepos/data-engineering/airflow-dags!494ladsgroupstart_querypagemain
Customize query in GitLab

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1195427 merged by jenkins-bot:

[mediawiki/core@master] QueryPage: Add external query source support

https://gerrit.wikimedia.org/r/1195427

Change #1196112 had a related patch set uploaded (by Zabe; author: Zabe):

[operations/mediawiki-config@master] Using Hadoop for MostTranscludedPages on testwiki

https://gerrit.wikimedia.org/r/1196112

After meeting with Eric, we decided to for now push it to as a file to somewhere, originally swift but Dan suggested we can just push them to datasets https://analytics.wikimedia.org/published/datasets/

Does this has an internal domain?

Zabe moved this task from Ready to In progress on the DBA board.

After meeting with Eric, we decided to for now push it to as a file to somewhere, originally swift but Dan suggested we can just push them to datasets https://analytics.wikimedia.org/published/datasets/

Does this has an internal domain?

I don't think it's needed. It'd be fun to make it go through envoy but that domain is not a service to need an envoy proxy nor it gets hit frequently enough to make sense to have one. It's accessible internally from what I'm seeing too (no need for http_proxy). Is it failing somehow?

After meeting with Eric, we decided to for now push it to as a file to somewhere, originally swift but Dan suggested we can just push them to datasets https://analytics.wikimedia.org/published/datasets/

Does this has an internal domain?

I don't think it's needed. It'd be fun to make it go through envoy but that domain is not a service to need an envoy proxy nor it gets hit frequently enough to make sense to have one. It's accessible internally from what I'm seeing too (no need for http_proxy). Is it failing somehow?

I would have used it if there is one already available.

From my testing, it does work on deploy2002 but not on a kubernetes node.

zabe@deploy2002:~$ mwscript shell.php testwiki
Psy Shell v0.12.10 (PHP 8.1.33 — cli) by Justin Hileman
> use MediaWiki\MediaWikiServices;
> $httpRequestFactory = MediaWikiServices::getInstance()->getHttpRequestFactory();
= MediaWiki\Http\HttpRequestFactory {#5441}

> $request = $httpRequestFactory->create( 'https://analytics.wikimedia.org/published/datasets/querypage/MostTranscludedPages/testwiki.json', [ 'timeout' => 10 ], 'TestFName' );
= GuzzleHttpRequest {#5451}

> $status = $request->execute();
= MediaWiki\Status\Status {#8314
    +value: & 200,
    +success: & [],
    +successCount: & 0,
    +failCount: & 0,
    +statusData: & null,
    +cleanCallback: false,
  }

> count( json_decode( $request->getContent(), true ) )
= 5000

> ^D

   INFO  Ctrl+D.

zabe@deploy2002:~$ mwscript-k8s --attach -- shell.php --wiki=testwiki
⏳ Starting shell.php on Kubernetes as job mw-script.codfw.sasyrsq2 ...
🚀 Job is running.
ℹ Expecting a prompt but don't see it? Due to a race condition, the beginning of the output might be missing. Try pressing enter.
📜 Attached to stdin/stdout:

> use MediaWiki\MediaWikiServices;
> $httpRequestFactory = MediaWikiServices::getInstance()->getHttpRequestFactory();
= MediaWiki\Http\HttpRequestFactory {#5457}

> $request = $httpRequestFactory->create( 'https://analytics.wikimedia.org/published/datasets/querypage/MostTranscludedPages/testwiki.json', [ 'timeout' => 10 ], 'TestFName' );
= GuzzleHttpRequest {#5461}

> $status = $request->execute();
= MediaWiki\Status\Status {#8325
    +value: & 0,
    +success: & [],
    +successCount: & 0,
    +failCount: & 0,
    +statusData: & null,
    +cleanCallback: false,
  }

> $request->getContent()
= ""

> $status->isOk()
= false

> $status->getMessage()->text()
= """
  * HTTP request timed out.\n
  * There was a problem during the HTTP request: 0 Error
  """

> ^D

   INFO  Ctrl+D.

zabe@deploy2002:~$

Let me ask around. ServiceOps should be able to guide us.

Internally, analytics.wikimedia.org is analytics-web.discovery.wmnet:8443 (note: what looks like a discovery service there is really a CNAME pointing to an-web1001.eqiad.wmnet). In any case, it looks like the analytics-web listener was added to the service mesh as part of the work in T380623 earlier this year.

Without knowing anything about the client or its use case [0], the k8s service where it will run will need to have (1) the listener enabled (i.e., will listen on localhost:6206) and (2) the network policies updated to allow connectivity to the relevant IP/port pair(s) (in this case, singular, if indeed it's backed by a single host).

In any case, with a bit more detail on the former part (i.e., what and where), I can help point you folks toward what needs changed.

[0] though I do implicitly have questions about how reliable this is intended to be, if it's backed by a single host

Internally, analytics.wikimedia.org is analytics-web.discovery.wmnet:8443 (note: what looks like a discovery service there is really a CNAME pointing to an-web1001.eqiad.wmnet). In any case, it looks like the analytics-web listener was added to the service mesh as part of the work in T380623 earlier this year.

Without knowing anything about the client or its use case [0], the k8s service where it will run will need to have (1) the listener enabled (i.e., will listen on localhost:6206) and (2) the network policies updated to allow connectivity to the relevant IP/port pair(s) (in this case, singular, if indeed it's backed by a single host).

The clients would be the updatequerpages periodic jobs. We want them to stop performing database queries themselfs and instead fetch pre-processed data from analytics (e.g. https://analytics.wikimedia.org/published/datasets/querypage/MostTranscludedPages/enwiki.json). The total number of HTTP requests will be quite low.

In any case, with a bit more detail on the former part (i.e., what and where), I can help point you folks toward what needs changed.

[0] though I do implicitly have questions about how reliable this is intended to be, if it's backed by a single host

Thanks for the context, @Zabe!

Alright, in that case, it should be straightforward to add the analytics-web listener to the list of enabled listeners in MediaWiki, then update the shared networkpolicy to allow egress to an-web1001.eqiad.wmnet on 8443. I can take a look at that tomorrow.

Once that's done, you should be able to confirm (e.g., in mwscript-k8s) that http://localhost:6206 works and go from there (i.e., wiring that into your wgExternalQuerySources if I'm understanding correctly).

Change #1196109 merged by jenkins-bot:

[operations/mediawiki-config@master] BETA: Try using Hadoop QueryPage computations

https://gerrit.wikimedia.org/r/1196109

Mentioned in SAL (#wikimedia-operations) [2025-10-16T13:20:04Z] <zabe@deploy2002> Started scap sync-world: Backport for [[gerrit:1196109|BETA: Try using Hadoop QueryPage computations (T309738)]]

Mentioned in SAL (#wikimedia-operations) [2025-10-16T13:22:27Z] <zabe@deploy2002> zabe: Backport for [[gerrit:1196109|BETA: Try using Hadoop QueryPage computations (T309738)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-10-16T13:28:13Z] <zabe@deploy2002> Finished scap sync-world: Backport for [[gerrit:1196109|BETA: Try using Hadoop QueryPage computations (T309738)]] (duration: 08m 09s)

Change #1196733 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] hieradata: enable analytics-web listener in mediawiki

https://gerrit.wikimedia.org/r/1196733

Change #1196734 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] hieradata: allow access to analytics-web from wikikube

https://gerrit.wikimedia.org/r/1196734

Change #1196735 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] mw-*: update network policy for access to analytics-web

https://gerrit.wikimedia.org/r/1196735

I took a fresh look this morning, and there's at least one additional step beyond what I described in T309738#11279756 - it looks like we'll need to open up access to analytics-web from wikikube pod IP ranges in order for this to work. I believe this would look something like https://gerrit.wikimedia.org/r/1196734.

@brouberol - Two questions for you, and feel free to redirect them (tagging you since you added the mesh listener, etc.):

  1. Do you have any concerns about opening up access like this?
  2. Are there any additional access controls etc. that might need tuned in order to enable access to analytics-web via the service mesh? (this is to facilitate the use case @Zabe describes in T309738#11275322)

Change #1196774 had a related patch set uploaded (by Zabe; author: Zabe):

[operations/mediawiki-config@master] PS.php: Add analytics-web service

https://gerrit.wikimedia.org/r/1196774

@Scott_French The patch and the approach looks fine to me. I've talked to @BTullis who feels the same, but also wanted to take a bit of time to soak in the full context.

Change #1196733 merged by Scott French:

[operations/puppet@production] hieradata: enable analytics-web listener in mediawiki

https://gerrit.wikimedia.org/r/1196733

Change #1196735 merged by jenkins-bot:

[operations/deployment-charts@master] mw-*: update network policy for access to analytics-web

https://gerrit.wikimedia.org/r/1196735

Change #1196734 merged by Scott French:

[operations/puppet@production] hieradata: allow access to analytics-web from wikikube

https://gerrit.wikimedia.org/r/1196734

Change #1197688 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] Revert "mw-*: update network policy for access to analytics-web"

https://gerrit.wikimedia.org/r/1197688

Change #1197688 merged by jenkins-bot:

[operations/deployment-charts@master] Revert "mw-*: update network policy for access to analytics-web"

https://gerrit.wikimedia.org/r/1197688

Mentioned in SAL (#wikimedia-operations) [2025-10-21T17:46:13Z] <swfrench@deploy2002> Started scap sync-world: Deploy mesh configuration change for T309738

Mentioned in SAL (#wikimedia-operations) [2025-10-21T17:53:55Z] <swfrench@deploy2002> Finished scap sync-world: Deploy mesh configuration change for T309738 (duration: 11m 50s)

After piloting the mesh change in mw-debug and applying the firewall change to an-web1001, then shamelessly borrowing from T309738#11274818:

swfrench@deploy2002:~$ sudo mw-debug-repl testwiki                                                                                                                                                                                                                           
Finding a mw-debug pod in codfw...                                                                                                                                                                                                                                           
Now running shell.php for testwiki inside pod/mw-debug.codfw.pinkunicorn-79bd5d45c8-bs2fz on release pinkunicorn...                                                                                                                                                          
Psy Shell v0.12.10 (PHP 8.1.33 — cli) by Justin Hileman                                                                                                                                                                                                                      
> use MediaWiki\MediaWikiServices;                                                                                                                                                                                                                                           
> $httpRequestFactory = MediaWikiServices::getInstance()->getHttpRequestFactory();                                                                                                                                                                                           
= MediaWiki\Http\HttpRequestFactory {#5459}                                                                                                                                                                                                                                  
                                                                                                                                                                                                                                                                             
> $request = $httpRequestFactory->create( 'http://localhost:6206/published/datasets/querypage/MostTranscludedPages/testwiki.json', [ 'timeout' => 10 ], 'TestFName' );
= GuzzleHttpRequest {#5469}

> $status = $request->execute();
= MediaWiki\Status\Status {#8339
    +value: & 200,
    +success: & [],
    +successCount: & 0,
    +failCount: & 0,
    +statusData: & null,
    +cleanCallback: false,
  }

> $status->isOk()
= true

> $request->getContent()
= "[{"qc_type":"MostTranscludedPages","qc_namespace":828, [...]

@Zabe - I think you should be good to proceed with testing on your end.

Change #1196774 merged by jenkins-bot:

[operations/mediawiki-config@master] PS.php: Add analytics-web service

https://gerrit.wikimedia.org/r/1196774

Mentioned in SAL (#wikimedia-operations) [2025-10-21T23:34:06Z] <zabe@deploy2002> Started scap sync-world: Backport for [[gerrit:1196774|PS.php: Add analytics-web service (T309738)]]

Mentioned in SAL (#wikimedia-operations) [2025-10-21T23:38:42Z] <zabe@deploy2002> zabe: Backport for [[gerrit:1196774|PS.php: Add analytics-web service (T309738)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-10-21T23:44:01Z] <zabe@deploy2002> Finished scap sync-world: Backport for [[gerrit:1196774|PS.php: Add analytics-web service (T309738)]] (duration: 09m 55s)

Change #1197735 had a related patch set uploaded (by Zabe; author: Zabe):

[mediawiki/core@master] QueryPage: Improve error handling for external sources

https://gerrit.wikimedia.org/r/1197735

Change #1197737 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] QueryPage: Fix typo in wgExternalQuerySources docs

https://gerrit.wikimedia.org/r/1197737

Change #1197738 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] QueryPage: Improve wgExternalQuerySources docs

https://gerrit.wikimedia.org/r/1197738

Change #1197737 merged by jenkins-bot:

[mediawiki/core@master] QueryPage: Fix typo in wgExternalQuerySources docs

https://gerrit.wikimedia.org/r/1197737

Change #1197735 merged by jenkins-bot:

[mediawiki/core@master] QueryPage: Improve error handling for external sources

https://gerrit.wikimedia.org/r/1197735

Change #1197738 merged by jenkins-bot:

[mediawiki/core@master] QueryPage: Improve wgExternalQuerySources docs

https://gerrit.wikimedia.org/r/1197738

Change #1196112 merged by jenkins-bot:

[operations/mediawiki-config@master] Using Hadoop for MostTranscludedPages on testwiki

https://gerrit.wikimedia.org/r/1196112

Mentioned in SAL (#wikimedia-operations) [2025-10-27T11:26:14Z] <zabe@deploy2002> Started scap sync-world: Backport for [[gerrit:1196112|Using Hadoop for MostTranscludedPages on testwiki (T309738)]]

Mentioned in SAL (#wikimedia-operations) [2025-10-27T11:30:21Z] <zabe@deploy2002> zabe: Backport for [[gerrit:1196112|Using Hadoop for MostTranscludedPages on testwiki (T309738)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-10-27T11:37:16Z] <zabe@deploy2002> Finished scap sync-world: Backport for [[gerrit:1196112|Using Hadoop for MostTranscludedPages on testwiki (T309738)]] (duration: 11m 02s)

Mentioned in SAL (#wikimedia-operations) [2025-10-28T15:11:41Z] <swfrench-wmf> applied mediawiki-common network policy updates in mw-script / mw-cron - T309738

Change #1199521 had a related patch set uploaded (by Zabe; author: Zabe):

[analytics/refinery@master] Introduce MostCategories.hql

https://gerrit.wikimedia.org/r/1199521

Change #1199522 had a related patch set uploaded (by Zabe; author: Zabe):

[operations/mediawiki-config@master] Using Hadoop for MostTranscludedPages on enwiki

https://gerrit.wikimedia.org/r/1199522

Change #1199522 merged by jenkins-bot:

[operations/mediawiki-config@master] Using Hadoop for MostTranscludedPages on enwiki

https://gerrit.wikimedia.org/r/1199522

Mentioned in SAL (#wikimedia-operations) [2025-11-04T00:23:00Z] <zabe@deploy2002> Started scap sync-world: Backport for [[gerrit:1199522|Using Hadoop for MostTranscludedPages on enwiki (T309738)]]

Mentioned in SAL (#wikimedia-operations) [2025-11-04T00:25:04Z] <zabe@deploy2002> zabe: Backport for [[gerrit:1199522|Using Hadoop for MostTranscludedPages on enwiki (T309738)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-11-04T00:32:05Z] <zabe@deploy2002> Finished scap sync-world: Backport for [[gerrit:1199522|Using Hadoop for MostTranscludedPages on enwiki (T309738)]] (duration: 09m 05s)

Change #1199521 merged by Joal:

[analytics/refinery@master] Introduce MostCategories.hql

https://gerrit.wikimedia.org/r/1199521

Getting organized here, the pieces we need are thus:

  • AQS 2.0 endpoint will serve responses to /statistics/{wiki project}/{query}/{date as yyyy-mm} (example endpoint here, developer docs pending)
  • Some Cassandra table or tables hold the data, depending on whether the output of all these queries is the same shape or can be comfortably squeezed into the same shape
  • Airflow runs each query and loads the output into Cassandra

Checklist of queries we want to port:

I'm going to put new SQL files that implement this logic here for now: https://gitlab.wikimedia.org/milimetric/sql

In addition we want to port the following queries due to T398709: FY2025-26 WE 6.4.1: Move links tables of commons to a dedicated cluster.

Change #1224958 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/core@master] Fix createMock not using class names via …::class

https://gerrit.wikimedia.org/r/1224958

Change #1224958 merged by jenkins-bot:

[mediawiki/core@master] Fix createMock not using class names via …::class

https://gerrit.wikimedia.org/r/1224958

Change #1226348 had a related patch set uploaded (by Zabe; author: Zabe):

[analytics/refinery@master] MostCategories: Fix copy/paste mistake in documentation

https://gerrit.wikimedia.org/r/1226348

Hi!

Just making sure you are aware:

analytics.wikimedia.org is hosted on a single bare metal server. It is a SPOF, and is meant for ad-hoc data publishing.

There are no SLOs or operational commitments to keep analytics.wikimedia.org up and running.

Reading the comments, I can see why analytics.wikimedia.org was chosen for data transfer here. WMF has many use cases like this, but no platform support for it.

FWIW, this falls squarely in the on going 'Derived Data' initiatives. I hope we can work together to influence and prioritize platform support for stuff like this.

This problem is also described at https://wikitech.wikimedia.org/wiki/MediaWiki_Externalized_Data_Problem

For many products, the required shape of data is highly specific. This leads to the creation of brittle and bespoke data pipelines.

One big reasons analytics.wikimedia.org was chosen is that these data is also not mission critical (gets updated monthly, so live data is not really a concern)‌ and if it fails for a month or so. It's still not a big deal. I'd argue we shouldn't provide HA and spend resources on something that doesn't need it.

I'd argue we shouldn't provide HA and spend resources on something that doesn't need it.

I'm not so worried about HA if expectations like this are clear. What I'm more interested in is how to make this kind of thing easier on a more 'paved path' .

Change #983961 abandoned by Zabe:

[mediawiki/core@master] [WIP] Querypage: Migrate from direct db query to hadoop-backed json res

https://gerrit.wikimedia.org/r/983961