Page MenuHomePhabricator

Move Mostcategories computation to Hadoop
Open, MediumPublic

Description

See parent

Event Timeline

Marostegui triaged this task as Medium priority.Dec 23 2025, 6:25 AM
Marostegui moved this task from Triage to In progress on the DBA board.

Change #1226348 had a related patch set uploaded (by Zabe; author: Zabe):

[analytics/refinery@master] MostCategories: Fix copy/paste mistake in documentation

https://gerrit.wikimedia.org/r/1226348

Change #1226348 merged by Joal:

[analytics/refinery@master] MostCategories: Fix copy/paste mistake in documentation

https://gerrit.wikimedia.org/r/1226348

Broken DAG: [/opt/airflow/dags/airflow_dags/platform_eng/dags/querypage/querypage_most_categories_monthly_dag.py]
Traceback (most recent call last):
  File "/opt/airflow/dags/airflow_dags/wmf_airflow_common/config/dag_properties.py", line 118, in __init__
    self._override_fields(variable_body)
  File "/opt/airflow/dags/airflow_dags/wmf_airflow_common/config/dag_properties.py", line 179, in _override_fields
    raise KeyError(f"Property {prop_name} is not overridable.")
KeyError: 'Property hive_linktarget_table is not overridable.'
Zabe added a subscriber: JAllemandou.

@JAllemandou Hey, I am a bit confused regarding this error ^ since the dag no longer references linktarget. Could you maybe take a look? The error appears at airflow-platform-eng.

Hi! Smells like T348963: DagProperties don't automatically update Airflow variables.

I suspect the hive_linktarget_table property is defined in the Airflow UI. Check somewhere in Admin -> Variables.

I can't double check myself, I don't have admin rights on the patform-eng instance. I think Andrew is right in his diagnose: it seems like a wrong variable problem. Deleting the existing config variable for the DAG should fix the problem (it is automatically recreated from code values when not existing).

I‌ think I‌ have admin rights and I‌ think I‌ deleted the record. I‌ keep a copy here in case I‌ break things:

platform_eng/dags/querypage/querypage_most_categories_monthly_dag.py
{
  "wikis_to_run": [
    "testwiki",
    "frwiki"
  ],
  "hive_querycache_table": "wmf.querycache",
  "hive_categorylinks_table": "wmf_raw.mediawiki_categorylinks",
  "hive_linktarget_table": "wmf_raw.mediawiki_private_linktarget",
  "wmf_raw_tables_path": "hdfs://analytics-hadoop/wmf/data/raw/mediawiki/tables",
  "wmf_raw_private_tables_path": "hdfs://analytics-hadoop/wmf/data/raw/mediawiki_private/tables",
  "hdfs_destination_dir": "hdfs://analytics-hadoop/wmf/data/published/datasets/querypage/MostCategories/",
  "temporary_directory": "hdfs://analytics-hadoop//tmp/platform_eng/querypage/MostCategories/",
  "monthly_dag_start_date": "2025-12-01T00:00:00",
  "dag_hql": "hdfs://analytics-hadoop/wmf/refinery/current/hql/querypage/MostCategories.hql",
  "dag_sla": "P10D",
  "alerts_email": "platform-eng-alerts@wikimedia.org"
}

        This Variable can be used to override the properties of the DAG located at:
        platform_eng/dags/querypage/querypage_most_categories_monthly_dag.py
        To check that your DAG got parsed and updated with the overrides, see:
        Details >> DagModel debug information >> last_parsed_time
        To reset a property to its original value, remove it from the json blob.

Change #1237596 had a related patch set uploaded (by Zabe; author: Zabe):

[operations/mediawiki-config@master] Configure Hadoop source for Mostcategories computations

https://gerrit.wikimedia.org/r/1237596

Change #1237597 had a related patch set uploaded (by Zabe; author: Zabe):

[operations/mediawiki-config@master] Use Hadoop for Mostcategories on testwiki

https://gerrit.wikimedia.org/r/1237597

Change #1237596 merged by jenkins-bot:

[operations/mediawiki-config@master] Configure Hadoop source for Mostcategories computations

https://gerrit.wikimedia.org/r/1237596

Mentioned in SAL (#wikimedia-operations) [2026-02-09T20:12:44Z] <zabe@deploy2002> Started scap sync-world: Backport for [[gerrit:1237596|Configure Hadoop source for Mostcategories computations (T413362)]]

Mentioned in SAL (#wikimedia-operations) [2026-02-09T20:14:36Z] <zabe@deploy2002> zabe: Backport for [[gerrit:1237596|Configure Hadoop source for Mostcategories computations (T413362)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-02-09T20:19:35Z] <zabe@deploy2002> Finished scap sync-world: Backport for [[gerrit:1237596|Configure Hadoop source for Mostcategories computations (T413362)]] (duration: 06m 51s)

Change #1237597 merged by jenkins-bot:

[operations/mediawiki-config@master] Use Hadoop for Mostcategories on testwiki

https://gerrit.wikimedia.org/r/1237597

Mentioned in SAL (#wikimedia-operations) [2026-02-09T20:34:30Z] <zabe@deploy2002> Started scap sync-world: Backport for [[gerrit:1237597|Use Hadoop for Mostcategories on testwiki (T413362)]]

Mentioned in SAL (#wikimedia-operations) [2026-02-09T20:36:25Z] <zabe@deploy2002> zabe: Backport for [[gerrit:1237597|Use Hadoop for Mostcategories on testwiki (T413362)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-02-09T20:40:52Z] <zabe@deploy2002> Finished scap sync-world: Backport for [[gerrit:1237597|Use Hadoop for Mostcategories on testwiki (T413362)]] (duration: 06m 23s)

Change #1238025 had a related patch set uploaded (by Zabe; author: Zabe):

[operations/mediawiki-config@master] Reenable MostCategories on frwiki

https://gerrit.wikimedia.org/r/1238025

Change #1238025 merged by jenkins-bot:

[operations/mediawiki-config@master] Reenable MostCategories on frwiki

https://gerrit.wikimedia.org/r/1238025

Mentioned in SAL (#wikimedia-operations) [2026-02-10T00:08:30Z] <zabe@deploy2002> Started scap sync-world: Backport for [[gerrit:1224794|Add config variable for MultiTitle (T404461)]], [[gerrit:1238025|Reenable MostCategories on frwiki (T413362)]]

Mentioned in SAL (#wikimedia-operations) [2026-02-10T00:12:33Z] <zabe@deploy2002> tbodt, zabe: Backport for [[gerrit:1224794|Add config variable for MultiTitle (T404461)]], [[gerrit:1238025|Reenable MostCategories on frwiki (T413362)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-02-10T00:20:11Z] <zabe@deploy2002> Finished scap sync-world: Backport for [[gerrit:1224794|Add config variable for MultiTitle (T404461)]], [[gerrit:1238025|Reenable MostCategories on frwiki (T413362)]] (duration: 11m 41s)

Change #1248909 had a related patch set uploaded (by Zabe; author: Zabe):

[operations/mediawiki-config@master] Use Hadoop for Mostcategories on commonswiki

https://gerrit.wikimedia.org/r/1248909

Ok, the difference is that the MediaWiki implementation filters for pages in $wgContentNamespaces and this includes files on commons (see here) while the Hadoop implementation currently only filters for NS_MAIN.

Change #1267966 had a related patch set uploaded (by Zabe; author: Zabe):

[analytics/refinery@master] querypage: mostcategories: Include NS_FILE if running on commons

https://gerrit.wikimedia.org/r/1267966

Ping @Ahoelzl on this. There are patches to review that the team doesn't know about.

Change #1267966 merged by Xcollazo:

[analytics/refinery@master] querypage: MostCategories: Include all content namespaces

https://gerrit.wikimedia.org/r/1267966

@xcollazo the DAG failed. Could you tell me what the underlying error is?

[2026-05-04, 18:44:58 UTC] {taskinstance.py:3337} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/app/.local/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 776, in _execute_task
    result = _execute_callable(context=context, **execute_callable_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/app/.local/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 742, in _execute_callable
    return ExecutionCallableRunner(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/app/.local/lib/python3.11/site-packages/airflow/utils/operator_helpers.py", line 252, in run
    return self.func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/app/.local/lib/python3.11/site-packages/airflow/models/baseoperator.py", line 425, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/app/.local/lib/python3.11/site-packages/airflow/providers/apache/spark/operators/spark_submit.py", line 174, in execute
    self._hook.submit(self.application)
  File "/opt/airflow/dags/airflow_dags/wmf_airflow_common/hooks/spark.py", line 494, in submit
    return self._skein_hook.submit()
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/airflow/dags/airflow_dags/wmf_airflow_common/hooks/skein.py", line 314, in submit
    raise AirflowException(
airflow.exceptions.AirflowException: SkeinHook Airflow SparkSkeinSubmitHook skein launcher querypage_most_categories_monthly__compute_commonswiki__20260401 application_1773845446826_922512: expected final status was 'SUCCEEDED', but got 'FAILED' instead

@xcollazo the DAG failed. Could you tell me what the underlying error is?

DAG link: https://airflow-platform-eng.wikimedia.org/dags/querypage_most_categories_monthly/grid

Specific task referenced link.

In the logs you quote, we can find the following:

[2026-05-04, 18:44:57 UTC] {skein.py:296} INFO - SkeinHook Airflow SparkSkeinSubmitHook skein launcher querypage_most_categories_monthly__compute_commonswiki__20260401 application_1773845446826_922512 - YARN application log collection is disabled. To view logs for the YARN App Master, run the following command:
	See also https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Kubernetes#Use_the_yarn_CLI
	yarn logs -appOwner analytics-platform-eng -applicationId application_1773845446826_922512
If your App Master launched other YARN applications (e.g. a Spark app), you will need to look at these logs and run a similar command but with the appropriate YARN application_id.

Thus we want to run yarn logs -appOwner analytics-platform-eng -applicationId application_1773845446826_922512 following instructions from https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Kubernetes#Use_the_yarn_CLI:

$ ssh deployment.eqiad.wmnet

...

xcollazo@deploy1003:~$ kube_env airflow-platform-eng-deploy dse-k8s-eqiad
xcollazo@deploy1003:~$ kubectl exec -it $(kubectl get pod -l app=airflow,component=hadoop-shell --no-headers -o custom-columns=":metadata.name") -- bash

runuser@airflow-hadoop-shell-6cd879f7b6-7k642:/opt/airflow$ yarn logs -appOwner analytics-platform-eng -applicationId application_1773845446826_922512

...

Container: container_e147_1773845446826_922512_01_000001 on an-worker1217.eqiad.wmnet_8041_1777920293939
LogAggregationType: AGGREGATED
========================================================================================================
LogType:application.driver.log
LogLastModifiedTime:Mon May 04 18:44:53 +0000 2026
LogLength:3460
LogContents:
Running /opt/conda-analytics/bin/spark-submit $@
SPARK_HOME: /usr/lib/spark3
Using Hadoop client lib jars at 3.2.0, provided by Spark.
PYSPARK_PYTHON=/opt/conda-analytics/bin/python3
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/05/04 18:44:47 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13000. Attempting port 13001.
ADD JAR file:///usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar
26/05/04 18:44:49 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
Added [file:///usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar] to class path
Added resources: [file:///usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar]
Spark master: yarn, Application Id: application_1773845446826_922520
key	value
snapshot	CONCAT(LPAD('2026', 4, '0'), '-', LPAD('4', 2, '0'))
Time taken: 1.221 seconds, Fetched 1 row(s)
Error in query: Table or view not found: nsm; line 13 pos 10;
'InsertIntoDir false, Storage(Location: hdfs://analytics-hadoop/tmp/platform_eng/querypage/MostCategories/commonswiki), text, true
+- 'Repartition 1, false
   +- 'Project ['to_json('collect_list('named_struct(qc_type, 'qc_type, qc_namespace, 'qc_namespace, qc_title, 'qc_title, qc_value, 'qc_value, qc_wiki, 'qc_wiki, qc_snapshot, 'qc_snapshot))) AS most_categories#10]
      +- 'SubqueryAlias output
         +- 'GlobalLimit 5000
            +- 'LocalLimit 5000
               +- 'Sort ['qc_value DESC NULLS LAST], true
                  +- 'UnresolvedHaving ('qc_value > 1)
                     +- 'Aggregate ['qc_namespace, 'qc_title], [Mostcategories AS qc_type#11, 'p.page_namespace AS qc_namespace#12, 'p.page_title AS qc_title#13, count(1) AS qc_value#14L, commonswiki AS qc_wiki#15, concat(lpad(2026, 4, 0), -, lpad(4, 2, 0)) AS qc_snapshot#16]
                        +- 'Filter (((('cls.snapshot = concat(lpad(2026, 4, 0), -, lpad(4, 2, 0))) AND ('cls.wiki_db = commonswiki)) AND ('p.snapshot = concat(lpad(2026, 4, 0), -, lpad(4, 2, 0)))) AND (('p.wiki_db = commonswiki) AND ('nsm.namespace_is_content = 1)))
                           +- 'Join Inner, ((('p.wiki_db = 'nsm.dbname) AND ('p.page_namespace = 'nsm.namespace)) AND ('nsm.snapshot = concat(lpad(2026, 4, 0), -, lpad(4, 2, 0))))
                              :- Join LeftOuter, (cl_from#18L = page_id#29L)
                              :  :- SubqueryAlias cls
                              :  :  +- SubqueryAlias spark_catalog.wmf_raw.mediawiki_categorylinks
                              :  :     +- HiveTableRelation [`wmf_raw`.`mediawiki_categorylinks`, org.apache.hadoop.hive.serde2.avro.AvroSerDe, Data Cols: [cl_from#18L, cl_to#19, cl_sortkey#20, cl_timestamp#21, cl_sortkey_prefix#22, cl_collation#23, cl..., Partition Cols: [snapshot#27, wiki_db#28]]
                              :  +- SubqueryAlias p
                              :     +- SubqueryAlias spark_catalog.wmf_raw.mediawiki_page
                              :        +- HiveTableRelation [`wmf_raw`.`mediawiki_page`, org.apache.hadoop.hive.serde2.avro.AvroSerDe, Data Cols: [page_id#29L, page_namespace#30, page_title#31, page_is_redirect#32, page_is_new#33, page_random#..., Partition Cols: [snapshot#40, wiki_db#41]]
                              +- 'UnresolvedRelation [nsm], [], false

...

(Let me know if you don't have access to deployment.eqiad.wmnet or to the k8s bits and we can bring in the right folks for that.)