See parent
Description
Details
| Title | Reference | Author | Source Branch | Dest Branch | |
|---|---|---|---|---|---|
| querypage: mostcategories: Pass nsm table | repos/data-engineering/airflow-dags!2228 | zabe | T413362 | main | |
| mostcategories: Pass page table, not linktarget table | repos/data-engineering/airflow-dags!1896 | zabe | T413362 | main |
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T343131 Commons database is growing way too fast | |||
| Open | Ladsgroup | T398709 FY2025-26 WE 6.4.1: Move links tables of commons to a dedicated cluster | |||
| Open | Zabe | T309738 Move MediaWiki QueryPages computation to Hadoop | |||
| Open | Zabe | T413362 Move Mostcategories computation to Hadoop |
Event Timeline
zabe opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1896
mostcategories: Pass page table, not linktarget table
joal merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1896
mostcategories: Pass page table, not linktarget table
Change #1226348 had a related patch set uploaded (by Zabe; author: Zabe):
[analytics/refinery@master] MostCategories: Fix copy/paste mistake in documentation
Change #1226348 merged by Joal:
[analytics/refinery@master] MostCategories: Fix copy/paste mistake in documentation
Broken DAG: [/opt/airflow/dags/airflow_dags/platform_eng/dags/querypage/querypage_most_categories_monthly_dag.py]
Traceback (most recent call last):
File "/opt/airflow/dags/airflow_dags/wmf_airflow_common/config/dag_properties.py", line 118, in __init__
self._override_fields(variable_body)
File "/opt/airflow/dags/airflow_dags/wmf_airflow_common/config/dag_properties.py", line 179, in _override_fields
raise KeyError(f"Property {prop_name} is not overridable.")
KeyError: 'Property hive_linktarget_table is not overridable.'@JAllemandou Hey, I am a bit confused regarding this error ^ since the dag no longer references linktarget. Could you maybe take a look? The error appears at airflow-platform-eng.
Hi! Smells like T348963: DagProperties don't automatically update Airflow variables.
I suspect the hive_linktarget_table property is defined in the Airflow UI. Check somewhere in Admin -> Variables.
I can't double check myself, I don't have admin rights on the patform-eng instance. I think Andrew is right in his diagnose: it seems like a wrong variable problem. Deleting the existing config variable for the DAG should fix the problem (it is automatically recreated from code values when not existing).
I think I have admin rights and I think I deleted the record. I keep a copy here in case I break things:
platform_eng/dags/querypage/querypage_most_categories_monthly_dag.py
{
"wikis_to_run": [
"testwiki",
"frwiki"
],
"hive_querycache_table": "wmf.querycache",
"hive_categorylinks_table": "wmf_raw.mediawiki_categorylinks",
"hive_linktarget_table": "wmf_raw.mediawiki_private_linktarget",
"wmf_raw_tables_path": "hdfs://analytics-hadoop/wmf/data/raw/mediawiki/tables",
"wmf_raw_private_tables_path": "hdfs://analytics-hadoop/wmf/data/raw/mediawiki_private/tables",
"hdfs_destination_dir": "hdfs://analytics-hadoop/wmf/data/published/datasets/querypage/MostCategories/",
"temporary_directory": "hdfs://analytics-hadoop//tmp/platform_eng/querypage/MostCategories/",
"monthly_dag_start_date": "2025-12-01T00:00:00",
"dag_hql": "hdfs://analytics-hadoop/wmf/refinery/current/hql/querypage/MostCategories.hql",
"dag_sla": "P10D",
"alerts_email": "platform-eng-alerts@wikimedia.org"
}
This Variable can be used to override the properties of the DAG located at:
platform_eng/dags/querypage/querypage_most_categories_monthly_dag.py
To check that your DAG got parsed and updated with the overrides, see:
Details >> DagModel debug information >> last_parsed_time
To reset a property to its original value, remove it from the json blob.Change #1237596 had a related patch set uploaded (by Zabe; author: Zabe):
[operations/mediawiki-config@master] Configure Hadoop source for Mostcategories computations
Change #1237597 had a related patch set uploaded (by Zabe; author: Zabe):
[operations/mediawiki-config@master] Use Hadoop for Mostcategories on testwiki
Change #1237596 merged by jenkins-bot:
[operations/mediawiki-config@master] Configure Hadoop source for Mostcategories computations
Mentioned in SAL (#wikimedia-operations) [2026-02-09T20:12:44Z] <zabe@deploy2002> Started scap sync-world: Backport for [[gerrit:1237596|Configure Hadoop source for Mostcategories computations (T413362)]]
Mentioned in SAL (#wikimedia-operations) [2026-02-09T20:14:36Z] <zabe@deploy2002> zabe: Backport for [[gerrit:1237596|Configure Hadoop source for Mostcategories computations (T413362)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
Mentioned in SAL (#wikimedia-operations) [2026-02-09T20:19:35Z] <zabe@deploy2002> Finished scap sync-world: Backport for [[gerrit:1237596|Configure Hadoop source for Mostcategories computations (T413362)]] (duration: 06m 51s)
Change #1237597 merged by jenkins-bot:
[operations/mediawiki-config@master] Use Hadoop for Mostcategories on testwiki
Mentioned in SAL (#wikimedia-operations) [2026-02-09T20:34:30Z] <zabe@deploy2002> Started scap sync-world: Backport for [[gerrit:1237597|Use Hadoop for Mostcategories on testwiki (T413362)]]
Mentioned in SAL (#wikimedia-operations) [2026-02-09T20:36:25Z] <zabe@deploy2002> zabe: Backport for [[gerrit:1237597|Use Hadoop for Mostcategories on testwiki (T413362)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
Mentioned in SAL (#wikimedia-operations) [2026-02-09T20:40:52Z] <zabe@deploy2002> Finished scap sync-world: Backport for [[gerrit:1237597|Use Hadoop for Mostcategories on testwiki (T413362)]] (duration: 06m 23s)
Change #1238025 had a related patch set uploaded (by Zabe; author: Zabe):
[operations/mediawiki-config@master] Reenable MostCategories on frwiki
zabe opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1995
Run MostCategories/MostTranscludedTemplates DAGs on commons and enwiki
Change #1238025 merged by jenkins-bot:
[operations/mediawiki-config@master] Reenable MostCategories on frwiki
Mentioned in SAL (#wikimedia-operations) [2026-02-10T00:08:30Z] <zabe@deploy2002> Started scap sync-world: Backport for [[gerrit:1224794|Add config variable for MultiTitle (T404461)]], [[gerrit:1238025|Reenable MostCategories on frwiki (T413362)]]
Mentioned in SAL (#wikimedia-operations) [2026-02-10T00:12:33Z] <zabe@deploy2002> tbodt, zabe: Backport for [[gerrit:1224794|Add config variable for MultiTitle (T404461)]], [[gerrit:1238025|Reenable MostCategories on frwiki (T413362)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
Mentioned in SAL (#wikimedia-operations) [2026-02-10T00:20:11Z] <zabe@deploy2002> Finished scap sync-world: Backport for [[gerrit:1224794|Add config variable for MultiTitle (T404461)]], [[gerrit:1238025|Reenable MostCategories on frwiki (T413362)]] (duration: 11m 41s)
milimetric merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1995
Run MostCategories/MostTranscludedTemplates DAGs on commons and enwiki
Change #1248909 had a related patch set uploaded (by Zabe; author: Zabe):
[operations/mediawiki-config@master] Use Hadoop for Mostcategories on commonswiki
Taking a look at https://analytics.wikimedia.org/published/datasets/querypage/MostCategories/commonswiki.json and comparing it to https://commons.wikimedia.org/wiki/Special:MostCategories it seems like the Hadoop implementation does not include files.
Ok, the difference is that the MediaWiki implementation filters for pages in $wgContentNamespaces and this includes files on commons (see here) while the Hadoop implementation currently only filters for NS_MAIN.
Change #1267966 had a related patch set uploaded (by Zabe; author: Zabe):
[analytics/refinery@master] querypage: mostcategories: Include NS_FILE if running on commons
Change #1267966 merged by Xcollazo:
[analytics/refinery@master] querypage: MostCategories: Include all content namespaces
@xcollazo the DAG failed. Could you tell me what the underlying error is?
[2026-05-04, 18:44:58 UTC] {taskinstance.py:3337} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/app/.local/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 776, in _execute_task
result = _execute_callable(context=context, **execute_callable_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/app/.local/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 742, in _execute_callable
return ExecutionCallableRunner(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/app/.local/lib/python3.11/site-packages/airflow/utils/operator_helpers.py", line 252, in run
return self.func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/app/.local/lib/python3.11/site-packages/airflow/models/baseoperator.py", line 425, in wrapper
return func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/app/.local/lib/python3.11/site-packages/airflow/providers/apache/spark/operators/spark_submit.py", line 174, in execute
self._hook.submit(self.application)
File "/opt/airflow/dags/airflow_dags/wmf_airflow_common/hooks/spark.py", line 494, in submit
return self._skein_hook.submit()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/airflow/dags/airflow_dags/wmf_airflow_common/hooks/skein.py", line 314, in submit
raise AirflowException(
airflow.exceptions.AirflowException: SkeinHook Airflow SparkSkeinSubmitHook skein launcher querypage_most_categories_monthly__compute_commonswiki__20260401 application_1773845446826_922512: expected final status was 'SUCCEEDED', but got 'FAILED' insteadDAG link: https://airflow-platform-eng.wikimedia.org/dags/querypage_most_categories_monthly/grid
Specific task referenced link.
In the logs you quote, we can find the following:
[2026-05-04, 18:44:57 UTC] {skein.py:296} INFO - SkeinHook Airflow SparkSkeinSubmitHook skein launcher querypage_most_categories_monthly__compute_commonswiki__20260401 application_1773845446826_922512 - YARN application log collection is disabled. To view logs for the YARN App Master, run the following command:
See also https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Kubernetes#Use_the_yarn_CLI
yarn logs -appOwner analytics-platform-eng -applicationId application_1773845446826_922512
If your App Master launched other YARN applications (e.g. a Spark app), you will need to look at these logs and run a similar command but with the appropriate YARN application_id.Thus we want to run yarn logs -appOwner analytics-platform-eng -applicationId application_1773845446826_922512 following instructions from https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Kubernetes#Use_the_yarn_CLI:
$ ssh deployment.eqiad.wmnet ... xcollazo@deploy1003:~$ kube_env airflow-platform-eng-deploy dse-k8s-eqiad xcollazo@deploy1003:~$ kubectl exec -it $(kubectl get pod -l app=airflow,component=hadoop-shell --no-headers -o custom-columns=":metadata.name") -- bash runuser@airflow-hadoop-shell-6cd879f7b6-7k642:/opt/airflow$ yarn logs -appOwner analytics-platform-eng -applicationId application_1773845446826_922512 ... Container: container_e147_1773845446826_922512_01_000001 on an-worker1217.eqiad.wmnet_8041_1777920293939 LogAggregationType: AGGREGATED ======================================================================================================== LogType:application.driver.log LogLastModifiedTime:Mon May 04 18:44:53 +0000 2026 LogLength:3460 LogContents: Running /opt/conda-analytics/bin/spark-submit $@ SPARK_HOME: /usr/lib/spark3 Using Hadoop client lib jars at 3.2.0, provided by Spark. PYSPARK_PYTHON=/opt/conda-analytics/bin/python3 Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 26/05/04 18:44:47 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13000. Attempting port 13001. ADD JAR file:///usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar 26/05/04 18:44:49 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. Added [file:///usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar] to class path Added resources: [file:///usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar] Spark master: yarn, Application Id: application_1773845446826_922520 key value snapshot CONCAT(LPAD('2026', 4, '0'), '-', LPAD('4', 2, '0')) Time taken: 1.221 seconds, Fetched 1 row(s) Error in query: Table or view not found: nsm; line 13 pos 10; 'InsertIntoDir false, Storage(Location: hdfs://analytics-hadoop/tmp/platform_eng/querypage/MostCategories/commonswiki), text, true +- 'Repartition 1, false +- 'Project ['to_json('collect_list('named_struct(qc_type, 'qc_type, qc_namespace, 'qc_namespace, qc_title, 'qc_title, qc_value, 'qc_value, qc_wiki, 'qc_wiki, qc_snapshot, 'qc_snapshot))) AS most_categories#10] +- 'SubqueryAlias output +- 'GlobalLimit 5000 +- 'LocalLimit 5000 +- 'Sort ['qc_value DESC NULLS LAST], true +- 'UnresolvedHaving ('qc_value > 1) +- 'Aggregate ['qc_namespace, 'qc_title], [Mostcategories AS qc_type#11, 'p.page_namespace AS qc_namespace#12, 'p.page_title AS qc_title#13, count(1) AS qc_value#14L, commonswiki AS qc_wiki#15, concat(lpad(2026, 4, 0), -, lpad(4, 2, 0)) AS qc_snapshot#16] +- 'Filter (((('cls.snapshot = concat(lpad(2026, 4, 0), -, lpad(4, 2, 0))) AND ('cls.wiki_db = commonswiki)) AND ('p.snapshot = concat(lpad(2026, 4, 0), -, lpad(4, 2, 0)))) AND (('p.wiki_db = commonswiki) AND ('nsm.namespace_is_content = 1))) +- 'Join Inner, ((('p.wiki_db = 'nsm.dbname) AND ('p.page_namespace = 'nsm.namespace)) AND ('nsm.snapshot = concat(lpad(2026, 4, 0), -, lpad(4, 2, 0)))) :- Join LeftOuter, (cl_from#18L = page_id#29L) : :- SubqueryAlias cls : : +- SubqueryAlias spark_catalog.wmf_raw.mediawiki_categorylinks : : +- HiveTableRelation [`wmf_raw`.`mediawiki_categorylinks`, org.apache.hadoop.hive.serde2.avro.AvroSerDe, Data Cols: [cl_from#18L, cl_to#19, cl_sortkey#20, cl_timestamp#21, cl_sortkey_prefix#22, cl_collation#23, cl..., Partition Cols: [snapshot#27, wiki_db#28]] : +- SubqueryAlias p : +- SubqueryAlias spark_catalog.wmf_raw.mediawiki_page : +- HiveTableRelation [`wmf_raw`.`mediawiki_page`, org.apache.hadoop.hive.serde2.avro.AvroSerDe, Data Cols: [page_id#29L, page_namespace#30, page_title#31, page_is_redirect#32, page_is_new#33, page_random#..., Partition Cols: [snapshot#40, wiki_db#41]] +- 'UnresolvedRelation [nsm], [], false ...
(Let me know if you don't have access to deployment.eqiad.wmnet or to the k8s bits and we can bring in the right folks for that.)
zabe opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/2228
querypage: mostcategories: Pass nsm table
zabe merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/2228
querypage: mostcategories: Pass nsm table
zabe opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/2269
querypage: Add DAG for SpecialWantedCategories