Page MenuHomePhabricator

Fix CommonsCategoryGraphBuilder to reflect latest changes to categorylinks table
Closed, ResolvedPublic

Description

Column categorylinks.cl_to has been deprecated and set to NULL recently (see T397923) but org.wikimedia.analytics.refinery.job.CommonsCategoryGraphBuilder depends on that column to construct a SQL query:

SELECT
    cl.cl_from,
    pg.page_id AS cl_to,
    cl.cl_type
FROM ${params.categorylinksTable} cl
    INNER JOIN ${params.pageTable} pg
    ON (cl.cl_to = pg.page_title)
WHERE
    cl.wiki_db = 'commonswiki' AND
    cl.snapshot = '${params.mediawikiSnapshot}' AND
    pg.wiki_db = 'commonswiki' AND
    pg.snapshot = '${params.mediawikiSnapshot}' AND
    pg.page_namespace = 14

This issue prevented us from delivering the Commons Impact Metrics job results for August 2025. Please fix the query in org.wikimedia.analytics.refinery.job.CommonsCategoryGraphBuilder and re-run the August 2025 job.

Details

Event Timeline

Change #1188914 had a related patch set uploaded (by Aleksandar Mastilovic; author: Aleksandar Mastilovic):

[analytics/refinery/source@master] Update CommonsCategoryGraphBuilder to avoid column cl_to

https://gerrit.wikimedia.org/r/1188914

The new SQL query is:

SELECT
    cl.cl_from,
    pg.page_id AS cl_to,
    cl.cl_type
FROM ${params.categorylinksTable} cl
    INNER JOIN ${params.linktargetTable} lt
    INNER JOIN ${params.pageTable} pg
    ON (cl.cl_target_id = lt.lt_id AND lt.lt_title = pg.page_title)
WHERE
    cl.wiki_db = 'commonswiki' AND
    cl.snapshot = '${params.mediawikiSnapshot}' AND
    lt.snapshot = '${params.mediawikiSnapshot}' AND
    pg.wiki_db = 'commonswiki' AND
    pg.snapshot = '${params.mediawikiSnapshot}' AND
    pg.page_namespace = 14

Manual execution of the query for the 2025-08 snapshot does give (what looks like correct) data back:

image.png (209×303 px, 17 KB)

Change #1188914 merged by jenkins-bot:

[analytics/refinery/source@master] Update CommonsCategoryGraphBuilder to avoid column cl_to

https://gerrit.wikimedia.org/r/1188914

Change #1189323 had a related patch set uploaded (by Aleksandar Mastilovic; author: Aleksandar Mastilovic):

[analytics/refinery/source@master] Additional constraints to the CommonsCategoryGraphBuilder query

https://gerrit.wikimedia.org/r/1189323

Change #1189323 merged by jenkins-bot:

[analytics/refinery/source@master] Additional constraints to the CommonsCategoryGraphBuilder query

https://gerrit.wikimedia.org/r/1189323

The 2025-08 backfill run of the DAG has completed successfully, and judging by data sizes on HDFS I'd say it falls in line with what we've seen in the previous months. @GFontenelle_WMF if you have some basic validation checks to run on this data, now would be a good time. Thank you!

Change #1189588 had a related patch set uploaded (by Aleksandar Mastilovic; author: Aleksandar Mastilovic):

[analytics/refinery/source@master] Update changelog.md for v0.3.2

https://gerrit.wikimedia.org/r/1189588

@amastilovic: I've tested it and it looks like it's working normally now. Thanks so much!

Change #1189588 merged by Aleksandar Mastilovic:

[analytics/refinery/source@master] Update changelog.md for v0.3.2

https://gerrit.wikimedia.org/r/1189588