Page MenuHomePhabricator

[L] Prepare image suggestions for a new set of Wikipedias
Closed, ResolvedPublic

Description

The Growth team is planning to expand add an image to 10 new Wikipedias as per T360059: Communication around scaling "add an image" to more Wikipedias.
We already generate suggestions for those wikis:

isu = spark.read.table('analytics_platform_eng.image_suggestions_suggestions').where('snapshot="2024-04-01" and wiki in ("ckbwiki", "frrwiki", "hywiki", "jvwiki", "kuwiki", "newiki", "pawiki", "simplewiki", "sqwiki", "skwiki")')
alis = isu.where('section_index is null')
slis = isu.where('section_index is not null')
alis.groupBy('wiki').count().show(truncate=False), slis.groupBy('wiki').count().show(truncate=False)

+----------+-------+
|wiki      |count  |
+----------+-------+
|hywiki    |708904 |
|simplewiki|3124112|
|sqwiki    |4162703|
|frrwiki   |23313  |
|ckbwiki   |661384 |
|jvwiki    |1946997|
|newiki    |2731371|
|kuwiki    |532225 |
|pawiki    |2850709|
|skwiki    |5737200|
+----------+-------+

+----------+-----+
|wiki      |count|
+----------+-----+
|hywiki    |31074|
|simplewiki|34753|
|sqwiki    |12049|
|frrwiki   |281  |
|ckbwiki   |525  |
|jvwiki    |7144 |
|newiki    |3186 |
|kuwiki    |2399 |
|pawiki    |4732 |
|skwiki    |42531|
+----------+-----+

Wikis

  • ckbwiki
  • frrwiki
  • hywiki
  • jvwiki
  • kuwiki
  • newiki
  • pawiki
  • simplewiki
  • skwiki
  • sqwiki

Tasks

  • run detect_html_tables.py against the target wikis
  • eventually run check_bad_parsing.py against relevant wikis
  • put scripts' outputs on HDFS at analytics_platform_eng
  • execute dev pipeline
  • manually check a random sample & determine general data quality
  • change section topics DAG's default data quality scripts output

Event Timeline

Since the Growth team is committed to scaling to 10 more wikis, we wanted to suggest "backup" wikis that we could consider scaling to if any of the ten defined in this task have low data quality. Backup wikis to consider scaling to:

  • nlwiki
  • cewiki
  • thwiki
  • lvwiki
MarkTraceur renamed this task from Prepare image suggestions for a new set of Wikipedias to [L] Prepare image suggestions for a new set of Wikipedias.May 15 2024, 4:36 PM

@AUgolnikova-WMF - I just wanted to check in. Is it still possible for Structured Content to complete this task in early June, so Growth has time to complete T360060: Scale Article-Level "add an image" to more Wikipedias by the end of June?

Hey @KStoller-WMF , chiming in while @AUgolnikova-WMF is out of office: yes, I'll pick up this ticket next week. Stay tuned!

mfossati changed the task status from Open to In Progress.May 31 2024, 11:07 AM
mfossati claimed this task.
This comment was removed by mfossati.
  • change section topics DAG's default data quality scripts output

A note that we should tackle this.

targets = [f'{w}wiki' for w in ('ckb', 'frr', 'hy', 'jv', 'ku', 'ne', 'pa', 'simple', 'sq', 'sk')]
prev = spark.read.table('analytics_platform_eng.image_suggestions_suggestions').where("snapshot='2024-07-15' and section_index is not null")
curr = spark.read.table('analytics_platform_eng.image_suggestions_suggestions').where("snapshot='2024-07-29' and section_index is not null")
prev.where(prev.wiki.isin(targets)).count(), curr.where(curr.wiki.isin(targets)).count()

(156870, 136622)

SLIS counts have decreased to an amount similar to the one obtained during implementation, i.e., 134133 for 2024-05-20's snapshot. Closing.