Page MenuHomePhabricator

[L] Prepare image suggestions for a new set of Wikipedias
Open, Needs TriagePublic

Description

The Growth team is planning to expand add an image to 10 new Wikipedias as per T360059: Communication around scaling "add an image" to 10 more Wikipedias.
We already generate suggestions for those wikis:

isu = spark.read.table('analytics_platform_eng.image_suggestions_suggestions').where('snapshot="2024-04-01" and wiki in ("ckbwiki", "frrwiki", "hywiki", "jvwiki", "kuwiki", "newiki", "pawiki", "simplewiki", "sqwiki", "skwiki")')
alis = isu.where('section_index is null')
slis = isu.where('section_index is not null')
alis.groupBy('wiki').count().show(truncate=False), slis.groupBy('wiki').count().show(truncate=False)

+----------+-------+
|wiki      |count  |
+----------+-------+
|hywiki    |708904 |
|simplewiki|3124112|
|sqwiki    |4162703|
|frrwiki   |23313  |
|ckbwiki   |661384 |
|jvwiki    |1946997|
|newiki    |2731371|
|kuwiki    |532225 |
|pawiki    |2850709|
|skwiki    |5737200|
+----------+-------+

+----------+-----+
|wiki      |count|
+----------+-----+
|hywiki    |31074|
|simplewiki|34753|
|sqwiki    |12049|
|frrwiki   |281  |
|ckbwiki   |525  |
|jvwiki    |7144 |
|newiki    |3186 |
|kuwiki    |2399 |
|pawiki    |4732 |
|skwiki    |42531|
+----------+-----+

Tasks

  • manually check a random sample & determine general data quality
  • run detect_html_tables.py against the target wikis
  • eventually run check_bad_parsing.py against relevant wikis
  • put scripts' outputs on HDFS at analytics_platform_eng
  • check production run

Event Timeline

Since the Growth team is committed to scaling to 10 more wikis, we wanted to suggest "backup" wikis that we could consider scaling to if any of the ten defined in this task have low data quality. Backup wikis to consider scaling to:

  • nlwiki
  • cewiki
  • thwiki
  • lvwiki
MarkTraceur renamed this task from Prepare image suggestions for a new set of Wikipedias to [L] Prepare image suggestions for a new set of Wikipedias.Wed, May 15, 4:36 PM

@AUgolnikova-WMF - I just wanted to check in. Is it still possible for Structured Content to complete this task in early June, so Growth has time to complete T360060: Scale "add an image" to 10 more Wikipedias by the end of June?