The Growth team is planning to expand add an image to 10 new Wikipedias as per T360059: Communication around scaling "add an image" to more Wikipedias.
We already generate suggestions for those wikis:
isu = spark.read.table('analytics_platform_eng.image_suggestions_suggestions').where('snapshot="2024-04-01" and wiki in ("ckbwiki", "frrwiki", "hywiki", "jvwiki", "kuwiki", "newiki", "pawiki", "simplewiki", "sqwiki", "skwiki")')
alis = isu.where('section_index is null')
slis = isu.where('section_index is not null')
alis.groupBy('wiki').count().show(truncate=False), slis.groupBy('wiki').count().show(truncate=False)
+----------+-------+
|wiki |count |
+----------+-------+
|hywiki |708904 |
|simplewiki|3124112|
|sqwiki |4162703|
|frrwiki |23313 |
|ckbwiki |661384 |
|jvwiki |1946997|
|newiki |2731371|
|kuwiki |532225 |
|pawiki |2850709|
|skwiki |5737200|
+----------+-------+
+----------+-----+
|wiki |count|
+----------+-----+
|hywiki |31074|
|simplewiki|34753|
|sqwiki |12049|
|frrwiki |281 |
|ckbwiki |525 |
|jvwiki |7144 |
|newiki |3186 |
|kuwiki |2399 |
|pawiki |4732 |
|skwiki |42531|
+----------+-----+Wikis
- ckbwiki
- frrwiki
- hywiki
- jvwiki
- kuwiki
- newiki
- pawiki
- simplewiki
- skwiki
- sqwiki
Tasks
- run detect_html_tables.py against the target wikis
- eventually run check_bad_parsing.py against relevant wikis
- put scripts' outputs on HDFS at analytics_platform_eng
- execute dev pipeline
- manually check a random sample & determine general data quality
- change section topics DAG's default data quality scripts output