The Growth team is planning to expand add an image to 10 new Wikipedias as per T360059: Communication around scaling "add an image" to 10 more Wikipedias.
We already generate suggestions for those wikis:
isu = spark.read.table('analytics_platform_eng.image_suggestions_suggestions').where('snapshot="2024-04-01" and wiki in ("ckbwiki", "frrwiki", "hywiki", "jvwiki", "kuwiki", "newiki", "pawiki", "simplewiki", "sqwiki", "skwiki")') alis = isu.where('section_index is null') slis = isu.where('section_index is not null') alis.groupBy('wiki').count().show(truncate=False), slis.groupBy('wiki').count().show(truncate=False) +----------+-------+ |wiki |count | +----------+-------+ |hywiki |708904 | |simplewiki|3124112| |sqwiki |4162703| |frrwiki |23313 | |ckbwiki |661384 | |jvwiki |1946997| |newiki |2731371| |kuwiki |532225 | |pawiki |2850709| |skwiki |5737200| +----------+-------+ +----------+-----+ |wiki |count| +----------+-----+ |hywiki |31074| |simplewiki|34753| |sqwiki |12049| |frrwiki |281 | |ckbwiki |525 | |jvwiki |7144 | |newiki |3186 | |kuwiki |2399 | |pawiki |4732 | |skwiki |42531| +----------+-----+
Tasks
- manually check a random sample & determine general data quality
- run detect_html_tables.py against the target wikis
- eventually run check_bad_parsing.py against relevant wikis
- put scripts' outputs on HDFS at analytics_platform_eng
- check production run