Background
This Python script automates the cleanup of deprecated language codes (e.g., kr) from Wikidata items. It deletes or moves labels, descriptions, and aliases based on comparisons with fallback and related languages. The script uses the Pywikibot framework to process each item and logs cases it skips due to data inconsistencies or ambiguity to a CSV file for manual review.
A previous attempt to run this script (as described in T394877) was unsuccessful: it took ~6 hours to update ~2,000 items out of ~500,000 items for the kr language code. At that speed, completing the full set would take ~1,400 hours (~2 months). To find a more sustainable approach, it was advised to improve the script’s efficiency (currently it performs multiple edits per item) and host it in Wikimedia Cloud Services. There is another task in progress to improve the cleanup process: T403097
Tasks
- Host the bot script in Wikimedia’s Cloud Services infrastructure.
- Request bot permissions on Wikidata.
- Improve the existing bot script for efficiency (currently it updates the same entries multiple times).
- Set up a cron job to allow the bot to run automatically.
- Ensure proper hosting and monitoring of the bot.