Page MenuHomePhabricator

Remove data tagged as `kr` from Wikidata
Open, MediumPublic

Description

This task is about removing data tagged as kr from Wikidata: deleting it or moving it to knc. Part of T356144.

For this task, use of @Amire80's script Wikidata language code cleanup is needed: https://gitlab.wikimedia.org/amire80/scriptology/-/tree/main/cleanup-wikidata-language-code.

This Python script automates the cleanup of deprecated language codes (e.g. kr) from Wikidata items. It deletes or moves labels, descriptions, and aliases based on comparisons with fallback and related languages. The script uses the Pywikibot framework to process each item and logs cases it skips due to data inconsistencies or ambiguity to a CSV file for manual review.

Event Timeline

srishakatux moved this task from Backlog to In Progress on the LPL Onboarding and Development board.

Iterated through 30,000 items so far, of which close to 30 have been skipped and fewer than 150 items have been deleted. Responding to some queries and reverted edits from admins:

Also improving the bot script to support logging of skipped data items in wikitext and CSV formats for better manual handling later:

Just in case you wanted to estimate the time needed, there are currently almost half a million (~0.4%) items with a string in kr:

MariaDB [wikidatawiki_p]> SELECT COUNT(DISTINCT wbit_item_id) AS total FROM wbt_item_terms JOIN wbt_term_in_lang ON wbtl_id = wbit_term_in_lang_id JOIN wbt_text_in_lang ON wbxl_id = wbtl_text_in_lang_id WHERE wbxl_language = 'kr';
+--------+
| total  |
+--------+
| 458071 |
+--------+
1 row in set (5.575 sec)

Changing the select clause to just DISTINCT wbit_item_id should give you the list of all the items without the need for examining each and every one.

@matej_suchanek Thanks for the neat idea! I was able to use your query in Quarry tool to generate items with label kr:
https://quarry.wmcloud.org/query/93849
https://quarry.wmcloud.org/run/978915/output/0/json

@Amire80 I have integrated fetching data items from Quarry json result url with your script. The script now iterates over only those items that are fetched by the query. See a PR here.

@Bugreporter Could you share more? Is it possible to prevent addition of new KR labels using AbuseFilter or other means? @Amire80 Do you also have any advice here?

We could use an abuse filter, but only for cases when the language is mentioned in the edit summary. So this cannot be the definitive solution. In fact, there is plenty of deprecated language codes around that are regularly purged from and re-added to items. (The rabbit hole starts at T44396.) We really need a solution for this.

If you order the query from the newest items, you are more likely to find recent additions. This should give you the idea of why the language code is still being added. For example, some tools aid users in adding labels in many languages at once (which may include kr). But this has been discouraged since T312097.

We could use an abuse filter, but only for cases when the language is mentioned in the edit summary. So this cannot be the definitive solution. In fact, there is plenty of deprecated language codes around that are regularly purged from and re-added to items. (The rabbit hole starts at T44396.) We really need a solution for this.

By whom are they re-added? Are those specific bots or users?

I don't keep a list of them. It isn't really a problem now, because the fields are not empty. If they were, it's a matter of time someone would enthusiastically start populating them again (e.g., using QuickStatements).

But we are discussing a mass-editing policy that could also help get this under control.

Thanks @matej_suchanek! It looks like it might be possible to request the creation of an AbuseFilter that can block the addition of any new labels using the language code in question, something similar to https://www.wikidata.org/wiki/Special:AbuseFilter/129 and probably customizable for other language codes too. I am asking for more information in the Wikidata Telegram channel.

There may be one caveat to this as @Lucas_Werkmeister_WMDE points out - the way it’s implemented suggests there could be some false negatives (for example, edits made using wbeditentity might not be caught). So, while it may not prevent all cases of the language code being re-added, it could still help reduce accidental additions by editors who may not be aware that the language code is being retired on Wikidata.

Based on information gathered through discussions on the Wikidata channel, it seems that the filter cannot solve all the problems, but it might still help catch major edits related to the disabled language code. I’ve made a request here: https://www.wikidata.org/wiki/Wikidata:Administrators%27_noticeboard#Request_for_AbuseFilter_to_block_labels_using_a_disabled_language_code. Now continuing to run the script to remove kr items.

After a few hours of running the script to modify Wikidata items yesterday, here is a little update: ~2000 items out of ~460000 for language code kr were processed in 6 hours. Edits made can be seen here. At this rate, it would take roughly 1,400 hours to complete the task. Unless I am missing something, this may not be a sustainable approach moving forward. At this point, guidance from engineers in Language and Product Localization or further discussions with the Wikidata team would be helpful.

MaryMunyoki subscribed.

to be reassigned once we begin working on the task