"WMDE should take over the operation of KrBot (that generates Constraint Violation pages) because it now takes huge resources ("Now the bot requires 106 GB of memory to load and process all data") and Ivan Krestinin cannot cope. See discussion at User_talk:Ivan_A._Krestinin and at Telegram"
source https://phabricator.wikimedia.org/T201150#7341220
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T244043 suggestions and possible decisions from the 2020 report on Property constraints | |||
Open | None | T192565 Find constraint violations | |||
Open | None | T201150 Regularly run constraint checks for all items | |||
Declined | None | T290635 Evaluate whether WMDE can take over our essential community run constraints bot |
Event Timeline
Maybe someone in the community can adopt it from Ivan Krestinin? It should be able to run fine in a WMF cloud VPS I assume.
Where is the code for this bot?
And what exactly is the data in and out of this code?
I also see in the description WMF should take over and in the title WMF operations.
I'm guessing the intent here is for WMDE to consider taking this over, this is certainly not something that the WMF would get involved in for this project.
(Yes, we mean WMDE.)
This is basically a reopen of T189747 which was declined because the source is not available.
@Ivan_A_Krestinin Can you comment about opening the source?
The bot generates the constraint violation pages, eg see https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violations/P2088.
- These pages are the best way to work out data quality problems. Eg I'm now working out through https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violations/P2088#%22Single_value%22_violations to remove stale or wrong CrunchBase identifiers
- Given T201150, this is currently the ONLY way to find constraint violations
- But even when I can get all violations with SPARQL, I'll prefer to work from a generated WD page
- The only improvement I'd ask for is to print the labels of WD items in addition to Qnnnn
- (Hmm, https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot/KrBot describes different tasks).
https://phabricator.wikimedia.org/T189747#4058205 describes how the bot operates.
If WMDE won't take over the operation of KrBot, another option would be to write a WMDE bot to do it.
I believe that when violations are fully exposed for querying (T201150), this should be possible to do with SPARQL (or with SQL if they are also available in the underlying RDBMS).
So I certainly don't think we want to operate the bot.
But I'd imagine something else might be possible, cc @Lydia_Pintscher @Manuel
As I understand the bot makes a page per property that contains a broken down list of current constraint violations for the constraints defined on that property, with some slightly useful extra data / formatting (such as the values at play)?
@Addshore I think that's a fair description. To add:
- those pages are linked to the Discussion pages of each property, so are perceived as an integral part of WD, eg https://www.wikidata.org/wiki/Property_talk:P245:
- Database reports/Constraint violations/P245: KrBot
- Database reports/Complex constraint violations/P245: not sure who generates
- Database reports/Humans with missing claims/P245: not sure who generates
- And there's a link for each individual constraint
- It also generates useful Type Statistics (eg https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violations/P245#Types_statistics) that allow us to evaluate whether the type constraints are right. Eg the first "No" is https://www.wikidata.org/wiki/Q15632617 "fictional human" (ULAN is only supposed to include real persons and organizations) and I can search for Q15632617 to see whether to remove that type from WD items, or to add it to the ULAN property
As Adam said we can't take over this bot. Instead I want us to focus our energy on the underlying issue, which is the fact that constraints violations are not accessible in a meaningful way via an API or other way to query them. We are working on fixing this by persistently storing the constraint violations. Once that is in place a much simpler and less resource-heavy bot can be written to handle these tasks if still desired. The work on persistently storing constraint violations is ongoing in T214362. I am declining this ticket in favor of that one as that is the more scalable long-term solution.