WMDE declined to take over KrBot operations (T290635, T189747) for two reasons:
- It's not open source
- WMDE puts higher priority on being able to access violations through SPARQL/API (T214362), but that still needs to complete various tech tasks (eg T201150)
So this task asks to reimplement (something like) KrBot using the new SPARQL/API access.
KrBot generates violation reports like
https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violations/P2088 that are integrated in Property Discussion pages and are viewed as core part of WD.
- These pages are the best way to work out data quality problems of specific props. Eg I'm now working out through https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violations/P2088#%22Single_value%22_violations to remove stale or wrong CrunchBase identifiers
- Even when I can get all violation info with SPARQL, I'd prefer to work from a generated WD page because:
- all the info is available at a glance,
- it can be used by non-tech people (eg Getty Vocabulary Program editors will now use ULAN constraint violations to improve their own data)
- I can use it to generate QS corrections.
- (There is https://www.wikidata.org/wiki/Special:ConstraintReport/ (eg
https://www.wikidata.org/wiki/Special:ConstraintReport/Q389336 shows `P2088` violations) but big-data editors don't fix data problems item by item.)
- An improvement is needed: print the labels of WD items in addition to `Qnnnn`
Scheduling (update/refresh)
- T201150#7351510 discusses potentially useful schedules for when to reprocess (though that's per-item not per-property)
- A benefit of a SPARQL/API based bot is that violation pages can easily be refreshed **on demand**