Page MenuHomePhabricator

rewrite KrBot to publish Constraint Violation pages
Open, Needs TriagePublic

Description

WMDE declined to take over KrBot operations (T290635, T189747) for two reasons:

  • It's not open source
  • WMDE puts higher priority on being able to access violations through SPARQL/API (T214362), but that still needs to complete various tech tasks (eg T201150)

So this task asks to reimplement (something like) KrBot using the new SPARQL/API access.
KrBot generates violation reports like Wikidata:Database_reports/Constraint_violations/P2088 that are integrated in Property Discussion pages and are viewed as core part of WD.

  • These pages are the best way to work out data quality problems of specific props. Eg I'm now working out through #Single_value%22_violations to remove stale or wrong CrunchBase identifiers
  • Even when I can get all violation info with SPARQL, I'd prefer to work from a generated WD page because:
    • all the info is available at a glance,
    • it can be used by non-tech people (eg Getty Vocabulary Program editors will now use ULAN constraint violations to improve their own data)
    • I can use it to generate QS corrections.
    • (There is Special:ConstraintReport, eg ConstraintReport/Q389336 shows some P2088 violations of that item, but big-data editors don't fix data problems item by item.)
  • An improvement is needed: print the labels of WD items in addition to Qnnnn

Scheduling (update/refresh)

  • T201150#7351510 discusses potentially useful schedules for when to reprocess (though that's per-item not per-property)
  • A benefit of a SPARQL/API based bot is that violation pages can easily be refreshed on demand

Event Timeline

I think we should wait until we can directly transclude query results into wiki pages (T67626: [Epic] Support for queries on-wiki (automated list generation)).

For me, one of the best features of KrBot that it actually edits wiki pages instead of embedding the results from somewhere—this produces valuable diffs. Diffs are useful to see new violations; for example I may want to concentrate on them so that reverting an incomprehensible value is more likely to result in the author adding a correct value; or I may have intentionally skipped several constraint violations, which are extremal rather than wrong values, and I don’t want to go over them again and again. Page histories also provide nice statistics about the constraint violations over time: are we doing a good job, having less and less / a constant low number of constraint violations? Or do we get more new ones than what we clean up? They also show up on watchlists, reminding from time to time that there may be new violations. For these reasons, I’d like T67626’s result not to be used for the KrBot rewrite.

  • An improvement is needed: print the labels of WD items in addition to Qnnnn

KrBot supports this, see for example Wikidata:Database reports/Constraint violations/P4082. It’s actually a feature that it doesn’t produce output using the {{Q}} template on certain pages: these pages are so big that the template would break the display of the page (e.g. it may exceed the Lua memory/time limits).

So: T291091: Snapshots for saved queries.

For me, one of the best features of KrBot that it actually edits wiki pages instead of embedding the results from somewhere—this produces valuable diffs. Diffs are useful to see new violations; for example I may want to concentrate on them so that reverting an incomprehensible value is more likely to result in the author adding a correct value; or I may have intentionally skipped several constraint violations, which are extremal rather than wrong values, and I don’t want to go over them again and again. Page histories also provide nice statistics about the constraint violations over time: are we doing a good job, having less and less / a constant low number of constraint violations? Or do we get more new ones than what we clean up? They also show up on watchlists, reminding from time to time that there may be new violations. For these reasons, I’d like T67626’s result not to be used for the KrBot rewrite.

  • An improvement is needed: print the labels of WD items in addition to Qnnnn

KrBot supports this, see for example Wikidata:Database reports/Constraint violations/P4082. It’s actually a feature that it doesn’t produce output using the {{Q}} template on certain pages: these pages are so big that the template would break the display of the page (e.g. it may exceed the Lua memory/time limits).

I don’t have access to the bot’s source code either, but I guess so.

I think rewriting KrBot is right idea. The bot has one significant limitation: it generates report periodically. As the reports user I want to see updated report immediately after fixing several items.

But please do not lose several important features of KrBot:

  • Report is generated for property, not for item. Tasks like data import work with few numbers of properties, but with huge amount of items. So such processes can not be controlled using reports for individual items.
  • Bot processes all property values, not some subset of items.
  • Report contains all constraints for individual property. This allows review a property status quickly.
  • Bot processes deprecated values also. It is important for Format, Type and some other constraints.

Maybe we should think about checking constraints in items edit API. This may make constraints reports redundant at all in future. I created T291335 to discuss the idea.

@Ivan

see updated report immediately after fixing several items

How would this work, on demand? Click a button and the page is regenerated?