Page MenuHomePhabricator

Wikidata Gap Fixer: AI‑assisted detection and fixing of missing lexeme and item content (starting with Dutch)
Closed, InvalidPublic

Description

Problem

Many Wikidata lexemes—especially in Dutch, but even more so in smaller languages—are missing basic morphological forms (e.g. plural nouns, past participles) and glosses. This project uses AI to suggest fixes for these simple, low‑hanging fruit/gaps, with a focus on cases where high‑confidence, pattern‑based suggestions are realistic (and acknowledging this will not cover very small languages).

It will focus on:

  • Lexemes (missing senses, forms, grammatical features, and lexeme–item links).
  • Items (missing or inconsistent labels/descriptions/aliases in that language which should be derivable from lexemes and existing multilingual data).

Today, filling these gaps is:

  • Time‑consuming and manual for editors.
  • Hard to prioritize (no good overview of “high‑impact, high‑confidence” missing content).

At the same time, modern language models are reasonably good at spotting missing/misaligned pieces of structured lexical information given enough context, provided that humans stay in the loop to approve or correct suggestions.

We lack a dedicated tool that:

  • Systematically detects gaps in lexeme/item data for a target language.
  • Uses AI/LLMs to propose concrete, small, reviewable edits (not bulk auto‑edits).
  • Provides an efficient human‑in‑the‑loop UI where editors can accept/reject/edit those suggestions and push them to Wikidata.

Goal

Create an experimental, opt‑in tool (working name: “Wikidata Gap Fixer”) that:

  1. Detects gaps in lexeme and item data for a configurable language (initially Dutch).
  2. Uses an AI/LLM‑based pipeline to generate high‑confidence, explainable suggestions for filling those gaps.
  3. Presents these suggestions in a review queue where human editors can:
    • Accept as‑is.
    • Edit then accept.
    • Reject (feeding back into the system’s scoring/filters).
  4. Applies accepted suggestions to Wikidata via the API under the editor’s account, with clear, tool‑specific edit summaries.

The long‑term design should be language‑agnostic so that other communities can plug in their language‑specific rules and data sources.


Scope (initial phase)

  • Focus language: Dutch (nl).
  • Domain: Lexemes and directly related item content.
  • Initial suggestion types (MVP):
    1. Lexeme–item linking suggestions
      • Suggest LexemeItem links where:
        • Lemma matches an item label/alias in Dutch.
        • There is clear disambiguation via sitelinks, existing statements, or IDs.
        • The LLM can confirm the match given context (definitions, examples).
    2. Lexeme form / grammatical feature suggestions (regular patterns only)
      • Propose missing forms (e.g. Dutch noun plurals, regular verb inflections) where patterns are simple and low risk.
      • Suggest missing grammatical features on existing forms (e.g. gender for Dutch nouns) based on regular patterns + existing data.
    3. Dutch item label/description suggestions from multilingual data
      • When an item has rich labels/descriptions in other languages and related lexemes, suggest:
        • Missing Dutch labels.
        • Missing or obviously improvable Dutch descriptions.

Everything is review‑only: no automatic edits without an explicit human action.


Users / Personas

  • Wikidata lexeme contributors (especially Dutch‑focused editors).
  • Wikidata item editors interested in improving Dutch labels/descriptions.
  • Language‑focused WikiProjects wanting a queue of “low‑hanging fruit” edits.

User stories

  • As a Dutch lexeme editor, I want a list of lexemes with high‑confidence suggestions (e.g. missing plurals, missing lexeme–item links), so that I can quickly review and apply them.
  • As a Wikidata item editor, I want suggested Dutch labels/descriptions for items already well‑described in other languages, so that I can improve Dutch coverage efficiently.
  • As a community member, I want transparent documentation of how each suggestion type is generated, so I can trust (or opt out of) suggestions that are too aggressive.
  • As a tool maintainer, I want clearly separated language‑specific rules (e.g. Dutch inflection patterns) and a generic core pipeline, so I can extend the tool to other languages later.

Non‑goals (initially)

  • Fully automated bot edits without human review.
  • Complex sense creation for highly polysemous or controversial terms.
  • Deep semantic modeling or complex ontology reasoning.
  • Full integration into the Wikidata UI (this will be a standalone, external tool initially).

Requirements / Constraints

  • Human‑in‑the‑loop:
    • Every AI suggestion must be explicitly accepted or edited by a logged‑in user before being applied.
  • Transparency:
    • Each suggestion shows:
      • The exact proposed edit(s).
      • A short rationale in natural language (e.g. “Plural formed via regular Dutch pattern X”).
      • Confidence or priority score.
  • Safety and conservatism:
    • Focus on high‑precision suggestions over high recall.
    • Start with conservative thresholds and simple, regular patterns.
  • Wikidata integration:
    • OAuth‑based login.
    • Edits via the Wikidata API under the user’s account.
    • Clear, tool‑specific edit summaries (e.g. “via Wikidata Gap Fixer (rule: NL_REGULAR_PLURAL)”).
  • Language modularity:
    • Language‑specific rules and resources live in separate modules/configs.
    • Core pipeline should not hardcode Dutch; Dutch is “language profile #1”.
  • LLM usage constraints:
    • Respect WMF and community policies around external services, privacy, and data handling.
    • Avoid sending unnecessary personal data or large amounts of content to LLM providers.
    • Prefer cached/derived representations where possible.

Proposed technical approach / architecture

High‑level architecture
  • Backend service (e.g. Node/TypeScript, Python, or similar stack familiar to WMF tooling):
    • Periodically scans Wikidata (via SPARQL + API) to find candidate gaps.
    • Runs rule‑based pre‑filters (simple, deterministic checks).
    • Where needed, calls an LLM layer to:
      • Validate or refine suggestions.
      • Generate short rationales or canonicalized text.
    • Stores suggestions and their metadata in a database (e.g. PostgreSQL).
    • Exposes an HTTP API for:
      • Fetching suggestion queues.
      • Accepting/rejecting/editing suggestions.
      • Triggering write actions to Wikidata.
  • Frontend web UI:
    • Login via OAuth to Wikidata.
    • Suggestion queue, with:
      • Filters by language, suggestion type, confidence band.
      • Per‑suggestion diff view (“before / after”).
      • Actions: Accept, Edit+Accept, Reject.
  • Integration with Wikidata:
    • Reads via SPARQL (querying for lexemes/items with likely gaps).
    • Writes via Wikidata API (edit endpoints).
    • Adheres to usual rate limits, bot policies, and community practices.
Data flow (simplified)
  1. Candidate discovery
    • Scheduled job or on‑demand query finds potential targets, e.g.:
      • Lexemes in Dutch with a single form but typical language rules imply more.
      • Lexemes missing Lexeme → Item links where labels/aliases strongly suggest a match.
      • Items with no Dutch label/description but rich labels in other languages and related lexemes.
    • Implemented via SPARQL queries + supplemental API calls as needed.
  1. Rule‑based pre‑filtering
    • For each candidate, apply deterministic rules per suggestion type, e.g.:
      • Dutch noun pluralization with known regular suffix patterns.
      • Excluding known irregular verbs from “regular” modules.
      • Rejecting ambiguous lemma–item matches (multiple possible items without strong evidence).
    • This stage should already eliminate obviously bad or ambiguous candidates.
  1. LLM assistance layer
    • For candidates that pass pre‑filtering but still need semantic judgement or text generation:
      • Send a minimal context prompt to an LLM, for example:
        • Short description of the lexeme/item.
        • Relevant labels/descriptions in other languages.
        • Existing forms/senses/statements.
        • The candidate suggestion and the rule that proposed it.
      • Ask the LLM to:
        • Confirm or flag the suggestion (e.g. “likely correct / unclear / probably wrong”).
        • Generate/refine short natural language text where needed (e.g. concise Dutch description).
        • Provide a brief rationale (“This matches the English lemma and described meaning…”).
    • Output is a structured JSON object (decision, confidence estimate, suggested text, rationale).
    • LLM is advisory, not authoritative – human still makes the final call.
  1. Suggestion storage
    • Store each suggestion with:
      • Target entity (lexeme or item ID).
      • Proposed change(s) (as a small patch or set of statements).
      • Suggestion type and rule ID (e.g. NL_REGULAR_PLURAL, LEXEME_ITEM_LINK).
      • Confidence scores (rule‑based and LLM‑based).
      • Human‑readable rationale.
      • Status: pending, accepted, rejected, applied, superseded, etc.
      • Audit trail (who accepted/rejected, when, what was changed before/after).
  1. Review UI
    • Frontend fetches pending suggestions for the logged‑in user.
    • For each suggestion:
      • Show:
        • Entity summary (labels, key statements, links).
        • The proposed edit in a diff‑like view.
        • Rationale and confidence badge.
      • Allow:
        • Accept: send to backend to apply via Wikidata API.
        • Edit: let the user tweak the suggestion (e.g. adjust gloss) then accept.
        • Reject: mark as rejected; optionally ask for a short reason or category (e.g. “wrong lemma”, “too ambiguous”).
    • Provide filters:
      • By suggestion type (forms, labels, links, etc.).
      • By confidence.
      • By entity type or project (e.g. nouns only, verbs only).
  1. Applying edits
    • When a user accepts (or edits+accepts):
      • Backend:
        • Verifies the suggestion is still valid (entity not heavily changed in the meantime).
        • Constructs the appropriate API payload for Wikidata.
        • Performs the edit using the user’s OAuth token.
        • Writes an explicit edit summary, such as: “Adding Dutch plural form via Wikidata Gap Fixer (rule: NL_REGULAR_PLURAL, manually reviewed).”
      • Marks the suggestion as applied (or failed with error details).
  1. Feedback and learning loop
    • Periodically analyze:
      • Acceptance/rejection rates per rule and suggestion type.
      • Common rejection reasons.
    • Use this to:
      • Tighten or relax rule thresholds.
      • Remove or revise low‑precision rules.
      • Adjust when/if we call the LLM (e.g. only for borderline cases).

LLM considerations (revised)

  • External provider choice (pragmatic recommendation)
    • Use an external hosted model that is cheap, fast, and good at short, structured tasks.
    • We start with something like OpenAI’s gpt-4.1-mini (or a similar “mini/flash/haiku”‑class model):
      • Good at following JSON/structured prompts.
      • Much cheaper than full flagship models.
      • Fast enough for interactive use.
  • Cost control strategy
    • Keep prompts very short and structured: send only essential labels/descriptions and a compact schema.
    • Favor classification/validation (“is this suggestion OK?”) and short gloss generation over long explanations.
    • Cache results for identical prompts where possible (same item/lexeme, same context).
  • Abstraction layer
    • Wrap all LLM calls behind a small internal API, so you can:
      • Swap gpt-4.1-mini for alternatives like Anthropic Claude Haiku or Google Gemini Flash if pricing/terms change.
      • Disable LLM use entirely (falling back to rule‑only suggestions) without rewriting the rest of the system.
  • Data minimization and privacy
    • Only send:
      • Item/lexeme IDs, short labels, terse descriptions, and maybe 1–2 key statements.
      • No user identifiers or edit histories.
    • Clearly document what is sent to the external provider and why, so community members can assess the privacy implications.

Risks / open questions

  • Community acceptance:
    • AI‑assisted suggestions may raise concerns; we need clear documentation and opt‑in usage.
  • Model bias / hallucination:
    • LLMs can make plausible‑sounding but wrong suggestions; mitigated by:
      • Conservative rule‑based pre‑filtering.
      • Strict human review requirement.
  • Scaling to other languages:
    • Need clear patterns for how language modules are defined and maintained by interested communities.
  • Ops and maintenance:
    • Who runs and maintains the service (infrastructure, LLM keys, rate limits)?

Deliverables (initial phase)

  • A minimal but working backend + frontend for Dutch lexemes/items with:
    • At least 1–2 conservative suggestion types implemented end‑to‑end.
    • OAuth login and Wikidata integration.
    • Basic queue UI for reviewing and applying suggestions.
  • Documentation:
    • Explanation of each suggestion type and its heuristics.
    • Privacy/LLM data‑handling notes.
    • How other language communities could add their own modules later.

Event Timeline

DSmit-WMF renamed this task from Prototype agent to suggest missing Dutch forms and glosses for Wikidata Lexemes to Wikidata Lexeme Gap Suggester: AI‑assisted detection and fixing of missing lexeme and item content (starting with Dutch).Mar 13 2026, 11:10 AM
DSmit-WMF updated the task description. (Show Details)
DSmit-WMF renamed this task from Wikidata Lexeme Gap Suggester: AI‑assisted detection and fixing of missing lexeme and item content (starting with Dutch) to Wikidata Lexeme Gap Fixer: AI‑assisted detection and fixing of missing lexeme and item content (starting with Dutch).Mar 13 2026, 2:18 PM
DSmit-WMF renamed this task from Wikidata Lexeme Gap Fixer: AI‑assisted detection and fixing of missing lexeme and item content (starting with Dutch) to Wikidata Gap Fixer: AI‑assisted detection and fixing of missing lexeme and item content (starting with Dutch).
DSmit-WMF changed the task status from Open to In Progress.
DSmit-WMF claimed this task.