Problem
Many Wikidata lexemes—especially in Dutch, but even more so in smaller languages—are missing basic morphological forms (e.g. plural nouns, past participles) and glosses. This project uses AI to suggest fixes for these simple, low‑hanging fruit/gaps, with a focus on cases where high‑confidence, pattern‑based suggestions are realistic (and acknowledging this will not cover very small languages).
It will focus on:
- Lexemes (missing senses, forms, grammatical features, and lexeme–item links).
- Items (missing or inconsistent labels/descriptions/aliases in that language which should be derivable from lexemes and existing multilingual data).
Today, filling these gaps is:
- Time‑consuming and manual for editors.
- Hard to prioritize (no good overview of “high‑impact, high‑confidence” missing content).
At the same time, modern language models are reasonably good at spotting missing/misaligned pieces of structured lexical information given enough context, provided that humans stay in the loop to approve or correct suggestions.
We lack a dedicated tool that:
- Systematically detects gaps in lexeme/item data for a target language.
- Uses AI/LLMs to propose concrete, small, reviewable edits (not bulk auto‑edits).
- Provides an efficient human‑in‑the‑loop UI where editors can accept/reject/edit those suggestions and push them to Wikidata.
Goal
Create an experimental, opt‑in tool (working name: “Wikidata Gap Fixer”) that:
- Detects gaps in lexeme and item data for a configurable language (initially Dutch).
- Uses an AI/LLM‑based pipeline to generate high‑confidence, explainable suggestions for filling those gaps.
- Presents these suggestions in a review queue where human editors can:
- Accept as‑is.
- Edit then accept.
- Reject (feeding back into the system’s scoring/filters).
- Applies accepted suggestions to Wikidata via the API under the editor’s account, with clear, tool‑specific edit summaries.
The long‑term design should be language‑agnostic so that other communities can plug in their language‑specific rules and data sources.
Scope (initial phase)
- Focus language: Dutch (nl).
- Domain: Lexemes and directly related item content.
- Initial suggestion types (MVP):
- Lexeme–item linking suggestions
- Suggest Lexeme → Item links where:
- Lemma matches an item label/alias in Dutch.
- There is clear disambiguation via sitelinks, existing statements, or IDs.
- The LLM can confirm the match given context (definitions, examples).
- Suggest Lexeme → Item links where:
- Lexeme form / grammatical feature suggestions (regular patterns only)
- Propose missing forms (e.g. Dutch noun plurals, regular verb inflections) where patterns are simple and low risk.
- Suggest missing grammatical features on existing forms (e.g. gender for Dutch nouns) based on regular patterns + existing data.
- Dutch item label/description suggestions from multilingual data
- When an item has rich labels/descriptions in other languages and related lexemes, suggest:
- Missing Dutch labels.
- Missing or obviously improvable Dutch descriptions.
- When an item has rich labels/descriptions in other languages and related lexemes, suggest:
- Lexeme–item linking suggestions
Everything is review‑only: no automatic edits without an explicit human action.
Users / Personas
- Wikidata lexeme contributors (especially Dutch‑focused editors).
- Wikidata item editors interested in improving Dutch labels/descriptions.
- Language‑focused WikiProjects wanting a queue of “low‑hanging fruit” edits.
User stories
- As a Dutch lexeme editor, I want a list of lexemes with high‑confidence suggestions (e.g. missing plurals, missing lexeme–item links), so that I can quickly review and apply them.
- As a Wikidata item editor, I want suggested Dutch labels/descriptions for items already well‑described in other languages, so that I can improve Dutch coverage efficiently.
- As a community member, I want transparent documentation of how each suggestion type is generated, so I can trust (or opt out of) suggestions that are too aggressive.
- As a tool maintainer, I want clearly separated language‑specific rules (e.g. Dutch inflection patterns) and a generic core pipeline, so I can extend the tool to other languages later.
Non‑goals (initially)
- Fully automated bot edits without human review.
- Complex sense creation for highly polysemous or controversial terms.
- Deep semantic modeling or complex ontology reasoning.
- Full integration into the Wikidata UI (this will be a standalone, external tool initially).
Requirements / Constraints
- Human‑in‑the‑loop:
- Every AI suggestion must be explicitly accepted or edited by a logged‑in user before being applied.
- Transparency:
- Each suggestion shows:
- The exact proposed edit(s).
- A short rationale in natural language (e.g. “Plural formed via regular Dutch pattern X”).
- Confidence or priority score.
- Each suggestion shows:
- Safety and conservatism:
- Focus on high‑precision suggestions over high recall.
- Start with conservative thresholds and simple, regular patterns.
- Wikidata integration:
- OAuth‑based login.
- Edits via the Wikidata API under the user’s account.
- Clear, tool‑specific edit summaries (e.g. “via Wikidata Gap Fixer (rule: NL_REGULAR_PLURAL)”).
- Language modularity:
- Language‑specific rules and resources live in separate modules/configs.
- Core pipeline should not hardcode Dutch; Dutch is “language profile #1”.
- LLM usage constraints:
- Respect WMF and community policies around external services, privacy, and data handling.
- Avoid sending unnecessary personal data or large amounts of content to LLM providers.
- Prefer cached/derived representations where possible.
Proposed technical approach / architecture
High‑level architecture
- Backend service (e.g. Node/TypeScript, Python, or similar stack familiar to WMF tooling):
- Periodically scans Wikidata (via SPARQL + API) to find candidate gaps.
- Runs rule‑based pre‑filters (simple, deterministic checks).
- Where needed, calls an LLM layer to:
- Validate or refine suggestions.
- Generate short rationales or canonicalized text.
- Stores suggestions and their metadata in a database (e.g. PostgreSQL).
- Exposes an HTTP API for:
- Fetching suggestion queues.
- Accepting/rejecting/editing suggestions.
- Triggering write actions to Wikidata.
- Frontend web UI:
- Login via OAuth to Wikidata.
- Suggestion queue, with:
- Filters by language, suggestion type, confidence band.
- Per‑suggestion diff view (“before / after”).
- Actions: Accept, Edit+Accept, Reject.
- Integration with Wikidata:
- Reads via SPARQL (querying for lexemes/items with likely gaps).
- Writes via Wikidata API (edit endpoints).
- Adheres to usual rate limits, bot policies, and community practices.
Data flow (simplified)
- Candidate discovery
- Scheduled job or on‑demand query finds potential targets, e.g.:
- Lexemes in Dutch with a single form but typical language rules imply more.
- Lexemes missing Lexeme → Item links where labels/aliases strongly suggest a match.
- Items with no Dutch label/description but rich labels in other languages and related lexemes.
- Implemented via SPARQL queries + supplemental API calls as needed.
- Scheduled job or on‑demand query finds potential targets, e.g.:
- Rule‑based pre‑filtering
- For each candidate, apply deterministic rules per suggestion type, e.g.:
- Dutch noun pluralization with known regular suffix patterns.
- Excluding known irregular verbs from “regular” modules.
- Rejecting ambiguous lemma–item matches (multiple possible items without strong evidence).
- This stage should already eliminate obviously bad or ambiguous candidates.
- For each candidate, apply deterministic rules per suggestion type, e.g.:
- LLM assistance layer
- For candidates that pass pre‑filtering but still need semantic judgement or text generation:
- Send a minimal context prompt to an LLM, for example:
- Short description of the lexeme/item.
- Relevant labels/descriptions in other languages.
- Existing forms/senses/statements.
- The candidate suggestion and the rule that proposed it.
- Ask the LLM to:
- Confirm or flag the suggestion (e.g. “likely correct / unclear / probably wrong”).
- Generate/refine short natural language text where needed (e.g. concise Dutch description).
- Provide a brief rationale (“This matches the English lemma and described meaning…”).
- Send a minimal context prompt to an LLM, for example:
- Output is a structured JSON object (decision, confidence estimate, suggested text, rationale).
- LLM is advisory, not authoritative – human still makes the final call.
- For candidates that pass pre‑filtering but still need semantic judgement or text generation:
- Suggestion storage
- Store each suggestion with:
- Target entity (lexeme or item ID).
- Proposed change(s) (as a small patch or set of statements).
- Suggestion type and rule ID (e.g. NL_REGULAR_PLURAL, LEXEME_ITEM_LINK).
- Confidence scores (rule‑based and LLM‑based).
- Human‑readable rationale.
- Status: pending, accepted, rejected, applied, superseded, etc.
- Audit trail (who accepted/rejected, when, what was changed before/after).
- Store each suggestion with:
- Review UI
- Frontend fetches pending suggestions for the logged‑in user.
- For each suggestion:
- Show:
- Entity summary (labels, key statements, links).
- The proposed edit in a diff‑like view.
- Rationale and confidence badge.
- Allow:
- Accept: send to backend to apply via Wikidata API.
- Edit: let the user tweak the suggestion (e.g. adjust gloss) then accept.
- Reject: mark as rejected; optionally ask for a short reason or category (e.g. “wrong lemma”, “too ambiguous”).
- Show:
- Provide filters:
- By suggestion type (forms, labels, links, etc.).
- By confidence.
- By entity type or project (e.g. nouns only, verbs only).
- Applying edits
- When a user accepts (or edits+accepts):
- Backend:
- Verifies the suggestion is still valid (entity not heavily changed in the meantime).
- Constructs the appropriate API payload for Wikidata.
- Performs the edit using the user’s OAuth token.
- Writes an explicit edit summary, such as: “Adding Dutch plural form via Wikidata Gap Fixer (rule: NL_REGULAR_PLURAL, manually reviewed).”
- Marks the suggestion as applied (or failed with error details).
- Backend:
- When a user accepts (or edits+accepts):
- Feedback and learning loop
- Periodically analyze:
- Acceptance/rejection rates per rule and suggestion type.
- Common rejection reasons.
- Use this to:
- Tighten or relax rule thresholds.
- Remove or revise low‑precision rules.
- Adjust when/if we call the LLM (e.g. only for borderline cases).
- Periodically analyze:
LLM considerations (revised)
- External provider choice (pragmatic recommendation)
- Use an external hosted model that is cheap, fast, and good at short, structured tasks.
- We start with something like OpenAI’s gpt-4.1-mini (or a similar “mini/flash/haiku”‑class model):
- Good at following JSON/structured prompts.
- Much cheaper than full flagship models.
- Fast enough for interactive use.
- Cost control strategy
- Keep prompts very short and structured: send only essential labels/descriptions and a compact schema.
- Favor classification/validation (“is this suggestion OK?”) and short gloss generation over long explanations.
- Cache results for identical prompts where possible (same item/lexeme, same context).
- Abstraction layer
- Wrap all LLM calls behind a small internal API, so you can:
- Swap gpt-4.1-mini for alternatives like Anthropic Claude Haiku or Google Gemini Flash if pricing/terms change.
- Disable LLM use entirely (falling back to rule‑only suggestions) without rewriting the rest of the system.
- Wrap all LLM calls behind a small internal API, so you can:
- Data minimization and privacy
- Only send:
- Item/lexeme IDs, short labels, terse descriptions, and maybe 1–2 key statements.
- No user identifiers or edit histories.
- Clearly document what is sent to the external provider and why, so community members can assess the privacy implications.
- Only send:
Risks / open questions
- Community acceptance:
- AI‑assisted suggestions may raise concerns; we need clear documentation and opt‑in usage.
- Model bias / hallucination:
- LLMs can make plausible‑sounding but wrong suggestions; mitigated by:
- Conservative rule‑based pre‑filtering.
- Strict human review requirement.
- LLMs can make plausible‑sounding but wrong suggestions; mitigated by:
- Scaling to other languages:
- Need clear patterns for how language modules are defined and maintained by interested communities.
- Ops and maintenance:
- Who runs and maintains the service (infrastructure, LLM keys, rate limits)?
Deliverables (initial phase)
- A minimal but working backend + frontend for Dutch lexemes/items with:
- At least 1–2 conservative suggestion types implemented end‑to‑end.
- OAuth login and Wikidata integration.
- Basic queue UI for reviewing and applying suggestions.
- Documentation:
- Explanation of each suggestion type and its heuristics.
- Privacy/LLM data‑handling notes.
- How other language communities could add their own modules later.