Page MenuHomePhabricator

Wikidata Integration
Open, Needs TriagePublic

Description

Description

This task proposes integrating Wikidata as an external knowledge source into Wanda (https://www.mediawiki.org/wiki/Extension:Wanda). The integration will enable Wanda to intelligently query Wikidata to answer questions about entities, facts, and structured data without requiring users to manually specify entity IDs or property IDs.

Planned Features

Core Functionality

  • Automatically identify and resolve entity IDs (QIDs) from natural language queries without user intervention
  • Automatically identify and resolve property IDs (PIDs) relevant to user questions
  • Break down complex queries into logical thinking steps to construct appropriate SPARQL queries or Wikidata API calls
  • Generate optimized Wikidata queries based on user intent and context
  • Fetch and process structured data from Wikidata including:
    • Entity properties and values
    • Relationships between entities
    • Qualifiers and references
    • Temporal and quantitative data
  • Combine Wikidata results with wiki content for comprehensive answers
  • Allow users to query both local wiki content and Wikidata simultaneously

Technical Implementation

The integration will use a multi-step agentic approach:

  1. Intent Analysis: LLM analyzes user query to identify:
    • Target entities mentioned
    • Properties/attributes being queried
    • Relationships between entities
    • Query type (factual lookup, comparison, temporal query, etc.)
  1. Entity Resolution: Agentic process to identify correct entity IDs:
    • Search Wikidata for candidate entities
    • Use context clues to disambiguate
    • Confirm entity selection through reasoning
    • Handle aliases and multilingual labels
  1. Property Identification: Determine relevant property IDs:
    • Map natural language attributes to Wikidata properties
    • Select appropriate properties for the query type
    • Include relevant qualifiers and references
  1. Query Construction: Build optimized queries:
    • Generate SPARQL for complex queries
    • Use Wikidata API for simple lookups
    • Construct federated queries when needed
    • Apply filters and constraints
  1. Result Processing: Process and synthesize results:
    • Parse structured data responses
    • Format results for natural language output
    • Combine with wiki context if available
    • Provide source citations
Configuration Options
// Enable Wikidata integration
$wgWandaEnableWikidata = true;

// Wikidata SPARQL endpoint
$wgWandaWikidataSparqlEndpoint = 'https://query.wikidata.org/sparql';

// Wikidata API endpoint
$wgWandaWikidataApiEndpoint = 'https://www.wikidata.org/w/api.php';

// Enable agentic resolution (multi-step thinking)
$wgWandaWikidataAgenticMode = true;

// Show thinking steps to users
$wgWandaWikidataShowThinking = false;

// Cache duration for Wikidata queries (in seconds)
$wgWandaWikidataCacheDuration = 3600;

// Maximum SPARQL query timeout
$wgWandaWikidataQueryTimeout = 30;

// Preferred language for Wikidata labels
$wgWandaWikidataLanguage = 'en';

// Fallback languages for labels
$wgWandaWikidataFallbackLanguages = [ 'en', 'mul' ];

API Enhancements

Extend the existing action=wandachat API with new parameters:

ParameterRequiredDescription
sourcesNoComma-separated list of sources to query (e.g., 'wiki,wikidata')
wikidataonlyNoQuery only Wikidata, skip local wiki content (default: false)
showthinkingNoInclude reasoning steps in response (default: false)
wikidatalangNoPreferred language for Wikidata labels (default: wiki content language)

Example API request:

api.php?action=wandachat&message=When%20was%20Albert%20Einstein%20born&sources=wiki,wikidata&showthinking=true&format=json

Example API response:

{
  "wandachat": {
    "response": "Albert Einstein was born on March 14, 1879 in Ulm, Germany.",
    "success": true,
    "sources_used": ["wikidata"],
    "entities_resolved": [
      {
        "label": "Albert Einstein",
        "qid": "Q937",
        "description": "German-born theoretical physicist"
      }
    ],
    "properties_used": [
      {
        "label": "date of birth",
        "pid": "P569"
      },
      {
        "label": "place of birth",
        "pid": "P19"
      }
    ],
    "thinking_steps": [
      "1. Identified entity: Albert Einstein",
      "2. Searched Wikidata for matching entities",
      "3. Resolved to Q937 (Albert Einstein - physicist)",
      "4. Identified required property: date of birth (P569)",
      "5. Queried Wikidata for P569 value of Q937",
      "6. Retrieved: 14 March 1879"
    ]
  }
}

Example Use Cases

Use Case 1: Biographical Information

Query: "Who was Marie Curie's husband?"

Agentic Process:

  1. Identify entity: Marie Curie → Q7186 (physicist and chemist)
  2. Identify property: spouse → P26
  3. Query: Get P26 values for Q7186
  4. Result: Pierre Curie (Q37463)
  5. Format response with context

Response: "Marie Curie's husband was Pierre Curie, a French physicist. They were married in 1895 and worked together on radioactivity research."

Use Case 2: Comparative Queries

Query: "Compare the population of Tokyo and New York"

Agentic Process:

  1. Identify entities: Tokyo → Q1490, New York → Q60
  2. Identify property: population → P1082
  3. Query: Get P1082 with temporal qualifiers for both entities
  4. Compare values and dates
  5. Format comparative response

Response: "As of 2024, Tokyo has a population of approximately 14 million (metropolitan area: 37.4 million), while New York City has a population of approximately 8.3 million (metropolitan area: 19.6 million)."

Use Case 3: Temporal Queries

Query: "When did World War II start and end?"

Agentic Process:

  1. Identify entity: World War II → Q362
  2. Identify properties: start time → P580, end time → P582
  3. Query: Get P580 and P582 values for Q362
  4. Format temporal response

Response: "World War II started on September 1, 1939 and ended on September 2, 1945, lasting approximately 6 years."

Use Case 4: Relationship Traversal

Query: "List the novels written by the author of Pride and Prejudice"

Agentic Process:

  1. Identify work: Pride and Prejudice → Q170583
  2. Query property: author → P50
  3. Result: Jane Austen → Q36322
  4. Query inverse: works by author (P800 or P50 inverse)
  5. Filter by instance of novel (Q8261)
  6. Format list response

Response: "The author of Pride and Prejudice is Jane Austen. Her novels include: Sense and Sensibility (1811), Pride and Prejudice (1813), Mansfield Park (1814), Emma (1815), Northanger Abbey (1817), and Persuasion (1817)."