Page MenuHomePhabricator

Measuring the consistency of information between Wikipedia articles and Wikidata items.
Closed, ResolvedPublic

Description

Wikidata is currently the most edited project in the Wikimedia sphere [1]. While there are some efforts to use Wikidata information to populate Wikipedia pages [2], (what is specially interesting for Wikipedias in languages with few articles), there is not much research about the consistency of the information already existing on Wikipedia and the content on Wikidata. Ongoing research is trying to compare the Wikipedia infoboxes (structured information on articles) with the content on Wikidata [3], however most of the information in Wikipedia is unstructured (ie. the text on the articles), and currently there is no solution to compare such content with the information on Wikidata. For example, consider the article about Chile in the English Wikipedia saying: “... It borders Peru to the north, Bolivia to the northeast, Argentina to the east, and the Drake Passage in the far south”, there we want to extract such information and compare with the Wikidata item about Chile (Q298) in the property “shares borders with” (P47) to know if they are consistent with the text on the English Wikipedia. Being able to make such comparison will have important and positive effects on the quality and availability of information both in Wikidata and Wikipedia, helping to detect inconsistencies, detect missing content (in both projects) and also in sharing references. Moreover, applying this technique to several languages will help to use Wikidata as bridge to improve the flow of information across different languages, allowing to address the knowledge gaps across projects [4]. This alignment of content will be also important to help patrollers in under resourced communities [5] to early detect the presence of suspicious content (eg. information introduced in wikipedia in language X, that is not consistent with the information contained in Wikidata or other Wikis), helping to fight against disinformation campaigns [6].

[1] stats.wikimedia.org/v2
[2] KAFFEE, Lucie-Aimée, et al. Mind the (language) gap: generation of multilingual Wikipedia summaries from Wikidata for ArticlePlaceholders. En European Semantic Web Conference. Springer, Cham, 2018. p. 319-334.
[3] https://meta.wikimedia.org/wiki/Grants:Project/DBpedia/GlobalFactSyncRE
[4] https://research.wikimedia.org/knowledge-gaps.html
[5] https://meta.wikimedia.org/wiki/Research:Patrolling_on_Wikipedia
[6] Saez-Trumper, D. (2019). Online Disinformation and the Role of Wikipedia. arXiv preprint arXiv:1910.12596.

Event Timeline

Weekly Update: Preparing the dataset.

Update from last two weeks:

  • Implemented a (naive) approach to convert Wikidata claims in readable text, by concatenating ITEM.label + Prop.Label + ITEM.label or value
  • Downloaded the subset of articles under Wikiproject Medicine and created a parallel dataset between articles and Wikidata claims.
  • Created a subset of the aforementioned articles and wikidata items, keeping just high quality (FA) articles.
  • Shared the data with collaborators.

Update:

  • Reviewing Corona Virus related cases.

Weekly update:

  • Nothing to report

Weekly update

  • Experimenting with T-REX

Weekly update:

  • Continued experimentation with T-REX.

Weekly update:

  • No updates this week

Weekly update:

  • Work will restart next week. The project is delay due the collaborators lab (IBS/KAIST) was partially closed during the COVID-19 crisis in South Korea. We will have a meeting on May 5th to reschedule the work.
  • Project restarted after COVID-19 related delay.
  • Planned to be finished on August 18th.
  • Wrote a "Quick Introduction to Wikidata & Wikitext" in order to understand how to use wikilinks and sitelinks as anchors for alignments.
  • Started the hiring process for an intern. CRF and SOW submitted. Waiting for approvals.
    • Weekly Updates **
  • Finalizing the hiring for the intern.
  • Defining two ML tasks:
    • Coverage is about the overlap of information between Wiki Pages and Wikidata items. For example if a Wikidata item X about person contains information about: Nationality, Birthday, Birth Place, Eye colors (4 properties); and the corresponding Wikipedia Page just contain information information about Nationality, Birth Place (2 properties), we say the coverage of the Wiki Page is 2/4 (50%) respect to the Wikidata Item.
    • Consistency is about the content of the property. Considering the example above if the Birth Place in Wikidata is marked as Egypt, but in Wikipedia is Chile, that is an inconsistency.
  • Working on models to measure coverage problem.

Updates

  • Trying sec2sec models for creating aligned embeddings spaces for Wikidata and English Wikipedia.
  • Reviewing the most popular properties in Wikidata for each topic (using the oresdraft model taxonomy).

Updates

  • Improving the sec2sec model.
  • Adding editors information.
  • Found relevant differences in model performance depending on the articles length.

Updates

  • Started to design a crowdsourced task to label data.
  • Studying the differences in performance depending on articles length.
diego triaged this task as High priority.Jun 13 2020, 12:59 AM

Updates

  • Defined a subset of pages for manual evaluation.
  • Studying the suitability of a Q&A setup to compare Wikipedia and Wikidata.

Updates

  • Working in representing Wikipedia page as set of Wikidata Items

Updates

  • Defined a new supervised approach, using a novel technique of negative sampling.

@leila: this task was delayed due COVID-19 related problems. That is the reason to keep it on the FY20 April-June lane, we are working to finish it at the end of August. I'm ok moving it to the FY21 if that works for you.

@diego 'got it. thanks for confirming here. It's okay to keep it here given that if I recall correctly you expect it to conclude in August.

Updates

  • Analyzing which Properties in Wikidata receives much changes.
  • Prepared a new dataset about biographies.

Updates

  • We have finished to develop the first two models, one based in a DNN an another based in BoW. We are working on the report, that should be online (in meta) during the following weeks.
  • We are also working in analyzing the different coverage of entities in Wikipedias across languages.
  • Based on the two developed models, we are comparing with related work, and discussing about a new model or a fusion of the existing ones.

Updates

  • We have finished the first model.
  • Report can be found here.
  • I will close this task, and continue reporting the progress on this line of research here: T260566