Measuring the consistency of information between Wikipedia articles and Wikidata items.
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	diego
	Jan 21 2020, 1:02 AM

Description

Wikidata is currently the most edited project in the Wikimedia sphere [1]. While there are some efforts to use Wikidata information to populate Wikipedia pages [2], (what is specially interesting for Wikipedias in languages with few articles), there is not much research about the consistency of the information already existing on Wikipedia and the content on Wikidata. Ongoing research is trying to compare the Wikipedia infoboxes (structured information on articles) with the content on Wikidata [3], however most of the information in Wikipedia is unstructured (ie. the text on the articles), and currently there is no solution to compare such content with the information on Wikidata. For example, consider the article about Chile in the English Wikipedia saying: “... It borders Peru to the north, Bolivia to the northeast, Argentina to the east, and the Drake Passage in the far south”, there we want to extract such information and compare with the Wikidata item about Chile (Q298) in the property “shares borders with” (P47) to know if they are consistent with the text on the English Wikipedia. Being able to make such comparison will have important and positive effects on the quality and availability of information both in Wikidata and Wikipedia, helping to detect inconsistencies, detect missing content (in both projects) and also in sharing references. Moreover, applying this technique to several languages will help to use Wikidata as bridge to improve the flow of information across different languages, allowing to address the knowledge gaps across projects [4]. This alignment of content will be also important to help patrollers in under resourced communities [5] to early detect the presence of suspicious content (eg. information introduced in wikipedia in language X, that is not consistent with the information contained in Wikidata or other Wikis), helping to fight against disinformation campaigns [6].

[1] stats.wikimedia.org/v2
[2] KAFFEE, Lucie-Aimée, et al. Mind the (language) gap: generation of multilingual Wikipedia summaries from Wikidata for ArticlePlaceholders. En European Semantic Web Conference. Springer, Cham, 2018. p. 319-334.
[3] https://meta.wikimedia.org/wiki/Grants:Project/DBpedia/GlobalFactSyncRE
[4] https://research.wikimedia.org/knowledge-gaps.html
[5] https://meta.wikimedia.org/wiki/Research:Patrolling_on_Wikipedia
[6] Saez-Trumper, D. (2019). Online Disinformation and the Role of Wikipedia. arXiv preprint arXiv:1910.12596.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		diego	T243256 Measuring the consistency of information between Wikipedia articles and Wikidata items.
		Resolved		diego	T252450 Submit a paper about the model developed for detecting disinformation.

Event Timeline

diego created this task.Jan 21 2020, 1:02 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 21 2020, 1:02 AM

Weekly Update: Preparing the dataset.

Update from last two weeks:

Implemented a (naive) approach to convert Wikidata claims in readable text, by concatenating ITEM.label + Prop.Label + ITEM.label or value
Downloaded the subset of articles under Wikiproject Medicine and created a parallel dataset between articles and Wikidata claims.
Created a subset of the aforementioned articles and wikidata items, keeping just high quality (FA) articles.
Shared the data with collaborators.

Update:

Reviewing Corona Virus related cases.

Weekly update:

Nothing to report

Weekly update

Experimenting with T-REX

Weekly update:

Continued experimentation with T-REX.

Isaac mentioned this in T242170: Define research agenda for external re-use.Mar 12 2020, 10:07 PM

Weekly update:

No updates this week

Weekly update:

Work will restart next week. The project is delay due the collaborators lab (IBS/KAIST) was partially closed during the COVID-19 crisis in South Korea. We will have a meeting on May 5th to reschedule the work.

Project restarted after COVID-19 related delay.
Planned to be finished on August 18th.

diego changed the status of subtask T252450: Submit a paper about the model developed for detecting disinformation. from Open to Stalled.May 11 2020, 7:48 PM

Wrote a "Quick Introduction to Wikidata & Wikitext" in order to understand how to use wikilinks and sitelinks as anchors for alignments.
Started the hiring process for an intern. CRF and SOW submitted. Waiting for approvals.

- Weekly Updates **
Finalizing the hiring for the intern.
Defining two ML tasks:
- Coverage is about the overlap of information between Wiki Pages and Wikidata items. For example if a Wikidata item X about person contains information about: Nationality, Birthday, Birth Place, Eye colors (4 properties); and the corresponding Wikipedia Page just contain information information about Nationality, Birth Place (2 properties), we say the coverage of the Wiki Page is 2/4 (50%) respect to the Wikidata Item.
- Consistency is about the content of the property. Considering the example above if the Birth Place in Wikidata is marked as Egypt, but in Wikipedia is Chile, that is an inconsistency.

Working on models to measure coverage problem.

Updates

Trying sec2sec models for creating aligned embeddings spaces for Wikidata and English Wikipedia.

Reviewing the most popular properties in Wikidata for each topic (using the oresdraft model taxonomy).

Updates

Improving the sec2sec model.

Adding editors information.

Found relevant differences in model performance depending on the articles length.

Isaac mentioned this in T254275: HTML Dumps - June/2020.Jun 8 2020, 1:27 PM

Updates

Started to design a crowdsourced task to label data.

Studying the differences in performance depending on articles length.

diego triaged this task as High priority.Jun 13 2020, 12:59 AM

leila moved this task from FY2019-20-Research-January-March to FY2019-20-Research-April-June on the Research board.Jun 19 2020, 11:20 PM

leila edited projects, added Research (FY2019-20-Research-April-June); removed Research (FY2019-20-Research-January-March).

Updates

Defined a subset of pages for manual evaluation.

Studying the suitability of a Q&A setup to compare Wikipedia and Wikidata.

Updates

Working in representing Wikipedia page as set of Wikidata Items

Updates

Defined a new supervised approach, using a novel technique of negative sampling.

@leila: this task was delayed due COVID-19 related problems. That is the reason to keep it on the FY20 April-June lane, we are working to finish it at the end of August. I'm ok moving it to the FY21 if that works for you.

@diego 'got it. thanks for confirming here. It's okay to keep it here given that if I recall correctly you expect it to conclude in August.

Updates

Analyzing which Properties in Wikidata receives much changes.

Prepared a new dataset about biographies.

Updates

We have finished to develop the first two models, one based in a DNN an another based in BoW. We are working on the report, that should be online (in meta) during the following weeks.
We are also working in analyzing the different coverage of entities in Wikipedias across languages.
Based on the two developed models, we are comparing with related work, and discussing about a new model or a fusion of the existing ones.

Updates

We have finished the first model.
Report can be found here.
I will close this task, and continue reporting the progress on this line of research here: T260566

diego closed this task as Resolved.Aug 31 2020, 1:52 PM

diego changed the status of subtask T252450: Submit a paper about the model developed for detecting disinformation. from Stalled to Open.Dec 11 2020, 2:40 PM

diego closed subtask T252450: Submit a paper about the model developed for detecting disinformation. as Resolved.Dec 3 2021, 5:37 PM

Measuring the consistency of information between Wikipedia articles and Wikidata items.Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Measuring the consistency of information between Wikipedia articles and Wikidata items.
Closed, ResolvedPublic
Actions

Related Objects
Search...