Page MenuHomePhabricator

Let us play with PubMed to enrich Wikidata with medical information
Closed, ResolvedPublic


Username or display name (will be displayed publicly): Csisc

Categories/Tags/Keywords (up to 5): Wikidata,

Session type (select one): Tutorial (including Q/A)

Session Details
In this tutorial, I will introduce the joint research efforts of Data Engineering and Semantics Research Unit, University of Sfax, Tunisia alongside Wikimedia users and SisonkeBiotik, an open community of African biomedical machine learning enthusiasts, to develop bots for enriching and validating biomedical information in Wikidata through the analysis of PubMed, a large-scale bibliographic database for biomedical research publications. I will go through the source codes that have been developed in this context using multiple Python Libraries such as Wikibase Integrator and Biopython as well as using Toolforge tools like Hub. Then, I will slightly deal with the importance of such a work for the Wikimedia Community and our work for the development and evaluation of our PubMed-based approaches for automatically editing Wikidata.

Target audience:

  • People interested in automatically editing Wikidata
  • People working with Python Libraries like Wikibase Integrator
  • People dealing with medical knowledge in Wikidata

What will participants get out of this session?

Due to this session, attendees will learn more about how to process bibliographic metadata in PubMed for biomedical information retrieval. As well, attendees will know the methods for developing a Wikidata bot.

(Optional) Additional resources:

Event Timeline

Csisc updated the task description. (Show Details)
Csisc updated the task description. (Show Details)

@Csisc: Thanks for participating in the Hackathon! We hope you had a great time.

  • If this session / event took place: Please change the task status to resolved via the Add Action...Change Status dropdown.
    • If there are session notes (e.g. on Etherpad or a wiki page), or if the session was recorded, please make sure these resources are linked from this task.
    • If there are specific follow-up tasks from this session / event: Please create dedicated tasks and add another active project tag to those tasks, so others can find those tasks (as likely nobody in the future will look at the Hackathon workboard when trying to find something they are interested in).
  • In this session / event did not take place: Please set the task status to declined.

Thank you,
your Hackathon venue housekeeping service

@Aklapper, I have presented the session as shown in the Program. This session has been attended by 30 participants from the Wikimedia Community and eight people at the Wikimania in-person event held in Monastir, Tunisia. This was a live session of the Hackathon and as well for the in-person event in Monastir, Tunisia. We received interesting points that will be interesting to develop the work and we will certainly consider them for developing our research work.

Please find the slides at

Question 1:

I do have issues with it being put right into Wikidata. For example, Covid-19 has Drug-or-therapy-used-for-treatment with ivermectin (!!!!!!!!) administered orally (!!!!!!!!!!!!!!!!!!). There are 10 references, that no one will chase down. But people (or in particular "AI bots") will pick this up and think that it is true. That is so dangerous! There needs to be a link to and not to any lists of clinical trials - the RESULTS of the clinical trials are needed and not the fact that there was a clinical trial for this. This is why I feel that you need to keep the medical information on a separate Wikibase implementation until you are clear on how you will be dealing with all this issues that will arise. WiseWoman (talk) 15:46, 13 August 2022 (UTC)

WiseWoman: We will certainly consider your comments. You are certainly right. --Csisc (talk) 13:59, 14 August 2022 (UTC)
WiseWoman: We currently see whether we can develop an interface that allows medical specialists to curate our data when retrieved from PubMed Database. --Csisc (talk) 14:02, 14 August 2022 (UTC)
That would be a step in the right direction, but people argue so much and do not agree on what is correct in medicine! Why are you not concentrating on Cochrane reports first? That would make so much more sense! I've had students working on PubMed, even identifying named entities is quite challenging, we found multiple cases with 5+ different ways that a name of the same person was written, as the journals all have different rules as to whether or not first names are written out or abbreviated, etc. Journals are quite sloppy about putting metadata into PubMed and they seldom correct errors. It's a great resource for a human medical researcher or practicioner, who can determine if an article makes sense or not. But again, people don't agree. --WiseWoman (talk) 20:30, 14 August 2022 (UTC)
WiseWoman: Sure, it will be useful to use Cochrane Reviews to enrich Wikidata. We are already working to use Cochrane Reviews to add new statements to Open Research Knowledge Graph. This work can be scaled to Wikidata. --Csisc (talk) 18:47, 15 August 2022 (UTC)
WiseWoman: If you have other interesting proposals, we will be honoured to implement them for our project. --Csisc (talk) 18:49, 15 August 2022 (UTC)

Question 2:

I have concerns about the use of AI tools to edit biomedical knowledge in Wikidata.

Of course, prior to the extraction, we are verifying the validity of the source database before deciding to use it from the perspective of legal concerns and quality. After the extraction, we will add what we have extracted into a database and use them in a tool similar to Reference Island of Wikimedia Deutschland to let them verified by Wikimedia users. There could be some vandalism of the tool to have edits with minor efforts. That is why we are thinking to mirror the extracted medical knowledge and let it valdiated by physicians online.

Question 3:

How will you deal with negated statements.

There are no negation in the MeSH Keywords. However, we think that this matter can be solved by human validation.

Question 4:

Will you consider other resources to enrich Wikidata with biomedical information.

Certainly, we will look forward to using available resources to enrich Wikidata with biomedical knowledge at the very short run. We have a project funded by Wikimedia Foundation that will deal with this issue (,_Semantic_Web_and_Machine_Learning).