Page MenuHomePhabricator

Improving cite extraction from URLs when the author has one or two surnames (Basque, Spanish...)
Open, Needs TriagePublic

Description

Hi.

We are students and a teacher of computing. As that is our first programming contribution to Wikimedia, we need help to integrate an improvement in the Basque Wikipedia editor.

Adding a reference when editting Basque Wikipedia is very easy when it is possible to extract it from an URL. The editor automatically extracts all information by just entering the URL.
But it often misseparates the author's name(s) and surname(s) when there is a second surname (in English it is not used, but in Basque sometimes two surnames are used; in addition, in a surname there may be more than one word, i.e., López de Agirre). For example: If the author of a book is "Olatz Arbelaitz Gallego" it recognices "Olatz Arbelaitz" as name and "Gallego" as surname ("Olatz Arbelaitz ; Gallego") and not "Olatz ; Arbelaitz Gallego", that is the correct distribution.

It is difficult to distinguish names and surnames. The programme now being used in Wikipedia always takes the last word as the surname and all the previous wocreatedrds as the name. Of course, the result is not always correct in Basque, it makes many mistakes. We have collected many names and surnames and trained a Python program to correctly distribute the names and surnames of an author (https://github.com/EkhiAzur/WikipediaNameProblem). We evalusted it and it goes much better.

Question: How can we integrate this program into the Wikipedia editor?
Please, will anyone help us?

Some examples:

NameCorrect distributionDistribution given by the current version in WikipediaOutput of our program
Olatz Arbelaitz GallegoOlatz ; Arbelaitz GallegoOlatz Arbelaitz ; GallegoOlatz ; Arbelaitz Gallego
Ekhi Azurmendi ArrueEkhi ; Azurmendi ArrueEkhi Azurmendi ; ArrueEkhi ; Azurmendi Arrue
Arantza Diaz de IlarrazaArantza ; Diaz de IlarrazaArantza Diaz de ; IlarrazaArantza ; Diaz de iLarraza
Patxi Angulo PerezPatxi ; Angulo PerezPatxi Angulo ; PerezPatxi ; Angulo Perez
Arantza Diaz de IlarrazaArantza ; Diaz de IlarrazaArantza Diaz de ; IlarrazaArantza ; Diaz de Ilarraza
Arantza Diaz de Ilarraza SanchezArantza ; Diaz de Ilarraza SanchezArantza Diaz de Ilarraza ; SanchezArantza ; Diaz de iLarraza Sanchez
Francisco Xabier Albizuri IrigoyenFrancisco Xabier ; Albizuri IrigoyenFrancisco Xabier Albizuri ; IrigoyenFrancisco Xabier ; Albizuri Irigoyen
Francisco Xabier AlbizuriFrancisco Xabier ; AlbizuriFrancisco Xabier ; AlbizuriFrancisco Xabier ; Albizuri
Jose Luis AlvarezJose Luis ; AlvarezJose Luis ; AlvarezJose Luis ; Alvarez
Jose Luis Alvarez EnparantzaJose Luis ; Alvarez EnparantzaJose Luis Alvarez ; EnparantzaJose Luis ; Alvarez Enparantza
Arantza Irastorza GoñiArantza ; Irastorza GoñiArantza Irastorza ; GoñiArantza ; Irastorza Goñi
María Arantza Irastorza GoñiMaría Arantza ; Irastorza GoñiMaría Arantza Irastorza ; GoñiMaría Arantza ; Irastorza Goñi

Event Timeline

This is very cool! The short answer is - it would be very difficult to integrate this into the actual service we use. It uses a web scraper (Zotero) and each website is parsed by a different translator.

You might want to consider changing the "citoid": TemplateData for the citation template on Basque Wikipedia to store names in a flat manner (i.e. "author": ["abizena", "abizena2", "abizena3"]) which just concatenates the whole name, i.e. would render it as "Francisco Xabier Albizuri" if this is a common problem. (see: https://www.mediawiki.org/wiki/Citoid/Maps_TemplateData#Two-dimensional_Arrays_(list_of_lists)). This is hacky but circumvents the problem by not dealing with it. :)

Alternatively there might be a way to run a bot on basque wiki (probably using pywikibot ? ) that could fix incorrectly added citations (although this might be tricky if it gets some wrong and "fixes" ones that were correct!)

OK, Thank you very much for your answer, Mvoiz

Then we are going to start with Pywikibot. At the begining we are going to check the references in the list of Basque Wikipedia articles recently created, ( and in a list of articles created by an user that extraxt rferences automatically.

Hi! I think that the idea suggested by Marielle would be great because it would be source-agnostic and because it would be easily compatible with your Python code.

Alternatively, you may suggest changes to the Zotero web scrapers that Marielle mentioned, but that would imply lots of changes because many of them are source-specific. And they are written in JavaScript.

As a side comment, we are currently developing Web2Cit, a tool that will let Wikipedia contributors collaboratively tweak this web scraping process. In the future it may be integrated to the editor via a gadget or user script, but in principle we plan that it will be used via the current automatic citation tool (Citoid): Source Web2Cit Citoid Wikipedia.

Collaborators will be able to define (on a per-domain basis) a series of selection and transformation steps for each citation metadata field (we will start with the most basic: item type, title, authors, date and source). Different selection types will be available, including XPath and Citoid (we will soon post a video here explaining this in more detail). That means that collaborators will also be able to use the Citoid response for some fields, transform it, and return it again.

In principle we will provide some basic transformation steps (mostly for string manipulation), but it has been suggested that we include custom transformations too. Maybe some of these may interact with an external API. Probably this won't be included in our first releases, but I wonder whether your last name corrections may be one such transformation step in the future.

You might want to consider changing the "citoid": TemplateData for the citation template on Basque Wikipedia to store names in a flat manner (i.e. "author": ["abizena", "abizena2", "abizena3"]) which just concatenates the whole name, i.e. would render it as "Francisco Xabier Albizuri"

I personally think this is the best method and we should not be interpreting names (without human oversight), to figure out what is first/last in one system (Zotero ?) and then be outputting that in another like Citoid and indirectly Wikipedia.

This is a long known problem in western IT, which creates this separation while in practice it is required almost no where. There are good articles like this one and this one which explain why we are just doing this wrong.

But I think it will be hard to change Wikipedians minds about this. And some of it is born from citation guidelines that have been developed outside of wikimedia that people are trying to follow, and is really only possible to execute correctly if ppl apply this separation manually.

But I still think we should only use |authorn and maybe if absolutely needed, have a Wikidata based bot linked bot cleanup after the users if really needed.

Hi folks!

I hope it's appropriate to do this on a Phabricator thread -- but we've identified a couple of threads such as this one where there are some issues with Citoid and automatic extraction / generation of metadata, and @diegodlh has been working on a community based solution for this problem, called Web2Cit. Web2Cit aims to solve some of those problems without having users to fiddle with Zotero translators or having a lot of technical skills.

On May 11 at 4 PM UTC we will be running a workshop to show the tool and allow users to test the early adopters version. If you're interested, you can register here: https://us06web.zoom.us/meeting/register/tZIpfu2upj4sE9ZrqblmM3-QujaeqekAAINK

If you want to know more about Web2Cit or the workshop, check here: https://meta.wikimedia.org/wiki/Web2Cit/Workshops

We would also greatly appreciate it if you happen to know anyone that might be interested in attending such a workshop and can handle some technical complexity.

cheers,
scann