Page MenuHomePhabricator

[Session] How to fix automatic citations in Wikipedia with Web2Cit
Closed, ResolvedPublic


Wikipedia's automatic citation generator (Citoid) creates formatted citations from given web sources. However, sometimes metadata are not accurately retrieved. One way to fix this is with Web2Cit, a tool to collaboratively improve automatic citations in Wikipedia.

The Web2Cit community has already indicated (both manually and automatically) the expected citation metadata for a series of webpages, some of which do not match their automatic citations. In this workshop we will introduce the basics of Web2Cit, and show how to use it to fix automatic citations in Wikipedia for these sources, or other relevant to your communities.

  • Title of session: How to fix automatic citations in Wikipedia with Web2Cit
  • Session description: We will introduce the basics of Web2Cit and show how to use it to fix automatic citations for previously identified sources or others which may be relevant to your local communities.
  • Username for contact: @diegodlh & @Nidiah
  • Session duration (25 or 50 min): 50 min
  • Session type (presentation, workshop, discussion, etc.): workshop
  • Language of session (English, Arabic, etc.): English or Spanish (depending on the audience)
  • Prerequisites (some Python, etc.): some XPath and JSON might be useful
  • Any other details to share?:
  • Interested? Add your username below:

If you attended last year's hackathon Web2Cit session, this one may still be relevant. Web2Cit development continued since last year's hackathon session. Most notable changes include:

  • Translation test support, to indicate expected output for specific webpages, including tests automatically generated from data from our research team.
  • Web2Cit monitor, which regularly checks translation tests and publishes test results.
  • JSON-LD support, a popular format to embed metadata in webpages.

Proposed agenda

  1. Web2Cit basics and installation.
  2. Identify a problematic webpage (either bring one relevant to your community or choose one from the Web2Cit monitor list).
  3. Indicate or check the expected output on a translation test.
  4. Configure a translation template to match the expected output.

Event Timeline

Thanks for this proposal! All proposed sessions couldn't become part of the agenda published today due to limited space, including yours. I'll ping here if a room becomes available to organize remaining or impromptu sessions.

Hello! There is now additional space available for you to schedule a session: (See "Small Hacking Room Corner (up to 20 participants)"). If you are still interested in organizing a session, you can claim this slot on a first-come, first-serve basis. There may not be any AV equipment in this small room, just a bunch of tables, and the room might be separated from the big hacking room with a bunch of dividers.

Cool, @srishakatux! So we've added our session to one of the 50-min slots available. Shall we move this task to the "Accepted sessions" column in the workboard, or to a separate new column, so people checking the workboard for sessions may easily find those that will actually take place? Thank you!

Below you may find the link attached that redirects the user to the corresponding Etherpad:

Do you (or could you) support Cite Q through this tool? Also see T289287.

Do you (or could you) support Cite Q through this tool? Also see T289287.

Hi, Mike! Thank you for your question. Cita provides the Citoid extension with an alternative reply similar to the one provided by the Citoid service, but based on community-maintained algorithms. But the transformation of the citation metadata returned this way into a citation template is still handled by the Citoid extension.

Therefore, once this is implemented in Citoid, it would automatically become available in Cita as well.

Session Notes:

How to fix automatic citations in Wikipedia with Web2Cit

Date & time: Saturday, May 20th at 15:00 pm EEST / 12:00 pm UTC

Relevant links

Participants (~20)


Grant project

Automatic citation tool in Wikimedia. URL -> citation. fetches metadata from source (author, title, etc.) which is far easier than doing this manually. Problem is that sometimes the metadata has errors or is missing fields. this system (citoid) requests URL and extracts the metadata from the webpage. It uses Zotero (3rd-party) software to extract the metadata. if metadata appropriately embedded via standards, great! generally not the case though so Zotero community writes specific "Translators" for each domain. A lot of work though and webpages change and the translator breaks.

To fix this, a few options:

Manual: doesn't benefit others and can be slow

Convince webmasters to fix site: very slow and doesn't work often

Fix Zotero translator: requires coding experience though so high barrier to entry and slow iteration

Web2cit attempts to solve these issues: complements Citoid. When Citoid works, no web2cit needed. When it doesn't though, web2cit can help fix the gaps with a lower barrier to entry by just editing configuration.

Demo! If web2cit is installed (user script) and you use Citoid, both original Citoid output and web2cit output shown. Can investigate web2cit output and edit output to be what is actually expected. The saved edits are saved as JSON on metawiki.

Once saved, can see how good the overlap is between extracted content and expected content. Could stop here but can also edit the extractor. Can do things like setting hard-coded values (fixed), using a specific Citoid field, HTML selector (xpath), or pull from site's json-ld.

under Web2cit/monitor, can see table (updated approximately hourly) of all the domains and their scores and details. by watching the page, can see if score changes because something changed in the site.

Don't need user script (can adjust the URL you add to Citoid) but much easier to install user-script

As part of project, measured current performance of Citoid. To do so, extracted references from Citoid for featured articles and compared to manually-curated results. Used these for automatically generating a set of initial expected web2cit outputs.


Can we name and shame the publishers who do poorly?

Is it feasible to auto-generate patches for Zotero?

Yes but hard. Probably better to hand-code but the tests themselves can be very useful for anyone with the skills to create translators on Zotero

Will this work for any language edition of Wikipedia?

Yes, we chose a subset of fields that would provide a basic citation. for instance, we don't support DOIs yet but that might be a good one to add. the templates are stored on meta so they are not language-specific. they output fields names that match to what is used by Zotero/Citoid so as long as a community has configured Citoid to work with their citation templates, they can use web2cit. Perhaps some UI etc. issues with certain languages though -- e.g., right-to-left languages.