Page MenuHomePhabricator

Create recommendations for databases/journals/websites, by WikiProject for WikiProject X
Open, NormalPublic

Description

At Wikimania, James and I talked about the possibility of using WikiProject X to create recommendations. @Halfak said that his current capabilities for the tool: it would just be a matter of engineering it.

The tool would a)screen the references in top class articles for a WikiProject (likely FA, GA, B), b) identify the most frequently used sources in that topic through comparing either urls, or titles of certain reference fields- whether by journal (i.e. The Lancet), website (i.e. Newspapers.com), identifiers (doi, for example https://gist.github.com/hubgit/5974843) or publisher/via (i.e. JSTOR, Project Muse, etc), and c) recommend those resources to editors as places to start their research -> rather than what is happening now which is either manually curated lists, or relying heavily on editors previous knowledge of a field or research -> neither of which are reliable "guarantees" of quality research strategies.

The main risk here, is that the tool isn't used and that the recommendations tend to be very generic (such as Google Books).

Additional potential use cases: recommending research starting points in unreferenced tags, based on WikiProject or categories; recommending TWL and/or open access sources to newish editors.

Useful links:
*Capability to figure out article quality in WP articles: https://meta.wikimedia.org/wiki/ORES/wp10
*Cability to extract structured citation information: https://meta.wikimedia.org/wiki/Research:Scholarly_article_citations_in_Wikipedia

See also: T120502: Tools for dealing with citations of withdrawn academic journal articles

Event Timeline

Sadads created this task.Sep 1 2015, 3:28 PM
Sadads raised the priority of this task from to Needs Triage.
Sadads updated the task description. (Show Details)
Sadads added a project: WikiProject-X.
Sadads added subscribers: Sadads, Halfak, Harej.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 1 2015, 3:28 PM
Harej triaged this task as Normal priority.Sep 1 2015, 10:19 PM
Harej set Security to None.

I would like to create a structured database of Wikipedia citations. This includes the citation data itself (think OCLC on steroids), but also information about where the citation appears on Wikipedia. Implementation-wise this could manifest itself as a MediaWiki+Wikibase instance on Labs combined with a script that pulls citation data from Wikipedia. The script would (a) pull anything with a citation template and parse it; (b) pull anything between <ref> tags (and run incomplete citations through Parsoid), and (c) map out the data according to a schema. In principle this could be done on Wikidata but the level of granularity I want may be overkill for Wikidata's purposes.

With that infrastructure in place, a script could compare its index of WikiProjects and articles to entries in this database and provide information on the sources that are used the most by the highest-quality articles. The list could then be coupled with links to the Wikipedia Library "library card" system, itself containing a workflow encouraging people to sign up. (Or if the text is available on Wikisource or Commons, we could link to that. Paging @Daniel_Mietchen)

This approach would allow for applications far beyond WikiProjects, and it would provide a long-term solution to other citation-related issues. For example, when a journal article is retracted, it would be useful to see which articles cite that journal article. It also would help us get insight on sources used on other projects, since different language versions of articles are linked through their Wikidata item.

Sadads added a comment.Sep 2 2015, 4:37 AM

The Wikipedia Library is very much interested in the historical changes in citation data as well (especially when it comes to most cited works, and our particular partner sources). See https://phabricator.wikimedia.org/T102064

Harej moved this task from Needs Triage to In Progress on the WikiProject-X board.Sep 2 2015, 9:15 PM
Sadads added a subscriber: Mvolz.Sep 3 2015, 6:46 PM

@Halfak Do you know if anyone else would be interested in working with @Harej on the reference database? Do we need to be reaching out to anyone else to include on working on that?

Moreover, does your current strategy for extracting the reference data pair well with his intended use of the data?

Also adding @Mvolz , who might be interested in this for Citoid: we could be, for instance, fixing repeated citation scraping errors coming out of Zotero as editors manually create good citations for that source.

Halfak added a comment.EditedSep 4 2015, 2:50 PM

/me puts on his volunteer hat

I'm working on some utilities now that will likely be relevant to extracting and processing <ref>s and identifiers historically. See https://github.com/mediawiki-utilities/python-mwrefs and https://github.com/mediawiki-utilities/python-mwcites . I was just discussing plans with @Harej in Research. I plan to prioritize the tooling you guys need in those utilities.

Thanks @Halfak! Looking forward to you working on this.

I also wanted to add @Jdforrester-WMF . This might be of interest to Citoid, especially if https://phabricator.wikimedia.org/T111141 is the strategy used to back the recommendations.

@Harej what is the timeline or next steps beyond the Wikibase and @Halfak's work?

@Harej Also, just discovered: https://www.wikidata.org/wiki/Wikidata:WikiProject_Source_MetaData Wouldn't want to be replicating too much their effort.

Harej added a comment.Sep 15 2015, 5:39 PM

I'm familiar with that project and will be working with them.

Another use for a data set like this: http://arxiv.org/pdf/1509.05631v1.pdf , Verifiability metrics :)

Tarrow added a subscriber: Tarrow.Oct 14 2015, 10:40 AM
Harej moved this task from In Progress to Stalled on the WikiProject-X board.Oct 20 2015, 5:46 AM
Harej moved this task from In Progress to Stalled on the WikiProject-X board.
Harej moved this task from Stalled to Requests on the WikiProject-X board.Oct 26 2015, 7:06 PM

Hi,

I'd be interested in working on creating a reference database. I'm particularly interested in tracking the usage of PMIDs (I'm about to start a short project with EuropePMC) rather than DOIs but it seems silly to replicate work.

I've got the output of @Halfak 's mwcites on a recent dump of enwiki on tools and am just thinking about importing the results into a wikibase install.

@Tarrow, I'm considering adding some generalized metadata extraction to mwcites (and integrating it into the more general mwrefs) at the hackathon. See https://phabricator.wikimedia.org/T114247 I've already got some people working on DOIs. Maybe we could work together on making the metadata extractor for PMIDs easier to use at the hacka-summit. :)

Harej added a comment.Oct 28 2015, 7:09 PM

Hello @Tarrow! I would be happy to work with you on integrating your work into Librarybase, a Wikibase instance I set up for exactly this kind of thing: http://librarybase.wmflabs.org

jrbs added a subscriber: jrbs.Nov 17 2015, 8:59 PM

Hey @Harej, @Tarrow, @Halfak whats the status on this? Is there a direction and/or progress? Can I help with anything?

Halfak added a comment.Dec 1 2015, 9:50 PM

I see this as blocked on an initial import to LibraryBase. @Harej, what's the most apt. card for that?

Harej added a comment.Dec 2 2015, 7:56 PM

@Halfak, that would be this one ---> T120115

Sadads added a comment.Dec 2 2015, 9:18 PM

Very exciting! Keep up the good work!

Harej moved this task from Requests to Stalled on the WikiProject-X board.Jan 5 2016, 12:57 AM
Harej added a comment.Apr 18 2016, 7:45 PM

For presentation, it is worth looking at how they are done for "WikiProject libraries": https://en.wikipedia.org/wiki/Category:WikiProject_libraries

Harej edited projects, added Reports-bot; removed WikiProject-X.Apr 20 2016, 1:22 AM
Harej moved this task from Backlog to Requests on the Reports-bot board.Apr 26 2016, 3:40 AM
Harej moved this task from Backlog to Radar on the VPS-project-Librarybase board.Aug 4 2016, 3:29 AM