Page MenuHomePhabricator

Consider using a hierarchy of configuration files
Open, Needs TriagePublic

Description

Copying a proposal by @Kerry_Raymond from the advisory board mailing list:

Would this allow us to set up a hierarchy of rules for automatic citation? First, use the one in my user space if it exists and succeeds, else use the one in the community space if that exists and succeeds, else use Citoid, else do the minimum {{cite web|url= ... |title=Unknown title |access-date=today}}

This would be nice as it would allow individual users to precisely control the citations of URLs of importance to them (presumably ones they use regularly) or to test new rules without disrupting the community's use of existing rules. It could also be potentially extended to enable more complex cascading, e.g. alternative rules in different languages, WikiProject-specific rules, etc.

Probably not something for the minimal viable product, but it never hurts to have ideas for future extensions on the table so early design choices don't frustrate later development and to mark points in the code base with comments "this is where we'd deal with SuchAndSuch"

The Web2Cit-Core is already prepared to use configuration files from different storage locations (as used by the sandbox endpoint of the Web2Cit-Server), so it should be ready to support this use case in the future.

I just wonder to which extent this would discourage collaboration around a single set of configuration files.

Possibly related to other tasks such as T302019 and T305168

Event Timeline

Currently on en.WP each contributor using "cite web" etc use it more-or-less as they choose. Which piece of text from the web page do I regard as the title? Do I put the organisation name as the website or as the publisher? Etc. So it is not unusual that many citations to the same website will look quite different in the hands of different editors. I am not aware of any WikiProjects that I am involved with seeking to standardise these things. And if someone proposed trying to mandate a standard, I would probably say "it's just going to create arguments among the active contributors and everyone else won't even be aware of it and won't follow it anyway". Our biggest problem on en.WP is not people producing variant styles of citation to the same resource but people not citing at all. Or just adding a URL as an external link.

What some WikiProjects or just individuals currently do is create specific templates for citing certain resources, e.g. I created Template:Cite QHR which is used to cite listings on the Qld Heritage Register. We also have ones to cite the Australian Census, e.g. Template:Census 2016 AUS. In VE, you use these with Cite > Basic > Insert > Template > name of the template. Active contributors working in a particular topic use these a lot (and will collaborate on their development_ as the standard fields like "website" etc are pre-filled, so it's less work to them and the resultant citation is more useful. However, occasional contributors will still use cite web and do it the hard way because they don't know of any other way. "Collaborating" on Wikipedia is far more disengaged and disorganised than (say) an academic collaboration on writing an academic paper. The idea that all citations to a particular website could be mandated to use a particular presentation is virtually unachievable.

Of course, active contributors do sometimes clean up poor citations created by occasional contributors to a higher standard (or to use custom templates) but at the same time, you get the occasional contributors removing the use of custom templates as they want to update something in them but they don't understand the custom template, so they remove it and do a cite-web instead. This is the downside of custom templates; others don't understand what they are don't know how to find out how to use them. In an ideal universe, active contributors would spend more time nurturing occasional contributors but, unfortunately, the scale factors work against this -- too few active contributors and a never-ending stream of occasional contributors. In an ideal universe, the nurtured new contributor will flourish into an active contributor but this is rarely so ("Wikipedians are born, not made").

So I think with Web2Cit becoming a mainstream tool, we will see WikiProjects look at their custom templates and creating the Web2Cit equivalents in some cases. But not all. One of the reasons we use custom templates is because of frequent changes in domain name and website structure. Because custom templates are evaluated at read-time, we can update the custom template to construction the URL field to deliver the reader to the current URL. Web2Cit and Cotoid cannot do that as they are resolved at write-time not read-time. All we can do with cite-web is rely on an archive website to provide a copy of the contents as it was. So I expect we will find that WikiProjects using custom templates will want to "teach" Web2Cit how to create a instance of those custom templates from the provided URL rather than a "standard" cite-web. This is not a minimum viable product, but I think an inevitable future request if Web2Cit is successful (the price of success).

I understand your comments are regarding my concern that having a hierarchy of configuration files may "discourage collaboration around a single set of configuration files".

The idea that all citations to a particular website could be mandated to use a particular presentation is virtually unachievable.

Web2Cit would not mandate a particular citation presentation, the same way that Citoid doesn't. It is citation templates which mandate citation formatting.

Collaborating around a single set of Web2Cit configuration files means agreeing around a single set of citation metadata for cited sources (and the procedures necessary to extract them). This sounds to me more like what it's done in Wikidata for bibliographic resources (used in Wikipedia via the CiteQ citation template), where contributors agree around a single title, publisher, etc for each source.

So I expect we will find that WikiProjects using custom templates will want to "teach" Web2Cit how to create a instance of those custom templates from the provided URL rather than a "standard" cite-web. This is not a minimum viable product, but I think an inevitable future request if Web2Cit is successful (the price of success).

For now, Web2Cit interaction with Wikipedia resembles that of Citoid. That is, it is up to each Wikipedia to decide what citation template to use based on the item type returned by Citoid/Web2Cit, using this configuration file. As far as I know, there is no way to make this decision based on citation data other than the item type. Being able to change the citation template suggested by the citation extension (based on Citoid response) has been discussed before (T97936), but in the context of Citoid returning a wrong item type (which is not the context we are discussing here).

This goes beyond the topic of this task, though, I believe. We can further discuss alternative Wikipedia-Web2Cit integration ideas on separate tasks, if you wish.

At Wikimedia Hackathon 2022's Web2Cit session (T308449), @Mvolz commented that having separate Web2Cit configurations per Wikipedia may be useful for the specific case described in T132308. That is, that incomplete dates returned by Citoid (e.g., 2010-12, meaning December 2010) throw an error in English Wikipedia citation templates, to avoid confusion with date ranges (i.e., 2010-2012). As described there, it was tried with -XX at the end (i.e., 2010-12-XX), but whereas accepted by English Wikipedia, it was rejected by other Wikipedias.

If this is the only case where we would benefit from having different Web2Cit/Citoid response format between Wikipedias, I think that it would be outweighed by the benefits of having one single set of common Web2Cit configurations across Wikipedias (i.e., the benefit of a larger community). I've added a comment to T132308 with an idea of how this could be sorted out at the Citoid extension's level.

@Mvolz, can you think of where we may benefit from having separate Web2Cit configurations per Wikipedia? Thanks!

At Wikimedia Hackathon 2022's Web2Cit session (T308449), @Mvolz commented that having separate Web2Cit configurations per Wikipedia may be useful for the specific case described in T132308. That is, that incomplete dates returned by Citoid (e.g., 2010-12, meaning December 2010) throw an error in English Wikipedia citation templates, to avoid confusion with date ranges (i.e., 2010-2012). As described there, it was tried with -XX at the end (i.e., 2010-12-XX), but whereas accepted by English Wikipedia, it was rejected by other Wikipedias.

If this is the only case where we would benefit from having different Web2Cit/Citoid response format between Wikipedias, I think that it would be outweighed by the benefits of having one single set of common Web2Cit configurations across Wikipedias (i.e., the benefit of a larger community). I've added a comment to T132308 with an idea of how this could be sorted out at the Citoid extension's level.

@Mvolz, can you think of where we may benefit from having separate Web2Cit configurations per Wikipedia? Thanks!

I agree the default should be a single template. My main worry would be that we would have edit wars between different Wikipedias. I think maybe this won't come up much, but it could! In that case it would be useful if there was a way to fork them to easily satisfy such disputes. But I suppose people could just save their custom configs if they disagree with the central template, so perhaps a lower priority.

Another point is that having separate templates for separate wikis would allow wikilinking to pages in the fields (i.e. the publisher, for instance). (Relevant mailing list discussion here: https://lists.wikimedia.org/hyperkitty/list/webtocit@lists.wikimedia.org/thread/HRYDDXA6MLTKBEV477H6FMZ4CBSNVGS4/)

It might be better to use wikidata for this use case - but that is adding yet another part to the translator creation workflow. So instead of "fixed" it'd be "qid", and then the service fetches correct page from wikidata based on the current wiki - that info would have to be included in the request to the service, via the gadget, and the service would have to accept that as input. Maybe a can of worms you don't want to open :).

Another point is that having separate templates for separate wikis would allow wikilinking to pages in the fields

Thanks! I opened a separate task for that: T309869.