Page MenuHomePhabricator

Come up with a better way to auto-label references
Open, LowPublic8 Story Points

Description

Add TemplateData configuration for how reference names should be generated.

Event Timeline

Jdforrester-WMF raised the priority of this task from to Medium.
Jdforrester-WMF updated the task description. (Show Details)
Jdforrester-WMF added a subscriber: Jdforrester-WMF.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 11 2015, 7:29 PM
Jdforrester-WMF edited a custom field.

The ideal system would be to use TemplateData to specify which parameters should be used to generate the name from, but in cases where there is no such data we still need a sensible fallback, and the current system has the advantage of using latin numbers, which work with other languages well.

Mvolz added a subscriber: Mvolz.Mar 25 2015, 4:54 PM
Jdforrester-WMF lowered the priority of this task from Medium to Low.Apr 16 2015, 9:02 PM
Elitre added a subscriber: Elitre.Sep 2 2015, 3:31 PM

No progress on this? <ref name=":0"> is not just ugly, it looks like an error code or something.

No progress on this? <ref name=":0"> is not just ugly, it looks like an error code or something.

It's not in the top 100 things we're working on.

Anomie added a subscriber: Anomie.Nov 4 2015, 2:22 PM

Another reason to fix this bug: If people are copying wikitext between articles (which happens often enough that enwiki has a bot to look for it), sensibly-named references are easier to figure out what's going on if they copy text with a <ref name="..."/> (resulting in a broken reference in the copied-to article) and are less likely to have name collisions if they copy a reference that includes the full reference body text.

I ran into this earlier today. name=":0" is too generic and easily causes problems when bringing in content from another page (both in wikitext and through VE) because it is basically guaranteed to conflict if both pages had at least one VE-generated citation.

One small improvement we could make is use simple hash. Nothing strong or cryptographic, but something like string-hash.js (5 lines of code).

We can convert the digest number to a string with .toString(36) and produce a short unique string (taking care to check if it already exists, at which point one could add Math.random and hash again).

That takes care of the cross-article conflict problem and is better than starting the count at 0 and using :0 as id (which we currently do – also no idea why there is a colon in the name).

also no idea why there is a colon in the name

The Cite extension doesn't support integers as reference names. I don't know if colon specifically was chosen for any particular reason.

Neil_P._Quinn_WMF removed Neil_P._Quinn_WMF as the assignee of this task.Dec 11 2015, 10:31 PM

I'd love to see Citoid use some sort of reference naming! The idea with auto-generating name for each reference is a good alternative to manually setting them and I support it.

@Anomie, I believe that the colon was chosen because it's in the very small set of (things that can be used) and (characters present on the keyboards of most MediaWiki users). The first requirement explains why it's not all numbers (some non-numeric character is required), and the second explains why it's not a Latin alphabet character.

Krinkle removed a subscriber: Krinkle.May 20 2016, 6:08 PM

I ran into this earlier today. name=":0" is too generic and easily causes problems when bringing in content from another page (both in wikitext and through VE) because it is basically guaranteed to conflict if both pages had at least one VE-generated citation.

I agree. We don't (yet) need a system that creates meaningful names, but we need one that doesn't generate likely collisions.

If someone can point me to where the current code lives, I'll write up a patch and submit it.

If someone can point me to where the current code lives

I think this is modules/ve-cite/ve.dm.MWReferenceNode.js in the Cite extension, see the "Generate a name starting with ':' to distinguish it from normal names" comment

I think this is modules/ve-cite/ve.dm.MWReferenceNode.js in the Cite extension, see the "Generate a name starting with ':' to distinguish it from normal names" comment

Thanks for the pointer. I'm working on this. For folks who are also looking that isn't the only place where we expect that reference numbering: https://phabricator.wikimedia.org/diffusion/ECIT/browse/master/modules/ve-cite/ve.ui.MWReferenceSearchWidget.js;06376669d9c1895d9b312998d0ee331520eea6a1$161-165

Boghog added a subscriber: Boghog.Apr 2 2017, 7:12 PM

While ref tags that take the form of ":0", ":1", ":2" are unique, they are not very informative. One alternative would be a Harvard style ref tag in the form of first authors last name + year of publication (i.e., "Smith_2017").

@Boghog I agree + if there were more different publications by Smith from 2017, then Smith_2017a, Smith_2017b...

TheDJ added a subscriber: TheDJ.Apr 4 2017, 8:53 AM
tomasz removed a subscriber: tomasz.Jun 11 2017, 10:59 PM
Krinkle removed a subscriber: Krinkle.
Izno added a project: Cite.Nov 20 2017, 2:21 AM
Izno moved this task from Unsorted backlog to External on the Cite board.Nov 20 2017, 2:26 AM

Auto-label them before insertion, but allow them to be changed by pressing the Edit button when our mouse pointer is hovered on the newly created Citation. This would be done before the changes are Saved.

PamD added a comment.Nov 5 2018, 9:38 PM

I'm not surprised to find that this has already been raised, but am surprised and disappointed that it's been allowed to remain unresolved for so long.

If a reference uses a citation template, then there are fields which can be used to make a reference name. It doesn't depend on Artificial Intelligence solutions, just a "If LAST1 is present, use it. If that name matches an existing reference, and DATE is present, add the year. If no year, add a running number. etc etc". Even if the flowchart had some "too difficult" end boxes saying "If all else fails use a colon and a number", we could get the vast majority of reference names chosen sensibly, in a way compliant with the spirit of the enwiki guideline which forbids the use of purely numeric reference names. ":0" is not purely numeric, but all arguments against purely numeric names apply to it.

While this would be easy to implement for any specific language (e.g. only for English), keep in mind that citation templates are translated to 200+ languages. When this task was filed, we had no way to know that e.g. "nazwisko" in Polish is equivalent to "last" in English.

It seems that since then, someone has invented Citoid and TemplateData :), and as part of these, invented a way for communities to specify a mapping like this – see e.g. https://pl.wikipedia.org/w/index.php?title=Szablon:Cytuj_stronę/opis&action=edit (search for "maps"; this is the "cite web" template).

We could probably use those mappings now, there is some documentation here: https://www.mediawiki.org/wiki/Citoid/Maps_TemplateData

As for the actual algorithm for generating the name, surely there exists some bot or something already that merges and names identical references? It would be a lot easier if such a thing was out there and if we could borrow that code.

Elitre removed a subscriber: Elitre.Nov 8 2018, 4:42 PM

Such a bot has operated in the past: https://en.wikipedia.org/wiki/Wikipedia:Bots/Requests_for_approval/Polbot_8 I don't know if any bots are currently doing this.

Looks like that also didn't generate the names cleverly. It just used "botgen1", "botgen2" etc., instead of ":0", ":1" etc.

This has been proposed as part of the 2019 Community Wishlist: https://meta.wikimedia.org/wiki/Community_Wishlist_Survey_2019/Citations/VisualEditor:_Allow_references_to_be_named It's too early to know whether it will make the top 10 (voting will be open until the 30 November 2018), but it's currently among the more popular items, which suggests that solving this problem has widespread community support.

While this would be easy to implement for any specific language (e.g. only for English), keep in mind that citation templates are translated to 200+ languages. When this task was filed, we had no way to know that e.g. "nazwisko" in Polish is equivalent to "last" in English. ...

Remember the good ol', "don't let the perfect be the enemy of the good." When I go to https://www.wikipedia.org, I only see ten Wikipedias listed there. If you implement the fix just for those ten, I'm guessing you're fixing a very significant percentage of the problem. Nothing wrong with incremental rollout: I see no reason to hold up an initial fix for a handful of languages, while someone figures out how to say "last1" and "year" in Inuktitut, Kapampangan, Tuvinian and Cherokee.

Izno added a comment.Dec 8 2018, 12:03 AM

This has been proposed as part of the 2019 Community Wishlist: https://meta.wikimedia.org/wiki/Community_Wishlist_Survey_2019/Citations/VisualEditor:_Allow_references_to_be_named It's too early to know whether it will make the top 10 (voting will be open until the 30 November 2018), but it's currently among the more popular items, which suggests that solving this problem has widespread community support.

This along with T52568: VisualEditor: Be able to name references manually in the reference dialog were in the top 10.

Tgr added a subscriber: Tgr.Dec 25 2018, 7:02 AM
PamD added a comment.Apr 17 2019, 9:03 AM

The discussion above seems to ignore the needs of human editors. When I try to work in the text editor on an article which has multiple multi-used references, created in VE, I need to be able to see which reference is which. Initially I can see that "footnote n refers to reference colon - n - minus - one; by the time I've rearranged the text of the article I now have footnote "4" as ref ":3", and so on. See https://en.wikipedia.org/wiki/Kate_Jagoe-Davies as an example.