Page MenuHomePhabricator

Come up with a better way to auto-label references
Open, LowPublic8 Estimated Story Points

Description

Add TemplateData configuration for how reference names should be generated.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

If someone can point me to where the current code lives

I think this is modules/ve-cite/ve.dm.MWReferenceNode.js in the Cite extension, see the "Generate a name starting with ':' to distinguish it from normal names" comment

I think this is modules/ve-cite/ve.dm.MWReferenceNode.js in the Cite extension, see the "Generate a name starting with ':' to distinguish it from normal names" comment

Thanks for the pointer. I'm working on this. For folks who are also looking that isn't the only place where we expect that reference numbering: https://phabricator.wikimedia.org/diffusion/ECIT/browse/master/modules/ve-cite/ve.ui.MWReferenceSearchWidget.js;06376669d9c1895d9b312998d0ee331520eea6a1$161-165

While ref tags that take the form of ":0", ":1", ":2" are unique, they are not very informative. One alternative would be a Harvard style ref tag in the form of first authors last name + year of publication (i.e., "Smith_2017").

@Boghog I agree + if there were more different publications by Smith from 2017, then Smith_2017a, Smith_2017b...

Auto-label them before insertion, but allow them to be changed by pressing the Edit button when our mouse pointer is hovered on the newly created Citation. This would be done before the changes are Saved.

I'm not surprised to find that this has already been raised, but am surprised and disappointed that it's been allowed to remain unresolved for so long.

If a reference uses a citation template, then there are fields which can be used to make a reference name. It doesn't depend on Artificial Intelligence solutions, just a "If LAST1 is present, use it. If that name matches an existing reference, and DATE is present, add the year. If no year, add a running number. etc etc". Even if the flowchart had some "too difficult" end boxes saying "If all else fails use a colon and a number", we could get the vast majority of reference names chosen sensibly, in a way compliant with the spirit of the enwiki guideline which forbids the use of purely numeric reference names. ":0" is not purely numeric, but all arguments against purely numeric names apply to it.

While this would be easy to implement for any specific language (e.g. only for English), keep in mind that citation templates are translated to 200+ languages. When this task was filed, we had no way to know that e.g. "nazwisko" in Polish is equivalent to "last" in English.

It seems that since then, someone has invented Citoid and TemplateData :), and as part of these, invented a way for communities to specify a mapping like this – see e.g. https://pl.wikipedia.org/w/index.php?title=Szablon:Cytuj_stronę/opis&action=edit (search for "maps"; this is the "cite web" template).

We could probably use those mappings now, there is some documentation here: https://www.mediawiki.org/wiki/Citoid/Maps_TemplateData

As for the actual algorithm for generating the name, surely there exists some bot or something already that merges and names identical references? It would be a lot easier if such a thing was out there and if we could borrow that code.

Such a bot has operated in the past: https://en.wikipedia.org/wiki/Wikipedia:Bots/Requests_for_approval/Polbot_8 I don't know if any bots are currently doing this.

Looks like that also didn't generate the names cleverly. It just used "botgen1", "botgen2" etc., instead of ":0", ":1" etc.

This has been proposed as part of the 2019 Community Wishlist: https://meta.wikimedia.org/wiki/Community_Wishlist_Survey_2019/Citations/VisualEditor:_Allow_references_to_be_named It's too early to know whether it will make the top 10 (voting will be open until the 30 November 2018), but it's currently among the more popular items, which suggests that solving this problem has widespread community support.

While this would be easy to implement for any specific language (e.g. only for English), keep in mind that citation templates are translated to 200+ languages. When this task was filed, we had no way to know that e.g. "nazwisko" in Polish is equivalent to "last" in English. ...

Remember the good ol', "don't let the perfect be the enemy of the good." When I go to https://www.wikipedia.org, I only see ten Wikipedias listed there. If you implement the fix just for those ten, I'm guessing you're fixing a very significant percentage of the problem. Nothing wrong with incremental rollout: I see no reason to hold up an initial fix for a handful of languages, while someone figures out how to say "last1" and "year" in Inuktitut, Kapampangan, Tuvinian and Cherokee.

This has been proposed as part of the 2019 Community Wishlist: https://meta.wikimedia.org/wiki/Community_Wishlist_Survey_2019/Citations/VisualEditor:_Allow_references_to_be_named It's too early to know whether it will make the top 10 (voting will be open until the 30 November 2018), but it's currently among the more popular items, which suggests that solving this problem has widespread community support.

This along with T52568: VisualEditor: Be able to name references manually in the reference dialog were in the top 10.

The discussion above seems to ignore the needs of human editors. When I try to work in the text editor on an article which has multiple multi-used references, created in VE, I need to be able to see which reference is which. Initially I can see that "footnote n refers to reference colon - n - minus - one; by the time I've rearranged the text of the article I now have footnote "4" as ref ":3", and so on. See https://en.wikipedia.org/wiki/Kate_Jagoe-Davies as an example.

The current reference naming system is a problem for excerpts (transclusions of part of an article into another article) which are used heavily on the Spanish Wikipedia and may soon start seeing much more use on other wikis. The reason is that often, a part of an article containing a reference named :0 is transcluded into another article that also has a reference named :0, causing a conflict. The current solution is to just rename one of the references, but this is not easy for new users and sometimes it's not even easy for advanced users. This issue would no longer happen (or at least be very rare) if the references were named "semantically".

How is this still languishing after five years? Visual editor adds useless ref names such as ":2". When I need to switch to source, I can't tell which ref is which. The only way I've found to prevent this is to add a ref, switch to source before using it a second time, add a useful ref name like jonesNYT1may2020 or whatever, which besides being tedious when I'm adding more than one ref doesn't fix all the other stupidly named refs added by someone else using VE. For editors who switch back and forth often, or editors who edit primarily in source, these useless ref names are just infuriating. Why are we deciding that the priority on this should be low? It's literally to me the most irritating thing about Visual Editor.

This would need some sort of simple hash codes. I'm not sure, what should they be based on. Timestamp? Url? Contents generally? revisionid?

With support of Citoid, we could generate simple acronyms using Zotero or some Zotero plugin like BetterBibTeX, which can handle this quite well.

This would need some sort of simple hash codes.

I think what you meant was, "one solution might involve some sort of simple hash codes." It's certainly not true that this would need hash codes.

Another solution was proposed by PamD (Nov 5, 2018 1:38PM), and hers is far better, in my opinion, as it is human-friendly, and hash codes are not.

Even a hash code solution would have to deal with what to do about collisions, which could be made as improbable as desired, but not impossible. So, you're going to have to code the collision pathway anyway, and figure out what to do. Or, don't code it, allow the collision, and leave the rare named reference collisions lying around like little unexploded mines, that virtually no user, no matter how advanced, will ever disentangle.

Let's prioritize the users, here. PamD's solution seems better to me. If there are weak points in her design that I'm not seeing, let's identify them, and resolve them.

Another solution was proposed by PamD (Nov 5, 2018 1:38PM), and hers is far better, in my opinion, as it is human-friendly, and hash codes are not.

There is no difference between my and PamD's suggestion:

With support of Citoid, we could generate simple acronyms using Zotero or some Zotero plugin like BetterBibTeX, which can handle this quite well.

We both want to use Citation templates data if there is any. Simple hash codes would be needed for references without Citation templates only. I just suggested a way PamD's suggestion could be made possible.

I see, my apologies in that case; I misread what you were proposing. Thanks for the clarification.

For references without citation templates, could one scan for something resembling date, isbn, doi, or pmid, and use that if possible? And in either case, what happens with collisions?

Why was this moved from medium to low priority?

@Valereee because in whatever ticket it was they did the actual investigation they found whole bunch of reasons that doing this would be hard and so I think it's not getting done if I recall.

The priority got lowered in 2015, likely because there was no developer assigned to work on it at that point. See https://www.mediawiki.org/wiki/Phabricator/Project_management#Priority_levels

Please move it to highest priority. Develop the fix per PamD and patch it in to the existing visual editor, or just disable the visual editor until this is done. It is unacceptable for the visual editor to generate names that (a) frequently cause name conflicts and (b) go against :en:WP:REFNAME, which discourages this style of ref name.

Who has the authority to upgrade or downgrade a priority?

I agree that this would still be an important improvement to consider. Appropriate naming using the "reuse" button on vis editor would be helpful for edu projects as well. Thank you for flagging me here!

disable the visual editor until this is done.

That's a bit of an overreaction. The way the visual editor is doing things right now isn't ideal, but it's easy enough to fix afterwards.

It is unacceptable for the visual editor to generate names that (a) frequently cause name conflicts

Can you give an example of this happening? I've never seen the visual editor do this.

Who has the authority to upgrade or downgrade a priority?

The Editing Team, who have almost certainly seen the comments here.

Community Tech, who is probably more likely to actually work on this, has also looked at this (T243300: Spike: Investigate Named References in VE [8 hours]), but it hasn't made it onto their list of upcoming projects. We might see them prioritize it sometime this year.

@Barkeep49 I won't pretend to understand why a fix would be hard, but for heaven's sake the fact something might be difficult shouldn't be a reason to downgrade its importance. That just tries to hide the biggest problems by calling them minor. It's like dropping your keys in the street but looking for them on the porch because the porch light makes it easer to search there. We should be prioritizing by actual priority as assessed by the people who are using the tool. I use this tool and honestly this is for me the single biggest frustration I have with Vis Ed. And honestly whoever decided it would be okay to make Vis Ed work like this in the first place must not actually edit. No one who had written an article from scratch would ever have thought this was by ANY measure a reasonable decision.

@Valereee You appear to be confusing "community importance" with "developer priority".

Importance is how impactful a particular bug or feature request is to a project's users. It's important to consider, but prioritizing only by importance would not be an efficient use of development resources.

Development priority refers to how a bug or feature request fits into a volunteer developer's interests or a WMF team's planning. There are vastly more bugs and feature requests than developers with the time and knowledge to fix them, so developers have to. The priority on a task communicates from the developers to the community (and other developers) what is being worked on now and in the near future. Increasing the priority of a task doesn't change how WMF teams and volunteer developers plan their time.

WMF development teams typically focus on one project at a time, often based on priorities set in WMF Annual Plans and other strategy processes. This often does result in a de-prioritization of smaller fixes and feature requests that don't require a large team to be redirected for a period of time or that don't produce a new thing that can be shown off. That's unfortunate, and I don't like it, but I also don't see it changing anytime soon.

The Community Tech team uses the Community Wishlist to pick up some of those smaller, high-importance feature requests. This task has come up twice in the Wishlist, and CommTech does appear to be looking at picking it up -- they just haven't done it yet.
For clarity, @ifried, is Community Tech planning to work on how VE names references at some point in the near-ish future, or has that decision not been made yet?

The other option is convincing someone to do this as a volunteer. I'm not sure that all-volunteer work is the best way to handle this task, as anything VE-related and TemplateData-related can be fairly involved.

@Barkeep49 I won't pretend to understand why a fix would be hard, but for heaven's sake the fact something might be difficult shouldn't be a reason to downgrade its importance. That just tries to hide the biggest problems by calling them minor.

Difficulty to fix should and does affect priority; teams have to figure out the reward per unit effort and prioritise accordingly. There's whole management frameworks based on this concept (e.g. impact-effort matrix).

It's like dropping your keys in the street but looking for them on the porch because the porch light makes it easer to search there. We should be prioritizing by actual priority as assessed by the people who are using the tool.

That's exactly what teams do, and why users are able to file Phabricator tasks in the first place. The teams then have to balance user requests with other strategic and infrastructural work.

I use this tool and honestly this is for me the single biggest frustration I have with Vis Ed.

You're not alone in that. There are also thousands of visual editor users out there that do not share your frustration, and thousands of other tasks the team have to work on too, many of which have far larger impact than this.

And honestly whoever decided it would be okay to make Vis Ed work like this in the first place must not actually edit. No one who had written an article from scratch would ever have thought this was by ANY measure a reasonable decision.

James Forrester and I both have over 20,000 edits each, so no, not really. How about we focus on the substance of the task, and not on the characterstics of the people involved?

Hey, @AntiCompositeNumber, thanks for pinging me! Yes, the Community Tech team did receive a wish from the 2019 wishlist to allow named references in VE. However, we haven't conducted an analysis to make a decision yet. We are currently working on two other projects (watchlist expiry and the ebook export improvement project), which are our main focuses right now. When we do conduct an analysis, we'll share our findings with the community. Thanks!

My apologies; I didn't mean to make this personal and shouldn't have said that. I'm just finding it so difficult to understand how anyone who adds sources to articles and uses Visual Editor to do it wouldn't be ridiculously frustrated by Vis Ed's manner of naming sources and consider this more than 'low' priority. I started editing nearly fifteen years ago. I am very comfortable with source editing, and I think I'm probably pretty unusual in that as someone who edited in source for over a decade, I now use Vis Ed for probably 99% of my editing and only switch to source when necessary. When I first encountered these refnames I didn't realize they were from Vis Ed and thought there was some really prolific editor out there who was naming stuff in ways only they could possibly understand. If you're using Vis Ed to add sources how do you workaround this problem? Are you adding the source in vis ed, then switching to source each time so you don't leave behind a meaningless ref name mess for source editors and switchers to have to deal with, then switching back to Vis Ed until you add the next ref? And if so, don't you find that incredibly frustrating when it would be so much easier if you could just specify what Vis Ed names the reference instead of having to switch to source every time you add a ref for the first time?

@AntiCompositeNumber, and yes, if this is something we need to pay someone to do because it's not interesting for volunteer developers, then of course let's for heaven's sake pay someone to solve a problem that causes high levels of editor frustration.

I'm just finding it so difficult to understand how anyone who adds sources to articles and uses Visual Editor to do it wouldn't be ridiculously frustrated by Vis Ed's manner of naming sources

If you were using the visual editor for all/nearly all of your editing, then you would never see these 'names' anyway, so it's not frustrating at all.

If you're using Vis Ed to add sources how do you workaround this problem? Are you adding the source in vis ed, then switching to source each time so you don't leave behind a meaningless ref name mess for source editors and switchers to have to deal with, then switching back to Vis Ed until you add the next ref? And if so, don't you find that incredibly frustrating when it would be so much easier if you could just specify what Vis Ed names the reference instead of having to switch to source every time you add a ref for the first time?

I mostly don't worry about it, and if I do, then I switch to a wikitext editor and use find-and-replace to change all ":0" to "something sensible", and then move on. You could also generate them in the 2017 wikitext editor in the first place. It has the same toolbar, with the same magical referencing system.

And now I have a request: If you all want to carry on a non-technical conversation about this problem, could we please do that at https://www.mediawiki.org/wiki/VisualEditor/Feedback ? Phabricator isn't a great place to discuss whether a problem should be solved, what year it should happen in, or who should do the work.

It is unacceptable for the visual editor to generate names that (a) frequently cause name conflicts

Can you give an example of this happening? I've never seen the visual editor do this.

It basically doesn't happen when people create new content. It does happen occasionally, e.g., if someone is careless about a wikitext-based page merge or copying between articles. (In the visual editing mode, such ref names are resolved automagically but a little strangely –@Deskana, do you remember the bug about an unexpected <ref name=":12"/>, and then it becomes <ref name=":122"/>, and <ref name=":1222"/>? Pasting a conflicting refname into the visual mode will trigger the addition of the extra 2 at the end. It was probably meant to turn Smith into Smith2.)

Some years back, there was a bot at enwiki that tried to use ref names to 'rescue' refs across articles. The idea was that <ref name="pmid112233" /> was going to be the same across all articles. He eventually had to stop doing that for short ref names, because not only does <ref name=":0" /> not always refer to the same source across all articles, but <ref name="Lee"/> and <ref name="WHO" /> don't, either. Before he realized what was happening, there were a few messes created.

The generation of meaningless, easily confused ref names happens, it is a frustration to editors trying to maintain the integrity of sourcing, therefore it should be fixed. By all means pay someone to do it if the volunteers don't want to do it for whatever reason. This is one of the fundamental reasons for the WMF's existence, and the annual fundraising.

Just to get a sense of perspective here, how many of those 20 000 edits were fixing problems caused by VE?

Cheers, Peter

From: Deskana [mailto:no-reply@phabricator.wikimedia.org]
Sent: Tuesday, August 4, 2020 22:03
To: Phabricator
Cc: peter.southwood@telkomsa.net
Subject: [Maniphest] [Commented On] T92432: Come up with a better way to auto-label references

Deskana added a comment. https://phabricator.wikimedia.org/T92432 View Task

In https://phabricator.wikimedia.org/T92432#6360105 T92432#6360105, https://phabricator.wikimedia.org/p/Valereee/ @Valereee wrote:

https://phabricator.wikimedia.org/p/Barkeep49/ @Barkeep49 I won't pretend to understand why a fix would be hard, but for heaven's sake the fact something might be difficult shouldn't be a reason to downgrade its importance. That just tries to hide the biggest problems by calling them minor.

Difficulty to fix should and does affect priority; teams have to figure out the reward per unit effort and prioritise accordingly. There's whole management frameworks based on this concept (e.g. impact-effort matrix).

It's like dropping your keys in the street but looking for them on the porch because the porch light makes it easer to search there. We should be prioritizing by actual priority as assessed by the people who are using the tool.

That's exactly what teams do, and why users are able to file Phabricator tasks in the first place. The teams then have to balance user requests with other strategic and infrastructural work.

I use this tool and honestly this is for me the single biggest frustration I have with Vis Ed.

You're not alone in that. There are also thousands of visual editor users out there that do not share your frustration, and thousands of other tasks the team have to work on too, many of which have far larger impact than this.

And honestly whoever decided it would be okay to make Vis Ed work like this in the first place must not actually edit. No one who had written an article from scratch would ever have thought this was by ANY measure a reasonable decision.

James Forrester and I both have over 20,000 edits each, so no, not really. How about we focus on the substance of the task, and not on the characterstics of the people involved?

TASK DETAIL

https://phabricator.wikimedia.org/T92432

EMAIL PREFERENCES

https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Deskana
Cc: ifried, Deskana, Pbsouthwood, JenOttawa, Anomalocaris, Pcoombe, Valereee, Sophivorus, AntiCompositeNumber, Barkeep49, Secundus_Zephyrus, Cirdan, Tgr, Mathglot, matmarex, Izno, PamD, Checkingfax, Liuxinyu970226, TheDJ, Boghog, Dvorapa, AlexMonk-WMF, Protonk, Thryduulf, Vojtech.dostal, Anomie, Ltrlg, rmoen, Whatamidoing-WMF, Krenair, Mvolz, TrevorParscal, Jdforrester-WMF, Aklapper, keithbrianpadilla, Saimongoltinio, WikimeSteve, ppelberg, marcella, Revansx, OhKayeSierra, takidelfin, Necroarcano, Robinma, Tramullas, merbst, Wess, Srdjan, Jrf, Husun1297, jeblad, jayvdb, Swainr, fbstj, Jackmcbarn

And now I have a request: If you all want to carry on a non-technical conversation about this problem, could we please do that at https://www.mediawiki.org/wiki/VisualEditor/Feedback ? Phabricator isn't a great place to discuss whether a problem should be solved, what year it should happen in, or who should do the work.

Sure, but that directs you to a page that directs you to a page that directs you back here, just FYI. :)

With apologies to all wrt adding comments here that aren't directly in the task cycle of this ticket as cautioned by WaId above, I have one bit of great news to report, which I'm pretty sure everyone here who has faced this issue would love to know about, and I don't know how else to bring it to your attention.

Following a script request at en-wiki, User @Nardog has come up with the script RefRenamer. This script will convert all VE numeric names on a page to useful named references (default: Lastname-YYYY) with lots of additional options. This has worked flawlessly on pages containing more than a hundred numeric references, and is one of the most useful scripts I have encountered at en-wiki, possibly the most useful. Example: ~136 changes at "Generation Z" ( diff ).

It doesn't stop the VE problem from occurring, but it is a complete solution for converting one page that you're working on to reasonable ref names. Enjoy! (Send your love letters to Nardog at his UTP, or on the Talk page of the script. )

RefRenamer seems like a Beta tool. It is very good, but not super intuitive, not explained. The graphical interface could have user tips right on it, IMHO. The good thing is the preview window to have a double check moment before submitting the changes. Sadly, it does not (yet) search and suggest changes for the other problematic ref names which are: auto, auto1, auto2...auto234567890

[Update to above comments]: User:Nardog is very responsive and open minded about improving or fixing their tool. The tool is hecka helpful as is. Looking forward to its evolution. Check out their Talk-Page. User:Mathglot is helping it evolve with their input too! Apologizes if I came across as snarky or ungrateful. Cheers! Wayne

Change 891513 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/extensions/Cite@master] Make auto-generated reference names like name=":0" visible in VE

https://gerrit.wikimedia.org/r/891513

Change 891513 merged by jenkins-bot:

[mediawiki/extensions/Cite@master] Make auto-generated reference names like name=":0" visible in VE

https://gerrit.wikimedia.org/r/891513