Page MenuHomePhabricator

Numeric ref name collision from merge of VE-updated sandbox section back to article having prior VE ref names
Open, Needs TriagePublicBUG REPORT

Description

What happened:

In copying back their VE-updated sandbox into a live article, a H:CERDK error is displayed in the References section ("named reference defined multiple times"). (permalink)

What should have happened:

All reference "names must be unique". Editors using VE are not in control of what reference names are created.

Why this happened:

This happened (see caveat below) because less than the entire article was copied into the sandbox, and while working in the sandbox with VE, a ref name was assigned by VE to new content containing a new citation, without having knowledge of the full set of ref names in portions of the live article that were not copied to the sandbox, and thus VE acting in the sandbox assigned a numeric name that was unique in the sandbox, but not unique in the live article.

This is partly conjecture, but is based on my best understanding and reconstruction of events, based on reports by the editor, and by examination of article, sandbox, and editor history: the editor worked in a sandbox created empty, then progressively expanded it, interlacing some edits with new content (e.g., here) or new citations (e.g., here) with other edits being copies of source from the live article into the sandbox (e.g., here). I believe the smoking gun is this edit to their sandbox, which ultimately found its way back into the article, causing the collision. (Note: the term 'sandbox' is a convenience term; in fact, it was a user subpage.)

Reproduce:

It's difficult to know the exact sequence of events carried out by a student editor, and in any case reproduction involves multiple pieces, such as the "smoking gun" edit (if that reconstruction is correct), but by itself, if VE created that reference, by itself that is not a bug, as the numeric reference is unique in the sandbox. And a user "copying sandbox material back to the live article" is not, ipso facto, a problematic procedure; in fact, it is required of editors in the Wiki Ed program. Finally, I don't know VE's algorithm for assigning new numeric names for citations added by a user. However, given all that, I can make the following proposal about how to reproduce the problem, although in reality it's not definitive, unless one has internal knowledge of VE's numeric-id assignment algorithm, so multiple trials may be needed until the observed problematic behavior occurs:

Possible step sequence to reproduce:

  1. Find an article having some numeric, VE-style references.
  2. Create a user sandbox for the article.
  3. Optional: copy some source content (say, one or two sections to be worked on) from article to sandbox, ensuring that:
    • The copy must not be the entire article.
    • The portion of the article not copied must contain at least one VE numeric named reference (the more the better, to improve the likelihood of a given trial reproducing the bug).
  4. Using VE, add content in the sandbox, adding at least one citation (adding more than one may improve your chances).
  5. Observe sandbox for added numeric names: run diffs to see what numeric names were added, and note them.
  6. Search the live article for citations matching any of those names.
  7. Did you find any of those ref names in live? NO: go back to step 4. YES: continue.
  8. Match them up, live vs. sandbox. Are they for the same source (author, date, pub, etc.) in every case? YES: go back to step 4. NO: continue.
  9. Merge back: copy the updated section content overlaying the original content in the article (after assuring no intervening updates by other editors to working sections (see below).
  10. Click Preview mode, and scroll down to the References section.
  11. Note the Red message "Cite error: The named reference ":123" was defined multiple times with different content (see the help page)."

This reproduction sequence contains a loop because of the uncertainty about what numeric name VE will assign, and whether that will or won't create a collision.

Note: in the case of two editors working on the same article with VE, even on mutually exclusive sections (as I believe sometimes happens among student editors sharing work on an article), the problem becomes more complex, and is not part of this bug description. At a minimum, collisions become more likely, and more tedious to find and resolve.

As a practical matter: analysis of this type of collision is tedious and time-consuming, and requires a fairly experienced editor to perform. Repairing the collision can be even worse, and if not caught early on, or if involving multiple citations (or god forbid, multiples by two editors) may be so tedious as to be effectively impossible. (See this UTP discussion for an example. In this case, the collision was reported rapidly, involved no intervening edits by any other editor (save two minor bot edits), and involved only a single collision of one numeric reference–i.e., the simplest possible case of this. Still, analysis and especially repair was tedious.)

A sidebar about the assignment to project VE, and setting it as task='bug': I'm not unaware that from a narrow point of view, this might not be considered a 'bug' (especially by VE software developers), as strictly speaking, there is no way for VE to know what's going on in a separate file that is later merged back; or if it is a bug, maybe it's not a VE bug. I understand and sympathize (I've been a developer, and have been on the other side of this) but I try to take a user-first view, and from the point of view of a user editing a valid article containing valid citations using long-established procedures such as "developing in your sandbox" and "merging back", using an approved editor, for it to all go kablooey and generate an error message on the page is certainly not the editor's fault. That, to me, is the definition of a bug: something is not working right someplace, and the editor is not responsible for it. It doesn't matter much to the user who owns the task, or whether it is called a "bug" or something else, or what project board it ends up on. Maybe the locus of the problem lies somewhere in the interstices of insufficiently robust software, loose procedures, missing copy tools, vague documentation, or something as yet not clearly identified, but whatever it is, this is not the editing experience we wish our users to have. On that basis, I've raised it as a "bug", because that's how it looks to a user, and I've attached it to "VE" even if it is blameless in some sense, because that seems to be the "proximate cause" here; I have no objection if someone reassigns the task type and/or the project appropriately.

Possible mitigating factors, workarounds, or solutions:

If an article contains no VE numeric ref names, then regardless what editor a user employs, there is very low risk of collision at merge-back time, if that editor is the only one editing the article. Even if the article contains no numeric names, if two editors are editing it with VE and each is using their own sandbox for updates, even in strictly separate sections, there is a risk.

These conditions are probably too complex to expect student editors to be able to handle. So, to be safe under current circumstances, only one student editor should edit an article using a sandbox at one time, until the sandbox is merged back. (Switching to wikitext editor is a workaround; then two student editors may edit, with low risk of collision.)

The collision possibility arises from editing only a partial copy of an article in a sandbox, where VE doesn't know the full set of refnames already in use in the article. A workaround, is to not make partial copies; that is, instruct VE users who wish to edit in a sandbox using VE, to copy the entire article into the sandbox, even if they only want to work on one portion of it. This is safe wrt VE and will avoid collisions, but may be error-prone for student editors in other ways, and thus is less than optimal, although possibly better than nothing, for some student editors. (Feedback on this point from Wiki Ed experts would be helpful.)

Workaround: don't do anything; the problem doesn't appear to occcur all that often, and when it does, get somebody else to fix it. One downside: who ya gonna call? My one experience with one of the easiest manifestations of the problem is far too tedious to ever want to repeat.

Mitigate it with a tool: proposed function: VE-section-copy:

Create a new copy tool (or function within VE); let's call it "VE-Section-Copy". This is designed to copy a section(s) of an article to a sandbox (or other destination page), and contain everything VE needs to know, in order that future citations added by an editor using VE in the copied sandbox will never collide with existing ones, even in uncopied parts of the article.

(With my designer hat on, I can't help envisioning a specific implementation for this: copy the desired section wikitext to the dest page, then append a metadata setion within hidden text delimiters consisting of a string with every named ref in the article (numeric or not) as a self-closed named ref, i.e.: 'Lorem.<ref name="Foo"/><ref name=":3"/><ref name=":17"/>' etc. VE would require a modification to be able to recognize and read the metadata section with the ref names *as if they were not hidden* and so avoid assigning them, as the editor starts modifying the sandbox section with new citations. On merge-back, ideally the student would know enough to just copy the updated section and skip the metadata, but if they forget and copy the hidden section, too, no harm done: the worst that would happen is that there would be a hidden text section in the article that doesn't belong there (as long as the delimiters were not corrupted); the references tool wouldn't generate any spurious citations. A bot could harvest the hidden metadata later, if no one else does. Okay, sorry for the digression; couldn't help myself.)

Mitigate it after the fact with an analysis/repair tool:

Create a tool able to examine article history, sandbox history, and find last good version before a H:CERDK error, identify the edit that caused the error, and either on its own or interactively with editor assistance, resolve the problem.

Lower the risk before the fact: use script RefRenamer to remove all VE numeric ref names from the article, before copying sections to a sandbox. Unless editors are working simultaneously in two sandboxes, this lowers the risk of any collision upon merge-back to near zero.

Conclusion:

The upshot of all this, is that at the very least, changes to doc or Help pages may be required in various locations, for starters to describe the current situation and advise users about possible risks and workarounds, and later about any new tools that might be created. In particular, Wiki Ed procedures and training materials for students and professors may require changes to document best practices regarding the use of sandboxes.

Event Timeline

Mathglot updated the task description. (Show Details)
Mathglot updated the task description. (Show Details)
Mathglot updated the task description. (Show Details)
Mathglot updated the task description. (Show Details)
Mathglot updated the task description. (Show Details)

Another possible solution occurs to me: a VE-sandbox-main-link function, which would be usable from sandboxes and would provide a VE user with an input field where they could enter the name of the main article for this sandbox. (Maybe the function would search for 'User:Example/ArticleName' and pop it up as a 'suggestion' if it existed; ditto for 'User:Example/sandbox', if it contained an H2 heading matching 'ArticleName' near-ish the top of page.)

This function would only be available away from the live article (e,g., in sandboxes, user subpages, and so on). Upon linking the sandbox to an article, VE could access the main article, and assure that future additions of citations to the sandbox did not re-use any of the ref names already in use at the linked main article.

Can you clarify if you've found that this happens when only using the visual editor to copy content between pages, or when using the source editor as well?

I'm asking because when using the visual editor, the copy/paste function should already work like your desired "VE-section-copy" function: when copy/pasting a reference to another page, its content is included, and its name is adjusted to avoid conflicts with existing references. (Occasionally this works too well, and the reference is duplicated. I think we have a task about this somewhere.)

When using the source editor, I'm afraid that there isn't really anything we can do about this (other than generating better names for the references).

The quick answer is, I don't know; depends what the student actually did, and it's not clear to me (yet).

Can you clarify if you've found that this happens when only using the visual editor to copy content between pages, or when using the source editor as well?

For starters, the best report for what happened, is from the student editor's comment (diff) on their UTP in response to my earlier question about what they did:

I transferred in Source mode from my sandbox initially, but received the error after making adjustments in Visual Editor.

and that's a bit cryptic to me.

Getting back to your question: I'm not sure I understand the part about "using VE to copy content" (I rarely use VE); if there is a copy function within VE already, I was unaware of it. Maybe you meant selecting and copying text from the live page as rendered by VE while editing it? That seems plausible, but I really don't know; I'd refer you back to the user comment . I did follow that up with another question attempting to elicit more information, but they haven't edited WP since their previous reply on 1 April.

Again, I don't completely follow the bit about copy/paste function, and this may be getting too much into the weeds of how VE interacts with the clipboard of a particular OS it's running on, but I assume in the source (live) page, ^C (or a menu item, if there is one) copies selected material from the live page to the clip/paste-board, and ^V pastes it from there to a second location, either VE if it's open, or the wikitext editor if that's open. Is that what you are asking? Either way, as far as what the student actually did, we're back to "I don't know", see above. Let's try it this way, as a Gedankenexperiment:

Suppose you have an article with ten sections and 100 VE-style numeric named refs names, all sequential, so each section has exactly ten, in sequence. User:Student1 decides to expand section eight (refs :70 to :79) and copies it to his sandbox. (A wikitext editor user would see ref names :70 to :79 in the sandbox, right?) Th student edits the sandbox using VE, adding 20 more citations. (Not sure what numeric ref names VE assigns for those; maybe :0 to :19, maybe :80 to :99; doesn't matter.) Then the student copies the sandbox back to live, overlaying the original section 8 with his version. What happens with the 20 new citations, if their names happen to collide with numbers already used in other citations?

Do I understand you correctly, that for the 20 new citations, if they happened to be numbered :0 to :19 in the sandbox, say (thus potentially colliding with sections 1 and 2 of live), they would be renumbered to unused values following the copy to live to some unused range in the live article, thus perhaps to the range :100 to :120?

In the actual case with the student editor, they added a citation that was named ":13" in the sandbox, for whatever reason, (I still hope to hear back from the student, if they can shed any light on that), but ":13" was already in use in the live article for something different, unbeknownst to the sandbox editor, and when the student copied back, a collision occurred. If a copy-back is supposed to renumber colliding ref names, why didn't it do so in that case, instead of saving the article with two unrelated sources both named as ":13"?

When using the source editor, I'm afraid that there isn't really anything we can do about this (other than generating better names for the references).

Not to pick a nit, but when we say "we" cannot do anything about this, I'm not sure what the scope of "we" is; if it's the VE project, I get it. (I referred to this issue of "what is a bug" and where " the locus of the problem lies" in the "Sidebar" of the OP.) I have my see-a-problem-and-raise-it hat on, not my assigning-the-right-fixit-project hat (which I leave to you, or others). For starters, do you agree that this behavior is problematic for users, and should be fixed, if possible?

Also, even if the source editor is involved in the copyback, I don't agree that we can't do anything about it, if "we" includes WMF/Wikipedia as a whole, and not limited to the VE project (which I assigned, as proximate cause, but I recognize that that may be the wrong assignment).

Switching hats to problem-solver, one of which I hinted at in my previous reply. I'm thinking of another possible solution (more like a workaround), now, and I may try it out and mock something up, to see what it looks like, and perhaps stimulate discussion, or better approaches to a solution.

I'm probably breaking all sorts of rules or conventions at how Phabricator is supposed to be used, and who's responsible for what, and what gets assigned where, for all of which I apologize. I hope that can be excused partly due to my unfamiliarity with the WMF development cycle and reporting mechanisms, and partly due to my fixation on spotting and trying to improve a real situation in the wild, regardless of cause, which is currently impacting, and may continue to negatively impact our users.