Page MenuHomePhabricator

Revisit HTML included in pastes from popular LLMs
Closed, ResolvedPublic8 Estimated Story Points

Description

In T379908, we learned popular LLMs (e.g. Gemini, Claude, and ChatGPT) in include unique HTML elements along with text people paste from these services.

Through this investigation, we also concluded (T379908#10419299) that the HTML these service generate is likely to evolve over time and thus, not reliable/stable enough to be used a signal to configure Paste Check with.

This ticket involves the work of revisiting the HTML these popular LLMs include in text people paste from them to:

  • Learn if and how the HTML has changed since we first investigated this in November 2024
  • Decide whether we think the HTML is stable enough to be used as a signal:
    • Paste Check can be configured with/off of
    • Used to label (read: tag) edits with

Related

Event Timeline

One ChatGPT artifact that started to emerge: :contentReference[oaicite:1]{index=1} or similar (example). Confirmed by... ChatGPT itself.

You need to use the "Share" button in ChatGPT to create a shareable link. The regular URL is private.

Edited the original post.

I realize the task might not have been understood correctly. The title is "Revisit HTML...", that is, the "code" stored in the clipboard prior to copy-pasting it into a wiki (like T376306#10197664 or T379908#10322254).

On the other hand, the resources deal rather with the impression after copy-pasting (e.g., wording).

Editors of the English Wikipedia have been collecting empirical signs of machine-generated content, including some markup quirks.

On the other hand, the resources deal rather with the impression after copy-pasting (e.g., wording).

Editors of the English Wikipedia have been collecting empirical signs of machine-generated content, including some markup quirks.

Great spot, @matej_suchanek. Could you please have a look at T409484 and share to what (if any) extent you think what that ticket describes could be helpful?

It's definitely a way forward and, to some extent, a mitigation. Yes, let's divide the effort between communities, each focused on the language they are most fluent in, and have them observe the patterns.

With

the resources deal rather with the impression after copy-pasting (e.g., wording)

I was referring to the "Related" resources in the task description, as opposed to the title

Revisit HTML included in pastes from popular LLMs.

Most sites have a "copy" button as well as the ability to just select the text on the page, so we can document these cases separately:

LLMCopy buttonSelect and copy[1]
GeminiSame as select and copyAppears to have the distinctive attribute data-sourcepos, e.g. <p data-sourcepos="9:1-9:117">.... This may come from some generic library though, so not necessarily a solid indicator
ChatGPTPlain text, unidentifiableContains some attributes starting data-message- such as data-message-author-role="assistant", data-message-id and data-message-model-slug="gpt-4o"
ClaudePlain text, unidentifiableVery plain HTML, but specific class list might be fairly unique: <p class="whitespace-pre-wrap break-words">
  1. In all cases, if a paragraph is only partially selected it usually pastes without any HTML and is unidentifiable.

As of today:

  • ChatGPT sometimes includes data-start and data-end if you select across nodes (e.g. select multiple paragraphs, select across bold/italics/links, or triple click to select a whole paragraph).
  • Similarly, Gemini will sometimes include data-path-to-node with selection across nodes
  • Claude seems to have font-claude-response-body as a CSS class - as above will only get copied with selections across nodes

This means there are still indicators available, but the are very unstable and will need to be monitored and updated regularly (maybe once per quarter).

The addition I made to the task description in T382608#11695116 [i] emerged in response to me noticing the following pattern in how I'm using VisualEditor Suggestion Mode:

  1. I'll open an article where an Add citation suggestion is present. E.g. https://en.wikipedia.org/w/index.php?title=070_Shake&oldid=1342417919 .
  2. I'll copy the sentence/text the suggestions is asking me to consider adding a source to verify and paste it into ChatGPT and prompt the machine to see if it can find sources that Wikipedia policies would be likely to consider reliable
  3. I'll use my own understanding of WP:RS to evaluate the source candidates ChatGPT will have offered.
  4. Once I find a source that seems reliable, I'll verify it does, in fact, corroborate the currently-unsourced claimed within Wikipedia
  5. Paste the URL of the source I've deemed to be reliable and capable of verifying the on-wiki claim into the Wikipedia article I opened in "1."

i. "Used to label (read: tag) edits with..."

...

  1. Paste the URL of the source I've deemed to be reliable and capable of verifying the on-wiki claim into the Wikipedia article I opened in "1."

Yes, these are the URLs containing utm_source=chatgpt.com covered by the the abusefilter on dewiki.

ldelench_wmf set the point value for this task to 8.Mon, Mar 16, 6:04 PM
ppelberg claimed this task.

Per today's team meeting, we're going to:

  1. Consider this task resolved based on the outcome of the investigation @Esanders shared in T382608#11658361.
  2. Use the newly-filed T420258 to implement the LLM Paste Check variation. Timing still TBD.