Page MenuHomePhabricator

Identify paste source from copied HTML
Closed, ResolvedPublic

Description

Certain apps may generate identifiable HTML when copying to the clipboard. If we can reliably identify certains app, we may consider them less likely to contain copyright infringements than HTML copied from an unidentified websites. For example, Google Docs / Microsoft Word might be more likely to contain a self-written article draft.

https://evercoder.github.io/clipboard-inspector/ can be used for inspecting clipboard contents.

Event Timeline

Google Docs

<... id="docs-internal-guid-[GUID]">

(as well an the extra data entry under application/x-vnd.google-docs-document-slice-clip+wrapped)

Libre Office:

<meta name="generator" content="LibreOffice 7.3.7.2 (Linux)"/>

Open Office:

<META NAME="GENERATOR" CONTENT="OpenOffice 4.1.15  (Win32)">

Word 365 (web):
Nothing explicit, but contains some odd attributes, such as data-contrast and CSS classes TextRun and NormalTextRun

Word 365 (Windows):
P69459

<html xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns:m="http://schemas.microsoft.com/office/2004/12/omml"
xmlns="http://www.w3.org/TR/REC-html40">
...
<meta name=ProgId content=Word.Document>
<meta name=Generator content="Microsoft Word 15">
<meta name=Originator content="Microsoft Word 15">

and various other pieces of metadata


It seems we should be able to identify most popular document editors, but would need to keep an eye over time to see if HTML changes.

It seems we should be able to identify most popular document editors, but would need to keep an eye over time to see if HTML changes.

Nice! @Esanders, earlier this week, we talked about the prospect of using this information (were it to be available) to decide when/if to suppress the Paste Check (T359107).

Now, seeing this list, causes me to wonder: would it be technically possible for us to use this info. to tailor the feedback Edit Check surfaces?

It would be technically possible, yes. We could annotate the paste with the source (if detected).

Change #1077930 had a related patch set uploaded (by Esanders; author: Esanders):

[VisualEditor/VisualEditor@master] Implement pasteSourceDetectors

https://gerrit.wikimedia.org/r/1077930

It would be technically possible, yes. We could annotate the paste with the source (if detected).

Excellent; we'll decide whether we want to tailor the Check message based on paste source as part of T359107. See "Open questions" #4."

Change #1077930 merged by jenkins-bot:

[VisualEditor/VisualEditor@master] Implement pasteSourceDetectors

https://gerrit.wikimedia.org/r/1077930

Change #1081991 had a related patch set uploaded (by Esanders; author: Esanders):

[mediawiki/extensions/VisualEditor@master] Update VE core submodule to master (bae9101b7)

https://gerrit.wikimedia.org/r/1081991

Change #1081991 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] Update VE core submodule to master (bae9101b7)

https://gerrit.wikimedia.org/r/1081991

ppelberg claimed this task.