Page MenuHomePhabricator

SPIKE: Research and document how VisualEditor retrieves Attribution Information.
Closed, ResolvedPublic3 Estimated Story PointsSpike

Description

VisualEditor displays attribution details for media files, including licensing, author, and uploader information.

When you're in VisualEditor select Insert from Toolbar, then select Images and Media. This shows you a modal with your recent uploads, but also allows you to search across Wiki/Commons. By searching for any image, let's say "Mona Lisa" you will be listed all available images, select first one and you should see a Media information like this:

image.png (743×985 px, 409 KB)

.
Goals:
• Get VisualEditor working locally
• Identify the mechanism used to fetch attribution ( Licence, Author, Uploaded ) information.
• Determine whether retrieving the data requires parsing the full article content.
• Evaluate whether attribution logic should be centralized in WikimediaCustomizations, or if it can be implemented directly within the Attribution module ( in other words, how much logic would need to duplicate in both MultimediaViewer and Customizations )

VisualEditor documentation: https://www.mediawiki.org/wiki/Extension:VisualEditor

Event Timeline

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptTue, Feb 17, 5:12 PM
BPirkle set the point value for this task to 3.Thu, Feb 19, 4:15 PM

Ciao everyone, after playing around a bit these are the findings I came up with. I'll go through each of the expected goals one by one.

  • Get VisualEditor working locally
    • Done — VE is running locally and the media insertion dialog correctly displays attribution information (license, author, uploader) when selecting images from Commons.
  • Identify the mechanism used to fetch attribution information
    • VE fetches attribution data through the standard MediaWiki imageinfo API with the extmetadata property. The interesting part happens server-side: FormatMetadata builds the extmetadata object by reading EXIF/XMP from the file, and then fires a hook that CommonsMetadata hooks into. CommonsMetadata fetches the rendered HTML of the file description page and scrapes it for structured CSS classes that Commons templates emit — this is how fields like Artist and LicenseShortName are extracted. The result is cached, so the rendering cost only occurs when the file description page changes.
  • Does retrieving attribution data require parsing full article content?
    • Not article content, no. But it does involve rendering the file description page to extract template data. This is handled transparently by the existing infrastructure (FormatMetadata + CommonsMetadata) and is not something the Attribution module would need to replicate.
  • Should attribution logic be centralized in WikimediaCustomizations?
    • Very little duplication would be needed. VE and MultimediaViewer both call the imageinfo API directly from the browser and will continue to do so — no changes needed there. The Attribution module in WikimediaCustomizations is a separate REST endpoint for external consumers, and the fix is localized to that handler. Right now AttributionRestHandler has two issues for file pages: it reads the license from the site-wide wiki config instead of the file's actual license, and it returns the uploader's username instead of the artist. Both fields are already available in extmetadata — the handler just needs to consume them correctly, using the same FormatMetadata layer that everything else already relies on. Some extmetadata fields — including Artist and LicenseShortName — contain raw HTML rather than plain text (e.g. Artist may come back as <a href="...">Leonardo da Vinci</a>). VE handles this by stripping all HTML tags before displaying the value. The Attribution module will need to make the same decision: either strip the HTML server-side before returning it, or return it as-is and let the client handle it.

Follow ups from the planning/demo:

  • Stripping HTML is expensive; how bad is it in this case? Is there an easier way to do it?
  • Can we cache it?
  • Within the handler, can it access the class directly (it doesn’t do HTTP requests, so can we access the raw more directly)?
  • Alternative idea: Potentially add to pageprops, which are cacheable since they are non-changing; potentially have a backfill for pageprops instead of parsing on demand? Job to update and store, then lighter weight API approach.

Meta points:

  • There was a hope that there would be more commonality for how data is fetched. The good news is that everything lives in Commons Metadata extension, but there are challenges to actually fetch it. We may need to look into the hook to get it into the service and make it callable, but that could also be messy.
  • Main takeaway is that visual editor is also messy; we likely can't replicate the data directly. There are too many templates and parsing across solutions.
  • Image discoverability and metadata is another topic for MWP next year with the Commons KR. We might be able to do something simple/imperfect now, then consider leveraging new work for future improvements.
  • Let's close this spike and regroup for which follow ups we want to pursue in other tickets; @pmiazga will add more notes to consolidate the findings across research tasks to determine what can be taken from commons directly.

Follow ups from the planning/demo:

  • Stripping HTML is expensive; how bad is it in this case? Is there an easier way to do it?

I don't know if there is an easier way to do that. The code is already parsing the DOM structure to search elements like .vcard or .licensetpl and then does some cleanup on those (https://gerrit.wikimedia.org/g/mediawiki/extensions/CommonsMetadata/+/48641d30497c561c7a2bf030e76590fcb6c0f9f8/src/TemplateParser.php#329). We would have to provide a switch on top of it to clean it up even more - to return just a plain text. I would say stripping is just an extra step in existing process that shouldn't cause problems.

{{Information
|description = Schlüsselfeld, Elsendorf, Hl. Nepomuk
|date = 2014-01-17
|source = {{own}}
|author = [[user:Tilman2007|Tilman2007]]
|permission = 
|other_versions = 
}}

or

{{Information
|Description	= <b>Via férrea</b>: Variante de Alcácer [PK 81]
<b>Local</b>: Foros de Albergaria (entre Alcácer do Sal e Grândola) * PS VI [km 23,137]
<b>Data e hora</b>: 12 de Novembro de 2010 [12h58]
<b>Sentido</b>: Pinheiro
|Source		= [[Flickr]]: [https://www.flickr.com/photos/76815277@N00/5387843346 PK 81, Variante de Alcácer, 2010.11.12]
|Date		= 2010-11-12 12:58:29
|Author		= [https://www.flickr.com/people/76815277@N00 Nuno Morão] 
|Permission	= {{User:Flickr upload bot/upload|date=15:26, 12 July 2011 (UTC)|reviewer=Tm}}
{{cc-by-sa-2.0}}
|other_versions	=
}}

or

{{ Artwork
  | Other fields 1 = {{ InFi | Creator | Skaggs, J. Lemoine }}
  | title = Lipscomb Lime Light and Follett Times (Follett, Tex.), Vol. 8, No. 42, Ed. 1 Thursday, September 2, 1920
  | description = eight pages: ill. ; page 18 x 11 in. Digitized from 35 mm. microfilm.; Weekly newspaper from Follett, Texas that includes local, state and national news along with advertising.
  | date = 1920-09-02
  | permission = {{PD-US}}
  | source = {{ DPLA
      | Q69492256
      | hub = Q83878505
      | url = https://texashistory.unt.edu/ark:/67531/metapth390546/
      | dpla_id = a32d2bc403de6414c84677033db3eb94
      | local_id = https://texashistory.unt.edu/ark:/67531/metapth390546/; sn86088923; 14201440; https://texashistory.unt.edu/ark:/67531/metapth390546/small/; ark:/67531/metapth390546; https://texashistory.unt.edu/ark:/67531/metapth390546/manifest/
  }}
  | Institution = {{ Institution | wikidata = Q69492256 }}
  | Other fields = {{ InFi | Standardized rights statement | {{ rights statement | http://rightsstatements.org/vocab/NoC-US/1.0/ }} }}
}}
  • Can we cache it?

CommonsMetadata is already caching it. I don't think the problem here is caching, it's more fact that this endpoint would be called for various titles, retriving metadata for the same file second/third time in a row is fast. Also, the metadata is cached based on the last file modification, therefore, if there were no changes and the entry wasn't evicted, it will not require re-parsing.

  • Within the handler, can it access the class directly (it doesn’t do HTTP requests, so can we access the raw more directly)?

Not at this moment as CommonsMetadata is using GetExtendedMetadata hook to inject the metadata. Altough it won't be difficult to refactor code ( introduce a Service class) to allow calling it directly.

  • Alternative idea: Potentially add to pageprops, which are cacheable since they are non-changing; potentially have a backfill for pageprops instead of parsing on demand? Job to update and store, then lighter weight API approach.

I don't think this changes anything. This information is already cached in MainWANObjectCache. Storing this in PageProps may provide a long term storage but I don't think it's something we should look into right now. Only if we start getting lots of WANObjectCache evictions I would consider moving it to more persistent storage.