Page MenuHomePhabricator

Attribution API returns duplicate 'Unknown authorUnknown author'
Closed, ResolvedPublic2 Estimated Story PointsBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

What should have happened instead?:
The image does have "Unknown author" set as the author value. I would expect that to be returned as is, instead of the duplicate/repeated version of it. It's unclear where it's being duplicated, but it is has been observed on multiple images and seemingly only on images where "Unknown author" is the text set in the field in Commons.

Event Timeline

HCoplin-WMF set the point value for this task to 2.

Note from estimation:

  • Please document why this is happening, and whether it might be happening elsewhere (in less obvious ways). This is coming from the ImageInfo parser, and/or might be a broken template.

So, after an investigation on why the Attribution API returns "Unknown authorUnknown author" instead of "Unknown author" for some file pages this is what's happening:

The author name comes from Commons file description pages. When Commons uses a template like {{Unknown|author}}, that template sometimes generates HTML that contains the visible text plus a hidden copy of the same text.
Our API strips the HTML tags to extract just the text — but in doing so, it naively concatenates all text nodes, including the hidden one. The result is the author name appearing twice.

See these two examples:

Two separate issues found

  • Issue 1 (the original bug): Some files return "Unknown authorUnknown author" because the hidden text is embedded inside a display:none element. This is fixable on our side in the Attribution API code.
"Artist": {
                "value": "Unknown author\u003Cspan style=\"display: none;\"\u003EUnknown author\u003C/span\u003E",
                "source": "commons-desc-page"
              }
  • Issue 2 (a different file, different root cause): Seems something similat, at least one file returns a garbled string like style="background: ..." class="table-Unknown" | Author as the author value. This is broken HTML — raw wikitext table markup that was never properly parsed. This appears to be a broken template, not something we can fix on our end.
"Artist": {
                "value": "\u003Cp\u003Estyle=\"background: var(--background-color-interactive, #EEE); color: var(--color-base, black); vertical-align: middle; text-align: center; \" class=\"table-Unknown\" | Author\n\u003C/p\u003E",
                "source": "commons-desc-page"
              },

That said, the two questions I raise are:

  • Should I fix Issue 1 in the PHP code (strip hidden elements before extracting text), or should I have to report it to Commons (or whoever) to fix the template?
  • Should Issue 2 be tracked separately, and is it worth raising with the Commons template maintainers?

cc. @pmiazga @Mooeypoo

I was wondering why some templates render the second <span> with display:none so I started digging into:

If you check the Unknown Template - https://commons.wikimedia.org/w/index.php?title=Template:Unknown&action=edit - it includes at the end <span style="display: none;">Unknown {{lc:{{{1|}}}}}</span>

This invisible marking was added by Jarekt: add invisible marking https://commons.wikimedia.org/w/index.php?title=Template%3AUnknown&diff=526331398&oldid=487946645
The quick check lead me to https://commons.wikimedia.org/wiki/Commons:Watermarks -> which states

Invisible watermarks are acceptable and should not be intentionally removed.

This means we can have more cases like this one. My understanding is that this falls into the category - "lets make machine readable autor information". For example the Licensing information is stored in <span class="licensetpl_shortname" style="display:none">...</span> - which makes the license information hidden in browser but easy to find and retrieve by scripts.
The text that goes into <span style="display:none"> most likely is the simple, plain text form because the {{Creator}} template can output really complex structures ( like one here https://commons.wikimedia.org/wiki/File:-Emma_Charlotte_Dillwyn_Llewelyn%27s_Album-_MET_DP143470.jpg )

That said, the two questions I raise are:

  • Should I fix Issue 1 in the PHP code (strip hidden elements before extracting text), or should I have to report it to Commons (or whoever) to fix the template?

By "strip hidden elements" are we talking about loading a DOM document? because if that's the case, I am a little nervous about performance....

But I wonder -- is the only case we have is the Unknown author template? If so, we can correct.

  • Should Issue 2 be tracked separately, and is it worth raising with the Commons template maintainers?

cc. @pmiazga @Mooeypoo

I would not wait for the community, but we might want to check in and see why this is set as visible and invisible? what use case is it fulfilling....?

By "strip hidden elements" are we talking about loading a DOM document? because if that's the case, I am a little nervous about performance....

As we said during the standup, no no I'm not thinking about doing DOM manipulation just strings stripping :)

I wouldn't worry about the Issue 2 -> it's an example of garbage in - garbage out -> this is an error with templates and how they are used in this specific scenario. Fixing it requires fixing the Article, we don't have to tackle this programmatically.

function stripDisplayNoneElements( string $html ): string {
    return preg_replace(
        '/<[a-z][^>]*\bstyle\s*=\s*["\'][^"\']*\bdisplay\s*:\s*none\b[^"\']*["\'][^>]*>.*?<\/[a-z]+>/is',
        '',
        $html
    );
}

What do you think if you do something like this snippet? With this regex we can catch each <span>, <div>, <p>, etc. with display:none.

Change #1267203 had a related patch set uploaded (by Aghirelli; author: Aghirelli):

[mediawiki/extensions/WikimediaCustomizations@master] Attribution: strip display:none elements from extmetadata values

https://gerrit.wikimedia.org/r/1267203

As I wrote inside a comment in the patch: I wanted to try a different approach than the regex, I used a more programmatic way to recognize these edge cases and measured for a 1MLN iterations and the results say that the strpos is faster also than the regex but also this approach will not only be faster but also much more readable and maintainable.

------------------------------------------------------------
REGEX   — Total: 0.1325s  |  Avg: 0.221 μs/call
DOM     — Total: 5.0975s  |  Avg: 8.496 μs/call
STRPOS  — Total: 0.2365s  |  Avg: 0.394 μs/call
============================================================
VERDICT: Regex is 38.5x faster than DOM
VERDICT: strpos is 0.6x faster than Regex
============================================================

cc. @pmiazga @Mooeypoo

Change #1267203 merged by jenkins-bot:

[mediawiki/extensions/WikimediaCustomizations@master] Attribution: strip display:none elements from extmetadata values

https://gerrit.wikimedia.org/r/1267203

Marking as resolved to close MWI-Sprint-30 (2026-03-24 to 2026-04-07)