Page MenuHomePhabricator

NSFW image incorrectly included in MediaList Response
Closed, ResolvedPublic

Description

The Russian article for Blade shows a NSFW image when tapping into the lead image in the app: https://ru.wikipedia.org/wiki/%D0%91%D0%BB%D1%8D%D0%B9%D0%B4

You can see this comes from the MediaList (https://ru.wikipedia.org/api/rest_v1/page/media-list/%D0%91%D0%BB%D1%8D%D0%B9%D0%B4).

This image is the redirect for blade.jpg, and this likely has something to do with the problem.

Event Timeline

MattCleinman renamed this task from NSFW image incorrectly included in MediaList to NSFW image incorrectly included in MediaList Response.Aug 6 2021, 8:55 PM
MattCleinman triaged this task as High priority.

I did some investigation using the MachineVision API for that specific image and it looks like the state is withheld for all the labels.

The mobileapps response for that page shows only one image used (which is not NSFW).
Where is the 2nd lead image on android app get populated from? I can only reproduce the issue if I swipe the lead image on the android app.

EDIT: I can reproduce this on the media-list endpoint locally on mobileapps

I think the problem is that in order to populate the media-list we

  1. query parsoid and get the article page
  2. filter out elements to extract only media items
  3. resolve redirects of files

Up until step 2 things look OK while debugging. The issue shows up when we try to resolve redirects for the media list (as mentioned on the description).
From what it looks from the API response of redirects the article image actually redirects to the NSFW image.

I think the problem is that in ru.wikipedia.org the file points to the right location but when we query the redirects from commons.wikimedia.org it shows the NSFW redirect which is a valid redirect for en.wikipedia.org domain.

Change 710939 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/mobileapps@master] Resolve media redirects only for request domain and not commons

https://gerrit.wikimedia.org/r/710939

A bit more investigation:

I think we need to differentiate on how we handle media files uploaded on a specific wiki and images uploaded in commons.
In our case, the image of the article is not freely licensed that means that its uploaded only for usage in ru.wikipedia.org.

By querying commons.wikimedia.org for redirects we might end up redirecting to a file that has nothing to do with the initial request (eg. its just a coincidence that 2 files have the same names but are stored in completely different environments).

Change 710939 abandoned by Jgiannelos:

[mediawiki/services/mobileapps@master] Resolve media redirects only for request domain and not commons

Reason:

https://phabricator.wikimedia.org/T288376#7270345

https://gerrit.wikimedia.org/r/710939

Change 711014 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/mobileapps@master] Fix redirects for media list

https://gerrit.wikimedia.org/r/711014

I think the problem is that in order to populate the media-list we

  1. query parsoid and get the article page
  2. filter out elements to extract only media items
  3. resolve redirects of files

Why does mobileapps resolve redirects? Is this for the use case for when a file is renamed between when Parsoid generates the HTML and when mobileapps fetches the HTML?

Here is output of Parsoid on a renamed image:

[subbu@earth:~/work/wmf/parsoid] echo '[[File:Langholz_-_Landskov_2.jpg]]' | php bin/parse.php
<p data-parsoid='{"dsr":[0,34,0,0]}'><span class="mw-default-size" typeof="mw:Image" data-parsoid='{"optList":[],"dsr":[0,34,null,null]}'><a href="./File:Langholz_-_Landskov_2.jpg" data-parsoid="{}"><img resource="./File:Langholz_-_Landskov_2.jpg" src="//upload.wikimedia.org/wikipedia/commons/5/58/Langholz_-_Langskov_2.jpg" data-file-width="800" data-file-height="594" data-file-type="bitmap" height="594" width="800" data-parsoid='{"a":{"resource":"./File:Langholz_-_Landskov_2.jpg","height":"594","width":"800"},"sa":{"resource":"File:Langholz_-_Landskov_2.jpg"}}'/></a></span></p>

The image is linked to the canonical URL and the canonical filename is accessible in the image's resource property already.

This is the relevant ticket for redirects on media-list: https://phabricator.wikimedia.org/T230040
We could use the resource html attribute to get the canonical file name which is what the android app uses in the end. In my last patch I did something similar by querying the imageinfo API (i didn't know that the information was available from parsoid) and it works fine.

Change 711014 abandoned by Jgiannelos:

[mediawiki/services/mobileapps@master] Fix redirects for media list

Reason:

https://gerrit.wikimedia.org/r/711014

Change 711094 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/mobileapps@master] Keep original resources as titles for media list response

https://gerrit.wikimedia.org/r/711094

Change 711095 had a related patch set uploaded (by MSantos; author: MSantos):

[mediawiki/services/mobileapps@master] media-list: remove canonical title regex

https://gerrit.wikimedia.org/r/711095

Change 711095 abandoned by MSantos:

[mediawiki/services/mobileapps@master] media-list: remove canonical title regex

Reason:

in favor of I57e9d10dd9733899660032554be94afbf6a18634

https://gerrit.wikimedia.org/r/711095

Change 711094 merged by jenkins-bot:

[mediawiki/services/mobileapps@master] Keep original resources as titles for media list response

https://gerrit.wikimedia.org/r/711094

The resource attribute in the Parsoid HTML corresponds to the file named in the wikitext. That is the correct key to use IMO, because that is what it going to be rendered. For example, if the wikitext author wrote [[File:A.jpg]] and then later A.jpg was deleted (so the file was served from commons) and then the file A.jpg on commons was moved to B.jpg and then later A.jpg on the localwiki was uploaded -- the page would appear to have four different images on it at these points. Watching "just" A on commons, or B on commons, or A on the localwiki would miss the changes and you'd risk showing the wrong image to the reader.

After the last deployment media-list doesn't return the NSFW image. Restbase is still cached though.

Things look better on the media-list side after invalidating restbase cache:

{
  "revision": "115723889",
  "tid": "e07f5ce0-f7fa-11eb-b3c3-af3b29fca021",
  "items": [
    {
      "title": "Файл:Blade.jpg",
      "leadImage": true,
      "section_id": 0,
      "type": "image",
      "showInGallery": true,
      "srcset": [
        {
          "src": "//upload.wikimedia.org/wikipedia/ru/thumb/9/94/Blade.jpg/320px-Blade.jpg",
          "scale": "1x"
        }
      ]
    },
    {
      "title": "Файл:Blade-poster.jpg",
      "leadImage": false,
      "section_id": 6,
      "type": "image",
      "caption": {
        "html": "<a rel=\"mw:WikiLink\" href=\"./Снайпс,_Уэсли\" title=\"Снайпс, Уэсли\" id=\"mwYA\">Уэсли Снайпс</a> в роли Блэйда",
        "text": "Уэсли Снайпс в роли Блэйда"
      },
      "showInGallery": true,
      "srcset": [
        {
          "src": "//upload.wikimedia.org/wikipedia/ru/thumb/8/88/Blade-poster.jpg/150px-Blade-poster.jpg",
          "scale": "1x"
        },
        {
          "src": "//upload.wikimedia.org/wikipedia/ru/thumb/8/88/Blade-poster.jpg/225px-Blade-poster.jpg",
          "scale": "1.5x"
        },
        {
          "src": "//upload.wikimedia.org/wikipedia/ru/thumb/8/88/Blade-poster.jpg/300px-Blade-poster.jpg",
          "scale": "2x"
        }
      ]
    }
  ]
}

Also on a fresh wikipedia app installation, gallery doesn't show the NSFW picture for that article.