Page MenuHomePhabricator

Some citation links are redirected to GDPR portal, that is blocked by SpamBlacklist
Open, Needs TriagePublic

Description

How to reproduce:

  1. Go on French wikipedia draft such as https://fr.wikipedia.org/w/index.php?title=Aide:Bac_%C3%A0_sable
  2. Add new citation with any url from 7sur7, such as https://www.7sur7.be/show/un-voyage-en-train-couchettes-que-vous-ne-regretterez-pas~a585919f/
  3. It get edited by Citoid as https://myprivacy.dpgmedia.be/consent?siteKey=atXMVFeyFP1Ki09i&callbackUrl=https%3A%2F%2Fwww.7sur7.be%2Fprivacy-gate%2Faccept-tcf2%3FredirectUri%3D%252Fshow%252Fun-voyage-en-train-couchettes-que-vous-ne-regretterez-pas~a585919f.
  4. You can't save your edit because myprivacy.dpgmedia.be is in blacklist (Edit: the site is no longer in blacklist, because a bot is now cleaning the URL after wise.)

Context:
This myprivacy.dpgmedia.be domain is used by this site to manage its GDPR stuff. The site redirects all its readers to this url first, then redirect back to the article. Citoid gets stuck with the redirect.
For a wiki, these redirect urls are garbage.
7sur7 is just one of the different sites that are using this gdpr portal.

French wikipedia added today this myprivacy.dpgmedia.be domain (and other similar domaines used by news portals) to the SpamBlacklist, because, for a wiki, these redirect urls are garbage.
Starting today, it is no longer possible for a wiki editor to add a citation with such news portals as normal, as Citoid uses the redirect url that is blocked. The user has to manually edit the url to get rid of the redirection.
Instead of editing the url, Citoid should keep the original url.

Event Timeline

Instead of editing the url, Citoid should keep the original url.

Unfortunately this is too simplistic because recently we had exactly the opposite case where the original url triggered the blacklist, as opposed to the resolved one: T359527

On principle, also, it makes sense to use the resolved URL because this is the actual url the metadata has come from. Sometimes this makes it easier for users to detect when their citation has gone awry - they can see that the url they used isn't the one the metadata is coming from so it allows them to see where the error has come in. In some cases, a GDPR proxy could mean we aren't getting correct metadata about the website the person actually wants.

And in this case, you can see the metadata that comes from that url is garbage - it does not include the original title, but just metadata about the gateway itself. So we really *wouldn't* want to include the correct url they entered along with that garbage metadata from the portal.

So there's not a clear resolution to this issue - not to say it's not solvable, but unfortunately the simplest solution of using the original url I think we shouldn't/can't do and doesn't really fix the situation anyway.

I can envision us somehow navigating the GDPR /accept cookies message ourselves - but that's a bit more complicated and would require further investigation because we are not an actual browser, so we can't do everything a browser can do.

Thank you for your quick answer, which seems perfectly logical. The site does not comply with the standards.
If there's no existing "don't change url blacklist" in which the domain can be added, don't spend more time. I've written a little python script to correct these occurrences daily, this way the impact is limited for French wikipedia.