Page MenuHomePhabricator

Feature Request: request search engines to update caches of pages with rev deleted content
Open, Needs TriagePublic

Description

Currently if a webpage is cached then its content, even if rev deleted, is still accessible by visiting the page cache until the cache is updated. Requesting that when revdel is performed Google and Bing are automatically contacted to refresh the relevant page caches (Yahoo! doesn't do their own caches anymore, other engines lower priority).

Googles accepts up to 500 individual URL recrawling requests every 30 days through their "Fetch as Google" tool as explained here: https://support.google.com/webmasters/answer/6065812?hl=en

Bing will update caches usually within 24 hours of a request being made through their webmaster tool as explained here: https://www.bing.com/webmaster/help/bing-content-removal-tool-cb6c294d

Event Timeline

Hi @M.A.Bruhn, thanks for taking the time to report this!

Which underlying problem would you like to see solved here?
What exactly is proposed to be changed in MediaWiki code here? Adding code in MediaWiki that would contact lots of 3rd party internet search engines via their APIs to update their 3rd party search indexes?

Hi @Aklapper !

The problem is that currently, if an admin uses revdel to remove information, like outing information, I can just go to the Google cache of the page in question and see that information (at least until the cache is refreshed). It's something I do on forums and places like reddit to see comments that have been edited or deleted, and I've been able to confirm on Wikipedia that I can do it here as well to see information that was stricken out. If code could be added to the revdel tool to contact two 3rd party tools to contact Google and Bing (the other search engines don't cache often enough to really be a concern) then it would help keep sensitive information away from prying eyes.

Specifically, if the edit summary has info removed then the "History" page which has all the diffs and the "User contributions" page which has all their diffs would have their URLs reported to Google and Bing. If the edit content itself had the material then the page that was edited could have its URL reported.

I don't know 1) how busy you all are, 2) how difficult this would be, 3) what priority this would be, so I realize this may just not be worth the effort. I just thought I'd let you all be aware of the exploit and suggest a possible workaround.

It's not an exploit - as soon as a revision is saved, it's public, anyone can download it and republish it. What revision deletion does is stop the host wiki from continuing to display it to everyone. It doesn't claim to attempt to wipe out all copies of the revision across the web.

Exploits aren't well defined, whether it's an exploit is a matter of choosing your definition. Revdel claims to hide information from the public, but the public can still access that information if the page was cached with it. This is something that I use pretty commonly on other sites, and I've tried it three times on Wikipedia and the info has been there.

Although unlikely it would be possible for a troll to make a bot that monitors Special:RecentChanges for revdel usage and automatically looks up and stores the most recent cache of the page on Google/Bing. They could then go through and see if there's any information outing people and use that info maliciously, and this could be aided by the fact that a lot of the edit summaries for revdel also state their purpose as an outing violation, or removing private information. This wouldn't be stopped though by making requests for the cache to be updated since presumably the bot would work faster than Google/Bing, although you could make it so that revdel edits don't show up on new edits feeds.

I don't know if it is relevant, but what prompted me to bring this up here was a user linking to this statement by WMF legal where they state they are forced to constrain the idea of giving people, who have not been through an RfA or similar process, access to deleted revisions.