Page MenuHomePhabricator

Make InstantCommons and other uses of ForeignApiRepo use WMF policy-compliant user agents
Closed, ResolvedPublic

Description

We need at the very least to:

  • Provide a way to override the UA used by ForeignFileRepo to add a suffix containing contact information
  • Always set the referrer to the root url of the wiki making the request

Ideally, we should also fix the defaults, and add contact information to the UA automatically (something like https://<wiki url>).

Additionally, we might want to find a way to allow people running older versions of MediaWiki to monkey-patch their installation to add contact info to the UA.

Event Timeline

Joe removed Joe as the assignee of this task.
Joe added a project: MediaWiki-File-management.
taavi subscribed.

(User-notice, not relevant to editors of Wikimedia wikis.)

(assuming this will also apply to QuickInstantCommons, remove if I'm wrong :))

Provide a way to override the UA used by ForeignFileRepo to add a suffix containing contact information

Easy enough, but we probably need to think about how we communicate to the wiki sysadmin they need to fill this out, and how (if at all) its presented in the installer.

Always set the referrer to the url of the wiki making the request

Should be easy provided its fine that the referrer goes to the wiki's main page or just the domain. It might potentially be trickier if the referrer should refer to the specific wiki page that caused the request to be triggered.

Ideally, we should also fix the defaults, and add contact information to the UA automatically (something like wiki user@wiki url).

Are you suggesting including the wiki user who caused the request to happen (as opposed to the server admin)? That feels like a privacy violation and i fail to see how it would be helpful in abuse fighting. Keep in mind the triggering event could just be a page view (whenever the page falls out of the appropriate cache) and may be an anonoymous user.

As far as wiki name, QuickInstantCommons already includes $wgSitename in the user-agent (core MW does not do this).

Reedy renamed this task from Make InstantCommons and other uses of ForeignApiRepo use policy-compliant user agents to Make InstantCommons and other uses of ForeignApiRepo use WMF policy-compliant user agents.Jul 31 2025, 5:01 PM

Change #1174858 had a related patch set uploaded (by Brian Wolff; author: Brian Wolff):

[mediawiki/extensions/QuickInstantCommons@master] Set a referrer in HTTP requests and adjust user-agent

https://gerrit.wikimedia.org/r/1174858

Change #1174858 merged by jenkins-bot:

[mediawiki/extensions/QuickInstantCommons@master] Set a referrer in HTTP requests and adjust user-agent

https://gerrit.wikimedia.org/r/1174858

Change #1174867 had a related patch set uploaded (by Brian Wolff; author: Brian Wolff):

[mediawiki/extensions/QuickInstantCommons@REL1_44] Set a referrer in HTTP requests and adjust user-agent

https://gerrit.wikimedia.org/r/1174867

Change #1174868 had a related patch set uploaded (by Brian Wolff; author: Brian Wolff):

[mediawiki/extensions/QuickInstantCommons@REL1_43] Set a referrer in HTTP requests and adjust user-agent

https://gerrit.wikimedia.org/r/1174868

Change #1174869 had a related patch set uploaded (by Brian Wolff; author: Brian Wolff):

[mediawiki/extensions/QuickInstantCommons@REL1_42] Set a referrer in HTTP requests and adjust user-agent

https://gerrit.wikimedia.org/r/1174869

Change #1174868 merged by jenkins-bot:

[mediawiki/extensions/QuickInstantCommons@REL1_43] Set a referrer in HTTP requests and adjust user-agent

https://gerrit.wikimedia.org/r/1174868

Change #1174867 merged by jenkins-bot:

[mediawiki/extensions/QuickInstantCommons@REL1_44] Set a referrer in HTTP requests and adjust user-agent

https://gerrit.wikimedia.org/r/1174867

Change #1174869 merged by Brian Wolff:

[mediawiki/extensions/QuickInstantCommons@REL1_42] Set a referrer in HTTP requests and adjust user-agent

https://gerrit.wikimedia.org/r/1174869

Change #1174870 had a related patch set uploaded (by Brian Wolff; author: Brian Wolff):

[mediawiki/extensions/QuickInstantCommons@REL1_39] Set a referrer in HTTP requests and adjust user-agent

https://gerrit.wikimedia.org/r/1174870

Change #1174871 had a related patch set uploaded (by Brian Wolff; author: Brian Wolff):

[mediawiki/extensions/QuickInstantCommons@REL1_35] Set a referrer in HTTP requests and adjust user-agent

https://gerrit.wikimedia.org/r/1174871

Change #1174871 merged by Brian Wolff:

[mediawiki/extensions/QuickInstantCommons@REL1_35] Set a referrer in HTTP requests and adjust user-agent

https://gerrit.wikimedia.org/r/1174871

Change #1174870 merged by jenkins-bot:

[mediawiki/extensions/QuickInstantCommons@REL1_39] Set a referrer in HTTP requests and adjust user-agent

https://gerrit.wikimedia.org/r/1174870

[Anyways, I adjusted the QuickInstantCommons part. That's the only part of this bug i plan to work on, so it should stay open for the stuff in MW core]

Are you suggesting including the wiki user who caused the request to happen (as opposed to the server admin)? That feels like a privacy violation and i fail to see how it would be helpful in abuse fighting. Keep in mind the triggering event could just be a page view (whenever the page falls out of the appropriate cache) and may be an anonoymous user.

I was looking at this from the prespective of wiki operators. If we have the username in the UA, that allows us to rate-limit requests based on the UA string and only block abusive behaviour of single users, instead of need to punish an entire wiki. I'm not sure how that's a privacy violation, unless you assume that the username used on wiki X is somehow private identifying information, which I'm not convinced of.

[Anyways, I adjusted the QuickInstantCommons part. That's the only part of this bug i plan to work on, so it should stay open for the stuff in MW core]

Thanks for working on this <3

Are you suggesting including the wiki user who caused the request to happen (as opposed to the server admin)? That feels like a privacy violation and i fail to see how it would be helpful in abuse fighting. Keep in mind the triggering event could just be a page view (whenever the page falls out of the appropriate cache) and may be an anonoymous user.

I was looking at this from the prespective of wiki operators. If we have the username in the UA, that allows us to rate-limit requests based on the UA string and only block abusive behaviour of single users, instead of need to punish an entire wiki. I'm not sure how that's a privacy violation, unless you assume that the username used on wiki X is somehow private identifying information, which I'm not convinced of.

I think the personal information might be the combined information that user X was viewing article Y/Commons images Y, rather than necessarily a username on its own (although, to be fair, possibly also just a username on its own). IANAL, but I believe that either of these may be covered by the GDPR's broad definition of "personal data", and would potentially cause data-protection headaches for wiki operators were we to implement a system that sent any sort of user data back to Wikimedia servers. On the face of the issue, I think I agree with @Bawolff here.

Are you suggesting including the wiki user who caused the request to happen (as opposed to the server admin)? That feels like a privacy violation and i fail to see how it would be helpful in abuse fighting. Keep in mind the triggering event could just be a page view (whenever the page falls out of the appropriate cache) and may be an anonoymous user.

I was looking at this from the prespective of wiki operators. If we have the username in the UA, that allows us to rate-limit requests based on the UA string and only block abusive behaviour of single users, instead of need to punish an entire wiki. I'm not sure how that's a privacy violation, unless you assume that the username used on wiki X is somehow private identifying information, which I'm not convinced of.

I think the personal information might be the combined information that user X was viewing article Y/Commons images Y, rather than necessarily a username on its own (although, to be fair, possibly also just a username on its own). IANAL, but I believe that either of these may be covered by the GDPR's broad definition of "personal data", and would potentially cause data-protection headaches for wiki operators were we to implement a system that sent any sort of user data back to Wikimedia servers. On the face of the issue, I think I agree with @Bawolff here.

Fair enough, I am not fully convinced this is covered by GDPR, but nonetheless I agree it is a risk not worth taking. After all, wiki operators if they find themselves rate-limited or blocked will be able to look at their own databases and find out who's the offender.

InstantCommons does work in a way that is pretty easy to abuse. If you view any File page on a wiki or relavent API method, instant commons will cause the wiki to look up that file. Its essentially an open proxy for MediaWiki's imageinfo API.

When the images are hotlinked (but the downstream wiki still needs to fetch metadata), adding a username would reveal IP / username combinations to the upstream wiki via timing correlations. Can't violate privacy much more than that.

There are two scenarios, assuming reasonable caching configuration:

  • Downstream wiki reuses a modest number of Commons images. Since the results of the API requests are cached, even if they are being scraped or whatever, eventually all images or image metadata will be cached. The images themselves might still be hotlinked, but those requests will use normal browser UAs anyway.
  • Downstream wiki reuses a very large number of Commons images (with the assumption that most of them won't be visited regularly, so at any given time only a small fraction of them is cached). E.g. some sort of public Wikipedia mirror. An abusive user *will* break InstantCommons for the entire wiki, but if you run a wiki with a ton of Commons image references, I think it's reasonable to expect the downstream site operator to handle the abuse on their side in this case.

So I think generic wiki contact information in the UA will always make more sense.

MediaWiki-Platform-Team will pick up the core part of this. Note that the soonest a change to the InstantCommons code could make a difference is after the next MediaWiki release (so in about 3 months). Many sites will only upgrade when the next LTS version is released (in about 15 months).

Maybe we can recommend some code that you can drop into your wiki configuration today to affect user agents.

Some related tasks, if we are already touching InstantCommons:

MediaWiki-Platform-Team will pick up the core part of this. Note that the soonest a change to the InstantCommons code could make a difference is after the next MediaWiki release (so in about 3 months). Many sites will only upgrade when the next LTS version is released (in about 15 months).

Maybe we can recommend some code that you can drop into your wiki configuration today to affect user agents.

Yes that's my core idea - provide a code snippet people can add to their LocalSettings.php to monkey-patch the behaviour. I'll see what's the distribution of versions I see in the logs and report on the task.

So basically we need a ForeignAPIRepo subclass that overrides httpGet() with something along the lines of

$version = MW_VERSION;
$contact = Title::newMainPage()->getCanonicalUrl(); // or use $wgEmergencyContact?
$options['userAgent'] = "InstantCommons MediaWiki/$version ($contact)";
return parent::httpGet( $url, $timeout, $options, $mtime);

There is no way to declaratively add arbitrary headers, so if we really need a referer, that will be more complex.

Change #1176852 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/core@master] filerepo: Improve identification of ForeignAPIRepo requests

https://gerrit.wikimedia.org/r/1176852

Change #1176853 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/core@REL1_44] filerepo: Improve identification of ForeignAPIRepo requests

https://gerrit.wikimedia.org/r/1176853

Change #1176854 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/core@REL1_43] filerepo: Improve identification of ForeignAPIRepo requests

https://gerrit.wikimedia.org/r/1176854

Change #1176855 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/core@REL1_42] filerepo: Improve identification of ForeignAPIRepo requests

https://gerrit.wikimedia.org/r/1176855

Change #1176856 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/core@REL1_39] filerepo: Improve identification of ForeignAPIRepo requests

https://gerrit.wikimedia.org/r/1176856

LocalSettings code snippet that in theory works going back to 1.34 (although I haven't tested it on old versions):

$wgUseInstantCommons = false;
$wgForeignFileRepos[] = [
	'class' => ForeignAPIRepoWithFixedUA::class,
	'name' => 'wikimediacommons',
	'apibase' => 'https://commons.wikimedia.org/w/api.php',
	'url' => 'https://upload.wikimedia.org/wikipedia/commons',
	'thumbUrl' => 'https://upload.wikimedia.org/wikipedia/commons/thumb',
	'directory' => $wgUploadDirectory,
	'hashLevels' => 2,
	'transformVia404' => true,
	'fetchDescription' => true,
	'descriptionCacheExpiry' => 43200,
	'apiThumbCacheExpiry' => 0,
];

class ForeignAPIRepoWithFixedUA extends \ForeignAPIRepo {
	public static function getUserAgent() {
		global $wgCanonicalServer;
		$mediaWikiVersion = 'MediaWiki/' . MW_VERSION;
		return "$mediaWikiVersion ($wgCanonicalServer) ForeignAPIRepo/T400881";
	}
}

The settings are the defaults for InstantCommons (hotlinking, with 12-hour cache for the API requests), do we want to make any changes to those?

(This doesn't add a referer, which would be a lot more complicated. The patch does add it for future versions. Although not sure how useful it is since it just duplicates the URL from the UA.)

Change #1176855 abandoned by Gergő Tisza:

[mediawiki/core@REL1_42] filerepo: Improve identification of ForeignAPIRepo requests

Reason:

EOL branch

https://gerrit.wikimedia.org/r/1176855

@Joe which format do you think would be more useful?

ForeignAPIRepo/2.1 (https://example.org) MediaWiki/1.45.0

or

MediaWiki/1.45.0 (https://example.org) ForeignAPIRepo/2.1

Updated T400881#11072676 and the patch to put MediaWiki first.

Change #1176854 merged by jenkins-bot:

[mediawiki/core@REL1_43] filerepo: Improve identification of ForeignAPIRepo requests

https://gerrit.wikimedia.org/r/1176854

Change #1176856 merged by jenkins-bot:

[mediawiki/core@REL1_39] filerepo: Improve identification of ForeignAPIRepo requests

https://gerrit.wikimedia.org/r/1176856

Change #1176853 merged by jenkins-bot:

[mediawiki/core@REL1_44] filerepo: Improve identification of ForeignAPIRepo requests

https://gerrit.wikimedia.org/r/1176853

Change #1176852 merged by jenkins-bot:

[mediawiki/core@master] filerepo: Improve identification of ForeignAPIRepo requests

https://gerrit.wikimedia.org/r/1176852

The patches are merged, and I added the code snippet to https://www.mediawiki.org/wiki/InstantCommons#Temporary_solution. I think we are done here.

Does the referrer thing make sense to set in MediaWiki in general just by default. I'm sure there are lots of other HTTP requests MW might make which aren't as common as InstantCommons.

Does the referrer thing make sense to set in MediaWiki in general just by default. I'm sure there are lots of other HTTP requests MW might make which aren't as common as InstantCommons.

For example, 3rd party wikis may allow transwiki import from Wikimedia projects.

Does the referrer thing make sense to set in MediaWiki in general just by default. I'm sure there are lots of other HTTP requests MW might make which aren't as common as InstantCommons.

IMO because they aren't as common it doesn't matter much (does make sense but isn't important). In any case, filed as T402740: Consider making the default UA or referrer for MediaWiki server-side web requests more informative.