Page MenuHomePhabricator

Images served with text/html content type
Closed, DeclinedPublic


I do not know if this an artifact of these images being proxy-ed by googleweblight but images such us:

Which should be of content-type: image/webp

Appears on webrequest data as text:

cp1081.eqiad.wmnet 1938530061 2019-07-01T02:17:56 7.66E-4 hit-local 200 21344 GET /wiki/File:Arm_muscles_back_numbers.p
ng text/html; charset=UTF-8 - NULL Mozilla/5.0 (Linux; Android 4.2.1; en-us; Nexus 5 Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko; googleweblight) Chrome
/some Mobile Safari/535.19 pt-BR ns=6;page_id=39031;https=1;nocookies=1 - true 0.0.21 some {"city":"Unknown","subdivision":"Unknown","longitude":"-97.82
2","timezone":"America/Chicago","country_code":"US","country":"United States","latitude":"37.751","continent":"North America","postal_code":"Unknown"} cp1075 hit/1, cp1081 miss {"bro
wser_family":"Chrome Mobile","os_major":"4","wmf_app_version":"-","browser_major":"38","os_minor":"2","os_family":"Android","device_family":"Nexus 5"} {"page_id":"39031","ns":"6","nocookie
s":"1","https":"1"} 2019-07-01 02:17:56 mobile web user NULL none {"project_class":"wikimedia","project":"commons","qualifiers":["m"],"tld":"org","project_family":"wik
imedia"} {"language_variant":"default","project":"commons.wikimedia","page_title":"File:Arm_muscles_back_numbers.png"} 39031 6 ["pageview"] {"organization":"Google Proxy
","autonomous_system_organization":"Google LLC","isp":"Google Proxy","autonomous_system_number":"15169"} text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 text 201971

Event Timeline

Nuria created this task.Sep 11 2019, 10:03 PM
Restricted Application added a project: Operations. · View Herald TranscriptSep 11 2019, 10:03 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This has the effect that these images are being considered content pageviews when they are just asset requests

I think we need to add proxy=googleweblight to x-analytics

jbond triaged this task as Normal priority.Sep 12 2019, 11:06 AM
jbond added a subscriber: jbond.

We need to add googlewblight to the proxy list to make sure it is treated appropriately, i think misc/trusted_proxies.json is outside my boundaries so possibly @BBlack or @ema can do it.

cc @Ottomata just in case he can do the change too

Nuria updated the task description. (Show Details)Sep 12 2019, 11:02 PM
jijiki added a subscriber: jijiki.Sep 13 2019, 3:06 AM

The URL mentioned at the top isn't a media URL, it actually is HTML content and is a pageview. Try it in your browser:

Can we get a separate and appropriately-titled ticket about the Weblight addition to the trusted proxies list and rationale, and where the upstream source of IPs to whitelist is? Keep in mind our proxies database is only manually curated (thus will inevitably fall behind upstream changes), and currently lacks many proxies (IIRC, it only has OperaMini to date). It was an outgrowth of the now-defunct Zero project. We may want to consider a better system for managing "trusted" proxies for analytics purposes into the future.

Nuria added a comment.Sep 13 2019, 4:58 PM

I have started another ticket that as you mentioned, better explains the rationale behing having "trusted proxies", we really do not need them if we can capture the original ip:

Ottomata closed this task as Declined.Sep 23 2019, 3:30 PM

Nuria I think we can decline this yes? Doing so, feel free to reopen if I am wrong.