Page MenuHomePhabricator

API Querying for XML/JSON, you might get the Browser Connection Security warning HTML page (which is invalid XML)
Closed, InvalidPublic

Description

On en.wikipedia.org, the browser connection security error page that has just been activated may contain invalid XML. It causes a parse failure when read using XmlDocument.Load() in the continuation to a HttpClient.GetStreamAsync() specifying an API Query request.


The 'img' start tag on line 21 position 38 does not match the end tag of 'a'. Line 22, position 3.

Although I honestly can't see what is wrong with the HTML. The attachment is snapped from a Fiddler trace.

I'm using HTTPS but I suspect I'm not properly enabling TLS 1.2. My workaround is to fix that so I don't see the error.

Event Timeline

Hi @DavidBrooks, thanks for taking the time to report this!

If you parse it as XML, I assume that <img /> is expected by the parser instead of <img>?

Thanks for picking this up. I assume the answer is buried deep in XmlDocument.Load(). According to the stack trace it's System.Xml.XmlTextReaderImpl.ParseEndElement() that gets upset. But you're probably right; I didn't look critically enough. The <IMG> element isn't closed.

To use TLS1.2 I have to do a one-line code fix. Because I'm still using framework 4.5, I have to add System.Net.ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12; before constructing the HTTP handlers. That's not ideal, and fx 4.8 has a more flexible fix.

It might be invalid XML, but https://validator.w3.org/ has no problem with it

Screenshot 2019-12-12 at 01.40.48.png (954×2 px, 378 KB)

And https://developer.mozilla.org/en-US/docs/Web/HTML/Element/img for example doesn't show it as needing the closing /

HTML5 isn't valid XML....

Oh, goodness, my thinking is muddled today. Blame a cold.

This is an API query (GET /w/api.php?action=query...). It expects well-formed XML with some levels down a <rev> element that contains the wiki page as inner text (whose syntax is not parsed). Instead it's getting an actual HTML page. If it had parsed as XML successfully, the code wouldn't be able to understand the resulting tree.

Returning HTML when XML or JSON is expected is something of a category error, isn't it? Instead, it would be really nice for the API route to return an XML or JSON object with at least an <api><warnings> element calling out the new security requirement. It would be a cleaner response, even if the <rev> element isn't there. That said, if I'm the only one facing the regression, and have already coded a fix, this would be a low priority.

It kinda is and it isn't. You've been served a static HTML error page that isn't served by the MW API, it's coming from the caches infront of MediaWiki. It doesn't get as far as MediaWiki, so there's no XML or JSON choice, just the error/warning page

Do you know what HTTP Status it was served with? Does it let you continue with subsequent requests?

Aklapper renamed this task from Browser Connection Security warning page apparently produces invalid XML to API Querying for XML/JSON, you might get the Browser Connection Security warning HTML page (which is invalid XML).Dec 12 2019, 12:58 PM

The way it works is that if the connection isn't using TLSv1.2, the user is served a 302 redirect to /sec-warning on the same domain, which in turn returns a cacheable 200 OK with the HTML warning content and the CT header as text/html; charset=utf-8. There are a lot of gory details in the compromises being made by that solution (vs. eg. we could have returned some kind of 4xx error immediately rather than 302->200), but we've learned this is the best pattern to avoid misbehavior of certain bots and scrapers out there in the world which spam-retry [45]xx return codes.

That the contents of the 200 response don't parse as intended is actually a good thing. If they did parse correctly like the response you were expecting, you might still not notice that anything was wrong, and never upgrade the TLS version you're using. We served these sec-warning interruptions earlier as well at lower-percentages (e.g. some days before this, you should've seen ~10% of your requests failing this way) as an early-warning system, but there is no universal early-warning system for such changes that actually works in the real world (nobody reads or reacts to announcements, and generally they don't pay attention when we break a small percentage of their requests either, but we do what we can). The really awful case is POST requests, and there are some bots out there which do only POST and never GET, and don't really bother to parse the output response either...

Thanks for the detailed explanation. Does that mean the external parser should check the header to realize the mime type is text/html (and not application/xml or text/xml or application/json), and that this task should be declined?

I'm not even sure what the task is asking for, but yeah in general we're not going to make the sec-warning mechanism comply with all expected valid outputs from all possible APIs/URIs it's covering. It's designed to break things, in a way that at least provides some level of human info on what's going on if someone digs in and looks. The next step in the transition process after this is that whatever agent they're using which is getting the sec-warning output won't be able to establish a connection to our infrastructure at all, which is way more broken than this.

Thanks, BBlack, for the obvious thoughtful care that went into this. And, in my case, it had the desired end-result, although I had to read to the end of the page to get me started.

I guess the task is asking "don't suddenly surprise API users like this", but I understand now it was by design to be attention-getting. Perhaps one improvement would be to add a comment right at the top of the HTML page (because I'll look at the source in the first instance) explaining to coders in general terms what to do. Of course it'll be moot in 3 weeks.

For the Action API (api.php), it would be sufficient to return a 4xx or 5xx rather than a 200. But it sounds like that causes some crawlers to be stupid in other contexts.

jcrespo subscribed.

@DavidBrooks Based on your last comment, it seems like the right actionables here would be to close this as "Invalid" or "Declined" and maybe file a new ticket with suggestions for improvements on /sec-warning page, if you wanted to do so, so they are easier to understand for developers? While I agree your concerns are valid, based on the above comments on which layer that is handled (generic for all services, and not something the application does), it is difficult to suggest a better outcome than that.

(of course, the immediate actionables are to use only TLS 1.2+ connections on client following industry-wide deprecation https://hacks.mozilla.org/2019/05/tls-1-0-and-1-1-removal-update/ )

DavidBrooks closed this task as Invalid.EditedDec 15 2019, 2:40 PM

Closed as suggested. Sorry about the delay; I'll open another ticket although it's rapidly becoming moot. Is this the equivalent of "By Design"?