Page MenuHomePhabricator

Reduce the usage of API format=php
Open, LowestPublic

Assigned To
None
Authored By
MaxSem
Nov 13 2015, 12:57 AM
Referenced Files
None
Tokens
"Like" token, awarded by Pcj."Doubloon" token, awarded by Ricordisamoa."Dislike" token, awarded by Cyberpower678."Dislike" token, awarded by APPER."Heartbreak" token, awarded by Luke081515.

Description

Since T95715 is nearing a conclusion, time to discuss the next step.

Let's be honest, format=php sucks. It is not human-readable, it's restricted to one programming language, and even there it can result in a remote code execution on client if attacker owns/MITMs the server. Therefore, it would be wise to get rid of it, yet its current usage share is over 5% so even marking it as deprecated would be a poinless waste of bandwidth and logs to generate a zillion of deprecation warnings.

What can be done here? Some kind of soft-deprecation? Also, probably the time has come to start preaching to bot developers to cease using it. A fix in a couple of frameworks could improve the statistics significantly.

For the reference:

$ head -n 1000000 api.log | grep format=php | grep -oE 'API (GET|POST) \S+' | awk '{print $3;}' | sort | uniq -c | sort -rn | sed -r  -e 's/[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+/<IP user>/' | head -n 30
23521 Cyberbot_I
12615 <IP user>
 6833 <IP user>
 4664 Cyberbot_II
 3347 ClueBot_III
 1726 Legobot
 1612 Luke081515Bot
 1568 <IP user>
  868 ArndBot
  717 <IP user>
  338 <IP user>
  242 <IP user>
  240 ClueBot_NG
  240 <IP user>
  210 <IP user>
  191 <IP user>
  152 OLMBot
  152 NasirkhanBot
  135 <IP user>
  105 <IP user>
   81 <IP user>
   77 <IP user>
   74 <IP user>
   69 SineBot
   67 <IP user>
   66 <IP user>
   62 Beau.bot.admin
   59 <IP user>
   51 <IP user>
   39 <IP user>

Event Timeline

MaxSem raised the priority of this task from to Needs Triage.
MaxSem updated the task description. (Show Details)
MaxSem added subscribers: Anomie, csteipp, Yurik, MaxSem.

How about we introduce an artificial delay, say a few seconds, for each non JSON format result? This would get most developer's attention, and would help us migrate faster. Non-JSON formats cause us a huge headache, so we really should notify our subscribers about it somehow.

I think we should mark the format as deprecated (without logging if volume is a problem), and update docs (e.g. https://www.mediawiki.org/wiki/API:Data_formats) to indicate that format=php is deprecated and discouraged. Migration should be as simple as format=phpformat=json and unserialize(...)json_decode(..., true).

A fix in a couple of frameworks could improve the statistics significantly.

I have updated my copy of botclasses.php.

MaxSem set Security to None.

Sorry, but I have no intentions of moving my bot anytime soon and I will consider intentional delays disruptive.

Tough. Your bots will eventually break.

Using other format would need a big workaround, because the encoding would be a problem.

@Luke081515, what encoding problems do you have with JSON?

Tough. Your bots will eventually break.

Can you tell me what what the huge problem even is?

The fundamental problem is that each piece of the API (which is huge) needs to be aware of different output formats because they differ. XML is actually a bigger problem, but PHP serialization is a cause of a concern as well, especially due to security. Hence, we have wanted to standardize on just one format that everyone on the web uses.

Anomie triaged this task as Lowest priority.Nov 13 2015, 2:44 PM

How about we introduce an artificial delay, say a few seconds, for each non JSON format result?

No, -2. We are emphatically not going to do that.

Non-JSON formats cause us a huge headache,

Since I fixed the stupidity with format=xml throwing exceptions if various hoops weren't jumped through, I'm not aware of any remaining headaches caused by non-JSON formats.

The fundamental problem is that each piece of the API (which is huge) needs to be aware of different output formats because they differ.

This seems false to me. Can you explain a situation where any piece of the API outside of the formatter itself actually has to care which output format is being used, that isn't a special case like action=opensearch?

@Anomie, this was based on api formats discussion a year or two ago. Are you saying that everything is now peachy with formats, and we never incur any costs or maintenance headaches with multiple formats?

@Anomie, this was based on api formats discussion a year or two ago. Are you saying that everything is now peachy with formats, and we never incur any costs or maintenance headaches with multiple formats?

I'm saying that, since the overhaul, I'm not aware of any situations where API modules have to care about which format module is in use unless they're something like action=opensearch or the feed modules where the format is part of some external specification. There is still metadata that can be used for nicer output in some cases, but nothing will break anymore if people don't do it.

Even dump, dbg, txt, and yaml didn't cause any real problems, for that matter. But var_dump() (dump) and print_r() (txt) are just kind of pointless, and our yaml has been exactly the same as json since 2011.

PHP serialize format does have its security risks for the client, but I don't know that it rises to the level that comments like T118538#1803277 are really needed here. Our format=php doesn't even output any objects (and if it were to be added someday, which could be done easily enough, only stdClass would be used), so a client could likely use the new PHP7 feature to prevent object instantiation from unserialize and be safe. Or use code like people used to use in JavaScript when using eval() to parse JSON.

Is it possible at the moment to add stdClass for {...} and array() for [...] ? I haven't looked lately at the output metadata, but it seems like using stdClass for all object-like output would be preferred to assoc-arrays results.

There is no way to get such output at the moment from format=php. Adding a parameter (like format=json's 'utf8' or 'ascii' parameters) to set $transforms['Types']['AssocAsObject'] = true would be simple enough if anyone actually wants it.

@Luke081515, what encoding problems do you have with JSON?

For example, when I get letters like ä, ö or ü, (Get this usually at dewiki) the bot got problems with encode this to input for the API. When I'm using format=PHP, the programm, (in php too) don't have this problem.

I read all the messages but didn't find any reason, which would justify the disadvantages of breaking tools and forcing volunteers to change their code.
If it's as easy as Legoktm wrote, this simple transformation from JSON to PHP could simply be done in mediawiki. If it's not that easy, the format shouldn't be removed.

I must agree with APPER. I see no justification to force volunteers to allot hours of their time to comply to an unnecessary API change. If it's not broken don't fix it. I have no time to make any changes to my bots and tools, and if this change goes into effect, a widely used set of tools will be broken too.

PHP serialize format does have its security risks for the client, but I don't know that it rises to the level that comments like T118538#1803277 are really needed here. Our format=php doesn't even output any objects (and if it were to be added someday, which could be done easily enough, only stdClass would be used), so a client could likely use the new PHP7 feature to prevent object instantiation from unserialize and be safe. Or use code like people used to use in JavaScript when using eval() to parse JSON.

IMO the biggest issue with format=php is the security risk for the client, and there have been security issues with it in the past (T73478). We should not be encouraging people to unserialize() data received over the network.

If it's as easy as Legoktm wrote, this simple transformation from JSON to PHP could simply be done in mediawiki.

I don't understand what you mean by "done in mediawiki", the clients who are receiving the data need to use json_decode() instead of unserialize() when reading the API response.

@Luke081515, what encoding problems do you have with JSON?

For example, when I get letters like ä, ö or ü, (Get this usually at dewiki) the bot got problems with encode this to input for the API. When I'm using format=PHP, the programm, (in php too) don't have this problem.

@Luke081515, i think you are doing something really wrong there if you get UTF8 decoding problems -- all these should be transparent to you. All you need to do is something similar to the two line change that @Ricordisamoa showed above.

and there have been security issues with it in the past (T73478).

OTOH, we've had security issues in json too (T70187).

So we have security issues with both formats, so I don't see a reason to deprecate the php format. The result of that are just breaking tools and (maybe) angry volunteers, because the are forced to change their code (some people need to do more than the two line change), and as result we have unhappy developers, who lose time to develop new tools.

To reduce stress and confusion, this is the current status:

  • No formats are being deprecated right now. No one's code is currently in danger of being broken. This task not (currently) an RFC, and currently is a position statement with nothing actionable beyond "advocate to developers to switch to json".
  • If any format is to be deprecated or removed, it'll probably have to go through the requests for comment process.
    • If the hypothetical RFC is accepted, I'll insist on a decent notice interval. Likely at least a year if the format is currently in use, as was done for txt, dbg, and yaml.
    • If anyone does submit an RFC, make sure it gets announced to mediawiki-api-announce. The easiest way to do that is to tell me "Hey, anomie, post about Txxxxxx to mediawiki-api-announce" on IRC or via email.
  • People are free to advocate that developers update their code to use json instead of php or xml, and developers' doing so is not a bad idea. But please stick to the facts when advocating.
    • People are also free to advocate that developers use php or xml instead of json, for that matter. I don't know of any valid reason for anyone to advocate that, though, and unless you have reasons that aren't "increase usage to prevent the format from being deprecated" you probably shouldn't.

To reduce stress and confusion, this is the current status:

  • No formats are being deprecated right now. No one's code is currently in danger of being broken. This task not (currently) an RFC, and currently is a position statement with nothing actionable beyond "advocate to developers to switch to json".
  • If any format is to be deprecated or removed, it'll probably have to go through the requests for comment process.
    • If the hypothetical RFC is accepted, I'll insist on a decent notice interval. Likely at least a year if the format is currently in use, as was done for txt, dbg, and yaml.
    • If anyone does submit an RFC, make sure it gets announced to mediawiki-api-announce. The easiest way to do that is to tell me "Hey, anomie, post about Txxxxxx to mediawiki-api-announce" on IRC or via email.
  • People are free to advocate that developers update their code to use json instead of php or xml, and developers' doing so is not a bad idea. But please stick to the facts when advocating.
    • People are also free to advocate that developers use php or xml instead of json, for that matter. I don't know of any valid reason for anyone to advocate that, though, and unless you have reasons that aren't "increase usage to prevent the format from being deprecated" you probably shouldn't.

I feel much better now.