Page MenuHomePhabricator

Reduce the usage of API format=php
Open, MediumPublic

Description

Since T95715 is nearing a conclusion, time to discuss the next step.

Let's be honest, format=php sucks. It is not human-readable, it's restricted to one programming language, and even there it can result in a remote code execution on client if attacker owns/MITMs the server. Therefore, it would be wise to get rid of it, yet its current usage share is over 5% so even marking it as deprecated would be a poinless waste of bandwidth and logs to generate a zillion of deprecation warnings.

What can be done here? Some kind of soft-deprecation? Also, probably the time has come to start preaching to bot developers to cease using it. A fix in a couple of frameworks could improve the statistics significantly.

For the reference:

$ head -n 1000000 api.log | grep format=php | grep -oE 'API (GET|POST) \S+' | awk '{print $3;}' | sort | uniq -c | sort -rn | sed -r  -e 's/[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+/<IP user>/' | head -n 30
23521 Cyberbot_I
12615 <IP user>
 6833 <IP user>
 4664 Cyberbot_II
 3347 ClueBot_III
 1726 Legobot
 1612 Luke081515Bot
 1568 <IP user>
  868 ArndBot
  717 <IP user>
  338 <IP user>
  242 <IP user>
  240 ClueBot_NG
  240 <IP user>
  210 <IP user>
  191 <IP user>
  152 OLMBot
  152 NasirkhanBot
  135 <IP user>
  105 <IP user>
   81 <IP user>
   77 <IP user>
   74 <IP user>
   69 SineBot
   67 <IP user>
   66 <IP user>
   62 Beau.bot.admin
   59 <IP user>
   51 <IP user>
   39 <IP user>

Event Timeline

MaxSem raised the priority of this task from to Needs Triage.
MaxSem updated the task description. (Show Details)
MaxSem added subscribers: Anomie, csteipp, Yurik, MaxSem.

How about we introduce an artificial delay, say a few seconds, for each non JSON format result? This would get most developer's attention, and would help us migrate faster. Non-JSON formats cause us a huge headache, so we really should notify our subscribers about it somehow.

I think we should mark the format as deprecated (without logging if volume is a problem), and update docs (e.g. https://www.mediawiki.org/wiki/API:Data_formats) to indicate that format=php is deprecated and discouraged. Migration should be as simple as format=phpformat=json and unserialize(...)json_decode(..., true).

A fix in a couple of frameworks could improve the statistics significantly.

I have updated my copy of botclasses.php.

MaxSem set Security to None.

Sorry, but I have no intentions of moving my bot anytime soon and I will consider intentional delays disruptive.

Tough. Your bots will eventually break.

Using other format would need a big workaround, because the encoding would be a problem.

@Luke081515, what encoding problems do you have with JSON?

Tough. Your bots will eventually break.

Can you tell me what what the huge problem even is?

The fundamental problem is that each piece of the API (which is huge) needs to be aware of different output formats because they differ. XML is actually a bigger problem, but PHP serialization is a cause of a concern as well, especially due to security. Hence, we have wanted to standardize on just one format that everyone on the web uses.

Anomie triaged this task as Lowest priority.Nov 13 2015, 2:44 PM

How about we introduce an artificial delay, say a few seconds, for each non JSON format result?

No, -2. We are emphatically not going to do that.

Non-JSON formats cause us a huge headache,

Since I fixed the stupidity with format=xml throwing exceptions if various hoops weren't jumped through, I'm not aware of any remaining headaches caused by non-JSON formats.

The fundamental problem is that each piece of the API (which is huge) needs to be aware of different output formats because they differ.

This seems false to me. Can you explain a situation where any piece of the API outside of the formatter itself actually has to care which output format is being used, that isn't a special case like action=opensearch?

@Anomie, this was based on api formats discussion a year or two ago. Are you saying that everything is now peachy with formats, and we never incur any costs or maintenance headaches with multiple formats?

@Anomie, this was based on api formats discussion a year or two ago. Are you saying that everything is now peachy with formats, and we never incur any costs or maintenance headaches with multiple formats?

I'm saying that, since the overhaul, I'm not aware of any situations where API modules have to care about which format module is in use unless they're something like action=opensearch or the feed modules where the format is part of some external specification. There is still metadata that can be used for nicer output in some cases, but nothing will break anymore if people don't do it.

Even dump, dbg, txt, and yaml didn't cause any real problems, for that matter. But var_dump() (dump) and print_r() (txt) are just kind of pointless, and our yaml has been exactly the same as json since 2011.

PHP serialize format does have its security risks for the client, but I don't know that it rises to the level that comments like T118538#1803277 are really needed here. Our format=php doesn't even output any objects (and if it were to be added someday, which could be done easily enough, only stdClass would be used), so a client could likely use the new PHP7 feature to prevent object instantiation from unserialize and be safe. Or use code like people used to use in JavaScript when using eval() to parse JSON.

Is it possible at the moment to add stdClass for {...} and array() for [...] ? I haven't looked lately at the output metadata, but it seems like using stdClass for all object-like output would be preferred to assoc-arrays results.

There is no way to get such output at the moment from format=php. Adding a parameter (like format=json's 'utf8' or 'ascii' parameters) to set $transforms['Types']['AssocAsObject'] = true would be simple enough if anyone actually wants it.

@Luke081515, what encoding problems do you have with JSON?

For example, when I get letters like ä, ö or ü, (Get this usually at dewiki) the bot got problems with encode this to input for the API. When I'm using format=PHP, the programm, (in php too) don't have this problem.

I read all the messages but didn't find any reason, which would justify the disadvantages of breaking tools and forcing volunteers to change their code.
If it's as easy as Legoktm wrote, this simple transformation from JSON to PHP could simply be done in mediawiki. If it's not that easy, the format shouldn't be removed.

I must agree with APPER. I see no justification to force volunteers to allot hours of their time to comply to an unnecessary API change. If it's not broken don't fix it. I have no time to make any changes to my bots and tools, and if this change goes into effect, a widely used set of tools will be broken too.

PHP serialize format does have its security risks for the client, but I don't know that it rises to the level that comments like T118538#1803277 are really needed here. Our format=php doesn't even output any objects (and if it were to be added someday, which could be done easily enough, only stdClass would be used), so a client could likely use the new PHP7 feature to prevent object instantiation from unserialize and be safe. Or use code like people used to use in JavaScript when using eval() to parse JSON.

IMO the biggest issue with format=php is the security risk for the client, and there have been security issues with it in the past (T73478). We should not be encouraging people to unserialize() data received over the network.

If it's as easy as Legoktm wrote, this simple transformation from JSON to PHP could simply be done in mediawiki.

I don't understand what you mean by "done in mediawiki", the clients who are receiving the data need to use json_decode() instead of unserialize() when reading the API response.

@Luke081515, what encoding problems do you have with JSON?

For example, when I get letters like ä, ö or ü, (Get this usually at dewiki) the bot got problems with encode this to input for the API. When I'm using format=PHP, the programm, (in php too) don't have this problem.

@Luke081515, i think you are doing something really wrong there if you get UTF8 decoding problems -- all these should be transparent to you. All you need to do is something similar to the two line change that @Ricordisamoa showed above.

and there have been security issues with it in the past (T73478).

OTOH, we've had security issues in json too (T70187).

So we have security issues with both formats, so I don't see a reason to deprecate the php format. The result of that are just breaking tools and (maybe) angry volunteers, because the are forced to change their code (some people need to do more than the two line change), and as result we have unhappy developers, who lose time to develop new tools.

To reduce stress and confusion, this is the current status:

  • No formats are being deprecated right now. No one's code is currently in danger of being broken. This task not (currently) an RFC, and currently is a position statement with nothing actionable beyond "advocate to developers to switch to json".
  • If any format is to be deprecated or removed, it'll probably have to go through the requests for comment process.
    • If the hypothetical RFC is accepted, I'll insist on a decent notice interval. Likely at least a year if the format is currently in use, as was done for txt, dbg, and yaml.
    • If anyone does submit an RFC, make sure it gets announced to mediawiki-api-announce. The easiest way to do that is to tell me "Hey, anomie, post about Txxxxxx to mediawiki-api-announce" on IRC or via email.
  • People are free to advocate that developers update their code to use json instead of php or xml, and developers' doing so is not a bad idea. But please stick to the facts when advocating.
    • People are also free to advocate that developers use php or xml instead of json, for that matter. I don't know of any valid reason for anyone to advocate that, though, and unless you have reasons that aren't "increase usage to prevent the format from being deprecated" you probably shouldn't.

To reduce stress and confusion, this is the current status:

  • No formats are being deprecated right now. No one's code is currently in danger of being broken. This task not (currently) an RFC, and currently is a position statement with nothing actionable beyond "advocate to developers to switch to json".
  • If any format is to be deprecated or removed, it'll probably have to go through the requests for comment process.
    • If the hypothetical RFC is accepted, I'll insist on a decent notice interval. Likely at least a year if the format is currently in use, as was done for txt, dbg, and yaml.
    • If anyone does submit an RFC, make sure it gets announced to mediawiki-api-announce. The easiest way to do that is to tell me "Hey, anomie, post about Txxxxxx to mediawiki-api-announce" on IRC or via email.
  • People are free to advocate that developers update their code to use json instead of php or xml, and developers' doing so is not a bad idea. But please stick to the facts when advocating.
    • People are also free to advocate that developers use php or xml instead of json, for that matter. I don't know of any valid reason for anyone to advocate that, though, and unless you have reasons that aren't "increase usage to prevent the format from being deprecated" you probably shouldn't.

I feel much better now.

More than ten years have passed since the last comment and I'm planning to finally pull the plug on the php format. If anyone hasn't updated their code in the past ten years, they are not going to update it in the next decade either.

More than ten years have passed since the last comment and I'm planning to finally pull the plug on the php format. If anyone hasn't updated their code in the past ten years, they are not going to update it in the next decade either.

Makes sense to me. I guess the steps would be something like:

  1. Ping the few authors of affected bots.
  2. Update https://www.mediawiki.org/wiki/API:Data_formats (which already says format=php is deprecated).
  3. Send a message to mediawiki-api-announce that format=php is planned for removal in 3 months (like https://lists.wikimedia.org/hyperkitty/list/mediawiki-api-announce@lists.wikimedia.org/message/MIA3RXYN762VIYXFD3DKLCF2G47VO566/).
  4. Remove format=php support once that deadline is reached.

Re-running the api.log grep command today gave:

aaron@mwlog1003:/srv/mw-log$ head -n 1000000 api.log | grep format=php | grep -oE 'API (GET|POST) \S+' | awk '{print $3;}' | sort | uniq -c | sort -rn | sed -r  -e 's/[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+/<IP user>/' | head -n 30
   4903 Cyberbot_I
    500 ClueBot_III
    143 <IP user>
    114 <IP user>
     89 <IP user>
     72 <IP user>
     69 <IP user>
     67 <IP user>
     45 <IP user>
     43 <IP user>
     42 <IP user>
     40 SineBot
     35 2607:5300:205:300:0:0:0:563
     34 <IP user>
     32 <IP user>
     32 <IP user>
     31 <IP user>
     29 <IP user>
     26 ClueBot_NG
     25 <IP user>
     23 <IP user>
     22 LivingBot
     22 <IP user>
     21 <IP user>
     21 <IP user>
     21 <IP user>
     20 <IP user>
     20 2001:41D0:203:8878:0:0:0:0
     18 Luke081515Bot
     18 <IP user>

Basically I'm just seeing:

aaron raised the priority of this task from Lowest to Medium.Mar 17 2026, 8:51 PM

I pinged these two bots on their talk pages.

Cluebot stuff has been fixed now (they are asking whether we are seeing more requests like that from our side). Cyberbot message was ignored and went to archive

Cluebot stuff has been fixed now (they are asking whether we are seeing more requests like that from our side). Cyberbot message was ignored and went to archive

Not ignored. Just haven't had an opportunity to address it. I have now. Can you confirm Cyberbot I isn't using format=php anymore?

It looks like it's fixed now. Only these two are showing up in very small numbers:

37 AkeronBot
35 XabatuBot

Announced. I'm giving one month instead of three since this is a trivial fix and the ratio of the requests is extremely small specially legitimate requests (excluding scrapers following links)

Izno subscribed.

Re-adding User-notice (added/removed by same person); please make it a general forum to notify for breaking changes that might impact live users.

Suggested text might be

The [[:mw:API:Action API|action API]] has had several formats for requested output. One of them, <code>format=php</code>, is being removed soon. Please ensure your scripts or bots use the [[:mw:API:Data formats#Output|JSON format]]. This removal should affect very few scripts and bots.

With a potential addition of

We are also planning to remove <code>format=xml</code> at some point. See T421973 for that one.

(added/removed by same person)

I accidentally removed it and didn't realize until you pointed it out, sorry about that.

Re-adding User-notice (added/removed by same person); please make it a general forum to notify for breaking changes that might impact live users.

Suggested text might be

The [[:mw:API:Action API|action API]] has had several formats for requested output. One of them, <code>format=php</code>, is being removed soon. Please ensure your scripts or bots use the [[:mw:API:Data formats#Output|JSON format]]. This removal should affect very few scripts and bots.

With a potential addition of

We are also planning to remove <code>format=xml</code> at some point. See T421973 for that one.

Sounds good to me. Thanks!

Not marked as deprecated
here: https://www.mediawiki.org/wiki/API:Action_API
or here: https://www.mediawiki.org/wiki/Special:ApiSandbox
or here: https://www.mediawiki.org/w/api.php?action=help&modules=php

I am using my own fork of botclasses.php which has format=php https://github.com/bamyers99/Bambots/blob/master/src/lib/ChrisG/wikipedia.php
Some anonymous GETs by various bots and hundreds of POSTs a day by InceptionBot: https://en.wikipedia.org/wiki/Special:Contributions/InceptionBot
InceptionBot runs for a couple of hours each day, so maybe it wasn't in the 1000000 entry api.log grep.
User agent has bambots in it.
Had it been deprecated in the api sandboxes, I would have seen it and had more time to migrate the code and test all of the usages.

It's marked as deprecated at the page describing the formats: https://www.mediawiki.org/wiki/API:Data_formats

Why would anybody be looking at the API:Data_formats page when they have been using the sandboxes for years with format=php?

I make no comment on whether there has been sufficient notice, only that there was notice. And that notice is ancient (like this task is).

Sad to see people aren't following documented deprecation processes anymore. No announcement to the mediawiki-api-announce mailing list, no marking of the module as deprecated in code and waiting a release cycle, no using the deprecation log to measure usage, just "I'm going to remove this because I think it's old, and some greps over a relatively small sample don't show much usage".

@Ladsgroup @Izno would we like to be more specific on how soon this is happening?

Re-adding User-notice (added/removed by same person); please make it a general forum to notify for breaking changes that might impact live users.

Suggested text might be

The [[:mw:API:Action API|action API]] has had several formats for requested output. One of them, <code>format=php</code>, is being removed soon. Please ensure your scripts or bots use the [[:mw:API:Data formats#Output|JSON format]]. This removal should affect very few scripts and bots.

With a potential addition of

We are also planning to remove <code>format=xml</code> at some point. See T421973 for that one.

Sounds good to me. Thanks!

@Ladsgroup @Izno would we like to be more specific on how soon this is happening?

The notice for this task made it into this week's tech news. I am not sure if you forgot to move it to already announced.