Page MenuHomePhabricator

[Task] Drop support for php-serialized output from Special:EntityData
Open, MediumPublic

Description

Special:EntityData is intended to be a LinkedData interface.
It should support various RDF flavours, and plain JSON.
Other formats (based on MediaWiki API result formats) should be dropped (currently, php-serialized is allowed per default).

This would allow us to use the JSON serialization of the entity directly, and drop the clunky dependency on the API result formatters.

TODO reasons we want to drop it

Notes:

  • needs to go through the breaking change process

Event Timeline

daniel raised the priority of this task from to Needs Triage.
daniel updated the task description. (Show Details)
daniel added a project: Wikidata.
daniel added subscribers: Aklapper, daniel.
JanZerebecki set Security to None.
JanZerebecki moved this task from incoming to ready to go on the Wikidata board.

Change 224773 had a related patch set uploaded (by Addshore):
Drop php format from default entityDataFormats

https://gerrit.wikimedia.org/r/224773

thiemowmde subscribed.

What does "other formats" mean? Only php? Or more?

This ticket does not explain why we "should" drop this support? What do we win? It works just fine right now. Even if it's not perfect and things are missing from the serialization, how is that a reason to kill it instead of fixing it? All serializations use arrays as an intermediate format. Just pass this to PHP's serialize() and be done. Why drop it?

Change 224773 abandoned by Addshore:
Drop php format from default entityDataFormats

Reason:
For now

https://gerrit.wikimedia.org/r/224773

Jonas renamed this task from Drop support for php-serialized output from Special:EntityData to [Task] Drop support for php-serialized output from Special:EntityData.Aug 14 2015, 8:31 AM
Jonas added a project: Technical-Debt.

Should be this be moved to discussion?

What does "other formats" mean? Only php? Or more?

This ticket does not explain why we "should" drop this support? What do we win? It works just fine right now. Even if it's not perfect and things are missing from the serialization, how is that a reason to kill it instead of fixing it? All serializations use arrays as an intermediate format. Just pass this to PHP's serialize() and be done. Why drop it?

Random example: in the context of T128486: [Story] Make Special:EntityData be up to date after an edit, supporting this absurd output format means there are two additional URLs we need to purge on each edit. (Unless Varnish already normalizes .php and ?format=php into one URL, or something, then it’s one URL.)

Can we find out how much this format is still used, similar to T220826#5185202?

Oh dear, I just realized how strange my comment from 2015 sounds by now. I totally support dropping this esoteric format! I believe PHP's internal serialization format should be something that's internal to PHP, and not be part of a public API. Please drop it.

My problem with this ticket was – and still is – that it just states how something "should be", without providing any information how this decision was made. Is it even a decision?

Can we find out how much this format is still used, similar to T220826#5185202?

Aparrently it has between 10k and 100k usages per day right now https://grafana.wikimedia.org/d/000000169/wikidata-api-format-usage?orgId=1&refresh=30m&from=now-30d&to=now

My problem with this ticket was – and still is – that it just states how something "should be", without providing any information how this decision was made. Is it even a decision?

To me it read more like a proposal than a decision, though I hope the decision won’t be controversial.

Aparrently it has between 10k and 100k usages per day right now https://grafana.wikimedia.org/d/000000169/wikidata-api-format-usage?orgId=1&refresh=30m&from=now-30d&to=now

Are you sure this is for Special:EntityData and not for the action API? I’m missing some RDF formats in that graph (though that might just mean that we’ve misconfigured the list of allowed formats – I couldn’t figure out where the data ultimately comes from).

Aparrently it has between 10k and 100k usages per day right now https://grafana.wikimedia.org/d/000000169/wikidata-api-format-usage?orgId=1&refresh=30m&from=now-30d&to=now

Are you sure this is for Special:EntityData and not for the action API? I’m missing some RDF formats in that graph (though that might just mean that we’ve misconfigured the list of allowed formats – I couldn’t figure out where the data ultimately comes from).

Ah, nevermind, I found the Wikidata Special:EntityData dashboard. Somewhere around 5k PHP requests per day, it seems, compared to millions for JSON and Turtle.

This is a product decision, so putting in the product column.
From the tech side removing this will leave us with less things to maintain, I'm not really sure if we "support" the php format here really.
I would guess that most of the PHP calls are perhaps scrapers and things requesting this format by accident.
Migration path would be to use JSON instead.

This is a product decision, so putting in the product column.
From the tech side removing this will leave us with less things to maintain, I'm not really sure if we "support" the php format here really.
I would guess that most of the PHP calls are perhaps scrapers and things requesting this format by accident.
Migration path would be to use JSON instead.

I can look it up in hadoop if that helps PM decision (@Lydia_Pintscher) on this.

Yesterday we had 3400 hits on php endpoints, 2089 were spiders and 1300 were from users (at least they faked user UA which is possible and happens quite often). 1000 of the hits belong to only four countries (which are not usual suspects) but I can't disclose more in a public ticket.

I'll raise this with Lydia in my next 1:1 with her

Yesterday we had 3400 hits on php endpoints, 2089 were spiders and 1300 were from users (at least they faked user UA which is possible and happens quite often). 1000 of the hits belong to only four countries (which are not usual suspects) but I can't disclose more in a public ticket.

@Ladsgroup can we get up to date numbers and also put this in relation to the other formats we expose?

Then @Lydia_Pintscher can make a current and informed decision on the future of the output.

Reasons to drop:

  • It adds a non-negligible amount of code that needs maintaining
  • It adds a stable interface to our stable interfaces that we need to communicate and follow the procedure for each and every change.
  • It adds two urls for cache busting (as explained above)
  • It doesn't give much benefit, the views of it are small (I put numbers below), its only unserializable in php and not any other language (unless with gymnastics).
  • (Might not be a big deal): Serialization and deserialization are security sensitive, we might expose something we shouldn't or receive something which would lead to arbitrary code execution.
    • This is not true here AFAIK but avoiding seriliazation and deserilazation in language the server is running is highly encouraged to reduce attack vectors.
TypeNumber of hits in September 21
json7,598,854
rdf89,861
ttl11,388,708
php2,116

Thank you!
Alright. Then let's do this.

@Ladsgroup can you say if the hits for the php-serialozed output are coming from one/very few individuals making a lot of requests or a lot of individuals making a few requests? Is there any discernable pattern in the requests or the tools they are made with? (I'm asking as this might change the communication a bit.)

It adds two urls for cache busting (as explained above)

No, we only cache a limited set of URLs and the RDF format is not included in those. (The earlier comment was from before the caching story was resolved, we changed plans at some point in there.)

(But to be clear, I also support getting rid of this.)

@Ladsgroup can you say if the hits for the php-serialozed output are coming from one/very few individuals making a lot of requests or a lot of individuals making a few requests? Is there any discernable pattern in the requests or the tools they are made with? (I'm asking as this might change the communication a bit.)

It has four major consumers, Three seems to be bots and one is either a bot with fake UA or a gadget. There are maybe in total thirty usages outside these four but negligible.