[Task] Drop support for php-serialized output from Special:EntityData
Open, MediumPublic
Actions

Description

Special:EntityData is intended to be a LinkedData interface.
It should support various RDF flavours, and plain JSON.
Other formats (based on MediaWiki API result formats) should be dropped (currently, php-serialized is allowed per default).

This would allow us to use the JSON serialization of the entity directly, and drop the clunky dependency on the API result formatters.

TODO reasons we want to drop it

Notes:

needs to go through the breaking change process

Details

	Subject	Repo	Branch	Lines +/-
	Drop php format from default entityDataFormats	mediawiki/extensions/Wikibase	master	+0 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T98033 [Task] Drop API-Style wrapper from JSON output of Special:EntityData
		Open		None	T98035 [Task] Drop support for php-serialized output from Special:EntityData

Event Timeline

daniel created this task.May 4 2015, 5:34 PM

daniel raised the priority of this task from to Needs Triage.

daniel updated the task description. (Show Details)

daniel added a project: Wikidata.

daniel added subscribers: Aklapper, daniel.

JanZerebecki triaged this task as Medium priority.May 15 2015, 6:37 PM

JanZerebecki set Security to None.

JanZerebecki moved this task from incoming to ready to go on the Wikidata board.

Change 224773 had a related patch set uploaded (by Addshore):
Drop php format from default entityDataFormats

https://gerrit.wikimedia.org/r/224773

gerritbot added a project: Patch-For-Review.Jul 15 2015, 11:45 AM

What does "other formats" mean? Only php? Or more?

This ticket does not explain why we "should" drop this support? What do we win? It works just fine right now. Even if it's not perfect and things are missing from the serialization, how is that a reason to kill it instead of fixing it? All serializations use arrays as an intermediate format. Just pass this to PHP's serialize() and be done. Why drop it?

Change 224773 abandoned by Addshore:
Drop php format from default entityDataFormats

Reason:
For now

https://gerrit.wikimedia.org/r/224773

Addshore removed a project: Patch-For-Review.Aug 7 2015, 5:21 PM

• Jonas renamed this task from Drop support for php-serialized output from Special:EntityData to [Task] Drop support for php-serialized output from Special:EntityData.Aug 14 2015, 8:31 AM

• Jonas added a project: Technical-Debt.

Should be this be moved to discussion?

Lydia_Pintscher moved this task from ready to go to needs discussion or investigation on the Wikidata board.Aug 14 2015, 8:57 AM

Danny_B moved this task from Unsorted to Needs removal on the Technical-Debt board.Jan 23 2016, 1:06 AM

Restricted Application added a subscriber: PokestarFan. · View Herald TranscriptJul 25 2017, 8:14 AM

For reference PHP is still availible https://test.wikidata.org/wiki/Special:EntityData/L84.php

Krinkle edited projects, added Technical-Debt (Deprecation process); removed Technical-Debt.Jul 13 2018, 11:11 PM

Krinkle moved this task from Untriaged to Not yet on the Technical-Debt (Deprecation process) board.Oct 12 2019, 10:50 PM

In T98035#1460263, @thiemowmde wrote:

What does "other formats" mean? Only php? Or more?

This ticket does not explain why we "should" drop this support? What do we win? It works just fine right now. Even if it's not perfect and things are missing from the serialization, how is that a reason to kill it instead of fixing it? All serializations use arrays as an intermediate format. Just pass this to PHP's serialize() and be done. Why drop it?

Random example: in the context of T128486: [Story] Make Special:EntityData be up to date after an edit, supporting this absurd output format means there are two additional URLs we need to purge on each edit. (Unless Varnish already normalizes .php and ?format=php into one URL, or something, then it’s one URL.)

In T98035#4420235, @Addshore wrote:

For reference PHP is still availible https://test.wikidata.org/wiki/Special:EntityData/L84.php

Macro slowlydisappears:

Can we find out how much this format is still used, similar to T220826#5185202?

Oh dear, I just realized how strange my comment from 2015 sounds by now. I totally support dropping this esoteric format! I believe PHP's internal serialization format should be something that's internal to PHP, and not be part of a public API. Please drop it.

My problem with this ticket was – and still is – that it just states how something "should be", without providing any information how this decision was made. Is it even a decision?

Ladsgroup added a project: Wikidata-Campsite.Feb 18 2020, 10:18 PM

In T98035#5893953, @Lucas_Werkmeister_WMDE wrote:

Can we find out how much this format is still used, similar to T220826#5185202?

Aparrently it has between 10k and 100k usages per day right now https://grafana.wikimedia.org/d/000000169/wikidata-api-format-usage?orgId=1&refresh=30m&from=now-30d&to=now

My problem with this ticket was – and still is – that it just states how something "should be", without providing any information how this decision was made. Is it even a decision?

To me it read more like a proposal than a decision, though I hope the decision won’t be controversial.

Aparrently it has between 10k and 100k usages per day right now https://grafana.wikimedia.org/d/000000169/wikidata-api-format-usage?orgId=1&refresh=30m&from=now-30d&to=now

Are you sure this is for Special:EntityData and not for the action API? I’m missing some RDF formats in that graph (though that might just mean that we’ve misconfigured the list of allowed formats – I couldn’t figure out where the data ultimately comes from).

Aparrently it has between 10k and 100k usages per day right now https://grafana.wikimedia.org/d/000000169/wikidata-api-format-usage?orgId=1&refresh=30m&from=now-30d&to=now

Are you sure this is for Special:EntityData and not for the action API? I’m missing some RDF formats in that graph (though that might just mean that we’ve misconfigured the list of allowed formats – I couldn’t figure out where the data ultimately comes from).

Ah, nevermind, I found the Wikidata Special:EntityData dashboard. Somewhere around 5k PHP requests per day, it seems, compared to millions for JSON and Turtle.

Lucas_Werkmeister_WMDE mentioned this in T128486: [Story] Make Special:EntityData be up to date after an edit.Feb 19 2020, 4:01 PM

This is a product decision, so putting in the product column.
From the tech side removing this will leave us with less things to maintain, I'm not really sure if we "support" the php format here really.
I would guess that most of the PHP calls are perhaps scrapers and things requesting this format by accident.
Migration path would be to use JSON instead.

Lucas_Werkmeister_WMDE mentioned this in T166470: Include links in Wikidata HTTP responses to different entity representations as Link headers.Aug 18 2020, 3:48 PM

In T98035#6092192, @Addshore wrote:

This is a product decision, so putting in the product column.
From the tech side removing this will leave us with less things to maintain, I'm not really sure if we "support" the php format here really.
I would guess that most of the PHP calls are perhaps scrapers and things requesting this format by accident.
Migration path would be to use JSON instead.

I can look it up in hadoop if that helps PM decision (@Lydia_Pintscher) on this.

Yesterday we had 3400 hits on php endpoints, 2089 were spiders and 1300 were from users (at least they faked user UA which is possible and happens quite often). 1000 of the hits belong to only four countries (which are not usual suspects) but I can't disclose more in a public ticket.

Addshore added a project: [DEPRECATED] wdwb-tech.Sep 21 2021, 8:40 AM

I'll raise this with Lydia in my next 1:1 with her

WMDE-leszek subscribed.Sep 21 2021, 12:45 PM

In T98035#6394020, @Ladsgroup wrote:

Yesterday we had 3400 hits on php endpoints, 2089 were spiders and 1300 were from users (at least they faked user UA which is possible and happens quite often). 1000 of the hits belong to only four countries (which are not usual suspects) but I can't disclose more in a public ticket.

@Ladsgroup can we get up to date numbers and also put this in relation to the other formats we expose?

Then @Lydia_Pintscher can make a current and informed decision on the future of the output.

Addshore updated the task description. (Show Details)Sep 21 2021, 12:48 PM

Reasons to drop:

It adds a non-negligible amount of code that needs maintaining
It adds a stable interface to our stable interfaces that we need to communicate and follow the procedure for each and every change.
It adds two urls for cache busting (as explained above)
It doesn't give much benefit, the views of it are small (I put numbers below), its only unserializable in php and not any other language (unless with gymnastics).
(Might not be a big deal): Serialization and deserialization are security sensitive, we might expose something we shouldn't or receive something which would lead to arbitrary code execution.
- This is not true here AFAIK but avoiding seriliazation and deserilazation in language the server is running is highly encouraged to reduce attack vectors.

Type	Number of hits in September 21
json	7,598,854
rdf	89,861
ttl	11,388,708
php	2,116

Thank you!
Alright. Then let's do this.

@Ladsgroup can you say if the hits for the php-serialozed output are coming from one/very few individuals making a lot of requests or a lot of individuals making a few requests? Is there any discernable pattern in the requests or the tools they are made with? (I'm asking as this might change the communication a bit.)

It adds two urls for cache busting (as explained above)

No, we only cache a limited set of URLs and the RDF format is not included in those. (The earlier comment was from before the caching story was resolved, we changed plans at some point in there.)

(But to be clear, I also support getting rid of this.)

In T98035#7373635, @Lydia_Pintscher wrote:

@Ladsgroup can you say if the hits for the php-serialozed output are coming from one/very few individuals making a lot of requests or a lot of individuals making a few requests? Is there any discernable pattern in the requests or the tools they are made with? (I'm asking as this might change the communication a bit.)

It has four major consumers, Three seems to be bots and one is either a bot with fake UA or a gadget. There are maybe in total thirty usages outside these four but negligible.

Thanks! :)

Lydia_Pintscher updated the task description. (Show Details)Sep 23 2021, 6:03 PM

Lydia_Pintscher added a subscriber: Mohammed_Sadat_WMDE.

Manuel moved this task from Needs Wikidata PM Work to Miscellaneous on the Wikidata-Campsite board.Jul 12 2022, 11:57 AM

Manuel moved this task from Miscellaneous to Tech realm on the Wikidata-Campsite board.Aug 2 2022, 11:57 AM

Manuel removed a project: Wikidata-Campsite.Nov 8 2022, 11:11 AM

Addshore unsubscribed.Jun 27 2023, 12:42 PM