Page MenuHomePhabricator

Stop shipping open graph image tags to authenticated users
Closed, DeclinedPublicFeature

Description

Feature summary:
Check to see if the user is authenticated before shipping open graph image tags as they are only meant for social media crawlers rather than humans.

Benefits:
The bytes shipped add up as up to 3 different image sizes could be used for the open graph tags leading to slower load times than possible as well as extra CO2 emissions.

Event Timeline

Change #1172673 had a related patch set uploaded (by R4356thwiki; author: R4356thwiki):

[mediawiki/extensions/PageImages@master] Stop shipping open graph image tags to authenticated users

https://gerrit.wikimedia.org/r/1172673

I'm not sure I understand. What if someone runs a tool or bot that tries to find these tags while being logged in? What do we win by adding this complexity? The environmental impact is probably close to zero when the change is only for logged in users.

What if someone runs a tool or bot that tries to find these tags while being logged in?

Is anyone actually doing that? There is an API module that PageImages provides which could be used instead. They can also simply log out.

RE benefit:
This is present on enwiki's homepage right now:

<meta property="og:image" content="https://upload.wikimedia.org/wikipedia/commons/0/0e/Lesley_James_McNair_%28US_Army_General%29.jpg">
<meta property="og:image:width" content="1200">
<meta property="og:image:height" content="1508">
<meta property="og:image" content="https://upload.wikimedia.org/wikipedia/commons/0/0e/Lesley_James_McNair_%28US_Army_General%29.jpg">
<meta property="og:image:width" content="800">
<meta property="og:image:height" content="1006">
<meta property="og:image:width" content="640">
<meta property="og:image:height" content="804">

Multiplying by the number of visits the page had over the last 30 days, this would have amounted to more than 93 GB of data shipped if we didn't gzip it (assuming slightly incorrectly the tags would have taken up approximately the same amount of bytes on other days as well), which would have amounted to anywhere between 25-30 kg CO2e depending on where the visits were from. And that's just from one page on one wiki. Of course, you could argue that compression helps here and not all pages have open graph image tags, and I understand those are both true but it certainly adds up for Wikimedia wikis. And it also helps make pages load just a bit faster; it may not feel significant for most devs here but not everyone is that fortunate. I would love to be able to stop shipping these for anons as well but then we would be defeating the whole purpose behind this feature, so in lieu of that, my hope is that this makes logged in contributors' lives a tiny bit better with this. Thanks.

[…] number of visits the page had over the last 30 days, this would have amounted to more than 93 GB of data […]

How is this an argument for the patch? The numbers you are bringing up are for anonymous users. That is unaffected by the patch.

[…] if we didn't gzip it […]

What do you mean with "if we didn't"? Is there a problem with how we gzip the traffic?

I still don't see what you are trying to do here, and how this is a Performance Issue. Maybe our time is better spend helping with T295521 instead?

How is this an argument for the patch? The numbers you are bringing up are for anonymous users. That is unaffected by the patch.

I know I don't have to tell you this but the numbers also include authenticated users. In a perfect world, we would be able to serve open graph tags to crawlers only or they would all know what API endpoints to use when looking for that data but we can't do that and the next best thing seems to be only serving these tags to anonymous users because crawlers shouldn't be logging in. Of course, we could also go around sniffing User-Agent strings to decide when to add open graph metadata but that would be way too complex and certainly not foolproof.

[…] if we didn't gzip it […]

What do you mean with "if we didn't"? Is there a problem with how we gzip the traffic?

Of course not. But I cannot think of any way to flawlessly calculate how many bytes that part alone ends up taking during transfer with gzip.

Maybe our time is better spend helping with T295521 instead?

I agree that would likely result in higher savings overall but that should not stop this from being implemented, especially since it's so simple.

[…] how many bytes that part alone ends up taking during transfer with gzip.

Removing all og:image saves about 90 bytes (gzipped) for the https://en.wikipedia.org/wiki/Tom_Hanks page.

[…] the numbers also include authenticated users.

On average the Tom Hanks page is visited 350,000 times per month. The same time 12 edits have been made per month. This means the patch would save a few kilobytes on a page that generates 50 GB of traffic per month (gzipped).

[…] it's so simple.

My argument is that it's not simple. It adds complexity not only to the code but also to how MediaWiki behaves. How do we explain to logged-in users that they are not allowed to make use of this meta data?

[snipped] The same time 12 edits have been made per month. This means the patch would save a few kilobytes on a page that generates 50 GB of traffic per month (gzipped).

The absence of data regarding visits from authenticated users makes this argument makes this calculation highly unreliable, imo. In reality, most editors likely read articles while logged in.

It adds complexity not only to the code but also to how MediaWiki behaves.

Does that mean you are concerned about the appservers having to do more work?

How do we explain to logged-in users that they are not allowed to make use of this meta data?

But is anyone genuinely doing this? :-/ Because if they are, they may be able to use the API module PageImages provides. Otherwise, they can make the requests without cookies.

The absence of data […] makes this argument […] highly unreliable […]

The thing is that it's your job to provide data to demonstrate that your patch is worth the trouble, or find a team that can help you with that. The numbers I can find so far tell me that the patch will have effectively zero effect and should not be merged.

[…] you are concerned about the appservers having to do more work?

When I talk about complexity I'm referring to the devs that have to read, understand, and maintain our codebases long-term.

The thing is that it's your job to provide data to demonstrate that your patch is worth the trouble [snip]

There must be some miscommunication here because I believe I have done that, with my estimate above but I will do so again. Let's say only a fourth of the visits to the Tom Hanks page were made by authenticated users - that would have meant that ~8 GB of traffic could have been saved by this patch.

The numbers I can find so far tell me that the patch will have effectively zero effect and should not be merged.

Most editors don't log out after they are done editing. How could we be sure? Because of the sheer number of users who have complained about getting randomly logged out or auto login not working due to recent-ish changes to how cookies work in modern browsers.

[....] I'm referring to the devs that have to read, understand, and maintain our codebases long-term.

The patch waiting to be reviewed is very simple and adds little complexity to a piece of code that itself is currently simple enough; the pros outweigh the cons here. And I don't expect most WM(F/DE) staffers to understand this from first-hand experience but for a big number of us in parts of the world with poor connectivity, having even a few kilobytes shaved off results in improvement that is noticeable too often. So even if you think the emissions prevented alone do not warrant making this change because of supposed complexity involved, it is still worth doing because of the performance improvement.

Let's say only a fourth of the visits […] were made by authenticated users […]

Logged-in users usually make up less than 1% on such a high-traffic page. What is your source?

[…] even a few kilobytes shaved off results in improvement that is noticeable […]

As said above, we are not talking about kilobytes but merely 90 bytes. On a page where the HTML alone is 145 kB (gzipped). Even the slowest possible 16 kbit/s ISDN data line will transfer that in 45 milliseconds. It's impossible to notice that.

Let's say only a fourth of the visits […] were made by authenticated users […]

Logged-in users usually make up less than 1% on such a high-traffic page. [...]

I am curious to know what qualifies as 'high-traffic' in this case. Regarding my source, I did say it's an assumption.

[…] even a few kilobytes shaved off results in improvement that is noticeable […]

As said above, we are not talking about kilobytes but merely 90 bytes. On a page where the HTML alone is 145 kB (gzipped). Even the slowest possible 16 kbit/s ISDN data line will transfer that in 45 milliseconds. It's impossible to notice that.

My apologies, I completely misread that. I agree with you here that the impact would be very low indeed. That being said, my point about the change being, and the code staying, simple enough sticks.

The bytes shipped add up as up to 3 different image sizes could be used for the open graph tags leading to slower load times than possible as well as extra CO2 emissions.

This needs more evidence to back it up. My understanding is only crawlers will parse and crawl these image URLs. If you are viewing the page as a human these will be parsed but no network requests will be made (and that is supported by viewing the network tab of any page) so I am not sure why we would "Stop shipping open graph image tags to authenticated users".

Side note: In general I believe we should be trying to bring the HTML for authenticated users the same with anonymous users as in future that would lead to better caching for logged in users and a better overall experience if we can serve all (or parts of the page) from cache.

[…] bring the HTML for authenticated users the same with anonymous users as in future that would lead to better caching […]

Thanks. In other words: The suggested patch would make it harder for us to make our caching layers more efficient, resulting in the opposite of what it aims to do.

Change #1172673 abandoned by Thiemo Kreuz (WMDE):

[mediawiki/extensions/PageImages@master] Stop shipping open graph image tags to authenticated users

Reason:

See T400489.

https://gerrit.wikimedia.org/r/1172673

...My understanding is only crawlers will parse and crawl these image URLs. If you are viewing the page as a human these will be parsed but no network requests will be made (and that is supported by viewing the network tab of any page) so I am not sure why we would "Stop shipping open graph image tags to authenticated users".

That's not the point I was trying to make; I was talking about the HTML needed for the meta tags but it is moot anyway given your statement regarding the WMF's future caching aims.