Page MenuHomePhabricator

General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow"
Closed, ResolvedPublic

Description

Hello,

Several arwiki users complains that they can't open Wikimedia projects and are facing this error message (between 10:30 UTC -until now, on 26 July 2021) :

upstream connect error or disconnect/reset before headers. reset reason: overflow

dddddd.PNG (102×844 px, 3 KB)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

For what is worth, these are the wikis living on s3 databases: https://noc.wikimedia.org/conf/dblists/s3.dblist

Ummm, arwiki and enwiki not on "the wikis living on s3 databases", why it was affected?

For what is worth, these are the wikis living on s3 databases: https://noc.wikimedia.org/conf/dblists/s3.dblist

Ummm, arwiki and enwiki not on "the wikis living on s3 databases", why it was affected?

worker exhaustion, they share the same appservers, the appservers (php workers) got so busy handing slow s3 requests, they couldn't respond to any requests altogether

For what is worth, these are the wikis living on s3 databases: https://noc.wikimedia.org/conf/dblists/s3.dblist

Ummm, arwiki and enwiki not on "the wikis living on s3 databases", why it was affected?

See the comment above that, T287362#7235931. Requests to s3 wikis were holding up php-fpm workers which caused free workers to run out (they're not isolated between database shards, but shared globally), causing all sites to go down.

With a second incident, it is clear that the issue will be recurring if we simply re-enable DPL on ruwikinews now. It is not acceptable to enable DPL on ruwikinews at the cost of half an hour downtime per year (or more) on 900+ wikis. From T262391, it is pretty clear that DPL is not, and will unlikely be, actively developed in the near future. With this in mind, ruwikinews probably does not have a choice but to move their DPL uses to bots, at least in the short run.

Jdforrester-WMF renamed this task from Lots of s3 wikis broken: "upstream connect error or disconnect/reset before headers. reset reason: overflow" to General site outage caused by ruwikinews abuse of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow".Jul 26 2021, 12:09 PM
Jdforrester-WMF added a subscriber: Jdforrester-WMF.

I've re-titled the task to be more accurate; the outage did indeed expand beyond s3 to be a general outage.

Peachey88 renamed this task from General site outage caused by ruwikinews abuse of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" to General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow".Jul 26 2021, 12:16 PM

Change 708101 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[operations/mediawiki-config@master] Disable DPL on ruwikinews

https://gerrit.wikimedia.org/r/708101

Status update: I disabled DPL on ruwikinews and now things are getting back to normal

It's very bad solution. Russian Wikinews will only grow and killing one the most largest project is just simple cosmetic attempt to solve the disorder. It's necessary to eliminate the real cause ASAP.

Let me try to explain it. The real cause is DPL and it should be fully undeployed from Wikimedia. The software behind it is not designed with scale in mind and can easily break. Software is not magic. At end of the day some solutions just wouldn't work. As I explained before, you can simply replace it with a bot (a compromise with big benefit and small drawback). We can't simply allow such a problematic piece of code stay in production. Lots of other wikinews wikis don't use DPL and they are just fine. DPL is not the requirement of being a Wikinews.

Change 708101 merged by jenkins-bot:

[operations/mediawiki-config@master] Disable DPL on ruwikinews

https://gerrit.wikimedia.org/r/708101

@Krassotkin: Please see and follow https://www.mediawiki.org/wiki/Bug_management/Phabricator_etiquette if you would like to be active here. Thanks for your understanding.

Absolutely everyone knows about this problem and the ways to solve it (CirrusSearch) for about a year.

This problem can be solved by 1 (one) programmer in 1 (one) week. Even in lite mode.

I don’t understand why it hasn’t been solved yet.

Therefore, the problem must be urgently resolved. And Russian Wikinews should be launched as soon as possible.

I also have one more question. Why one of the largest Wikimedia wiki doesn't have its own infrastructure.

@Krassotkin: You are very welcome to contribute by providing patches if you have already analyzed the situation and know all the required technical fixes.
Please bring up general technical questions in technical forums instead - again, see https://www.mediawiki.org/wiki/Bug_management/Phabricator_etiquette . Thanks.

@Krassotkin: You are very welcome to contribute by providing patches if you have already analyzed the situation and know all the required technical fixes.
Please bring up general technical questions in technical forums instead - again, see https://www.mediawiki.org/wiki/Bug_management/Phabricator_etiquette . Thanks.

I don't write on PHP.

As you can see I'm doing my job on client side very well.

May the Wikimedia Foundation do a good job on server side too please?

@Krassotkin: In that case, please either refrain from adding unhelpful comments which don't lead to solving problems, or I politely suggest that you may want to spend your time somewhere else than here. Thanks for your understanding.

@Aklapper Please note I don't speak English, but there is nothing offensive in my words in Russian. I apologize if they sounded harsh. Wikimedia is a multicultural environment. Please be patient and stop threatening me. Thanks for your understanding.

For what is worth, these are the wikis living on s3 databases: https://noc.wikimedia.org/conf/dblists/s3.dblist

Ummm, arwiki and enwiki not on "the wikis living on s3 databases", why it was affected?

In fact, zhwiki is not on this list either, but also affected.

FYI: ruwikinews "owners" (Krassotkin & Co.) recently uploaded to ruwikinews millions of "news", created 10-15 years ago on other Russian-language news websites, owners of which published this content under free license. I call on Wikimedia administrators to delete this mass upload, because this isn't sort of content for which Wikinews was created.

Russian Wikinews' administrators will not delete these news, because they are already free, and one of the purposes of this site is to collect and display these news.

In T287362#7236237, @IN wrote:

For what is worth, these are the wikis living on s3 databases: https://noc.wikimedia.org/conf/dblists/s3.dblist

Ummm, arwiki and enwiki not on "the wikis living on s3 databases", why it was affected?

In fact, zhwiki is not on this list either, but also affected.

I think you need to read T287362#7235939

Now phabricator also runs more slowly, it also takes a very long time to query a tag.

In T287362#7236237, @IN wrote:

For what is worth, these are the wikis living on s3 databases: https://noc.wikimedia.org/conf/dblists/s3.dblist

Ummm, arwiki and enwiki not on "the wikis living on s3 databases", why it was affected?

In fact, zhwiki is not on this list either, but also affected.

I think you need to read T287362#7235939

In fact, when I clicked on your link, phabricator took a long time to navigate to your location. Besides, it takes a long time for me to reply to your message.
Phabricator is stalled.

Also when I edited this comment, it took a long time for the server to display the source code.

The server took a long time to preview my message. @AnYiLin Maybe you can see why it running so slow?

@IN: Please stop adding totally unrelated comments. This task is not about Phabricator. Thanks.

@IN: Please stop adding totally unrelated comments. This task is not about Phabricator. Thanks.

But it seems that Phabricator is also running very slowly, do I have to open another task for this?

@IN: Obviously yes if you do not get the very same error message as in this task. See https://www.mediawiki.org/wiki/How_to_report_a_bug

Ladsgroup claimed this task.

The site is back online after T287362#7235783, we should create tickets for mid-term and long-term follow ups and continue discussing there, otherwise it'll be just derailing comments all in one place.

The site is back online <...>

But Wikinews is actually offline because news sites can't exist without a news feed. Imagine a social network without a feed. It's the same as if it doesn't exist.

Screenshot 2021-07-26 at 14-29-44 Викиновости, свободный источник новостей.png (418×564 px, 37 KB)

Background information from September 2020, when mass uploads (~100,000) at the Russian Wikinews caused problems: https://ru.wikinews.org/wiki/Викиновости:Форум/Общий/Архив/2020#DynamicPageList

https://ru.wikinews.org is not "actually offline": You can verify by going to https://ru.wikinews.org and not seeing a HTTP 502 error anymore. Exaggerations don't help.
How to proceed with DPL on Wikimedia servers is subject of evaluation and discussion (which might take a while).
This task's scope is about fixing a site outage. This task's scope is not making DPL work as before allegedly mass-adding lots of items. Thanks for your understanding.

ruwikinews now has 13M pages (bigger than enwiktionary). Even if the band aid solution would have worked last time, we might be passed that at this point.

Also like dude, this is the second time this has caused problems. You didn't think to, i dont know, do a bit more of a slow ramp up on the bot speed?

ruwikinews now has 13M pages (bigger than enwiktionary). Even if the band aid solution would have worked last time, we might be passed that at this point.

Also like dude, this is the second time this has caused problems. You didn't think to, i dont know, do a bit more of a slow ramp up on the bot speed?

Last stream of NewsBot had end mass upload of News.ru at the morning (UTC+3) a few hours before the incident . Problems had started when bot worked as usual.

Also like dude, this is the second time this has caused problems. You didn't think to, i dont know, do a bit more of a slow ramp up on the bot speed?

Oh, he did. Just not in any reasonable way expected. Two weeks ago:

By the previous experience, the maximum amount of uploads [of bot-created articles] should best be around 15-30 thousand [articles] a day, otherwise the servers start to break down for multiple reasons, and also due to requests. So 100 thousand articles in half a week or a week. … --cаша ([[User:Krassotkin|krassotkin]]) 18:35, 11 июля 2021 (UTC)

@IN: Obviously yes if you do not get the very same error message as in this task. See https://www.mediawiki.org/wiki/How_to_report_a_bug

But by the time I replied to your message, this problem has been disappeared, and everything has become normal.

@stjn This is about API. I did not observe such problem this time.

@Aklapper, this is not an exaggeration. I have attached screenshots above. A news project is not an archive nor a search, a news project are news feeds. It is currently not working. This means that the news project is dead, despite the absence of a server error.

@Krassotkin you can be as loud as you like (until you get banned due to violation of phabricator etiquette or tech CoC) but it doesn't mean ruwikinews will get DPL enabled. It's clear after the second full outage caused by it, ruwikinews won't have DPL enabled again. Feel free to write a bot to replace its functionality.

... Problems had started when bot worked as usual.

Page creation, edits and many other things, can create jobs that are queued in the background to be performed, Which is why there can be delays between things being performed and issues arising.

@Ladsgroup I'm sorry but I'm not trying to be loud. I'm just trying to describe the problem. Please try to hear us.

This is not my Wikinews. Wikinews is a Wikimedia project.

Wikinews is a specific project. It is not like other archival ones.

News feed is the main functionality of a news project.

For example, the Wikimedia Language Committee closes Wikinews editions if there is no recent news.

After you turned off the news feeds on Russian Wikinews there are no recent news there although yesterday we created about 600. Please see the screenshot.

Now you will not be able to find recent news anywhere on the project. The project is not fulfilling its function.

Screenshot 2021-07-27 at 08-44-27 Викиновости, свободный источник новостей.png (324×1 px, 84 KB)

@Aklapper Please note I don't speak English, but there is nothing offensive in my words in Russian. I apologize if they sounded harsh. Wikimedia is a multicultural environment. Please be patient and stop threatening me. Thanks for your understanding.

I'm sorry if you felt threatened, but I don't think Aklapper is doing so. Could you please assume good faith? Thanks.

@Ladsgroup I'm sorry but I'm not trying to be loud. I'm just trying to describe the problem. Please try to hear us.

This is not my Wikinews. Wikinews is a Wikimedia project.

Wikinews is a specific project. It is not like other archival ones.

[Newsfeed] is the main functionality of a news project.

For example, the Wikimedia Language Committee closes Wikinews editions if there is no recent news.

After you turned off the news feeds on Russian Wikinews there [is] no recent news[,] although yesterday we created about 600. Please see the screenshot.

Now you will not be able to find recent news anywhere on the project. The project is not fulfilling its function.

Screenshot 2021-07-27 at 08-44-27 Викиновости, свободный источник новостей.png (324×1 px, 84 KB)

This isn't technically related to the task at hand(it's more of a defense of your behavior on ruwikinews rather than attempting to discuss the task).

@Firestar464 The project does not seem to give an error but it is completely inoperative.

In reality the project has been just disabled. This is not a solution to the problem.

My hands drop. I don't know how to explain this. Maybe someone can do it better.

Just to clarify, given @Ladsgroup has been the target of severe on-wiki harassment by ruwikinews people, I am the person who made the call first to disable DPL in order to save the rest of the websites, not him. And I firmly stand by it. And that decision was supported by all the responders to this emergency. Amir didn't act alone or in isolation.

It was the right call then, and it's the right call now. If anything, if we (WMF SRE) made a mistake, in hindsight, was accepting to reenable the extension there.

I'm going to -2 any attempt to reenable DPL on ruwikinews unless significant technical work is made on it, but I bet the best solution would be to get rid of the usage of an extension that is not designed for such a large (in articles, not traffic) wiki.

Also: harassment will get you nowhere. And I'm not going to dignify harassers with any further response.

@Joe You are the third person to tell me about @Ladsgroup's harassment. But I don't find it in the discussion above.

In any case, I apologize if my or anyone else's words were received in this way.

I don't think we should blame anyone now. We just need to solve the problem together.

Can we stop whining about the DPL and get back to improving Wikimedia?

Dear @Ladsgroup . If I or Krassotkin insulted you with something excuse us please. We really have no minds to harass you. Do not be offended by us. Good luck with your work.

@Joe You are the third person to tell me about @Ladsgroup's harassment. But I don't find it in the discussion above.

Dear @Ladsgroup . If I or Krassotkin insulted you with something excuse us please. We really have no minds to harass you. Do not be offended by us. Good luck with your work.

Let's be clear, the article titled "The Wikimedia Foundation broke Russian Wikinews again" currently on the homepage of ru.wikinews.org is obviously harassment (thanks to the IP editor who removed the photo) and POV pushing, not "news". I don't know how you reconcile what's on that wiki page with how we are expected to treat people with respect and dignity. There are, or were, multiple developers who felt sympathetic and unhappy with disabling a used feature and would likely have worked to fix or remediate the problem in some way or another, but I doubt much of that will is left is this is how you all plan to treat colleagues in the movement.

There's certainly some irony in linking to https://en.wikipedia.org/wiki/Wikipedia:Don%27t_worry_about_performance, which contains "In most cases, there is little you can do to appreciably speed up or slow down the site's servers" and "if a sysadmin tells you to make a change, listen to them". Unambiguously, creating 100k new articles in a day and 200k over 4 days (per Wikimedia News) is one of those things!!! This was previously communicated at https://ru.wikinews.org/wiki/%D0%92%D0%B8%D0%BA%D0%B8%D0%BD%D0%BE%D0%B2%D0%BE%D1%81%D1%82%D0%B8:%D0%A4%D0%BE%D1%80%D1%83%D0%BC/%D0%9E%D0%B1%D1%89%D0%B8%D0%B9/%D0%90%D1%80%D1%85%D0%B8%D0%B2/2020#DynamicPageList - if you're going to ignore what sysadmins tell you, then I'm only surprised that User:NewsBots wasn't blocked.

MediaWiki and its extensions scale at different levels. Large wikis will have certain things update slower and get new features later just because they're larger and certain actions are slower. Normally this change happens gradually as the wiki organically grows, but importing hundreds of thousands of pages in a very short period time is going to force sysadmins to take faster and what appears like more drastic actions, even though all other large wikis already deal with those limitations (example).

Can we stop whining about the DPL and get back to improving Wikimedia?

We have strayed way too far from "General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow." Let's drop this.

@Legoktm, this is a Phabricator, not Wikinews. This a not a proper place for rate such things as news headers. If you want let's talk about this news on discuss page. By the way, when football score news about this often contains image with him. This it the same situation. Disable of extension is just a fact and nothing more.

Then, you wrote us about punishments if we were going to ignore that sysadmins told us. But we had nothing requests to shutdown NewsBots or slow it down. Moreover, we were told last year that engineers would monitor the server and immediately report us about technical violations. We have got no messages like these for whole year. So, we were sure that everything was all right.

@DonSimon Please re-read the previous message.

@Legoktm, this is a Phabricator, not Wikinews. This a not a proper place for rate such things as news headers. If you want let's talk about this news on discuss page. By the way, when football score news about this often contains image with him. This it the same situation. Disable of extension is just a fact and nothing more.

I will call out harassment when I see it to make it clear that the behavior isn't tolerated here.

Then, you wrote us about punishments if we were going to ignore that sysadmins told us. But we didn't have nothing requests to shutdown NewsBots or slow it down. Moreover, we were told last year that engineers would monitor the server and immediately report us about technical violations. We have got no messages like these for whole year. So, we were sure that everything was all right.

Uh, that's not at all what you were told:

From T262391#6449590:

So it seems like what happened, is that NewsBot imported a lot (~100k) articles over a very short time frame...Being somewhat slow by itself was ok, but having 100k articles edited at roughly the same time which all had the same slow DPL on it, was too much for the servers.

From T262391#6450864:

I would suggest future imports that happen after dpl is reenabled import things a bit more slowly at least at first. Maybe that won't be neccesary after the patch, but we should be cautious. It would probably be good for whenever they start again if someone could give a heads up to the folks in #wikimedis-operations irc channel that the import is restarting and reference this task so if something bad happens ops knows what might have happened.

From T262391#6473291:

It does NOT mean any changes were made to optimize DPL itself, or that DPL is now safer to call...if it starts to be an issue again, it may need to be disabled again.

Anyways, we do monitor the servers, and ruwikinews DPL usage became unacceptable and it was turned off again. Like you were told would happen. The general bot edit rate is 12 edits per minute, which should've taken 5.75 days to create 100k pages, not 1 day. But we're not interested in punitive punishments, just keeping the sites up, and if that means disabling DPL on one large wiki that is choosing to ignore sysadmin direction, so be it.

@Legoktm Please pay attention to the words

at least at first

I urge you to close the topic of the accusations. If someone later needs to understand the reasons for the crash, I will be ready to provide the facts. But now we just need to fix the problem. Hope for understanding.

Basically, what this is all about is you accusing everyone else of punishing ruwikinews and everyone else trying to reassure you that they are not and urging you to stop harassing @Ladsgroup. This discussion is useless; the sites are up, and DPL has been disabled on ruwikinews. Case closed, end of the story.

@Krassotkin I think you would also benefit from re-reading the previous messages.

May I remind everyone that this task has been closed, as the specific issue (general site outage) has been handled. The future of the DPL extension is being actively discussed at T287380.

Note that comments on phabricator, like all other behaviour in Wikimedia technical spaces, are governed by the https://www.mediawiki.org/wiki/Code_of_Conduct. Critique and disagreement are welcomed but they must be done in a civil manner.
Contributors' behavior on this task is also governed by https://www.mediawiki.org/wiki/Bug_management/Phabricator_etiquette.
If a contributor seems to have violated those guidelines, the first step is to contact the person privately or publicly to point out the problem. If apparent violations continue, a report can be made to the Code of Conduct committee: https://www.mediawiki.org/wiki/Code_of_Conduct#Report_a_problem

No further comments should be needed on the current task. Thanks.

Posting in my official capacity as Code of Conduct Committee auxiliary member.

Aklapper changed the edit policy from "All Users" to "Custom Policy".Jul 28 2021, 11:25 AM
This comment was removed by Arbnos.