Page MenuHomePhabricator

Hundreds of concurrent SELECT MASTER_POS_WAIT('db1049-bin.002927', 354219766, 10) without debugging information (no user, not function) on s5 API node
Closed, ResolvedPublic

Description

It makes impossible to see what is the root cause of lag.

  • Make the timeout smaller
  • Add debug information (the classical /* function user */

They are not the direct cause of lag, but we lack of debugging info.

Event Timeline

I think the maxlag parameter for bots made it worse, not better.

It could be related to T135470, but the queries were being locked on a non-API node, so I assume they are only a result of lag, not a cause.

Since this was s5, maybe it was related to https://www.wikidata.org/wiki/User_talk:Termininja#Number_of_edits which caused a bunch of slave lag yesterday?

I think there are several issues with more than one bot, mostly coming from the loss of database capacity mentioned in T135100#2289175. Whatever were "bad bot behaviours" before were masked by the overcapacity. Now that we are temporarily restricted, those have more visible effect.

This ticket's scope is only to add the missing information on the queries to the API? job? to be able to identify the cause.

Based on other errors, it could be batches of CategoryMembershipUpdates but just based on time frame (no conclusive evidence). It could be they just are affected at the same time.

I think I found it:

| COMMIT /* DatabaseBase::deadlockLoop BotNinja */                                                     |    0.000 |
| NULL                                                                                                 |    0.000 |
| COMMIT /* DatabaseBase::deadlockLoop BotNinja */                                                     |    0.000 |
| NULL                                                                                                 |    0.000 |
| COMMIT /* DatabaseBase::deadlockLoop BotNinja */                                                     |    0.000 |
| NULL                                                                                                 |    0.000 |
| DELETE /* Wikibase\TermSqlIndex::deleteTermsInternal BotNinja */ FROM `wb_terms` WHERE term_language |    0.000 |
| NULL                                                                                                 |    0.000 |
| COMMIT /* DatabaseBase::deadlockLoop BotNinja */                                                     |    0.000 |
| NULL                                                                                                 |    0.000 |
| NULL                                                                                                 |    0.000 |
| NULL                                                                                                 |    0.000 |
| NULL                                                                                                 |    0.000 |
| COMMIT /* DatabaseBase::deadlockLoop BotNinja */                                                     |    0.000 |
| NULL                                                                                                 |    0.000 |
| NULL                                                                                                 |    0.000 |
| NULL                                                                                                 |    0.000 |
| COMMIT /* DatabaseBase::deadlockLoop BotNinja */                                                     |    0.000 |
| COMMIT /* DatabaseBase::deadlockLoop BotNinja */                                                     |    0.000 |
| NULL                                                                                                 |    0.000 |
| COMMIT /* DatabaseBase::deadlockLoop BotNinja */                                                     |    0.000 |
| NULL                                                                                                 |    0.000 |
| COMMIT /* DatabaseBase::deadlockLoop BotNinja */                                                     |    0.000 |
| COMMIT /* DatabaseBase::deadlockLoop BotNinja */                                                     |    0.000 |
| NULL                                                                                                 |    0.000 |
| NULL                                                                                                 |    0.000 |
| COMMIT /* DatabaseBase::deadlockLoop BotNinja */                                                     |    0.000 |
| NULL                                                                                                 |    0.000 |
| NULL                                                                                                 |    0.000 |
| NULL                                                                                                 |    0.000 |
| NULL                                                                                                 |    0.000 |
| UPDATE /* SiteStatsUpdate::tryDBUpdateInternal BotNinja */  `site_stats` SET ss_total_edits=ss_total |

BotNinja has at the very least 13 concurrent commits happening (maybe hundreds) + the job queue tasks generated. That goes against the API policy, and he has been already warned- I propose to ban it.

This bot was doing 10 edits per second, causing lag, specially on recentchanges on both wikidata and dewiki, and probably 10 in parallel. API policy does not set a concrete limit, but it is very clear about not being so fast that creates infrastructure issues, and specially when there is temporarily reduced capacity on these 2 wikis.

My recommendations is:

  • To make request serially, that is, waiting for one to finish before doing the next one.
  • Limit to no more than 1 per second, or 60 per minute (and not the 610 that the bot is currently doing)

One year ago there was similar issue, then my bot made ~200 edits/sec but I didn't know anything about bot policies and the lags. When I was warned my bot started to checks median lag on every minute and stopped work if the lag reached 300. But the problem was continues so I decrease the max lag to 100, and later I made it to slow the speed if lag is more than 50 and stop if reach 60. This is how my bot worked in the last one year.

Yesterday there was again a complaint so I added maxlag=5 to every POST request. I don't know how much this helped... For me is not problem to decrease the speed more, but I need to know how much and what is criteria. Just tell me some normal limit for edits per second (1, 2, 3 or ...) or critical median lag (10, 30, 50, ...) or maybe both. Of course I'm totally disagree the speed to be less or equal to 1 edit/sec because this means to take ~1 month to edit all taxon items (if needed, for example) which are more than 2 millions.

Independently of the policy, which says:

"There is no hard and fast limit on read requests, but we ask that you be considerate and try not to take a site down. Most sysadmins reserve the right to unceremoniously block you if you do endanger the stability of their site."

We received several complains from other bot operators and had to take action, as you can see here: T135100 (I would suggest starting by apologizing there to other Wikimedia users and saying how you are going to fix the issues you created).

"because this means to take ~1 month to edit all taxon items (if needed, for example) which are more than 2 millions."

is not a valid excuse, it violates the guideline mentioned above (and as you said, it is not the first time that happens, despite receiving several warning).

Mediawiki will go in read only mode if lag is higher than 5-7 seconds and requests should not be done at all if higher than 0. Unless the recommendations I mention here: T135471#2300457 are followed, I will encourage the administrators on wikidata to keep the block on.

The blocking is not problem, this is normal and right action when there is some problem with some bot, but don't forget that this bot is commanded by me, and I don't want to harm to any wiki, so keeping block on is of course unnecessarily and very stupid decision. How you see every time when someone informed me for some problem I took immediately action to resolve the problem (the same happened yesterday, when I was warned the advice was to use maxlag and I did it), but for this what you are talking about here I can't help. The guideline mentioned above from you is too abstract to be followed strictly from anyone, we need concrete measures and values. Take what decision is needed and just inform me, I'll take needed actions later from my side as always. I'm here to help not to harm! Just don't forget to clarify this changes in https://www.wikidata.org/wiki/Wikidata:Bots#Bot_accounts for all bots globally not only for mine.

I need to know how much and what is criteria. Just tell me some normal limit for edits per second (1, 2, 3 or ...) or critical median lag (10, 30, 50, ...) or maybe both.

For Commons, see https://commons.wikimedia.org/wiki/Commons:Bots#Bot_speed

@Termininja: are you making edits in parallel or in series? I find it very hard to believe you'd be able to reach those speeds without using parallel requests.

Thanks Aklapper, I'll use it if decide to do something in Commons. And yes Legoktm, I've never said that my bot use synchronous requests, why you have to believe in this?

And yes Legoktm, I've never said that my bot use synchronous requests, why you have to believe in this?

https://www.mediawiki.org/wiki/API:Etiquette#Request_limit says to make your requests in series, not parallel.

Yes, I understood for this yesterday for first time. Nobody informed me before for this in Wikidata, where my bot works mainly. Also, nobody learned me how to write my bot code, I learn on place. When I started one year half ago my bot used web browser to make edits and this took ~15 seconds to add link to bgwiki for some item (for example). I was reading only Wikidata:Bots where nothing for parallel or serial requests was mentioned. Even to was mentioned, I didn't know anything for asynchronous requests, so I could not understand it. But later I improved my bot code, it started working fast, I was happy... After the first warning about the speed I changed the code, the other people was happy, me too... From then I improve my bot code non stop, which means I improve everything else, but do not increase the speed! And how I already mentioned above every time when somebody tell me that my bot is fast I decrease the speed more from the last time (see above: 300 → 100 → 60 sec as limit for median lag). When I was warned last time 4 days ago from User:Hoo man, he only suggested to use maxlag parameter, so I did it. And thought everyone was happy after that, but not, I didn't know anything for problems in other wikies, and from where to know?!

Termininja, let's try to fix this. Can you run edits exclusively in a serial way, as suggested, even if that means they will be slower?

Of course, this will solve the problems, in the next days I'll refactor my bot code. I think to make it to do all POST requests in serial, but to continue to use parallel for the GET requests. In any case in future will there is minimum 1 min interval between my bot contributions.

Termininja, you do not need to write only once per minute, something like 1 POST/write per second is unlikely to cause issues, just wait for one edit to finish before starting the next one (DO NOT SENT MULTIPLE REQUEST IN PARALLEL). I will tell the admins on wikidata to unblock you, with that condition. Thank you.

jcrespo claimed this task.

Your account has been already unblocked.

Ups, sorry, I wanted to write 1 second, and yes, I will make it in serial so every previous request has to be finished.