Page MenuHomePhabricator

[Investigate] Intermittent performance issues with wikilabels
Closed, ResolvedPublic

Description

Wikilabels sometimes takes a very long time to respond or responds with a 500 when requesting worksets.

For example, this request will get a 500 response sometimes:
https://labels.wmflabs.org/users/6877667/28/?callback=jQuery111308495149720143409_1458849769686&worksets=stats&_=1458849769686

Event Timeline

I want to check logs but I'm not a member list :(

Okay, I've checked logs. Generating content happens in order of 1-10 milliseconds for most of requests:
(Deleted IP numbers since I wasn't sure they are internal or external)

Apr  6 12:04:57 wikilabels-01 uwsgi[23132]: [pid: 23137|app: 0|req: 15301/115333] ip.ip.ip.ip () {36 vars in 654 bytes} [Wed Apr  6 12:04:57 2016] GET /gadget/loader.js => generated 702 bytes in 1 msecs (HTTP/1.1 200) 2 headers in 78 bytes (1 switches on core 0)

But some requests takes unimaginably long time to generate. For example we have this today:

Apr  6 09:15:43 wikilabels-01 uwsgi[23132]: [pid: 23141|app: 0|req: 9777/110443] ip.ip.ip.ip () {38 vars in 958 bytes} [Wed Apr  6 07:44:41 2016] GET /campaigns/nowiki/27/915/300456/?callback=jQuery11130845116797315624_1459927355554&label=%7B%22damaging%22%3Afalse%2C%22goodfaith%22%3Atrue%2C%22unsure%22%3Afalse%7D&_=1459927355800 => generated 250 bytes in 5462104 msecs (HTTP/1.1 200) 2 headers in 78 bytes (1 switches on core 0)

i.e. it took 546 seconds to generate this. For me it took a fraction of a second (Even though it's might be varnish, do we have varnish?, or our cache)
I'm continuing my checks

I happens in all times (seems like a random spike between several web requests) and in all types of requests. My guess goes to uwsgi, let me check.

That's as far as I can tell. Someone else needs to check this.

Maybe we ought to turn up logging for a little while.  It'll fill the logs with debug, but it might help us diagnose the problem. 

I'm thinking that it is likely the DB is the problem here. It could be that we forgot an index and that causes intermittent issues when a query result isn't already cached -- or the DB just has some intermittent timeouts.

Ladsgroup triaged this task as High priority.

Okay, I downloaded logs for the last week. It stopped after April 7th, or got really in control (number of timed out requests went to a negligible number) but before then it was happening in a very frequent basis.

With some helps I realized in pgsql (the db: labsdb1004.eqiad.wmne) was highly overloaded starting from middle of March until April 6th. Also, I realized all requests that isn't involved with DB (like loading the gadget, etc.) never had this issue.

So that's probably a labs db issue. I think @chasemp and @Andrew would be the best person to take a look at this.

Please don't add me to this. I am not the Chase you're looking for.

Hello,

On the DB I see these errors repeatedly

2016-04-06 09:37:37 GMT ERROR:  duplicate key value violates unique constraint "label_pkey"
2016-04-06 09:37:37 GMT DETAIL:  Key (task_id, user_id)=(304406, 28410464) already exists.
2016-04-06 09:37:37 GMT STATEMENT:  
	            INSERT INTO label VALUES
	              (304406, 28410464, NOW(), '{"information-removed": false, "semantic_labels": ["elaboration"], "information-added": true, "information-modified": false, "notes": "not sure this counts as \"substantive\", but it''s not an update (new information)"}')
	            RETURNING *

is it possible that these are causing (some of) your problems ?

On the DB utilization front, the DB was indeed more loaded during Week 14 (it's more clear here https://ganglia.wikimedia.org/latest/graph.php?r=month&z=xlarge&h=labsdb1004.eqiad.wmnet&m=cpu_report&s=descending&mc=2&g=cpu_report&c=MySQL+eqiad where the IOwait is obvious) but I can not currently say why as we haven't enabled that kind of logging. It is possible to do that and find out the possible queries that are expensive, but I fail right now to see how that would help. It's a shared database so users might indeed be running expensive queries on it. It's a pattern that's to be expected, I think.

Hey,
Thank you for taking a look at this

Hello,

On the DB I see these errors repeatedly

2016-04-06 09:37:37 GMT ERROR:  duplicate key value violates unique constraint "label_pkey"
2016-04-06 09:37:37 GMT DETAIL:  Key (task_id, user_id)=(304406, 28410464) already exists.
2016-04-06 09:37:37 GMT STATEMENT:  
	            INSERT INTO label VALUES
	              (304406, 28410464, NOW(), '{"information-removed": false, "semantic_labels": ["elaboration"], "information-added": true, "information-modified": false, "notes": "not sure this counts as \"substantive\", but it''s not an update (new information)"}')
	            RETURNING *

is it possible that these are causing (some of) your problems ?

I think this is related to https://github.com/wiki-ai/wikilabels/blob/master/wikilabels/database/labels.py#L11 I fix it, but I highly doubt that it would be the underlying issue. Does it return the error with long waiting time?

On the DB utilization front, the DB was indeed more loaded during Week 14 (it's more clear here https://ganglia.wikimedia.org/latest/graph.php?r=month&z=xlarge&h=labsdb1004.eqiad.wmnet&m=cpu_report&s=descending&mc=2&g=cpu_report&c=MySQL+eqiad where the IOwait is obvious) but I can not currently say why as we haven't enabled that kind of logging. It is possible to do that and find out the possible queries that are expensive, but I fail right now to see how that would help. It's a shared database so users might indeed be running expensive queries on it. It's a pattern that's to be expected, I think.

I agree that in a shared host thus it's an expected thing but these queries is making the database unstable (=unusable). IMHO we should either 1- Scale up this database so it would be big enough not to cause wait times in order of hours 2- Find too expensive users, cut the resource or move it to a dedicated db 3- Make a dedicated DB for wikilabels.

Thank you again.

Hey,
Thank you for taking a look at this

Hello,

On the DB I see these errors repeatedly

2016-04-06 09:37:37 GMT ERROR:  duplicate key value violates unique constraint "label_pkey"
2016-04-06 09:37:37 GMT DETAIL:  Key (task_id, user_id)=(304406, 28410464) already exists.
2016-04-06 09:37:37 GMT STATEMENT:  
	            INSERT INTO label VALUES
	              (304406, 28410464, NOW(), '{"information-removed": false, "semantic_labels": ["elaboration"], "information-added": true, "information-modified": false, "notes": "not sure this counts as \"substantive\", but it''s not an update (new information)"}')
	            RETURNING *

is it possible that these are causing (some of) your problems ?

I think this is related to https://github.com/wiki-ai/wikilabels/blob/master/wikilabels/database/labels.py#L11 I fix it, but I highly doubt that it would be the underlying issue. Does it return the error with long waiting time?

Ah the UPSERT feature missing from postgres (it's being added in 9.5). Getting an implementation correct for that is notoriously difficult (see https://wiki.postgresql.org/wiki/UPSERT#PostgreSQL_.28today.29 and even more http://www.depesz.com/2012/06/10/why-is-upsert-so-complicated/). The one in wikilabels for example suffers from race conditions in concurrent environments

As far as the log goes, the error does not log the time it took for the operation to happen, but it is notable that was happening way much more during the period the DB was more heavily used. So it definitely has something to do with contention, which supports the race condition theory.

On the DB utilization front, the DB was indeed more loaded during Week 14 (it's more clear here https://ganglia.wikimedia.org/latest/graph.php?r=month&z=xlarge&h=labsdb1004.eqiad.wmnet&m=cpu_report&s=descending&mc=2&g=cpu_report&c=MySQL+eqiad where the IOwait is obvious) but I can not currently say why as we haven't enabled that kind of logging. It is possible to do that and find out the possible queries that are expensive, but I fail right now to see how that would help. It's a shared database so users might indeed be running expensive queries on it. It's a pattern that's to be expected, I think.

I agree that in a shared host thus it's an expected thing but these queries is making the database unstable (=unusable). IMHO we should either 1- Scale up this database so it would be big enough not to cause wait times in order of hours 2- Find too expensive users, cut the resource or move it to a dedicated db 3- Make a dedicated DB for wikilabels.

In the example above, 546 secs is mentioned. How did we get from < 10mins to hours ? Not that 10mins for a query to finish is acceptable for a service, but labsdb1004 was never dedicated to a service and no guarantees as far as performance goes are made.

As far as the solutions go,

  1. and (even more 3) need to be justified in terms of hardware/budget, which means needs/requirements have to be articulated by a team and justify scaling that infrastructure up. 1) might be doable with getting some SSDs (given the box is in IOwait) but that's probably the limit.
  2. is a stopgap and depending on the user it might end up hindering them instead. For example, a user doing OLAP on that box would be hindered by asking them to not issue expensive queries. So while it might help sometimes, it is not a real solution for this problem, mostly due to the fact this box is not dedicated to a specific role.

Btw the only users I currently see connected are u_wikilabels, u_wikilabels_test. Next time this problems surfaces ping me in time so maybe we can find what caused that problem.

In other news, I am happy people are using this infrastructure and such problems surface. It means there is a use for it.

I think there are several upserts in wikilabels right now, I take a look at the articles, and try to implement a robust/efficient method. Thank you!

In the above comments I mentioned a task that took 5462104 msecs (=1.51 hour) I have longer tasks too, fortunately by implementing harakiri, we don't have any task blocking a worker for that time anymore. It times out after 30 seconds.

BTW. Upgrading to 9.5 would be amazing, specially I can move to using "CREATE INDEX IF NOT EXISTS" instead of writing ten lines of sql queries.

I think there are several upserts in wikilabels right now, I take a look at the articles, and try to implement a robust/efficient method. Thank you!

You are welcome.

In the above comments I mentioned a task that took 5462104 msecs (=1.51 hour) I have longer tasks too,

>fortunately by implementing harakiri, we don't have any task blocking a worker for that time anymore. It times out after 30 seconds.

Ah, I missed the 7th digit on that one. The comment that clearly stated 546 seconds right below that line did not help either :(

That sounds way too much. I am thinking that even a hardware upgrade would not solve that issue. There must be something else. Ping me ASAP next time it happens so we can find out what is going on right then.

Do we know btw which query too so long to run ? I don't see it pasted in this task.

BTW. Upgrading to 9.5 would be amazing, specially I can move to using "CREATE INDEX IF NOT EXISTS" instead of writing ten lines of sql queries.

That is unfortunately not currently possible. 9.5 is present in Debian stretch which is not even close to being released. ETA is at least a year.

Based on request URL I can say it was an upsert:

The insert query:

INSERT INTO label VALUES
  (300456, %(user_id)s, NOW(), '{"damaging":false,"goodfaith":true,"unsure":false}')
RETURNING *

The update query:

UPDATE label
SET
    data = '{"damaging":false,"goodfaith":true,"unsure":false}',
    timestamp = NOW()
WHERE
    task_id = 300456 AND
    user_id = %(user_id)s
RETURNING *

I can't get user_id easily, but the process is that they worker tries insert and when it fails, it tries to update (srouce code) I made the PR I mentioned to make it better, but based on your articles, I think I should implement something else.

Based on the discussion above, and the rate of error logging during the period of bad database performance, I am starting to think that the insert/if fail update implementation of the upsert is the reason for the problem.

My current theory is that in a highly concurrent environment (which for some reason was the wikilabels application for that period of time), performance of that upsert implementation will fail badly. The reason is the big number of transactions that will have to be rolled back because of the failed inserts. Rolling back a transaction is costly both in CPU as well as IO.

I would suggest an implementation that first tries to update and in case no rows are update because the WHERE clause matches 0 rows, tries to insert. Then and in case there is an integrity error, it should try to update once more since now it knows somebody else (another worker process?) was faster and inserted first. That way in the best case scenario you get an update, in many cases you get an update+ an insert and only in the worst case an update+insert+update. That actually gets the job done faster than the select/insert/update cycle in the above PR which even in the best case scenario has a race condition and is twice more expensive.

In T130274, @Ladsgroup said:

I encountered T130872 [performance issues] again today but it was bearable.

So I'm re-opening this. I think our best bet to get away from this is either to determine what's going on with the shared postgres server or to host our own postgres.

I dug into logs and extracted cases that generating took more than 0.1 sec. P2980 is the result. Tell me if that's bearable or we should do another thing.

I'm closely montoring this.
In cases that took more than one second, we had several cases today and yesterday. Most of them were strange cases of just getting lists of stats of assigned worksets for the current campaign:

ladsgroup@wikilabels-01:/var/log$ grep "[0-9][0-9][0-9][0-9] msec" syslog
May  3 17:38:31 wikilabels-01 uwsgi[16514]: [pid: 28348|app: 0|req: 11104/123540] 10.68.21.68 () {38 vars in 862 bytes} [Tue May  3 17:38:04 2016] GET /users/6877667/32/?callback=jQuery111307152888143950435_1462297057608&worksets=stats&_=1462297057609 => generated 291 bytes in 27025 msecs (HTTP/1.1 500) 2 headers in 84 bytes (1 switches on core 0)
May  3 17:38:37 wikilabels-01 uwsgi[16514]: [pid: 28336|app: 0|req: 10240/123546] 10.68.21.68 () {38 vars in 832 bytes} [Tue May  3 17:38:31 2016] GET /campaigns/enwiki/?callback=jQuery111302985390569476789_1462297085068&_=1462297085070 => generated 933 bytes in 5905 msecs (HTTP/1.1 200) 2 headers in 78 bytes (1 switches on core 0)
ladsgroup@wikilabels-01:/var/log$ grep "[0-9][0-9][0-9][0-9] msec" syslog.1
May  2 19:02:22 wikilabels-01 uwsgi[16514]: [pid: 24203|app: 0|req: 5948/92999] 10.68.21.68 () {36 vars in 790 bytes} [Mon May  2 19:02:08 2016] GET /users/43607616/32/?callback=jQuery111305071653840132058_1462215727833&worksets=stats&_=1462215727834 => generated 8312 bytes in 13931 msecs (HTTP/1.1 200) 2 headers in 79 bytes (1 switches on core 0)
May  2 19:02:45 wikilabels-01 uwsgi[16514]: [pid: 24204|app: 0|req: 9582/93019] 10.68.21.68 () {36 vars in 790 bytes} [Mon May  2 19:02:19 2016] GET /users/43607616/32/?callback=jQuery111303417759307194501_1462215738605&worksets=stats&_=1462215738606 => generated 8312 bytes in 26097 msecs (HTTP/1.1 200) 2 headers in 79 bytes (1 switches on core 0)
May  2 19:02:50 wikilabels-01 uwsgi[16514]: [pid: 24201|app: 0|req: 7043/93030] 10.68.21.68 () {36 vars in 790 bytes} [Mon May  2 19:02:22 2016] GET /users/43607616/31/?callback=jQuery111308521612603217363_1462215741831&worksets=stats&_=1462215741832 => generated 291 bytes in 27961 msecs (HTTP/1.1 500) 2 headers in 84 bytes (1 switches on core 0)
May  2 19:02:54 wikilabels-01 uwsgi[16514]: [pid: 16526|app: 0|req: 13900/93034] 10.68.21.68 () {36 vars in 790 bytes} [Mon May  2 19:02:50 2016] GET /users/43607616/4/?callback=jQuery1113006808375101536512_1462215769745&worksets=stats&_=1462215769748 => generated 389 bytes in 4038 msecs (HTTP/1.1 200) 2 headers in 78 bytes (1 switches on core 0)
May  2 19:16:09 wikilabels-01 uwsgi[16514]: [pid: 28349|app: 0|req: 7176/93551] 10.68.21.68 () {36 vars in 1244 bytes} [Mon May  2 19:16:08 2016] GET /campaigns/enwiki/32/1134/310352/?callback=jQuery1113006808375101536512_1462215769745&label=%7B%22semantic_labels%22%3A%5B%22wikify%22%2C%22verifiability%22%2C%22elaboration%22%5D%2C%22information-added%22%3Afalse%2C%22information-modified%22%3Afalse%2C%22information-removed%22%3Afalse%2C%22notes%22%3A%22%22%7D&_=1462215770037 => generated 409 bytes in 1094 msecs (HTTP/1.1 200) 2 headers in 78 bytes (1 switches on core 0)

We had several cases that took about 27 seconds (why? because everything more than 30 would get a harakiri) and we had 5 cases of harakiri today and yesterday.

@akosiaris it would be great if you check this and/or send me the logs so I take a look at them.

Here's the query plan for an example query (the first workset stats query to fail in @Ladsgroup's paste above):

Aggregate  (cost=452.43..452.44 rows=1 width=8)
  ->  Nested Loop Left Join  (cost=0.86..452.10 rows=44 width=8)
        Join Filter: (label.user_id = workset.user_id)
        ->  Nested Loop  (cost=0.57..93.91 rows=44 width=8)
              ->  Index Only Scan using workset_task_pkey on workset_task  (cost=0.29..85.06 rows=44 width=8)
                    Index Cond: (workset_id = 32)
              ->  Materialize  (cost=0.28..8.30 rows=1 width=8)
                    ->  Index Scan using workset_pkey on workset  (cost=0.28..8.29 rows=1 width=8)
                          Index Cond: (id = 32)
        ->  Index Only Scan using label_pkey on label  (cost=0.29..8.13 rows=1 width=8)
              Index Cond: (task_id = workset_task.task_id)

Looks like we can improve the basic scan for relevant worksets with an index on user_id and campaign_id on workset.

u_wikilabels_test=> explain SELECT
u_wikilabels_test->     id, user_id,
u_wikilabels_test->     campaign_id,
u_wikilabels_test->     EXTRACT(EPOCH FROM created) AS created,
u_wikilabels_test->     EXTRACT(EPOCH FROM expires) AS expires
u_wikilabels_test-> FROM workset
u_wikilabels_test-> WHERE workset.user_id = 6877667 AND
u_wikilabels_test->       workset.campaign_id = 32;
                                QUERY PLAN                                 
---------------------------------------------------------------------------
 Bitmap Heap Scan on workset  (cost=4.21..14.37 rows=1 width=28)
   Recheck Cond: (user_id = 6877667)
   Filter: (campaign_id = 32)
   ->  Bitmap Index Scan on workset_user  (cost=0.00..4.21 rows=7 width=0)
         Index Cond: (user_id = 6877667)
(5 rows)

u_wikilabels_test=> CREATE INDEX workset_user_campaign (user_id, campaign_id);
ERROR:  syntax error at or near "("
LINE 1: CREATE INDEX workset_user_campaign (user_id, campaign_id);
                                           ^
u_wikilabels_test=> CREATE INDEX workset_user_campaign ON workset (user_id, campaign_id);
CREATE INDEX
u_wikilabels_test=> explain SELECT
u_wikilabels_test->     id, user_id,
u_wikilabels_test->     campaign_id,
u_wikilabels_test->     EXTRACT(EPOCH FROM created) AS created,
u_wikilabels_test->     EXTRACT(EPOCH FROM expires) AS expires
u_wikilabels_test-> FROM workset
u_wikilabels_test-> WHERE workset.user_id = 6877667 AND
u_wikilabels_test->       workset.campaign_id = 32;
                       QUERY PLAN                       
--------------------------------------------------------
 Seq Scan on workset  (cost=0.00..1.38 rows=1 width=28)
   Filter: ((user_id = 6877667) AND (campaign_id = 32))

I've been monitoring time to generate requests in the past few days. It works perfectly fine.