Page MenuHomePhabricator

Parsoid on beta cluster times out doing API requests
Closed, ResolvedPublic

Description

On the beta cluster, Parsoid runs on the instance deployment-parsoid05.eqiad.wmflabs. I noticed the four parsoid process eating all the CPU.

Looking at the log file at /data/project/parsoid/parsoid.log tgere are a bunch of socket timeout errors against eswiki/wikidatawiki db (at least):

{
"name":"parsoid",
"hostname":"deployment-parsoid05",
"pid":22036,
"level":40,
"logType":"warning/api/esockettimedout",
"wiki":"wikidatawiki",
"title":"Q63760",
"oldId":null,
"reqId":"572eeab6459c44ffbf2e7a439858cfcd",
"msg":"Failed API request, {\"error\":{\"code\":\"ESOCKETTIMEDOUT\"},\"retries-remaining\":1}",
"longMsg":"Failed API request,\n{\"error\":{\"code\":\"ESOCKETTIMEDOUT\"},\"retries-remaining\":1}",
"time":"2015-04-28T15:27:21.925Z",
"v":0
}

https://logstash-beta.wmflabs.org/ gives a bit more informations using the filters:

field : _type == value : parsoid 
logType == query : "warning/api/etimedout"

parsoid-beta-apitimeout.png (180×1 px, 17 KB)

Event Timeline

hashar raised the priority of this task from to Needs Triage.
hashar updated the task description. (Show Details)
hashar added a subscriber: hashar.

A curl query works just fine right now: curl http://wikidata.beta.wmflabs.org/w/api.php

The instance had some heavy CPU spike earlier today and dmesg reports a CPU being stuck:

[Wed Apr 29 10:18:38 2015] BUG: soft lockup - CPU#1 stuck for 23s! [nodejs:27358]
[Wed Apr 29 10:18:38 2015] Modules linked in: nfsv4 nfsd auth_rpcgss nfs_acl nfs lockd sunrpc fscache 8021q garp stp mrp llc kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd serio_raw floppy psmouse
[Wed Apr 29 10:18:38 2015] CPU: 1 PID: 27358 Comm: nodejs Not tainted 3.13.0-37-generic #64-Ubuntu
[Wed Apr 29 10:18:38 2015] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Bochs 01/01/2011
[Wed Apr 29 10:18:38 2015] task: ffff8800b6b8c800 ti: ffff880137c76000 task.ti: ffff880137c76000
[Wed Apr 29 10:18:38 2015] RIP: 0010:[<ffffffff81726b9b>]  [<ffffffff81726b9b>] _raw_spin_unlock_irqrestore+0x1b/0x40
[Wed Apr 29 10:18:38 2015] RSP: 0018:ffff88013fd03df8  EFLAGS: 00000282
[Wed Apr 29 10:18:38 2015] RAX: ffff88013fd14cc8 RBX: ffffffff8147a5b1 RCX: 000000000000000f
[Wed Apr 29 10:18:38 2015] RDX: 000000000000000c RSI: 0000000000000282 RDI: 0000000000000282
[Wed Apr 29 10:18:38 2015] RBP: ffff88013fd03e00 R08: 0000000000000000 R09: 0000000000000000
[Wed Apr 29 10:18:38 2015] R10: 0000000000000001 R11: 00000000005f0d72 R12: ffff88013fd03d68
[Wed Apr 29 10:18:38 2015] R13: ffffffff8173059d R14: ffff88013fd03e00 R15: 0000000000000282
[Wed Apr 29 10:18:38 2015] FS:  00007f5848050780(0000) GS:ffff88013fd00000(0000) knlGS:0000000000000000
[Wed Apr 29 10:18:38 2015] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Wed Apr 29 10:18:38 2015] CR2: 00002327155927e0 CR3: 000000009b906000 CR4: 00000000001406e0
[Wed Apr 29 10:18:38 2015] Stack:
[Wed Apr 29 10:18:38 2015]  ffff88013fd14480 ffff88013fd03e70 ffffffff810a0ea4 ffffffffffffe82d
[Wed Apr 29 10:18:38 2015]  ffffffffffffffff 0000000000000000 0003f63a7dfe6d8c ffff88013fd03e70
[Wed Apr 29 10:18:38 2015]  0000000000000282 ffff88013fd14cc8 00000001062a94bd ffff88013fd14480
[Wed Apr 29 10:18:38 2015] Call Trace:
[Wed Apr 29 10:18:38 2015]  <IRQ> 
[Wed Apr 29 10:18:38 2015]  [<ffffffff810a0ea4>] update_blocked_averages+0x2a4/0x720
[Wed Apr 29 10:18:38 2015]  [<ffffffff810a796b>] rebalance_domains+0x4b/0x250
[Wed Apr 29 10:18:38 2015]  [<ffffffff810a7bba>] run_rebalance_domains+0x4a/0x1b0
[Wed Apr 29 10:18:38 2015]  [<ffffffff8106cb8c>] __do_softirq+0xec/0x2c0
[Wed Apr 29 10:18:38 2015]  [<ffffffff8106d0d5>] irq_exit+0x105/0x110
[Wed Apr 29 10:18:38 2015]  [<ffffffff81731c05>] smp_apic_timer_interrupt+0x45/0x60
[Wed Apr 29 10:18:38 2015]  [<ffffffff8173059d>] apic_timer_interrupt+0x6d/0x80
[Wed Apr 29 10:18:38 2015]  <EOI> 
[Wed Apr 29 10:18:38 2015]  [<ffffffff81726b9b>] ? _raw_spin_unlock_irqrestore+0x1b/0x40
[Wed Apr 29 10:18:38 2015]  [<ffffffff8115c1cd>] release_pages+0xdd/0x210
[Wed Apr 29 10:18:38 2015]  [<ffffffff8118b6e5>] free_pages_and_swap_cache+0x95/0xb0
[Wed Apr 29 10:18:38 2015]  [<ffffffff81174bdc>] tlb_flush_mmu.part.56+0x4c/0x90
[Wed Apr 29 10:18:38 2015]  [<ffffffff81176708>] unmap_page_range+0x648/0x7f0
[Wed Apr 29 10:18:38 2015]  [<ffffffff81176931>] unmap_single_vma+0x81/0xf0
[Wed Apr 29 10:18:38 2015]  [<ffffffff811779a9>] unmap_vmas+0x49/0x90
[Wed Apr 29 10:18:38 2015]  [<ffffffff81180b6c>] exit_mmap+0x9c/0x170
[Wed Apr 29 10:18:38 2015]  [<ffffffff8106480c>] mmput+0x5c/0x120
[Wed Apr 29 10:18:38 2015]  [<ffffffff81069c5c>] do_exit+0x26c/0xa50
[Wed Apr 29 10:18:38 2015]  [<ffffffff8115cd0e>] ? lru_cache_add+0xe/0x10
[Wed Apr 29 10:18:38 2015]  [<ffffffff8106a4bf>] do_group_exit+0x3f/0xa0
[Wed Apr 29 10:18:38 2015]  [<ffffffff8107a140>] get_signal_to_deliver+0x1d0/0x6f0
[Wed Apr 29 10:18:38 2015]  [<ffffffff8172ae24>] ? __do_page_fault+0x204/0x560
[Wed Apr 29 10:18:38 2015]  [<ffffffff81013448>] do_signal+0x48/0x960
[Wed Apr 29 10:18:38 2015]  [<ffffffff810a0175>] ? set_next_entity+0x95/0xb0
[Wed Apr 29 10:18:38 2015]  [<ffffffff81012609>] ? __switch_to+0x169/0x4c0
[Wed Apr 29 10:18:38 2015]  [<ffffffff81013dc9>] do_notify_resume+0x69/0xb0
[Wed Apr 29 10:18:38 2015]  [<ffffffff8172739f>] retint_signal+0x48/0x89
[Wed Apr 29 10:18:38 2015] Code: 48 8b 7d 08 e8 27 63 94 ff 5d c3 0f 1f 44 00 00 66 66 66 66 90 55 48 89 e5 53 48 89 f3 66 66 66 66 90 66 83 07 02 48 89 df 57 9d <66> 66 90 66 90 5b 5d c3 0f 1f 44 00 00 8b 37 f0 66 83 07 02 f6 
[Wed Apr 29 10:19:02 2015] BUG: soft lockup - CPU#1 stuck for 21s! [apt-cache:7539]
[Wed Apr 29 10:19:02 2015] Modules linked in: nfsv4 nfsd auth_rpcgss nfs_acl nfs lockd sunrpc fscache 8021q garp stp mrp llc kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd serio_raw floppy psmouse
[Wed Apr 29 10:19:02 2015] CPU: 1 PID: 7539 Comm: apt-cache Not tainted 3.13.0-37-generic #64-Ubuntu
[Wed Apr 29 10:19:02 2015] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Bochs 01/01/2011
[Wed Apr 29 10:19:02 2015] task: ffff88009bb1e000 ti: ffff880083db2000 task.ti: ffff880083db2000
[Wed Apr 29 10:19:02 2015] RIP: 0033:[<00007fcbd07f4205>]  [<00007fcbd07f4205>] 0x7fcbd07f4205
[Wed Apr 29 10:19:02 2015] RSP: 002b:00007ffff2723c20  EFLAGS: 00000293
[Wed Apr 29 10:19:02 2015] RAX: 000000000002c961 RBX: 000000000000e2a6 RCX: 00000000000001d5
[Wed Apr 29 10:19:02 2015] RDX: 00000000000b7772 RSI: 00007ffff2723e20 RDI: 0000000000000030
[Wed Apr 29 10:19:02 2015] RBP: 00007fcbcd72d000 R08: 00007ffff2723e30 R09: 0000000000000002
[Wed Apr 29 10:19:02 2015] R10: fffffffffffff7cc R11: 00007fcbcfe0b870 R12: 00007ffff27241b0
[Wed Apr 29 10:19:02 2015] R13: 00007ffff2725000 R14: 0000000000008071 R15: 000000000000a32a
[Wed Apr 29 10:19:02 2015] FS:  00007fcbd0cf8780(0000) GS:ffff88013fd00000(0000) knlGS:0000000000000000
[Wed Apr 29 10:19:02 2015] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Wed Apr 29 10:19:02 2015] CR2: 00007fcbcd96f000 CR3: 000000009bb4f000 CR4: 00000000001406e0

It seems deployment-parsoid05 runs on virt1007. That machine has spikes of WIO but is otherwise low on CPU usage. virt1007 Ganglia view over a day.

That seems to be an hardware / virtualization issue and its tracked at T97520: Instances CPU being stuck on at least a couple instances

hashar claimed this task.

The instance has been migrated from labvirt1005 which is suffering from memory issue (T97521). It is now on labvirt1001.

puppet run is fast again so I guess the Parsoid service is now running fine.

/data/project/parsoid/parsoid.log shows it is parsing things just fine now.