Page MenuHomePhabricator

Parsoid on beta cluster times out doing API requests
Closed, ResolvedPublic

Description

On the beta cluster, Parsoid runs on the instance deployment-parsoid05.eqiad.wmflabs. I noticed the four parsoid process eating all the CPU.

Looking at the log file at /data/project/parsoid/parsoid.log tgere are a bunch of socket timeout errors against eswiki/wikidatawiki db (at least):

{
"name":"parsoid",
"hostname":"deployment-parsoid05",
"pid":22036,
"level":40,
"logType":"warning/api/esockettimedout",
"wiki":"wikidatawiki",
"title":"Q63760",
"oldId":null,
"reqId":"572eeab6459c44ffbf2e7a439858cfcd",
"msg":"Failed API request, {\"error\":{\"code\":\"ESOCKETTIMEDOUT\"},\"retries-remaining\":1}",
"longMsg":"Failed API request,\n{\"error\":{\"code\":\"ESOCKETTIMEDOUT\"},\"retries-remaining\":1}",
"time":"2015-04-28T15:27:21.925Z",
"v":0
}

https://logstash-beta.wmflabs.org/ gives a bit more informations using the filters:

field : _type == value : parsoid 
logType == query : "warning/api/etimedout"

Event Timeline

hashar created this task.Apr 28 2015, 3:50 PM
hashar raised the priority of this task from to Needs Triage.
hashar updated the task description. (Show Details)
hashar added a subscriber: hashar.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 28 2015, 3:50 PM

A curl query works just fine right now: curl http://wikidata.beta.wmflabs.org/w/api.php

The instance had some heavy CPU spike earlier today and dmesg reports a CPU being stuck:

[Wed Apr 29 10:18:38 2015] BUG: soft lockup - CPU#1 stuck for 23s! [nodejs:27358]
[Wed Apr 29 10:18:38 2015] Modules linked in: nfsv4 nfsd auth_rpcgss nfs_acl nfs lockd sunrpc fscache 8021q garp stp mrp llc kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd serio_raw floppy psmouse
[Wed Apr 29 10:18:38 2015] CPU: 1 PID: 27358 Comm: nodejs Not tainted 3.13.0-37-generic #64-Ubuntu
[Wed Apr 29 10:18:38 2015] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Bochs 01/01/2011
[Wed Apr 29 10:18:38 2015] task: ffff8800b6b8c800 ti: ffff880137c76000 task.ti: ffff880137c76000
[Wed Apr 29 10:18:38 2015] RIP: 0010:[<ffffffff81726b9b>]  [<ffffffff81726b9b>] _raw_spin_unlock_irqrestore+0x1b/0x40
[Wed Apr 29 10:18:38 2015] RSP: 0018:ffff88013fd03df8  EFLAGS: 00000282
[Wed Apr 29 10:18:38 2015] RAX: ffff88013fd14cc8 RBX: ffffffff8147a5b1 RCX: 000000000000000f
[Wed Apr 29 10:18:38 2015] RDX: 000000000000000c RSI: 0000000000000282 RDI: 0000000000000282
[Wed Apr 29 10:18:38 2015] RBP: ffff88013fd03e00 R08: 0000000000000000 R09: 0000000000000000
[Wed Apr 29 10:18:38 2015] R10: 0000000000000001 R11: 00000000005f0d72 R12: ffff88013fd03d68
[Wed Apr 29 10:18:38 2015] R13: ffffffff8173059d R14: ffff88013fd03e00 R15: 0000000000000282
[Wed Apr 29 10:18:38 2015] FS:  00007f5848050780(0000) GS:ffff88013fd00000(0000) knlGS:0000000000000000
[Wed Apr 29 10:18:38 2015] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Wed Apr 29 10:18:38 2015] CR2: 00002327155927e0 CR3: 000000009b906000 CR4: 00000000001406e0
[Wed Apr 29 10:18:38 2015] Stack:
[Wed Apr 29 10:18:38 2015]  ffff88013fd14480 ffff88013fd03e70 ffffffff810a0ea4 ffffffffffffe82d
[Wed Apr 29 10:18:38 2015]  ffffffffffffffff 0000000000000000 0003f63a7dfe6d8c ffff88013fd03e70
[Wed Apr 29 10:18:38 2015]  0000000000000282 ffff88013fd14cc8 00000001062a94bd ffff88013fd14480
[Wed Apr 29 10:18:38 2015] Call Trace:
[Wed Apr 29 10:18:38 2015]  <IRQ> 
[Wed Apr 29 10:18:38 2015]  [<ffffffff810a0ea4>] update_blocked_averages+0x2a4/0x720
[Wed Apr 29 10:18:38 2015]  [<ffffffff810a796b>] rebalance_domains+0x4b/0x250
[Wed Apr 29 10:18:38 2015]  [<ffffffff810a7bba>] run_rebalance_domains+0x4a/0x1b0
[Wed Apr 29 10:18:38 2015]  [<ffffffff8106cb8c>] __do_softirq+0xec/0x2c0
[Wed Apr 29 10:18:38 2015]  [<ffffffff8106d0d5>] irq_exit+0x105/0x110
[Wed Apr 29 10:18:38 2015]  [<ffffffff81731c05>] smp_apic_timer_interrupt+0x45/0x60
[Wed Apr 29 10:18:38 2015]  [<ffffffff8173059d>] apic_timer_interrupt+0x6d/0x80
[Wed Apr 29 10:18:38 2015]  <EOI> 
[Wed Apr 29 10:18:38 2015]  [<ffffffff81726b9b>] ? _raw_spin_unlock_irqrestore+0x1b/0x40
[Wed Apr 29 10:18:38 2015]  [<ffffffff8115c1cd>] release_pages+0xdd/0x210
[Wed Apr 29 10:18:38 2015]  [<ffffffff8118b6e5>] free_pages_and_swap_cache+0x95/0xb0
[Wed Apr 29 10:18:38 2015]  [<ffffffff81174bdc>] tlb_flush_mmu.part.56+0x4c/0x90
[Wed Apr 29 10:18:38 2015]  [<ffffffff81176708>] unmap_page_range+0x648/0x7f0
[Wed Apr 29 10:18:38 2015]  [<ffffffff81176931>] unmap_single_vma+0x81/0xf0
[Wed Apr 29 10:18:38 2015]  [<ffffffff811779a9>] unmap_vmas+0x49/0x90
[Wed Apr 29 10:18:38 2015]  [<ffffffff81180b6c>] exit_mmap+0x9c/0x170
[Wed Apr 29 10:18:38 2015]  [<ffffffff8106480c>] mmput+0x5c/0x120
[Wed Apr 29 10:18:38 2015]  [<ffffffff81069c5c>] do_exit+0x26c/0xa50
[Wed Apr 29 10:18:38 2015]  [<ffffffff8115cd0e>] ? lru_cache_add+0xe/0x10
[Wed Apr 29 10:18:38 2015]  [<ffffffff8106a4bf>] do_group_exit+0x3f/0xa0
[Wed Apr 29 10:18:38 2015]  [<ffffffff8107a140>] get_signal_to_deliver+0x1d0/0x6f0
[Wed Apr 29 10:18:38 2015]  [<ffffffff8172ae24>] ? __do_page_fault+0x204/0x560
[Wed Apr 29 10:18:38 2015]  [<ffffffff81013448>] do_signal+0x48/0x960
[Wed Apr 29 10:18:38 2015]  [<ffffffff810a0175>] ? set_next_entity+0x95/0xb0
[Wed Apr 29 10:18:38 2015]  [<ffffffff81012609>] ? __switch_to+0x169/0x4c0
[Wed Apr 29 10:18:38 2015]  [<ffffffff81013dc9>] do_notify_resume+0x69/0xb0
[Wed Apr 29 10:18:38 2015]  [<ffffffff8172739f>] retint_signal+0x48/0x89
[Wed Apr 29 10:18:38 2015] Code: 48 8b 7d 08 e8 27 63 94 ff 5d c3 0f 1f 44 00 00 66 66 66 66 90 55 48 89 e5 53 48 89 f3 66 66 66 66 90 66 83 07 02 48 89 df 57 9d <66> 66 90 66 90 5b 5d c3 0f 1f 44 00 00 8b 37 f0 66 83 07 02 f6 
[Wed Apr 29 10:19:02 2015] BUG: soft lockup - CPU#1 stuck for 21s! [apt-cache:7539]
[Wed Apr 29 10:19:02 2015] Modules linked in: nfsv4 nfsd auth_rpcgss nfs_acl nfs lockd sunrpc fscache 8021q garp stp mrp llc kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd serio_raw floppy psmouse
[Wed Apr 29 10:19:02 2015] CPU: 1 PID: 7539 Comm: apt-cache Not tainted 3.13.0-37-generic #64-Ubuntu
[Wed Apr 29 10:19:02 2015] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Bochs 01/01/2011
[Wed Apr 29 10:19:02 2015] task: ffff88009bb1e000 ti: ffff880083db2000 task.ti: ffff880083db2000
[Wed Apr 29 10:19:02 2015] RIP: 0033:[<00007fcbd07f4205>]  [<00007fcbd07f4205>] 0x7fcbd07f4205
[Wed Apr 29 10:19:02 2015] RSP: 002b:00007ffff2723c20  EFLAGS: 00000293
[Wed Apr 29 10:19:02 2015] RAX: 000000000002c961 RBX: 000000000000e2a6 RCX: 00000000000001d5
[Wed Apr 29 10:19:02 2015] RDX: 00000000000b7772 RSI: 00007ffff2723e20 RDI: 0000000000000030
[Wed Apr 29 10:19:02 2015] RBP: 00007fcbcd72d000 R08: 00007ffff2723e30 R09: 0000000000000002
[Wed Apr 29 10:19:02 2015] R10: fffffffffffff7cc R11: 00007fcbcfe0b870 R12: 00007ffff27241b0
[Wed Apr 29 10:19:02 2015] R13: 00007ffff2725000 R14: 0000000000008071 R15: 000000000000a32a
[Wed Apr 29 10:19:02 2015] FS:  00007fcbd0cf8780(0000) GS:ffff88013fd00000(0000) knlGS:0000000000000000
[Wed Apr 29 10:19:02 2015] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Wed Apr 29 10:19:02 2015] CR2: 00007fcbcd96f000 CR3: 000000009bb4f000 CR4: 00000000001406e0

It seems deployment-parsoid05 runs on virt1007. That machine has spikes of WIO but is otherwise low on CPU usage. virt1007 Ganglia view over a day.

That seems to be an hardware / virtualization issue and its tracked at T97520: Instances CPU being stuck on at least a couple instances

hashar closed this task as Resolved.Apr 29 2015, 2:49 PM
hashar claimed this task.

The instance has been migrated from labvirt1005 which is suffering from memory issue (T97521). It is now on labvirt1001.

puppet run is fast again so I guess the Parsoid service is now running fine.

/data/project/parsoid/parsoid.log shows it is parsing things just fine now.