Page MenuHomePhabricator

Track queries that run on the elasticsearch clusters longer than some timeout
Closed, ResolvedPublic

Description

We run all search queries with both server side and client side timeouts. Unfortunately multiple times now (T152895#2872562, T172653 and T195491) we have found edge cases where queries can run 10+ minutes on a shard. We don't currently have any good way of identifying they exist, they have been noticed either because the cluster was struggling due to the added load of 10+ minute queries, or because a search engineer noticed a query running longer than expected by chance.

Elasticsearch has slow logging functionality, we should look into how we can make this visible.

Event Timeline

Restricted Application edited projects, added Discovery-Search; removed Discovery-Search (Current work). · View Herald TranscriptJun 1 2018, 5:41 PM
EBernhardson updated the task description. (Show Details)Jun 1 2018, 5:44 PM

Change 436841 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/puppet@production] Send elasticsearch slowlogs to logstash

https://gerrit.wikimedia.org/r/436841

Change 436848 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/mediawiki-config@master] Tune CirrusSearch slow logging

https://gerrit.wikimedia.org/r/436848

Change 436841 merged by Gehel:
[operations/puppet@production] Send elasticsearch slowlogs to logstash

https://gerrit.wikimedia.org/r/436841

The slow logs are now being collected in graphite, but they are being truncated. Will poke the infra and figure out how to stop truncating.

EBernhardson added a subscriber: ayounsi.EditedJun 5 2018, 10:43 PM

Checked with tcpdump on the logstash machines and we are getting oversized udp packets from elasticsearch:

$ sudo tcpdump udp and port 12201 | head
22:35:51.993187 IP elastic1046.eqiad.wmnet.52560 > logstash.svc.eqiad.wmnet.12201: UDP, bad length 2059 > 1472
22:35:52.020144 IP elastic2007.codfw.wmnet.45672 > logstash.svc.eqiad.wmnet.12201: UDP, length 346
22:35:52.020881 IP elastic2007.codfw.wmnet.45672 > logstash.svc.eqiad.wmnet.12201: UDP, length 346
22:35:52.021190 IP elastic2007.codfw.wmnet.45672 > logstash.svc.eqiad.wmnet.12201: UDP, length 349
22:35:52.021458 IP elastic2007.codfw.wmnet.45672 > logstash.svc.eqiad.wmnet.12201: UDP, length 348
22:35:52.081435 IP elastic1049.eqiad.wmnet.52314 > logstash.svc.eqiad.wmnet.12201: UDP, bad length 2059 > 1472
22:35:52.118145 IP elastic1039.eqiad.wmnet.60308 > logstash.svc.eqiad.wmnet.12201: UDP, bad length 2053 > 1472
...

For oversized udp packets to work out the network needs to support packet fragmentation, which can be found with ethtool:

The frontend logstash machines report udp fragmentation supported:

$ for i in $(seq 7 9); do ssh logstash100$i.eqiad.wmnet sudo ethtool -k eth0 | grep udp-fragmentation-offload; done
udp-fragmentation-offload: on
udp-fragmentation-offload: on
udp-fragmentation-offload: on

The eqiad elasticsearch cluster reports all disabled though:

$ for i in $(seq -w 17 52); do ssh elastic10$i.eqiad.wmnet sudo ethtool -k eth0 | grep udp-fragmentation-offload; done | sort | uniq -c
     35 udp-fragmentation-offload: off [fixed]

Similarly the codfw cluster reports it disabled:

$ for i in $(seq -w 1 36); do ssh elastic20$i.codfw.wmnet sudo ethtool -k eth0 | grep udp-fragmentation-offload; done | sort | uniq -c
     36 udp-fragmentation-offload: off [fixed]

@ayounsi Would you by chance know if our networks (eqiad internal and cross-dc) should support udp fragmentation? The logstash frontends live inside ganetti so maybe they are incorrectly showing support.

I'm also possibly completely mis-interpreting whats going on here...

EBernhardson added a comment.EditedJun 5 2018, 11:04 PM

Fragmentation should be fine, i turned off ferm on logstash1009 so i could use an arbitrary port. I then used netcat to transfer a 4kB file over UDP from an elastic machine to logstash, everything seemed to work fine. Some more investigation suggests tcpdump will report bad length even when the length is fine and expected/found in a subsequent packet on level 3

By taking some tcpdump recordings and feeding reconstructed packets through a zlib decompressor i can verify that events are received by logstash over udp with their full contents. Something inside logstash is doing this but not sure what yet.

Change 437657 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/puppet@production] logstash: Use gelf long_message when provided

https://gerrit.wikimedia.org/r/437657

Most likely the problem is that elasticsearch is sending two message fields in the gelf, a truncated version in short_message and the whole thing in long_message. Our gelf handling in logstash is always using short_message. Attached puppet patch uses the long_message when available.

Change 437657 merged by Gehel:
[operations/puppet@production] logstash: Use gelf long_message when provided

https://gerrit.wikimedia.org/r/437657

Fragmentation should be fine, i turned off ferm on logstash1009 so i could use an arbitrary port. I then used netcat to transfer a 4kB file over UDP from an elastic machine to logstash, everything seemed to work fine. Some more investigation suggests tcpdump will report bad length even when the length is fine and expected/found in a subsequent packet on level 3

Correct, tcpdump show you the packet after it has been re-assembled by the NIC/kernel. And it freaks out because it's larger than the NIC's MTU, but a red hearing.
If the datagrams were transiting on the wire with a size larger than the MTU, they would:
1/ not exit the source host (as the NIC MTU is 1500),
2/ not be accepted by the destination host (which would reject it with a "ICMP packet too big" message)

As a side note, the network supports jumbo frames (>9000) so if you feel adventurous, you can increase your NIC's MTU (but that's another can of worms).

EBjune triaged this task as Normal priority.Jun 7 2018, 5:09 PM
EBjune moved this task from needs triage to Current work on the Discovery-Search board.

Change 436848 merged by jenkins-bot:
[operations/mediawiki-config@master] Tune CirrusSearch slow logging

https://gerrit.wikimedia.org/r/436848

truncation problem looks resolved. We unfortunately don't have the exact timings extracted, but levels are set well enough that anything slow-logged at the warn level should be either a real problem, or at least a very low level. In the last week only 2 queries hit the warn level, both regex's that took ~15s for the fetch phase.

Vvjjkkii renamed this task from Track queries that run on the elasticsearch clusters longer than some timeout to vtbaaaaaaa.Jul 1 2018, 1:06 AM
Vvjjkkii raised the priority of this task from Normal to High.
Vvjjkkii removed EBernhardson as the assignee of this task.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: Aklapper, gerritbot.
CommunityTechBot assigned this task to EBernhardson.
CommunityTechBot lowered the priority of this task from High to Normal.
CommunityTechBot renamed this task from vtbaaaaaaa to Track queries that run on the elasticsearch clusters longer than some timeout.
CommunityTechBot added subscribers: Aklapper, gerritbot.
debt closed this task as Resolved.Jul 9 2018, 11:45 PM