Page MenuHomePhabricator

Superset: Presto backend: Unable to access some charts
Closed, ResolvedPublic

Assigned To
Authored By
ppelberg
Dec 15 2022, 10:35 PM
Referenced Files
F35876259: image.png
Dec 19 2022, 2:20 PM
F35876239: image.png
Dec 19 2022, 2:20 PM
F35876203: image.png
Dec 19 2022, 1:20 PM
F35875864: image.png
Dec 19 2022, 12:34 PM
F35875845: image.png
Dec 19 2022, 12:01 PM
F35875834: image.png
Dec 19 2022, 11:47 AM
F35875829: image.png
Dec 19 2022, 11:39 AM
F35875501: image.png
Dec 19 2022, 10:39 AM

Description

Linked Slack thread: https://wikimedia.slack.com/archives/CSV483812/p1671212018121909

Multiple dashboards are exhibiting errors from Presto data sources.

Sub-Saharan Africa Editors Superset Dashboard:

Some charts within two of the "tabs" within the Sub-Saharan Africa Editors Superset Dashboard are not loading as expected.

See "Behavior" below for more details...

Behavior

Chart: SSA Editors by Project and Country

  1. Navigate to https://superset.wikimedia.org/r/2208
  2. ❗️ Notice the following errors:
ChartError Message
Monthly distinct editors - Sub-Saharan Africa
Screen Shot 2022-12-15 at 1.09.41 PM.png (752×1 px, 102 KB)
Monthly distinct editors year over year changes - Sub-Saharan Africa
Screen Shot 2022-12-15 at 1.10.20 PM.png (748×1 px, 102 KB)
Monthly distinct editors by country - Sub-Saharan Africa
Screen Shot 2022-12-15 at 1.10.46 PM.png (752×1 px, 102 KB)

Chart: Monthly Distinct Editors by Project and Location

  1. Navigate to: https://superset.wikimedia.org/r/2210
  2. ❗️ Notice errors similar to those pictured above

Done

I've had inconsistent timeouts on the editors metrics and readers metrics dashboard as well.

Sometimes they time out and sometimes they load after a longer than usual time period.
Individual charts from the content metrics, editors metrics, and readers metrics dashboards are all taking a very long time to load.

Event Timeline

BTullis renamed this task from Sub-Saharan Africa Editors Superset Dashboard: Unable to access some charts to Superset: Presto backend: Unable to access some charts.Dec 19 2022, 10:25 AM
BTullis claimed this task.
BTullis triaged this task as Unbreak Now! priority.
BTullis updated Other Assignee, added: Stevemunene.
BTullis updated the task description. (Show Details)

Picking up this ticket and working on it with high priority.

I can replicate the errors, so I think that this is a widespread issue.

I've been checking the presto UI (https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration#View_the_Presto_UI) and I can view thew logs there for failed jobs, such as this.

image.png (413×1 px, 73 KB)

We'll proceed to look at the server logs from different worker nodes, to track down the source of these errors.

We can see errors in the logs for all workers, not only the new or the old. Checking how long the services have been running shows that some have been running since prior to the addition of the 10 new nodes.

btullis@cumin1001:~$ sudo cumin --no-progress A:presto-analytics 'systemctl status presto-server|grep Active'
15 hosts will be targeted:
an-presto[1001-1015].eqiad.wmnet
Ok to proceed on 15 hosts? Enter the number of affected hosts to confirm or "q" to quit 15
===== NODE GROUP =====
(1) an-presto1014.eqiad.wmnet
----- OUTPUT of 'systemctl status...rver|grep Active' -----
     Active: active (running) since Wed 2022-12-07 12:47:18 UTC; 1 weeks 4 days ago
===== NODE GROUP =====
(1) an-presto1010.eqiad.wmnet
----- OUTPUT of 'systemctl status...rver|grep Active' -----
     Active: active (running) since Wed 2022-12-07 10:44:37 UTC; 1 weeks 5 days ago
===== NODE GROUP =====
(1) an-presto1015.eqiad.wmnet
----- OUTPUT of 'systemctl status...rver|grep Active' -----
     Active: active (running) since Wed 2022-12-07 12:48:22 UTC; 1 weeks 4 days ago
===== NODE GROUP =====
(1) an-presto1003.eqiad.wmnet
----- OUTPUT of 'systemctl status...rver|grep Active' -----
   Active: active (running) since Wed 2022-11-30 16:49:08 UTC; 2 weeks 4 days ago
===== NODE GROUP =====
(1) an-presto1006.eqiad.wmnet
----- OUTPUT of 'systemctl status...rver|grep Active' -----
     Active: active (running) since Fri 2022-12-02 12:05:44 UTC; 2 weeks 2 days ago
===== NODE GROUP =====
(1) an-presto1013.eqiad.wmnet
----- OUTPUT of 'systemctl status...rver|grep Active' -----
     Active: active (running) since Wed 2022-12-07 11:02:48 UTC; 1 weeks 4 days ago
===== NODE GROUP =====
(1) an-presto1002.eqiad.wmnet
----- OUTPUT of 'systemctl status...rver|grep Active' -----
   Active: active (running) since Wed 2022-11-30 16:47:04 UTC; 2 weeks 4 days ago
===== NODE GROUP =====
(1) an-presto1009.eqiad.wmnet
----- OUTPUT of 'systemctl status...rver|grep Active' -----
     Active: active (running) since Wed 2022-12-07 10:37:29 UTC; 1 weeks 5 days ago
===== NODE GROUP =====
(1) an-presto1007.eqiad.wmnet
----- OUTPUT of 'systemctl status...rver|grep Active' -----
     Active: active (running) since Mon 2022-12-05 11:46:08 UTC; 1 weeks 6 days ago
===== NODE GROUP =====
(1) an-presto1004.eqiad.wmnet
----- OUTPUT of 'systemctl status...rver|grep Active' -----
   Active: active (running) since Wed 2022-11-30 16:51:11 UTC; 2 weeks 4 days ago
===== NODE GROUP =====
(1) an-presto1008.eqiad.wmnet
----- OUTPUT of 'systemctl status...rver|grep Active' -----
     Active: active (running) since Wed 2022-12-07 10:29:48 UTC; 1 weeks 5 days ago
===== NODE GROUP =====
(1) an-presto1012.eqiad.wmnet
----- OUTPUT of 'systemctl status...rver|grep Active' -----
     Active: active (running) since Wed 2022-12-07 10:55:07 UTC; 1 weeks 4 days ago
===== NODE GROUP =====
(1) an-presto1011.eqiad.wmnet
----- OUTPUT of 'systemctl status...rver|grep Active' -----
     Active: active (running) since Wed 2022-12-07 10:51:32 UTC; 1 weeks 4 days ago
===== NODE GROUP =====
(1) an-presto1001.eqiad.wmnet
----- OUTPUT of 'systemctl status...rver|grep Active' -----
   Active: active (running) since Wed 2022-11-30 16:45:01 UTC; 2 weeks 4 days ago
===== NODE GROUP =====
(1) an-presto1005.eqiad.wmnet
----- OUTPUT of 'systemctl status...rver|grep Active' -----
   Active: active (running) since Wed 2022-11-30 16:53:14 UTC; 2 weeks 4 days ago
================
100.0% (15/15) success ratio (>= 100.0% threshold) for command: 'systemctl status...rver|grep Active'.
100.0% (15/15) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

I think that the first thing for us to try would be a rolling restart of the presto worker nodes.

Running this cookbook now.

btullis@cumin1001:~$ sudo cookbook sre.presto.roll-restart-workers analytics
START - Cookbook sre.presto.roll-restart-workers for Presto analytics cluster: Roll restart of all Presto's jvm daemons.
Scheduling downtime on Icinga server alert1001.wikimedia.org for hosts: an-presto[1001-1015]

Cookbook finished at: 11:28 UTC

!log btullis@cumin1001 END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) for Presto analytics cluster: Roll restart of all Presto's jvm daemons.
I

Initial results show that we are still getting lots of timeouts. Some queries are completing, but after the the 180 second timeout in Superset.

image.png (873×1 px, 237 KB)

Errors in the logs seem reduced after the cookbook, but they are still much too slow.

I suggest that we experiment with shutting down an-presto10[06-15] and repeat these tests with the original sized cluster of 5 worker nodes. Maybe there is something here related to the increased cluster size.

Icinga downtime and Alertmanager silence (ID=e639d47b-4061-4222-bd4a-5e533a454811) set by btullis@cumin1001 for 30 days, 0:00:00 on 10 host(s) and their services with reason: Reverting presto cluster size from 15 to 5 as a test

an-presto[1006-1015].eqiad.wmnet

I have downtimed the 10 new hosts with the cookbook above, then shut them down with:

btullis@cumin1001:~$ sudo cumin 'an-presto10[06-15].eqiad.wmnet' 'shutdown -h now'

The number of workers dropped as expected. I have not restarted the presto coordinator, nor the five remaining worker processes.

image.png (460×1 px, 106 KB)

I'll repeat the test queries to see if this gets them back to a working state.

It does! All dashboards mentioned in the description are now working correctly and returning results quickly.

I'm going to try a different experiment, shutting down an-presto100[1-5] and starting up an-presto10[06-10].

This experiment should tell us whether it's related to the cluster size of five workers, or whether it's something specific to the new worker nodes.

Booted an-presto10[06-10] using ipmitool

btullis@cumin1001:~$ sudo ipmitool -I lanplus -H "an-presto1006.mgmt.eqiad.wmnet" -U root -E chassis power on
Unable to read password from environment
Password: 
Chassis Power Control: Up/On
btullis@cumin1001:~$ sudo ipmitool -I lanplus -H "an-presto1007.mgmt.eqiad.wmnet" -U root -E chassis power on
Unable to read password from environment
Password: 
Chassis Power Control: Up/On
btullis@cumin1001:~$ sudo ipmitool -I lanplus -H "an-presto1008.mgmt.eqiad.wmnet" -U root -E chassis power on
Unable to read password from environment
Password: 
Chassis Power Control: Up/On
btullis@cumin1001:~$ sudo ipmitool -I lanplus -H "an-presto1009.mgmt.eqiad.wmnet" -U root -E chassis power on
Unable to read password from environment
Password: 
Chassis Power Control: Up/On
btullis@cumin1001:~$ sudo ipmitool -I lanplus -H "an-presto1010.mgmt.eqiad.wmnet" -U root -E chassis power on
Unable to read password from environment
Password: 
Chassis Power Control: Up/On

Cluster expanded to 10 nodes as expected.

image.png (220×1 px, 41 KB)

I've tried a couple of queries with 10 workers and I see slow queries again.
I'll now shut down an-presto100[1-5] to see if we can ascertain more.

Icinga downtime and Alertmanager silence (ID=c69af7fc-f74f-45b1-b7fc-d5114d5c9ce0) set by btullis@cumin1001 for 30 days, 0:00:00 on 5 host(s) and their services with reason: Trying five of the new preto servers instead of the original five

an-presto[1001-1005].eqiad.wmnet

I have shut down an-presto100[1-5] with the following.

btullis@cumin1001:~$ sudo cumin 'an-presto100[1-5].eqiad.wmnet' 'shutdown -h now'
5 hosts will be targeted:
an-presto[1001-1005].eqiad.wmnet
BTullis lowered the priority of this task from Unbreak Now! to High.Dec 19 2022, 12:34 PM

These dashboards seem to be fine when using any 5 worker nodes, but bad with 10 and really bad with 15.

@ppelberg , @mpopov , @Iflorez - Could you please let me know if things are working as they were previously, or if you believe that there is still an issue with queries being slow?

The current configuration is:

  • an-presto100[1-5] - powered off (these were the original 5 worker nodes)
  • an-presto10[6-10] - powered on
  • an-presto10[10-15] - powered off

The timeline of adding additional worker nodes looks like this.

image.png (427×928 px, 33 KB)

If your impression is that there is still a performance degradation between how things were before the increase in worker node count at the start of December, then I fill focus on what's different about these nodes.

If, however, you feel that things are exactly the way they were before the increase, then I will focus on the differences between a cluster size if 5 and a cluster size of 15.

Any other feedback about the Superset/Presto experience might also be relevant in this context too, so please feel free to share any observations.

I'll triage this back from UBN to High priority, whilst we investigate this issue.

Thank you for the quick investigation @BTullis - let's try to understand what prevents Presto from performing with more nodes

A quick look at the presto grafana dashboard makes me think that it could be related to the coordinator not having enough memory (see GC time and GC count when the number of queries grow).

A quick look at the presto grafana dashboard makes me think that it could be related to the coordinator not having enough memory (see GC time and GC count when the number of queries grow).

I think you're right. We're only giving the presto server 4GB of RAM for the heap.

image.png (167×1 px, 55 KB)

We have 19 GB of RAM that is not even in use as cache by the O/S so I think that it is safe to allocate another 12 GB for this process. That will quadruple the size of the heap for the presto coordinator.

btullis@an-coord1001:~$ free -g
              total        used        free      shared  buff/cache   available
Mem:             62          41           0           0          20          19
Swap:             0           0           0

We're going to be upgrading these hosts to 128 GB as part of the upcoming refresh, plus we're going to be moving the MariaDB services to their own host, so I'm not worried about adding another 12 GB to presto server now.

Change 869214 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add another 12 GB of RAM to the presto server JVM

https://gerrit.wikimedia.org/r/869214

Change 869214 merged by Btullis:

[operations/puppet@production] Add another 12 GB of RAM to the presto server JVM

https://gerrit.wikimedia.org/r/869214

Mentioned in SAL (#wikimedia-analytics) [2022-12-19T13:45:08Z] <btullis> restart presto-server on an-coord1001 to increase heap from 4GB to 16 GB T325331

I have now deployed that change and restarted the presto-server on an-coord1001 with the increased heap size of 16 GB.

btullis@an-coord1001:~$ systemctl status -n 0 presto-server.service 
● presto-server.service - Presto Server Node
   Loaded: loaded (/lib/systemd/system/presto-server.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2022-12-19 13:44:13 UTC; 1min 53s ago
 Main PID: 15682 (java)
    Tasks: 284 (limit: 4915)
   Memory: 1.5G
   CGroup: /system.slice/presto-server.service
           └─15682 java -cp /usr/lib/presto/lib/* -server -Xmx16G -XX:+UseG1GC -XX:G1HeapRegionSize=32M -XX:+UseGCOverheadLimit -XX:+ExplicitGCInvokesConcurrent -XX:+HeapDumpOnOutOfMemoryError -XX:+ExitOnOutOfMe

Running the tests again with 5 active workers and monitoring GC usage.

btullis@cumin1001:~$ sudo cookbook sre.hosts.remove-downtime --force 'an-presto101[1-5].eqiad.wnet'
==> Will remove downtime for 5 unverified hosts: an-presto[1011-1015].eqiad.wnet
Type "go" to proceed or "abort" to interrupt the execution
> go
START - Cookbook sre.hosts.remove-downtime for an-presto[1011-1015].eqiad.wnet
Removing downtime for an-presto[1011-1015].eqiad.wnet
END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for an-presto[1011-1015].eqiad.wnet
btullis@cumin1001:~$ sudo ipmitool -I lanplus -H "an-presto1011.mgmt.eqiad.wmnet" -U root -E chassis power on
Unable to read password from environment
Password: 
Chassis Power Control: Up/On
btullis@cumin1001:~$ sudo ipmitool -I lanplus -H "an-presto1012.mgmt.eqiad.wmnet" -U root -E chassis power on
Unable to read password from environment
Password: 
Chassis Power Control: Up/On
btullis@cumin1001:~$ sudo ipmitool -I lanplus -H "an-presto1013.mgmt.eqiad.wmnet" -U root -E chassis power on
Unable to read password from environment
Password: 
Chassis Power Control: Up/On
btullis@cumin1001:~$ sudo ipmitool -I lanplus -H "an-presto1014.mgmt.eqiad.wmnet" -U root -E chassis power on
Unable to read password from environment
Password: 
Chassis Power Control: Up/On
btullis@cumin1001:~$ sudo ipmitool -I lanplus -H "an-presto1015.mgmt.eqiad.wmnet" -U root -E chassis power on
Unable to read password from environment
Password: 
Chassis Power Control: Up/On

Tests with 10 new workers only and increased heap size seem to suggest no improvement and many timeouts again.

I'm going to try to revert to the original 5 workers.

So we're now running the 5 original workers. The 10 new hosts are powered off for now.

This seem to be working again and quite quickly, so it looks like the heap settings may not have been what we needed.

Indeed - same GC problems despite a way higher heap, weird!

Change 870819 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Disable the presto server on the 10 new hosts

https://gerrit.wikimedia.org/r/870819

Change 870819 merged by Btullis:

[operations/puppet@production] Disable the presto server on the 10 new hosts

https://gerrit.wikimedia.org/r/870819

I've now powered up an-presto10[06-15] but presto-server will be disabled.

I'm going to mark this incident as resolved, but I have created a follow-up ticket for the investigation into the apparent instability with more than 5 worker nodes: T325809: Presto is unstable with more than 5 worker nodes