Superset: Presto backend: Unable to access some charts
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ppelberg
	Dec 15 2022, 10:35 PM

Description

Linked Slack thread: https://wikimedia.slack.com/archives/CSV483812/p1671212018121909

Multiple dashboards are exhibiting errors from Presto data sources.

Sub-Saharan Africa Editors Superset Dashboard:

Some charts within two of the "tabs" within the Sub-Saharan Africa Editors Superset Dashboard are not loading as expected.

See "Behavior" below for more details...

Behavior

Chart: SSA Editors by Project and Country

Navigate to https://superset.wikimedia.org/r/2208
❗️ Notice the following errors:

Chart	Error Message
Monthly distinct editors - Sub-Saharan Africa
Monthly distinct editors year over year changes - Sub-Saharan Africa
Monthly distinct editors by country - Sub-Saharan Africa

Chart: Monthly Distinct Editors by Project and Location

Navigate to: https://superset.wikimedia.org/r/2210
❗️ Notice errors similar to those pictured above

Done

All charts within the SSA Editors by Project and Country and Monthly Distinct Editors by Project and Location "tabs" of the Sub-Saharan Africa Editors Superset Dashboard load and function as expected

I've had inconsistent timeouts on the editors metrics and readers metrics dashboard as well.

Sometimes they time out and sometimes they load after a longer than usual time period.
Individual charts from the content metrics, editors metrics, and readers metrics dashboards are all taking a very long time to load.

Details

Other Assignee: Stevemunene

	Subject	Repo	Branch	Lines +/-
	Disable the presto server on the 10 new hosts	operations/puppet	production	+25 -0
	Add another 12 GB of RAM to the presto server JVM	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects

Mentioned In: T327753: jmx_presto prometheus job down for some an-presto hosts
T325809: Presto is unstable with more than 5 worker nodes
T323783: Add an-presto10[06-15] to the presto cluster
T314201: [SPIKE] Determine project testing environments
Mentioned Here: T325809: Presto is unstable with more than 5 worker nodes

Event Timeline

ppelberg created this task.Dec 15 2022, 10:35 PM

ppelberg moved this task from Backlog to Analytics on the Editing-team (Tracking) board.

ppelberg mentioned this in T314201: [SPIKE] Determine project testing environments.

odimitrijevic added projects: superset.wikimedia.org, Data-Engineering-Planning.Dec 16 2022, 6:10 PM

BTullis renamed this task from Sub-Saharan Africa Editors Superset Dashboard: Unable to access some charts to Superset: Presto backend: Unable to access some charts.Dec 19 2022, 10:25 AM

BTullis claimed this task.

BTullis triaged this task as Unbreak Now! priority.

BTullis updated Other Assignee, added: Stevemunene.

BTullis added a project: Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-07)).

BTullis updated the task description. (Show Details)

Picking up this ticket and working on it with high priority.

I can replicate the errors, so I think that this is a widespread issue.

I've been checking the presto UI (https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration#View_the_Presto_UI) and I can view thew logs there for failed jobs, such as this.

We'll proceed to look at the server logs from different worker nodes, to track down the source of these errors.

We can see errors in the logs for all workers, not only the new or the old. Checking how long the services have been running shows that some have been running since prior to the addition of the 10 new nodes.

btullis@cumin1001:~$ sudo cumin --no-progress A:presto-analytics 'systemctl status presto-server|grep Active'
15 hosts will be targeted:
an-presto[1001-1015].eqiad.wmnet
Ok to proceed on 15 hosts? Enter the number of affected hosts to confirm or "q" to quit 15
===== NODE GROUP =====
(1) an-presto1014.eqiad.wmnet
----- OUTPUT of 'systemctl status...rver|grep Active' -----
     Active: active (running) since Wed 2022-12-07 12:47:18 UTC; 1 weeks 4 days ago
===== NODE GROUP =====
(1) an-presto1010.eqiad.wmnet
----- OUTPUT of 'systemctl status...rver|grep Active' -----
     Active: active (running) since Wed 2022-12-07 10:44:37 UTC; 1 weeks 5 days ago
===== NODE GROUP =====
(1) an-presto1015.eqiad.wmnet
----- OUTPUT of 'systemctl status...rver|grep Active' -----
     Active: active (running) since Wed 2022-12-07 12:48:22 UTC; 1 weeks 4 days ago
===== NODE GROUP =====
(1) an-presto1003.eqiad.wmnet
----- OUTPUT of 'systemctl status...rver|grep Active' -----
   Active: active (running) since Wed 2022-11-30 16:49:08 UTC; 2 weeks 4 days ago
===== NODE GROUP =====
(1) an-presto1006.eqiad.wmnet
----- OUTPUT of 'systemctl status...rver|grep Active' -----
     Active: active (running) since Fri 2022-12-02 12:05:44 UTC; 2 weeks 2 days ago
===== NODE GROUP =====
(1) an-presto1013.eqiad.wmnet
----- OUTPUT of 'systemctl status...rver|grep Active' -----
     Active: active (running) since Wed 2022-12-07 11:02:48 UTC; 1 weeks 4 days ago
===== NODE GROUP =====
(1) an-presto1002.eqiad.wmnet
----- OUTPUT of 'systemctl status...rver|grep Active' -----
   Active: active (running) since Wed 2022-11-30 16:47:04 UTC; 2 weeks 4 days ago
===== NODE GROUP =====
(1) an-presto1009.eqiad.wmnet
----- OUTPUT of 'systemctl status...rver|grep Active' -----
     Active: active (running) since Wed 2022-12-07 10:37:29 UTC; 1 weeks 5 days ago
===== NODE GROUP =====
(1) an-presto1007.eqiad.wmnet
----- OUTPUT of 'systemctl status...rver|grep Active' -----
     Active: active (running) since Mon 2022-12-05 11:46:08 UTC; 1 weeks 6 days ago
===== NODE GROUP =====
(1) an-presto1004.eqiad.wmnet
----- OUTPUT of 'systemctl status...rver|grep Active' -----
   Active: active (running) since Wed 2022-11-30 16:51:11 UTC; 2 weeks 4 days ago
===== NODE GROUP =====
(1) an-presto1008.eqiad.wmnet
----- OUTPUT of 'systemctl status...rver|grep Active' -----
     Active: active (running) since Wed 2022-12-07 10:29:48 UTC; 1 weeks 5 days ago
===== NODE GROUP =====
(1) an-presto1012.eqiad.wmnet
----- OUTPUT of 'systemctl status...rver|grep Active' -----
     Active: active (running) since Wed 2022-12-07 10:55:07 UTC; 1 weeks 4 days ago
===== NODE GROUP =====
(1) an-presto1011.eqiad.wmnet
----- OUTPUT of 'systemctl status...rver|grep Active' -----
     Active: active (running) since Wed 2022-12-07 10:51:32 UTC; 1 weeks 4 days ago
===== NODE GROUP =====
(1) an-presto1001.eqiad.wmnet
----- OUTPUT of 'systemctl status...rver|grep Active' -----
   Active: active (running) since Wed 2022-11-30 16:45:01 UTC; 2 weeks 4 days ago
===== NODE GROUP =====
(1) an-presto1005.eqiad.wmnet
----- OUTPUT of 'systemctl status...rver|grep Active' -----
   Active: active (running) since Wed 2022-11-30 16:53:14 UTC; 2 weeks 4 days ago
================
100.0% (15/15) success ratio (>= 100.0% threshold) for command: 'systemctl status...rver|grep Active'.
100.0% (15/15) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

I think that the first thing for us to try would be a rolling restart of the presto worker nodes.

Running this cookbook now.

btullis@cumin1001:~$ sudo cookbook sre.presto.roll-restart-workers analytics
START - Cookbook sre.presto.roll-restart-workers for Presto analytics cluster: Roll restart of all Presto's jvm daemons.
Scheduling downtime on Icinga server alert1001.wikimedia.org for hosts: an-presto[1001-1015]

BTullis moved this task from Backlog to Shared Data Infra on the Data-Engineering-Planning board.Dec 19 2022, 10:57 AM

Cookbook finished at: 11:28 UTC

!log btullis@cumin1001 END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) for Presto analytics cluster: Roll restart of all Presto's jvm daemons.
I

Initial results show that we are still getting lots of timeouts. Some queries are completing, but after the the 180 second timeout in Superset.

Errors in the logs seem reduced after the cookbook, but they are still much too slow.

I suggest that we experiment with shutting down an-presto10[06-15] and repeat these tests with the original sized cluster of 5 worker nodes. Maybe there is something here related to the increased cluster size.

Icinga downtime and Alertmanager silence (ID=e639d47b-4061-4222-bd4a-5e533a454811) set by btullis@cumin1001 for 30 days, 0:00:00 on 10 host(s) and their services with reason: Reverting presto cluster size from 15 to 5 as a test

an-presto[1006-1015].eqiad.wmnet

I have downtimed the 10 new hosts with the cookbook above, then shut them down with:

btullis@cumin1001:~$ sudo cumin 'an-presto10[06-15].eqiad.wmnet' 'shutdown -h now'

The number of workers dropped as expected. I have not restarted the presto coordinator, nor the five remaining worker processes.

I'll repeat the test queries to see if this gets them back to a working state.

It does! All dashboards mentioned in the description are now working correctly and returning results quickly.

I'm going to try a different experiment, shutting down an-presto100[1-5] and starting up an-presto10[06-10].

This experiment should tell us whether it's related to the cluster size of five workers, or whether it's something specific to the new worker nodes.

Booted an-presto10[06-10] using ipmitool

btullis@cumin1001:~$ sudo ipmitool -I lanplus -H "an-presto1006.mgmt.eqiad.wmnet" -U root -E chassis power on
Unable to read password from environment
Password: 
Chassis Power Control: Up/On
btullis@cumin1001:~$ sudo ipmitool -I lanplus -H "an-presto1007.mgmt.eqiad.wmnet" -U root -E chassis power on
Unable to read password from environment
Password: 
Chassis Power Control: Up/On
btullis@cumin1001:~$ sudo ipmitool -I lanplus -H "an-presto1008.mgmt.eqiad.wmnet" -U root -E chassis power on
Unable to read password from environment
Password: 
Chassis Power Control: Up/On
btullis@cumin1001:~$ sudo ipmitool -I lanplus -H "an-presto1009.mgmt.eqiad.wmnet" -U root -E chassis power on
Unable to read password from environment
Password: 
Chassis Power Control: Up/On
btullis@cumin1001:~$ sudo ipmitool -I lanplus -H "an-presto1010.mgmt.eqiad.wmnet" -U root -E chassis power on
Unable to read password from environment
Password: 
Chassis Power Control: Up/On

Cluster expanded to 10 nodes as expected.

I've tried a couple of queries with 10 workers and I see slow queries again.
I'll now shut down an-presto100[1-5] to see if we can ascertain more.

Icinga downtime and Alertmanager silence (ID=c69af7fc-f74f-45b1-b7fc-d5114d5c9ce0) set by btullis@cumin1001 for 30 days, 0:00:00 on 5 host(s) and their services with reason: Trying five of the new preto servers instead of the original five

an-presto[1001-1005].eqiad.wmnet

I have shut down an-presto100[1-5] with the following.

btullis@cumin1001:~$ sudo cumin 'an-presto100[1-5].eqiad.wmnet' 'shutdown -h now'
5 hosts will be targeted:
an-presto[1001-1005].eqiad.wmnet

These dashboards seem to be fine when using any 5 worker nodes, but bad with 10 and really bad with 15.

@ppelberg , @mpopov , @Iflorez - Could you please let me know if things are working as they were previously, or if you believe that there is still an issue with queries being slow?

The current configuration is:

an-presto100[1-5] - powered off (these were the original 5 worker nodes)
an-presto10[6-10] - powered on
an-presto10[10-15] - powered off

The timeline of adding additional worker nodes looks like this.

If your impression is that there is still a performance degradation between how things were before the increase in worker node count at the start of December, then I fill focus on what's different about these nodes.

If, however, you feel that things are exactly the way they were before the increase, then I will focus on the differences between a cluster size if 5 and a cluster size of 15.

Any other feedback about the Superset/Presto experience might also be relevant in this context too, so please feel free to share any observations.

I'll triage this back from UBN to High priority, whilst we investigate this issue.

Thank you for the quick investigation @BTullis - let's try to understand what prevents Presto from performing with more nodes

A quick look at the presto grafana dashboard makes me think that it could be related to the coordinator not having enough memory (see GC time and GC count when the number of queries grow).

In T325331#8478159, @JAllemandou wrote:

A quick look at the presto grafana dashboard makes me think that it could be related to the coordinator not having enough memory (see GC time and GC count when the number of queries grow).

I think you're right. We're only giving the presto server 4GB of RAM for the heap.

We have 19 GB of RAM that is not even in use as cache by the O/S so I think that it is safe to allocate another 12 GB for this process. That will quadruple the size of the heap for the presto coordinator.

btullis@an-coord1001:~$ free -g
              total        used        free      shared  buff/cache   available
Mem:             62          41           0           0          20          19
Swap:             0           0           0

We're going to be upgrading these hosts to 128 GB as part of the upcoming refresh, plus we're going to be moving the MariaDB services to their own host, so I'm not worried about adding another 12 GB to presto server now.

Change 869214 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add another 12 GB of RAM to the presto server JVM

https://gerrit.wikimedia.org/r/869214

gerritbot added a project: Patch-For-Review.Dec 19 2022, 1:24 PM

Change 869214 merged by Btullis:

[operations/puppet@production] Add another 12 GB of RAM to the presto server JVM

https://gerrit.wikimedia.org/r/869214

Mentioned in SAL (#wikimedia-analytics) [2022-12-19T13:45:08Z] <btullis> restart presto-server on an-coord1001 to increase heap from 4GB to 16 GB T325331

I have now deployed that change and restarted the presto-server on an-coord1001 with the increased heap size of 16 GB.

btullis@an-coord1001:~$ systemctl status -n 0 presto-server.service 
● presto-server.service - Presto Server Node
   Loaded: loaded (/lib/systemd/system/presto-server.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2022-12-19 13:44:13 UTC; 1min 53s ago
 Main PID: 15682 (java)
    Tasks: 284 (limit: 4915)
   Memory: 1.5G
   CGroup: /system.slice/presto-server.service
           └─15682 java -cp /usr/lib/presto/lib/* -server -Xmx16G -XX:+UseG1GC -XX:G1HeapRegionSize=32M -XX:+UseGCOverheadLimit -XX:+ExplicitGCInvokesConcurrent -XX:+HeapDumpOnOutOfMemoryError -XX:+ExitOnOutOfMe

Running the tests again with 5 active workers and monitoring GC usage.

I can see the increase in the heap size and there is plenty of headroom now.

https://grafana.wikimedia.org/d/pMd25ruZz/presto?orgId=1&from=now-1h&to=now&refresh=30s&viewPanel=20

https://grafana.wikimedia.org/d/pMd25ruZz/presto?orgId=1&from=now-1h&to=now&refresh=30s&viewPanel=13

So I'm going to try re-adding another 5 worker nodes, an-presto101[1-5] first.

btullis@cumin1001:~$ sudo cookbook sre.hosts.remove-downtime --force 'an-presto101[1-5].eqiad.wnet'
==> Will remove downtime for 5 unverified hosts: an-presto[1011-1015].eqiad.wnet
Type "go" to proceed or "abort" to interrupt the execution
> go
START - Cookbook sre.hosts.remove-downtime for an-presto[1011-1015].eqiad.wnet
Removing downtime for an-presto[1011-1015].eqiad.wnet
END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for an-presto[1011-1015].eqiad.wnet
btullis@cumin1001:~$ sudo ipmitool -I lanplus -H "an-presto1011.mgmt.eqiad.wmnet" -U root -E chassis power on
Unable to read password from environment
Password: 
Chassis Power Control: Up/On
btullis@cumin1001:~$ sudo ipmitool -I lanplus -H "an-presto1012.mgmt.eqiad.wmnet" -U root -E chassis power on
Unable to read password from environment
Password: 
Chassis Power Control: Up/On
btullis@cumin1001:~$ sudo ipmitool -I lanplus -H "an-presto1013.mgmt.eqiad.wmnet" -U root -E chassis power on
Unable to read password from environment
Password: 
Chassis Power Control: Up/On
btullis@cumin1001:~$ sudo ipmitool -I lanplus -H "an-presto1014.mgmt.eqiad.wmnet" -U root -E chassis power on
Unable to read password from environment
Password: 
Chassis Power Control: Up/On
btullis@cumin1001:~$ sudo ipmitool -I lanplus -H "an-presto1015.mgmt.eqiad.wmnet" -U root -E chassis power on
Unable to read password from environment
Password: 
Chassis Power Control: Up/On

Maintenance_bot removed a project: Patch-For-Review.Dec 19 2022, 2:30 PM

Tests with 10 new workers only and increased heap size seem to suggest no improvement and many timeouts again.

I'm going to try to revert to the original 5 workers.

So we're now running the 5 original workers. The 10 new hosts are powered off for now.

This seem to be working again and quite quickly, so it looks like the heap settings may not have been what we needed.

JArguello-WMF moved this task from In Progress to Unexpected work/incident on the Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-07)) board.Dec 19 2022, 3:53 PM

Indeed - same GC problems despite a way higher heap, weird!

Stevemunene mentioned this in T323783: Add an-presto10[06-15] to the presto cluster.Dec 20 2022, 1:34 PM

mpopov moved this task from Triage to Tracking on the Product-Analytics board.Dec 20 2022, 6:12 PM

Change 870819 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Disable the presto server on the 10 new hosts

https://gerrit.wikimedia.org/r/870819

gerritbot added a project: Patch-For-Review.Dec 22 2022, 10:38 AM

Change 870819 merged by Btullis:

[operations/puppet@production] Disable the presto server on the 10 new hosts

https://gerrit.wikimedia.org/r/870819

I've now powered up an-presto10[06-15] but presto-server will be disabled.

Maintenance_bot removed a project: Patch-For-Review.Dec 22 2022, 11:30 AM

BTullis mentioned this in T325809: Presto is unstable with more than 5 worker nodes.Dec 22 2022, 11:41 AM

I'm going to mark this incident as resolved, but I have created a follow-up ticket for the investigation into the apparent instability with more than 5 worker nodes: T325809: Presto is unstable with more than 5 worker nodes

BTullis closed this task as Resolved.Dec 22 2022, 11:43 AM

Stevemunene mentioned this in T327753: jmx_presto prometheus job down for some an-presto hosts.Jan 24 2023, 4:08 PM

	F35876259: image.png
	Dec 19 2022, 2:20 PM

	F35876239: image.png
	Dec 19 2022, 2:20 PM

	F35876203: image.png
	Dec 19 2022, 1:20 PM

	F35875864: image.png
	Dec 19 2022, 12:34 PM

Superset: Presto backend: Unable to access some chartsClosed, ResolvedPublicActions