⚓ T329054 Spike: Create a new presto coordinator

	Subject	Repo	Branch	Lines +/-
	fix(presto): ensure log folder has appropriate right	operations/puppet	production	+12 -0
	feat(presto): add gc logs	operations/puppet	production	+11 -1

• EChetty created this task.Feb 7 2023, 2:24 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 7 2023, 2:24 PM

• EChetty moved this task from EQ2 Kanban (Sprints 04-07) to 2022-23 Q4 Wrap up on the Shared-Data-Infrastructure board.Feb 7 2023, 2:59 PM

• EChetty edited projects, added Shared-Data-Infrastructure (2022-23 Q4 Wrap up); removed Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-07)).

• EChetty moved this task from Next Up to In Progress on the Shared-Data-Infrastructure (2022-23 Q4 Wrap up) board.

• EChetty moved this task from In Progress to Next Up on the Shared-Data-Infrastructure (2022-23 Q4 Wrap up) board.Feb 8 2023, 1:39 PM

• nfraison claimed this task.Feb 9 2023, 7:49 AM

• nfraison moved this task from Next Up to In Progress on the Shared-Data-Infrastructure (2022-23 Q4 Wrap up) board.Feb 9 2023, 12:55 PM

Side node of my current understanding of presto coordinator:

there is only one presto coordinator currently deployed in our stack => moving it ad is will leads to some unavailability (See if we could introduce disaggregation with RM and multiple coordinators?)

Puppet config;

one coordinator manifest role in analytics_cluster and analytics_test_cluster which manage multiple service (not only presto but also hive and oozie): https://gerrit.wikimedia.org/g/operations/puppet/+/14ca9ee030980db076e5649c8fd445aa4bf162ac/modules/role/manifests/analytics_cluster/coordinator.pp
one coordinator hiera config in analytics_cluster and analytics_test_cluster (this is where we set the presto node as coordinator): https://gerrit.wikimedia.org/g/operations/puppet/+/14ca9ee030980db076e5649c8fd445aa4bf162ac/hieradata/role/common/analytics_cluster/coordinator.yaml
then the association between node coordinator role is in sites.pp manifests: https://gerrit.wikimedia.org/g/operations/puppet/+/14ca9ee030980db076e5649c8fd445aa4bf162ac/manifests/site.pp#79

The replica coordinator role doesn't deploy presto coordinator (only other services): https://gerrit.wikimedia.org/g/operations/puppet/+/14ca9ee030980db076e5649c8fd445aa4bf162ac/modules/role/manifests/analytics_cluster/coordinator/replica.pp

Still some side notes to help understand the architecture
current coordinator hosts are:

test cluster: an-test-coord1001.eqiad.wmnet
prod cluster: an-coord1001.eqiad.wmnet

presto nodes:

test: an-test-presto1001.eqiad.wmnet
prod: an-presto10(0[1-9]|1[0-5])\.eqiad\.wmnet

New nodes on which we can deploy coordinator: an-presto10[06-15]

@BTullis , @Stevemunene
As you surely know, the Presto coordinator is a key component from presto stack and moving it will probbaly lead to some unavailability of the service
This is a new topic for me (never administrate a presto cluster before) so I can be wrong on my assertion, perhaps it ispossible to add a new coordinator and move worker to it smoothly..
Depending to that first assertion I'd like to discuss possibilities with you:

there are no big deal with some plan unavailability => we can just do the the migration
moving to coordinator disaggregation => will require to bootstrap multiple RM and then add/migrate coordinator to it (not dven sure there won't be any impacts)
some other solution that you already applied on that service?

The change of host will require also DNS change for https://gerrit.wikimedia.org/g/operations/dns/+/219c1e88dc4429d344a014e8cff91a9762d54df3/templates/wmnet#64 so our client can still query the coordinator (or the new RMs)
I'm probably missing other stuff that will require some changes don't hesitate to comment

Looking at grafana board when the 10 nodes were added we can indeed se a some increase of young GC but not really sure it would link to things timeouting: https://grafana.wikimedia.org/d/pMd25ruZz/presto?orgId=1&from=1669847567132&to=1671594072414&var-datasource=eqiad%20prometheus%2Fanalytics&var-coordinator=an-coord1001:10281&viewPanel=20
Were we facing some OOM issues=> from the graph doesn't seems to be the case the memory is well reclaimed and only the young GC is affected (no increase on the old one).

I would recommend to add GC logging to be able to analyze it to see if some tuning can be done there and if it could be the real cause

Also from what I can see after the bump at 16G of the heap the heap usage remains below GB on the coordinator: https://grafana.wikimedia.org/d/pMd25ruZz/presto?orgId=1&from=1670957037772&to=1671594072414&var-datasource=eqiad%20prometheus%2Fanalytics&var-coordinator=an-coord1001:10281&viewPanel=13 / https://grafana.wikimedia.org/d/pMd25ruZz/presto?orgId=1&from=1674652000431&to=1674687613682&var-datasource=eqiad%20prometheus%2Fanalytics&var-coordinator=an-coord1001:10281

Change 888214 had a related patch set uploaded (by Nicolas Fraison; author: Nicolas Fraison):

[operations/puppet@production] feat(presto): add gc logs

https://gerrit.wikimedia.org/r/888214

gerritbot added a project: Patch-For-Review.Feb 10 2023, 1:27 PM

As suggested by @BTullis performed some query test to see the effect on the TCP connections and to identify between which components the connections are used

queries executed:
https://superset.wikimedia.org/superset/dashboard/readers-metrics/
https://superset.wikimedia.org/superset/dashboard/editors-metrics/
https://superset.wikimedia.org/superset/dashboard/riskobservatory/

Leads to 10/15 queries being launched in //

The number of connections per host increase up to 400/500 per presto worker and most of them (300/400) is used for inter connections between presto nodes
Only 10 to 20 are used to connect to datanodes (one connection per datanode...) -> so nothing huge here in nominal config

Some strange DFSClient logs reported by presto node but so far I didn't find real issues on the reported datanode nor issues on the blocks (no under replicated or missing one):

Connection failure

Feb 10 01:50:45 an-presto1005 presto-server[6698]: 2023-02-10T01:50:44.622Z        WARN        20230210_015036_00017_t8v4f.1.0.1-663-48196        org.apache.hadoop.hdfs.DFSClient        Connection failure: Failed to connect to /10.64.53.47:50010 for file /wmf/data/wmf/pageview/actor/year=2023/month=1/day=30/hour=1/000023_0 for block BP-1552854784-10.64.21.110-1405114489661:blk_2131205256_1057547221:java.nio.channels.ClosedByInterruptException

No live nodes contain the block

Feb 09 22:46:34 an-presto1005 presto-server[6698]: 2023-02-09T22:46:33.671Z        INFO        20230209_224114_01177_t8v4f.2.0.2-11687-46435        org.apache.hadoop.hdfs.DFSClient        Could not obtain BP-1552854784-10.64.21.110-1405114489661:blk_2120375203_1046716903 from any node: java.io.IOException: No live nodes contain block BP-1552854784-10.64.21.110-1405114489661:blk_2120375203_1046716903 after checking nodes = [DatanodeInfoWithStorage[10.64.53.47:50010,DS-4a19a2b4-af53-4489-9950-fbe8a39fa280,DISK], DatanodeInfoWithStorage[10.64.5.40:50010,DS-c13cc26f-7394-49f3-b3dd-8a01d08c7e4d,DISK], DatanodeInfoWithStorage[10.64.53.33:50010,DS-a57f095d-aff5-4f12-873d-e3bf9dd781a7,DISK]], ignoredNodes = null No live nodes contain current block Block locations: DatanodeInfoWithStorage[10.64.53.47:50010,DS-4a19a2b4-af53-4489-9950-fbe8a39fa280,DISK] DatanodeInfoWithStorage[10.64.5.40:50010,DS-c13cc26f-7394-49f3-b3dd-8a01d08c7e4d,DISK] DatanodeInfoWithStorage[10.64.53.33:50010,DS-a57f095d-aff5-4f12-873d-e3bf9dd781a7,DISK] Dead nodes:  DatanodeInfoWithStorage[10.64.53.47:50010,DS-4a19a2b4-af53-4489-9950-fbe8a39fa280,DISK] DatanodeInfoWithStorage[10.64.53.33:50010,DS-a57f095d-aff5-4f12-873d-e3bf9dd781a7,DISK] DatanodeInfoWithStorage[10.64.5.40:50010,DS-c13cc26f-7394-49f3-b3dd-8a01d08c7e4d,DISK]. Will get new block locations from namenode and retry...

Thanks @nfraison - from memory, when we added more presto workers it was this type of Connection failure: message that was fairly rapidly being logged. I'm not surprised that we still see some with only five nodes.

an-presto1005 presto-server[6698]: 2023-02-10T01:50:44.622Z        WARN        20230210_015036_00017_t8v4f.1.0.1-663-48196        org.apache.hadoop.hdfs.DFSClient        Connection failure: Failed to connect to ...

Also interesting to know to know that it's only one conection per datanode from the presto servers.

Potentially a consequence of the add of the 10 presto nodes but we can see here in the same timeframe of the add of presto nodes we had some spike of under replicated blocks: https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&from=1674431692162&to=1674855030508 (as if thee was some transient unavailability of datanodes).

This will have to be monitored if we had back those nodes.

I will decline the ticket. The investigation will have to be continue in another ticket

• nfraison closed this task as Declined.Feb 13 2023, 8:33 AM

Change 888214 merged by Nicolas Fraison:

[operations/puppet@production] feat(presto): add gc logs

https://gerrit.wikimedia.org/r/888214

Mentioned in SAL (#wikimedia-analytics) [2023-02-13T08:59:54Z] <nfraison> restarting presto-worker on an-test-presto1001 to pick up new gc logging settings T329054

Mentioned in SAL (#wikimedia-analytics) [2023-02-13T09:04:33Z] <nfraison> restarting presto-coordinator on an-test-coord1001 to pick up new gc logging settings T329054

GC logs well applied on worker but not on coordinator due to permission denied

Feb 13 09:05:03 an-test-coord1001 presto-server[37018]: OpenJDK 64-Bit Server VM warning: Cannot open file /srv/presto/var/log/gc.log due to Permission denied

In prod there are also some nodes with appropriate rights on /srv/presto/var (presto:presto) and some with bad rights on it (root:root)

Maintenance_bot removed a project: Patch-For-Review.Feb 13 2023, 9:30 AM

Change 888636 had a related patch set uploaded (by Nicolas Fraison; author: Nicolas Fraison):

[operations/puppet@production] fix(presto): ensure log folder has appropriate right

https://gerrit.wikimedia.org/r/888636

gerritbot added a project: Patch-For-Review.Feb 13 2023, 9:34 AM

Change 888636 merged by Nicolas Fraison:

[operations/puppet@production] fix(presto): ensure log folder has appropriate right

https://gerrit.wikimedia.org/r/888636

Mentioned in SAL (#wikimedia-analytics) [2023-02-13T09:59:44Z] <nfraison> restarting presto-coordinator on an-coord1001 to pick up new gc logging settings T329054

Maintenance_bot removed a project: Patch-For-Review.Feb 13 2023, 10:30 AM

Mentioned in SAL (#wikimedia-analytics) [2023-02-13T10:46:13Z] <nfraison> restarting presto-worker on an-presto[1001-1015].eqiad.wmnet to pick up new gc logging settings T329054

• nfraison closed this task as Declined.Feb 13 2023, 11:27 AM

• EChetty moved this task from In Progress to Done on the Shared-Data-Infrastructure (2022-23 Q4 Wrap up) board.Feb 13 2023, 1:36 PM

Spike: Create a new presto coordinator
Closed, DeclinedPublic
Actions

Details

Event Timeline

Spike: Create a new presto coordinatorClosed, DeclinedPublicActions

Details

Event Timeline

Spike: Create a new presto coordinator
Closed, DeclinedPublic
Actions