Page MenuHomePhabricator

Spike: Create a new presto coordinator
Closed, DeclinedPublic

Event Timeline

Side node of my current understanding of presto coordinator:

  • there is only one presto coordinator currently deployed in our stack => moving it ad is will leads to some unavailability (See if we could introduce disaggregation with RM and multiple coordinators?)

Puppet config;

The replica coordinator role doesn't deploy presto coordinator (only other services): https://gerrit.wikimedia.org/g/operations/puppet/+/14ca9ee030980db076e5649c8fd445aa4bf162ac/modules/role/manifests/analytics_cluster/coordinator/replica.pp

Still some side notes to help understand the architecture
current coordinator hosts are:

  • test cluster: an-test-coord1001.eqiad.wmnet
  • prod cluster: an-coord1001.eqiad.wmnet

presto nodes:

  • test: an-test-presto1001.eqiad.wmnet
  • prod: an-presto10(0[1-9]|1[0-5])\.eqiad\.wmnet

New nodes on which we can deploy coordinator: an-presto10[06-15]

@BTullis , @Stevemunene
As you surely know, the Presto coordinator is a key component from presto stack and moving it will probbaly lead to some unavailability of the service
This is a new topic for me (never administrate a presto cluster before) so I can be wrong on my assertion, perhaps it ispossible to add a new coordinator and move worker to it smoothly..
Depending to that first assertion I'd like to discuss possibilities with you:

  • there are no big deal with some plan unavailability => we can just do the the migration
  • moving to coordinator disaggregation => will require to bootstrap multiple RM and then add/migrate coordinator to it (not dven sure there won't be any impacts)
  • some other solution that you already applied on that service?

The change of host will require also DNS change for https://gerrit.wikimedia.org/g/operations/dns/+/219c1e88dc4429d344a014e8cff91a9762d54df3/templates/wmnet#64 so our client can still query the coordinator (or the new RMs)
I'm probably missing other stuff that will require some changes don't hesitate to comment

Looking at grafana board when the 10 nodes were added we can indeed se a some increase of young GC but not really sure it would link to things timeouting: https://grafana.wikimedia.org/d/pMd25ruZz/presto?orgId=1&from=1669847567132&to=1671594072414&var-datasource=eqiad%20prometheus%2Fanalytics&var-coordinator=an-coord1001:10281&viewPanel=20
Were we facing some OOM issues=> from the graph doesn't seems to be the case the memory is well reclaimed and only the young GC is affected (no increase on the old one).

I would recommend to add GC logging to be able to analyze it to see if some tuning can be done there and if it could be the real cause

Also from what I can see after the bump at 16G of the heap the heap usage remains below GB on the coordinator: https://grafana.wikimedia.org/d/pMd25ruZz/presto?orgId=1&from=1670957037772&to=1671594072414&var-datasource=eqiad%20prometheus%2Fanalytics&var-coordinator=an-coord1001:10281&viewPanel=13 / https://grafana.wikimedia.org/d/pMd25ruZz/presto?orgId=1&from=1674652000431&to=1674687613682&var-datasource=eqiad%20prometheus%2Fanalytics&var-coordinator=an-coord1001:10281

Change 888214 had a related patch set uploaded (by Nicolas Fraison; author: Nicolas Fraison):

[operations/puppet@production] feat(presto): add gc logs

https://gerrit.wikimedia.org/r/888214

As suggested by @BTullis performed some query test to see the effect on the TCP connections and to identify between which components the connections are used

queries executed:
https://superset.wikimedia.org/superset/dashboard/readers-metrics/
https://superset.wikimedia.org/superset/dashboard/editors-metrics/
https://superset.wikimedia.org/superset/dashboard/riskobservatory/

Leads to 10/15 queries being launched in //

The number of connections per host increase up to 400/500 per presto worker and most of them (300/400) is used for inter connections between presto nodes
Only 10 to 20 are used to connect to datanodes (one connection per datanode...) -> so nothing huge here in nominal config

Some strange DFSClient logs reported by presto node but so far I didn't find real issues on the reported datanode nor issues on the blocks (no under replicated or missing one):

  • Connection failure
Feb 10 01:50:45 an-presto1005 presto-server[6698]: 2023-02-10T01:50:44.622Z        WARN        20230210_015036_00017_t8v4f.1.0.1-663-48196        org.apache.hadoop.hdfs.DFSClient        Connection failure: Failed to connect to /10.64.53.47:50010 for file /wmf/data/wmf/pageview/actor/year=2023/month=1/day=30/hour=1/000023_0 for block BP-1552854784-10.64.21.110-1405114489661:blk_2131205256_1057547221:java.nio.channels.ClosedByInterruptException
  • No live nodes contain the block
Feb 09 22:46:34 an-presto1005 presto-server[6698]: 2023-02-09T22:46:33.671Z        INFO        20230209_224114_01177_t8v4f.2.0.2-11687-46435        org.apache.hadoop.hdfs.DFSClient        Could not obtain BP-1552854784-10.64.21.110-1405114489661:blk_2120375203_1046716903 from any node: java.io.IOException: No live nodes contain block BP-1552854784-10.64.21.110-1405114489661:blk_2120375203_1046716903 after checking nodes = [DatanodeInfoWithStorage[10.64.53.47:50010,DS-4a19a2b4-af53-4489-9950-fbe8a39fa280,DISK], DatanodeInfoWithStorage[10.64.5.40:50010,DS-c13cc26f-7394-49f3-b3dd-8a01d08c7e4d,DISK], DatanodeInfoWithStorage[10.64.53.33:50010,DS-a57f095d-aff5-4f12-873d-e3bf9dd781a7,DISK]], ignoredNodes = null No live nodes contain current block Block locations: DatanodeInfoWithStorage[10.64.53.47:50010,DS-4a19a2b4-af53-4489-9950-fbe8a39fa280,DISK] DatanodeInfoWithStorage[10.64.5.40:50010,DS-c13cc26f-7394-49f3-b3dd-8a01d08c7e4d,DISK] DatanodeInfoWithStorage[10.64.53.33:50010,DS-a57f095d-aff5-4f12-873d-e3bf9dd781a7,DISK] Dead nodes:  DatanodeInfoWithStorage[10.64.53.47:50010,DS-4a19a2b4-af53-4489-9950-fbe8a39fa280,DISK] DatanodeInfoWithStorage[10.64.53.33:50010,DS-a57f095d-aff5-4f12-873d-e3bf9dd781a7,DISK] DatanodeInfoWithStorage[10.64.5.40:50010,DS-c13cc26f-7394-49f3-b3dd-8a01d08c7e4d,DISK]. Will get new block locations from namenode and retry...

Thanks @nfraison - from memory, when we added more presto workers it was this type of Connection failure: message that was fairly rapidly being logged. I'm not surprised that we still see some with only five nodes.

an-presto1005 presto-server[6698]: 2023-02-10T01:50:44.622Z        WARN        20230210_015036_00017_t8v4f.1.0.1-663-48196        org.apache.hadoop.hdfs.DFSClient        Connection failure: Failed to connect to ...

Also interesting to know to know that it's only one conection per datanode from the presto servers.

Potentially a consequence of the add of the 10 presto nodes but we can see here in the same timeframe of the add of presto nodes we had some spike of under replicated blocks: https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&from=1674431692162&to=1674855030508 (as if thee was some transient unavailability of datanodes).

This will have to be monitored if we had back those nodes.

I will decline the ticket. The investigation will have to be continue in another ticket

Change 888214 merged by Nicolas Fraison:

[operations/puppet@production] feat(presto): add gc logs

https://gerrit.wikimedia.org/r/888214

Mentioned in SAL (#wikimedia-analytics) [2023-02-13T08:59:54Z] <nfraison> restarting presto-worker on an-test-presto1001 to pick up new gc logging settings T329054

Mentioned in SAL (#wikimedia-analytics) [2023-02-13T09:04:33Z] <nfraison> restarting presto-coordinator on an-test-coord1001 to pick up new gc logging settings T329054

GC logs well applied on worker but not on coordinator due to permission denied

Feb 13 09:05:03 an-test-coord1001 presto-server[37018]: OpenJDK 64-Bit Server VM warning: Cannot open file /srv/presto/var/log/gc.log due to Permission denied

In prod there are also some nodes with appropriate rights on /srv/presto/var (presto:presto) and some with bad rights on it (root:root)

Change 888636 had a related patch set uploaded (by Nicolas Fraison; author: Nicolas Fraison):

[operations/puppet@production] fix(presto): ensure log folder has appropriate right

https://gerrit.wikimedia.org/r/888636

Change 888636 merged by Nicolas Fraison:

[operations/puppet@production] fix(presto): ensure log folder has appropriate right

https://gerrit.wikimedia.org/r/888636

Mentioned in SAL (#wikimedia-analytics) [2023-02-13T09:59:44Z] <nfraison> restarting presto-coordinator on an-coord1001 to pick up new gc logging settings T329054

Mentioned in SAL (#wikimedia-analytics) [2023-02-13T10:46:13Z] <nfraison> restarting presto-worker on an-presto[1001-1015].eqiad.wmnet to pick up new gc logging settings T329054