Page MenuHomePhabricator

[Data Platform] Test Alluxio as cache layer for Presto
Open, MediumPublic

Assigned To
None
Authored By
elukey
Oct 28 2020, 9:10 AM
Referenced Files
F34676274: image.png
Oct 6 2021, 4:38 PM
F34676262: image.png
Oct 6 2021, 4:32 PM
F34638129: image.png
Sep 9 2021, 11:45 AM
F34618942: image.png
Aug 24 2021, 5:27 PM
F34618938: image.png
Aug 24 2021, 5:27 PM
F34598202: image.png
Aug 17 2021, 11:22 AM

Description

In T256108 we were wondering if it was worth or not to co-locate Presto with Yarn on Hadoop worker nodes. The alternative would be to keep separate nodes, and use alluxio as caching layer for HDFS.

Alluxio is packaged in Bigtop so after the upgrade we could try to test it and see how it performs. The current Presto bottleneck is all the data moving in and out the workers from the HDFS ones (all network bound). Alluxio would alleviate the problem via hdfs caching on the Presto workers' RAM.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 724407 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Fix the ferm configuration for alluxio workers

https://gerrit.wikimedia.org/r/724407

Change 724407 merged by Btullis:

[operations/puppet@production] Fix the ferm configuration for alluxio workers

https://gerrit.wikimedia.org/r/724407

I have started running the master process manually as the alluxio user on an-test-coord1001 with the following command.

alluxio@an-test-coord1001:~$ kerberos-run-command alluxio /usr/lib/jvm/java-1.8.0-openjdk-amd64/bin/java -cp /usr/lib/alluxio/conf/::/usr/lib/alluxio/assembly/server/target/alluxio-assembly-server-2.4.1-jar-with-dependencies.jar -Dalluxio.home=/usr/lib/alluxio -Dalluxio.conf.dir=/usr/lib/alluxio/conf -Dalluxio.logs.dir=/usr/lib/alluxio/logs -Dalluxio.user.logs.dir=/usr/lib/alluxio/logs/user -Dlog4j.configuration=file:/usr/lib/alluxio/conf/log4j.properties -Dorg.apache.jasper.compiler.disablejsr199=true -Djava.net.preferIPv4Stack=true -Dorg.apache.ratis.thirdparty.io.netty.allocator.useCacheForAllThreads=false -Dalluxio.logger.type=MASTER_LOGGER -Dalluxio.master.audit.logger.type=MASTER_AUDIT_LOGGER -Xmx8g -XX:MetaspaceSize=256M alluxio.master.AlluxioMaster

Currently investigating a permissions error, which I believe to be on the local file system, as opposed to HDFS.

I have instead been running the commands with:

an-test-coord1001: bash -x /usr/lib/alluxio/bin/alluxio-start.sh master
an-test-presto1001: bash -x /usr/lib/alluxio/bin/alluxio-start.sh worker SudoMount

I have temporarily disabled puppet on an-test-presto1001 so that I can test a required sudoers entry:

alluxio ALL=(ALL) NOPASSWD: /bin/mount * /mnt/ramdisk, /bin/umount */mnt/ramdisk, /bin/mkdir * /mnt/ramdisk, /bin/chmod * /mnt/ramdisk

I have formatted the master with the following command as the alluxio user on an-test-coord1001: alluxio formatMasters
The output was as shown below.

I'll keep looking into the warnings, as I had thought that at least the native library would have been found.

alluxio@an-test-coord1001:/etc/alluxio/conf$ alluxio formatMasters
Formatting Alluxio Master @ an-test-coord1001.eqiad.wmnet
2021-09-30 13:10:55,807 INFO  Format - Formatting master journal: hdfs://analytics-test-hadoop/wmf/alluxio/journal/
2021-09-30 13:10:55,845 INFO  ExtensionFactoryRegistry - Loading core jars from /usr/lib/alluxio/lib
2021-09-30 13:10:55,881 INFO  ExtensionFactoryRegistry - Loading extension jars from /usr/lib/alluxio/extensions
2021-09-30 13:10:55,985 WARN  HdfsUnderFileSystem - Cannot create SupportedHdfsAclProvider. HDFS ACLs will not be supported.
2021-09-30 13:10:56,051 WARN  HdfsUnderFileSystem - Cannot create SupportedHdfsActiveSyncProvider.HDFS ActiveSync will not be supported.
2021-09-30 13:10:56,065 INFO  ExtensionFactoryRegistry - Loading core jars from /usr/lib/alluxio/lib
2021-09-30 13:10:56,076 INFO  ExtensionFactoryRegistry - Loading extension jars from /usr/lib/alluxio/extensions
2021-09-30 13:10:56,156 WARN  HdfsUnderFileSystem - Cannot create SupportedHdfsAclProvider. HDFS ACLs will not be supported.
2021-09-30 13:10:56,174 WARN  NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2021-09-30 13:10:56,186 WARN  HdfsUnderFileSystem - Cannot create SupportedHdfsActiveSyncProvider.HDFS ActiveSync will not be supported.
2021-09-30 13:10:56,188 INFO  ExtensionFactoryRegistry - Loading core jars from /usr/lib/alluxio/lib
2021-09-30 13:10:56,198 INFO  ExtensionFactoryRegistry - Loading extension jars from /usr/lib/alluxio/extensions
2021-09-30 13:10:56,276 WARN  HdfsUnderFileSystem - Cannot create SupportedHdfsAclProvider. HDFS ACLs will not be supported.
2021-09-30 13:10:56,295 WARN  NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2021-09-30 13:10:56,306 WARN  HdfsUnderFileSystem - Cannot create SupportedHdfsActiveSyncProvider.HDFS ActiveSync will not be supported.
2021-09-30 13:10:56,308 INFO  ExtensionFactoryRegistry - Loading core jars from /usr/lib/alluxio/lib
2021-09-30 13:10:56,319 INFO  ExtensionFactoryRegistry - Loading extension jars from /usr/lib/alluxio/extensions
2021-09-30 13:10:56,397 WARN  HdfsUnderFileSystem - Cannot create SupportedHdfsAclProvider. HDFS ACLs will not be supported.
2021-09-30 13:10:56,414 WARN  NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2021-09-30 13:10:56,425 WARN  HdfsUnderFileSystem - Cannot create SupportedHdfsActiveSyncProvider.HDFS ActiveSync will not be supported.
2021-09-30 13:10:56,427 INFO  ExtensionFactoryRegistry - Loading core jars from /usr/lib/alluxio/lib
2021-09-30 13:10:56,436 INFO  ExtensionFactoryRegistry - Loading extension jars from /usr/lib/alluxio/extensions
2021-09-30 13:10:56,583 WARN  HdfsUnderFileSystem - Cannot create SupportedHdfsAclProvider. HDFS ACLs will not be supported.
2021-09-30 13:10:56,603 WARN  NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2021-09-30 13:10:56,614 WARN  HdfsUnderFileSystem - Cannot create SupportedHdfsActiveSyncProvider.HDFS ActiveSync will not be supported.
2021-09-30 13:10:56,614 INFO  UfsJournal - Formatting hdfs://analytics-test-hadoop/wmf/alluxio/journal/BlockMaster/v1
2021-09-30 13:10:57,297 INFO  UfsJournal - Formatting hdfs://analytics-test-hadoop/wmf/alluxio/journal/TableMaster/v1
2021-09-30 13:10:57,789 INFO  UfsJournal - Formatting hdfs://analytics-test-hadoop/wmf/alluxio/journal/FileSystemMaster/v1
2021-09-30 13:10:58,246 INFO  UfsJournal - Formatting hdfs://analytics-test-hadoop/wmf/alluxio/journal/MetaMaster/v1
2021-09-30 13:10:58,791 INFO  UfsJournal - Formatting hdfs://analytics-test-hadoop/wmf/alluxio/journal/MetricsMaster/v1
2021-09-30 13:10:59,212 INFO  Format - Formatting complete

I was able to execute the alluxio runTests command:

alluxio@an-test-coord1001:/etc/alluxio/conf$ alluxio runTests
2021-09-30 13:33:08,712 INFO  ZkMasterInquireClient - Creating new zookeeper client for zk@an-test-druid1001.eqiad.wmnet/alluxio/leader
2021-09-30 13:33:08,749 INFO  ZooKeeper - Client environment:zookeeper.version=3.5.8-f439ca583e70862c3068a1f2a7d4d068eec33315, built on 05/04/2020 15:53 GMT
2021-09-30 13:33:08,749 INFO  ZooKeeper - Client environment:host.name=an-test-coord1001.eqiad.wmnet
2021-09-30 13:33:08,749 INFO  ZooKeeper - Client environment:java.version=1.8.0_302
2021-09-30 13:33:08,749 INFO  ZooKeeper - Client environment:java.vendor=Oracle Corporation
2021-09-30 13:33:08,749 INFO  ZooKeeper - Client environment:java.home=/usr/lib/jvm/java-8-openjdk-amd64/jre
2021-09-30 13:33:08,749 INFO  ZooKeeper - Client environment:java.class.path=/usr/lib/alluxio/conf/::/usr/lib/alluxio/assembly/client/target/alluxio-assembly-client-2.4.1-jar-with-dependencies.jar:/usr/lib/alluxio/lib/alluxio-integration-tools-validation-2.4.1.jar
2021-09-30 13:33:08,749 INFO  ZooKeeper - Client environment:java.library.path=/usr/lib/hadoop/lib/native/
2021-09-30 13:33:08,749 INFO  ZooKeeper - Client environment:java.io.tmpdir=/tmp
2021-09-30 13:33:08,749 INFO  ZooKeeper - Client environment:java.compiler=<NA>
2021-09-30 13:33:08,749 INFO  ZooKeeper - Client environment:os.name=Linux
2021-09-30 13:33:08,749 INFO  ZooKeeper - Client environment:os.arch=amd64
2021-09-30 13:33:08,749 INFO  ZooKeeper - Client environment:os.version=4.19.0-16-amd64
2021-09-30 13:33:08,749 INFO  ZooKeeper - Client environment:user.name=alluxio
2021-09-30 13:33:08,750 INFO  ZooKeeper - Client environment:user.home=/var/lib/alluxio
2021-09-30 13:33:08,750 INFO  ZooKeeper - Client environment:user.dir=/etc/alluxio/conf.analytics-test-hadoop
2021-09-30 13:33:08,750 INFO  ZooKeeper - Client environment:os.memory.free=1856MB
2021-09-30 13:33:08,750 INFO  ZooKeeper - Client environment:os.memory.max=27305MB
2021-09-30 13:33:08,750 INFO  ZooKeeper - Client environment:os.memory.total=1926MB
2021-09-30 13:33:08,750 INFO  Compatibility - Using emulated InjectSessionExpiration
2021-09-30 13:33:08,797 INFO  TieredIdentityFactory - Initialized tiered identity TieredIdentity(node=an-test-coord1001.eqiad.wmnet, rack=null)
2021-09-30 13:33:08,812 INFO  CuratorFrameworkImpl - Starting
2021-09-30 13:33:08,816 INFO  X509Util - Setting -D jdk.tls.rejectClientInitiatedRenegotiation=true to disable client-initiated TLS renegotiation
2021-09-30 13:33:08,818 INFO  ZooKeeper - Initiating client connection, connectString=an-test-druid1001.eqiad.wmnet sessionTimeout=60000 watcher=org.apache.curator.ConnectionState@4ae3c1cd
2021-09-30 13:33:08,821 INFO  ClientCnxnSocket - jute.maxbuffer value is 4194304 Bytes
2021-09-30 13:33:08,826 INFO  ClientCnxn - zookeeper.request.timeout value is 0. feature enabled=
2021-09-30 13:33:08,832 INFO  CuratorFrameworkImpl - Default schema
2021-09-30 13:33:08,837 INFO  ClientCnxn - Opening socket connection to server an-test-druid1001.eqiad.wmnet/10.64.53.6:2181. Will not attempt to authenticate using SASL (unknown error)
2021-09-30 13:33:08,843 INFO  ClientCnxn - Socket connection established, initiating session, client: /10.64.53.41:32994, server: an-test-druid1001.eqiad.wmnet/10.64.53.6:2181
2021-09-30 13:33:08,851 INFO  ClientCnxn - Session establishment complete on server an-test-druid1001.eqiad.wmnet/10.64.53.6:2181, sessionid = 0x1036c6f602300c4, negotiated timeout = 40000
2021-09-30 13:33:08,861 INFO  ConnectionStateManager - State change: CONNECTED
2021-09-30 13:33:08,981 INFO  NettyUtils - EPOLL_MODE is available
runTest --operation BASIC --readType CACHE_PROMOTE --writeType MUST_CACHE
2021-09-30 13:33:10,420 INFO  BasicOperations - writeFile to file /default_tests_files/BASIC_CACHE_PROMOTE_MUST_CACHE took 701 ms.
2021-09-30 13:33:10,524 INFO  BasicOperations - readFile file /default_tests_files/BASIC_CACHE_PROMOTE_MUST_CACHE took 104 ms.
Passed the test!
runTest --operation BASIC_NON_BYTE_BUFFER --readType CACHE_PROMOTE --writeType MUST_CACHE
2021-09-30 13:33:10,582 INFO  BasicNonByteBufferOperations - writeFile to file /default_tests_files/BASIC_NON_BYTE_BUFFER_CACHE_PROMOTE_MUST_CACHE took 40 ms.
2021-09-30 13:33:10,604 INFO  BasicNonByteBufferOperations - readFile file /default_tests_files/BASIC_NON_BYTE_BUFFER_CACHE_PROMOTE_MUST_CACHE took 21 ms.
Passed the test!
runTest --operation BASIC --readType CACHE_PROMOTE --writeType CACHE_THROUGH
2021-09-30 13:33:12,241 INFO  BasicOperations - writeFile to file /default_tests_files/BASIC_CACHE_PROMOTE_CACHE_THROUGH took 1637 ms.
2021-09-30 13:33:12,263 INFO  BasicOperations - readFile file /default_tests_files/BASIC_CACHE_PROMOTE_CACHE_THROUGH took 22 ms.
Passed the test!
runTest --operation BASIC_NON_BYTE_BUFFER --readType CACHE_PROMOTE --writeType CACHE_THROUGH
2021-09-30 13:33:12,445 INFO  BasicNonByteBufferOperations - writeFile to file /default_tests_files/BASIC_NON_BYTE_BUFFER_CACHE_PROMOTE_CACHE_THROUGH took 155 ms.
2021-09-30 13:33:12,462 INFO  BasicNonByteBufferOperations - readFile file /default_tests_files/BASIC_NON_BYTE_BUFFER_CACHE_PROMOTE_CACHE_THROUGH took 17 ms.
Passed the test!
runTest --operation BASIC --readType CACHE_PROMOTE --writeType THROUGH
2021-09-30 13:33:12,589 INFO  BasicOperations - writeFile to file /default_tests_files/BASIC_CACHE_PROMOTE_THROUGH took 127 ms.
2021-09-30 13:33:12,697 INFO  BasicOperations - readFile file /default_tests_files/BASIC_CACHE_PROMOTE_THROUGH took 108 ms.
Passed the test!
runTest --operation BASIC_NON_BYTE_BUFFER --readType CACHE_PROMOTE --writeType THROUGH
2021-09-30 13:33:12,841 INFO  BasicNonByteBufferOperations - writeFile to file /default_tests_files/BASIC_NON_BYTE_BUFFER_CACHE_PROMOTE_THROUGH took 117 ms.
2021-09-30 13:33:12,883 INFO  BasicNonByteBufferOperations - readFile file /default_tests_files/BASIC_NON_BYTE_BUFFER_CACHE_PROMOTE_THROUGH took 41 ms.
Passed the test!
runTest --operation BASIC --readType CACHE_PROMOTE --writeType ASYNC_THROUGH
2021-09-30 13:33:12,945 INFO  BasicOperations - writeFile to file /default_tests_files/BASIC_CACHE_PROMOTE_ASYNC_THROUGH took 62 ms.
2021-09-30 13:33:12,959 INFO  BasicOperations - readFile file /default_tests_files/BASIC_CACHE_PROMOTE_ASYNC_THROUGH took 13 ms.
Passed the test!
runTest --operation BASIC_NON_BYTE_BUFFER --readType CACHE_PROMOTE --writeType ASYNC_THROUGH
2021-09-30 13:33:13,007 INFO  BasicNonByteBufferOperations - writeFile to file /default_tests_files/BASIC_NON_BYTE_BUFFER_CACHE_PROMOTE_ASYNC_THROUGH took 29 ms.
2021-09-30 13:33:13,023 INFO  BasicNonByteBufferOperations - readFile file /default_tests_files/BASIC_NON_BYTE_BUFFER_CACHE_PROMOTE_ASYNC_THROUGH took 15 ms.
Passed the test!
runTest --operation BASIC --readType CACHE --writeType MUST_CACHE
2021-09-30 13:33:13,100 INFO  BasicOperations - writeFile to file /default_tests_files/BASIC_CACHE_MUST_CACHE took 76 ms.
2021-09-30 13:33:13,113 INFO  BasicOperations - readFile file /default_tests_files/BASIC_CACHE_MUST_CACHE took 13 ms.
Passed the test!
runTest --operation BASIC_NON_BYTE_BUFFER --readType CACHE --writeType MUST_CACHE
2021-09-30 13:33:13,226 INFO  BasicNonByteBufferOperations - writeFile to file /default_tests_files/BASIC_NON_BYTE_BUFFER_CACHE_MUST_CACHE took 94 ms.
2021-09-30 13:33:13,237 INFO  BasicNonByteBufferOperations - readFile file /default_tests_files/BASIC_NON_BYTE_BUFFER_CACHE_MUST_CACHE took 11 ms.
Passed the test!
runTest --operation BASIC --readType CACHE --writeType CACHE_THROUGH
2021-09-30 13:33:13,357 INFO  BasicOperations - writeFile to file /default_tests_files/BASIC_CACHE_CACHE_THROUGH took 119 ms.
2021-09-30 13:33:13,369 INFO  BasicOperations - readFile file /default_tests_files/BASIC_CACHE_CACHE_THROUGH took 12 ms.
Passed the test!
runTest --operation BASIC_NON_BYTE_BUFFER --readType CACHE --writeType CACHE_THROUGH
2021-09-30 13:33:13,537 INFO  BasicNonByteBufferOperations - writeFile to file /default_tests_files/BASIC_NON_BYTE_BUFFER_CACHE_CACHE_THROUGH took 144 ms.
2021-09-30 13:33:13,549 INFO  BasicNonByteBufferOperations - readFile file /default_tests_files/BASIC_NON_BYTE_BUFFER_CACHE_CACHE_THROUGH took 12 ms.
Passed the test!
runTest --operation BASIC --readType CACHE --writeType THROUGH
2021-09-30 13:33:13,652 INFO  BasicOperations - writeFile to file /default_tests_files/BASIC_CACHE_THROUGH took 102 ms.
2021-09-30 13:33:13,686 INFO  BasicOperations - readFile file /default_tests_files/BASIC_CACHE_THROUGH took 34 ms.
Passed the test!
runTest --operation BASIC_NON_BYTE_BUFFER --readType CACHE --writeType THROUGH
2021-09-30 13:33:13,806 INFO  BasicNonByteBufferOperations - writeFile to file /default_tests_files/BASIC_NON_BYTE_BUFFER_CACHE_THROUGH took 99 ms.
2021-09-30 13:33:13,835 INFO  BasicNonByteBufferOperations - readFile file /default_tests_files/BASIC_NON_BYTE_BUFFER_CACHE_THROUGH took 29 ms.
Passed the test!
runTest --operation BASIC --readType CACHE --writeType ASYNC_THROUGH
2021-09-30 13:33:13,871 INFO  BasicOperations - writeFile to file /default_tests_files/BASIC_CACHE_ASYNC_THROUGH took 36 ms.
2021-09-30 13:33:13,884 INFO  BasicOperations - readFile file /default_tests_files/BASIC_CACHE_ASYNC_THROUGH took 12 ms.
Passed the test!
runTest --operation BASIC_NON_BYTE_BUFFER --readType CACHE --writeType ASYNC_THROUGH
2021-09-30 13:33:13,973 INFO  BasicNonByteBufferOperations - writeFile to file /default_tests_files/BASIC_NON_BYTE_BUFFER_CACHE_ASYNC_THROUGH took 71 ms.
2021-09-30 13:33:13,984 INFO  BasicNonByteBufferOperations - readFile file /default_tests_files/BASIC_NON_BYTE_BUFFER_CACHE_ASYNC_THROUGH took 11 ms.
Passed the test!
runTest --operation BASIC --readType NO_CACHE --writeType MUST_CACHE
2021-09-30 13:33:14,018 INFO  BasicOperations - writeFile to file /default_tests_files/BASIC_NO_CACHE_MUST_CACHE took 34 ms.
2021-09-30 13:33:14,029 INFO  BasicOperations - readFile file /default_tests_files/BASIC_NO_CACHE_MUST_CACHE took 11 ms.
Passed the test!
runTest --operation BASIC_NON_BYTE_BUFFER --readType NO_CACHE --writeType MUST_CACHE
2021-09-30 13:33:14,128 INFO  BasicNonByteBufferOperations - writeFile to file /default_tests_files/BASIC_NON_BYTE_BUFFER_NO_CACHE_MUST_CACHE took 83 ms.
2021-09-30 13:33:14,138 INFO  BasicNonByteBufferOperations - readFile file /default_tests_files/BASIC_NON_BYTE_BUFFER_NO_CACHE_MUST_CACHE took 10 ms.
Passed the test!
runTest --operation BASIC --readType NO_CACHE --writeType CACHE_THROUGH
2021-09-30 13:33:14,278 INFO  BasicOperations - writeFile to file /default_tests_files/BASIC_NO_CACHE_CACHE_THROUGH took 140 ms.
2021-09-30 13:33:14,288 INFO  BasicOperations - readFile file /default_tests_files/BASIC_NO_CACHE_CACHE_THROUGH took 9 ms.
Passed the test!
runTest --operation BASIC_NON_BYTE_BUFFER --readType NO_CACHE --writeType CACHE_THROUGH
2021-09-30 13:33:14,425 INFO  BasicNonByteBufferOperations - writeFile to file /default_tests_files/BASIC_NON_BYTE_BUFFER_NO_CACHE_CACHE_THROUGH took 119 ms.
2021-09-30 13:33:14,435 INFO  BasicNonByteBufferOperations - readFile file /default_tests_files/BASIC_NON_BYTE_BUFFER_NO_CACHE_CACHE_THROUGH took 10 ms.
Passed the test!
runTest --operation BASIC --readType NO_CACHE --writeType THROUGH
2021-09-30 13:33:14,557 INFO  BasicOperations - writeFile to file /default_tests_files/BASIC_NO_CACHE_THROUGH took 122 ms.
2021-09-30 13:33:14,579 INFO  BasicOperations - readFile file /default_tests_files/BASIC_NO_CACHE_THROUGH took 22 ms.
Passed the test!
runTest --operation BASIC_NON_BYTE_BUFFER --readType NO_CACHE --writeType THROUGH
2021-09-30 13:33:14,690 INFO  BasicNonByteBufferOperations - writeFile to file /default_tests_files/BASIC_NON_BYTE_BUFFER_NO_CACHE_THROUGH took 92 ms.
2021-09-30 13:33:14,715 INFO  BasicNonByteBufferOperations - readFile file /default_tests_files/BASIC_NON_BYTE_BUFFER_NO_CACHE_THROUGH took 25 ms.
Passed the test!
runTest --operation BASIC --readType NO_CACHE --writeType ASYNC_THROUGH
2021-09-30 13:33:14,749 INFO  BasicOperations - writeFile to file /default_tests_files/BASIC_NO_CACHE_ASYNC_THROUGH took 34 ms.
2021-09-30 13:33:14,757 INFO  BasicOperations - readFile file /default_tests_files/BASIC_NO_CACHE_ASYNC_THROUGH took 8 ms.
Passed the test!
runTest --operation BASIC_NON_BYTE_BUFFER --readType NO_CACHE --writeType ASYNC_THROUGH
2021-09-30 13:33:14,797 INFO  BasicNonByteBufferOperations - writeFile to file /default_tests_files/BASIC_NON_BYTE_BUFFER_NO_CACHE_ASYNC_THROUGH took 26 ms.
2021-09-30 13:33:14,805 INFO  BasicNonByteBufferOperations - readFile file /default_tests_files/BASIC_NON_BYTE_BUFFER_NO_CACHE_ASYNC_THROUGH took 8 ms.
Passed the test!

Also added the job_master process on the master.

bash -x /usr/lib/alluxio/bin/alluxio-start.sh job_master

There is currently an issue with running these scripts, where it is asking for a password. I'm not yet sure what it is, but i think it's from kinit somewhere. It's not from sudo
The process runs even without entering a password, but the tty is still captured so I'm still investigating this.

I have made some more progress on this, but it is still fairly slow.
Firstly, I have tried the vanilla download of Alluxio 2.6.2 instead of our packaged version.
It would appear that our version somehow omitted the /webui directory so there was no web interface on the master's port 19999

I'm currently working with an extracted tarball in /home/btullis/alluxio-2.6.2/ with conf symlinked to /etc/alluxio/conf and logs symlinked to /var/log/alluxio
I also had to create a metadata directory in the extracted directory and configure this to be ownder by alluxio:alluxio - This is beause version 2.6.2 uses a RocksDB instance for its metastore: https://docs.alluxio.io/os/user/stable/en/operation/Metastore.html

The commands I am using to start and stop the services are as follows:

an-test-coord1001

Start the Master
bin/alluxio-start.sh -a master

Start the Job Master
bin/alluxio-start -a job_master

Stop the master
bin/alluxio-stop.sh master

Stop the Job Master
bin/alluxio-stop.sh job_master

an-test-presto1001

Start the Worker
bin/alluxio-start.sh -a worker SudoMount

Start the Job Worker
bin/alluxio-start -a job_worker

Stop the Worker
bin/alluxio-stop.sh worker

Stop the Job Worker
bin/alluxio-stop.sh job_worker

Next I am trying to attach a hive database as an UDB. Following these instructions: https://docs.alluxio.io/os/user/stable/en/core-services/Catalog.html#attaching-databases

However, it's not yet working.

alluxio@an-test-presto1001:/home/btullis/alluxio-2.6.2$ bin/alluxio table attachdb --db alluxio_event hive thrift://analytics-test-hive.eqiad.wmnet:9083 event_sanitized
Failed to connect underDb for Alluxio db 'alluxio_event': Failed to get hive database event_sanitized. null

I have posted several more questions to the Slack workspace for Alluxio. I have a feeling that the error above is related to Kerberos, since we have the HDFS with Kerberos working, but we have not specified a keytab to use for Hive access.

Here is the full stacktrace from the master.log file for this operation.

2021-10-06 13:59:05,016 ERROR AlluxioCatalog - Sync (during attach) failed for db 'alluxio_event'.
java.io.IOException: Failed to get hive database default. null
	at alluxio.table.under.hive.HiveDatabase.getDatabaseInfo(HiveDatabase.java:137)
	at alluxio.master.table.Database.sync(Database.java:226)
	at alluxio.master.table.AlluxioCatalog.attachDatabase(AlluxioCatalog.java:126)
	at alluxio.master.table.DefaultTableMaster.attachDatabase(DefaultTableMaster.java:85)
	at alluxio.master.table.TableMasterClientServiceHandler.lambda$attachDatabase$0(TableMasterClientServiceHandler.java:74)
	at alluxio.RpcUtils.callAndReturn(RpcUtils.java:121)
	at alluxio.RpcUtils.call(RpcUtils.java:83)
	at alluxio.RpcUtils.call(RpcUtils.java:58)
	at alluxio.master.table.TableMasterClientServiceHandler.attachDatabase(TableMasterClientServiceHandler.java:72)
	at alluxio.grpc.table.TableMasterClientServiceGrpc$MethodHandlers.invoke(TableMasterClientServiceGrpc.java:1135)
	at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)
	at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
	at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
	at io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
	at alluxio.security.authentication.AuthenticatedUserInjector$1.onHalfClose(AuthenticatedUserInjector.java:67)
	at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)
	at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:797)
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
	at alluxio.concurrent.jsr.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1378)
	at alluxio.concurrent.jsr.ForkJoinTask.doExec(ForkJoinTask.java:609)
	at alluxio.concurrent.jsr.ForkJoinPool.runWorker(ForkJoinPool.java:1356)
	at alluxio.concurrent.jsr.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:131)
Caused by: org.apache.thrift.transport.TTransportException
	at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
	at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
	at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
	at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
	at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
	at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77)
	at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_database(ThriftHiveMetastore.java:782)
	at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_database(ThriftHiveMetastore.java:769)
	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDatabase(HiveMetaStoreClient.java:1290)
	at sun.reflect.GeneratedMethodAccessor46.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:169)
	at com.sun.proxy.$Proxy74.getDatabase(Unknown Source)
	at sun.reflect.GeneratedMethodAccessor46.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at alluxio.table.under.hive.util.CompatibleMetastoreClient.invoke(CompatibleMetastoreClient.java:39)
	at com.sun.proxy.$Proxy74.getDatabase(Unknown Source)
	at alluxio.table.under.hive.HiveDatabase.getDatabaseInfo(HiveDatabase.java:128)
	... 22 more
2021-10-06 13:59:05,040 WARN  TableMasterClientServiceHandler - Exit (Error): attachDatabase: , Error=java.io.IOException: Failed to connect underDb for Alluxio db 'alluxio_event': Failed to get hive database default. null

This is also an interesting error. We have generated a keytab for each host that is to access HDFS, but this states that the configuration item should be identical. I don't know whether this is causing an issue or just a warning.

alluxio@an-test-coord1001:/home/btullis/alluxio-2.6.2$ bin/alluxio fsadmin doctor
Server-side configuration errors (those properties are required to be identical): 
key: alluxio.master.mount.table.root.option.alluxio.security.underfs.hdfs.kerberos.client.principal
    value: alluxio/an-test-coord1001.eqiad.wmnet@WIKIMEDIA (an-test-coord1001.eqiad.wmnet:19998)
    value: alluxio/an-test-presto1001.eqiad.wmnet@WIKIMEDIA (an-test-presto1001.eqiad.wmnet:29999)
All worker storage paths are in working state.

Hmm. Not looking good. The word on the street is that the Alluxio Catalog Service doesn't support kerberized Hive.

image.png (100×979 px, 28 KB)

https://app.slack.com/client/TEXALQC8J/CEXGGUBDK/thread/CEXGGUBDK-1633528717.482200

There are still options of creating Hive tables that point to Alluxio locations, but it's still a compromise compared with what we wanted to achieve.

image.png (658×997 px, 165 KB)

Instead of Alluxio as acaching layer, we might like to look at the caching features of the hive connector that is avaiable in Trino: https://trino.io/docs/current/connector/hive-caching.html

We are already working to T266640: Decide whether to migrate from Presto to Trino (cc @razzi) so this feature of Trino might be a good differentiator.

The cache architecture section of the hive-connector for Trino states the following:

Caching can operate in two modes. The async mode provides the queried data directly and caches any objects asynchronously afterwards. Async is the default and recommended mode. The query doesn’t pay the cost of warming up the cache. The cache is populated in the background and the query bypasses the cache if the cache is not already populated. Any following queries requesting the cached objects are served directly from the cache.

The other mode is a read-through cache. In this mode, if an object is not found in the cache, it is read from the storage, placed in the cache, and then provided to the requesting query. In read-through mode, the query always reads from cache and must wait for the cache to be populated.

In both modes, objects are cached on local storage of each worker. Workers can request cached objects from other workers to avoid requests from the object storage.

The cache chunks are 1MB in size and are well suited for ORC or Parquet file formats.

I'll start researching whether Kerberos, user impersonation, and access control would operate in the manner we need.

Unfortunately, that's a no on all three counts.
https://trino.io/docs/current/connector/hive-caching.html#limitations

Limitations

Caching does not support user impersonation and cannot be used with HDFS secured by Kerberos. It does not take any user-specific access rights to the object storage into account.
The cached objects are simply transparent binary blobs to the caching system and full access to all content is available.

Looking at the details of the JMX monitoring and the GitHub history, it would appear that Trino merged rubix into their codebase.

The same limitations are present in their Starburst Enterprise Presto product: https://docs.starburst.io/latest/connector/hive-caching.html#limitations

Do we want to revisit the idea to T256108: Co-locate Presto with Hadoop worker nodes?

@JAllemandou spoke of an alternative solution, which was to create a second Hadoop cluster that was essentially colocated with the Presto nodes.
Data would then be regularly synced from (let's call it) the primary cluster, to the presto cluster by some means.

Change 731115 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove alluxio from the test cluster

https://gerrit.wikimedia.org/r/731115

Change 731115 merged by Btullis:

[operations/puppet@production] Remove alluxio resources from puppet

https://gerrit.wikimedia.org/r/731115

I have deployed a patch to the alluxio resources from puppet, given that it's not going to be able to meet our needs.

I will remove the packages and any remanants manually from an-test-coord1001 and an-test-presto1001.

I need to roll-restart the hadoop masters in both the analytics and test clusters, before removing the alluxio user and group in a subsequent commit.

Change 732296 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove the alluxio user and group

https://gerrit.wikimedia.org/r/732296

Change 732296 merged by Btullis:

[operations/puppet@production] Remove the alluxio user and group

https://gerrit.wikimedia.org/r/732296

Change 732719 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove all remaining references to alluxio

https://gerrit.wikimedia.org/r/732719

I have checked that puppet has run recently and successfully on all 102 servers that have the profile bigtop::alluxio::user applied.
This ensures that the user and group will have been removed by the absent change to the resources in that class.

btullis@cumin1001:~$ sudo cumin --no-progress C:bigtop::alluxio::user "/usr/local/lib/nagios/plugins/check_puppetrun -c 3600 -w 2000"
102 hosts will be targeted:
an-airflow1001.eqiad.wmnet,an-coord[1001-1002].eqiad.wmnet,an-launcher1002.eqiad.wmnet,an-master[1001-1002].eqiad.wmnet,an-test-client1001.eqiad.wmnet,an-test-coord1001.eqiad.wmnet,an-test-master[1001-1002].eqiad.wmnet,an-test-worker[1001-1003].eqiad.wmnet,an-worker[1078-1141].eqiad.wmnet,analytics[1058-1077].eqiad.wmnet,stat[1004-1008].eqiad.wmnet
Ok to proceed on 102 hosts? Enter the number of affected hosts to confirm or "q" to quit 102
===== NODE GROUP =====
(1) analytics1072.eqiad.wmnet
----- OUTPUT of '/usr/local/lib/n... -c 3600 -w 2000' -----
OK: Puppet is currently enabled, last run 26 minutes ago with 0 failures
===== NODE GROUP =====
(2) an-worker[1078-1079].eqiad.wmnet
----- OUTPUT of '/usr/local/lib/n... -c 3600 -w 2000' -----
OK: Puppet is currently enabled, last run 15 minutes ago with 0 failures
===== NODE GROUP =====
(1) an-master1001.eqiad.wmnet
----- OUTPUT of '/usr/local/lib/n... -c 3600 -w 2000' -----
OK: Puppet is currently enabled, last run 22 minutes ago with 0 failures
===== NODE GROUP =====
(1) an-test-worker1003.eqiad.wmnet
----- OUTPUT of '/usr/local/lib/n... -c 3600 -w 2000' -----
OK: Puppet is currently enabled, last run 17 minutes ago with 0 failures
===== NODE GROUP =====
(2) an-worker[1137,1141].eqiad.wmnet
----- OUTPUT of '/usr/local/lib/n... -c 3600 -w 2000' -----
OK: Puppet is currently enabled, last run 10 minutes ago with 0 failures
===== NODE GROUP =====
(1) analytics1075.eqiad.wmnet
----- OUTPUT of '/usr/local/lib/n... -c 3600 -w 2000' -----
OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures
===== NODE GROUP =====
(3) an-launcher1002.eqiad.wmnet,an-worker1128.eqiad.wmnet,analytics1069.eqiad.wmnet
----- OUTPUT of '/usr/local/lib/n... -c 3600 -w 2000' -----
OK: Puppet is currently enabled, last run 29 minutes ago with 0 failures
===== NODE GROUP =====
(4) an-master1002.eqiad.wmnet,an-worker[1085,1114].eqiad.wmnet,analytics1066.eqiad.wmnet
----- OUTPUT of '/usr/local/lib/n... -c 3600 -w 2000' -----
OK: Puppet is currently enabled, last run 9 minutes ago with 0 failures
===== NODE GROUP =====
(3) an-coord1002.eqiad.wmnet,an-test-client1001.eqiad.wmnet,analytics1073.eqiad.wmnet
----- OUTPUT of '/usr/local/lib/n... -c 3600 -w 2000' -----
OK: Puppet is currently enabled, last run 18 minutes ago with 0 failures
===== NODE GROUP =====
(2) an-worker1094.eqiad.wmnet,analytics1058.eqiad.wmnet
----- OUTPUT of '/usr/local/lib/n... -c 3600 -w 2000' -----
OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
===== NODE GROUP =====
(4) an-worker[1093,1119,1139].eqiad.wmnet,analytics1068.eqiad.wmnet
----- OUTPUT of '/usr/local/lib/n... -c 3600 -w 2000' -----
OK: Puppet is currently enabled, last run 11 minutes ago with 0 failures
===== NODE GROUP =====
(3) an-worker[1080,1121,1123].eqiad.wmnet
----- OUTPUT of '/usr/local/lib/n... -c 3600 -w 2000' -----
OK: Puppet is currently enabled, last run 20 minutes ago with 0 failures
===== NODE GROUP =====
(3) an-test-coord1001.eqiad.wmnet,an-worker1107.eqiad.wmnet,analytics1062.eqiad.wmnet
----- OUTPUT of '/usr/local/lib/n... -c 3600 -w 2000' -----
OK: Puppet is currently enabled, last run 25 minutes ago with 0 failures
===== NODE GROUP =====
(5) an-worker[1102,1108,1133,1135].eqiad.wmnet,analytics1076.eqiad.wmnet
----- OUTPUT of '/usr/local/lib/n... -c 3600 -w 2000' -----
OK: Puppet is currently enabled, last run 6 minutes ago with 0 failures
===== NODE GROUP =====
(7) an-worker[1095,1113,1117,1120,1127,1129].eqiad.wmnet,analytics1074.eqiad.wmnet
----- OUTPUT of '/usr/local/lib/n... -c 3600 -w 2000' -----
OK: Puppet is currently enabled, last run 8 minutes ago with 0 failures
===== NODE GROUP =====
(7) an-test-master1001.eqiad.wmnet,an-test-worker1002.eqiad.wmnet,an-worker[1088,1091,1138].eqiad.wmnet,analytics1059.eqiad.wmnet,stat1006.eqiad.wmnet
----- OUTPUT of '/usr/local/lib/n... -c 3600 -w 2000' -----
OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
===== NODE GROUP =====
(4) an-airflow1001.eqiad.wmnet,an-worker1122.eqiad.wmnet,analytics[1064,1067].eqiad.wmnet
----- OUTPUT of '/usr/local/lib/n... -c 3600 -w 2000' -----
OK: Puppet is currently enabled, last run 14 minutes ago with 0 failures
===== NODE GROUP =====
(2) an-worker[1110,1136].eqiad.wmnet
----- OUTPUT of '/usr/local/lib/n... -c 3600 -w 2000' -----
OK: Puppet is currently enabled, last run 12 minutes ago with 0 failures
===== NODE GROUP =====
(1) an-worker1111.eqiad.wmnet
----- OUTPUT of '/usr/local/lib/n... -c 3600 -w 2000' -----
OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
===== NODE GROUP =====
(7) an-worker[1082,1106,1109,1115,1118,1140].eqiad.wmnet,analytics1065.eqiad.wmnet
----- OUTPUT of '/usr/local/lib/n... -c 3600 -w 2000' -----
OK: Puppet is currently enabled, last run 16 minutes ago with 0 failures
===== NODE GROUP =====
(1) stat1004.eqiad.wmnet
----- OUTPUT of '/usr/local/lib/n... -c 3600 -w 2000' -----
OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
===== NODE GROUP =====
(3) an-worker1098.eqiad.wmnet,analytics[1060,1070].eqiad.wmnet
----- OUTPUT of '/usr/local/lib/n... -c 3600 -w 2000' -----
OK: Puppet is currently enabled, last run 24 minutes ago with 0 failures
===== NODE GROUP =====
(3) an-worker[1089,1100].eqiad.wmnet,stat1007.eqiad.wmnet
----- OUTPUT of '/usr/local/lib/n... -c 3600 -w 2000' -----
OK: Puppet is currently enabled, last run 28 minutes ago with 0 failures
===== NODE GROUP =====
(3) an-worker[1099,1116,1132].eqiad.wmnet
----- OUTPUT of '/usr/local/lib/n... -c 3600 -w 2000' -----
OK: Puppet is currently enabled, last run 7 minutes ago with 0 failures
===== NODE GROUP =====
(7) an-worker[1083,1086-1087,1101,1103,1124].eqiad.wmnet,analytics1063.eqiad.wmnet
----- OUTPUT of '/usr/local/lib/n... -c 3600 -w 2000' -----
OK: Puppet is currently enabled, last run 23 minutes ago with 0 failures
===== NODE GROUP =====
(6) an-test-master1002.eqiad.wmnet,an-test-worker1001.eqiad.wmnet,an-worker[1096,1130,1134].eqiad.wmnet,analytics1077.eqiad.wmnet
----- OUTPUT of '/usr/local/lib/n... -c 3600 -w 2000' -----
OK: Puppet is currently enabled, last run 13 minutes ago with 0 failures
===== NODE GROUP =====
(5) an-coord1001.eqiad.wmnet,an-worker[1104-1105,1112].eqiad.wmnet,stat1008.eqiad.wmnet
----- OUTPUT of '/usr/local/lib/n... -c 3600 -w 2000' -----
OK: Puppet is currently enabled, last run 21 minutes ago with 0 failures
===== NODE GROUP =====
(4) an-worker[1081,1090,1092,1097].eqiad.wmnet
----- OUTPUT of '/usr/local/lib/n... -c 3600 -w 2000' -----
OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
===== NODE GROUP =====
(7) an-worker[1084,1125-1126,1131].eqiad.wmnet,analytics[1061,1071].eqiad.wmnet,stat1005.eqiad.wmnet
----- OUTPUT of '/usr/local/lib/n... -c 3600 -w 2000' -----
OK: Puppet is currently enabled, last run 19 minutes ago with 0 failures
================
100.0% (102/102) success ratio (>= 100.0% threshold) for command: '/usr/local/lib/n... -c 3600 -w 2000'.
100.0% (102/102) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

I've checked all ext4 file systems on all of these hosts for files owned by the uid/gid 914.
There are still a few files still left around belonging to the alluxio user, specifically a kerberos keytab and a stray /var/lib/alluxio directory.

There are a few more files still owned by the alluxio user.

btullis@cumin1001:~$ sudo cumin -x C:bigtop::alluxio::user "sudo find $(findmnt -t ext4 -o TARGET -r -n|tr '\n' ' ') -xdev -uid 914"
                                                                                                               
(1) an-test-client1001.eqiad.wmnet                                                                                                                                                                                 
----- OUTPUT of 'sudo find / /srv  -xdev -uid 914' -----                                                                                                                                                           
/var/lib/alluxio                                                                                                                                                                                                   

===== NODE GROUP =====                                                                                                                                                                                             
(1) an-test-coord1001.eqiad.wmnet                                                                                                                                                                                  
----- OUTPUT of 'sudo find / /srv  -xdev -uid 914' -----                                                                                                                                                           
/etc/security/keytabs/alluxio                                                                                                                                                                                      
/etc/security/keytabs/alluxio/alluxio.keytab                                                                                                                                                                       
/tmp/krb5cc_914

I propose to remove these manually.

As @jbond pointed out, the alluxio user and group still exist on an-test-presto1001. They have not been absented by the recent change to puppet.

btullis@an-test-presto1001:~$ grep alluxio /etc/passwd
alluxio:x:914:914:alluxio User,,,:/var/lib/alluxio:/bin/false
btullis@an-test-presto1001:~$ grep alluxio /etc/group
hadoop:x:908:yarn,mapred,hdfs,alluxio
alluxio:x:914:

I could remove these manually, but I would like to understand why puppet isn't absenting it.

I have discovered that the bigtop::alluxio::user account was never applied to presto servers, because it wasn't included in the catalog.

We might like to think about whether we want to apply the profile::analytics::cluster::users profile to presto servers in future, but for now I am happy to remove the stray user and group manually on this server.

The same keytab file has been found on this server as on an-test-coord1001, so I will remove that manually as well.
Removing files

btullis@an-test-coord1001:~$ sudo rm /etc/security/keytabs/alluxio/alluxio.keytab && sudo rmdir /etc/security/keytabs/alluxio
btullis@an-test-coord1001:~$ sudo rm /tmp/krb5cc_914

btullis@an-test-presto1001:~$ sudo rm /etc/security/keytabs/alluxio/alluxio.keytab && sudo rmdir /etc/security/keytabs/alluxio

btullis@an-test-client1001:~$ sudo rmdir /var/lib/alluxio

Removing user and group

btullis@an-test-presto1001:~$ sudo deluser alluxio
Removing user `alluxio' ...
Warning: group `alluxio' has no more members.
Done.
btullis@an-test-presto1001:~$ sudo delgroup alluxio
The group `alluxio' does not exist.
btullis@an-test-presto1001:~$ grep alluxio /etc/group
btullis@an-test-presto1001:~$ grep alluxio /etc/passwd
btullis@an-test-presto1001:~$

Change 732952 had a related patch set uploaded (by Btullis; author: Btullis):

[labs/private@master] Remove unused dummy keytabs and an SSH key for alluxio

https://gerrit.wikimedia.org/r/732952

I have deleted the kerberos principals:

btullis@krb1001:~$ sudo manage_principals.py delete alluxio/an-test-coord1001.eqiad.wmnet@WIKIMEDIA
Principal successfully deleted.
btullis@krb1001:~$ sudo manage_principals.py delete alluxio/an-test-presto1001.eqiad.wmnet@WIKIMEDIA
Principal successfully deleted.

...and the keytabs that were generated.

root@krb1001:/srv/kerberos/keytabs# rm an-test-presto1001.eqiad.wmnet/alluxio/alluxio.keytab && rmdir an-test-presto1001.eqiad.wmnet/alluxio
root@krb1001:/srv/kerberos/keytabs# rm an-test-coord1001.eqiad.wmnet/alluxio/alluxio.keytab && rmdir an-test-coord1001.eqiad.wmnet/alluxio

I have removed these keytabs from the private puppet repository, and from the dummy puppet repository.

Change 732952 merged by Btullis:

[labs/private@master] Remove unused dummy keytabs and an SSH key for alluxio

https://gerrit.wikimedia.org/r/732952

Change 732719 merged by Btullis:

[operations/puppet@production] Remove all remaining references to alluxio

https://gerrit.wikimedia.org/r/732719

BTullis edited subscribers, added: Gehel; removed: razzi.

Reopening this ticket, as we may have a way forward with using Alluxio to optimize Presto and improve performance for Superset and stat machine users.

Previously, we had focused all of our efforts on the Alluxio integration with the Hive connector for Presto: https://prestodb.io/docs/current/connector/hive.html#alluxio-configuration

We were looking at setting up an Alluxio cluster independently of Presto, but co-located on the same hosts and using the Catalog Service to support read-only workloads.
However, this all came to a halt when we ascertained that the Apache 2.0 licensed parts of Alluxio were unable to connect to an upstream Heve metastore that was protected by Kerberos authentication. This was and remains a feature requiring an enterprise licence.

However, it has now been brought to our attention that Presto has another mechanism for caching via Alluxio, which is already built into it.
https://prestodb.io/docs/current/cache/local.html

To quote from that page:

Presto supports caching input data with a built-in Alluxio SDK cache to reduce query latency using the Hive Connector. This built-in cache utilizes local storage (such as SSD) on each worker with configurable capacity and locations.

Note that this is a read-cache, which is completely transparent to users and fully managed by individual Presto workers.

Enabling the Alluxio SDK cache is quite simple. Include the following configuration in etc/catalog/hive.properties and restart the Presto coordinator and workers:

hive.node-selection-strategy=SOFT_AFFINITY
cache.enabled=true
cache.type=ALLUXIO
cache.alluxio.max-cache-size=500GB
cache.base-directory=/tmp/alluxio-cache

The disadvantage compared with the previous methods that we were looking at previously, is that it is not a distrubted cache. Each presto worker maintains its own independent alluxio cache, that is available only to itself.
Other methods, such as the Alluxio Cache Service and the Alluxio Catalog Service function as a distributed cache.

Nevertheless, I suggest we try this at once on the test cluster. We only have one presto worker node on the test cluster and it's a VM, so the cache isn't going to make a huge difference to performance.

On the production presto cluster we currently have 15 nodes, each of which has 12 x 4 TB hard drives that are unused, so a local cache on each worker could prove very effective.

Change 940097 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Enable local caching for presto on the test cluster

https://gerrit.wikimedia.org/r/940097

Change 940097 merged by Btullis:

[operations/puppet@production] Enable local caching for presto on the test cluster

https://gerrit.wikimedia.org/r/940097

@JAllemandou and I got this working!

We substituted the alluxio-shaded-client jar from our presto-server package with version 2.9.3 from presto-server version 0.283 as shown.

root@an-test-presto1001:~# ls -l /usr/lib/presto/plugin/hive-hadoop2/|head -n 3
total 204768
-rw-r--r-- 1 root root   181089 Jun  7 12:55 aircompressor-0.15.jar
lrwxrwxrwx 1 root root       42 Aug 18 14:21 alluxio-shaded-client-2.9.3.jar -> /home/joal/alluxio-shaded-client-2.9.3.jar
root@an-test-presto1001:~#

We made the same modification on both the coordinator (an-test-coord1001) and the worker (an-test-presto1001) and restarted the presto-server service.

As an aside, it also worked with version alluxio-shaded-client-301.jar but that has yet to be released in a stable presto-server release.

When the cache files were created on the disk, they were originally world-readable, which didn't seem suitable for our purposes.
So for continued testing we created a systemd override file that configured the umask of the presto-server service to be 0027 as shown.

root@an-test-presto1001:~# cat /etc/systemd/system/presto-server.service.d/override.conf 
[Service]
UMask=0027

Once that change was made, we could see that the cached files were no longer readble by any process other than the presto-server service.

root@an-test-presto1001:~# tree -ugp /tmp/alluxio-cache/
/tmp/alluxio-cache/
└── [drwxr-x--- presto   presto  ]  LOCAL
    └── [drwxr-x--- presto   presto  ]  1048576
        ├── [drwxr-x--- presto   presto  ]  543
        │   └── [drwxr-x--- presto   presto  ]  8dcc5c07b6841364e5dbd2cdfeafa26b
        │       └── [-rw-r----- presto   presto  ]  0
        ├── [drwxr-x--- presto   presto  ]  599
        │   └── [drwxr-x--- presto   presto  ]  201ba184f90482d050aadd0cf7549bdd
        │       └── [-rw-r----- presto   presto  ]  0
        ├── [drwxr-x--- presto   presto  ]  609
        │   └── [drwxr-x--- presto   presto  ]  c605668aba1fff051085f5a1b68c86d3
        │       └── [-rw-r----- presto   presto  ]  0
        └── [drwxr-x--- presto   presto  ]  728
            └── [drwxr-x--- presto   presto  ]  b067baf7d70efa9babc6fa397eab8f9d
                └── [-rw-r----- presto   presto  ]  0

We also did some due diligence of the caching fdeatures, to ensure that they did not allow different users to bypass HDFS filesystem permissions via the Alluxio cache and it passed our tests.

We also checked the available disk space on each of our presto-worker nodes and verified that there is currently 17 TB of unused space ready to be used in /srv

btullis@cumin1001:~$ sudo cumin A:presto-analytics 'df -h /srv'
15 hosts will be targeted:
an-presto[1001-1015].eqiad.wmnet
OK to proceed on 15 hosts? Enter the number of affected hosts to confirm or "q" to quit: 15
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-presto1011.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'df -h /srv' -----                                                                                                                                                                                 
Filesystem           Size  Used Avail Use% Mounted on                                                                                                                                                              
/dev/mapper/vg1-srv   18T  219M   17T   1% /srv                                                                                                                                                                    
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-presto1009.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'df -h /srv' -----                                                                                                                                                                                 
Filesystem           Size  Used Avail Use% Mounted on                                                                                                                                                              
/dev/mapper/vg1-srv   18T  215M   17T   1% /srv                                                                                                                                                                    
===== NODE GROUP =====                                                                                                                                                                                             
(2) an-presto[1003,1015].eqiad.wmnet                                                                                                                                                                               
----- OUTPUT of 'df -h /srv' -----                                                                                                                                                                                 
Filesystem           Size  Used Avail Use% Mounted on                                                                                                                                                              
/dev/mapper/vg1-srv   18T  204M   17T   1% /srv                                                                                                                                                                    
===== NODE GROUP =====                                                                                                                                                                                             
(2) an-presto[1002,1008].eqiad.wmnet                                                                                                                                                                               
----- OUTPUT of 'df -h /srv' -----                                                                                                                                                                                 
Filesystem           Size  Used Avail Use% Mounted on                                                                                                                                                              
/dev/mapper/vg1-srv   18T  208M   17T   1% /srv                                                                                                                                                                    
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-presto1004.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'df -h /srv' -----                                                                                                                                                                                 
Filesystem           Size  Used Avail Use% Mounted on                                                                                                                                                              
/dev/mapper/vg1-srv   18T  211M   17T   1% /srv                                                                                                                                                                    
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-presto1006.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'df -h /srv' -----                                                                                                                                                                                 
Filesystem           Size  Used Avail Use% Mounted on                                                                                                                                                              
/dev/mapper/vg1-srv   18T  224M   17T   1% /srv                                                                                                                                                                    
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-presto1013.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'df -h /srv' -----                                                                                                                                                                                 
Filesystem           Size  Used Avail Use% Mounted on                                                                                                                                                              
/dev/mapper/vg1-srv   18T  200M   17T   1% /srv                                                                                                                                                                    
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-presto1014.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'df -h /srv' -----                                                                                                                                                                                 
Filesystem           Size  Used Avail Use% Mounted on                                                                                                                                                              
/dev/mapper/vg1-srv   18T  203M   17T   1% /srv                                                                                                                                                                    
===== NODE GROUP =====                                                                                                                                                                                             
(2) an-presto[1010,1012].eqiad.wmnet                                                                                                                                                                               
----- OUTPUT of 'df -h /srv' -----                                                                                                                                                                                 
Filesystem           Size  Used Avail Use% Mounted on                                                                                                                                                              
/dev/mapper/vg1-srv   18T  202M   17T   1% /srv                                                                                                                                                                    
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-presto1007.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'df -h /srv' -----                                                                                                                                                                                 
Filesystem           Size  Used Avail Use% Mounted on                                                                                                                                                              
/dev/mapper/vg1-srv   18T  218M   17T   1% /srv                                                                                                                                                                    
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-presto1005.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'df -h /srv' -----                                                                                                                                                                                 
Filesystem           Size  Used Avail Use% Mounted on                                                                                                                                                              
/dev/mapper/vg1-srv   18T  210M   17T   1% /srv                                                                                                                                                                    
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-presto1001.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'df -h /srv' -----                                                                                                                                                                                 
Filesystem           Size  Used Avail Use% Mounted on                                                                                                                                                              
/dev/mapper/vg1-srv   18T  205M   17T   1% /srv                                                                                                                                                                    
================

Enabling this feature could make a big difference to the performance of Superset in particular and it seems likely to be low-risk, so I would be keen for us to move forward with: T342343: Upgrade Presto to version 0.283 as soon as is practicable.

Ping @BTullis on this - Could we get apriorization on either a presto version bump or an adaptation of our version to get this enabled ?
Also, putting this on the Data Engineering and Event Platform Team radar :)

Yes, will do. T342343: Upgrade Presto to version 0.283 is already in our prioritized backlog, so I'm hoping to get version 0.283 of presto deployed very soon.

Yes, will do. T342343: Upgrade Presto to version 0.283 is already in our prioritized backlog, so I'm hoping to get version 0.283 of presto deployed very soon.

<3

Change 965730 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Ensure that alluxio cache directories are present for presto

https://gerrit.wikimedia.org/r/965730

Ahoelzl renamed this task from Test Alluxio as cache layer for Presto to [Platform] Test Alluxio as cache layer for Presto.Oct 20 2023, 4:57 PM
Ahoelzl renamed this task from [Platform] Test Alluxio as cache layer for Presto to [Data Platform] Test Alluxio as cache layer for Presto.Oct 20 2023, 5:16 PM

It would be a great idea to implement https://phabricator.wikimedia.org/T269832 before this one.
Without having historical data (logged by query logger), we can't really measure the gain.

It would be a great idea to implement https://phabricator.wikimedia.org/T269832 before this one.
Without having historical data (logged by query logger), we can't really measure the gain.

Yes, I see your point. I did make a good start on the presto query logger, but I only got as far as making plain text logs. I had hoped to make ECS compatible logs that we could feed into logstash, but I wasn't confident enough with the Java to make rapid enough progress.

What is your feeling on whether we should go this route, or implement a simpler log format?

Hi @BTullism,
We've talked with the team and decided that we'd postpone working on this topic as the improvement, while relevant, is not a "game changer" in usability for users (yet).
I have looked at the code and might squeeze in some advancement as part of my 10% at-will time.

Hi @BTullism,
We've talked with the team and decided that we'd postpone working on this topic as the improvement, while relevant, is not a "game changer" in usability for users (yet).
I have looked at the code and might squeeze in some advancement as part of my 10% at-will time.

OK, I'm totally happy to be guided by you on this.
In terms of enabling the Alluxio SDK cache in the production Presto cluster, it's now completely ready to go. All it would take is a configuration change.
However, gathering baseline metrics with a query logger and measuring the difference it might make to users' work, would still take some time.

One possible alternative that you might like to think about is creating two additional presto catalogs (with and without iceberg), which have the presto cache enabled.
That way, we might be able to have both the cached and not cached versions available in parallel and do some side-by-side comparisons.

Whatever you think best.

I think it's worth adding a query-logger :) I think it's worth spending the time enabling proper metrics instead of trying to fast-track the solution. Thanks for the secondary option though!

Change 965730 merged by Btullis:

[operations/puppet@production] Ensure that alluxio cache directories are present for presto

https://gerrit.wikimedia.org/r/965730