Page MenuHomePhabricator

Upgrade Presto to release that aligns with Iceberg 1.2.1
Closed, ResolvedPublic

Description

In T311525, we picked up a version of Presto that supports Iceberg at version 0.13.1.

In our Iceberg Working Session we decided that we want to upgrade to a version that would be closer to 1.2.1.

The Presto folks have commited support for Iceberg 1.2.1, but this currently sits in their master branch.

In this task we should:

  • Wait until the Presto folks release a version that includes Iceberg 1.2.1 (0.282?) Edit: No need to wait for 0.282, 0.281 is fine.
  • Build debian packages for presto version 0.281 and make them available for installation.
  • Deploy such version to our test cluster.
  • Make sure it works as expected.
  • Upgrade our main cluster.

Event Timeline

BTullis triaged this task as High priority.
BTullis moved this task from Incoming to In Progress on the Data-Platform-SRE board.

I'm starting to work on this now.

As per the README: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/presto/+/refs/heads/debian/debian/README.Debian

btullis@marlin:~/wmf/debs/presto$ export VERSION=0.281
btullis@marlin:~/wmf/debs/presto$ wget -P /tmp https://repo1.maven.org/maven2/com/facebook/presto/presto-server/$VERSION/presto-server-$VERSION.tar.gz
--2023-06-07 13:32:38--  https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.281/presto-server-0.281.tar.gz
Resolving repo1.maven.org (repo1.maven.org)... 199.232.56.209, 2a04:4e42:4b::209
Connecting to repo1.maven.org (repo1.maven.org)|199.232.56.209|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1895968788 (1.8G) [application/x-gzip]
Saving to: ‘/tmp/presto-server-0.281.tar.gz’

presto-server-0.281.tar.gz                           100%[=====================================================================================================================>]   1.77G  7.11MB/s    in 2m 37s  

2023-06-07 13:35:48 (11.5 MB/s) - ‘/tmp/presto-server-0.281.tar.gz’ saved [1895968788/1895968788]

btullis@marlin:~/wmf/debs/presto$ [ "$(curl -s https://repo1.maven.org/maven2/com/facebook/presto/presto-server/$VERSION/presto-server-$VERSION.tar.gz.sha1)" == "$(shasum -a 1 /tmp/presto-server-$VERSION.tar.gz | awk '{print $1}')" ] && echo "presto-server sha1 checksum matches, continue" || echo "presto-server sha1 checksum does not match!"
presto-server sha1 checksum matches, continue

btullis@marlin:~/wmf/debs/presto$ gbp import-orig -u $VERSION --upstream-branch=master --debian-branch=debian --merge-mode=replace /tmp/presto-server-$VERSION.tar.gz
gbp:info: Importing '/tmp/presto-server-0.281.tar.gz' to branch 'master'...
gbp:info: Source package is presto
gbp:info: Upstream version is 0.281
gbp:info: Replacing upstream source on 'debian'
gbp:info: Successfully imported version 0.281 of /tmp/presto-server-0.281.tar.gz

btullis@marlin:~/wmf/debs/presto$ git rm debian/lib/presto-cli-*-executable.jar
rm 'debian/lib/presto-cli-0.273.3-executable.jar'

btullis@marlin:~/wmf/debs/presto$ wget -P debian/lib https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/$VERSION/presto-cli-$VERSION-executable.jar
--2023-06-07 13:47:46--  https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.281/presto-cli-0.281-executable.jar
Resolving repo1.maven.org (repo1.maven.org)... 199.232.56.209, 2a04:4e42:4b::209
Connecting to repo1.maven.org (repo1.maven.org)|199.232.56.209|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15761227 (15M) [application/java-archive]
Saving to: ‘debian/lib/presto-cli-0.281-executable.jar’

presto-cli-0.281-executable.jar                      100%[=====================================================================================================================>]  15.03M  10.9MB/s    in 1.4s    

2023-06-07 13:47:48 (10.9 MB/s) - ‘debian/lib/presto-cli-0.281-executable.jar’ saved [15761227/15761227]

btullis@marlin:~/wmf/debs/presto$ [ "$(curl -s https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/$VERSION/presto-cli-$VERSION-executable.jar.sha1)" == "$(shasum -a 1 ./debian/lib/presto-cli-$VERSION-executable.jar | awk '{print $1}')" ] && echo "presto-cli sha1 checksum matches, continue" || echo "presto-cli sha1 checksum does not match!"
presto-cli sha1 checksum matches, continue

btullis@marlin:~/wmf/debs/presto$ chmod 755 debian/lib/presto-cli-$VERSION-executable.jar

That's a good start.

Now finding the binaries that need to be included.

btullis@marlin:~/wmf/debs/presto$ find {debian/lib,lib,plugin} -type f -exec file {} \; | grep -v text | awk -F ':' '{print $1}' | sort > debian/source/include-binaries

I edited the files debian/presto-cli.links and debian/presto-cli.install as instructed and replaced references from version 0.273.3 with 0.281

I bumped the debian/changelog file with the following command:

dch -v 0.281-1 -D buster-wikimedia --force-distribution

I committed with:

git add debian
git commit -m "Upstream release $VERSION"

Now pushing to gerrit.

I've checked out the new version to a build server:

btullis@build2001:~$ git clone "https://gerrit.wikimedia.org/r/operations/debs/presto" && (cd "presto" && mkdir -p `git rev-parse --git-dir`/hooks/ && curl -Lo `git rev-parse --git-dir`/hooks/commit-msg https://gerrit.wikimedia.org/r/tools/hooks/commit-msg; chmod +x `git rev-parse --git-dir`/hooks/commit-msg)
Cloning into 'presto'...
remote: Counting objects: 212, done
remote: Finding sources: 100% (211/211)
remote: Getting sizes: 100% (187/187)
remote: Compressing objects: 100% (192953/192953)
remote: Total 1564 (delta 41), reused 1534 (delta 21)
Receiving objects: 100% (1564/1564), 991.08 MiB | 49.32 MiB/s, done.
Resolving deltas: 100% (276/276), done.
Updating files: 100% (1526/1526), done.
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2272  100  2272    0     0  17082      0 --:--:-- --:--:-- --:--:-- 17082
btullis@build2001:~$ cd presto/
btullis@build2001:~/presto$ git branch
* master
btullis@build2001:~/presto$ git checkout debian
Updating files: 100% (2026/2026), done.
Branch 'debian' set up to track remote branch 'debian' from 'origin'.
Switched to a new branch 'debian'
btullis@build2001:~/presto$ git log -n 1
commit d2f7f778b1af8235579306254451a4cd6fc8a856 (HEAD -> debian, origin/debian)
Author: Ben Tullis <btullis@wikimedia.org>
Date:   Wed Jun 7 13:58:55 2023 +0100

    Upstream release 0.281
    
    Change-Id: I5855f32bbbe1d51012f15e16714ca3eef8ff8b6d

Now, as per the README again, proceeding to build the package.

btullis@build2001:~/presto$ GIT_PBUILDER_AUTOCONF=no DIST=buster WIKIMEDIA=yes gbp buildpackage -sa -us -uc --git-builder=git-pbuilder --source-option="--include-removal"

Oh, that didn't work as well as I had hoped.

I see a lot of error messages like this:

btullis@build2001:~/presto$ GIT_PBUILDER_AUTOCONF=no DIST=buster WIKIMEDIA=yes gbp buildpackage -sa -us -uc --git-builder=git-pbuilder --source-option="--include-removal"
gbp:info: Creating presto_0.281.orig.tar.gz from 'master'
gbp:info: Performing the build
Building with cowbuilder
W: /home/btullis/.pbuilderrc does not exist
I: using cowbuilder as pbuilder
dpkg-source: info: using options from presto/debian/source/options: --extend-diff-ignore=^\.gitreview$
dh clean --with=systemd
dh: warning: Compatibility levels before 10 are deprecated (level 9 in use)
   dh_clean
dh_clean: warning: Compatibility levels before 10 are deprecated (level 9 in use)
dpkg-source: info: using options from presto/debian/source/options: --extend-diff-ignore=^\.gitreview$
dpkg-source: info: using source format '3.0 (quilt)'
dpkg-source: info: building presto using existing ./presto_0.281.orig.tar.gz
dpkg-source: warning: ignoring deletion of directory plugin/raptor
dpkg-source: warning: ignoring deletion of directory plugin/geospatial
dpkg-source: error: cannot represent change to lib/bootstrap-0.198.jar: binary file contents changed
dpkg-source: error: add lib/bootstrap-0.198.jar in debian/source/include-binaries if you want to store the modified binary in the debian tarball
dpkg-source: error: cannot represent change to lib/checker-qual-2.5.2.jar: binary file contents changed
dpkg-source: error: add lib/checker-qual-2.5.2.jar in debian/source/include-binaries if you want to store the modified binary in the debian tarball
dpkg-source: error: cannot represent change to lib/concurrent-0.198.jar: binary file contents changed
dpkg-source: error: add lib/concurrent-0.198.jar in debian/source/include-binaries if you want to store the modified binary in the debian tarball

The thing that I don't get is that bootstrap-0.198.jar isn't mentined in the diff. It's version 0.205.jar

image.png (449×763 px, 44 KB)

Continuing to investigate.

I think I have worked out what it was. I hadn't pushed both the master and debian branches to gerrit, so when I ran the gbp buildpackage command it biult the tarball from an outdated master.

I've now pushed and pulled the master branch and executed the build package command again like this.

btullis@build2001:~/presto$ GIT_PBUILDER_AUTOCONF=no DIST=buster WIKIMEDIA=yes gbp buildpackage -sa -us -uc --git-builder=git-pbuilder --git-force-create --source-option="--include-removal"

gbp:info: Creating presto_0.281.orig.tar.gz from 'master'

Without the --git-force-create option it found a cached version of the presto tarball from somewhere and showed this:

dpkg-source: info: building presto using existing ./presto_0.281.orig.tar.gz

Now it seems to be proceeding more happily.
I'm not sure if I should be updating the DIST=buster to DIST=bullseye at the moment, but I can come back to that.

Yes, I previously copied the packages from the buster repository to bullseye here: T323783#8435164
So I'm fairly confident that we can do so again and come back to a specific build for bullseye.

Strangely, the packages are in the /var/cache/pbuider/results/bullseye-amd64 directory. I had expected them to be in the buster version of that directory, but I don't think it matters.

btullis@build2001:/var/cache/pbuilder/result/bullseye-amd64$ ls -l presto*
-rw-r--r-- 1 btullis wikidev       5307 Jun  7 14:32 presto_0.281-1_amd64.buildinfo
-rw-r--r-- 1 btullis wikidev       2046 Jun  7 14:33 presto_0.281-1_amd64.changes
-rw-r--r-- 1 btullis wikidev   13891996 Jun  7 13:54 presto_0.281-1.debian.tar.xz
-rw-r--r-- 1 btullis wikidev        879 Jun  7 13:54 presto_0.281-1.dsc
-rw-r--r-- 1 btullis wikidev       1397 Jun  7 14:33 presto_0.281-1_source.changes
-rw-r--r-- 1 btullis wikidev 1893730742 Jun  7 13:50 presto_0.281.orig.tar.gz
-rw-r--r-- 1 btullis wikidev   13876288 Jun  7 14:29 presto-cli_0.281-1_all.deb
-rw-r--r-- 1 btullis wikidev 1843903104 Jun  7 14:32 presto-server_0.281-1_all.deb

Copying them to the apt server with:

btullis@apt1001:~$ rsync rsync://build2001.codfw.wmnet/pbuilder-result/bullseye-amd64/presto* .

Uploaded therm to the apt repository with:

btullis@apt1001:~$ sudo -i reprepro include buster-wikimedia `pwd`/presto_0.281-1_amd64.changes 
Exporting indices...
btullis@apt1001:~$ sudo -i reprepro --ignore=wrongdistribution include bullseye-wikimedia `pwd`/presto_0.281-1_amd64.changes 
.changes put in a distribution not listed within it!
Ignoring as --ignore=wrongdistribution given.
Exporting indices...
Deleting files no longer referenced...

They are now available for installation:

btullis@apt1001:~$ sudo -i reprepro ls presto
presto | 0.246-wmf-1 |  stretch-wikimedia | source
presto |     0.281-1 |   buster-wikimedia | source
presto |     0.281-1 | bullseye-wikimedia | source
btullis@apt1001:~$ sudo -i reprepro ls presto-server
presto-server | 0.246-wmf-1 |  stretch-wikimedia | amd64, i386
presto-server |     0.281-1 |   buster-wikimedia | amd64, i386
presto-server |     0.281-1 | bullseye-wikimedia | amd64, i386
btullis@apt1001:~$ sudo -i reprepro ls presto-cli
presto-cli | 0.246-wmf-1 |  stretch-wikimedia | amd64, i386
presto-cli |     0.281-1 |   buster-wikimedia | amd64, i386
presto-cli |     0.281-1 | bullseye-wikimedia | amd64, i386

I'm installing this to the test cluser with:

btullis@cumin1001:~$ sudo cumin 'A:presto-analytics-test or A:hadoop-coordinator-test' 'apt install -y presto presto-server presto-cli'

Followed by

btullis@cumin1001:~$ sudo cumin A:hadoop-client-test 'apt install -y presto-cli'

It looked like it was going well, but then the service running on an-test-coord1001 threw an error relating to logging.

btullis@an-test-coord1001:/etc/presto$ journalctl -u presto-server.service |grep ERROR
Jun 07 15:50:22 an-test-coord1001 presto-server[29501]: 2023-06-07T15:50:22.977Z        INFO        main        stdout        2023-06-07 15:50:22,976 main ERROR File contains an invalid element or attribute "EcsLayout"
Jun 07 15:50:22 an-test-coord1001 presto-server[29501]: 2023-06-07T15:50:22.980Z        INFO        main        stdout        2023-06-07 15:50:22,980 main ERROR FileManager (/var/log/presto/queries-20230607.log) java.io.IOException: Could not create directory /var/log/presto java.io.IOException: Could not create directory /var/log/presto
Jun 07 15:50:22 an-test-coord1001 presto-server[29501]: 2023-06-07T15:50:22.983Z        INFO        main        stdout        2023-06-07 15:50:22,983 main ERROR Could not create plugin of type class org.apache.logging.log4j.core.appender.FileAppender for element File: java.lang.IllegalStateException: ManagerFactory [org.apache.logging.log4j.core.appender.FileManager$FileManagerFactory@f46fff6] unable to create manager for [/var/log/presto/queries-20230607.log] with data [org.apache.logging.log4j.core.appender.FileManager$FactoryData@1a366d0] java.lang.IllegalStateException: ManagerFactory [org.apache.logging.log4j.core.appender.FileManager$FileManagerFactory@f46fff6] unable to create manager for [/var/log/presto/queries-20230607.log] with data [org.apache.logging.log4j.core.appender.FileManager$FactoryData@1a366d0]
Jun 07 15:50:22 an-test-coord1001 presto-server[29501]: 2023-06-07T15:50:22.983Z        INFO        main        stdout        2023-06-07 15:50:22,983 main ERROR Unable to invoke factory method in class org.apache.logging.log4j.core.appender.FileAppender for element File: java.lang.IllegalStateException: No factory method found for class org.apache.logging.log4j.core.appender.FileAppender java.lang.IllegalStateException: No factory method found for class org.apache.logging.log4j.core.appender.FileAppender
Jun 07 15:50:22 an-test-coord1001 presto-server[29501]: 2023-06-07T15:50:22.984Z        INFO        main        stdout        2023-06-07 15:50:22,984 main ERROR Null object returned for File in Appenders.
Jun 07 15:50:22 an-test-coord1001 presto-server[29501]: 2023-06-07T15:50:22.988Z        INFO        main        stdout        2023-06-07 15:50:22,987 main ERROR Unable to locate appender "FileAppender" for logger config "root"
Jun 07 15:55:41 an-test-coord1001 presto-server[29501]: 2023-06-07T15:55:41.445Z        ERROR        dispatcher-query-1        com.facebook.airlift.concurrent.BoundedExecutor        Task failed

This might be a config issue. I'll look into it.

I think that it's working now. The error was caused by some cruft that I had manually added to an-test-coord1001 as part of

I removed it with sudo mv /var/lib/presto/plugin/presto-query-logger/ ~ and sudo mv /etc/presto/event-listener.properties ~/
The restarted the presto-server service with: sudo systemctl restart presto-server

Now presto is working.

btullis@an-test-client1001:~$ presto
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
presto> show catalogs;
        Catalog         
------------------------
 analytics_test_hive    
 analytics_test_iceberg 
 system                 
(3 rows)

Query 20230607_161415_00002_nwp2p, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
[Latency: client-side: 0:02, server-side: 0:02] [0 rows, 0B] [0 rows/s, 0B/s]

We can see that the version is correct on botht he server and the coordinator with:

presto> SELECT node_id,node_version FROM system.runtime.nodes;
            node_id             | node_version  
--------------------------------+---------------
 an-test-coord1001-eqiad-wmnet  | 0.281-cfbc6eb 
 an-test-presto1001-eqiad-wmnet | 0.281-cfbc6eb 
(2 rows)

Query 20230607_161741_00005_nwp2p, FINISHED, 2 nodes
Splits: 17 total, 17 done (100.00%)
[Latency: client-side: 315ms, server-side: 265ms] [2 rows, 184B] [7 rows/s, 694B/s]

@xcollazo - can I hand this over to you for further testing of the iceberg integration please?

BTullis updated the task description. (Show Details)
BTullis moved this task from In Progress to Needs Review on the Data-Platform-SRE board.
BTullis subscribed.

Here's a handy list of the iceberg related options that presto has applied, including defaults and where we have configured something different. It might come in handy.

1com.facebook.presto.server.PluginManager -- Loading plugin /srv/presto/plugin/iceberg --
2com.facebook.presto.server.PluginManager Installing com.facebook.presto.iceberg.IcebergPlugin
3com.facebook.presto.server.PluginManager Registering connector iceberg
4com.facebook.presto.server.PluginManager -- Finished loading plugin /srv/presto/plugin/iceberg --
5com.facebook.presto.metadata.StaticCatalogStore -- Loading catalog properties etc/catalog/analytics_test_iceberg.properties --
6com.facebook.presto.metadata.StaticCatalogStore -- Loading catalog analytics_test_iceberg --
7Bootstrap PROPERTY DEFAULT RUNTIME DESCRIPTION
8Bootstrap iceberg.catalog.cached-catalog-num 10 10 number of Iceberg catalog to cache across all sessions
9Bootstrap iceberg.catalog.type HIVE HIVE Iceberg catalog type
10Bootstrap iceberg.catalog.warehouse ---- ---- Iceberg catalog warehouse root path
11Bootstrap iceberg.compression-codec GZIP SNAPPY
12Bootstrap iceberg.file-format PARQUET PARQUET
13Bootstrap iceberg.hadoop.config.resources [] [] Comma separated paths to Hadoop configuration resource files
14Bootstrap iceberg.max-partitions-per-writer 100 100 Maximum number of partitions per writer
15Bootstrap iceberg.minimum-assigned-split-weight 0.05 0.05 Minimum weight that a split can be assigned
16Bootstrap iceberg.nessie.auth.type ---- ---- The authentication type to use. Available values are BASIC | BEARER
17Bootstrap iceberg.nessie.auth.bearer.token ---- ---- The token to use with BEARER authentication
18Bootstrap iceberg.nessie.client-builder-impl ---- ---- Configure the custom ClientBuilder implementation class to be used
19Bootstrap iceberg.nessie.compression-enabled true true Configure whether compression should be enabled or not. Default: true
20Bootstrap iceberg.nessie.connect-timeout-ms ---- ---- The connection timeout in milliseconds for the client
21Bootstrap iceberg.nessie.ref main main The default Nessie reference to work on
22Bootstrap iceberg.nessie.auth.basic.password ---- ---- The password to use with BASIC authentication
23Bootstrap iceberg.nessie.read-timeout-ms ---- ---- The read timeout in milliseconds for the client
24Bootstrap iceberg.nessie.uri ---- ---- The URI to connect to the Nessie server
25Bootstrap iceberg.nessie.auth.basic.username ---- ---- The username to use with BASIC authentication
26Bootstrap analytics_test_iceberg.orc.dwrf-stripe-cache-enabled true true Check DWRF stripe cache to look for stripe footers and index streams
27Bootstrap analytics_test_iceberg.orc.expected-file-tail-size 16384B 16384B Expected size of the file tail. This value should be increased to read StripeCache and footer in one IO
28Bootstrap analytics_test_iceberg.orc.file-tail-cache-enabled false false Enable cache for orc file tail
29Bootstrap analytics_test_iceberg.orc.file-tail-cache-size 1MB 1MB Size of the orc file tail cache
30Bootstrap analytics_test_iceberg.orc.file-tail-cache-ttl-since-last-access 0.00s 0.00s Time-to-live for orc file tail cache entry after last access
31Bootstrap analytics_test_iceberg.orc.row-group-index-cache-enabled false false
32Bootstrap analytics_test_iceberg.orc.row-group-index-cache-size 0B 0B Size of the stripe row group index stream cache
33Bootstrap analytics_test_iceberg.orc.row-group-index-cache-ttl-since-last-access 0.00s 0.00s Time-to-live for stripe stream row group index cache entry after last access
34Bootstrap analytics_test_iceberg.orc.stripe-footer-cache-size 0B 0B Size of the stripe footer cache
35Bootstrap analytics_test_iceberg.orc.stripe-footer-cache-ttl-since-last-access 0.00s 0.00s Time-to-live for stripe footer cache entry after last access
36Bootstrap analytics_test_iceberg.orc.stripe-metadata-cache-enabled false false Enable cache for stripe metadata
37Bootstrap analytics_test_iceberg.orc.stripe-stream-cache-size 0B 0B Size of the stripe stream cache
38Bootstrap analytics_test_iceberg.orc.stripe-stream-cache-ttl-since-last-access 0.00s 0.00s Time-to-live for stripe stream cache entry after last access
39com.facebook.presto.metadata.StaticCatalogStore -- Added catalog analytics_test_iceberg using connector iceberg --

@xcollazo - can I hand this over to you for further testing of the iceberg integration please?

Yes please.

@Milimetric noticed that a simple query fails, so I'd better keep in looking into this. At first glance it looks like a kerberos issue.

presto:wmf> select * from webrequest where year=2023 and month=6 and day=6 limit 10;

Query 20230607_180003_00031_nwp2p, FAILED, 1 node
Splits: 17 total, 0 done (0.00%)
[Latency: client-side: 0:01, server-side: 349ms] [0 rows, 0B] [0 rows/s, 0B/s]

Query 20230607_180003_00031_nwp2p failed: Failed to list directory: hdfs://analytics-test-hadoop/wmf/data/wmf/webrequest/webrequest_source=test_text/year=2023/month=6/day=6/hour=8. Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: No common protection layer between client and server; Host Details : local host is: "an-test-coord1001/10.64.53.41"; destination host is: "an-test-master1002.eqiad.wmnet":8020;

I've identified the cause of this issue and we have a workaround.
There is an open issue on the prestodb GitHub tracker: Presto not works with kerberos secured HDFS and Hive, the cause maybe a confiuration 'hive.copy-on-first-write-configuration-enabled' from PR 18115

I have added a comment to say that we too are affected by this bug.

In the meantime, we can implement a workaround by adding the configuration option: hive.copy-on-first-write-configuration-enabled=false to each catalog.

I tested this by manually adding it to the file: /etc/presto/catalog/analytics_test_hive.properties and restarting the presto service.
The client on an-test-client1001 was immediately able to access the data:

image.png (282×1 px, 100 KB)

It's interesting that it doesn't need to be added to the catalog definition on the presto worker node (an-test-presto1001).
I'll follow this up with a puppet change to set it on the coordinator's catalog, but with a note that we might be able to remove it when they have fixed the bug.

Change 928506 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add a workaround for a kerberos issue with Hive/HDFS on Presto 0.281

https://gerrit.wikimedia.org/r/928506

Change 928506 merged by Btullis:

[operations/puppet@production] Add a workaround for a kerberos issue with Hive/HDFS on Presto 0.281

https://gerrit.wikimedia.org/r/928506

Ah, it looks like it is required on the worker nodes after all. When I ran the query repeatedly, I got errors from an-presto1001 as well.

presto:wmf> select * from webrequest where year=2023 and month=6 and day=6 limit 10;

Query 20230608_113945_00006_qm8em, FAILED, 1 node
Splits: 517 total, 0 done (0.00%)
[Latency: client-side: 0:01, server-side: 444ms] [0 rows, 0B] [0 rows/s, 0B/s]

Query 20230608_113945_00006_qm8em failed: Error opening Hive split hdfs://analytics-test-hadoop/wmf/data/wmf/webrequest/webrequest_source=test_text/year=2023/month=6/day=6/hour=9/part-00056-49ea308c-a037-4582-b284-29fe4e017092.c000.snappy.parquet (offset=0, length=39489): Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: No common protection layer between client and server; Host Details : local host is: "an-test-presto1001/10.64.53.11"; destination host is: "an-test-master1002.eqiad.wmnet":8020;

I'll follow up with another patch.

Change 928509 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] presto: add the workaround for kerberos problem to worker nodes

https://gerrit.wikimedia.org/r/928509

Change 928509 merged by Btullis:

[operations/puppet@production] presto: add the workaround for kerberos problem to worker nodes

https://gerrit.wikimedia.org/r/928509

BTullis added a subscriber: JAllemandou.

OK, @xcollazo , @Milimetric , @JAllemandou and anyone else interested in testing this new version of presto with its upgraded iceberg support, it's back over to you. Let me know if anything still looks off.

First, HDFS was not happy with me creating a database in Hive:

hive (default)> create database xcollazo_test_presto_0281;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: org.apache.hadoop.security.AccessControlException Permission denied: user=xcollazo, access=WRITE, inode="/user/hive/warehouse":hive:analytics-privatedata-users:drwxr-x---

This is how permissions look:

xcollazo@an-test-client1001:~$ hdfs dfs -ls /user/hive | grep warehouse
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
drwxr-x---   - hive analytics-privatedata-users          0 2022-06-28 18:44 /user/hive/warehouse

This is how it looks on main cluster:

xcollazo@stat1007:~/heapdumps$ hostname -f
stat1007.eqiad.wmnet
xcollazo@stat1007:~/heapdumps$ hdfs dfs -ls /user/hive | grep warehouse
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
drwxrwx---   - hive analytics-privatedata-users          0 2023-06-07 16:33 /user/hive/warehouse

So I took the liberty of fixing it myself on test cluster:

ssh an-test-master1001.eqiad.wmnet
sudo -u hdfs bash
hdfs@an-test-master1001:/home/xcollazo$ kerberos-run-command hdfs hdfs dfs -chmod g+w /user/hive/warehouse
hdfs@an-test-master1001:/home/xcollazo$ kerberos-run-command hdfs hdfs dfs -ls /user/hive | grep warehouse
drwxrwx---   - hive analytics-privatedata-users          0 2022-06-28 18:44 /user/hive/warehouse

Side note: should this permission settings be automated? @BTullis ?

Ok then second database attempt:

hive (default)> create database xcollazo_test_presto_0281;
OK
Time taken: 1.207 seconds

Success! Now let's create an Iceberg table that looks like some other Hive table:

xcollazo@an-test-client1001:~$ spark3-sql --master yarn --executor-memory 8G --executor-cores 4 --driver-memory 2G --conf spark.dynamicAllocation.maxExecutors=64
spark-sql (default)> use xcollazo_test_presto_0281;
Response code
Time taken: 0.139 seconds
spark-sql (default)> select count(1) from wmf.webrequest where year=2023 and month=6 and day=9 and hour=9;
count(1)
12467
Time taken: 12.268 seconds, Fetched 1 row(s)
spark-sql (default)> CREATE TABLE xcollazo_test_presto_0281.webrequest
                   > USING ICEBERG
                   > PARTITIONED BY (hours(ts))
                   > AS SELECT
                   >   *
                   >   FROM wmf.webrequest
                   >   WHERE year=2023 and month=6 and day=9;
23/06/09 18:40:53 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
23/06/09 18:40:59 WARN TaskSetManager: Lost task 1.0 in stage 2.0 (TID 6) (an-test-worker1002.eqiad.wmnet executor 2): java.lang.ClassNotFoundException: org.apache.iceberg.spark.source.SparkWrite$WriterFactory

Looks like Iceberg is not available in test cluster? Let's try a simpler DDL:

spark-sql (default)>   CREATE TABLE xcollazo_test_presto_0281.simple_test( id int, name string) USING ICEBERG;
Response code
Time taken: 0.54 seconds

Ok that works. Let's try an INSERT:

spark-sql (default)> insert into xcollazo_test_presto_0281.simple_test values (1, 'a'), (2, 'b');
23/06/09 18:45:09 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 30) (an-test-worker1003.eqiad.wmnet executor 20): java.lang.ClassNotFoundException: org.apache.iceberg.spark.source.SparkWrite$WriterFactory
	at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:348)
	at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)

So the write path seems broken. I speculate again that perhasps we are missing configuration on the test cluster? CC @BTullis

You're quite right @xcollazo

btullis@an-test-client1001:~$ apt-cache policy conda-analytics
conda-analytics:
  Installed: 0.0.13
  Candidate: 0.0.18
  Version table:
     0.0.18 1001
       1001 http://apt.wikimedia.org/wikimedia buster-wikimedia/main amd64 Packages
 *** 0.0.13 100
        100 /var/lib/dpkg/status

I'll upgrade this now to 0.0.18 across the whole of the test cluster.

@xcollazo - That's done now, so you should be good to commence testing again. Apologies for that omission.
I'll come back to the default hive permissions question that you raised earlier another time.

@BTullis I can still repro the same stack trace.

Side note: I noticed that, event though all test nodes do have version 0.0.18 of conda-analytics, in general there seems to be many hosts that are still on 0.0.13 as per https://debmonitor.wikimedia.org/packages/conda-analytics.

Thanks @xcollazo I see what you mean. I have also verified that I can create a table using iceberg, but writing any values into it throws the error:

java.lang.ClassNotFoundException: org.apache.iceberg.spark.source.SparkWrite$WriterFactory

Do we need to recreate the spark3 assembly to include the iceberg jar?

Do we need to recreate the spark3 assembly to include the iceberg jar?

I'm going ahead with this. Firstly, I'm creating a backup of the old assembly.

btullis@an-test-master1001:~$ sudo kerberos-run-command hdfs hdfs dfs -cp /user/spark/share/lib/spark-3.1.2-assembly.jar /user/spark/share/lib/spark-3.1.2-assembly.jar.bak
btullis@an-test-master1001:~$ sudo kerberos-run-command hdfs hdfs dfs -ls /user/spark/share/lib/
Found 3 items
-rw-r--r--   2 hdfs  hadoop  195258057 2021-02-19 15:49 /user/spark/share/lib/spark-2.4.4-assembly.zip
-rw-r--r--   2 spark spark   228506202 2022-05-25 13:22 /user/spark/share/lib/spark-3.1.2-assembly.jar
-rw-r-----   2 hdfs  hadoop  228506202 2023-06-13 10:33 /user/spark/share/lib/spark-3.1.2-assembly.jar.bak

Interesting to note that the ownership is different between the old spark2 assembly and the spark3 assembly that is in use.
I can come back to that.

I used the generate_spark_assembly.sh script on an-test-master1002.

This worked apart from the fact that it failed to overwrite an existing assembly file;

put: `/user/spark/share/lib/spark-3.1.2-assembly.jar': File exists

So I manually copied the file and forced an overwrite.

sudo kerberos-run-command hdfs hdfs dfs -put -f spark-3.1.2-assembly.jar /user/spark/share/lib/

Verified that the file sizes were different:

btullis@an-test-master1001:~$ sudo kerberos-run-command hdfs hdfs dfs -ls /user/spark/share/lib/
Found 3 items
-rw-r--r--   2 hdfs hadoop  195258057 2021-02-19 15:49 /user/spark/share/lib/spark-2.4.4-assembly.zip
-rw-r-----   2 hdfs hadoop  255798851 2023-06-13 10:43 /user/spark/share/lib/spark-3.1.2-assembly.jar
-rw-r-----   2 hdfs hadoop  228506202 2023-06-13 10:33 /user/spark/share/lib/spark-3.1.2-assembly.jar.bak

Changed the permissions to permit read access by all, which is what the script would have done.

sudo kerberos-run-command hdfs hdfs dfs -chmod +r /user/spark/share/lib/spark-3.1.2-assembly.jar

Verified that this command functioned as expected:

btullis@an-test-master1001:~$ sudo kerberos-run-command hdfs hdfs dfs -ls /user/spark/share/lib/
Found 3 items
-rw-r--r--   2 hdfs hadoop  195258057 2021-02-19 15:49 /user/spark/share/lib/spark-2.4.4-assembly.zip
-rw-r--r--   2 hdfs hadoop  255798851 2023-06-13 10:43 /user/spark/share/lib/spark-3.1.2-assembly.jar
-rw-r-----   2 hdfs hadoop  228506202 2023-06-13 10:33 /user/spark/share/lib/spark-3.1.2-assembly.jar.bak

I'll now see if this makes any difference to the attempt to write data into the iceberg table.

That looks ok from a new spark3-sql session.

btullis@an-test-client1001:/etc/spark3/conf$ spark3-sql --master yarn --executor-memory 8G --executor-cores 4 --driver-memory 2G --conf spark.dynamicAllocation.maxExecutors=64
Running /opt/conda-analytics/bin/spark-sql $@
SPARK_HOME: /opt/conda-analytics/lib/python3.10/site-packages/pyspark
Using Hadoop client lib jars at 3.2.0, provided by Spark.
PYSPARK_PYTHON=/opt/conda-analytics/bin/python3
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/06/13 10:50:24 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
ADD JAR file:///usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar
23/06/13 10:50:26 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
Added [file:///usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar] to class path
Added resources: [file:///usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar]
Spark master: yarn, Application Id: application_1684836135432_13348
spark-sql (default)> insert into btullis_test_presto_0281.simple_test values (1, 'a'), (2, 'b');
Response code
Time taken: 12.415 seconds
spark-sql (default)>

I could try your more complicated insert operation @xcollazo, but I'm inclined to leave that you you.

I'll work on pushing out conda-analytics 0.0.18 to the remaining nodes in the prod cluster, as you suggested.

Rolling out conda-analytics 0.0.18 to hadoop-workers with debdeploy now.

I've finished rolling out the new version of conda-analytics to all hosts.
Now updating the assembly from an-launcher1002.

Backing up the current assembly.

btullis@an-launcher1002:~$ sudo kerberos-run-command hdfs hdfs dfs -cp /user/spark/share/lib/spark-3.1.2-assembly.jar /user/spark/share/lib/spark-3.1.2-assembly.jar.bak

Creating the new one:

./generate_spark_assembly.sh`

(I commented out the ssh command and update the dfs -put to add a -f to overwrite.)
Checked the presence of the new version:

btullis@an-launcher1002:~$ sudo kerberos-run-command hdfs hdfs dfs -ls /user/spark/share/lib/
Found 10 items
-rw-r--r--   3 spark spark  208371635 2019-08-07 16:16 /user/spark/share/lib/spark-2.3.1-assembly.zip
-rw-r--r--   3 spark spark  211950174 2019-08-26 20:56 /user/spark/share/lib/spark-2.4.3-assembly.zip
-rw-r--r--   3 hdfs  spark  195258057 2021-02-22 14:14 /user/spark/share/lib/spark-2.4.4-assembly.zip
-rw-r--r--   3 spark spark  212114480 2019-10-10 20:14 /user/spark/share/lib/spark-2.4.4-hadoop2.6-assembly.zip
-rw-r--r--   3 hdfs  spark  255798851 2023-06-13 13:34 /user/spark/share/lib/spark-3.1.2-assembly.jar
-rw-r--r--   3 spark spark  228506202 2022-05-25 13:02 /user/spark/share/lib/spark-3.1.2-assembly.jar.backup
-rw-r-----   3 hdfs  spark  255798851 2023-06-13 13:28 /user/spark/share/lib/spark-3.1.2-assembly.jar.bak
-rw-r--r--   3 hdfs  spark  301590410 2022-07-05 11:17 /user/spark/share/lib/spark-3.3.0-assembly.zip
-rw-r--r--   3 spark spark  114157041 2018-11-12 14:44 /user/spark/share/lib/spark-assembly.jar
-rw-r--r--   3 hdfs  spark  208371635 2019-04-17 14:57 /user/spark/share/lib/spark2-assembly.zip

Thanks @xcollazo I see what you mean. I have also verified that I can create a table using iceberg, but writing any values into it throws the error:

java.lang.ClassNotFoundException: org.apache.iceberg.spark.source.SparkWrite$WriterFactory

Do we need to recreate the spark3 assembly to include the iceberg jar?

This definitely was the Spark issue. I was able to confirm that Iceberg on Spark works as expected with the following sanity test:

xcollazo@an-test-client1001:~$ hostname -f
an-test-client1001.eqiad.wmnet
xcollazo@an-test-client1001:~$ spark3-sql --master yarn --executor-memory 8G --executor-cores 4 --driver-memory 2G --conf spark.dynamicAllocation.maxExecutors=64


spark-sql (default)> CREATE TABLE xcollazo_test_presto_0281.webrequest
                   > USING ICEBERG
                   > PARTITIONED BY (hours(ts))
                   > AS SELECT
                   >   *
                   >   FROM wmf.webrequest
                   >   WHERE year=2023 and month=6 and day=9
                   >   ORDER BY ts;
23/06/13 18:19:15 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
Response code
Time taken: 42.93 seconds
  
spark-sql (default)> select count(1) from xcollazo_test_presto_0281.webrequest;
count(1)
303526
Time taken: 1.935 seconds, Fetched 1 row(s)
spark-sql (default)> select count(1) from wmf.webrequest WHERE year=2023 and month=6 and day=9;
count(1)
303526

Presto is happy with Hive tables:

presto> show catalogs;
        Catalog         
------------------------
 analytics_test_hive    
 analytics_test_iceberg 
 system                 
(3 rows)

Query 20230613_182543_00000_qm8em, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
[Latency: client-side: 0:01, server-side: 0:01] [0 rows, 0B] [0 rows/s, 0B/s]
xcollazo@an-test-client1001:~$ hostname -f
an-test-client1001.eqiad.wmnet
xcollazo@an-test-client1001:~$ presto
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8

presto> SELECT count(1) FROM analytics_test_hive.wmf.webrequest WHERE year=2023 and month=6 and day=9;
 _col0  
--------
 303526 
(1 row)

Query 20230613_182926_00002_qm8em, FINISHED, 1 node
Splits: 6,161 total, 6,161 done (100.00%)
[Latency: client-side: 1:02, server-side: 1:01] [304K rows, 96MB] [4.93K rows/s, 1.56MB/s]

However, Presto doesn't seem to be happy with Iceberg:

presto> SELECT count(1) FROM analytics_test_iceberg.xcollazo_test_presto_0281.webrequest;
Query 20230613_182622_00001_qm8em failed: Table metadata is missing.

@xcollazo - could you have another go please? I just tried to reproduce your error on an-test-client1001 and it's working for me.

Creating the iceberg table in hive.

btullis@an-test-client1001:~$ spark3-sql --master yarn --executor-memory 8G --executor-cores 4 --driver-memory 2G --conf spark.dynamicAllocation.maxExecutors=64
Running /opt/conda-analytics/bin/spark-sql $@
SPARK_HOME: /opt/conda-analytics/lib/python3.10/site-packages/pyspark
Using Hadoop client lib jars at 3.2.0, provided by Spark.
PYSPARK_PYTHON=/opt/conda-analytics/bin/python3
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/06/19 15:26:33 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
ADD JAR file:///usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar
23/06/19 15:26:36 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
Added [file:///usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar] to class path
Added resources: [file:///usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar]
Spark master: yarn, Application Id: application_1684836135432_17074
spark-sql (default)> show tables;
database	tableName	isTemporary
default	t1	false
default	test_avro	false
Time taken: 2.734 seconds, Fetched 2 row(s)
spark-sql (default)> CREATE TABLE btullis_test_presto_0281.webrequest
                   > USING ICEBERG
                   > PARTITIONED BY (hours(ts))
                   > AS SELECT
                   >  *
                   >  FROM wmf.webrequest
                   >  WHERE year=2023 and month=6 and day=18
                   > ORDER BY ts;
23/06/19 15:27:51 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
Response code
Time taken: 46.373 seconds

Checking the count of records:

spark-sql (default)> select count(1) from btullis_test_presto_0281.webrequest;
count(1)
296774
Time taken: 2.059 seconds, Fetched 1 row(s)

Running the same count in presto.

btullis@an-test-client1001:~$ presto
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
presto> select count(1) from analytics_test_iceberg.btullis_test_presto_0281.webrequest;
 _col0  
--------
 296774 
(1 row)

Query 20230619_153458_00005_zunyi, FINISHED, 1 node
Splits: 241 total, 241 done (100.00%)
[Latency: client-side: 0:02, server-side: 0:02] [297K rows, 3.69MB] [150K rows/s, 1.86MB/s]

presto>

If you can let me know whether you're still seeing the error, that would be great. Thanks.

@BTullis confirming it is working for me now:

presto> SELECT count(1) FROM analytics_test_hive.wmf.webrequest WHERE year=2023 and month=6 and day=9;
 _col0  
--------
 303526 
(1 row)

Query 20230620_143623_00001_zunyi, FINISHED, 1 node
Splits: 6,161 total, 6,161 done (100.00%)
[Latency: client-side: 1:04, server-side: 1:04] [304K rows, 96MB] [4.75K rows/s, 1.5MB/s]

presto> SELECT count(1) FROM analytics_test_iceberg.xcollazo_test_presto_0281.webrequest;
 _col0  
--------
 303526 
(1 row)

Query 20230620_143740_00002_zunyi, FINISHED, 1 node
Splits: 240 total, 240 done (100.00%)
[Latency: client-side: 0:02, server-side: 0:02] [304K rows, 3.67MB] [171K rows/s, 2.06MB/s]

presto> SELECT count(1) as count, hostname FROM analytics_test_iceberg.xcollazo_test_presto_0281.webrequest GROUP BY hostname ORDER BY count DESC LIMIT 20;
 count |      hostname      
-------+--------------------
 12111 | cp3060.esams.wmnet 
 11971 | cp3056.esams.wmnet 
 11593 | cp3062.esams.wmnet 
 11544 | cp3058.esams.wmnet 
 11533 | cp3064.esams.wmnet 
 11348 | cp3050.esams.wmnet 
 11286 | cp3052.esams.wmnet 
 10947 | cp3054.esams.wmnet 
  9422 | cp1087.eqiad.wmnet 
  8962 | cp1081.eqiad.wmnet 
  8840 | cp1079.eqiad.wmnet 
  8827 | cp1089.eqiad.wmnet 
  8451 | cp1077.eqiad.wmnet 
  8443 | cp1083.eqiad.wmnet 
  8349 | cp1075.eqiad.wmnet 
  8221 | cp5022.eqsin.wmnet 
  8156 | cp5021.eqsin.wmnet 
  7995 | cp5024.eqsin.wmnet 
  7945 | cp5017.eqsin.wmnet 
  7900 | cp5018.eqsin.wmnet 
(20 rows)

Query 20230620_144403_00005_zunyi, FINISHED, 1 node
Splits: 272 total, 272 done (100.00%)
[Latency: client-side: 0:02, server-side: 0:02] [304K rows, 3.96MB] [178K rows/s, 2.32MB/s]

I know that Presto can cache the Hive Metastore data so perhaps what I saw before was just that? Not sure, but I cannot repro anymore so let's not worry about it.

+1 to move 0.281 to prod!

xcollazo updated the task description. (Show Details)

Change 932827 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add a workaround for a kerberos issue affecting Presto version 0.281

https://gerrit.wikimedia.org/r/932827

Change 932827 merged by Btullis:

[operations/puppet@production] Add a workaround for a kerberos issue affecting Presto version 0.281

https://gerrit.wikimedia.org/r/932827

This all looks fine now. The new version has been rolled out to all nodes.

presto:wmf> SELECT node_id,node_version FROM system.runtime.nodes;
          node_id          | node_version  
---------------------------+---------------
 an-coord1001-eqiad-wmnet  | 0.281-cfbc6eb 
 an-presto1001-eqiad-wmnet | 0.281-cfbc6eb 
 an-presto1002-eqiad-wmnet | 0.281-cfbc6eb 
 an-presto1003-eqiad-wmnet | 0.281-cfbc6eb 
 an-presto1004-eqiad-wmnet | 0.281-cfbc6eb 
 an-presto1005-eqiad-wmnet | 0.281-cfbc6eb 
 an-presto1006-eqiad-wmnet | 0.281-cfbc6eb 
 an-presto1007-eqiad-wmnet | 0.281-cfbc6eb 
 an-presto1008-eqiad-wmnet | 0.281-cfbc6eb 
 an-presto1009-eqiad-wmnet | 0.281-cfbc6eb 
 an-presto1010-eqiad-wmnet | 0.281-cfbc6eb 
 an-presto1011-eqiad-wmnet | 0.281-cfbc6eb 
 an-presto1012-eqiad-wmnet | 0.281-cfbc6eb 
 an-presto1013-eqiad-wmnet | 0.281-cfbc6eb 
 an-presto1014-eqiad-wmnet | 0.281-cfbc6eb 
 an-presto1015-eqiad-wmnet | 0.281-cfbc6eb 
(16 rows)

Query 20230626_100330_00046_yzu6v, FINISHED, 2 nodes
Splits: 17 total, 17 done (100.00%)
[Latency: client-side: 183ms, server-side: 160ms] [16 rows, 1.29KB] [100 rows/s, 8.06KB/s]

I was also able to test a query against webrequest, validating that the kerberos issue is not affecting us.

Marking as done.

Confirmed iceberg production table referrer_daily is working as expected on prod Presto instance:

presto:xcollazo_iceberg> select sum(num_referrals) as sum, country, browser_family, search_engine from analytics_iceberg.wmf_traffic.referrer_daily  where day = date '2023-05-18' group by country, browser_family, search_engine order by sum DESC LIMIT 5;
   sum    |    country    | browser_family | search_engine 
----------+---------------+----------------+---------------
 12765387 | India         | Chrome Mobile  | Google        
 12175410 | United States | Mobile Safari  | Google        
  8392264 | United States | Chrome Mobile  | Google        
  8238541 | United States | Chrome         | Google        
  4593786 | Germany       | Chrome Mobile  | Google        
(5 rows)

Query 20230626_135636_00178_yzu6v, FINISHED, 15 nodes
Splits: 498 total, 498 done (100.00%)
[Latency: client-side: 0:02, server-side: 0:02] [270K rows, 1.08MB] [172K rows/s, 705KB/s]

Thank you @BTullis !