Page MenuHomePhabricator

Do performance testing of a big Hadoop Table hosted by Ceph
Open, HighPublic

Description

Update Feb 2026

We are still working towards enabling this functionality, before being able to validate the performance characterisitics.
This is the bare minimum that we believe we will need, in order to be able to get hive databases backed by Ceph/S3.

Then we will be able to create a table on Ceph/S3 with:

create external table blah (whatever) location 's3a://rgw.eqiad.dpe.anycast.wmnet/mybucket'

...and it should work.

Original description below

In T381412, we learnt that the eqiad datacenter utilizes core routers to manage all traffic between the Hadoop HDFS nodes and the Presto nodes. Things got ugly when we attempted to read ~40TB of data.

We now have a Ceph cluster that we have discussed could be used to offload some data away from HDFS. But presumably, this new Ceph cluster is also one core router away from HDFS, and thus any and all data read or write will have to go thru a core router for any job, that is, not only Presto, but any Spark job, which is the biggest workload in our Hadoop infrastructure.

@cmooney mentioned in Slack that eqiad will eventually switch its network topology away from a core router design, but this work may be ~24 months away.

In this task, we want to test the performance and reliability of the current architecture by:

  • Running a similarly big query as in T381412 via Spark on a table hosted in Ceph, and see what happens.
  • Running a smaller query, say, ~5TB, and see what happens.
  • Gather info, and inform future network topology changes for the analytics side of things.

Event Timeline

Let's try that. The network load is indeed a known impact of decoupling compute and storage in big-data world.

I'm all for this performance testing, but we haven't got any tables hosted by Ceph yet, so we can't run a similarly big query yet.

Getting to the point where we have a hive table that is hosted on an S3 bucket is a necessary first step. We will need at least:

Then we will be able to create a table on Ceph/S3 with:

create external table blah (whatever) location 's3a://rgw.eqiad.dpe.anycast.wmnet/mybucket'

...and it should work.

These are useful pages:
https://www.redhat.com/en/blog/storing-tables-ceph-object-storage
https://www.redhat.com/en/blog/anatomy-s3a-filesystem-client

Gehel triaged this task as Medium priority.Jan 8 2025, 8:33 AM
Gehel moved this task from Incoming to Misc on the Data-Platform-SRE board.

Copy pasting part of conversation on T381389: Add QoS markings to profile Hadoop/HDFS analytics traffic when testing a ~42TB query on top of Presto, for future reference:

...

Here are my heavy query results:

First, I attempted query failed due to transient error:

presto> select max(length(revision_content_slots['main'].content_body)) from wmf_content.mediawiki_content_history_v1 where wiki_id='enwiki';

Query 20250401_153238_00024_cdjjz, FAILED, 15 nodes
Splits: 57,360 total, 52,468 done (91.47%)
[Latency: client-side: 19:59, server-side: 19:59] [0 rows, 0B] [0 rows/s, 0B/s]

Query 20250401_153238_00024_cdjjz failed: Error reading from /wmf/data/wmf_content/mediawiki_content_history_v1/data/wiki_id=enwiki/01221-346327-f15044f1-ab5f-463a-bb6e-9eb664ad9e79-00003.parquet at position 128590371

Seems like it is easy for Iceberg tables to fail when Presto is really busy. This is unrelated to QoS, so I switched to the query that started this whole effort when I caused a minor prod outage with it before the QoS markings:

SELECT
    count(1) as count,
    user_id_categories
FROM (
        SELECT
            CASE
                WHEN user_id = 0 THEN 'user_id = 0'
                WHEN user_id = -1 THEN 'user_id = -1'
                WHEN user_id < -1 THEN 'user_id < -1'
                WHEN user_id IS NULL THEN 'user_id IS NULL'
                ELSE 'user_id > 0'
            END AS user_id_categories
        FROM wmf.mediawiki_wikitext_history
        WHERE snapshot = '2025-02'
)
GROUP BY user_id_categories
ORDER BY count DESC

Here is the run and results:

presto> SELECT
     ->     count(1) as count,
     ->     user_id_categories
     -> FROM (
     ->         SELECT
     ->             CASE
     ->                 WHEN user_id = 0 THEN 'user_id = 0'
     ->                 WHEN user_id = -1 THEN 'user_id = -1'
     ->                 WHEN user_id < -1 THEN 'user_id < -1'
     ->                 WHEN user_id IS NULL THEN 'user_id IS NULL'
     ->                 ELSE 'user_id > 0'
     ->             END AS user_id_categories
     ->         FROM wmf.mediawiki_wikitext_history
     ->         WHERE snapshot = '2025-02'
     -> )
     -> GROUP BY user_id_categories
     -> ORDER BY count DESC;
   count    | user_id_categories 
------------+--------------------
 6610858626 | user_id > 0        
  526779900 | user_id IS NULL    
    8437200 | user_id = 0        
     760449 | user_id = -1       
(4 rows)

Query 20250401_160931_00030_cdjjz, FINISHED, 15 nodes
Splits: 795,809 total, 795,809 done (100.00%)
[Latency: client-side: 71:16, server-side: 71:16] [7.15B rows, 42TB] [1.67M rows/s, 10.1GB/s]

Thus we moved 42TB of data from HDFS over to Presto. Grafana link for this event: https://grafana.wikimedia.org/goto/mDjTquTNR?orgId=1

Grafana link for the core router for this event: https://grafana.wikimedia.org/goto/ihY1qXoNg?orgId=1. Looks like the query had a sustained ~60GB/s load on the core router.

No one is yelling on IRC so I think I am happy with this. I am done from my side.


No one is yelling on IRC so I think I am happy with this. I am done from my side.

Ok cool. FWIW we did saturate some links on the core, similar to what caused problems the last time.

Unfortunately it was outbound on the devices where we don't have stats. We got paged for the link usage, but no other alarms fired - unlike last time. So I'm happy that the QoS is doing what we need of it and this didn't affect other applications across the DC.


Also to get a sense of total throughput this graph is good:

https://grafana.wikimedia.org/goto/dVbqeuTNR

Looks like it was fairly solid at ~125Gbps across the network.

Adding this to our current milestone, based on recent discussions with @JAllemandou and friends.

We're currently most interested in testing:

Integration of Ceph/S3 with YARN, as originally set out in the description will be a secondary testing target.

Change #1190277 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Update the scaffolding helpers for spark-operator

https://gerrit.wikimedia.org/r/1190277

Change #1190278 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Fix the networkpolicy allowing the k8s api to call the spark-operator webhook

https://gerrit.wikimedia.org/r/1190278

Change #1190279 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Add the required securitycontext to pods created by the spark-operator

https://gerrit.wikimedia.org/r/1190279

Change #1190279 abandoned by Btullis:

[operations/deployment-charts@master] Add the required securitycontext to pods created by the spark-operator

Reason:

Not required here

https://gerrit.wikimedia.org/r/1190279

Change #1190282 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Enable external-services for spark-operator

https://gerrit.wikimedia.org/r/1190282

Change #1190283 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Rename the release from spark-operator to production

https://gerrit.wikimedia.org/r/1190283

Change #1190285 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add kubernetes tokens for an analytics-test namespace on dse-k8s

https://gerrit.wikimedia.org/r/1190285

Change #1190277 merged by jenkins-bot:

[operations/deployment-charts@master] Update the scaffolding helper module for spark-operator

https://gerrit.wikimedia.org/r/1190277

Change #1190278 merged by jenkins-bot:

[operations/deployment-charts@master] Fix the networkpolicy allowing the k8s api to call the spark-operator webhook

https://gerrit.wikimedia.org/r/1190278

Change #1190282 merged by jenkins-bot:

[operations/deployment-charts@master] Enable external-services for spark-operator

https://gerrit.wikimedia.org/r/1190282

Change #1190283 merged by jenkins-bot:

[operations/deployment-charts@master] Rename the release from spark-operator to production

https://gerrit.wikimedia.org/r/1190283

Change #1190285 merged by Btullis:

[operations/puppet@production] Add kubernetes tokens for an analytics-test namespace on dse-k8s

https://gerrit.wikimedia.org/r/1190285

Change #1190678 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Add a namespace called analytics-test to both dse-k8s clusters

https://gerrit.wikimedia.org/r/1190678

Change #1190678 merged by jenkins-bot:

[operations/deployment-charts@master] Add a namespace called analytics-test to both dse-k8s clusters

https://gerrit.wikimedia.org/r/1190678

BTullis raised the priority of this task from Medium to High.Sep 24 2025, 8:43 AM

Bringing into the current milestone, since we plan to move forward with this as soon as is practicable.