Page MenuHomePhabricator

nfraison
Disabled

Projects

User does not belong to any projects.

User Details

User Since
Feb 6 2023, 9:38 AM (68 w, 5 d)
Roles
Disabled
LDAP User
Nicolas Fraison
MediaWiki User
NFraison-WMF [ Global Accounts ]

Recent Activity

Aug 16 2023

xcollazo awarded T330176: [Data Platform] Deploy Spark History Service a Pterodactyl token.
Aug 16 2023, 4:35 PM · Data-Engineering (Sprint 7), Patch-For-Review, Data-Platform-SRE

Jul 4 2023

JAllemandou awarded T330176: [Data Platform] Deploy Spark History Service a Mountain of Wealth token.
Jul 4 2023, 1:50 PM · Data-Engineering (Sprint 7), Patch-For-Review, Data-Platform-SRE

Mar 24 2023

nfraison moved T331970: Bootstrap spark cli to submit jobs on the DSE K8S cluster from In Progress to Done on the Shared-Data-Infrastructure (2022-23 Q4 Wrap up) board.
Mar 24 2023, 1:06 PM · Shared-Data-Infrastructure (2022-23 Q4 Wrap up)

Mar 23 2023

nfraison created T332913: [dse-k8s] Provide common spark config for spark jobs.
Mar 23 2023, 4:12 PM · Data-Platform-SRE
nfraison created T332912: [dse-k8s] Provide common hive config for spark jobs.
Mar 23 2023, 4:12 PM · Data-Platform-SRE
nfraison changed the status of T332909: [dse-k8s] Provide common hadooop config for spark jobs from Open to In Progress.
Mar 23 2023, 3:52 PM · Data-Platform-SRE, Patch-For-Review
nfraison changed the status of T332908: [dse-k8s] Spark-deploy need to create secret object in spark namespace from Open to In Progress.
Mar 23 2023, 3:52 PM · Data-Platform-SRE, Patch-For-Review
nfraison claimed T332908: [dse-k8s] Spark-deploy need to create secret object in spark namespace.
Mar 23 2023, 3:52 PM · Data-Platform-SRE, Patch-For-Review
nfraison created T332909: [dse-k8s] Provide common hadooop config for spark jobs.
Mar 23 2023, 3:52 PM · Data-Platform-SRE, Patch-For-Review
nfraison created T332908: [dse-k8s] Spark-deploy need to create secret object in spark namespace.
Mar 23 2023, 3:44 PM · Data-Platform-SRE, Patch-For-Review

Mar 22 2023

nfraison added a comment to T331859: Enable egress traffic from spark pods to HDFS and HIVE.

FI updating the mainApplicationFile to mainApplicationFile: "hdfs://analytics-hadoop/user/nfraison/spark-examples_2.12-3.3.0.jar" works fine so no specific need to manage this for now

Mar 22 2023, 2:20 PM · Patch-For-Review, Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison moved T331859: Enable egress traffic from spark pods to HDFS and HIVE from In Review to Done on the Shared-Data-Infrastructure (2022-23 Q4 Wrap up) board.
Mar 22 2023, 2:13 PM · Patch-For-Review, Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison added a comment to T331859: Enable egress traffic from spark pods to HDFS and HIVE.

Work fine in prod:

Mar 22 2023, 2:13 PM · Patch-For-Review, Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison moved T331858: Deploy spark-operator webhook from Ready to Deploy to Done on the Shared-Data-Infrastructure (2022-23 Q4 Wrap up) board.
Mar 22 2023, 10:25 AM · Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison added a comment to T331859: Enable egress traffic from spark pods to HDFS and HIVE.

How to get a token for test

Mar 22 2023, 10:09 AM · Patch-For-Review, Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison added a comment to T331859: Enable egress traffic from spark pods to HDFS and HIVE.

I need to test this one I'm not hundred percent sure but if it is we should indeed rely on this and this should be manage by our spark8s cli.
The cli should check those dependencies and ensure that they are pushed to hdfs/update conf to point to the appropriate hdfs path.
This feature could be disabled with a specific flag if needed

Mar 22 2023, 8:40 AM · Patch-For-Review, Shared-Data-Infrastructure (2022-23 Q4 Wrap up)

Mar 21 2023

nfraison added a comment to T331859: Enable egress traffic from spark pods to HDFS and HIVE.

2 things that will have to be added in the roadmap:

  • Management of hadoop/hive/spark config. Currently pushed as configmap with the job but should probably be a common configmap for all jobs)
  • Management of jars/dependencies. Currently rely on local example jars, for real application dependencies we need to find some solution: ceph s3 (but not available for now), archiva (find for prod jobs with released artifact but not good for testing ones not pushed in archiva), http hosted on the wrapper submitting the job to serve files to running app, other?
Mar 21 2023, 6:07 PM · Patch-For-Review, Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison added a comment to T331859: Enable egress traffic from spark pods to HDFS and HIVE.

For the kerb issue it is due to my config not being fine. I'm using a hadooop delegation token for hdfs analytics-test-hadoop while my config was using analytics-hadoop. So it leads to not finding the HDT and fall back to kerberos...
Below config is working fine in test cluster:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: hadoop-conf
  namespace: spark
data:
  core-site.xml: |
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
      <property>
        <name>fs.defaultFS</name>
        <value>hdfs://analytics-test-hadoop/</value>
      </property>
      <property>
        <name>io.file.buffer.size</name>
        <value>131072</value>
      </property>
      <property>
          <name>fs.permissions.umask-mode</name>
          <value>027</value>
      </property>
      <property>
          <name>hadoop.http.staticuser.user</name>
          <value>yarn</value>
      </property>
      <property>
          <name>hadoop.rpc.protection</name>
          <value>privacy</value>
      </property>
      <property>
          <name>hadoop.security.authentication</name>
          <value>kerberos</value>
      </property>
      <property>
          <name>hadoop.ssl.enabled.protocols</name>
          <value>TLSv1.2</value>
      </property>
    </configuration>
  hdfs-site.xml: |
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
      <property>
        <name>dfs.nameservices</name>
        <value>analytics-test-hadoop</value>
      </property>
      <property>
        <name>dfs.ha.namenodes.analytics-test-hadoop</name>
        <value>an-test-master1001-eqiad-wmnet,an-test-master1002-eqiad-wmnet</value>
      </property>
      <property>
        <name>dfs.namenode.rpc-address.analytics-test-hadoop.an-test-master1001-eqiad-wmnet</name>
        <value>an-test-master1001.eqiad.wmnet:8020</value>
      </property>
      <property>
        <name>dfs.namenode.rpc-address.analytics-test-hadoop.an-test-master1002-eqiad-wmnet</name>
        <value>an-test-master1002.eqiad.wmnet:8020</value>
      </property>
      <property>
        <name>dfs.client.failover.proxy.provider.analytics-test-hadoop</name>
        <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
      </property>
      <property>
       <name>dfs.blocksize</name>
       <value>268435456</value>
      </property>
      <property>
        <name>dfs.datanode.hdfs-blocks-metadata.enabled</name>
        <value>true</value>
      </property>
      <property>
          <name>dfs.block.access.token.enable</name>
          <value>true</value>
      </property>
      <property>
          <name>dfs.data.transfer.protection</name>
          <value>privacy</value>
      </property>
      <property>
          <name>dfs.datanode.kerberos.principal</name>
          <value>hdfs/_HOST@WIKIMEDIA</value>
      </property>
      <property>
          <name>dfs.encrypt.data.transfer</name>
          <value>true</value>
      </property>
      <property>
          <name>dfs.encrypt.data.transfer.cipher.key.bitlength</name>
          <value>128</value>
      </property>
      <property>
          <name>dfs.encrypt.data.transfer.cipher.suites</name>
          <value>AES/CTR/NoPadding</value>
      </property>
      <property>
          <name>dfs.http.policy</name>
          <value>HTTPS_ONLY</value>
      </property>
      <property>
          <name>dfs.namenode.kerberos.principal</name>
          <value>hdfs/_HOST@WIKIMEDIA</value>
      </property>
      <property>
          <name>dfs.web.authentication.kerberos.principal</name>
          <value>HTTP/_HOST@WIKIMEDIA</value>
      </property>
    </configuration>
Mar 21 2023, 5:42 PM · Patch-For-Review, Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison added a comment to T331859: Enable egress traffic from spark pods to HDFS and HIVE.

Also seems that dse-k8s-worker1002.eqiad.wmnet as a strange pattern the driver stuck at

Mar 21 2023, 5:26 PM · Patch-For-Review, Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison added a comment to T331859: Enable egress traffic from spark pods to HDFS and HIVE.

FW access open. Access OK with below job config

Mar 21 2023, 4:00 PM · Patch-For-Review, Shared-Data-Infrastructure (2022-23 Q4 Wrap up)

Mar 20 2023

nfraison moved T303168: Investigate trend of gradual hive server heap exhaustion from Blocked/Paused to Done on the Shared-Data-Infrastructure (2022-23 Q4 Wrap up) board.
Mar 20 2023, 5:10 PM · Patch-For-Review, Data-Platform-SRE

Mar 17 2023

nfraison added a comment to T330162: Research and test methods for accessing kerberized services from spark running on the DSE K8S cluster.

Send messages on #wikimedia-serviceops IRC channel to have some reviews from sre and ensure the vault mechanism chosen is acceptable or not

Mar 17 2023, 8:59 AM · Data-Platform-SRE
nfraison claimed T331970: Bootstrap spark cli to submit jobs on the DSE K8S cluster.
Mar 17 2023, 7:27 AM · Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison claimed T331859: Enable egress traffic from spark pods to HDFS and HIVE.
Mar 17 2023, 7:27 AM · Patch-For-Review, Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison moved T331970: Bootstrap spark cli to submit jobs on the DSE K8S cluster from Next Up to In Progress on the Shared-Data-Infrastructure (2022-23 Q4 Wrap up) board.
Mar 17 2023, 7:27 AM · Shared-Data-Infrastructure (2022-23 Q4 Wrap up)

Mar 16 2023

nfraison moved T303168: Investigate trend of gradual hive server heap exhaustion from Unexpected work/incident to Blocked/Paused on the Shared-Data-Infrastructure (2022-23 Q4 Wrap up) board.
Mar 16 2023, 4:52 PM · Patch-For-Review, Data-Platform-SRE
nfraison added a comment to T303168: Investigate trend of gradual hive server heap exhaustion.

Taking HeapDump of the test hiveserver2 and analyzing it with MAT show multiple instances of lots of our Singleton UDF while there should be only one:

Mar 16 2023, 3:08 PM · Patch-For-Review, Data-Platform-SRE

Mar 15 2023

nfraison placed T331859: Enable egress traffic from spark pods to HDFS and HIVE up for grabs.
Mar 15 2023, 1:40 PM · Patch-For-Review, Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison updated the task description for T331859: Enable egress traffic from spark pods to HDFS and HIVE.
Mar 15 2023, 9:07 AM · Patch-For-Review, Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison updated the task description for T331858: Deploy spark-operator webhook.
Mar 15 2023, 9:06 AM · Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison moved T331859: Enable egress traffic from spark pods to HDFS and HIVE from Next Up to In Progress on the Shared-Data-Infrastructure (2022-23 Q4 Wrap up) board.
Mar 15 2023, 9:03 AM · Patch-For-Review, Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison moved T331858: Deploy spark-operator webhook from In Progress to In Review on the Shared-Data-Infrastructure (2022-23 Q4 Wrap up) board.
Mar 15 2023, 9:03 AM · Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison moved T331858: Deploy spark-operator webhook from Next Up to In Progress on the Shared-Data-Infrastructure (2022-23 Q4 Wrap up) board.
Mar 15 2023, 9:03 AM · Shared-Data-Infrastructure (2022-23 Q4 Wrap up)

Mar 14 2023

nfraison added a comment to T331133: Deploy timeline server.

Wait for sparkhistory to be deployed and see then if still needed.
Will require some testing to see if it is stable/scale enough

Mar 14 2023, 4:59 PM · Data-Platform-SRE
nfraison moved T332038: Study blackbox exporter to see if it can be used to probe our web based service from Backlog to To be discussed on the Shared-Data-Infrastructure board.
Mar 14 2023, 4:08 PM · Data-Platform-SRE, Shared-Data-Infrastructure
nfraison created T332038: Study blackbox exporter to see if it can be used to probe our web based service.
Mar 14 2023, 4:08 PM · Data-Platform-SRE, Shared-Data-Infrastructure
nfraison created T331971: [dse-k8s] Deploy spark cli to submit jobs on DSE K8S cluster with K8S config.
Mar 14 2023, 10:31 AM · Data-Platform-SRE
nfraison updated the task description for T331970: Bootstrap spark cli to submit jobs on the DSE K8S cluster.
Mar 14 2023, 10:29 AM · Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison created T331970: Bootstrap spark cli to submit jobs on the DSE K8S cluster.
Mar 14 2023, 10:27 AM · Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison closed T330656: Implementation of cgroup on top of yarn nodemanager as Invalid.
Mar 14 2023, 10:18 AM · Shared-Data-Infrastructure
nfraison updated subscribers of T330657: Improve our monitoring to more rely on probes.

@BTullis @Stevemunene here is the epic we just discuss IRL.
If you are fine with it I'd like we start looking at this on this sprint.
For ex. adding one ticket to add probing on one of our web based service dathub or superset or turnilo?

Mar 14 2023, 10:17 AM · Shared-Data-Infrastructure, Epic

Mar 13 2023

nfraison added a comment to T331125: Security Issue Access Request for nfraison.

@Aklapper could you confirm that my MFA is well set up so @Mstyles can provide appropriate access?

Mar 13 2023, 5:07 PM · SecTeam-Processed, Security-Team, Security
nfraison added a subtask for T331265: Investigate DB connection issues faced from airflow on an-launcher1002: T331892: Move eventlogging_to_druid_ jobs to airflow in order to rely on cluster spark mechanism instead of client.
Mar 13 2023, 4:20 PM · Data-Engineering-Planning, Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison added a parent task for T331892: Move eventlogging_to_druid_ jobs to airflow in order to rely on cluster spark mechanism instead of client: T331265: Investigate DB connection issues faced from airflow on an-launcher1002.
Mar 13 2023, 4:20 PM · Data Pipelines
nfraison created T331892: Move eventlogging_to_druid_ jobs to airflow in order to rely on cluster spark mechanism instead of client.
Mar 13 2023, 4:20 PM · Data Pipelines
nfraison moved T331310: Update FSimage monitoring to rely on JMX metrics and take in account namenode status from Ready to Deploy to Done on the Shared-Data-Infrastructure (2022-23 Q4 Wrap up) board.
Mar 13 2023, 2:04 PM · Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison added a comment to T331858: Deploy spark-operator webhook.

Ex. of Networkpolicy:
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/kserve/templates/networkpolicy.yaml

Mar 13 2023, 10:51 AM · Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison updated the task description for T331858: Deploy spark-operator webhook.
Mar 13 2023, 10:50 AM · Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison created T331859: Enable egress traffic from spark pods to HDFS and HIVE.
Mar 13 2023, 10:27 AM · Patch-For-Review, Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison changed the status of T331858: Deploy spark-operator webhook from Open to In Progress.
Mar 13 2023, 10:25 AM · Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison created T331858: Deploy spark-operator webhook.
Mar 13 2023, 10:24 AM · Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison assigned T318924: Submit a spark job to the dse-k8s cluster to BTullis.
Mar 13 2023, 10:05 AM · Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison moved T318924: Submit a spark job to the dse-k8s cluster from In Progress to Done on the Shared-Data-Infrastructure (2022-23 Q4 Wrap up) board.
Mar 13 2023, 10:05 AM · Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison added a comment to T318924: Submit a spark job to the dse-k8s cluster.

Volumes and volumeMounts which also rely on webhook...

Mar 13 2023, 9:29 AM · Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison added a comment to T318924: Submit a spark job to the dse-k8s cluster.

Due to mutation webhook not enabled we can't rely on hadoopConfigMap spec on sparkapplication -> TODO create phab ticket to add the webhook
Currently trying to perform manually actions done by webhook

  • ConfigMap: hadoop-conf with core and hdfs sites.xml
  • volumeMounts and volumes configured
  • HADOOP_CONF_DIR env var exposed
Mar 13 2023, 9:21 AM · Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison added a comment to T318924: Submit a spark job to the dse-k8s cluster.

Spark job submission work with the new NetworkPolicy and the port config on the SparkApplication
Trying to run a job accessing hdfs

Mar 13 2023, 8:12 AM · Shared-Data-Infrastructure (2022-23 Q4 Wrap up)

Mar 10 2023

nfraison added a comment to T331580: Fix permissions in hdfs://analytics-hadoop/wmf/data/discovery.

All rights update to analytics-search:analytics-search-users on hdfs://analytics-hadoop/wmf/data/discovery

Mar 10 2023, 4:21 PM · Patch-For-Review, Data-Engineering, Discovery-Search (Current work), CirrusSearch
nfraison added a comment to T331580: Fix permissions in hdfs://analytics-hadoop/wmf/data/discovery.

Command to change right is runnning

Mar 10 2023, 4:20 PM · Patch-For-Review, Data-Engineering, Discovery-Search (Current work), CirrusSearch
nfraison updated Other Assignee for T331580: Fix permissions in hdfs://analytics-hadoop/wmf/data/discovery, added: nfraison.
Mar 10 2023, 4:16 PM · Patch-For-Review, Data-Engineering, Discovery-Search (Current work), CirrusSearch
nfraison added a comment to T318924: Submit a spark job to the dse-k8s cluster.

Here is the last version of the SparkApplication definition to take set driver and blockmanager port

Mar 10 2023, 4:13 PM · Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison added a comment to T330176: [Data Platform] Deploy Spark History Service.

What about putting this component in K8S?

  • Will require to build a docker image
  • Get a keytab to access eventlog in hdfs
  • Expose the web ui to internet on sparkhistory.wikimedia.org
Mar 10 2023, 1:47 PM · Data-Engineering (Sprint 7), Patch-For-Review, Data-Platform-SRE
nfraison added a comment to T330176: [Data Platform] Deploy Spark History Service.

We currently rely on airflow pkg to deploy spark3 so will need to deploy airflow to get it.
TODO:

  • update the package to expose a spark-history-server in /usr/bin
  • create systemd service for it
  • create hdfs folder where history file will be created
  • update spark-conf
Mar 10 2023, 11:19 AM · Data-Engineering (Sprint 7), Patch-For-Review, Data-Platform-SRE
nfraison added a comment to T318924: Submit a spark job to the dse-k8s cluster.

Yes the coreLimit fixed it
And yes we need a specific NetworkPolicy to have driver and executor -> pushing it this morning

Mar 10 2023, 7:50 AM · Shared-Data-Infrastructure (2022-23 Q4 Wrap up)

Mar 9 2023

nfraison added a comment to T318924: Submit a spark job to the dse-k8s cluster.

Executor pods are now well launched:

Mar 9 2023, 7:26 PM · Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison added a comment to T331543: anworker1132 BBU issue/replacement.

Sorry missed that ticket.
It is indeed not needed anymore
The cache issue was linked to the bad disk

Mar 9 2023, 5:02 PM · SRE, ops-eqiad
nfraison moved T330162: Research and test methods for accessing kerberized services from spark running on the DSE K8S cluster from In Progress to In Review on the Shared-Data-Infrastructure (2022-23 Q4 Wrap up) board.
Mar 9 2023, 12:59 PM · Data-Platform-SRE
nfraison moved T331310: Update FSimage monitoring to rely on JMX metrics and take in account namenode status from In Progress to In Review on the Shared-Data-Infrastructure (2022-23 Q4 Wrap up) board.
Mar 9 2023, 12:59 PM · Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison added a comment to T330162: Research and test methods for accessing kerberized services from spark running on the DSE K8S cluster.

@elukey could you please review https://docs.google.com/document/d/1Aub7lUr1nPGN3MXz8FI7CCCZ5a5Y1BRpY3poVmui6AM/edit# with our proposal for hadoop access mechanism for spark jobs on K8S

Mar 9 2023, 8:49 AM · Data-Platform-SRE

Mar 8 2023

nfraison moved T330979: Investigate slownesses on an-worker1132 from Unexpected work/incident to Done on the Shared-Data-Infrastructure (2022-23 Q4 Wrap up) board.
Mar 8 2023, 5:05 PM · Shared-Data-Infrastructure (2022-23 Q4 Wrap up), Data-Engineering-Planning
nfraison added a comment to T330979: Investigate slownesses on an-worker1132.

Adding back node as all paritions are available and disks cache is back

Mar 8 2023, 4:23 PM · Shared-Data-Infrastructure (2022-23 Q4 Wrap up), Data-Engineering-Planning
nfraison added a comment to T330971: Degraded RAID on an-worker1132.

Strangely since the change of disk everything is back to normal

RECOVERY - MegaRAID on an-worker1132 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
Mar 8 2023, 3:59 PM · Data-Engineering, SRE, ops-eqiad
nfraison added a comment to T330151: Deploy ceph osd processes to data-engineering cluster.

https://phabricator.wikimedia.org/T326945#8534579

Mar 8 2023, 1:40 PM · Patch-For-Review, Data-Platform-SRE
nfraison added a comment to T330151: Deploy ceph osd processes to data-engineering cluster.

Step performed by the cookbook bootstrap script

Mar 8 2023, 11:06 AM · Patch-For-Review, Data-Platform-SRE
nfraison added a comment to T303168: Investigate trend of gradual hive server heap exhaustion.

What we can see from graph is that the leak in the OldGC is linked to the link of Metaspace.
512MB is not enough trying with 1G but not really confident on that one.
Also added heap dump generation in case of OOM to hopefully understand the leak but seeing the correlation between metaspace and old gc I would say it is around some classloader

Mar 8 2023, 9:38 AM · Patch-For-Review, Data-Platform-SRE
nfraison closed T330982: Automate run of refreshNodes on masters as Resolved.
Mar 8 2023, 9:16 AM · Shared-Data-Infrastructure

Mar 7 2023

nfraison added a comment to T303168: Investigate trend of gradual hive server heap exhaustion.

Need to recheck the setting as it leads to OOM in Metaspace :(

Mar 7 2023, 6:25 PM · Patch-For-Review, Data-Platform-SRE
nfraison added a comment to T331448: Make YARN web interface work with both primary and standby resourcemanager.

FI dns repo:

Mar 7 2023, 3:43 PM · Data-Platform-SRE
nfraison closed T331446: Ensure yarn.wikimedia.org point to the active RM as Declined.
Mar 7 2023, 3:40 PM · Shared-Data-Infrastructure
nfraison added a comment to T331446: Ensure yarn.wikimedia.org point to the active RM.

Duplicate of https://phabricator.wikimedia.org/T331448

Mar 7 2023, 3:40 PM · Shared-Data-Infrastructure
nfraison created T331446: Ensure yarn.wikimedia.org point to the active RM.
Mar 7 2023, 3:28 PM · Shared-Data-Infrastructure
nfraison added a comment to T331265: Investigate DB connection issues faced from airflow on an-launcher1002.

https://gerrit.wikimedia.org/r/c/operations/puppet/+/895228

Mar 7 2023, 2:21 PM · Data-Engineering-Planning, Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison added a comment to T331265: Investigate DB connection issues faced from airflow on an-launcher1002.

The driver is in charge of servicing files, jars and app jar through http file server. With those potential 64 executors * 4 jobs getting the 100MB refinery-job-0.0.146.jar it indeed generate some loads (more than 3 min of network transfer at full speed).

Mar 7 2023, 12:44 PM · Data-Engineering-Planning, Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison added a comment to T331265: Investigate DB connection issues faced from airflow on an-launcher1002.

Tonight issue:

Mar 07 00:01:31 an-launcher1002 airflow-scheduler@analytics[5803]: Process DagFileProcessor652438-Process:
Mar 07 00:03:17 an-launcher1002 airflow-scheduler@analytics[5803]: [2023-03-07 00:03:17,896] {scheduler_job.py:354} INFO - 5 tasks up for execution:
Mar 7 2023, 9:58 AM · Data-Engineering-Planning, Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison moved T329362: Upgrade the Data Engineering team's Zookeeper servers to Bullseye from In Progress to Done on the Shared-Data-Infrastructure (2022-23 Q4 Wrap up) board.
Mar 7 2023, 8:50 AM · Shared-Data-Infrastructure (2022-23 Q4 Wrap up), Data-Engineering-Planning
nfraison updated the task description for T329362: Upgrade the Data Engineering team's Zookeeper servers to Bullseye.
Mar 7 2023, 8:50 AM · Shared-Data-Infrastructure (2022-23 Q4 Wrap up), Data-Engineering-Planning

Mar 6 2023

nfraison updated the task description for T331310: Update FSimage monitoring to rely on JMX metrics and take in account namenode status.
Mar 6 2023, 4:53 PM · Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison updated the task description for T331310: Update FSimage monitoring to rely on JMX metrics and take in account namenode status.
Mar 6 2023, 4:48 PM · Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison created T331310: Update FSimage monitoring to rely on JMX metrics and take in account namenode status.
Mar 6 2023, 4:46 PM · Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison added a comment to T331265: Investigate DB connection issues faced from airflow on an-launcher1002.

An idea was that reportupdater job can be the root cause of that high tx
Here is the log of all reportupdater service running at midnight (or last log when not having logs at midnight. Doesn't seems to match the timeline

● reportupdater-pingback.service - Periodic execution of reportupdater-pingback.service
Mar 06 00:00:01 an-launcher1002 systemd[1]: Started Report Updater job for pingback.
Mar 06 00:00:01 an-launcher1002 systemd[1]: reportupdater-pingback.service: Succeeded.
Mar 06 00:00:01 an-launcher1002 kerberos-run-command[22919]: 2023-03-06 00:00:01,321 - INFO - Execution complete.
Mar 06 00:00:01 an-launcher1002 kerberos-run-command[22919]: 2023-03-06 00:00:00,730 - INFO - Starting execution.
Mar 06 00:00:01 an-launcher1002 kerberos-run-command[22919]: kinit: Failed to store credentials: Internal credentials cache error (filename: /tmp/krb5cc_906) while getting initial credentials
Mar 06 00:00:00 an-launcher1002 kerberos-run-command[23016]: User analytics executes as user analytics the command ['/usr/bin/python3', '/srv/reportupdater/reportupdater/update_reports.py', '-l', 'info', '/srv/reportupdater/jobs/reportupdater-queries/pin
Mar 06 00:00:00 an-launcher1002 systemd[1]: Starting Report Updater job for pingback...
Mar 6 2023, 3:13 PM · Data-Engineering-Planning, Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison renamed T331265: Investigate DB connection issues faced from airflow on an-launcher1002 from Investigate DB connection issues faced from airflow on an-aluncher1002 to Investigate DB connection issues faced from airflow on an-launcher1002.
Mar 6 2023, 2:23 PM · Data-Engineering-Planning, Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison added a comment to T329362: Upgrade the Data Engineering team's Zookeeper servers to Bullseye.

an-conf1002 done

Mar 6 2023, 1:08 PM · Shared-Data-Infrastructure (2022-23 Q4 Wrap up), Data-Engineering-Planning
nfraison added a comment to T329362: Upgrade the Data Engineering team's Zookeeper servers to Bullseye.

an-conf1001 reimaged but zookeeper not starting
This was due to /etc/zookeeper/conf/version-2/ not belonging to zoookeeper:zookeeper (expected as user id i snot kept on reimage)

Mar 6 2023, 10:32 AM · Shared-Data-Infrastructure (2022-23 Q4 Wrap up), Data-Engineering-Planning
nfraison claimed T331265: Investigate DB connection issues faced from airflow on an-launcher1002.
Mar 6 2023, 9:09 AM · Data-Engineering-Planning, Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison added a comment to T331265: Investigate DB connection issues faced from airflow on an-launcher1002.

Trying to identify process which could generate this tx network usage:

  • running a PS command every 15s in an-launcher1002 stored in /home/nfraison/proc.log
  • running netstat -tup te get process/src/dst ip/port stored in /home/nfraison/netstat.log
  • running iftop -tPs 5 to get traffic send/receive from last 5s src host/dst host/port stored in /home/nfraison/iftop.log
Mar 6 2023, 9:05 AM · Data-Engineering-Planning, Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison added a comment to T331265: Investigate DB connection issues faced from airflow on an-launcher1002.

Looks to me that we reach 100% network usage on an-launcher1002 when the connection issues happens: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=an-launcher1[…]analytics&from=1677715234951&to=1677715701346&viewPanel=11
This node only have a 1GB NIC, we should identify that local job which issue this usage and see if we can throttle or make it run on hadoop.

Mar 6 2023, 8:42 AM · Data-Engineering-Planning, Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison updated the task description for T331265: Investigate DB connection issues faced from airflow on an-launcher1002.
Mar 6 2023, 8:42 AM · Data-Engineering-Planning, Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison added a project to T331265: Investigate DB connection issues faced from airflow on an-launcher1002: Data-Engineering-Planning.
Mar 6 2023, 8:38 AM · Data-Engineering-Planning, Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison created T331265: Investigate DB connection issues faced from airflow on an-launcher1002.
Mar 6 2023, 8:37 AM · Data-Engineering-Planning, Shared-Data-Infrastructure (2022-23 Q4 Wrap up)
nfraison moved T330979: Investigate slownesses on an-worker1132 from Backlog to 2022-23 Q4 Wrap up on the Shared-Data-Infrastructure board.
Mar 6 2023, 8:31 AM · Shared-Data-Infrastructure (2022-23 Q4 Wrap up), Data-Engineering-Planning
nfraison moved T303168: Investigate trend of gradual hive server heap exhaustion from Backlog to 2022-23 Q4 Wrap up on the Shared-Data-Infrastructure board.
Mar 6 2023, 8:31 AM · Patch-For-Review, Data-Platform-SRE