Page MenuHomePhabricator

nfraison
User

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Feb 6 2023, 9:38 AM (6 w, 1 d)
Availability
Available
LDAP User
Nicolas Fraison
MediaWiki User
NFraison-WMF [ Global Accounts ]

Recent Activity

Today

nfraison added a comment to T331859: Enable egress traffic from spark pods to HDFS and HIVE.

2 things that will have to be added in the roadmap:

  • Management of hadoop/hive/spark config. Currently pushed as configmap with the job but should probably be a common configmap for all jobs)
  • Management of jars/dependencies. Currently rely on local example jars, for real application dependencies we need to find some solution: ceph s3 (but not available for now), archiva (find for prod jobs with released artifact but not good for testing ones not pushed in archiva), http hosted on the wrapper submitting the job to serve files to running app, other?
Tue, Mar 21, 6:07 PM · Patch-For-Review, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison added a comment to T331859: Enable egress traffic from spark pods to HDFS and HIVE.

For the kerb issue it is due to my config not being fine. I'm using a hadooop delegation token for hdfs analytics-test-hadoop while my config was using analytics-hadoop. So it leads to not finding the HDT and fall back to kerberos...
Below config is working fine in test cluster:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: hadoop-conf
  namespace: spark
data:
  core-site.xml: |
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
      <property>
        <name>fs.defaultFS</name>
        <value>hdfs://analytics-test-hadoop/</value>
      </property>
      <property>
        <name>io.file.buffer.size</name>
        <value>131072</value>
      </property>
      <property>
          <name>fs.permissions.umask-mode</name>
          <value>027</value>
      </property>
      <property>
          <name>hadoop.http.staticuser.user</name>
          <value>yarn</value>
      </property>
      <property>
          <name>hadoop.rpc.protection</name>
          <value>privacy</value>
      </property>
      <property>
          <name>hadoop.security.authentication</name>
          <value>kerberos</value>
      </property>
      <property>
          <name>hadoop.ssl.enabled.protocols</name>
          <value>TLSv1.2</value>
      </property>
    </configuration>
  hdfs-site.xml: |
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
      <property>
        <name>dfs.nameservices</name>
        <value>analytics-test-hadoop</value>
      </property>
      <property>
        <name>dfs.ha.namenodes.analytics-test-hadoop</name>
        <value>an-test-master1001-eqiad-wmnet,an-test-master1002-eqiad-wmnet</value>
      </property>
      <property>
        <name>dfs.namenode.rpc-address.analytics-test-hadoop.an-test-master1001-eqiad-wmnet</name>
        <value>an-test-master1001.eqiad.wmnet:8020</value>
      </property>
      <property>
        <name>dfs.namenode.rpc-address.analytics-test-hadoop.an-test-master1002-eqiad-wmnet</name>
        <value>an-test-master1002.eqiad.wmnet:8020</value>
      </property>
      <property>
        <name>dfs.client.failover.proxy.provider.analytics-test-hadoop</name>
        <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
      </property>
      <property>
       <name>dfs.blocksize</name>
       <value>268435456</value>
      </property>
      <property>
        <name>dfs.datanode.hdfs-blocks-metadata.enabled</name>
        <value>true</value>
      </property>
      <property>
          <name>dfs.block.access.token.enable</name>
          <value>true</value>
      </property>
      <property>
          <name>dfs.data.transfer.protection</name>
          <value>privacy</value>
      </property>
      <property>
          <name>dfs.datanode.kerberos.principal</name>
          <value>hdfs/_HOST@WIKIMEDIA</value>
      </property>
      <property>
          <name>dfs.encrypt.data.transfer</name>
          <value>true</value>
      </property>
      <property>
          <name>dfs.encrypt.data.transfer.cipher.key.bitlength</name>
          <value>128</value>
      </property>
      <property>
          <name>dfs.encrypt.data.transfer.cipher.suites</name>
          <value>AES/CTR/NoPadding</value>
      </property>
      <property>
          <name>dfs.http.policy</name>
          <value>HTTPS_ONLY</value>
      </property>
      <property>
          <name>dfs.namenode.kerberos.principal</name>
          <value>hdfs/_HOST@WIKIMEDIA</value>
      </property>
      <property>
          <name>dfs.web.authentication.kerberos.principal</name>
          <value>HTTP/_HOST@WIKIMEDIA</value>
      </property>
    </configuration>
Tue, Mar 21, 5:42 PM · Patch-For-Review, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison added a comment to T331859: Enable egress traffic from spark pods to HDFS and HIVE.

Also seems that dse-k8s-worker1002.eqiad.wmnet as a strange pattern the driver stuck at

Tue, Mar 21, 5:26 PM · Patch-For-Review, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison added a comment to T331859: Enable egress traffic from spark pods to HDFS and HIVE.

FW access open. Access OK with below job config

Tue, Mar 21, 4:00 PM · Patch-For-Review, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)

Yesterday

nfraison moved T303168: Investigate trend of gradual hive server heap exhaustion from Blocked/Paused to Done on the Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10) board.
Mon, Mar 20, 5:10 PM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10), Data-Engineering-Planning

Fri, Mar 17

nfraison added a comment to T330162: DSE Experiment - PoC how to Address Kerberos from spark running on DSE K8S cluster.

Send messages on #wikimedia-serviceops IRC channel to have some reviews from sre and ensure the vault mechanism chosen is acceptable or not

Fri, Mar 17, 8:59 AM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison claimed T331970: Bootstrap spark cli to submit jobs on the DSE K8S cluster.
Fri, Mar 17, 7:27 AM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison claimed T331859: Enable egress traffic from spark pods to HDFS and HIVE.
Fri, Mar 17, 7:27 AM · Patch-For-Review, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison moved T331970: Bootstrap spark cli to submit jobs on the DSE K8S cluster from Next Up to In Progress on the Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10) board.
Fri, Mar 17, 7:27 AM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)

Thu, Mar 16

nfraison moved T303168: Investigate trend of gradual hive server heap exhaustion from Unexpected work/incident to Blocked/Paused on the Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10) board.
Thu, Mar 16, 4:52 PM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10), Data-Engineering-Planning
nfraison added a comment to T303168: Investigate trend of gradual hive server heap exhaustion.

Taking HeapDump of the test hiveserver2 and analyzing it with MAT show multiple instances of lots of our Singleton UDF while there should be only one:

Thu, Mar 16, 3:08 PM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10), Data-Engineering-Planning

Wed, Mar 15

nfraison placed T331859: Enable egress traffic from spark pods to HDFS and HIVE up for grabs.
Wed, Mar 15, 1:40 PM · Patch-For-Review, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison updated the task description for T331859: Enable egress traffic from spark pods to HDFS and HIVE.
Wed, Mar 15, 9:07 AM · Patch-For-Review, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison updated the task description for T331858: Deploy spark-operator webhook.
Wed, Mar 15, 9:06 AM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison moved T331859: Enable egress traffic from spark pods to HDFS and HIVE from Next Up to In Progress on the Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10) board.
Wed, Mar 15, 9:03 AM · Patch-For-Review, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison moved T331858: Deploy spark-operator webhook from In Progress to In Review on the Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10) board.
Wed, Mar 15, 9:03 AM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison moved T331858: Deploy spark-operator webhook from Next Up to In Progress on the Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10) board.
Wed, Mar 15, 9:03 AM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)

Tue, Mar 14

nfraison added a comment to T331133: Deploy timeline server.

Wait for sparkhistory to be deployed and see then if still needed.
Will require some testing to see if it is stable/scale enough

Tue, Mar 14, 4:59 PM · Shared-Data-Infrastructure
nfraison moved T332038: Study blackbox exporter to see if it can be used to probe our web based service from Backlog to To be discussed on the Shared-Data-Infrastructure board.
Tue, Mar 14, 4:08 PM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison created T332038: Study blackbox exporter to see if it can be used to probe our web based service.
Tue, Mar 14, 4:08 PM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison created T331971: Deploy spak cli to submit jobs on DSE K8S cluster with K8S config.
Tue, Mar 14, 10:31 AM · Shared-Data-Infrastructure
nfraison updated the task description for T331970: Bootstrap spark cli to submit jobs on the DSE K8S cluster.
Tue, Mar 14, 10:29 AM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison created T331970: Bootstrap spark cli to submit jobs on the DSE K8S cluster.
Tue, Mar 14, 10:27 AM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison closed T330656: Implementation of cgroup on top of yarn nodemanager as Invalid.
Tue, Mar 14, 10:18 AM · Shared-Data-Infrastructure
nfraison updated subscribers of T330657: Improve our monitoring to more rely on probes.

@BTullis @Stevemunene here is the epic we just discuss IRL.
If you are fine with it I'd like we start looking at this on this sprint.
For ex. adding one ticket to add probing on one of our web based service dathub or superset or turnilo?

Tue, Mar 14, 10:17 AM · Epic, Shared-Data-Infrastructure

Mon, Mar 13

nfraison added a comment to T331125: Security Issue Access Request for nfraison.

@Aklapper could you confirm that my MFA is well set up so @Mstyles can provide appropriate access?

Mon, Mar 13, 5:07 PM · Security-Team, Security
nfraison added a subtask for T331265: Investigate DB connection issues faced from airflow on an-launcher1002: T331892: Move eventlogging_to_druid_ jobs to airflow in order to rely on cluster spark mechanism instead of client.
Mon, Mar 13, 4:20 PM · Data-Engineering-Planning, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison added a parent task for T331892: Move eventlogging_to_druid_ jobs to airflow in order to rely on cluster spark mechanism instead of client: T331265: Investigate DB connection issues faced from airflow on an-launcher1002.
Mon, Mar 13, 4:20 PM · Data Pipelines
nfraison created T331892: Move eventlogging_to_druid_ jobs to airflow in order to rely on cluster spark mechanism instead of client.
Mon, Mar 13, 4:20 PM · Data Pipelines
nfraison moved T331310: Update FSimage monitoring to rely on JMX metrics and take in account namenode status from Ready to Deploy to Done on the Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10) board.
Mon, Mar 13, 2:04 PM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison added a comment to T331858: Deploy spark-operator webhook.

Ex. of Networkpolicy:
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/kserve/templates/networkpolicy.yaml

Mon, Mar 13, 10:51 AM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison updated the task description for T331858: Deploy spark-operator webhook.
Mon, Mar 13, 10:50 AM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison created T331859: Enable egress traffic from spark pods to HDFS and HIVE.
Mon, Mar 13, 10:27 AM · Patch-For-Review, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison changed the status of T331858: Deploy spark-operator webhook from Open to In Progress.
Mon, Mar 13, 10:25 AM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison created T331858: Deploy spark-operator webhook.
Mon, Mar 13, 10:24 AM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison assigned T318924: Submit a spark job to the dse-k8s cluster to BTullis.
Mon, Mar 13, 10:05 AM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison moved T318924: Submit a spark job to the dse-k8s cluster from In Progress to Done on the Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10) board.
Mon, Mar 13, 10:05 AM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison added a comment to T318924: Submit a spark job to the dse-k8s cluster.

Volumes and volumeMounts which also rely on webhook...

Mon, Mar 13, 9:29 AM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison added a comment to T318924: Submit a spark job to the dse-k8s cluster.

Due to mutation webhook not enabled we can't rely on hadoopConfigMap spec on sparkapplication -> TODO create phab ticket to add the webhook
Currently trying to perform manually actions done by webhook

  • ConfigMap: hadoop-conf with core and hdfs sites.xml
  • volumeMounts and volumes configured
  • HADOOP_CONF_DIR env var exposed
Mon, Mar 13, 9:21 AM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison added a comment to T318924: Submit a spark job to the dse-k8s cluster.

Spark job submission work with the new NetworkPolicy and the port config on the SparkApplication
Trying to run a job accessing hdfs

Mon, Mar 13, 8:12 AM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)

Fri, Mar 10

nfraison added a comment to T331580: Fix permissions in hdfs://analytics-hadoop/wmf/data/discovery.

All rights update to analytics-search:analytics-search-users on hdfs://analytics-hadoop/wmf/data/discovery

Fri, Mar 10, 4:21 PM · Data-Engineering, Discovery-Search (Current work), CirrusSearch
nfraison added a comment to T331580: Fix permissions in hdfs://analytics-hadoop/wmf/data/discovery.

Command to change right is runnning

Fri, Mar 10, 4:20 PM · Data-Engineering, Discovery-Search (Current work), CirrusSearch
nfraison updated Other Assignee for T331580: Fix permissions in hdfs://analytics-hadoop/wmf/data/discovery, added: nfraison.
Fri, Mar 10, 4:16 PM · Data-Engineering, Discovery-Search (Current work), CirrusSearch
nfraison added a comment to T318924: Submit a spark job to the dse-k8s cluster.

Here is the last version of the SparkApplication definition to take set driver and blockmanager port

Fri, Mar 10, 4:13 PM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison added a comment to T330176: Deploy spark history.

What about putting this component in K8S?

  • Will require to build a docker image
  • Get a keytab to access eventlog in hdfs
  • Expose the web ui to internet on sparkhistory.wikimedia.org
Fri, Mar 10, 1:47 PM · Patch-For-Review, Shared-Data-Infrastructure
nfraison added a comment to T330176: Deploy spark history.

We currently rely on airflow pkg to deploy spark3 so will need to deploy airflow to get it.
TODO:

  • update the package to expose a spark-history-server in /usr/bin
  • create systemd service for it
  • create hdfs folder where history file will be created
  • update spark-conf
Fri, Mar 10, 11:19 AM · Patch-For-Review, Shared-Data-Infrastructure
nfraison added a comment to T318924: Submit a spark job to the dse-k8s cluster.

Yes the coreLimit fixed it
And yes we need a specific NetworkPolicy to have driver and executor -> pushing it this morning

Fri, Mar 10, 7:50 AM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)

Thu, Mar 9

nfraison added a comment to T318924: Submit a spark job to the dse-k8s cluster.

Executor pods are now well launched:

Thu, Mar 9, 7:26 PM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison added a comment to T331543: anworker1132 BBU issue/replacement.

Sorry missed that ticket.
It is not indeed not needed anymore
The cache issue was linked to the bad disk

Thu, Mar 9, 5:02 PM · SRE, ops-eqiad
nfraison moved T330162: DSE Experiment - PoC how to Address Kerberos from spark running on DSE K8S cluster from In Progress to In Review on the Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10) board.
Thu, Mar 9, 12:59 PM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison moved T331310: Update FSimage monitoring to rely on JMX metrics and take in account namenode status from In Progress to In Review on the Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10) board.
Thu, Mar 9, 12:59 PM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison added a comment to T330162: DSE Experiment - PoC how to Address Kerberos from spark running on DSE K8S cluster.

@elukey could you please review https://docs.google.com/document/d/1Aub7lUr1nPGN3MXz8FI7CCCZ5a5Y1BRpY3poVmui6AM/edit# with our proposal for hadoop access mechanism for spark jobs on K8S

Thu, Mar 9, 8:49 AM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)

Wed, Mar 8

nfraison moved T330979: Investigate slownesses on an-worker1132 from Unexpected work/incident to Done on the Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10) board.
Wed, Mar 8, 5:05 PM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10), Data-Engineering-Planning
nfraison added a comment to T330979: Investigate slownesses on an-worker1132.

Adding back node as all paritions are available and disks cache is back

Wed, Mar 8, 4:23 PM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10), Data-Engineering-Planning
nfraison added a comment to T330971: Degraded RAID on an-worker1132.

Strangely since the change of disk everything is back to normal

RECOVERY - MegaRAID on an-worker1132 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
Wed, Mar 8, 3:59 PM · Data-Engineering, SRE, ops-eqiad
nfraison added a comment to T330151: Deploy ceph osd processes to data-engineering cluster.

https://phabricator.wikimedia.org/T326945#8534579

Wed, Mar 8, 1:40 PM · Patch-For-Review, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison added a comment to T330151: Deploy ceph osd processes to data-engineering cluster.

Step performed by the cookbook bootstrap script

Wed, Mar 8, 11:06 AM · Patch-For-Review, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison added a comment to T303168: Investigate trend of gradual hive server heap exhaustion.

What we can see from graph is that the leak in the OldGC is linked to the link of Metaspace.
512MB is not enough trying with 1G but not really confident on that one.
Also added heap dump generation in case of OOM to hopefully understand the leak but seeing the correlation between metaspace and old gc I would say it is around some classloader

Wed, Mar 8, 9:38 AM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10), Data-Engineering-Planning
nfraison closed T330982: Automate run of refreshNodes on masters as Resolved.
Wed, Mar 8, 9:16 AM · Shared-Data-Infrastructure

Tue, Mar 7

nfraison added a comment to T303168: Investigate trend of gradual hive server heap exhaustion.

Need to recheck the setting as it leads to OOM in Metaspace :(

Tue, Mar 7, 6:25 PM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10), Data-Engineering-Planning
nfraison added a comment to T331448: Make YARN web interface work with both primary and standby resourcemanager.

FI dns repo:

Tue, Mar 7, 3:43 PM · Shared-Data-Infrastructure, Data-Engineering-Planning
nfraison closed T331446: Ensure yarn.wikimedia.org point to the active RM as Declined.
Tue, Mar 7, 3:40 PM · Shared-Data-Infrastructure
nfraison added a comment to T331446: Ensure yarn.wikimedia.org point to the active RM.

Duplicate of https://phabricator.wikimedia.org/T331448

Tue, Mar 7, 3:40 PM · Shared-Data-Infrastructure
nfraison created T331446: Ensure yarn.wikimedia.org point to the active RM.
Tue, Mar 7, 3:28 PM · Shared-Data-Infrastructure
nfraison added a comment to T331265: Investigate DB connection issues faced from airflow on an-launcher1002.

https://gerrit.wikimedia.org/r/c/operations/puppet/+/895228

Tue, Mar 7, 2:21 PM · Data-Engineering-Planning, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison added a comment to T331265: Investigate DB connection issues faced from airflow on an-launcher1002.

The driver is in charge of servicing files, jars and app jar through http file server. With those potential 64 executors * 4 jobs getting the 100MB refinery-job-0.0.146.jar it indeed generate some loads (more than 3 min of network transfer at full speed).

Tue, Mar 7, 12:44 PM · Data-Engineering-Planning, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison added a comment to T331265: Investigate DB connection issues faced from airflow on an-launcher1002.

Tonight issue:

Mar 07 00:01:31 an-launcher1002 airflow-scheduler@analytics[5803]: Process DagFileProcessor652438-Process:
Mar 07 00:03:17 an-launcher1002 airflow-scheduler@analytics[5803]: [2023-03-07 00:03:17,896] {scheduler_job.py:354} INFO - 5 tasks up for execution:
Tue, Mar 7, 9:58 AM · Data-Engineering-Planning, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison moved T329362: Upgrade the Data Engineering team's Zookeeper servers to Bullseye from In Progress to Done on the Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10) board.
Tue, Mar 7, 8:50 AM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10), Data-Engineering-Planning
nfraison updated the task description for T329362: Upgrade the Data Engineering team's Zookeeper servers to Bullseye.
Tue, Mar 7, 8:50 AM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10), Data-Engineering-Planning

Mon, Mar 6

nfraison updated the task description for T331310: Update FSimage monitoring to rely on JMX metrics and take in account namenode status.
Mon, Mar 6, 4:53 PM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison updated the task description for T331310: Update FSimage monitoring to rely on JMX metrics and take in account namenode status.
Mon, Mar 6, 4:48 PM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison created T331310: Update FSimage monitoring to rely on JMX metrics and take in account namenode status.
Mon, Mar 6, 4:46 PM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison added a comment to T331265: Investigate DB connection issues faced from airflow on an-launcher1002.

An idea was that reportupdater job can be the root cause of that high tx
Here is the log of all reportupdater service running at midnight (or last log when not having logs at midnight. Doesn't seems to match the timeline

● reportupdater-pingback.service - Periodic execution of reportupdater-pingback.service
Mar 06 00:00:01 an-launcher1002 systemd[1]: Started Report Updater job for pingback.
Mar 06 00:00:01 an-launcher1002 systemd[1]: reportupdater-pingback.service: Succeeded.
Mar 06 00:00:01 an-launcher1002 kerberos-run-command[22919]: 2023-03-06 00:00:01,321 - INFO - Execution complete.
Mar 06 00:00:01 an-launcher1002 kerberos-run-command[22919]: 2023-03-06 00:00:00,730 - INFO - Starting execution.
Mar 06 00:00:01 an-launcher1002 kerberos-run-command[22919]: kinit: Failed to store credentials: Internal credentials cache error (filename: /tmp/krb5cc_906) while getting initial credentials
Mar 06 00:00:00 an-launcher1002 kerberos-run-command[23016]: User analytics executes as user analytics the command ['/usr/bin/python3', '/srv/reportupdater/reportupdater/update_reports.py', '-l', 'info', '/srv/reportupdater/jobs/reportupdater-queries/pin
Mar 06 00:00:00 an-launcher1002 systemd[1]: Starting Report Updater job for pingback...
Mon, Mar 6, 3:13 PM · Data-Engineering-Planning, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison renamed T331265: Investigate DB connection issues faced from airflow on an-launcher1002 from Investigate DB connection issues faced from airflow on an-aluncher1002 to Investigate DB connection issues faced from airflow on an-launcher1002.
Mon, Mar 6, 2:23 PM · Data-Engineering-Planning, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison added a comment to T329362: Upgrade the Data Engineering team's Zookeeper servers to Bullseye.

an-conf1002 done

Mon, Mar 6, 1:08 PM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10), Data-Engineering-Planning
nfraison added a comment to T329362: Upgrade the Data Engineering team's Zookeeper servers to Bullseye.

an-conf1001 reimaged but zookeeper not starting
This was due to /etc/zookeeper/conf/version-2/ not belonging to zoookeeper:zookeeper (expected as user id i snot kept on reimage)

Mon, Mar 6, 10:32 AM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10), Data-Engineering-Planning
nfraison claimed T331265: Investigate DB connection issues faced from airflow on an-launcher1002.
Mon, Mar 6, 9:09 AM · Data-Engineering-Planning, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison added a comment to T331265: Investigate DB connection issues faced from airflow on an-launcher1002.

Trying to identify process which could generate this tx network usage:

  • running a PS command every 15s in an-launcher1002 stored in /home/nfraison/proc.log
  • running netstat -tup te get process/src/dst ip/port
  • running iftop -tPs 5 to get traffic send/receive from last 5s src host/dst host/port
Mon, Mar 6, 9:05 AM · Data-Engineering-Planning, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison added a comment to T331265: Investigate DB connection issues faced from airflow on an-launcher1002.

Looks to me that we reach 100% network usage on an-launcher1002 when the connection issues happens: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=an-launcher1[…]analytics&from=1677715234951&to=1677715701346&viewPanel=11
This node only have a 1GB NIC, we should identify that local job which issue this usage and see if we can throttle or make it run on hadoop.

Mon, Mar 6, 8:42 AM · Data-Engineering-Planning, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison updated the task description for T331265: Investigate DB connection issues faced from airflow on an-launcher1002.
Mon, Mar 6, 8:42 AM · Data-Engineering-Planning, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison added a project to T331265: Investigate DB connection issues faced from airflow on an-launcher1002: Data-Engineering-Planning.
Mon, Mar 6, 8:38 AM · Data-Engineering-Planning, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison created T331265: Investigate DB connection issues faced from airflow on an-launcher1002.
Mon, Mar 6, 8:37 AM · Data-Engineering-Planning, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison moved T330979: Investigate slownesses on an-worker1132 from Backlog to Shared-Data-Infra Sprint 10 on the Shared-Data-Infrastructure board.
Mon, Mar 6, 8:31 AM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10), Data-Engineering-Planning
nfraison moved T303168: Investigate trend of gradual hive server heap exhaustion from Backlog to Shared-Data-Infra Sprint 10 on the Shared-Data-Infrastructure board.
Mon, Mar 6, 8:31 AM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10), Data-Engineering-Planning

Fri, Mar 3

nfraison created T331133: Deploy timeline server.
Fri, Mar 3, 3:04 PM · Shared-Data-Infrastructure
nfraison added a comment to T330151: Deploy ceph osd processes to data-engineering cluster.

Currently osds are setup 2 step:

Fri, Mar 3, 2:21 PM · Patch-For-Review, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison added a project to T331125: Security Issue Access Request for nfraison: Security-Team.
Fri, Mar 3, 12:59 PM · Security-Team, Security
nfraison updated subscribers of T331125: Security Issue Access Request for nfraison.

@odimitrijevic could you please approve?
Thks

Fri, Mar 3, 12:58 PM · Security-Team, Security
nfraison created T331125: Security Issue Access Request for nfraison.
Fri, Mar 3, 12:57 PM · Security-Team, Security
nfraison added a comment to T330971: Degraded RAID on an-worker1132.

@Cmjohnson this node has strange behaviour on raid/disks

Fri, Mar 3, 10:30 AM · Data-Engineering, SRE, ops-eqiad
nfraison added a comment to T330979: Investigate slownesses on an-worker1132.

Enforcing cache to WriteBack doesn't work: sudo megacli -LDSetProp -WB -Immediate -Lall -aAll

Fri, Mar 3, 9:25 AM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10), Data-Engineering-Planning
nfraison added a comment to T330979: Investigate slownesses on an-worker1132.

BBU looks fine

Fri, Mar 3, 9:02 AM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10), Data-Engineering-Planning
nfraison added a comment to T330979: Investigate slownesses on an-worker1132.

Raid disk configuration is in WriteThrough instead of WriteBack.

  • On an-worker1131
nfraison@an-worker1131:~$ sudo megacli -LDInfo -Lall -aALL
Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 446.625 GB
Sector Size         : 512
Is VD emulated      : Yes
Mirror Data         : 446.625 GB
State               : Optimal
Strip Size          : 64 KB
Number Of Drives    : 2
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
  • On an-worker1132
nfraison@an-worker1132:~$ sudo megacli -LDInfo -Lall -aALL
Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 446.625 GB
Sector Size         : 512
Is VD emulated      : Yes
Mirror Data         : 446.625 GB
State               : Optimal
Strip Size          : 256 KB
Number Of Drives    : 2
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU
Fri, Mar 3, 9:00 AM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10), Data-Engineering-Planning
nfraison added a comment to T330979: Investigate slownesses on an-worker1132.

On an-worker1132 all disks are having same stats with no more 4MiB for read/write per sec and 238/158 iops

Fri, Mar 3, 8:45 AM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10), Data-Engineering-Planning
nfraison added a comment to T330979: Investigate slownesses on an-worker1132.

For reference disk bench from an-worker1131

Fri, Mar 3, 8:36 AM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10), Data-Engineering-Planning

Thu, Mar 2

nfraison claimed T330151: Deploy ceph osd processes to data-engineering cluster.
Thu, Mar 2, 4:20 PM · Patch-For-Review, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison added a comment to T330979: Investigate slownesses on an-worker1132.

https://phabricator.wikimedia.org/T330971

Thu, Mar 2, 3:53 PM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10), Data-Engineering-Planning
nfraison moved T330151: Deploy ceph osd processes to data-engineering cluster from Next Up to In Progress on the Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10) board.
Thu, Mar 2, 3:47 PM · Patch-For-Review, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)
nfraison claimed T330979: Investigate slownesses on an-worker1132.
Thu, Mar 2, 2:00 PM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10), Data-Engineering-Planning
nfraison added a comment to T330979: Investigate slownesses on an-worker1132.

Downtime node to avoid false alert

Thu, Mar 2, 12:57 PM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10), Data-Engineering-Planning