User Details
- User Since
- Feb 6 2023, 9:38 AM (6 w, 1 d)
- Availability
- Available
- LDAP User
- Nicolas Fraison
- MediaWiki User
- NFraison-WMF [ Global Accounts ]
Today
2 things that will have to be added in the roadmap:
- Management of hadoop/hive/spark config. Currently pushed as configmap with the job but should probably be a common configmap for all jobs)
- Management of jars/dependencies. Currently rely on local example jars, for real application dependencies we need to find some solution: ceph s3 (but not available for now), archiva (find for prod jobs with released artifact but not good for testing ones not pushed in archiva), http hosted on the wrapper submitting the job to serve files to running app, other?
For the kerb issue it is due to my config not being fine. I'm using a hadooop delegation token for hdfs analytics-test-hadoop while my config was using analytics-hadoop. So it leads to not finding the HDT and fall back to kerberos...
Below config is working fine in test cluster:
--- apiVersion: v1 kind: ConfigMap metadata: name: hadoop-conf namespace: spark data: core-site.xml: | <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://analytics-test-hadoop/</value> </property> <property> <name>io.file.buffer.size</name> <value>131072</value> </property> <property> <name>fs.permissions.umask-mode</name> <value>027</value> </property> <property> <name>hadoop.http.staticuser.user</name> <value>yarn</value> </property> <property> <name>hadoop.rpc.protection</name> <value>privacy</value> </property> <property> <name>hadoop.security.authentication</name> <value>kerberos</value> </property> <property> <name>hadoop.ssl.enabled.protocols</name> <value>TLSv1.2</value> </property> </configuration> hdfs-site.xml: | <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>dfs.nameservices</name> <value>analytics-test-hadoop</value> </property> <property> <name>dfs.ha.namenodes.analytics-test-hadoop</name> <value>an-test-master1001-eqiad-wmnet,an-test-master1002-eqiad-wmnet</value> </property> <property> <name>dfs.namenode.rpc-address.analytics-test-hadoop.an-test-master1001-eqiad-wmnet</name> <value>an-test-master1001.eqiad.wmnet:8020</value> </property> <property> <name>dfs.namenode.rpc-address.analytics-test-hadoop.an-test-master1002-eqiad-wmnet</name> <value>an-test-master1002.eqiad.wmnet:8020</value> </property> <property> <name>dfs.client.failover.proxy.provider.analytics-test-hadoop</name> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> </property> <property> <name>dfs.blocksize</name> <value>268435456</value> </property> <property> <name>dfs.datanode.hdfs-blocks-metadata.enabled</name> <value>true</value> </property> <property> <name>dfs.block.access.token.enable</name> <value>true</value> </property> <property> <name>dfs.data.transfer.protection</name> <value>privacy</value> </property> <property> <name>dfs.datanode.kerberos.principal</name> <value>hdfs/_HOST@WIKIMEDIA</value> </property> <property> <name>dfs.encrypt.data.transfer</name> <value>true</value> </property> <property> <name>dfs.encrypt.data.transfer.cipher.key.bitlength</name> <value>128</value> </property> <property> <name>dfs.encrypt.data.transfer.cipher.suites</name> <value>AES/CTR/NoPadding</value> </property> <property> <name>dfs.http.policy</name> <value>HTTPS_ONLY</value> </property> <property> <name>dfs.namenode.kerberos.principal</name> <value>hdfs/_HOST@WIKIMEDIA</value> </property> <property> <name>dfs.web.authentication.kerberos.principal</name> <value>HTTP/_HOST@WIKIMEDIA</value> </property> </configuration>
Also seems that dse-k8s-worker1002.eqiad.wmnet as a strange pattern the driver stuck at
FW access open. Access OK with below job config
Yesterday
Fri, Mar 17
Send messages on #wikimedia-serviceops IRC channel to have some reviews from sre and ensure the vault mechanism chosen is acceptable or not
Thu, Mar 16
Taking HeapDump of the test hiveserver2 and analyzing it with MAT show multiple instances of lots of our Singleton UDF while there should be only one:
Wed, Mar 15
Tue, Mar 14
Wait for sparkhistory to be deployed and see then if still needed.
Will require some testing to see if it is stable/scale enough
@BTullis @Stevemunene here is the epic we just discuss IRL.
If you are fine with it I'd like we start looking at this on this sprint.
For ex. adding one ticket to add probing on one of our web based service dathub or superset or turnilo?
Mon, Mar 13
Volumes and volumeMounts which also rely on webhook...
Due to mutation webhook not enabled we can't rely on hadoopConfigMap spec on sparkapplication -> TODO create phab ticket to add the webhook
Currently trying to perform manually actions done by webhook
- ConfigMap: hadoop-conf with core and hdfs sites.xml
- volumeMounts and volumes configured
- HADOOP_CONF_DIR env var exposed
Spark job submission work with the new NetworkPolicy and the port config on the SparkApplication
Trying to run a job accessing hdfs
Fri, Mar 10
All rights update to analytics-search:analytics-search-users on hdfs://analytics-hadoop/wmf/data/discovery
Command to change right is runnning
Here is the last version of the SparkApplication definition to take set driver and blockmanager port
What about putting this component in K8S?
- Will require to build a docker image
- Get a keytab to access eventlog in hdfs
- Expose the web ui to internet on sparkhistory.wikimedia.org
We currently rely on airflow pkg to deploy spark3 so will need to deploy airflow to get it.
TODO:
- update the package to expose a spark-history-server in /usr/bin
- create systemd service for it
- create hdfs folder where history file will be created
- update spark-conf
Yes the coreLimit fixed it
And yes we need a specific NetworkPolicy to have driver and executor -> pushing it this morning
Thu, Mar 9
Executor pods are now well launched:
Sorry missed that ticket.
It is not indeed not needed anymore
The cache issue was linked to the bad disk
@elukey could you please review https://docs.google.com/document/d/1Aub7lUr1nPGN3MXz8FI7CCCZ5a5Y1BRpY3poVmui6AM/edit# with our proposal for hadoop access mechanism for spark jobs on K8S
Wed, Mar 8
Adding back node as all paritions are available and disks cache is back
Strangely since the change of disk everything is back to normal
RECOVERY - MegaRAID on an-worker1132 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
Step performed by the cookbook bootstrap script
What we can see from graph is that the leak in the OldGC is linked to the link of Metaspace.
512MB is not enough trying with 1G but not really confident on that one.
Also added heap dump generation in case of OOM to hopefully understand the leak but seeing the correlation between metaspace and old gc I would say it is around some classloader
Tue, Mar 7
Need to recheck the setting as it leads to OOM in Metaspace :(
FI dns repo:
Duplicate of https://phabricator.wikimedia.org/T331448
The driver is in charge of servicing files, jars and app jar through http file server. With those potential 64 executors * 4 jobs getting the 100MB refinery-job-0.0.146.jar it indeed generate some loads (more than 3 min of network transfer at full speed).
Tonight issue:
Mar 07 00:01:31 an-launcher1002 airflow-scheduler@analytics[5803]: Process DagFileProcessor652438-Process: Mar 07 00:03:17 an-launcher1002 airflow-scheduler@analytics[5803]: [2023-03-07 00:03:17,896] {scheduler_job.py:354} INFO - 5 tasks up for execution:
Mon, Mar 6
An idea was that reportupdater job can be the root cause of that high tx
Here is the log of all reportupdater service running at midnight (or last log when not having logs at midnight. Doesn't seems to match the timeline
● reportupdater-pingback.service - Periodic execution of reportupdater-pingback.service Mar 06 00:00:01 an-launcher1002 systemd[1]: Started Report Updater job for pingback. Mar 06 00:00:01 an-launcher1002 systemd[1]: reportupdater-pingback.service: Succeeded. Mar 06 00:00:01 an-launcher1002 kerberos-run-command[22919]: 2023-03-06 00:00:01,321 - INFO - Execution complete. Mar 06 00:00:01 an-launcher1002 kerberos-run-command[22919]: 2023-03-06 00:00:00,730 - INFO - Starting execution. Mar 06 00:00:01 an-launcher1002 kerberos-run-command[22919]: kinit: Failed to store credentials: Internal credentials cache error (filename: /tmp/krb5cc_906) while getting initial credentials Mar 06 00:00:00 an-launcher1002 kerberos-run-command[23016]: User analytics executes as user analytics the command ['/usr/bin/python3', '/srv/reportupdater/reportupdater/update_reports.py', '-l', 'info', '/srv/reportupdater/jobs/reportupdater-queries/pin Mar 06 00:00:00 an-launcher1002 systemd[1]: Starting Report Updater job for pingback...
an-conf1002 done
an-conf1001 reimaged but zookeeper not starting
This was due to /etc/zookeeper/conf/version-2/ not belonging to zoookeeper:zookeeper (expected as user id i snot kept on reimage)
Trying to identify process which could generate this tx network usage:
- running a PS command every 15s in an-launcher1002 stored in /home/nfraison/proc.log
- running netstat -tup te get process/src/dst ip/port
- running iftop -tPs 5 to get traffic send/receive from last 5s src host/dst host/port
Looks to me that we reach 100% network usage on an-launcher1002 when the connection issues happens: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=an-launcher1[…]analytics&from=1677715234951&to=1677715701346&viewPanel=11
This node only have a 1GB NIC, we should identify that local job which issue this usage and see if we can throttle or make it run on hadoop.
Fri, Mar 3
Currently osds are setup 2 step:
- Puppet manage the system config, mtu, disks scheduler, hdparm, prometheus ping, ceph config, ceph auth keys and FW + some cluster interface (probably not needed for our case)
- Then a specific cookbook take care of bootstraping ceph osds: https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/wmcs-cookbooks/+/refs/heads/main/cookbooks/wmcs/ceph/osd/bootstrap_and_add.py
@odimitrijevic could you please approve?
Thks
@Cmjohnson this node has strange behaviour on raid/disks
Enforcing cache to WriteBack doesn't work: sudo megacli -LDSetProp -WB -Immediate -Lall -aAll
BBU looks fine
Raid disk configuration is in WriteThrough instead of WriteBack.
- On an-worker1131
nfraison@an-worker1131:~$ sudo megacli -LDInfo -Lall -aALL Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name : RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0 Size : 446.625 GB Sector Size : 512 Is VD emulated : Yes Mirror Data : 446.625 GB State : Optimal Strip Size : 64 KB Number Of Drives : 2 Span Depth : 1 Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
- On an-worker1132
nfraison@an-worker1132:~$ sudo megacli -LDInfo -Lall -aALL Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name : RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0 Size : 446.625 GB Sector Size : 512 Is VD emulated : Yes Mirror Data : 446.625 GB State : Optimal Strip Size : 256 KB Number Of Drives : 2 Span Depth : 1 Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU
On an-worker1132 all disks are having same stats with no more 4MiB for read/write per sec and 238/158 iops
For reference disk bench from an-worker1131
Thu, Mar 2
Downtime node to avoid false alert