A single Hadoop worker node for analytics, all good.
I replaced the Analytics tag for kafka1001 with @herron since the kafka main cluster is now handled by infrastructure foundations.
All the analytics nodes are hadoop workers, not a big deal if they loose power.
The kafka hosts are going to be decommed in T226517, so not a concern. The other hosts can go down without horrible consequences :)
We can check per-hosts metrics with:
In light of what I wrote above:
All right this is a clear PEBCAK (problem exists between computer and keyboard). First of all my tests were not correct:
This is happening on mw1261 (currently depooled):
The services in analytics that use LDAP are:
To keep archives happy:
@Nuria probably yes, I agree!
From my home ipv6 address (removed the first hops):
Sun, Jul 14
Sat, Jul 13
Fri, Jul 12
@aaron I added two new rows to https://grafana.wikimedia.org/dashboard/db/mcrouter with new per-shard metrics. Let me know what you think about it and if anything is missing.
Recap of what it has been done so fare in various (sub) tasks:
Ok turns out this is only a problem that I have, Fran seems to get no errors.. Now I am really confused :D
No users were logged so I have also done apt-get purge hive; puppet agent -tv, to get a clean version of hive. Issue persists.
The hdfs-audit log error comes from this bit in /usr/lib/hive/bin/hive:
We can do it anytime with 10/15 mins of heads up Chris (I need to stop replication and traffic to db1107 before you can operate). Ping me on IRC! :)
For some reason the disk doesn't show as failed by megacli but:
Thu, Jul 11
Ah no sorry there is another issue, namely that you are not in the nda LDAP group. What it is currently failing is not superset, but the httpd's LDAP auth in front of it (that requires the user to be either in wmf or nda). We'll need to create a task like https://phabricator.wikimedia.org/T188105
@Jan_Dittrich hi! The superset LDAP config requires the uid, that usually is not different but in your case is wmde-jand. I amended it now your account in superset, can you retry?
Error on stat1004 was due to an experiment that I was doing to nail down why the --verbose option leads to:
All right so ROCm 2.5 and tensorflow-rocm 1.13.3 seems to work. Other versions of TF (1.13.4 and 1.14.0) lead to the following error:
This is a pyspark2 session opened on stat1007:
Very nice investigation, I was in fact trying to figure out the purpose of the last port and you solved it :) I'll make sure that port will get a range too!
Wed, Jul 10
@EBernhardson should work now, let me know!
We currently are running Hive 1.1.0 (that should be 0.15 IIUC) and I can see this:
I am a little bit lost with LDAP config, since we use:
I filed a code review to create the initial version of the node exporter, with the following metrics:
- usage percent
- power consumption (in watts)
- fan usage percent
- temperature (in celsius)
From puppet I can see that the change for ldap-ro was reverted:
Tue, Jul 9
Not really, I wish myself from the past added more info. I asked to @ayounsi and he didn't come up with a reason not to, so in theory we could try to modify the term on the firewall and see how it goes.
Makes sense, I am now wondering if we should create a generic and configurable alarm or not :)
Added documentation in https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/AMD_GPU
Now the annoying part:
Mon, Jul 8
Not actionable yet.
root@install1002:~# reprepro --noskipold --component thirdparty/amd-rocm checkupdate buster-wikimedia Calculating packages to get...
elukey@cumin1001:~$ sudo cookbook sre.ganeti.makevm eqiad_A test_not_existing.eqiad.wmnet --vcpus 2 --memory 4 --disk 150 --link analytics START - Cookbook sre.ganeti.makevm Exception raised while executing cookbook sre.ganeti.makevm: Traceback (most recent call last): File "/usr/lib/python3/dist-packages/spicerack/dns.py", line 137, in resolve response = self._resolver.query(qname, record_type) File "/usr/lib/python3/dist-packages/dns/resolver.py", line 1051, in query raise NXDOMAIN(qnames=qnames_to_try, responses=nxdomain_responses) dns.resolver.NXDOMAIN: None of DNS query names exist: test_not_existing.eqiad.wmnet., test_not_existing.eqiad.wmnet.eqiad.wmnet.
Amended this task and created T227425 :)