Not all new slab metrics are rendered, opened an issue upstream: https://github.com/prometheus/memcached_exporter/issues/75
Build and deployed 0.6.0 and deployed on deployment-prep-memc08. Also created the following:
Mon, Nov 11
Fri, Nov 8
stat items is something that changed a bit. For example, take slab 63:
Thu, Nov 7
Thanks a lot for the tests!
After a chat with Moritz and my team, this is what we are planning to do:
@Cmjohnson we might need to add a new GPU next quarter (need to triple check with the Research team), is there any of the above hosts that can host a GPU in your opinion or should we open each one of them and measure to find one?
Wed, Nov 6
Terminated Jobs: JobId Level Files Bytes Status Finished Name =================================================================== [..] 159737 Incr 2 3.596 G OK 06-Nov-19 04:05 an-master1002.eqiad.wmnet-Monthly-1st-Mon-production-hadoop-namenode-backup 159834 Full 781 19.16 G OK 06-Nov-19 11:28 an-master1002.eqiad.wmnet-Monthly-1st-Mon-production-analytics-meta-mysql-lvm-backup <<===
Sanitization is still running on the two databases!
On an-coord1001 I can see:
Tue, Nov 5
Added metrics to http://beta-prometheus.wmflabs.org about memcached:
Mon, Nov 4
@Ottomata I am wondering if we could simply configure bacula to copy the meta backup's files as we do for Archiva. This will allow us to remove the cron that uploads to HDFS.. Need to check with Jaime first if this is possible, but if so what do you think about it?
Nice thanks! Just pushed the new rules to the routers, so in theory an-master1002 and analytics1029 should go away now! Let me know :)
To keep archives happy:
I am all for simplifying and standardizing confs, so no opposition about incremental. Only one question - what would it change when trying to restore the database? This is basically my only concern at the moment.. If it is as simple as doing a recovery via Bacula I am all for it!
Thanks a lot Jaime!
High level plan that I have in mind:
@mforns When you are online can you ping me? I'd like to drop the above tables but with somebody triple checking what I am doing :)
Added documentation to https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#New_Worker_Installation_(12_disk,_2_flex_bay_drives_-_analytics1028-analytics1077) Cc: @Ottomata
Thu, Oct 31
Summary of things done if a rollback is needed:
The procedure should be as simple as:
As part of this task I'd also clean up profile::mariadb::misc::eventlogging::sanitization from the db110 hosts :)
On puppetmaster2001 I cannot see /etc/apt/sources.list.d/buster-cergen.list, hence the new package version seems not available.. expected?
Wed, Oct 30
CERGEN=yes DIST=buster-wikimedia pdebuild is needed to build the package in case anybody needs it :)
Hello! Do you have a coordinator that I can check in Hue? https://hue.wikimedia.org/oozie/list_oozie_coordinator/0015017-190822093211873-oozie-oozi-C/ seems the closest one and it works afaics..
I set up a test webrequest.conf on cp2001, and confirmed that the solution works!
Sure! The JSON format of what we collect from Varnish for webrequest is in profile::cache::kafka::webrequest:
Sure, even if I might not resolve the conflict in the best way and break Beta :)
Tue, Oct 29
@awight this is probably something that we didn't test it, as far as I know we don't use the graphite writer.. the stacktrace makes sense, a text string should not be passed to sock.send().. Will check the code asap!
About the database - we could use a mariadb on the VM, but my experience with that is that if the db usage is a bit above the norm then the overall performance of the VM will suffer a lot (like it used to happen for Matomo). I'd start using an-coord1001, and then we may move the db out if we see that the usage is too high (I don't expect that but it might happen). We'll also create a user that is able to modify only one database, so there shouldn't be risks of hitting the rest inadvertently. @Ottomata what do you think?
All right then, let's close this and re-open if necessary!