Me and the JS master @fdans checked the Pivot's code this morning, and after a lot of tests we identified what returns the error:
Pivot deployed on d-1, usable via:
Fri, Apr 20
Upgraded d[1-3] in labs to druid 0.10, adding manual hiera config as replacement for https://gerrit.wikimedia.org/r/#/c/355471.
Thu, Apr 19
ping - status :)
After a chat with the team we decided to proceed with Druid 0.10 for the moment, since we have basically everything that we need ready to go.
Wed, Apr 18
Restarting to work on this after the Hadoop cluster has been migrated to Java 8. The latest stable release is currently 0.12, meanwhile we are running 0.9.2. The previous attempt was targeting 0.10.
Ok! I am going to chat with Andrew about https://gerrit.wikimedia.org/r/427159, since adding r-base-core to the packages pinned to jessie-backports seems to work (tested this morning with `sudo apt-get install r-base-core -t jessie-backports on an1028 before cleaning up).
This morning I removed the old apt config for jessie backports (since after https://gerrit.wikimedia.org/r/427170 it seemed not needed and puppet was broken on Jessie hosts) but now this is the situation for the Hadoop workers:
Tue, Apr 17
Mon, Apr 16
Tested in labs a migration with two stretch hosts running zk 3.4.9 and one jessie host running zk 3.4.9 (Moritz's backport) and the host swap happened without any issue (no host in LOOKING state).
Interesting discovery today while testing zookeeper on stretch. I tried to clean up /etc/zookeeper/conf and ran puppet to check if everything was going to be restored or not, and with my great surprise, the zookeeper systemd unit wasn't able to "start". After a big of digging, the culprit seems to be the following:
Fri, Apr 13
@chelsyx so now the infrastructure that runs the bohrium host (and hence piwik) is much more stable, we hope to have solved the issues that were causing the host to frequently freeze and not archive data. If I have understood it correctly, the last remaining step is on your side to work on a wider dispatch interval; is my understanding correct? Are there pending actions for Analytics?
elukey@db1108:~$ sudo -u eventlogcleaner crontab -l 0 11 * * * /usr/bin/flock --verbose -n /var/lock/eventlogging_cleaner /usr/local/bin/eventlogging_cleaner --whitelist /etc/analytics/sanitization/eventlogging_purging_whitelist.yaml --yaml --older-than 90 --start-ts-file /var/run/eventlogging_cleaner --batch-size 10000 --sleep-between-batches 2 >> /var/log/eventlogging_cleaner/eventlogging_cleaner.log
elukey@analytics1003:~$ ls -l /etc/analytics/sanitization/eventlogging_purging_whitelist.yaml -r--r--r-- 1 root root 26165 Apr 13 07:44 /etc/analytics/sanitization/eventlogging_purging_whitelist.yaml
Thu, Apr 12
From the consumers point of view (two kafka clusters and one hadoop cluster) I have observed only some non critical logs in one of the hadoop masters when zookeeper broke its session:
Re-tested today the swap of one node in labs (analytics project) to verify again logs and things that might break. Some details about the procedure:
Added documentation to https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster#recover_files_deleted_by_mistake_using_the_hdfs_CLI_rm_command?, last step is to send a mail to analytics@ (and possibly research, engineering?) to announce the new feature.
Tue, Apr 10
@mforns sorry I don't find the whitelist, can you add it in here? Moreover, do you need that puppet also pushes it to hdfs?
A bit of historic context about the why db1108 is not read-only:
So Jonas (user: jk) is already in analytics-privatedata-users, and as far as I can see access is already granted for notebook1003, stat1004 and stat1006. The only one that is not included is stat1005, but it is probably redundant for Jonas' use case (let me know otherwise).
Mon, Apr 9
Fri, Apr 6
Nothing varnish-related happened on Feb 6th as far as I can see from the ops SAL: https://tools.wmflabs.org/sal/production?p=0&q=&d=2018-02-06
Thu, Apr 5
Since this task has been open for a long time, I'll open a new one when we'll be ready to create the analytics-in6 filter.
Wed, Apr 4
Ok mistery solved after checking with Andrew. The version of https://github.com/julienschmidt/httprouter in Debian is stuck at the 1.1 tag (from 2015), and since then a ton of things changed. https://github.com/julienschmidt/httprouter/issues/207 is open since last year to ask for a 1.2 release (that would probably kick off a new Debian pkg release) but no traction since then, so I think that we'd probably need to backtrack into packing all the Burrow dependencies in our package rather than relying on Debian upstream :(
Tried to build burrow 1.0 using all debian dependencies (and not godeps added to the package) but this is what I get:
After removing /srv/deployment/prometheus I don't see any trace of the jmx exporter jar contained in the dir in lsof -Xd DEL on rdb2001/1007.
Sadly after a long battle there seems not to be a good way to add prometheus for the hive/oozie jvms, we'll revisit later on when upgrading to newer cdh versions.
Tue, Apr 3
Fri, Mar 30
Tested the two values that I've set in the above patch in labs: