Fri, Jul 20
After a powercycle the error went away, probably a temporary issue in the RAID controller?
Wed, Jul 18
Declining this for the moment, it doesn't seem a good way to go.
Any comment about my last entry? Sorry to ping you guys, maybe a quick meeting between the three of us would be better?
Tue, Jul 17
Ciao @Miriam! You guys were not in any LDAP group, I added you to the wmf one since you are staff and Tiziano/Michele to nda. Both LDAP groups should grant access to the SWAP UI, let me know if it works.
Mon, Jul 16
Hardware order placed in T199674
I had a chat with @MoritzMuehlenhoff about this use case, here's some more notes:
- there will be no data shared with the Hadoop production cluster or any other host in production.
- we (as analytics) will load periodically public data (no PII) to this new cluster in labs, that will effectively be a new small scale Hadoop cluster in labs.
Fri, Jul 13
IPv6 enabled RPC service must join a well known multicast group, which is FF02::202. A IPv6 host is expected to remain in this group for it's entire life and should rejoin this group if the node leaves this multicast group for any reason. ONC RPC uses rpcbind or portmapper service to join this group early during boot phase.
So now on stat* and notebook* we have a /etc/gitconfig rule that forces all git users to use the http[s] proxy. The conf1006 flow is related to the term zookeeper, it has been added to analytics-in6.
@ayounsi only the PTRs right? Or should it be the case to add the AAAA too? I am a bit afraid of seeing Hadoop starting use ipv6 after adding AAAA and triggering weird jvm bugs :)
Thu, Jul 12
Wed, Jul 11
@ayounsi I tried to fix the git::clone calls to text-lb with the above changes, let me know if I fixed it or not :)
I think that there are more git::clone calls to gerrit from the Analytics VLAN, it is only a matter of finding them and possibly add the http.proxy setting directly to puppet (maybe as an exec?).
Tue, Jul 10
Last changed applied by Arzhel, including merging common-infrastructure4 to analytics-in4
As reference, https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/314336/ is the change that Brandon did a while ago to adjust webrequest's vk batch settings.
The snappy compression change had positive effects on the bytes transmitted to Kafka, that dropped considerably:
Mon, Jul 9
Another point to figure out is what kind of security level we are aiming for, just to do our homework before ordering hardware and choosing Presto + Hadoop as technology for this project.
Any comment about this? (No rush/hurry, I am just checking my open tasks :)
Fixed archiva and removed puppet in analytics-in4. The last step is to drop Ganglia and git-deploy terms from common-infrastructure4.
Sun, Jul 8
It was probably a regular mw appserver for beta, don't have specific memories about it.
Fri, Jul 6
@ayounsi are we sure that we can touch common-infrastructure4 without affecting anything else? Is there any trace of who made it? If it is used only by analytics it would make sense to just include those terms in analytics-in4 for visibility..
So the first attempt is to introduce the Snappy compression for vk eventlogging, that in theory should reduce a lot the message size sent from Singapore to Eqiad. Historically it was enabled only for webrequest traffic, but now eventlogging might be a good candidate?
Would it be enough to just > /dev/null stdout since we log to a file? Seems easy enough to drop the JAVA_TOOL_OPTIONS things..
Today I reviewed the varnishkafka grafana dashboard and I saw an ongoing pattern of sporadic delivery error reports for the varnishkafka eventlogging instance on cp5* hosts (singapore caches):
@Marostegui, @jcrespo: we'd need to deploy the Analytics Refinery repository to db110[7,8] to move away the Python eventlogging data purging script + table whitelist from puppet. Is it something that we can freely do or are there any restriction for database hosts? (Never done it so I'd like to have your opinion first :)
We could also think about moving the Python script (currently in python) to the refinery, and reference it from puppet?
Thu, Jul 5
Followed the awesome https://debmonitor.wikimedia.org/packages/prometheus-jmx-exporter, and upgraded all the remaining kafka/analytics hosts. I haven't restarted any jvm, will do it during the next round of restarts.
Wed, Jul 4
Disk space consumption looks stable:
Reverting the netboot changes via https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/443806/
After a chat with Discovery we ended up refreshing the list of hosts in the Analytics VLAN firewall (that is meant for traffic from the analytics hosts towards production, like stat1005 to wdqs):
New term wdqs:
New term es:
First batch of changes: