Tue, Nov 13
this has been fixed already
Fri, Nov 9
Thu, Nov 8
The build succeeded. The new docs are published at https://doc.wikimedia.org/search-highlighter/experimental/, the publication date has been updated, and the site looks fine (minus the usual broken links).
Wed, Nov 7
@Smalyshev we're good on this from my point of view. Could you check that running updater manually (with the -S option to output to console, or with -v for verbose) works as you expect?
Tue, Nov 6
Mon, Nov 5
Approved in weekly SRE meeting for full root access to wdqs servers
Why not drop the wrapper completely and just allow running an arbitrary command in the project directory? That way, we'll be able to configure ./mvnw clean install && ./mvnw site site:stage (or whatever we need). And not care about transient state needed to be propagated between containers. I'm not sure what value is that wrapper bringing, so I might be missing something.
Fri, Nov 2
Thanks for the feedback!
Thu, Nov 1
For context, T202765 is about a bot sending annoying and somewhat expensive requests. That specific issue is now resolved.
maps is a different project from maps-team. But we can still delete the maps-team project, we now have maps instances on deployment-prep, which makes more sense.
Honestly, I'm not sure what the underlying issue is. What I can say is that there is no foreseeable need to have any NFS mount in current or future instances in the maps-project.
At this point, there are no live instances in the maps-team project, and no immediate need to use NFS. So feel free to remove it, we'll ask for it again if the need arises.
That change is probably reasonably easy to implement in the JJB templates, but I'm lost there. I hope @hashar can provide some support.
Tue, Oct 30
new configuration deployed, but raising some deprecations, needs some tuning.
Fri, Oct 26
Thu, Oct 25
There are 3 issues here, and maybe they should be addressed on different tickets:
So this shows that we have less than 0.04% of packet loss on the elasticsearch eqiad cluster? I would expect a loss rate that low to not be an issue (which matches the fact that we don't see a functional issue on that cluster). The goal is not to raise alerts, it is to raise alerts if we reach a level that is problematic.
Wed, Oct 24
Actually, already some pool counter errors with 29 nodes on eqiad.
My current patch is trying to put all that logic into logback.xml, but it is definitely starting to be unreadable. And coding ifs in XML just seems wrong :/
Tue, Oct 23
With 6 servers depooled / banned, the cluster seems to be just fine. Starting at 7 nodes depooled, I see the load rising on some of the other servers. The response times don't show any significant change.
Mon, Oct 22
Some minimal packet drop is still seen (< 100 packet / 24h), so the situation is very much better. More work needs to be done on limiting CPU usage on the blazegraph side.
Fri, Oct 19
A few wishes I have from an operations point of view for any replacement. Those are not necessarily mandatory, but we should evaluate them at some point:
After some discussion (P7699), the idea is:
Thu, Oct 18
Wed, Oct 17
@Smalyshev if you could take a heap dump of blazegraph under load, we might be able to trace more precisely where this unnamed thread pool is coming from. Feel free to send me the dump for analysis.
Tue, Oct 16
The current puppet code for tlsproxy::localssl does not allow for multiple default_server, even when on different ports.
Oct 15 2018
I think the proposal make sense. This check is here so that we don't forget to reshard when needed, but there isn't a hard limit on the max shard size (well, there is the overall disk space, but we're going to be in trouble well before that). The main goal being to get a low priority alert when things are climbing too high. And "too high" isn't well defined. So we have some latitude as to what limit we want to set.
Oct 12 2018
The ones I have seen are relatively short burst of errors in error counters in interfaces on WDQS (node_network_receive_drop in prometheus). In the case of WDQS, it seems related to CPU starvation and seems to be actionnable (T206105). It looks to me like something we need to address more generally, but that's sufficiently out of my comfort zone that feedback is definitely welcomed!
Oct 11 2018
General question on how to deploy this kind of change:
The main contention point for WDQS (or investigating alternatives) seems to be IOPS. We tried setting up a wdqs test instance on WMCS, but IO contention meant that we were not able to keep up with the update flow. Our production instance consume ~3-4K IOPS just for updates. If there is a way to get this kind of throughput on our VMs, then that would be great!
Oct 10 2018
Coming back to this discussion, I'll try to make my point more clear:
With some trial an error, it looks like the smp_affinity = 00ff00ff would allow the IRQ to be managed by any CPU, but it is still managed by the first one (in this case, any == CPU0). Setting each IRQ on a specific CPU (and one only) will spread them. It looks like puppet interface::rps is setting this up, but it is also setting up a few other things. Time to read some more.
Oct 9 2018
Looking at dropped packets, it looks like we did not have any over the last few days. So we have another cause to our lag. Also not that while the issue still seems more present on wdqs2003, we also see issue with other nodes.