Wed, May 22
Thanks a lot for the deploy! I checked metrics and nothing seems changed :(
Tue, May 21
elukey@ganeti1003:~$ sudo gnt-instance remove analytics-tool1003.eqiad.wmnet This will remove the volumes of the instance analytics- tool1003.eqiad.wmnet (including mirrors), thus removing all the data of the instance. Continue? y/[n]/?: y
Debian GNU/Linux 10 (buster)
an-tool1005 is a Superset web GUI for analytics dashboards (staging environment) (analytics_cluster::superset::staging)
After restoring traffic served by php-fpm, it is clear that the increase in GET is correlated with it. Very interesting that GETS remained stable.
Grabbed a sample of localhost traffic to port 11213 on mw1238 from 11:45:39 to 11:48:05 (146s).
20s of traffic from mw1239:
Other 20s of traffic from mw1238:
Would it be possible to deploy https://gerrit.wikimedia.org/r/511612 before the weekly train?
Mon, May 20
@Krinkle the graph of ops/second clearly shows a bump from ~3k ops to ~12k ops, and the amount of gets to the chronology protector per second is close to 9k ops from tcpdump. I'll follow up as you suggest but it seems to me a volume of calls problem, rather than big keys hitting Redis.
@abi_ sorry I should have been more precise - I meant to ask if we have an idea about the size of those keys etc.. (you know these ops people always complaining about network usage :P). If we don't have a clear idea yet nevermind, I was only curious :)
What results are expected after this change gets deployed? (just to summarize what to look for / check)
Grabbed a .pcap from mw1238 (09:26:41 -> 09:27:00 UTC, so ~20s of traffic) and tried to create a meaningful summary of the get traffic:
Very good news from the task opened to the Language team: https://gerrit.wikimedia.org/r/508112 is merged and going to be part of this week's mediawiki train. It may very well be the last fix needed, fingers crossed :)
Not a very lucky morning:
elukey@mc1019:~$ dpkg -l | grep stdc ii libstdc++6:amd64 4.9.2-10+deb8u1 amd64 GNU Standard C++ Library v3
@Addshore After writing the above entry I remembered that we discussed a while ago about a difference in behavior for memcached in hhvm and php7 by Wikidata traffic. Could it be related?
Sun, May 19
I zoomed to check when the increase happened, and I noticed something interesting:
Sat, May 18
Today I have seen some alarms firing for mediawiki exceptions due to this error :)
Killed all the processes on stat1007 and also commented his crontab (so nothing should restart). Let me know if it is ok :)
Fri, May 17
Thanks a lot for the feedback, and don't worry about the delay, better to be sure and test the code! One week more doesn't make a lot of difference :)
+1 even this solution is fine, I am ok with anything that it is easy and doesn't require more than a couple of hours to be completed :)
Hello everybody, I am investigating https://phabricator.wikimedia.org/T223310, namely the one of the Redis MainStash being constantly hammered by GET to global:Wikimedia\Rdbms\ChronologyProtector since after the deployment of 1.34.0-wmf.3. Is it related to https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/EventBus/+/504819/ by any chance?
New hosts should be racked this month: https://phabricator.wikimedia.org/T220687
Thu, May 16
It would be better in my opinion to keep using the mysql database in deployment-prep as interim solution, and see if we can find an alternative solution in the meantime.. Event Gate will eventually replace eventlogging, so it is probably better to spend time on the former rather than the latter. If we are afraid of keeping mysql code/dependencies we could try to make them optional to remove them on eventlog1002 (in production) and keep them in deployment-prep. What do you think @Ottomata ?
I grabbed a pcap (~200MB) from 11:03:25 to 11:04:41 of traffic directed to the mc1033's redis port (only one direction to avoid having to deal with huge pcaps), grepped for GET and tried to aggregate some results:
@Krinkle @aaron I raised the priority to high since the tx network bandwidth usage grew a lot and we might end up saturating it with spikes in memcached traffic (causing timeouts etc.. like T203786). Can we check what caused the increase of traffic asap?
I have modified the ops/s of the Redis dashboard (https://grafana.wikimedia.org/d/000000174/redis) to show per host metrics:
Connections to the memcached shards dropped nicely after the nutcracker restart.
Wed, May 15
Started up just now, forgot to add the task in the SAL :)
@MoritzMuehlenhoff I am ready to make this host a proper staging environment for superset, let me know if we can proceed or not :)
@aaron hi! I am trying to figure out if Redis traffic for mc1033 is causing this increase in bandwidth usage. From tcpdump I can see a lot of traffic for GET global:Wikimedia\Rdbms\ChronologyProtector::v2. Is it something added by 1.34.0-wmf.3 ? If not I'll look more deeper :)
Tue, May 14
From the memcached's bytes_written and bytes_read metrics I don't see anything changing dramatically:
From a quick look with memkeys on mc1033, these are the top talkers keys:
Mon, May 13
@abi_ you guys rock, thanks a lot for this effort!
@abi_ Hi! Any news? :)
Update: we were running 1.4.1.-1~stretch1, I have rolled back eventlogging to it and all instabilities went away. 1.4.3 seems a broken version from the changelog, a lot of changes related to the consumer code and bugs introduced.
Sun, May 12
@Ottomata sorry but 1.4.3 was not the right version to rollback to :(
Again all processors stuck, py-bt for reference:
Sat, May 11
There was a big lag accumulated during the EU nightime, after the EL restart it seems working fine. Errors were all like:
Fri, May 10
Andrew deployed 1.4.3 and we are back to stable.
Rebuilt python-kafka_1.4.3-1_all.deb and uploaded to eventlog1002 in case we decide to rollback.
https://github.com/dpkp/kafka-python/issues/1418 seems related, some workarounds are listed.
The upgrade of python-kafka to 1.4.6 on eventlog1002 coincides very well with T222941 :(
SAL for the 04-30:
I tried with the following changes:
Thu, May 9
Hive server 2 was OOMing, still not sure why. After a restart it seems working fine again.. Will need to dig a bit deeper in the logs!
No segfaults today, logrotate ran fine! I don't see any log in logstash though, worth to investigate? (Maybe in a separate task if confirmed by others).
@Cmjohnson I'd need a heads up of ~15 mins before the maintenance to shutdown the host properly, but we can do it anytime!
Wed, May 8
Side note: I don't find any message in Kibana related to netmon1002 or netbox, not sure if normal or not.
The fix looks very promising, I have restarted 3 times in a row uwsgi-netbox and no trace of the segfault. Let's wait for tomorrow's round of logrotate restarts to confirm that we are good.
Tue, May 7
On boron I have built uwsgi-core_2.0.14+20161117-3+deb9u2~wmf1_amd64.deb with the following patch:
@Jdcc-berkman since you have access to the host, can you please clean up those files? So I'll not accidentally delete anything valuable etc.. Thanks :)
@MoritzMuehlenhoff we have an interesting segfault that happens for uwsgi when systemctl restart the netbox unit, but only in production and not in labs. Cas and I tried several times to reproduce the bug without any luck. After opening https://github.com/unbit/uwsgi/issues/2010, there is a promising patch that may fix the issue, so in order to keep mental sanity I propose the following:
@ssingh very happy to help! We should be ok to close the task right?
Mon, May 6
@Marostegui sorry I was under the impression that we'd have needed to wait for a feedback from Chris/Rob about how to proceed. Is there anything pending on the Analytics side?
Please also restart the job as user analytics, not hdfs (related patch is https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/508283/, if not deployed when you restart please use the -D user=analytics override and start the job via sudo -u analytics oozie etc..)
A systemctl restart triggered a segfault, and a core was available under /var/tmp/core. This is what gdb says:
Added a couple of code reviews as attempt to add the LimitCore to the netbox's systemd unit. If this is not the idea that you guys had, please feel free to discard them :)
Sun, May 5
I am wondering if we could do the following:
Fri, May 3
Swapped all the occurrences of instance=~"$kafka_broker" with instance=~"($kafka_broker).*", and the dashboard seems loading faster now. Also removed the .* custom value from the $kafka_broker All values field.
This works with the default all value:
It seems that the following happens when using the default all value (using the current cpu usage query since it ends up in the same problem):
Note to self: remember that doing the above breaks all the kafka graphs
@GoranSMilovanovic have you tried to use https://wikitech.wikimedia.org/wiki/SWAP and install the package in there? pip is available and you could start using it straight away. The other alternative is to create your own Python virtual environment on stat1007 and pip install the package (testing it and evaluating it). Packaging it for us would require a lot of time and efforts, so if you only need it on one hosts this might be the quickest solution :)