Page MenuHomePhabricator

elukey (Luca Toscano)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Jan 5 2016, 9:54 PM (176 w, 4 d)
Availability
Away Away until May 27.
LDAP User
Unknown
MediaWiki User
LToscano (WMF) [ Global Accounts ]

Recent Activity

Wed, May 22

elukey added a comment to T223310: Investigate increase in tx bandwidth usage for mc1033.

Thanks a lot for the deploy! I checked metrics and nothing seems changed :(

Wed, May 22, 5:15 AM · MW-1.34-notes (1.34.0-wmf.5; 2019-05-14), Patch-For-Review, Growth-Team, Wikidata, Wikidata-Campsite, Performance-Team, User-jijiki, serviceops, Operations

Tue, May 21

elukey changed the point value for T212243: Staging environment for upgrades of superset from 3 to 8.
Tue, May 21, 3:12 PM · Analytics-Kanban, Patch-For-Review, User-Elukey, Analytics
elukey changed the point value for T212243: Staging environment for upgrades of superset from 5 to 3.
Tue, May 21, 3:11 PM · Analytics-Kanban, Patch-For-Review, User-Elukey, Analytics
elukey set the point value for T212243: Staging environment for upgrades of superset to 5.
Tue, May 21, 3:11 PM · Analytics-Kanban, Patch-For-Review, User-Elukey, Analytics
elukey moved T212243: Staging environment for upgrades of superset from Paused to Done on the Analytics-Kanban board.
Tue, May 21, 3:11 PM · Analytics-Kanban, Patch-For-Review, User-Elukey, Analytics
elukey updated the task description for T212243: Staging environment for upgrades of superset .
Tue, May 21, 3:11 PM · Analytics-Kanban, Patch-For-Review, User-Elukey, Analytics
elukey added a comment to T212243: Staging environment for upgrades of superset .

https://wikitech.wikimedia.org/wiki/Analytics/Systems/Superset#Deploy

Tue, May 21, 3:11 PM · Analytics-Kanban, Patch-For-Review, User-Elukey, Analytics
elukey updated the task description for T212243: Staging environment for upgrades of superset .
Tue, May 21, 2:40 PM · Analytics-Kanban, Patch-For-Review, User-Elukey, Analytics
elukey moved T224023: Decommission analytics-tool1003 (old superset host) from Next Up to Done on the Analytics-Kanban board.
Tue, May 21, 2:40 PM · Patch-For-Review, Operations, vm-requests, Analytics-Kanban, User-Elukey, Analytics
elukey triaged T224023: Decommission analytics-tool1003 (old superset host) as Normal priority.
Tue, May 21, 2:40 PM · Patch-For-Review, Operations, vm-requests, Analytics-Kanban, User-Elukey, Analytics
elukey added a comment to T224023: Decommission analytics-tool1003 (old superset host).
elukey@ganeti1003:~$ sudo gnt-instance remove analytics-tool1003.eqiad.wmnet
This will remove the volumes of the instance analytics-
tool1003.eqiad.wmnet (including mirrors), thus removing all the data
of the instance. Continue?
y/[n]/?: y
Tue, May 21, 2:27 PM · Patch-For-Review, Operations, vm-requests, Analytics-Kanban, User-Elukey, Analytics
elukey created T224023: Decommission analytics-tool1003 (old superset host).
Tue, May 21, 2:12 PM · Patch-For-Review, Operations, vm-requests, Analytics-Kanban, User-Elukey, Analytics
elukey updated the task description for T212243: Staging environment for upgrades of superset .
Tue, May 21, 1:51 PM · Analytics-Kanban, Patch-For-Review, User-Elukey, Analytics
elukey moved T217738: Create an-tool1005 (Staging environment for Superset) from Next Up to Done on the Analytics-Kanban board.
Tue, May 21, 1:51 PM · Analytics-Kanban, vm-requests, Operations, User-Elukey, Analytics
elukey set the point value for T217738: Create an-tool1005 (Staging environment for Superset) to 5.
Tue, May 21, 1:50 PM · Analytics-Kanban, vm-requests, Operations, User-Elukey, Analytics
elukey added a comment to T217738: Create an-tool1005 (Staging environment for Superset).

Debian GNU/Linux 10 (buster)
an-tool1005 is a Superset web GUI for analytics dashboards (staging environment) (analytics_cluster::superset::staging)

Tue, May 21, 1:50 PM · Analytics-Kanban, vm-requests, Operations, User-Elukey, Analytics
elukey added a comment to T223647: Investigate increase in GET ops registered by mcrouter for the mediawiki appserver cluster.

After restoring traffic served by php-fpm, it is clear that the increase in GET is correlated with it. Very interesting that GETS remained stable.

Tue, May 21, 12:26 PM · User-jijiki, serviceops, Operations
elukey added a comment to T223647: Investigate increase in GET ops registered by mcrouter for the mediawiki appserver cluster.

Grabbed a sample of localhost traffic to port 11213 on mw1238 from 11:45:39 to 11:48:05 (146s).

Tue, May 21, 9:57 AM · User-jijiki, serviceops, Operations
elukey added a comment to T223647: Investigate increase in GET ops registered by mcrouter for the mediawiki appserver cluster.

20s of traffic from mw1239:

Tue, May 21, 9:29 AM · User-jijiki, serviceops, Operations
elukey added a comment to T223647: Investigate increase in GET ops registered by mcrouter for the mediawiki appserver cluster.

Other 20s of traffic from mw1238:

Tue, May 21, 9:25 AM · User-jijiki, serviceops, Operations
elukey added a comment to T223310: Investigate increase in tx bandwidth usage for mc1033.

Would it be possible to deploy https://gerrit.wikimedia.org/r/511612 before the weekly train?

Tue, May 21, 8:16 AM · MW-1.34-notes (1.34.0-wmf.5; 2019-05-14), Patch-For-Review, Growth-Team, Wikidata, Wikidata-Campsite, Performance-Team, User-jijiki, serviceops, Operations

Mon, May 20

elukey added a comment to T223310: Investigate increase in tx bandwidth usage for mc1033.

@Krinkle the graph of ops/second clearly shows a bump from ~3k ops to ~12k ops, and the amount of gets to the chronology protector per second is close to 9k ops from tcpdump. I'll follow up as you suggest but it seems to me a volume of calls problem, rather than big keys hitting Redis.

Mon, May 20, 8:58 PM · MW-1.34-notes (1.34.0-wmf.5; 2019-05-14), Patch-For-Review, Growth-Team, Wikidata, Wikidata-Campsite, Performance-Team, User-jijiki, serviceops, Operations
elukey added projects to T223310: Investigate increase in tx bandwidth usage for mc1033: Performance-Team, MediaWiki-Cache.
Mon, May 20, 4:03 PM · MW-1.34-notes (1.34.0-wmf.5; 2019-05-14), Patch-For-Review, Growth-Team, Wikidata, Wikidata-Campsite, Performance-Team, User-jijiki, serviceops, Operations
elukey added a comment to T213802: Investigate ways to reduce the size of translate-groups cache key.

@abi_ sorry I should have been more precise - I meant to ask if we have an idea about the size of those keys etc.. (you know these ops people always complaining about network usage :P). If we don't have a clear idea yet nevermind, I was only curious :)

Mon, May 20, 3:50 PM · MW-1.34-notes (1.34.0-wmf.6; 2019-05-21), MW-1.33-notes (1.33.0-wmf.24; 2019-04-02), User-abi_, User-Nikerabbit, MediaWiki-extensions-Translate
elukey added a comment to T213802: Investigate ways to reduce the size of translate-groups cache key.

What results are expected after this change gets deployed? (just to summarize what to look for / check)

Mon, May 20, 2:42 PM · MW-1.34-notes (1.34.0-wmf.6; 2019-05-21), MW-1.33-notes (1.33.0-wmf.24; 2019-04-02), User-abi_, User-Nikerabbit, MediaWiki-extensions-Translate
elukey added a comment to T223647: Investigate increase in GET ops registered by mcrouter for the mediawiki appserver cluster.

Grabbed a .pcap from mw1238 (09:26:41 -> 09:27:00 UTC, so ~20s of traffic) and tried to create a meaningful summary of the get traffic:

Mon, May 20, 8:30 AM · User-jijiki, serviceops, Operations
elukey added a comment to T203786: Mcrouter periodically reports soft TKOs for mc1029 (was mc1035, mc1022) leading to MW Memcached exceptions.

Very good news from the task opened to the Language team: https://gerrit.wikimedia.org/r/508112 is merged and going to be part of this week's mediawiki train. It may very well be the last fix needed, fingers crossed :)

Mon, May 20, 8:17 AM · MW-1.33-notes (1.33.0-wmf.25; 2019-04-09), Patch-For-Review, Performance-Team (Radar), Wikimedia-production-error, User-Elukey, MediaWiki-Cache, Operations
elukey added a comment to T223647: Investigate increase in GET ops registered by mcrouter for the mediawiki appserver cluster.

Not a very lucky morning:

Mon, May 20, 7:41 AM · User-jijiki, serviceops, Operations
elukey added a comment to T223863: memkeys segfaults on Debian Stretch.
elukey@mc1019:~$ dpkg -l | grep stdc
ii  libstdc++6:amd64                     4.9.2-10+deb8u1                  amd64        GNU Standard C++ Library v3
Mon, May 20, 6:50 AM · Operations
elukey updated subscribers of T223647: Investigate increase in GET ops registered by mcrouter for the mediawiki appserver cluster.

@Addshore After writing the above entry I remembered that we discussed a while ago about a difference in behavior for memcached in hhvm and php7 by Wikidata traffic. Could it be related?

Mon, May 20, 6:45 AM · User-jijiki, serviceops, Operations
elukey created T223863: memkeys segfaults on Debian Stretch.
Mon, May 20, 6:37 AM · Operations

Sun, May 19

jijiki awarded T223647: Investigate increase in GET ops registered by mcrouter for the mediawiki appserver cluster a Evil Spooky Haunted Tree token.
Sun, May 19, 6:13 PM · User-jijiki, serviceops, Operations
elukey added a comment to T223647: Investigate increase in GET ops registered by mcrouter for the mediawiki appserver cluster.

I zoomed to check when the increase happened, and I noticed something interesting:

Sun, May 19, 6:11 PM · User-jijiki, serviceops, Operations

Sat, May 18

elukey added a comment to T222921: extensions/CirrusSearch/includes/Sanity/Checker.php:369 Cannot fetch ids from index.

Today I have seen some alarms firing for mediawiki exceptions due to this error :)

Sat, May 18, 4:17 PM · CirrusSearch, Wikimedia-production-error, Discovery-Search
elukey added a comment to T219546: Change permissions for daily traffic anomaly reports on stat1007.

Killed all the processes on stat1007 and also commented his crontab (so nothing should restart). Let me know if it is ok :)

Sat, May 18, 6:01 AM · User-Elukey, Analytics-Kanban, Analytics

Fri, May 17

elukey triaged T223647: Investigate increase in GET ops registered by mcrouter for the mediawiki appserver cluster as Normal priority.
Fri, May 17, 5:39 PM · User-jijiki, serviceops, Operations
elukey claimed T223310: Investigate increase in tx bandwidth usage for mc1033.
Fri, May 17, 5:27 PM · MW-1.34-notes (1.34.0-wmf.5; 2019-05-14), Patch-For-Review, Growth-Team, Wikidata, Wikidata-Campsite, Performance-Team, User-jijiki, serviceops, Operations
elukey added a comment to T213802: Investigate ways to reduce the size of translate-groups cache key.

Thanks a lot for the feedback, and don't worry about the delay, better to be sure and test the code! One week more doesn't make a lot of difference :)

Fri, May 17, 4:05 PM · MW-1.34-notes (1.34.0-wmf.6; 2019-05-21), MW-1.33-notes (1.33.0-wmf.24; 2019-04-02), User-abi_, User-Nikerabbit, MediaWiki-extensions-Translate
elukey added a comment to T223415: Strategy to be able to test eventlogging in beta in the absence of mysql .

+1 even this solution is fine, I am ok with anything that it is easy and doesn't require more than a couple of hours to be completed :)

Fri, May 17, 2:35 PM · Analytics, Analytics-EventLogging
elukey added a comment to T212550: Implement support for ChronologyProtection in events sent when editing Mediawiki/Wikidata.

Hello everybody, I am investigating https://phabricator.wikimedia.org/T223310, namely the one of the Redis MainStash being constantly hammered by GET to global:Wikimedia\Rdbms\ChronologyProtector since after the deployment of 1.34.0-wmf.3. Is it related to https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/EventBus/+/504819/ by any chance?

Fri, May 17, 9:50 AM · Performance-Team (Radar), MW-1.34-notes (1.34.0-wmf.3; 2019-04-30), Patch-For-Review, User-Smalyshev, Wikidata, Wikidata-Query-Service
elukey added a comment to T223310: Investigate increase in tx bandwidth usage for mc1033.

@aaron what about https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/EventBus/+/504819/ and T212550 ?

Fri, May 17, 9:44 AM · MW-1.34-notes (1.34.0-wmf.5; 2019-05-14), Patch-For-Review, Growth-Team, Wikidata, Wikidata-Campsite, Performance-Team, User-jijiki, serviceops, Operations
elukey updated the task description for T211706: Superset Updates .
Fri, May 17, 5:46 AM · Better Use Of Data, Analytics-Kanban, Product-Analytics
elukey added a comment to T217057: decouple analytics zookeeper cluster from kafka zookeeper cluster [2019-2020].

New hosts should be racked this month: https://phabricator.wikimedia.org/T220687

Fri, May 17, 5:23 AM · User-Elukey, Analytics
elukey added a project to T217057: decouple analytics zookeeper cluster from kafka zookeeper cluster [2019-2020]: User-Elukey.
Fri, May 17, 5:21 AM · User-Elukey, Analytics

Thu, May 16

elukey updated subscribers of T223415: Strategy to be able to test eventlogging in beta in the absence of mysql .

It would be better in my opinion to keep using the mysql database in deployment-prep as interim solution, and see if we can find an alternative solution in the meantime.. Event Gate will eventually replace eventlogging, so it is probably better to spend time on the former rather than the latter. If we are afraid of keeping mysql code/dependencies we could try to make them optional to remove them on eventlog1002 (in production) and keep them in deployment-prep. What do you think @Ottomata ?

Thu, May 16, 5:44 PM · Analytics, Analytics-EventLogging
elukey added a comment to T223310: Investigate increase in tx bandwidth usage for mc1033.

I grabbed a pcap (~200MB) from 11:03:25 to 11:04:41 of traffic directed to the mc1033's redis port (only one direction to avoid having to deal with huge pcaps), grepped for GET and tried to aggregate some results:

Thu, May 16, 9:24 AM · MW-1.34-notes (1.34.0-wmf.5; 2019-05-14), Patch-For-Review, Growth-Team, Wikidata, Wikidata-Campsite, Performance-Team, User-jijiki, serviceops, Operations
elukey added a comment to T223310: Investigate increase in tx bandwidth usage for mc1033.

I would expect https://github.com/wikimedia/mediawiki/commit/1ef64b2f2683051066a33ba34c6d04723252c3c0 to have some increase though. Unless the process cache doesn't work, it would seem odd for that to cause problems.

Thu, May 16, 9:01 AM · MW-1.34-notes (1.34.0-wmf.5; 2019-05-14), Patch-For-Review, Growth-Team, Wikidata, Wikidata-Campsite, Performance-Team, User-jijiki, serviceops, Operations
elukey added a comment to T223310: Investigate increase in tx bandwidth usage for mc1033.

@Krinkle @aaron I raised the priority to high since the tx network bandwidth usage grew a lot and we might end up saturating it with spikes in memcached traffic (causing timeouts etc.. like T203786). Can we check what caused the increase of traffic asap?

Thu, May 16, 5:58 AM · MW-1.34-notes (1.34.0-wmf.5; 2019-05-14), Patch-For-Review, Growth-Team, Wikidata, Wikidata-Campsite, Performance-Team, User-jijiki, serviceops, Operations
elukey raised the priority of T223310: Investigate increase in tx bandwidth usage for mc1033 from Normal to High.
Thu, May 16, 5:53 AM · MW-1.34-notes (1.34.0-wmf.5; 2019-05-14), Patch-For-Review, Growth-Team, Wikidata, Wikidata-Campsite, Performance-Team, User-jijiki, serviceops, Operations
elukey added a comment to T223310: Investigate increase in tx bandwidth usage for mc1033.

I have modified the ops/s of the Redis dashboard (https://grafana.wikimedia.org/d/000000174/redis) to show per host metrics:

Thu, May 16, 5:52 AM · MW-1.34-notes (1.34.0-wmf.5; 2019-05-14), Patch-For-Review, Growth-Team, Wikidata, Wikidata-Campsite, Performance-Team, User-jijiki, serviceops, Operations
elukey added a comment to T214275: Consider removing the last traces of nutcracker in Mediawiki configs.

Connections to the memcached shards dropped nicely after the nutcracker restart.

Thu, May 16, 5:44 AM · Performance-Team (Radar), Patch-For-Review, User-Elukey, Operations, MediaWiki-Cache

Wed, May 15

elukey added a comment to T217398: elastic2038 DOWN (CPU/memory errors ).

Started up just now, forgot to add the task in the SAL :)

Wed, May 15, 5:13 PM · Discovery-Search (Current work), Operations, ops-codfw
elukey added a comment to T217738: Create an-tool1005 (Staging environment for Superset).

@MoritzMuehlenhoff I am ready to make this host a proper staging environment for superset, let me know if we can proceed or not :)

Wed, May 15, 3:21 PM · Analytics-Kanban, vm-requests, Operations, User-Elukey, Analytics
elukey updated subscribers of T223310: Investigate increase in tx bandwidth usage for mc1033.

@aaron hi! I am trying to figure out if Redis traffic for mc1033 is causing this increase in bandwidth usage. From tcpdump I can see a lot of traffic for GET global:Wikimedia\Rdbms\ChronologyProtector::v2. Is it something added by 1.34.0-wmf.3 ? If not I'll look more deeper :)

Wed, May 15, 2:23 PM · MW-1.34-notes (1.34.0-wmf.5; 2019-05-14), Patch-For-Review, Growth-Team, Wikidata, Wikidata-Campsite, Performance-Team, User-jijiki, serviceops, Operations
elukey updated the task description for T212243: Staging environment for upgrades of superset .
Wed, May 15, 1:24 PM · Analytics-Kanban, Patch-For-Review, User-Elukey, Analytics
elukey added a comment to T223335: Can't log into Superset.

@Nuria the user that you added missed the "Alpha" role, I think that this is the reason for the 401.. it should be good now! @kaldari please retry when you have time :)

Wed, May 15, 6:52 AM · Analytics

Tue, May 14

elukey updated subscribers of T223310: Investigate increase in tx bandwidth usage for mc1033.

From the memcached's bytes_written and bytes_read metrics I don't see anything changing dramatically:

Tue, May 14, 5:53 PM · MW-1.34-notes (1.34.0-wmf.5; 2019-05-14), Patch-For-Review, Growth-Team, Wikidata, Wikidata-Campsite, Performance-Team, User-jijiki, serviceops, Operations
elukey triaged T223310: Investigate increase in tx bandwidth usage for mc1033 as Normal priority.
Tue, May 14, 5:33 PM · MW-1.34-notes (1.34.0-wmf.5; 2019-05-14), Patch-For-Review, Growth-Team, Wikidata, Wikidata-Campsite, Performance-Team, User-jijiki, serviceops, Operations
elukey added a comment to T97368: Fix inefficient CacheAwarePropertyInfoStore memcached access pattern.

From a quick look with memkeys on mc1033, these are the top talkers keys:

Tue, May 14, 8:26 AM · MW-1.33-notes (1.33.0-wmf.22; 2019-03-19), User-Elukey, Wikidata-Campsite, MW-1.32-notes (WMF-deploy-2018-10-02 (1.32.0-wmf.24)), User-Addshore, Performance-Team (Radar), Patch-For-Review, Wikimedia-Incident, Operations, wikidata-tech-focus, MediaWiki-extensions-WikibaseClient, MediaWiki-extensions-WikibaseRepository, Wikidata

Mon, May 13

elukey added a comment to T213802: Investigate ways to reduce the size of translate-groups cache key.

@abi_ you guys rock, thanks a lot for this effort!

Mon, May 13, 1:28 PM · MW-1.34-notes (1.34.0-wmf.6; 2019-05-21), MW-1.33-notes (1.33.0-wmf.24; 2019-04-02), User-abi_, User-Nikerabbit, MediaWiki-extensions-Translate
elukey added a comment to T213802: Investigate ways to reduce the size of translate-groups cache key.

@abi_ Hi! Any news? :)

Mon, May 13, 8:38 AM · MW-1.34-notes (1.34.0-wmf.6; 2019-05-21), MW-1.33-notes (1.33.0-wmf.24; 2019-04-02), User-abi_, User-Nikerabbit, MediaWiki-extensions-Translate
elukey lowered the priority of T212697: uwsgi's logsocket_plugin.so causes segfaults during log rotation from High to Normal.
Mon, May 13, 8:23 AM · Patch-For-Review, Operations
elukey added a comment to T212697: uwsgi's logsocket_plugin.so causes segfaults during log rotation.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=928927

Mon, May 13, 8:22 AM · Patch-For-Review, Operations
elukey added a comment to T221848: Upgrade python-kafka.

Update: we were running 1.4.1.-1~stretch1, I have rolled back eventlogging to it and all instabilities went away. 1.4.3 seems a broken version from the changelog, a lot of changes related to the consumer code and bugs introduced.

Mon, May 13, 7:01 AM · Patch-For-Review, Analytics-Kanban, Performance-Team (Radar), Operations, Analytics, Analytics-EventLogging

Sun, May 12

elukey added a comment to T222941: Eventlogging processors are frequently failing heartbeats causing consumer group rebalances.

@Ottomata sorry but 1.4.3 was not the right version to rollback to :(

Sun, May 12, 3:30 PM · Patch-For-Review, Analytics-Kanban, Analytics
elukey added a comment to T212697: uwsgi's logsocket_plugin.so causes segfaults during log rotation.

Upstream commit: https://github.com/unbit/uwsgi/commit/d642e635b3d558ce91e80442c74f4d16b9d81146

Sun, May 12, 3:15 PM · Patch-For-Review, Operations
elukey added a comment to T222941: Eventlogging processors are frequently failing heartbeats causing consumer group rebalances.

Again all processors stuck, py-bt for reference:

Sun, May 12, 12:13 PM · Patch-For-Review, Analytics-Kanban, Analytics

Sat, May 11

elukey added a comment to T222941: Eventlogging processors are frequently failing heartbeats causing consumer group rebalances.

There was a big lag accumulated during the EU nightime, after the EL restart it seems working fine. Errors were all like:

Sat, May 11, 6:49 AM · Patch-For-Review, Analytics-Kanban, Analytics

Fri, May 10

elukey lowered the priority of T222941: Eventlogging processors are frequently failing heartbeats causing consumer group rebalances from High to Normal.

Andrew deployed 1.4.3 and we are back to stable.

Fri, May 10, 3:34 PM · Patch-For-Review, Analytics-Kanban, Analytics
elukey added a comment to T221848: Upgrade python-kafka.

We can probably simply backport https://github.com/dpkp/kafka-python/pull/1628/commits/f12d4978e06c191871e092c190c2a34977f0c8bd on top of our 1.4.3 package

If we confirm that this fixes it and given that 1.4.3 is in buster we should also report it to Debian, there's no pre-existing bug in the Debian BTS.

Fri, May 10, 10:55 AM · Patch-For-Review, Analytics-Kanban, Performance-Team (Radar), Operations, Analytics, Analytics-EventLogging
elukey added a comment to T222941: Eventlogging processors are frequently failing heartbeats causing consumer group rebalances.

Rebuilt python-kafka_1.4.3-1_all.deb and uploaded to eventlog1002 in case we decide to rollback.

Fri, May 10, 10:34 AM · Patch-For-Review, Analytics-Kanban, Analytics
elukey added a comment to T222941: Eventlogging processors are frequently failing heartbeats causing consumer group rebalances.

https://github.com/dpkp/kafka-python/issues/1418 seems related, some workarounds are listed.

Fri, May 10, 10:26 AM · Patch-For-Review, Analytics-Kanban, Analytics
elukey added a comment to T221848: Upgrade python-kafka.

The upgrade of python-kafka to 1.4.6 on eventlog1002 coincides very well with T222941 :(

Fri, May 10, 10:12 AM · Patch-For-Review, Analytics-Kanban, Performance-Team (Radar), Operations, Analytics, Analytics-EventLogging
elukey added a comment to T222941: Eventlogging processors are frequently failing heartbeats causing consumer group rebalances.

SAL for the 04-30:

Fri, May 10, 10:01 AM · Patch-For-Review, Analytics-Kanban, Analytics
elukey added a comment to T222941: Eventlogging processors are frequently failing heartbeats causing consumer group rebalances.

I tried with the following changes:

Fri, May 10, 9:22 AM · Patch-For-Review, Analytics-Kanban, Analytics
elukey triaged T222941: Eventlogging processors are frequently failing heartbeats causing consumer group rebalances as High priority.
Fri, May 10, 9:17 AM · Patch-For-Review, Analytics-Kanban, Analytics

Thu, May 9

elukey added a comment to T222895: "TSocket read 0 bytes" error in Hue when querying.

Hive server 2 was OOMing, still not sure why. After a restart it seems working fine again.. Will need to dig a bit deeper in the logs!

Thu, May 9, 4:44 PM · Analytics
elukey moved T220702: Check if HDFS offers a way to prevent/limit/throttle users to overwhelm the HDFS Namenode from Backlog to In Progress on the User-Elukey board.
Thu, May 9, 7:35 AM · Analytics-Kanban, Analytics, User-Elukey
elukey added a comment to T212697: uwsgi's logsocket_plugin.so causes segfaults during log rotation.

No segfaults today, logrotate ran fine! I don't see any log in logstash though, worth to investigate? (Maybe in a separate task if confirmed by others).

Thu, May 9, 7:02 AM · Patch-For-Review, Operations
elukey added a comment to T222050: db1107 (eventlogging db master) possibly memory issues.

@Cmjohnson I'd need a heads up of ~15 mins before the maintenance to shutdown the host properly, but we can do it anytime!

Thu, May 9, 6:19 AM · Analytics, Operations, ops-eqiad, Analytics-EventLogging, DBA

Wed, May 8

elukey added a comment to T212697: uwsgi's logsocket_plugin.so causes segfaults during log rotation.

Side note: I don't find any message in Kibana related to netmon1002 or netbox, not sure if normal or not.

Wed, May 8, 7:54 AM · Patch-For-Review, Operations
elukey added a comment to T212697: uwsgi's logsocket_plugin.so causes segfaults during log rotation.

The fix looks very promising, I have restarted 3 times in a row uwsgi-netbox and no trace of the segfault. Let's wait for tomorrow's round of logrotate restarts to confirm that we are good.

Wed, May 8, 7:51 AM · Patch-For-Review, Operations

Tue, May 7

elukey added a comment to T212697: uwsgi's logsocket_plugin.so causes segfaults during log rotation.

On boron I have built uwsgi-core_2.0.14+20161117-3+deb9u2~wmf1_amd64.deb with the following patch:

Tue, May 7, 5:50 PM · Patch-For-Review, Operations
elukey added a comment to T219546: Change permissions for daily traffic anomaly reports on stat1007.

@Jdcc-berkman since you have access to the host, can you please clean up those files? So I'll not accidentally delete anything valuable etc.. Thanks :)

Tue, May 7, 4:32 PM · User-Elukey, Analytics-Kanban, Analytics
elukey updated subscribers of T212697: uwsgi's logsocket_plugin.so causes segfaults during log rotation.

@MoritzMuehlenhoff we have an interesting segfault that happens for uwsgi when systemctl restart the netbox unit, but only in production and not in labs. Cas and I tried several times to reproduce the bug without any luck. After opening https://github.com/unbit/uwsgi/issues/2010, there is a promising patch that may fix the issue, so in order to keep mental sanity I propose the following:

Tue, May 7, 4:09 PM · Patch-For-Review, Operations
elukey added a comment to T219546: Change permissions for daily traffic anomaly reports on stat1007.

@ssingh very happy to help! We should be ok to close the task right?

Tue, May 7, 9:31 AM · User-Elukey, Analytics-Kanban, Analytics

Mon, May 6

elukey added a comment to T212697: uwsgi's logsocket_plugin.so causes segfaults during log rotation.

Opened https://github.com/unbit/uwsgi/issues/2010

Mon, May 6, 5:26 PM · Patch-For-Review, Operations
elukey added a comment to T222050: db1107 (eventlogging db master) possibly memory issues.

@Marostegui sorry I was under the impression that we'd have needed to wait for a feedback from Chris/Rob about how to proceed. Is there anything pending on the Analytics side?

Mon, May 6, 4:45 PM · Analytics, Operations, ops-eqiad, Analytics-EventLogging, DBA
elukey added a comment to T222603: Fix oozie banner_impression monthly job.

Please also restart the job as user analytics, not hdfs (related patch is https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/508283/, if not deployed when you restart please use the -D user=analytics override and start the job via sudo -u analytics oozie etc..)

Mon, May 6, 2:36 PM · Analytics-Kanban, Analytics
elukey added a comment to T212697: uwsgi's logsocket_plugin.so causes segfaults during log rotation.

Highlight to:

Mon, May 6, 11:06 AM · Patch-For-Review, Operations
elukey added a comment to T212697: uwsgi's logsocket_plugin.so causes segfaults during log rotation.

A systemctl restart triggered a segfault, and a core was available under /var/tmp/core. This is what gdb says:

Mon, May 6, 9:43 AM · Patch-For-Review, Operations
elukey added a comment to T212697: uwsgi's logsocket_plugin.so causes segfaults during log rotation.

Added a couple of code reviews as attempt to add the LimitCore to the netbox's systemd unit. If this is not the idea that you guys had, please feel free to discard them :)

Mon, May 6, 7:00 AM · Patch-For-Review, Operations

Sun, May 5

elukey added a comment to T212697: uwsgi's logsocket_plugin.so causes segfaults during log rotation.

I am wondering if we could do the following:

Sun, May 5, 2:39 PM · Patch-For-Review, Operations

Fri, May 3

elukey added a comment to T222112: figure out why Kafka dashboard hammers Prometheus, and fix it.

Swapped all the occurrences of instance=~"$kafka_broker" with instance=~"($kafka_broker).*", and the dashboard seems loading faster now. Also removed the .* custom value from the $kafka_broker All values field.

Fri, May 3, 2:26 PM · Wikimedia-Incident, Operations, observability
elukey added a comment to T222112: figure out why Kafka dashboard hammers Prometheus, and fix it.

This works with the default all value:

Fri, May 3, 2:10 PM · Wikimedia-Incident, Operations, observability
elukey added a comment to T222112: figure out why Kafka dashboard hammers Prometheus, and fix it.

It seems that the following happens when using the default all value (using the current cpu usage query since it ends up in the same problem):

Fri, May 3, 2:07 PM · Wikimedia-Incident, Operations, observability
elukey added a comment to T222112: figure out why Kafka dashboard hammers Prometheus, and fix it.

Note to self: remember that doing the above breaks all the kafka graphs

Fri, May 3, 1:46 PM · Wikimedia-Incident, Operations, observability
elukey added a comment to T222359: Install Gensim for Python3 on stat1007.

@GoranSMilovanovic have you tried to use https://wikitech.wikimedia.org/wiki/SWAP and install the package in there? pip is available and you could start using it straight away. The other alternative is to create your own Python virtual environment on stat1007 and pip install the package (testing it and evaluating it). Packaging it for us would require a lot of time and efforts, so if you only need it on one hosts this might be the quickest solution :)

Fri, May 3, 1:39 PM · WMDE-Analytics-Engineering, Analytics
elukey added a comment to T222112: figure out why Kafka dashboard hammers Prometheus, and fix it.

I think you should just be able to remove the "custom all value" in the dashboard settings and have it work. In this case Grafana will create its own 'all' value that is simply a regex OR'ing together all the known values, which it looks like it computes based on the cluster=kafka_jumbo hidden variable.

Fri, May 3, 11:15 AM · Wikimedia-Incident, Operations, observability
elukey added a comment to T222301: Upgrade pandas in spark SWAP notebooks.

We are also open to suggestions, it is a new thing for everybody, so please feel free to create tasks and follow up with us, we don't think you are looking to argue, on the contrary!

Fri, May 3, 7:28 AM · Analytics