@herron decided to proceed to unblock the oxygen's decom process, from now on we can decide how to proceed with logstash/webrequest-503 (it will likely take a bit of time so better to nuke oxygen in the meantime :). Hope that it is ok!
Created https://phabricator.wikimedia.org/T211883 :)
Me and @fdans deployed 0.28.1 this morning, and we had to apply a hot fix for an outstanding upstream bug (see T211605#4820128 for more juicy info). Please check if everything is ok, we did a quick check and didn't notice anything significant (except what referenced before of course :)).
After deploying the Chars panel was broken, fixed manually by https://github.com/apache/incubator-superset/issues/6347#issuecomment-442178847
While testing the superset db upgrade command I got:
This server is going to be decommed very soon (OOW), I've acked the alarm a long time ago to avoid it spamming us. Good to close in my opinion, +1
Wed, Dec 12
The first breaking change that I can see (use of f-strings) happened in commit https://github.com/apache/incubator-superset/commit/cc3a625a4bb6b0e581b30f3112315ff5a8ab6807 that should be in the upcoming release, not in 0.28.1, so in theory reverting https://github.com/lyft/incubator-superset/commit/174ee13b512f8aaa311fe0980276ac970930f4e6 and building with Python 3.5 should be enough for this upgrade.
Tue, Dec 11
@MoritzMuehlenhoff I think that you have a good first candidate for buster testing :D
@mpopov please also keep in mind that things like T211605#4814020 could happen with a project that is still developing fast and does not care much about breaking existing users, so upgrades might not be easy :D
Very nice issue just found: https://github.com/apache/incubator-superset/pull/5985
Very good news, finally stat1005 is ready for experiment with GPU drivers etc.. I am completely ignorant about the subject so if anybody has time/patience please come forward :)
We discussed this during the Analytics standup and we have a proposal: we could start with creating a tracking task for Superset/Turnilo upgrade schedules, that everybody can bookmark easily, and then start with one update every quarter (if upstream released a new version of course). If this turns up to be not enough, the same tracking task will also be used to request new versions for specific reasons (like solving a bug etc..). How does it sound?
@AndyRussG ping :)
I checked again tcpdump traffic and the "new" peaks of mc1022's usage are due to CAS commands, as it is visible in https://grafana.wikimedia.org/d/000000614/memcache-elukey?orgId=1&panelId=10&fullscreen
Mon, Dec 10
- log is not anymore on dbstore1002/analytics-store, but you can find it in analytics-slave (db1108)
- centralauth should be s7
- wikishared no idea (@Banyek can you help?)
We (analytics) have been trying to move away from crons in favor of systemd timers, adding some automation in profile::analytics::systemd_timer. It shouldn't need too much work to be generalized and adapted to the mediawiki use case, I can help/work on it if you think it is good!
Fri, Dec 7
It happened on the 30th on various cp hosts, and in most of the Jumbo brokers I can see something like the following (repeated multiple times and for different brokers):
Thu, Dec 6
To follow up what I wrote (after a chat with the data persistence team):
Turnilo is now running on nodejs 10!
Wed, Dec 5
The plan is:
This is a very good point, I'll bring it up to my team's standup today and I'll let you know. It has been used, as far as I know, for two purposes:
- join tables from different databases in tmp tables to work on them freely (thing not possible anymore)
- use it as holding area for various scripts/analytics-reporing/etc..
No mcrouter proxies on A4, all good.
No mcrouter codfw proxies present in B4, all good.
Piwik/Matomo upgraded, but while testing the users I noticed that the piwik user outlined in https://wikitech.wikimedia.org/wiki/Analytics/Systems/Piwik#Access seems having a different password. @Nuria: I tried to change the pass and it seems that it needs more than 6 chars, so it must be another one. Shall we update wikitech?
Database Upgrade Required
Tue, Dec 4
Opened a procurement task for 1 Cloudb replica in T211135. We are not planning to buy two hosts with the following assumption:
Before doing this, we need to probably run npm install for turnilo with the nodejs10... Just realized it
As FYI ema told me that https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/477424/ reverted https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/476311, an experiment to disable N-hit-wonder for some days. This caused issues while loading images from commons - https://phabricator.wikimedia.org/T210890 (hence upload).
Upgraded turnilo in labs (turnilo.eqiad.wmflabs), if anybody wants to test it: ssh -N turnilo.eqiad.wmflabs -L 9091:turnilo.eqiad.wmflabs:9091
Between 8:10 and 9 UTC this morning there were enough TKOs to trigger logstash exception alarm, from https://grafana.wikimedia.org/dashboard/db/memcache-elukey?orgId=1&from=1543910817973&to=1543914647084 it matches nicely..
Need to check with Joe but I'd do the following:
Mon, Dec 3
Since the refined data should now be there, lowering the priority to High :)
Sun, Dec 2
Fri, Nov 30
They are not, forgot to mention :(
Reading the backlog only now, this was good learning lesson for me too (I was aware of what Chase did as mentioned, and didn't think that it would have been flagged as issue to review). Thanks a lot to all that contributed with their thoughts and suggestions :)
A possible solution, instead of ordering new hardware, would be to reuse one/two of the new Hadoop nodes racked in T207192 for this use case: they have 12x3.6TB disks and 128GB of RAM, so I'd say that they could do the job (I only don't know if 128GB of RAM would be enough for our use case, but I'll defer to Manuel/Balazs/Jaime judgement).
In case this option is viable, we'll need to get also the green light for the repurpose from Faidon or Mark.
@hashar quick question - we are about to migrate AQS to NodeJS 10, will it be easy to migrate npm test to it when needed?
Thanks a lot for all the inputs, I'd say that we don't need proxies for the moment, we'll probably just need some automation around the map between replicated wiki - mariadb host/instance (that the Analytics team can do of course) to ease the job of connecting to a specific instance for every user.
Update after the mediawiki train deployment:
Thu, Nov 29
As far as I know we have to go multi-instance, but I don't have a lot of context if multi-source can or is needed anymore (I guess no but I prefer to ask :)