Page MenuHomePhabricator

Upgrade Turnilo
Closed, ResolvedPublic

Description

We're a few versions behind, there have been some security patches and there's a new Grid visualization. I read through the changes and it feels like if we wait much longer, it would be too big of an upgrade: https://github.com/allegro/turnilo/releases?page=3

Related Objects

StatusSubtypeAssignedTask
ResolvedBTullis
ResolvedStevemunene

Event Timeline

Thanks for the task I was looking for the exact same thing.

The scatterplot visualization would be useful for us in SRE as well.

Change 777881 had a related patch set uploaded (by Razzi; author: Razzi):

[analytics/turnilo/deploy@master] Upgrade to upstream version 1.35.0

https://gerrit.wikimedia.org/r/777881

I made a patch for this, but the scap deploy to staging failed due to some error with locales:

Apr 06 21:42:49 an-tool1005 turnilo[26803]: Child process initialized in 22.28 ms
Apr 06 21:42:50 an-tool1005 turnilo[26803]: internal/modules/cjs/loader.js:638
Apr 06 21:42:50 an-tool1005 turnilo[26803]:     throw err;
Apr 06 21:42:50 an-tool1005 turnilo[26803]:     ^
Apr 06 21:42:50 an-tool1005 turnilo[26803]: Error: Cannot find module '../locale/locale'
Apr 06 21:42:50 an-tool1005 turnilo[26803]:     at Function.Module._resolveFilename (internal/modules/cjs/loader.js:636:15)
Apr 06 21:42:50 an-tool1005 turnilo[26803]:     at Function.Module._load (internal/modules/cjs/loader.js:562:25)
Apr 06 21:42:50 an-tool1005 turnilo[26803]:     at Module.require (internal/modules/cjs/loader.js:692:17)
Apr 06 21:42:50 an-tool1005 turnilo[26803]:     at require (internal/modules/cjs/helpers.js:25:18)
Apr 06 21:42:50 an-tool1005 turnilo[26803]:     at Object.<anonymous> (/srv/deployment/analytics/turnilo/deploy-cache/revs/a1c5c6fa88e6e8ada5fe57ae88fd86acc578cb4d/node_modules/t
Apr 06 21:42:50 an-tool1005 turnilo[26803]:     at Module._compile (internal/modules/cjs/loader.js:778:30)
Apr 06 21:42:50 an-tool1005 turnilo[26803]:     at Object.Module._extensions..js (internal/modules/cjs/loader.js:789:10)
Apr 06 21:42:50 an-tool1005 turnilo[26803]:     at Module.load (internal/modules/cjs/loader.js:653:32)
Apr 06 21:42:50 an-tool1005 turnilo[26803]:     at tryModuleLoad (internal/modules/cjs/loader.js:593:12)
Apr 06 21:42:50 an-tool1005 turnilo[26803]:     at Function.Module._load (internal/modules/cjs/loader.js:585:3)
Apr 06 21:42:50 an-tool1005 turnilo[26803]: Parent is shutting down, bye...

I'm wrapping up for the day, if anybody else would like to take a look please feel free.

The deploy command that I did and rolled back was:

razzi@deploy1002:/srv/deployment/analytics/turnilo/deploy$ scap deploy --limit an-tool1005.eqiad.wmnet

Ah ok it appears we're now too far behind on nodejs versions

Pre-requisites

    Node.js - 12.x or 14.x version

https://github.com/allegro/turnilo

So we'll have to upgrade node from version 10.24 on an-tool1005 to at least 12 to upgrade turnilo (at least to the latest version, which we should do in my opinion).

So we'll have to upgrade node from version 10.24 on an-tool1005 to at least 12 to upgrade turnilo (at least to the latest version, which we should do in my opinion).

+1

razzi moved this task from Next Up to In Progress on the Data-Engineering-Kanban board.
razzi added a subscriber: hashar.

According to @hashar we can get node 12.22.5 by upgrading Debian to version 11 Bullseye (staging and production Turnilo run Debian 10). I'll try upgrading Debian on the staging host and see if the latest Turnilo works then.

Change 791397 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] an-tool1005: set operating system image to bullseye

https://gerrit.wikimedia.org/r/791397

Change 791397 merged by Razzi:

[operations/puppet@production] an-tool1005: set operating system image to bullseye

https://gerrit.wikimedia.org/r/791397

Change 791461 had a related patch set uploaded (by Razzi; author: Razzi):

[analytics/turnilo/deploy@master] Upgrade to superset 1.35.0

https://gerrit.wikimedia.org/r/791461

Change 791705 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] dhcpd: downgrade an-tool1005 to stretch, upgrade an-tool1007 to bullseye

https://gerrit.wikimedia.org/r/791705

Change 777881 merged by Razzi:

[analytics/turnilo/deploy@master] Upgrade to upstream version 1.35.0

https://gerrit.wikimedia.org/r/777881

Change 791705 merged by Razzi:

[operations/puppet@production] dhcpd: downgrade an-tool1005 to stretch, upgrade an-tool1007 to bullseye

https://gerrit.wikimedia.org/r/791705

I completed the reinstall of an-tool1007 to buster with the following on ganeti1024.eqiad.wmnet

sudo gnt-instance shutdown an-tool1007.eqiad.wmnet
sudo gnt-instance modify --hypervisor-parameters=boot_order=network an-tool1007.eqiad.wmnet
sudo gnt-instance start an-tool1007 && sudo gnt-instance console an-tool1007.eqiad.wmnet
sudo gnt-instance modify --hypervisor-parameters=boot_order=disk an-tool1007.eqiad.wmnet

I then followed the required steps from the manual installation procedure to delete and regenerate the puppet certificate.

After running puppet a couple of times turnilo came up.

I also ran a deploy with scap:

btullis@deploy1002:/srv/deployment/analytics/turnilo/deploy$ scap deploy --limit an-tool1007.eqiad.wmnet
09:16:14 Started deploy [analytics/turnilo/deploy@bf60521]
09:16:14 Deploying Rev: HEAD = bf605219112cedf74ef8001b5e2b396afe47e23a
09:16:14 Started deploy [analytics/turnilo/deploy@bf60521]: (no justification provided)
09:16:14
== DEFAULT ==
:* an-tool1007.eqiad.wmnet
analytics/turnilo/deploy: fetch stage(s): 100% (in-flight: 0; ok: 1; fail: 0; left: 0)
analytics/turnilo/deploy: config_deploy stage(s): 100% (in-flight: 0; ok: 1; fail: 0; left: 0)
analytics/turnilo/deploy: promote and restart_service stage(s): 100% (in-flight: 0; ok: 1; fail: 0; left: 0)
09:16:16
== DEFAULT ==
:* an-tool1007.eqiad.wmnet
analytics/turnilo/deploy: finalize stage(s): 100% (in-flight: 0; ok: 1; fail: 0; left: 0)
09:16:17 Finished deploy [analytics/turnilo/deploy@bf60521]: (no justification provided) (duration: 00m 03s)
09:16:17 Finished deploy [analytics/turnilo/deploy@bf60521] (duration: 00m 03s)

However, according to regular users there are some missing dashboards and possibly some are missing data.
This is the list that I see.

image.png (954×659 px, 57 KB)

Every minute we see an entry like this in the logs:

May 17 09:45:46 an-tool1007 turnilo[5457]: Scanning cluster 'druid-analytics-eqiad' for new sources
May 17 09:45:46 an-tool1007 turnilo[5457]: Cluster 'druid-analytics-eqiad' has never seen 'banner_activity_minutely' and will introspect 'banner_activity_minutely'
May 17 09:45:46 an-tool1007 turnilo[5457]: Cluster 'druid-analytics-eqiad' has never seen 'mediawiki_geoeditors_monthly' and will introspect 'mediawiki_geoeditors_monthly'
May 17 09:45:46 an-tool1007 turnilo[5457]: Cluster 'druid-analytics-eqiad' has never seen 'pageviews_daily' and will introspect 'pageviews_daily'
May 17 09:45:46 an-tool1007 turnilo[5457]: Cluster 'druid-analytics-eqiad' has never seen 'pageviews_hourly' and will introspect 'pageviews_hourly'
May 17 09:45:46 an-tool1007 turnilo[5457]: Cluster 'druid-analytics-eqiad' has never seen 'unique_devices_per_domain_daily' and will introspect 'unique_devices_per_domain_daily'
May 17 09:45:46 an-tool1007 turnilo[5457]: Cluster 'druid-analytics-eqiad' has never seen 'unique_devices_per_domain_monthly' and will introspect 'unique_devices_per_domain_monthly'
May 17 09:45:46 an-tool1007 turnilo[5457]: Cluster 'druid-analytics-eqiad' has never seen 'unique_devices_per_project_family_daily' and will introspect 'unique_devices_per_project_family_daily'
May 17 09:45:46 an-tool1007 turnilo[5457]: Cluster 'druid-analytics-eqiad' has never seen 'unique_devices_per_project_family_monthly' and will introspect 'unique_devices_per_project_family_monthly'
May 17 09:45:46 an-tool1007 turnilo[5457]: Cluster 'druid-analytics-eqiad' has never seen 'virtualpageviews_hourly' and will introspect 'virtualpageviews_hourly'
May 17 09:45:46 an-tool1007 turnilo[5457]: Cluster 'druid-analytics-eqiad' has never seen 'webrequest_sampled_128' and will introspect 'webrequest_sampled_128'
May 17 09:45:46 an-tool1007 turnilo[5457]: Cluster 'druid-analytics-eqiad' has never seen 'wmf_netflow' and will introspect 'wmf_netflow'

I believe that these may well constitute a list of the missing dashboards, so it looks like it might be a configuration issue between the two versions. Continuing to investigate.

There seems to be a clear problem with our current configuration and version 1.35.
The symptoms are as follows:

  • If I load the config as-is, then only the automatically detected data cubes are loaded. Those present in the configuration files are not shown.

There are error messages like:

Cluster 'druid-analytics-eqiad' already has an external for 'pageviews_hourly' ('pageviews_hourly')

...and yet the pageviews_hourly cube isn't shown.

  • If I add sourceListScan: disable to the data source, then all of the cubes from the configuration file are loaded, but none of the cubes that are not in the files are loaded.

I've posted a message to the Turnilo Slack channel, to see if they can help at all.

image.png (685×1 px, 128 KB)

I'm receiving substantial help and a great response from Adrian Mróź who is a key contributor to Turnilo, via their Slack.
I've supplied logs and a config file and I'm awaiting more information.

BTullis triaged this task as Medium priority.May 17 2022, 3:19 PM

Change 791461 abandoned by Razzi:

[analytics/turnilo/deploy@master] Upgrade to superset 1.35.0

Reason:

Did this upgrade in another patch

https://gerrit.wikimedia.org/r/791461

I've created a child ticket to track the fix for turnilo. T308778: Fix turnilo after upgrade
I'm hopeful that the upstream author will be able to find the issue and suggest a workaround or provide a fix in the code.

Moving to paused whle the fix is being investigated. One of the options is to downgrade again, otherwise I would resolve this ticket now.

I'm resolving this ticket because the upgrade is done. There is still an issue but I'll work on that in T308778