Tue, Feb 23
Fri, Feb 19
ALTER TABLE row_level_security_filters ROW_FORMAT=DYNAMIC; fixed it, thanks! Here's the full procedure so the order is clear:
Thu, Feb 18
Ok, the problem was that I had upgraded the pip version in the docker container when building the wheels, which made the wheels incompatible with the staging server.
I tried to deploy superset to the staging box, but it failed with
I'm thinking of writing up the steps for rebalancing partitions in a wiki article such as https://wikitech.wikimedia.org/wiki/Kafka/Administration, and I'm reminded of how I scp'd the topicmappr executable to kafka-jumbo1002 and how that's hacky. Should we make a plan to properly package topicmappr?
Wed, Feb 17
Ok! Now that we're on to the final and highest traffic topics, webrequest_upload and webrequest_text, we're switching to migrating one partition at a time. Here are the full migrations plans, in case they get modified in the process.
Sat, Feb 13
Since it's already mid-February and there's still preparation to do, we're going to wait until sqoop runs on March 1 to proceed with this.
Fri, Feb 12
Wed, Feb 10
Tue, Feb 9
Mon, Feb 8
Fri, Feb 5
Thu, Feb 4
Wed, Feb 3
@Ottomata and I discussed next steps for this ticket, and came up with the following:
Tue, Feb 2
Mon, Feb 1
@JAllemandou what do you mean by snapshot data?
As we get into the higher-volume topics, we are seeing some alerts about replica max lag and under-replicated partions. As I continue to run migrations, those alerts should be disabled for a few hours at a time and the metrics should be observed manually in Grafana.
Fri, Jan 29
One way to go about this may be to use hive.max-partitions-per-scan. From the docs:
Thanks, all set.
Thu, Jan 28
Tue, Jan 26
@Jgreen I did get this working, and confirmed it was working by visiting it in the UI, where you can see whether a chart is cached in the overflow menu:
Jan 25 2021
Jan 21 2021
Jan 20 2021
One more useful command: to change the throttle rate, run the on the node data is coming from and the node the data is going to. For example, if data is being copied from kafka-jumbo1003 to kafka-jumbo1009:
Migrated the following topics on kafka-jumbo:
With @ottomatta we came up with a way to rollback a partition migration.
When applying a migration, it prints the current state, which can be used to migrate the partitions back,
however while a migration is running, trying to start another gives the error "There is an existing assignment running."
@jcrespo It looks like this is normal - traffic to wikimediafoundation.org has spiked since the 20th birthday last week, so the access logs have grown proportionally.
Jan 14 2021
Closing since there has been no reply; feel free to reopen.
Jan 12 2021
Cluster is up and running!
Cluster is up and running!
- Migration plan for partition rebalancing
@kostajh that was quick! Comment if you have any issues.
@JAnstee_WMF this should be all set, check your email :) and comment if you have any issues!
@DNdubane_WMF this should be all set, check your email :) and comment if you have any issues!
Jan 8 2021
Here's the error from attempting to decommission analytics-tool1004:
Jan 4 2021
Spoke with @elukey and we're thinking of leaving turnilo on an-tool1007 for now, rather than co-locating it with superset, so that issues with either service won't affect the other. If we go that route, all that's left for this ticket is to decommission analytics-tool1004. @Ottomata what do you think?
Dec 22 2020
Superset is now running on an-tool1010, so analytics-tool1004 can be decommissioned.
Dec 18 2020
Quick poll: what should the default caching timeout be? I'm thinking 12 hours, since it seems most charts have daily granularity, so viewing a chart one day and then the next day will show the latest data point. The timeout is also configurable on a per-table or per-chart level, but I expect most users won't discover this.
For superset, the following 3 patches should be all we need to move traffic over with a short window of downtime:
Dec 17 2020
Dec 14 2020
This has been deployed to the hadoop workers and master. To test, we can view a long-running job and see that its logs are aggregated at the 1-hour mark.
Dec 11 2020
@elukey: I confirmed that memcached was working based on the presence of superset_result keys in memcached.
Dec 9 2020
Current progress: configured staging superset to use memcache, but pylibmc was installed as an apt package and the process uses a virtual environment, so pylibmc needs to be installed there, in the virtual environment, using the https://gerrit.wikimedia.org/r/admin/repos/analytics/superset/deploy repository.
Dec 4 2020
My thought for next steps here is to install superset on an-tool1010, using the existing database at an-coord1001, and testing that caching works as expected.
@Ottomata Yeah, I'll add an $is_critical parameter.
Dec 3 2020
Dec 1 2020
Here's the cumin output for the kafka-test1001 decomission:
Nov 30 2020
I originally created these virtual machines in the analytics vlan, but it should be in the default private network instead, so I'm decommissioning the nodes that I created and remaking them.
Nov 19 2020
@Ottomata and I are planning create a new small standalone node to be the zookeeper, requiring 2GB ram, 20G disk, and 2 vcpus.
I plan to put these machines on the same ganeti host, since as a test use case we don't need high availability. Let me know if they should be distributed instead.
@akosiaris does this seem like a reasonable request?