Page MenuHomePhabricator
Feed Advanced Search

Dec 14 2022

razzi committed rCCKBc4b1429f7315: Add cookbooks for running maintain-views.
Add cookbooks for running maintain-views
Dec 14 2022, 3:30 PM
razzi committed rCCKB1d02a7452191: druid: Add option to roll restart test druid worker java processes.
druid: Add option to roll restart test druid worker java processes
Dec 14 2022, 3:28 PM
razzi committed rCCKB7fd527ec25e7: sre.hadoop.roll-restart-masters: consistent sleep confirmation.
sre.hadoop.roll-restart-masters: consistent sleep confirmation
Dec 14 2022, 3:28 PM
razzi committed rCCKBa800a32578a6: sre.hadoop.roll-restart-masters: run hdfs as hdfs and yarn as yarn.
sre.hadoop.roll-restart-masters: run hdfs as hdfs and yarn as yarn
Dec 14 2022, 3:28 PM
razzi committed rCCKBbd9021fc1fab: sre.hadoop.roll-restart-masters: use sudo -u hdfs kerberos-run-command.
sre.hadoop.roll-restart-masters: use sudo -u hdfs kerberos-run-command
Dec 14 2022, 3:28 PM
razzi committed rCCKB838744efaa42: sre.druid.roll-restart-workers: properly pass commands list.
sre.druid.roll-restart-workers: properly pass commands list
Dec 14 2022, 3:27 PM
razzi committed rCCKB53ce9a1c269a: sre.kafka.reboot-workers: Properly format arguments in log message.
sre.kafka.reboot-workers: Properly format arguments in log message
Dec 14 2022, 3:27 PM
razzi committed rCCKB49666366bfe4: Pass list of single host to hosts_downtimed.
Pass list of single host to hosts_downtimed
Dec 14 2022, 3:27 PM
razzi committed rCCKBe49443550734: Rename kafka cluster from test-eqiad to test.
Rename kafka cluster from test-eqiad to test
Dec 14 2022, 3:27 PM
razzi committed rCCKB5ffc19512bdc: sre.kafka.reboot-workers: Add cookbook to restart nodes in kafka cluster.
sre.kafka.reboot-workers: Add cookbook to restart nodes in kafka cluster
Dec 14 2022, 3:27 PM
razzi committed rCCKB473619d560ce: sre.druid.reboot-workers: pass single host as list.
sre.druid.reboot-workers: pass single host as list
Dec 14 2022, 3:27 PM
razzi committed rCCKBdf3023554f56: Add cookbook for rebooting druid nodes.
Add cookbook for rebooting druid nodes
Dec 14 2022, 3:27 PM

May 17 2022

razzi closed T306213: Site: 1 VM request for turnilo/superset staging on Bullseye as Resolved.

VM created. Work continues at https://phabricator.wikimedia.org/T308597

May 17 2022, 9:52 PM · vm-requests, Infrastructure-Foundations, SRE
razzi created T308597: Split turnilo staging off of an-tool1005.
May 17 2022, 6:08 PM · Patch-For-Review, Data-Engineering-Kanban, Data-Engineering
razzi added a comment to T306213: Site: 1 VM request for turnilo/superset staging on Bullseye.

I'm going to go ahead and put this on row A. Here's a little snippet I used to look at the ganeti resource totals by row (python -m pip install pandas ipython first):

May 17 2022, 5:23 PM · vm-requests, Infrastructure-Foundations, SRE

May 16 2022

razzi added a comment to T308441: Error when updating dashboard .

I downloaded the whole dashboard as json, edited the json to make the name have "TEST COPY", scp'd it to the superset host, and loaded it with:

May 16 2022, 4:13 PM · Product-Analytics, Data-Engineering-Kanban, superset.wikimedia.org, Data-Engineering
razzi reopened T306213: Site: 1 VM request for turnilo/superset staging on Bullseye as "Open".

Upgrading the VM worked for Turnilo, but Superset needs updating before it will work on Bullseye. Generally there's no guarantee that both staging services will be compatible with the same Debian version, so I say we split Turnilo staging onto its own server.

May 16 2022, 2:21 PM · vm-requests, Infrastructure-Foundations, SRE
razzi created T308441: Error when updating dashboard .
May 16 2022, 2:01 PM · Product-Analytics, Data-Engineering-Kanban, superset.wikimedia.org, Data-Engineering

May 12 2022

razzi moved T304972: Upgrade Superset to 1.4.2 from Ready to Deploy to Done on the Data-Engineering-Kanban board.
May 12 2022, 4:05 PM · Product-Analytics, Patch-For-Review, Data-Engineering-Kanban, superset.wikimedia.org, Data-Engineering
razzi closed T306213: Site: 1 VM request for turnilo/superset staging on Bullseye as Resolved.

I forgot there's a way to upgrade a virtual machine's operating system: https://wikitech.wikimedia.org/wiki/Ganeti#Reinstall_/_Reimage_a_VM and I'm following that now for the superset / turnilo staging instance (an-tool1005).

May 12 2022, 3:30 PM · vm-requests, Infrastructure-Foundations, SRE
razzi closed T308174: Restarting pybal caused icinga error as Resolved.

Thanks for the explanation @BBlack, nothing to do here so I'll close this.

May 12 2022, 2:14 PM · SRE, Traffic

May 11 2022

razzi added a comment to T298940: Reimage WMCS db proxies to Bullseye.

I merged the related patch, but when I restarted pybal it caused an alert, so I'm waiting for input from the traffic team before proceeding: https://phabricator.wikimedia.org/T308174

May 11 2022, 7:05 PM · Data-Engineering-Planning (Sprint 02), Data-Services, cloud-services-team (Kanban)
razzi created T308174: Restarting pybal caused icinga error.
May 11 2022, 7:03 PM · SRE, Traffic
razzi moved T298940: Reimage WMCS db proxies to Bullseye from In Progress to In Code Review on the Data-Engineering-Kanban board.
May 11 2022, 4:05 PM · Data-Engineering-Planning (Sprint 02), Data-Services, cloud-services-team (Kanban)
razzi moved T304972: Upgrade Superset to 1.4.2 from In Code Review to Ready to Deploy on the Data-Engineering-Kanban board.
May 11 2022, 4:05 PM · Product-Analytics, Patch-For-Review, Data-Engineering-Kanban, superset.wikimedia.org, Data-Engineering

May 5 2022

razzi closed T275575: Add superset-next.wikimedia.org domain for superset staging, a subtask of T288115: Upgrade Superset to 1.3.1 or higher, as Resolved.
May 5 2022, 10:34 PM · User-razzi, Analytics-Clusters, Analytics-Kanban, Data-Engineering-Kanban, Data-Engineering, Better Use Of Data, Product-Analytics
razzi closed T275575: Add superset-next.wikimedia.org domain for superset staging as Resolved.
May 5 2022, 10:34 PM · Patch-For-Review, Data-Engineering-Kanban, Product-Analytics, superset.wikimedia.org, Data-Engineering, Analytics-Clusters
razzi added a comment to T304972: Upgrade Superset to 1.4.2.

I tried out Superset 1.5 briefly but found it requires python 3.8, and an-tool1005 is currently running python 3.7. The error:

May 5 2022, 10:33 PM · Product-Analytics, Patch-For-Review, Data-Engineering-Kanban, superset.wikimedia.org, Data-Engineering

May 4 2022

razzi added a comment to T290146: Pressing the Stop button in Quarry results in a 500 error.

Here's the traceback of a 500 error I got:

[2022-05-04 17:25:52,664] ERROR in app: Exception on /api/query/stop [POST]
Traceback (most recent call last):
  File "/srv/quarry/venv/lib/python3.7/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/srv/quarry/venv/lib/python3.7/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/srv/quarry/venv/lib/python3.7/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/srv/quarry/venv/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/srv/quarry/venv/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/srv/quarry/venv/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "./quarry/web/api.py", line 152, in api_stop_query
    cur.execute("KILL %s;", (result_dictionary["connection_id"]))
  File "/srv/quarry/venv/lib/python3.7/site-packages/pymysql/cursors.py", line 148, in execute
    result = self._query(query)
  File "/srv/quarry/venv/lib/python3.7/site-packages/pymysql/cursors.py", line 310, in _query
    conn.query(q)
  File "/srv/quarry/venv/lib/python3.7/site-packages/pymysql/connections.py", line 548, in query
    self._affected_rows = self._read_query_result(unbuffered=unbuffered)
  File "/srv/quarry/venv/lib/python3.7/site-packages/pymysql/connections.py", line 775, in _read_query_result
    result.read()
  File "/srv/quarry/venv/lib/python3.7/site-packages/pymysql/connections.py", line 1156, in read
    first_packet = self.connection._read_packet()
  File "/srv/quarry/venv/lib/python3.7/site-packages/pymysql/connections.py", line 725, in _read_packet
    packet.raise_for_error()
  File "/srv/quarry/venv/lib/python3.7/site-packages/pymysql/protocol.py", line 221, in raise_for_error
    err.raise_mysql_exception(self._data)
  File "/srv/quarry/venv/lib/python3.7/site-packages/pymysql/err.py", line 143, in raise_mysql_exception
    raise errorclass(errno, errval)
pymysql.err.OperationalError: (1094, 'Unknown thread id: 8295086')
May 4 2022, 7:51 PM · Patch-For-Review, Quarry
razzi awarded T307403: Request to add razzi to Quarry Cloud VPS project a Like token.
May 4 2022, 6:27 PM · Cloud-Services-Origin-User, Cloud-Services-Worktype-Unplanned, User-dcaro, Quarry, cloud-services-team (Kanban)

May 3 2022

razzi moved T275575: Add superset-next.wikimedia.org domain for superset staging from In Code Review to Done on the Data-Engineering-Kanban board.

It's working! Visit https://superset-next.wikimedia.org/

May 3 2022, 7:28 PM · Patch-For-Review, Data-Engineering-Kanban, Product-Analytics, superset.wikimedia.org, Data-Engineering, Analytics-Clusters
razzi closed T306148: clouddb1021 missing network firmware bnx2x/bnx2x-e2-7.13.21.0.fw in Debian 11 Bullseye as Resolved.

Updated netbox status to "Active".

May 3 2022, 5:15 PM · Data-Engineering-Kanban, Data-Engineering, Infrastructure-Foundations, DC-Ops
razzi closed T306148: clouddb1021 missing network firmware bnx2x/bnx2x-e2-7.13.21.0.fw in Debian 11 Bullseye, a subtask of T278938: Netbox IPv6 for some cloud hosts is missing or not set as primary, as Resolved.
May 3 2022, 5:15 PM · cloud-services-team (Kanban)

May 2 2022

razzi created T307403: Request to add razzi to Quarry Cloud VPS project.
May 2 2022, 11:10 PM · Cloud-Services-Origin-User, Cloud-Services-Worktype-Unplanned, User-dcaro, Quarry, cloud-services-team (Kanban)
razzi moved T275575: Add superset-next.wikimedia.org domain for superset staging from In Progress to In Code Review on the Data-Engineering-Kanban board.
May 2 2022, 10:37 PM · Patch-For-Review, Data-Engineering-Kanban, Product-Analytics, superset.wikimedia.org, Data-Engineering, Analytics-Clusters
razzi moved T275575: Add superset-next.wikimedia.org domain for superset staging from Next Up to In Progress on the Data-Engineering-Kanban board.
May 2 2022, 7:11 PM · Patch-For-Review, Data-Engineering-Kanban, Product-Analytics, superset.wikimedia.org, Data-Engineering, Analytics-Clusters
razzi added a project to T275575: Add superset-next.wikimedia.org domain for superset staging: Data-Engineering-Kanban.
May 2 2022, 7:11 PM · Patch-For-Review, Data-Engineering-Kanban, Product-Analytics, superset.wikimedia.org, Data-Engineering, Analytics-Clusters
razzi added a comment to T307055: Remaining data engineering host security restarts.

I sent out an email that all the named hosts, other than an-airflow, will be rebooted this Friday May 6 in a window from 17-19UTC (10am-12pm pacific).

May 2 2022, 4:31 PM · Dumps-Generation, Infrastructure-Foundations (FY2021/2022-Q4), SRE, Security

Apr 28 2022

razzi claimed T298940: Reimage WMCS db proxies to Bullseye.
Apr 28 2022, 4:12 PM · Data-Engineering-Planning (Sprint 02), Data-Services, cloud-services-team (Kanban)
razzi moved T298940: Reimage WMCS db proxies to Bullseye from Next Up to In Progress on the Data-Engineering-Kanban board.
Apr 28 2022, 4:10 PM · Data-Engineering-Planning (Sprint 02), Data-Services, cloud-services-team (Kanban)
razzi created T307055: Remaining data engineering host security restarts.
Apr 28 2022, 12:21 AM · Dumps-Generation, Infrastructure-Foundations (FY2021/2022-Q4), SRE, Security

Apr 27 2022

razzi updated subscribers of T298940: Reimage WMCS db proxies to Bullseye.

If we merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/779915/ that @BTullis and I came up with, we can do the following to reimage these hosts.

Apr 27 2022, 11:48 PM · Data-Engineering-Planning (Sprint 02), Data-Services, cloud-services-team (Kanban)
razzi closed T304591: View wb_changes_dispatch in commonswiki_p shows an error as Resolved.

This is done. Since none of the views were changing but passing --replace-all was hanging since some views were currently in use, I told every "replace" prompt "no" with the following:

Apr 27 2022, 7:06 PM · Data-Engineering-Kanban, Data-Engineering, Data-Services
razzi closed T278938: Netbox IPv6 for some cloud hosts is missing or not set as primary as Resolved.

This seems to be resolved now that clouddb1021 has been reimaged successfully. I'll close this but if I'm missing something please reopen.

Apr 27 2022, 5:11 PM · cloud-services-team (Kanban)
razzi updated the task description for T278938: Netbox IPv6 for some cloud hosts is missing or not set as primary.
Apr 27 2022, 5:10 PM · cloud-services-team (Kanban)
razzi closed T306148: clouddb1021 missing network firmware bnx2x/bnx2x-e2-7.13.21.0.fw in Debian 11 Bullseye, a subtask of T278938: Netbox IPv6 for some cloud hosts is missing or not set as primary, as Resolved.
Apr 27 2022, 5:08 PM · cloud-services-team (Kanban)
razzi closed T306148: clouddb1021 missing network firmware bnx2x/bnx2x-e2-7.13.21.0.fw in Debian 11 Bullseye as Resolved.
Apr 27 2022, 5:07 PM · Data-Engineering-Kanban, Data-Engineering, Infrastructure-Foundations, DC-Ops

Apr 26 2022

razzi moved T304591: View wb_changes_dispatch in commonswiki_p shows an error from Next Up to In Progress on the Data-Engineering-Kanban board.
Apr 26 2022, 5:39 PM · Data-Engineering-Kanban, Data-Engineering, Data-Services
razzi closed T299480: Upgrade clouddb* hosts to Bullseye as Resolved.
Apr 26 2022, 5:21 PM · Data-Engineering-Kanban, Patch-For-Review, Data-Engineering-Radar, Data-Services, cloud-services-team (Kanban)
razzi closed T299480: Upgrade clouddb* hosts to Bullseye, a subtask of T298585: Upgrade WMF database-and-backup-related hosts to bullseye, as Resolved.
Apr 26 2022, 5:21 PM · DBA
razzi updated the task description for T299480: Upgrade clouddb* hosts to Bullseye.
Apr 26 2022, 5:19 PM · Data-Engineering-Kanban, Patch-For-Review, Data-Engineering-Radar, Data-Services, cloud-services-team (Kanban)
razzi added a comment to T306148: clouddb1021 missing network firmware bnx2x/bnx2x-e2-7.13.21.0.fw in Debian 11 Bullseye.

I did the reimage again just now and it worked fine selecting "No" when prompted to load missing firmware. @MoritzMuehlenhoff I misread your comment and didn't realize your change should have been submitted first, sorry!! Let me know if I can still be useful in testing that, but otherwise, this ticket can be closed. Thanks for your input everybody.

Apr 26 2022, 5:18 PM · Data-Engineering-Kanban, Data-Engineering, Infrastructure-Foundations, DC-Ops

Apr 18 2022

razzi renamed T306148: clouddb1021 missing network firmware bnx2x/bnx2x-e2-7.13.21.0.fw in Debian 11 Bullseye from clouddb1021 missing firmware; debian installer cannot connect to network to clouddb1021 missing network firmware bnx2x/bnx2x-e2-7.13.21.0.fw in Debian 11 Bullseye.
Apr 18 2022, 8:18 PM · Data-Engineering-Kanban, Data-Engineering, Infrastructure-Foundations, DC-Ops
razzi added a comment to T306148: clouddb1021 missing network firmware bnx2x/bnx2x-e2-7.13.21.0.fw in Debian 11 Bullseye.

I reimaged the host back to Buster for now, which went smoothly. Replication lag is a few days behind but is catching up gradually: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=clouddb1021&var-port=13311&viewPanel=6&from=now-5m&to=now

Apr 18 2022, 8:16 PM · Data-Engineering-Kanban, Data-Engineering, Infrastructure-Foundations, DC-Ops

Apr 14 2022

razzi claimed T306213: Site: 1 VM request for turnilo/superset staging on Bullseye.
Apr 14 2022, 7:25 PM · vm-requests, Infrastructure-Foundations, SRE
razzi created T306213: Site: 1 VM request for turnilo/superset staging on Bullseye.
Apr 14 2022, 7:25 PM · vm-requests, Infrastructure-Foundations, SRE
razzi renamed T306148: clouddb1021 missing network firmware bnx2x/bnx2x-e2-7.13.21.0.fw in Debian 11 Bullseye from clouddb1021 missing firmware: debian installer cannot connect to network to clouddb1021 missing firmware; debian installer cannot connect to network.
Apr 14 2022, 6:50 AM · Data-Engineering-Kanban, Data-Engineering, Infrastructure-Foundations, DC-Ops
razzi updated subscribers of T306148: clouddb1021 missing network firmware bnx2x/bnx2x-e2-7.13.21.0.fw in Debian 11 Bullseye.

@Marostegui says I should tag @MoritzMuehlenhoff - hopefully we can all solve this together :)

Apr 14 2022, 6:50 AM · Data-Engineering-Kanban, Data-Engineering, Infrastructure-Foundations, DC-Ops
razzi updated subscribers of T306148: clouddb1021 missing network firmware bnx2x/bnx2x-e2-7.13.21.0.fw in Debian 11 Bullseye.

Possibly relevant links thanks to @jhathaway:

Apr 14 2022, 12:02 AM · Data-Engineering-Kanban, Data-Engineering, Infrastructure-Foundations, DC-Ops

Apr 13 2022

razzi created T306148: clouddb1021 missing network firmware bnx2x/bnx2x-e2-7.13.21.0.fw in Debian 11 Bullseye.
Apr 13 2022, 11:59 PM · Data-Engineering-Kanban, Data-Engineering, Infrastructure-Foundations, DC-Ops
razzi updated the task description for T299480: Upgrade clouddb* hosts to Bullseye.
Apr 13 2022, 11:35 PM · Data-Engineering-Kanban, Patch-For-Review, Data-Engineering-Radar, Data-Services, cloud-services-team (Kanban)
razzi edited P24619 Missing firmware message on clouddb1021.
Apr 13 2022, 10:30 PM
razzi created P24619 Missing firmware message on clouddb1021.
Apr 13 2022, 10:29 PM
razzi moved T299480: Upgrade clouddb* hosts to Bullseye from Next Up to In Progress on the Data-Engineering-Kanban board.
Apr 13 2022, 4:19 PM · Data-Engineering-Kanban, Patch-For-Review, Data-Engineering-Radar, Data-Services, cloud-services-team (Kanban)
razzi added a project to T299480: Upgrade clouddb* hosts to Bullseye: Data-Engineering-Kanban.
Apr 13 2022, 4:19 PM · Data-Engineering-Kanban, Patch-For-Review, Data-Engineering-Radar, Data-Services, cloud-services-team (Kanban)

Apr 12 2022

razzi added a comment to T299480: Upgrade clouddb* hosts to Bullseye.

Ok after some help with wmf-pt-kill in https://phabricator.wikimedia.org/T305974 and a patch to update netboot for other clouddb10xx hosts https://gerrit.wikimedia.org/r/c/operations/puppet/+/779557 the reimage of clouddb1014 went smoothly. I'm repooling all hosts and will continue with clouddb1015-1021 tomorrow.

Apr 12 2022, 11:28 PM · Data-Engineering-Kanban, Patch-For-Review, Data-Engineering-Radar, Data-Services, cloud-services-team (Kanban)
razzi updated the task description for T299480: Upgrade clouddb* hosts to Bullseye.
Apr 12 2022, 11:27 PM · Data-Engineering-Kanban, Patch-For-Review, Data-Engineering-Radar, Data-Services, cloud-services-team (Kanban)
razzi added a comment to T305974: Provide wmf-pt-kill on Debian Bullseye.

Thank you both! Looks good 👍 👍

Apr 12 2022, 10:35 PM · cloud-services-team (Kanban), DBA
razzi updated the task description for T299480: Upgrade clouddb* hosts to Bullseye.
Apr 12 2022, 9:00 PM · Data-Engineering-Kanban, Patch-For-Review, Data-Engineering-Radar, Data-Services, cloud-services-team (Kanban)
razzi added a comment to T299480: Upgrade clouddb* hosts to Bullseye.

I forgot to tell netboot to treat these hosts as database hosts, which I have now done in https://gerrit.wikimedia.org/r/c/operations/puppet/+/779488

Apr 12 2022, 3:40 PM · Data-Engineering-Kanban, Patch-For-Review, Data-Engineering-Radar, Data-Services, cloud-services-team (Kanban)

Apr 11 2022

razzi claimed T301990: Upgrade Turnilo.

According to @hashar we can get node 12.22.5 by upgrading Debian to version 11 Bullseye (staging and production Turnilo run Debian 10). I'll try upgrading Debian on the staging host and see if the latest Turnilo works then.

Apr 11 2022, 3:17 PM · Patch-For-Review, Data-Engineering-Kanban, Data-Engineering

Apr 7 2022

razzi added a comment to T299480: Upgrade clouddb* hosts to Bullseye.

I'll do this next week. To my knowledge these hosts are pretty much the same as the dbstore hosts I did this week for https://phabricator.wikimedia.org/T299481, except that there can be no downtime if I depool the hosts first.

Apr 7 2022, 7:16 PM · Data-Engineering-Kanban, Patch-For-Review, Data-Engineering-Radar, Data-Services, cloud-services-team (Kanban)

Apr 6 2022

razzi added a comment to T301990: Upgrade Turnilo.

Ah ok it appears we're now too far behind on nodejs versions

Apr 6 2022, 9:50 PM · Patch-For-Review, Data-Engineering-Kanban, Data-Engineering
razzi added a comment to T301990: Upgrade Turnilo.

I made a patch for this, but the scap deploy to staging failed due to some error with locales:

Apr 6 2022, 9:46 PM · Patch-For-Review, Data-Engineering-Kanban, Data-Engineering
razzi claimed T305591: Error with refinery-drop-mediawiki-snapshots: table specs not matching partitions for wmf/wikidata/entity and wmf/wikidata/item_page_link.
Apr 6 2022, 8:48 PM · Data-Engineering
razzi moved T305591: Error with refinery-drop-mediawiki-snapshots: table specs not matching partitions for wmf/wikidata/entity and wmf/wikidata/item_page_link from Incoming (new tickets) to Data Products & Metrics on the Data-Engineering board.
Apr 6 2022, 8:48 PM · Data-Engineering
razzi created T305591: Error with refinery-drop-mediawiki-snapshots: table specs not matching partitions for wmf/wikidata/entity and wmf/wikidata/item_page_link.
Apr 6 2022, 8:48 PM · Data-Engineering
razzi created P24175 refinery-drop-mediawiki-snapshots April 6.
Apr 6 2022, 8:47 PM

Apr 5 2022

razzi added a project to T304972: Upgrade Superset to 1.4.2: Product-Analytics.

Hi Product Analytics, superset 1.4.2 is ready to be tested on staging. Once we confirm there are no showstopping bugs we'll release it to production.

Apr 5 2022, 10:27 PM · Product-Analytics, Patch-For-Review, Data-Engineering-Kanban, superset.wikimedia.org, Data-Engineering
razzi added a comment to T304972: Upgrade Superset to 1.4.2.

Ok, looks like the following will resolve it:

Apr 5 2022, 10:20 PM · Product-Analytics, Patch-For-Review, Data-Engineering-Kanban, superset.wikimedia.org, Data-Engineering
razzi added a comment to T304972: Upgrade Superset to 1.4.2.

I thought I'd update the staging database to be the same as production before sharing superset staging widely, and I'm glad I did, because it looks like there's some sort of database issue with the update.

Apr 5 2022, 10:06 PM · Product-Analytics, Patch-For-Review, Data-Engineering-Kanban, superset.wikimedia.org, Data-Engineering
razzi moved T299481: Upgrade dbstore100* hosts to Bullseye from Ready to Deploy to Done on the Data-Engineering-Kanban board.
Apr 5 2022, 5:42 PM · Data-Engineering-Kanban, Data-Persistence (work done), Data-Engineering, Analytics
razzi added a comment to T299481: Upgrade dbstore100* hosts to Bullseye.

All the reimages are done. Thanks for your input @Marostegui and @Ladsgroup .

Apr 5 2022, 5:42 PM · Data-Engineering-Kanban, Data-Persistence (work done), Data-Engineering, Analytics
razzi updated the task description for T299481: Upgrade dbstore100* hosts to Bullseye.
Apr 5 2022, 5:06 PM · Data-Engineering-Kanban, Data-Persistence (work done), Data-Engineering, Analytics
razzi updated the task description for T299481: Upgrade dbstore100* hosts to Bullseye.
Apr 5 2022, 4:33 PM · Data-Engineering-Kanban, Data-Persistence (work done), Data-Engineering, Analytics
razzi added a comment to T299481: Upgrade dbstore100* hosts to Bullseye.

Looks like reimage went fine; the warning about icinga status is that the replication has not caught up, but I see the replication Seconds_Behind_Master decreasing over time.

Apr 5 2022, 3:42 PM · Data-Engineering-Kanban, Data-Persistence (work done), Data-Engineering, Analytics

Mar 31 2022

razzi updated subscribers of T275575: Add superset-next.wikimedia.org domain for superset staging.

I'm thinking about requiring superset-next to have 2fa for 2 reasons:

  • since it's the staging environment for superset, it's more likely to have misconfigurations that would lead to security issues
  • to test out what it would look like to eventually require 2fa for superset
Mar 31 2022, 8:46 PM · Patch-For-Review, Data-Engineering-Kanban, Product-Analytics, superset.wikimedia.org, Data-Engineering, Analytics-Clusters
razzi closed T304065: Check home/HDFS leftovers of clarakosi as Resolved.

I removed stat100* directories. All done!

Mar 31 2022, 6:54 PM · Data-Engineering-Kanban, Data-Engineering
razzi closed T300607: Check home/HDFS leftovers of bumeh-ctr as Resolved.

I removed the data in stat1006. Thanks everyone.

Mar 31 2022, 6:38 PM · Data-Engineering-Kanban, Data-Engineering, Analytics
razzi closed T302194: Check home/HDFS leftovers of rhuang-ctr as Resolved.

Ok, I have removed the data on each stat host and also did a hdfd -rmdir on the empty hive database as well.

Mar 31 2022, 6:33 PM · Data-Engineering-Kanban, Data-Engineering
razzi added a comment to T304478: Move wikireplicas dbproxy haproxy config to etcd.

By declaring the host -> ip mappings using https://github.com/kelseyhightower/confd/blob/master/docs/templates.md#map earlier in the etcd template, we should be able to keep the data stored in etcd in line with the current node statuses; as simple as "pooled/not pooled". I'd prefer this, since repeating the ip address in etcd is a potential point of confusion.

Mar 31 2022, 6:28 AM · Patch-For-Review, Data-Engineering, Data-Services
razzi awarded T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs a Manufacturing Defect? token.
Mar 31 2022, 5:24 AM · SRE, Infrastructure-Foundations, netops

Mar 30 2022

razzi moved T299481: Upgrade dbstore100* hosts to Bullseye from Next Up to Ready to Deploy on the Data-Engineering-Kanban board.
Mar 30 2022, 4:16 PM · Data-Engineering-Kanban, Data-Persistence (work done), Data-Engineering, Analytics
razzi added a project to T299481: Upgrade dbstore100* hosts to Bullseye: Data-Engineering-Kanban.
Mar 30 2022, 4:15 PM · Data-Engineering-Kanban, Data-Persistence (work done), Data-Engineering, Analytics
razzi moved T301562: Set up karapace instance for datahub from In Code Review to Ready to Deploy on the Data-Engineering-Kanban board.
Mar 30 2022, 4:14 PM · Data-Engineering-Kanban, Data-Engineering, Data-Catalog
razzi moved T301565: Create debian package of karapace from Ready to Deploy to Done on the Data-Engineering-Kanban board.
Mar 30 2022, 4:14 PM · Data-Engineering-Kanban, Patch-For-Review, Data-Engineering, Data-Catalog
razzi moved T304972: Upgrade Superset to 1.4.2 from Next Up to In Code Review on the Data-Engineering-Kanban board.
Mar 30 2022, 4:13 PM · Product-Analytics, Patch-For-Review, Data-Engineering-Kanban, superset.wikimedia.org, Data-Engineering

Mar 29 2022

razzi claimed T299481: Upgrade dbstore100* hosts to Bullseye.
Mar 29 2022, 9:49 PM · Data-Engineering-Kanban, Data-Persistence (work done), Data-Engineering, Analytics
razzi added a comment to T299481: Upgrade dbstore100* hosts to Bullseye.

Ok thanks for chiming in @Marostegui and @Ladsgroup. Here is my updated plan, and I'm planning to kick this off a week from today on April 5 at 15:00 UTC.

Mar 29 2022, 9:37 PM · Data-Engineering-Kanban, Data-Persistence (work done), Data-Engineering, Analytics
razzi added a comment to T304972: Upgrade Superset to 1.4.2.

Superset 1.4.2 is running on superset-staging: an-tool1005. When making the change, I had to pin markupsafe to 2.0.1 since the default markupsafe it downloaded is not compatible with the version of flask that superset is using.

Mar 29 2022, 6:09 PM · Product-Analytics, Patch-For-Review, Data-Engineering-Kanban, superset.wikimedia.org, Data-Engineering