Page MenuHomePhabricator
Feed Advanced Search

Fri, Sep 20

Eevans committed rMSKSf9e6066f4257: logging: Logger optimizations (authored by Eevans).
logging: Logger optimizations
Fri, Sep 20, 7:49 AM

Thu, Sep 19

Eevans added a comment to T222099: Staging release of RESTBagOStuff using Kask.

As far as we can see, it's only seemingly broken on testwiki (something put into session data doesn't come out the same). And may have been broken for a while, as I imagine enabling 2FA on testwiki isn't how most people would do it
The class in question going in is https://github.com/wikimedia/mediawiki-extensions-OATHAuth/blob/master/src/Key/TOTPKey.php and apparently comes out as an empty array...

Thu, Sep 19, 2:50 PM · Core Platform Team Workboards (Green), CPT Initiatives (Session Management Service (CDP2)), User-Clarakosi, User-Eevans
Eevans awarded T224553: Migrate remaining Restbase servers to Stretch a Stroopwafel token.
Thu, Sep 19, 2:35 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), RESTBase-Cassandra, Cassandra, RESTBase, Operations
Eevans added a comment to T209110: Logging for the session storage service.

TTBMK, this is the only remaining blocker to fully moving sessionstore to production.

Thu, Sep 19, 12:12 AM · CPT Initiatives (Session Management Service (CDP2)), Patch-For-Review, User-Clarakosi, User-Eevans

Wed, Sep 18

Eevans added a comment to T224553: Migrate remaining Restbase servers to Stretch.

restbase2012 is decommissioned and can be reimaged at any time.

Wed, Sep 18, 9:25 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), RESTBase-Cassandra, Cassandra, RESTBase, Operations
Eevans added a comment to T209110: Logging for the session storage service.

After startup, Kask doesn't log much outside of exceptional errors; To invoke log output try issuing a HEAD request:

Wed, Sep 18, 9:01 PM · CPT Initiatives (Session Management Service (CDP2)), Patch-For-Review, User-Clarakosi, User-Eevans
Eevans moved T229697: Investigate Kask request latency from Doing to Done on the Core Platform Team Workboards (Clinic Duty Team) board.
Wed, Sep 18, 8:43 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), CPT Initiatives (Session Management Service (CDP2)), Performance-Team (Radar)
Eevans added a comment to T229697: Investigate Kask request latency.

And here is what latency looks like after r/537539 has been deployed (see also: direct link to Grafana dashboard).

Wed, Sep 18, 8:43 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), CPT Initiatives (Session Management Service (CDP2)), Performance-Team (Radar)
Eevans committed rDEPLOYCHARTSa7e5d55fe511: sessionstore: Upgrade image to 2019-09-18-090156-production (authored by Eevans).
sessionstore: Upgrade image to 2019-09-18-090156-production
Wed, Sep 18, 8:12 PM
Eevans created T233259: Docker registry not updating when a tag is pushed to Gerrit.
Wed, Sep 18, 8:08 PM · Release-Engineering-Team
Eevans committed rDEPLOYCHARTSe38a8eda385d: sessionstore: configure cassandra for `local_dc` (authored by Eevans).
sessionstore: configure cassandra for `local_dc`
Wed, Sep 18, 4:35 PM
Eevans committed rMSKS249587a6fefe: (Actually )be data-center aware. (authored by Eevans).
(Actually )be data-center aware.
Wed, Sep 18, 10:01 AM

Tue, Sep 17

Eevans added a comment to T224553: Migrate remaining Restbase servers to Stretch.

restbase2011 is fully decommissioned and ready to be reimaged.

Tue, Sep 17, 9:14 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), RESTBase-Cassandra, Cassandra, RESTBase, Operations
Eevans added a comment to T229697: Investigate Kask request latency.

In T229697#5471288, @akosiaris wrote:
[ ... ]

The p99 looks suspiciously close to the cross DC latency, is there any way it is related?

Tue, Sep 17, 8:55 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), CPT Initiatives (Session Management Service (CDP2)), Performance-Team (Radar)

Mon, Sep 16

Eevans added a comment to T224553: Migrate remaining Restbase servers to Stretch.

restbase2010 is ready to be reimaged.

Mon, Sep 16, 11:07 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), RESTBase-Cassandra, Cassandra, RESTBase, Operations

Fri, Sep 13

Eevans added a comment to T224553: Migrate remaining Restbase servers to Stretch.

restbase2009 is fully decommissioned and ready to be reimaged.

Fri, Sep 13, 10:21 AM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), RESTBase-Cassandra, Cassandra, RESTBase, Operations

Thu, Sep 12

Eevans updated the task description for T224553: Migrate remaining Restbase servers to Stretch.
Thu, Sep 12, 8:51 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), RESTBase-Cassandra, Cassandra, RESTBase, Operations
Eevans added a comment to T224553: Migrate remaining Restbase servers to Stretch.

restbase1018 is decommissioned and ready to be reimaged.

Thu, Sep 12, 1:13 AM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), RESTBase-Cassandra, Cassandra, RESTBase, Operations

Tue, Sep 10

Eevans moved T229697: Investigate Kask request latency from Done to Doing on the Core Platform Team Workboards (Clinic Duty Team) board.

1eevans@deploy1001:~$ KASK_URL=https://sessionstore.svc.eqiad.wmnet:8081 ./wrk.sh
2+ WRK_ARGS=--latency -d10m
3+ LUA_CPATH=/usr/lib/x86_64-linux-gnu/lua/5.1/?.so
4+ KASK_URL=https://sessionstore.svc.eqiad.wmnet:8081
5+ dirname ./wrk.sh
6+ cd .
7+ [ ! -f multi-request-json.lua ]
8+ [ ! -f requests.json ]
9+ LUA_CPATH=/usr/lib/x86_64-linux-gnu/lua/5.1/?.so wrk --latency -d10m -s multi-request-json.lua https://sessionstore.svc.eqiad.wmnet:8081
10multiplerequests: Found 101000 requests
11multiplerequests: Found 101000 requests
12multiplerequests: Found 101000 requests
13Running 10m test @ https://sessionstore.svc.eqiad.wmnet:8081
14 2 threads and 10 connections
15 Thread Stats Avg Stdev Max +/- Stdev
16 Latency 19.96ms 18.05ms 270.12ms 64.20%
17 Req/Sec 251.88 37.75 360.00 59.98%
18 Latency Distribution
19 50% 22.14ms
20 75% 37.76ms
21 90% 37.95ms
22 99% 38.88ms
23 301034 requests in 10.00m, 50.53MB read
24 Non-2xx or 3xx responses: 66355
25Requests/sec: 501.67
26Transfer/sec: 86.23KB
27------------------------------
2810%,1703
2920%,1815
3030%,1920
3140%,2057
3250%,22144
3360%,37591
3470%,37706
3580%,37811
3690%,37946
3799%,38882
38duration (micros),600062009
39requests,301034
40bytes,52984874
41connect errors,0
42read errors,0
43write errors,0
44status errors,66355
45timeout errors,0
46------------------------------
47eevans@deploy1001:~$ KASK_URL=https://sessionstore.svc.codfw.wmnet:8081 ./wrk.sh
48+ WRK_ARGS=--latency -d10m
49+ LUA_CPATH=/usr/lib/x86_64-linux-gnu/lua/5.1/?.so
50+ KASK_URL=https://sessionstore.svc.codfw.wmnet:8081
51+ dirname ./wrk.sh
52+ cd .
53+ [ ! -f multi-request-json.lua ]
54+ [ ! -f requests.json ]
55+ LUA_CPATH=/usr/lib/x86_64-linux-gnu/lua/5.1/?.so wrk --latency -d10m -s multi-request-json.lua https://sessionstore.svc.codfw.wmnet:8081
56multiplerequests: Found 101000 requests
57multiplerequests: Found 101000 requests
58multiplerequests: Found 101000 requests
59Running 10m test @ https://sessionstore.svc.codfw.wmnet:8081
60 2 threads and 10 connections
61 Thread Stats Avg Stdev Max +/- Stdev
62 Latency 56.33ms 18.03ms 301.04ms 55.91%
63 Req/Sec 88.94 12.77 131.00 79.11%
64 Latency Distribution
65 50% 73.43ms
66 75% 74.26ms
67 90% 74.50ms
68 99% 74.99ms
69 106474 requests in 10.00m, 13.35MB read
70Requests/sec: 177.44
71Transfer/sec: 22.78KB
72------------------------------
7310%,38066
7420%,38210
7530%,38338
7640%,38497
7750%,73429
7860%,74046
7970%,74193
8080%,74327
8190%,74495
8299%,74989
83duration (micros),600072527
84requests,106474
85bytes,14000072
86connect errors,0
87read errors,0
88write errors,0
89status errors,0
90timeout errors,0
91------------------------------
92eevans@deploy1001:~$

The p99 looks suspiciously close to the cross DC latency, is there any way it is related?

Tue, Sep 10, 10:27 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), CPT Initiatives (Session Management Service (CDP2)), Performance-Team (Radar)
Eevans added a comment to T224554: Migrate Restbase-dev cluster to Stretch.

restbase-dev1005 has been decommissioned and is ready to be reimaged.

Tue, Sep 10, 6:21 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), Cassandra, RESTBase, Operations
Eevans renamed T231027: Cassandra instances outages (was: Outage of restbase2017-b) from Outage of restbase2017-b to Cassandra instances outages (was: Outage of restbase2017-b).
Tue, Sep 10, 6:03 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team)
Eevans added a comment to T231027: Cassandra instances outages (was: Outage of restbase2017-b).

Part of what has made this so odd is that it only ever occurred to this one instance, so good/bad news, this has now been observed on 2009-b as well.

Tue, Sep 10, 6:03 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team)
Eevans added a comment to T224554: Migrate Restbase-dev cluster to Stretch.

I've started the decommission of -dev1005-b quite late in my evening; It should be complete by EU morning. If there is no output from running ssh restbase-dev1004.eqiad.wmnet -- c-any-nt status -r | grep 1005, then the node can be taken down for reimage.

Tue, Sep 10, 5:36 AM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), Cassandra, RESTBase, Operations
Eevans updated the task description for T224554: Migrate Restbase-dev cluster to Stretch.
Tue, Sep 10, 12:41 AM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), Cassandra, RESTBase, Operations

Mon, Sep 9

Eevans added a comment to T231027: Cassandra instances outages (was: Outage of restbase2017-b).

If the systemd syslog output is to believed, the service exited with status code 3 (...status=3/NOTIMPLEMENTED). AFAICT, Java would exit with a status code > 127 if the JVM shutdown as the result of a signal, otherwise this would have to be the result of a java.lang.System.exit(3) call. However, I can find no explanation of this in Cassandra's code.

I think we can trust systemd on that. I checked and it correctly identifies java as the mainpid, which was the only doubt I had.
Very dumb question: which version of the cassandra code did you check? Did you include debian patches we apply, if any?

Mon, Sep 9, 11:42 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team)
Eevans added a comment to T224554: Migrate Restbase-dev cluster to Stretch.

@Eevans I recreated the certs for restbase-dev1004 through restbase-dev1006 and committed in the private repo. Please try again now.

Mon, Sep 9, 10:31 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), Cassandra, RESTBase, Operations

Fri, Sep 6

Eevans added a comment to T229697: Investigate Kask request latency.

https://gerrit.wikimedia.org/r/534613 merged and deployed fine.

I wasn't either. I ended up doing a helmfile destroy && helmfile apply to fix it, but we will need to have a better look at why this happened, since even manually fixing the underlying cause did not make the release move from FAILED to DEPLOYED

helmfile sync fixed it though after all.

Fri, Sep 6, 1:33 AM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), CPT Initiatives (Session Management Service (CDP2)), Performance-Team (Radar)
Eevans edited P9046 wrk.sh.
Fri, Sep 6, 1:31 AM

Thu, Sep 5

Eevans updated the title for P9045 test.js from kask_test to test.js.
Thu, Sep 5, 10:22 PM
Eevans edited P9045 test.js.
Thu, Sep 5, 10:21 PM
Eevans added a comment to T224554: Migrate Restbase-dev cluster to Stretch.

restbase-dev1004 has been reinstalled as Stretch. @Eevans, you can bootstrap 1004 in Cassandra and decom 1005 in Cassandra, then I'll proceed with reimaging 1005.

Thu, Sep 5, 2:45 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), Cassandra, RESTBase, Operations
Eevans added a comment to T224554: Migrate Restbase-dev cluster to Stretch.

restbase-dev1004 has been reinstalled as Stretch. @Eevans, you can bootstrap 1004 in Cassandra and decom 1005 in Cassandra, then I'll proceed with reimaging 1005.

Thu, Sep 5, 2:19 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), Cassandra, RESTBase, Operations
Eevans added a comment to T224554: Migrate Restbase-dev cluster to Stretch.

restbase-dev1004 has been decommissioned and can come down for a re-image at any time.

Thu, Sep 5, 2:10 AM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), Cassandra, RESTBase, Operations

Wed, Sep 4

Eevans added a comment to T229697: Investigate Kask request latency.

[ ... ]

@akosiaris are we OK to merge this? Happy to do so and deploy, just making sure it's there.

Yes, feel free. I can deploy it as well/be around when you do.

Wed, Sep 4, 9:30 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), CPT Initiatives (Session Management Service (CDP2)), Performance-Team (Radar)
Eevans moved T224554: Migrate Restbase-dev cluster to Stretch from Backlog to In-Progress on the User-Eevans board.
Wed, Sep 4, 8:13 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), Cassandra, RESTBase, Operations
Eevans moved T224554: Migrate Restbase-dev cluster to Stretch from Blocked Externally to Doing on the Core Platform Team Workboards (Clinic Duty Team) board.
Wed, Sep 4, 8:13 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), Cassandra, RESTBase, Operations
Eevans moved T224553: Migrate remaining Restbase servers to Stretch from Backlog to In-Progress on the User-Eevans board.
Wed, Sep 4, 8:12 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), RESTBase-Cassandra, Cassandra, RESTBase, Operations
Eevans added a project to T224553: Migrate remaining Restbase servers to Stretch: User-Eevans.
Wed, Sep 4, 8:12 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), RESTBase-Cassandra, Cassandra, RESTBase, Operations
Eevans edited P9041 get_events.
Wed, Sep 4, 8:06 PM
Eevans committed rDEPLOYCHARTS7673dc22599b: staging/sessionstore: restbase-dev1006 is back online (authored by Eevans).
staging/sessionstore: restbase-dev1006 is back online
Wed, Sep 4, 8:00 PM
Eevans committed rDEPLOYCHARTS1096091e75b8: staging/sessionstore: bump memory back to 100Mi in response to errors (authored by Eevans).
staging/sessionstore: bump memory back to 100Mi in response to errors
Wed, Sep 4, 7:35 PM
Eevans committed rDEPLOYCHARTS43594bf5c4f7: sessionstore: Bump limits and requests (authored by akosiaris).
sessionstore: Bump limits and requests
Wed, Sep 4, 7:17 PM

Tue, Sep 3

Eevans added a comment to T229697: Investigate Kask request latency.

Change 533922 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] sessionstore: Bump limits and requests
https://gerrit.wikimedia.org/r/533922

Tue, Sep 3, 6:22 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), CPT Initiatives (Session Management Service (CDP2)), Performance-Team (Radar)
Eevans added a comment to T229697: Investigate Kask request latency.

As fan FYI, the 3 different stanzas of data in https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1&var-dc=eqiad%20prometheus%2Fk8s-staging&var-service=sessionstore&from=1567432126930&to=1567435225421 are 3 distinct benchmarkings, limits are respectively 150m, 1500m, 2500m. Some observations off the top of my head and in no particular order:

Tue, Sep 3, 6:14 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), CPT Initiatives (Session Management Service (CDP2)), Performance-Team (Radar)
Eevans moved T229697: Investigate Kask request latency from Backlog to In-Progress on the User-Eevans board.
Tue, Sep 3, 5:37 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), CPT Initiatives (Session Management Service (CDP2)), Performance-Team (Radar)
Eevans moved T229697: Investigate Kask request latency from Ready to Doing on the Core Platform Team Workboards (Clinic Duty Team) board.
Tue, Sep 3, 5:37 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), CPT Initiatives (Session Management Service (CDP2)), Performance-Team (Radar)
Eevans added a project to T229697: Investigate Kask request latency: User-Eevans.
Tue, Sep 3, 5:37 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), CPT Initiatives (Session Management Service (CDP2)), Performance-Team (Radar)
Eevans awarded T230515: Grafana dashboards for sessionstore, k8s staging, are not working a Cookie token.
Tue, Sep 3, 3:01 PM · Operations, Core Platform Team Workboards (Green), Performance-Team (Radar)

Fri, Aug 30

Eevans added a comment to T231027: Cassandra instances outages (was: Outage of restbase2017-b).

This instance has gone down again, we should dig deeper.
[ ... ]

Fri, Aug 30, 1:43 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team)

Thu, Aug 29

Eevans added a comment to T231027: Cassandra instances outages (was: Outage of restbase2017-b).

The timeline of events goes like this:

Thu, Aug 29, 10:11 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team)
Eevans moved T231027: Cassandra instances outages (was: Outage of restbase2017-b) from Backlog to In-Progress on the User-Eevans board.
Thu, Aug 29, 2:36 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team)
Eevans added a project to T231027: Cassandra instances outages (was: Outage of restbase2017-b): User-Eevans.
Thu, Aug 29, 2:35 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team)
Eevans moved T231027: Cassandra instances outages (was: Outage of restbase2017-b) from Done to Doing on the Core Platform Team Workboards (Clinic Duty Team) board.

This instance has gone down again, we should dig deeper.

Thu, Aug 29, 2:35 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team)

Wed, Aug 28

Eevans moved T196377: Imbalanced storage distribution over JBOD devices from Inbox to Done on the Core Platform Team Workboards (Clinic Duty Team) board.

This imbalance seems to resolved itself.

Wed, Aug 28, 9:08 PM · Core Platform Team Workboards (Clinic Duty Team), Services (next), Cassandra
Eevans edited projects for T196377: Imbalanced storage distribution over JBOD devices, added: Core Platform Team Workboards (Clinic Duty Team); removed Core Platform Team Legacy (Later), User-Eevans.
Wed, Aug 28, 8:57 PM · Core Platform Team Workboards (Clinic Duty Team), Services (next), Cassandra
Eevans edited P8999 Masterwork From Distant Lands.
Wed, Aug 28, 8:56 PM
Eevans moved T132632: puppetize turning off reserved space for cassandra /srv from Doing to Done on the Core Platform Team Workboards (Clinic Duty Team) board.

I do not know how it came to pass that machines are getting setup without reserved space, but given how long this issue has been open (and since I'm still unsure how to best go about Puppetizing this), I think we should accept this as a gift and close the issue.

Wed, Aug 28, 7:46 PM · Core Platform Team Workboards (Clinic Duty Team), Operations, Cassandra
Eevans added a comment to T132632: puppetize turning off reserved space for cassandra /srv.

Interestingly, reserved space on the main data volumes in the production cluster already have zero reserved blocks.

Wed, Aug 28, 7:35 PM · Core Platform Team Workboards (Clinic Duty Team), Operations, Cassandra
Eevans updated the language for P8997 Masterwork From Distant Lands from autodetect to shell.
Wed, Aug 28, 7:26 PM
Eevans edited P8997 Masterwork From Distant Lands.
Wed, Aug 28, 7:26 PM
Eevans edited P8996 Masterwork From Distant Lands.
Wed, Aug 28, 7:23 PM
Eevans moved T132632: puppetize turning off reserved space for cassandra /srv from Backlog to In-Progress on the User-Eevans board.
Wed, Aug 28, 6:54 PM · Core Platform Team Workboards (Clinic Duty Team), Operations, Cassandra
Eevans removed a project from T209098: Document Kask: User-Eevans.
Wed, Aug 28, 6:52 PM · Core Platform Team Workboards (Clinic Duty Team), CPT Initiatives (Session Management Service (CDP2)), Documentation, User-Clarakosi
Eevans removed a project from T226551: Package table_properties utility for Debian: User-Eevans.
Wed, Aug 28, 6:49 PM · CPT Initiatives (Session Management Service (CDP2)), serviceops, User-Clarakosi
Eevans removed a project from T226554: Document table_properties work flow: User-Eevans.
Wed, Aug 28, 6:49 PM · Core Platform Team Workboards (Green), CPT Initiatives (Session Management Service (CDP2)), User-WDoran, User-Clarakosi
Eevans removed a project from T226555: Bootstrap initial Cassandra table properties configuration in Puppet: User-Eevans.
Wed, Aug 28, 6:49 PM · Patch-For-Review, CPT Initiatives (Session Management Service (CDP2)), User-WDoran, serviceops-radar, User-Clarakosi
Eevans removed a project from T226556: Relocate MVP table_properties util repo from Github to Gerrit: User-Eevans.
Wed, Aug 28, 6:49 PM · Core Platform Team Workboards (Green), CPT Initiatives (Session Management Service (CDP2)), User-WDoran, User-Clarakosi
Eevans removed a project from T226557: Integrate table_properties utility's tests into CI : User-Eevans.
Wed, Aug 28, 6:49 PM · Core Platform Team Workboards (Green), CPT Initiatives (Session Management Service (CDP2)), User-WDoran, User-Clarakosi
Eevans removed a project from T198787: Revisit default settings for c-foreach-restart: User-Eevans.
Wed, Aug 28, 6:47 PM · Core Platform Team Workboards (Clinic Duty Team), Cassandra, Operations
Eevans removed a project from T224995: Document that session TTL mismatch between Kask and MediaWiki (or other applications) will be silently ignored: User-Eevans.
Wed, Aug 28, 6:45 PM · Core Platform Team Workboards (Green), CPT Initiatives (Multi-DC (TEC1))
Eevans moved T224995: Document that session TTL mismatch between Kask and MediaWiki (or other applications) will be silently ignored from Ready to Deploy to Done on the Core Platform Team Workboards (Green) board.

In the short-term, production configuration lives in deploy1001:/srv/scap-helm/sessionstore/sessionstore-{codfw,eqiad,staging}-values.yaml. I've updated each of these files with the following comment.

# WARNING: The value of $wgObjectCacheSessionExpiry in MediaWiki must
# correspond to the TTL defined here; If you alter default_ttl, update
# MediaWiki accordingly or problems with session renewal/expiry may occur.
default_ttl: 86400

Longer-term, these files will be version-controlled as part of operations/deployment-charts repository (and will be initialized from the above files).

FYI; Just leaving this ticket open to followup later and ensure that these comments make into the Git repository.

Wed, Aug 28, 6:44 PM · Core Platform Team Workboards (Green), CPT Initiatives (Multi-DC (TEC1))
Eevans moved T132632: puppetize turning off reserved space for cassandra /srv from Ready to Doing on the Core Platform Team Workboards (Clinic Duty Team) board.
Wed, Aug 28, 6:39 PM · Core Platform Team Workboards (Clinic Duty Team), Operations, Cassandra
Eevans moved T229421: restbase-dev1006: ACPI errors from Doing to Done on the Core Platform Team Workboards (Clinic Duty Team) board.

Host has been rebooted and there are no ACPI errors present.

Wed, Aug 28, 6:37 PM · Core Platform Team Workboards (Clinic Duty Team)
Eevans moved T132632: puppetize turning off reserved space for cassandra /srv from Inbox to Ready on the Core Platform Team Workboards (Clinic Duty Team) board.
Wed, Aug 28, 6:25 PM · Core Platform Team Workboards (Clinic Duty Team), Operations, Cassandra
Eevans claimed T132632: puppetize turning off reserved space for cassandra /srv.
Wed, Aug 28, 6:25 PM · Core Platform Team Workboards (Clinic Duty Team), Operations, Cassandra
Eevans claimed T229421: restbase-dev1006: ACPI errors.
Wed, Aug 28, 6:20 PM · Core Platform Team Workboards (Clinic Duty Team)
Eevans committed rODCTW2e4c333657e7: Merge master into debian (authored by Eevans).
Merge master into debian
Wed, Aug 28, 1:38 AM
Eevans committed rODCTW4aafc4ba1726: Updated to version 1.0.3 (authored by Eevans).
Updated to version 1.0.3
Wed, Aug 28, 1:38 AM
Eevans committed rODCTW1168813f07ae: Optionally pass cqlshrc file from environment (authored by Eevans).
Optionally pass cqlshrc file from environment
Wed, Aug 28, 1:38 AM

Tue, Aug 27

Eevans moved T198787: Revisit default settings for c-foreach-restart from Doing to Blocked Externally on the Core Platform Team Workboards (Clinic Duty Team) board.
Tue, Aug 27, 7:55 PM · Core Platform Team Workboards (Clinic Duty Team), Cassandra, Operations
Eevans updated the task description for T198787: Revisit default settings for c-foreach-restart.
Tue, Aug 27, 7:55 PM · Core Platform Team Workboards (Clinic Duty Team), Cassandra, Operations
Eevans added a comment to T198787: Revisit default settings for c-foreach-restart.

@Eevans Who could look at this, would this be a good task for @Clarakosi? Not sure if you've done deb packaging yet Clara

Tue, Aug 27, 7:54 PM · Core Platform Team Workboards (Clinic Duty Team), Cassandra, Operations
Eevans moved T198787: Revisit default settings for c-foreach-restart from Ready to Doing on the Core Platform Team Workboards (Clinic Duty Team) board.
Tue, Aug 27, 7:40 PM · Core Platform Team Workboards (Clinic Duty Team), Cassandra, Operations

Mon, Aug 26

Eevans closed T94329: secure Cassandra/RESTBase cluster as Resolved.
Mon, Aug 26, 10:01 PM · Core Platform Team (Needs Cleaning - Cassandra Operational), Cassandra, Operations, RESTBase-Cassandra, RESTBase
Eevans raised the priority of T92471: enable authenticated access to Cassandra JMX from Normal to Needs Triage.
Mon, Aug 26, 9:59 PM · Core Platform Team Workboards (Clinic Duty Team), User-Eevans, Cassandra, Operations, Patch-For-Review
Eevans updated subscribers of T92471: enable authenticated access to Cassandra JMX.

[ ... ]
This is no longer the case (again?); RMI is (again?) bound only to the IPv4 loopback.
So to summarize the risk of sticking with the status quo:
Anyone with local access to the node can a) issue any command nodetool is capable of, in addition to b) executing arbitrary code as the Cassandra user (only network access is needed)
If we enabled password authentication of RMI, we would restrict this level of access to anyone capable of reading the credentials file (root, presumably).

Mon, Aug 26, 9:58 PM · Core Platform Team Workboards (Clinic Duty Team), User-Eevans, Cassandra, Operations, Patch-For-Review
Eevans added a comment to T92471: enable authenticated access to Cassandra JMX.

OK, time for our yearly update of this ticket!

Mon, Aug 26, 9:49 PM · Core Platform Team Workboards (Clinic Duty Team), User-Eevans, Cassandra, Operations, Patch-For-Review
Eevans moved T224554: Migrate Restbase-dev cluster to Stretch from Inbox to Blocked Externally on the Core Platform Team Workboards (Clinic Duty Team) board.
Mon, Aug 26, 9:28 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), Cassandra, RESTBase, Operations
Eevans edited projects for T224554: Migrate Restbase-dev cluster to Stretch, added: Core Platform Team Workboards (Clinic Duty Team); removed Core Platform Team (Needs Cleaning - Services Operations).
Mon, Aug 26, 9:28 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), Cassandra, RESTBase, Operations
Eevans updated the task description for T224554: Migrate Restbase-dev cluster to Stretch.
Mon, Aug 26, 9:27 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), Cassandra, RESTBase, Operations
Eevans added a subtask for T224553: Migrate remaining Restbase servers to Stretch: T224554: Migrate Restbase-dev cluster to Stretch.
Mon, Aug 26, 9:26 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), RESTBase-Cassandra, Cassandra, RESTBase, Operations
Eevans added a parent task for T224554: Migrate Restbase-dev cluster to Stretch: T224553: Migrate remaining Restbase servers to Stretch.
Mon, Aug 26, 9:26 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), Cassandra, RESTBase, Operations
Eevans removed a project from T127472: Investigate reducing impact of single-node Cassandra latencies: Core Platform Team (Needs Cleaning - Cassandra Operational).
Mon, Aug 26, 8:13 PM · User-Eevans, Cassandra, RESTBase-Cassandra
Eevans closed T127472: Investigate reducing impact of single-node Cassandra latencies as Resolved.

We're up to 3.5.0 of the driver at this time; Let's close this unless we can establish that there is more to do.

Mon, Aug 26, 8:13 PM · User-Eevans, Cassandra, RESTBase-Cassandra
Eevans moved T198787: Revisit default settings for c-foreach-restart from Inbox to Ready on the Core Platform Team Workboards (Clinic Duty Team) board.
Mon, Aug 26, 8:02 PM · Core Platform Team Workboards (Clinic Duty Team), Cassandra, Operations
Eevans edited projects for T198787: Revisit default settings for c-foreach-restart, added: Core Platform Team Workboards (Clinic Duty Team); removed Core Platform Team (Needs Cleaning - Cassandra Operational).

Let's just do this already.

Mon, Aug 26, 8:02 PM · Core Platform Team Workboards (Clinic Duty Team), Cassandra, Operations
Eevans added a comment to T228294: Cassandra PHP driver evaluation.

I used some 10% time to have a (cursory) look at the Datastax PHP driver. Some observations:

Mon, Aug 26, 7:13 PM · Core Platform Team (Needs Cleaning - Cassandra Operational), User-Eevans

Fri, Aug 23

Eevans moved T224553: Migrate remaining Restbase servers to Stretch from Inbox to Blocked Externally on the Core Platform Team Workboards (Clinic Duty Team) board.
Fri, Aug 23, 8:49 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), RESTBase-Cassandra, Cassandra, RESTBase, Operations
Eevans edited projects for T224553: Migrate remaining Restbase servers to Stretch, added: Core Platform Team Workboards (Clinic Duty Team); removed Core Platform Team (Needs Cleaning - Services Operations).

T208087, T223976 and T222960 are fixed. Could we get restbase2009-restbase2012, restbase1018 (and the two remaining -dev servers) migrated to Stretch in the next 1-2 months?
cassandra/restbase is the only remaining use case for our custom OpenJDK 8 backports in jessie-wikimedia and it would be fantastic not to spend more time on this when the October Java security release gets released.

Assuming we're adhering to the process of decommissioning, re-imaging, and bootstrapping, then I suspect the only blocker to doing so would be SRE resources. :)

Fri, Aug 23, 8:48 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), RESTBase-Cassandra, Cassandra, RESTBase, Operations

Aug 22 2019

Eevans moved T231027: Cassandra instances outages (was: Outage of restbase2017-b) from Doing to Done on the Core Platform Team Workboards (Clinic Duty Team) board.

I've only looked at this briefly, but some observations:

  • The JVM seems to have spontaneously and uncerimoniously self-destructed:
    • No logged (fatal) exceptions
    • No crash log or heap dump
  • None of the usual signs of distress preceding the event
    • StatusLogger frequency
    • Major GCs
    • Load avg, cpu, IO
  • High latency immediately preceding the event shared by 2009-b
Aug 22 2019, 7:25 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team)