Page MenuHomePhabricator
Feed Advanced Search

Jun 26 2019

Eevans created T226666: RESTBagOStuff client error handling.
Jun 26 2019, 8:35 PM · Core Platform Team (Multi-DC (TEC1)), User-Clarakosi, Core Platform Team Legacy (Next), User-Eevans
Eevans renamed T226553: Install Cassandra table properties Debian package on Cassandra hosts from table_properties Puppet install config to Install Cassandra table properties Debian package on Cassandra hosts.
Jun 26 2019, 8:23 PM · Patch-For-Review, Core Platform Team Workboards (Green), User-WDoran, Core Platform Team (Needs Cleaning - Cassandra Operational)
Eevans renamed T226555: Bootstrap initial Cassandra table properties configuration in Puppet from Enable initial configuration of Cassandra instances via Puppet to Bootstrap initial Cassandra table properties configuration in Puppet.
Jun 26 2019, 8:09 PM · Patch-For-Review, Core Platform Team Workboards (Green), CPT Initiatives (Session Management Service (CDP2)), User-WDoran, serviceops-radar, User-Clarakosi, User-Eevans
Eevans added a comment to T221986: Security Review of RESTBagOStuff.

Responding (in-line) to those items I'm familiar with...

Jun 26 2019, 12:50 AM · Core Platform Team Workboards (Team 2), Core Platform Team (Session Management Service (CDP2))

Jun 25 2019

Eevans added a comment to T224993: Example configuration clauses for using RESTBagOStuff with Kask.
  • setting $wgObjectCacheSessionExpiry to the same value as is configured for kask (9 * 3600?)
Jun 25 2019, 4:27 PM · Core Platform Team Workboards (Green), CPT Initiatives (Session Management Service (CDP2))

Jun 12 2019

Eevans closed T219831: Security Review For Kask as Resolved.
Jun 12 2019, 2:24 PM · Restricted Project, Security-Team-Reviews, Services (watching), Core Platform Team Legacy (Watching / External), Core Platform Team (Session Management Service (CDP2)), User-Clarakosi, User-Eevans
Eevans closed T219831: Security Review For Kask, a subtask of T206016: Create a service for session storage, as Resolved.
Jun 12 2019, 2:24 PM · CPT Initiatives (Multi-DC (TEC1)), User-Clarakosi, User-Eevans
Eevans added a comment to T219831: Security Review For Kask.

@Eevans - I'm not seeing anything for this particular review, though I might dig a little deeper into the code and attempt some dynamic-scanning this week, as mentioned in T219831#5173498. But none of this should block resolving the task or deployment IMO.

Jun 12 2019, 2:23 PM · Restricted Project, Security-Team-Reviews, Services (watching), Core Platform Team Legacy (Watching / External), Core Platform Team (Session Management Service (CDP2)), User-Clarakosi, User-Eevans

Jun 11 2019

Eevans added a comment to T219831: Security Review For Kask.

@sbassett is there anything remaining here before we close/resolve this?

Jun 11 2019, 7:36 PM · Restricted Project, Security-Team-Reviews, Services (watching), Core Platform Team Legacy (Watching / External), Core Platform Team (Session Management Service (CDP2)), User-Clarakosi, User-Eevans
Eevans closed T217650: Deployment strategy for the session storage application. as Resolved.
Jun 11 2019, 7:12 PM · Patch-For-Review, Kubernetes, serviceops, Core Platform Team (Multi-DC (TEC1)), User-Clarakosi, Core Platform Team Legacy (Next), User-Eevans
Eevans closed T217650: Deployment strategy for the session storage application., a subtask of T206016: Create a service for session storage, as Resolved.
Jun 11 2019, 7:12 PM · CPT Initiatives (Multi-DC (TEC1)), User-Clarakosi, User-Eevans
Eevans updated subscribers of T209110: Logging for the session storage service.

I believe this issue is basically complete, but was been left open because we weren't certain whether or not we needed the @cee token to be prepended to log messages; @akosiaris is logging working as expected in k8s?

Jun 11 2019, 7:11 PM · CPT Initiatives (Session Management Service (CDP2)), Patch-For-Review, User-Clarakosi, User-Eevans
Eevans updated subscribers of T209109: Security model for session storage service.

I believe the current status here to be:

Jun 11 2019, 6:57 PM · CPT Initiatives (Session Management Service (CDP2)), Security-Team, User-Clarakosi, User-Eevans
Eevans closed T209108: Monitoring and data collection for session storage service as Resolved.
Jun 11 2019, 6:20 PM · Patch-For-Review, User-Clarakosi, Core Platform Team Legacy (Next), Core Platform Team (Session Management Service (CDP2)), User-Eevans
Eevans closed T209108: Monitoring and data collection for session storage service, a subtask of T206016: Create a service for session storage, as Resolved.
Jun 11 2019, 6:19 PM · CPT Initiatives (Multi-DC (TEC1)), User-Clarakosi, User-Eevans
Eevans moved T224995: Document that session TTL mismatch between Kask and MediaWiki (or other applications) will be silently ignored from Backlog to Blocked on the User-Eevans board.
Jun 11 2019, 6:06 PM · Core Platform Team Workboards (Green), CPT Initiatives (Multi-DC (TEC1)), User-Eevans
Eevans added a comment to T224995: Document that session TTL mismatch between Kask and MediaWiki (or other applications) will be silently ignored.

In the short-term, production configuration lives in deploy1001:/srv/scap-helm/sessionstore/sessionstore-{codfw,eqiad,staging}-values.yaml. I've updated each of these files with the following comment.

# WARNING: The value of $wgObjectCacheSessionExpiry in MediaWiki must
# correspond to the TTL defined here; If you alter default_ttl, update
# MediaWiki accordingly or problems with session renewal/expiry may occur.
default_ttl: 86400

Longer-term, these files will be version-controlled as part of operations/deployment-charts repository (and will be initialized from the above files).

Jun 11 2019, 6:05 PM · Core Platform Team Workboards (Green), CPT Initiatives (Multi-DC (TEC1)), User-Eevans

Jun 6 2019

Eevans updated the task description for T220246: Management of Cassandra schema and keyspace/table configuration.
Jun 6 2019, 4:58 PM · Patch-For-Review, User-WDoran, Core Platform Team Workboards (Clinic Duty Team), serviceops-radar

Jun 5 2019

Eevans updated the task description for T220246: Management of Cassandra schema and keyspace/table configuration.
Jun 5 2019, 9:34 PM · Patch-For-Review, User-WDoran, Core Platform Team Workboards (Clinic Duty Team), serviceops-radar
Eevans updated the task description for T220246: Management of Cassandra schema and keyspace/table configuration.
Jun 5 2019, 9:18 PM · Patch-For-Review, User-WDoran, Core Platform Team Workboards (Clinic Duty Team), serviceops-radar
Eevans updated the task description for T220246: Management of Cassandra schema and keyspace/table configuration.
Jun 5 2019, 9:10 PM · Patch-For-Review, User-WDoran, Core Platform Team Workboards (Clinic Duty Team), serviceops-radar

Jun 4 2019

Eevans triaged T224995: Document that session TTL mismatch between Kask and MediaWiki (or other applications) will be silently ignored as Normal priority.
Jun 4 2019, 4:09 PM · Core Platform Team Workboards (Green), CPT Initiatives (Multi-DC (TEC1)), User-Eevans
Eevans added a comment to T224995: Document that session TTL mismatch between Kask and MediaWiki (or other applications) will be silently ignored.

In the short-term, production configuration lives in deploy1001:/srv/scap-helm/sessionstore/sessionstore-{codfw,eqiad,staging}-values.yaml. I've updated each of these files with the following comment.

Jun 4 2019, 4:08 PM · Core Platform Team Workboards (Green), CPT Initiatives (Multi-DC (TEC1)), User-Eevans
Eevans added a comment to T220401: Introduce kask session storage service to kubernetes.

And LVS done today.

akosiaris@deploy1001:~$ curl -i https://sessionstore.svc.eqiad.wmnet:8081/healthz
HTTP/2 200 
content-type: application/json
content-length: 0
date: Tue, 04 Jun 2019 15:33:30 GMT
akosiaris@deploy1001:~$ curl -i https://sessionstore.svc.codfw.wmnet:8081/healthz
HTTP/2 200 
content-type: application/json
content-length: 0
date: Tue, 04 Jun 2019 15:33:36 GMT

Resolving this, the service is ready to be used.

Jun 4 2019, 3:38 PM · Patch-For-Review, Core Platform Team Legacy (Next), Core Platform Team (Session Management Service (CDP2)), Services (next), User-Eevans, Release Pipeline, Operations, serviceops, Release-Engineering-Team
Eevans added a comment to T224995: Document that session TTL mismatch between Kask and MediaWiki (or other applications) will be silently ignored.

I'm not sure config.yaml.sample is a good place. There isn't anything MediaWiki/Session storage-specific about Kask, and this warning is (specific). It'd be confusing in every other context.

Jun 4 2019, 3:36 PM · Core Platform Team Workboards (Green), CPT Initiatives (Multi-DC (TEC1)), User-Eevans

Jun 3 2019

Eevans added a comment to T220246: Management of Cassandra schema and keyspace/table configuration.

[...]
All in all, with a couple caveats on finding a better process, I think this proposal can work. I'm just not sure about the strawman puppet proposal. I'd create a define called something like cassandra::table::config and just declare all instances directly in puppet a puppet class cassandra::tables_config that collects them all, instead of relying on hiera and something like create_resources.

Jun 3 2019, 11:56 PM · Patch-For-Review, User-WDoran, Core Platform Team Workboards (Clinic Duty Team), serviceops-radar
Eevans added a comment to T220401: Introduce kask session storage service to kubernetes.

One minor question. Given per T220401#5128786 1 kask instance is able to handle ~300req/s, how many instances will we require? I am unsure of the current rate of sessions requests to/from redis.

What was the test environment used there? When I tested using the sessionstore Cassandra cluster nodes, I got at least two orders of magnitude higher throughput.

An admittedly underpowered minikube environment with a probably untuned cassandra. Some values for cassandra itself are in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/kask/values.yaml#114. It makes absolute sense that a well tuned and more powered cassandra cluster would be able to serve more req/s.
Now, to answer my question, and by looking at T221292, I 'll assume a single instance for production should be able to serve some 30k req/s (I am rounding down from the lowest score in that table just to be on the safe side). So 1 instance would probably not cover it, we would need at least 2 instances. Adding 2x rack row redundancy means 4 instances. Looks like that's our number for now. We can always increase it ofc.

FWIW, we've been bouncing around a target throughput of 30k/sec in production based on Redis metrics, but as was later noted in T212129, that number includes everything in Mainstash, only a fraction of which is sessions (we're moving sessions over separately of the rest). IOW, sessions should be something considerably less 30k/s, even if we don't know exactly what.

Yup, that's true. In fact, some numbers I 've heard (I have no actual proof) place sessions at <10% of the total Mainstash (which says however nothing about the rate of requests for sessions). That being said, without an actual number it's hard to do math. Now, given that kask really isn't expensive to run, my take is: Play it safe, assume everything in Mainstash is sessions and work with that. When we get actual numbers we can always revisit the decisions here. It's not like we are going to etch them in stone.

Jun 3 2019, 4:32 PM · Patch-For-Review, Core Platform Team Legacy (Next), Core Platform Team (Session Management Service (CDP2)), Services (next), User-Eevans, Release Pipeline, Operations, serviceops, Release-Engineering-Team
Eevans added a comment to T220401: Introduce kask session storage service to kubernetes.

One minor question. Given per T220401#5128786 1 kask instance is able to handle ~300req/s, how many instances will we require? I am unsure of the current rate of sessions requests to/from redis.

What was the test environment used there? When I tested using the sessionstore Cassandra cluster nodes, I got at least two orders of magnitude higher throughput.

An admittedly underpowered minikube environment with a probably untuned cassandra. Some values for cassandra itself are in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/kask/values.yaml#114. It makes absolute sense that a well tuned and more powered cassandra cluster would be able to serve more req/s.
Now, to answer my question, and by looking at T221292, I 'll assume a single instance for production should be able to serve some 30k req/s (I am rounding down from the lowest score in that table just to be on the safe side). So 1 instance would probably not cover it, we would need at least 2 instances. Adding 2x rack row redundancy means 4 instances. Looks like that's our number for now. We can always increase it ofc.

Jun 3 2019, 1:59 PM · Patch-For-Review, Core Platform Team Legacy (Next), Core Platform Team (Session Management Service (CDP2)), Services (next), User-Eevans, Release Pipeline, Operations, serviceops, Release-Engineering-Team

May 31 2019

Eevans added a comment to T220401: Introduce kask session storage service to kubernetes.

One minor question. Given per T220401#5128786 1 kask instance is able to handle ~300req/s, how many instances will we require? I am unsure of the current rate of sessions requests to/from redis.

May 31 2019, 5:26 PM · Patch-For-Review, Core Platform Team Legacy (Next), Core Platform Team (Session Management Service (CDP2)), Services (next), User-Eevans, Release Pipeline, Operations, serviceops, Release-Engineering-Team
Eevans added a comment to T220401: Introduce kask session storage service to kubernetes.

One thing that I just met is that kask stops accepting HTTP connections if kask cert/key pair is configured. That's fine normally, but there is a very interesting repercussion. Kubernetes readiness probes to the /healthz endpoint now fail. kask logs
2019/05/31 09:36:00 http: TLS handshake error from 10.64.0.247:55194: tls: first record does not look like a TLS handshake
Kubernetes can't do TLS enabled HTTP probes.

That is unfortunate. Out of curiosity, is that a design decision? Something that just hasn't been a priority to implement yet?

Options:

  1. We don't enable TLS for the session store. Unless I am mistaken that is what our current situation is, so we at least don't introduce a regression. That being said, it's clearly very suboptimal and against our goals. On the plus side, we could rely on envoy as a sidecar container for TLS demarcation/termination sometime in the future
  2. We migrate to a TCP probe. However we already use this for the liveness probe and experience has shown that using the same liveness/readiness probe is completely wrong. Pods get killed before they are depooled and have a chance to recover causing outages.
  3. We don't have a readiness probe. No, not a good idea. This protects us from domino effects and overloads of the service. The entire idea of the /healthz endpoint is exactly to avoid that.
  4. We use an Exec probe that executes something like curl https://<POD_IP>:8081/healthz. This is generally suboptimal as the execution of an external command is more expensive than an HTTP GET probe from the kubelet. I am also not clear on how the pod IP would be communicated to the command, need to research that more.
  5. We amend kask to not require TLS for the /healthz endpoint. This is ugly and would complicate the code considerably I think.

We can do this. It'd technically be a different server (different Server object, different port), even if bound to the same Go process, will it still satisfy its mandate? I wonder, does it make sense to do the same with the Prometheus agent? We could add the idea of a management interface or similar, make the listen address and port configurable, and hang /health and /metrics there.

May 31 2019, 3:36 PM · Patch-For-Review, Core Platform Team Legacy (Next), Core Platform Team (Session Management Service (CDP2)), Services (next), User-Eevans, Release Pipeline, Operations, serviceops, Release-Engineering-Team
Eevans added a comment to T220401: Introduce kask session storage service to kubernetes.

One thing that I just met is that kask stops accepting HTTP connections if kask cert/key pair is configured. That's fine normally, but there is a very interesting repercussion. Kubernetes readiness probes to the /healthz endpoint now fail. kask logs
2019/05/31 09:36:00 http: TLS handshake error from 10.64.0.247:55194: tls: first record does not look like a TLS handshake
Kubernetes can't do TLS enabled HTTP probes.

May 31 2019, 3:09 PM · Patch-For-Review, Core Platform Team Legacy (Next), Core Platform Team (Session Management Service (CDP2)), Services (next), User-Eevans, Release Pipeline, Operations, serviceops, Release-Engineering-Team
Eevans added a comment to T222960: Fix restbase1017's physical rack.

I think DC-Ops should sync with you all on a schedule for this move. According to @Eevans it would be desirable to schedule the physical move on a monday, so that the cassandra decommission can be started before the weekend.

May 31 2019, 2:53 PM · Patch-For-Review, serviceops, Core Platform Team Workboards (Team 2), Operations, Services (doing), Core Platform Team (Needs Cleaning - Security, stability, performance and scalability (TEC1)), User-Eevans, Cassandra

May 30 2019

Eevans committed rMSKS0f85b276f6c1: Log `ResponseWriter.Write` errors (authored by Eevans).
Log `ResponseWriter.Write` errors
May 30 2019, 9:59 PM
Eevans added a comment to T219831: Security Review For Kask.

[ ... ]

[ ... ]
Static Analysis Findings
I ran gosec against the source and it came back with what appear to be mostly false positives, but I thought I'd post them here for completeness' sake:

  • config.go:62 - G304: Potential file inclusion via variable (Confidence: HIGH, Severity: MEDIUM)
    • ioutil.ReadFile(filename)
May 30 2019, 7:53 PM · Restricted Project, Security-Team-Reviews, Services (watching), Core Platform Team Legacy (Watching / External), Core Platform Team (Session Management Service (CDP2)), User-Clarakosi, User-Eevans
Eevans added a comment to T209108: Monitoring and data collection for session storage service.

https://grafana.wikimedia.org/d/000001590/sessionstore?refresh=1m&orgId=1 is now available courtesy of @akosiaris

May 30 2019, 5:26 PM · Patch-For-Review, User-Clarakosi, Core Platform Team Legacy (Next), Core Platform Team (Session Management Service (CDP2)), User-Eevans
Eevans closed T224620: Convert Cassandra contact to a list as Resolved.
May 30 2019, 5:09 PM · Core Platform Team (Multi-DC (TEC1)), User-Clarakosi, Core Platform Team Legacy (Next), User-Eevans
Eevans closed T224620: Convert Cassandra contact to a list, a subtask of T206016: Create a service for session storage, as Resolved.
May 30 2019, 5:09 PM · CPT Initiatives (Multi-DC (TEC1)), User-Clarakosi, User-Eevans
Eevans committed rMSKSc13c61f7bc01: Accept a list of hostnames as Cassandra contacts (authored by Eevans).
Accept a list of hostnames as Cassandra contacts
May 30 2019, 3:46 PM

May 29 2019

Eevans committed rMSKS989ec0a8fa2c: Accept a list of hostnames as Cassandra contacts (authored by Eevans).
Accept a list of hostnames as Cassandra contacts
May 29 2019, 11:28 PM
Eevans merged task T224623: Upgrade RESTBase cluster to Stretch into T224553: Migrate remaining Restbase servers to Stretch.
May 29 2019, 8:25 PM · Core Platform Team, RESTBase, Cassandra
Eevans merged T224623: Upgrade RESTBase cluster to Stretch into T224553: Migrate remaining Restbase servers to Stretch.
May 29 2019, 8:25 PM · Core Platform Team (Needs Cleaning - Services Operations), RESTBase-Cassandra, Cassandra, RESTBase, Operations
Eevans updated the task description for T224553: Migrate remaining Restbase servers to Stretch.
May 29 2019, 8:23 PM · Core Platform Team (Needs Cleaning - Services Operations), RESTBase-Cassandra, Cassandra, RESTBase, Operations
Eevans updated the task description for T224623: Upgrade RESTBase cluster to Stretch.
May 29 2019, 8:12 PM · Core Platform Team, RESTBase, Cassandra
Eevans created T224623: Upgrade RESTBase cluster to Stretch.
May 29 2019, 8:11 PM · Core Platform Team, RESTBase, Cassandra
Eevans updated the task description for T223976: Decommission restbase10(0[7-9]|1[0-5]).
May 29 2019, 7:53 PM · Core Platform Team Workboards (Done with CPT), Services (done), Cassandra, RESTBase, Core Platform Team (Needs Cleaning - Security, stability, performance and scalability (TEC1)), Operations
Eevans triaged T224620: Convert Cassandra contact to a list as Normal priority.
May 29 2019, 7:10 PM · Core Platform Team (Multi-DC (TEC1)), User-Clarakosi, Core Platform Team Legacy (Next), User-Eevans
Eevans created T224620: Convert Cassandra contact to a list.
May 29 2019, 7:10 PM · Core Platform Team (Multi-DC (TEC1)), User-Clarakosi, Core Platform Team Legacy (Next), User-Eevans

May 28 2019

Eevans updated the task description for T223976: Decommission restbase10(0[7-9]|1[0-5]).
May 28 2019, 9:31 PM · Core Platform Team Workboards (Done with CPT), Services (done), Cassandra, RESTBase, Core Platform Team (Needs Cleaning - Security, stability, performance and scalability (TEC1)), Operations
Eevans renamed T224041: Kask functional testing with Cassandra via the Deployment Pipeline from Kask integration testing with Cassandra via the Deployment Pipeline to Kask functional testing with Cassandra via the Deployment Pipeline.
May 28 2019, 1:15 PM · CPT Initiatives (Session Management Service (CDP2)), Release-Engineering-Team (Pipeline), Release-Engineering-Team-TODO, Services (next), User-Eevans, Release Pipeline, Operations, serviceops
Eevans updated the task description for T223976: Decommission restbase10(0[7-9]|1[0-5]).
May 28 2019, 12:36 AM · Core Platform Team Workboards (Done with CPT), Services (done), Cassandra, RESTBase, Core Platform Team (Needs Cleaning - Security, stability, performance and scalability (TEC1)), Operations

May 27 2019

Eevans updated the task description for T223976: Decommission restbase10(0[7-9]|1[0-5]).
May 27 2019, 2:59 AM · Core Platform Team Workboards (Done with CPT), Services (done), Cassandra, RESTBase, Core Platform Team (Needs Cleaning - Security, stability, performance and scalability (TEC1)), Operations

May 26 2019

Eevans updated the task description for T223976: Decommission restbase10(0[7-9]|1[0-5]).
May 26 2019, 1:59 PM · Core Platform Team Workboards (Done with CPT), Services (done), Cassandra, RESTBase, Core Platform Team (Needs Cleaning - Security, stability, performance and scalability (TEC1)), Operations

May 24 2019

Eevans updated the task description for T223976: Decommission restbase10(0[7-9]|1[0-5]).
May 24 2019, 9:53 PM · Core Platform Team Workboards (Done with CPT), Services (done), Cassandra, RESTBase, Core Platform Team (Needs Cleaning - Security, stability, performance and scalability (TEC1)), Operations
Eevans updated the task description for T223976: Decommission restbase10(0[7-9]|1[0-5]).
May 24 2019, 12:15 AM · Core Platform Team Workboards (Done with CPT), Services (done), Cassandra, RESTBase, Core Platform Team (Needs Cleaning - Security, stability, performance and scalability (TEC1)), Operations

May 23 2019

Eevans updated the task description for T223976: Decommission restbase10(0[7-9]|1[0-5]).
May 23 2019, 1:52 PM · Core Platform Team Workboards (Done with CPT), Services (done), Cassandra, RESTBase, Core Platform Team (Needs Cleaning - Security, stability, performance and scalability (TEC1)), Operations

May 22 2019

Eevans updated the task description for T223976: Decommission restbase10(0[7-9]|1[0-5]).
May 22 2019, 1:05 PM · Core Platform Team Workboards (Done with CPT), Services (done), Cassandra, RESTBase, Core Platform Team (Needs Cleaning - Security, stability, performance and scalability (TEC1)), Operations
Eevans added a comment to T222907: Determine if per-request TTLs are needed.

[ ... ]
From the task description:

Put another way, what is at question is NOT whether arbitrary TTLs are possible, but whether the client can override the default with something shorter.

The ability to configure session TTLs is baked in to our current SessionManager/SessionBackend/BagOStuff implementation, and nothing I do in RESTBagOStuff will eliminate that configuration setting. So it will be possible for someone to *think* they are configuring an arbitrary TTL on the client (wiki) side. But we can, of course, ignore that setting. And in fact, we currently are ignoring it - the current RESTBagOStuff implementation never even sends it to Kask.
So if it suffices to provide a consistent TTL across all sessions on all wikis (and it appears it does) then I think that we are, technically speaking, done.
However, it feels sketchy to me that it is possible to configure a TTL that gets silently ignored. Someone might set $wgObjectCaacheSessionExpiry on a wiki that uses RESTBagOStuff and reasonably expect it to do something. But it would not.

May 22 2019, 2:05 AM · Core Platform Team Workboards (Done with CPT), Core Platform Team (Session Management Service (CDP2))

May 21 2019

Eevans added a comment to T222990: Audit session storage to determine max age of un-GC'd sessions.

@Eevans is this a task for you or were you looking for input from @EvanProdromou ?

May 21 2019, 6:49 PM · Core Platform Team Workboards (Green), CPT Initiatives (Multi-DC (TEC1)), audits-data-retention, User-Clarakosi, User-Eevans
Eevans updated the task description for T222907: Determine if per-request TTLs are needed.
May 21 2019, 3:45 PM · Core Platform Team Workboards (Done with CPT), Core Platform Team (Session Management Service (CDP2))
Eevans updated the task description for T222907: Determine if per-request TTLs are needed.
May 21 2019, 3:38 PM · Core Platform Team Workboards (Done with CPT), Core Platform Team (Session Management Service (CDP2))

May 20 2019

Eevans added a comment to T223825: Degraded RAID on restbase-dev1006.

Not production; This host can be taken down at any time, without coordination.

May 20 2019, 2:34 PM · ops-eqiad, Operations

May 17 2019

Eevans added a comment to T215533: Enable use of session storage service in MediaWiki.

[ ... ]
However, if there's no compelling need to do Basic Auth, I'd vote to support the option that it sounds like both RESTBagOStuff and Kask already support.

May 17 2019, 1:24 PM · Core Platform Team Workboards (Epics), Epic, CPT Initiatives (Session Management Service (CDP2)), MW-1.34-notes (1.34.0-wmf.7; 2019-05-28), Patch-For-Review, User-Clarakosi, User-Eevans

May 14 2019

Eevans added a comment to T219879: Create a test runner for end-to-end API tests (Phester).

Additional notes and considerations:
[ ... ]

  • The test runner should not hard code any knowledge about the MediaWiki action API, and should be designed to be usable for testing other APIs, such as RESTbase.
May 14 2019, 6:18 PM · CPT Initiatives (API Integration Tests), Epic, Code-Health

May 11 2019

Eevans added a comment to T219831: Security Review For Kask.

[ ... ]
Policy Compliance
I am assuming data will live in Cassandra and within relevant logs in accordance with Wikimedia's existing data retention guidelines.

No PII should ever by logged, but session data is persisted in Cassandra. Sessions expire well before the max retention period, but that data is not immediately removed from storage, it's GC'd by compaction (as dictated by the compaction algorithm in use) after a moratorium (defaults to 10 days). I would be really surprised if that process were to drag out past 90 days (or anything even close to 90 days), but it might be worth putting a pin in this, and conduct an audit at some point to verify just how long they do hang around.

May 11 2019, 12:34 AM · Restricted Project, Security-Team-Reviews, Services (watching), Core Platform Team Legacy (Watching / External), Core Platform Team (Session Management Service (CDP2)), User-Clarakosi, User-Eevans
Eevans triaged T222990: Audit session storage to determine max age of un-GC'd sessions as Normal priority.
May 11 2019, 12:33 AM · Core Platform Team Workboards (Green), CPT Initiatives (Multi-DC (TEC1)), audits-data-retention, User-Clarakosi, User-Eevans
Eevans created T222990: Audit session storage to determine max age of un-GC'd sessions.
May 11 2019, 12:32 AM · Core Platform Team Workboards (Green), CPT Initiatives (Multi-DC (TEC1)), audits-data-retention, User-Clarakosi, User-Eevans
Eevans added a comment to T219831: Security Review For Kask.

This is awesome @sbassett; Thanks for looking it over! I left some comments inline:

May 11 2019, 12:19 AM · Restricted Project, Security-Team-Reviews, Services (watching), Core Platform Team Legacy (Watching / External), Core Platform Team (Session Management Service (CDP2)), User-Clarakosi, User-Eevans

May 10 2019

Eevans closed T222227: Kask support for operations/software/service-checker as Resolved.
May 10 2019, 4:27 PM · Patch-For-Review, User-Clarakosi, Core Platform Team Legacy (Next), Core Platform Team (Session Management Service (CDP2)), User-Eevans
Eevans closed T222227: Kask support for operations/software/service-checker, a subtask of T209108: Monitoring and data collection for session storage service, as Resolved.
May 10 2019, 4:27 PM · Patch-For-Review, User-Clarakosi, Core Platform Team Legacy (Next), Core Platform Team (Session Management Service (CDP2)), User-Eevans
Eevans committed rMSKS7aaf55560c58: Serve OpenAPI specification (authored by Eevans).
Serve OpenAPI specification
May 10 2019, 4:24 PM

May 9 2019

Eevans added a comment to T222907: Determine if per-request TTLs are needed.

FYI, I think CentralAuth needs a TTL different to that of regular sessions (IIRC, this was cited as a reason for exposing the TTL). Regardless of whether or not this is true, we should probably consider using a different instance of the service for CentralAuth (it's trivial to do). If we do, we can configure it with a different default TTL as well.

May 9 2019, 7:19 PM · Core Platform Team Workboards (Done with CPT), Core Platform Team (Session Management Service (CDP2))
Eevans added a comment to T222908: Determine if set-if-not-exists method is necessary for session storage.

This task is to check if this is needed now, or ever.

May 9 2019, 7:08 PM · Core Platform Team Workboards (Done with CPT), Core Platform Team (Session Management Service (CDP2))
Eevans added a comment to T222908: Determine if set-if-not-exists method is necessary for session storage.

I'll check through our session-handling code to see if we ever use this type of call (I think it's add() in BagOStuff) for sessions, and if it seems like it's necessary at that point.
Off my dome, I could see using this type of write when you're worried about propagation delays between data centres. One scenario:

  1. Client connects to server in DC1 and gets a session ID AAAA
  2. Application server in DC1 unconditionally writes an empty session object with key AAAA to DC1 cluster
  3. User authenticates to application server in DC1
  4. Application server in DC1 unconditionally writes a session object with user ID with key AAAA to DC1 cluster
  5. Client connects to application server in DC2, with session ID AAAA
  6. Application server in DC2 reads session with key AAAA from DC2 cluster, gets no value
  7. Empty session object with key AAAA propagates from DC1 cluster to DC2 cluster
  8. Session object with user ID and key AAAA propagates from DC1 cluster to DC2 cluster
  9. Application server in DC2 unconditionally writes empty session object for key AAAA to cluster in DC2
  10. Empty session object with key AAAA propagates from DC2 cluster to DC1 cluster
May 9 2019, 6:49 PM · Core Platform Team Workboards (Done with CPT), Core Platform Team (Session Management Service (CDP2))
Eevans added a comment to T215533: Enable use of session storage service in MediaWiki.

One thing I want to make sure of is that KaskBagOStuff is compatible with the TLS setup defined in T209109, as well as supporting username/password HTTP Basic authentication.

May 9 2019, 6:33 PM · Core Platform Team Workboards (Epics), Epic, CPT Initiatives (Session Management Service (CDP2)), MW-1.34-notes (1.34.0-wmf.7; 2019-05-28), Patch-For-Review, User-Clarakosi, User-Eevans
Eevans committed rMSKSb95678dedba4: Serve OpenAPI specification (authored by Eevans).
Serve OpenAPI specification
May 9 2019, 1:42 AM

May 7 2019

Eevans committed rMSKS51ba6c7211ce: [WIP] Serve OpenAPI specification (authored by Eevans).
[WIP] Serve OpenAPI specification
May 7 2019, 1:43 PM

May 1 2019

Eevans added a comment to T222227: Kask support for operations/software/service-checker.

According to the site, the benefits of OpenAPI are...

May 1 2019, 1:53 AM · Patch-For-Review, User-Clarakosi, Core Platform Team Legacy (Next), Core Platform Team (Session Management Service (CDP2)), User-Eevans
Eevans updated the task description for T222227: Kask support for operations/software/service-checker.
May 1 2019, 1:19 AM · Patch-For-Review, User-Clarakosi, Core Platform Team Legacy (Next), Core Platform Team (Session Management Service (CDP2)), User-Eevans

Apr 30 2019

Eevans committed rMSKS87c2e9a45d7d: [WIP] Serve OpenAPI specification (authored by Eevans).
[WIP] Serve OpenAPI specification
Apr 30 2019, 9:24 PM
Eevans updated the task description for T222227: Kask support for operations/software/service-checker.
Apr 30 2019, 9:13 PM · Patch-For-Review, User-Clarakosi, Core Platform Team Legacy (Next), Core Platform Team (Session Management Service (CDP2)), User-Eevans
Eevans renamed T222227: Kask support for operations/software/service-checker from Kask: operations/software/service-checker support to Kask support for operations/software/service-checker.
Apr 30 2019, 9:00 PM · Patch-For-Review, User-Clarakosi, Core Platform Team Legacy (Next), Core Platform Team (Session Management Service (CDP2)), User-Eevans
Eevans triaged T222227: Kask support for operations/software/service-checker as Normal priority.
Apr 30 2019, 9:00 PM · Patch-For-Review, User-Clarakosi, Core Platform Team Legacy (Next), Core Platform Team (Session Management Service (CDP2)), User-Eevans
Eevans updated the task description for T222227: Kask support for operations/software/service-checker.
Apr 30 2019, 8:59 PM · Patch-For-Review, User-Clarakosi, Core Platform Team Legacy (Next), Core Platform Team (Session Management Service (CDP2)), User-Eevans
Eevans created T222227: Kask support for operations/software/service-checker.
Apr 30 2019, 8:48 PM · Patch-For-Review, User-Clarakosi, Core Platform Team Legacy (Next), Core Platform Team (Session Management Service (CDP2)), User-Eevans
Eevans added a comment to T221292: Establish performance of the session storage service.

@Eevans maybe the mystery of the longer requests is just that they're writes instead of reads?

Apr 30 2019, 8:11 PM · Performance-Team (Radar), Core Platform Team Workboards (Green), CPT Initiatives (Multi-DC (TEC1)), User-Clarakosi, User-Eevans
Eevans added a comment to T221292: Establish performance of the session storage service.

Also, is this task done when we're within the boundaries of T211721 or is the task just to have a tool to do measurements?

Apr 30 2019, 8:10 PM · Performance-Team (Radar), Core Platform Team Workboards (Green), CPT Initiatives (Multi-DC (TEC1)), User-Clarakosi, User-Eevans
Eevans updated the task description for T219883: Draft file format for phester test definitions.
Apr 30 2019, 7:46 PM · CPT Initiatives (API Integration Tests), Core Platform Team Workboards (Done with CPT)
Eevans added a comment to T211721: Establish an SLA for session storage.

On a related note, do we want or need an SLA on consistency?

Apr 30 2019, 7:43 PM · CPT Initiatives (Session Management Service (CDP2)), MW-1.33-notes (1.33.0-wmf.24; 2019-04-02), Patch-For-Review, Performance-Team (Radar), TechCom, Services (next), Operations, User-Clarakosi, User-Eevans
Eevans added a comment to T211721: Establish an SLA for session storage.

[ ... ]
... My understanding is session writes would be visible in all DCs once an HTTP response is sent (at least the old security team wanted that).

Apr 30 2019, 7:42 PM · CPT Initiatives (Session Management Service (CDP2)), MW-1.33-notes (1.33.0-wmf.24; 2019-04-02), Patch-For-Review, Performance-Team (Radar), TechCom, Services (next), Operations, User-Clarakosi, User-Eevans
Eevans committed rMSKS62a132898024: [WIP] Serve OpenAPI specification (authored by Eevans).
[WIP] Serve OpenAPI specification
Apr 30 2019, 6:52 PM

Apr 29 2019

Eevans added a comment to T211721: Establish an SLA for session storage.

@Eevans have we been considering cross-DC writes in the performance testing? Are we comparing apples to apples here?

Cross-DC only occurs for DELETE operations and should be identical to POST, plus whatever inter-DC latency is. I have not included them in performance testing thus far because the focus (thus far) has been on Kask performance (and any fixes/optimizations needed there), and including DELETE would only provide a measurement of Cassandra over the WAN.

Why would there be cross-DC requests to Kask at all? MW should send requests only to Kask in the local DC, and then it's up to Cassandra to propagate the change to the other DCs. Or am I missing something?

Apr 29 2019, 7:04 PM · CPT Initiatives (Session Management Service (CDP2)), MW-1.33-notes (1.33.0-wmf.24; 2019-04-02), Patch-For-Review, Performance-Team (Radar), TechCom, Services (next), Operations, User-Clarakosi, User-Eevans
Eevans added a comment to T211721: Establish an SLA for session storage.

@Eevans have we been considering cross-DC writes in the performance testing? Are we comparing apples to apples here?

Apr 29 2019, 6:55 PM · CPT Initiatives (Session Management Service (CDP2)), MW-1.33-notes (1.33.0-wmf.24; 2019-04-02), Patch-For-Review, Performance-Team (Radar), TechCom, Services (next), Operations, User-Clarakosi, User-Eevans
Eevans added a comment to T221292: Establish performance of the session storage service.

@Eevans should we break out a separate task for identifying the cause of the extra 40-50ms delay?

Apr 29 2019, 6:45 PM · Performance-Team (Radar), Core Platform Team Workboards (Green), CPT Initiatives (Multi-DC (TEC1)), User-Clarakosi, User-Eevans

Apr 25 2019

Eevans moved T209108: Monitoring and data collection for session storage service from Backlog to In-Progress on the User-Eevans board.
Apr 25 2019, 8:17 PM · Patch-For-Review, User-Clarakosi, Core Platform Team Legacy (Next), Core Platform Team (Session Management Service (CDP2)), User-Eevans
Eevans added a comment to T220246: Management of Cassandra schema and keyspace/table configuration.

In this scenario, our service "foo" is already accessing table "foo.meta" and we're currently adding a column to the table. The process of altering the schema and upgrading the application need to be able to be performed asynchronously. because neither schema changes nor application deployments are atomic. Let's assume we do what MediaWiki does, that is make the code deployable with or without the new schema, and protecting the use of the new features behind a feature flag. In this scenario:

  1. Code (and the proposed alter) are written, deployed to production
  2. This is just a schema change, so the cql command for the schema change can be issued whenever we want - even as a final step of deployment - directly by the deployer.
  3. The feature flag gets flipped in a subsequent deployment when it's assured the schema change has happened everywhere.
Apr 25 2019, 4:47 PM · Patch-For-Review, User-WDoran, Core Platform Team Workboards (Clinic Duty Team), serviceops-radar
Eevans added a comment to T220401: Introduce kask session storage service to kubernetes.

@Eevans @Clarakosi chart has been merged and is published. The only thing missing before we can move on to the deployment is the swagger/openapi spec so that service-checker[1] can run and monitor this service.
[1] https://github.com/wikimedia/operations-software-service-checker

Apr 25 2019, 4:16 PM · Patch-For-Review, Core Platform Team Legacy (Next), Core Platform Team (Session Management Service (CDP2)), Services (next), User-Eevans, Release Pipeline, Operations, serviceops, Release-Engineering-Team
Eevans moved T209106: Setup session storage service testing/continuous integration from Backlog to Next on the User-Eevans board.
Apr 25 2019, 3:50 PM · CPT Initiatives (Session Management Service (CDP2)), User-Clarakosi, User-Eevans
Eevans moved T209110: Logging for the session storage service from Backlog to In-Progress on the User-Eevans board.
Apr 25 2019, 3:50 PM · CPT Initiatives (Session Management Service (CDP2)), Patch-For-Review, User-Clarakosi, User-Eevans
Eevans moved T217650: Deployment strategy for the session storage application. from Backlog to In-Progress on the User-Eevans board.
Apr 25 2019, 3:50 PM · Patch-For-Review, Kubernetes, serviceops, Core Platform Team (Multi-DC (TEC1)), User-Clarakosi, Core Platform Team Legacy (Next), User-Eevans
Eevans moved T219831: Security Review For Kask from Backlog to In-Progress on the User-Eevans board.
Apr 25 2019, 3:48 PM · Restricted Project, Security-Team-Reviews, Services (watching), Core Platform Team Legacy (Watching / External), Core Platform Team (Session Management Service (CDP2)), User-Clarakosi, User-Eevans