Page MenuHomePhabricator

Service Ops Review of Metrics Platform Configuration Management UI
Open, HighPublic

Description

Description

As part of the SDS 2.5.2 hypothesis, Data Products is building a Configuration Management UI to enable Product Owners to easily create and manage metrics instrumentation collection and, ultimately, full experiments.

Request for review by SRE

We have two main documents we would like SRE input on

For additional context, we also have :

Event Timeline

WDoranWMF created this task.

Hello @Kappakayala, please let us know if the March 22nd deadline is still feasible for this review. The most important is to learn early from your team if there are any showstoppers from your perspective.

Hi @MShilova_WMF. This is on my list for today, it might spill into early next week though. I 've started the review but I don't see to have access to T358115 (linked from the description), could you please grant me access?

Hi,

I 've already left various comments on the 2 docs. I am still going through the Miro board, but I can summarize the following:

This proposal shows promise and thanks for working on it and reaching out to us early in the process, I appreciate it very much. I think I would like some more clarifications first.

  • There are multiple caches mentioned in the design doc. It is my current understanding that they are pretty crucial for the performance of both the app AND the bridging MW extension. They are however not described in much detail and I 'd like to see a more detail description of them. My goal is to understand possible failure patterns and scenarios
  • The decision has been taken to host the application part of this in dse-k8s, which also implies it is going to be eqiad only, at least in the beginning. This means that for 6 months out of 12, the bridging extension of MediaWiki will need to pay the latency cost of reaching out across the 2 DCs. While the service mesh we have will alleviate some of that by maintaining persistent HTTP connections and paying only once the TLS negotiation cost, this is still at least (and only in the absolute best case scenario) 40ms+ the time the app will take to respond. If this ends up being in the critical path of end-user requests, it will create a significant performance regression for all users for 6 months.
  • If this bridging extension is going to participate in the critical path of every end-user request, which is my current understanding that is will, it is imperative that it remains very performant. In the worst case scenario, which is all caches mentioned above are empty and/or unavailable, to avoid thundering herd problems, the fallback to static configuration needs to happen very fast, probably in less that 500ms (if not even less). This is going to be somewhat interesting to calculate as the static config fallback is the 3rd step per my understanding and the various timeouts for each of previous lookups will need to be taken into consideration and summed up.
  • The application part implicitly assumes that is it going to be publicly available (to be editable by people creating instruments). Correct me if I am wrong. This isn't the default for most services deployed currently and thus needs to be written down explicitly to avoid misunderstandings and miscommunications during deployment.
  • Pertinent to the above, we 'll to know which URL/domain this is going to be accessible under.
  • It is not clear what an "instrument" in the context of this proposal. Furthermore, the way the instrument is going to be delivered to end-users is going to be via ResourceLoader. It would help with both if there was some example showcasing how this is currently envisioned to would work in practice.

Thank you, @akosiaris . I've just added you as a subscriber to {T358115}. Let me know if it automatically granted you access.

  • There are multiple caches mentioned in the design doc. It is my current understanding that they are pretty crucial for the performance of both the app AND the bridging MW extension. They are however not described in much detail and I 'd like to see a more detail description of them. My goal is to understand possible failure patterns and scenarios
  • If this bridging extension is going to participate in the critical path of every end-user request, which is my current understanding that is will, it is imperative that it remains very performant. In the worst case scenario, which is all caches mentioned above are empty and/or unavailable, to avoid thundering herd problems, the fallback to static configuration needs to happen very fast, probably in less that 500ms (if not even less). This is going to be somewhat interesting to calculate as the static config fallback is the 3rd step per my understanding and the various timeouts for each of previous lookups will need to be taken into consideration and summed up.
  • The application part implicitly assumes that is it going to be publicly available (to be editable by people creating instruments). Correct me if I am wrong. This isn't the default for most services deployed currently and thus needs to be written down explicitly to avoid misunderstandings and miscommunications during deployment.
  • Pertinent to the above, we 'll to know which URL/domain this is going to be accessible under.

I've updated the doc and responded to your comments around these points.

  • The decision has been taken to host the application part of this in dse-k8s, which also implies it is going to be eqiad only, at least in the beginning. This means that for 6 months out of 12, the bridging extension of MediaWiki will need to pay the latency cost of reaching out across the 2 DCs. While the service mesh we have will alleviate some of that by maintaining persistent HTTP connections and paying only once the TLS negotiation cost, this is still at least (and only in the absolute best case scenario) 40ms+ the time the app will take to respond. If this ends up being in the critical path of end-user requests, it will create a significant performance regression for all users for 6 months.

This still needs more thought. I've proposed a budget of 250 ms for fetching the configuration from the app before falling back to default (static) configuration. 40+ ms best case is a huge chunk of that. Perhaps we should target deploying to WikiKube straight away.

  • It is not clear what an "instrument" in the context of this proposal. Furthermore, the way the instrument is going to be delivered to end-users is going to be via ResourceLoader. It would help with both if there was some example showcasing how this is currently envisioned to would work in practice.

Thanks for this. I'll start working on a comprehensive overview/glossary and supporting diagrams.

  • The decision has been taken to host the application part of this in dse-k8s, which also implies it is going to be eqiad only, at least in the beginning. This means that for 6 months out of 12, the bridging extension of MediaWiki will need to pay the latency cost of reaching out across the 2 DCs. While the service mesh we have will alleviate some of that by maintaining persistent HTTP connections and paying only once the TLS negotiation cost, this is still at least (and only in the absolute best case scenario) 40ms+ the time the app will take to respond. If this ends up being in the critical path of end-user requests, it will create a significant performance regression for all users for 6 months.

This still needs more thought. I've proposed a budget of 250 ms for fetching the configuration from the app before falling back to default (static) configuration. 40+ ms best case is a huge chunk of that. Perhaps we should target deploying to WikiKube straight away.

Alternatively, we could propose a significantly smaller response time for the GET /api/v1/instruments route, e.g. a median response time of 50 ms.