Service Ops Review of Metrics Platform Configuration Management UI
Open, HighPublic
Actions

Assigned To

Authored By

	WDoranWMF
	Feb 27 2024, 12:53 PM

Description

As part of the SDS 2.5.2 hypothesis, Data Products is building a Configuration Management UI to enable Product Owners to easily create and manage metrics instrumentation collection and, ultimately, full experiments.

Request for review by SRE

We have two main documents we would like SRE input on

For additional context, we also have :

Event Timeline

WDoranWMF triaged this task as High priority.Feb 27 2024, 12:53 PM

WDoranWMF created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 27 2024, 12:53 PM

WDoranWMF updated the task description. (Show Details)Feb 27 2024, 12:58 PM

MShilova_WMF subscribed.Feb 29 2024, 3:31 PM

Hello @Kappakayala, please let us know if the March 22nd deadline is still feasible for this review. The most important is to learn early from your team if there are any showstoppers from your perspective.

VirginiaPoundstone moved this task from Radar (other teams) to To be discussed on the Data Products board.Mar 18 2024, 8:33 PM

Hi @MShilova_WMF. This is on my list for today, it might spill into early next week though. I 've started the review but I don't see to have access to T358115 (linked from the description), could you please grant me access?

Hi,

I 've already left various comments on the 2 docs. I am still going through the Miro board, but I can summarize the following:

This proposal shows promise and thanks for working on it and reaching out to us early in the process, I appreciate it very much. I think I would like some more clarifications first.

There are multiple caches mentioned in the design doc. It is my current understanding that they are pretty crucial for the performance of both the app AND the bridging MW extension. They are however not described in much detail and I 'd like to see a more detail description of them. My goal is to understand possible failure patterns and scenarios
The decision has been taken to host the application part of this in dse-k8s, which also implies it is going to be eqiad only, at least in the beginning. This means that for 6 months out of 12, the bridging extension of MediaWiki will need to pay the latency cost of reaching out across the 2 DCs. While the service mesh we have will alleviate some of that by maintaining persistent HTTP connections and paying only once the TLS negotiation cost, this is still at least (and only in the absolute best case scenario) 40ms+ the time the app will take to respond. If this ends up being in the critical path of end-user requests, it will create a significant performance regression for all users for 6 months.

If this bridging extension is going to participate in the critical path of every end-user request, which is my current understanding that is will, it is imperative that it remains very performant. In the worst case scenario, which is all caches mentioned above are empty and/or unavailable, to avoid thundering herd problems, the fallback to static configuration needs to happen very fast, probably in less that 500ms (if not even less). This is going to be somewhat interesting to calculate as the static config fallback is the 3rd step per my understanding and the various timeouts for each of previous lookups will need to be taken into consideration and summed up.

The application part implicitly assumes that is it going to be publicly available (to be editable by people creating instruments). Correct me if I am wrong. This isn't the default for most services deployed currently and thus needs to be written down explicitly to avoid misunderstandings and miscommunications during deployment.

Pertinent to the above, we 'll to know which URL/domain this is going to be accessible under.

It is not clear what an "instrument" in the context of this proposal. Furthermore, the way the instrument is going to be delivered to end-users is going to be via ResourceLoader. It would help with both if there was some example showcasing how this is currently envisioned to would work in practice.

Thank you, @akosiaris . I've just added you as a subscriber to {T358115}. Let me know if it automatically granted you access.

VirginiaPoundstone assigned this task to phuedx.Mar 25 2024, 4:55 PM

VirginiaPoundstone moved this task from To be discussed to Radar (other teams) on the Data Products board.

VirginiaPoundstone moved this task from Radar (other teams) to Data Products Sprint 11 on the Data Products board.

VirginiaPoundstone edited projects, added Data Products (Data Products Sprint 11); removed Data Products.

VirginiaPoundstone moved this task from Sprint Backlog to In Process on the Data Products (Data Products Sprint 11) board.

In T358577#9653832, @akosiaris wrote:

There are multiple caches mentioned in the design doc. It is my current understanding that they are pretty crucial for the performance of both the app AND the bridging MW extension. They are however not described in much detail and I 'd like to see a more detail description of them. My goal is to understand possible failure patterns and scenarios

If this bridging extension is going to participate in the critical path of every end-user request, which is my current understanding that is will, it is imperative that it remains very performant. In the worst case scenario, which is all caches mentioned above are empty and/or unavailable, to avoid thundering herd problems, the fallback to static configuration needs to happen very fast, probably in less that 500ms (if not even less). This is going to be somewhat interesting to calculate as the static config fallback is the 3rd step per my understanding and the various timeouts for each of previous lookups will need to be taken into consideration and summed up.

The application part implicitly assumes that is it going to be publicly available (to be editable by people creating instruments). Correct me if I am wrong. This isn't the default for most services deployed currently and thus needs to be written down explicitly to avoid misunderstandings and miscommunications during deployment.

Pertinent to the above, we 'll to know which URL/domain this is going to be accessible under.

I've updated the doc and responded to your comments around these points.

The decision has been taken to host the application part of this in dse-k8s, which also implies it is going to be eqiad only, at least in the beginning. This means that for 6 months out of 12, the bridging extension of MediaWiki will need to pay the latency cost of reaching out across the 2 DCs. While the service mesh we have will alleviate some of that by maintaining persistent HTTP connections and paying only once the TLS negotiation cost, this is still at least (and only in the absolute best case scenario) 40ms+ the time the app will take to respond. If this ends up being in the critical path of end-user requests, it will create a significant performance regression for all users for 6 months.

This still needs more thought. I've proposed a budget of 250 ms for fetching the configuration from the app before falling back to default (static) configuration. 40+ ms best case is a huge chunk of that. Perhaps we should target deploying to WikiKube straight away.

It is not clear what an "instrument" in the context of this proposal. Furthermore, the way the instrument is going to be delivered to end-users is going to be via ResourceLoader. It would help with both if there was some example showcasing how this is currently envisioned to would work in practice.

Thanks for this. I'll start working on a comprehensive overview/glossary and supporting diagrams.

In T358577#9665271, @phuedx wrote:

The decision has been taken to host the application part of this in dse-k8s, which also implies it is going to be eqiad only, at least in the beginning. This means that for 6 months out of 12, the bridging extension of MediaWiki will need to pay the latency cost of reaching out across the 2 DCs. While the service mesh we have will alleviate some of that by maintaining persistent HTTP connections and paying only once the TLS negotiation cost, this is still at least (and only in the absolute best case scenario) 40ms+ the time the app will take to respond. If this ends up being in the critical path of end-user requests, it will create a significant performance regression for all users for 6 months.

This still needs more thought. I've proposed a budget of 250 ms for fetching the configuration from the app before falling back to default (static) configuration. 40+ ms best case is a huge chunk of that. Perhaps we should target deploying to WikiKube straight away.

Alternatively, we could propose a significantly smaller response time for the GET /api/v1/instruments route, e.g. a median response time of 50 ms.

WDoranWMF moved this task from In Process to BLOCKED on the Data Products (Data Products Sprint 11) board.Tue, Apr 2, 11:14 AM

VirginiaPoundstone edited projects, added Data Products (Data Products Sprint 12); removed Data Products (Data Products Sprint 11).Wed, Apr 17, 3:54 PM

VirginiaPoundstone moved this task from Sprint Backlog to BLOCKED on the Data Products (Data Products Sprint 12) board.

xcollazo moved this task from BLOCKED to In Process on the Data Products (Data Products Sprint 12) board.Wed, Apr 24, 4:16 PM

Service Ops Review of Metrics Platform Configuration Management UIOpen, HighPublicActions

Description

Description

Request for review by SRE

Event Timeline

Service Ops Review of Metrics Platform Configuration Management UI
Open, HighPublic
Actions