Page MenuHomePhabricator

Load test current state of the Article Topic service
Closed, ResolvedPublic

Description

To find the limitations of the current production service, we will perform load tests to verify number of requests per second we can safely serve on LiftWing.
We are doing this, because the service went through a couple of implementation changes in the past months. Additionally, previous load tests were performed against staging and findings suggest that the previous load test results might have been unreliable due to the load test configuraiton.

Current state and assumptions for load tests:

  • I'm testing against internal production endpoint https://inference.svc.eqiad.wmnet:30443/v1/models/outlink-topic-model:predict.
  • I'm using page_id and lang parameters, which offer the best performance.
  • Current production deployment is scaling up to 5 replicas, the tests will reflect performance of the current 5 replica setup.
  • I'm using set of ~6000 unique page_id values for enwiki during testing.

Event Timeline

I'm using page_id and lang parameters, which offer the best performance.

@BWojtowicz-WMF is page_id the only way to get cache support or would requesting with page_title still hit the cache (though it sounds like slightly less efficiently)? Context: we have a prototype tool that gathers topic predictions for lists of articles but it currently uses the page title to make requests.

@Isaac The details of the cache and how exactly will it be implemented to Article Topics is still not fully decided. Current approaches we explored would work with page_id, whereas page_title requests would not go through cache. This ticket does not take cache into consideration, but we're verifying how fast can we get without cache. As a bonus, I can also check the page_title variant in this ticket so we'll have more context on it :)

Current approaches we explored would work with page_id, whereas page_title requests would not go through cache.

Understood -- at the point where you have to make an API call to get the pageID for a given title, might as well just compute the full thing. I'll see if there's a cheap way for us to use pageID within the context of the tool.

As a bonus, I can also check the page_title variant in this ticket so we'll have more context on it :)

Thanks!

I'm sharing load test numbers tested against production deployment on eqiad using internal endpoint. I've made sure the responses return valid predictions and I ran the load test after a few hours of cooldown to make results are not skewed by caching on the MWAPI side.

  1. Requests using page_id and lang parameters:
Results (5000 requests, 20 workers)
  Total time:    14.84s
  Throughput:    336.91 req/s
  Success:       5000/5000
  Failures:      0
  Latency avg:   0.059s
  Latency p50:   0.047s
  Latency p95:   0.133s
  Latency p99:   0.255s
  1. Requests using page_title and lang parameters:
Results (5000 requests, 20 workers)
  Total time:    22.80s
  Throughput:    219.29 req/s
  Success:       5000/5000
  Failures:      0
  Latency avg:   0.089s
  Latency p50:   0.077s
  Latency p95:   0.187s
  Latency p99:   0.309s

I've also ran experiment with a setup, where we use page_id + lang + revision_id information. Those results are shared below. It seems the overhead of the additional query for getting outlinks linked to a specific revision_id is significant, especially under load test scenario:

Results (5000 requests, 20 workers)
  Total time:    141.58s
  Throughput:    35.18 req/s
  Success:       4981/5000
  Failures:      19
  Latency avg:   0.564s
  Latency p50:   0.459s
  Latency p95:   1.473s
  Latency p99:   2.714s

@BWojtowicz-WMF thanks for running these tests! Although results look great the grafana dashboard doesn't align 100% with these results where I can see some increased latencies in the preprocess step (>10s). Could that be because of the horizontal scaling and the pods being spawned or sth else?

@isarantopoulos

I see the regime with >10s p99 latencies, however it happened during the night and not during running those tests. It seems to me that the Grafana numbers aligns well with the reported latencies above see:

  1. page_id + lang requests: https://grafana.wikimedia.org/goto/cfgzhd4aveg3kf?orgId=1
  2. page_title + lang requests: https://grafana.wikimedia.org/goto/ffgzhfegn63uoc?orgId=1
  3. page_id + lang + revision_id requests: https://grafana.wikimedia.org/goto/bfgzhglvqmpdse?orgId=1

From my experience, we experience the very long preprocessing step due to fetching outlinks for a specific revision_id - our changeprop configuration contains revision_id information, so each evens is being processed via the "slow" route, which sometimes gets really slow.


One more inconsistency with Grafana is on RPS - Grafana never showed >300RPS throughput, but topped at ~150RPS (link here). However, this can also be easily explained by smoothing being the effect of both Grafana time range (currently set at 5min) and Prometheus scraping interval.

Great, thanks for clarifying!
@Seddon are the numbers reported above T420931#11742010 good for you in case you integrate directly with LiftWing?

It seems the overhead of the additional query for getting outlinks linked to a specific revision_id is significant, especially under load test scenario

I'd guess this is because the outlinks are obtained via a MW API that ultimately requires rendered HTML? I can't recall which API is used to obtain the outlinks (do you get html and then parse it yourself?), but either way an old revision is unlikely to be in parser cache (and not in pagelinks MW table either), which means the API has to convert from wikitext -> html when you request it. We are dealing with a similar problem in T360794: Event stream with latest revision HTML & parent revision HTML diff when trying to test a 'backfill', and also because we also request the parent revision html.

BTW, when we did T328899: Add a new outlink topic stream for EventGate main, we intended (T328899#8661226) to also do T331399: Create new mediawiki links change streams based on fragment/mediawiki/state/change/page and use that for the input to the model, rather than LiftWing having request or compute the outlinks itself. But, it was never prioritized 😢 . Just an FYI! Not really related to your main use case here (testing the request latency of the full endpoint).