awight (Adam Roses Wight)
User

Projects (14)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Oct 12 2014, 9:02 PM (165 w, 5 d)
Availability
Available
IRC Nick
awight
LDAP User
Awight
MediaWiki User
Awight (WMF)

Recent Activity

Yesterday

awight created T182942: Tests should have covered regression in T182936.
Fri, Dec 15, 1:56 AM · Scoring-platform-team (Current), User-Ladsgroup, MediaWiki-extensions-ORES, Collaboration-Team-Triage (Collab-Team-This-Quarter)

Thu, Dec 14

awight added a project to T182921: Extension:ORES caused MW train rollback: Wikimedia-Incident.
Thu, Dec 14, 10:22 PM · Wikimedia-Incident, Scoring-platform-team, MediaWiki-extensions-ORES
awight created T182921: Extension:ORES caused MW train rollback.
Thu, Dec 14, 10:22 PM · Wikimedia-Incident, Scoring-platform-team, MediaWiki-extensions-ORES
awight created T182915: Look at additional uWSGI metrics for potential use in the ORES dashboard.
Thu, Dec 14, 9:35 PM · Scoring-platform-team, ORES, Graphite
awight created T182914: Add a graph of ORES Celery task queue length.
Thu, Dec 14, 9:34 PM · Scoring-platform-team, Graphite, ORES
awight added a comment to T182799: Make sure ORES is compatible with stretch.

I'm reconsidering my proposal to use python3.4. It's only available by adding jessie as an apt source, and causes some annoying dependency fu such as downgrading findutils and maybe other packages.

Thu, Dec 14, 9:30 PM · Scoring-platform-team, ORES
awight closed T181552: Git refusing to clone some ORES submodules as Resolved.

I haven't seen this issue in a few weeks, closing. Thank you!

Thu, Dec 14, 9:25 PM · Scoring-platform-team, Release-Engineering-Team (Kanban), User-Ladsgroup, Operations, User-Joe, ORES
awight closed T181552: Git refusing to clone some ORES submodules, a subtask of T168073: Switch ORES to dedicated cluster, as Resolved.
Thu, Dec 14, 9:25 PM · ORES, Scoring-platform-team
awight changed the status of T168073: Switch ORES to dedicated cluster from Stalled to Open.

Unstalling, now that the stress testing is complete.

Thu, Dec 14, 9:24 PM · ORES, Scoring-platform-team
awight changed the status of T168073: Switch ORES to dedicated cluster, a subtask of T179501: Use external dsh group to list pooled ORES nodes, from Stalled to Open.
Thu, Dec 14, 9:24 PM · Scap, ORES, Scoring-platform-team, Operations
awight changed the status of T168073: Switch ORES to dedicated cluster, a subtask of T176324: Scoring platform team FY18 Q2, from Stalled to Open.
Thu, Dec 14, 9:24 PM · Scoring-platform-team (Current)
awight added a subtask for T168073: Switch ORES to dedicated cluster: T182799: Make sure ORES is compatible with stretch.
Thu, Dec 14, 9:23 PM · ORES, Scoring-platform-team
awight added a parent task for T182799: Make sure ORES is compatible with stretch: T168073: Switch ORES to dedicated cluster.
Thu, Dec 14, 9:23 PM · Scoring-platform-team, ORES
awight closed T169246: Stress/capacity test new ores* cluster as Resolved.

Ok, done for real now. @Halfak and I decided that the remaining bottlenecks are something non-trivial and we don't want to block on further optimization.

Thu, Dec 14, 9:23 PM · User-Ladsgroup, Operations, User-Joe, Scoring-platform-team (Current), Patch-For-Review, ORES
awight closed T169246: Stress/capacity test new ores* cluster, a subtask of T168073: Switch ORES to dedicated cluster, as Resolved.
Thu, Dec 14, 9:23 PM · ORES, Scoring-platform-team
awight closed T169246: Stress/capacity test new ores* cluster, a subtask of T171851: Reimage ores* hosts with Debian Stretch, as Resolved.
Thu, Dec 14, 9:23 PM · Scoring-platform-team, Operations, ORES
awight removed a parent task for T182249: Diagnose and fix 4.5k req/min ceiling for ores* requests: T169246: Stress/capacity test new ores* cluster.
Thu, Dec 14, 9:17 PM · Scoring-platform-team (Current), Operations, Patch-For-Review, Performance, ORES
awight removed a subtask for T169246: Stress/capacity test new ores* cluster: T182249: Diagnose and fix 4.5k req/min ceiling for ores* requests.
Thu, Dec 14, 9:17 PM · User-Ladsgroup, Operations, User-Joe, Scoring-platform-team (Current), Patch-For-Review, ORES
awight changed the status of T182249: Diagnose and fix 4.5k req/min ceiling for ores* requests from Open to Stalled.

I think we've got our tuning parameters! 45 minutes of overload, and everything looks good. Available memory hasn't dropped below 8GB. We'll be eating into this margin as we add models, so it's perfect to leave this way for now.

Thu, Dec 14, 9:16 PM · Scoring-platform-team (Current), Operations, Patch-For-Review, Performance, ORES
awight changed the status of T182249: Diagnose and fix 4.5k req/min ceiling for ores* requests, a subtask of T169246: Stress/capacity test new ores* cluster, from Open to Stalled.
Thu, Dec 14, 9:16 PM · User-Ladsgroup, Operations, User-Joe, Scoring-platform-team (Current), Patch-For-Review, ORES
awight created T182912: scap deploy --service-restart doesn't affect ORES celery.
Thu, Dec 14, 8:26 PM · Scap, Scoring-platform-team, ORES
awight added a comment to T182876: ORES timeout error graph is incorrect.

The last comment was wrong, I see how the timeout is caught and metrics are recorded. I currently can't find any code paths to explain the missing metrics.

Thu, Dec 14, 7:44 PM · Scoring-platform-team, Graphite, ORES
awight added a comment to T182876: ORES timeout error graph is incorrect.

There's one code path that can throw a TimeoutError without adding to this metric, it's the outer timeout in ores/util.py. Interesting that we're hitting this code, I don't think that should be happening.

Thu, Dec 14, 7:26 PM · Scoring-platform-team, Graphite, ORES
awight added a comment to T182249: Diagnose and fix 4.5k req/min ceiling for ores* requests.

I'm happy with that. It looks like it's going to be difficult to break through this ceiling, and we have lots of data to compare against when we try again, somewhere experimental. The capacity is fine, overload behavior is much better than before.

Thu, Dec 14, 6:46 PM · Scoring-platform-team (Current), Operations, Patch-For-Review, Performance, ORES
awight added a comment to T182249: Diagnose and fix 4.5k req/min ceiling for ores* requests.

Ran a tricky test, in which I stepped up from 1 to 3 test harnesses, then back down.

  • tester 1 ran 17:03 – 18:09
  • tester 2 ran 17:11 – 17:51
  • tester 3 ran 17:31 – 17:40
Thu, Dec 14, 6:10 PM · Scoring-platform-team (Current), Operations, Patch-For-Review, Performance, ORES
awight added a comment to T182249: Diagnose and fix 4.5k req/min ceiling for ores* requests.

Ran another test:
https://grafana-admin.wikimedia.org/dashboard/db/ores?orgId=1&from=1513256100000&to=1513257600000
Test conditions:

  • Only 8 of the ores* machines were online
  • Test harness was hitting 7 machines' uwsgi endpoints
  • Using a delay of 0.001s.
  • uWSGI workers: 230 per machine
  • Celery workers: 150 per machine
Thu, Dec 14, 1:38 PM · Scoring-platform-team (Current), Operations, Patch-For-Review, Performance, ORES
awight created T182876: ORES timeout error graph is incorrect.
Thu, Dec 14, 1:36 PM · Scoring-platform-team, Graphite, ORES
awight created T182871: ORES web worker memory usage graph is meaningless.
Thu, Dec 14, 1:16 PM · Graphite, Scoring-platform-team, ORES
awight closed T181661: Connection timeout from tin to new ores servers as Resolved.

Using a workaround for T182865, where we go into submodules and checkout master before submodule updating from the ORES root, I was able to successfully parallel deploy to ores*, and in under a minute! It looks like the SSH timeout is resolved!!

Thu, Dec 14, 12:45 PM · Patch-For-Review, Scap, Scoring-platform-team, Release-Engineering-Team, Operations, ORES
awight closed T181661: Connection timeout from tin to new ores servers, a subtask of T168073: Switch ORES to dedicated cluster, as Resolved.
Thu, Dec 14, 12:45 PM · ORES, Scoring-platform-team
awight added a comment to T181661: Connection timeout from tin to new ores servers.

Oops—we aren't expecting this repo to be mirrored to gerrit. So the surprise is that the revision exists in Phabricator but can be found by the deployment tool. Making a new task: T182865: Source revision is in Phabricator, but can't be found by deployment tools

Thu, Dec 14, 12:24 PM · Patch-For-Review, Scap, Scoring-platform-team, Release-Engineering-Team, Operations, ORES
awight created T182865: Source revision is in Phabricator, but can't be found by deployment tools.
Thu, Dec 14, 12:24 PM · Release-Engineering-Team (Kanban), Scoring-platform-team, Scap, ORES
awight added a comment to T181661: Connection timeout from tin to new ores servers.

The Phabricator control panels look happy,
https://phabricator.wikimedia.org/source/editquality/manage/uris/ shows that we're observing GitHub,
https://github.com/wiki-ai/editquality/commit/15d5283b7422919d85203b5ba907027f9356e421 that commit has been in GitHub for 3 days,
https://gerrit.wikimedia.org/r/#/q/15d5283b7422919d85203b5ba907027f9356e421 It's not in Gerrit yet,
https://phabricator.wikimedia.org/source/editquality/manage/basics/ "Last updated Thu, Dec 14, 7:07 AM (9 m, 19 s ago)."

Thu, Dec 14, 12:18 PM · Patch-For-Review, Scap, Scoring-platform-team, Release-Engineering-Team, Operations, ORES
awight created T182864: Access request: Phabricator Repository-Admins.
Thu, Dec 14, 12:14 PM · Repository-Admins, Phabricator, Scoring-platform-team
awight added a comment to T181661: Connection timeout from tin to new ores servers.

Those revisions aren't in gerrit. I think the github -> gerrit mirroring broke when we were messing around with pointing to git-lfs repos?

Thu, Dec 14, 12:06 PM · Patch-For-Review, Scap, Scoring-platform-team, Release-Engineering-Team, Operations, ORES
awight added a comment to T182354: OresDamagingPref back-compatibility is logging exceptions.

@Johan This only affected fawiki, to my knowledge. Feel free to change tags as necessary, thanks!

Thu, Dec 14, 11:36 AM · Wikimedia-Incident, User-notice-collaboration, Regression, Patch-For-Review, Wikimedia-log-errors, Scoring-platform-team, MediaWiki-extensions-ORES
awight created T182862: Exception killing threads in ORES celery workers.
Thu, Dec 14, 11:35 AM · Scoring-platform-team, ORES
awight added a comment to T181661: Connection timeout from tin to new ores servers.

Not sure if this is related, but now I'm seeing a deploy-local failure with no diagnostics other than error code 70:

Thu, Dec 14, 11:28 AM · Patch-For-Review, Scap, Scoring-platform-team, Release-Engineering-Team, Operations, ORES
awight created P6463 (An Untitled Masterwork).
Thu, Dec 14, 11:25 AM
awight added a comment to T181099: Train/test reverted model for Icelandic.

aspell-is is available on scb* now, so we're ready to deploy on Thursday.

Thu, Dec 14, 1:13 AM · Patch-For-Review, Scoring-platform-team (Current), User-Ladsgroup, editquality-modeling, artificial-intelligence

Wed, Dec 13

awight added a comment to T182614: Investigate why ORES logs are being written to syslog despite explicit logging config. Fix..

The explanation is that Celery follows an archaic pattern of hijacking the Python root logger, using its own customizations and ignoring existing logging config. All in the interest of "having something show up on the terminal for people who don't configure logging."

Wed, Dec 13, 6:24 PM · Scoring-platform-team (Current), Patch-For-Review, ORES, Operations
awight added a comment to T181661: Connection timeout from tin to new ores servers.

Looks like I'm getting the same error.

Wed, Dec 13, 5:35 PM · Patch-For-Review, Scap, Scoring-platform-team, Release-Engineering-Team, Operations, ORES
awight added a comment to T181099: Train/test reverted model for Icelandic.

This is deployed to the beta cluster and ready for testing:
http://ores-beta.wmflabs.org/v3/scores/iswiki/123456

Wed, Dec 13, 5:30 PM · Patch-For-Review, Scoring-platform-team (Current), User-Ladsgroup, editquality-modeling, artificial-intelligence
awight added a comment to T181099: Train/test reverted model for Icelandic.

Temporarily stalled on https://gerrit.wikimedia.org/r/#/c/398078/

Wed, Dec 13, 5:22 PM · Patch-For-Review, Scoring-platform-team (Current), User-Ladsgroup, editquality-modeling, artificial-intelligence
awight committed rORESDEPLOYde25ae84b3b5: Limit to no more than 3 cached revisions (authored by awight).
Limit to no more than 3 cached revisions
Wed, Dec 13, 5:15 PM
awight added a comment to T182801: New, mysterious scap failure.

strange indeed. Full disk can case all sorts of weird behaviors though.

Wed, Dec 13, 5:14 PM · Scap, Scoring-platform-team, Release-Engineering-Team, Operations, ORES
awight closed T182801: New, mysterious scap failure as Invalid.

/srv is full. Strange that there was no error message during deployment, though...

Wed, Dec 13, 5:12 PM · Scap, Scoring-platform-team, Release-Engineering-Team, Operations, ORES
awight closed T182801: New, mysterious scap failure, a subtask of T181661: Connection timeout from tin to new ores servers, as Invalid.
Wed, Dec 13, 5:12 PM · Patch-For-Review, Scap, Scoring-platform-team, Release-Engineering-Team, Operations, ORES
awight triaged T182801: New, mysterious scap failure as High priority.
Wed, Dec 13, 5:09 PM · Scap, Scoring-platform-team, Release-Engineering-Team, Operations, ORES
awight created T182799: Make sure ORES is compatible with stretch.
Wed, Dec 13, 4:59 PM · Scoring-platform-team, ORES
awight added a comment to T181661: Connection timeout from tin to new ores servers.

@mmodell Tangential note, I've been happy using git clone --depth 1 on personal projects. Would that make any sense for scap?

Wed, Dec 13, 4:08 PM · Patch-For-Review, Scap, Scoring-platform-team, Release-Engineering-Team, Operations, ORES
Gerrit Code Review <gerrit@wikimedia.org> committed rORESDEPLOYb67bba77acb7: Merge "Adds iswiki and eswikiquote reverted models." (authored by awight).
Merge "Adds iswiki and eswikiquote reverted models."
Wed, Dec 13, 4:08 PM

Tue, Dec 12

awight claimed T182614: Investigate why ORES logs are being written to syslog despite explicit logging config. Fix..
Tue, Dec 12, 7:25 PM · Scoring-platform-team (Current), Patch-For-Review, ORES, Operations
awight added a comment to T181661: Connection timeout from tin to new ores servers.

Pardon me, but I have to ask why a file with timestamps in the log file dating Dec 11th, and with a local fs system timestamp (on tin) of December 11th as well is named 2017-12-09 giving the impression it was created on Dec 9th. I guess a bug ?

Tue, Dec 12, 3:38 PM · Patch-For-Review, Scap, Scoring-platform-team, Release-Engineering-Team, Operations, ORES

Mon, Dec 11

awight added a comment to T181661: Connection timeout from tin to new ores servers.

Done. Logs are in scap-sync-2017-12-09-0004.log (sic., note that the dates are still misleading)

Mon, Dec 11, 7:20 PM · Patch-For-Review, Scap, Scoring-platform-team, Release-Engineering-Team, Operations, ORES
awight renamed T182614: Investigate why ORES logs are being written to syslog despite explicit logging config. Fix. from Investigate why ORES logs are being written to syslog despite explicit logging config to Investigate why ORES logs are being written to syslog despite explicit logging config. Fix..
Mon, Dec 11, 6:26 PM · Scoring-platform-team (Current), Patch-For-Review, ORES, Operations
awight triaged T182614: Investigate why ORES logs are being written to syslog despite explicit logging config. Fix. as Normal priority.
Mon, Dec 11, 6:24 PM · Scoring-platform-team (Current), Patch-For-Review, ORES, Operations
awight moved T181334: Split Cache.php to different services from Review to Pending deployment on the Scoring-platform-team (Current) board.
Mon, Dec 11, 5:51 PM · MW-1.31-release-notes (WMF-deploy-2017-12-12 (1.31.0-wmf.12)), Patch-For-Review, MediaWiki-extensions-ORES, Scoring-platform-team (Current), User-Ladsgroup
awight moved T182249: Diagnose and fix 4.5k req/min ceiling for ores* requests from Active to Review on the Scoring-platform-team (Current) board.
Mon, Dec 11, 5:51 PM · Scoring-platform-team (Current), Operations, Patch-For-Review, Performance, ORES
awight triaged T182249: Diagnose and fix 4.5k req/min ceiling for ores* requests as Normal priority.
Mon, Dec 11, 5:51 PM · Scoring-platform-team (Current), Operations, Patch-For-Review, Performance, ORES
awight edited projects for T182249: Diagnose and fix 4.5k req/min ceiling for ores* requests, added: Operations, Scoring-platform-team (Current); removed Scoring-platform-team.
Mon, Dec 11, 5:51 PM · Scoring-platform-team (Current), Operations, Patch-For-Review, Performance, ORES
awight created T182600: Clarify that ORES UI can be disabled separately from the extension.
Mon, Dec 11, 4:05 PM · Global-Collaboration, Scoring-platform-team, MediaWiki-extensions-ORES
awight updated the task description for T181830: How can we test all the wiki/page combinations that can be affected by ORES?.
Mon, Dec 11, 3:53 PM · Wikimedia-Incident, Scoring-platform-team (Current)
awight moved T181183: Improvements to ORES deployment documentation and process from Active to Done on the Scoring-platform-team (Current) board.
Mon, Dec 11, 3:51 PM · Wikimedia-Incident, Scoring-platform-team (Current)
awight moved T181071: Cache ORES virtualenv within versioned source from Active to Review on the Scoring-platform-team (Current) board.
Mon, Dec 11, 3:51 PM · Wikimedia-Incident, Patch-For-Review, Operations, Scoring-platform-team (Current), Release-Engineering-Team (Watching / External)
awight placed T181830: How can we test all the wiki/page combinations that can be affected by ORES? up for grabs.
Mon, Dec 11, 3:51 PM · Wikimedia-Incident, Scoring-platform-team (Current)

Sat, Dec 9

awight lowered the priority of T182497: Update log config for scb* boxes, to deal with ORES verbose logging from High to Normal.

Urgent fix is deployed, lowering the priority.

Sat, Dec 9, 10:19 PM · Patch-For-Review, Scoring-platform-team, ORES, Operations
awight renamed T182497: Update log config for scb* boxes, to deal with ORES verbose logging from Update logrotate config for scb* boxes, to deal with ORES verbose logging to Update log config for scb* boxes, to deal with ORES verbose logging.
Sat, Dec 9, 4:38 PM · Patch-For-Review, Scoring-platform-team, ORES, Operations
awight added a project to T181661: Connection timeout from tin to new ores servers: Scap.

This is affecting me in production, now:

Timeout, server scb2004.codfw.wmnet not responding.
Sat, Dec 9, 4:07 PM · Patch-For-Review, Scap, Scoring-platform-team, Release-Engineering-Team, Operations, ORES
awight created T182498: Scap is unhappy about deploying from a branch other than master.
Sat, Dec 9, 3:59 PM · Scap, Scoring-platform-team, Release-Engineering-Team
awight committed rORESDEPLOY1c0ede0c3d8b: Less verbose Celery logging (authored by awight).
Less verbose Celery logging
Sat, Dec 9, 3:52 PM
awight committed rORESDEPLOYb4f2b0249a60: Less verbose Celery logging (authored by awight).
Less verbose Celery logging
Sat, Dec 9, 3:52 PM
awight triaged T182497: Update log config for scb* boxes, to deal with ORES verbose logging as High priority.
Sat, Dec 9, 3:34 PM · Patch-For-Review, Scoring-platform-team, ORES, Operations
awight created T182497: Update log config for scb* boxes, to deal with ORES verbose logging.
Sat, Dec 9, 3:30 PM · Patch-For-Review, Scoring-platform-team, ORES, Operations
awight added a comment to T182354: OresDamagingPref back-compatibility is logging exceptions.

@Arash.pt Taking your questions at face value: Yes, yes and yes :). The issue here is that it's difficult to write tests that cover scenarios we haven't seen before, such as exotic combinations of settings. Now that we know about this problem, we can write a regression test for it (https://en.wikipedia.org/wiki/Regression_testing) which will prevent it from happening again.

Sat, Dec 9, 12:40 PM · Wikimedia-Incident, User-notice-collaboration, Regression, Patch-For-Review, Wikimedia-log-errors, Scoring-platform-team, MediaWiki-extensions-ORES

Thu, Dec 7

awight created T182354: OresDamagingPref back-compatibility is logging exceptions.
Thu, Dec 7, 8:52 PM · Wikimedia-Incident, User-notice-collaboration, Regression, Patch-For-Review, Wikimedia-log-errors, Scoring-platform-team, MediaWiki-extensions-ORES
awight created T182350: Profile ORES code memory use.
Thu, Dec 7, 8:17 PM · Performance, Scoring-platform-team, ORES
awight reopened T169246: Stress/capacity test new ores* cluster as "Open".

Reopening until we finish with T182249: Diagnose and fix 4.5k req/min ceiling for ores* requests.

Thu, Dec 7, 8:05 PM · User-Ladsgroup, Operations, User-Joe, Scoring-platform-team (Current), Patch-For-Review, ORES
awight reopened T169246: Stress/capacity test new ores* cluster, a subtask of T168073: Switch ORES to dedicated cluster, as Open.
Thu, Dec 7, 8:05 PM · ORES, Scoring-platform-team
awight added a comment to T169246: Stress/capacity test new ores* cluster.

@Dzahn sorry--we decided to test some more, to overcome a suspiciously low performance ceiling. I'll make the follow-up task a blocker.

Thu, Dec 7, 8:02 PM · User-Ladsgroup, Operations, User-Joe, Scoring-platform-team (Current), Patch-For-Review, ORES
awight added a comment to T182331: [Epic] Deploy ORES in kubernetes cluster.
Thu, Dec 7, 4:05 PM · Operations, ORES, Scoring-platform-team

Wed, Dec 6

awight created T182258: ORES virtualenv deployment step fails intermittently.
Wed, Dec 6, 10:01 PM · Scap, Scoring-platform-team, ORES
awight closed T181538: ORES overload incident, 2017-11-28 as Resolved.
Wed, Dec 6, 9:53 PM · Wikimedia-Incident, Operations, Scoring-platform-team
awight moved T181567: Rate limit thresholds requests when the service is down from To Triage to Active Situation on the Wikimedia-Incident board.
Wed, Dec 6, 9:53 PM · MediaWiki-extensions-ORES, Wikimedia-Incident, MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)), Scoring-platform-team
awight created T182256: Clean up ORES thresholds cache: pre-emptively check before expiry.
Wed, Dec 6, 9:50 PM · Performance, MediaWiki-extensions-ORES, Scoring-platform-team
awight closed T181567: Rate limit thresholds requests when the service is down, a subtask of T181538: ORES overload incident, 2017-11-28, as Resolved.
Wed, Dec 6, 9:47 PM · Wikimedia-Incident, Operations, Scoring-platform-team
awight closed T181567: Rate limit thresholds requests when the service is down as Resolved.
Wed, Dec 6, 9:47 PM · MediaWiki-extensions-ORES, Wikimedia-Incident, MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)), Scoring-platform-team
awight edited projects for T181567: Rate limit thresholds requests when the service is down, added: Wikimedia-Incident, MediaWiki-extensions-ORES; removed Patch-For-Review.
Wed, Dec 6, 9:47 PM · MediaWiki-extensions-ORES, Wikimedia-Incident, MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)), Scoring-platform-team
awight committed rORESDEPLOY42cf53249ffa: Bump editquality and ores submodules (authored by awight).
Bump editquality and ores submodules
Wed, Dec 6, 9:44 PM
awight added a comment to T182249: Diagnose and fix 4.5k req/min ceiling for ores* requests.

At their current performance, the celery workers we have should be able to handle 9 nodes * 150 workers / node * (1 request / 1.17 node-s) * (60 s/min) = 69,000 req/min. We would probably be limited by CPU (at 5x 20% CPU = 22,500 req/min).

Wed, Dec 6, 9:29 PM · Scoring-platform-team (Current), Operations, Patch-For-Review, Performance, ORES
awight created T182249: Diagnose and fix 4.5k req/min ceiling for ores* requests.
Wed, Dec 6, 9:08 PM · Scoring-platform-team (Current), Operations, Patch-For-Review, Performance, ORES
awight added a comment to T169246: Stress/capacity test new ores* cluster.

Let's do it.

Wed, Dec 6, 8:58 PM · User-Ladsgroup, Operations, User-Joe, Scoring-platform-team (Current), Patch-For-Review, ORES
awight added a comment to T169246: Stress/capacity test new ores* cluster.

Point well taken. What if we temporarily depool some of the servers for future tests? Any single ores* machine can handle our average traffic, so we'd be fine depooling 4 or 5 servers.

Wed, Dec 6, 8:52 PM · User-Ladsgroup, Operations, User-Joe, Scoring-platform-team (Current), Patch-For-Review, ORES
awight added a parent task for T181661: Connection timeout from tin to new ores servers: T168073: Switch ORES to dedicated cluster.
Wed, Dec 6, 8:06 PM · Patch-For-Review, Scap, Scoring-platform-team, Release-Engineering-Team, Operations, ORES
awight added a subtask for T168073: Switch ORES to dedicated cluster: T181661: Connection timeout from tin to new ores servers.
Wed, Dec 6, 8:06 PM · ORES, Scoring-platform-team
awight added a parent task for T181552: Git refusing to clone some ORES submodules: T168073: Switch ORES to dedicated cluster.
Wed, Dec 6, 8:05 PM · Scoring-platform-team, Release-Engineering-Team (Kanban), User-Ladsgroup, Operations, User-Joe, ORES
awight added a subtask for T168073: Switch ORES to dedicated cluster: T181552: Git refusing to clone some ORES submodules.
Wed, Dec 6, 8:05 PM · ORES, Scoring-platform-team
awight removed subtasks for T169246: Stress/capacity test new ores* cluster: T181661: Connection timeout from tin to new ores servers, T181552: Git refusing to clone some ORES submodules.
Wed, Dec 6, 8:05 PM · User-Ladsgroup, Operations, User-Joe, Scoring-platform-team (Current), Patch-For-Review, ORES
awight removed a parent task for T181552: Git refusing to clone some ORES submodules: T169246: Stress/capacity test new ores* cluster.
Wed, Dec 6, 8:05 PM · Scoring-platform-team, Release-Engineering-Team (Kanban), User-Ladsgroup, Operations, User-Joe, ORES
awight removed a parent task for T181661: Connection timeout from tin to new ores servers: T169246: Stress/capacity test new ores* cluster.
Wed, Dec 6, 8:05 PM · Patch-For-Review, Scap, Scoring-platform-team, Release-Engineering-Team, Operations, ORES