Thu, Dec 14
I'm reconsidering my proposal to use python3.4. It's only available by adding jessie as an apt source, and causes some annoying dependency fu such as downgrading findutils and maybe other packages.
I haven't seen this issue in a few weeks, closing. Thank you!
Unstalling, now that the stress testing is complete.
Ok, done for real now. @Halfak and I decided that the remaining bottlenecks are something non-trivial and we don't want to block on further optimization.
I think we've got our tuning parameters! 45 minutes of overload, and everything looks good. Available memory hasn't dropped below 8GB. We'll be eating into this margin as we add models, so it's perfect to leave this way for now.
The last comment was wrong, I see how the timeout is caught and metrics are recorded. I currently can't find any code paths to explain the missing metrics.
There's one code path that can throw a TimeoutError without adding to this metric, it's the outer timeout in ores/util.py. Interesting that we're hitting this code, I don't think that should be happening.
I'm happy with that. It looks like it's going to be difficult to break through this ceiling, and we have lots of data to compare against when we try again, somewhere experimental. The capacity is fine, overload behavior is much better than before.
Ran a tricky test, in which I stepped up from 1 to 3 test harnesses, then back down.
- tester 1 ran 17:03 – 18:09
- tester 2 ran 17:11 – 17:51
- tester 3 ran 17:31 – 17:40
Ran another test:
- Only 8 of the ores* machines were online
- Test harness was hitting 7 machines' uwsgi endpoints
- Using a delay of 0.001s.
- uWSGI workers: 230 per machine
- Celery workers: 150 per machine
Using a workaround for T182865, where we go into submodules and checkout master before submodule updating from the ORES root, I was able to successfully parallel deploy to ores*, and in under a minute! It looks like the SSH timeout is resolved!!
Oops—we aren't expecting this repo to be mirrored to gerrit. So the surprise is that the revision exists in Phabricator but can be found by the deployment tool. Making a new task: T182865: Source revision is in Phabricator, but can't be found by deployment tools
The Phabricator control panels look happy,
https://phabricator.wikimedia.org/source/editquality/manage/uris/ shows that we're observing GitHub,
https://github.com/wiki-ai/editquality/commit/15d5283b7422919d85203b5ba907027f9356e421 that commit has been in GitHub for 3 days,
https://gerrit.wikimedia.org/r/#/q/15d5283b7422919d85203b5ba907027f9356e421 It's not in Gerrit yet,
https://phabricator.wikimedia.org/source/editquality/manage/basics/ "Last updated Thu, Dec 14, 7:07 AM (9 m, 19 s ago)."
Those revisions aren't in gerrit. I think the github -> gerrit mirroring broke when we were messing around with pointing to git-lfs repos?
@Johan This only affected fawiki, to my knowledge. Feel free to change tags as necessary, thanks!
Not sure if this is related, but now I'm seeing a deploy-local failure with no diagnostics other than error code 70:
aspell-is is available on scb* now, so we're ready to deploy on Thursday.
Wed, Dec 13
The explanation is that Celery follows an archaic pattern of hijacking the Python root logger, using its own customizations and ignoring existing logging config. All in the interest of "having something show up on the terminal for people who don't configure logging."
Looks like I'm getting the same error.
This is deployed to the beta cluster and ready for testing:
Temporarily stalled on https://gerrit.wikimedia.org/r/#/c/398078/
/srv is full. Strange that there was no error message during deployment, though...
@mmodell Tangential note, I've been happy using git clone --depth 1 on personal projects. Would that make any sense for scap?
Tue, Dec 12
Mon, Dec 11
Done. Logs are in scap-sync-2017-12-09-0004.log (sic., note that the dates are still misleading)
Sat, Dec 9
Urgent fix is deployed, lowering the priority.
This is affecting me in production, now:
Timeout, server scb2004.codfw.wmnet not responding.
@Arash.pt Taking your questions at face value: Yes, yes and yes :). The issue here is that it's difficult to write tests that cover scenarios we haven't seen before, such as exotic combinations of settings. Now that we know about this problem, we can write a regression test for it (https://en.wikipedia.org/wiki/Regression_testing) which will prevent it from happening again.
Thu, Dec 7
Reopening until we finish with T182249: Diagnose and fix 4.5k req/min ceiling for ores* requests.
@Dzahn sorry--we decided to test some more, to overcome a suspiciously low performance ceiling. I'll make the follow-up task a blocker.
Wed, Dec 6
At their current performance, the celery workers we have should be able to handle 9 nodes * 150 workers / node * (1 request / 1.17 node-s) * (60 s/min) = 69,000 req/min. We would probably be limited by CPU (at 5x 20% CPU = 22,500 req/min).
Let's do it.
Point well taken. What if we temporarily depool some of the servers for future tests? Any single ores* machine can handle our average traffic, so we'd be fine depooling 4 or 5 servers.