Fri, Oct 19
@kostajh If you have time for that it would be perfect. I admit, I don't have any idea how to test this.
Thu, Oct 18
Just checked, python jsonschema validates milliseconds ISO-8601s with date-time format just fine. :)
The above patch should mitigate the problem, however, we need to also account for possible clock drift between our servers. The more drift we tolerate, the less efficient our deduplication becomes, so I'm wondering whether we have any data on the possible clock drift to help make a decision on the exact number to tolerate?
we've switched from requesting all the onthisday content at once to requesting every portion separately
Wow, in the example one Accept-Language value is capitalized, the other all lower-case. Which way do we go with?
And the fact it did increase the perf of /ontishday 2x suggests there is a lot of room for improvement in MCS itself - we're making 5 times more requests RB->MCS for that endpoint now, so more CPU time for making the request, parsing and routing it in MCS, combining the results, but just by splitting the actual processing into different nodes we get a lot of improvement.
Wed, Oct 17
I know what's happening. If there's more items in the list then the bach size, the code re-enqueues exactly the same job again here. That means the deduplication info is exactly the same as for the previous job, but the Kafka queue is so quick that the jobs are executed with < 1 second delay, so the dt for the second job ends up being exactly the same as the dt for the previous job and they get deduplicated.
Ye. We can tune and tweak it indefinitely, but for now I think we're in a good state
However, there has been a lot of improvements to the upstream code even without the aformentioned PR. those improvements need to be brought into our fork.
Tue, Oct 16
I did fix the latency graph: https://grafana.wikimedia.org/dashboard/db/proton?orgId=1&from=now-3h&to=now
Fri, Oct 12
How're we satisfying the requirement of
If we do adopt the policy to have the latest schema using the references and then getting rendered into full schemas in versioned files, so that clients are not required to support all the fancy features, we'd need to make the pre-commit hook an executable script to satisfy development requirements like:
Q: should we use the term 'repository' or 'registry' here? I'm considering retitling the tickets to 'repository' since we will be using git repositories. However, there may be some extra features on a potential HTTP service that serves schemas. If we have that, would we call that the 'registry'?
Up to date JSONSchema support (Draft 7?)
Thu, Oct 11
We actually have tests for all of these in the current event-schemas repo, but they're not perfect. What would I like to change:
Do we need to use a custom meta JSONSchema for this, or can we just add type information outside of the JSONSchema spec in the schemas?
I've repeated your experiments but cached the parsoid and mw api results in memory to eliminate networking. The variant without promisifying the long sync processing is faster on average, but the promisified version have way more flat distribution of the request latencies.
The -c 10 numbers for the promise chain version do look a bit better in this sample, but I think that only reflects a transient network connection improvement. After running multiple times, I haven't seen the numbers consistently better or worse.
@Ottomata I think there should be 2 steps here
- first we just stop emitting events completely
- second we deploy this
- third we re-enable even emitting in the new format
The patch above will make the retries stop, but will preserve all the logging.
Can this job be configured to not retry at all? I think that would be ideal. At the end of the day it's a warmup script and it's fine for it to fail.
@Pchelolo would adding more workers to MCS be a reasonable course of action?
That said, I doubt we'd see gains this dramatic in production, for a few reasons. First, of course, we wouldn't just be hammering away at a single page and enjoying a 100% cache hit rate; also, we'd expect that many of the requests that reach MCS are in response to page content changes, meaning that cached Document objects from previous renders wouldn't help us. OTOH, most MCS endpoints for both page and feed content include domino.createDocument as a processing step, so at least for popular pages we could expect cached Document objects to be reused at least several times.
Wed, Oct 10
Mon, Oct 8
Fri, Oct 5
Thu, Oct 4
I think all the pieces were deployed, so I'm resolving the task. Let's see next week how it goes, will reopen in case of an issue
Parsoid deploys aren't exactly fast, so there's a period where some nodes are producing the latest version which other nodes don't know how to handle.
Both CP and CPJQ were deployed with the fix. This bug happens quite rarely, so I will close the ticket now. If it happens again we will reopen it.
@Gilles this sounds like legit errors. Should we ignore 429 errors and not retry the job then?
Wed, Oct 3
The previous comment also explains why we started seeing the errors after DC switchover. Topics are created on demand and while codfw was not a primary DC a lot of job types did not exist there, so when we switched and new job types started being emitted in codfw the faulty codepath was executed and we run into the race condition.
I found the reason for this to be happening. At least, one possible reason. And to be honest, I'm embarrassed.
09:37 Pchelolo: arlolra: I think I found why parsoid is failing in beta
09:39 arlolra: Pchelolo: /me perks up ... I haven't actually looked yet
09:40 arlolra: I just assumed it was a configuration change
09:41 Pchelolo: when using ApiRequest, https://en.wikipedia.beta.wmflabs.org/w/api.php is provided as a uri which is conf.wiki.apiURI. http.Agent is selected based on it and https agent is assigned. then on line 291 the uri is assigned to mwApiServer===conf.parsoid.mwApiServer which is http://deployment-mediawiki-07.deployment-prep.eqiad.wmflabs/w/api.php
09:41 Pchelolo: so the protocol starts to mismatch the agent
09:41 Pchelolo: we just need to move the agent selection code way to the bottom of ApiRequest.prototype.request method and we should be fine
09:41 Pchelolo: verifying
09:43 Pchelolo: vuala, parsoid works in beta
09:44 Pchelolo: I'm not sure my approach is entirely correct, but parsoid is life-hacked on deployment-parsoid09 now and it works
Beautiful. Look at the update latency impact!
This might be a bit too advanced to award the goodfirstbug tag, but at least it's very straightforward and makes good exposure to the convoluted storage semantics™ and the hell hole of parsoid.js module, so I will tag it.
hm... on restbase1007:
Ok, @Pchelolo gets the persistence award!
Tue, Oct 2
We did enable the feature after all by looking at requests reaching RESTBase, but that's not very convenient.
Tagging as a good onboarding bug as once the subtask is resolved, it will be easy to fix in code and it provides a great glimpse into what a render is, how Parsoid, RESTBase and VE interacts and what constraints we need to maintain in order to make the 3 works together correctly.
There's been 700 cases when the If-Match was not supplied over the last month and only 2 user agents:
So was this done or not after all?
I believe that's not an issue any more?
The problem is that MW API is configured as 'http://', but for some reason request uses the https Agent - thus the failure.
It does. Yesterday I've restarted JobQueue for that.
@Mholloway Everything is deployed in both RB and Parsoid, but let's wait till tomorrow for MCS deployment in case we need to rollback?
Mon, Oct 1
There is a base set of npm packages that are used by all services. Currently, server.js installs heapdump and gc-stats (possibly among others).
Related patch for change-prop https://github.com/wikimedia/change-propagation/pull/292
Wed, Sep 26
@Ryasmeen try now
@Ryasmeen can you list the titles you are trying now? I bet my right hand it's Varnish cached something wrong.
Tue, Sep 25
In my opinion - yes. Thank you.
Mon, Sep 24
Ok, I guess you're right. For the sake of correctness here's a PR https://github.com/wikimedia/restbase/pull/1066
As an engineer, I want to specify concrete settings for different topics like the number of partitions or the retention interval. T157092
@Pchelolo where would database settings live? Would it be the service codebase itself or do we have a separate repository for that?
Fri, Sep 21
These are different things designed for different purpose, so you should do both.
This is a big and interesting question that I've been thinking about myself for a long long time, so it needs a lot of discussions.