Another approach how to minimize the damage to normal change prop while making a nice and uniform use of the service_name everywhere is to rename normal ChangeProp service from changeprop to change-prop - this will break logging and metrics, but those dashboards are easily fixable.
Ideally the consumer group names should be prepended with the service_name as well just for consistency, but renaming the consumer groups will make us loose all the backlog we have in all the topics and not process it. I don't think consistency is important enough to do that. What do you think @mobrovac ?
The wikibase-UpdateUsagesForPage job sounds like a perfect candidate to be the next one. It's ~220 jobs/s on average over the last month, it was well tested in beta labs and it seems idempotent and it doesn't seem to use any of the advanced JobQueue features like root job deduplication or delayed execution.
After some discussion with @mobrovac we think that it's better to replicate the sharding mapping in the EventBus service instead of providing it together with the event.
AFAIK, the new queue would be able to deal with that, though I'm not 100% sure on how writes initiated in different DC perform.
One issue I've encountered is that I can't find an easy way to find out which shard which domain belongs to from the extension code or how to provide this configuration to the EventBus service. @Joe do you know if it's possible to get the domain => shard mapping in the MW extension code somehow?
Per discussion on the JobQueue meeting:
Wed, Nov 22
Mon, Nov 20
@hashar Puppet reenabled on both change-prop and reds hosts in deployment-prep a a puppet run was maid. I will be working more on this at some point so will disable it back for a while, but probably not today/tomorrow.
Thu, Nov 16
Out of the IRC discussion we've got 3 candidates for the next migration:
- wikibase-UpdateUsagesForPage - super high traffic, well tested on beta, but super easy. TODO talk to Wikidata
- ORESFetchScoresJob - low traffic, quite problematic
- recentchangesupdate - decent traffic, very high user-visible effect.
Yup. That's why I'm very suspicious about this particular correlation and wanna wait for the RecodLintJob backlog to get cleaned up naturally
Local debugging wasn't that fruitful. I propose to postpone deployment of https://gerrit.wikimedia.org/r/#/c/391801/ until the RecordLintJob backlog disappears (it's going down now) to get proof that this is indeed the reason for the memory growth.
I've conducted an experiment in deployment-prep. I've connected ChangeProp directly to deployment-redis06 and generated some extensive load on ChangeProp. Then I've killed a redis instance and observed that there were a bunch of redis connection errors, but a very limited number compared to production. Also the exception was coming from reds client on_error handler, not from the code directly using redis. However, during that time CP completely stopped processing incoming Kafka events. A theory was that redis client waits for connection and stopped the world. I've change the retry policy of reds to never retry and an issue with stopping processing events was gone. So we should try to write a custom retry policy for production.
@Ottomata I've reverted your change on kafka1001 as there's some AttributeError: 'NoneType' object has no attribute 'append' I the local logs.
Wed, Nov 15
According to @fgiunchedi the 05 and 06 instances are new Redis instances where MW was not migrated yet, so we have a few days to conduct experiments and can then reuse the instances for normal operation together with media wiki. Resolving.
cc @fgiunchedi as he's doing some reds migrations in deployment-prep
I've increased request timeout in EventBus extension to 10 seconds to match sync timeout in Kafka, but it did not fix the timeout errors.
I can see there's deployment-redis05 and deployment-redis06 in deployment-prep, but I don't see any references to these instances anywhere. @Joe @elukey do you know if these are unused reds instances? I've checked if they have any keys and these 2 nodes have completely empty reds installations. Can I just use them?
One more piece of info - page_edit rule stopped processing completely after midnight on Nov 14 - there was one more instance of all workers dying at midnight after first an huge amount of reds connection logs were emitted and then KafkaConsumer not connected errors on scb1002.
Here's the list of topics that should be deleted:
Script used to create yaml for title_revisions-ng table:
Raising the priority since today during the investigation of T180568 this issue have been raised again. In case only one instance per node is misbehaving it's possible to use log stash, but if there would be 2 misbehaving per node that would make log stash absolutely useless.
Tue, Nov 14
restbase-dev1006 is a part of our dev cluster. WE are not currently using it however there's a change-prop instance running there - that's the source of errors from there. I'll stop change-prop in dev cluster
I would say having a marker (the "experimental" one, or a similar one) and setting expectations to be "it may disappear or its API may change at any point in time without notice" would be the way to go here.
@Tgr we've never had the need for sorting so we've never settled on a convention. I personally like the /lists/?sort=name more then others.
Thu, Nov 9
In general there are several issues we've observed:
As the change will not affect any current users, I think we can go ahead and deploy everywhere.
Seems like we don't specify it in mediawiki-config, so we're using the default of 5 seconds.
Judging by the logs (on mw-log1001, not currently available in logstash) the timeouts did not disappear. Maybe we could consider increasing request timeout?
If we can deploy for all wikis at the same time that would be easier for the Services team but I think @Pchelolo said that that is doable.
Wed, Nov 8
Script used to generate the table creation statements:
The EventBus logs were fixed and now can be seen in log stash. Resolving.
Today we've had a semi-outage because of this.
Ok, I know what's happening. You're hitting a bug on our side that I will fix, however you're using the latest master from GitHub which we use for development, so it's unstable and will not work. I'd suggest only using released versions, the last release is https://github.com/wikimedia/restbase/releases/tag/v0.17.0 - that one works perfectly with your config.
I am not sure which config are you referring to?
Please provide some more details:
Tue, Nov 7
I guess we can go with https://github.com/wikimedia/restbase/pull/896 right away without creating a proxy. The endpoint is quite low-volume, so it's ok if we just recreate everything.
Mon, Nov 6
Fri, Nov 3
Thu, Nov 2
Wed, Nov 1
Tue, Oct 31
We've decided to migrate stashing together with normal parsed tables btw
It does give us a pretty significant latency improvement for Varnish cache misses, from 500 ms generating it each time to 230 ms using Cassandra. However, Varnish hit rate is quite high on this particular endpoint: out of 28 req/s that hit Varnishes only about 1.5 req/s is a cache miss.
Mon, Oct 30
logging this from MediaWiki (even if just the meta) is more appropriate.
@Ottomata pushing > 100 megs into log stash? Don't think that's a good idea.
Fri, Oct 27
That gerrit was a workaround, the real issue is that the jobs don't follow the Job contract. Reopening.
We've discussed this issue during the Services-Reading meeting and here're some ideas from the discussion:
So, we have an event that's > 100 Mb of serialized JSON... Wonderful. Do you think it's possibly to dump it somewhere on the filesystem to find out what is that? It's probably creating some issues in the current JobQueue and MediaWiki as well, it's even more abnormal then previously found 17 Mb events. Perhaps temporary increase the max_buffer_size to something like 200 Mb so that tornado can at least accept the event and log it?
Thu, Oct 26
One more bit: I've tried very hard to locate any log entries that would have different actual types for the problematic fields and did not succeed. This might suggest that the order of messages is not the reason, but maybe I've just missed a couple of sneaky records
I've found that particular log entry from 2017-10-26T06:17:47 in parsoid logs locally on the machine and nothing is unusual, it's just a normal Parsoid log entry.
Wed, Oct 25
Same happens with UpdateConstraintsTableJob
Oct 25 2017
Oct 17 2017
Thanks, I'll submit fixes soon. Btw, we've moved to gerrit and the new repo lives at https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/services/chromium-render. I'll ping you once the patch is ready.
Left some comments regarding the code: https://github.com/kodchi/mediawiki-services-chromium-render/commit/d02e5a57ec1e986f0992edaf6b8c8169d13b5203
The public /revision hierarchy was removed as well as secondary indexing usage in Cassandra. deleting the secondary indexes does not really delete the tables, so the cleanup should be done manually. Here's the list of tables that should be removed:
The logs are back where they belong, so I guess the ticket can be resolved. Thank you @fgiunchedi
: and $ are reserved charcters so URLs which differ in how these characters are encoded are not considered equal. The web server will consider them equivalent so this is not a big deal in practice, but it will split Varnish and other caches, and maybe confuse semantic web applications.
Oct 12 2017
Indeed we need to register all the definitions in ajv when doing the validations.
Oct 11 2017
Oh, ok, didn't understand that. It's completely doable, will make a PR later today
@bearND unfortunately we can't do that in RESTBase layer. The bug is in TextExtracts extension itself - for the request we're making to it from RB we only get content up to the asterix: https://cs.wikipedia.org/w/api.php?action=query&prop=extracts&exintro=true&exsentences=5&titles=Marek_Eben