Sat, May 19
Fri, May 18
Done for summaries as well
Tue, May 15
This was deployed to production, the number of rebalance log messages during the consumer startups declined, so I'm resolving the ticket.
Mobile is done, running summaries.
Mon, May 14
Started one for mobile-sections with concurrency 100 on restbase-dev1004 in a screen session. I will monitor for a little while to make sure the concurrency's fine.
@Pchelolo do you have any other objections or were you just looking to reuse schemas for consistency?
Ye, it might be more complex to parse indeed. Just throwing out the ideas, feel free to discard it.
@bearND heh, I've just copy-pasted this from the config, obviously it should return this in json format.
@Pchelolo TranslateDeleteJobs are not being run in parallel in two job queues, are they?
@bearND basically just the schema we already have, just adjusted per-domain:
Fri, May 11
We've got the logs in logstash, thank you @Ottomata
Finally after a bunch of logging enhancements, it's now possible to do this with the following PR: https://github.com/wikimedia/hyperswitch/pull/90
I'm wondering if it would be nicer to expose the actual JSON schema of the expected response and not invent a custom format?
After breaking the loop it seems to have stopped. Digging through the code for a whole day was not fruitful and even an attempt to recreate similar situation by manually inserting some non-existent revision onto the storage didn't let me reproduce this.
Thu, May 10
@Ottomata, when we send the revision-create event to ORES, precache endpoint we get the scores as a result, but we do not have the capability to inject those results into the event and re-send it in the config. What we could do is to write a js module to handle that.
Wed, May 9
HA! gelf as the solution? I've told you!!!
pdfservice can go away for a well-deserved retirement
swproxy can go away for a well-deserved retirement
cc @bearND re appservice, do we have a real beta cluster instance already? RB is still going to the appservice for tests
cc @mobrovac @Mvolz re zotero-test citoif-test citoid-jessie-test sca1
Tue, May 8
I think it's time to close this one. Please reopen if that breaks again during the transition process.
Interesting, that revision 269290610 actually did exist for the page once, but it was somehow deleted, since that revision ID exists in MySQL archive table.
Oh, sorry. It actually just happened again at 07:13 UTC:
Mon, May 7
This happened again today with on_transclusions_update group - it just stopped being consumed completely without a visible reason. There's some logs regarding the topic that this group was consuming and some messages regarding it being rebalanced, but no crazy multi-generation reassignment logs.
@Tgr the link you provided doesn't work and I can't find instances of logs that look like the one you're talking about in restbase logs. Can you show one again please?
Doesn't seem related to the job queue either as there's no job-related logs an it was propagating correctly
Thu, May 3
We've enabled it for change-prop instances as well, so I consider this task resolved.
@Ottomatta so we are still not getting proper logs right? At least I can't find them ;(
Wed, May 2
All the 4xx from MW Action API except 404 are logged with 1% probability now. Log entry example: https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2018.05.02/restbase?id=AWMiRbXNpesmgM3lqi_i&_g=h@b74aee6
Tue, May 1
Yes, wgJobTypeConf is intended to be set the same on all wikis to avoid having to shell/API out.
The topics have been separated. They use the service name as prefix now.
THis has been deployed, the domain is now reported correctly.
It seems this task got derailed completely from the original purpose.
That suggests something times out and gets retried but I have no idea what that something might be.
Almost all seem to come from the job queue, unfortunately, I don't think the job name is recorded.
@Tgr we don't have stats JUST for the Action API request itself, but we do have stats for calls to the action.js module. Since it's such a thin wrapper over a pure request to the Action API, an since for reading lists we only use rawquery that is even thinner, I guess the stats for restbase.external.sys_action_rawquery.ALL.ALL.p95 woul be a decent representation of the actual latencies we see in requests from RESTBase to Action API:
Also, playing with latencies dashboard on RESTBase level, we have the ability to separate latencies by response code and I can see the same degradation for 2xx as for 4xx:
Mon, Apr 30
Other instances of cross-wiki job scheduling that are yielded by a quick ack 'JobQueueGroup::singleton\( ': Cognate/LocalJobSubmitJob, MassMessage/MassMessageSubmitJob, GlobalUsage/GlobalUsageCachePurgeJob, GlobalUserPage/LocalJobSubmitJob, SecurePoll/PopulateVoterListJob.
Do we need to migrate CentralAuthRename too? If so, can it be done? Thanks.
As you know, we can't say it resolved until we being sure, because there's many pending requests, so if we said to all global renamers that the issue solved, there's will be a large number of rename processes in the log!
The problem started within hours of Kafka being enabled on mediawikiwiki, and it affects the wiki that's after mediawikiwiki alphabetically (which is the order global renames go), so it seems pretty likely there is a connection. (Except Husseinzadeh02/Hüseynzadə which also got stuck on the next wiki, minwiki, and the job did not finish properly on mediawikiwiki either. No idea what's up with that one.)
@mobrovac do you know if LocalRenameUserJob jobs on meta (and only there) could somehow be affected by the Redis-Kafka migration? I'm probably grasping at straws here, but not sure where else to look.
Thu, Apr 26
Wed, Apr 25
Consider troubleshooting some problem with kafkacat -C | jq .
I've run some analysis on the logs and indeed sometimes the cirrusSearchElasticWrite is too large. Here're the sizes in bytes for all the log entries I could find so far:
If there is a way to monitor such errors I guess we can pick-up known large pages and modify them while the write are frozen?
When we freeze writes we start to push ElasticaWrite jobs that contain the full page doc which can be relatively large. We had to raise some limits in the past due to that (nginx request size when we added nginx in front of elastic).
The subtasks that were created to fix issues discovered during the first iteration of the switch were resolved, and I don't see any logs indicating there's problems, so seems like nothing is blocking us from moving some more projects to kafka queue.
We might want to test more wikis or all of them perhaps?
I believe the fix for it has been deployed and we can try to proceed with switching cirrus search for some more wikis?
This has been resolved by enabling EventBus extension on loginwiki wiki with T191464
Support was enabled for all wikis except wikitech (see T192361 for reasoning). Resolving.
Tue, Apr 24
Apr 18 2018
So, we have 2 options on how to implement this.
Apr 17 2018
The deployment-mediawiki04.deployment-prep.eqiad.wmflabs host was removed per T192071 - that explains the issue. I think this can be resolved now, please reopen if it comes back.