I updated this task with what I hope is more descriptive of the problem and some possible solutions. I really think solving this is key to the Architecture teams mission of supporting 'infinite use cases insert rest of arch mission phraseology here' :)
We'd also have to some how tie the Kafka produce call with the MySQL DB write call into a transaction.
To do this I think we'd need some kind of two phase commit service for MediaWiki, which sounds really hard to me!
OH OOPS I did thank you.
@WMDE-leszek will also need a Kerberos principal and should be in one of either the wmf or nda LDAP group.
Ah, we should for sure not page on this. I just looked, and if monitoring is enabled we set critical => true for the Kafka Broker Server process: https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/kafka/broker/monitoring.pp#L53
Thu, Dec 3
The problem we were having was that the jobs were timingout before execution time due to the long delay (longer than 60 days), leading to success-files being deleted for them when restarted.
Wed, Dec 2
Well, those ones would succeed because now the _SUCCESS flags exist. The problem is when the job times out before the _SUCCESS flags exist. If the October job succeeds without timing out before the the Dec 19th dependency exists, we'll know we fixed it.
Related: T263672: Figure out where stream/schema annotations belong (for sanitization and other use cases), @mpopov mentioned that you might want to use stream config to enable/disable setting of particular base fields by client libraries; we might want to do something similar for server side default settings, like automatically filling in HTTP header values.
The concept of an event topic is designed to replace the idea of instrumentation producing events to a particular stream. Our system allows multiple streams to subscribe to the same events, and we have made this explicit by recognizing that these events are not being sent to streams, rather the streams are subscribing to topics, and receiving the events for those topics.
Interesting, thank you! It's good to have that on the profile class.
Tue, Dec 1
I found SpecialInvestigate, it is in the CheckUser extension: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CheckUser/+/644536
Hiya, I think we need to expedite https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventLogging/+/623459. We need it for the EventLogging -> EventGate migration.
I wonder if the -1 timeout in coordinator.xml conflicts with the Oozie property [[ https://oozie.apache.org/docs/3.1.3-incubating/oozie-default.xml | oozie.service.coord.default.max.timeout ]], which defaults to 60 days. It does seem likely, as the workflows and being put in TIMEDOUT 60 days after the created time.
Mon, Nov 30
I'm also trying to find out if we actually need to migrate AutoblockIpBlock and CookieBlock after all. Neither of these seem to have had any data for many years. Can we just deprecate them and not migrate?
So, the sqoop puppetization should just ensure that this directory exists?
I don't think this should be ops week, right? This is a regular task that should be scheduled, no? Moving back to incoming.
This looks like a problem with Firefox, it works ok in Chrome.
Hm, this looks to be related to or caused by T222603: Fix oozie banner_impression monthly job.
revision_id: undefined (which I'm assuming EventLogging filtered out better than it did the previously-mentioned nulls).
If revision_id is not required, and you don't want to send it, you should omit including it in the event data.
It's back! This time it looks like it caused by a revision_id: null
@Niharika I have migrated SpecialMuteSubmit and SpecialInvestigate to Event Platform.
@Niharika, AFAICT, AutoblockIpBlock hasn't had any data for years, if ever.
To fight phab proliferation I'm merging the subtasks in.
FYI @sdkim I'm declining this one and marking it as To Deprecate on our audit sheet.
I propose we do not migrate this schema and mark it as unused. There hasn't been a schema edit on metawiki since Jan 2017, and there isn't even a Hive table for this, which means there hasn't been an event sent since before we migrated to Hive.
+1 sounds good!
execute kinit -R automatically upon login for every user
THAT IS AWESOME YES PLEASE!
This is very similar to an issue in Spark: https://issues.apache.org/jira/browse/SPARK-23890, which is why we are using the Hive session to alter the table in the first place. The changes we are making are compatible type changes, Hive/Spark just can't tell the difference. The Spark fixes should be in Spark 3, but I haven't tested.
We deploy a on-host memcached instance, we have already a lot of puppet code + monitoring + metrics to re-use.
Sounds good. Q: Is there a reason we couldn't/shouldn't use the prod memcached clusters?