Anyone can kill YARN jobs that they own:
Ok, I think we are ready to do this! If there are no objections, I'll start on codfw tomorrow.
As in, maybe we should change as little as possible, and instead think about how to get a much better system in the medium-term?
I don't love it! I feel like 4Mb is already huge. Consider troubleshooting some problem with kafkacat -C | jq .. Gotta consume a individual 4Mb messages.
I already feel like 4Mb messages are a lot, and would much prefer not to increase the max message size more. Can these jobs be split up?
One nit! Remember that your json field names are going to be directly mapped to caseless SQL column names, so please avoid using camelCase when you can. snake_case is much better. E.g. app_install_id cross_device_id, etc. :)
We might need some input from another analytics team member (@joal ?) about how this would work in Druid. Druid is usually (but maybe not always) inherently time series, so the idea of putting an updated state into it seems a little weird. I might be wrong though.
Tue, Apr 24
BTW, don't forget you probably want is_anon and primary_language, etc.! I know you are trying to be consistent, but we should fix the old bad usage asap. Mixed case doesn't work well in SQL systems.
I was trying to avoid this much simpler design just because I don't want to send all the properties every time when only one of them has changed
BTW, just checking that you are synced up with the iOS team on this. It sounds like you are both trying to do similar things! See T192819.
would be much easier and simpler to do on the backend side.
The properties field is a nested data structure, so in a Hive table, you'll get a struct field, and be able to access the fields like:
SELECT event.properties.readlinglist_count or SELECT event.properties.*, etc. We recommend flat structures because there are many systems (like Druid), where nested fields aren't supported. (We actually have to flatten any EventLogging data for Druid anyway, because technically all schema fields are nested, since they are enclosed in the capsule as the event field.) We can support nested fields just fine, but they might not be very future proof, so we recommend in the schema guidelines that you usually avoid them.
Mon, Apr 23
Yes, we will follow ISO 8601 with a minor deviation
That deviation (using the timezone) is fine, but let's call the field something with dt in the name then. We try to use 'ts' when it is in integer timestamp. event_dt makes the most sense to me, and will is consistent with the concept of 'event time' in general, not just 'client time'.
I think A-team is drafting a more comprehensive response, but here's a few quick ones:
Ah great! Was going to ask for a phab task. https://gerrit.wikimedia.org/r/#/c/428337/
Into it, but I think I'd like to wait until we get all the workers upgraded to stretch before we do stat1004.
Thu, Apr 19
I don't have much context of how geowiki runs, but storing this in HDFS would be fine. We (I?) just thought it would be better to use some non analytics based way of doing this, and since geoip already comes from Puppet, we just thought of expanding that there.
Gonna ping @JAllemandou on this one ^ :)
Wed, Apr 18
That'd be fine!
We could do that, but we wanted something centralized and reproducable (e.g. include a puppet class, get the historical dbs). We would have just put this as is in gerrit and auto-committed to it, but we can't host it anywhere publicly, since we pay for these files.
@fdans, puppet will do that.
K sounds good. I'd go for 0.11 (after labs testing), but if you prefer to 0.10 first, that sounds fine too.
Whaa? I did the very cumin rm apt-get update thing you did yesterday! Puppet ran everywhere fine then...or at least I thought it did.
Great! 17G is a little big for puppetmasters as is now, but we can ask ops if we can expand the partition, or add another one. We'll talk about this with them today.
Tue, Apr 17
Think so! We can reopen if there are more problems.
Ok! Got it. Should be good now.
STill working on this, am having versioning problems on different nodes...
Yeah, rats. Madhu's original rsync crons used the --delete flag. I disabled that flag in https://gerrit.wikimedia.org/r/426931, but by that time it had already run once. I'm now rsyncing over pagecounts-ez over (without --delete) from (old) dataset1001 to restore anything that was there before.
Mon, Apr 16
For reference, just tested in labs. MirrorMaker 0.9 works (but is flaky and buggy) with both 0.9 and 1.x brokers, but 1.x MirrorMaker will only work with 1.x brokers, not 0.9.
Ah @elukey I looked more into this and remembered that this might actually cause TCP issues after all. I think we should not move this to jumbo yet, but first look into getting a new PHP Kafka client deployed. Let's hold on that though, and focus on the Kafka main and MirrorMaker stuff. This can be our last hold out Kafka analytics, and we don't need MirrorMaker for this, so it doesn't block us.
Hm, I just thought about this a little bit, and I'm not so sure we should do it. The hiera info that this function is looking up is always at the global common.yaml level. There is never a case in given environment (e.g. labs vs prod) where the value of kafka_clusters is different per role or node or whatever.
That timeline sounds fine to me!
Not sure I totally understand the problem. What is an example of a good metric name and a corrupted one here? You are saying that metric names are coming in with dots in the name as they should, and sometimes something is replacing those dots with underscores, but not all of the time?
The cron had been disabled, because the source locations didn't exist and were breaking things. I just reenabled them.
Thu, Apr 12
New producer in MirrorMaker 1.x is not compatible with old 0.9 broker. So once this upgrade happens, we can no longer Mirror to Old Kafka analytics cluster. What to do...?
We had planned to pause this until after the main Kafka upgrade in T167039, but from https://kafka.apache.org/documentation/#upgrade_1_1_0:
But in any case I can't get pyhive to work either right now
Ok, but we should keep (other?) change-prop topics mirrored?
+1 :) To clarify: we haven't officially scheduled any MirrorMaker TLS work, but after we upgrade main Kafka clusters, it should be relatively easy to set this up.
This goes for change-prop too, right? So blacklist .+\.(job|change-prop)\..+?
Wed, Apr 11
Timeline for upgrading main is Q4, but MirrorMaker +TLS wasn't in the plan. I don't think we should block your work though (private cross DC data has happened for a long time, we jut have to get incrementally better).
Hm, however, we’re trying to make internal ‘private’ cross DC data all go
over TLS. If we do this, we would want to have TLS enabled for main-eqiad
<-> main-codfw MirrorMaker instance. To do that we need to upgrade main
Hm, currently the data we import into Hadoop is readable by anyone with a
Hadoop account (not just analytics-privatedata-users) (but overlap of
Hadoop by non privatedata-users is very small). We could change the
permissions on the event database (or even possibly just a few job topic
tables) to be readable only by analytics-privatedata-users.
Tue, Apr 10
Or, more correctly, he hasn't updated his jobs to write the files to /srv/dumps on stat1005?