Since change-prop is responsible for emitting the revision-score event, we'll have to make sure that these fields are in the event schema, and that change-prop sets the properly. Also ping @Pchelolo.
We plan to deploy this Monday Nov 26th.
Ping on this! I know it is TG week so things might be slow, but I'm checking in anyway :)
Sam, that'd be great! Find me and Marcel (mforns) on IRC in #wikimedia-analytics and lets discuss. Actually...@fgiunchedi or @herron might have something helpful to say, as they are currently implementing some new Kafka -> Logstash integration.
Thu, Nov 15
We need a name! Have been brainstorming over on https://etherpad.wikimedia.org/p/event-platform. The current three top contenders:
PERFECT, thanks Andrew! We'll try to fit what we need in the stock VMs first. Sounds good.
Wed, Nov 14
If they work, I'm fine with them as long as...
Oook thanks Petr!
Tue, Nov 13
18/11/13 20:52:34 INFO RefineMonitor: No dataset targets in /wmf/data/raw/eventlogging between 2018-11-11T20:50:26.600Z and 2018-11-13T16:50:26.601Z need refinement to /wmf/data/event
IIUC it has to be row B for them to be used as Cloud Virts. @Andrew to confirm. If they can go any row, then they should be spread out as evenly amongst as many rows as possible.
When ALTERing Hive tables, DataFrameToHive uses a manual JDBC connection to Hive,
rather than Spark SQL. This is a work around for https://issues.apache.org/jira/browse/SPARK-23890.
Spark doesn't allow issuing of ALTER statements via spark.sql().
@Cmjohnson I updated T207194: rack/setup/install cloudvirtan100[1-5].eqiad.wmnet to reflect the new naming. Please proceed and then assign to Cloud VPS folks for OS install / puppetization setup as Cloud Virt nodes.
Mon, Nov 12
Ok, @Cmjohnson your call then: we'd prefer cloudvirtanalytics1xxx, but if that is too long, then use cloudvirtan1xxx. How can should we now proceed?
Let's make this happen! @Andrew are you ok with cloudvirt-anXXXX? @Cmjohnson would prefer to coordinate racking of these nodes on this ticket or on T207194: rack/setup/install cloudvirtan100[1-5].eqiad.wmnet (we'll have to rename the nodes).
@ArielGlenn, they need to be copied into HDFS inside of Hadoop, not just available on a regular filesystem.
It has pros and cons: The downside of using backports is that it is a moving target, while we don't necessary want to follow up update. With a locally imported package we can more easily control when to upgrade (which is particularly important for fleet-wide installed packages where we try to keep oldstable and stable at the same version).
The only one there that we should check on for sure is wdqs1009.eqiad.wmnet, but I'm pretty sure they use the Java client, not librdkafka for their updater service.
Oh hm. There are no prod services running on the stat boxes. We can (and should) upgrade there anyway.
I'm fairly certain there shouldn't be any streth hosts using 0.9.3-1, or at least not any running prod services. It should be fine.
I believe we had this problem (and discussion) before...and we decided that apt pinning to backports was better than importing our own version. This allows us to more easily upgrade and keep track of what versions are used where.
Could you get it up and running in just a python virtualenv instead of Docker? Or, mediawiki-vagrant has an eventlogging role!
I'd prefer if we used 'analytics' instead of 'data lake'. Can we do cloudvirtanXXXX? cloudvirt-anXXXX?
Ha yea Petr doesn't really have anything to do with this. Hm.
Mon, Nov 5
Cool, done: T208756: New Cloud VPS project 'cloud-analytics'
Declining in favor of T208756: New Cloud VPS project 'cloud-analytics'
Ok, so plan:
Hold on this while we figure out T207321.
Sun, Nov 4
Fri, Nov 2
K here's our thoughts about this:
But why?! It's a map on webrequest.
As an engineer, I want to batch produce many events at once so mobile apps can produce events after an offline period.
Interesting...I suppose this service isn't quite as critical as our prod ones. Maybe this is ok?
1/ Will all those hosts need to be in the same vlan/row (eg. cloud-hosts1-b-eqiad)? Ideally they should be spread across multiple rows to avoid the scenario of one row (aka. failure domain) outage taking the whole service down
OH duh, I forgot to account for the HDFS replication. Right. Ok in that case, let's go with option 2. Is there room on the switches for 10g? :D
Thu, Nov 1
@Nuria says check out https://meta.wikimedia.org/wiki/Data_retention_guidelines#How_long_do_we_retain_non-public_data. I think there is some problem with saving both page_id with user_id or any other user identifier.
If this is possible we don't see any reason why not! I don't think there is anything on the analytics backend side that would need to be changed; just the VSL settings and the EL extension. If traffic team is cool with it please proceed!
Oh this is a duplicate, you already know over in T207424: Many errors on "MobileWikiAppiOSSearch" and "MobileWikiAppiOSUserHistory" ok thanks!
@chelsyx I just assigned to you because you are listed as a maintainer of this schema. Please route to whomever is appropriate! :)
Wed, Oct 31
hey hey look at that! :)
Let's hear what @JAllemandou thinks. The mediawiki_history dataset is under 1TB (snappy compressed parquet) per snapshots, and we want to keep a few snapshots around. We also may need some space to 'stage' the dataset copy while we load it into HDFS (not sure about this yet). I think ~15TB to start with should be ok.
If possible, I think I slightly prefer option 1. We may need more storage in the future, but I think for the time being it should be fine. @JAllemandou can correct me if I'm wrong, but we might not need more than 10TB or so to host a few mediawiki_history snapshots at once.
Oh, I think we won't necessarily need so much storage. CPU and RAM more important. Faster disks might actually better than larger ones in this case.
Are these numbers per, or total?
Per worker. This number is also flexible, its just what we were aiming for with our bare metal hardware order.
Ok! I just disabled the whitelist filter in beta. Produce some events and they *should* show up in the MySQL database on deployment-eventlog05.
Just had a great meeting with @chasemp, @faidon, @JAllemandou and @Nuria. The main action item (after Nuria had to go) was to talk with Cloud VPS engineers to see if we could make this cluster on Cloud Virts instead of bare metal in prod. That would be totally fine with us, and actually even preferred. I think we thought this was not possible originally, but if it is, and we can do it within a couple of weeks, we'd like to proceed that way.
We need to be able to identify the schema for an event outside of any particular transport or api call. The schema will be used to dynamically create downstream data structures, e.g. Hive table schemas, etc.
Tue, Oct 30
Mon, Oct 29
Oo, here's a plausible explanation. kafka-jumbo1006 was the only broker that was missing some of its leaders. It is usually the last nod e to be rebooted for a full cluster reboot. webrequest_text takes the longest to sync back up after a broker restart. I betcha that all but these two partitions had resynced into the ISR, and the auto leader rebalancer was triggered to run (after 300 seconds) and saw that the imbalance percentage was greater than 10%. It then triggered a leader election BEFORE these two webrequest_text replicas were back in the ISR. Most of the leaders would have then been elected appropriately, but not these partitions. Soon after these replicas would have resynced, but at this point the imbalance partition is less than 10%, so any future auto rebalancer runs wouldn't trigger an election.
Interesting! Today Luca and I were about to move partition leadership using kafka reassign-partitions, but we noticed that the replica assignment actually looked correct, and what we were going to change it to. It was only the leadership that was out of whack. So we ran a kafka preferred-replica-election to see if it would rebalance the leadership. And it did! The partition leadership now looks good.
Fri, Oct 26
Ping @Nuria too
I think we can close this. How we actually use them hasn't been decided though. We'll need to find a schema $ref resolver library, or write one. AJV won't do it.
Am fine with /srv/geoip!
Thu, Oct 25
Hm a tricky bit about $refs and generating fully dereferenced schemas with AJV:
FYI, networking considerations being worked out in T207321: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster
hey hey heyyy, the nodes are in! https://phabricator.wikimedia.org/T204177#4695147
Reopening because we have an idea:
@Nuria needs to give the sign off from analytics, but from my POV this is all correct! Yeehaw!
@Pchelolo https://snowplowanalytics.com/blog/2014/05/15/introducing-self-describing-jsons/ is a really good ready if you haven't seen it.