Fri, Dec 7
To duplicate something like check_private_data in Hadoop, I'd guess a day to write it and a couple of days to review and test it. So probably like a week to get everything deployed and integrated with the current job. We have to change some other jobs too to make them depend on this check.
Thu, Dec 6
Wed, Dec 5
On the issue of storage, the average per wiki would definitely not come out to 5GB. If you plot all wikis by number of revisions, the curve has a very long tail, with big wikis like enwiki, commons, wikidata being more outliers than the norm. I just did a basic count(*) in hadoop and enwiki has 801325542 out of a total of 3704431563 revisions across all wikis across all time. If the comment and actor table scale roughly in line with this*, and enwiki has a 31GB materialized comment view as @Banyek mentioned, then my estimate for the overall size of the materialized comment view is around 150GB. Actor should be smaller because it benefits from sharing. I'm not sure if that's still too big, but just wanted to say I have good reason to think it's not as bad as the 4500GB estimate.
Thanks very much @Anomie, I understand my misunderstanding, and your third answer is what I was asking.
Following up on a good problem that @Bstorm raised with my approach. I would love @Anomie to take a look at my comment as well and see if it makes sense. Ok, so to confirm @Marostegui's understanding, yes, we are importing archive, logging, and revision from the views in the cloud replicas, and actor, comment unsanitized from the production replicas. And the only way we use rows from actor and comment is to join them to rows from the three views, which are sanitized. And I believe what @Bstorm was pointing out was that if rev_deleted is set to sanitize rev_actor, that would be fine for the revision table. But if the archive table references the same actor_id through ar_actor, but the ar_deleted flag is not set, then that would not be sanitized. According to the actor view definition: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/templates/labs/db/views/maintain-views.yaml#222, a particular actor_id is not present if it's sanitized by *any* log/rev/ar/etc_deleted flag.
Tue, Dec 4
I have done a limited test on 3 wikis: etwiki, simplewiki, and hawiktionary. Description of the test and results:
The other reason for a single host instead of redundant hosts is this: our only critical use of the box is during the first few days of the month. The box would have to break during those first days to really impact us, if it breaks at any other time and it's rebuilt by the time we need it, we're fine. So that further mitigates our risk. Assuming we can query from the other replicas in the worst case scenario, I think we're fine with just one additional box.
And to follow up on my first bullet from before:
+1 to @Bstorm's framing of the problem. I was going to say the same thing, and withdraw my revision query from above from @Banyek's consideration. We can work around our performance problem with or without more hardware (more is better), but everyone wins if we get faster queries. What the DBAs should consider, with our help, is just improving the performance in general as @Bstorm outlined.
Still, if we got all the media requests from Media Viewer into EventLogging, we would not have all media requests for mediawiki in general. To do that, we'd have to go around instrumenting any place that fetches and renders media (in core, extensions, etc.). I think if we want thorough and sensical mediacounts until all that work happens, we need to handle the filtering on Hadoop.
- exact query for a materialized view that would allow us to import the revision table into Hadoop. There are two possible ways to do this, depending on where the query runs:
Thanks to @Banyek for the questions and a talk we just had over hangouts, we decided to go in two directions in parallel:
Mon, Dec 3
Yes, all the templates for the queries are here, and they're easy to read, basic python templating: https://github.com/wikimedia/analytics-refinery/blob/7cf62e1a7b9cc65cdf229e8a13ced8f6769f0a14/python/refinery/sqoop.py#L180 (that's archive, but scroll down and you'll see all the tables).
Ok, done and agreed. But instead of trying to find hardware that will keep up with replication, I'm asking if replication is necessary, could we do it any other way, given the relatively simple requirements?
More details on requirements for Analytics. The queries that we'll be running will be of the form:
Got it. Luca asked me to comment here describing exactly what we need to do on the boxes. But if basic replication can't even run, I of course defer to you all. It does raise questions about this approach, though, to waste so many resources to replicate real-time all operations for hundreds of tables, when ultimately we need a snapshot of a small minority of tables.
The queries that we'll be running on here will be of the form:
@Marostegui, we would like to go over plans for implementation during our Wednesday meeting. Is there anything else you'd like us to define or discuss before then?
I can take care of this, @Krinkle, unless you're doing it as an urgent matter. Let me know.
Thu, Nov 29
- I assume that instead of images what we really mean here is "files." E.g., presumably this will also give us a count of pageviews (not plays) to video or audio files?
- My understanding is that it is actually pretty common for users to upload images, etc., directly to individual wikis. Can this track that as well?
Will merge others into this, but keep in mind this nice analysis about the storage in Cassandra implications: T88775#4751882
cc @Nuria so she's in the loop that we're following up here about the materialized views work.
Wed, Nov 28
I think the questions are around unifying Mobile Front End and Core. Is this a good idea, something that's planned at some point, or considered and decided against? And, relevant to this epic, how does the Minerva refactor affect any such plans?
I also have some questions about this effort on behalf of TechCom. Mainly, is there a future where we unify front ends? If so, we could sketch plans together for how to get that done. If not, do you have thoughts about keeping the separation? @daniel is thinking a lot about not having two front ends, so cc-ing him on any conversation. I personally don't know enough yet, I think, but this is an area of our platform that I've been interested in. I'd like to take a more active role, maybe helping with some basic tasks to get up to speed.
The undefined line is due to an empty newline at the end of the file, I'll fix it as a bug in dashiki and deploy your dashboard again, no other action will be required. See T210570
@rafidaslam, I'll leave this up to you, do whatever you like: ignore 504, 503, or even ignore both. It's a matter of style that I don't think affects the readability of the code too much.
Tue, Nov 27
@domoritz I've been trying to make time to work on it. I think I can start this Friday, thank you for offering to help! I'll ping you with any questions on this task.
@Pchelolo, weird! I just blamed the first thing I saw that made any sense, and now we have no explanation. But, it doesn't matter, like you said the bug was not sorting in the first place. Agreed with Invalid, thanks.
Mon, Nov 26
I tried to register at the https://codein.withgoogle.com/ site, but that site has some major bugs... For example, when signing in, it says I am not authorized to be there. But it also signs me out of Chrome!!! I didn't think that was possible... I'm going to stay away unless someone lets me know how to get around this.
Just to follow up to put this issue to rest, the caches are all cleared and the responses are all consistent, going from rank 1 to rank 1000 in order. This is a sort at the service level so it won't be affected by underlying libraries anymore.
Wed, Nov 21
The fix for this has been deployed, but it'll take a while to clear the cache. Sorry for the inconvenience.
Private. Niharika would benefit from being a part of analytics-privatedata-users, including access to data before it's sanitized.
Random thought related to this: part of the fun of "random page" was getting really weird things, and this bug might have helped with that. The older the page, the higher the probability it's a more common topic (Britney Spears was added before Free energy principle). Not sure how everyone feels about messing with "random page", but we could for example build a service that picks a random page, making it less likely to be picked if it has high pageviews.
Tue, Nov 20
Sorry to have missed this ping so long. If you're waiting on me for more than a day, and you need an answer quickly, do ping me on IRC.
Great. Once this is available in the cloud replicas, give me a ping and I'll update the Hadoop import to include it.
I think @Bawolff was referring to the automatic query that Sqoop generates against the table you point it at, usually something like select min(id_you_split_by), max(id_you_split_by) from table_to_sqoop. I would hope the mariadb optimizer knows to treat that the same as the order-by-comment approach you mentioned, but you never know :) In any case, we don't have much control over how Sqoop does these queries, but we could choose to import those tables in parallel, which would skip that inefficient query. But that would slow it down as well. Basically it's a tricky problem, and we tried a few different solutions but ultimately the new views are just a bit too slow.