Started trying to help on this, ran into two problems:
This latest change I pushed removes any EventLogging instrumentation from the SpamBlacklist extension. Needs some review.
This is not only not in the whitelist, it's excluded from the pageview definition. The logic was that we don't want to include traffic to wikis that are not about "content". I believe we made an exception for outreach wiki, so we have to talk about whether we want to just allow all wikis.
This sucks but we're not likely to work on it, as we're moving away from mysql. We don't want to be mean though, so we can help sqoop this stuff into Hadoop if you need to use your painful workaround too much.
This is really great. @Ottomata, this is a good example if you want to try typescript in EventGate. Pretty complete and the linked-to guide is good too.
it's not you, it's Java. But I can't help without details, ping me on IRC, I'm very behind on my phab pings as you can see.
How about a new $wg that specified an alternate set of baseModules? Like $wgAlternateBaseModules and if it exists, use that instead of https://github.com/wikimedia/mediawiki/blob/master/includes/resourceloader/ResourceLoaderStartUpModule.php#L376
This increase in data sounds fine, and the proposed example path looks fine too. Hundreds of subfolders are only annoying when it comes to repairing Hive tables to make them aware of new partitions, but if you're accessing from Spark it won't matter.
Yeah, going to close as a duplicate of that. The main obstacle now is expanding storage on the API cluster to fit the mediacounts dataset.
Thu, Feb 14
Wed, Feb 13
I understand it, yay! And I like it. We could even compute the tolerance from past data once in a while, and use that instead of our guess. That way this approach could grow organically with the data. We should always have some absolute alarms like if entropy is ever 0 something went wrong. So we could put an entropy(min=0) check on pretty much every column.
Also, I created the table and filled it through 2019-01 manually. This way we can just restart the whole job and not have to worry about this addition of a step in the workflow. I'm going to stall this task a bit until I add a job to productionize the GII aggregation in a yearly coordinator. I'll finish that up tomorrow.
For reference, this is how I generated the numbers sent to the GII folks:
Tue, Feb 12
OOUI uses a build step.
I really wish it did not. The build process makes testing OOUI patches with MediaWiki a major chore.
The build step only involves concatenating JS files, compiling Less files, and generating colored icon variants, all of which are also available in MediaWiki's ResourceLoader. It could probably be refactored to avoid the build step.
Mon, Feb 11
@Anomie: would it be much worse to implement this in mediawiki directly? Maybe a redacted column that starts out false and becomes true if rev_delete, log_delete, etc. are set.
The instrumentation for ExternalLinksChange was kind of jammed into the SpamBlacklist extension. It never really worked, so it could just be pulled out entirely. It seemed hard to force it to work, but it doesn't seem hard to undo it and clean up SpamBlacklist, as I don't think there are problems with the extension itself.
Fri, Feb 8
reviewed doc and fixed some spelling. I don't know what spill files are, but the rest made sense to me.
@Krinkle, the schema wouldn't change, it's fine as it is. The rest is correct. The confusion might come from the fact that we don't use the schema to refine data. That's how we ran into this problem, not knowing what type a field is, we just infer it from the JSON. Which doesn't work for whole numbers that need to be stored as Doubles. So, my proposed hack until we can make refine aware of the schema:
Thu, Feb 7
@Krinkle: refined data was down-cast and is going to remain incorrect unless we re-ingest. We have raw data going back 90 days , so if you need we can alter the table and re-ingest to fix existing records. Let us know either way.
Tue, Feb 5
The status right now is that WMF is looking for a steward for the project. Ideally, a team would agree this is critical infrastructure and take over ownership of Graphoid and the Graph extension. We are mostly in agreement that it would be a bad idea to add even more logic to the current implementation, before it gets restructured. If no team agrees to take it over by the end of this quarter, I will see if I can make space to deal with it myself. As for GSoC work, before we get the basic service in shape I don't think it makes sense to draft any other tasks around it. Hope that helps, and I really hope a team decides to take this over.
I like the [a-z0-9_] restriction being up-front. It's predictable and a very fair limitation considering the many systems this data is expected to flow through.
you mean T212386#4905949 right? Elukey I'm going to wait until you have that in code review, just in case you have other thoughts as you make it more official.
Sun, Jan 27
Wed, Jan 23
This wikitext-in-JSON thing seems really complicated. I read through both comments above and walked away with a much better understanding of mediawiki, and I think that's a bad thing :)
Jan 22 2019
- The XML file with current pages, including user and talk pages (uncompressed)
- The full history dumps (uncompressed)
Jan 21 2019
switching to Andrew per Luca
Jan 16 2019
Quick note to thank everyone very much for all these use cases. They're very useful for both the short and long term planning that's always spinning in my brain. Thank you!
Which major design goal would that be? /me genuinely interested
oh man, revoke my arithmetic license, classic off by 1000 error, sorry about that, edited my comment above, it's 0.24%
Jan 15 2019
UPDATE: my math is so very wrong below, it should be 0.24% of pageviews, not 24%, sorry
Anything else to do here? The draft has been published, shall we move to Last Call?
I just dropped the data I mentioned in this task. Since we've been sqooping from mediawiki, we have a version of this kind of data in the wmf_raw database, in the mediawiki_revision table, by comparing data across snapshots. And we can easily look at smaller periods of time.
dropped all milimetric_ tables, and gave up on my dreams of figuring out what exactly is going on with mediawiki's revision table.
I just checked and I think we've exorcised any async or roundtrip code. @Krinkle, is there a specific test you do or a specific mw.track call that was too slow that you'd like me to check into?
Jan 14 2019
Jan 12 2019
Thank you for pointing out the issues and helping with docs @srishakatux
Jan 11 2019
I just want to re-iterate our simpler way of handling build steps in Wikistats 2. In light of the problems described here, I think it deserves a second look. It's pretty standard on other nodejs projects I've looked at. We basically factor out most of our common build config and use it in two kinds of builds: