Thu, Jul 19
Tue, Jul 17
@daniel Surprisingly, there is interest in this going through TechCom after all. I've been digesting discussion into this page, https://etherpad.wikimedia.org/p/JADE_scalability_FAQ but would like to ask you for an example of the best format for presenting the issues to the committee?
Mon, Jul 16
I don't like it, but the solution was to overwrite the page with new content...
Fri, Jul 13
Thu, Jul 12
I added an XML dump and import clause so we have more content, but we don't have data in every table yet. See the commit message for a list of empty tables. Personally, I'd prefer to merge like this and incrementally improve, since it's usable and adds value as-is...
@ArielGlenn The patch is ready to submit now, IMO. With the global parsing fix merged and the latest tweak to the vagrant role, it's possible to run a dump.
Testing more, I found a missing step. I need to run "make" in /vagrant/srv/mwbzutils/xmldumps-backup/mwbzutils
Wed, Jul 11
Looks like we can override Content::prepareSave and return a Status::newFatal with a more specific error string.
@notconfusing Great results so far! I've been pretty distant from the investigations into bad models so far, but have a few random thoughts:
- Reverts are a problematic data set. @Ladsgroup did a k-means cluster analysis and found many distinct clusters, with differing numbers of clusters in each language. This informed our decision to move beyond revert classifiers. We should probably focus on damaging and goodfaith for now. My interpretation is that "reverted" is not actually a proper label, it actually masks several more meaningful and detailed labels.
- We've noticed that certain labelers have systematic biases, it might be interesting to calculate an average variance for each individual labeler.
- https://en.wikipedia.org/wiki/Active_learning_(machine_learning)#Query_strategies has alternative methods for identifying problematic observations, FWIW.
- My naive impulse is to set up a second round of human labeling. Maybe we should even provide a "not sure" choice to get some signal about human certainty levels? I don't know what the industry norms are for that.
Thu, Jul 5
Lowered the priority because the epoch seconds check is a fallback to more accurate editRevId, the latest revision ID. The remaining work is to remove the fallback.
Wed, Jul 4
There's a lot to go through in this thread. We won't be doubling the revision table, my current estimate for the upper bound of activity is actually 0.5M additional pages and revisions per year on the largest wikis, and only hundreds or thousands of additional pages on the smaller wikis. If you want to store revisions from this namespace on x1, that sounds like a reasonable precaution to me. Where is this sharding configured? Is it okay that we continue to use wiki pages and revisions or would we have to use a custom table?
@jcrespo I see, well in this case content storage is exactly what we're planning to use. Is there anything special to do in order to set that up? For example, the judgment about https://en.wikipedia.org/?diff=12345678 will be made on the same wiki, in https://en.wikipedia.org/wiki/Judgment:Diff/12345678
@jcrespo Thanks for the reply!
Hi @jcrespo @BBlack, nudging per T183381#4296475 and here, we're hoping to deploy a new extension whose impact is limited to about 0.5M additional pages created per year, on large wikis, assuming the most optimistic, uncontrollable uptake scenario. I'd love to hear DBA and Traffic perspectives on the proposal.
Tue, Jul 3
CC @Fjalapeno, I'd be interested in your thoughts about the potential for a flood of data here.
Now I'm thinking that we shouldn't make the editRevId parameter mandatory, and can point to the other ApiEditPage params being optional as precedent. The default behavior of guessing editRevid = page.latest is sane and is usually correct as well, so requiring clients to track page.latest before making an edit adds complexity with no gain.
Mon, Jul 2
It doesn't look like we ever solved the race condition for ApiEditPage, which doesn't use editRevId. I'd like to require this param everywhere, it's quite risky to edit a page without knowing what revision we're basing changes on.
Sat, Jun 30
Thu, Jun 28
Should I post my yubikey serial again? I won't actually have access to it until July 12th, fwiw.
Wed, Jun 27
It looks like clients can't communicate any type of "PATROLLED" message, indicating that a change has been reviewed and is good?