Still broken for me. Steps to reproduce:
- In Firefox, go to http://xtools.wmflabs.org/ec
- Fill in "fr.wikipedia.org" as the project
- Fill in "Kaldari" as the Username
- Hit Submit.
Was not able to test due to T168676.
This is while not logged in via OAuth.
Not relevant for Community Tech. Removing from board.
I'm not sure if I like Tim or Nemo's suggestion better. Tim's version will be more concise and simpler to use in most cases, but Nemo's version will be closer to the underlying implementation and possibly offer more flexibility. If we do use Nemo's suggestion, I would advise against calling the function cclike() as that will probably confuse people into thinking it's an operator (like like, rlike, and irlike). A better function name might be are_confusable() (similar to the PHP function).
I haven't looked at this in depth, but it seems the best solution here would be to surface some of the functions from the PHP Spoofchecker class within AntiSpoof, perhaps with some overrides for edge cases like zh-hans / zh-hant pairs. In other words, provide methods like: AntiSpoof::areConfusable(), AntiSpoof::isSuspicious(), AntiSpoof::setChecks(), etc. (and eventually surface these functions within AbuseFilter as well).
No worries. The purpose of this table was fulfilled years ago. It is safe to burn with fire.
Wed, Jun 21
The front-end implementation could be a tool like CopyPatrol (but for audio instead of text). Or if we wanted to do something quick and dirty, like CorenSearchBot instead.
@Dispenser: Can you elaborate on "Match count is low."?
In the one double rev_parent_id 0 page I checked, both revisions were 'redirect' revisions. It is possible that rev_parent_id = 0 + page_is_redirect = false would get you only real page creates (if you don't want to count auto redirect page creations).
@Ottomata: I don't think that would solve the problem as there are definitely cases of ev_parent_id = 0 that aren't redirects. For example, none of the 9 ev_parent_id = 0 revisions on this page are redirects: https://ia.wikipedia.org/wiki/Wikipedia:A_proposito/ro
According to Max, fixing all calls to Sanitizer::escapeId() so that they specify whether they are being used for a fragment or an ID would be a large task, as there are dozens of uses in different extensions. Because of this, we considered using percent-encoding for both, but using both percent-encoded fragments and percent-encoded section IDs doesn't work in Firefox, although it seems to work in all the other browsers.
Tue, Jun 20
Currently blocked by https://phabricator.wikimedia.org/T153393
@4shadoww: The first step is to start a community discussion on the Finnish Wikipedia. If the community supports the creation of a CopyPatrol interface and there are volunteers wanting to actually use the tool, we can then start working on the implementation. It's important to make sure that there are people actually wanting to use the tool, as we have built interfaces for other language wikis that have been completely unused and were a waste of time to create.
Are we still working on this or it is complete?
@Matthewrbowker: Can you explain this task? I believe the Intuition migration guide is about moving message keys out of the Intuition repo (as used to be the practice) and into your own tool's repo (which I'm pretty sure is already the case for XTools).
I like the idea of having a page-creation event, but would really really like us to be absolutely sure PageContentInsertComplete is indeed used only for new pages, and I don't think that is currently the case (that we are sure this statement holds true, that is).
@mobrovac: After parsing through the code, I'm pretty confident that PageContentInsertComplete is only called during new page creation. It looks like there are two cases where PageContentInsertComplete is not called during page creation, however: When a page is created via import (import/WikiRevision.php) and when a redirect is automatically created during a page move (MovePage.php). These seem like sensible exceptions (for my use case at least).
The only purpose of PageContentInsertComplete is to handle events related to page creation, so it should be the most reliable thing to use. rev_parent_id == 0 is just a proxy for page creation, so I don't think it makes as much sense to rely on, especially since we already know that it isn't reliable. My vote would be to create a new page creation schema for EventBus and use PageContentInsertComplete. The other advantage of using that hook is that the data is more likely to be comparable to the EventLogging data for page creation (which has been using that hook for years).
@Ottomata: PageContentInsertComplete isn't supposed to capture all revision creates. It only captures page creates, which is exactly what I need.
@Marostegui: Thanks for the info about the back-up. Might be useful data for some wiki archeologist one day :)
@MusikAnimal: If querying only for pagetriage-curation reviews in the logging table is relatively fast, let's do that. 1.5 seconds (or 1.7 in my test) still feels slow for an API response. I tried querying a week instead of a month and it only took 0.6 seconds. How about we restrict the API to only querying for the past day or week (as it was originally), and let your bot handle the longer time frames (albeit for both patrolling and reviewing).
Mon, Jun 19
Since querying the logging table is slow, it seems like the easiest solution is to just convert pagetriage_log into a permanent log and remove the syncing/purging. Setting up a new purging arrangement (that is separate from the pagetriage_page purging) just sounds like unneeded complexity. Once the table is too big to query efficiently (in like 10 years) we can purge it again :)
If we decide to switch to using the logging table, we'll need to look for actions with log_type of pagetriage-curation and log_action of reviewed.
Sun, Jun 18
Sat, Jun 17
@Samwilson: Is this live currently? I tried checking it at http://tools.wmflabs.org/xtools-dev/ec/fr.wikipedia.org/Toughpigs, but just got a bunch of blank boxes:
Fri, Jun 16
Yes, we can help with this, do you want to put a changeset together we can codereview? But, to reiterate my point earlier, these events can be created but they will not have historical information, thus I cannot see how would you use this work immediately for ACTTRIAL metrics.
During our meeting with Tobey and Victoria a few weeks ago, we decided that we needed a 2-pronged approach to dealing with ACTRIAL: a short-term plan (to deal with the immediate issues) and a longer-term plan (that includes the possibility of ACTRIAL being implemented). The dashboard that we want to build from EventBus data is mainly to address the longer-term needs, while the improvements to the Data Lake data are to address the short-term needs. Since neither of these are really going to be available in the short term, we've been working with whatever imperfect data we've been able to cobble together (with Dan and Tilman's help) in the meantime. Thanks for your continued assistance on this and prioritizing work on it. I know you guys are busy with lots of other projects and it isn't fun dealing with interruptions and context switching.
I like it!
Thu, Jun 15
I agree this is probably not easy to fix, but @Pastakhov might know better.
@Jdlrobson: Normally I'm a fan of your ruthless triaging, but this seems like a legit bug.
@Ottomata: From brion and Niharika's comments above, it looks like rev_parent_id = 0 isn't reliable. I would like to move ahead with using the PageContentInsertComplete hook instead and having a dedicated page creation schema/table. Do you think that makes sense, and is it something that you would want to help with?
The workaround are:
- use the ssh:// protocol for review
- fetch patches manually (eg: git fetch origin refs/changes/DE/ABCDE && git checkout FETCH_HEAD
Here's a page that has 9 revisions (out of 12) with rev_parent_id = 0: https://ia.wikipedia.org/w/index.php?title=Wikipedia:A_proposito/ro&action=history
Hmm, not sure what to make of the results from recentchanges. That's really weird. I ran milimetric's query against iawiki and got 47 pages out of 32600 having multiple revisions with parent ID 0, which is 0.14%. That's really tiny, but I wonder if there's a chance the percentages are higher on larger wikis. For example, maybe it's due to oversighted revisions, which would be more common on bigger wikis. Of course it would probably take days to run the query against enwiki, so hard to tell.
@Ottomata: Nevermind, I see we can use revision-create where rev_parent_id == 0. I imagine this will be an order of magnitude slower than if we had a dedicated page creation event, but oh well.
Wed, Jun 14
To clarify, this was basically a suggestion to only run the regex over the first line instead of the whole page (as MediaWiki itself does in essence). This may also make things a bit faster.
This has definitely gone too far down the rabbit hole. I agree with Danny that we should just forget about accounting for redirects if it's going to make getting this data significantly difficult.
We were mostly worried about incoming links, however there's also the problem of outgoing interwiki links from new wikis to old ones - shall we also introduce a separate setting for them?
I don't think we'll be able to solve this problem, since there will be a mix of 3rd party wikis using legacy, legacy+html5, and html5-only section IDs over the next several years.
Looks good to me!
@Revent: Good point. We shouldn't overthink this until we have feedback from the Commons community.
Sure, I was going to create it at https://meta.wikimedia.org/wiki/Research:Wikipedia_article_creation, but that already exists. Guess I'll create it at https://meta.wikimedia.org/wiki/Research:Wikipedia_article_creation_II.
Apparently we can get 3 logo designs for $69 at https://worthylogollc.com/.
We probably don't want to hack mw core only to add this. AF might be better suited for this.
Using AbuseFilter for this would be great from a developer point of view, but would suck for end users. Imagine going through all the steps in Upload Wizard only to be told that your upload was rejected (since AbuseFilter is only triggered at the last stage of uploading). I'm pretty sure there's a hook for adding valid mime-types and there's also an UploadVerification hook in verifyUpload(), so this could be handled by an extension (rather than in core). Maybe the UploadWizard extension (although it's a bit of an awkward fit since this will affect non-UploadWizard uploads as well).