May 30 2020
I'd like to close this as "declined" for now, as we haven't really seen any interest in this since the last comment. If there's interest, and I'm able to get more focused time to work on this part of SuggestBot, then we can reopen this. It could also be a potential candidate for a Hackathon project, but I don't know much about the criteria for those.
I don't have the bandwidth to work on this, so I'm removing myself as the assignee.
Aug 26 2019
@Bstorm : looks like https://gerrit.wikimedia.org/r/513943 did not include a definition of comment_archive, was that intentional? I would expect there to be a comment_archive table to allow for joining with the archive table when querying archived edit comments (similar to how there's an actor_archive table), but that table doesn't exist on the replicated databases on Toolforge.
Jan 29 2019
Dec 15 2018
The suggestbot-prod instance has been shut down and deleted (ref the suggestbot log). SuggestBot is now running from suggestbot-01. Move completed, task resolved.
Oct 27 2018
The migration process has been started by creating a new instance suggestbot-01. Had a bit of a start-stop-start situation as the process of moving instances to the new eqiad region is coinciding with this work, so the new instance had a short life in the old region, but is now present in eqiad1-r.
Oct 10 2018
Aug 21 2018
The report is now posted as a sub-page of the AfC Process Improvement page on enwiki. Marking this as resolved and reassigning it so I can track it there in case it gets reopened.
Fixed it by getting an email alias set up, so I'm marking this as resolved.
Aug 20 2018
Can confirm I have access and everything seems to be working. Thanks for taking care of this, and so quickly as well, awesome work!
Is this something that can be resolved on the Phabricator end, or should I look for a workaround? Either way is fine with me, as long as I can get a second account set up.
Aug 16 2018
Aug 14 2018
Aug 13 2018
@Milimetric Thanks for taking care of the SQL queries! I don't see a need for backfilling the data at the moment, there's not a benefit warranting that cost. As mentioned I can help the NPP folks out with getting their data together. In other words, as far as I can tell, this ticket can be closed now.
Aug 10 2018
@Niharika Yes, I'd like to keep this open and try to wrap it up in the near future, if that's okay with you?
I don't think backfilling all the data is very important. The only ones that appear to be affected are the NPP reviewers, and I should be able to run some queries on the Data Lake to either fill the missing data, or get reasonable estimates they can use.
Aug 8 2018
Aug 7 2018
Aug 2 2018
Aug 1 2018
Jul 31 2018
Jul 30 2018
Apr 30 2018
I see you're running into some of the same challenges that I had with getting good data on this for ACTRIAL, and that you've found some of the code and data that I have. Since I'm currently working on T192574, there's also some newer code and data available.
Apr 23 2018
The data gathering for this is now running, and I expect it'll take a day or two to complete. I also updated the database schema to have a column for the timestamp when a submission was withdrawn so that we can use that to better estimate the contribution to the AfC backlog from pages created in the Draft namespace (hypothesis 17).
Apr 19 2018
Mar 27 2018
I've spent a bit of time looking at this, and as far as I can find, the revision_deleted_timestamp is consistently incorrect. Using a sample dataset of creations from four different months, I've found that 15% of the time the deletion timestamp is missing. For pages that have it set, the vast majority of entries (almost 90%) do not match against the logging table. Lastly, of those that match against the logging table, it's almost always not a page deletion event.
Mar 22 2018
As mentioned on IRC earlier today, I never filed a ticket because I didn't have the time to sit down and make sure I had data that allowed me to understand exactly what the problem is. Picked it up again today because I now have some time to dig in.
Feb 23 2018
Jan 17 2018
I checked the dashboard for enwiki and spot-checked a dataset, and the data appears to be in working order. Thanks for helping take care of this @Milimetric, and great to learn there's a way to easily fix this next time!
Jan 16 2018
Nov 29 2017
Nov 21 2017
Nevermind, turns out @mforns has already updated that configuration, should've checked that first. Thanks again for taking care of it!
The data behind Page Creation Dashboard is configured to read data from the log database on dbstore1002. Can I at this point submit a patch to the ReportUpdater configuration that updates it to use db1108.eqiad.wmnet, as that now has the updated log database?
Oct 26 2017
Looks good to me, thanks again!
Oct 19 2017
Sep 27 2017
- Verified that the dataset of number of pages created is available in the correct dataset directory.
- Added metric for number of pages created in the Draft namespace to Dashiki:CategorizedMetrics in this edit.
- Added metric to Config:Dashiki:PageCreations in this edit.
- Verified that the metric is now available, it can be viewed here.
Sep 21 2017
Sep 20 2017
Sep 12 2017
@kaldari : The three last metrics are only defined for English Wikipedia, partly because I saw them as ACTRIAL-specific. When it comes to the autopatrol right, those are also defined for different user groups depending on what wiki we're looking at, and I didn't see the benefit of figuring those out for the entire set of wikis.
@Nuria : Thanks for taking care of this! Sorry I didn't get around to updating the commit message as you requested, forgot to put that on my todo list.
Sep 11 2017
There's a user group called "autoreviewer" that specifically gets the "autopatrol" user right. That right is also applied to bots and admins. Or at least that's how I read en:Special:ListGroupRights. The help page mentions that it used to be called "autoreviewer", so I guess they just never renamed the user group.
@kaldari : No, I really mean "autoreviewer", ref en:Special:ListGroupRights. I haven't been able to find any documentation that defines the user group in the system as "autopatrolled". And yes, I find that confusing.
@Neil_P._Quinn_WMF : I actually ran a query to get similar data on Friday, because I've been using it to figure out how long it takes for articles to get reviewed. My current best version of the query is in our GitHub repository: non_autopatrolled_creations.hql It looks for non-autopatrolled creations, but it's trivial to calculate the opposite proportion as I also have data on all article creations.
@awight : I was working on this yesterday, but didn't get the dataset ready overnight. The process I have goes as follows:
@Nuria : I added a short note to the tutorial about the requirements. Since I don't know npm very well, it's rather non-specific on how to get them installed. I'll make a mental note to look into nvm on a rainy day, as that might allow it to be more specific on how to go about doing this since I'll then know how to do this for both a global npm install as well as for a local one using nvm.
Sep 8 2017
@Nuria: I've tested our dashboard locally here and everything seemed to be working just fine. How do we go about getting it deployed? In this specific project, having a VM on Labs isn't really an option.
Sep 6 2017
Ah, I see! The tutorial isn't aligned with said documentation then. I'll update the tutorial and move forward.
From what I can tell after digging around a bit, the configuration of the Dashiki extension limits the creation of pages in the "Config" namespace to ones with titles starting with "Dashiki:" (refs [1,2]). Thus, I can create "Config:Dashiki:PageCreations", but not "Config:PageCreations", I suspect the latter is instead a pseudo page used by the JsonConfig extension.
Sep 5 2017
@Nuria : I'm working on this now, got the metrics added to [[m:Dashiki:CategorizedMetrics]] without breaking anything, or so it seems. I do not have permissions to create [[m:Config:PageCreationDashboard]], but it appears I can edit existing dashboards. Could you (or someone else who has permissions, pinging @kaldari) create the config page for our dashboard so I can edit it? Feel free to create it with a different title if the one I suggested breaks conventions.
Aug 31 2017
Ah, I remember being confused by the configuration file path in the examples I looked at, but forgot to ask about what it should be. Thanks for figuring that out and updating it, and also for your help with reviewing the patch, much appreciated!
Aug 29 2017
I'm a bit pressed for time at the moment, so to prevent this from stalling I'd like to propose that a first priority is that I try to create a dataset that doesn't have any redirects in it. Given the low number of redirects we have in the dataset, I expect this problem to be minimal if I simply sample a few hundred extra articles in the classes where that is possible. I'll also make sure the dataset doesn't contain any disambiguation pages.
Aug 28 2017
Aug 25 2017
I adapted this query for use in gathering some statistics for the ACTRIAL project and noticed that it seemed to fail to pick up deleted articles. In my dataset gathered a week ago there is 730 article creations on 2017-01-01, and 729 of those currently exist in the revision table. What appears to be a key reason for this is that event_comment for those deleted articles is NULL leading any event_comment NOT REGEXP 'foo' to remove that row from the query result.
Aug 23 2017
@mforns Patch submitted (linked below), and I added you as a reviewer. First time working with Gerrit, hopefully I got it mostly right! Happy to make changes as need be, fun to learn how to do this. Thanks again!
@mforns : Thanks much for your help with this! I've set up the queries so they return two columns, with the second named after the wiki as you recommended. Also, thanks for the link to the tutorial, it's a lot easier to follow than the technical documentation ([[:wikitech:Analytics/Systems/Dashiki]], I'd be happy to add a link to the tutorial from that page if that's useful?).
Aug 22 2017
Just a head's up that we've rephrased our hypotheses around patroller workload since the start of the ACTRIAL project, and "number of active patrollers" is now one of our measurements together with a few related ones. Ref hypotheses 9–13 on our project page: https://meta.wikimedia.org/wiki/Research:Autoconfirmed_article_creation_trial I plan to reuse your query for counting number of active patrollers, thanks!
I'm working on this and got ReportUpdater working locally. A couple of questions:
Jul 14 2017
I've gathered revision timestamps for all the revisions in the published dataset, and also checked for redirects. Here are some summaries:
Jul 12 2017
Jun 8 2017
Coming back to this I have a bunch of questions, so I'll just ask them and see where we go from there. Apologies if this is counterproductive, feel free to let me know how to improve in future work.
Jun 7 2017
@Mavrikant Excellent! The extractor looks good to go as far as I can tell. Also, happy to hear that you don't have HTML comments in your WikiProject templates, that makes life a lot easier :)
Jun 6 2017
@Mavrikant: thanks for getting code for the trwiki extractor up on https://github.com/Mavrikant/wikiclass/blob/master/wikiclass/extractors/trwiki.py, it makes everything a lot easier!
Apr 17 2017
Jan 4 2017
3,746,600 rows. The file I'm importing is 259MiB when unzipped.