@Milimetric Will do. I've already made some use of the wmf.mediawiki_history table, but now I see there's a plenty of new tables there already.
@Lea_WMDE Yes, I am waiting for you to close this task. BTW I don't have no idea on what needs to happen when you 'enter something for the namespaces'.
@Lea_WMDE Let me check. It's certainly in the data and there is a possibility that I was so absent-minded to update the data set but not the dashboard charts themselves.
@Lea_WMDE No. The file is now found in the designated directory on the stat1005 server. I will send you a copy via e-mail.
Tue, Feb 20
- We're clear on how to proceed here; technically it takes some copy and pasting from the existing (new) singlestats to develop the missing ones. This shouldn't take long.
- This will be solved from the EditConflict Schema in combination with joints across other relevant SQL tables.
Mon, Feb 19
I will be referring to the following three singlestats (in the bottom three rows of the dashboard, in the leftmost column): (1) TwoColConflict page views NS:Main, (2) TwoColConflict resolved conflicts NS:Main, and (3) Percent resolved in NS:Main w. TwoColConflict.
Sat, Feb 17
@Addshore I want to take care about T182011 first, given that T180571 is probably done now. Then I will get back to you in relation to this. Thanks for spotting the event logging related constraints, I'll also have a look on whether the data in Hadoop look any different.
- The Dashboard is fully operational: http://wdcm.wmflabs.org/TW_AdvancedSearchExtension/
- The updates are still being run manually and will be automated once T187606 is resolved.
- Feature correlations tab on the Dashboard is now implemented (screenshot attached);
- Shiny Server problems are currently preventing the Dashboard to go live; I need to inspect this in detail;
- We are still running manual updates for this Dashboard, but as soon as we have the public data set approved we will sync Labs and production and run on automation.
- Back-end completed; keeping data for the last three months, hourly resolution;
- crontab set: new update on hourly basis
- public data set review requested in order to migrate the updates from production to CloudVPS (where the Dashboard is hosted): T187606
Fri, Feb 16
@Marostegui The R script that orchestrates Apache Sqoop connects to analytics-store.eqiad.wmnet by using my analytics-research-client.cnf credentials from stat1004 - I wouldn't know exactly the server to which that resolves.
@Marostegui m = 0, h = 0, dom = 7,14,21,29, mon = *, dow = *, i.e. every 7th, 14th, 21st, and 29th of the month, 00:00 UTC.
@Ladsgroup @Marostegui I have a cron job on stat1004 that Sqoops the wbc_entity_usage tables for all projects into a HiveQL table for the Wikidata Concepts Monitor pre-processing. The cron job runs on a weekly schedule. Please let me know if you think it would be affected by whatever optimization you plan to do there. No serious problems would be caused if I would need to drop a weekly update or two if you predict any interactions. Thanks.
Thu, Feb 15
The only other thing that is confusing to me: You write there were 4773 observations in the data, but I would have expected a 15 times higher number with roughly 2000 edit conflicts per day, and a month of observation data. This can't all be anonymous users, can it?
@Lea_WMDE Thanks for re-opening, and it's not just the file type keyword, the back-end is not fully developed yet. This will be finished tomorrow.
@Lea_WMDE It would be great if I could join the dev team meeting, could you please make me an invite? Thanks.
Wed, Feb 14
Tue, Feb 13
I thought you can chose whether to include it in a particular schema or not? Thanks anyways, for some reasons (hint: contact which mediawiki API?) I needed that field badly yesterday.
@Addshore So, when the data were fetched from https://meta.wikimedia.org/wiki/Schema:TwoColConflictConflict, max(id) was 4777, and the following ids were skipped: 133, 2332, 2433, 4523. To perform a control for this result, this morning: max(id) = 4816, however, count(id) = 4812. Hope this helps.
Mon, Feb 12
@Addshore Thank you for the webHost field.
The results are shared on Google Drive.
Sun, Feb 11
@Addshore So, let's see: "Track the number of paragraphs that need resolution per edit conflict, split by whether the conflict was resolved or not." - I think that the first part of the question can be answered, while I am not sure for the second ("split") part.
In order to complete this task we must be able to perform the following:
@Addshore If you decide to add some test data, make sure that you add *a lot*. If you can put the data directly into the table - bypassing the event logging - then I can generate the test data, send you the .csv file or whatever you prefer, and then play with it from SQL.
Thu, Feb 8
Wed, Feb 7
@Ladsgroup Thanks, Amir - and exactly as I have assumed. @Lydia_Pintscher I wanted to by-pass the analysis of the templatelinks tables because I would face the same problem there as I did in my early attempts to analize the wbc_entity_usage tables from SQL in the beginning, while sqooping them first would be a massive operation (taking 6, 7 hours on the analytics cluster for the wbc_entity_usage tables for WDCM). Anyways, this has resolved itself with Amir's reference towards the already processed transclusion data sets from templatelinks. Again, thank you @Ladsgroup
@Lydia_Pintscher Hmm, I had some doubts on whether a case of "S" aspect Wikidata usage is equivalent to transclusion or not for templates. If you need these pages that @Ladsgroup pointed to web-scraped and served or analyzed somehow - let me know. Also, do you happen to know how were these transclusion statistics produced? Thanks.
- We have the WDCM Structure Dashboard now to help us navigate through the classes that we are interested the most;
- I have left an option for a user to produce a P31|P279 upward paths graph for any desired Wikidata item; I find that handy;
- Next step, fetching cumulative class item counts. This is, essentially, the most important information to select what undergoes analyses in WDCM; the Wikidata toolkit will be employed to do this, most probably, because WDQS could not process some of the operations.
@Lydia_Pintscher Please find enclosed the following:
Tue, Feb 6
Sun, Feb 4
Sat, Feb 3
Fri, Feb 2
Tue, Jan 30
Mon, Jan 29
@Stefan_Schneider_WMDE You're welcome, Stefan.
Fri, Jan 26
@Stefan_Schneider_WMDE My responses here, also in an e-mail where you will also find a data integrity check table:
@Stefan_Schneider_WMDE Responding in an e-mail, because I am sending you a table with usernames (requires NDA).
@Stefan_Schneider_WMDE Checking this as of now.
Thu, Jan 25
@Stefan_Schneider_WMDE You're welcome. Well, the data didn't change, really - it's just that we are now looking a the relevant subset of the Training Module data set.
@Stefan_Schneider_WMDE Here's an update report with correct data on the Training Modules. The conclusions are now completely different, of course.
@Stefan_Schneider_WMDE I think the error is due to my improper handling of the Training Modules data set. I am still working to fix it, be back ASAP.
Wed, Jan 24
Did 8 users start the artikel-bewerten module?
- Edits of new users (in monthly steps after the campaigns and after registration). Q: I understand you need a data breakdown per month and for each campaign separately, but I do not understand the difference between "after the campaigns" and "after registration" - new users typically register during the campaign, don't they?
Tue, Jan 23
- There are no keyword co-occurrences present in the currently available data, so the Correlations tab will be developed once we gather enough data for it.
@Lea_WMDE per Wiki statistics are now ready. Reminder: I am still using the old schema (no file_type field) until the new schema is populated with more data.
Jan 23 2018
Jan 22 2018
@bd808 The name of the Horizon project that Adam means is wikidataconcepts. It should not be deleted before the new project, wmde-dashboards, is puppetized and everything transferred there smoothly. Thanks.
Jan 21 2018
@Addshore It does not. It refers to the WDCM puppetization on both the production machines and labs.
Jan 20 2018
Agreed. Let's have one project then, wdme-dashboards, for example, and serve all (non Wikidata, non WDCM) Shiny dashboards from there.