Page MenuHomePhabricator

Basic data on Wikidata use: Edit counts, frequency
Closed, ResolvedPublic

Description

User Story:: As a user researcher, I want to know how our user base is composed so I can evaluate if survey and interview data has a bias compared to it and to see basic patterns that I might want to further explore.

Information needs:

I imagine the following data to be useful (all can be aggregates):
I imagine this should not be a graphana board but a report (ideally: RMarkdown or Jupyter Notebook)

  • How is our user base composed in terms of edit count? (e.g. shown by an histogram, y= count, x=bins of edit count ranges)
  • How were our user base composed in terms of recent edits? How many account did, in the last month (or so) edited n times (e.g. shown by an histogram, y= count, x=bins of edit count ranges for a month)
  • How many of those users do participate in discussions? (mosaic plot, y=bins of edit count, x= bins of discussion participation)
  • How many new users do we have each month?
    • And of them, how many do edit how much?
    • And of them, how many do participate in discussions?

Or in general, how is the relation between:

  • Edit count and: Discussion participation (possibly split by item and property discussions vs. other project discussions), adding values, correcting values, adding references, adding qualifiers

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 4 2018, 12:30 PM
Jan_Dittrich renamed this task from Basic data on Wikidata use: Edit counts, frequency, language(?) to Basic data on Wikidata use: Edit counts, frequency.Oct 4 2018, 12:32 PM
Addshore moved this task from incoming to in progress on the Wikidata board.Oct 9 2018, 9:34 AM

@Jan_Dittrich Please find your Report attached. Pinging @Lydia_Pintscher who might also be interested to take a look at the results.

@Jan_Dittrich As of the following request:

Edit count and: Discussion participation (possibly split by item and property discussions vs. other project discussions), adding values, correcting values, adding references, adding qualifiers

I can provide breakdowns per namespace if you like, but I am not sure where to go for data on "... adding values, correcting values, adding references, adding qualifiers"?

Please provide your feedback once you read through this report. Thank you.

@GoranSMilovanovic thanks!

Some questions for understanding it right:

– 1
– 1.1. "Q1.1 Checking for power-law behavior" has two log scaled axis, if I read it correctly. I do not get what the numerical labels on the y axis mean – is this number of users, but after log transformation, so 10000 users becomes 9.21… on the log scale? The log is a natural logarithm, correct? (2.718281828459…)


– 1.2. Q1.2 and Q1.3 show the same data as Q1.1, but in natural numbers without transformations,

– 1.3. The diagrams tell "We have many accounts that edited a few times and a small amount of accounts which edit(ed) a lot"
– 1.4. To clarify, this is how many revisions a specific account has created until the data was acquired? (I call "revisions" "edits", it seems then)
– 2.1 Q2.x are like 1.x, but only for september 2018.
– 2.2 The diagrams tell "recent edit counts by user follow the same pattern (log) as the general edit count distribution"
– 3. Q3 The diagrams are indeed tricky to read. I would read the crosstabular one like a scatterplot, however, in this case, it would need jitter, would it? Maybe a "heatmap" like approach might also be OK: A 2-D-bin would be darker if more edit/discussion counts fall into it.
– 4. Q4.2 is again a diagram I am unable to read well. Could it be that the labels are off? X seems to show dates, but y seems to show dates (year-month), too, but log scaled? So I could not make sense of it.

@Jan_Dittrich Thank you for your comments. I will provide all necessary explanations later in the evening.

@Jan_Dittrich Here we go:

– 1.1. "Q1.1 Checking for power-law behavior" has two log scaled axis, if I read it correctly. I do not get what the numerical labels on the y axis mean – is this number of users, but after log transformation, so 10000 users becomes 9.21… on the log scale? The log is a natural logarithm, correct? (2.718281828459…)

Correct. Both the number of revisions (x-axis) and the number of users who made the respective number of revisions (y-axis) where log-transformed (it's a natural log indeed, base = e), and the plotted against each other.

– 1.2. Q1.2 and Q1.3 show the same data as Q1.1, but in natural numbers without transformations,

Yes. Except that Q1.3 is misnomed: it should be titled Q1.3 Histogram: Distribution of the the number of users across revisions. But yes, both Q1.2 and Q1.3 sections deal with the same data as Q1.1, except that they operate on the natural (count) scale, not on log-transforms.

– 1.3. The diagrams tell "We have many accounts that edited a few times and a small amount of accounts which edit(ed) a lot"

Exactly.

– 1.4. To clarify, this is how many revisions a specific account has created until the data was acquired? (I call "revisions" "edits", it seems then)

That is true. Q1.3 is the distribution of the number of users who have made a particular number of revisions (edits, if you prefer), up to the moment of data acquisition for this analysis.

– 2.1 Q2.x are like 1.x, but only for september 2018.

Yes. Initially, you asked for an overview of the last 30 days or so. However, the wmf.mediawiki_history Hadoop table, where the data for this analysis is found, is partitioned per snapshots, where a snapshot represents all the data up to the particular month (e.g. everything since the beginning of history up to August 2018, September 2018, October 2018, etc). At the time when I've run the analysis, the September snapshot was the freshest one available. So you don't see any of the October 2018 data here, for example. We will not be able to do October 2018 before November 2018; it takes some time for that table to update.

– 2.2 The diagrams tell "recent edit counts by user follow the same pattern (log) as the general edit count distribution"

Q2.2 does not use any log scaling (as well as the corresponding Q1.2), but yes, it tells us that in qualitative terms the distribution of the number of users who have made a particular number of edits is no different from what we've seen when we've analyzed all of the data in Q1 sections. One could also compare Q2.3 to Q1.3 and bring about the same judgment by eye-balling only. I am not very enthusiastic about going into strict statistical, model-based comparisons between the two distributions here, but if you really want to make sure whether the case is true as stated here... let me know.

– 3. Q3 The diagrams are indeed tricky to read. I would read the crosstabular one like a scatterplot, however, in this case, it would need jitter, would it? Maybe a "heatmap" like approach might also be OK: A 2-D-bin would be darker if more edit/discussion counts fall into it.

As you have already observed it did cause a bit of a headache to me too. Please allow me some time to figure out the most informative approach to this visualization and as soon as I have it you will have it too. Thank you.

– 4. Q4.2 is again a diagram I am unable to read well. Could it be that the labels are off? X seems to show dates, but y seems to show dates (year-month), too, but log scaled? So I could not make sense of it.

Oh, oh, oh: sorry. The axes are off: x represents Month-Year, y represents log(number of Revisions). I will correct this as soon as I am done with the tricky Q3 diagrams.

Please let me know if you have any additional questions. Thank you for you feedback!

2.1 Q2.x are like 1.x, but only for september 2018.

Yes. Initially, you asked for an overview of the last 30 days or so. However, the wmf.mediawiki_history Hadoop table, where the data for this analysis is found, …

That’s fine, I only wanted to be sure to check if edit patterns did dramatically change in the last years, the actual month and how many days exactly… is thus not that crucial as long as it is recent-is any long enough to smooth possible extremes, so all is fine.

Q3…

As you have already observed it did cause a bit of a headache to me too. Please allow me some time to figure out the most informative approach to this visualization and as soon as I have it you will have it too. Thank you.

Great.

The axes are off: x represents Month-Year, y represents log(number of Revisions). I will correct this as soon as I am done with the tricky Q3 diagrams.

Thanks! I think it is useful, but I get the scaling issues… could you use a 10-based (instead of e-based) log scale for this one? I think it would be easier to understand, and we could label the axis 1-10-100… (so the "actual" counts).

Last but not least and for some context:

I would like to create a little poster based on these, so I try to make it graspable for "everyone". I think we are on a good way there. I think this explains, why I would go with natural frequencies and labels as often as possible; same with using a linear scaling if possible and if not a 10 instead of e-based one.

GoranSMilovanovic added a comment.EditedOct 23 2018, 11:56 PM

@Jan_Dittrich Here we go:

  • please let me know whether the alternative visualization of edits vs. discussions in Q3.1 works for you;
  • Q4.2 is now fixed: log10() used instead of the natural logarithm; data points are labeled in order to preserve the absolute values; the y-axis simply must be scaled because of the great disproportion between the number of edits (i.e. revisions on main namespaces) and the number of discussions (i.e. revisions on talk pages).

I would like to create a little poster based on these, so I try to make it graspable for "everyone".

If you want to print that poster someday, let me know so that I can produce 300dpi printable graphics from R for you.

Your feedback, please. Thank you!

Thanks for the progress so far. I'd like to see the tool expanded to cover usage of Wikidata outside namespace 0 (including on Wikidata itself) and in SPARQL queries/ API calls.

GoranSMilovanovic added a comment.EditedOct 24 2018, 12:18 AM

@Daniel_Mietchen My bad: I gave you the wrong Phab ticket for this. Sorry. Please: https://phabricator.wikimedia.org/T193969

Also, there is the WDCM system that tracks Wikidata Usage. The WDCM Usage dashboard, specifically, keeps track of the WD usage in all namespaces and all projects. All WDCM dashboards will be re-designed in the following month or so.

@Jan_Dittrich I did not attach this file in relation to T206214#4690469, didn't I?

This comment was removed by GoranSMilovanovic.

@Jan_Dittrich Any feedback in relation to this? Shall I close the task as resolved or do you have any further requests in relation to this dataset?

@Jan_Dittrich Jan, please: do we need this ticket anymore? Thanks!

Jan_Dittrich closed this task as Resolved.Nov 19 2018, 8:16 AM

@GoranSMilovanovic no, I'll close it. Thanks btw, this was very useful.