Page MenuHomePhabricator

User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey
Closed, ResolvedPublic

Description

Context: In the 2021 Wikidata Community survey we had more people stating they started in 2021 and 2020 than, let's say, in 2017. A plausible explanation is that people stop editing over time and that the likelihood that they stop editing is related to the time they already were editing. (Similar to the Lindy Effect)

image.png (846×856 px, 26 KB)

Question: What is the relation between past length of participation in the Wikidata community and the likelihood to stop participating?

Operationalization:

  • A user is active when they edited 5 times within a month
  • The time of a user being active is their "streak" of subsequent months in which they are active
  • We consider a user has left the project if they stop editing and never so far have become an active editor (+5 edits/month) again.

Additional Questions:

  • How high is the likelihood to become an active editor again?
  • For people who become active editors again, it would be interesting to understand the patterns: Do they leave for a year and start again? (Like parents taking a baby break) Do they stop for a month and continue? (maybe they were sick or so) etc. Different thresholds were proposed in this paper, an extensive analysis of inter-activity time is published here

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Jan_Dittrich renamed this task from User Retention Wikidata: Exploring the resons for patterns in the 2021 Wikidata Community Survey to User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey.May 11 2021, 2:23 PM
Jan_Dittrich updated the task description. (Show Details)

@Jan_Dittrich

Please find the analytics dataset attached.

Columns:

  • userId: the anonymized Wikidata user Id
  • registrationYM: the YYYY-MM timestamp of user registration on Wikidata
  • revisionYM: the YYYY-MM timestamp of user revisions on Wikidata
  • revisions: the count of revisions made in the revisionYM month.

Next steps:

  • We will proceed to test the Lindy effect for Wikidata by (a) calculating the difference in months since user registration and revisions, and (b) searching for pauses in editing behavior.
  • All hypotheses/research questions will be addressed from the derived time lags between user revisions and registrations.

Notes.

  • Bot and anonymous revisions were filtered out.
  • Only item (0), property (120), and lexeme (146) namespaces are taken into account.

The ETL was performed in HiveQL from wmf.mediawiki_history's current snapshot (and that would be 2021-04). Here's the query:

USE wmf; 
          SELECT 
            event_user_id, event_user_registration_timestamp, 
            substring(event_timestamp, 1, 4) AS year, 
            substring(event_timestamp, 6, 2) AS month, 
            COUNT(*) AS revisions FROM mediawiki_history 
          WHERE (
            event_entity = \'revision\' AND 
            event_type = \'create\' AND 
            wiki_db = \'wikidatawiki\' AND 
            event_user_is_anonymous = FALSE AND 
            NOT ARRAY_CONTAINS(event_user_is_bot_by, \'name\') AND 
            NOT ARRAY_CONTAINS(event_user_is_bot_by, \'group\') AND 
            NOT ARRAY_CONTAINS(event_user_is_bot_by_historical, \'name\') AND 
            NOT ARRAY_CONTAINS(event_user_is_bot_by_historical, \'group\') AND 
            NOT ARRAY_CONTAINS(event_user_groups, \'bot\') AND 
            NOT ARRAY_CONTAINS(event_user_groups_historical, \'bot\') AND 
            event_user_id != 0 AND 
            page_is_redirect = FALSE AND 
            revision_is_deleted_by_page_deletion = FALSE AND 
            (page_namespace = 1 OR page_namespace = 120 OR page_namespace = 146) AND 
            snapshot = \'2021-04\'
          ) 
          GROUP BY 
            event_user_id, 
            event_user_registration_timestamp, 
            substring(event_timestamp, 1, 4), 
            substring(event_timestamp, 6, 2);

@Jan_Dittrich

To answer the following, simple question:

How high is the likelihood to become an active editor again?

please find a dataset attached: userId is a fake (but unique) Wikidata user ID, reactivationsN is the number of the times when the respective user started editing again following a period of inactivity.

The probability of becoming an editor again after a period of inactivity - looking from if that ever happened for a particular user, not how many times - is: 0.104778 (approx. 10.5%). We are considering 11825 users in this analysis (see the ETL step in T282563#7094294 to understand what was filtered out).

The distribution of the number of "comebacks" is the following one (first row: how many reactivations, second row: how many users did that):

0       1     2     3      4     5     6     7      8     9 
10586   720   224   136    79    36    22    15     5     2

Now the Lindy Effect.

This line can be removed,

event_user_id != 0

Anonymous users are already filtered out with event_user_is_anonymous = FALSE, and anyway event_user_id is set to null rather than 0 for anonymous (or revision-deleted) users.

I think the bot tests can be simplified to, NOT ARRAY_CONTAINS(event_user_groups, \'bot\'), why would we need to check the historical column? If the user was classified as a bot at a time but now is not, shouldn't we respect the updated classification? And the specific tests against the event_user_is_bot_by columns seem to be redundant.

The text says your filter will include the Item namespace (0), but the query only includes the talk pages: page_namespace = 1. Maybe this explains why there's such a low user count? I would have expected to see virtually all non-bot users who have edited wikidata.

Can you share more about the query that produced reactivations.csv? I can't tell from the information provided what counts as a "period of inactivity".

@awight

This line can be removed,
event_user_id != 0

Indeed.

why would we need to check the historical column? If the user was classified as a bot at a time but now is not, shouldn't we respect the updated classification?

Because in the data collection we want to be conservative and make sure that if we talk about human and not bot editors we certainly talk about human and not bot editors. To put it simply: it reduces uncertainty in our data.

And the specific tests against the event_user_is_bot_by columns seem to be redundant.

Better safe than sorry.

The text says your filter will include the Item namespace (0), but the query only includes the talk pages: page_namespace = 1. Maybe this explains why there's such a low user count? I would have expected to see virtually all non-bot users who have edited wikidata.

Good catch, thank you! I will re-run the ETL now.

Can you share more about the query that produced reactivations.csv? I can't tell from the information provided what counts as a "period of inactivity".

R code. If you still would like me to share it with you I will open a Gerrit repo for this ticket.

Can you share more about the query that produced reactivations.csv? I can't tell from the information provided what counts as a "period of inactivity".

R code. If you still would like me to share it with you I will open a Gerrit repo for this ticket.

A Gerrit repo could take weeks, and probably overkill in this case. If it's a single file, pasting is a good option, or you could create a GitLab repo if it's several files. I am curious, but please only bother posting the files if you'd like review. Most of all I was wondering whether one month below the active threshold is long enough to count as a period of inactivity, and whether multiple reactivations are counted for a user, if they oscillate between inactive and active.

A Gerrit repo could take weeks

I have o opinion whether it would be an overkill or not, but for what it could be worth, I can create gerrit repos when needed, and hopefully can do it in the same working day as I am requested.

@WMDE-leszek Thank you. Don't worry, I will request the repo: we need one for this kind of one-shot tasks anyways.

@awight @Jan_Dittrich

The following is based on 395,680 Wikidata editors and following the corrections as suggested by @awight in T282563#7104580:

  • The probability of editor reactivation is 0.08548827;
  • Reactivation is defined in the following way: we look at consecutive months since user registration and mark active months (>= 5 edits) as 1 and non-active months as 0, so the whole user revision history becomes a string e.g. 010001111100101...
  • We search for a regex pattern 1+0+1+ in the revision histories; each match is recognized as a reactivation.

Here is the distribution of the number of reactivations:

     0      1      2      3      4      5      6      7      8      9     10     11     12 
361854  18324   6660   3580   2194   1363    869    447    236    101     37     12      3

Obviously a vast majority of editors never reactivate following one month of inactivity, implying that user retention in Wikidata is a serious problem indeed.

Again, only item, property, and lexeme namespaces are considered.

@awight Thanks again for T282563#7104580!

@Jan_Dittrich

Question: What is the relation between past length of participation in the Wikidata community and the likelihood to stop participating?

@Jan_Dittrich I had to impose some definitions in order to be able to precisely formulate the effect that we are looking for. Please feel free to suggest any changes of the following methodological approach.

  • Definitions:
    • Q1. When do we say that a user has left Wikidata?
    • A1. (1) We look at the procession of months since user registration, each month coded as active (1) or inactive (0); (2) We look at all sequences of inactive months in a particular user's revision history and find the lengthiest one; (3) if the respective user's revision history ends in a sequence of non-active months, we compare the length of that latest period of inactivity with the previous lengthiest period of inactivity; (4) if the former is lengthier than the later, we declare that the user has left Wikidata.
    • Q2. How do we measure for how long has the user been active before leaving Wikidata, given that any user can have interspersed periods of activity and inactivity?
    • A2. We count all active months for a user before it was declared that the user has stopped editing. Motivation: that is how we align a (a) a user who has edited for six months, *each month*, and left, ie 111111000..., and (b) a user who has edited here and there for six months and then left, e.g. 101001000....
    • Q3. What about the users who are still active (in the sense of not obeying to the definition given in A1 )?
    • A3. We say that their measure of the length of activity is simply the count of their active months, and thus the measures in A3 and A2 are comparable.

N.B. All registered users that have never edited at all are filtered out in this analysis.

Here is how it looks like:

TotalActive_ProbabilityLeave.png (453×684 px, 34 KB)

Interestingly, this looks more like a negation of the Lindy Effect. It seems that with more active months in the past the probability of an editor leaving Wikidata slightly increases. Only after very prolonged periods of activity it seems like the Lindy Effect begins to hold (but with higher uncertainty/variance in the data points related to prolonged periods of activity too), but it is problematic with our data to assert whether that would be definitely true or not.

Again, please take a look at my definitions above and suggest if anything needs to be changed.

@Jan_Dittrich This might also help, a larger version of the chart in T282563#7107757 with the number of editors on each level of activity (x-axis) included. Of course we get to observe fewer editors with prolonged periods of activity, that is simply the nature of the data here. But it also suggests that we should be careful in drawing any conclusions on whether the Lindy Effect holds or not following prolonged periods of Wikidata editing. Please take a look and let me know what you think.

TotalActive_ProbabilityLeave_Large.png (679×1 px, 76 KB)

@Jan_Dittrich This is also interesting: higher the number of reactivations in editing behavior - higher the probability to leave Wikidata.

NumReactivations_ProbabilityLeave.png (484×513 px, 27 KB)

I am beginning to think that we might need to readjust the definition of when we consider the editor to has left Wikidata as I have proposed it. What do you think? Any ideas?

A (scientifically) conservative idea might be:

  • we say that a user is active only if they are now found in a sequence of active months (e.g. ...11111); otherwise we say that they have left;
  • however, given that many users have edit/pause/edit... sequences... I am not sure, because this might turn out to be too strict.

@Jan_Dittrich

For people who become active editors again, it would be interesting to understand the patterns: Do they leave for a year and start again? (Like parents taking a baby break) Do they stop for a month and continue? (maybe they were sick or so) etc. Different thresholds were proposed in this paper, an extensive analysis of inter-activity time is published here

I guess the first step - before introducing any hypotheses on semantics ("parents taking a baby break", "maybe they were sick or so", etc) - is to take a look at the distributions of the length of sequences of active and inactive months.

Here's the chart (NOTE: the log(y) scale is used!):

Activity_Inactivity_SeqLength.png (679×1 px, 136 KB)

And there is already something interesting to observe: see how (a) on shorter activity/inactivity sequences (x-axis, represents the length of 000... or 111...) we observe more inactivity periods than activity periods, while (b) the situation switches in favor of activity periods as the length of the sequences increases:

  • The lengthier the observed sequence, it is more likely to be a sequence of active than inactive months;
  • vice versa, the shorter the observed sequence, it is more likely to be a sequence of inactive than active months.

This is looking great! I mean, it's a discouraging phenomenon but a promising analysis :-)

I have a question about how to interpret the "0" reactivations category, which is stated above to be "a vast majority of editors [who] never reactivate following one month of inactivity". If the zero reactivations category includes anyone for whom the 1+0+1+ regex doesn't match, doesn't this also include active editors who simply have a history like, 000111..., and who are still active?

Also, I have a vague and anecdotal memory that Wikidata has a lot of users who are semi- or fully-automated bots but are not registered as such. Is this the case? If so, is there some other heuristic like overly rapid editing that we can use to filter out these users or analyze them separately?

@awight

If the zero reactivations category includes anyone for whom the 1+0+1+ regex doesn't match, doesn't this also include active editors who simply have a history like, 000111..., and who are still active?

Bravo... Will take a look at it and correct the analysis if necessary. Thanks again!

Also, I have a vague and anecdotal memory that Wikidata has a lot of users who are semi- or fully-automated bots but are not registered as such. Is this the case? If so, is there some other heuristic like overly rapid editing that we can use to filter out these users or analyze them separately?

It is easier said than done but I also have a vague memory that once I did something similar for the purposes of some analysis. Let me think for a while and try to remember what exactly was done to account for such semi-human-semi-bot editors.

Also, I have a vague and anecdotal memory that Wikidata has a lot of users who are semi- or fully-automated bots but are not registered as such. Is this the case? If so, is there some other heuristic like overly rapid editing that we can use to filter out these users or analyze them separately?

It is easier said than done but I also have a vague memory that once I did something similar for the purposes of some analysis. Let me think for a while and try to remember what exactly was done to account for such semi-human-semi-bot editors.

Good point, I guess @Manuel has also some ideas on this.

I am beginning to think that we might need to readjust the definition of when we consider the editor to has left Wikidata as I have proposed it. What do you think? Any ideas?

Would probably make sense to adjust it, but I have currently no good idea how. I'll keep thinking.

@Jan_Dittrich @awight

  • I need to re-adjust the regular expression for editor reactivations as suggested by Adam in T282563#7110253 now

I am beginning to think that we might need to readjust the definition of when we consider the editor to has left Wikidata as I have proposed it. What do you think? Any ideas?

Would probably make sense to adjust it, but I have currently no good idea how. I'll keep thinking.

There is a range of crieria that we can test and then see which one fits our needs the best (i.e. what exactly do we want to learn and understand).
Also, I still did not find the time to go through the papers that you have provided - maybe someone has already figured out some good, working criteria.
This is a first shot at the data and analysis that we are discussing here. I don't see the questions posed in this ticket as simple in any sense. Maybe a concise meeting/brainstorming on this would do us good? Let me know,

From: "How Long Do Wikipedia Editors Keep Active?"

…specifically, we consider an editor to be“dead” or inactive if he did not make any edit for a certain period of time. Here we set the threshold of inactivity to be5 months, since it reflects WMF’s concern as demonstrated in the recent Wikipedia Participation Challenge

Not sure if that is ideal, but it is certainly more simple.

@Jan_Dittrich Happening now:

Reporting back as soon as I have something.

@Jan_Dittrich

Please disregard all previous findings. The following is based on:

  • the definition of editor inactivity in T282563#7124389,
  • and the two important corrections in the analytics code;
    • one to guard against what @awight has observed on regex in T282563#7110253,
    • and the other - even more important one - that I had to introduce to fix a fatal flow in the existing analysis (having to do with considering only months w. > 0 edits at all: my incorrect assumption about the structure of the ETL result set).

We consider only editors that were active at some point in time.

How high is the likelihood to become an active editor again?

Here we consider those users who have had at least one period of inactivity (5 months w/o edits): the probability of editor reactivation is 0.208437:

  • around 21% of users who were active at least at some point in time and had at least one period of inactivity became active editors again at some point in time.

The estimate of the probability of reactivation is based on 358,823 editors. Out of this 358,823 editors, 338,058 (~ 94%) had eventually left Wikidata, while 20,765 (~ 6%) are still with us.
The definition of "has left Wikidata" used here is the following one: the editor currently has five or more months of inactivity.

What is the relation between past length of participation in the Wikidata community and the likelihood to stop participating?

Again, we define the "past length of participation" as the total number of active months since registration, and "to stop participating" as "has left Wikidata" (see above): the editor currently has five or more months of inactivity.

@Jan_Dittrich And here goes a beautiful illustration of the Lindy Effect for Wikidata editing (the data labels referring to how many editors are represented):

WikidataUserRetention_NEW_LindyEffect.png (726×1 px, 36 KB)

@Jan_Dittrich Following our 20210630 discussion:

Additional questions

  • for those ~ 6% who are still with us: can we find any interesting patterns
    • the distribution of the length of their periods of inactivity
    • the distribution of their usage counts
    • user behavior on talk pages
  • compare the ~ 6% retained group vs. others

I really like where this is going.

Maybe also look for patterns in the 94% who have dropped off, for example any variables that negatively correlate with longetivity.

Seems also relevant: https://eprints.whiterose.ac.uk/140352/1/evolution-wikidata-editors.pdf "The evolution of power and standard Wikidata editors: comparing editing behavior over time to predict lifespan and volume of edits"

  • Re-work on a fresh dataset (the 2021-06 snapshot of the wmf.mediawiki_history table) is underway;
  • Reporting: until tonight (hopefully);
  • @Jan_Dittrich I will be getting in touch via e-mail about the research/paper part later during the day.

@Jan_Dittrich @awight

In reference to T282563#7186386 and T282563#7226336:

  • I have used a fresh dataset, relying on the 2021-06 snapshot of the wmf.mediawiki_history table;
  • the results are fully replicated (in qualitative sense, of course);
  • I have also filtered out all editors who have less than six (6) months of presence in Wikidata, simply because they never really had a chance to leave (where "left Wikidata" is defined as five (5) months of inactivity).

The Lindy Effect

I have used several different operational definitions of the "length of past activity" to illustrate the Lindy Effect in Wikidata editing.

A. The total number of active months in editor's revision history

So, and editor can be active and inactive now and than; this measure of "length of past activity" is defined as the count of months in which and editor was active given the whole course of their presence in Wikidata since registration.
The vertical axis represents the probability to leave Wikidata given the count of active months.

01_LindyA.png (674×865 px, 61 KB)

B. The probability of an active month

The previous measure could be criticized on the grounds that it is not the same if (a) someone has ten active months while being registered a year ago and if (b) someone has ten active months while being registered three years ago. I have turned the absolute counts of active months per editor into proportions of their total stay in Wikidata since registration (effectively calculating the probability of any given month in the editor's revision history being an active month).
The horizontal axis is the probability to have an active month in course of one's revision history, binned into 100 intervals. The vertical axis represents the probability to leave Wikidata given the count of active months.

02_LindyA.png (616×1 px, 91 KB)

C. The age of the account
This is simple yet probably inconclusive in respect to the Lindy Effect itself: how old is their account vs what is the probability that they have left Wikidata (i.e. are now inactive for five months at least)?

03_LindyA.png (555×786 px, 57 KB)

The distribution of the number of revisions vs left or did not left Wikidata
The horizontal axis represents the log of the number of revisions, while the vertical axis is probability density. Obviously, those who are still with us are those who made more edits until now - as expected.

04_RevisionsVSLeftWikidata.png (559×1 px, 27 KB)

Here are the descriptive statistics on revisions:

Left Wikidata:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
1       1       2     203       7 5891740

Active on Wikidata:

Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
2       19      108    15268      720 31003903

The distribution of the length of inactivity periods vs left or did not left Wikidata
A single editor can have several periods of inactivity of varying length in months. I have analyzed the distribution of both mean and median length of inactivity periods per user, grouped according to whether they are still editing or not.

Mean length of inactivity periods first:

05_MeanLengthInactiveVSLeftWikidata.png (450×1 px, 26 KB)

Obviously, the editors who are still active typically have way less prolonged sequences of inactive months.

The descriptive statistics on mean length of inactivity periods:

Left Wikidata:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
1.429  14.500  30.000  37.185  56.000 105.000

Active on Wikidata:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  1.000   1.875   3.000   4.942   5.600  77.000      88

N.B. NA's represent those editors who did not have a single inactive month in their revision history.

And now for the median length of inactivity periods:

06_MedianLengthInactiveVSLeftWikidata.png (351×949 px, 23 KB)

The descriptive statistics:

Left Wikidata:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
1.00   13.00   30.00   36.52   56.00  105.00

Active on Wikidata:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
1.000   1.000   2.000   3.609   4.000  77.000      88

N.B. NA's represent those editors who did not have a single inactive month in their revision history.

My present conclusions:

  • The Lindy Effect holds in Wikidata editing: the lengthier the past editing behavior higher the chances that it will persist;
  • As expected, currently active Wikidata editors made more revisions in the past in comparison to those who are now inactive;
  • Currently active Wikidata editors have less prolonged periods of inactivity on the average (and measured in months) relative to those who are now inactive.

What is missing from this analysis?

This is missing from T282563#7186386:

... user behavior on talk pages

because it takes another ETL run through the wmf.mediawiki_history table; I will try to produce that dataset tonight, join with the existing data, and report upon it. Sincerely: I do not expected any other finding to emerge then that active editors make more revisions on talk pages.

@Jan_Dittrich I did not find enough time to focus on all the papers that you have shared (and for which I am thankful). I will focus on them tonight, as much as I can (there are other tickets calling for my attention too), and then get in touch on our idea to publish this finding. Thank you a very inspirational question that you have raised here in relation to the Lindy Effect!

@Jan_Dittrich @awight

Finally, as of

... user behavior on talk pages

07_RevisionTalkNamespacesVSLeftWikidata.png (706×1 px, 35 KB)

but please take into your considerations that the distributions are somewhat misleading since 368,410 (out of 383,045 in total) editors considered have never made a single edit in any talk namespace on Wikidata.

In fact, revisions in talk namespaces matter a lot in respect to whether the editor is found to be active or not now:

Left Wikidata

Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.0000    0.0000    0.0000    0.2381    0.0000 2258.0000

Active on Wikidata

Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.00     0.00     0.00    10.13     0.00 11943.00

We can see that those who are still active make an order of magnitude more edits in talk namespaces than those who have left (given the current definition of "left Wikidata", of course).

@Jan_Dittrich Do we really find a Lindy effect in the Wikidata acount age distribution?

Assumption. As demonstrated in Eliazar, Iddo (November 2017). "Lindy's Law". Physica A: Statistical Mechanics and Its Applications. 486: 797–805, if the Lindy effect holds than the Survival function of the account age is Pareto. So, we need to test if the Wikidata account age follows a power-law or not.

Now, this is a bit tricky, so let's go one step at the time:

  • the data are the frequencies of Wikidata account ages;
  • the age of the account is the number of months since the registration until the first sequence of five inactive months (when we pronounce an editor officially inactive by convention)
  • Bots are filtered out in the ETL phase;
  • if the x_min is set to the de facto minimum of the account age (which is 69; no x_min estimation), then we have a power-law behavior with an estimate of alpha found at 1.626341 - a power-law behavior with all moments diverging.

However, following the recommendations of the authors of {poweRlaw}, the boostrap analysis shows that in neither of the two cases the power-law is really present (see the Hypothesis Testing framework implemented in {poweRlaw}, 2. Examples using the poweRlaw package: https://cran.r-project.org/web/packages/poweRlaw/vignettes/b_powerlaw_examples.pdf, pages 4 - 5).

So, it does not seem to be a case of the Lindy effect after all. The code will be shared on Gerrit soon and referenced from Phab.

I would also feel at least a bit more confident than I am now if @MGerlach could find some time to take a look at the data.

It is methodologically problematic, or at least in my viewpoint, to try to establish whether the power-law (and thus Lindy) holds for the total number of active months (obtained by neglecting all inactive months in the editor's revision history). However, we can try.
Edit. In the meantime, I have tested, just in case, the total number of active months for power-law behavior: it is not. So, no Lindy effect there either.

@Jan_Dittrich @awight @Lydia_Pintscher @Manuel @Tobi_WMDE_SW

Probably of interest to all of you, because we have a quite interesting - and potentially very useful - outcome here.

As a side kick to this ticket, I have trained a Random Forest classifier, following some feature engineering steps first, to predict which editor would probably continue to work on Wikidata vs who would probably leave.

All features are derived from user revision histories coded as 00010101010111111100001110001100..., where 1 represents an active month (>=5 edits) and 0 and inactive months.
All user revision histories for those who are officially and by convention absent in the present moment (i.e., their revision history ends in ...00000+$ - five or consecutive months of inactivity now) were truncated to end in four consecutive months of inactivity - simply because we would like to predict what would happen to a user who is still an active editor, and not do so once we already pronounce them to be inactive.

Anyways, following a series of cross-validations and tricks to account for a highly imbalanced dataset, one Random Forrest classifier was able to predict leave vs stay in Wikidata with:

  • Accuracy of 97%,
  • Hit rate (True Positive Rate, TPP) of 90%,
  • and a False Alarm (False Positive Rate, FPP) of only 2.8%.

This means that we can recognize, with descent accuracy and a low level of false alarms, those editors who are on a streak to continue contributing to Wikidata in the future, and think of how to use that information in community building and make Wikidata more sustainable.

The result should be taken as preliminary, but these initial tests were already quite extensive (8 - 10 h of processing, model selection among 240 cross-validate Random Forest classifiers...).

The model encompasses the following features (MeanDecreaseGini is a measure of variable importance in Random Forests):

                             MeanDecreaseGini
med_inact                          12274.1092
sumActiveMonths                     7676.7991
mean_inact                          6686.6961
accountAge                          5541.5158
averageRevisionsPerMonth            3875.9850
pActiveMonth                        3692.2568
numRevisions                        3618.5379
H                                   2269.2940
reactivationsN                      2145.5995
averageTalkRevisionsPerMonth         552.7711
talkrevisions                        384.1718

Feature Vocabulary:

  • med_inact - the median of the length of user's periods of inactivity in months (say we find 000, 000, 0000, 00, 0, 00, in a particular user's revision history somewhere - we take the median of the interval lenghts)
  • sumActiveMonths - the count of active months in a particular user's revision history
  • mean_inact - the average length of user's periods of inactivity in months (say we find 000, 000, 0000, 00, 0, 00, in a particular user's revision history somewhere - we take the average of the interval lenghts)
  • accountAge - the length of user's revision history in months, since user registration and up to the present moment
  • averageRevisionsPerMonth - the average number of revisions in the namespaces 0, 120, 146
  • pActiveMonth - the proportion of active months in a particular user's revision history (i.e. the probability of an active month for a user)
  • numRevisions - the total number of revisions in the namespaces 0, 120, 146
  • H - the Shannon Diversity Index derived from the user's revision history (i.e. entropy normalized by Hmax)
  • reactivationsN - the number of reactivations of the user (slightly problematic from a methodological viewpoint: because if the user is currently inactive, and we observe their inactivity for the first time, by definition it is zero, and than also there is a question of do we focus on that user's data in the future or not)
  • averageTalkRevisionsPerMonth - the average number of edits in the Talk namespaces
  • talkrevisions - the total number of edits in the Talk namespaces

These features are somewhat redundant (Random Forests does not care much about colinearity and similar issues, however), so the prospects are good that we can develop a more efficient/lighter and yet successful model in the future.

All computations were performed on DataKolektiv's servers on a dataset with anonymized user ids.

Thanks, super interesting! Some things are beyond my data-skills, thus, I look forward to feedback from other people!

Anyways, following a series of cross-validations and tricks to account for a highly imbalanced dataset, one Random Forrest classifier was able to predict leave vs stay in Wikidata with:

  • Accuracy of 97%,
  • Hit rate (True Positive Rate, TPP) of 90%,
  • and a False Alarm (False Positive Rate, FPP) of only 2.8%.

What about the true/false negative rate? To my untrained eye, these numbers look typical for an imbalanced training/test set, where we have a lot of people abandoning so it's really easy for a classifier to accurately predict that a user will leave, but probably much less accurate at predicting that a person will stay. I'm unsure whether "positive" here means the classifier identifies a person who will leave or stay, btw., can you share more about the test results?

The model encompasses the following features (MeanDecreaseGini is a measure of variable importance in Random Forests):

Thanks for including the relative importance of each feature. I like your "median length of inactivity" measure, that could be a good single-parameter predictor. Of course, there is some risk of this being tautological: e.g. if a user is absent for a median of 5 months then they are roughly 50% likely to be absent for another 5 months (therefore considered "abandoned") in the future. Maybe it would help the exploration to run a tool like LIME on the model to learn more about how features are related to the prediction.

@awight

First of all, I might have missed to mention that the outcome variable (i.e. what we are predicting) is "stay", not "leave". My bad.

I'm unsure whether "positive" here means the classifier identifies a person who will leave or stay, btw., can you share more about the test results?

These terms have one and the same meaning in Statistical Decision Theory and ML, always, see ROC Analysis from Wikipedia.

What about the true/false negative rate?

Well they are just 1 - their positive counterparts, right?

To my untrained eye, these numbers look typical for an imbalanced training/test set, where we have a lot of people abandoning so it's really easy for a classifier to accurately predict that a user will leave, but probably much less accurate at predicting that a person will stay.

To the contrary, the reported law FA rate means that the model is good at avoiding the Type I Error, i.e. to predict that someone would stay while actually they left. And the dataset is still very imbalanced - but there are techniques to deal with it. And I've used some of them here. Your confusion is probably a consequence of my failure to explain what the outcome variable is, sorry.

I like your "median length of inactivity" measure, that could be a good single-parameter predictor.

Could be, don't know yet.

Of course, there is some risk of this being tautological: e.g. if a user is absent for a median of 5 months then they are roughly 50% likely to be absent for another 5 months (therefore considered "abandoned") in the future.

Wouldn't that hold only if Lindy and Power-Law hold too? But I think they do not, see T282563#7250712.

N.B. I am still experimenting to see if the feature engineering process can give us even more information than we are using now. Then I will share the code and the data so that anyone can play with the model or build their own.

@GoranSMilovanovic thanks for sharing, these results look interesting. some comments below.

@Jan_Dittrich Do we really find a Lindy effect in the Wikidata acount age distribution?

Assumption. As demonstrated in Eliazar, Iddo (November 2017). "Lindy's Law". Physica A: Statistical Mechanics and Its Applications. 486: 797–805, if the Lindy effect holds than the Survival function of the account age is Pareto. So, we need to test if the Wikidata account age follows a power-law or not.

Now, this is a bit tricky, so let's go one step at the time:

  • the data are the frequencies of Wikidata account ages;
  • the age of the account is the number of months since the registration until the first sequence of five inactive months (when we pronounce an editor officially inactive by convention)
  • Bots are filtered out in the ETL phase;
  • if the x_min is set to the de facto minimum of the account age (which is 69; no x_min estimation), then we have a power-law behavior with an estimate of alpha found at 1.626341 - a power-law behavior with all moments diverging.

How can the x_min be so large (estimated or not)? My understanding of the parameter x_min is that we fit a powerlaw distribution to all x>x_min. Thus we only fit the the powerlaw for account ages with more than 69 or 153 months, respectively. From the plots you showed above, this applies only to a small fraction of accounts. This is problematic because your fitted distribution does not try to describe anything that happens at x<x_min essentially ignoring the vast majority of accounts. Instead, I believe one should fit a distribution with a fixed x_min=1 (or similarly small).

However, following the recommendations of the authors of {poweRlaw}, the boostrap analysis shows that in neither of the two cases the power-law is really present (see the Hypothesis Testing framework implemented in {poweRlaw}, 2. Examples using the poweRlaw package: https://cran.r-project.org/web/packages/poweRlaw/vignettes/b_powerlaw_examples.pdf, pages 4 - 5).

This means that the powerlaw-distribution is rejected for the data. However, this is not surprising - real data is messy and this type of hypothesis test rejects even if we have really strong reasons to believe it should follow the powerlaw-distribution, e.g. due to small correlations etc (you can read in more detail about this argument in a paper we wrote some time ago).

One possible path out of this is to slightly change the question. Instead of asking whether the data is perfectly described by a powerlaw (in most cases it is not), it might be more interesting to know whether a powerlaw describes the data better than another distribution. This is also described in the package you mention (3. Comparing distributions with the poweRlaw package). For example, one could compare the fit of a powerlaw with a Poisson. The latter is an interesting comparison because the Poisson follows if the probability of stopping is independent of the time an editor has already been around. In contrast, the powerlaw follows if the probability of stopping decreases with time (in a specific way). If the powerlaw fits better than the Poisson, this would then be evidence that the probability of stopping does depend (somehow) on the time an editor has been already around.

Anyways, following a series of cross-validations and tricks to account for a highly imbalanced dataset, one Random Forrest classifier was able to predict leave vs stay in Wikidata with:

  • Accuracy of 97%,
  • Hit rate (True Positive Rate, TPP) of 90%,
  • and a False Alarm (False Positive Rate, FPP) of only 2.8%.

What about the true/false negative rate? To my untrained eye, these numbers look typical for an imbalanced training/test set, where we have a lot of people abandoning so it's really easy for a classifier to accurately predict that a user will leave, but probably much less accurate at predicting that a person will stay.

I agree with @awight. The high accuracy is not to be taken at face value as the positive/negative groups are probably highly imbalanced (not sure if this is true but it looks like most account stop editing very quickly). Two options to make the numbers more interpretable:

  • compare with a baseline predictor that does not use any of the features. This could be either a random guess (for example based on the Lindy-curve) or simply always guessing the majority-class
  • using a balanced test-set such that you have the same number of positive and negative examples (for example via downsampling the majority class or vice versa)

@MGerlach

First of all, thank you very much for the insights that you have provided.

On Power Laws and Lindy:

One possible path out of this is to slightly change the question. Instead of asking whether the data is perfectly described by a powerlaw (in most cases it is not), it might be more interesting to know whether a powerlaw describes the data better than another distribution.

I agree completely, and that is what I am about to do here next.

How can the x_min be so large (estimated or not)? My understanding of the parameter x_min is that we fit a powerlaw distribution to all x>x_min. Thus we only fit the the powerlaw for account ages with more than 69 or 153 months, respectively. From the plots you showed above, this applies only to a small fraction of accounts.

From POWER-LAW DISTRIBUTIONS IN EMPIRICAL DATA, AARON CLAUSET, COSMA ROHILLA SHALIZI, AND M. E. J. NEWMAN (2009):

In practice, few empirical phenomena obey power laws for all values of x. More often the power law applies only for values greater than some minimum xmin. In such cases we say that the tail of the distribution follows a power law.

and the {poweRlaw} package - which implements the estimation approach of Clauset, Shalizi & Newman - estimates xmin to be as large as 153. Let me remind you that I have also tried with xmin set to the minimum of the empirical observations (that would be 69 in our dataset) - essentially what you have also suggested (see T282563#7250712).

This means that the powerlaw-distribution is rejected for the data. However, this is not surprising - real data is messy and this type of hypothesis test rejects even if we have really strong reasons to believe it should follow the powerlaw distribution, e.g. due to small correlations etc (you can read in more detail about this argument in a paper we wrote some time ago).

The paper you mention, Gerlach & Altmann (2019). Testing statistical laws in complex systems, is an overkill to me. If you promise to find some time to meet and provide a translation into plain English, I promise to be all ear.

Now, as of the Random Forest classifier:

The high accuracy is not to be taken at face value as the positive/negative groups are probably highly imbalanced (not sure if this is true but it looks like most account stop editing very quickly).

Yes, the high accuracy at face value does not tell a thing, but we have a Hit rate (the model predicts "stay" and the editor "stays") at 90% and the False Alarm rate (the model says "stay" but the editor "leaves") at "only" 2.8%. Some would say "not great, not terrible", but given that this is our first attempt at the problem at hand I would really say that is not bad at all.

using a balanced test-set such that you have the same number of positive and negative examples (for example via downsampling the majority class or vice versa)

Instead of using upsampling or downsampling, I have controlled for the priors in classification to account for the (huge) imbalance in the distribution of the outcome (see: classwt argument of randomForest() in {randomForest}).

compare with a baseline predictor that does not use any of the features. This could be either a random guess (for example based on the Lindy-curve) or simply always guessing the majority-class

Definitely. Will do.

Thanks again @MGerlach

Change 709690 had a related patch set uploaded (by GoranSMilovanovic; author: GoranSMilovanovic):

[analytics/wmde/WD/WikidataAdHocAnalytics@master] T282563

https://gerrit.wikimedia.org/r/709690

Change 709690 merged by GoranSMilovanovic:

[analytics/wmde/WD/WikidataAdHocAnalytics@master] T282563

https://gerrit.wikimedia.org/r/709690

Here's the ETL code.
I will add modeling and power law estimation as soon as I complete all additional steps as suggested.

@GoranSMilovanovic regarding the prediction model, a recent paper from this year's ISCW-conference might be very interesting (e.g. which features they use and to compare prediction-performance):

Learning to Predict the Departure Dynamics of Wikidata Editors (link to pdf)

Wikidata as one of the largest open collaborative knowledge bases has drawn much attention from researchers and practitioners since its launch in 2012. As it is collaboratively developed and maintained by a community of a great number of volunteer editors, understanding and predicting the departure dynamics of those editors are crucial but have not been studied extensively in previous works. In this paper, we investigate the synergistic effect of two different types of features: statistical and pattern-based ones with DeepFM as our classification model which has not been explored in a similar context and problem for predicting whether a Wikidata editor will stay or leave the platform. Our experimental results show that using the two sets of features with DeepFM provides the best performance regarding AUROC (0.9561) and F1 score (0.8843), and achieves substantial improvement compared to using either of the sets of features and over a wide range of baselines.

@Esh77 @MGerlach @Jan_Dittrich

Martin and Jan: thank you for your readiness to present our findings on Wikidata User Retention in the WikidataCon 2021 Education & science track (see WikidataCon 2021 program, Sunday, October 31st from 11:00 to 18:00 UTC) !

@awight Adam, you have contributed too, and there is still time to join us to prepare the session for the WikidataCon!

@MGerlach @Jan_Dittrich

If you agree:

  • I would prepare a synthesis of the work done here so far, and then
  • share an (R Markdown, rendered to html) Notebook with you;
  • we could the Notebook as a starting point to develop the session.

@MGerlach

  • Thank you for sharing the paper in T282563#7419722;
  • I will take a look, but I doubt that I will have enough time until WikidataCon 2021 to experiment with any other model except for Random Forest (almost implemented) and XGBoost (in preparation).

I also suggest that we schedule a concise meeting on this.

Thank you - I am looking forward to seeing you soon!

@MGerlach @Jan_Dittrich

We also need to decide on the following in order to submit our session proposal to WikidataCon 2021:

  • Proposal title
  • Session type (please take a look at the submission form for options)
  • Abstract
  • Notes
  • Session Image

I will add a submission once we meet and figure out the format and everything else + add you as co-speakers.

What about:

Proposal title: Wikidata user retention over time?
Session type: Lightning or short, I guess?
Abstract: People leave online communities after some time. However, the likelyhood that a particular user leaves the project is dependent on the time they have been on the project already: People who have only spend a brief time in the project are more likely to leave than people who are long-term members. This is similar to the so-called "Lindy Effect". We modeled the curve for the likelihood of leaving the project depending on the time of past participation and will present methods, outcomes and practical relevance.
Notes: We would show some slides
Session Image: Maybe the diagram in from the original issue?

@Jan_Dittrich Sound great, but I think @MGerlach and I would like to add some modeling efforts to see if we can predict if a users stays or not.

@MGerlach

The authors of the paper that you have cited in T282563#7419722 use a similar - if not the same - approach to feature engineering for the prediction task as I have used in T282563#7251679 w. the RF classifier, pp. 4 in the PDF:

3. Diversity of edit actions (Divedit·act). To capture the diversity of different
types of edit actions (see Section 4), we use the Shannon-Entropy [25] of
different edit actions in the same manner as in [24] as: H(T) = −
Pn i=1 P(ti)·log P(ti) where T indicates different types of edit actions, and |T| = n.

4. Diversity of entities (Divent). We measure the diversity of edited entities
of a user using the Shannon-Entropy. The intuition is that the diversity of
edited entities of a user could also be different across active and inactive
editors.

@MGerlach @Jan_Dittrich

It is a power-law (and thus Lindy) after all:

"H0: data IS generated from a power law distribution; 
 H1: data IS NOT generated from a power law distribution."

(from: Fitting Heavy Tailed Distributions: The poweRlaw Package, Colin S. Gillespie, Newcastle University, Journal of Statistical Software, February 2015, Volume 64, Issue 2).

*I was reading the bootstrap p-values incorrectly* - our findings say that we cannot reject H0, and under this hypothesis testing framework H0 means: it's a power-law.

See our slides for test results.

@MGerlach I will test this against Poisson of course. I did not forget about your earlier remark in T282563#7254843.

EDIT Except for I have fitted RANK frequencies in place of raw observations - forget it, no power-law (I promise; my last word), changes entered to our slides.

@MGerlach @Jan_Dittrich

I have used XGBoost to train a leave vs stay binary classifier to our data.

I did not go into elaborated cross-validation, used only a single train and single test dataset, downsampling by a huge factor (because of a huge class imbalance), using scale_pos_weight to upweight then, and cross-validate across eta, max_depth and subsample only; shallow tress (5 or 10 max_depth) were used only - and lots of them (n_rounds set to 10,000).

Results

  • The best model I've found had an AUC = 0.9689391 - a bit better than what is reported in the paper that used the DeepFM architecture and which @MGerlach shared in T282563#7419722; however, the models are not really comparable since the authors of the DeepFM paper have used a different criterion to define "leave" then we did (5 months of inactivity);
  • Since we get P(Leave) from XGBoost, I have performed a full ROC analysis. When the decision criterion (or boundary, if you prefer) is set to be very high (0.999001), we obtain the following characteristics:
    • TPR = 0.911067
    • FNR = 0.088933
    • FPR = 0.1030485
    • TNR = 0.8969515

That seems quite satisfactory, especially if we consider a simple Bayesian analysis:

  • starting from an a priori of P(Leave) = .92 - and we now that from our data,
  • the a posteriori P(Leave|Model says "Leave") = .99.

I hope this addresses all remarks made in T282563#7252149 and T282563#7254863.

I guess an even better model could be built with XGBoost - after all, I have searched in a rather constrained parameter space only - but I do not think that I have enough time to run tons of ML cycles until Sunday, October 31 when we need to present at WikidataCon 2021. This results will enter our WikidataCon 2021 slides.

Change 735736 had a related patch set uploaded (by GoranSMilovanovic; author: GoranSMilovanovic):

[analytics/wmde/WD/WikidataAdHocAnalytics@master] T282563

https://gerrit.wikimedia.org/r/735736

Change 735736 merged by GoranSMilovanovic:

[analytics/wmde/WD/WikidataAdHocAnalytics@master] T282563

https://gerrit.wikimedia.org/r/735736