Add more languages to Wikipedia Clickstream
Open, LowPublic
Actions

Assigned To

None

Authored By

	Isaac
	Aug 23 2021, 8:24 PM

Description

Let's add some more languages (and projects) to the clickstream datasets! My goal is to upfront add some easy ones and more long-term have a more established process for adding them and understand any barriers to further expanding the dataset.

Current status

There are 11 Wikipedia projects for which the clickstream is currently produced monthly (de, en, es, fa, fr, it, ja, pl, pt, ru, zh). I was not around when these languages were chosen but they seem to be the largest languages by # of pageviews. When these were added, the smallest (fa) was receiving just over 100 million pageviews per month.

Why add more?

The clickstream dataset is our sole source of public information about reader navigation -- i.e. how people move between pages. This is valuable for a variety of reasons:

Editors: seeing what links are clicked on so that they may improve the quality of those pages
Researchers: as a measure of "link importance" for algorithms like PageRank, generating page embeddings based on where readers come/go, designing strategies for addressing gaps in Wikipedia (e.g., ensuring that not only is there more content about women but that it is sufficiently discoverable), etc.

One of the main challenges for this data has been accessibility -- i.e. it was only available as dumps so no API for quick access or nice interfaces to play with the data sans data science skills. This barrier has been recently lowered thanks to an Outreachy project, so it feels like a good time to discuss adding more languages: https://wikinav.toolforge.org/

As far as I can tell, there is no clear reason to limit the clickstream to just these 11 languages beyond concerns of complicating the job that produces them. There are some fast-growing wiki communities that I believe could make use of this data (and that we should encourage researchers to include in their work).

Desired state

In an ideal world, the dataset is produced for all Wikipedia languages and we discuss adding relevant sister projects too -- e.g., Wikisource. Practically speaking, I assume the job will need some refactoring to scale that dramatically (?) and the privacy requirements (at least 10 clicks between two pages to be retained in the data) mean that for some projects, the dataset would be missing far too much data to be of much use.

Thus, I would like to at least propose some priority to these projects, sorted by pageviews per month (turnilo):

>100M: ar (Arabic), nl (Dutch), tr (Turkish), id (Indonesian)
>50M: sv (Swedish), ko (Korean), vi (Vietnamese), hi (Hindi), th (Thai), he (Hebrew), cs (Czech), fi (Finnish)
>10M: hu (Hungarian), uk (Ukrainian), el (Greek), ro (Romanian), no (Norwegian (Bokmål)), da (Danish), sr (Serbian), ms (Malay), bn (Bengali), bg (Bulgarian), hr (Croatian), ta (Tamil), sk (Slovak), az (Azerbaijani), simple (Simple), ca (Catalan), mr (Marathi), ml (Malayalam)

Of note, if you expand beyond Wikipedia, then these cut-offs also include Commons, Wikidata, and a few high-traffic Wiktionary sites. There have been previous requests for Wikidata clickstream (see below) but not sure what it would mean for Commons or Wiktionary -- i.e. if the data is useful.

Concerns

I don't know what the original privacy review was but might be worth asking for a privacy review for the >50M and >10M languages given that the likely reader pool that generates the data dwindles for them.
In theory, adding a Wikipedia site to the job is just the matter of adding it to the config. In practice, I don't know if including all languages in a single job will scale forever. Though English dwarfs all of these in terms of amount of data, languages like Swedish have almost half the number of articles so still might break the job. Other projects like Commons or Wikidata of course are also wildcards for this reason.
In practice, adding a bunch more languages at least shouldn't increase the size of data being stored drastically because we already generate the dataset for the largest wikis.

Other potentially relevant tasks

Last time wikis were added, it was a very simple config change: T185344
Related request for Wikidata that seems to be stalled (but also slightly different than current clickstream): T208569
Request to make language-switches explicitly counted: T310975
Request for Ukrainian Wikipedia: T310972

Related Objects
Search...

Status	Assigned	Task
Open	None	T289532 Add more languages to Wikipedia Clickstream
Open	None	T292435 Re-examine how internal search referrals are handled by Clickstream
Open	None	T292476 Update clickstream code to support more languages
Open	None	T296359 Consider adding more namespaces to Clickstream dataset
Open	None	T327982 Add cawiki to clickstream dataset
Open	None	T339805 Add cswiki to clickstream

Event Timeline

Isaac created this task.Aug 23 2021, 8:24 PM

Restricted Application added subscribers: Strainu, jhsoby, Base, revi. · View Herald TranscriptAug 23 2021, 8:24 PM

I would be interested in seeing clickstream added to www.mediawiki.org, wikitech.wikimedia.org, and meta.wikimedia.org. This could potentially be a very valuable tool for evaluating how visitors are interacting with technical documentation on the sites and would help us make more informed decisions about how to structure documentation.

apaskulin awarded a token.Sep 20 2021, 4:31 PM

Tgr subscribed.Sep 24 2021, 6:41 PM

Isaac mentioned this in T292435: Re-examine how internal search referrals are handled by Clickstream.Oct 4 2021, 1:59 PM

Daniel_Mietchen awarded a token.Oct 15 2021, 6:02 PM

Daniel_Mietchen subscribed.

Ainali subscribed.Oct 17 2021, 11:52 AM

FRomeo_WMF subscribed.Oct 19 2021, 10:19 AM

odimitrijevic triaged this task as Low priority.Nov 4 2021, 5:14 PM

odimitrijevic moved this task from Incoming (new tickets) to Datasets on the Data-Engineering board.

odimitrijevic added a project: Epic.

Aroraakhil subscribed.Nov 23 2021, 10:05 AM

Aroraakhil unsubscribed.

Aroraakhil subscribed.

Pablo subscribed.Nov 23 2021, 6:02 PM

TBurmeister subscribed.Apr 5 2022, 6:52 PM

I would be interested in seeing Swedish added to the tool.

Isaac updated the task description. (Show Details)Aug 29 2022, 5:19 PM

@JAllemandou I think now would be a good time to revisit this task. To summarize, we last investigated about a year ago and at the time, the Oozie config was deemed a pretty big blocker to adding more languages so the work stalled. Now that it's been migrated to Airflow (T305843), it seems that the major technical blocker is gone (yay!!).

There is the general ask within this ticket to add many more languages and projects. There are also specific asks to add Ukrainian and Swedish Wikipedia that maybe would be a good start? Thoughts on how to start moving this forward?

Thanks for the ping @Isaac :)
Let's add that to @EChetty 's radar and see how we can prioritize.

Thanks @JAllemandou !

Let's add that to @EChetty 's radar and see how we can prioritize.

@EChetty if there's anything I can do to help understand the value of this work for researchers and volunteers, please let me know (and others should feel free to chime in)! I tried my best to describe the situation in the task description so that's a good place to start.

Thanks @Isaac for restarting conversation on this thread!

@JAllemandou @EChetty,
I would also like to point out a specific research project that was carried out last year by EPFL (including myself, @tizianopiccardi, and @Cervisiarius) and WMF (@MGerlach) researchers, which was published at the WSDM conference. We showcased the value of clickstream data (using the existing languages) in approximating reader navigation behavior in Wikipedia as well as on several practical downstream tasks, such as topic classification, link prediction, etc. A natural conclusion for this project would be the public release of clickstream data across many other (if not all) language versions, thereby increasing the value of this resource as well as facilitating advancements in research on multilingual Wikipedia versions.

For details, please see:

Also, please feel free to reach out to me or @MGerlach if you have any further questions or need any clarifications.
Best,
Akhil

Adding some details about the impact of extending the language list to also include all of the languages listed under Desired state in the task description (languages listed below). Testing done by @fkaelin (thanks!) and analysis done by me. My recommendation is to go ahead with adding the proposed languages as they all work as expected and the resulting clickstream has good coverage of pageviews (privacy filtering doesn't remove too much data though that's a very subjective criterion).

First off, extending the list of languages was successful with no additional config changes -- fkaelin forked the airflow DAG, updated it with the new languages, and it ran to completion with no errors for the July 2022 dumps! That suggests that making the change official shouldn't trigger follow-on technical work.

I inspected the resulting dumps to make sure they were as expected -- i.e. no differences for the languages that already exist and basic checks that the new languages look as expected. All the data is the same with one small exception: there are small discrepancies in counts coming from other-internal referrals -- i.e. switching between language editions of Wikipedia. The cause seems to be that the ClickstreamBuilder job has a case statement where it pulls out internal referrers by extracting the page title from the referral URL if its in the list of clickstream wikis (code). This should work as expected because the page titles will be stripped out for referrals coming from other wikis because they're considered external domains under HTTPS norms and so they aren't caught by this statement. Browsers aren't perfect though so sometimes the full referral URL is still passed even when it's a different domain than the page currently being viewed and then that pageview ends up getting filtered in a later step as opposed to added to the other-internal count. While this is likely fixable with a stricter condition, it's a very minor bug and a reasonable assumption to embed into the code.

Details:

Current list of languages: de en es fa fr it ja pl pt ru zh
Proposed new languages: an ar az bg bn ca cs da el fi he hi hr hu id ko ml mr ms nl no ro sk sr sv ta th tr uk vi
- NOTE: an was a mistake and shouldn't be included
- This took the total size of the monthly dumps from 815 MB -> 944 MB, so small footprint change (as expected)
The resulting dumps: https://analytics.wikimedia.org/published/datasets/one-off/fab/clickstream_languages/
- As compared to the official dumps: https://dumps.wikimedia.org/other/clickstream/2022-07/

Full details:

====== anwiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 56
Num unique targets: 4655
Top count: other-empty -> Carles_Puigdemont (external): 49884
Clickstream contains 57.5% of the total user pageviews for July 2022

count     6510.000000
mean        55.727803
std        671.210896
min         10.000000
25%         12.000000
50%         19.000000
75%         42.000000
max      49884.000000
Name: count, dtype: float64

type
external    0.991496
link        0.006343
other       0.002161
Name: count, dtype: float64


====== arwiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 102705
Num unique targets: 318834
Top count: other-empty -> الصفحة_الرئيسية (external): 1652611
Clickstream contains 87.1% of the total user pageviews for July 2022

count    1.135004e+06
mean     1.383851e+02
std      1.821368e+03
min      1.000000e+01
25%      1.400000e+01
50%      2.500000e+01
75%      6.400000e+01
max      1.652611e+06
Name: count, dtype: float64

type
external    0.81455
link        0.17394
other       0.01151
Name: count, dtype: float64


====== azwiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 8304
Num unique targets: 43426
Top count: other-empty -> Karles_Puçdemon (external): 52025
Clickstream contains 74.5% of the total user pageviews for July 2022

count    91028.000000
mean        67.326537
std        320.448229
min         10.000000
25%         14.000000
50%         22.000000
75%         49.000000
max      52025.000000
Name: count, dtype: float64

type
external    0.886991
link        0.093994
other       0.019014
Name: count, dtype: float64


====== bgwiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 21138
Num unique targets: 86896
Top count: other-empty -> Начална_страница (external): 107686
Clickstream contains 80.4% of the total user pageviews for July 2022

count    200529.000000
mean         77.645622
std         394.382627
min          10.000000
25%          14.000000
50%          23.000000
75%          52.000000
max      107686.000000
Name: count, dtype: float64

type
external    0.881789
link        0.111208
other       0.007003
Name: count, dtype: float64


====== bnwiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 15311
Num unique targets: 55103
Top count: other-empty -> প্রধান_পাতা (external): 108562
Clickstream contains 87.7% of the total user pageviews for July 2022

count    147466.000000
mean        119.364240
std         623.292698
min          10.000000
25%          14.000000
50%          25.000000
75%          65.000000
max      108562.000000
Name: count, dtype: float64

type
external    0.906547
link        0.078459
other       0.014993
Name: count, dtype: float64


====== cawiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 5765
Num unique targets: 80299
Top count: other-empty -> Portada (external): 221561
Clickstream contains 64.1% of the total user pageviews for July 2022

count    128458.000000
mean         44.298191
std         664.212171
min          10.000000
25%          13.000000
50%          19.000000
75%          38.000000
max      221561.000000
Name: count, dtype: float64

type
external    0.961202
link        0.036543
other       0.002256
Name: count, dtype: float64


====== cswiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 66404
Num unique targets: 210927
Top count: other-empty -> Hlavní_strana (external): 762949
Clickstream contains 80.9% of the total user pageviews for July 2022

count    585662.000000
mean         78.217567
std        1076.944848
min          10.000000
25%          13.000000
50%          22.000000
75%          48.000000
max      762949.000000
Name: count, dtype: float64

type
external    0.802297
link        0.185611
other       0.012091
Name: count, dtype: float64


====== dawiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 19829
Num unique targets: 101910
Top count: other-empty -> Forside (external): 732953
Clickstream contains 81.9% of the total user pageviews for July 2022

count    211260.000000
mean         79.311805
std        1694.420439
min          10.000000
25%          14.000000
50%          22.000000
75%          51.000000
max      732953.000000
Name: count, dtype: float64

type
external    0.882914
link        0.113678
other       0.003408
Name: count, dtype: float64


====== dewiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 492193
Num unique targets: 1305711
Top count: other-search -> Pornhub (external): 492937
Clickstream contains 82.4% of the total user pageviews for July 2022

count    5.529815e+06
mean     1.138181e+02
std      9.195455e+02
min      1.000000e+01
25%      1.400000e+01
50%      2.400000e+01
75%      5.600000e+01
max      4.929370e+05
Name: count, dtype: float64

type
external    0.727876
link        0.271132
other       0.000992
Name: count, dtype: float64


Comparing to official dump...
5527899 of 5529815 (99.96535%) rows in common.

====== elwiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 27704
Num unique targets: 97471
Top count: other-empty -> Κάρλες_Πουτζδεμόν (external): 68378
Clickstream contains 84.0% of the total user pageviews for July 2022

count    263610.000000
mean         97.890899
std         440.329858
min          10.000000
25%          14.000000
50%          24.000000
75%          58.000000
max       68378.000000
Name: count, dtype: float64

type
external    0.869392
link        0.123513
other       0.007095
Name: count, dtype: float64


====== enwiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 2133501
Num unique targets: 4382353
Top count: other-empty -> Main_Page (external): 119706994
Clickstream contains 92.6% of the total user pageviews for July 2022

count    3.177086e+07
mean     2.039830e+02
std      2.155762e+04
min      1.000000e+01
25%      1.500000e+01
50%      2.700000e+01
75%      7.000000e+01
max      1.197070e+08
Name: count, dtype: float64

type
external    0.658778
link        0.332420
other       0.008802
Name: count, dtype: float64


Comparing to official dump...
31754894 of 31770865 (99.94973%) rows in common.

====== eswiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 363420
Num unique targets: 913703
Top count: other-empty -> Cleopatra (external): 3834121
Clickstream contains 86.3% of the total user pageviews for July 2022

count    3.923877e+06
mean     1.660531e+02
std      2.359549e+03
min      1.000000e+01
25%      1.500000e+01
50%      2.600000e+01
75%      6.700000e+01
max      3.834121e+06
Name: count, dtype: float64

type
external    0.812321
link        0.183288
other       0.004391
Name: count, dtype: float64


Comparing to official dump...
3922959 of 3923877 (99.97660%) rows in common.

====== fawiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 111402
Num unique targets: 288118
Top count: other-empty -> صفحهٔ_اصلی (external): 1269076
Clickstream contains 87.0% of the total user pageviews for July 2022

count    1.232577e+06
mean     1.204850e+02
std      1.541890e+03
min      1.000000e+01
25%      1.500000e+01
50%      2.500000e+01
75%      6.000000e+01
max      1.269076e+06
Name: count, dtype: float64

type
external    0.692233
link        0.288126
other       0.019641
Name: count, dtype: float64


Comparing to official dump...
1232162 of 1232577 (99.96633%) rows in common.

====== fiwiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 67040
Num unique targets: 209869
Top count: other-search -> Mursu (external): 186833
Clickstream contains 78.6% of the total user pageviews for July 2022

count    567694.000000
mean         70.459922
std         378.142600
min          10.000000
25%          13.000000
50%          22.000000
75%          46.000000
max      186833.000000
Name: count, dtype: float64

type
external    0.767432
link        0.227351
other       0.005217
Name: count, dtype: float64


====== frwiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 375118
Num unique targets: 1068068
Top count: other-empty -> Cookie_(informatique) (external): 70995802
Clickstream contains 85.6% of the total user pageviews for July 2022

count    4.005817e+06
mean     1.376774e+02
std      3.548690e+04
min      1.000000e+01
25%      1.400000e+01
50%      2.400000e+01
75%      5.600000e+01
max      7.099580e+07
Name: count, dtype: float64

type
external    0.803290
link        0.194571
other       0.002139
Name: count, dtype: float64


Comparing to official dump...
4004746 of 4005817 (99.97326%) rows in common.

====== hewiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 65183
Num unique targets: 172765
Top count: other-empty -> עמוד_ראשי (external): 398856
Clickstream contains 80.0% of the total user pageviews for July 2022

count    575345.000000
mean         79.046223
std         674.261109
min          10.000000
25%          14.000000
50%          22.000000
75%          49.000000
max      398856.000000
Name: count, dtype: float64

type
external    0.775433
link        0.215694
other       0.008872
Name: count, dtype: float64


====== hiwiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 25089
Num unique targets: 65216
Top count: other-search -> द्रौपदी_मुर्मू (external): 286393
Clickstream contains 94.7% of the total user pageviews for July 2022

count    221197.000000
mean        240.579709
std        1449.757073
min          10.000000
25%          16.000000
50%          31.000000
75%          98.000000
max      286393.000000
Name: count, dtype: float64

type
external    0.917329
link        0.058659
other       0.024011
Name: count, dtype: float64


====== hrwiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 14496
Num unique targets: 69222
Top count: other-empty -> HRT_3 (external): 217374
Clickstream contains 81.5% of the total user pageviews for July 2022

count    153839.000000
mean         77.154415
std         756.971424
min          10.000000
25%          14.000000
50%          23.000000
75%          55.000000
max      217374.000000
Name: count, dtype: float64

type
external    0.906104
link        0.087947
other       0.005950
Name: count, dtype: float64


====== huwiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 42866
Num unique targets: 156123
Top count: other-empty -> Kezdőlap (external): 563348
Clickstream contains 81.6% of the total user pageviews for July 2022

count    403293.000000
mean         87.905344
std         974.437526
min          10.000000
25%          14.000000
50%          22.000000
75%          53.000000
max      563348.000000
Name: count, dtype: float64

type
external    0.848275
link        0.144747
other       0.006978
Name: count, dtype: float64


====== idwiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 61849
Num unique targets: 217908
Top count: other-empty -> Halaman_Utama (external): 780703
Clickstream contains 87.0% of the total user pageviews for July 2022

count    692577.000000
mean        116.075768
std        1268.912897
min          10.000000
25%          14.000000
50%          25.000000
75%          62.000000
max      780703.000000
Name: count, dtype: float64

type
external    0.827721
link        0.163277
other       0.009002
Name: count, dtype: float64


====== itwiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 314625
Num unique targets: 743132
Top count: other-empty -> Pagina_principale (external): 4861839
Clickstream contains 88.7% of the total user pageviews for July 2022

count    3.226504e+06
mean     1.158242e+02
std      2.880684e+03
min      1.000000e+01
25%      1.400000e+01
50%      2.400000e+01
75%      5.600000e+01
max      4.861839e+06
Name: count, dtype: float64

type
external    0.704589
link        0.287843
other       0.007569
Name: count, dtype: float64


Comparing to official dump...
3225898 of 3226504 (99.98122%) rows in common.

====== jawiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 528498
Num unique targets: 1005448
Top count: other-empty -> メインページ (external): 14438906
Clickstream contains 91.6% of the total user pageviews for July 2022

count    6.339795e+06
mean     1.502597e+02
std      6.402807e+03
min      1.000000e+01
25%      1.400000e+01
50%      2.500000e+01
75%      5.900000e+01
max      1.443891e+07
Name: count, dtype: float64

type
external    0.698191
link        0.295070
other       0.006740
Name: count, dtype: float64


Comparing to official dump...
6338996 of 6339795 (99.98740%) rows in common.

====== kowiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 49823
Num unique targets: 250029
Top count: other-empty -> 문화방송 (external): 500175
Clickstream contains 77.7% of the total user pageviews for July 2022

count    521295.000000
mean         77.014992
std        1351.327721
min          10.000000
25%          14.000000
50%          22.000000
75%          50.000000
max      500175.000000
Name: count, dtype: float64

type
external    0.901728
link        0.093337
other       0.004936
Name: count, dtype: float64


====== mlwiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 4357
Num unique targets: 23421
Top count: other-search -> വൈക്കം_മുഹമ്മദ്_ബഷീർ (external): 105727
Clickstream contains 84.0% of the total user pageviews for July 2022

count     50395.000000
mean         97.119575
std         676.325484
min          10.000000
25%          14.000000
50%          24.000000
75%          59.000000
max      105727.000000
Name: count, dtype: float64

type
external    0.919109
link        0.067363
other       0.013528
Name: count, dtype: float64


====== mrwiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 6045
Num unique targets: 21152
Top count: other-empty -> क्लिओपात्रा (external): 182164
Clickstream contains 88.9% of the total user pageviews for July 2022

count     52334.000000
mean        137.732163
std        1017.653627
min          10.000000
25%          15.000000
50%          27.000000
75%          73.000000
max      182164.000000
Name: count, dtype: float64

type
external    0.915114
link        0.060099
other       0.024787
Name: count, dtype: float64


====== mswiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 10133
Num unique targets: 61659
Top count: other-search -> Mat_Kilau (external): 222322
Clickstream contains 85.7% of the total user pageviews for July 2022

count    137876.000000
mean        103.059118
std         950.609821
min          10.000000
25%          14.000000
50%          25.000000
75%          63.000000
max      222322.000000
Name: count, dtype: float64

type
external    0.886517
link        0.109441
other       0.004041
Name: count, dtype: float64


====== nlwiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 113996
Num unique targets: 390073
Top count: other-empty -> Hoofdpagina (external): 2388616
Clickstream contains 81.7% of the total user pageviews for July 2022

count    1.110147e+06
mean     8.273842e+01
std      2.315087e+03
min      1.000000e+01
25%      1.300000e+01
50%      2.100000e+01
75%      4.700000e+01
max      2.388616e+06
Name: count, dtype: float64

type
external    0.814257
link        0.178273
other       0.007469
Name: count, dtype: float64


====== nowiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 26674
Num unique targets: 141586
Top count: other-empty -> Carles_Puigdemont_i_Casamajó (external): 47220
Clickstream contains 74.1% of the total user pageviews for July 2022

count    279705.000000
mean         62.487006
std         257.453780
min          10.000000
25%          13.000000
50%          21.000000
75%          45.000000
max       47220.000000
Name: count, dtype: float64

type
external    0.868263
link        0.129467
other       0.002270
Name: count, dtype: float64


====== plwiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 190723
Num unique targets: 514955
Top count: other-search -> Robert_Lewandowski (external): 182982
Clickstream contains 82.8% of the total user pageviews for July 2022

count    1.766181e+06
mean     8.895546e+01
std      5.429472e+02
min      1.000000e+01
25%      1.400000e+01
50%      2.200000e+01
75%      4.900000e+01
max      1.829820e+05
Name: count, dtype: float64

type
external    0.718809
link        0.274634
other       0.006557
Name: count, dtype: float64


Comparing to official dump...
1765576 of 1766181 (99.96575%) rows in common.

====== ptwiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 152570
Num unique targets: 440029
Top count: other-external -> AMBEV (external): 1074446
Clickstream contains 85.7% of the total user pageviews for July 2022

count    1.593821e+06
mean     1.210883e+02
std      1.577612e+03
min      1.000000e+01
25%      1.400000e+01
50%      2.400000e+01
75%      5.900000e+01
max      1.074446e+06
Name: count, dtype: float64

type
external    0.819621
link        0.176368
other       0.004011
Name: count, dtype: float64


Comparing to official dump...
1593439 of 1593821 (99.97603%) rows in common.

====== rowiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 24553
Num unique targets: 112299
Top count: other-empty -> Pagina_principală (external): 340550
Clickstream contains 83.1% of the total user pageviews for July 2022

count    265184.000000
mean         86.901212
std         774.721436
min          10.000000
25%          14.000000
50%          23.000000
75%          54.000000
max      340550.000000
Name: count, dtype: float64

type
external    0.891809
link        0.101146
other       0.007045
Name: count, dtype: float64


====== ruwiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 421295
Num unique targets: 944816
Top count: other-empty -> Заглавная_страница (external): 6754214
Clickstream contains 90.1% of the total user pageviews for July 2022

count    4.985128e+06
mean     1.425348e+02
std      3.495369e+03
min      1.000000e+01
25%      1.400000e+01
50%      2.500000e+01
75%      6.100000e+01
max      6.754214e+06
Name: count, dtype: float64

type
external    0.697830
link        0.291483
other       0.010688
Name: count, dtype: float64


Comparing to official dump...
4983468 of 4985128 (99.96670%) rows in common.

====== skwiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 11931
Num unique targets: 60362
Top count: other-empty -> Hlavná_stránka (external): 162796
Clickstream contains 78.7% of the total user pageviews for July 2022

count    119551.000000
mean         71.270955
std         528.025149
min          10.000000
25%          14.000000
50%          23.000000
75%          53.000000
max      162796.000000
Name: count, dtype: float64

type
external    0.906771
link        0.087107
other       0.006122
Name: count, dtype: float64


====== srwiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 22792
Num unique targets: 99812
Top count: other-empty -> Главна_страна (external): 165059
Clickstream contains 82.2% of the total user pageviews for July 2022

count    263388.000000
mean         71.017047
std         456.510683
min          10.000000
25%          14.000000
50%          22.000000
75%          49.000000
max      165059.000000
Name: count, dtype: float64

type
external    0.876100
link        0.118526
other       0.005374
Name: count, dtype: float64


====== svwiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 77939
Num unique targets: 268359
Top count: other-search -> Kosovare_Asllani (external): 66253
Clickstream contains 79.2% of the total user pageviews for July 2022

count    695525.000000
mean         81.777561
std         406.863636
min          10.000000
25%          13.000000
50%          22.000000
75%          49.000000
max       66253.000000
Name: count, dtype: float64

type
external    0.813981
link        0.183104
other       0.002915
Name: count, dtype: float64


====== tawiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 8967
Num unique targets: 43997
Top count: other-search -> காமராசர் (external): 150578
Clickstream contains 86.6% of the total user pageviews for July 2022

count    100379.000000
mean        108.007551
std         693.302061
min          10.000000
25%          14.000000
50%          25.000000
75%          62.000000
max      150578.000000
Name: count, dtype: float64

type
external    0.917110
link        0.068672
other       0.014218
Name: count, dtype: float64


====== thwiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 32589
Num unique targets: 92104
Top count: other-empty -> หน้าหลัก (external): 550011
Clickstream contains 89.9% of the total user pageviews for July 2022

count    309885.000000
mean        144.482585
std        1547.592635
min          10.000000
25%          14.000000
50%          26.000000
75%          68.000000
max      550011.000000
Name: count, dtype: float64

type
external    0.864806
link        0.129710
other       0.005483
Name: count, dtype: float64


====== trwiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 76102
Num unique targets: 213728
Top count: other-empty -> Anasayfa (external): 1008251
Clickstream contains 88.4% of the total user pageviews for July 2022

count    7.962210e+05
mean     1.272451e+02
std      1.410136e+03
min      1.000000e+01
25%      1.400000e+01
50%      2.500000e+01
75%      6.200000e+01
max      1.008251e+06
Name: count, dtype: float64

type
external    0.769142
link        0.219358
other       0.011500
Name: count, dtype: float64


====== ukwiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 54112
Num unique targets: 231207
Top count: other-empty -> Головна_сторінка (external): 261734
Clickstream contains 81.9% of the total user pageviews for July 2022

count    593197.000000
mean         87.540099
std         706.127086
min          10.000000
25%          14.000000
50%          22.000000
75%          51.000000
max      261734.000000
Name: count, dtype: float64

type
external    0.836523
link        0.158663
other       0.004814
Name: count, dtype: float64


====== viwiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 39853
Num unique targets: 132729
Top count: other-external -> Đài_Tiếng_nói_Việt_Nam (external): 1095835
Clickstream contains 87.4% of the total user pageviews for July 2022

count    4.151330e+05
mean     1.188083e+02
std      2.425945e+03
min      1.000000e+01
25%      1.400000e+01
50%      2.400000e+01
75%      5.900000e+01
max      1.095835e+06
Name: count, dtype: float64

type
external    0.839325
link        0.155916
other       0.004759
Name: count, dtype: float64


====== zhwiki ======
All checks passed -- no duplicates/nulls, counts all >= 10.
Num unique sources: 114866
Num unique targets: 549984
Top count: other-empty -> 連字暨減號 (external): 1379250
Clickstream contains 88.8% of the total user pageviews for July 2022

count    1.822529e+06
mean     1.897802e+02
std      2.268961e+03
min      1.000000e+01
25%      1.500000e+01
50%      2.900000e+01
75%      8.500000e+01
max      1.379250e+06
Name: count, dtype: float64

type
external    0.938867
link        0.060503
other       0.000629
Name: count, dtype: float64


Comparing to official dump...
1821399 of 1822529 (99.93800%) rows in common.

• EChetty edited projects, added Data Pipelines, Data-Engineering-Planning, Privacy Engineering; removed Data-Engineering.Oct 26 2022, 1:06 PM

• EChetty added a subscriber: Htriedman.

• JFishback_WMF moved this task from Incoming to Backlog on the Privacy Engineering board.Nov 1 2022, 7:09 PM

• EChetty moved this task from Backlog to Pipelines on the Data-Engineering-Planning board.Nov 7 2022, 10:46 AM

• EChetty moved this task from Backlog to To be discussed /To be estimated on the Data Pipelines board.Nov 7 2022, 11:15 AM

• EChetty moved this task from To be discussed /To be estimated to To be prioritised on the Data Pipelines board.Nov 20 2022, 5:48 PM

• EChetty moved this task from To be prioritised to Epics on the Data Pipelines board.Nov 20 2022, 6:43 PM

JAllemandou added a subtask: T327982: Add cawiki to clickstream dataset.Feb 6 2023, 1:58 PM

I'm going to remove this task from the Backlog lane of the Research board given that there is no task for Research here, yet. Once prioritized, please reach out to us with a subtask and add Research back. We would be happy to look into prioritizing supporting you at that point.

leila removed a project: Research.Jun 27 2023, 7:09 PM

JArguello-WMF removed a project: Data-Engineering-Planning.Jun 29 2023, 10:00 PM

JArguello-WMF added a project: Data Engineering and Event Platform Team.Jun 30 2023, 5:31 PM

JArguello-WMF moved this task from Data Eng Backlog to Parent Tasks/Epics on the Data Engineering and Event Platform Team board.Jun 30 2023, 5:38 PM

leila unsubscribed.Jun 30 2023, 6:59 PM

JArguello-WMF edited projects, added Data Products; removed Data Engineering and Event Platform Team.Jul 1 2023, 12:07 AM

JArguello-WMF moved this task from Incoming to Epics / Initiatives / Projects in Process on the Data Products board.Jul 1 2023, 12:10 AM

VirginiaPoundstone moved this task from Epics / Initiatives / Projects in Process to Wikistats Backlog on the Data Products board.Aug 17 2023, 3:57 PM

TBurmeister unsubscribed.Oct 5 2023, 7:07 PM

xcollazo subscribed.Oct 20 2023, 3:39 PM

VirginiaPoundstone moved this task from Wikistats Backlog to To be discussed on the Data Products board.Mar 19 2024, 12:15 AM