Tue, Mar 20
Please find the data to labeled here: https://drive.google.com/drive/folders/1pzR3P16ck7FyrE7QgIpcSx1TPumTGA9u?usp=sharing
Mon, Mar 19
Tue, Mar 13
@JAllemandou : just for the record, in this case I meant the parquet partitions. See you in IRC
@JAllemandou : My understanding is that if you partition by a unique id, you sort by that key,and then all the contiguous ids are in the same partition, as explained here: https://hackernoon.com/managing-en,spark-partitions-with-coalesce-and-repartition-4050c57ad5c4
@JAllemandou : as we discussed on IRC, could you please add the timestamp for each revision?
Also it would be good to have the data partitioned by revision_id, because this would make easer futures joins to get additional information (e.g. user)
now 66G /home/dsaez/
Mon, Mar 12
@Ottomata and @JAllemandou I found a work-around by creating an python2.7 virtualenv on stat1005.
I think that is the easiest solution right now. Updating python3 on the workers might be a good idea for the future :)
You can find the candidates for synonyms here:
Sun, Mar 11
@JAllemandou : just came back to this. The parquet version is amazing!! Thank you very much!
Tue, Feb 27
Mon, Feb 26
@DarTar: Do you have any preference for the format of this dataset? I can think in two ways of present it:
Feb 15 2018
to make stat1004 we need to solve this: https://phabricator.wikimedia.org/T187178
Feb 14 2018
Feb 13 2018
Feb 12 2018
@bmansurov please find the candidates here: @stat1005:/home/dsaez/code/alignment/resultsMapping
Feb 10 2018
Feb 9 2018
@bmansurov could you try to upload the 20180201 dumps for en,ru,ar,jp,fr,es in parquet_
This is not urgent but might be useful for the section recommendations project.
Feb 8 2018
quick comment: from this results http://gapfinder.wmflabs.org/fr.wikipedia.org/v1/section/article/Barack_Obama
Feb 7 2018
Feb 5 2018
Jan 18 2018
X is the number of people that speaks 'en' and 'uz', Y is the number of people that speaks 'en' and 'fr' ...etc
this is great @bmansurov !
Jan 3 2018
Results are interesting for understanding topic-span of edit wars. Cross-topic edit wars are rare, and usually associated to very active users.
The viability of applying this approach for detecting harassment or more specifically wikihounding requires deeper analysis.
Jan 2 2018
Dec 6 2017
Nov 14 2017
Nov 8 2017
Sep 27 2017
Installing sshfs would be also a good solution for this and for https://phabricator.wikimedia.org/T176093
Sep 17 2017
@Aklapper, sorry, it is analytics. Tagged.
Sep 16 2017
Sep 6 2017
Sep 5 2017
Sep 1 2017
Aug 18 2017
Aug 17 2017
new production key:
my ssh config
Aug 9 2017
@RobH: I would prefer to have just one account, with 'diego' as username. I can delete the personal one.
just FYI my username in wikitech is diego, but my 'Instance shell account name:' is dsaez (diego was not available)