Thu, Aug 9
Wed, Aug 8
We did some tests in PySpark CLI with @Ottomata this evening and found memory settings that work (with some minor changes in code).
Hi @GoranSMilovanovic , please excuse me I meant to follow-up on this task but then forgot...
I think @Milimetric is right: when you call write from a dataframe, any worker (meaning any of the cluster node) being responsible for a dataframe partition will be writing its own chunk. Since you've repartitioned to 1, only one file will be written, but it could be on any worker.
If you're sure the data is small enough to fit on the driver, you should use collect to get the data back, and then use regular R functions on the driver to write locally.
I Hope it makes sense :)
Many thanks @Ottomata ! Those notebooks are awesome :)
Tue, Aug 7
Mon, Aug 6
I have not tested but I don't see why it would break :)
Let's go !
Hi again @Astinson,
The wikisource item in Wikistats selector refers to the wikisource.org url, not every wikisource bundled together.
See https://stats.wikimedia.org/v2/#/en.wikisource.org/contributing/editors/normal|line|2-Year~2016070100~2018080300|~total for the number of edits on en.wikisource.org for instance.
We plan on providing "project-aggregated" metrics in wikistats, but it's not available as of today.
Hi @Astinson :)
Indeed there is more data available. Not since beginning of wiki times, but at least a few years.
Data is available by months through API calls of the form: https://wikimedia.org/api/rest_v1/metrics/pageviews/top-by-country/wikisource.org/all-access/2018/07
You can read some doc here: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews#Pageviews_split_by_country
May I let you follow up with this task?
Sounds good to me! Maybe we could put a bigger list of topics to check, to multiply the probability of catching errors, but except from that sounds good.
Fri, Jul 27
Wed, Jul 25
Tue, Jul 24
Sounds good to me, maybe with a .org$ to match only end-of-string (and make the regexp parser life easier).
Jul 16 2018
Jul 12 2018
Jul 11 2018
Jul 10 2018
Jul 6 2018
removing the "SQL-Context not loaded" part from the description, as spark 2.3 uses SparkSession named spark instead of a sql-context. the sqlContext is still usable through spark.sqlContext, but most of the sql-related suff is now available under the spark-session: spark.sql, spark.udf etc.
As for the other two, I have the same problems :)
Jul 4 2018
Thanks for the very accurate summary @elukey :)
Jul 2 2018
Jun 28 2018
Jun 22 2018
Jun 21 2018
Jun 18 2018
Jun 16 2018
I did another quick check this morning: there are some valid user-agent strings of length larger than 512 in our faulty hour (9 over 64). The other 55 are exactly the same, of length 2035.
I also have successfully parsed user-agents with a length-limit of 1024 over the faulty hour, and double checked how many user-agents would not have been parsed with various limits for another full day of raw webrequest:
- Total number of rows for that day: 3626986512
Jun 15 2018
One problem is related to user-agent parsing for very long strings:
sudo -u hdfs spark2-shell --master yarn --conf spark.dynamicAllocation.maxExecutors=256 --jars /srv/deployment/analytics/refinery/artifacts/refinery-job.jar