Thanks a lot @Samat for the details.
Indeed you were right the difference is to be accounted for a methodological change. I'm sorry not to have noticed right away.
From the month 2019-05 onward, we have changed the way editors were computed by removing the edits on deleted pages.
We did this to be more homogeneous, as other metrics (edits and edited-pages for instance) were already computed with deleted-edits removal.
Correct (see https://druid.apache.org/docs/latest/tutorials/tutorial-delete-data.html, paragraph How to permanently delete data). We can also use API calls to mark segments as unused if we prefer not using rules.
With better/more precise explanation:
- In order for data to be dropped from deepstorage, it needs to be unloaded from historical nodes. This can be done in 2 ways: disabling a full datasource, or disabling segments using rules.
- Once segments are disabled, you can run the kill task to drop them.
Given the need to use rules to disable segments from historical, I'd rather keep the max data in hadoop (no storage issue so far).
@Nuria: We on purpose did it the way it is setup, in order to facilitate loading data in druid in case it is needed (data present in deep-storage for 60 days) while still keeping space on druid.
Having agreed we should keep 1 month of data in druid, I still recommend using rules to unload data after 1 month and keep 60 days in deep storage, as 2 month means 2Tb per server in druid, probably too much.
Hi @Samat, thanks for reaching out.
It would be interesting if you could upload the files again, and also possibly confirm the URL you downloaded data from, as my tests/checks don't show differences that big.
I have checked the number of users only editors for huwiki over 4 years, looking for differences in our last 3 snapshots (we call monthly recomputations snapshots), and while there a very small deletion-drift (difference due to pages being deleted, as they are excluded from statistics computation), they are really not a 5%/10% change, more like -0.05% to -0.10%, and only for 3/4 month before last month.
Fri, Jun 21
Tue, Jun 18
Mon, Jun 17
We can easily get data for older days if needed (we don't drop statistic-data).
Fri, Jun 14
Hi @Nuria - Can you confirm the above request is correct for generating the data?
Thu, Jun 13
Tue, Jun 11
I found a workaround url:
Mon, Jun 10
Thanks for offering @Neil_P._Quinn_WMF :)
I'm still working on changing the algorithm, so no need from you as of now.
I'll let you know once I have a test dataset.
Sat, Jun 8
Issue pinpointed in the new TransformFunction applied to drop non-mediawiki data: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/TransformFunctions.scala#L105
@GoranSMilovanovic : You're welcome :) At some point I'll manage to have that productionize ;)
Fri, Jun 7
Spark driver is not launched from the notebook but from the kernel, and it's configuration is not updatable on the fly, so I'm not surprised it doesn't work.
The solution is to bump driver-memory at the kernel level (see my ping to Andrew and Luca in the previous comment).
I have reproduced the error. The problem comes from driver-memory I think. I have been able to make the computation succeed for 1 day in python-notebook, and for 1 month in CLI with higher driver memory.
Issue found by manual test of DataFrameToHive (I added logging and created a small class using DataFrameToHive to test) on that line: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-spark/src/main/scala/org/wikimedia/analytics/refinery/spark/connectors/DataFrameToHive.scala#L234
Thu, Jun 6
May 23 2019
May 21 2019
Following your path, I confirm I have the same problem you do.
Thanks a lot for reporting @Formatierer!
May 20 2019
Hi @Formatierer - While I definitely see the snapshot, I can't reproduce on wikistats :(
May 17 2019
NO WAY !!!! I'm super sorry for having derailed that :(
May 16 2019
A lot trickier :)
We have the wmf_raw.mediawiki_private_cu_changes table in hive, allowing us to compute geo-editors (editors-by-country, aggregated). This table only contains 3 month of data for PII removal reasons. It's probably not enough for what you're after, but I have nothing better (see https://github.com/wikimedia/analytics-refinery/blob/master/oozie/mediawiki/geoeditors/monthly/insert_geoeditors_monthly_data.hql for an example).
I've just created T223444 to submit the general idea of having geo-editors stats split by desktop/mobile.
May 14 2019
May 13 2019
May 7 2019
spark.sql("select uri_host, uri_path, uri_query from wmf.webrequest where webrequest_source = 'text' and year = 2019 and month = 5 and day = 6 and hour = 16 and is_pageview and pageview_info['project'] = '15.wikipedia'").show(10, false)
May 6 2019
here are the faulty lines:
spark.sql("select uri_host, uri_path, uri_query from wmf.webrequest where webrequest_source = 'text' and year = 2019 and month = 4 and day = 29 and hour = 6 and is_pageview and pageview_info['project'] = '15.wikipedia'").show(10, false)
A manual fix has been applied to 2018 jobs.
May 3 2019
Hi @Ladsgroup - I'm extremely sorry for not having taken the time to answer you faster :(
I've quickly tested your patch and it seems to work.
I have run it on fawiki and frwiki to compare proportions of link vs other-* link types:
It looks super good :)
Merging for a deploy next week.