Jul 3 2019
Adding a comment:
Jul 1 2019
We have not implemented the proposal defined here for page-create event timestamp definition. I let @Milimetric explain (either here or in sync-up meeting, might be easier face to face).
This is solved from snapshot 2019-05 onward thanks to the rebuild of the page-history reconstruction algorithm:
Improved greatly by the last page-history reconstruction refactor:
This is solved in snapshot 2019-05 onward.
- The page_first_edit_timestamp is the field containing the interesting value, not page_creation_timestamp as this one should reflect the timestamp of the first create event. Most of the time, they are equal, but they can differ for pages having complicated histories with deletes and restores.
- The page_first_edit_timestamp is not always equal to the timestamp of revision having parent_page_id = 0, as the dataset also use archive revision (therefore the first revision can be an archived one), and because complex histories can also lead to multiple revisions having parent_page_id = 0 in their history.
Jun 28 2019
Jun 27 2019
Results confirmed after page-history algorithm refactor. Marking as done :)
Some information in that respect is provided as part of T221825 with the new field page_is_from_before_page_creation. But this is incomplete as it only accounts for pages imported before the page creation, not after.
I haven't have time to fix this with this bunch of changes. Keeping it in backlog of things to do for mediawiki_history.
Actually I haven't had time to tackle this issue in this round of change, sorry about that :(
Keeping the task in the bakclog of things to do for mediawiki-history.
Done ! Sorry for the delay.
Jun 25 2019
Thanks a lot @Samat for the details.
Indeed you were right the difference is to be accounted for a methodological change. I'm sorry not to have noticed right away.
From the month 2019-05 onward, we have changed the way editors were computed by removing the edits on deleted pages.
We did this to be more homogeneous, as other metrics (edits and edited-pages for instance) were already computed with deleted-edits removal.
Jun 24 2019
Correct (see https://druid.apache.org/docs/latest/tutorials/tutorial-delete-data.html, paragraph How to permanently delete data). We can also use API calls to mark segments as unused if we prefer not using rules.
With better/more precise explanation:
- In order for data to be dropped from deepstorage, it needs to be unloaded from historical nodes. This can be done in 2 ways: disabling a full datasource, or disabling segments using rules.
- Once segments are disabled, you can run the kill task to drop them.
Given the need to use rules to disable segments from historical, I'd rather keep the max data in hadoop (no storage issue so far).
@Nuria: We on purpose did it the way it is setup, in order to facilitate loading data in druid in case it is needed (data present in deep-storage for 60 days) while still keeping space on druid.
Having agreed we should keep 1 month of data in druid, I still recommend using rules to unload data after 1 month and keep 60 days in deep storage, as 2 month means 2Tb per server in druid, probably too much.
Hi @Samat, thanks for reaching out.
It would be interesting if you could upload the files again, and also possibly confirm the URL you downloaded data from, as my tests/checks don't show differences that big.
I have checked the number of users only editors for huwiki over 4 years, looking for differences in our last 3 snapshots (we call monthly recomputations snapshots), and while there a very small deletion-drift (difference due to pages being deleted, as they are excluded from statistics computation), they are really not a 5%/10% change, more like -0.05% to -0.10%, and only for 3/4 month before last month.
Jun 21 2019
Jun 18 2019
Jun 17 2019
We can easily get data for older days if needed (we don't drop statistic-data).
Jun 14 2019
Hi @Nuria - Can you confirm the above request is correct for generating the data?
Jun 13 2019
Jun 11 2019
I found a workaround url:
Jun 10 2019
Thanks for offering @Neil_P._Quinn_WMF :)
I'm still working on changing the algorithm, so no need from you as of now.
I'll let you know once I have a test dataset.
Jun 8 2019
Issue pinpointed in the new TransformFunction applied to drop non-mediawiki data: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/TransformFunctions.scala#L105
@GoranSMilovanovic : You're welcome :) At some point I'll manage to have that productionize ;)
Jun 7 2019
Spark driver is not launched from the notebook but from the kernel, and it's configuration is not updatable on the fly, so I'm not surprised it doesn't work.
The solution is to bump driver-memory at the kernel level (see my ping to Andrew and Luca in the previous comment).
I have reproduced the error. The problem comes from driver-memory I think. I have been able to make the computation succeed for 1 day in python-notebook, and for 1 month in CLI with higher driver memory.
Issue found by manual test of DataFrameToHive (I added logging and created a small class using DataFrameToHive to test) on that line: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-spark/src/main/scala/org/wikimedia/analytics/refinery/spark/connectors/DataFrameToHive.scala#L234
Jun 6 2019
May 23 2019
May 21 2019
Following your path, I confirm I have the same problem you do.
Thanks a lot for reporting @Formatierer!
May 20 2019
Hi @Formatierer - While I definitely see the snapshot, I can't reproduce on wikistats :(
May 17 2019
NO WAY !!!! I'm super sorry for having derailed that :(
May 16 2019
A lot trickier :)
We have the wmf_raw.mediawiki_private_cu_changes table in hive, allowing us to compute geo-editors (editors-by-country, aggregated). This table only contains 3 month of data for PII removal reasons. It's probably not enough for what you're after, but I have nothing better (see https://github.com/wikimedia/analytics-refinery/blob/master/oozie/mediawiki/geoeditors/monthly/insert_geoeditors_monthly_data.hql for an example).
I've just created T223444 to submit the general idea of having geo-editors stats split by desktop/mobile.
May 14 2019
May 13 2019
May 7 2019
spark.sql("select uri_host, uri_path, uri_query from wmf.webrequest where webrequest_source = 'text' and year = 2019 and month = 5 and day = 6 and hour = 16 and is_pageview and pageview_info['project'] = '15.wikipedia'").show(10, false)