Fri, Nov 15
Thanks for creating this task @CCicalese_WMF! I completely forgot about the reruns.
I already deleted the data that needs rerun. Reportupdater will start computing the missing data in the next hour.
As there are lots of data points to rerun and the queries are non trivial, this computation might take some days, possibly 1 week.
Let's close this task only when all data is visible and OK in the Dashboard :]
Thu, Nov 14
I'm trying to match a first draft of the traffic_per_country metric with the outage data that you put together.
Also, it would be great to have false positives examples to do the same and see how the metric behaves in such cases.
Do you have any false positive examples that I can use?
Thanks a lot!
Tue, Nov 12
I added a first draft of the metric to the data quality pipeline (see patch above), and added a chart to the data quality dashboard in Superset.
Fri, Nov 8
Hi all! One idea that maybe can reduce false positives when there are traffic peaks for any given reason.
Yes, @Nuria, the data starts 17th of August, so we can merge end of next week. Or on Monday the 18th? Better chance of having an ops person to merge the change.
Thu, Nov 7
It's still backfilling, will take a couple days.
Here's the data quality dashboard in Superset:
Tue, Nov 5
EventLogging data was first enabled in Hive on 2017-11-20T19:00:00Z.
I believe that is why we have partial data on 2017, proportion seems to match.
We could theoretically drop data with timestamp >= '20171120190000'.
Mon, Nov 4
Change proposal: Remove the lists from https://github.com/wikimedia/analytics-refinery/blob/master/bin/refinery-drop-mediawiki-snapshots and pass them as parameters.
Having this would allow us to have different jobs for different retention times.
Yes, definitely :]
This retention period feels small!! Is the deletion scheme deleting more than expected?
Fri, Nov 1
The reports in stat1006 are the ones created by querying MySQL.
The updated new ones are in stat1007, with all the other Hive reports, and already have systemd timers running them in puppet.
I left the old report files in stat1006 for safety. But if you think it's confusing or @CCicalese_WMF prefers to drop them, I can delete them.
Thu, Oct 31
Wed, Oct 30
I think the rlike comparison does not require the matched string to start with the pattern, unless you use ^.
The like comparison, however, does.
So, it can be that the extra counts in rlike results are records with revision texts that have something before #redirect.
Tue, Oct 29
sanitize data in the log databases
This shouldn't be necessary, right? IIUC, data should be sanitized already?
I need to check with @mforns but my understanding is that the last 90d of events are not sanitized yet..
Mon, Oct 28
@mforns it looks like the subtasks we had for this have all been resolved; can this be closed?
Thu, Oct 24
The white-list which we use as single source of truth as of what EL data gets sanitized and kept indefinitely is:
There you can maybe find your schemas by name?
I also added the MediaWiki version 1.32 to the PHP drilldown Dashboard, which was missing:
All data from MySQL:log.MediaWikiPingback_* is now present in Hive:event_sanitized.mediawikipingback.
Also, all pingback reportupdater-queries have been completely migrated to Hive and tested.
Wed, Oct 23
Oct 17 2019
@srishakatux This is the puppet patch.
When it gets merged by one of our ops people, RU will start to execute your job periodically.
You'll find the resulting report in stat1007.eqiad.wmnet:/srv/reportupdater/output/metrics/wmcs.
Remember the first data-point corresponds to 2019-10-01, so it will be calculated on 2019-11-02 (1 day lag).
Oct 16 2019
Sorry @srishakatux I didn't notice in the code review that the report name in the config does not match the file name of the script.
They should match for RU to work. Could you please fix that in a new patch?
Thanks a lot.
The docs are here: https://wikitech.wikimedia.org/wiki/Analytics/Data_quality/Entropy_alarms
Moving task to Done.
I will then copy the data over to Hive and rerun the queries to correct inconsistencies in the reports and dashboards.
I will also create a task regarding the other thoughts we discussed.
Thanks a lot for the inputs!
Totally understand your point.
If you'd be fine with the solution in my prior comment, I could apply it to the mysql database for the 2017 data, then copy the reports over to stat1007 and complete them querying Hive.
This way you'd have complete data since 2017-04 and we'd solve the query performance caveat. Plus it would be easier for us Analytics to migrate it that way, rather than importing mysql data to Hive.
What do you think?
Oct 15 2019
Ok, makes sense. I guess, this scenario where we want to calculate secondary data quality metrics that help troubleshoot, or we want to first plot some new data quality metrics before we decide whether to alarm on them or not, is going to happen again. So, the data quality pipeline should probably have an option to activate/deactivate alarming on given metrics. Will take this into account when implementing that.
In the meantime, as the calculation of the useragent entropy metrics is finished, I will move this task to done.
The trends over time are one of the important aspects of the pingback data. It would be good to import the missing data from 2017-04 to 2017-11 into hive so we have coverage from the beginning of the period in which we began to collect that data.
@Nuria Should I remove the rest of the useragent metrics from the data quality Oozie pipeline then?
I have deleted all data directories and Hive partitions for event_sanitized.helppanel up to Oct 1st 2019 (not included).
I checked that the table looks good, but please ping us if you find any inconsistency.
The deleted data will stay in Hadoop's trash folder for a couple weeks, in case you want to recover something, then will be automatically deleted.
I still need to post the results of the proof of concept to Wikitech.
What's missing in this task is to decide which of the 4 (one or more) metrics considered for the proof of concept of entropy-based quality metrics are we going to chose to be productionized as a data quality alarm.
Oct 11 2019
@CCicalese_WMF Hi :]
Note that the pingback reports are still bad after the migration to Hive.
But not because of scheduling reasons, rather the queries are still not doing what the MySQL ones did before.
Working on that right now.
This is normal behavior I think.
In reportupdater weekly reports span weeks starting on sunday 00:00:00 and ending following saturday 23:59:59.
Also, the data-point receives the label of the start date of the week. Meaning 2019-09-29 is the week that starts at that date and ends at 2019-10-05T23:59:59.
The subsequent week (2019-10-06T00:00:00 -> 2019-10-12T23:59:59) has not been calculated yet, because it is still not passed completely.
Also, usually reports have a couple hours or one day lag, to give time for the collected data to reach the database.
It is likely that the reports update early on Monday 2019-10-14.
Oct 10 2019
- The docs says: “The first column must be equal to start_date parameter (consider naming it date). This is an unnecessary limitation and might be removed in the future...”. Do I need to follow this?
I don't know if the limitation is still valid or not - I have looked at some example queries (see repo in my next line) and the first column is always date ... @mforns can you tell us more?
Yes, this limitation is still there. The first column of the results should be the date in YYYY-MM-DD format.
Oct 9 2019
Oct 8 2019
Oct 7 2019
Oct 3 2019
I think the query that defines this quality metric should just forward the absolute value of pageviews per country.
I think normal+stddev is not needed there, because the anomaly detection algorithm should take care of that, no?
Sep 27 2019
Sep 24 2019
Sep 23 2019
@ArielGlenn thanks for chiming in!
Eventually the data will be accessible via datasets.w.o or dumps.w.o right?
Yes, from dumps.w.o
Sep 20 2019
@Ottomata Hm indeed...
Specifying sanitization for the whole map can be dangerous, because a non-sensitive map field marked as 'keep' could in the future be added a sensitive field, and it would be 'kept' by default.
How will it be with geocoded_data and user_agent fields? Are they structs or maps in MEP? Some EL schemas do specify to keep parts of them.
We could maybe do:
geocoded_data: type: object additionalProperties: type: string annotations: sanitize: country: keep state: hash blah: foo: keep bar: hash olap_properties: [country:dimension,state:dimension]
Reviewed https://gerrit.wikimedia.org/r/538235 and the fix makes sense to me!
There is a call to os.system returning a value, but I checked the python docs and it seems to be both python2 and python3 return the exact same for os.system.
I'm not sure about the underlying magic, but what I've seen in other similar cases of data sets that need rsync and publication in analytics.wikimedia.org, is here:
I have added one analog code block to that file pointing towards MediaWiki history dumps,.
I believe the source directory is mounted in stat1007, see: https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/dumps/distribution/datasets/fetcher.pp#L18
I know miscdumpsdir is /srv/dumps/xmldatadumps/public/other but I'm not sure of which host. Shouldn't it be Thorium, from where analytics.wikimedia.org is served?