Wed, Mar 25
@Aklapper yes, this task should be resolved. Thanks!
Mon, Mar 23
@EBernhardson thanks a lot for all explanations!
Hi @EBernhardson! As we dicussed in IRC, this is the task I was mentioning.
I'd like to access your Airflow instance to be able to test our job, could you help me with that?
I couldn't find any docs in Wikitech or Meta appart from https://wikitech.wikimedia.org/wiki/Discovery/Analytics.
I believe the code for that goes here, right? https://gerrit.wikimedia.org/r/#/admin/projects/wikimedia/discovery/analytics
Wed, Mar 18
Luca has made it possible to use Superset with Presto and Kerberos, so now we can plot all data quality metrics (and all other tables) with Superset.
Mon, Mar 16
We've deployed the patch, now.
It has already started to crunch data starting at 2020-01-01.
It will take a couple hours to backfill up to today.
Wed, Mar 11
Yes, if we choose to hash some fields in EditAttemptStep, we should also hash all identifiers that could be used to "bridge the quarterly salt change".
BTW, I tried to explain that many times and I was so less effective than Neil! The sentence "bridge the quarterly salt change" is great!
Fri, Mar 6
Option 3 sounds great to me (given that @Krinkle approves potential performance issues).
I believe it's definitely the simplest and most natural from both our perspective and the instrumentation developer perspective.
I also like the approach for the produce() public interface.
Thu, Mar 5
OK, the problem is fixed now!
I'm sorry we're getting so many issues (even if small) with this dashboard.
I changed to 64 the showLastDays offset of the sunburst chart, so it can always show some data.
Now, there's another problem, not related to your reports or dashboard:
We've recently moved the reports to another host (stat1007.eqiad.wmnet -> an-launcher1001.eqiad.wmnet).
I checked in the new host, and the reports are updated fine!
So, it seems that the rsync to https://analytics.wikimedia.org/published/datasets/periodic/reports/metrics/wmcs/ is not working.
Will look into that.
OK then, I modified the change to just include additions to pageview_houly datasource in druid.
I tested by loading a test datasource (both with hourly and daily jobs), and everything seems to work.
So I open the Gerrit change for final code review.
Wed, Mar 4
@cchen Hi! I finished the code for this task and I'm testing the loading of the corresponding data, and I see that virtualpageview_hourly does not have project families other than wikipedia, and also not namespaces other than main. Is the addition of project_family and namespace data still useful there?
One thought re. code complexity and people having difficulties using the code from the product perspective:
Mon, Mar 2
Fri, Feb 28
Oh! Didn't know about the heartbeat pingback.
That is great, looking one month back would be completely fine of course.
Is the heartbeat pingback issued at a fixed data (say first of month) for all active wikis?
Or does the date depend on the install timestamp, or maybe other factors?
Feb 27 2020
Hmm, all queries have the same first step, which is to isolate the last ping from each wiki. Only the last ping is considered for the calculations of new data points.
It is true that, if we delete the data as proposed, we'll not be able to use the table for retroactive queries (say accumulate values until 2018-06-12).
But that's why I suggested to copy the whole original data to a backup location before purging event_sanitized.mediawikipingback.
We even can create a table in top of the backup data to allow for retroactive queries there.
Regarding new data points, I think calculations will work even with only last pings for each wiki, no?
Feb 26 2020
I think this might be caused by the following patch +2 by me (my bad)
The error that is popping up in the test is a ValueError, but the one expected in the test was changed to RuntimeError.
I think ValueError does not inherit from RuntimeError, it inherits directly from Exception.
One thing we could do, as suggested by Dan, is to purge event_sanitized.mediawikipingback by deleting all events that are not the state of the art of a given wiki (remove all but last pingback per wiki). And we could still keep un-purged data in a backup if needed.
This would reduce the size of the table, by my calculations, by a factor of 20 approx. All queries would continue to work and the results should be still the same.
Of course, this would not be a permanent solution, in 1 or 2 years we'd have to repeat this operation to purge the table from "unused" events again or we'd see the same issues.
So maybe this could be a 'quick' solution that would populate the dashboard temporarily and would give us all a couple months to rethink the pingback pipeline?
Feb 25 2020
I recreated tables plus repaired partitions of event_sanitized.helppanel, event_sanitized.homepagemodule and event_sanitized.homepagevisit.
This eliminated loose Hive partitions that might have been left there mistakenly when deleting the data manually.
This should solve any issues when querying from Hive or Presto.
@nettrom_WMF, I already tested with some queries in Hive and Presto that this works, but please confirm that all works for you.
Feb 24 2020
Hmmmm... yes it could be.
Not sure if this is due to manual deletion or to the deletion jobs that were set up.
I think it's more likely to be because of manual deletions.
Feb 21 2020
First: name of file & class & singleton
I think mw.eventStreams is cool!
Feb 20 2020
I believe this is done!
Feb 17 2020
Feb 13 2020
Hehe, no problem, I'm also constantly checking these in my brain, especially with weekly reports.
@srishakatux no problemo! Yes, the month of February is still not over, so reportupdater will wait until then to calculate the corresponding metrics.
I checked in https://analytics.wikimedia.org/datasets/periodic/reports/metrics/wmcs/ and all files are now complete with latest data.
The dashboard is still showing old data for me, but it must be a caching problem that will resolve itself in a bit.
I checked the report files, the logs, and the data in edit_hourly.
I believe the problem was that the mediawiki_history data set was not available on 2 Feb 2020,
and thus the edit_hourly table was not yet populated with January data,
and then the reports failed.
Looking into this now
Feb 7 2020
Thanks a lot @jlinehan for refactoring the older task and putting together this one.
Feb 6 2020
I pushed the code that I had, for when we resume this task, see above.
Feb 5 2020
Feb 4 2020
Cool, moving this to Ready to Deploy.
If I'm not the one deploying, remember there's deployment instructions in the etherpad:
Jan 23 2020
Jan 22 2020
Thanks for the explanation Joseph.
Working on this right now.
Jan 15 2020
- I see a new feature in 0.35.2: filter labels in the chart views of a dashboard. They tell you which params you can alter and when you click on them, they point you to the corresponding control. And they added some coloring. Seems cool!
- They have changed the top menu order and the icon of the "Manage" option in the menu has disappeared.
- Apart from this, couldn't find anything that is broken or different.
Jan 14 2020
Maven tree shows 4 different versions of guava:
- 11.0.2 Used by json-schema-core.jackson-coreutils and CDH5.hadoop-common
- 12.0 Used by reflection
- 16.0.1 Used by hadoop-common.hadoop-auth.apache-curator and uri-template
- 18.0 Specified in refinery-core's pom.xml
From all those versions, the only one that does not have the method com.google.common.base.Stopwatch.<init>() implemented is 18.0.
If I understand it correctly, the version compiled and included in the jar is 18.0 ([INFO] +- com.google.guava:guava:jar:18.0:compile).
Jan 13 2020
The metric is there, but maybe not directly visible:
You have to select the "editors" metric, and then enable the split by "activity level".
Jan 10 2020
Jan 2 2020
We enabled the deletion of the data for the 3 specified schemas: HelpPanel, HomepageVisit, HomepageModule.
No data has been deleted yet because all events are still less than 270 days old.
So, provided you have everything you want to keep in the sanitization white-list, I guess this task can be marked as done!
@elukey There was no task, because this was treated as part of the initial task to develop the data quality metrics.
The fix that we did was bump up the oozie_spark_lib property to spark-2.4.4. And it seemed to reduce the frequency of this problem!
But it turns out it's still there.
Dec 23 2019
Dec 19 2019
Dec 18 2019
Looked whether the entropy UDAF indicates any unexpected changes in the webrequest fields that are used in webrequest.load.
=== USER AGENT === hour entropy 11 10.54753394241449 12 10.56885036456681 13 10.494309570202567 14 10.475783511181957 15 10.41000334885259
Dec 9 2019
@Nuria is this something we want to tackle next year?