Mon, May 20
Fri, May 17
Idea looks fine but I do not think it I do not think it will be wise to change naming at this stage.
I think changing naming now will be a indeed bit of work (database, jobs, coordinate deployment, documentation, notify people, etc.).
And it's likely that we mess up and have to do backfilling and such.
But I think the advantages of this approach are also significant:
- No need of blacklisting of the deletion-after-90-days script (which is dangerous), if it fails, non-EL data could be deleted from event database.
- No need of blacklisting of sanitization process (not so dangerous, but avoidable).
- Better organization of data, which would allow for more data sets getting deleted-after-90-days and sanitized in an easier way and avoid confusion overall.
I believe, as Andrew suggested once, that instead of having an "event" db and an "event_sanitized" db, we should have an "event_unsanitized" db and an "event" db.
This way, event_sanitized would only contain only temporary data that will be deleted after 90 days, and the event db will contain final (sanitized if necessary) data.
The data sets that we control and do not need sanitization could be directly ingested into the final event database.
This would make everything easier.
Thu, May 16
We think this is the related change:
@Ottomata, this is done right?
Wed, May 15
Added some docs on Wikitech:
Tue, May 14
The new datasource is available in Turnilo!
Please, have a look :]
Mon, May 13
Thu, May 9
Wed, May 8
When we use RU for Hive, we have to use a script instead of the query.
That is so, because RU doesn't have yet a Hive client. So we use a bash script that calls hive -e "<query>".
The way RU passes dates (and other params) to the script is different from the way it passes dates to sql files.
In a nutshell, to add a date column in a Hive query (bash script) use:
SELECT ... '$1' AS date, ...
$1 is the first parameter that RU passes to the script, which is the date in question.
You can find this and other infos in the RU documentation:
Also, take a look at this example of another Hive-based RU report:
You can basically copy the way hive is called (hive -e "..." 2> /dev/null | grep -v parquet.hadoop).
And also, copy the way $1 is used.
Tue, May 7
OK, I think I have some conclusions.
Mon, May 6
Thu, Apr 25
Apr 22 2019
We'll look this task and prioritize it with the team this Thursday.
Can we do this in a hackathon?
Waiting for a schema registry, so we can implement this.
@Ottomata, this task can be closed right? Because of the changes we're doing on EventGate.
Closing it. Please, reopen if I'm wrong.
@diego Hi! Is there anythin additional for us Analytics here? Thaanks
Why do you think a list of most linked articles would be useful?
We can see the value of the top images by num. of appearances, because the image uploaders might be interested in that.
Can you elaborate? Thanks!
@Urbanecm Do you mean images used in articles?
Apr 17 2019
Apr 16 2019
Cool! Glad that you guys liked it.
Yes, I left user_tenure_bucket for next iteration as per Nuria's suggestion in the doc.
User_tenure_bucket was a bit more complex than the other fields, but I checked and I believe it's feasible.
Apr 12 2019
I found https://www.mediawiki.org/wiki/Extension:QuickSurveys,
and it explains the code for the survey is loaded dynamically, so JS disabled is not the cause.
DNT is also not the cause, because when it's on, the surveys don't even show.
I've been looking into this for a bit.
Is there any documentation I can read on the flow of the surveys?
Does the user click on a link on-wiki, that opens a Google/Qualtrics form?
And when do events for QuickSurveyInitiation and QuickSurveysResponses trigger?
Apr 11 2019
@Amire80 I couldn't find any other task that refers to fixing the broken job.
Maybe it was in an email... or conversation? I couln't find them either.
We can use this task for that anyway, no?
Apr 9 2019
This is corrected now. See: https://turnilo.wikimedia.org/#edits_hourly
Apr 8 2019
Apr 5 2019
It's a problem with the generation of edit_hourly in Hive.
The timestamp is not well formatted, I was using:
FROM_UNIXTIME( UNIX_TIMESTAMP(event_timestamp, 'yyyy-MM-dd hh:mm:ss.sss'), 'yyyy-MM-dd hh:00:00.0' ) AS dt
But it's converting that to am-only hours, it's an easy-to-fix.
Will fix that on Monday.
Wow, that's weird.
Thanks @Neil_P._Quinn_WMF! Forgot to do that.
As you see, we had a slight change of plans in the implementation.
We encountered and issue in Druid, which does not allow to apply transforms to fields that are not listed as dimensions, for hive tables stored in parquet format.
So we decided to create this intermediate table in Hive called edit_hourly (maybe edit_daily, if we find that hourly reveals poor performance).
This way we won't need to use druid transforms (transforms will happen in Hadoop via HiveSQL).
Also, we can take advantage of having the Hive version of the data set for more detailed querying.
Druid developers are fixing this issue in the new version, but it will still take some time until we upgrade to that.
In any case, it won't harm to have that intermediate table in Hive.