Mon, Mar 18
Wed, Mar 13
@chelsyx thanks for the clarification!
we could do it one of these EU evenings?
Sure! I'll be on vacation, though, starting this Fri 15th (included), and will be back on Mon 25th.
Tue, Mar 12
Great, I already accepted the invite. Thank you!
Mon, Mar 11
@ezachte The issue 1 has changed since your initial description.
Now we do not have intervals, rather we round to the 1000s. However, there is still some non-human values like 304.33M.
What do you think would be a good formatting?
@elukey we were here in grosking and agreed to look into this to be sure there are no action items left. before closing.
@mpopov Hi! Do you need google-auth here, or you managed with the others?
Much of this data may be coming from bots as well, see: T210006
We have not been able to reproduce this issue. But maybe @elukey has an idea of how to solve?
Hi @JAllemandou :]
Can you outline what remains to be done here please?
Hi @MNeisler! We'd like to have this done by the end of this quarter. Is there anything we can do, I can help you build a job that loads that data. Maybe we can have a meeting and you can pass me the requirements of the data set.
Fri, Mar 8
Thu, Mar 7
Re-opening, needs to check if it is deployed and restarted.
This task is done. Resolving.
Wed, Mar 6
Tue, Mar 5
Thu, Feb 28
@GoranSMilovanovic I found 2 schemas that I wasn't aware of, that I believe belong to WMDE:
WMDEBannerEvents and WMDEBannerSizeIssue.
I didn't delete them yet, in case their owner wasn't aware of the 90-day deletion policy.
Are you the owner of those schemas?
The deletion is finished now. The event database only contains now the last 90 days of data.
We're still deleting the corresponding Hive partitions (meta-data), maybe this causes some warnings when querying data.
The estimated time of completion of the partition meta-data synch'ing is in 24 hours.
Ok, starting the deletion now.
@chelsyx thanks for the check!
@Tbayer thanks for the check!
Wed, Feb 27
Here's the plan for tomorrow:
- Collect the names of all tables in the event database that belong to the EL pipeline.
- For each table T, delete /wmf/data/event/T/year=2017 and /wmf/data/event/T/year=2018/month=M with M in (1,2,...10).
- For each table T, execute an msck repair table T command in Hive.
- Execute the deletion script (refinery-drop-older-than) once with --older-than=90 to delete the last due days.
- Productionize a systemd timer that calls the deletion script periodically (every day) in puppet.
Mon, Feb 25
@leila, +1 to Nuria, but I think you're just confused by my latest message.
We already discussed this 14th of November onwards.
As I understood it from @Miriam's comments, you guys copied the CitationUsage schemas over to another location, right?
So, the script won't affect that copy.
I was just giving a last ping after final deletion.
Please, confirm :]
Fri, Feb 22
Sorry for the mass ping.
@chelsyx @mpopov @Neil_P._Quinn_WMF (cc @Nuria)
I backfilled both schemas since the discussed dates: event_sanitized.VisualEditorFeatureUse since 2018-10-24 and event_sanitized.MobileWikiAppShareAFact since 2018-06-21. I vetted the resulting data and it all looks good to me, but please give a quick check to confirm.
Thanks for the changes you guys did! Now I will proceed to productionize the purging script that will delete events older than 90 days from the raw events database (event). This will happen on Feb 28th.
@Tbayer event_sanitized.readingdepth is backfilled using the new whitelist.
I have vetted the resulting data and it looks good to me, but please do a quick check.
Note that the 28th of this month (in one week) we'll execute the purging script to delete data older than 90 days on the event database, and backfillings like this will not be possible any more (after 90 days).
Thu, Feb 21
Wed, Feb 20
Sure, we can discuss here, or have a meeting, what's better for you. I also just talked to @Neil_P._Quinn_WMF about whether we should extract the data from mediawiki_history to an intermediate Hive table, and then load from that one. Or just use Druid transforms to ingest directly from mediawiki_history. I lean towards the second option, because it doesn't need the extra step (table which will have to be maintained). But let's discuss!
@hashar Thanks a lot for all the dedicated work!!
Tue, Feb 19
@Ottomata I tried the simple ALTER TABLE, and it works, provided the field you want to change the type of, is a top level field. In our case, we want to change the type of a subfield of the event struct. When you do this the whole event field becomes unreadable for all the data that was in the table. I think the problem lies in the serde (parquet serde?). That is why we needed to backfill the entire table since 2017-11. However we only had raw events for the last 3 months, so part of the backfilling has been done with a temp copy of the old data.
I think this could work, but I believe it has a couple drawbacks that we can avoid:
- I think it needs some extra cognitive load. Especially because different fields with different orders of magnitude will need different decimal shifts. We can store the length of the shift in the schema field description, but still it's potentially confusing I think.
- Also, it would need an explicit conversion on the client (1.2345 => 12345). And the shift length per field would have to be stored somewhere or hardcoded in the instrumentation, no?
@JAllemandou Yes. The checksum does not change when the dates change, because the --older-than=90 parameter still remains the same.
Mon, Feb 18
Hey all :]
@Amire80, sure feel free to schedule one!
You need to run the script once without the --execute flag (dry run). The checksum will be printed by the script at the end.
To *really* execute the script, run it again with the same parameters and with --execute <checksum>.
The puppet systemd-timer should thus include the checksum.
This checksum thing was intended to force us to do a dry run first and hopefully check the printlns before executing it for real.