Fri, Oct 23
Deleted both HDFS and HIVE directories, plus the corresponding database in HIVE.
Marking this as resolved!
I deleted HDFS and HIVE files.
Thu, Oct 22
@Shilad ping? :]
@leila, T264255 is now resolved (I believe a tarball with all required files was copied over to a public location).
Please, can you confirm that we can proceed to delete the data in stat100* machines and HDFS?
One of the files in stat1006 is empty, and the other is a simple query.
I believe we can delete all.
I removed already the empty user's folder in HDFS.
Deleted /user/jkumalah from HDFS.
Could not delete folders in stat* boxes, no permits...
The file and directories found are all empty. Will delete everything.
Wed, Oct 21
The table wmf_raw.mediawiki_page is not historical and only has snapshots since 2020-04.
Thus, we cannot re-calculate the denominator using that same table and query.
Fri, Oct 16
Wed, Oct 14
Yes, please merge when ready, thanks!
Tue, Oct 13
Let me know when it's fine to merge the relevant change (for src_net + dst_net at least).
Please, merge whenever you are ready. Thanks!
Fri, Oct 9
What is the field that we want to extract the AS name for? I see as_src, as_dst, peer_as_src, peer_as_dst?
Ideally all of them, but at least as_src and as_dst note that because traffic sampled is to/from our network if the as_src is a public AS (that you can lookup), as_dst will most likely be a private one (not present in the maxmind DB) and the other way around. We could either keep them empty or feed them a static list of ASN (see https://wikitech.wikimedia.org/wiki/IP_and_AS_allocations#Private_AS).
Hi @ayounsi, can you help me? I have some more questions:
Thu, Oct 8
After discussing with the team, we think it's fine for now.
If we want to add more fields or increase the sampling ratio,
then we should indeed make some calculations to make sure we're ok :]
Luca asked me to give some feedback about hue-next, here are some thoughts.
- Overall hue-next looks OK to me, seems I can do all I need from it, except maybe:
- I usually like to open workflow instances in new tabs, so that I don't have to go back-and-forth when checking jobs, but hue-next does not allow that (or at least I could not manage to do that).
- the filter box defaults to user:mforns every time I reload the UI. This is a bit annoying, but not a big deal.
- You can not order the job list by any field AFAICS (like oder by name or order by creation date, etc.), you can do that in the old hue.
The size of the events has increased in about 25-30%, which is considerable, but I believe sustainable for now.
When we sanitize this data set for long term retention, we'll have to think about the size of the remaining data.
Wed, Oct 7
@ayounsi Confirmed that you can merge the changes that add BGP communities to pmacct!
We'll be monitoring the kafka topic. Thanks!
A comment on sanitization:
I was looking at T254332 and one option is to move the netflow data and table to the event database.
This way, we could use the already present sanitization scheme (event -> event_sanitized), just by adding a couple lines to the sanitization include-list.
Ok to merge anytime or should I sync up with you?
I believe it's OK to merge, and that Refine should identify the new field and automagically evolve netflow's Hive schema. But let me confirm later today!
OK, after a very interesting chat with Joseph, here's our conclusions:
To mean that the flow has the 3 communities 14907:0 14907:2 and 14907:3.
So we can easily add it to the pmacct producer (and let me know when is a good time to do so). But I believe Faidon's question is about how to use it in Druid/Turnilo to for example filter only on 14907:2.
Awesome. Yes, as you said, Druid allows for multi-value dimensions. Either the Refine job or a subsequent job can transform BGP strings like "14907:0_14907:2_14907:3" into a list like ["14907:0", "14907:2", "14907:3"] and that would be ingested by Druid easily. In Turnilo's UI you would just use the drop-down filter with check-boxes to select those communities that you want to see (1 or more). I saw your patch, and think that whenever that gets merged, the current Refine job will automagically add that field to the refined netflow table (@Nuria correct me if I'm wrong).
Tue, Oct 6
Raw data for mediawiki_job and netflow *older than 90 days* has been deleted with the script,
and periodical deletion jobs have been deployed.
Mon, Oct 5
Sat, Oct 3
Yesterday I tested the el_drop_unsanitized deletion job with the newest code (order fix + partial match fix) and it worked well.
I think that can be merged, and we can discuss further next week.
Thu, Oct 1
Can @mforns confirm that reportupdater can use this package as its?
Sep 28 2020
On friday @razzi and I encountered a puppet compiler error when trying to test your puppet change for the test cluster.
Razzi created a task for the error: T263876.
We believe that is unrelated, but didn't want to merge it anyway.
Sep 24 2020
I think we've discussed this before, but just for the record:
I think one important aspect of the sanitization config is that changes to those configs can only take effect after a +2 from the analytics/security team.
Otherwise, that might cause privacy sensitive data to be stored for more than 90 days, plus back-filling, auditing and discussions.
So, I believe there must be some centralized control over sanitization configs. Maybe this fact helps us choose which way to go.
Sep 21 2020
Sep 18 2020
This is solved now, right?
@ssingh Oh, cool. :]
Maybe we can even leave the spark job running as is now (with very few changes on our side),
and in the systemd timer job, just check for the existence of an anomaly file.
The spark job writes the anomalies file under the directory: hdfs://analytics-hadoop/tmp/analytics/anomalies/
Now we'll have to think a way to parse the file name because it's the one that contains what metric was anomalous, i.e.:
Feel free to ping me whenever you tackle this!
Sep 17 2020
From grosking: let's keep them for 13 months.
@ssingh Hey! Do you have bandwidth to work on this in the end? We have had more ideas, that might turn this into an easier task.
Sep 16 2020
Sep 15 2020
We're keeping track of the design requirements of this task in the doc:
Please, request access if interested.
Sep 14 2020
We decided for Superset and Hue
We already have Presto in our stack.
In subsequent tasks we might add Alluxio to speed up Presto.