Page MenuHomePhabricator

Add data quality metric: traffic variations per country
Closed, ResolvedPublic13 Estimated Story Points

Description

We should be able to detect whether the makeup of traffic for all projects or maybe just en.wikipedia has considerably shifted. This could indicate several things:

  • network wise we have a problem serving a particular location
  • there is blockage of traffic from a country
  • traffic is flowing into a country that did not before

The more targeted that the measures are the more useful they would be so entrophy measures on "desktop" english wikipedia are likely to be more useful than in "overall" projects.

Event Timeline

And also traffic increases, is this a good usage of entrophy or do we want to go with a simple normal + std deviation distribution?

I think the query that defines this quality metric should just forward the absolute value of pageviews per country.
I think normal+stddev is not needed there, because the anomaly detection algorithm should take care of that, no?

Nuria renamed this task from Add anomaly detection alarm to detect traffic blockage per country to Add anomaly detection alarm to detect traffic variations on countries overall.Oct 4 2019, 4:01 AM
Nuria updated the task description. (Show Details)
Milimetric moved this task from Incoming to Data Quality on the Analytics board.
Milimetric added a project: Analytics-Kanban.

Hi,

We (research) will be supporting @ssingh on his work related to this problem, especially focused in censorship.

@Nuria & @mforns the problem I've seen in the past is that traffic seems to be seasonal, and the first thing here is to learn those seasons. What can be "normal" on Monday ib January can be completely different from Sunday on the same week, or another Monday in July. Or the traffic can change because some important events (image the World Cup Final). So it's not clear which should be referential point in time to compare with.

What I've seen in previous cases is that good practice is to compare traffic among countries. If traffic is decaying simultaneously in many countries, is less likely that that is censorship. Another interesting feature is to compute the slope in traffic changes, to learn what is normal (I understand that slopes should be always 'soft') and what is not.

In general, I would say that this can be scoped as outlier detection problem in time series. I don't know the full literature on that, but we can have a look, not to build something super sophisticated but good enough to capture most of the cases with reasonable precision.

@diego @ssingh has done work in the past on this regard . I think the strategy of looking just at countries where we "suspect" drops might be related to censorship might be a good one. And to avoid "worldcup traffic" anomalies maybe we can look at entropy (see parent task T215863: Proof of concept: Entropy calculations can be used to alarm on anomalies for data quality metrics rather than raw traffic numbers . Entrophy should be a more robust indicator of significant variance (I should say *seems like it would be* a more robust indicator cause that needs to be proven). We just did some work to add abilities to calculate entrophy over a series of values: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/data_quality/hourly/queries/eventcapsule_metrics.hql and that work can probably be reused here.

@Nuria the entropy approach looks very cool, thanks for sharing.

The approach of having suspicious countries sounds dangerous to me. There are business and political utilization behind censorship accusations, and I think all countries should be considered the same.

Hi @Nuria and @diego. Thank you for your comments.

I think I should give some context on my discussions with Diego. From the censorship detection point of view, we are interested in the global picture. Relatedly, on stat1007, the traffic anomaly project runs every day and generates anomalies in traffic patterns for all (246) countries (see T215379 and /home/sukhe/project_monitoring on stat1007), which is similar to what we are trying to do here. However, there are some concerns with the existing implementation: the anomaly detection needs improvement (what constitutes an "anomaly"?) and it takes more than 24 hours for the report to be generated. (There are other issues as well, such as the code is no longer maintained and since we will be changing the logic, rewriting it is probably better than fixing it.) This was the reason that I asked Diego for his help on how to better detect such anomalies but then we noticed this ticket and it seems like you had started discussing this before Diego and I did -- I guess it's a good sign that all of us think that this is a problem that needs to be solved :)

The reason such a thing is useful is because it serves as a trigger: when we detect that traffic from X country went down, we can then run other tests to figure out if the drop in traffic was due to censorship of Wikipedia in that country or if there was a general internet outage. So while we can start with some specific countries, I think it will be best to cover all countries, not just for the reasons Diego mentioned but also because having this data may be useful as a monitoring tool about Wikipedia's accessibility from countries around the world. The data we have may also be of interest to researchers, projects and organizations working on detecting internet censorship.

I am happy to help in any way I can as this is directly related to the work we are doing on censorship detection and the completion of this task will be very useful to us. It also seems like Research has some time to spare so this can be a collaboration across the three teams.

I am going to set up a meeting to coordinate efforts. Defining an anomaly is not easy but we can work with a more robust measure than raw pageviews (this is where entropy comes along) with past blockage events as "training data" and see where we get

mforns renamed this task from Add anomaly detection alarm to detect traffic variations on countries overall to Add data quality metric: traffic variations per country.Oct 15 2019, 10:04 AM

Hi all! One idea that maybe can reduce false positives when there are traffic peaks for any given reason.

I assume that when there's a traffic peak in a wiki, it only happens to a small subset of the articles.
If there's an event that triggers people looking up a theme on Wikipedia, they will not increase the traffic for all articles equally, rather just for a small subset.

So, maybe, when counting pageviews per country, we can leave the top N% most visited articles out of the calculation.
We can sort the articles by pageviews and remove the top ones until a certain % of the total is reached, like with a percentile calculation.

Maybe I'm completely naive in my assumption, but we could try?
Also, another concern is the computational load of this query, it could be heavy.

This we could apply in addition to normalizing a given country's pageviews by the total (global) pageviews,
as Nuria mentioned before. This would help with the drops in the data (like Christmas).

So, maybe, when counting pageviews per country, we can leave the top N% most visited articles out of the calculation.

Nice! I think this is an excellent idea, we, of course, need to verify empirically that it works but it very likely will.

Change 550498 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery@master] Add data quality metric: traffic variations per country

https://gerrit.wikimedia.org/r/550498

I added a first draft of the metric to the data quality pipeline (see patch above), and added a chart to the data quality dashboard in Superset.
https://superset.wikimedia.org/superset/dashboard/73/

@ssingh
I'm trying to match a first draft of the traffic_per_country metric with the outage data that you put together.
Also, it would be great to have false positives examples to do the same and see how the metric behaves in such cases.
Do you have any false positive examples that I can use?
Thanks a lot!

@ssingh
I'm trying to match a first draft of the traffic_per_country metric with the outage data that you put together.
Also, it would be great to have false positives examples to do the same and see how the metric behaves in such cases.
Do you have any false positive examples that I can use?
Thanks a lot!

Hi @mforns. I will share some examples on the same page as the outage data. Thank you! ( @DED and I had a discussion about the same topic yesterday!)

See event in iran as of today

Change 550498 merged by Mforns:
[analytics/refinery@master] Add data quality metric: traffic variations per country

https://gerrit.wikimedia.org/r/550498

Nuria set the point value for this task to 13.Dec 20 2019, 5:28 PM