Add data quality metric: traffic variations per country
Closed, ResolvedPublic13 Estimated Story Points
Actions

Assigned To

Authored By

	• Nuria
	Oct 2 2019, 9:43 PM

Description

We should be able to detect whether the makeup of traffic for all projects or maybe just en.wikipedia has considerably shifted. This could indicate several things:

network wise we have a problem serving a particular location
there is blockage of traffic from a country
traffic is flowing into a country that did not before

The more targeted that the measures are the more useful they would be so entrophy measures on "desktop" english wikipedia are likely to be more useful than in "overall" projects.

Details

	Subject	Repo	Branch	Lines +/-
	Add data quality metric: traffic variations per country	analytics/refinery	master	+238 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	odimitrijevic	T198986 Data Quality Alarms
Open	None	T267355 Traffic anomaly alarms
Resolved	mforns	T234484 Add data quality metric: traffic variations per country

Event Timeline

And also traffic increases, is this a good usage of entrophy or do we want to go with a simple normal + std deviation distribution?

I think the query that defines this quality metric should just forward the absolute value of pageviews per country.
I think normal+stddev is not needed there, because the anomaly detection algorithm should take care of that, no?

@mforns agreed

• Nuria renamed this task from Add anomaly detection alarm to detect traffic blockage per country to Add anomaly detection alarm to detect traffic variations on countries overall.Oct 4 2019, 4:01 AM

• Nuria updated the task description. (Show Details)

Milimetric triaged this task as High priority.Oct 7 2019, 3:50 PM

Milimetric moved this task from Incoming to Data Quality on the Analytics board.

Milimetric added a project: Analytics-Kanban.

ssingh subscribed.Oct 8 2019, 6:51 PM

diego added a project: Research.Oct 8 2019, 8:40 PM

diego subscribed.

Hi,

We (research) will be supporting @ssingh on his work related to this problem, especially focused in censorship.

@Nuria & @mforns the problem I've seen in the past is that traffic seems to be seasonal, and the first thing here is to learn those seasons. What can be "normal" on Monday ib January can be completely different from Sunday on the same week, or another Monday in July. Or the traffic can change because some important events (image the World Cup Final). So it's not clear which should be referential point in time to compare with.

What I've seen in previous cases is that good practice is to compare traffic among countries. If traffic is decaying simultaneously in many countries, is less likely that that is censorship. Another interesting feature is to compute the slope in traffic changes, to learn what is normal (I understand that slopes should be always 'soft') and what is not.

In general, I would say that this can be scoped as outlier detection problem in time series. I don't know the full literature on that, but we can have a look, not to build something super sophisticated but good enough to capture most of the cases with reasonable precision.

@diego @ssingh has done work in the past on this regard . I think the strategy of looking just at countries where we "suspect" drops might be related to censorship might be a good one. And to avoid "worldcup traffic" anomalies maybe we can look at entropy (see parent task T215863: Proof of concept: Entropy calculations can be used to alarm on anomalies for data quality metrics rather than raw traffic numbers . Entrophy should be a more robust indicator of significant variance (I should say *seems like it would be* a more robust indicator cause that needs to be proven). We just did some work to add abilities to calculate entrophy over a series of values: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/data_quality/hourly/queries/eventcapsule_metrics.hql and that work can probably be reused here.

@Nuria the entropy approach looks very cool, thanks for sharing.

The approach of having suspicious countries sounds dangerous to me. There are business and political utilization behind censorship accusations, and I think all countries should be considered the same.

Hi @Nuria and @diego. Thank you for your comments.

I think I should give some context on my discussions with Diego. From the censorship detection point of view, we are interested in the global picture. Relatedly, on stat1007, the traffic anomaly project runs every day and generates anomalies in traffic patterns for all (246) countries (see T215379 and /home/sukhe/project_monitoring on stat1007), which is similar to what we are trying to do here. However, there are some concerns with the existing implementation: the anomaly detection needs improvement (what constitutes an "anomaly"?) and it takes more than 24 hours for the report to be generated. (There are other issues as well, such as the code is no longer maintained and since we will be changing the logic, rewriting it is probably better than fixing it.) This was the reason that I asked Diego for his help on how to better detect such anomalies but then we noticed this ticket and it seems like you had started discussing this before Diego and I did -- I guess it's a good sign that all of us think that this is a problem that needs to be solved :)

The reason such a thing is useful is because it serves as a trigger: when we detect that traffic from X country went down, we can then run other tests to figure out if the drop in traffic was due to censorship of Wikipedia in that country or if there was a general internet outage. So while we can start with some specific countries, I think it will be best to cover all countries, not just for the reasons Diego mentioned but also because having this data may be useful as a monitoring tool about Wikipedia's accessibility from countries around the world. The data we have may also be of interest to researchers, projects and organizations working on detecting internet censorship.

I am happy to help in any way I can as this is directly related to the work we are doing on censorship detection and the completion of this task will be very useful to us. It also seems like Research has some time to spare so this can be a collaboration across the three teams.

I am going to set up a meeting to coordinate efforts. Defining an anomaly is not easy but we can work with a more robust measure than raw pageviews (this is where entropy comes along) with past blockage events as "training data" and see where we get

leila subscribed.Oct 8 2019, 11:41 PM

mforns renamed this task from Add anomaly detection alarm to detect traffic variations on countries overall to Add data quality metric: traffic variations per country.Oct 15 2019, 10:04 AM

mforns edited parent tasks, added: T198986: Data Quality Alarms ; removed: T215863: Proof of concept: Entropy calculations can be used to alarm on anomalies for data quality metrics.Oct 15 2019, 10:25 AM

Hi all! One idea that maybe can reduce false positives when there are traffic peaks for any given reason.

I assume that when there's a traffic peak in a wiki, it only happens to a small subset of the articles.
If there's an event that triggers people looking up a theme on Wikipedia, they will not increase the traffic for all articles equally, rather just for a small subset.

So, maybe, when counting pageviews per country, we can leave the top N% most visited articles out of the calculation.
We can sort the articles by pageviews and remove the top ones until a certain % of the total is reached, like with a percentile calculation.

Maybe I'm completely naive in my assumption, but we could try?
Also, another concern is the computational load of this query, it could be heavy.

This we could apply in addition to normalizing a given country's pageviews by the total (global) pageviews,
as Nuria mentioned before. This would help with the drops in the data (like Christmas).

So, maybe, when counting pageviews per country, we can leave the top N% most visited articles out of the calculation.

Nice! I think this is an excellent idea, we, of course, need to verify empirically that it works but it very likely will.

Change 550498 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery@master] Add data quality metric: traffic variations per country

https://gerrit.wikimedia.org/r/550498

gerritbot added a project: Patch-For-Review.Nov 12 2019, 3:47 PM

I added a first draft of the metric to the data quality pipeline (see patch above), and added a chart to the data quality dashboard in Superset.
https://superset.wikimedia.org/superset/dashboard/73/

mforns moved this task from Next Up to In Code Review on the Analytics-Kanban board.Nov 12 2019, 4:05 PM

@ssingh
I'm trying to match a first draft of the traffic_per_country metric with the outage data that you put together.
Also, it would be great to have false positives examples to do the same and see how the metric behaves in such cases.
Do you have any false positive examples that I can use?
Thanks a lot!

In T234484#5663940, @mforns wrote:

@ssingh
I'm trying to match a first draft of the traffic_per_country metric with the outage data that you put together.
Also, it would be great to have false positives examples to do the same and see how the metric behaves in such cases.
Do you have any false positive examples that I can use?
Thanks a lot!

Hi @mforns. I will share some examples on the same page as the outage data. Thank you! ( @DED and I had a discussion about the same topic yesterday!)

Screen Shot 2019-11-20 at 3.58.50 PM.png (1×2 px, 321 KB)

See event in iran as of today

mforns moved this task from In Code Review to In Progress on the Analytics-Kanban board.Nov 25 2019, 5:17 PM

mforns moved this task from In Progress to In Code Review on the Analytics-Kanban board.Nov 27 2019, 4:53 PM

Change 550498 merged by Mforns:
[analytics/refinery@master] Add data quality metric: traffic variations per country

https://gerrit.wikimedia.org/r/550498

mforns moved this task from In Code Review to Ready to Deploy on the Analytics-Kanban board.Dec 18 2019, 9:34 PM

• Nuria moved this task from Ready to Deploy to Done on the Analytics-Kanban board.Dec 20 2019, 5:06 PM

• Nuria set the point value for this task to 13.Dec 20 2019, 5:28 PM

• Nuria closed this task as Resolved.Jan 7 2020, 2:00 PM

mforns added a parent task: T267355: Traffic anomaly alarms.May 27 2021, 3:05 PM

Maintenance_bot removed a project: Patch-For-Review.May 27 2021, 3:10 PM

	F31115524: Screen Shot 2019-11-20 at 3.58.50 PM.png
	Nov 20 2019, 11:59 PM

Add data quality metric: traffic variations per countryClosed, ResolvedPublic13 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Add data quality metric: traffic variations per country
Closed, ResolvedPublic13 Estimated Story Points
Actions

Related Objects
Search...