Page MenuHomePhabricator

Some varnishkafka instances dropped traffic for a long time due to the wrong version of the package installed
Open, Needs TriagePublic

Description

What happened

After today's restart of varnishkafkas for a couple of changes (everything went fine) I noticed two weird metrics in the dashboard for cp403[56] (related to the number of requests waiting to be acked by poll()). I asked some help to Valentin and we quickly found out that those vk instances were stuck with:

%7 VSM_OPEN: Failed to open Varnish VSL: Cannot open /var/lib/varnish/frontend/_.vsm: No such file or directory

(Note: to discover this I started another instance using the webrequest config but emitting data to stdout and with debug logging).

It is not clear why, but the varnishkafka package installed was 1.0.14-1 istead of 1.1.0-1 (the upgrade happened a while ago, IIUC for T264074).

The caching nodes affected were: cp[6002,6005,6009-6013].drmrs.wmnet,cp1087.eqiad.wmnet,cp[4021,4033-4036].ulsfo.wmnet

More specifically, the affected instances were:

  • varnishkafka-webrequest on cp1087, cp4035 and cp4036 (text)
  • varnishkafka-webrequest on cp4021, cp4033, cp4034 (upload)
  • varnishkafka-statsv on cp1087, cp4035 and cp4036
  • varnishkafka-eventlogging on cp1087, cp4035 and cp4036

Some notes about these nodes:

  • cp1087 is an old node, but the OS was installed for the last time on Fri Jun 4 20:02:59 UTC 2021
  • the drmrs nodes are related to the new DC, never served traffic
  • cp4021's OS was installed on Wed Oct 13 11:48:34 UTC 2021
  • cp4035's OS was installed on Wed Nov 3 18:04:46 UTC 2021
  • cp4036's OS was installed on Thu Nov 4 10:42:30 UTC 2021
Why varnishkafka didn't fail triggering an alert

We coded a special feature in vk to retry indefinitely when the Varnish shared memory handle is not available. This is needed to be flexible when Varnish is restarted, since it caused a lot of headaches in the past.
This special function was only present in the old version of Varnishkafka, since in the newer one the Varnish API takes care of everything, even retrying and failing after a bit. Tested it today on cp3035:

elukey@cp3055:~$ sudo varnishkafka -S webrequest.conf 
.....
VSM: Could not get hold of varnishd, is it running?

I used a wrong Varnish handle name in the config, and I got a meaningful error and exit code after some tries. So the correct/up-to-date version of Varnishkafka doesn't retry indefinitely and fails after some tries. We ended up in this situation due to an old version of Varnishkafka being deployed.

Why didn't we get any alert?

We have alerts for errors like delivery report (so when the message is not delivered to kafka) and alerts on the status of the varnishkafka daemons (up/down) but we don't have any alert related to the traffic that an instance handle (like if it is zero or not).

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

varnishkafka-webrequest on cp1087, cp4035 and cp4036 (text)
varnishkafka-webrequest on cp4021, cp4033, cp4034 (upload)

We should estimate how much webrequest log loss over how much time this was.

The main issue is:

varnishkafka | 1.0.14-1 |  buster-wikimedia |               main | amd64, source
varnishkafka |  1.1.0-1 |  buster-wikimedia | component/varnish6 | amd64, source

So varnishkafka 1.0.14 is installed by the varnishkafka class (init.pp) and then the apt config for the component is added, but puppet doesn't upgrade and if we don't do it manually it we end up in this task.

varnishkafka-webrequest on cp1087, cp4035 and cp4036 (text)
varnishkafka-webrequest on cp4021, cp4033, cp4034 (upload)

We should estimate how much webrequest log loss over how much time this was.

My suspicion is that these instances dropped data since the last OS reimage date :(

Besides figuring out the data loss, we should:

  1. Fix puppet and apt, the right varnishkafka version should be deployed. Not sure what's best, it also depends on how the traffic team wants to proceed.
  2. Add monitors for zero msg/s of traffic in varnishkafka instances

This confirms all of the package install dates:

btullis@cumin1001:~$ sudo cumin 'cp1087.* or cp4021.* or cp403[3-6].*' 'zgrep -E "(install|upgrade) varnishkafka:amd64" /var/log/dpkg.log*'
6 hosts will be targeted:
cp1087.eqiad.wmnet,cp[4021,4033-4036].ulsfo.wmnet
Ok to proceed on 6 hosts? Enter the number of affected hosts to confirm or "q" to quit 6
===== NODE GROUP =====                                                                                                                                                                                             
(1) cp4035.ulsfo.wmnet                                                                                                                                                                                             
----- OUTPUT of 'zgrep -E "(insta...ar/log/dpkg.log*' -----                                                                                                                                                        
/var/log/dpkg.log:2022-01-26 15:41:16 upgrade varnishkafka:amd64 1.0.14-1 1.1.0-1                                                                                                                                  
/var/log/dpkg.log.2.gz:2021-11-03 18:16:48 install varnishkafka:amd64 <none> 1.0.14-1                                                                                                                              
===== NODE GROUP =====                                                                                                                                                                                             
(1) cp4034.ulsfo.wmnet                                                                                                                                                                                             
----- OUTPUT of 'zgrep -E "(insta...ar/log/dpkg.log*' -----                                                                                                                                                        
/var/log/dpkg.log:2022-01-26 15:55:10 upgrade varnishkafka:amd64 1.0.14-1 1.1.0-1                                                                                                                                  
/var/log/dpkg.log.2.gz:2021-11-04 09:39:01 install varnishkafka:amd64 <none> 1.0.14-1                                                                                                                              
===== NODE GROUP =====                                                                                                                                                                                             
(1) cp4021.ulsfo.wmnet                                                                                                                                                                                             
----- OUTPUT of 'zgrep -E "(insta...ar/log/dpkg.log*' -----                                                                                                                                                        
/var/log/dpkg.log:2022-01-26 15:55:10 upgrade varnishkafka:amd64 1.0.14-1 1.1.0-1                                                                                                                                  
/var/log/dpkg.log.3.gz:2021-10-13 12:06:32 install varnishkafka:amd64 <none> 1.0.14-1                                                                                                                              
===== NODE GROUP =====                                                                                                                                                                                             
(1) cp4033.ulsfo.wmnet                                                                                                                                                                                             
----- OUTPUT of 'zgrep -E "(insta...ar/log/dpkg.log*' -----                                                                                                                                                        
/var/log/dpkg.log:2022-01-26 15:55:10 upgrade varnishkafka:amd64 1.0.14-1 1.1.0-1                                                                                                                                  
/var/log/dpkg.log.2.gz:2021-11-03 16:30:15 install varnishkafka:amd64 <none> 1.0.14-1                                                                                                                              
===== NODE GROUP =====                                                                                                                                                                                             
(1) cp4036.ulsfo.wmnet                                                                                                                                                                                             
----- OUTPUT of 'zgrep -E "(insta...ar/log/dpkg.log*' -----                                                                                                                                                        
/var/log/dpkg.log:2022-01-26 15:55:10 upgrade varnishkafka:amd64 1.0.14-1 1.1.0-1                                                                                                                                  
/var/log/dpkg.log.2.gz:2021-11-04 10:53:53 install varnishkafka:amd64 <none> 1.0.14-1                                                                                                                              
===== NODE GROUP =====                                                                                                                                                                                             
(1) cp1087.eqiad.wmnet                                                                                                                                                                                             
----- OUTPUT of 'zgrep -E "(insta...ar/log/dpkg.log*' -----                                                                                                                                                        
/var/log/dpkg.log:2022-01-26 15:55:09 upgrade varnishkafka:amd64 1.0.14-1 1.1.0-1                                                                                                                                  
/var/log/dpkg.log.7.gz:2021-06-04 20:23:56 install varnishkafka:amd64 <none> 1.0.14-1                                                                                                                              
================

I make it the following number of days without data:

cp1087 = 236 days
cp4021 = 105 days
cp4033 = 84 days
cp4044 = 83 days
cp4045 = 84 days
cp4046 = 83 days

Discounting the drmrs nodes.

This lines up with the dip in traffic that @kzimmerman was asking us about. I'm gonna loop her in here sooner than later, even though we don't really have a high quality report.

I make it the following number of days without data:

cp1087 = 236 days
cp4021 = 105 days
cp4033 = 84 days
cp4044 = 83 days
cp4045 = 84 days
cp4046 = 83 days

btw, I think this has typos, should be cp403[3-6], instead of cp404[4-6]

Queries used for analysis:

Thanks @Milimetric! Here's the main task we were using as we investigated pageviews: https://phabricator.wikimedia.org/T296875

Summary of the data loss analysis:

  • Between 2021-06-04 and 2021-11-03 we have lost 2.80% of webrequest-text, statv and eventlogging data (pageview impact, eqiad datacenter only)
  • Between 2021-11-04 and 2022-01-27 we have lost 4.34% of webrequest-text, statv and eventlogging data (pageview impact, eqiad + ulsfo datacenters only)
  • Between 2021-10-13 and 2021-11-03 we have lost 1.01% of webrequest-upload data (mediarequest impact, ulsfo datacenter only)
  • Between 2021-11-04 and 2022-01-27 we have lost 2.19% of webrequest-upload data (mediarequest impact, ulsfo datacenter only)
EChetty moved this task from Incoming to Ops Week on the Data-Engineering board.
EChetty moved this task from Ops Week to Ops on the Data-Engineering board.

Thank you Data Engineering team for the report. I used Webrequest Loss end of 2021 for getting the % data loss for Nov and Dec and here's what I found:

November
We reported 15,888,030,722 pageviews. However, given the 4.23% data loss, we should have 16,589,778,346.04 actual pageviews.
Nov declines :

  • Compared to 2020: We reported -11.41% YoY. However, the actual decline was -7.49%%
  • Compared to 2019: we reported -2.11% YoYoY. However, there was an increase of +2.21%

December
We reported 15,826,651,147 pageviews. However, given the 4.45% data loss, we should have 16,563,737,464.15 actual pageviews.
Dec decline :

  • Compared to 2020: we reported -7.70%. However, the actual decline was -3.40%
  • Compared to 2019: we reported an increase of +1.13%. However, the actual inc was +5.84%
Conclusion

The impact of data loss on pageview decline is high (~4%) ! It is a significant contributor to the drop in pageviews.

Scale

Difference between reported and actual pageviews is ~700M in Nov.
In terms of scale, we can say that - "In November we lost data for around 3 days of pageviews on enwiki"

@Mayakp.wiki thanks for doing this analysis! Question: it looks like these numbers are for 0.0423% data loss as opposed to 4.23% data loss. Am I missing something?

Thanks @Isaac for noting that, I have edited the above comment with my analysis. My formula was correct but I was getting different numbers because of percentage formatting in my spreadsheet.
Pls let me know if it makes better sense now.

Pv_actuals = Pv_reported + Pv_Loss
Pv_reported : pageview_hourly data
Pv_Loss : data loss % from DE
Pv_actuals : real pageviews

Pv_actuals=(Pv_reported *100) / (100 - Pv_Loss)

Yep! That's all making more sense to me now. Thanks again for running the numbers and contextualizing them!

Wow, thanks so much, everyone, for finding this and for all the work figuring out the impact! I'm wondering if there's any chance of estimating the loss for smaller segments, that is, by wiki project, country and access method. Maybe by looking at how much traffic was served each day for each segment, from the affected datacenters?

Just a radical idea here... so if this information could help partly rectify our aggregated data sets, but is not available outside the 90-day purge window, might we consider temporarily suspending the full data purges, at least until we're able to run queries for this on the data that's still there?

Giant hug to all... I know how dedicated folks are to this work, and I imagine this doesn't feel great. It'll be okok! Your work is amazing. Thanks again for everything!!!!!!

@AndyRussG : For next steps we are planning to look into the impact of data loss on US traffic and other dimensions like access method and referrer traffic. This will help us continue investigating causes (other than the data loss) that have contributed to the decline.

Just a radical idea here... so if this information could help partly rectify our aggregated data sets, but is not available outside the 90-day purge window, might we consider temporarily suspending the full data purges, at least until we're able to run queries for this on the data that's still there?

I dont know the full implications here but Im fairly certain a request like this may require legal approval. :)

@AndyRussG : For next steps we are planning to look into the impact of data loss on US traffic and other dimensions like access method and referrer traffic. This will help us continue investigating causes (other than the data loss) that have contributed to the decline.

Ah sounds great! So it seems the calculations here would apply to aggregated global traffic, but since the two datacenters affected are in the US, the impact on reported pageviews will be even greater for the US and countries that mainly reach us through those datacenters, and less, or maybe almost none, for countries that reach us through datacenters elsewhere?

(Perhaps folks from data engineering or Traffic would be able to verify this assumption?)

I think this bias would then carry forward to all country-specific data, like referrer, browser share, page rank, etc.?

Hmmm maybe this could partly explain what seems like some geographic clustering in the pageview decline?

I wonder how random the distribution of work across hosts in data centers is? Would there be an impact on circadian patterns (important for some research) due to different hosts handling more or less requests at different times of the day, when there is more or less traffic?

might we consider temporarily suspending the full data purges, at least until we're able to run queries for this on the data that's still there?

I dont know the full implications here but Im fairly certain a request like this may require legal approval. :)

Definitely I think it would require legal approval, and I imagine that might actually take longer than just running the queries... though if people think this is worthwhile, an idea could be to try starting on both?

Thanks again!!!! :)

since the two datacenters affected are in the US, the impact on reported pageviews will be even greater for the US and countries that mainly reach us through those datacenters, and less, or maybe almost none, for countries that reach us through datacenters elsewhere?

@AndyRussG : that's a valid point. thanks for bringing it up.
@JAllemandou / @Milimetric : would you be able to provide a Geographic breakdown of the loss ? This is especially relevant since we need to look at US pageviews and potential impact on the 2021 December Fundraiser.

@JAllemandou A few questions to better understand the situation and the details shared so far:
a) how does the data loss break out across webrequest-text, statv, eventlogging data, and webrequest-upload data? For example, "between 2021-06-04 and 2021-11-03 [we lost] 2.80% of webrequest-text, statv and eventlogging data (pageview impact, eqiad datacenter only)." Were the three impacted equally? Did they each lose 2.80%? or did they each lose the same smaller percent that aggregates to 2.80%? or were they each impacted differently and together they lost 2.80%?
b) Were all data sets within those categories impacted? And were they impacted equally? Can I assume that all eventlogging data was impacted? If yes, then we'll need to consider the data loss not only in conversations on reader metrics but also editors and content.

Summary of the data loss analysis:

  • Between 2021-06-04 and 2021-11-03 we have lost 2.80% of webrequest-text, statv and eventlogging data (pageview impact, eqiad datacenter only)
  • Between 2021-11-04 and 2022-01-27 we have lost 4.34% of webrequest-text, statv and eventlogging data (pageview impact, eqiad + ulsfo datacenters only)
  • Between 2021-10-13 and 2021-11-03 we have lost 1.01% of webrequest-upload data (mediarequest impact, ulsfo datacenter only)
  • Between 2021-11-04 and 2022-01-27 we have lost 2.19% of webrequest-upload data (mediarequest impact, ulsfo datacenter only)

Ran a quick test query to get a by-country breakdown for pageviews on 2021-12-01. I'm not super confident about these numbers, but it looks like the data loss was almost exclusively in the Americas. Links: data, query, map.

That day, the US may have seen an error of -14% in reported pageviews [1] with most other countries in the Americas seeing similar levels of error. (Mexico is the only country in the Americas that appears to have been spared, presumably because it gets routed exclusively through codfw, I guess?)

Most of the rest of the world was unaffected.

Again, I believe we likely could estimate the correct figures for aggregated datasets, for the entire period of the outage, even though the full webrequest data prior to mid-November is now gone. It would not be a one-day task to do it properly, but if I may say so, after thinking it over, I do feel quite strongly that we should try.

This loss occurred very close to the source of all our most essential data. It impacts the data we use for decisions throughout the organization, as well as the public datasets used by wiki communities. This data is also essential for current and future research, by both the WMF and researchers elsewhere, and has immeasurable heritage value as a document about one of the world's foremost social movements.

Many thanks once again!!!!

[1] ( reported - estimated_actual ) * 100 / estimated_actual

If it helps: https://github.com/wikimedia/operations-dns/blob/master/geo-maps

We resolve our public DNS domains to specific DCs via geo-map configuration, to spread traffic to the DCs/pops that we want. Of course there is the corner case of clients not following DNS (but hitting a specific IP, for example particular bots etc..) but it should be a tiny part of the traffic.

Adding precisions

@JAllemandou A few questions to better understand the situation and the details shared so far:
a) how does the data loss break out across webrequest-text, statv, eventlogging data, and webrequest-upload data? For example, "between 2021-06-04 and 2021-11-03 [we lost] 2.80% of webrequest-text, statv and eventlogging data (pageview impact, eqiad datacenter only)." Were the three impacted equally? Did they each lose 2.80%? or did they each lose the same smaller percent that aggregates to 2.80%? or were they each impacted differently and together they lost 2.80%?

The loss values have been computed for webrequest-text and webrequest-upload data. The two other flows statsv and eventlogging are served by the same hosts doing webrequest-text, so we can expect a similar order of magnitude of loss for those two systems.

b) Were all data sets within those categories impacted? And were they impacted equally? Can I assume that all eventlogging data was impacted? If yes, then we'll need to consider the data loss not only in conversations on reader metrics but also editors and content.

As stated above, Eventlogging data was impacted, with a similar expected magnitude as webrequest-text. I don't think this impacts editors/content metrics, as I would expect them to be computed from mediawiki-history, which doesn't rely on eventlogging. Only the specific metrics relying on eventlogging have been impected. Also, please note that Eventlogging is our legacy system, different from the new EventPlatform where a lot has already been migrated.

I agree we should look at this loss, @AndyRussG, and estimate as accurately as possible along as many dimensions as possible. I'll brain-dump some thoughts about this and we should make a plan to analyze. I'd like our managers' support in how we resource and prioritize this work.

  • The geo-maps Luca shared above are useful, but I think mainly to validate that the other estimates make sense
  • During the days we are missing data from a given host, we can use its sibling hosts in that datacenter to estimate traffic patterns
  • Here we have to look closer at why certain hosts seem to serve a higher proportion of pageviews than their datacenter siblings (sometimes limited to a specific dimension)
  • We can look at the country breakdown by host before and after the fix restored data flow from the affected hosts
  • We should probably store some sampled data from November and December while we still have it, but we have to think this through
  • I think the more people we get together to think about this, the better, as I think so far each of us missed some key aspects of the problem

Thanks so much, @elukey, @JAllemandou, @Milimetric!

Heheh ok I guess following @Milimetric, I'll not hold back on the brain dump...

we can use its sibling hosts in that datacenter to estimate traffic patterns [...]
we have to look closer at why certain hosts seem to serve a higher proportion of pageviews than their datacenter siblings (sometimes limited to a specific dimension)

Yes, exactly! So, using data in webrequest for the time since the outage, I imagine we could create a model that predicts patterns of requests handled by the previously broken hosts, based on patterns of requests handled by the sibbling hosts. (And, 100% agreed, technical details about how requests are distributed among hosts would absolutely be useful for this.)

Then we could apply that model to webrequest data for the time during the outage to estimate requests dropped by each host.

Finally, we could use timeseries models to estimate loss, broken down over selected dimensions, for the period for which webrequest data is no longer available.

Existing streams that we know were not affected could potentially help with validation.

I ran a quick query to look at differences among hosts on one day this month (after the outage). Number of requests per host seems to vary from the mean by about +/- 11%. (In previous estimates, above, it's assumed that requests were evenly distributed among hosts.)

For eqiad, on 2022-02-05 (query):

HostRequests% difference from average
cp1085.eqiad.wmnet181062584-10.65
cp1079.eqiad.wmnet187249488-7.59
cp1089.eqiad.wmnet200323153-1.14
cp1083.eqiad.wmnet2028509860.11
cp1077.eqiad.wmnet2045463910.94
cp1081.eqiad.wmnet2050039161.17
cp1087.eqiad.wmnet2145037005.86
cp1075.eqiad.wmnet22553174211.30
Average202633995

estimate as accurately as possible along as many dimensions as possible

Yeah! So then maybe a result could be a table with coefficients to multiply reported values by, for each day, for each permutation of values of the selected dimensions?

Also, please note that Eventlogging is our legacy system, different from the new EventPlatform where a lot has already been migrated.

Ah this is important news, thanks so much! So does that mean that if an event has been migrated to use the new json schemas, it's unaffected? For example, CentralNoticeBannerHistory was migrated to a json schema, and is sent from the client using mw.eventLog.logEvent(). Would it have been affected?

I took a quick look at virtaulpageviews_hourly, but I couldn't tell visually if it had been impacted. Maybe it wasn't?

What about the beacons used for the new session length dataset?

Ah fantastic, that's super helpful!!! I also see routing for some countries differs by region, which would explain differences among countries that use ulsfo, eqiad and codfw.

For archives to be happy, here is the email I sent today about gathering data for analysis:

Hi Folks,
I'm about to gather data that will allow us to build approximations of traffic lost across various dimensions[1].

I'd like confirmation before I start:

  • Are we interested in pageview only, or the broader webrequest data?
  • Is getting daily data enough in terms of time-granularity, or would getting hourly data better?

As we are eager to report on this fast, I'll start my extraction job later today.
Without answers to this message, I'll use pageview data and daily granularity.

Thank you!

[1] Dimensions to extract

  • year/month/day/[hour?]
  • hostname (split into host + datacenter)
  • is_pageview (if we use webrequest)
  • access_method
  • agent_type (automated not available in webrequest)
  • country_code
  • project (as in pageview: main project + project-family / .m etc accounted for in the access_method)
  • referer_class

@JAllemandou : Thanks Joseph!
These would be great for us to investigate. Additionally can we pls add OS family, OS major, Browser family, Browser major to the dimension list ?
For product-analytics, we are fine with pageview only.
From Fundraising - I believe we would want to have the broader webrequest data so we can also see the impact to banner impressions (which is very important for fundraising to understand).

Hi! Thanks so so much @JAllemandou, @Mayakp.wiki, @JMando, @Milimetric! Hugely, hugely appreciated!!!!!!

Following discussions on other channels, just some additional notes for Fundraising... it'd be great to have the following dimensions:

  • is_banner_beacon for data on banner impressions (as you mentioned, using the same filters as here would be great).
  • device_family is also important, since Fundraising often shows different content to iPads and iPhones (which would both appear as iOS in os_family).

Also, Fundrasing data cubes use hourly granularity.

Hope it's OK to share some more general thoughts, below...

  • More dimensions I'd suggest we include:
    • automated agent_type (at least for requests that were pageviews, perhaps joining on the pageview table?)
    • geocoded_data.subdivision (region), at least for countries that route to datacenters based on subdivision (US and CA). (This is because the country breakdown seems to follow the routing map very closely. Aggregate data for the US and CA looks like a more complex mix, and splitting them by subdivision I imagine is likely to produce cleaner results?)
  • +1 for hourly granularity for better post-processing options (in addition to the reasons mentioned elsewhere, though I understand there's a need to balance resource requirements).
  • +1 to using webrequest rather than pageviews, since routing to hosts within datacenters is not random, and it's not clear whether a request being the first one from a given client might not introduce some bias.
  • Not sure exactly how the data will be queried and stored, but if it would be possible to store reported request counts as well as estimated correct counts, that'd be fantastic. (I think that would help support further refinement of how we estimate loss.)

Hmmm perhaps Traffic (@BBlack?) might have some further input about routing and load balancing within datacenters? Also maybe Research (@MGerlach, @Isaac) would have input about dimensions that are important for their work?

Thanks again!

Ah more still....

  • Maybe, as well as referer_class, we could break down referrer by search engine, for correcting referrer_daily? (Though I'm not certain we'll need that breakdown to do so...? That is, if we can reliably estimate loss over all the other dimensions in that table, maybe that'll be enough?)
  • language_variant (as in pageview_hourly, so we can more accurately correct data in that table ).
  • From the X-analytics header, maybe proxy, nocookies (useful for detecting bots and proxies, which I imagine might introduce bias, for various reasons) and loggedIn?

Also, I see a sequence column in webrequest, documented as "Per-host sequence number". I don't imagine this is at all useful for further digging here? Maybe it's just for reconstructing the order in which requests were processed by each host, I guess?

Thanks again!!

Ahhhh one more idea, hope it's useful...

For a lot of countries, I see (for 2021-12-01, though it's probably consistent) the data loss is very small, often less than 0.1%. If query time and storage space are issues, maybe it'd be helpful to pull the data in two stages, with different levels of granularity: first globally, on just a few dimensions and at daily time granularity, and then run a more directed query, on all selected dimensions and at hourly granularity, but only for countries/subdivisions above a specific threshold of traffic passing through the affected datacenters?

Thanks again!!!!!!

Ahhhhhh oki so just one more thought that I guess might be worth sharing here... (Really hope this is useful, if not, many apologies, and in that case please ignore...!)

Soooo if we can find a smaller set of dimensions that represents all the variability among requests handled by hosts at an affected datacenter, I think we don't have to store data broken down further... does that make sense?

For example, hypothetically, if it were determined that, for countries/subdivisions that have more than 0.1% of their traffic passing through eqiad, the distribution of requests among eqiad hosts varies only by access_method, agent_type and referer_class, such that, after breaking down by those dimensions, all the other dimensions of interest are extremely stable across hosts... then there wouldn't be any point in storing data extracted from webrequest for those very stable dimensions, correct? Since, in that case, with country/subdivision, datacenter and the three dimensions over which host distribution does vary, we'd have all we need to model the loss and calculate correct numbers for the aggregated data sets, over all dimensions of interest?

Thanks and many apologies for the long stream of comments....... (And also many apologies if the above is already known and obvious to folks, or, conversely, if it's completely wrong......)

Change 763189 had a related patch set uploaded (by Joal; author: Joal):

[operations/puppet@production] Temporarily disable traffic data purge

https://gerrit.wikimedia.org/r/763189

Change 763189 merged by Btullis:

[operations/puppet@production] Temporarily disable traffic data purge

https://gerrit.wikimedia.org/r/763189

Based on yesterday's feedback I have not started extracting data yesterday.

I have sent a patch to temporarily stop data purging (think you @BTullis for merging) - I should have done that earlier.

I have started a job extracting daily pageviews with the dimensions discussed with @Mayakp.wiki. Once available the data will be in hdfs:///wmf/data/wmf/pageview/dataloss_2021-12_2022-01.
This dataset contains daily aggregated pageviews with host and datacenter information, and the discussed dimensions, between 2021-11-18 included and 2022-01-31 included.

About the webrequest data, I would like a more formal definition of the need before proceeding aswebrequest data is VERY big, and extracting 75 days of data is costly.

About the webrequest data, I would like a more formal definition of the need before proceeding aswebrequest data is VERY big, and extracting 75 days of data is costly.

@AndyRussG maybe you can combine your thoughts from above into a precise list of dimensions that Fundraising would need. To your point about limiting it to certain countries, you could look at the pageview data Joseph is extracting and the information so far and pick a set of countries you need.

I have started a job extracting daily pageviews with the dimensions discussed with @Mayakp.wiki.

Woohooo! Thanks for doing all this!!!! :) :)

About the webrequest data, I would like a more formal definition of the need before proceeding as webrequest data is VERY big, and extracting 75 days of data is costly.

@AndyRussG maybe you can combine your thoughts from above into a precise list of dimensions that Fundraising would need.

Yep, I should clarify--as far as I know, the only specific request on behalf of Fundraising for data from the webrequest table, as discussed with @JMando, would be data about banner impressions (see notes above about is_banner_beacon) with the added dimension of device_family, at hourly granularity. Perhaps @JMando can confirm?

All the rest of my comments yesterday were just more general ideas about how to study the loss in detail and reconstruct datasets as fully as possible. Sorry it was a lot, hope it wasn't distracting, and also very sorry if that wasn't clear. :)

(Also great idea about limiting to countries needed by Fundraising! Will investigate.)

Confirming what @AndyRussG said above about is_banner_beacon and device_family at hourly granularity.

However, I believe it is okay to wait on those pieces until we see findings on the overall loss by the dimensions in the data pull that has been started.

I updated the task's description with the current varnishkafka behavior, if nobody opposes I'd like to remove the old varnishkafka package from apt:

elukey@apt1001:/srv/wikimedia$ sudo reprepro lsbycomponent varnishkafka
varnishkafka | 1.0.13-1 | stretch-wikimedia |               main | amd64, source
varnishkafka | 1.0.14-1 |  buster-wikimedia |               main | amd64, source
varnishkafka |  1.1.0-1 |  buster-wikimedia | component/varnish6 | amd64, source

In this way, only the one in component/varnish6 will be deployed, or puppet errors out. I'll also follow up with Traffic.

The varnishkafka package version will be handled in T302301 by the Traffic team.

The varnishkafka package version will be handled in T302301 by the Traffic team.

The old version is gone from buster-wikimedia :)

The varnishkafka package version will be handled in T302301 by the Traffic team.

The old version is gone from buster-wikimedia :)

Thank you so much @elukey <3

I have started a job extracting daily pageviews with the dimensions discussed with @Mayakp.wiki. Once available the data will be in hdfs:///wmf/data/wmf/pageview/dataloss_2021-12_2022-01.
This dataset contains daily aggregated pageviews with host and datacenter information, and the discussed dimensions, between 2021-11-18 included and 2022-01-31 included.

Confirmed with @JAllemandou : This is completed and available in mayakpwiki.pageview_dataloss_202112_202201
Note: Data for the entire period we experienced data loss is not included in this extract due to data purging i.e. At the time this extract was generated data before 2021-11-18 was not available.

Hey Everyone, here is a full report on Product Health Timelines (pageview decline info)
For more details, pls see Pageview Decline impact due to data loss, that has specific data points for Global and US underreporting, as well as by Geos, referrer class and access method, including en6c.

Thanks @Mayakp.wiki for sharing these presentations! Is there a suggested public reference for this (ideally a wiki page or blogpost but also just a deck on Commons is fine)? I thinking as a reference for documentation etc.

@Isaac We provided updates on the data loss as a part of our monthly key metrics for February 2022 (first 2 slides). It is available on Commons: https://commons.wikimedia.org/wiki/File:February_2022_Wikimedia_movement_metrics.pdf
We've also added the summary to the upcoming Executive Weekly updates (Update March 29), but this is internal on Officewiki.

@Mayakp.wiki I don't have access to the full report, but the public slides mention 2-4% (globally) / 5-9% (USA) before Oct 3, and 5-8% (globally) / 15-21% (USA) after Nov 4. Am I correct to assume that the Nov was a typo and should be Oct 4?

For a banner project with data spanning Sept/Oct (T290387), I observed a dip in many countries on 3/4 October, and I'm trying to see if this could play a role (the dip is stronger in the US than other countries). Could it be that the October data loss is broader than the nodes affected before?

Hi @Effeietsanders , sorry about the confusion but the typo was in the slides.
Per T300164#7690173, it should've been 3 November and not 3 October.

Summary of the data loss analysis:

  • Between 2021-06-04 and 2021-11-03 we have lost 2.80% of webrequest-text, statv and eventlogging data (pageview impact, eqiad datacenter only)
  • Between 2021-11-04 and 2022-01-27 we have lost 4.34% of webrequest-text, statv and eventlogging data (pageview impact, eqiad + ulsfo datacenters only)
  • Between 2021-10-13 and 2021-11-03 we have lost 1.01% of webrequest-upload data (mediarequest impact, ulsfo datacenter only)
  • Between 2021-11-04 and 2022-01-27 we have lost 2.19% of webrequest-upload data (mediarequest impact, ulsfo datacenter only)

I have updated the February 2022 Wikimedia movement metrics.pdf to reflect the correct dates.
Pls let me know if there were any other public slides where you saw these dates.

@Effeietsanders I saw that you are collaborating with @MGerlach on Research. We don't yet have more details that have been vetted for external sharing, but we do have some internal data that we could share with Martin.

@MGerlach, per Effeietsanders's question, it is possible that the data loss impacted the data they are looking at with regard to the WLM campaigns. Most of the technical details are available in this ticket. The Product Analytics team has analyzed the impact specifically on pageviews and we have initial estimates (including some breakdowns by certain countries, like the US). You can reach out to @Mayakp.wiki if you have questions on that work.