Page MenuHomePhabricator

Run a 1-off sqoop over the new labsdb servers
Closed, ResolvedPublic8 Estimated Story Points

Description

See parent task T152788.
This one is about the first test.
It requires synchronisation with @MoritzMuehlenhoff to open network access ( T155487).
The plan is to sqoop all available wikis, and measure the load on the servers since they are not yet used by real labs users.

For this we would need some technical details:

  • A user allowed to read the dbs (probably easy to get)
  • The db locations (which project on which server)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 18 2017, 7:16 PM
Nuria added a comment.Jan 19 2017, 5:04 PM

Steps:

  • Technical information of dbs from chase
  • Syncronization with moritz to open accesss to labs db hosts from analytics network, scooping is happening from hadoop
  • We are going to try to get everything on 1 pass as DBs are not yet in use

This process will inform how we do recurrent updates

Nuria set the point value for this task to 8.Jan 19 2017, 5:05 PM
elukey added a subscriber: elukey.Jan 20 2017, 12:42 PM

Moritz patch (https://gerrit.wikimedia.org/r/#/c/332457/) has been abandoned after a chat with Jaime. We should not open ports on labsdb replicas but use labsdb-analytics.eqiad.wmnet directly. This one is a CNAME to one of the labsdb replica proxy that should be ready to use for our use case (modulo networking ACLs that might interfere).

Some remarks from Jaime:

  • Keep him and Manuel in the loop since the new labsdb is fairly new and they can ease our pain to make everything work with some tips.
  • Test labsdb-analytics.eqiad.wmnet with extreme care:
13:06  <jynus> if the server goes down, it will take weeks to recover
13:06  <jynus> because we cannot do gtids yet on labsdbs
elukey@analytics1030:~$ telnet labsdb-analytics.eqiad.wmnet 3306
Trying 10.64.37.14...

Seems like a network ACL preventing access..

It's not blocked by ferm, so this in fact needs network ACL changes.

elukey added a subscriber: faidon.Jan 20 2017, 5:33 PM

So far I have followed what @faidon suggested, namely:

  1. Checking input filters ACLs on cr1/cr2 for the ports related to labsdb-analytics.eqiad.wmnet and analytics10XX.eqiad.wmnet (using show route $hosname to find the ports).
  1. Running tcpdump on both labsdb-analytics.eqiad.wmnet and analytics10XX.eqiad.wmnet trying to find anomalies. telnet labsdb-analytics.eqiad.wmnet 3306 ran on analytics1030.eqiad.wmnet hangs, and I can't see any SYN packet logged on labsdb-analytics.eqiad.wmnet.

Are there any outbound rules for the Analytics VLAN by any chance? I can see the following on cr1:

show configuration firewall family inet filter analytics-in4
[...]
term mysql {
    from {
        destination-address {
            10.64.16.36/32;
            10.64.16.9/32;
            10.64.0.166/32;
            10.64.48.18/32;
            10.64.16.20/32;
            10.64.0.165/32;
            10.64.16.35/32;
            10.64.16.148/32;
            10.192.32.19/32;
        }
        protocol tcp;
        destination-port 3306;
    }
    then accept;
}
[...]

That should be an inbound traffic rule for one of the interfaces routing traffic to analytics1030. It doesn't make much sense to me since these seems to be outbound rules (so filtering traffic from the analytics VLAN to some mysql databases). I tried to telnet to these IPs from analytics1030 and everything works..

So I am fairly ignorant on this subject and I'll wait some network master to teach me :)

Are there any outbound rules for the Analytics VLAN by any chance? I can see the following on cr1:

show configuration firewall family inet filter analytics-in4
[...]
term mysql {
    from {
        destination-address {
            10.64.16.36/32;
            10.64.16.9/32;
            10.64.0.166/32;
            10.64.48.18/32;
            10.64.16.20/32;
            10.64.0.165/32;
            10.64.16.35/32;
            10.64.16.148/32;
            10.192.32.19/32;
        }
        protocol tcp;
        destination-port 3306;
    }
    then accept;
}
[...]

That should be an inbound traffic rule for one of the interfaces routing traffic to analytics1030. It doesn't make much sense to me since these seems to be outbound rules (so filtering traffic from the analytics VLAN to some mysql databases). I tried to telnet to these IPs from analytics1030 and everything works..

So I am fairly ignorant on this subject and I'll wait some network master to teach me :)

I was indeed :)

After a chat with Faidon I realized that these ACLs are applied to a router's port, so inbound traffic (input filters ACLs) applies to traffic directed TO the port from whatever host/switch is attached to it. Adding dbproxy1010's IP to the term mysql list allowed the traffic!

Last step is to verify with Jaime or Manuel if we need to whitelist dbproxy1011 too, since it might be dbproxy1010's backup if it fails (remember that we are going to use a CNAME, labsdb-analytics.eqiad.wmnet, that currently points to dbproxy1010 but that might change).

Last step is to verify with Jaime or Manuel if we need to whitelist dbproxy1011 too, since it might be dbproxy1010's backup if it fails (remember that we are going to use a CNAME, labsdb-analytics.eqiad.wmnet, that currently points to dbproxy1010 but that might change).

added dbproxy1011 as well, network whitelist work done.

Change 334042 had a related patch set uploaded (by Joal):
Update sqoop script with labsdb specificity

https://gerrit.wikimedia.org/r/334042

Follow-up:

  • sqoop-tool labs tool created for db credentials
  • sqoop launched on wiki groups 1 (enwiki) and 13 (lots of very small projects)

Seems to be working so far.
On load: ~25Mb/s of data is sent from labsdb1009, CPU load is ~0.5. Everything seems ok.
It might even be fun to try to raise the parallelisation (for now, not parallelised).

I do not mind labsdb-web being open, but please do not use it, as it should be reserved for web-like requests. If needed, I think I could provide even load balancing to labsdb-anaytics to double the throughput. Tuning is still very much in beta.

@jaime: I currently use labsdb-analytics, so I have no idea how and which machine got chosen.

It works !

Some takeovers:

  • Only two differences in one schema: archive table misses ar_content_format and ar_content_model
  • Some projects not available (expected) - see list bellow.
  • Parallelization is limited by user-cuncurrent-connection limit (10 currently). @jcrespo /@chasemp - Is this a negociable parameter?

@elukey: Quick question about network: Is the hole opened potentially to be open forever, or will be it be patch back and we need to find another solution?

List of missing projects (fro; our own list built from site matrix):

fawiki
frwiktionary
nlwiki
nowiki
ptwiki
idwiki
eswiki
plwiki
cswiki
itwiki
bgwiki
fiwiki
viwiki
wikidatawiki
ukwiki
rowiki
labtestwiki
enwikiquote
jawiki
zhwiki
thwiki
dewiki
eowiki
trwiki
huwiki
svwiki
cawiki
commonswiki
ruwiki
labswiki
hewiki
frwiki
metawiki
enwiktionary
kowiki
arwiki
bgwiktionary
  • Parallelization is limited by user-cuncurrent-connection limit (10 currently). @jcrespo /@chasemp - Is this a negociable parameter?

Open to helping determine load impact however I can but I think this is most directly something the DBA crew should dictate

@elukey: Quick question about network: Is the hole opened potentially to be open forever, or will be it be patch back and we need to find another solution?

Should be ok to keep the ACL indefinitely, I don't see any issue with it :)

@jcrespo / @Marostegui : Questions for you guys:

  • On the list of projets listed above that are not present in labsd-analytics (checked this morning, list is the same)
    • can you tell me if some of them will never end up in labs?
    • can you give me an ETA, for those that will be included, as to when we expect them to be accessible?
  • In table/view archive, I noticed a difference in schema between production and labsdb: ar_content_format and ar_model_format fields are not present in labs. will that somehow be changed, or is this the expected behaviour?
  • On load impact: can we discuss raising the number of concurrent connections for user s53272, in order to access data faster, or is this not a good idea?

Thanks !

Hello!!

@jcrespo / @Marostegui : Questions for you guys:

  • On the list of projets listed above that are not present in labsd-analytics (checked this morning, list is the same)
    • can you tell me if some of them will never end up in labs?

So far we have only imported s1 and s3 and we are in process of importing s4. The idea is to have as many shards (or all) as possible there. You might want to subscribe yourself to: T153743

  • can you give me an ETA, for those that will be included, as to when we expect them to be accessible?

It is not an easy process, but we want to start moving forward again. s4 will be next.
We have been wanting to do this for some weeks now, but there has been lots things in the middle

  • code freeze
  • holidays
  • all hands

And once we were back from all hands, we had to respond to some higher priority issues like the enwiki switchover due to a faulty switch: T155875 or all the phabricator crashes...

  • In table/view archive, I noticed a difference in schema between production and labsdb: ar_content_format and ar_model_format fields are not present in labs. will that somehow be changed, or is this the expected behaviour?

Yes, for some tables, the production and labs schema could be different due to privacy and sanitize data process. Remember, there are data in production that it is private and should not be in labs.
Maybe @chasemp or @yuvipanda can provide more details for this specific table, as I don't know all the history behind.

  • On load impact: can we discuss raising the number of concurrent connections for user s53272, in order to access data faster, or is this not a good idea?

Thanks !

The problem I can see here is that if you are able to get more connections and get data faster, that might impact the performance for other users as they might get slowed down in their processes.
A quick select over labsdb1009 shows that there are no exceptions on the amount of connections users get. However on labsdb1003 I don't see a number of connections for that same user @jcrespo @chasemp was this decided for the new labs infra?

@jcrespo / @Marostegui : Questions for you guys:

  • On the list of projets listed above that are not present in labsd-analytics (checked this morning, list is the same)
    • can you tell me if some of them will never end up in labs?

So far we have only imported s1 and s3 and we are in process of importing s4. The idea is to have as many shards (or all) as possible there. You might want to subscribe yourself to: T153743

I follow that task, but this doesn't help me understand which project DB will be present once shards will be added :)
Unfortunately I don't know our projects DB architecture, and therefore can't see which projects are related to which shared.
Is there any page referencing this?

  • In table/view archive, I noticed a difference in schema between production and labsdb: ar_content_format and ar_model_format fields are not present in labs. will that somehow be changed, or is this the expected behaviour?

Yes, for some tables, the production and labs schema could be different due to privacy and sanitize data process. Remember, there are data in production that it is private and should not be in labs.
Maybe @chasemp or @yuvipanda can provide more details for this specific table, as I don't know all the history behind.

Double ping @chasemp / @yuvipanda : Any idea as to why we don't have those fields (they don"t look private to me, but I might misunderstand something).

  • On load impact: can we discuss raising the number of concurrent connections for user s53272, in order to access data faster, or is this not a good idea?

Thanks !

The problem I can see here is that if you are able to get more connections and get data faster, that might impact the performance for other users as they might get slowed down in their processes.
A quick select over labsdb1009 shows that there are no exceptions on the amount of connections users get. However on labsdb1003 I don't see a number of connections for that same user @jcrespo @chasemp was this decided for the new labs infra?

Idea would be for me to test loading the infra before it's available widely, in order to know how much is good / no good. But for that I need more connection (with 10, knowing that each of my processes use 2, I can only parallelize by 5, which is not very much - There was no perceptible load on the servers when I was pulling data, while network-wise it could be seen).

Thanks again :)

@jcrespo / @Marostegui : Questions for you guys:

  • On the list of projets listed above that are not present in labsd-analytics (checked this morning, list is the same)
    • can you tell me if some of them will never end up in labs?

So far we have only imported s1 and s3 and we are in process of importing s4. The idea is to have as many shards (or all) as possible there. You might want to subscribe yourself to: T153743

I follow that task, but this doesn't help me understand which project DB will be present once shards will be added :)
Unfortunately I don't know our projects DB architecture, and therefore can't see which projects are related to which shared.
Is there any page referencing this?

So, to give you some idea of what is coming next probably:

s4: commonswiki
s5: wikidatawiki and dewiki
s2:
arwiki
cawiki
centralauth
eswiki
fawiki
frwiktionary
heartbeat
hewiki
huwiki
kowiki
metawiki
mysql
rowiki
test
ukwiki
viwiki

s6:
frwiki
jawiki
ruwiki

s7:
arwiki
cawiki
centralauth
eswiki
fawiki
frwiktionary
heartbeat
hewiki
huwiki
kowiki
metawiki
mysql
rowiki
test
ukwiki
viwiki

You can check wmf-config/db-eqiad.php for the list of specific wikis per shard, whatever is not in an specific shard, it is on s3 (which is already imported by default)

  • In table/view archive, I noticed a difference in schema between production and labsdb: ar_content_format and ar_model_format fields are not present in labs. will that somehow be changed, or is this the expected behaviour?

Yes, for some tables, the production and labs schema could be different due to privacy and sanitize data process. Remember, there are data in production that it is private and should not be in labs.
Maybe @chasemp or @yuvipanda can provide more details for this specific table, as I don't know all the history behind.

Double ping @chasemp / @yuvipanda : Any idea as to why we don't have those fields (they don"t look private to me, but I might misunderstand something).

  • On load impact: can we discuss raising the number of concurrent connections for user s53272, in order to access data faster, or is this not a good idea?

Thanks !

The problem I can see here is that if you are able to get more connections and get data faster, that might impact the performance for other users as they might get slowed down in their processes.
A quick select over labsdb1009 shows that there are no exceptions on the amount of connections users get. However on labsdb1003 I don't see a number of connections for that same user @jcrespo @chasemp was this decided for the new labs infra?

Idea would be for me to test loading the infra before it's available widely, in order to know how much is good / no good. But for that I need more connection (with 10, knowing that each of my processes use 2, I can only parallelize by 5, which is not very much - There was no perceptible load on the servers when I was pulling data, while network-wise it could be seen).

How many connections you think you need and for how long?

@jcrespo / @Marostegui : Questions for you guys:

  • On the list of projets listed above that are not present in labsd-analytics (checked this morning, list is the same)
    • can you tell me if some of them will never end up in labs?

So far we have only imported s1 and s3 and we are in process of importing s4. The idea is to have as many shards (or all) as possible there. You might want to subscribe yourself to: T153743

I follow that task, but this doesn't help me understand which project DB will be present once shards will be added :)
Unfortunately I don't know our projects DB architecture, and therefore can't see which projects are related to which shared.
Is there any page referencing this?

So, to give you some idea of what is coming next probably:

s4: commonswiki
s5: wikidatawiki and dewiki
s2:
arwiki
cawiki
centralauth
eswiki
fawiki
frwiktionary
heartbeat
hewiki
huwiki
kowiki
metawiki
mysql
rowiki
test
ukwiki
viwiki

s6:
frwiki
jawiki
ruwiki

s7:
arwiki
cawiki
centralauth
eswiki
fawiki
frwiktionary
heartbeat
hewiki
huwiki
kowiki
metawiki
mysql
rowiki
test
ukwiki
viwiki

You can check wmf-config/db-eqiad.php for the list of specific wikis per shard, whatever is not in an specific shard, it is on s3 (which is already imported by default)

Thanks a lot for the info, this really helps

  • In table/view archive, I noticed a difference in schema between production and labsdb: ar_content_format and ar_model_format fields are not present in labs. will that somehow be changed, or is this the expected behaviour?

Yes, for some tables, the production and labs schema could be different due to privacy and sanitize data process. Remember, there are data in production that it is private and should not be in labs.
Maybe @chasemp or @yuvipanda can provide more details for this specific table, as I don't know all the history behind.

Double ping @chasemp / @yuvipanda : Any idea as to why we don't have those fields (they don"t look private to me, but I might misunderstand something).

  • On load impact: can we discuss raising the number of concurrent connections for user s53272, in order to access data faster, or is this not a good idea?

Thanks !

The problem I can see here is that if you are able to get more connections and get data faster, that might impact the performance for other users as they might get slowed down in their processes.
A quick select over labsdb1009 shows that there are no exceptions on the amount of connections users get. However on labsdb1003 I don't see a number of connections for that same user @jcrespo @chasemp was this decided for the new labs infra?

Idea would be for me to test loading the infra before it's available widely, in order to know how much is good / no good. But for that I need more connection (with 10, knowing that each of my processes use 2, I can only parallelize by 5, which is not very much - There was no perceptible load on the servers when I was pulling data, while network-wise it could be seen).

How many connections you think you need and for how long?

I'd love to go from 10 to 50 connections, testing how much load it puts on the machines if I start parallelizing dumping.
Obviously this ends after the test and the correct level of parallelization is decided.
Do you think this is acceptable?

jcrespo added a comment.EditedFeb 2 2017, 3:52 PM

testing how much load it puts on the machines

I would block that until we have gtid on the slaves. If for some reason the slaves crash before gtid_domain_id is deployed that would mean start the import process from the beginning. I would let @Marostegui the last word about that, as he would be the first person affected by that issue.

Once gtid is deployed, a crash would be a non-issue, again, in my opinion.

I mentioned this on a previous quote:

13:06 <jynus> if the server goes down, it will take weeks to recover

Updated list of wikis not yet imported onto new servers, and where they'll be:

fawikis2
frwiktionarys2
nlwikis2
nowikis2
ptwikis2
idwikis2
eswikis2
plwikis2
cswikis2
itwikis2
bgwikis2
fiwikis2
viwikis2
wikidatawikis5
ukwikis2
rowikis2
labtestwikilabtestweb2001 ???
enwikiquotes2
jawikis6
zhwikis2
thwikis2
dewikis5
eowikis2
trwikis2
huwikis2
svwikis2
cawikis2
commonswikis4
ruwikis6
labswikisilver ???
hewikis2
frwikis6
metawikis2
enwiktionarys2
kowikis2
arwikis2
bgwiktionarys2

I found everything I was looking for, except for labtestwiki and labsdb.

testing how much load it puts on the machines

I would block that until we have gtid on the slaves. If for some reason the slaves crash before gtid_domain_id is deployed that would mean start the import process from the beginning. I would let @Marostegui the last word about that, as he would be the first person affected by that issue.

Once gtid is deployed, a crash would be a non-issue, again, in my opinion.

That is an extremely good point and I totally agree.
@JAllemandou for your information, today @jcrespo discussed about going ahead deploying that variable, as it's been tested on our misc shards for week now without any issue. So we are thinking to slowly start deploying it in our production shards, maybe starting next week.

This is the related task: T149418

There are no plans to import labswiki or labstest wiki, those are special wikis that are not part of the main cluster.

No plans does not mean it will never happen, but it is difficult at this moment (unlike the other wikis, which are programmed to be done soon) and would require a separate ticket for its request.

Thanks a lot again @Marostegui and @jcrespo for your answers.
I understand the GTID thing (at least how it can impact), and I'm completely happy to wait until this is done :)

@chasemp / @yuvipanda : Another question for you guys - There seems to be no rev_text_id in the DBs (0 everywhere). Can you explain me why so, and if this is something that could be changed ?
Thanks !

Recap before closing and creating new actions:

  • Analytics VLAN had to access new labsdb one - Thanks again @elukey and @MoritzMuehlenhoff.
  • After network was opened, process worked almost out of the box - Small changes needed on both sqoop and scala code, but very minor
  • Load on SQL machines seemed very acceptable (to be confirmed with @jcrespo and @Marostegui )
  • Differences in term of number of rows in imported tables are very small (less than 0.1%, except for ipblocks, logging and user):
TableDifference between labs and prod (percent)
archive-0.07%
ipblocks-5.76%
logging-0.74%
page-0.05%
revision0.01%
user-0.14%
user_groups-0.03%

reading: archive table in labsdb is 0.07% smaller than in prod

  • Differences in term of errors while reconstructing and events reconstructed are small except for user parsing:
TableDifference between labs and prod (percent)
User history parsing errors1441.42%
Unmatched user history events0.65%
Page history parsing errors3.12%
Page history unmatched events-0.16%
Archived revisions with delete times-0.07%
Union-ed revisions0.00%
Users with User data-0.15%
Pages with User data-0.03%
Denormalized revisions, pages and users union-0.03%

reading: user history parsing errors for labsdb data are 1441.42% bigger in prod

  • Numbers above are for ~600 small wikis, but are very similar for enwiki.
  • Parsing problems come from using nullified columns in the logging table (sanitization process)
  • Another issue, oriented towards potential future dumps generation is that labsdb revision table doesn't contain rev_text_id.

Do you happen to have some graph links so we can take a close look at how the servers behave?
Or at a rough estimation of day/time were you were doing your tests so we can check the graphs?

Thanks!

@chasemp / @yuvipanda : Another question for you guys - There seems to be no rev_text_id in the DBs (0 everywhere). Can you explain me why so, and if this is something that could be changed ?
Thanks !

  • Another issue, oriented towards potential future dumps generation is that labsdb revision table doesn't contain rev_text_id.

Hmmm, the applicable setup for the view is

https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/role/templates/labs/db/views/maintain-views.yaml;7678fccb897fd6912a75e3a162d338eb01a31193$320-326

if(rev_deleted&1,null,rev_text_id) as rev_text_id

going to have to ask the good DBA people to help shed some light

https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/role/templates/labs/db/views/maintain-views.yaml;7678fccb897fd6912a75e3a162d338eb01a31193$320-326

if(rev_deleted&1,null,rev_text_id) as rev_text_id

@JAllemandou I talked with @jcrespo about this briefly and our conclusion was to split off this consideration into another task. If you guys could put up a changeset with logic that would result in it being correctly populated and we can reason on the outcomes. We then get someone from legal and/or security to look at it. This exact logic I'm not sure the why of it to begin with. I need someone with mediawiki context knowledge here to review and instruct I think.

Do you happen to have some graph links so we can take a close look at how the servers behave?
Or at a rough estimation of day/time were you were doing your tests so we can check the graphs?

Thanks!

@Marostegui : I looked at server board from 2017-01-25 to 2017-01-27, using network as an approximation of when things where happening.
Is that good enough?

@Marostegui : I looked at server board from 2017-01-25 to 2017-01-27, using network as an approximation of when things where happening.
Is that good enough?

That is good - thanks! Looks like you only hit labsdb1009.

Looking at the same time and day for the rest of the labs servers (1001,1003), you generated almost the same (or more) network traffic and it was pretty much just you using the server :-). We are nowhere near network saturation there, at least summing up your usage plus the normal 1001,1003 usage as per the graphs now.
In terms of IOPS you generated a big initial spike which is around 1/10 of the current steady usage we have on the other labs servers.

Seeing these numbers and going back to your previous comment about increasing connections from 10 to 50, that means you'd be increasing the network traffic/iops by 5?
Remember we are still blocked by the gtid_domain_deployment we discussed above.

Thanks!

Seeing these numbers and going back to your previous comment about increasing connections from 10 to 50, that means you'd be increasing the network traffic/iops by 5?

I hope it would :) The idea is to indeed test and watch.
I repeat, this would be a test, with objective to find a correct reliable and maintainable value for the limit :)

Remember we are still blocked by the gtid_domain_deployment we discussed above.

I am waiting for you to tell me when I can continue playing, no pressure :)

Thanks again :)

Seeing these numbers and going back to your previous comment about increasing connections from 10 to 50, that means you'd be increasing the network traffic/iops by 5?

I hope it would :) The idea is to indeed test and watch.
I repeat, this would be a test, with objective to find a correct reliable and maintainable value for the limit :)

Sure :-)

Remember we are still blocked by the gtid_domain_deployment we discussed above.

I am waiting for you to tell me when I can continue playing, no pressure :)

Yeah, we have found something worrying on the last gtid_domain_id deployment: T149418#3004834
I will keep you posted anyways!

Thanks for understanding!!

Change 337793 had a related patch set uploaded (by Joal):
Add new fields to archive_p view in labsdb

https://gerrit.wikimedia.org/r/337793

Change 337793 had a related patch set uploaded (by Joal):
Add new fields to archive_p view in labsdb

https://gerrit.wikimedia.org/r/337793

Just porting a comment from the review for clarify:

Per @jcrespo

Those fields are not in use on production (I blocked that), and they will be done properly (deleted) later in the year: https://www.mediawiki.org/wiki/User:Brion_VIBBER/Compacting_the_revision_table_round_2#Provisional

Per @jcrespo

Those fields are not in use on production (I blocked that), and they will be done properly (deleted) later in the year: https://www.mediawiki.org/wiki/User:Brion_VIBBER/Compacting_the_revision_table_round_2#Provisional

Noted ! We'll fake them while waiting for the (big!) schema change.
Thanks @jcrespo and @chasemp

Change 337793 abandoned by Joal:
Add new fields to archive_p view in labsdb

Reason:
As per @jcrespo comment, will fake having them waiting for future schema change.

https://gerrit.wikimedia.org/r/337793

Change 334042 merged by Milimetric:
Update sqoop script with labsdb specificity

https://gerrit.wikimedia.org/r/334042

Last bit of checking - In metrics computed from reconstructed history.

metrics% difference in rows% differrence in metric
daily_edits0.279892-0.001966
daily_edits_by_anonymous_users1.022705-0.401228
daily_edits_by_bot_users0.3403500.001282
daily_edits_by_registered_users0.4975320.060303
daily_unique_anonymous_editors0.7644580.012528
daily_unique_bot_editors0.004341-0.001030
daily_unique_page_creators0.3805570.156646
daily_unique_registered_editors0.3764930.052439
monthly_new_editors1.1030664.510354
monthly_newly_registered1.4625930.615933
monthly_surviving_new_editors1.1071470.041126

Number of rows having differences is small overall (~1%).
Differences account for very small portion of data (less than 0.1% for 7, less than 1% for 3) except for one metric: monthly_new_editors.
This percentage is high because, for small wikis, even a small difference makes a big change in comparison.

Reviewed by @Milimetric and @mforns --> Let's move forward and productionize history reconstruction from Labs: )

Nuria closed this task as Resolved.Mar 8 2017, 8:03 PM