Page MenuHomePhabricator

Data Lake edit data missing for many wikis
Closed, ResolvedPublic8 Estimated Story Points

Description

The Data Lake seems to have no data for many (most?) wikis.

select wiki_db, count(*)
from wmf.mediawiki_history
where
snapshot = "2017-04" and
wiki_db in ("arwiki", "cswiki", "commonswiki", "dawiki", "enwiki", "enwiktionary", "kowiki", "itwiki", "zhwiki") and
event_timestamp >= "201701"
group by wiki_db;

wiki_db _c1
commonswiki     18474976
enwiki  22939154
dawiki  248147
3 rows selected (60.582 seconds)

This applies to older snapshots and older data too:

select wiki_db, count(*)
from wmf.mediawiki_history
where
snapshot = "2017-02" and
wiki_db in ("arwiki", "cswiki", "commonswiki", "dawiki", "enwiki", "enwiktionary", "kowiki", "itwiki", "zhwiki") and
event_timestamp >= "201601"
group by wiki_db;

wiki_db _c1
enwiki  79402391
commonswiki     64383940
dawiki  632861
3 rows selected (84.672 seconds)

However, the pre-Labs snapshots seem to be fine.

select wiki_db, count(*)
from wmf.mediawiki_history
where
snapshot = "2016-12_private" and
wiki_db in ("arwiki", "cswiki", "commonswiki", "dawiki", "enwiki", "enwiktionary", "kowiki", "itwiki", "zhwiki") and
event_timestamp >= "201601"
group by wiki_db;

wiki_db _c1
enwiktionary    7029529
dawiki  514631
commonswiki     54721834
zhwiki  4568247
cswiki  1466825
itwiki  8224350
arwiki  4918431
enwiki  67648716
kowiki  2847875
9 rows selected (95.034 seconds)

Event Timeline

Ping @Marostegui: do we have an ETA on when these wikis will be available on labs new db hosts?

Ping @Neil_P._Quinn_WMF this data is on production snapshot (we are doing both to compare) . Is that data outdated for your needs?

do we have an ETA on when these wikis will be available on labs new db hosts?

We are thinking by the end of FQ1, according to our roadmap.

Ok, so we can plan on this data being available in September, correct? Until then we will continue taking snapshots of production and labs.

Ok, so we can plan on this data being available in September, correct? Until then we will continue taking snapshots of production and labs.

Hmm, I don't see any production snapshots other than one from December of last year (and unfortunately that one is too outdated to use for some of my projects, like evaluating the effect of the new recent changes filters on new editor retention). Am I just missing them?

show partitions mediawiki_history;

partition
snapshot=2016-03
snapshot=2016-12_private
snapshot=2017-02
snapshot=2017-04
snapshot=2017-05
snapshot=2017-06
6 rows selected (0.047 seconds)

Ping @Marostegui: do we have an ETA on when these wikis will be available on labs new db hosts? Is it still end of Q1?

@Neil_P._Quinn_WMF Team agreed to only do prod snapshots ad hoc as we hope they are not needed pretty soon, if we have availability we might be able to do one this week (cc @Milimetric), have in mind that a snapshot takes couple days to complete though

Ping @Marostegui: do we have an ETA on when these wikis will be available on labs new db hosts? Is it still end of Q1?

@Neil_P._Quinn_WMF Team agreed to only do prod snapshots ad hoc as we hope they are not needed pretty soon, if we have availability we might be able to do one this week (cc @Milimetric), have in mind that a snapshot takes couple days to complete though

We are currently importing the last wikis (s7):

arwiki
cawiki
centralauth
fawiki
frwiktionary
eswiki
hewiki
huwiki
kowiki
metawiki
rowiki
ukwiki
viwiki

The rest of the wikis are already available, check: T153743

Excellent, so we can count on all being available by end of this quarter!

Excellent, so we can count on all being available by end of this quarter!

We are hoping to have them available in around 2 weeks or less (keep in mind that we have Wikimania in between)

@Neil_P._Quinn_WMF Team agreed to only do prod snapshots ad hoc as we hope they are not needed pretty soon, if we have availability we might be able to do one this week (cc @Milimetric), have in mind that a snapshot takes couple days to complete though

Okay, I didn't realize this. From your comment above (the only communication I've had about this), it sounded like you were planning to do these snapshots proactively.

Yes, I would like a full snapshot that includes all wikis. Without it, I'm not able to do some planned projects like analyzing new user retention. I can definitely put that work off for 2 weeks (although of course I would prefer not to :), but if I'd have to wait until next quarter, that's much more disruptive for me.

@Neil_P._Quinn_WMF we are going to be short 2 people in the team in the upcoming days, i cannot promise we can get this done in the next couple days. We will try our best.

@Neil_P._Quinn_WMF we are going to be short 2 people in the team in the upcoming days, i cannot promise we can get this done in the next couple days. We will try our best.

Thanks, I appreciate that. Like I said, I can wait longer than a few weeks if it's necessary, as long as I know about it up front so that I can rearrange my plans and give accurate timelines to people who want analysis.

@Neil_P._Quinn_WMF

The labs snapshot (with newer wikis but not all) is about to start, all things going well it will take 3/4 days for it to be done , if everything goes well (and in this massive jobs it is not rare we have to restart to fix things) the soonest we will be able to provide a prod sanpshot will be two weeks from now and that is wikimania week so I would recommend you plan for the prod snapshot not being ready for several weeks. A contingency plan will be good.

Change 369408 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[analytics/refinery@master] Enable newly available wikis for sqooping

https://gerrit.wikimedia.org/r/369408

Change 369408 merged by Milimetric:
[analytics/refinery@master] Enable newly available wikis for sqooping

https://gerrit.wikimedia.org/r/369408

Change 369409 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[analytics/refinery@master] [WIP] DO NOT MERGE UNTIL THESE WIKIS ARE IMPORTED

https://gerrit.wikimedia.org/r/369409

Ok, I just deployed the list for all but the 12 wikis that are still imported. That means the next reconstruction will have data from everything except those 12. (Centralauth isn't used in reconstruction).

Hm, @Marostegui I need some help. I ran the sqoop job to import from all the wikis except the ones you mentioned in T165233#3486500. I get access denied errors:

mysql:s53272@labsdb-analytics.eqiad.wmnet [enwiki_p]> use jawiki_p;
ERROR 1044 (42000): Access denied for user 's53272'@'%' to database 'jawiki_p'
mysql:s53272@labsdb-analytics.eqiad.wmnet [enwiki_p]> use cswiki_p;
ERROR 1044 (42000): Access denied for user 's53272'@'%' to database 'cswiki_p'
mysql:s53272@labsdb-analytics.eqiad.wmnet [enwiki_p]> use enwikiquote;
ERROR 1044 (42000): Access denied for user 's53272'@'%' to database 'enwikiquote'

Does that user maybe need to be set up on those dbs? The full list of the ones that failed:

bgwiki
bgwiktionary
cswiki
enwikiquote
enwiktionary
eowiki
fiwiki
frwiki
idwiki
itwiki
jawiki
nlwiki
nowiki
plwiki
ptwiki
ruwiki
svwiki
thwiki
trwiki
zhwiki

And they're predictably basically all the ones I added recently: https://gerrit.wikimedia.org/r/#/c/369408/1/static_data/mediawiki/grouped_wikis/labs_grouped_wikis.csv

@Milimetric can you try jawiki_p again for instance?

@Marostegui: thank you, I confirm the problem is addressed. In the meantime, we decided to just run the 2017-07_private snapshot, @Neil_P._Quinn_WMF. This should be done someone this weekend and available Monday, with all wikis available pulled from production. And we'll re-run the 2017-07 snapshot from labs later this month when @Marostegui tells us the rest of the wikis are imported and ready to be used. (At that time I'll be on leave so you'll be talking to @JAllemandou). Thanks all!

Thanks @Milimetric - I will fix all the pending wikis you've listed along with jawiki_p above on Monday.
s7 has been imported to all the labs servers already, I need to run some extra checks on Monday before making them visible :-)

No rush but just a heads up to @Neil_P._Quinn_WMF, the 2017-07_private job failed and I'm not sure what's wrong.

Thanks for confirming @Milimetric - I have now fixed all the wikis listed at: T165233#3498662.
Please try again and let me know if it all works now

@Milicevic01 @JAllemandou @Nuria we have finished importing all the production shards into the new labs infra: T153743#3505593

No rush but just a heads up to @Neil_P._Quinn_WMF, the 2017-07_private job failed and I'm not sure what's wrong.

Thanks for the heads up! Because of Wikimania, I won't have any chance to use it this week anyway, so it's definitely fine if you want to wait until next week or so to troubleshoot.

I think I worked out the bugs, should be ready soon unless something else goes wrong.

Good news, the 2017-07_private snapshot finished. I will now start the 2017-07 snapshot process, and if there are no problems that should be available before the end of Wikimania as well.

Change 369409 merged by Milimetric:
[analytics/refinery@master] Enable all wikis to sqoop from labs

https://gerrit.wikimedia.org/r/369409

Hi,

Is there anything pending here?

@Marostegui : We're waiting for august run to happen (at the first days of september) before closing, but we expect everything to be ok.

Looks like it is all good:

select wiki_db, count(*)
from wmf.mediawiki_history
where
snapshot = "2017-08" and
wiki_db in ("arwiki", "cswiki", "commonswiki", "dawiki", "enwiki", "enwiktionary", "kowiki", "itwiki", "zhwiki") and
event_timestamp >= "2017"
group by wiki_db;

Returns:

wiki_db _c1
enwiktionary 5864481
dawiki 436517
commonswiki 37066408
zhwiki 3842614
cswiki 879174
itwiki 6274804
arwiki 2779956
enwiki 45446114
kowiki 1704058

Closing ticket

Nuria set the point value for this task to 8.

@Nuria, yes, the queries where I originally discovered this work now. Thank you!