Page MenuHomePhabricator

disk failure on labsdb1002
Closed, ResolvedPublic

Description

[21555223.116628] XFS (dm-1): xfs_log_force: error 5 returned.

[21554975.965623] scsi 7:0:1:0: [sdc] Unhandled error code

Feb 14 22:24:55 labsdb1002 kernel: [21554975.965750] XFS (dm-1): Log I/O Error Detected. Shutting down filesystem

re: T126942

https://gerrit.wikimedia.org/r/#/c/270650/

https://phabricator.wikimedia.org/rOPUP266761623ba8ca20bbf53c6425761118c57ae9da

Event Timeline

chasemp assigned this task to Cmjohnson.
chasemp raised the priority of this task from to High.
chasemp updated the task description. (Show Details)
chasemp subscribed.

labsdb1002 is a CISCO server. I can pull a disk out of one of the
decommissioned servers and replace but we should really start working on
replacing this server.

Problem I am having is figuring out which disk is /dev/sdc

Cmjohnson: See my comment on T118174#2062707

Any plan or progress on this?

Basically I'm using user databases there, and my tools stopped working because they're gone when *wiki.labsdb are switched. I have file dumps of those databases and I can restore them to the new server, but I'm afraid the db host will be switched back and my tools will stop working again.

More seriously, if disk contents were not lost on the old host and *wiki.labsdb get pointed back to the old host at some time, my tools may start to work from a point where they've already worked on where the progress gets stored on the new host...

Suggestions for my current situation are appreciated.

@liangent, user databases were lost and will not be able to be recovered.

@liangent, user databases were lost and will not be able to be recovered.

Hmm I'm now trying to restore dumps to current hosts but I couldn't do so (ERROR 1227 (42000) at line 67: Access denied; you need (at least one of) the SUPER privilege(s) for this operation). Can anyone execute this for me?

mysql -h commonswiki.labsdb p50380g50497__wikidb_commonswiki < /data/project/liangent-php/sql/dumps/20160214000020/commonswiki.sql
mysql -h wikidatawiki.labsdb p50380g50497__wikidb_wikidatawiki < /data/project/liangent-php/sql/dumps/20160214000020/wikidatawiki.sql
mysql -h zhwiki.labsdb p50380g50497__wikidb_zhwiki < /data/project/liangent-php/sql/dumps/20160214000020/zhwiki.sql

I would happily would, but I would prefer to actually solve the issue for you forever so you can self-serve.

One tip before I research what is failing:

  • It is probable that commons and wikidata will be moved around as a default (but data will not) once c2 is replaced. Sticking to a particular machine (c1, c3) will allow you to import and export reliably that data. It is ok to use the logical name, but the default machine can be moved at any time (and you will need to reimport).

@liangent Your problem is that your script is trying to execute CREATE TRIGGER with DEFINER=`root`@`208.80.154.151`, for which you have no rights.

Removing the DEFINER, with something like:

sed -e 's/DEFINER[ ]*=[ ]*[^*]*\*/\*/' < sql/dumps/20160214000020/commonswiki.sql | mysql ... [mysql parameters]

Should work for you.

Also, please use a different ticket for importing issues, as this is offtopic.

Thanks. It will not be a reimport but a move, so I'll have to take care of host names every time... There's no way to ensure reliability I think, unless you have prev.commonswiki.labsdb and prev2.commonswiki.labsdb etc.

Just for a note, the next command I'll issue is:

mysqldump --defaults-file=$HOME/replica.my.cnf -h c3.labsdb p50380g50497__wikidb_commonswiki | sed -e 's/DEFINER[ ]*=[ ]*[^*]*\*/\*/' | mysql --defaults-file=$HOME/replica.my.cnf -h commonswiki.labsdb p50380g50497__wikidb_commonswiki
echo drop database p50380g50497__wikidb_commonswiki\; | mysql --defaults-file=$HOME/replica.my.cnf -h c3.labsdb

This is the last post in this ticket here...

@chasemp any updates on this disk?

I didn't have this on my radar

Is the status:

Problem I am having is figuring out which disk is /dev/sdc

I took this to mean this server was not coming back into service?

Cmjohnson: See my comment on T118174#2062707

@jcrespo can you help determine the way forward here? Do we need to pursue figuring out which disk is the faulty one here for replacment?

@jcrespo can you help determine the way forward here? Do we need to pursue figuring out which disk is the faulty one here for replacment?

We already have the replacement on rack, it is pending installing and reprovisioning (there are issues with that)

Change 278268 had a related patch set uploaded (by Jcrespo):
Configure labsdb1008 for the first time

https://gerrit.wikimedia.org/r/278268

Change 278268 merged by Jcrespo:
Configure labsdb1008 for the first time

https://gerrit.wikimedia.org/r/278268

jcrespo edited projects, added DBA; removed Patch-For-Review, DC-Ops, ops-eqiad.
jcrespo moved this task from Triage to In progress on the DBA board.
jcrespo added a subscriber: Cmjohnson.

List of tables to reimport:

1abuse_filter
2abuse_filter_action
3abuse_filter_log
4aft_article_answer
5aft_article_answer_text
6aft_article_feedback
7aft_article_feedback_properties
8aft_article_feedback_ratings_rollup
9aft_article_feedback_select_rollup
10aft_article_field
11aft_article_field_group
12aft_article_field_option
13aft_article_filter_count
14aft_article_revision_feedback_ratings_rollup
15aft_article_revision_feedback_select_rollup
16archive
17article_feedback
18article_feedback_pages
19article_feedback_properties
20article_feedback_ratings
21article_feedback_revisions
22article_feedback_stats
23article_feedback_stats_types
24category
25categorylinks
26change_tag
27ep_articles
28ep_cas
29ep_courses
30ep_events
31ep_instructors
32ep_oas
33ep_orgs
34ep_revisions
35ep_students
36ep_users_per_course
37externallinks
38filearchive
39flaggedimages
40flaggedpage_config
41flaggedpage_pending
42flaggedpages
43flaggedrevs
44flaggedrevs_promote
45flaggedrevs_statistics
46flaggedrevs_stats
47flaggedrevs_stats2
48flaggedrevs_tracking
49flaggedtemplates
50geo_tags
51global_block_whitelist
52hitcounter
53image
54imagelinks
55interwiki
56ipblocks
57iwlinks
58l10n_cache
59langlinks
60localisation
61localisation_file_hash
62logging
63mark_as_helpful
64math
65module_deps
66msg_resource_links
67oldimage
68page
69page_props
70page_restrictions
71pagelinks
72pagetriage_log
73pagetriage_page
74pagetriage_page_tags
75pagetriage_tags
76pif_edits
77povwatch_log
78povwatch_subscribers
79protected_titles
80recentchanges
81redirect
82revision
83site_identifiers
84site_stats
85sites
86tag_summary
87templatelinks
88transcode
89updatelog
90updates
91user
92user_former_groups
93user_groups
94user_properties
95wikigrok_claims
96wikigrok_questions
97wikigrok_responses
98wikilove_image_log
99wikilove_log

The following tables on enwiki have been already imported:

73350868	logging
51366344	archive
68959980	article_feedback
49353723	updates
39641126	page
33369615	article_feedback_revisions
27806343	user
27636591	page_props
26014665	pagetriage_page_tags
23210895	langlinks
19587823	article_feedback_properties
19161458	iwlinks
17053050	change_tag
15107916	abuse_filter_log
11530228	tag_summary
9695235	recentchanges
7529728	redirect
7500460	article_feedback_pages
2923618	math
2789547	filearchive
2420811	category
1631231	aft_article_answer
1594954	geo_tags
1506119	pagetriage_page
1467662	pagetriage_log
1359974	ipblocks
992023	pif_edits
988652	flaggedrevs_promote
968537	aft_article_feedback
899159	aft_article_filter_count
894364	image
817934	flaggedrevs
757963	flaggedrevs_tracking
501330	ep_events
452112	article_feedback_stats
415074	aft_article_revision_feedback_ratings_rollup
288558	wikigrok_questions
264543	flaggedrevs_statistics
167162	aft_article_feedback_ratings_rollup
129368	oldimage
127461	flaggedpage_pending
100212	aft_article_feedback_properties
98679	page_restrictions
93098	wikilove_log
59058	module_deps
51313	l10n_cache
50400	protected_titles
21818	user_groups
17589	wikilove_image_log
17012	aft_article_answer_text
14804	ep_users_per_course
13584	aft_article_revision_feedback_select_rollup
12727	ep_students
8703	ep_articles
4455	user_former_groups
4440	flaggedpage_config
4360	flaggedpages
3796	msg_resource_links
3055	ep_revisions
2363	mark_as_helpful
2046	transcode
1992	aft_article_feedback_select_rollup
980	localisation
855	sites
774	localisation_file_hash
754	abuse_filter
681	ep_courses
621	abuse_filter_action
584	site_identifiers
301	ep_orgs
58	ep_oas
49	ep_cas
29	updatelog
25	global_block_whitelist
17	pagetriage_tags
17	aft_article_field
4	article_feedback_ratings
4	aft_article_field_option
2	flaggedrevs_stats
2	article_feedback_stats_types
1	site_stats
1	povwatch_subscribers
1	povwatch_log
1	interwiki
1	flaggedtemplates
1	flaggedrevs_stats2
1	flaggedimages
1	ep_instructors
1	aft_article_field_group
0	wikigrok_responses
0	wikigrok_claims
0	hitcounter

These are pending:

1975408295	pagelinks
649773310	revision
610990450	templatelinks
119323634	externallinks
103124031	categorylinks
92999329	user_properties
80917852	imagelinks

If somone feels brave while I am gone, they can run on neodymium:

./import_from_production.sh 1 enwiki imagelinks "il_from BETWEEN 0 and 4999999"
./import_from_production.sh 1 enwiki imagelinks "il_from BETWEEN 5000000 and 9999999"
./import_from_production.sh 1 enwiki imagelinks "il_from BETWEEN 10000000 and 14999999"
./import_from_production.sh 1 enwiki imagelinks "il_from BETWEEN 15000000 and 19999999"
./import_from_production.sh 1 enwiki imagelinks "il_from BETWEEN 20000000 and 24999999"
./import_from_production.sh 1 enwiki imagelinks "il_from BETWEEN 25000000 and 29999999"
./import_from_production.sh 1 enwiki imagelinks "il_from BETWEEN 30000000 and 34999999"
./import_from_production.sh 1 enwiki imagelinks "il_from BETWEEN 35000000 and 39999999"
./import_from_production.sh 1 enwiki imagelinks "il_from BETWEEN 40000000 and 44999999"
./import_from_production.sh 1 enwiki imagelinks "il_from BETWEEN 45000000 and 49999999"
./import_from_production.sh 1 enwiki imagelinks "il_from < 0 or il_from > 49999999"

./import_from_production.sh 1 enwiki user_properties "up_user BETWEEN 0 and 4999999"
./import_from_production.sh 1 enwiki user_properties "up_user BETWEEN 5000000 and 9999999"
./import_from_production.sh 1 enwiki user_properties "up_user BETWEEN 10000000 and 14999999"
./import_from_production.sh 1 enwiki user_properties "up_user BETWEEN 15000000 and 19999999"
./import_from_production.sh 1 enwiki user_properties "up_user BETWEEN 20000000 and 24999999"
./import_from_production.sh 1 enwiki user_properties "up_user BETWEEN 25000000 and 29999999"
./import_from_production.sh 1 enwiki user_properties "up_user < 0 or up_user > 29999999"

./import_from_production.sh 1 enwiki categorylinks "cl_from BETWEEN 0 and 4999999"
./import_from_production.sh 1 enwiki categorylinks "cl_from BETWEEN 5000000 and 9999999"
./import_from_production.sh 1 enwiki categorylinks "cl_from BETWEEN 10000000 and 14999999"
./import_from_production.sh 1 enwiki categorylinks "cl_from BETWEEN 15000000 and 19999999"
./import_from_production.sh 1 enwiki categorylinks "cl_from BETWEEN 20000000 and 24999999"
./import_from_production.sh 1 enwiki categorylinks "cl_from BETWEEN 25000000 and 29999999"
./import_from_production.sh 1 enwiki categorylinks "cl_from BETWEEN 30000000 and 34999999"
./import_from_production.sh 1 enwiki categorylinks "cl_from BETWEEN 35000000 and 39999999"
./import_from_production.sh 1 enwiki categorylinks "cl_from BETWEEN 40000000 and 44999999"
./import_from_production.sh 1 enwiki categorylinks "cl_from BETWEEN 45000000 and 49999999"
./import_from_production.sh 1 enwiki categorylinks "cl_from < 0 or cl_from > 49999999"
jcrespo mentioned this in Unknown Object (Task).May 11 2016, 7:27 AM

These also have been imported:

610990450	templatelinks
119323634	externallinks
103124031	categorylinks
92999329	user_properties
80917852	imagelinks

Revision table is ongoing now, but it has 700 M rows and it takes almost half a day to import and filter 10 million of them.

Is there any update on the status of this? On 23 May, the revision table was in progress and was expected to take ~12 hours. The pagelinks table is about 3X larger and so might be expected to take ~36 hours. But this was eight days ago, so even if the estimates were off by a factor of 2 (or 4), the process could have completed by now.

I'm sorry, what? When did I say it was going to take 12 hours? My last estimation was:

Revision table is ongoing now, but it has 700 M rows and it takes almost half a day to import and filter 10 million of them.

It takes around 6 hours to import 5 million rows, and I have imported 110 millon of those for now. A more reasonable estimation would be 1 month for revision.

Sorry; it appears that I must have stopped reading before the end of the sentence. :-*>

So, if importing revision will take roughly a month, that means that pagelinks will take another three months, more or less?

Not necessarily, it is probably a less wide table and will not have so many metadata issues as revision.

Just noticed this bug just before opening a new bug report on cswiki_p missing virtually all revisions and categorylinks from between 2016-03-08 18:00 and 21:00 UTC, which spoils the results of a tool of mine that is supposed to find non-categorized articles (in no other than possibly some hidden categories).

I would be happy to know if cswiki_p is also on the schedule and if there is an estimate for this to be fixed. And also if there is any other way I could walk around this in the meantime, other than having to check the results against the live API. Thank you.

It is scheduled.

It is difficult to give an estimation, but it can be done after enwiki is finished, so 3-6 months? Labs hosts, by its own nature cannot and will probably not be 100% ever consistent due to filtering happening at any time, and your code should be ready to handle those exceptions.

You can (and should) use the dumps -which contains all revisions- to do large-scale analysis/analytics like-queries. They are available locally on labs, here is the manual about it: https://wikitech.wikimedia.org/wiki/Help:Shared_storage#.2Fpublic.2Fdumps

Thank you. I did not realize there were also SQL dumps, not only XML. Would it perhaps be possible to have the latest dump readily available on an SQL server? That could be an alternative for anyone who can sacrifice recency to accuracy. Needing to periodicaly install new dumps into my own user tables feels like adding too much extra work (that might break in the process) for it to be feasible, plus it is possibly redundant if more users need to access the same data.

I might give those SQL dumps a try, but using them has two major drawbacks:

  1. You loose live connection with your audience (i.e. Wikipedians), which is vital if you want to make any difference. Without regular updates, people do not see the progress of their collaborative effort and are therefore much less motivated to continue (categorizing uncategorized pages, in this case). This is not a one-time research and without my tool's data being effectively used by the community in near real time, there is no need for the tool to exist in the first place.
  1. You loose collaborators - because you loose the ease of a Quarry query. Every Wikipedian can fork and execute a query by clicking on buttons and copy the result into a wiki page. In this way, eager users can update the list of work to do at any time as often as they want. I as developer do not have to set up and maintain a dedicated tool account and can concentrate only on the query logic itself – and forget it once the community has been told about it, which is very important so that I can go on developing other utilities. I thought this was one of Quarry's raisons d'être, and access to live (and accurate) database replicas is the single most important feature for which I like Tool Labs that much.

Thanks for the estimate, but half a year is way too much to be worth waiting – I will probably already be working on something else at that time and my Village pump call to action will long be archived. I understand that the amount of data to handle is huge, but if the replicas should not be relied upon for a long time, this puts some tasks like this one back into the old times when there were no live replicas at all :-(

@Blahma what do you want me to do? I can load those files, but those would be out of sync as soon as they are imported, and impossible to get updated. You can load those tables to the same database server than I am importing the replica ones, with little-to-no difference.

If you want to speed up the process you could compare the replicas to those SQL exports and I will be happy to apply a patch if you provide one.

I think you are making too much of deal for losing 3 hours of data (what is that, 100? 1000? records- we have 3000 million of those only on enwiki)- it is again your code that should be able to handle gaps like that, the same way there are intended gaps for privacy reasons or any other potential incidents. Even production regularly loses *link table data due to multiple reasons -downtimes, parsing errors, job errors, code errors- you cannot ever relay on them being accurate. Meanwhile, fixing your code to workaround missing rows ("which spoils the results of a tool of mine") or using suggested workarounds "feels like adding too much extra work" :-/

Like in any open source project, please either be patient so things can be fixed in a proper long-term way or provide a patch I can review and apply.

@jcrespo Thanks for staying on the constructive line. FYI, the output in question is https://cs.wikipedia.org/wiki/Wikipedie:%C3%9Adr%C5%BEba/Nekategorizovan%C3%A9_%C4%8Dl%C3%A1nky_s_ohledem_na_skryt%C3%A9_kategorie produced by a query listed on https://cs.wikipedia.org/wiki/Diskuse_k_Wikipedii:%C3%9Adr%C5%BEba/Nekategorizovan%C3%A9_%C4%8Dl%C3%A1nky_s_ohledem_na_skryt%C3%A9_kategorie – users work on categorizing pages from the list and removing items that have been categorized in the meantime (perhaps by another user) and they feel it annoying when I or other developer do a new update which restores some items that should not be there, just because the replicas lack the relevant categorylinks entries. Hope this explains why even a few missing table lines can be annoying.

A confirmed work around is to remove categories from the concerned articles, save and add immediately revert, but I have been questioned by a patroller after doing this once, so I did not dare going on with more. Simply purging with forcelinkupdate does not seem to help. If we could somehow force all categorylinks on cswiki to be regenerated in some other way, such as running a maintenance script on production, that could solve my problem. If such a thing is not possible, Otherwise, I will be happy to identify the missing lines and produce a patch for your to merge with the replica - thanks for suggesting that! (I originally thought that this was what you were already doing and it could not be sped up in any way.)

This is very easy to fix- tell your users to mark those that are incorrect, and exclude them from your query- that is very easy to do and doesn't require waiting.

This is very easy to fix- tell your users to mark those that are incorrect, and exclude them from your query- that is very easy to do and doesn't require waiting.

Is this a joke? The latest "solution" to the constant stream of data integrity issues on Wikimedia Labs database replicas is to further inconvenience volunteers?

The latest "solution" to the constant stream of data integrity issues on Wikimedia Labs database replicas is to further inconvenience volunteers?

No, the latest solution is the reimport I am already doing, that was a workaround because Blahma could not wait.

A confirmed work around is to remove categories from the concerned articles, save and add immediately revert, but I have been questioned by a patroller after doing this once, so I did not dare going on with more.

I thought that's the usual workground for the issue. Just go for it, and if you are ever questioned, point to this ticket.

labsdb1002 itself has been decomissioned; the tracking of the new server setup (and labs fix in general) will be done at T140452