disk failure on labsdb1002
Closed, ResolvedPublic

Description

[21555223.116628] XFS (dm-1): xfs_log_force: error 5 returned.

[21554975.965623] scsi 7:0:1:0: [sdc] Unhandled error code

Feb 14 22:24:55 labsdb1002 kernel: [21554975.965750] XFS (dm-1): Log I/O Error Detected. Shutting down filesystem

re: T126942

https://gerrit.wikimedia.org/r/#/c/270650/

https://phabricator.wikimedia.org/rOPUP266761623ba8ca20bbf53c6425761118c57ae9da

chasemp created this task.Feb 14 2016, 11:15 PM
chasemp updated the task description. (Show Details)
chasemp raised the priority of this task from to High.
chasemp assigned this task to Cmjohnson.
chasemp added projects: Operations, Labs, ops-eqiad.
chasemp added a subscriber: chasemp.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 14 2016, 11:15 PM

labsdb1002 is a CISCO server. I can pull a disk out of one of the
decommissioned servers and replace but we should really start working on
replacing this server.

Problem I am having is figuring out which disk is /dev/sdc

Cmjohnson: See my comment on T118174#2062707

liangent added a subscriber: liangent.EditedFeb 28 2016, 4:42 AM

Any plan or progress on this?

Basically I'm using user databases there, and my tools stopped working because they're gone when *wiki.labsdb are switched. I have file dumps of those databases and I can restore them to the new server, but I'm afraid the db host will be switched back and my tools will stop working again.

More seriously, if disk contents were not lost on the old host and *wiki.labsdb get pointed back to the old host at some time, my tools may start to work from a point where they've already worked on where the progress gets stored on the new host...

Suggestions for my current situation are appreciated.

@liangent, user databases were lost and will not be able to be recovered.

liangent added a comment.EditedFeb 29 2016, 8:44 AM

@liangent, user databases were lost and will not be able to be recovered.

Hmm I'm now trying to restore dumps to current hosts but I couldn't do so (ERROR 1227 (42000) at line 67: Access denied; you need (at least one of) the SUPER privilege(s) for this operation). Can anyone execute this for me?

mysql -h commonswiki.labsdb p50380g50497__wikidb_commonswiki < /data/project/liangent-php/sql/dumps/20160214000020/commonswiki.sql
mysql -h wikidatawiki.labsdb p50380g50497__wikidb_wikidatawiki < /data/project/liangent-php/sql/dumps/20160214000020/wikidatawiki.sql
mysql -h zhwiki.labsdb p50380g50497__wikidb_zhwiki < /data/project/liangent-php/sql/dumps/20160214000020/zhwiki.sql

I would happily would, but I would prefer to actually solve the issue for you forever so you can self-serve.

One tip before I research what is failing:

  • It is probable that commons and wikidata will be moved around as a default (but data will not) once c2 is replaced. Sticking to a particular machine (c1, c3) will allow you to import and export reliably that data. It is ok to use the logical name, but the default machine can be moved at any time (and you will need to reimport).

@liangent Your problem is that your script is trying to execute CREATE TRIGGER with DEFINER=`root`@`208.80.154.151`, for which you have no rights.

Removing the DEFINER, with something like:

sed -e 's/DEFINER[ ]*=[ ]*[^*]*\*/\*/' < sql/dumps/20160214000020/commonswiki.sql | mysql ... [mysql parameters]

Should work for you.

Also, please use a different ticket for importing issues, as this is offtopic.

Thanks. It will not be a reimport but a move, so I'll have to take care of host names every time... There's no way to ensure reliability I think, unless you have prev.commonswiki.labsdb and prev2.commonswiki.labsdb etc.

Just for a note, the next command I'll issue is:

mysqldump --defaults-file=$HOME/replica.my.cnf -h c3.labsdb p50380g50497__wikidb_commonswiki | sed -e 's/DEFINER[ ]*=[ ]*[^*]*\*/\*/' | mysql --defaults-file=$HOME/replica.my.cnf -h commonswiki.labsdb p50380g50497__wikidb_commonswiki
echo drop database p50380g50497__wikidb_commonswiki\; | mysql --defaults-file=$HOME/replica.my.cnf -h c3.labsdb

This is the last post in this ticket here...

chasemp updated the task description. (Show Details)Mar 5 2016, 12:15 AM
chasemp set Security to None.

@chasemp any updates on this disk?

@chasemp any updates on this disk?

I didn't have this on my radar

Is the status:

Problem I am having is figuring out which disk is /dev/sdc

I took this to mean this server was not coming back into service?

Cmjohnson: See my comment on T118174#2062707

@jcrespo can you help determine the way forward here? Do we need to pursue figuring out which disk is the faulty one here for replacment?

@jcrespo can you help determine the way forward here? Do we need to pursue figuring out which disk is the faulty one here for replacment?

We already have the replacement on rack, it is pending installing and reprovisioning (there are issues with that)

Change 278268 had a related patch set uploaded (by Jcrespo):
Configure labsdb1008 for the first time

https://gerrit.wikimedia.org/r/278268

Change 278268 merged by Jcrespo:
Configure labsdb1008 for the first time

https://gerrit.wikimedia.org/r/278268

jcrespo claimed this task.Mar 18 2016, 12:30 PM
jcrespo edited projects, added DBA; removed Patch-For-Review, DC-Ops, ops-eqiad.
jcrespo moved this task from Triage to In progress on the DBA board.
jcrespo added a subscriber: Cmjohnson.

List of tables to reimport:

1​abuse_filter
2​abuse_filter_action
3​abuse_filter_log
4​aft_article_answer
5​aft_article_answer_text
6​aft_article_feedback
7​aft_article_feedback_properties
8​aft_article_feedback_ratings_rollup
9​aft_article_feedback_select_rollup
10​aft_article_field
11​aft_article_field_group
12​aft_article_field_option
13​aft_article_filter_count
14​aft_article_revision_feedback_ratings_rollup
15​aft_article_revision_feedback_select_rollup
16​archive
17​article_feedback
18​article_feedback_pages
19​article_feedback_properties
20​article_feedback_ratings
21​article_feedback_revisions
22​article_feedback_stats
23​article_feedback_stats_types
24​category
25​categorylinks
26​change_tag
27​ep_articles
28​ep_cas
29​ep_courses
30​ep_events
31​ep_instructors
32​ep_oas
33​ep_orgs
34​ep_revisions
35​ep_students
36​ep_users_per_course
37​externallinks
38​filearchive
39​flaggedimages
40​flaggedpage_config
41​flaggedpage_pending
42​flaggedpages
43​flaggedrevs
44​flaggedrevs_promote
45​flaggedrevs_statistics
46​flaggedrevs_stats
47​flaggedrevs_stats2
48​flaggedrevs_tracking
49​flaggedtemplates
50​geo_tags
51​global_block_whitelist
52​hitcounter
53​image
54​imagelinks
55​interwiki
56​ipblocks
57​iwlinks
58​l10n_cache
59​langlinks
60​localisation
61​localisation_file_hash
62​logging
63​mark_as_helpful
64​math
65​module_deps
66​msg_resource_links
67​oldimage
68​page
69​page_props
70​page_restrictions
71​pagelinks
72​pagetriage_log
73​pagetriage_page
74​pagetriage_page_tags
75​pagetriage_tags
76​pif_edits
77​povwatch_log
78​povwatch_subscribers
79​protected_titles
80​recentchanges
81​redirect
82​revision
83​site_identifiers
84​site_stats
85​sites
86​tag_summary
87​templatelinks
88​transcode
89​updatelog
90​updates
91​user
92​user_former_groups
93​user_groups
94​user_properties
95​wikigrok_claims
96​wikigrok_questions
97​wikigrok_responses
98​wikilove_image_log
99​wikilove_log

The following tables on enwiki have been already imported:

73350868	logging
51366344	archive
68959980	article_feedback
49353723	updates
39641126	page
33369615	article_feedback_revisions
27806343	user
27636591	page_props
26014665	pagetriage_page_tags
23210895	langlinks
19587823	article_feedback_properties
19161458	iwlinks
17053050	change_tag
15107916	abuse_filter_log
11530228	tag_summary
9695235	recentchanges
7529728	redirect
7500460	article_feedback_pages
2923618	math
2789547	filearchive
2420811	category
1631231	aft_article_answer
1594954	geo_tags
1506119	pagetriage_page
1467662	pagetriage_log
1359974	ipblocks
992023	pif_edits
988652	flaggedrevs_promote
968537	aft_article_feedback
899159	aft_article_filter_count
894364	image
817934	flaggedrevs
757963	flaggedrevs_tracking
501330	ep_events
452112	article_feedback_stats
415074	aft_article_revision_feedback_ratings_rollup
288558	wikigrok_questions
264543	flaggedrevs_statistics
167162	aft_article_feedback_ratings_rollup
129368	oldimage
127461	flaggedpage_pending
100212	aft_article_feedback_properties
98679	page_restrictions
93098	wikilove_log
59058	module_deps
51313	l10n_cache
50400	protected_titles
21818	user_groups
17589	wikilove_image_log
17012	aft_article_answer_text
14804	ep_users_per_course
13584	aft_article_revision_feedback_select_rollup
12727	ep_students
8703	ep_articles
4455	user_former_groups
4440	flaggedpage_config
4360	flaggedpages
3796	msg_resource_links
3055	ep_revisions
2363	mark_as_helpful
2046	transcode
1992	aft_article_feedback_select_rollup
980	localisation
855	sites
774	localisation_file_hash
754	abuse_filter
681	ep_courses
621	abuse_filter_action
584	site_identifiers
301	ep_orgs
58	ep_oas
49	ep_cas
29	updatelog
25	global_block_whitelist
17	pagetriage_tags
17	aft_article_field
4	article_feedback_ratings
4	aft_article_field_option
2	flaggedrevs_stats
2	article_feedback_stats_types
1	site_stats
1	povwatch_subscribers
1	povwatch_log
1	interwiki
1	flaggedtemplates
1	flaggedrevs_stats2
1	flaggedimages
1	ep_instructors
1	aft_article_field_group
0	wikigrok_responses
0	wikigrok_claims
0	hitcounter

These are pending:

1975408295	pagelinks
649773310	revision
610990450	templatelinks
119323634	externallinks
103124031	categorylinks
92999329	user_properties
80917852	imagelinks

If somone feels brave while I am gone, they can run on neodymium:

./import_from_production.sh 1 enwiki imagelinks "il_from BETWEEN 0 and 4999999"
./import_from_production.sh 1 enwiki imagelinks "il_from BETWEEN 5000000 and 9999999"
./import_from_production.sh 1 enwiki imagelinks "il_from BETWEEN 10000000 and 14999999"
./import_from_production.sh 1 enwiki imagelinks "il_from BETWEEN 15000000 and 19999999"
./import_from_production.sh 1 enwiki imagelinks "il_from BETWEEN 20000000 and 24999999"
./import_from_production.sh 1 enwiki imagelinks "il_from BETWEEN 25000000 and 29999999"
./import_from_production.sh 1 enwiki imagelinks "il_from BETWEEN 30000000 and 34999999"
./import_from_production.sh 1 enwiki imagelinks "il_from BETWEEN 35000000 and 39999999"
./import_from_production.sh 1 enwiki imagelinks "il_from BETWEEN 40000000 and 44999999"
./import_from_production.sh 1 enwiki imagelinks "il_from BETWEEN 45000000 and 49999999"
./import_from_production.sh 1 enwiki imagelinks "il_from < 0 or il_from > 49999999"

./import_from_production.sh 1 enwiki user_properties "up_user BETWEEN 0 and 4999999"
./import_from_production.sh 1 enwiki user_properties "up_user BETWEEN 5000000 and 9999999"
./import_from_production.sh 1 enwiki user_properties "up_user BETWEEN 10000000 and 14999999"
./import_from_production.sh 1 enwiki user_properties "up_user BETWEEN 15000000 and 19999999"
./import_from_production.sh 1 enwiki user_properties "up_user BETWEEN 20000000 and 24999999"
./import_from_production.sh 1 enwiki user_properties "up_user BETWEEN 25000000 and 29999999"
./import_from_production.sh 1 enwiki user_properties "up_user < 0 or up_user > 29999999"

./import_from_production.sh 1 enwiki categorylinks "cl_from BETWEEN 0 and 4999999"
./import_from_production.sh 1 enwiki categorylinks "cl_from BETWEEN 5000000 and 9999999"
./import_from_production.sh 1 enwiki categorylinks "cl_from BETWEEN 10000000 and 14999999"
./import_from_production.sh 1 enwiki categorylinks "cl_from BETWEEN 15000000 and 19999999"
./import_from_production.sh 1 enwiki categorylinks "cl_from BETWEEN 20000000 and 24999999"
./import_from_production.sh 1 enwiki categorylinks "cl_from BETWEEN 25000000 and 29999999"
./import_from_production.sh 1 enwiki categorylinks "cl_from BETWEEN 30000000 and 34999999"
./import_from_production.sh 1 enwiki categorylinks "cl_from BETWEEN 35000000 and 39999999"
./import_from_production.sh 1 enwiki categorylinks "cl_from BETWEEN 40000000 and 44999999"
./import_from_production.sh 1 enwiki categorylinks "cl_from BETWEEN 45000000 and 49999999"
./import_from_production.sh 1 enwiki categorylinks "cl_from < 0 or cl_from > 49999999"
jcrespo moved this task from In progress to Next on the DBA board.Apr 1 2016, 4:17 PM
jcrespo moved this task from Next to In progress on the DBA board.May 4 2016, 9:50 AM
jcrespo mentioned this in Unknown Object (Task).May 11 2016, 7:27 AM

These also have been imported:

610990450	templatelinks
119323634	externallinks
103124031	categorylinks
92999329	user_properties
80917852	imagelinks

Revision table is ongoing now, but it has 700 M rows and it takes almost half a day to import and filter 10 million of them.

Is there any update on the status of this? On 23 May, the revision table was in progress and was expected to take ~12 hours. The pagelinks table is about 3X larger and so might be expected to take ~36 hours. But this was eight days ago, so even if the estimates were off by a factor of 2 (or 4), the process could have completed by now.

I'm sorry, what? When did I say it was going to take 12 hours? My last estimation was:

Revision table is ongoing now, but it has 700 M rows and it takes almost half a day to import and filter 10 million of them.

It takes around 6 hours to import 5 million rows, and I have imported 110 millon of those for now. A more reasonable estimation would be 1 month for revision.

russblau added a comment.EditedJun 1 2016, 8:48 PM

Sorry; it appears that I must have stopped reading before the end of the sentence. :-*>

So, if importing revision will take roughly a month, that means that pagelinks will take another three months, more or less?

Not necessarily, it is probably a less wide table and will not have so many metadata issues as revision.

Blahma added a subscriber: Blahma.Jun 9 2016, 9:29 PM

Just noticed this bug just before opening a new bug report on cswiki_p missing virtually all revisions and categorylinks from between 2016-03-08 18:00 and 21:00 UTC, which spoils the results of a tool of mine that is supposed to find non-categorized articles (in no other than possibly some hidden categories).

I would be happy to know if cswiki_p is also on the schedule and if there is an estimate for this to be fixed. And also if there is any other way I could walk around this in the meantime, other than having to check the results against the live API. Thank you.

It is scheduled.

It is difficult to give an estimation, but it can be done after enwiki is finished, so 3-6 months? Labs hosts, by its own nature cannot and will probably not be 100% ever consistent due to filtering happening at any time, and your code should be ready to handle those exceptions.

You can (and should) use the dumps -which contains all revisions- to do large-scale analysis/analytics like-queries. They are available locally on labs, here is the manual about it: https://wikitech.wikimedia.org/wiki/Help:Shared_storage#.2Fpublic.2Fdumps

Thank you. I did not realize there were also SQL dumps, not only XML. Would it perhaps be possible to have the latest dump readily available on an SQL server? That could be an alternative for anyone who can sacrifice recency to accuracy. Needing to periodicaly install new dumps into my own user tables feels like adding too much extra work (that might break in the process) for it to be feasible, plus it is possibly redundant if more users need to access the same data.

I might give those SQL dumps a try, but using them has two major drawbacks:

  1. You loose live connection with your audience (i.e. Wikipedians), which is vital if you want to make any difference. Without regular updates, people do not see the progress of their collaborative effort and are therefore much less motivated to continue (categorizing uncategorized pages, in this case). This is not a one-time research and without my tool's data being effectively used by the community in near real time, there is no need for the tool to exist in the first place.
  1. You loose collaborators - because you loose the ease of a Quarry query. Every Wikipedian can fork and execute a query by clicking on buttons and copy the result into a wiki page. In this way, eager users can update the list of work to do at any time as often as they want. I as developer do not have to set up and maintain a dedicated tool account and can concentrate only on the query logic itself – and forget it once the community has been told about it, which is very important so that I can go on developing other utilities. I thought this was one of Quarry's raisons d'être, and access to live (and accurate) database replicas is the single most important feature for which I like Tool Labs that much.

Thanks for the estimate, but half a year is way too much to be worth waiting – I will probably already be working on something else at that time and my Village pump call to action will long be archived. I understand that the amount of data to handle is huge, but if the replicas should not be relied upon for a long time, this puts some tasks like this one back into the old times when there were no live replicas at all :-(

@Blahma what do you want me to do? I can load those files, but those would be out of sync as soon as they are imported, and impossible to get updated. You can load those tables to the same database server than I am importing the replica ones, with little-to-no difference.

If you want to speed up the process you could compare the replicas to those SQL exports and I will be happy to apply a patch if you provide one.

I think you are making too much of deal for losing 3 hours of data (what is that, 100? 1000? records- we have 3000 million of those only on enwiki)- it is again your code that should be able to handle gaps like that, the same way there are intended gaps for privacy reasons or any other potential incidents. Even production regularly loses *link table data due to multiple reasons -downtimes, parsing errors, job errors, code errors- you cannot ever relay on them being accurate. Meanwhile, fixing your code to workaround missing rows ("which spoils the results of a tool of mine") or using suggested workarounds "feels like adding too much extra work" :-/

Like in any open source project, please either be patient so things can be fixed in a proper long-term way or provide a patch I can review and apply.

@jcrespo Thanks for staying on the constructive line. FYI, the output in question is https://cs.wikipedia.org/wiki/Wikipedie:%C3%9Adr%C5%BEba/Nekategorizovan%C3%A9_%C4%8Dl%C3%A1nky_s_ohledem_na_skryt%C3%A9_kategorie produced by a query listed on https://cs.wikipedia.org/wiki/Diskuse_k_Wikipedii:%C3%9Adr%C5%BEba/Nekategorizovan%C3%A9_%C4%8Dl%C3%A1nky_s_ohledem_na_skryt%C3%A9_kategorie – users work on categorizing pages from the list and removing items that have been categorized in the meantime (perhaps by another user) and they feel it annoying when I or other developer do a new update which restores some items that should not be there, just because the replicas lack the relevant categorylinks entries. Hope this explains why even a few missing table lines can be annoying.

A confirmed work around is to remove categories from the concerned articles, save and add immediately revert, but I have been questioned by a patroller after doing this once, so I did not dare going on with more. Simply purging with forcelinkupdate does not seem to help. If we could somehow force all categorylinks on cswiki to be regenerated in some other way, such as running a maintenance script on production, that could solve my problem. If such a thing is not possible, Otherwise, I will be happy to identify the missing lines and produce a patch for your to merge with the replica - thanks for suggesting that! (I originally thought that this was what you were already doing and it could not be sped up in any way.)

This is very easy to fix- tell your users to mark those that are incorrect, and exclude them from your query- that is very easy to do and doesn't require waiting.

This is very easy to fix- tell your users to mark those that are incorrect, and exclude them from your query- that is very easy to do and doesn't require waiting.

Is this a joke? The latest "solution" to the constant stream of data integrity issues on Wikimedia Labs database replicas is to further inconvenience volunteers?

The latest "solution" to the constant stream of data integrity issues on Wikimedia Labs database replicas is to further inconvenience volunteers?

No, the latest solution is the reimport I am already doing, that was a workaround because Blahma could not wait.

A confirmed work around is to remove categories from the concerned articles, save and add immediately revert, but I have been questioned by a patroller after doing this once, so I did not dare going on with more.

I thought that's the usual workground for the issue. Just go for it, and if you are ever questioned, point to this ticket.

Dvorapa removed a subscriber: Dvorapa.Jun 20 2016, 10:15 AM
jcrespo closed this task as Resolved.Jul 28 2016, 5:13 PM

labsdb1002 itself has been decomissioned; the tracking of the new server setup (and labs fix in general) will be done at T140452