Page MenuHomePhabricator

Consistent TimeoutErrors when using Celery 4
Closed, ResolvedPublic

Assigned To
Authored By
awight
Nov 1 2017, 6:17 PM
Referenced Files
F10558565: succeeded
Nov 1 2017, 9:35 PM
F10558564: failed
Nov 1 2017, 9:35 PM
F10557825: wd
Nov 1 2017, 8:14 PM
F10557823: failed
Nov 1 2017, 8:14 PM
F10557824: succeeded
Nov 1 2017, 8:14 PM

Description

These errors only occur with "?features=", which disables caching and includes features in the response. I haven't been able to reproduce the issue without "features", even when scoring uncached revisions. The strangest thing about these timeouts is that we see the errors, but no scores:

1{
2 "enwiki": {
3 "models": {
4 "damaging": {
5 "version": "0.4.0"
6 },
7 "draftquality": {
8 "version": "0.1.0"
9 },
10 "goodfaith": {
11 "version": "0.4.0"
12 },
13 "wp10": {
14 "version": "0.6.1"
15 }
16 },
17 "scores": {
18 "123456": {
19 "damaging": {
20 "error": {
21 "message": "Timed out after 15 seconds.",
22 "type": "TimeoutError"
23 },
24 "features": {
25 "feature.english.badwords.revision.diff.match_delta_decrease": 0,
26 "feature.english.badwords.revision.diff.match_delta_increase": 0,
27 "feature.english.badwords.revision.diff.match_delta_sum": 0,
28 "feature.english.badwords.revision.diff.match_prop_delta_decrease": 0.0,
29 "feature.english.badwords.revision.diff.match_prop_delta_increase": 0.0,
30 "feature.english.badwords.revision.diff.match_prop_delta_sum": 0.0,
31 "feature.english.dictionary.revision.diff.dict_word_delta_decrease": 0,
32 "feature.english.dictionary.revision.diff.dict_word_delta_increase": 1,
33 "feature.english.dictionary.revision.diff.dict_word_delta_sum": 1,
34 "feature.english.dictionary.revision.diff.dict_word_prop_delta_decrease": 0.0,
35 "feature.english.dictionary.revision.diff.dict_word_prop_delta_increase": 1.0,
36 "feature.english.dictionary.revision.diff.dict_word_prop_delta_sum": 1.0,
37 "feature.english.dictionary.revision.diff.non_dict_word_delta_decrease": 0,
38 "feature.english.dictionary.revision.diff.non_dict_word_delta_increase": 2,
39 "feature.english.dictionary.revision.diff.non_dict_word_delta_sum": 2,
40 "feature.english.dictionary.revision.diff.non_dict_word_prop_delta_decrease": 0.0,
41 "feature.english.dictionary.revision.diff.non_dict_word_prop_delta_increase": 2.0,
42 "feature.english.dictionary.revision.diff.non_dict_word_prop_delta_sum": 2.0,
43 "feature.english.informals.revision.diff.match_delta_decrease": 0,
44 "feature.english.informals.revision.diff.match_delta_increase": 0,
45 "feature.english.informals.revision.diff.match_delta_sum": 0,
46 "feature.english.informals.revision.diff.match_prop_delta_decrease": 0.0,
47 "feature.english.informals.revision.diff.match_prop_delta_increase": 0.0,
48 "feature.english.informals.revision.diff.match_prop_delta_sum": 0.0,
49 "feature.len(<datasource.tokenized(datasource.revision.parent.text)>)": 0,
50 "feature.len(<datasource.tokenized(datasource.revision.text)>)": 11,
51 "feature.len(<datasource.wikitext.revision.markups>)": 2,
52 "feature.len(<datasource.wikitext.revision.parent.markups>)": 0,
53 "feature.len(<datasource.wikitext.revision.parent.uppercase_words>)": 0,
54 "feature.len(<datasource.wikitext.revision.parent.words>)": 0,
55 "feature.len(<datasource.wikitext.revision.words>)": 3,
56 "feature.revision.comment.has_link": false,
57 "feature.revision.comment.suggests_section_edit": true,
58 "feature.revision.diff.longest_new_repeated_char": 2,
59 "feature.revision.diff.longest_new_token": 16,
60 "feature.revision.page.is_articleish": false,
61 "feature.revision.page.is_draftspace": false,
62 "feature.revision.page.is_mainspace": false,
63 "feature.revision.user.has_advanced_rights": false,
64 "feature.revision.user.is_admin": false,
65 "feature.revision.user.is_anon": true,
66 "feature.revision.user.is_bot": false,
67 "feature.revision.user.is_curator": false,
68 "feature.revision.user.is_patroller": false,
69 "feature.revision.user.is_trusted": false,
70 "feature.temporal.revision.user.seconds_since_registration": 0,
71 "feature.wikitext.revision.chars": 36,
72 "feature.wikitext.revision.diff.markup_delta_decrease": 0.0,
73 "feature.wikitext.revision.diff.markup_delta_increase": 2.0,
74 "feature.wikitext.revision.diff.markup_delta_sum": 2.0,
75 "feature.wikitext.revision.diff.markup_prop_delta_decrease": 0.0,
76 "feature.wikitext.revision.diff.markup_prop_delta_increase": 2.0,
77 "feature.wikitext.revision.diff.markup_prop_delta_sum": 2.0,
78 "feature.wikitext.revision.diff.number_delta_decrease": 0.0,
79 "feature.wikitext.revision.diff.number_delta_increase": 0.0,
80 "feature.wikitext.revision.diff.number_delta_sum": 0.0,
81 "feature.wikitext.revision.diff.number_prop_delta_decrease": 0.0,
82 "feature.wikitext.revision.diff.number_prop_delta_increase": 0.0,
83 "feature.wikitext.revision.diff.number_prop_delta_sum": 0.0,
84 "feature.wikitext.revision.diff.uppercase_word_delta_decrease": 0.0,
85 "feature.wikitext.revision.diff.uppercase_word_delta_increase": 0.0,
86 "feature.wikitext.revision.diff.uppercase_word_delta_sum": 0.0,
87 "feature.wikitext.revision.diff.uppercase_word_prop_delta_decrease": 0.0,
88 "feature.wikitext.revision.diff.uppercase_word_prop_delta_increase": 0.0,
89 "feature.wikitext.revision.diff.uppercase_word_prop_delta_sum": 0.0,
90 "feature.wikitext.revision.external_links": 0,
91 "feature.wikitext.revision.headings": 0,
92 "feature.wikitext.revision.parent.chars": 0,
93 "feature.wikitext.revision.parent.external_links": 0,
94 "feature.wikitext.revision.parent.headings": 0,
95 "feature.wikitext.revision.parent.ref_tags": 0,
96 "feature.wikitext.revision.parent.templates": 0,
97 "feature.wikitext.revision.parent.wikilinks": 0,
98 "feature.wikitext.revision.ref_tags": 0,
99 "feature.wikitext.revision.templates": 0,
100 "feature.wikitext.revision.wikilinks": 0
101 }
102 },
103 "draftquality": {
104 "error": {
105 "message": "Timed out after 15 seconds.",
106 "type": "TimeoutError"
107 }
108 },
109 "goodfaith": {
110 "error": {
111 "message": "Timed out after 15 seconds.",
112 "type": "TimeoutError"
113 },
114 "features": {
115 "feature.english.badwords.revision.diff.match_delta_decrease": 0,
116 "feature.english.badwords.revision.diff.match_delta_increase": 0,
117 "feature.english.badwords.revision.diff.match_delta_sum": 0,
118 "feature.english.badwords.revision.diff.match_prop_delta_decrease": 0.0,
119 "feature.english.badwords.revision.diff.match_prop_delta_increase": 0.0,
120 "feature.english.badwords.revision.diff.match_prop_delta_sum": 0.0,
121 "feature.english.dictionary.revision.diff.dict_word_delta_decrease": 0,
122 "feature.english.dictionary.revision.diff.dict_word_delta_increase": 1,
123 "feature.english.dictionary.revision.diff.dict_word_delta_sum": 1,
124 "feature.english.dictionary.revision.diff.dict_word_prop_delta_decrease": 0.0,
125 "feature.english.dictionary.revision.diff.dict_word_prop_delta_increase": 1.0,
126 "feature.english.dictionary.revision.diff.dict_word_prop_delta_sum": 1.0,
127 "feature.english.dictionary.revision.diff.non_dict_word_delta_decrease": 0,
128 "feature.english.dictionary.revision.diff.non_dict_word_delta_increase": 2,
129 "feature.english.dictionary.revision.diff.non_dict_word_delta_sum": 2,
130 "feature.english.dictionary.revision.diff.non_dict_word_prop_delta_decrease": 0.0,
131 "feature.english.dictionary.revision.diff.non_dict_word_prop_delta_increase": 2.0,
132 "feature.english.dictionary.revision.diff.non_dict_word_prop_delta_sum": 2.0,
133 "feature.english.informals.revision.diff.match_delta_decrease": 0,
134 "feature.english.informals.revision.diff.match_delta_increase": 0,
135 "feature.english.informals.revision.diff.match_delta_sum": 0,
136 "feature.english.informals.revision.diff.match_prop_delta_decrease": 0.0,
137 "feature.english.informals.revision.diff.match_prop_delta_increase": 0.0,
138 "feature.english.informals.revision.diff.match_prop_delta_sum": 0.0,
139 "feature.len(<datasource.tokenized(datasource.revision.parent.text)>)": 0,
140 "feature.len(<datasource.tokenized(datasource.revision.text)>)": 11,
141 "feature.len(<datasource.wikitext.revision.markups>)": 2,
142 "feature.len(<datasource.wikitext.revision.parent.markups>)": 0,
143 "feature.len(<datasource.wikitext.revision.parent.uppercase_words>)": 0,
144 "feature.len(<datasource.wikitext.revision.parent.words>)": 0,
145 "feature.len(<datasource.wikitext.revision.words>)": 3,
146 "feature.revision.comment.has_link": false,
147 "feature.revision.comment.suggests_section_edit": true,
148 "feature.revision.diff.longest_new_repeated_char": 2,
149 "feature.revision.diff.longest_new_token": 16,
150 "feature.revision.page.is_articleish": false,
151 "feature.revision.page.is_draftspace": false,
152 "feature.revision.page.is_mainspace": false,
153 "feature.revision.user.has_advanced_rights": false,
154 "feature.revision.user.is_admin": false,
155 "feature.revision.user.is_anon": true,
156 "feature.revision.user.is_bot": false,
157 "feature.revision.user.is_curator": false,
158 "feature.revision.user.is_patroller": false,
159 "feature.revision.user.is_trusted": false,
160 "feature.temporal.revision.user.seconds_since_registration": 0,
161 "feature.wikitext.revision.chars": 36,
162 "feature.wikitext.revision.diff.markup_delta_decrease": 0.0,
163 "feature.wikitext.revision.diff.markup_delta_increase": 2.0,
164 "feature.wikitext.revision.diff.markup_delta_sum": 2.0,
165 "feature.wikitext.revision.diff.markup_prop_delta_decrease": 0.0,
166 "feature.wikitext.revision.diff.markup_prop_delta_increase": 2.0,
167 "feature.wikitext.revision.diff.markup_prop_delta_sum": 2.0,
168 "feature.wikitext.revision.diff.number_delta_decrease": 0.0,
169 "feature.wikitext.revision.diff.number_delta_increase": 0.0,
170 "feature.wikitext.revision.diff.number_delta_sum": 0.0,
171 "feature.wikitext.revision.diff.number_prop_delta_decrease": 0.0,
172 "feature.wikitext.revision.diff.number_prop_delta_increase": 0.0,
173 "feature.wikitext.revision.diff.number_prop_delta_sum": 0.0,
174 "feature.wikitext.revision.diff.uppercase_word_delta_decrease": 0.0,
175 "feature.wikitext.revision.diff.uppercase_word_delta_increase": 0.0,
176 "feature.wikitext.revision.diff.uppercase_word_delta_sum": 0.0,
177 "feature.wikitext.revision.diff.uppercase_word_prop_delta_decrease": 0.0,
178 "feature.wikitext.revision.diff.uppercase_word_prop_delta_increase": 0.0,
179 "feature.wikitext.revision.diff.uppercase_word_prop_delta_sum": 0.0,
180 "feature.wikitext.revision.external_links": 0,
181 "feature.wikitext.revision.headings": 0,
182 "feature.wikitext.revision.parent.chars": 0,
183 "feature.wikitext.revision.parent.external_links": 0,
184 "feature.wikitext.revision.parent.headings": 0,
185 "feature.wikitext.revision.parent.ref_tags": 0,
186 "feature.wikitext.revision.parent.templates": 0,
187 "feature.wikitext.revision.parent.wikilinks": 0,
188 "feature.wikitext.revision.ref_tags": 0,
189 "feature.wikitext.revision.templates": 0,
190 "feature.wikitext.revision.wikilinks": 0
191 }
192 },
193 "wp10": {
194 "error": {
195 "message": "Timed out after 15 seconds.",
196 "type": "TimeoutError"
197 },
198 "features": {
199 "feature.english.stemmed.revision.stems_length": 28,
200 "feature.enwiki.main_article_templates": 0,
201 "feature.enwiki.revision.category_links": 0,
202 "feature.enwiki.revision.cite_templates": 0,
203 "feature.enwiki.revision.cn_templates": 0,
204 "feature.enwiki.revision.image_links": 0,
205 "feature.enwiki.revision.infobox_templates": 0,
206 "feature.enwiki.revision.who_templates": 0,
207 "feature.wikitext.revision.chars": 36,
208 "feature.wikitext.revision.content_chars": 36,
209 "feature.wikitext.revision.external_links": 0,
210 "feature.wikitext.revision.headings_by_level(2)": 0,
211 "feature.wikitext.revision.headings_by_level(3)": 0,
212 "feature.wikitext.revision.ref_tags": 0,
213 "feature.wikitext.revision.templates": 0,
214 "feature.wikitext.revision.wikilinks": 0
215 }
216 }
217 }
218 }
219 }
220}

Reading debug logs and looking in the score cache, the scores are indeed generated and stored.

Event Timeline

Here are debug logs of failed and succeeded requests, I'm combing through the wdiff now.

wdiff succeeded failed > wd

I instrumented the four places we can raise this error, and we're hitting this one, a Celery exception from score_result.get, caught by _process_missing_scores.
https://github.com/wiki-ai/ores/blob/master/ores/scoring_systems/celery_queue.py#L143

Simplifying the query to just one model, https://ores-beta.wmflabs.org/v3/scores/enwiki/123456?features=&models=damaging

There's a good chance the problem is a race condition. In the successful run, looking at the differing lines we see:

  1. Setting state to SENT (process_score_map)
  2. Received task: ores.scoring_systems.celery_queue._process_score_map
  3. Task accepted: ores.scoring_systems.celery_queue._process_score_map
  4. Setting state to SENT (lookup_score_in_map)
  5. Received task: ores.scoring_systems.celery_queue._lookup_score_in_map
  6. TaskPool: Apply (lookup_score_in_map)
  7. Looking up damaging
  8. Task accepted: ores.scoring_systems.celery_queue._lookup_score_in_map

In the run that times out, the order is:

  1. Setting state to SENT (process_score_map)
  2. Received task: ores.scoring_systems.celery_queue._process_score_map
  3. Setting state to SENT (lookup_score_in_map)
  4. Received task: ores.scoring_systems.celery_queue._lookup_score_in_map
  5. TaskPool: Apply (lookup_score_in_map)
  6. Task accepted: ores.scoring_systems.celery_queue._process_score_map
  7. Looking up damaging
  8. Task accepted: ores.scoring_systems.celery_queue._lookup_score_in_map
awight renamed this task from Consistent TimeoutErrors affecting ores-beta.wmflabs.org to Consistent TimeoutErrors when using Celery 4.Nov 2 2017, 6:16 PM

The approach I want to try is to coerce the "inprogress" and "missing" child tasks into a Celery native subtask "group".

I ended up doing something simpler but wasteful. The synchronous subtask step came from sharing feature extraction between multiple models in a scoring request. Now we'll be doing feature extraction twice, but scoring each model in its own task. This burns 2x the CPU by duplicating feature extraction work. We'll want to win back that optimization, but I'm postponing because this smaller code change allows us to move to Celery 4, massively increase the number of workers, and get on the new cluster.

Change 389287 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/services/ores/deploy@CELERY_4] Bump ores to get new subtask design

https://gerrit.wikimedia.org/r/389287

Change 389287 merged by Awight:
[mediawiki/services/ores/deploy@CELERY_4] Bump ores to get new subtask design

https://gerrit.wikimedia.org/r/389287

Harej triaged this task as High priority.Aug 3 2018, 7:04 PM

There's a Celery bug which seems to map to our issue, but I can't find it in the task history.

When we do an experimental upgrade, we must have monitoring in place to detect these TimeoutErrors.

awight lowered the priority of this task from High to Low.Sep 26 2018, 4:43 PM

I'm running celery 4.2 on ores-staging.wmflabs.org and run a hammer test with ?features= on and it worked just fine.

@Ladsgroup That's great news! Note that in T179524#3735543 I disclose that I may have worked around this problem with a barbaric hack.

@Ladsgroup That's great news! Note that in T179524#3735543 I disclose that I may have worked around this problem with a barbaric hack.

Well, when we have task tracker in place this should not be needed anymore. (ores-staging is working with the task tracker on)