Page MenuHomePhabricator

CirrusSearch: We should dig into CirrusSearch-failures.log
Closed, ResolvedPublic

Description

We should figure out exactly what is going on. For example, I saw this:

2014-03-07 02:40:46 mw1015 wikidatawiki: Update for doc ids: 15087781
2014-03-07 02:40:46 mw1008 wikidatawiki: Update for doc ids: 15087781
2014-03-07 02:40:46 mw1008 wikidatawiki: Update for doc ids: 15087781

These might just be trying to update the same document concurrently. Chad was talking about just retrying here. If we're trying to update the same document multiple times we could do that. We also might want to use the pool counter to prevent it. We could probably use the shared acquire to notice that another job tried to update then, well, not do it. On problem, though, is that updating a page will get a parser lock (I think) so we have to make sure not to lock eachother. I think.


Version: unspecified
Severity: normal

Details

Reference
bz62358

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:07 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz62358.
bzimport added a subscriber: Unknown Object (MLST).
demon added a comment.Mar 7 2014, 2:50 AM

We do retry now as of gerrit #117335.

Cool! I guess I was just going on faith that we were retrying....

I just checked and saw two things:

  1. If we try to send 50 updates all at once we might bump against an Elasticsearch queue limit. I'm chunking it to 10 at a time.
  2. I _think_ moving a page and leaving behind a redirect can cause that version conflict error. I believe it makes two jobs - one for the new page and one for the redirect. We need to keep the redirect job but might be able to throw away the one for the new page. Worth checking.

Change 148405 had a related patch set uploaded by Manybubbles:
Chunk updates at 10

https://gerrit.wikimedia.org/r/148405

Finishing up skipping the second update in point #2 from comment 3.

Change 148417 had a related patch set uploaded by Manybubbles:
On article move only use one job

https://gerrit.wikimedia.org/r/148417

Change 148405 merged by jenkins-bot:
Chunk updates at 10

https://gerrit.wikimedia.org/r/148405

Change 148417 merged by jenkins-bot:
On article move only use one job

https://gerrit.wikimedia.org/r/148417

Shifting back to new - we'll have to reevaluate in two weeks or so once these changes hit production and we've churned through the queue.

Dug into these and the vast majority of them now come from trying to run the same updates twice or three times at the same time. Noop detection, going out to wikipedias tomorrow, should squash most of these.

Dug into these and the vast majority of them now come from trying to run the same updates twice or three times at the same time. Noop detection, going out to wikipedias tomorrow, should squash most of these.

One year later: Is this still an issue or can this task be closed as resolved?

Restricted Application added a project: Discovery. · View Herald TranscriptSep 1 2015, 12:40 PM

Dug into these and the vast majority of them now come from trying to run the same updates twice or three times at the same time. Noop detection, going out to wikipedias tomorrow, should squash most of these.

One year later: Is this still an issue or can this task be closed as resolved? @Deskana maybe?

Dug into these and the vast majority of them now come from trying to run the same updates twice or three times at the same time. Noop detection, going out to wikipedias tomorrow, should squash most of these.

@Deskana: One year later: Is this still an issue or can this task be closed as resolved?

Dug into these and the vast majority of them now come from trying to run the same updates twice or three times at the same time. Noop detection, going out to wikipedias tomorrow, should squash most of these.

@Deskana: More than one year later: Is this still an issue or can this task be closed as resolved?

Deskana closed this task as Resolved.Dec 23 2015, 12:25 AM
Deskana claimed this task.

@Deskana: More than one year later: Is this still an issue or can this task be closed as resolved?

I think we can assume the deployed fix resolved the issue. If that's not the case, someone can reopen.