Page MenuHomePhabricator

categorylinks tables are missing rows
Closed, ResolvedPublic

Description

MariaDB [enwikisource_p]>     SELECT
    ->       cl_to
    ->     FROM page
    ->     JOIN categorylinks
    ->     ON cl_from = page_id
    ->     WHERE page_id = cl_from
    ->     AND page_namespace = 104
    ->     AND page_title = 'Euripides_(Mahaffy).djvu/153';
Empty set (0.01 sec)

After making a null edit (just clicking edit and saving the page), the categorylinks entry appears:

MariaDB [enwikisource_p]>     SELECT
    ->       cl_to
    ->     FROM page
    ->     JOIN categorylinks
    ->     ON cl_from = page_id
    ->     WHERE page_id = cl_from
    ->     AND page_namespace = 104
    ->     AND page_title = 'Euripides_(Mahaffy).djvu/153';
+---------------+
| cl_to         |
+---------------+
| Not_proofread |
+---------------+
1 row in set (0.01 sec)

My suspicion is that the production (master) server has these rows in categorylinks and the Labs replicas are just missing rows. I guess I'll need to find another example.

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript
MariaDB [enwikisource_p]>     SELECT
    ->       cl_to
    ->     FROM page
    ->     JOIN categorylinks
    ->     ON cl_from = page_id
    ->     WHERE page_id = cl_from
    ->     AND page_namespace = 104
    ->     AND page_title = 'Instigations_of_Ezra_Pound.djvu/2';
Empty set (0.00 sec)

And yet https://en.wikisource.org/wiki/Page:Instigations_of_Ezra_Pound.djvu/2 is clearly in "Category: Without text".

Hrmph.

Poking at this a bit further, https://en.wikisource.org/w/api.php?action=query&prop=categories&titles=Page:Instigations+of+Ezra+Pound.djvu/2 returns:

{
    "batchcomplete": "",
    "query": {
        "pages": {
            "1986660": {
                "pageid": 1986660,
                "ns": 104,
                "title": "Page:Instigations of Ezra Pound.djvu/2"
            }
        }
    }
}

Compare with a working page such as https://en.wikisource.org/wiki/Page:Instigations_of_Ezra_Pound.djvu/4:

{
    "batchcomplete": "",
    "query": {
        "pages": {
            "1986662": {
                "pageid": 1986662,
                "ns": 104,
                "title": "Page:Instigations of Ezra Pound.djvu/4",
                "categories": [
                    {
                        "ns": 14,
                        "title": "Category:Without text"
                    }
                ]
            }
        }
    }
}

So it looks like the HTML output for https://en.wikisource.org/wiki/Page:Instigations_of_Ezra_Pound.djvu/2 is correct:

$ curl -s "https://en.wikisource.org/wiki/Page:Instigations_of_Ezra_Pound.djvu/2" | grep -A2 "catlinks"
				<div id="catlinks" class="catlinks" data-mw="interface"><div id="mw-normal-catlinks" class="mw-normal-catlinks"><a href="/wiki/Special:Categories" title="Special:Categories">Category</a>: <ul><li><a href="/wiki/Category:Without_text" title="Category:Without text">Without text</a></li></ul></div></div>				<div class="visualClear"></div>
							</div>
		</div>

But given the api.php output, I'm more inclined to believe that the categorylinks rows are missing on the master database server.

Krenair renamed this task from categorylinks tables on Tool Labs are missing rows to categorylinks tables on Labs replicas are missing rows.May 20 2016, 2:23 PM
Krenair removed a project: Toolforge.
Krenair updated the task description. (Show Details)

That query returns the same thing against the production master server

jcrespo closed this task as Resolved.EditedMay 20 2016, 3:08 PM
jcrespo claimed this task.
jcrespo removed projects: DBA, Cloud-Services.
jcrespo added a subscriber: jcrespo.

As with most reports, there are 3 possibilities:

  1. categorylinks are not updated in real time- it is done with a background job, which means it can take some time, from seconds to days to be updated. Krenair's comment suggest this is the case. These tables are non-canonical and there could be sometimes small issues when parsing them and 1 out of 1 million fail to parse-the job have some issue or be very delayed; a null edit fixes those. I did that and it worked: https://en.wikisource.org/w/api.php?action=query&prop=categories&titles=Page:Instigations+of+Ezra+Pound.djvu/2 Note that "Category: Without text" is probably a huge category, which contributes to requiring a lot of processing power to be updated. I ask you patience. I have some ideas to improve this kind of issues, but it will take time to implement them.
  2. Because of ongoing imports, categorylinks tables on labs can be (for a small period of time) desynced from production (the alternative being bringing down the whole server)
  3. There is a genuine difference between labs and production, and the previously mentioned reimports will fix the issue (several tickets, for example, T126946)

It's pretty frustrating to have categorylinks rows missing. It's also fairly aggravating for this task to be marked resolved when the issue is so clearly not.

Sure, it's possible to "fix" individual cases such as https://en.wikisource.org/wiki/Page:Instigations_of_Ezra_Pound.djvu/2 with a null edit, but that's a terrible workaround. In the past, many people have tried to do mass null-editing and there have been complaints from operations people.

I'm not sure what users are expected to do. Just live with incomplete and missing data indefinitely?

@MZMcBride, your original issue, in the way you expressed it, "T135801: categorylinks tables on Labs replicas are missing rows" for which you clearly seeked DBA help "MZMcBride added a project: DBA." is resolved-technically it would be invalid (labs is not missing any rows), and I explained why (in particular, I made work the very example you mentioned) and:

My suspicion is that the production (master) server has these rows in categorylinks and the Labs replicas are just missing rows

was confirmed untrue by krenair. That doesn't prevent you from opening a *Mediawiki* issue or reopening this one with a modified title and description and not DBA/Labs tags, as you prefer (I'd personally prefer a new one and leave this as reference). If you suggested this is the DBA issue, the DBAs will say "there is nothing wrong with the database" :-) DBA team rarely handles actual database contents. It may not make sense, but the different between "losing rows" and "they not being inserted in the first place" is that it has to be fixed in a complete different way.

I wouldn't disagree with the fact that this is *could* be a general problem, and I would suggest seeking mediawiki input on Performance-Team/MediaWiki-Core-JobQueue or Wikimedia-Rdbms projects (maybe you added DBA by mistake?- if that is the case let me reopen and add those, but please change the title and scope of the task to reflect the general issue, and not the very specific example).

In fact, no need to open another, it is already open on T87716. To push for someone to work on that, I would suggest commenting there with a link to this ticket expressing how impacting is this to you.

Krenair renamed this task from categorylinks tables on Labs replicas are missing rows to categorylinks tables are missing rows.May 22 2016, 2:49 PM
Krenair added a project: Wikimedia-Rdbms.

I wouldn't disagree with the fact that this is *could* be a general problem, and I would suggest seeking mediawiki input on Performance-Team/MediaWiki-Core-JobQueue or Wikimedia-Rdbms projects (maybe you added DBA by mistake?- if that is the case let me reopen and add those, but please change the title and scope of the task to reflect the general issue, and not the very specific example).

Without query access to the master database, it's very difficult for me to determine whether this issue only affects Labs or also affects production. I think that's why I added DBA. I'm also not sure I knew about Wikimedia-Rdbms. In any case, my issue showed symptoms on Labs, but ultimately the source of the problem was higher up.

In fact, no need to open another, it is already open on T87716. To push for someone to work on that, I would suggest commenting there with a link to this ticket expressing how impacting is this to you.

Yes, I'm subscribed to that task. Yes, this task is likely a duplicate of that task.

As a user and volunteer, it's very frustrating to be repeatedly hitting data integrity issues. I imagine you're also frustrated. As a user and volunteer, I don't really care what the cause is, all I know is that other Wikimedians complain to me about some ancient script or tool that I once wrote not working in certain cases. I spend a few minutes diagnosing what went wrong, and it's often "data is inexplicably missing from either Labs or production." Grrrrr.

The truth is, in the current state of things, non canonical tables cannot be 100% reliable- currently they would like ~99.9%, but that last 0.1% would probably take away a lot of performance (and a lot of code refactoring-an Epic task), making things like editing a page incredible slow and more prone to failures. I am not saying that that could not be improved, but with so much data, the preference is to "incomodate" API/labs users rather than editors that create content, something that you will agree is a better policy, considering that -in theory- desyncs should be very rare and possible to fix with a reparsing. In the real world, sometimes jobs fail or bugs happen and that affects the results.

Please note that I am not a Mediawiki programmer, so that is an "outsider" view and that may not reflect reality.

As an op, what I can tell you is what things are being done in labs to improve replication issues (something that I own). I agree that things have been really bad in the last years-

a) I am reimporting for the first time in years all rows from production to labs, something that will take months to be done due to having to filter row by row 20 TB of data
b) new hardware is about to arrive (recently approved) which will get rid of the TokuDB engine- the reason of most of the lag and replication errors and what prevents faster imports to fix data issues.

And you are right- I am the first one interested on those issue to be fixed.

Regarding Tags- it is not your responsibility to know who should fix an issue, what I was suggesting is to not tag it at all, and leave it to the triagers. By setting a tag and a priority usually it means processing gets delayed because it goes to the wrong inbox. If you want to know if it is a labs issue or a production issue, the trick is that if the API results are the same as labs, it is a production issue, otherwise it is a replication filtering issue (I own it).