Page MenuHomePhabricator

Data disappeared from labs replica in cswiki_p.page_props
Closed, ResolvedPublic

Description

Hi,
I was told query 'select * from page_props where pp_propname="page_image"' run at cswiki_p doesn't return all rows it should (there must be more pages with this prop).

Can anybody fix it?

Martin Urbanec

Event Timeline

Urbanecm created this task.Dec 21 2016, 7:02 PM
Restricted Application added a project: User-Urbanecm. · View Herald TranscriptDec 21 2016, 7:02 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This broke my tool that generates a map of geographical articles on cswiki with no page image. On production, page images are still shown, so the information is retained. My log indicates that the change ocurred at the end of November 2016.

This might be related to T152155 which broke many other uses of PageImage on Wikimedia sites. It got somehow fixed, but I do not understand what that bug's implications are for my scenario, whether I should change anything in my code, and whether things will sometime get back to "normal" or whether this is now the "normal" state. Could someone please explain?

For my case, I do not mind at all whether the image is free or not – I just need to be able to tell that Wikipedia already has an image for that article. Anyway, non-free images are not allowed at all on Czech Wikipedia.

Urbanecm updated the task description. (Show Details)Dec 22 2016, 7:54 AM
phuedx added subscribers: bmansurov, Jdlrobson, phuedx.

@bmansurov, @Jdlrobson: Can you speak to why this query might not return all the rows that it should? IIRC there's now an additional property which explains the freeness of an image?

At this moment, the query returns 4 rows. Still yesterday, that was 5. A script or something must be working in the background, but items with "page_image" are disappearing rather than re-appearing. On the other hand, there are 206870 page_image_free items on cswiki.

Documentation for PageImages is scarce, as noted in T152239 and evinced by [[mediawikiwiki:Extension:PageImages]]. What shall I do in my tool now if I don't care about the free/non-free distinction? I tried to dug in the code but am still confused what the intended behavior is. Am I right in understanding that page_image should have remained in place and page_image_free was intented to be merely added beside it?

Anyway, as the recent change broke existing code, it is clear that something went wrong. And while the API might have got a temporary fix, database queries like mine stopped working. (I do not use the API because I need to combine the search with the geo_tags table to look for articles with geographical coordinates but with no image.)

I do not know much about the whole incident, but maybe page_image_free will work instead of page_image? At least it seems to have much more rows:

[cswiki]> SELECT count(*) FROM page_props WHERE pp_propname = 'page_image_free';
+----------+
| count(*) |
+----------+
|   206872 |
+----------+
1 row in set (0.12 sec)

[cswiki]> SELECT count(*) FROM page_props WHERE pp_propname = 'page_image';
+----------+
| count(*) |
+----------+
|        2 |
+----------+
1 row in set (0.00 sec)

@jcrespo: Thank you for your hint. This is indeed the workaround that I have been using since yesterday. Is this a stable fix, however? If so, does therefore "page_image" as of now mean "page_image_nonfree"? I cannot believe someone would have purportedly broken existing code. I would understand that populating "page_image_free" takes time, but why has the old "page_image" disappeared – or was renamed "page_image_free" with no replacement, although all images on cswiki should be free images (no fair use is allowed there)? These are my questions.

@Blahma, yes from now on there will be two page properties. If 'page_image_free' exist, but 'page_image' doesn't, then it means the best image is the free image. If however, both properties exits, then the best free image differs from the best image (which is non-free).

jcrespo added a comment.EditedDec 22 2016, 10:30 PM

@Blahma Sadly, I can help you with the database itself, but not about its contents more than you know them. From the source code, I would say that the best query you should implement is:

SELECT pp_value FROM page_props WHERE pp_propname IN ('page_image_free', 'page_image') AND pp_page=$page_id ORDER BY IF(pp_propname = 'page_image_free', 0, 1) LIMIT 1;

Which would give you a result, no matter if it is free or not, although prefering free images first.

Thank you @bmansurov for explanation of the goal of the recent change and @jcrespo for expanding on it and pointing to the source code. I dare consider it bad design that an existing property has been assigned a new meaning, instead of either keeping it untouched or abandoning it completely. In this particular case, the change moreover, perhaps accidentally, favors non-free images, as only those will be returned in queries that have not been adapted to the new schema.

To learn from this for the future, can someone advice me how can I prevent such a breakdown of my tool in the future? Do I need to in some way subscribe to the particular extension's repository and follow all and any changes to tell that a design change is approaching? Is there any kind of release notes? The extension's page on meta has hardly changed recently. Or should I, as a tool developer, accept that such things just happen? I maintain several tools on Tool Labs and almost every months something breaks and needs my attention because of some third-party changes. How can I prevent this from happening? Aren't there some standards at least for code that is running on production wikis?

Urbanecm moved this task from Backlog to Watching on the User-Urbanecm board.Dec 23 2016, 12:01 PM

I dare consider it bad design that an existing property has been assigned a new meaning, instead of either keeping it untouched or abandoning it completely.

Abandoning it has its own problems as stale data piles up in the database. The maintenance script is meant to take care of migrations. As long as it's run we should not expect any problems.

In this particular case, the change moreover, perhaps accidentally, favors non-free images, as only those will be returned in queries that have not been adapted to the new schema.

Not quite, but temporarily yes. The API has been made flexible to return both free and non-free (if it's better than the free image) with the additional query parameter license, which is documented at https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bpageimages. It's true that the maintenance script is still running, and we're returning any image for now, but that will be fixed when we work on T152216. Once we do that, by default, we'd be returning free images when the license parameter is not passed.

To learn from this for the future, can someone advice me how can I prevent such a breakdown of my tool in the future? Do I need to in some way subscribe to the particular extension's repository and follow all and any changes to tell that a design change is approaching? Is there any kind of release notes? The extension's page on meta has hardly changed recently. Or should I, as a tool developer, accept that such things just happen? I maintain several tools on Tool Labs and almost every months something breaks and needs my attention because of some third-party changes. How can I prevent this from happening? Aren't there some standards at least for code that is running on production wikis?

There has been an announcement at wikitech regarding this: https://lists.wikimedia.org/pipermail/wikitech-l/2016-November/087098.html. Generally speaking, keeping an eye on this mailing list should cover most of the situations you mention.

I dare consider it bad design that an existing property has been assigned a new meaning, instead of either keeping it untouched or abandoning it completely. In this particular case, the change moreover, perhaps accidentally, favors non-free images, as only those will be returned in queries that have not been adapted to the new schema.

Hi @Blahma I apologise for the breakage here. This was something we should have caught in code review and failed to do so. As the incident report suggests - https://wikitech.wikimedia.org/wiki/Incident_documentation/20161202-20161201-PageImages - we are planning to protect ourselves in future against said changes.

To learn from this for the future, can someone advice me how can I prevent such a breakdown of my tool in the future?

As @bmansurov points out we will send notifications for big changes like this. Unfortunately, this change which was supposed to not trouble you backfired. The best you can do is keep up to date with what we are supposed to be doing and reporting when we break things (which hopefully we won't do again!)

Jdlrobson closed this task as Resolved.Jan 25 2017, 7:35 PM
Jdlrobson claimed this task.

Please reopen if you have any further questions.