Page MenuHomePhabricator

Empty edits in edit history
Closed, ResolvedPublic

Description

I noticed strange empty edits by Edoderoobot (example edit); they are shown on my watchlist and in the edit history even though they are empty. I cannot undo them as well, because There is nothing that can be undone here. Is it a normal behaviour?

My watchlist:

Képkivágás.PNG (47×1 px, 7 KB)

Edit history:

Képkivágás.PNG (50×1 px, 7 KB)

Event Timeline

More and more. I fear I could easily find more of them. @Lydia_Pintscher; shall we do an emergency shut down?

Yeah I am not sure what's happening. So I'd say let's stop the bot until either @Edoderoo or someone from the dev team can look into it.

Usually, a null edit (nothing changed) will result in that: a null edit with no history, and no edit on the account.
For some reason, I sometimes run into null-edits that do leave a trace, for the moment I can not see why that does happen, to me it looks like a small bug in the wikisoftware in the Wikidata-area.
These edits are already there for quite a while, and updating the pywikibot repo on my machine will not solve the issue.
I have only seen this when updating descriptions and/or labels, when the actual update in the end does not really make a change.

@Edoderoo I noticed that the log file linked in the edit summaries seems to be very outdated. Are there somewhere more up-to-date logs available?

I did some investigation with the help of @Lucas_Werkmeister_WMDE and @Ladsgroup but didn't find the root cause so far.

As far as we can tell, the content is exactly the same. We have seen such a phenomenon before when the serialization of entities changed, but this would have to have (also) happened between 20th of March and 18th of April. And so far I didn't find any candidate for that.

  • The content seems to be exactly the same, even of those revisions whose size apparently changed: API query with prop=revisions
  • There are some null-edits where content size increased and some null-edits where it stays the same. So far I didn't find any null-edits where it decreased.
  • we got the API query that made the null-edit from hadoop. It contains an empty data object: {}:
wikidatawiki 323 false [] {"summary":"nl-description, [[User:Edoderoobot/Set-nl-description|python code]], logfile on https://goo .gl/BezTim","maxlag":"15","data":"{}","assert":"user","bot":"1","format":"json","action":"wbeditentity","id":"Q1471595","baserevid":"889025858","token":"[redacted]"} 2019 4 18 14

Thanks Michael.
I have made two changes to my description-python-code ... one is the weird, and by intention broken link to the log, that is not updated for over a year anyways.
I don't think that will be the cause, but I took that out now.

Another change, is that I will check in my code if the data-set is filled. If it's empty, it makes no sense for me to make a null-edit, and from now on I do not need to trust the wiki-API to do this check for me in this way.
Maybe there will be still weird errors, time will tell.

I just checked this for the revisions 889025858 and 917938818. Both refer to equal content blobs (before and after inflating).

I looked into this locally a bit and my findings are very inconvenient: These edits seem to happen because the blobs differ in size (as determined by EntityContent::getSize). EntityContent::getSize is defined as strlen( serialize( $this->getNativeData() ) );, meaning the size of the entity depends on the size of the PHP serialization. But given PHP serialization isn't guaranteed to be exactly the same always, this can lead to bogus empty edits. In the case mentioned above (where not much time passed between the edits), I suppose the size difference got introduced because the parent revision was measured in PHP7, but the new revision was measured in HHVM.

Ways forward:

  1. Change SlotRecord::hasSameContent to no longer take the size into account (given the other criteria checked in there, I guess this would be fine)
  2. (SUPER UGLY) Retrieve the latest (or the latest n) revision(s) in EntityContent::getSize and return their size if the sha1 matches.

Change SlotRecord::hasSameContent to no longer take the size into account (given the other criteria checked in there, I guess this would be fine)

Comparing the hash would be sufficient, but it's expensive if the hash isn't in the database (can't happen with the new schema, but the new schema isn't even the default yet). But doesn't the hash also depend on the php serialization?... Or is that based on the entity's json serialization for some reason?

  1. Stop using serialize in EntityContent::getSize() and instead use e. g. json_encode()? (Though that only fixes the problem going forward, and might make the problem much worse when comparing against entities whose size was still computed using serialize?)

Change SlotRecord::hasSameContent to no longer take the size into account (given the other criteria checked in there, I guess this would be fine)

Comparing the hash would be sufficient, but it's expensive if the hash isn't in the database (can't happen with the new schema, but the new schema isn't even the default yet). But doesn't the hash also depend on the php serialization?... Or is that based on the entity's json serialization for some reason?

That is based on the JSON it seems (at least it seems to be stable even when the size, thus the serialization) changes)

  1. Stop using serialize in EntityContent::getSize() and instead use e. g. json_encode()? (Though that only fixes the problem going forward, and might make the problem much worse when comparing against entities whose size was still computed using serialize?)

Yes, we should definitely do that… but unless we back propagate it (yikes!), this wont solve this particular problem.

I just checked this for the revisions 889025858 and 917938818. Both refer to equal content blobs (before and after inflating).

[...]

  1. (SUPER UGLY) Retrieve the latest (or the latest n) revision(s) in EntityContent::getSize and return their size if the sha1 matches.

It is interesting that the sha1 is the same for those revisions. This isn't the case for the diff mentioned in the original post which covers more time and also has a different size and sha1 but no changes: https://w.wiki/3Sa.
Maybe the change to how we serialize things affected that item?

Maybe the change to how we serialize things affected that item?

For the sha1 the JSON version is taken into account, thus that's not the case. But please notice that we transform the JSON before passing it to the API, thus what you get there is not what's actually in the DB.

As can be seen when comparing P8452 and P8453, the serialization format actually changed (all Snak hashes changed and the wikibase-entityid datavalues gained an additional id key). These null changes have been around forever and are probably not (at least not entirely) new.

But please notice that we transform the JSON before passing it to the API, thus what you get there is not what's actually in the DB.

That is the reason I've been trying to get the DB contents directly. Is there a practical way to get the DB's contents as in P8452 without connecting to the production servers?

That is the reason I've been trying to get the DB contents directly. Is there a practical way to get the DB's contents as in P8452 without connecting to the production servers?

No. external storage data is not replicated to labs

Change 507027 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[mediawiki/extensions/Wikibase@master] Don't allow edits with equal content

https://gerrit.wikimedia.org/r/507027

Change 507027 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Don't allow edits with equal content

https://gerrit.wikimedia.org/r/507027

\o/
Closing this. If it still happens for any new edits please reopen.