Page MenuHomePhabricator

Remove BETA from Wikidata entities dump
Closed, ResolvedPublic

Description

Since Wikidata RDF ontology is not "beta" anymore, it's time to remove BETA marker from RDF dumps. The name is now e.g. wikidata-20190617-all-BETA.ttl.bz2 but should be just wikidata-20190617-all.ttl.bz2.

Event Timeline

Smalyshev added a project: User-Smalyshev.
Smalyshev moved this task from Backlog to Next on the User-Smalyshev board.

Time to figure out who/how we notify, and put a date out for the name change.

I think we need to drop a note to wikidata-l, maybe also add something to Weekly notes (@Lea_Lacroix_WMDE ?). Not sure what else.

Because this affects downloaders, might as well blast xmldatadumps-l and tbh I would forward to wikitech-l too.

I've sent a note to wikidata and xmldatadumps-l lists.

I take care of Wikidata newsletter and TechNews. Any idea when this change will take place?

What do people think of a July 29 deadline (the start of that run)? Unfortunately we can't really do a 1st of the month change.

Seems enough for people to change their code if needed.

Smalyshev triaged this task as Medium priority.Jun 20 2019, 6:55 PM

Change 518108 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/puppet@production] Remove BETA from RDF dump filenames

https://gerrit.wikimedia.org/r/518108

Thanks for the ping! I don't use RDF dumps at the moment, and I'm fine with this change.

I thought we agreed above on a July 29 deadline?

Oh, July... Somehow I've read that as "June". Maybe a bit earlier? Couple of weeks should be enough for preparing the software...

OK, let's go for July 15th then, again between runs. How does that sound? (But let's make sure that date is announced everywhere.)

@ArielGlenn OK for me, I'll make sure that it's announced asap.

@ArielGlenn just to be sure, are you going to rename only the new dumps to come, or also the previous ones?

My understanding was that we would rename them going forwards only.

Announced ✅ on Wikidata, on the wikidata, wikidata-tech, wikitech-l, xmldatadumps-l mailing-lists, on Weekly Summary and TechNews.

Change 518108 merged by ArielGlenn:
[operations/puppet@production] Remove BETA from RDF dump filenames

https://gerrit.wikimedia.org/r/518108

Smalyshev moved this task from Waiting/Blocked to Done on the User-Smalyshev board.

While this issue is supposed to be closed, one can still see at https://dumps.wikimedia.org/wikidatawiki/entities/20210628/ a "-all-BETA" dumps (in .nt and .ttl formats) and a -all.json format dump. Is it normal? Can you please confirm that the content of those dumps is the same except for the serialization format?

Hm, I think there’s two different things here.

  1. It looks like we removed the “-BETA” from the name of the latest dumps (e.g. latest-all.ttl.gz), but not from the timestamped ones (e.g. wikidata-20210628-all-BETA.ttl.gz). This wasn’t mentioned in the announcement, so I don’t think it’s intentional, and we probably want to fix it.
  1. @Rtroncy, I’m not sure what you mean by the same content, but as far as I’m aware, we don’t guarantee any atomicity for those dumps, neither within a dump nor between them. Since the .nt, .ttl and .json dumps are created independently (as far as I know), they probably don’t quite contain the same data, because Wikidata edits continue while the dumpers are working. Does that answer your question?

Thanks for the clarifications, this does perfectly answer my questions. I would consider though that the differences between the different formats of the dumps are minor, even if the processes are independent but this is indeed interesting to highlight, I don't think many people are aware of this.