Currently wikidata-entities dumps are generated on a fixed weekday basis (monday every two weeks for instance). It would be easier for some use-cases to get a fixed day-of-month basis (1st day of month and 15th day of month).
I don't have a real opinion on this one. Generally for dump users the only concern is for the dump to be recent enough to be useful (wikidata is rather volatile, and linked nature of it means that volatility in one area can influence results in bigger area). That said, I do not see a huge problem with having dumps on 1st and 15th. The only argument I can see for weekly dumps is that we could time them for when the activity of the editors is the lowest to reduce impact, but I am not sure whether it's an important consideration.
@ArielGlenn : Could we decide on regular day-in-month patterns for the various entity-dumps that need to be generated?
Here is my suggestion::
|Entities||Formats||Current Frequency||New suggested frequency|
|all||json / nt / ttl||Every monday||1st, 8th, 15th, 22nd of every month|
|truthy||nt||Every wednesday||3rd, 10th, 17th, 24th of every month|
|lexemes||nt / ttl||Every friday||5th, 12th, 19th, 26th of every month|
We'd be a bit late at end of month (worst case +3 days of waiting in comparison to regular releases). Is that an issue?
What I'd prefer to do, if we are changing things around, is to do one right after another, so: all, truthy, lexemes back to back, and decide on a starting date for the first one. This ensures we aren't running more than one type at once (control over resource use on the server) and that there's no dead time between runs (so server maintenance etc can be scheduled easier).
Works for me :) I assume the system would work similarly to the existing XML dumps, meaning that dumps would be generated in the same date folder (1st, 8th, 15th, 22nd of every month for instance), one after the other, providing information on availability in a json file?
The directories and any links or status files would be as they are now, but the date could I presume be passed into the script. We need again to see what existing users need and expect though.
I've sent mail to wikitech-l, xmldatadumps-l and research-internal (not sure if that last arrived, I may not be subscribed and tbh I am full up on subscriptions). See https://lists.wikimedia.org/pipermail/xmldatadumps-l/2019-February/001455.html
Let's give folks 7 days and if no one steps forward we'll rearrange things.
Hi Ariel et al.,
I don't think switching data dump generation from a pattern like "every Monday" to a pattern like "on the 1st, 8th, 15th, 22nd of each month" would have any real impact on data consumers that follow best practices.
For example, the best practice that we (Yahoo / Verizon Media) and others follow is to check once a day if a new dataset has been made available, and download this new dataset if it differs from the previous one.
As a json dump consumer, I have a script run by crontab to retrieve each week the latest dump. Usually the download starts on tuesday and I can work with it on wednesday.
Changing when the dumps will be generated means I'll have to keep an eye on the current date just because of this change (everyone always knows which day of the week is today, not that much people know which day of the month it is ..). That makes me wonder what are the use cases where this will makes things easier ?
Another thing, the generation of a dump sometimes fails and I see it re-started. With all the dumps generated one after the other to prevent from using the same resources, does it mean that a re-start of a failed dump would have to wait that all the other generations are finished ? Just wondering ..
I'm also interested in the specific reasons why the update frequency needs to be changed, i.e. beside streamlining the monthly workload on the Wikimedia machines.
(@Melderick: FWIW, cron can also be configured to start on specific days of the month, and the fact that jobs may fail or be postponed is the very reason why we don't rely too heavily on official release calendars for data collection)
I would get in touch with Women in Red, and ping by email other groups I know to be ingesting these dumps. By Friday I'll have done that; by next Wednesday let's make a decision, barring any huge obstacles.
Summarizing an IRC conversation with @JAllemandou :
- Analytics ingests full revision history xml dumps from the run on the 1st of the month; these correspond to the xml revision metadata ('stub') files which are generated on the 1st or 2nd of the month dependingon how long they take to complete. While some revisions may have been hidden or deleted, generally the revsion history files represent the content state on the 1st or 2nd.
- Analytics also gathers other data around the 1st of the month that is correlated with the revision content.
- Analytics ingests wikidata entity dumps in json format, and would like that data to be as close to the 1st of the month as possible. This means the wikidata-YYYYMMDD-all.json.bz2 (or gz) files specifically.
In his own words, "I want to productionize joining wikidata json dumps to other datasources, and matching dates at the month level is really what I 'm after"
Email sent to wikitech-l at https://lists.wikimedia.org/pipermail/wikitech-l/2019-March/091705.html
I sometimes wonder if dumps should have been split off to their own email list. Oh well. Googling for these dumps produces an impressive number of hits, no idea how to reach all those folks or even whether it's important to them. Anywhere else I ought to go fishing?
@ArielGlenn The proposed change from every other Monday to the 1st and 15th of each month will be fine for Women in Red . What's important for us is that the update is often enough (e.g. twice per month is fine) and that it's available on the day when it's expected to update. If/When this change is finalized, please email me as I don't keep an eye on Phabricator. Thanks. cc: @notconfusing
The reason for the reschedule is to have the first run correspond with the xm/sql dumps run also on the first of the month, so that the results more or less correspond. Having the second run on the 20th will do that for both entity runs.
The xml/sql dumps run twice a month; the 'full' dumps, containing all content for all revisions, run on the 1st of the month. The 'partial' dumps, containing all content for latest revisions only, run on the 20th of the month. They can't really run on the 15th because the previous run doesn't complete in time. We've done various tricks to speed things up but that just lets us break even as the amount of content continues to grow.
If we want to 'sync up' the wikidata entity dumps with the xml/sql dumps, which seems to be a good idea, I could add in one more run on the 8th or 9th as an 'extra' run but we can't really fit a 4th run in there.
Hello, can't we try to have a bit of both worlds ? Keep a weekly wikidata dump and have it sync with xml/sql dumps only on the 1st of each month ?
Because, really, having almost 3 weeks with no wikidata dump is a major change.
And this is sad that I seem to be the only one concerned with it.
Hello, I'm also very surprised on how things are evolving here. This task was about changing how dumps generation is scheduled (based on week or on month), not their frequency. The first proposals by @JAllemandou were with 4 dumps generations per month. Now, you are talking about reducing the frequency of dumps generations from 4-5 per month to only 2, on an irregular basis (sometimes with 3 weeks between two dumps, sometimes with 1 week).
I also don't understand what is the goal of this change. What is the point to have XML/SQL and JSON dumps generations synchronized? As a user of JSON dumps, I prefer to have JSON dumps on a regular basis, rather than synchronized to another dump that I don't use (especially if it's just the same data in another format).
+1 with what was said above, I also thought that it was just a change of day and not a change in the frequency, and communicated it as such. I agree that we should keep the existing frequency of the JSON dumps ; if the plan is to change it, then I need to announce it to the community in a clearer way.
Summarizing, we have three things at play:
- consistence between entity dumps and xml/sql dumps, for people who use them both and are trying to get a consistent picture of the data
- frequency; it has been once a week until now; the best we could do if we 'sync up' entity dumps and the two monthly runs of xml/sql dumps is 3 times a month
- allowing time for maintenance windows; at some point we need to have times that nothing is writing to the back-end server, so that we can do things like security updates (we have a fallback server but it's not nice to switch it in as main NFS server while writes are happening, it's intended for use in case the first backend server dies)
Currently xml/sql dumps run 1st of the month, until around the 15th or 16th; entity dumps run starting Monday 3 am UTC and complete sometime Friday, although they have taken longer (running through Sunday late night). The newly updated patch would run entity dumps the 1st, 10th and 20th of the month. This is not as frequent as before but not awful; it allows predictable maintenance windows as needed (18th-19th, 28th-??); it allows users of both dumps to have a more-or-less consistent picture across both dumps in cases where they need it.
What do folks think about this?
Obviously no change will go out until we have agreement here, and whatever we do must be clearly and widely announced.
Thanks for the summary Ariel.
- Changing the Wikidata dump generation start dates won't guarantee absolute consistency between Wiki dumps anyway.
- Technical glitches and maintenance windows will keep happening too.
- So dealing with inconsistencies is something that external users will still need to manage anyway to guarantee data quality.
- I'd even argue that external users like me actually benefit more from freshness (i.e. weekly dumps) than approximate synchronization b/w dumps.
Is this compromise driven by a lack of resources?
Anyway, this is just my point of view. I'll be fine either way, and I'm very grateful for your work and for what Wikidata provides to the community.
[Disclaimer: I work on the Yahoo Knowledge Graph at Verizon Media]
I agree with @Nicolastorzec above.
I suspect that the entity dumps are more popular and cheaper to generate than the full SQL/XML dumps, so I would argue that they should be generated more often. But if it already takes almost a week to generate the entity dumps, it does makes sense to generate them less often. Even if the entity dumps and the SQL/XML dumps were generated "at the same time", it would probably be hard to give guarantees about their relatedness: some revisions will always be included in one but not the other given the different time frames for the generation, no?
I have proposed a patch to add the revision id for each entity in the JSON dumps. This way, users have at least a way to compare the freshness of the data from different sources.
Yeah given everything we know about re-use I don't think syncing between the different dump formats is more important than freshness - especially given that they are never really going to be in sync anyway.