Page MenuHomePhabricator

Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday
Closed, DeclinedPublic

Description

Currently wikidata-entities dumps are generated on a fixed weekday basis (monday every two weeks for instance). It would be easier for some use-cases to get a fixed day-of-month basis (1st day of month and 15th day of month).

Event Timeline

Adding @hoo and @Smalyshev because they may have an idea of the needs of current users of these entity dumps.

I don't have a real opinion on this one. Generally for dump users the only concern is for the dump to be recent enough to be useful (wikidata is rather volatile, and linked nature of it means that volatility in one area can influence results in bigger area). That said, I do not see a huge problem with having dumps on 1st and 15th. The only argument I can see for weekly dumps is that we could time them for when the activity of the editors is the lowest to reduce impact, but I am not sure whether it's an important consideration.

@ArielGlenn : Could we decide on regular day-in-month patterns for the various entity-dumps that need to be generated?
Here is my suggestion::

EntitiesFormatsCurrent Frequency New suggested frequency
alljson / nt / ttlEvery monday1st, 8th, 15th, 22nd of every month
truthyntEvery wednesday3rd, 10th, 17th, 24th of every month
lexemesnt / ttlEvery friday5th, 12th, 19th, 26th of every month

We'd be a bit late at end of month (worst case +3 days of waiting in comparison to regular releases). Is that an issue?
Thanks :)

What I'd prefer to do, if we are changing things around, is to do one right after another, so: all, truthy, lexemes back to back, and decide on a starting date for the first one. This ensures we aren't running more than one type at once (control over resource use on the server) and that there's no dead time between runs (so server maintenance etc can be scheduled easier).

Works for me :) I assume the system would work similarly to the existing XML dumps, meaning that dumps would be generated in the same date folder (1st, 8th, 15th, 22nd of every month for instance), one after the other, providing information on availability in a json file?

The directories and any links or status files would be as they are now, but the date could I presume be passed into the script. We need again to see what existing users need and expect though.

I've sent mail to wikitech-l, xmldatadumps-l and research-internal (not sure if that last arrived, I may not be subscribed and tbh I am full up on subscriptions). See https://lists.wikimedia.org/pipermail/xmldatadumps-l/2019-February/001455.html

Let's give folks 7 days and if no one steps forward we'll rearrange things.

Hi Ariel et al.,

I don't think switching data dump generation from a pattern like "every Monday" to a pattern like "on the 1st, 8th, 15th, 22nd of each month" would have any real impact on data consumers that follow best practices.

For example, the best practice that we (Yahoo / Verizon Media) and others follow is to check once a day if a new dataset has been made available, and download this new dataset if it differs from the previous one.

-N

Hello,
As a json dump consumer, I have a script run by crontab to retrieve each week the latest dump. Usually the download starts on tuesday and I can work with it on wednesday.
Changing when the dumps will be generated means I'll have to keep an eye on the current date just because of this change (everyone always knows which day of the week is today, not that much people know which day of the month it is ..). That makes me wonder what are the use cases where this will makes things easier ?

Another thing, the generation of a dump sometimes fails and I see it re-started. With all the dumps generated one after the other to prevent from using the same resources, does it mean that a re-start of a failed dump would have to wait that all the other generations are finished ? Just wondering ..

I'm also interested in the specific reasons why the update frequency needs to be changed, i.e. beside streamlining the monthly workload on the Wikimedia machines.

(@Melderick: FWIW, cron can also be configured to start on specific days of the month, and the fact that jobs may fail or be postponed is the very reason why we don't rely too heavily on official release calendars for data collection)

I can't speak about failures and restarts as I don't know much about the dumps-generation process. @ArielGlenn would the person to know best.
As for the dates, the main reason we ask for the change is for dates consistency by month, mimic-ing what exists for xml dumps.

@Melderick Restarts are done automatically by the script running the specific dump, up to a certain number of times. That is independent of the starting date. If we have one wrapper script run them one after another, the same will still hold.

@ArielGlenn I see. Thanks for the information.
@Nicolastorzec Yes, I don't worry about configuring cron to automatically download the dump whatever the day it is generated. My main concern is having to change my habits for an unclear benefit.

Hello,

Women in Red rely on Wikidata Human Gender Indicators (WHGI) statistics, which are generated once a week using Wikidata dump. If not already done, you should get in touch with these communities to evaluate the impact of the change.

Following up on this: another viable solution to get monthly-coherence between dumps is to force a dump on the 1st of the month ... I'm not sure the idea is better.
@ArielGlenn - How do we proceed to try moving forward (in either direction) ?

I would get in touch with Women in Red, and ping by email other groups I know to be ingesting these dumps. By Friday I'll have done that; by next Wednesday let's make a decision, barring any huge obstacles.

@notconfusing I think you are the WHGI person; can you please weigh in?

By Friday I'll have done that; by next Wednesday let's make a decision, barring any huge obstacles.

Awesome, thanks @ArielGlenn :)

Mail sent to Rosie Stephenson-Goodknight of the Women in Red group. Mail also sent to Max Klein in case he doesn't monitor phab notifications closely

Summarizing an IRC conversation with @JAllemandou :

  • Analytics ingests full revision history xml dumps from the run on the 1st of the month; these correspond to the xml revision metadata ('stub') files which are generated on the 1st or 2nd of the month dependingon how long they take to complete. While some revisions may have been hidden or deleted, generally the revsion history files represent the content state on the 1st or 2nd.
  • Analytics also gathers other data around the 1st of the month that is correlated with the revision content.
  • Analytics ingests wikidata entity dumps in json format, and would like that data to be as close to the 1st of the month as possible. This means the wikidata-YYYYMMDD-all.json.bz2 (or gz) files specifically.

In his own words, "I want to productionize joining wikidata json dumps to other datasources, and matching dates at the month level is really what I 'm after"

ArielGlenn added a subscriber: Mvolz.

The WikiCite project apparently uses these dumps, so adding them too. @Mvolz (hope you're the right person) can you let us know if schedule changes are going to be ok, or what would be needed to make sure WikiCite's data workflow isn't disrupted?

Email sent to wikitech-l at https://lists.wikimedia.org/pipermail/wikitech-l/2019-March/091705.html

I sometimes wonder if dumps should have been split off to their own email list. Oh well. Googling for these dumps produces an impressive number of hits, no idea how to reach all those folks or even whether it's important to them. Anywhere else I ought to go fishing?

@ArielGlenn I already mentioned it in the Wikidata newsletter a few weeks ago, will put a reminder in the next one :)

@Lea_Lacroix_WMDE Thanks, because absolutely no one showed up, and I would rather we get people sorted before the change instead of having them scream afterwards :-)

@ArielGlenn The proposed change from every other Monday to the 1st and 15th of each month will be fine for Women in Red . What's important for us is that the update is often enough (e.g. twice per month is fine) and that it's available on the day when it's expected to update. If/When this change is finalized, please email me as I don't keep an eye on Phabricator. Thanks. cc: @notconfusing

Normally whgi.wmflabs.org tries to run the day after the dumps are created to make it's data as real time as possible. 1st and 15th would work for WHGI, I would just have to adjust the crontab. Thanks @ArielGlenn for being thoughtful beforehand.

@ArielGlenn , Actually, I came here because I saw @Lea_Lacroix_WMDE message on the newsletter :)

@ArielGlenn, As @Melderick, I commented here thanks to @Lea_Lacroix_WMDE's addition in the Wikidata weekly summary a few weeks ago :)

Change 498164 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] start wikidata entity dumps on the 1st and 20th of each month

https://gerrit.wikimedia.org/r/498164

Why 1st and 20th of each month, rather than more equally-spaced, e.g. 1st and 15th, or 7th and 21st?

The reason for the reschedule is to have the first run correspond with the xm/sql dumps run also on the first of the month, so that the results more or less correspond. Having the second run on the 20th will do that for both entity runs.

@ArielGlenn I'm getting confused here. Wikidata dumps are generated every week (well at least the Json dumps are). And you're changing it to be generated only twice a month ?

@ArielGlenn Am I understanding correctly that the xm/sql dumps run only on the 1st of the month and the 20th of the month, but not on other days (e.g. 15th)?

The xml/sql dumps run twice a month; the 'full' dumps, containing all content for all revisions, run on the 1st of the month. The 'partial' dumps, containing all content for latest revisions only, run on the 20th of the month. They can't really run on the 15th because the previous run doesn't complete in time. We've done various tricks to speed things up but that just lets us break even as the amount of content continues to grow.

If we want to 'sync up' the wikidata entity dumps with the xml/sql dumps, which seems to be a good idea, I could add in one more run on the 8th or 9th as an 'extra' run but we can't really fit a 4th run in there.

@ArielGlenn - Reflecting on what you've said, the 1st and the 20th are fine for Women in Red. Thanks for what you do. We appreciate it.

Hello, can't we try to have a bit of both worlds ? Keep a weekly wikidata dump and have it sync with xml/sql dumps only on the 1st of each month ?
Because, really, having almost 3 weeks with no wikidata dump is a major change.
And this is sad that I seem to be the only one concerned with it.

Hello, I'm also very surprised on how things are evolving here. This task was about changing how dumps generation is scheduled (based on week or on month), not their frequency. The first proposals by @JAllemandou were with 4 dumps generations per month. Now, you are talking about reducing the frequency of dumps generations from 4-5 per month to only 2, on an irregular basis (sometimes with 3 weeks between two dumps, sometimes with 1 week).

I also don't understand what is the goal of this change. What is the point to have XML/SQL and JSON dumps generations synchronized? As a user of JSON dumps, I prefer to have JSON dumps on a regular basis, rather than synchronized to another dump that I don't use (especially if it's just the same data in another format).

+1 with what was said above, I also thought that it was just a change of day and not a change in the frequency, and communicated it as such. I agree that we should keep the existing frequency of the JSON dumps ; if the plan is to change it, then I need to announce it to the community in a clearer way.

Summarizing, we have three things at play:

  • consistence between entity dumps and xml/sql dumps, for people who use them both and are trying to get a consistent picture of the data
  • frequency; it has been once a week until now; the best we could do if we 'sync up' entity dumps and the two monthly runs of xml/sql dumps is 3 times a month
  • allowing time for maintenance windows; at some point we need to have times that nothing is writing to the back-end server, so that we can do things like security updates (we have a fallback server but it's not nice to switch it in as main NFS server while writes are happening, it's intended for use in case the first backend server dies)

Currently xml/sql dumps run 1st of the month, until around the 15th or 16th; entity dumps run starting Monday 3 am UTC and complete sometime Friday, although they have taken longer (running through Sunday late night). The newly updated patch would run entity dumps the 1st, 10th and 20th of the month. This is not as frequent as before but not awful; it allows predictable maintenance windows as needed (18th-19th, 28th-??); it allows users of both dumps to have a more-or-less consistent picture across both dumps in cases where they need it.

What do folks think about this?

Obviously no change will go out until we have agreement here, and whatever we do must be clearly and widely announced.

Thanks for the summary Ariel.

FWIW:

  • Changing the Wikidata dump generation start dates won't guarantee absolute consistency between Wiki dumps anyway.
  • Technical glitches and maintenance windows will keep happening too.
  • So dealing with inconsistencies is something that external users will still need to manage anyway to guarantee data quality.
  • I'd even argue that external users like me actually benefit more from freshness (i.e. weekly dumps) than approximate synchronization b/w dumps.

Is this compromise driven by a lack of resources?
Anyway, this is just my point of view. I'll be fine either way, and I'm very grateful for your work and for what Wikidata provides to the community.

[Disclaimer: I work on the Yahoo Knowledge Graph at Verizon Media]

I agree with @Nicolastorzec above.

I suspect that the entity dumps are more popular and cheaper to generate than the full SQL/XML dumps, so I would argue that they should be generated more often. But if it already takes almost a week to generate the entity dumps, it does makes sense to generate them less often. Even if the entity dumps and the SQL/XML dumps were generated "at the same time", it would probably be hard to give guarantees about their relatedness: some revisions will always be included in one but not the other given the different time frames for the generation, no?

I have proposed a patch to add the revision id for each entity in the JSON dumps. This way, users have at least a way to compare the freshness of the data from different sources.

Yeah given everything we know about re-use I don't think syncing between the different dump formats is more important than freshness - especially given that they are never really going to be in sync anyway.

@JAllemandou bouncing back to you in light of what others have said here.

Community has spoken, we'll find workarounds - Thanks a lot @ArielGlenn for helping driving this :)

I'll close this for now as declined; the idea of getting a completely consistent snapshot for all dumps is in the back of my mind always but that will be an epic task :-)

T94019 might be relevant for getting things more in sync.

Change 498164 abandoned by ArielGlenn:
explicitly start wikidata entity dumps on the 1st and 20th of each month

Reason:
no community consensus

https://gerrit.wikimedia.org/r/498164