See my private homepage for details about myself.
Fri, Sep 14
Yes, it is fine to stop the extraction for now. Many thanks!
Jul 2 2018
This is good news -- thanks for the careful review! The lack of specific threat models for this data was also a challenge for us, for similar reasons, but it is also a good sign that many years after the first SPARQL data releases, there is still no realistic danger to user anonymity known. The footer is still a good idea for general community awareness. People who do have concerns about their anonymity could be encouraged to come forward with scenarios that we should take into account.
Apr 3 2018
The code is here: https://github.com/Wikidata/QueryAnalysis
It was not written for general re-use, so it might be a bit messy in places. The code includes the public Wikidata example queries as test data that can be used without accessing any confidential information.
Dec 16 2017
I agree with Stas: regular data releases are desirable, but need further thought. The task is easier for our current case since we already know what is in the data. For a regular process, one has to be very careful to monitor potential future issues. By releasing historic data, we avoid exploits that could be theoretically possible based on detailed knowledge of the methodology.
Sep 10 2016
@AndrewSu As I just replied to Benjamin Good in this matter, it is a bit too early for this, since we only have the basic technical access as of very recently. We have not had a chance to extract any community shareable data sets yet, and it is clear that it will require some time to get clearance for such data even after we believe it is ready.
Aug 24 2016
Regarding my remaining todos:
- I have signed the L3 doc
- Here is my dedicated production-only SSH key:
- My preferred login name is "mkroetzsch", same as my user name on labs. My wikitech user name is "Markus Kroetzsch".
Aug 12 2016
P.S. Alex is on vacation and possibly disconnected. His reply might therefore be delayed. He is officially back in two weeks.
I have signed the Acknowledgement of Wikimedia Server Access Responsibilities.
Jul 5 2016
May 25 2016
We have two kinds of large data files: biweekly Wikidata json entity dumps and RDF exports that we generate from them. The RDF exports are what we offer through our website http://tools.wmflabs.org/wikidata-exports/rdf/index.php?content=exports.php
Feb 25 2016
Re parsing strings: You are skipping the first step here. The question is not which format is better for advanced interpretation, but which format is specified at all. Whatever your proposal is, I have not seen any syntactic description of if yet. If -- in addition to having a specified syntax -- it can also be parsed for more complex features, that's a nice capability. But let's maybe start by saying what the proposed "structured data" format actually is.
Feb 24 2016
+1 sounds like a workable design
Feb 16 2016
Re chemical markup for semantics: this is true for Wikitext, where you cannot otherwise know that "C" is carbon. It does not apply to Wikidata, where you already get the same information from the property used. Think of P274 as a way of putting text into "semantic markup" on Wikipedia.
Feb 15 2016
I really wonder if the introduction of all kinds of specific markup languages in Wikidata is the right way to go. We could just have a Wikitext datatype, since it seems that Wikitext became the gold standard for all these special data types recently. Mark-up over semantics. By this I mean that the choice of format is focussed on presentation, not on data exchange. I am not an expert in chemical modelling (but then again, is anyone in this thread?), but it seems that this mark-up centric approach is fairly insufficient and naive.
Feb 10 2016
The MathML expression includes the TeX representation, which can be used in
LaTeX documents and also to create new statements.
The format should be the same as in JSON. If MathML is preferred there, then this is fine with me. If LaTeX is preferred, we can also use this. It seems that MathML would be a more reasonable data exchange format, but Moritz was suggesting in his emails that he does not think it to be usable enough today, so there might be practical reasons to avoid it.
Nov 23 2015
Nov 19 2015
...and if we consider our data dump to be an ontology, then what isn't an ontology?
Nov 18 2015
I don't want to detail every bit here, but it should be clear that one can easily eliminate the dependency to $db in the formatter code. The Sites object I mentioned is an example. It is *not* static in our implementation. You can make it an interface. You can inject Sites (or a mocked version of it) for testing -- this is what we do. The only dependency you will retain is that the formatting code, or some code that calls the formatting code, must know where to get the URL from.
@daniel As long as it works for you, this is all fine by me, but in my experience with PHP this could cost a lot of memory, which could be a problem for the long item pages that already caused problems in the past.
Structurally, this would work, but it seems like a very general solution with a lot of overhead. Not sure that this pattern works well on PHP, where the cost of creating additional objects is huge. I also wonder whether it really is good to manage all those (very different!) types of "derived" information in a uniform way. The examples given belong to very different objects and are based on very different inputs (some things requiring external data sources, some not). I find it a bit unmotivated to architecturally unify things that are conceptually and technically so very different. The motivation given for choosing this solution starts from the premise that one has to find a single solution that works for all cases, including some "edge cases". Without this assumption, one would be free to solve the different problems individually, using what is best for each, instead of being forced to go for some least common denominator.
Sep 22 2015
This was a suggestion we came up with when discussing during WikiCon. People are asking for a way to edit the data they pull into infobox templates. Clearly, doing this in place will be a long-term effort that needs a complicated solution and many more design discussions. Until this is in place, people can only link to Wikidata. Unfortunately, people often feel intimidated by what they see there, because they get a very long page that takes long time to load and contains all kind of data that they have not seen in the infobox.
Sep 11 2015
I think the discussion now lists all main ideas on how to handle this in RDF, but most of them are not feasible because of the very general way in which Wikibase implements unit support now. Given that there is no special RDF datatype for units and given that we have neither conversion support nor any kind way to restrict that a property must/must not have units, only one of the options is actually possible now: export as string (no range queries, but minimally more informative than just using a blank node).
If we could distinguish type quantity properties that require a unit from those that do not allow units, there would be another options. Then we could use a compound value as the "simple" value for all properties with unit to simulate the missing datatype. On the query level, this would be fully equivalent to having a custom datatype, since one can specify the unit and the (ranged) number individually. (While the P1234inCm properties support only the number, but no queries that refer to the unit).
Note that this discussion is no longer just about the wdt property values (called "truthy" above). Simple values are now used on several levels in the RDF encoding.
Sep 10 2015
Including more data (within reason) will not be a problem (other than a performance/bandwidth problem for your servers).
Sep 9 2015
Data on the referenced entities does not have to be included as long as one can get this data by resolving these entities' URIs. However, some basic data (ontology header, license information) should be in each single entity export.
One the mailing list, Stas brought up the question "which RDF" should be delivered by the linked data URIs by default. Our dumps contain data in multiple encodings (simple and complex), and the PHP code can create several variants of RDF based on parameters now.
Sep 8 2015
As another useful feature, this will also allow us to have our SPARQL endpoint monitored at http://sparqles.ai.wu.ac.at/ Basic registration should not be too much work; please look into it (I don't want to create an account for Wikimedia ;-).
Aug 24 2015
It seems that the Web API for wbeditentities is also returning empty lists when creating new items (at least on test.wikidata.org). Is this the same bug or a different component?
Aug 5 2015
If not dropped, then it should be fixed. The value of "1" (a string literal) is not correct. Units should be represented by URIs, not by literals.
Jun 23 2015
While I did say that pretty much all URIs I know use http, I do not have any reason to believe that https would cause problems. It is not so extensively tested maybe, but in most contexts it should work fine.
Jun 17 2015
Jun 12 2015
we once planned a popup box with links to the various formats. It would be shown when you click on the Q-id in the title.
Jun 10 2015
I think this is a useful change if you want Wikibase sites to be able to refer to other Wikibase sites. In WDTK, all of our EntityId objects are "external", of course. A lesson learned for us was that it is not enough to know the base URI in all cases. You sometimes need URLs for API, file path, and page path in addition to the plain URI. MediaWiki already has a solution for this in form of the sites table. I would suggest to use this and to store pairs <sitekey,localEntityId> and to have the URI prefix stored in the sites table. It's cleaner than storing the actual URI string (which might change if an external site is reconfigured!) in the actual values on the page.
May 21 2015
@thiemowmde I don't know what you mean with the mutliple tickets you refer to. I am not aware of other tickets related to readability. I was just saying that the requirement you are trying to address will never be addressed even halfway. It's still nice to improve readability a bit if it is possible without much pain and without any other disadvantages, but I don't think that this is the case here.
@thiemowmde One could have documentation as a text that is added as a description of the property used for precision. However, most users would more likely read a web page than look up the description stored in an OWL file. In the end, when you type in a SPARQL query, there is not much documentation directly available to you, even if it is stored in the RDF database somewhere.
A big advantage of the numbers is that you can search for values where the precision is at least a certain value (e.g., dates with precision day or above). This would be lost when using URIs.
May 19 2015
@Jc3s5h You are right that date conversion only makes sense in a certain range. I think the software should disallow day-precision dates in prehistoric eras (certainly everything before -10000). There are no records that could possibly justify this precision, and the question of calendar conversion becomes moot. Do you think 4713BCE would be enough already, or do you think there could be a reason to find more complex algorithms to get calendar support that extends further to the past?
May 11 2015
Apr 3 2015
@daniel Changing the base URIs is not working as a way to communicate breaking changes to users of RDF. You can change them, but there is no way to make users notice this change, and it will just break a few more queries. It's just not how RDF works. Most of our test queries do not even mention any wikibase ontology URI, yet they are likely to be broken by changes to come. If you think that we need a way to warn users of such changes, you need to think of another way of doing this.
I agree with the proposal of @Smalyshev.
Mar 31 2015
@Smalyshev You comment on my Item 1 by referring to BlazeGraph and Virtuoso. However, my Item 1 is about reading Wikidata, not about exporting to RDF. Your concerns about BlazeGraph compatibility are addressed by my item 2. I hope this clarifies this part.
Mar 30 2015
@Smalyshev P.S. Your finding of "0000" years in our Virtuoso instance is quite peculiar given that this endpoint is based on RDF 1.0 dumps as they are currently generated in WDTK using this code: https://github.com/Wikidata/Wikidata-Toolkit/blob/a9f676bfbc2df545d386bfa72e5130fa280521a9/wdtk-rdf/src/main/java/org/wikidata/wdtk/rdf/values/TimeValueConverter.java#L112-L117
@Smalyshev We really want the same thing: move on with minimal disturbance as quickly as possible. As you rightly say, the data we generate right now is not meant for production use but for testing. We must make sure that our production environment will understand dates properly, but it's still some time before that. Here is my proposal summed up:
@mkroetzsch I already listed a few of the tools that implement XSD 1.0 style BCE years and I read your answer as to say that you know of no tools that implement XSD 1.1 style BCE years.
Feel free to post a list of the RDF tools that you found to implement RDF 1.0 rather than RDF 1.1 in terms of dates.
Re "halting the work on the query engine"/"produce code now": The WDTK RDF exports are generated based on the original specification. There is no technical issue with this and it does not block development to do just this. The reason we are in a blocker situation is that you want to move forward with an implementation that is different from the RDF model we proposed and that goes against our original specification, so that Denny and I are fundamentally disagreeing with your design. If you want to return to the original plan, please do it and move on. If not, then better wait until Lydia has a conclusion for what to do with dates, rather than implementing your point of view without consensus. For me, this is a benchmark of whether or not our current discussion setup is working.
Mar 29 2015
@mkroetzsch Do you know of some widely used software that implements XSD 1.1 handling of BCE dates?
@Smalyshev @Lydia_Pintscher Dates without years should not be allowed by the time datatype. They are impossible to order, almost impossible to query, and they do not have any meaning whatsoever in combination with a preferred calendar model. All the arguments @Denny has already given elsewhere for why we should unify dates to Proleptic Gregorian internally apply here too. My suspicion is that the existing dates of this form are simply a glitch in the UI, where users got the impression that dates without years are recognized and pressing "save" silently set the year to zero without them seeing the change in meaning. If this is an important use case, then we should develop a day-of-year datatype that supports this, or suggest the community to use dedicated properties/qualifiers to encode this. However, other datatype extensions would be much more important than this rare case (e.g., units of measurement).
Mar 27 2015
All RDF tools should be able to handle resources without labels (no matter if used as subject, predicate, or objcet). But data browsers or other UIs will simply show the URL (or an automatically created abbreviated version of it) to the user. So instead of "instance of" it would read something like "http://www.wikidata.org/entity/P31c". Nevertheless, we can accept this for now. AFAIK there are no widely used generic RDF data browsers anyway, and it's much more likely that people will first create Wikidata-aware interfaces.
"we don't know what year it was but it was July 4th"
Yes, the discussion on SPARQL has converged surprisingly quickly to the view that XSD 1.1 is both normative and intended in SPARQL 1.1 (by the way, I can only recommend this list if you have SPARQL questions, or the analogous list for RDF -- people are usually very quick and helpful in answering queries, esp. if you say why you need it ;-).
Note that all current data representation formats assume that "0000-01-01T00:00:00" is a valid representation:
Don't see why it would be this many. It'd be like 4 additional rows per property:
Mar 26 2015
@Smalyshev Yes, using lower-case local names for properties is a widely used convention and we should definitely follow that for our ontology. However, I would rather not change case of our P1234 property ids when they occur in property URIs, since Wikibase ids might be case sensitive in the future (Commons files will have their filename as id, and even if standard MW is first-letter case-insensitive in articles, it can be configured to be otherwise). It would also create some confusion if one would have to write "p1234" in some interfaces and "P1234" in others (maybe even both would occur in RDF since we have a P1234 entity and several related properties).
Mar 25 2015
Re "what does consistent mean": to be based on the same input data. All dumps are based on Wikidata content. If they are based on the same content, they are consistent, otherwise they are not.
Re using the same code: That's not essential here. All we want is that the dumps are the same. It's also not necessary to develop the code twice, since it is already there twice anyway. It's just the question if we want to use a slow method that keeps people waiting for the dumps for days (as they already do now with many other dumps), or a fast one that you can run anywhere (even without DB access; on a laptop if you like). The fact that we must have the code in PHP too makes it possible to go back to the slow system if it should ever be needed, so there is no lock-in. Dump file generation is also not operation-critical for Wikidata (the internal SPARQL query will likely be based on a live feed, not on dumps). What's not to like?
All of these dumps will be generated by exporting from the DB.
@Lydia_Pintscher I understand this problem, but if you put different dumps for different times all in one directory, won't this become quite big over time and hard to use? Maybe one should group dumps by how often they are created (and have date-directories only below that). For some cases, there does not seem to be any problem. For example, creating all RDF dumps from the JSON dump takes about 3-6h in total (on labs). So this is easily doable on the same day as the JSON dump generation. I am sure that we could also generate alternative JSON dumps in comparable time (maybe add an hour to the RDF if you do it in one batch). The slow part seems to be the DB export that leads to the first JSON dump -- once you have this the other formats should be relatively quick to do.
@daniel Changing URIs of the ontology vocabulary is "silently producing wrong results" as well. I understand the problems you are trying to solve. I am just saying that changing the URIs does not actually solve them.
@hoo Thanks for the heads up! I do have comments.
Mar 22 2015
Mar 21 2015
Mar 20 2015
@daniel It makes sense to use wikibase rather than wikidata, but I don't think it matters very much at all. We should just define it rather sooner than later.
Mar 19 2015
@daniel: Have you wondered why XML Schema decided against changing their URIs? It is by far the most disruptive thing that you could possibly do. Ontologies don't work like software libraries where you download a new version and build your tool against it, changing identifiers as required. Changing all URIs of an ontology (even if only on a major version increment) will break third-party applications and established usage patterns in a single step. There is no mechanism in place to do this smoothly. You never want to do this. Even changing a single URI can be very costly, and is probably not what you want if the breaking change affects only a diminishing part of your users (How many BCE dates are there in XSD files? How many of those were already assuming the ISO reading anyway?).
Mar 16 2015
Yes, this refers to the Wikidata Toolkit RDF exports. I have created an issue now: https://github.com/Wikidata/Wikidata-Toolkit/issues/128 As it turns out, the error is actually caused by the code that treats the year-0 issue (which we of course are well aware of). Should be easy to fix as part of our upcoming RDF improvements. Note that negative dates before -9999 may still confuse some RDF tools.