Wed, Mar 18
Unassigning, I'm not working on this any more!
Wow, was that really almost 3 years ago. There doesn't seem to really be a need for this, so I'm closing the request as declined.
Feb 14 2020
Feb 12 2020
I think increase the factor will not make thing better, it only increase the oscillating period
Feb 11 2020
Possibly relevant comment here: I believe there is a plan also to move to incremental updates (updating only the statements/triples that have changed) so it is probably important that any parallelism in updating be coordinated so that updates for the same item (Q value) be grouped together and done in the same process, so they don't clobber one another. Updates for separate items (different Q values) can be handled in parallel as the associated RDF triples are independent (the subject of a triple is always the item, a statement on the item, or a further node derived from the item). Even without that incremental update process, grouping updates on the same item together could be a significant speed boost, as 5 updates for Q9999 can be collapsed into just the last update under the current procedure of completely rewriting the triples.
Feb 7 2020
Feb 4 2020
@Addshore and others - the problem has deteriorated since Saturday - see this discussion on Wikidata: https://www.wikidata.org/wiki/Wikidata:Contact_the_development_team/Query_Service_and_search#WDQS_lag
Jan 19 2020
Jan 18 2020
@Bugreporter well something must have changed early today - was it previously "mean" and is now "median"? I'm not sure which is better, but having WDQS hours out of date (we're over 4 hours now) is NOT a good situation, and what this whole task was intended to avoid! @Pintoch any thoughts on this?
Jan 17 2020
Just saw this - I'm wondering technically how you would implement it? You could generate a random number between 2.5 and 5, and if maxlag is greater than your random number deny the edit?
Am I misreading this graph? https://grafana.wikimedia.org/d/000000489/wikidata-query-service?panelId=8&fullscreen&orgId=1&from=now-12h&to=now&refresh=10s It looks like the query service lag for 3 of the servers has been growing steadily for the past roughly 8 hours. However, edits are going through. Did something change in the maxlag logic somewhere earlier today?
Dec 11 2019
Marking as resolved...
I increased the default number of retries to 12, so it will now retry for up to an hour. I think we're good here?
(A) Pintoch's patch has been applied, and (B) I also increased the retry time from 5 seconds to 5 minutes - that still means an edit will fail after 25 minutes if maxlag doesn't drop, with only 5 retries. Is there a consensus to retry for an hour? Or if there's a better standard for handling retries let me know!
Oct 22 2019
Here's a draft of slides for our workshop. Please feel free to edit this. Also I think you wanted to cover a bit more basics of how lexemes are put together - maybe that should go first? This was mostly just gathering statistics and data and then a page of questions at the end...
Oct 21 2019
I am uploading files with data on the counts of forms and senses by date for the last year (also totals in last column). There may have been some issues with this - it comes from the Lexicographical statistics pages that are generated from WDQS queries, so there were a few periods where I think the numbers were off. Anyway, it should be close to correct for most of the time period. So we can plot this along a bit of a timeline for the last year I think?
Oct 18 2019
I just did some exploring but I don't think Quarry will help with forms and senses - at least they're not "pages" in themselves with their own namespace. Actually I couldn't figure out where they were in the database schema at all... Anyway, I think I can get some rough numbers from looking at the stats page as it has changed over time, I will work on this.
Oct 11 2019
I was thinking some graphics on the growth of lexemes, forms, and senses would be good - do we already have that somewhere?
Sep 26 2019
If you go to the search page and select "Lexeme" as the only namespace you get the same error with "thanks" in the search box, but "thank" alone works fine - the two lexemes that match are L3798 (verb) and L28468 (noun).
Sep 12 2019
The Basque collection is even more complete now!
I do think some customization may be needed for Lexemes due to the different structure - the forms and senses etc. Perhaps the most useful link for a wiktionary may be from words to senses to wikidata items via the "item for this sense" property. That in principle allows translations to be provided, grouped by sense.
Aug 1 2019
I see the problem also (Safari browser). When you talk about it affecting lexemes, where do you see that? I experimented with adding a form and that seemed fine.
Feb 18 2019
Jan 28 2019
Can you add a test to the statement ID generation code that ensures it has an RDF compatible format (except for the 1 character that's a problem now), and a note that this is required for RDF support?
promise it will always be one-to-one, no matter what happens with internal IDs
Jan 26 2019
Another thought - even better would be if the API could be adjusted so it accepts the WDQS statement ID format as it is (all -'s).
Thanks for creating this ticket! Actually, my use case is the opposite of Lucas's - I want to be able to go from the results of a WDQS query to fetch the full statement via the API, which requires the statement ID. So I would like to see the id conversion documented in BOTH directions - and in particular the arbitrary regex replace listed above (preg_replace( '/[^\w-]/', '-', $statementID )) would NOT work for that purpose. Rather can we just settle that the first $ or - is switched, and that's it? Or is there something else that's an issue here?
Jan 7 2019
I didn't know about the "award token" option!
Nov 28 2018
Just a note - WDQS query gives different results hopping up and down - sometimes 3004 (for English lexeme senses) and sometimes 2872, over about the last 10 minutes.
@Smalyshev I'd forgotten there was a phabricator ticket for this - anyway, this is what I was referring to... Last night's update bumped the number down again to 2718; however when I run the query directly on WDQS I get 3004 right now. Something's not right!
Nov 27 2018
I ran a manual update and the total for English bumped up to 2819 - so it doesn't look as if we've actually lost lexeme senses, just that some of the query servers don't know about all of them?
I wouldn't be surprised if it's a WDQS problem, this is definitely generated from an RDF query.
Oct 16 2018
According to https://www.mediawiki.org/wiki/Extension:WikibaseLexeme/RDF_mapping a lexeme should be "a wikibase:Lexeme " as well as "a ontolex:LexicalEntry", but in the query service I can only find things via the latter relation. Similarly for forms and "wikibase:Form". Something left out of the dump?
Jun 29 2018
WDQS works for me! I'm not sure where that is of course - I guess I could check Phabricator!
Jun 19 2018
Does "alphabetical" ordering even make sense for words in a collection of vastly different writing systems? If this is done I would recommend it be accompanied by some filtering - for language, part of speech, grammatical features, certain properties perhaps.
Jun 1 2018
I am in general favorable to Micru's proposal, and perhaps Pamputt's elaboration of it above: using wikidata items directly allows representation of the lemma language naturally in the user's own script/language for one, and other automatic bonuses of using items given the structured data ethos etc.. However I'm a little confused about the details of how this would work - specifically, the most commonly used lexemes would usually have the same spelling, use etc. across all variants of a language; do we give that a more general language ("en" = Q1860 say) and only use the specific items mentioned ("en-US" = Q7976, "en-GB" = Q7979, "en-CA" = Q44676, etc.) where there really are variations? Or would it be possible to attach multiple language items to a single lexeme, to indicate it applies to several specific variants?
May 29 2018
Here's a specific question that might be detailed enough in description: suppose we have a collection of facts (say the names, countries, inception dates, and official websites for a collection of organizations) that has been extracted from multiple sources, including various language wikipedias, a CC-0 data source (for example https://grid.ac/) and a non-CC-0 non-wikipedia data source - these sources would be indicated in wikidata by the reference/source section on each statement. This extraction has been done by users either manually or running bots with the understanding that they are adding facts to a CC-0 database (wikidata). Reconciling the facts - for example merging duplicates with slightly different names, dates, or URL's - has been done by users manually or semi-automatically, again with the understanding they are contributing to a CC-0 database. Are there any copyright or other rights constraints that apply to this collection, or can it be fully considered to legally be CC-0?
Hmm, I'm not sure this is all that useful at least as it stands. Most external id's can be as easily found now via the Wikidata Resolver tool - https://tools.wmflabs.org/wikidata-todo/resolver.php - However, what I would find useful would be a way to locate for example partial street addresses - this (P969) is often entered as a qualifier on headquarters location (P159). Searching for' haswbstatement:P969=Main' now finds something, but only because that oddly has just 'Main' as the value for P969, and making the string lowercase ("main") finds nothing, which is definitely not what I would expect on this... I don't think treating string values as if they were identifiers is the right approach, the usefulness of a search engine is in normalizing string values so you can find them without having the exact matching string. And qualifiers should be folded in somehow!
May 28 2018
Hi - my most recent response was following MisterSynergy's comment on Denny's proposed questions, and specifically the meaning of "processes that in bulk extract facts from Wikipedia articles," - it sounds like from subsequent discussion that we are not talking solely of automated "processes", so I think I echo MisterSynergy's comment that the question needs to be better defined to "describe how these processes look like". On the one hand there's overall averages, with less than one "fact" per wikipedia article; on the other hand the distribution is probably quite wide, with some articles having dozens of "facts" extracted from them. Since CC-BY-SA applies to each article individually, does extraction of too much factual data from one article potentially violate its copyright?
May 26 2018
based on the fact that we have ~42M “imported from” references and ~64M sitelinks in Wikidata
May 25 2018
Some references on why CC0 is essential for a free public database:
"Databases may contain facts that, in and of themselves, are not protected by copyright law. However, the copyright laws of many jurisdictions cover creatively selected or arranged compilations of facts and creative database design and structure, and some jurisdictions like those in the European Union have enacted additional sui generis laws that restrict uses of databases without regard for applicable copyright law. CC0 is intended to cover all copyright and database rights, so that however data and databases are restricted (under copyright or otherwise), those rights are all surrendered"
May 23 2018
FYI I agree with VIGNERON on what it should look like - but at least something more than the id!
May 22 2018
It has been asserted here several times that OSM data has been wholesale imported into Wikidata - do we know that has happened? Wikidata has two properties related to OSM, one that relates wikidata items to OSM tags like "lighthouse", and one that is essentially deprecated (see T145284), so I assume those are not the issue. According to https://www.wikidata.org/wiki/Wikidata:OpenStreetMap (text which has been there since at least last September) "it is not possible to import coordinates from OpenStreetMap to Wikidata". If the issue is coordinates imported via wikipedia infoboxes that originated with OSM, I can see there might be an issue there, and maybe that should be added to Denny's suggested question in some fashion. But as far as actual importing of OSM data, the only specific cases that I noticed explicitly cited above are (A) a bot request that has been rejected, and (B) a discussion from 2013 where the copyright issue was explicitly raised right away.
Oct 11 2017
Jul 21 2017
Of course, now these examples I gave are working - probably because I updated them recently. However, I found more that are not now, or only partially - for example Q2256713:
Jul 19 2017
Jul 14 2017
I don't understand why Multichill can unilaterally alter the priority on this request in the face of an active wikidata RFC where the voting has been 2:1 in support of this change. It would also be nice to get some actual feedback from developers - is this really "against the core data model of Wikdiata"? I don't see it - particularly as the workarounds in place now prove it can be easily supported.
Jul 13 2017
Thanks! I did search through the open tasks first and didn't find anything on this....
Jun 6 2017
The dummy user solution sounds good to me. Magnus Manske is doing something like this with his QuickStatementsBot so maybe a special purpose Bot account on wikidata for this?
Mar 23 2017
I believe a way this could be done would be to allow the attachment of regular expressions to the formatter URL, and have the external id URL conversion code understand them. That is, if there was a qualifier property that specified "regex substitution" for example, the ISNI problem (of additional spaces within the id that must be removed for the formatter URL) would be handled by a value something like "s/\s+//g" (remove all spaces). Some of the others might need a "regex match" on the id that allows specifying a $1, $2, $3 grouping pattern, and the formatter URL then looks something like http://...../$1/$2/$3 (or that could also possibly be handled by a substitution as in the ISNI case). The IMDB case is more difficult because it's essential 4 different formatter URLs based on the first characters of the id, so it might need a "regex filter" that limits the scope of each formatter URL based on the id; wikibase would then need to look through the filter regexes to find a matching formatter URL and use that.
Mar 22 2017
As background, I'm seeing about 2000 "hits" per day on this service right now, with about a dozen properties linking through it to their databases.
Mar 21 2017
Hmm, Ok, I read through the discussion you linked with @coren - I certainly see there can be a privacy violation regarding expectations in cases as were discussed there. I think this is a quite different case though (for example, the links are exclusively to third-party sites, not anything I or any other WMF person controls) and would like to hear directly from somebody with WMF (and some voices from wikidata) on this. If there is a clearly posted policy somewhere that would be great too. The policy linked by @coren focused on the Labs user collecting personal information, which is not at all happening here, and said nothing specifically about redirects per se.
(claiming task - if this really needs to be done I can certainly take care of it)
Hmm, I think the big issue may be point 3. Do you have an example where this might have come up? I could certainly make it an interstitial easily enough, but that makes these links a bit less convenient for people (extra click); if the links are being included with or without a warning elsewhere based on the wmflabs URL then I can see how it may be important to address this somehow. Also is there boilerplate text we should use if we really do need to put this in?
specifically, looking at The Godfather, which you mention here, there are close to 3 dozen OTHER external id links that similarly would show user IP information if followed.
@Dispenser, ok the issue is that people clicking an "external id" link are going to an external site? Is there any situation in which it is not obvious this is going to an external website? Every wikidata item with "external id" values has links directly to third party sites, without any interstitial or warning other than that it is external. I don't see the harm or potential for anybody's expectations of privacy to be violated.
@Dispenser wikidata-externalid-url is installed on tool-labs which fully preserves user privacy, I'm not sure what your concern is? Please clarify where you think any policy has been violated.
Nov 16 2016
@jeblad I'm resolving this as invalid as the initial claim of an information leak seems to be incorrect. However you might want to open up a separate phabricator ticket with your detailed suggestion on how to do formatter URL's better, I think it's a promising approach to allow pulling components from the "regular expression" syntax.
Ha, if I'd actually looked at the logs I would have known that. Yes all the IP addresses in the file are a 10.68 address, which is locally identified as "tools-proxy....wmflabs" so yes, no external IP addresses are visible to the service.
Or if there's some privacy agreement to sign as jeblad suggested then I'm happy to do that too. I met Lydia Pintscher in person last week so she can vouch for who I am :)
There are two basic issues which the url redirect script tackles - ID"s that need cleaning up (such as ISNI that is supposed to be entered as an ID with space characters, but the URL requires the spaces to be removed) and formatter URL's that require some more sophisticated handling than just a single $1 substitution - the IMDB case for example where the first two characters of the id determine the specific formatter URL to be used. It's not clear to me where is the best place for either of those pieces of logic. Wikibase could have some code for this (feel free to import what I've written) which would be perhaps exposed as some sort of service, but anybody using the P1630 values directly wouldn't benefit from that. It's not clear to me where this belongs. For now if there's some protocol for wiping log files or not even recording them on the tool labs server I'd be happy to implement that too. I have no interest in these log files.
Oct 14 2016
I see you've closed - looks good by the way. Anyway, on the question of
retaining WDQ - no I don't think that's necessary, I think Magnus would
like to shut it down eventually. I don't see that WDQ adds anything to this
tool now SPARQL is working reliably, it's fast and stable. So feel free to
Sep 26 2016
I'm not sure what the issue is here - you can enter a unit URL via the WbQuantity initializer (unit = 'http://www.wikidata.org/entity/Q....') and it works fine. The documentation in __init__.py seems to be out of date on this though.
Sep 23 2016
@Yurik and all, I'm glad to see all this work going on, I was pointed to this after I made a comment on a wikidata property proposal that I thought would be best addressed by somehow allowing a tabular data value rather than a single value. However, I'm wondering if this might be best driven by specific problem cases rather than trying to tackle generic "data" records. One of the most common needs is for time-series data: population of a city vs time, for instance, economic data by point in time, physical data like temperature vs time, etc. The simplest extension beyond the single value allowed by wikidata would be to allow a set of pairs defined by two wikidata properties (eg. P585 - "point in time", P1082 - "population"). The relation to wikidata takes care of localization (those properties have labels in many different languages) and defines the value types (time and quantity in this case), and the dataset would somehow be a statement attached to a wikidata item (eg. a particular city) so that the item and pair of properties fully define the meaning of the collection of pairs. The underlying structure of the pairs doesn't really matter much. But there seems to be something missing here - I think it might be best addressed in wikidata itself...
Excellent, thanks! I probably should have sent you an email...
Aug 10 2016
So I updated to https in my local copy and that definitely fixed the problem. Not sure if @Ricordisamoa is around? I don't have permission right now to do anything with ptable, but I do have an account (apsmith) on tools.wmflabs.org so if I was in the right group I could help out here maybe...
Still broken (at least 3 days now). I can't see the error messages but I tried running my own copy and ran into:
Aug 8 2016
Jul 11 2016
Ok, the WbRepresentation superclass looks like it might help simplify this. But FilePage, ItemPage and PropertyPage (and basestring) are not subclasses of that, so I think just returning the json hash would be best there. But the function could certainly run fromWikibase for the other types, that seems pretty easy, I'll look into that.
Jul 7 2016
@Multichill - could be, I'm not familiar with WbTime other than a glance at the code. Are there edge cases (eg. 10^20 years into the future?) that would break the "int/long" assumptions? But it definitely does NOT work for WbQuantity the way things currently are. Fixing WbQuantity seemed to be out of scope here, though it does need to be done. Coordinate may have similar issues as it uses floats.
The function should return an object. Possibilities seem to be commonsMedia, globe-coordinate, monolingualtext, quantity, string, time, url, external-id, wikibase-item, wikibase-property, math
Jul 6 2016
See https://gerrit.wikimedia.org/r/#/c/297637/ for proposed implementation...
Jul 5 2016
Ok, that echoes something Tobias has said also about using strings and avoiding IEEE fp. I'm going to look at getting T112140 working first and then see if I can bring that implementation to bear on this.
I'm going to have a shot at implementing this - it looks like it will be useful for a number of other open phabricator issues for pywikibot. I was figuring a function that will take all the parameters the API offers (datatype - a string, values - a list of strings, options - a dict, validate - boolean). Any other recommendations?
Jul 4 2016
You're the one who brought up JSON! It sounds like the issue is something different though - internal representation as strings? Anyway, are you recommending pywikibot use the wbparsevalue API for all (or at least numerical) input? That could be a good idea. Looks like it there was already a phabricator ticket on this - T112140
Jul 2 2016
That restriction is NOT in the JSON spec: http://tools.ietf.org/html/rfc7159.html#section-6 - also the leading plus is not required by JSON. Is there some other reason for the limitation in the wikidata code? DataValues is a wikidata-specific PHP library right? I can't think of any good reason to keep this limitation on input values.
Jul 1 2016
Hmm. So is it a pywikibot problem or a wikibase API problem? Is pywikibot sending in JSON format?
As far as testing goes, I have (in my own copy) added the following to the pywikibot tests/wikibase_edit_tests.py file (within the class TestWikibaseMakeClaim):
Jun 27 2016
Please note this is still an issue with the latest pywikibot code and current wikidata release - as of June 23, 2016. The following is the fix I have in the pywikibot core pywikibot/__init__.py file: