See https://phabricator.wikimedia.org/T360295 for the request to run tests before deploying a tool.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Mar 18 2024
Mar 15 2024
Mar 14 2024
May 13 2023
@Mitar Done.
May 8 2023
I’ve tried Wikidata dumps in QuickStatements format with Zstd compression, and benchmarked it: https://github.com/brawer/wikidata-qsdump
File size shrinks to one third, and decompression is 150 times faster (on a typical modern cloud server) compared to pbzip2.
I’ve tried Wikidata dumps in QuickStatements format with Zstd compression, and benchmarked it: https://github.com/brawer/wikidata-qsdump
File size shrinks to one third, and decompression is 150 times faster (on a typical modern cloud server) compared to pbzip2.
Mar 18 2023
Done.
For the past ~2 years, qrank.toolforge.org has been serving redirects to qrank.wmcloud.org. After such a long time, it should be okay to shut down the redirect service entirely. Done.
Dec 14 2022
Actually, due to how the zstd formats were designed, the current parallelization approach for bzip2 will probably work in the exact same way also with zstd. With the right command line option, the zstd tool will already distribute its input to all CPU cores for parallel compression (the reference zstd implementation is similar to pbzip2 in that respect). But one should also be able to split the input oneself, compress the shards in parallel on a set of machines, and then in the end simply concatenate the compressed outputs. Again, it’s using the same approach as pbzip2.
Dec 12 2022
Sep 26 2022
Apr 25 2022
This seems like a duplicate
Somewhat, although this ticket here was meant specifically for monitoring cronjob completions. This is different (and simpler) than setting up Cortex/Thanos-like monitoring on metrics exposed by continuously running services.
Will Toolforge and Cloud VPS jobs be able to read and write into their own custom buckets? (That would be super helpful.)
Feb 2 2022
Possibly relevant: https://qrank.wmcloud.org/ which ranks Wikidata items by how often their pages get viewed on Wikipedia, Wikitravel, etc.; updated ~weekly.
Jan 25 2022
@nskaggs, it’s in Go, although I’m working on a web frontend that’ll eventually have a part written in JavaScript/React. Here’s my current “release process” that I’m hoping to make less manual. The environment variable GOOS=linux can be omitted when the compiler runs on a Linux machine.
Jan 21 2022
Does the push-to-deploy pipeline accept early adopters? I’d gladly volunteer as a guinea pig.
Jan 6 2022
Meanwhile I’ve set up Wikidata QRank, which computes this dataset on a weekly basis and offers the results for public download. So, feel free to close this ticket. But if the data engineering team was interested in joining/improving/taking over the project, please feel welcome; it’d be great to work together!
Dec 23 2021
Nov 17 2021
Jun 15 2021
Oh, if it’s useful to you, I’ll gladly keep it running. Do you want it to monitor other domains beyond the current four?
Jun 14 2021
May 17 2021
Friendly ping? The bug is still present.
May 14 2021
Cool, glad it’s useful! When you set up Prometheus rules, consider alerting when certmon_tls_certificate_expiration_timestamp - time() becomes less than ~2 weeks or so for a domain; see Prometheus recommendations for timestamps. Then, the the SRE team would get plenty of advance notice for expiring TLS certificates, allowing problems to be fixed long before they become user-visible outages. (Apologies if I’m stating the obvious here, you’ll know more about this than me).
May 11 2021
If it helps, feel free to adopt https://certmon.toolforge.org/ which was quickly thrown together in an attempt to help Wikimedia to improve its monitoring. See source code and the metrics endpoint for Prometheus monitoring. Feel free to fork, send pull requests, whatever. Please do tell if you end up using it, I’m quite curious. If it’s useful, my personal preference would be that you’d clone the repo into a better place (perhaps a Phabricator project) and run it yourself, so the Wikimedia SRE team could change things without me getting involved.
If it helps, feel free to adopt https://certmon.toolforge.org/ which was quickly thrown together in an attempt to help Wikimedia to improve its monitoring. See source code and the metrics endpoint for Prometheus monitoring. Feel free to fork, send pull requests, whatever. Please do tell if you end up using it, I’m quite curious. If it’s useful, my personal preference would be that you’d clone the repo into a better place (perhaps a Phabricator project) and run it yourself, so the Wikimedia SRE team could change things without me getting involved.
May 7 2021
Apr 28 2021
Hm, good point. Could the dumps be made consistent? Maybe like this: Before starting a dump, find the current last revision; pass this cut-off revision ID to the dumping shards; change the dump-producing code to not consider changes after the cut-off revision. But I wouldn’t know how hard this would be. Actually, DumpEntities already seems to take a last-page-id flag, but I don’t know if/where that is getting set in production (and if that’s really enough).
Regarding dump-level metadata, it would be super useful to know what timestamp should be passed to EventStreams for catching up with user edits after the dump was produced. To find this timestamp, can clients extract the entity ID with the highest lastrevid from a Wikidata dump, and then retrieve the corresponding timestamp via Special:EntityData like this? Or would a sync-up client loose some edits if it were to do this? (For example, if dumps get produced by parallel workers, they’d probably have to agree on a cut-off revision before starting the dumping process; otherwise, the JSON file wouldn’t necessarily contain all changes before the highest lastrevid in the dump file... correct?)
To find the timestamp of the last Wikidata change that went into a dump file, couldn’t one — while processing the dump — extract the entity and revision ID with the highest lastrevid value in the entire dump, and then retrieve the corresponding modified timestamp for that single edit via Special:EntityData like in this query? The lastrevid field seems to have been added to dumps by T87283 in changeset 500806.
Apr 9 2021
Apr 6 2021
Sure, glad to try. I’ve changed the qrank-builder job config to use the new image. It seems to work fine.
Mar 25 2021
Mar 22 2021
Thanks for the pointer! Indeed, I was hoping the Wikimedia Cloud had something like Cortex or Thanos running on behalf of custom tools. Hm, considering for how long these discussions seem to already have been taking place, it doesn’t really look like this will be coming anytime soon. So, closing this ticket here as stalled; things won’t go any faster with more tickets around.
Mar 19 2021
As @bd808 suspected, security is indeed the main reason why I’d like to run my dinky webservice in a constrained environment. As an external volunteer developer, I’m always fearing that my contributions may cause more harm than good. Especially when contributing some minor tool that doesn’t see much attention, I can sleep better when there isn’t much else bundled into the container for my webservice. Of course, the risks can be mitigated with container scanning, actively checking CVEs, etc. — but as an external volunteer, I don’t really want to impose such maintenance burden on others. Of course, keeping containers lean isn’t the universal solution to all problems in production security—still, with less baggage, fewer things can go wrong. Basically, it’s an attempt at taming the beast of system complexity.
Mar 18 2021
With the Go programming language, binaries typically get statically linked. So, compiled programs will typically run without any runtime dependencies whatsoever — they wouldn’t access package files, call shared libraries, or use any other files. When compiling for Linux, the compiler builds an ELF binary that directly invokes the operating system kernel through Linux system calls, not even using libc or anything else in a Linux distribution. Rust may be similar in that respect (not sure); static linking can also be done with C and C++, although it’s a bit less common there.
Also, you mention building the go program on Toolforge. How do you build it? I guess you build it in the toolforge bastion?
Thank you! Yes, this is a build pipeline for data, it isn’t compiling code. For background, see the technical design document. (Feedback very welcome!)
Mar 16 2021
Perhaps the QRank signal might be helpful here? The signal is computed in the Wikimedia cloud infrastructure (Toolforge) and gets periodically refreshed. It’s just aggregated pageviews, but I found it pretty useful in my own projects, which is why I contributed it to Toolforge.
- Site: https://qrank.toolforge.org/
- Data download: https://qrank.toolforge.org/download/qrank.gz
- Source code: https://github.com/brawer/wikidata-qrank
- Technical design: https://github.com/brawer/wikidata-qrank/blob/main/doc/design.md
Perhaps the QRank signal might be helpful here? The signal is computed in the Wikimedia cloud infrastructure (Toolforge) and gets periodically refreshed. It’s just aggregated pageviews, but I found it pretty useful in my own projects, which is why I contributed it to Toolforge.
- Site: https://qrank.toolforge.org/
- Data download: https://qrank.toolforge.org/download/qrank.gz
- Source code: https://github.com/brawer/wikidata-qrank
- Technical design: https://github.com/brawer/wikidata-qrank/blob/main/doc/design.md
Mar 15 2021
Mar 2 2021
Yes, it works now. Thank you!
Feb 24 2021
Note that Kubernetes can also directly mount volumes from Ceph RBD, so this wouldn’t necessarily have to be done via Cinder. If Kubernetes was directly mounting Ceph RBD, there would be one less layer to maintain. But I don’t know how well this would fit into Wikimedia’s production setup in terms of quota enforcement, key management, monitoring, etc. Here’s some pointers, in case you want to explore this. The example setup looks actually quite simple.
Feb 23 2021
Feb 22 2021
Feb 17 2021
Nov 12 2019
Do you need a beta tester? I have a public domain list of 600 Sursilvan verbs including inflected forms. (In Sursilvan, verb inflection is quite complicated and fills entire textbooks; sort of like Latin, but with more exceptions). I’d like importing this knowledge to Wikidata. (Actually, if QuickStatements2 was able to create lexemes and refer to the newly created lexeme from within the same batch, that would probably be enough). Example:
May 6 2019
Friendly ping?
May 3 2019
The codes are valid (and registered) IETF BCP47 language codes.
Hm... in the sidebar on Wikimedia Commons (see screenshot), would it perhaps make sense to replace the link to Special:WhatLinksHere by a link to Special:GlobalUsage? Currently, there seems to be a usability/UX issue: the feature is already implemented (thanks for the kind explanation on this bug, I had no idea!). However, people may might never come across the Special:GlobalUsage unless they already know that it exists. Hence the suggestion to remove “What links here” from the sidebar and replace it by “Global usage” which seems to be a superset. (There’s a risk of cluttering the user experience when the sidebar has too many links).
@Lea_Lacroix_WMDE Thank you! Filed T222426.
Is there anything specific I should do so that people can enter usage examples for Sursilvan lexemes, and likewise for lexemes in the various other Romansh variants? I’ll gladly file more tickets if it helps; just tell me what to do.
Filed T222423 for another (very minor) issue that seems related to language variants.
Hm, adding usage examples (and probably similar properties) doesn’t seem to work yet. Try adding the sentence “Ils tgauns vivan dalla naschientscha naven ensemen cullas nuorsas.” as usage example (P5831) in language “rm-sursilv” for tgaun (L45642); see screenshot.
May 2 2019
Ah, got it. Thank you!
Is something else needed to activate lexemes in variants of Romansh? See screenshot:
Apr 8 2019
Just to clarify, the codes in this ticket (rm-rumgr etc.) are not made up; they have been standardized by IETF and appear in the IANA language subtag registry.
Apr 2 2019
Apr 1 2019
If nobody else has time to do this, may I volunteer to write the code? Please tell me where to start (which programming language, what framework, etc.)
Mar 25 2019
Curious, is it possible to estimate by what date this might get implemented? Is there anything I can do to help?
Mar 20 2019
Mar 14 2019
Oh, all you need from CLDR is an English label? Nothing else? In that case, this Wikidata query might be helpful:
Sure, but it will take a while until the next official release of CLDR so you'd have to read the CLDR data from the development branch ("trunk"). I do wonder, though, if you could read the IANA registry in addition to CLDR and use IANA as fallback for the English names when CLDR has no data yet. Then, you would immediately get an English name for every language with an ISO 639 or IETF BCP 47 code, so you'd add support for a couple thousand languages at once.
The easiest way to add a new language to CLDR is preparing ‘seed’ files in XML format;
Feb 28 2019
Feb 8 2019
Friendly ping, is there anything I can do to help with this ticket?
Feb 6 2019
Feb 2 2019
Jan 11 2019
@GerardM, is there anything I can do to help with this ticket? There’s a sizable Romansh dictionary whose data can be donated to Wikidata, but this is currently blocked on this ticket. (Try an exact search for a few German words, eg. “Hund” or “Gelbsucht”, to see how the words are different in various variants of the Romansh language).
For languages that have no language code yet, perhaps Lingua Libre could use “mis-x-Q12345” (where Q12345 would be the Wikidata item for the language of the pronunciation audio). That would be a syntactically valid IETF BCP 47 tag, and you wouldn’t lump unrelated languages into the same category. Once the language does get a code, some bot could change the categories of uploaded files on Wikimedia Commons. @GerardM, what do you think?
Sorry, here’s the correct link to the Unicode FAQ about Zawgyi: https://www.unicode.org/faq/myanmar.html