User Details
- User Since
- Oct 7 2014, 6:35 PM (492 w, 6 d)
- Availability
- Available
- IRC Nick
- dr0ptp4kt
- LDAP User
- Unknown
- MediaWiki User
- ABaso (WMF) [ Global Accounts ]
Fri, Mar 8
Thanks @bking ! It looks like the NVMe in this one is not a higher speed one for writes, and I'm also wondering if perhaps its write performance has degraded with age. I'll paste in the results here, but this was slower than the other servers, ironically (although not surprisingly because of the slower NVMe and slightly slower processor). This slower write speed is atypical of the other NVMes I've encountered. I believe the newer model ones are rated for 6000 MB/s for writes. But, I'm going to ping on task to see if we can get a comparative read of disk throughput from one of the newer and faster cp#### NVMes.
Thu, Mar 7
First, adding some commands that were used for Blazegraph imports on Ubuntu 22.04. I had originally tried a good number of EC2 instance types, and then after that went back to focus on just four of them with a sequence of repeatable commands (this wasn't scripted, as I didn't want to spend time automating and also wanted to make sure I got the systems' feedback along the way). I forgot to grab RAM clock speed as a routine step when running these commands (I recall checking on one server maybe in the original checks, and did look at my Alienware), but generally servers are DDR4 unless the documentation in AWS says DDR5 (for my 2018 Alienware and 2019 MacBook Pro they're DDR4, BTW).
Wed, Mar 6
Tue, Mar 5
Originally, the thought was to be able to simply count relative volume of these types of inbound taps/clicks. Although we want fidelity on whether a link actually resolves to a page (and I know there are Phabricator comments about this here and elsewhere), often a simple count is sufficient to know if there's any traction whatsoever. I see that it's considered desirable to have a definite mapping of bona fide pageviews or previews (or other things of that nature) to these wprov values - makes sense.
@VRiley-WMF any pointers on how to iDRAC / iLO to this node and establish with a hostname of wdqs1025.eqiad.wmnet? I'm wondering if maybe there's a direct IP or IPs given that there don't seem to be DNS records for cp1086.eqiad.wmnet or cp1086.mgmt.eqiad.wmnet?
Mon, Mar 4
Fri, Mar 1
Thanks @VRiley-WMF ! @bking is up next for imaging, I think.
Thu, Feb 29
Hi team - @lbowmaker asked if I could take a look at this and provide some context. I was having a think on this, and I'd like to ponder up to a few more days and provide some thoughts.
Wed, Feb 28
Tue, Feb 27
After setup, I would be interested in using it for 6 weeks if that's okay (hopefully things would only take 4 weeks, but there's some PTO and real life stuff always comes up). Would that be okay?
Feb 9 2024
Feb 8 2024
Feb 5 2024
I summarized at https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Graph_split_IGUANA_performance . When we have a mailing list post during the next week or so, we'll want to move this to be a subpage of the target page of the post.
Feb 2 2024
Feb 1 2024
Here's the output from the latest run based upon a larger set of queries from a random sample of WDQS queries.
Jan 31 2024
A run is in progress for 78K+ queries from a set of 100,000 random queries. It should be done in under 10 hours from now.
Jan 30 2024
Following below are "per-query" summary stats. I actually just put this together by bringing CSV data into Google Sheets for now - all of the columns are calculated upon the "per-query" rows (but you'll see how the Mean corresponds basically with the value calculated up above). The underlying CSV data don't bear actual queries (the .nt files from which they're generated do), but rather rows of this form:
Jan 27 2024
Here were the data produced by IGUANA once piped through the CSV utility introduced in https://gitlab.wikimedia.org/repos/search-platform/IGUANA/-/merge_requests/3/diffs with a command of the following form (for the attentive reader, note that I had to rename the originally produced files to have an .nt extension to make the underlying Jena libraries not throw an exception).
Now a screenshot from the re-run of the randomized order queries, followed by a screenshot showing the two runs on the randomized order queries side by side.
Jan 26 2024
Now, the screenshot from the randomized order queries. I'll run one more time to see that comparable output is achieved. Those were produced with the following. This latest output file has been moved to result.nt.003.
Now, a screenshot showing the re-run. And then a screenshot showing them side-by-side. This is just for the visual, and the data produced from IGUANA (what is in the .nt output that we can convert to a handy CSV) should be more telling.
Dropping in a screenshot from Grafana from this first pass and made a copy of result.nt to result.nt.001. Re-running to see that server behavior is similar.
Jan 25 2024
For the first pass, the following configuration is being used for an hour long test conducted from stat1006 with config file wdqs-split-test.yml as follows.
It's back up and running. The following query is producing results. Note that the is_goog_isp field is mainly for helping to better resolve if traffic was likely to come via a Google proxy server; but all the usual caveats apply such as ISP mappings can change, isp_data['isp'] can bear strings that merely contain "Google" but aren't an exact match, and so on. The IP list at https://www.gstatic.com/chrome/prefetchproxy/prefetch_proxy_geofeed , like what you see in @fkaelin analysis at T346463#9393571, adds more precision.
Jan 18 2024
It's entering the analytics system based on the following query:
Documentation updated: https://wikitech.wikimedia.org/w/index.php?title=X-Analytics&diff=2140528&oldid=2028273
It's live and looking good in kafkacat. Now we wait a little for stuff to show up in the analytics tables. Thanks @Vgutierrez and @BTullis for the additional reviews and thanks @Vgutierrez for the deployment.
Jan 17 2024
@fkaelin Sec-Purpose: prefetch;prerender is mentioned for the omnibox use case at https://developer.chrome.com/docs/web-platform/prerender-pages , so I've added that, as well as Chrome's apparent link preview functionality (Sec-Purpose: prefetch;prerender;preview).
Just to put something concrete (not saying this is the thing), here's an interesting unit test on the prefetch predictor mechanism:
Thanks @fkaelin . Yes, those prefetches happened without clicking on them. It seems to occur both for searches originating from the location bar, as well as searches entered into the <input> search field on the Google search webpage.
Jan 16 2024
By the way, here are corresponding wmf_raw.webrequest fields for this latest SERP. Notice how two prefetech requests were made from the same SERP, but the exit IP differs a little.
Jan 12 2024
I managed to make a connection via the Chrome private prefetch proxy using a Fire with a sideloaded Chrome 120 APK. In this case the User-Agent is perceived as a desktop one by Google, but processed as mobile in the Wikipedia infrastructure, so Chrome saw it as a 302 (delivered from the Wikipedia edge via the Google proxy) while in the Google SERP.
In the Android emulator, it's possible to make Chrome initiate this sort of request. Unfortunately, it seems that in the emulator there may be some lower level networking issue, at least from a Mac (Intel based in this case), because it shows a network related error, and a click on the article title from a Google SERP issues a plain GET not loaded from cache based on a quick look with DevTools and kafkacat. I somewhat strongly expect that the network related error would not be present (i.e., the fetch would succeed and be cached) on a real physical device when one is able to trigger Chrome private proxy prefetch.
@Vgutierrez @BTullis https://gerrit.wikimedia.org/r/c/operations/puppet/+/981352 is ready for review. Would it be possible to review and arrange for a deployment next week?
Jan 10 2024
I'm scheduling time with @Mayakp.wiki and @MGerlach to soon discuss potential future use cases, but if folks familiar with VCL could give the latest version of the patch a look it'd be appreciated. I updated the Description a bit to note some additional considerations - it dawned on me we ought to capture some different browsers so that we can hopefully cover the broader majority of browser prefetch, irrespective of use of an intervening proxy architecture. This doesn't solve for the difficulty of properly classifying whether a pageview actually materialized in the user's browser (e.g., from cache of a prefetched resource), but hopefully aids for more coverage in case we are curious about different browser vintages.
Jan 8 2024
I'll amend the patch.
Jan 4 2024
Imports seemed to work.
Dec 8 2023
Dec 6 2023
I like where @elukey is going with this.
After an update to the script (PS6) and a fresh run of the same commands new files have been hdfs-rsync'd to stat1006:~dr0ptp4kt/gzips in anticipation of doing a file transfer over to the WDQS graph split test servers.
Dec 5 2023
I ran the current version of the code as follows:
Dec 4 2023
Not using right now, but here's roughly how one might go about generating more expanded Turtle statements without reverse-mapping prefixes: F41561068
Nov 29 2023
Adding a note so I don't forget: advice from @BTullis is to avoid NFS if possible, and advice from @JAllemandou is to consider use of hdfs-rsync (after our call I sought this out and found these: https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/+/refs/heads/master/python/refinery/hdfs.py and https://gerrit.wikimedia.org/g/analytics/hdfs-tools/deploy/+/2445aec92f6b3d409531fb74ab3f9a22d9716823/bin/hdfs-rsync and https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/+/refs/heads/master/bin/hdfs-rsync EDIT and https://github.com/wikimedia/hdfs-tools/blob/master/src/main/scala/org/wikimedia/analytics/hdfstools/HdfsRsyncCLI.scala - the latter being available from stat boxes from a quick glance). Chances are we'd need to add a ferm and possibly where up some Kerberos stuff on the WDQS servers if going the hdfs-rsync route.
Nov 20 2023
The job completed. The counts match up on this productionized job compared with the prior one run in my namespace. Following are some Hive queries in case needed later. Below that is a really small sample of the resultant data in tabular format for each partition.
Nov 16 2023
Spark patch merged, new Jenkins build of the rdf JAR done, Airflow patch merged. This is deployed to Search's Airflow instance and the job is running. Thank you, @dcausse and @EBernhardson.
Here's what I saw after re-running. So, we should be good with the latest patchset that goes without distinct() on the final graphs.
Oct 27 2023
Update: it seems to be working. Thus, I'd say this is maybe 75% complete.
Oct 25 2023
I also see https://grafana.wikimedia.org/d/000000264/wikidata-dump-downloads?orgId=1&refresh=5m&from=now-2y&to=now which I noticed from some tickets involving @Addshore (cf. T280678: Crunch and delete many old dumps logs and friends) and a pointer from a colleague.
^ Update.
UPDATE from previous comment: reducing to GETs, it's closer to 100 (a bunch of the requests were HEAD requests). Also, it seems that there may be some sort of range requests going on in there, so it's messier than at first glance.
Oct 20 2023
It took about 26min 24s to write S_direct_triples (7_293_925_470 rows) in basic Parquet. It's not all the rows (not even for its own partition, as that will include Value and Reference triples as well), but this means it ought to be possible for the job to write total 15B rows with about an hour of wall time (maybe double that to play it safe).
TL;DR this is about 45% done.
Oct 16 2023
Good question - I meant the contrast with respect to the .ttl.gz dumps and everything that goes into munging and importing (in aggregate across all downloaders of those files) versus the same for if this was done with the .jnl where they don't have to munge and import. Napkin-mathsing it, the thought was that the savings on energy accrues about as soon as the 16 cores x 12 hours of compression time on the .jnl has been "saved" by people in aggregate not needing to run the import process (and I'm just waving away the client side decompression, which in a way technically happens twice for the .ttl.gz user but only once for the .jnl.zst user, and any other disk or network transfer pieces, as those are all close enough, I suppose).
Oct 13 2023
Thank you @thcipriani and @brennen for the guidance and support throughout!
Personalized dev environment on analytics cluster with Airflow setup (stat1006) - was able to execute job, slightly hacked up to get specific dates and not keep running regularly (eats lots of disk) to get dr0ptp4kt.wikibase_rdf_with_split using my Kerberos principal. Verifying Jupyter notebook approach from David / Andy on stat1005 - some glitches as to be expected, but worked okay by doubling timeouts and removing some caps. Next up, working on a job that will do the splitting in a fashion similar to what's achieved with the join-antijoin approach of the notebooks. I'll want to have the produced data separated out from the existing table, I think - in this case it would be okay in my opinion to use some extra disk.
Thanks @herron !
I think the ammount of time taken to decompress the JNL file should also be taken into consideration on varying hardware if compression is being considered.
@bking just wanted to express my gratitude for the support on this ticket and its friends T344905: Publish WDQS JNL files to dumps.wikimedia.org and T347647: 2023-09-18 latest-all.ttl.gz WDQS dump `Fatal error munging RDF org.openrdf.rio.RDFParseException: Expected '.', found 'g'`. FWIW I do think it would be good to automate this. As a matter of getting to a functional WDQS local environment replete with BlazeGraph data, it would accelerate things a lot. I think my only reservations are that:
Oct 11 2023
Oct 5 2023
Addressing @Addshore's comment in T344905#9210122...
Drawing from your inspiration, I downloaded with wget overnight and the sha1sum now matches that from wdqs1016. Deflating now, will update with results.
Oct 3 2023
I'm going to close this for now given that the later dump munged okay and there seems to be an underlying issue somewhere probably related to file transfer. The `-- --skolemize` will be a thing to consider for any future run, nonetheless.
I did manage to run a sha1sum on the older dump where the import had failed.
Here's the sha1sum for the latest file I had downloaded:
For me the first 300 GB of the file went really, really fast. But axel was dropping connections, similar to when I had downloaded the large 1 TB file. So this download took about 5 hours. I'm pretty sure it could be done in 1-3 hours, though, if everything were working well.