Page MenuHomePhabricator

dr0ptp4kt (Adam Baso)
Principal Software Engineer, Wikimedia Foundation

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Oct 7 2014, 6:35 PM (492 w, 6 d)
Availability
Available
IRC Nick
dr0ptp4kt
LDAP User
Unknown
MediaWiki User
ABaso (WMF) [ Global Accounts ]

Recent Activity

Fri, Mar 8

dr0ptp4kt updated subscribers of T359062: Assess Wikidata dump import hardware.

@ssingh would you mind if the following command is run on one of the newer cp#### hosts with a new higher write throughput NVMe? If so, got a recommended node? I don't have access, but I think @bking may.

Fri, Mar 8, 3:42 PM · Wikidata, Discovery-Search (Current work)
dr0ptp4kt added a comment to T359062: Assess Wikidata dump import hardware.

Thanks @bking ! It looks like the NVMe in this one is not a higher speed one for writes, and I'm also wondering if perhaps its write performance has degraded with age. I'll paste in the results here, but this was slower than the other servers, ironically (although not surprisingly because of the slower NVMe and slightly slower processor). This slower write speed is atypical of the other NVMes I've encountered. I believe the newer model ones are rated for 6000 MB/s for writes. But, I'm going to ping on task to see if we can get a comparative read of disk throughput from one of the newer and faster cp#### NVMes.

Fri, Mar 8, 3:36 PM · Wikidata, Discovery-Search (Current work)

Thu, Mar 7

dr0ptp4kt updated the task description for T359062: Assess Wikidata dump import hardware.
Thu, Mar 7, 12:22 PM · Wikidata, Discovery-Search (Current work)
dr0ptp4kt updated the task description for T359062: Assess Wikidata dump import hardware.
Thu, Mar 7, 12:20 PM · Wikidata, Discovery-Search (Current work)
dr0ptp4kt added a comment to T359062: Assess Wikidata dump import hardware.

First, adding some commands that were used for Blazegraph imports on Ubuntu 22.04. I had originally tried a good number of EC2 instance types, and then after that went back to focus on just four of them with a sequence of repeatable commands (this wasn't scripted, as I didn't want to spend time automating and also wanted to make sure I got the systems' feedback along the way). I forgot to grab RAM clock speed as a routine step when running these commands (I recall checking on one server maybe in the original checks, and did look at my Alienware), but generally servers are DDR4 unless the documentation in AWS says DDR5 (for my 2018 Alienware and 2019 MacBook Pro they're DDR4, BTW).

Thu, Mar 7, 12:09 PM · Wikidata, Discovery-Search (Current work)

Wed, Mar 6

dr0ptp4kt updated the task description for T359062: Assess Wikidata dump import hardware.
Wed, Mar 6, 9:38 PM · Wikidata, Discovery-Search (Current work)

Tue, Mar 5

dr0ptp4kt added a comment to T252227: Mobile redirects drop provenance parameters.

Originally, the thought was to be able to simply count relative volume of these types of inbound taps/clicks. Although we want fidelity on whether a link actually resolves to a page (and I know there are Phabricator comments about this here and elsewhere), often a simple count is sufficient to know if there's any traction whatsoever. I see that it's considered desirable to have a definite mapping of bona fide pageviews or previews (or other things of that nature) to these wprov values - makes sense.

Tue, Mar 5, 1:31 PM · Data-Engineering, Data Pipelines, Traffic-Icebox, SRE
dr0ptp4kt added a comment to T358727: Reclaim recently-decommed CP host for WDQS (see T352253).

@VRiley-WMF any pointers on how to iDRAC / iLO to this node and establish with a hostname of wdqs1025.eqiad.wmnet? I'm wondering if maybe there's a direct IP or IPs given that there don't seem to be DNS records for cp1086.eqiad.wmnet or cp1086.mgmt.eqiad.wmnet?

Tue, Mar 5, 12:51 PM · Discovery-Search (Current work), Data-Platform-SRE (2024.03.04 - 2024.03.24), wmde-wikidata-tech, Wikidata, SRE, ops-eqiad

Mon, Mar 4

dr0ptp4kt updated the task description for T359062: Assess Wikidata dump import hardware.
Mon, Mar 4, 4:32 PM · Wikidata, Discovery-Search (Current work)
dr0ptp4kt moved T359062: Assess Wikidata dump import hardware from Incoming to Current work on the Wikidata-Query-Service board.
Mon, Mar 4, 4:30 PM · Wikidata, Discovery-Search (Current work)
dr0ptp4kt updated the task description for T359062: Assess Wikidata dump import hardware.
Mon, Mar 4, 4:29 PM · Wikidata, Discovery-Search (Current work)
dr0ptp4kt updated the task description for T359062: Assess Wikidata dump import hardware.
Mon, Mar 4, 4:17 PM · Wikidata, Discovery-Search (Current work)
dr0ptp4kt moved T359062: Assess Wikidata dump import hardware from Incoming to In Progress on the Discovery-Search (Current work) board.
Mon, Mar 4, 3:28 PM · Wikidata, Discovery-Search (Current work)
dr0ptp4kt changed the status of T359062: Assess Wikidata dump import hardware from Open to In Progress.
Mon, Mar 4, 3:28 PM · Wikidata, Discovery-Search (Current work)
dr0ptp4kt created T359062: Assess Wikidata dump import hardware.
Mon, Mar 4, 3:24 PM · Wikidata, Discovery-Search (Current work)

Fri, Mar 1

dr0ptp4kt added a comment to T358727: Reclaim recently-decommed CP host for WDQS (see T352253).

Thanks @VRiley-WMF ! @bking is up next for imaging, I think.

Fri, Mar 1, 7:30 PM · Discovery-Search (Current work), Data-Platform-SRE (2024.03.04 - 2024.03.24), wmde-wikidata-tech, Wikidata, SRE, ops-eqiad

Thu, Feb 29

dr0ptp4kt added a parent task for T358727: Reclaim recently-decommed CP host for WDQS (see T352253): T358533: Hardware requests for Search Platform FY2024-2025.
Thu, Feb 29, 9:28 PM · Discovery-Search (Current work), Data-Platform-SRE (2024.03.04 - 2024.03.24), wmde-wikidata-tech, Wikidata, SRE, ops-eqiad
dr0ptp4kt added a subtask for T358533: Hardware requests for Search Platform FY2024-2025: T358727: Reclaim recently-decommed CP host for WDQS (see T352253).
Thu, Feb 29, 9:28 PM · Data-Platform-SRE (2024.03.04 - 2024.03.24)
dr0ptp4kt added a parent task for T358727: Reclaim recently-decommed CP host for WDQS (see T352253): T336443: Investigate performance differences between wdqs2022 and older hosts.
Thu, Feb 29, 9:26 PM · Discovery-Search (Current work), Data-Platform-SRE (2024.03.04 - 2024.03.24), wmde-wikidata-tech, Wikidata, SRE, ops-eqiad
dr0ptp4kt added a subtask for T336443: Investigate performance differences between wdqs2022 and older hosts: T358727: Reclaim recently-decommed CP host for WDQS (see T352253).
Thu, Feb 29, 9:26 PM · Data-Platform-SRE
dr0ptp4kt added a comment to T252227: Mobile redirects drop provenance parameters.

Hi team - @lbowmaker asked if I could take a look at this and provide some context. I was having a think on this, and I'd like to ponder up to a few more days and provide some thoughts.

Thu, Feb 29, 12:02 PM · Data-Engineering, Data Pipelines, Traffic-Icebox, SRE

Wed, Feb 28

dr0ptp4kt added a comment to T352253: Decommission task for old cp hosts (cp1075-1090).

@bking , @RKemper , and I met today. @bking has an action on this here ticket (@bking LMK in case I need to chime in on anything!). Thanks!

Wed, Feb 28, 8:14 PM · SRE, ops-eqiad, DC-Ops, Traffic

Tue, Feb 27

dr0ptp4kt updated subscribers of T352253: Decommission task for old cp hosts (cp1075-1090).

After setup, I would be interested in using it for 6 weeks if that's okay (hopefully things would only take 4 weeks, but there's some PTO and real life stuff always comes up). Would that be okay?

Tue, Feb 27, 10:50 PM · SRE, ops-eqiad, DC-Ops, Traffic

Feb 9 2024

dr0ptp4kt added a project to T357064: Use custom CDN if possible for Jupyter HTML exported notebooks: Security.
Feb 9 2024, 6:25 PM · Data-Platform-SRE, Security, Data-Engineering, Data-Engineering-Jupyter

Feb 8 2024

dr0ptp4kt updated the task description for T357064: Use custom CDN if possible for Jupyter HTML exported notebooks.
Feb 8 2024, 9:37 PM · Data-Platform-SRE, Security, Data-Engineering, Data-Engineering-Jupyter
dr0ptp4kt added a project to T357064: Use custom CDN if possible for Jupyter HTML exported notebooks: Data-Engineering-Jupyter.
Feb 8 2024, 9:07 PM · Data-Platform-SRE, Security, Data-Engineering, Data-Engineering-Jupyter
dr0ptp4kt added a project to T357064: Use custom CDN if possible for Jupyter HTML exported notebooks: Data-Platform-SRE.
Feb 8 2024, 9:07 PM · Data-Platform-SRE, Security, Data-Engineering, Data-Engineering-Jupyter
dr0ptp4kt created T357064: Use custom CDN if possible for Jupyter HTML exported notebooks.
Feb 8 2024, 9:02 PM · Data-Platform-SRE, Security, Data-Engineering, Data-Engineering-Jupyter
dr0ptp4kt awarded T349512: [Analytics] Collect multiple sets of SPARQL queries a Party Time token.
Feb 8 2024, 11:48 AM · Wikidata Analytics (Kanban), Discovery-Search (Current work), Wikidata, Wikidata-Query-Service

Feb 5 2024

dr0ptp4kt added a comment to T355037: Compare the performance of sparql queries between the full graph and the subgraphs.

I summarized at https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Graph_split_IGUANA_performance . When we have a mailing list post during the next week or so, we'll want to move this to be a subpage of the target page of the post.

Feb 5 2024, 9:58 PM · Discovery-Search (Current work), Wikidata

Feb 2 2024

dr0ptp4kt added a comment to T355037: Compare the performance of sparql queries between the full graph and the subgraphs.

@dr0ptp4kt thanks! is the difference in the number of successful queries only explained by the improvement in query time or are there some improvements in the number of queries that timeout as well?

Feb 2 2024, 8:39 PM · Discovery-Search (Current work), Wikidata

Feb 1 2024

dr0ptp4kt added a comment to T355037: Compare the performance of sparql queries between the full graph and the subgraphs.

Here's the output from the latest run based upon a larger set of queries from a random sample of WDQS queries.

Feb 1 2024, 5:04 PM · Discovery-Search (Current work), Wikidata

Jan 31 2024

dr0ptp4kt added a comment to T355037: Compare the performance of sparql queries between the full graph and the subgraphs.

A run is in progress for 78K+ queries from a set of 100,000 random queries. It should be done in under 10 hours from now.

Jan 31 2024, 11:55 PM · Discovery-Search (Current work), Wikidata

Jan 30 2024

dr0ptp4kt added a comment to T355037: Compare the performance of sparql queries between the full graph and the subgraphs.

Following below are "per-query" summary stats. I actually just put this together by bringing CSV data into Google Sheets for now - all of the columns are calculated upon the "per-query" rows (but you'll see how the Mean corresponds basically with the value calculated up above). The underlying CSV data don't bear actual queries (the .nt files from which they're generated do), but rather rows of this form:

Jan 30 2024, 11:04 PM · Discovery-Search (Current work), Wikidata

Jan 27 2024

dr0ptp4kt added a comment to T355037: Compare the performance of sparql queries between the full graph and the subgraphs.

Here were the data produced by IGUANA once piped through the CSV utility introduced in https://gitlab.wikimedia.org/repos/search-platform/IGUANA/-/merge_requests/3/diffs with a command of the following form (for the attentive reader, note that I had to rename the originally produced files to have an .nt extension to make the underlying Jena libraries not throw an exception).

Jan 27 2024, 1:56 PM · Discovery-Search (Current work), Wikidata
dr0ptp4kt added a comment to T355037: Compare the performance of sparql queries between the full graph and the subgraphs.

Now a screenshot from the re-run of the randomized order queries, followed by a screenshot showing the two runs on the randomized order queries side by side.

Jan 27 2024, 12:54 PM · Discovery-Search (Current work), Wikidata

Jan 26 2024

dr0ptp4kt added a comment to T355037: Compare the performance of sparql queries between the full graph and the subgraphs.

Now, the screenshot from the randomized order queries. I'll run one more time to see that comparable output is achieved. Those were produced with the following. This latest output file has been moved to result.nt.003.

Jan 26 2024, 7:35 PM · Discovery-Search (Current work), Wikidata
dr0ptp4kt added a comment to T355037: Compare the performance of sparql queries between the full graph and the subgraphs.

Now, a screenshot showing the re-run. And then a screenshot showing them side-by-side. This is just for the visual, and the data produced from IGUANA (what is in the .nt output that we can convert to a handy CSV) should be more telling.

Jan 26 2024, 4:26 PM · Discovery-Search (Current work), Wikidata
dr0ptp4kt added a comment to T355037: Compare the performance of sparql queries between the full graph and the subgraphs.

Dropping in a screenshot from Grafana from this first pass and made a copy of result.nt to result.nt.001. Re-running to see that server behavior is similar.

Jan 26 2024, 1:59 AM · Discovery-Search (Current work), Wikidata

Jan 25 2024

dr0ptp4kt added a comment to T355037: Compare the performance of sparql queries between the full graph and the subgraphs.

For the first pass, the following configuration is being used for an hour long test conducted from stat1006 with config file wdqs-split-test.yml as follows.

Jan 25 2024, 10:52 PM · Discovery-Search (Current work), Wikidata
dr0ptp4kt claimed T355037: Compare the performance of sparql queries between the full graph and the subgraphs.
Jan 25 2024, 10:33 PM · Discovery-Search (Current work), Wikidata
dr0ptp4kt moved T328330: Create SLI / SLO on Search update lag from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.
Jan 25 2024, 10:32 PM · Data-Platform-SRE, Discovery-Search (Current work)
dr0ptp4kt updated the task description for T355037: Compare the performance of sparql queries between the full graph and the subgraphs.
Jan 25 2024, 10:32 PM · Discovery-Search (Current work), Wikidata
dr0ptp4kt added a comment to T346463: Identify and label prefetch proxy data in our traffic.

It's back up and running. The following query is producing results. Note that the is_goog_isp field is mainly for helping to better resolve if traffic was likely to come via a Google proxy server; but all the usual caveats apply such as ISP mappings can change, isp_data['isp'] can bear strings that merely contain "Google" but aren't an exact match, and so on. The IP list at https://www.gstatic.com/chrome/prefetchproxy/prefetch_proxy_geofeed , like what you see in @fkaelin analysis at T346463#9393571, adds more precision.

Jan 25 2024, 5:59 PM · Traffic, Movement-Insights, Data-Engineering

Jan 18 2024

dr0ptp4kt added a comment to T346463: Identify and label prefetch proxy data in our traffic.

It's entering the analytics system based on the following query:

Jan 18 2024, 7:59 PM · Traffic, Movement-Insights, Data-Engineering
dr0ptp4kt added a comment to T346463: Identify and label prefetch proxy data in our traffic.

Documentation updated: https://wikitech.wikimedia.org/w/index.php?title=X-Analytics&diff=2140528&oldid=2028273

Jan 18 2024, 5:54 PM · Traffic, Movement-Insights, Data-Engineering
dr0ptp4kt added a comment to T346463: Identify and label prefetch proxy data in our traffic.

It's live and looking good in kafkacat. Now we wait a little for stuff to show up in the analytics tables. Thanks @Vgutierrez and @BTullis for the additional reviews and thanks @Vgutierrez for the deployment.

Jan 18 2024, 5:05 PM · Traffic, Movement-Insights, Data-Engineering

Jan 17 2024

dr0ptp4kt added a comment to T346463: Identify and label prefetch proxy data in our traffic.

@fkaelin Sec-Purpose: prefetch;prerender is mentioned for the omnibox use case at https://developer.chrome.com/docs/web-platform/prerender-pages , so I've added that, as well as Chrome's apparent link preview functionality (Sec-Purpose: prefetch;prerender;preview).

Jan 17 2024, 7:38 PM · Traffic, Movement-Insights, Data-Engineering
dr0ptp4kt updated the task description for T346463: Identify and label prefetch proxy data in our traffic.
Jan 17 2024, 7:35 PM · Traffic, Movement-Insights, Data-Engineering
dr0ptp4kt added a comment to T346463: Identify and label prefetch proxy data in our traffic.

Just to put something concrete (not saying this is the thing), here's an interesting unit test on the prefetch predictor mechanism:

Jan 17 2024, 6:17 PM · Traffic, Movement-Insights, Data-Engineering
dr0ptp4kt added a comment to T346463: Identify and label prefetch proxy data in our traffic.

Thanks @fkaelin . Yes, those prefetches happened without clicking on them. It seems to occur both for searches originating from the location bar, as well as searches entered into the <input> search field on the Google search webpage.

Jan 17 2024, 6:13 PM · Traffic, Movement-Insights, Data-Engineering

Jan 16 2024

dr0ptp4kt added a comment to T346463: Identify and label prefetch proxy data in our traffic.

By the way, here are corresponding wmf_raw.webrequest fields for this latest SERP. Notice how two prefetech requests were made from the same SERP, but the exit IP differs a little.

Jan 16 2024, 6:54 PM · Traffic, Movement-Insights, Data-Engineering

Jan 12 2024

dr0ptp4kt added a comment to T346463: Identify and label prefetch proxy data in our traffic.

I managed to make a connection via the Chrome private prefetch proxy using a Fire with a sideloaded Chrome 120 APK. In this case the User-Agent is perceived as a desktop one by Google, but processed as mobile in the Wikipedia infrastructure, so Chrome saw it as a 302 (delivered from the Wikipedia edge via the Google proxy) while in the Google SERP.

Jan 12 2024, 11:34 PM · Traffic, Movement-Insights, Data-Engineering
dr0ptp4kt added a comment to T346463: Identify and label prefetch proxy data in our traffic.

In the Android emulator, it's possible to make Chrome initiate this sort of request. Unfortunately, it seems that in the emulator there may be some lower level networking issue, at least from a Mac (Intel based in this case), because it shows a network related error, and a click on the article title from a Google SERP issues a plain GET not loaded from cache based on a quick look with DevTools and kafkacat. I somewhat strongly expect that the network related error would not be present (i.e., the fetch would succeed and be cached) on a real physical device when one is able to trigger Chrome private proxy prefetch.

Jan 12 2024, 8:23 PM · Traffic, Movement-Insights, Data-Engineering
dr0ptp4kt added a comment to T346463: Identify and label prefetch proxy data in our traffic.

@Vgutierrez @BTullis https://gerrit.wikimedia.org/r/c/operations/puppet/+/981352 is ready for review. Would it be possible to review and arrange for a deployment next week?

Jan 12 2024, 5:16 PM · Traffic, Movement-Insights, Data-Engineering

Jan 10 2024

dr0ptp4kt updated the task description for T346463: Identify and label prefetch proxy data in our traffic.
Jan 10 2024, 6:58 PM · Traffic, Movement-Insights, Data-Engineering
dr0ptp4kt added a comment to T346463: Identify and label prefetch proxy data in our traffic.

I'm scheduling time with @Mayakp.wiki and @MGerlach to soon discuss potential future use cases, but if folks familiar with VCL could give the latest version of the patch a look it'd be appreciated. I updated the Description a bit to note some additional considerations - it dawned on me we ought to capture some different browsers so that we can hopefully cover the broader majority of browser prefetch, irrespective of use of an intervening proxy architecture. This doesn't solve for the difficulty of properly classifying whether a pageview actually materialized in the user's browser (e.g., from cache of a prefetched resource), but hopefully aids for more coverage in case we are curious about different browser vintages.

Jan 10 2024, 6:52 PM · Traffic, Movement-Insights, Data-Engineering
dr0ptp4kt updated the task description for T346463: Identify and label prefetch proxy data in our traffic.
Jan 10 2024, 6:06 PM · Traffic, Movement-Insights, Data-Engineering

Jan 8 2024

dr0ptp4kt added a comment to T346463: Identify and label prefetch proxy data in our traffic.

I'll amend the patch.

Jan 8 2024, 7:35 PM · Traffic, Movement-Insights, Data-Engineering
dr0ptp4kt moved T350106: Implement a spark job that converts a RDF triples table into a RDF file format from Needs review to In Progress on the Discovery-Search (Current work) board.
Jan 8 2024, 4:06 PM · Discovery-Search (Current work), Wikidata-Query-Service, Wikidata

Jan 4 2024

dr0ptp4kt added a comment to T350106: Implement a spark job that converts a RDF triples table into a RDF file format.

Imports seemed to work.

Jan 4 2024, 5:53 PM · Discovery-Search (Current work), Wikidata-Query-Service, Wikidata

Dec 8 2023

dr0ptp4kt updated the task description for T352783: Change data platform-related IRC channels to improve communication.
Dec 8 2023, 2:52 PM · Data-Platform-SRE (2024.03.04 - 2024.03.24), observability
dr0ptp4kt updated the task description for T352783: Change data platform-related IRC channels to improve communication.
Dec 8 2023, 2:52 PM · Data-Platform-SRE (2024.03.04 - 2024.03.24), observability

Dec 6 2023

dr0ptp4kt added a comment to T346463: Identify and label prefetch proxy data in our traffic.

I like where @elukey is going with this.

Dec 6 2023, 7:03 PM · Traffic, Movement-Insights, Data-Engineering
dr0ptp4kt created P54265 Logback config for Blazegraph.
Dec 6 2023, 11:53 AM
dr0ptp4kt added a comment to T350106: Implement a spark job that converts a RDF triples table into a RDF file format.

After an update to the script (PS6) and a fresh run of the same commands new files have been hdfs-rsync'd to stat1006:~dr0ptp4kt/gzips in anticipation of doing a file transfer over to the WDQS graph split test servers.

Dec 6 2023, 12:43 AM · Discovery-Search (Current work), Wikidata-Query-Service, Wikidata

Dec 5 2023

JAllemandou awarded T350106: Implement a spark job that converts a RDF triples table into a RDF file format a Burninate token.
Dec 5 2023, 8:39 AM · Discovery-Search (Current work), Wikidata-Query-Service, Wikidata
dr0ptp4kt updated subscribers of T350106: Implement a spark job that converts a RDF triples table into a RDF file format.

I ran the current version of the code as follows:

Dec 5 2023, 2:37 AM · Discovery-Search (Current work), Wikidata-Query-Service, Wikidata

Dec 4 2023

dr0ptp4kt added a comment to T350106: Implement a spark job that converts a RDF triples table into a RDF file format.

Not using right now, but here's roughly how one might go about generating more expanded Turtle statements without reverse-mapping prefixes: F41561068

Dec 4 2023, 10:21 PM · Discovery-Search (Current work), Wikidata-Query-Service, Wikidata
dr0ptp4kt created P54143 Example for generating bigger Turtle statements.
Dec 4 2023, 10:10 PM

Nov 29 2023

dr0ptp4kt moved T350106: Implement a spark job that converts a RDF triples table into a RDF file format from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.
Nov 29 2023, 5:12 PM · Discovery-Search (Current work), Wikidata-Query-Service, Wikidata
dr0ptp4kt updated subscribers of T350106: Implement a spark job that converts a RDF triples table into a RDF file format.

Adding a note so I don't forget: advice from @BTullis is to avoid NFS if possible, and advice from @JAllemandou is to consider use of hdfs-rsync (after our call I sought this out and found these: https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/+/refs/heads/master/python/refinery/hdfs.py and https://gerrit.wikimedia.org/g/analytics/hdfs-tools/deploy/+/2445aec92f6b3d409531fb74ab3f9a22d9716823/bin/hdfs-rsync and https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/+/refs/heads/master/bin/hdfs-rsync EDIT and https://github.com/wikimedia/hdfs-tools/blob/master/src/main/scala/org/wikimedia/analytics/hdfstools/HdfsRsyncCLI.scala - the latter being available from stat boxes from a quick glance). Chances are we'd need to add a ferm and possibly where up some Kerberos stuff on the WDQS servers if going the hdfs-rsync route.

Nov 29 2023, 4:37 PM · Discovery-Search (Current work), Wikidata-Query-Service, Wikidata
dr0ptp4kt claimed T350106: Implement a spark job that converts a RDF triples table into a RDF file format.
Nov 29 2023, 4:31 PM · Discovery-Search (Current work), Wikidata-Query-Service, Wikidata

Nov 20 2023

dr0ptp4kt closed T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules as Resolved.
Nov 20 2023, 5:46 PM · Patch-For-Review, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata
dr0ptp4kt closed T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules, a subtask of T337013: [Epic] Splitting the graph in WDQS, as Resolved.
Nov 20 2023, 5:45 PM · Discovery-Search (Current work), Epic, Wikidata-Query-Service, Wikidata
dr0ptp4kt added a comment to T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules.

The job completed. The counts match up on this productionized job compared with the prior one run in my namespace. Following are some Hive queries in case needed later. Below that is a really small sample of the resultant data in tabular format for each partition.

Nov 20 2023, 5:42 PM · Patch-For-Review, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata

Nov 16 2023

dr0ptp4kt updated subscribers of T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules.

Spark patch merged, new Jenkins build of the rdf JAR done, Airflow patch merged. This is deployed to Search's Airflow instance and the job is running. Thank you, @dcausse and @EBernhardson.

Nov 16 2023, 10:17 PM · Patch-For-Review, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata
dr0ptp4kt moved T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules from In Progress to Needs review on the Discovery-Search (Current work) board.

Here's what I saw after re-running. So, we should be good with the latest patchset that goes without distinct() on the final graphs.

Nov 16 2023, 2:49 AM · Patch-For-Review, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata

Oct 27 2023

dr0ptp4kt added a comment to T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules.

Update: it seems to be working. Thus, I'd say this is maybe 75% complete.

Oct 27 2023, 5:54 AM · Patch-For-Review, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata

Oct 25 2023

dr0ptp4kt added a comment to T347605: Document process for getting JNL files/consider automation.

I also see https://grafana.wikimedia.org/d/000000264/wikidata-dump-downloads?orgId=1&refresh=5m&from=now-2y&to=now which I noticed from some tickets involving @Addshore (cf. T280678: Crunch and delete many old dumps logs and friends) and a pointer from a colleague.

Oct 25 2023, 2:18 PM · Wikidata, Data-Platform-SRE, Wikidata-Query-Service
dr0ptp4kt added a comment to T347605: Document process for getting JNL files/consider automation.

^ Update.

Oct 25 2023, 1:11 PM · Wikidata, Data-Platform-SRE, Wikidata-Query-Service
dr0ptp4kt added a comment to T347605: Document process for getting JNL files/consider automation.

UPDATE from previous comment: reducing to GETs, it's closer to 100 (a bunch of the requests were HEAD requests). Also, it seems that there may be some sort of range requests going on in there, so it's messier than at first glance.

Oct 25 2023, 12:57 PM · Wikidata, Data-Platform-SRE, Wikidata-Query-Service

Oct 20 2023

dr0ptp4kt added a comment to T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules.

It took about 26min 24s to write S_direct_triples (7_293_925_470 rows) in basic Parquet. It's not all the rows (not even for its own partition, as that will include Value and Reference triples as well), but this means it ought to be possible for the job to write total 15B rows with about an hour of wall time (maybe double that to play it safe).

Oct 20 2023, 4:06 PM · Patch-For-Review, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata
dr0ptp4kt added a comment to T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules.

TL;DR this is about 45% done.

Oct 20 2023, 2:23 PM · Patch-For-Review, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata

Oct 16 2023

dr0ptp4kt added a comment to T347605: Document process for getting JNL files/consider automation.

Good question - I meant the contrast with respect to the .ttl.gz dumps and everything that goes into munging and importing (in aggregate across all downloaders of those files) versus the same for if this was done with the .jnl where they don't have to munge and import. Napkin-mathsing it, the thought was that the savings on energy accrues about as soon as the 16 cores x 12 hours of compression time on the .jnl has been "saved" by people in aggregate not needing to run the import process (and I'm just waving away the client side decompression, which in a way technically happens twice for the .ttl.gz user but only once for the .jnl.zst user, and any other disk or network transfer pieces, as those are all close enough, I suppose).

Oct 16 2023, 11:52 AM · Wikidata, Data-Platform-SRE, Wikidata-Query-Service

Oct 13 2023

dr0ptp4kt updated subscribers of T347089: Deployment training request for dr0ptp4kt.

Thank you @thcipriani and @brennen for the guidance and support throughout!

Oct 13 2023, 9:47 PM · Release-Engineering-Team (Deployment Training Requests)
dr0ptp4kt added a comment to T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules.

Personalized dev environment on analytics cluster with Airflow setup (stat1006) - was able to execute job, slightly hacked up to get specific dates and not keep running regularly (eats lots of disk) to get dr0ptp4kt.wikibase_rdf_with_split using my Kerberos principal. Verifying Jupyter notebook approach from David / Andy on stat1005 - some glitches as to be expected, but worked okay by doubling timeouts and removing some caps. Next up, working on a job that will do the splitting in a fashion similar to what's achieved with the join-antijoin approach of the notebooks. I'll want to have the produced data separated out from the existing table, I think - in this case it would be okay in my opinion to use some extra disk.

Oct 13 2023, 5:36 PM · Patch-For-Review, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata
dr0ptp4kt added a comment to T346688: Icinga contact for dr0ptp4kt.

Thanks @herron !

Oct 13 2023, 3:45 PM · SRE Observability (FY2023/2024-Q2), observability, SRE
dr0ptp4kt added a comment to T344905: Publish WDQS JNL files to dumps.wikimedia.org.

I think the ammount of time taken to decompress the JNL file should also be taken into consideration on varying hardware if compression is being considered.

Oct 13 2023, 2:05 PM · Data Products, Data-Engineering, Wikidata, Wikidata-Query-Service
dr0ptp4kt added a comment to T347605: Document process for getting JNL files/consider automation.

@bking just wanted to express my gratitude for the support on this ticket and its friends T344905: Publish WDQS JNL files to dumps.wikimedia.org and T347647: 2023-09-18 latest-all.ttl.gz WDQS dump `Fatal error munging RDF org.openrdf.rio.RDFParseException: Expected '.', found 'g'`. FWIW I do think it would be good to automate this. As a matter of getting to a functional WDQS local environment replete with BlazeGraph data, it would accelerate things a lot. I think my only reservations are that:

Oct 13 2023, 2:01 PM · Wikidata, Data-Platform-SRE, Wikidata-Query-Service

Oct 11 2023

dr0ptp4kt updated the task description for T346920: VisualEditor's Add a link should suggest a redirect with exact case match.
Oct 11 2023, 4:23 PM · Verified, MW-1.42-notes (1.42.0-wmf.3; 2023-10-31), Discovery-Search (Current work), Editing-team (Kanban Board), VisualEditor

Oct 5 2023

dr0ptp4kt added a comment to T347605: Document process for getting JNL files/consider automation.

Addressing @Addshore's comment in T344905#9210122...

Oct 5 2023, 9:26 PM · Wikidata, Data-Platform-SRE, Wikidata-Query-Service
dr0ptp4kt added a comment to T347605: Document process for getting JNL files/consider automation.

Drawing from your inspiration, I downloaded with wget overnight and the sha1sum now matches that from wdqs1016. Deflating now, will update with results.

Oct 5 2023, 12:38 PM · Wikidata, Data-Platform-SRE, Wikidata-Query-Service
dr0ptp4kt added a comment to T347089: Deployment training request for dr0ptp4kt.

Um there is no Thurs Oct 8. There is Thurs Oct 5 (today) and Thurs Oct 12, 19, 26... wonder if you meant any of these?

Oct 5 2023, 10:33 AM · Release-Engineering-Team (Deployment Training Requests)
dr0ptp4kt updated the task description for T347089: Deployment training request for dr0ptp4kt.
Oct 5 2023, 10:32 AM · Release-Engineering-Team (Deployment Training Requests)

Oct 3 2023

dr0ptp4kt closed T347647: 2023-09-18 latest-all.ttl.gz WDQS dump `Fatal error munging RDF org.openrdf.rio.RDFParseException: Expected '.', found 'g'` as Resolved.

I'm going to close this for now given that the later dump munged okay and there seems to be an underlying issue somewhere probably related to file transfer. The `-- --skolemize` will be a thing to consider for any future run, nonetheless.

Oct 3 2023, 6:40 PM · Data-Engineering, Wikidata, Discovery-Search (Current work), Wikidata-Query-Service
dr0ptp4kt added a comment to T347647: 2023-09-18 latest-all.ttl.gz WDQS dump `Fatal error munging RDF org.openrdf.rio.RDFParseException: Expected '.', found 'g'`.

I did manage to run a sha1sum on the older dump where the import had failed.

Oct 3 2023, 5:49 PM · Data-Engineering, Wikidata, Discovery-Search (Current work), Wikidata-Query-Service
dr0ptp4kt added a comment to T347605: Document process for getting JNL files/consider automation.

Here's the sha1sum for the latest file I had downloaded:

Oct 3 2023, 5:28 PM · Wikidata, Data-Platform-SRE, Wikidata-Query-Service
dr0ptp4kt claimed T347989: Adapt rdf-spark-tools to split the wikidata graph based on a set of rules.
Oct 3 2023, 3:49 PM · Patch-For-Review, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata
dr0ptp4kt added a comment to T347605: Document process for getting JNL files/consider automation.

For me the first 300 GB of the file went really, really fast. But axel was dropping connections, similar to when I had downloaded the large 1 TB file. So this download took about 5 hours. I'm pretty sure it could be done in 1-3 hours, though, if everything were working well.

Oct 3 2023, 1:19 PM · Wikidata, Data-Platform-SRE, Wikidata-Query-Service