Page MenuHomePhabricator

dcausse (David Causse)
User

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Jun 9 2015, 9:03 AM (475 w, 2 d)
Availability
Available
IRC Nick
dcausse
LDAP User
DCausse
MediaWiki User
DCausse (WMF) [ Global Accounts ]

Recent Activity

Today

dcausse moved T367510: Request permission to create 4 kafka topics in kafka-main (WDQS graph split) from To Be Deployed to Needs Reporting on the Discovery-Search (Current work) board.

The new code has been deployed this morning and data is flowing properly into these new topics, we improved batching a bit to save some space via compression, I believe that we have some room to increase some buffer size if we want to optimize for space even further.

Thu, Jul 18, 9:37 AM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Discovery-Search (Current work), wmde-wikidata-tech, serviceops, Wikidata
dcausse updated the task description for T367510: Request permission to create 4 kafka topics in kafka-main (WDQS graph split).
Thu, Jul 18, 9:29 AM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Discovery-Search (Current work), wmde-wikidata-tech, serviceops, Wikidata
dcausse moved T361935: Adapt the WDQS Streaming Updater to update multiple WDQS subgraphs from To Be Deployed to Needs Reporting on the Discovery-Search (Current work) board.
Thu, Jul 18, 9:29 AM · Discovery-Search (Current work), Wikidata
dcausse moved T365158: Homogenise jackson version from To Be Deployed to Needs Reporting on the Discovery-Search (Current work) board.
Thu, Jul 18, 9:29 AM · Discovery-Search (Current work), Wikidata
dcausse moved T369729: RDF Updater Consumer: Allow "empty" patches from To Be Deployed to Needs Reporting on the Discovery-Search (Current work) board.
Thu, Jul 18, 9:28 AM · Discovery-Search (Current work), Wikidata
dcausse added a comment to T369808: The Commons search "deepcategory" operator often does not work (Deep category query returned too many categories).

I can't reproduce with the given example in the description, File:Nut_Grab.jpg (page id 29851242) is properly excluded when searching pageid:29851242 -deepcategory:"Animals with nuts". So I suspect that the problem might have been caused by the issues we had with dumps recently. @Prototyperspective could you confirm or possibly provide another example file that does not comply with the search query.
For reference (when writing this comment) the list of categories identified by deepcategory:"Animals with nuts" is:

  • Animals with nuts
  • Animals eating nuts
  • Animals eating peanuts
  • Curculio (larval damage)
  • Animals eating hazelnuts
  • Animals eating walnuts
  • Birds eating nuts
  • Sciurus vulgaris eating walnuts
  • Sciurus vulgaris eating hazelnuts
  • Sciuridae eating peanuts
  • Birds eating peanuts
  • Sciurus carolinensis eating walnuts
  • Curculio nucum (larva)
  • Tamias striatus eating peanuts
  • Sciurus vulgaris eating peanuts
  • Sciurus carolinensis eating peanuts
  • Tamias striatus fed by hand (EIC)
Thu, Jul 18, 8:08 AM · Discovery-Search (Current work), CirrusSearch, Commons
dcausse moved T369808: The Commons search "deepcategory" operator often does not work (Deep category query returned too many categories) from Ready for Dev -- SWE to Blocked/Waiting on the Discovery-Search (Current work) board.
Thu, Jul 18, 8:08 AM · Discovery-Search (Current work), CirrusSearch, Commons

Yesterday

dcausse added a comment to T365595: [S] [Tech Debt] Swap out page/related API call with MediaWiki API equivalent.

@dcausse @MSantos

We are in the process of switching this over on iOS. I wanted to call out that our UI expects 20 results, so we are requesting that many. From what I can see Android requests 5. Will 20 results be a problem on the backend?

Wed, Jul 17, 10:07 AM · Wikipedia-iOS-App-Backlog (iOS Release FY2024-25)
dcausse edited projects for T366253: Create a generic stream to populate CirrusSearch weighted_tags, added: Discovery-Search (Current work); removed Discovery-Search.
Wed, Jul 17, 9:48 AM · Patch-For-Review, Discovery-Search (Current work), CirrusSearch
dcausse updated subscribers of T367510: Request permission to create 4 kafka topics in kafka-main (WDQS graph split).

We are getting ready to deploy the new updater that will populate these new topics, @bking could we have the topics created with proper retention and partitioning? (We could also let the topics autocreate and adapt the retention after the fact using https://wikitech.wikimedia.org/wiki/Kafka/Administration#Alter_topic_retention_settings). Thanks!

Wed, Jul 17, 7:53 AM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Discovery-Search (Current work), wmde-wikidata-tech, serviceops, Wikidata
dcausse added a comment to T368714: kafka-main replacement nodes don't fit kafka-main (storage wise).

Perhaps something to consider as well is fine-tuning mirrormaker, I don't think that in the case of the wdqs updater we need the *.rdf-streaming-updater.mutation* topics replicated between the two kafka-main clusters.

Wed, Jul 17, 7:48 AM · serviceops

Tue, Jul 16

dcausse moved T368010: Search not working for entity schemas from To Be Deployed to Needs Reporting on the Discovery-Search (Current work) board.

I think that all 412 schemas are now properly indexed.

Tue, Jul 16, 3:55 PM · MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), Wikidata Dev Team (Wikidata.org Slice), Discovery-Search (Current work), CirrusSearch, Wikidata, EntitySchema
dcausse moved T361950: Ensure that WDQS query throttling does not interfere with federation from Needs review to Needs Reporting on the Discovery-Search (Current work) board.

When you get a chance can you see if everything looks good? For the timebeing we've verified the patches didn't break wdqs and also at least in some cases the disable-throttling header is being set so things are looking good thus far.

Tue, Jul 16, 8:01 AM · wmde-wikidata-tech, Patch-For-Review, Discovery-Search (Current work), Wikidata

Mon, Jul 15

dcausse moved T355298: Investigate the impact of the WDQS graph split on constraints checks from Blocked/Waiting to Needs Reporting on the Discovery-Search (Current work) board.
Mon, Jul 15, 3:12 PM · Wikidata Dev Team (Wikidata.org Slice), Discovery-Search (Current work), Wikibase-Quality-Constraints, Wikidata
dcausse moved T361935: Adapt the WDQS Streaming Updater to update multiple WDQS subgraphs from Needs review to To Be Deployed on the Discovery-Search (Current work) board.
Mon, Jul 15, 3:07 PM · Discovery-Search (Current work), Wikidata
dcausse moved T368010: Search not working for entity schemas from Needs review to To Be Deployed on the Discovery-Search (Current work) board.
Mon, Jul 15, 3:07 PM · MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), Wikidata Dev Team (Wikidata.org Slice), Discovery-Search (Current work), CirrusSearch, Wikidata, EntitySchema
dcausse moved T369495: Make `haswbstatement:` work for the EntitySchema property from To Be Deployed to Needs Reporting on the Discovery-Search (Current work) board.
Mon, Jul 15, 3:06 PM · Wikidata Dev Team (Wikidata.org Slice), Discovery-Search (Current work), Wikidata, EntitySchema

Wed, Jul 10

dcausse added a comment to T361935: Adapt the WDQS Streaming Updater to update multiple WDQS subgraphs.

@Scott_French thanks for pinging us, indeed the test that was running was using https://api-ro.discovery.wmnet and this is unintentional (we rarely run such tests and by re-using an old configuration I overlooked that it was relying on api-ro). I have stopped the test (and removed this old config), there should be no use-cases from our end still hitting api-ro.

Wed, Jul 10, 6:57 AM · Discovery-Search (Current work), Wikidata

Fri, Jul 5

dcausse added a comment to T369149: Search has outdated label for P12861 (“Shape Expression for class” rather than “EntitySchema for class”).

@Lucas_Werkmeister_WMDE thanks for the fix! I manually re-indexed this item with our new (WIP) tooling, it would have been fixed automatically by the cleanup process but it would have taken up to 2weeks to discover in the worst case.

Fri, Jul 5, 11:55 AM · MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), wmde-wikidata-tech, Wikimedia-production-error, Wikidata Dev Team (Wikidata.org Slice), Discovery-Search (Current work), EntitySchema, Wikidata
dcausse edited P65853 Search Modules with the search API.
Fri, Jul 5, 8:13 AM
dcausse edited P65853 Search Modules with the search API.
Fri, Jul 5, 8:11 AM
dcausse created P65853 Search Modules with the search API.
Fri, Jul 5, 8:07 AM

Thu, Jul 4

dcausse added a comment to T368543: Error: Call to a member function getPageAsLinkTarget() on null.

We’ve already checked that $valParts[1] isn’t set, but then we still cast it to an int, and try to load a revision record from that ID? Is the if condition just flipped from what it should be?

Thu, Jul 4, 8:26 AM · Metrics Platform Backlog, CirrusSearch, Data Products, MediaWiki-extensions-EventLogging, Wikimedia-production-error, Data-Engineering
dcausse added a comment to T237773: Move Wikitech onto the production MW cluster.

Should T192361 be re-opened and added as a subtask here?

Thu, Jul 4, 6:49 AM · cloud-services-team, wikitech.wikimedia.org

Wed, Jul 3

dcausse moved T369149: Search has outdated label for P12861 (“Shape Expression for class” rather than “EntitySchema for class”) from Incoming to Blocked/Waiting on the Discovery-Search (Current work) board.
Wed, Jul 3, 1:39 PM · MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), wmde-wikidata-tech, Wikimedia-production-error, Wikidata Dev Team (Wikidata.org Slice), Discovery-Search (Current work), EntitySchema, Wikidata
dcausse added a comment to T369149: Search has outdated label for P12861 (“Shape Expression for class” rather than “EntitySchema for class”).

Seems like \EntitySchema\Wikibase\DataValues\EntitySchemaValue::getType() is returning EntityIdValue::getType() and thus some code are considering it as EntityIdValue ('VT:wikibase-entityid`), here WikibaseCirrusSearch is calling https://gerrit.wikimedia.org/g/mediawiki/extensions/Wikibase/+/8b3312396b4b8b91790d7b33c4703fb31bd290d8/repo/WikibaseRepo.datatypes.php#421 with an EntitySchemaValue.

Wed, Jul 3, 1:35 PM · MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), wmde-wikidata-tech, Wikimedia-production-error, Wikidata Dev Team (Wikidata.org Slice), Discovery-Search (Current work), EntitySchema, Wikidata
dcausse added a comment to T369149: Search has outdated label for P12861 (“Shape Expression for class” rather than “EntitySchema for class”).

The process is unable to render this document: https://www.wikidata.org/w/api.php?action=query&cbbuilders=content|links&format=json&format=json&formatversion=2&pageids=120965176&prop=cirrusbuilddoc fails with Caught exception of type TypeError:

Wed, Jul 3, 1:10 PM · MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), wmde-wikidata-tech, Wikimedia-production-error, Wikidata Dev Team (Wikidata.org Slice), Discovery-Search (Current work), EntitySchema, Wikidata
dcausse edited projects for T369149: Search has outdated label for P12861 (“Shape Expression for class” rather than “EntitySchema for class”), added: Discovery-Search (Current work); removed Discovery-Search.
Wed, Jul 3, 1:04 PM · MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), wmde-wikidata-tech, Wikimedia-production-error, Wikidata Dev Team (Wikidata.org Slice), Discovery-Search (Current work), EntitySchema, Wikidata
dcausse added a comment to T362702: APT errors when installing custom packages in MediaWiki-Docker.

Seems like that recently https://packages.sury.org/php/dists/buster/ is now returning a 403

Wed, Jul 3, 8:51 AM · Release-Engineering-Team (Priority Backlog 📥), dev-images, MediaWiki-Docker
dcausse added a comment to T369080: statsd-exporter in k8s is not configured to use its mapping configuration.

@dcausse, I see some metrics now at mediawiki_cirrus_search_request_time_bucket. Anything amiss?

Wed, Jul 3, 6:56 AM · SRE, Observability-Metrics

Tue, Jul 2

dcausse updated the task description for T350597: Audit and prioritize metrics for conversion to statslib that are used for graphite-based alerting.
Tue, Jul 2, 2:16 PM · SRE Observability (FY2024/2025-Q1), Discovery-Search (Current work), Data-Platform-SRE, MW-1.42-notes (1.42.0-wmf.20; 2024-02-27), User-fgiunchedi, Observability-Metrics
dcausse added a comment to T368996: Entering "Palestine" on en.wp, search suggestions do not offer "State of Palestine".

Related https://en.wikipedia.org/wiki/Talk:Palestine#Requested_move_29_June_2024

Tue, Jul 2, 8:02 AM · Discovery-Search, CirrusSearch

Mon, Jul 1

dcausse moved T331127: phantom redirects lingering in incategory searches after page moves from To Be Deployed to Needs Reporting on the Discovery-Search (Current work) board.

Moving a page from one namespace to another should now properly cleanup the search index, existing phantom redirects might still be around for a couple weeks while the automated cleanup process takes care of them. Please let me know if you see new instances of this problem in the future, sorry for the inconvenience.

Mon, Jul 1, 3:48 PM · MW-1.40-notes (1.40.0-wmf.25; 2023-02-27), Discovery-Search (Current work), CirrusSearch
dcausse added a comment to T362978: Update all helm modules and charts to be compatible with the restricted PSS.

Hi I'm having issues with a flink job running in staging and failing to deploy with an error:
>>> Status | Error | DEPLOYED | {"type":"org.apache.flink.kubernetes.operator.exception.DeploymentFailedException","message":"pods \"flink-app-consumer-search-784bc9fd87-9n862\" is forbidden: violates PodSecurity \"restricted:latest\": allowPrivilegeEscalation != false (container \"flink-main-container\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container \"flink-main-container\" must set securityContext.capabilities.drop=[\"ALL\"]), runAsNonRoot != true (pod or container \"flink-main-container\" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container \"flink-main-container\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")","additionalMetadata":{"reason":"FailedCreate"},"throwableList":[]}

Mon, Jul 1, 1:21 PM · Patch-For-Review, serviceops, Prod-Kubernetes
dcausse added a comment to T368894: Cirrus search does not prioritise master pages on their subpages.

The talk page is ranked very low indeed, it does seem quite recent (created on may 2024) and have 0 incoming_links and thus is far behind https://he.wikipedia.org/wiki/שיחת_משתמש:קיפודנחש/ארכיון_31_עד_מאי_2024 which has more than 3k incoming links. CirrusSearch does not prioritize master pages over their subpages indeed, if we want to do this this would have to be carefully evaluated because one thing we can't do is rank lower a subpage comparatively solely to its master page, all subpages would be down-ranked.

Mon, Jul 1, 12:39 PM · Discovery-Search, CirrusSearch
dcausse moved T286814: '.event.pageViewId' should be string, '.event.subTest' should be string, '.event.searchSessionId' should be string from Bugs to needs triage on the Discovery-Search board.
Mon, Jul 1, 10:15 AM · MW-1.43-notes (1.43.0-wmf.14; 2024-07-16), Discovery-Search (Current work), Wikimedia-production-error, Data-Engineering
dcausse moved T331127: phantom redirects lingering in incategory searches after page moves from Needs review to To Be Deployed on the Discovery-Search (Current work) board.
Mon, Jul 1, 7:15 AM · MW-1.40-notes (1.40.0-wmf.25; 2023-02-27), Discovery-Search (Current work), CirrusSearch

Fri, Jun 28

dcausse moved T363521: Completion suggester can promote a bad build from In Progress to Blocked/Waiting on the Discovery-Search (Current work) board.

I added some logging info to get a sense of the numbers, moving to waiting while we gather a bit more info.

Fri, Jun 28, 3:39 PM · MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), Discovery-Search (Current work), serviceops-radar, CirrusSearch
dcausse claimed T363521: Completion suggester can promote a bad build.
Fri, Jun 28, 2:43 PM · MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), Discovery-Search (Current work), serviceops-radar, CirrusSearch
dcausse claimed T366589: PHP Deprecated: Implicit conversion from float 75000.00000000001 to int loses precision.
Fri, Jun 28, 2:07 PM · MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), Discovery-Search (Current work), affects-translatewiki.net, PHP 8.1 support, CirrusSearch
dcausse added a project to T361950: Ensure that WDQS query throttling does not interfere with federation: serviceops.
Fri, Jun 28, 12:45 PM · wmde-wikidata-tech, Patch-For-Review, Discovery-Search (Current work), Wikidata
dcausse added a comment to T361950: Ensure that WDQS query throttling does not interfere with federation.

tagging serviceops for help on envoy to see if it can be used as a load balancer to balance the internal requests made from one blazegraph cluster to another without using lvs.

Fri, Jun 28, 12:44 PM · wmde-wikidata-tech, Patch-For-Review, Discovery-Search (Current work), Wikidata
dcausse added a comment to T361950: Ensure that WDQS query throttling does not interfere with federation.

@Vgutierrez thanks for the help!

Fri, Jun 28, 12:25 PM · wmde-wikidata-tech, Patch-For-Review, Discovery-Search (Current work), Wikidata

Wed, Jun 26

dcausse added a comment to T368010: Search not working for entity schemas.

In the meantime a ugly workaround is to search both EntitySchema and EntitySchema talk namespaces but filter on the content model using the keyword contentmodel:EntitySchema: https://www.wikidata.org/w/index.php?search=contentmodel%3AEntitySchema+intitle%3A%2FE%2F&title=Special:Search&profile=advanced&fulltext=1&ns640=1&ns641=1 .

Wed, Jun 26, 7:07 PM · MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), Wikidata Dev Team (Wikidata.org Slice), Discovery-Search (Current work), CirrusSearch, Wikidata, EntitySchema
dcausse added a comment to T368010: Search not working for entity schemas.

Hm, though the search links in the task description still don’t yield the expected results :/

yes this is sadly kind of expected (I should have told you about this on the config patch, sorry). The cleanup process had already started moving pages around while the entity schema namespace was considered non-content and thus these ones are no longer findable now it was brought back again in the content namespace. I need to reindex these pages to make search working again but sadly our tooling is not working as expected and I need to deploy https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/143 first to be able to fix the index. If this is causing major disruption I can messup with the index by hand but I'd rather not do that if not strictly required, sorry for the inconvenience!

Wed, Jun 26, 6:46 PM · MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), Wikidata Dev Team (Wikidata.org Slice), Discovery-Search (Current work), CirrusSearch, Wikidata, EntitySchema
dcausse added a comment to T362977: WDQS updater missed some updates.

Another instance of this issue was reported on wiki:

@dcausse (WMF): fwiw, I have 6 items updated on the 19 & 20 June - https://w.wiki/ASz6 - for which WDQS has not been updated ... on the production WDQS, not test. Only one of them was edited within the June 19 between 03:00 and 15:30 UTC window, afaics. It's not a prolem for me, more of a FYI. --Tagishsimon (talk) 16:01, 21 June 2024 (UTC)

Wed, Jun 26, 1:16 PM · Data-Engineering, Data-Platform, Wikidata, Wikidata-Query-Service
dcausse updated subscribers of T368010: Search not working for entity schemas.

Surprisingly E378 which is one of the schemas that is not indexed appears to be indexed in the "content" index of wikidata, but AFAICT 640 is not a content namespace.
But it might have been considered as a content namespace few weeks ago.
I wonder if T363153 and esp. https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EntitySchema/+/1040113/ might not be the reason of this change. When a namespace with existing documents has its search characteristics changed (wgContentNamespace and/or wgNamespacesToBeSearchedDefault) the indexed docs are not moved automatically from one index to another and will rely on the saneitizer to slowly fix the inconsistencies, this is what might have happened here and explain why the schemas suddenly disappeared and got re-indexed slowly overtime.

Wed, Jun 26, 10:35 AM · MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), Wikidata Dev Team (Wikidata.org Slice), Discovery-Search (Current work), CirrusSearch, Wikidata, EntitySchema
dcausse moved T368010: Search not working for entity schemas from In Progress to Needs review on the Discovery-Search (Current work) board.
Wed, Jun 26, 10:33 AM · MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), Wikidata Dev Team (Wikidata.org Slice), Discovery-Search (Current work), CirrusSearch, Wikidata, EntitySchema
dcausse claimed T368010: Search not working for entity schemas.
Wed, Jun 26, 10:03 AM · MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), Wikidata Dev Team (Wikidata.org Slice), Discovery-Search (Current work), CirrusSearch, Wikidata, EntitySchema
dcausse added a comment to T368010: Search not working for entity schemas.

The above reindex did not work as I expected, the attached patch should remedy this by allowing non indexed page to be re-indexed properly when manually re-indexing a whole namespace.
The root cause as to why these schemas were not indexed in the first place is yet to be investigated.

Wed, Jun 26, 9:55 AM · MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), Wikidata Dev Team (Wikidata.org Slice), Discovery-Search (Current work), CirrusSearch, Wikidata, EntitySchema

Tue, Jun 25

dcausse added a comment to T368010: Search not working for entity schemas.

There are currently 354 pages indexed in the entity schema, the all pages api does seem to suggest that there are 397 schemas.

Tue, Jun 25, 2:25 PM · MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), Wikidata Dev Team (Wikidata.org Slice), Discovery-Search (Current work), CirrusSearch, Wikidata, EntitySchema
dcausse moved T331127: phantom redirects lingering in incategory searches after page moves from In Progress to Needs review on the Discovery-Search (Current work) board.
Tue, Jun 25, 2:12 PM · MW-1.40-notes (1.40.0-wmf.25; 2023-02-27), Discovery-Search (Current work), CirrusSearch

Mon, Jun 24

dcausse moved T366346: Mute helmfile apply notifications from cirrus-streaming-updater deploys from In Progress to Needs Reporting on the Discovery-Search (Current work) board.

Should be done in https://gitlab.wikimedia.org/repos/search-platform/cirrus-reindex-orchestrator/-/commit/9175d48ab9ff47f7e53d150156f9d71366563849

Mon, Jun 24, 3:19 PM · Data-Platform-SRE (2024.06.17 - 2024.07.07), Discovery-Search (Current work), CirrusSearch
dcausse claimed T331127: phantom redirects lingering in incategory searches after page moves.
Mon, Jun 24, 7:34 AM · MW-1.40-notes (1.40.0-wmf.25; 2023-02-27), Discovery-Search (Current work), CirrusSearch
dcausse moved T350597: Audit and prioritize metrics for conversion to statslib that are used for graphite-based alerting from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.
Mon, Jun 24, 7:33 AM · SRE Observability (FY2024/2025-Q1), Discovery-Search (Current work), Data-Platform-SRE, MW-1.42-notes (1.42.0-wmf.20; 2024-02-27), User-fgiunchedi, Observability-Metrics
dcausse updated the task description for T241128: EPIC: Reduce the time needed to do the initial WDQS import.
Mon, Jun 24, 7:32 AM · Epic, Wikidata-Query-Service, Wikidata

Fri, Jun 21

dcausse updated the task description for T241128: EPIC: Reduce the time needed to do the initial WDQS import.
Fri, Jun 21, 9:40 AM · Epic, Wikidata-Query-Service, Wikidata

Thu, Jun 20

dcausse added a comment to T361950: Ensure that WDQS query throttling does not interfere with federation.

After discussing this Erik with we have a rough plan:

  • add a new lvs enpoint dedicated to internal federation and targeting a new port opened by nginx
  • add a new port in the nginx config for which we add the X-Disable-Throttling + x-bigdata-read-only to the request forwarded to blazegraph
  • use the blazegraph service alias feature to map https://query-main.wikidata.org/sparql -> https://wdqs-main.discovery.wmnet:$NEW_PORT/sparql
  • adapt ProxiedHttpConnectionFactory to allow the bypass of *.wmnet hostnames
Thu, Jun 20, 7:47 AM · wmde-wikidata-tech, Patch-For-Review, Discovery-Search (Current work), Wikidata
dcausse added a comment to T355298: Investigate the impact of the WDQS graph split on constraints checks.

TypeChecker & ValueTypeChecker are using Sparql to inspect the class hierarchy which may or may not be affected by the split.

Yes. Notably, the initial lookup of the class to check (the subject’s “instance of” and/or “subclass of” statements) always happens in PHP, not in SPARQL. My assumption would be that the class hierarchy is always fully included in the main graph, and only individual instances are potentially in the scholarly graph; in that case, we could run all the “is subclass of” queries against the main graph. Is that correct?

Yes this is my understanding as well, the undesirable effects I could see are:

  • one tagging an entity with a P31 that points to a scholarly article
  • introducing a scholarly article in the chain of subclass of thus making the sparql property path noneffective

I'm not knowledgeable enough but I suspect these problems should be quite rare and perhaps already identified via other means?

Thu, Jun 20, 7:08 AM · Wikidata Dev Team (Wikidata.org Slice), Discovery-Search (Current work), Wikibase-Quality-Constraints, Wikidata

Jun 18 2024

dcausse updated the task description for T241128: EPIC: Reduce the time needed to do the initial WDQS import.
Jun 18 2024, 9:24 AM · Epic, Wikidata-Query-Service, Wikidata
dcausse updated the task description for T241128: EPIC: Reduce the time needed to do the initial WDQS import.
Jun 18 2024, 9:24 AM · Epic, Wikidata-Query-Service, Wikidata
dcausse updated the task description for T367510: Request permission to create 4 kafka topics in kafka-main (WDQS graph split).
Jun 18 2024, 8:48 AM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Discovery-Search (Current work), wmde-wikidata-tech, serviceops, Wikidata
dcausse added a comment to T367510: Request permission to create 4 kafka topics in kafka-main (WDQS graph split).

I do have one question though. Why 4 weeks retention? Is there some business reason or could it be dropped to a smaller duration?

we need 4 weeks to be able to backfill after an import, from the time the wikidata dump process starts, the time required to shuffle the data around (compression, hdfs-rsync to hdfs) and til the end of the import into blazegraph, see the initial lag column in T241128 for past import times, perhaps 3weeks would be manageable but we went to 4 weeks to have extra room.

Jun 18 2024, 8:44 AM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Discovery-Search (Current work), wmde-wikidata-tech, serviceops, Wikidata

Jun 14 2024

dcausse added a comment to T214378: Check simple format constraints (no grouping) in PHP instead of SPARQL.

Unsure if feasible but perhaps manually flagging list of safe regex & very popular regex could help reduce the number of requests to shellbox?

Jun 14 2024, 2:11 PM · [DEPRECATED] wdwb-tech, Security-Team, Wikidata-Campsite, Wikibase-Quality-Constraints, Wikidata
dcausse renamed T367510: Request permission to create 4 kafka topics in kafka-main (WDQS graph split) from Request permission to create 4 kafka topics in kafka-main to Request permission to create 4 kafka topics in kafka-main (WDQS graph split).
Jun 14 2024, 1:20 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Discovery-Search (Current work), wmde-wikidata-tech, serviceops, Wikidata
dcausse created T367510: Request permission to create 4 kafka topics in kafka-main (WDQS graph split).
Jun 14 2024, 1:18 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Discovery-Search (Current work), wmde-wikidata-tech, serviceops, Wikidata
dcausse added a comment to T361950: Ensure that WDQS query throttling does not interfere with federation.

I did some testing and sadly when a wdqs node makes a query to https://query.wikidata.org it hits varnish again:
from wdqs1020 to https://query.wikidata.org (echo 'SELECT ?test_dcausse { ?test_dcausse ?p ?o . } LIMIT 1' | curl -f -s --data-urlencode query@- https://query.wikidata.org/sparql?format=json)

"x-request-id": "b34bb930-ef85-4b23-956e-7dcb11f0f7ec",
"content-length": "99",
"x-forwarded-proto": "http",
"x-client-port": "40256",
"x-bigdata-max-query-millis": "60000",
"x-wmf-nocookies": "1",
"x-client-ip": "2620:0:861:10a:10:64:131:24",
"x-varnish": "800949377",
"x-forwarded-for": "2620:0:861:10a:10:64:131:24\\, 10.64.0.79\\, 2620:0:861:10a:10:64:131:24",
"x-requestctl": "",
"x-cdis": "pass",
"accept": "*/*",
"x-real-ip": "2620:0:861:10a:10:64:131:24",
"via-nginx": "1",
"x-bigdata-read-only": "yes",
"host": "query.wikidata.org",
"content-type": "application/x-www-form-urlencoded",
"connection": "close",
"x-envoy-expected-rq-timeout-ms": "65000",
"x-connection-properties": "H2=1; SSR=0; SSL=TLSv1.3; C=TLS_AES_256_GCM_SHA384; EC=UNKNOWN;",
"user-agent": "curl/7.74.0"
Jun 14 2024, 12:55 PM · wmde-wikidata-tech, Patch-For-Review, Discovery-Search (Current work), Wikidata

Jun 13 2024

dcausse added a comment to P64016 Testing wdqs.data-reload with HDFS.

@RKemper I think we should now do a full import to measure the time it takes in order to have a rough estimation to answer T367409
To have a full run we need to re-enable the updater on wdqs2023 (which I think will be done with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1042965)
The command to run should be (using the latest dumps):

cookbook sre.wdqs.data-reload \
 --task-id T349069 \
 --reason "Test wdqs reload based on HDFS" \
 --reload-data wikidata_full \
 --from-hdfs hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240603/ \
 --stat-host stat1009.eqiad.wmnet \
 wdqs2023.codfw.wmnet
Jun 13 2024, 1:32 PM

Jun 11 2024

dcausse updated the task description for T365692: PHP Notice: Undefined index: lexeme_language / lexical_category.
Jun 11 2024, 8:36 AM · MW-1.43-notes (1.43.0-wmf.8; 2024-06-04), Discovery-Search (Current work), wmde-wikidata-tech, Wikidata, Wikidata Lexicographical data, Wikimedia-production-error
dcausse moved T365692: PHP Notice: Undefined index: lexeme_language / lexical_category from In Progress to Needs Reporting on the Discovery-Search (Current work) board.

Triggered a reindex of all the lexemes using https://gitlab.wikimedia.org/repos/search-platform/cirrus-rerender, might take about 3 hours to complete.

Jun 11 2024, 8:36 AM · MW-1.43-notes (1.43.0-wmf.8; 2024-06-04), Discovery-Search (Current work), wmde-wikidata-tech, Wikidata, Wikidata Lexicographical data, Wikimedia-production-error

Jun 10 2024

dcausse added a comment to T366904: Improve mysql search for files.

@dcausse @Gehel As far as I can see, updateTitle is not implemented by CirrusSearch right, and thus a noop per the parent SearchEngine class ? If so, then i can safely modify this.

Jun 10 2024, 9:14 PM · Patch-For-Review, User-TheDJ, Discovery-Search, MediaWiki-Search
dcausse awarded T358373: [Dumps 2] Reconcillation mechanism to detect and fetch missing/mismatched revisions a Love token.
Jun 10 2024, 6:22 PM · Patch-For-Review, Dumps 2.0 (Kanban Board)

Jun 6 2024

dcausse added a comment to P64016 Testing wdqs.data-reload with HDFS.

@RKemper for testing I created a smaller folder at hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ it has only two chunks so I hope it might help iterate a bit faster on this, the command should become:

cookbook sre.wdqs.data-reload \
 --task-id T349069 \
 --reason "Test wdqs reload based on HDFS" \
 --reload-data wikidata_full \
 --from-hdfs hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ \
 --stat-host stat1009.eqiad.wmnet \
 wdqs2023.codfw.wmnet
Jun 6 2024, 12:20 PM

Jun 4 2024

dcausse edited P64016 Testing wdqs.data-reload with HDFS.
Jun 4 2024, 3:31 PM
dcausse created P64016 Testing wdqs.data-reload with HDFS.
Jun 4 2024, 3:24 PM

Jun 3 2024

dcausse placed T331127: phantom redirects lingering in incategory searches after page moves up for grabs.
Jun 3 2024, 4:18 PM · MW-1.40-notes (1.40.0-wmf.25; 2023-02-27), Discovery-Search (Current work), CirrusSearch
dcausse added a comment to T362518: Deprecate buster-backports.

@dcausse docker-registry.wikimedia.org/wikimedia/wikidata-query-flink-rdf-streaming-updater seems to be deprecated in favor of docker-registry.wikimedia.org/repos/search-platform/flink-rdf-streaming-updater, can you confirm?

Yes (all the images under docker-registry.wikimedia.org/wikimedia/wikidata-query-flink-rdf-streaming-updater should no longer be used and can be safely removed if needed)

Jun 3 2024, 3:56 PM · Patch-For-Review, Infrastructure-Foundations, Release-Engineering-Team, serviceops
dcausse moved T331127: phantom redirects lingering in incategory searches after page moves from Needs Reporting to Incoming on the Discovery-Search (Current work) board.

Sorry to see this happening again, it is probable that we missed some edge cases when deploying T317045.

Jun 3 2024, 8:08 AM · MW-1.40-notes (1.40.0-wmf.25; 2023-02-27), Discovery-Search (Current work), CirrusSearch

May 31 2024

dcausse added a comment to T366253: Create a generic stream to populate CirrusSearch weighted_tags.

From a SUP perspective this would replace all sources of weighted tags (config option: stream name):

  • article-topic-stream: mediawiki.page_outlink_topic_prediction_change.v1
  • draft-topic-stream: mediawiki.revision_score_drafttopic
  • recommendation-create-stream: mediawiki.revision-recommendation-create
May 31 2024, 7:38 AM · Patch-For-Review, Discovery-Search (Current work), CirrusSearch

May 30 2024

dcausse added a comment to T364856: Outreach to producers of "other dumps" to raise awareness about Dumps 2.0 and options for deprecation or migration.

Hi, we might have a use-case related to "other dumps" that might benefit from the Dumps 2.0 infrastructure, I filed T366248 with some details about it.

May 30 2024, 9:32 AM · Data Products, Dumps 2.0, Dumps-Generation, Epic
dcausse created T366253: Create a generic stream to populate CirrusSearch weighted_tags.
May 30 2024, 9:28 AM · Patch-For-Review, Discovery-Search (Current work), CirrusSearch
dcausse created T366248: Source the CirrusSearch index dumps from hadoop instead of a MW maintenance script.
May 30 2024, 9:07 AM · CirrusSearch, Dumps 2.0, Discovery-Search

May 29 2024

dcausse added a comment to T365692: PHP Notice: Undefined index: lexeme_language / lexical_category.

The system should now index lexemes properly.
We still have to reindex all the lexemes to fix the ones created/edited before the fix was applied.

May 29 2024, 10:20 AM · MW-1.43-notes (1.43.0-wmf.8; 2024-06-04), Discovery-Search (Current work), wmde-wikidata-tech, Wikidata, Wikidata Lexicographical data, Wikimedia-production-error
dcausse updated the task description for T365692: PHP Notice: Undefined index: lexeme_language / lexical_category.
May 29 2024, 10:18 AM · MW-1.43-notes (1.43.0-wmf.8; 2024-06-04), Discovery-Search (Current work), wmde-wikidata-tech, Wikidata, Wikidata Lexicographical data, Wikimedia-production-error
dcausse added a comment to T366043: Some dumps are not available since mid may 2024.

@BTullis thanks! Categories are reloaded via a cronjob on all WDQS machine, the job is about to run in 30 mins

May 29 2024, 7:36 AM · Data-Platform-SRE (2024.05.27 - 2024.06.16), Discovery-Search, Data-Engineering, Dumps-Generation
dcausse added a comment to T366043: Some dumps are not available since mid may 2024.

@BTullis thanks! Categories are reloaded via a cronjob on all WDQS machine, the job is about to run in 30 mins

May 29 2024, 7:15 AM · Data-Platform-SRE (2024.05.27 - 2024.06.16), Discovery-Search, Data-Engineering, Dumps-Generation

May 28 2024

dcausse added a comment to P63465 extra fields in cirrus indices.

Output with:

cirrus = (spark.table("discovery.cirrus_index").where('cirrus_replica="codfw" AND snapshot="20240428"'))
May 28 2024, 5:15 PM
dcausse created P63465 extra fields in cirrus indices.
May 28 2024, 5:12 PM
dcausse committed rEWLC5e903c77c46b: Workaround missing lemma fields.
Workaround missing lemma fields
May 28 2024, 4:48 PM
dcausse added a comment to T365692: PHP Notice: Undefined index: lexeme_language / lexical_category.

The search fields specific to Lexemes are currently ignored causing this NOTICE but also preventing lexemes from being searchable (esp. the new ones).
The schemas should be adapted to support these fields and the lexemes will have to be re-indexed.

May 28 2024, 9:53 AM · MW-1.43-notes (1.43.0-wmf.8; 2024-06-04), Discovery-Search (Current work), wmde-wikidata-tech, Wikidata, Wikidata Lexicographical data, Wikimedia-production-error
dcausse merged task T365684: Particular lexeme (L1326823) not indexed so search with the Wikidata API returns nothing into T365692: PHP Notice: Undefined index: lexeme_language / lexical_category.
May 28 2024, 9:51 AM · Discovery-Search (Current work), Wikidata
dcausse merged T365684: Particular lexeme (L1326823) not indexed so search with the Wikidata API returns nothing into T365692: PHP Notice: Undefined index: lexeme_language / lexical_category.
May 28 2024, 9:50 AM · MW-1.43-notes (1.43.0-wmf.8; 2024-06-04), Discovery-Search (Current work), wmde-wikidata-tech, Wikidata, Wikidata Lexicographical data, Wikimedia-production-error
dcausse claimed T365692: PHP Notice: Undefined index: lexeme_language / lexical_category.
May 28 2024, 8:47 AM · MW-1.43-notes (1.43.0-wmf.8; 2024-06-04), Discovery-Search (Current work), wmde-wikidata-tech, Wikidata, Wikidata Lexicographical data, Wikimedia-production-error
dcausse added a comment to T361483: Selectively disable changeprop functionality that is no longer used.

@achou except expert search users explicitly searching for topics (which I suspect are rare) the growth team is the only team using this data in a user facing product, it is hard to tell what would be the impact for them but I suspect that if only a few (<100) are lost these might hardly impact anything. If you suspect that more might be lost perhaps having duplicates is better if this is an option for you.

May 28 2024, 8:03 AM · Patch-For-Review, Machine-Learning-Team, Lift-Wing, ORES, RESTBase Sunsetting, Content-Transform-Team, serviceops, ChangeProp, API Platform (RESTBase Deprecation Roadmap)
dcausse created T366043: Some dumps are not available since mid may 2024.
May 28 2024, 7:44 AM · Data-Platform-SRE (2024.05.27 - 2024.06.16), Discovery-Search, Data-Engineering, Dumps-Generation

May 23 2024

dcausse moved T349069: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers from In Progress to Needs review on the Discovery-Search (Current work) board.
May 23 2024, 2:52 PM · Discovery-Search (Current work), Data-Platform-SRE, Wikidata-Query-Service, Wikidata
dcausse moved T365190: Cannot provide empty array to wikis as $wgCirrusSearchWriteClusters from Needs review to Needs Reporting on the Discovery-Search (Current work) board.
May 23 2024, 2:51 PM · MW-1.43-notes (1.43.0-wmf.6; 2024-05-21), Discovery-Search (Current work), CirrusSearch
dcausse moved T364837: Q125918173 missing from elastic@codfw from Needs review to Needs Reporting on the Discovery-Search (Current work) board.
May 23 2024, 2:51 PM · Discovery-Search (Current work), CirrusSearch
dcausse moved T362060: Generalize ScholarlyArticleSplitter from To Be Deployed to Needs Reporting on the Discovery-Search (Current work) board.
May 23 2024, 2:51 PM · Discovery-Search (Current work), Wikidata

May 16 2024

dcausse moved T364837: Q125918173 missing from elastic@codfw from In Progress to Needs review on the Discovery-Search (Current work) board.
May 16 2024, 6:40 PM · Discovery-Search (Current work), CirrusSearch