Page MenuHomePhabricator

EBernhardson (EBernhardson)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Oct 7 2014, 4:49 PM (409 w, 6 d)
Availability
Available
LDAP User
EBernhardson
MediaWiki User
EBernhardson (WMF) [ Global Accounts ]

Recent Activity

Today

EBernhardson added a comment to T304954: Import data from hdfs to commonswiki_file.

I'm not sure why airflow decided to not create the dag runs as expected.

Mon, Aug 15, 6:57 PM · Structured-Data-Backlog (Current Work), Discovery-Search (Current work), Image-Suggestions
EBernhardson committed rWDAN230a82046462: HivePartitionRangeSensor: report missing partitions when using mode=reschedule (authored by EBernhardson).
HivePartitionRangeSensor: report missing partitions when using mode=reschedule
Mon, Aug 15, 6:36 PM
EBernhardson committed rWDANdc9abc0f23bc: Increase subgraph query SLAs to 2 days (authored by EBernhardson).
Increase subgraph query SLAs to 2 days
Mon, Aug 15, 5:22 PM
EBernhardson added a comment to T304954: Import data from hdfs to commonswiki_file.

I'm not sure why airflow decided to not create the dag runs as expected. To get things moving again i deleted the dag state from the instance (delete button in UI) and let airflow re-initialize. I manually marked the 2022-07-25 run as complete and it proceeded to start the 2022-08-01 instance and run it through completion. . The 2022-08-08 instance is also started but is currently waiting for data to arrive in analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2022-08-08.

Mon, Aug 15, 4:51 PM · Structured-Data-Backlog (Current Work), Discovery-Search (Current work), Image-Suggestions
EBernhardson moved T314426: Job queue for writes to cloudelastic falling behind from Needs review to Needs Reporting on the Discovery-Search (Current work) board.

Can see in the JobQueue Job grafana dashboard that concurrency jumped from ~16 to ~25 arround same time as above patches were merged (~aug 9 at 2300 utc). Backlog appears to be staying low. The Saneitizer fix rate on cloudelastic is still high but we suspect that is related to jobs that were dropped when the backlog grew beyond retention. Expecting the saneitizer fix rate to return to the same level as eqiad/codfw within ~2 weeks of the deployment (around aug 23 or so). With resepect to this ticket that means that saneitizer is pushing additional jobs beyond what we normally see due to page edits and the job queue is keeping up with that additional load.

Mon, Aug 15, 4:21 PM · MW-1.39-notes (1.39.0-wmf.22; 2022-07-25), Discovery-Search (Current work), Patch-For-Review

Fri, Aug 5

EBernhardson added a comment to T306899: WCQS 500 errors.

I've tracked down one source of 500 errors, unclear if the original report here is for same thing.

Fri, Aug 5, 6:10 PM · Discovery-Search (Current work), Wikidata, Wikidata-Query-Service

Thu, Aug 4

EBernhardson added a comment to T314473: Ingest new image suggestions index diffs.

I suppose a related thought, are the regenerated diff's diffing against the correct thing? For the diffs to be correct they need to be diffed against the expected state of the production indices. When you regenerate the 7-11 dataset is it building against the previous 7-11 dataset that was shipped, or against the previous dataset which is no longer the expected state?

Thu, Aug 4, 4:58 PM · Structured-Data-Backlog (Current Work), Discovery-Search (Current work), Image-Suggestions
EBernhardson added a comment to T314426: Job queue for writes to cloudelastic falling behind.

While the patch for deployment-charts was merged, when SRE went to deploy the patch the systems reported no change to the deployment. Unclear what the necessary next step is to have the cpjobqueue configuration updated.

Thu, Aug 4, 4:50 PM · MW-1.39-notes (1.39.0-wmf.22; 2022-07-25), Discovery-Search (Current work), Patch-For-Review
EBernhardson added a comment to T314473: Ingest new image suggestions index diffs.

@EBernhardson , thanks for the update. Are you talking about the re-generated 2022-07-11 snapshot? This is the one that should be ingested: it contains the bugfix for T314120 and it's available on Hive since August 2 at 5 pm UTC.

Thu, Aug 4, 4:45 PM · Structured-Data-Backlog (Current Work), Discovery-Search (Current work), Image-Suggestions

Wed, Aug 3

EBernhardson added a comment to T314473: Ingest new image suggestions index diffs.

The 2022-07-11 snapshot had already been pushed into the kafka queues to be ingested, but the daemons were paused while they were running. I've run the DAG to push 2022-07-18 into the queues today.

Wed, Aug 3, 9:57 PM · Structured-Data-Backlog (Current Work), Discovery-Search (Current work), Image-Suggestions
EBernhardson added a comment to T314078: Fix slow super_detect_noop code and monitor for future Elastic hangs.

search-loader instances are both re-enabled now, eqiad has been running since yesterday. They are still processing through the backlog of updates that were generated while they were paused including the weekly update. Eqiad should hopefully finish it's backlog by tomorrow, codfw might take two days.

Wed, Aug 3, 9:53 PM · Patch-For-Review, Discovery-Search (Current work)
EBernhardson added a comment to T306899: WCQS 500 errors.

Leaving the commons-query.wikimedia.org browser tab open for a few hours and re-running queries every 30-60 minutes or so reproduced a 500 after a few hours. Related js console errors. Timestamps are PDT. Unclear if the errors at 13:00 and 13:10 are directly related, but including since they were there)

Wed, Aug 3, 8:37 PM · Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
EBernhardson added a comment to T306899: WCQS 500 errors.

Experienced the same error today again, here is an exact timestamp (of the response): Wed, 03 Aug 2022 17:15:19 GMT.

Wed, Aug 3, 6:05 PM · Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
EBernhardson added a comment to T314426: Job queue for writes to cloudelastic falling behind.

This will require increasing the partition counts in kafka for the approrpiate topics. Today they should have 3 partitions, we want to change it to have 6 partitions. The topics live in the main-eqiad and main-codfw kafka clusters.

Wed, Aug 3, 5:46 PM · MW-1.39-notes (1.39.0-wmf.22; 2022-07-25), Discovery-Search (Current work), Patch-For-Review

Tue, Aug 2

EBernhardson moved T314426: Job queue for writes to cloudelastic falling behind from Incoming to Needs review on the Discovery-Search (Current work) board.
Tue, Aug 2, 9:44 PM · MW-1.39-notes (1.39.0-wmf.22; 2022-07-25), Discovery-Search (Current work), Patch-For-Review
EBernhardson added a project to T314426: Job queue for writes to cloudelastic falling behind: Discovery-Search (Current work).
Tue, Aug 2, 9:44 PM · MW-1.39-notes (1.39.0-wmf.22; 2022-07-25), Discovery-Search (Current work), Patch-For-Review
EBernhardson claimed T314426: Job queue for writes to cloudelastic falling behind.
Tue, Aug 2, 6:38 PM · MW-1.39-notes (1.39.0-wmf.22; 2022-07-25), Discovery-Search (Current work), Patch-For-Review
EBernhardson added a comment to T314426: Job queue for writes to cloudelastic falling behind.

The general idea is to add a new parameter to the ElasticaWrite job, jobqueue_partition, and have that include both the cluster name and a integer partition number derived through random number % num_partitions. cpjobqueue should then be configured to partition by jobqueue_partition rather than the existing cluster value.

Tue, Aug 2, 6:36 PM · MW-1.39-notes (1.39.0-wmf.22; 2022-07-25), Discovery-Search (Current work), Patch-For-Review
EBernhardson created T314426: Job queue for writes to cloudelastic falling behind.
Tue, Aug 2, 6:33 PM · MW-1.39-notes (1.39.0-wmf.22; 2022-07-25), Discovery-Search (Current work), Patch-For-Review

Mon, Aug 1

EBernhardson moved T314336: Repair cirrus integration test runner from Incoming to In Progress on the Discovery-Search (Current work) board.
Mon, Aug 1, 8:22 PM · MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), Discovery-Search (Current work)
EBernhardson added a comment to T314336: Repair cirrus integration test runner.

Monitoring the test run through the chrome inspector (port forward 9222 into the cloud instance, and then again into the vagrant instance, then visit chrome://inspect in a chromium based browser) shows that the strings we are telling the browser to write into the <input> box isn't actually making it there. Most of the time it makes it there, but sometimes it loses a letter.

Mon, Aug 1, 8:22 PM · MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), Discovery-Search (Current work)
EBernhardson created T314336: Repair cirrus integration test runner.
Mon, Aug 1, 8:15 PM · MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), Discovery-Search (Current work)
EBernhardson closed T313941: Evaluate security features of Opensearch as Declined.

decline for now, opensearch is still a ways out and this will be evaluated once we are in a position to consider the opensearch migration.

Mon, Aug 1, 3:39 PM · Discovery-Search
EBernhardson closed T313752: CirrusSearchIndexTooOld as Resolved.
Mon, Aug 1, 3:10 PM · Discovery-Search (Current work)
EBernhardson closed T217742: Rework the data flow between logstash and cirrus elasticsearch cluster for ApiFeatureUsage, a subtask of T297239: Move logstash api-feature-usage output away from v5 cluster, as Resolved.
Mon, Aug 1, 2:36 PM · SRE Observability (FY2022/2023-Q1), Sustainability (Incident Followup), Patch-For-Review, Observability-Logging, SRE
EBernhardson closed T217742: Rework the data flow between logstash and cirrus elasticsearch cluster for ApiFeatureUsage as Resolved.

I think this was accomplished in T297239 by moving apifeatureusage logstash to a host separate from the main logging pipeline. Do you agree?

Mon, Aug 1, 2:36 PM · Discovery-Search

Thu, Jul 28

EBernhardson created P32103 potentially oversized image recommendation weighted_tags updates.
Thu, Jul 28, 10:06 PM
EBernhardson added a comment to T314078: Fix slow super_detect_noop code and monitor for future Elastic hangs.

Order of tracking down the deadlock was:

  1. cirrus error rate increased
  2. One node had it's write queue building in https://search.svc.eqiad.wmnet:9243/_cat/thread_pool/write?v
  3. That node's index_total value in https://search.svc.eqiad.wmnet:9243/_nodes/elastic1071-production-search-eqiad/stats/indices/indexing?pretty wasn't increasing
  4. pulled jstack for write threads with: sudo nsenter -t 68132 -m sudo -u elasticsearch jstack 68132 | grep -A 7 '\[write]'
  5. all threads reported by jstack showed:
"elasticsearch[elastic1072-production-search-eqiad][write][T#1]" #420 daemon prio=5 os_prio=0 tid=0x00007fa7c401a800 nid=0x110f8 runnable [0x00007f99dd459000]
   java.lang.Thread.State: RUNNABLE
        at java.util.ArrayList.indexOf(ArrayList.java:323)
        at java.util.ArrayList.contains(ArrayList.java:306)
        at java.util.AbstractCollection.containsAll(AbstractCollection.java:318)
        at org.wikimedia.search.extra.superdetectnoop.MultiListHandler$MultiList.equalsIgnoreOrder(MultiListHandler.java:98)
        at org.wikimedia.search.extra.superdetectnoop.MultiListHandler$MultiList.replaceFrom(MultiListHandler.java:76)
        at org.wikimedia.search.extra.superdetectnoop.MultiListHandler.handle(MultiListHandler.java:31)
Thu, Jul 28, 6:31 PM · Patch-For-Review, Discovery-Search (Current work)
EBernhardson created P32060 (An Untitled Masterwork).
Thu, Jul 28, 2:43 PM
EBernhardson created P32059 (An Untitled Masterwork).
Thu, Jul 28, 2:43 PM
EBernhardson created P32058 (An Untitled Masterwork).
Thu, Jul 28, 2:37 PM

Wed, Jul 27

EBernhardson moved T307391: Enable CORS support for WCQS SPARQL endpoint access from Ready for Development to Needs Reporting on the Discovery-Search (Current work) board.
Wed, Jul 27, 9:45 PM · Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
EBernhardson committed rWDAN1a721954a38e: image_suggestion_manual should read from _delta not _full (authored by EBernhardson).
image_suggestion_manual should read from _delta not _full
Wed, Jul 27, 5:55 PM

Tue, Jul 26

EBernhardson added a comment to T307391: Enable CORS support for WCQS SPARQL endpoint access.

https://commons-query.wikimedia.org/sparql returns CORS headers in the same way that https://query.wikidata.org/sparql does.

Tue, Jul 26, 9:34 PM · Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
EBernhardson moved T313434: Migrate apifeatureusage to elasticsearch 7 from Incoming to Waiting on the Discovery-Search (Current work) board.

First two patches are merged. The two remaining patches (temporarily remove index template, then drop mapping type from template) will go out as part of the es7 deployment process. I've added them in appropriate places to the es7 rollout plan (T308676)

Tue, Jul 26, 5:22 PM · Patch-For-Review, Discovery-Search (Current work)

Mon, Jul 25

EBernhardson added a comment to T304954: Import data from hdfs to commonswiki_file.

regarding scheduling of the job once it lands in airflow, our weekly job always runs with a sunday execution date, if the image-suggestions could also be run on sunday that would greatly simplify things (otherwise various date math will be involved and it will delay import until the following sunday). We don't strictly have to be sending the imports sunday, but it takes some time to process through the elastic servers and sunday is a generally less-busy time.

Mon, Jul 25, 11:18 PM · Structured-Data-Backlog (Current Work), Discovery-Search (Current work), Image-Suggestions
EBernhardson moved T303831: Productionize Wikidata subgraph analysis from In Progress to Blocked (from outside the team) on the Discovery-Search (Current work) board.
Mon, Jul 25, 10:30 PM · Patch-For-Review, Wikidata, Wikidata-Query-Service, Discovery-Search (Current work)
EBernhardson moved T218994: Deprecation warning on elasticsearch 6 from In Progress to Needs review on the Discovery-Search (Current work) board.
Mon, Jul 25, 10:30 PM · MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), Patch-For-Review, Discovery-Search (Current work), CirrusSearch
EBernhardson moved T313752: CirrusSearchIndexTooOld from Incoming to In Progress on the Discovery-Search (Current work) board.

Reindexing is currently in progress

Mon, Jul 25, 10:29 PM · Discovery-Search (Current work)
EBernhardson added a comment to T218994: Deprecation warning on elasticsearch 6 .

Looks like wikidata is still triggering the should clause warning (rarely):

Mon, Jul 25, 6:37 PM · MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), Patch-For-Review, Discovery-Search (Current work), CirrusSearch
EBernhardson added a comment to T218994: Deprecation warning on elasticsearch 6 .

We had a few custom log levels set in the elasticsearch cluster state, specifically logger.org.elasticsearch.deprecation.common.ParseField and logger.org.elasticsearch.deprecation.index.query.functionscore.ScoreFunctionBuilder were muted. I've unmuted them and nothing seems to be complaining (and they used to be quite verbose). There is reindexing currently going on so if the reindexing process triggers any deprecations those should come up today. If nothing comes up this can be considered complete.

Mon, Jul 25, 5:28 PM · MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), Patch-For-Review, Discovery-Search (Current work), CirrusSearch
EBernhardson updated the task description for T308676: Elasticsearch 7.10.2 rollout plan.
Mon, Jul 25, 4:48 PM · Discovery-Search (Current work), CirrusSearch
EBernhardson updated the task description for T308676: Elasticsearch 7.10.2 rollout plan.
Mon, Jul 25, 4:48 PM · Discovery-Search (Current work), CirrusSearch
EBernhardson closed T123268: Reduce amount of queries per search request as Declined.

The queries issued are not particularly expensive. This does have some outsized cost during cluster maintenance, where we direct traffic from eqiad to codfw (paying the round-trip latency cost), but that only happens a few times a year and doesn't seem worth the effort required to re-architect how mediawiki talks to search in these instances.

Mon, Jul 25, 3:44 PM · Performance Issue, MediaWiki-Search, Discovery-Search, Discovery-ARCHIVED
EBernhardson removed a project from T303831: Productionize Wikidata subgraph analysis: Patch-For-Review.

Double checked all linked patches, no patches remain for review.

Mon, Jul 25, 3:23 PM · Patch-For-Review, Wikidata, Wikidata-Query-Service, Discovery-Search (Current work)
EBernhardson moved T304954: Import data from hdfs to commonswiki_file from In Progress to Blocked (from outside the team) on the Discovery-Search (Current work) board.
Mon, Jul 25, 3:16 PM · Structured-Data-Backlog (Current Work), Discovery-Search (Current work), Image-Suggestions
EBernhardson added a comment to T311247: CirrusSearchChangeFailed: Error in one or more bulk request actions.

not seeing repeat occurances and the missing index has been restored.

Mon, Jul 25, 3:16 PM · Discovery-Search (Current work), CirrusSearch, Wikimedia-production-error
EBernhardson moved T311247: CirrusSearchChangeFailed: Error in one or more bulk request actions from Waiting to Needs Reporting on the Discovery-Search (Current work) board.
Mon, Jul 25, 3:15 PM · Discovery-Search (Current work), CirrusSearch, Wikimedia-production-error
EBernhardson moved T309648: Restore lost index in cloudelastic from Needs review to Needs Reporting on the Discovery-Search (Current work) board.
Mon, Jul 25, 3:14 PM · MW-1.39-notes (1.39.0-wmf.18; 2022-06-27), Patch-For-Review, Discovery-Search (Current work)

Thu, Jul 21

EBernhardson added a comment to T301096: Add a link: prioritize suggestions of underlinked articles.

Accessing the array count in realtime without specific mapping will be expensive if even possible. Once indexed the array no longer exists and the only way to find out would be to decode the json blob that contains the entire document, or walk the positions lists and count the gaps. Not really plausible to do while scoring.

Thu, Jul 21, 8:57 PM · Patch-For-Review, Growth-Team (Current Sprint), Add-Link
EBernhardson committed rEAFUeded19d57270: quiet deprecation warning about date formatting (authored by EBernhardson).
quiet deprecation warning about date formatting
Thu, Jul 21, 9:56 AM

Wed, Jul 20

EBernhardson closed T313248: Undeploy ApiFeatureUsage extension from WMF production infrastructure as Declined.

The decision has been made to migrate ApiFeatureUsage to elastic 7 with the rest of the stack. That migration will be tracked in T313434

Wed, Jul 20, 6:51 PM · Discovery-Search (Current work), ApiFeatureUsage
EBernhardson created T313434: Migrate apifeatureusage to elasticsearch 7.
Wed, Jul 20, 6:41 PM · Patch-For-Review, Discovery-Search (Current work)

Tue, Jul 19

EBernhardson added a comment to T313248: Undeploy ApiFeatureUsage extension from WMF production infrastructure.

The undeployment process is being put on hold for at least one week.

Tue, Jul 19, 4:40 PM · Discovery-Search (Current work), ApiFeatureUsage
EBernhardson updated the task description for T135280: insource: queries can't search for the return character.
Tue, Jul 19, 4:35 PM · CirrusSearch, Discovery-Search, Discovery-ARCHIVED

Mon, Jul 18

EBernhardson added a comment to T218994: Deprecation warning on elasticsearch 6 .

Checked logstash with the query channel:CirrusSearchDeprecation AND NOT exception.trace:"ApiFeatureUsage" and verified nothing is coming through

Mon, Jul 18, 7:02 PM · MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), Patch-For-Review, Discovery-Search (Current work), CirrusSearch
EBernhardson updated the task description for T313248: Undeploy ApiFeatureUsage extension from WMF production infrastructure.
Mon, Jul 18, 5:54 PM · Discovery-Search (Current work), ApiFeatureUsage
EBernhardson updated the task description for T313248: Undeploy ApiFeatureUsage extension from WMF production infrastructure.
Mon, Jul 18, 5:49 PM · Discovery-Search (Current work), ApiFeatureUsage
EBernhardson updated the task description for T313248: Undeploy ApiFeatureUsage extension from WMF production infrastructure.
Mon, Jul 18, 5:48 PM · Discovery-Search (Current work), ApiFeatureUsage
EBernhardson created T313248: Undeploy ApiFeatureUsage extension from WMF production infrastructure.
Mon, Jul 18, 5:36 PM · Discovery-Search (Current work), ApiFeatureUsage
EBernhardson removed a project from T255981: Persistant error 500 getting category members: ApiFeatureUsage.
Mon, Jul 18, 5:24 PM · Platform Team Workboards (Clinic Duty Team), Upstream, Commons, Pywikibot
EBernhardson removed a project from T301336: EntitySchemas API Question: ApiFeatureUsage.

Removing ApiFeatureUsage, that project is specifically about recording information about requests made to api.php in mediawiki

Mon, Jul 18, 5:23 PM · Wikidata, Shape Expressions
EBernhardson removed a project from T304070: API Endpoint to search for Schemas: ApiFeatureUsage.

Removing ApiFeatureUsage, that project is specifically about usage of api.php in mediawiki

Mon, Jul 18, 5:20 PM · Shape Expressions, Wikidata

Jul 15 2022

EBernhardson added a comment to T303831: Productionize Wikidata subgraph analysis.

There is actually one piece remaining, we typically use refinery-drop-older-than to prune our tables. That worked when we used date=... as the partitioning scheme, but it doesn't support snapshot=.... I t takes minimal work (I already have a working POC) to make it interpret snapshot the same as date, but I suspect the partitioning changed the name to snapshot=... due to an intent to not only use dates for partitioning? If so analytics does have a refinery-drop-mediawiki-snapshots script but it's fairly specialized to their use case. I suspect we would need to make a work-alike script that uses the same refinery library methods but provides our own configuration to the script. Or the script could be modified to import it's configuration from somewhere user-defined instead of having the configuration embedded in the script itself.

Jul 15 2022, 5:13 PM · Patch-For-Review, Wikidata, Wikidata-Query-Service, Discovery-Search (Current work)

Jul 14 2022

EBernhardson added a comment to T312134: Request for SQL Templating to be enabled in Superset.

@BTullis thanks! Loading my old dashboard it looks to be working same as it did before. Looks complete to me.

Jul 14 2022, 3:01 PM · Product-Analytics, Data-Engineering
EBernhardson added a comment to T311939: Degraded RAID on elastic2049.

From the other ticket these are the messages that were coming in on dmesg before the reimage was attempted:

[Tue Jul  5 17:45:27 2022] ata2: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen
[Tue Jul  5 17:45:27 2022] ata2: irq_stat 0x00000040, connection status changed
[Tue Jul  5 17:45:27 2022] ata2: SError: { CommWake DevExch }
[Tue Jul  5 17:45:27 2022] ata2: hard resetting link
[Tue Jul  5 17:45:28 2022] ata2: SATA link down (SStatus 0 SControl 300)
[Tue Jul  5 17:45:28 2022] ata2: EH complete
Jul 14 2022, 2:57 AM · Discovery-Search (Current work), Patch-For-Review, Elasticsearch, SRE, ops-codfw

Jul 13 2022

EBernhardson created T312960: Track elasticsearch bulk index queue rejections.
Jul 13 2022, 3:57 PM · Discovery-Search

Jul 12 2022

EBernhardson added a comment to T303831: Productionize Wikidata subgraph analysis.

All dags are now enabled and have completed at least one full execution of each dag.

Jul 12 2022, 11:27 PM · Patch-For-Review, Wikidata, Wikidata-Query-Service, Discovery-Search (Current work)
EBernhardson moved T309648: Restore lost index in cloudelastic from In Progress to Needs review on the Discovery-Search (Current work) board.

I've updated https://wikitech.wikimedia.org/wiki/Search/S3_Plugin_Enable with notes related to the final restore that managed to complete.

Jul 12 2022, 11:25 PM · MW-1.39-notes (1.39.0-wmf.18; 2022-06-27), Patch-For-Review, Discovery-Search (Current work)
EBernhardson committed rWDAN45ae36dd036a: subgraph_and_query_metrics: Drop wiki from sparql event partition spec (authored by EBernhardson).
subgraph_and_query_metrics: Drop wiki from sparql event partition spec
Jul 12 2022, 10:12 PM
EBernhardson added a comment to T309648: Restore lost index in cloudelastic.

Managed a complete restore of commonswiki_file in cloudelastic. This ticket should stay open until the process is documented in https://wikitech.wikimedia.org/wiki/Search

Jul 12 2022, 3:29 PM · MW-1.39-notes (1.39.0-wmf.18; 2022-06-27), Patch-For-Review, Discovery-Search (Current work)
EBernhardson committed rWDAN89cb17dfa816: subgraph_and_query_mapping: Increase memory to 12g, use repartition (authored by EBernhardson).
subgraph_and_query_mapping: Increase memory to 12g, use repartition
Jul 12 2022, 7:07 AM

Jul 11 2022

EBernhardson committed rWDAN3ba1d4c33dd2: subgraph_query_mapping_daily: Increase partitioning to 2048 (authored by EBernhardson).
subgraph_query_mapping_daily: Increase partitioning to 2048
Jul 11 2022, 9:38 PM
EBernhardson committed rWDANa559f8287345: Remove external queries from wait_for_data (authored by EBernhardson).
Remove external queries from wait_for_data
Jul 11 2022, 8:34 PM
EBernhardson committed rWDAN00b2b320ce3d: subgraph: Use HivePartitionRangeSensor to wait for sparql queries (authored by EBernhardson).
subgraph: Use HivePartitionRangeSensor to wait for sparql queries
Jul 11 2022, 7:33 PM
EBernhardson committed rWDAN701ee34114cd: Remove image recommendations from weekly data load (authored by EBernhardson).
Remove image recommendations from weekly data load
Jul 11 2022, 4:38 PM
EBernhardson merged task T312187: elastic2049 /srv is read-only into T311939: Degraded RAID on elastic2049.
Jul 11 2022, 3:20 PM · Discovery-Search
EBernhardson merged T312187: elastic2049 /srv is read-only into T311939: Degraded RAID on elastic2049.
Jul 11 2022, 3:20 PM · Discovery-Search (Current work), Patch-For-Review, Elasticsearch, SRE, ops-codfw

Jul 8 2022

EBernhardson added a comment to T303831: Productionize Wikidata subgraph analysis.

Summary of what was done so far to deploy:

Jul 8 2022, 4:33 AM · Patch-For-Review, Wikidata, Wikidata-Query-Service, Discovery-Search (Current work)
EBernhardson committed rWDANb5d49feaf4d2: airflow: Use mode=reschedule on all sensors (authored by EBernhardson).
airflow: Use mode=reschedule on all sensors
Jul 8 2022, 4:00 AM
EBernhardson added a comment to T303831: Productionize Wikidata subgraph analysis.

Stats on the final join building topSubgraphTriples. this is using 4096 partitions and repartition(). It works for now so probably not worth dealing with the skew, but these stats might be useful to compare against in the future if it starts failing:

Jul 8 2022, 3:18 AM · Patch-For-Review, Wikidata, Wikidata-Query-Service, Discovery-Search (Current work)
EBernhardson committed rWDANc27177479410: Update rdf-spark-tools to 0.3.112 (authored by EBernhardson).
Update rdf-spark-tools to 0.3.112
Jul 8 2022, 1:38 AM

Jul 7 2022

EBernhardson edited P30960 mediasearch -> wcqs.
Jul 7 2022, 11:18 PM
EBernhardson created T312604: Monitor the number of image recommendations available and alert when lower than expected.
Jul 7 2022, 10:53 PM · Structured-Data-Backlog
EBernhardson committed rWDANe0a8f038588c: Tune subgraph_mapping_weekly based on first prod run (authored by EBernhardson).
Tune subgraph_mapping_weekly based on first prod run
Jul 7 2022, 8:47 PM
EBernhardson added a comment to T303831: Productionize Wikidata subgraph analysis.

I tried a run with the three coalesce's in SubgraphMapper converted into repartitions. In this case instead of having 8 partitions where 7 finish and the 8th takes forever and then fails, now it has 200 partitions and 199 finish with the 200th taking forever and then failing. This seems like it could be a case of skew-join, the dataset is being partitioned based on the join condition (rather than randomly) and a specific part of the join has significantly more values to work through than anything else. To get an idea of how significant the skew is i doubled the ram again (to 24g) in hopes that it will eventually complete and give some stats. The final stats are as follows, clearly showing a significant skew:

Jul 7 2022, 6:50 PM · Patch-For-Review, Wikidata, Wikidata-Query-Service, Discovery-Search (Current work)
EBernhardson created P30960 mediasearch -> wcqs.
Jul 7 2022, 5:54 PM
EBernhardson added a comment to T303831: Productionize Wikidata subgraph analysis.

the airflow patch is deployed but i only turned on *_init dags and subgraph_mapping_weekly today (ran out of time, will do rest tomorrow).

subgraph_mapping_weekly failed the first time through. I updated executor memory from 8g to 12g but the second execution is still failing. something is quite unbalanced about the topSubgraphItems, of the 8 shards they have inputs varying from 100MB to 450MB giving executions times of ~30s on the small ones and ~8m before the final one fails.

Not specifically related to this patch, but i wonder if we could change up the SparkUtils.saveTables method to somehow take parameters in the path to specify coalesce vs repartition and the number of partitions to save by, so we only have to update the airflow invocation and not the jar as well to test variations there.

Should we have params called coalesce, and repartition, and have them default to false. And when true, use num_partitions to coalesce or repartition accordingly?

Edit: I realize all arg classes that need to coalesce or repartition will need to have these params set.

Jul 7 2022, 5:18 PM · Patch-For-Review, Wikidata, Wikidata-Query-Service, Discovery-Search (Current work)
EBernhardson added a comment to T303831: Productionize Wikidata subgraph analysis.

the airflow patch is deployed but i only turned on *_init dags and subgraph_mapping_weekly today (ran out of time, will do rest tomorrow).

Jul 7 2022, 12:12 AM · Patch-For-Review, Wikidata, Wikidata-Query-Service, Discovery-Search (Current work)

Jul 6 2022

EBernhardson committed rWDAN5082f1774707: Increase subgraph_mapping_weekly executor memory (authored by EBernhardson).
Increase subgraph_mapping_weekly executor memory
Jul 6 2022, 11:46 PM
EBernhardson committed rWDANdebd4025db5d: Airflow dags to generate subgraph and query mapping and their metrics (authored by AKhatun_WMF).
Airflow dags to generate subgraph and query mapping and their metrics
Jul 6 2022, 9:53 PM
EBernhardson committed rWDAN5f39dbfa34e8: Update rdf-spark-tools classes (authored by AKhatun_WMF).
Update rdf-spark-tools classes
Jul 6 2022, 9:53 PM
EBernhardson added a comment to T312134: Request for SQL Templating to be enabled in Superset.

I suspect the presto.latest_partition function will come for free, here is a query that i'm hoping will start working again with templating enabled. Here the table name is parameterized (probably not necessary, but DRY's things up a little) and it sources filter variables from another chart in the dashboard. This used to work pre-superset 1.0.

Jul 6 2022, 9:41 PM · Product-Analytics, Data-Engineering
EBernhardson added a comment to T312134: Request for SQL Templating to be enabled in Superset.

I would also love to see this come back, I have old dashboards that no longer work because templating was turned off. Looking things over, that was turned off for the 1.0 release here (without justification): https://github.com/apache/superset/pull/11172

Jul 6 2022, 7:14 PM · Product-Analytics, Data-Engineering
EBernhardson added a comment to T309648: Restore lost index in cloudelastic.

One other suspicious piece, the cluster-overview dashboard for the eqiad thanos cluster showed significantly increased disk utilization, up to a solid 100% on one of the disks for all 4 backend nodes, starting at about 23:30 which is the same time we started the restore. There was a similar spike yesterday at around the same time, but not every day for the past 7 days. Planning to wait until thanos has settled down before attempting to snapshot+restore again.

Jul 6 2022, 2:55 AM · MW-1.39-notes (1.39.0-wmf.18; 2022-06-27), Patch-For-Review, Discovery-Search (Current work)
EBernhardson added a comment to T309648: Restore lost index in cloudelastic.

Recovery failed, similar errors to last time. Using shardId 14 as an example:

Jul 6 2022, 2:53 AM · MW-1.39-notes (1.39.0-wmf.18; 2022-06-27), Patch-For-Review, Discovery-Search (Current work)
EBernhardson created T312187: elastic2049 /srv is read-only.
Jul 6 2022, 2:08 AM · Discovery-Search

Jul 5 2022

EBernhardson added a comment to T309648: Restore lost index in cloudelastic.

On restore it's using the same index settings as the source cluster, which means 2 replicas instead of the 1 that cloudelastic is expecting. Updated to match expectations:

Jul 5 2022, 11:35 PM · MW-1.39-notes (1.39.0-wmf.18; 2022-06-27), Patch-For-Review, Discovery-Search (Current work)
EBernhardson added a comment to T309648: Restore lost index in cloudelastic.

Started the restore:

$ curl -XPOST -H 'Content-Type: application/json' https://cloudelastic.wikimedia.org:9243/_snapshot/elastic_snaps/snapshot_t309648_attempt_4/_restore -d '{
    "indices": "commonswiki_file_1647921177",
    "include_global_state": false
}'
{"accepted":true} :)
Jul 5 2022, 11:27 PM · MW-1.39-notes (1.39.0-wmf.18; 2022-06-27), Patch-For-Review, Discovery-Search (Current work)
EBernhardson added a comment to T309648: Restore lost index in cloudelastic.

Updated cloudelastic snapshot repository settings to increase restore throttling, by default it was 40mb/s:

Jul 5 2022, 11:18 PM · MW-1.39-notes (1.39.0-wmf.18; 2022-06-27), Patch-For-Review, Discovery-Search (Current work)