Page MenuHomePhabricator

Gehel (Guillaume Lederrey)
Operations Engineer - Discovery

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Nov 9 2015, 9:18 PM (189 w, 2 d)
Availability
Available
IRC Nick
gehel
LDAP User
Gehel
MediaWiki User
GLederrey (WMF) [ Global Accounts ]

Recent Activity

Tue, Jun 25

Gehel removed a project from T207676: Increase deployment window of wdqs or parallelize scap deployment: Discovery-Wikidata-Query-Service-Sprint.
Tue, Jun 25, 5:26 PM · Wikidata-Query-Service, Wikidata
Gehel moved T221121: Capacity planning for elastic search from Needs review to Done on the Discovery-Search (Current work) board.
Tue, Jun 25, 12:45 PM · Discovery-Search (Current work)
Gehel moved T226471: WDQS bans its own monitoring due to bad user agent from Backlog to Done on the Discovery-Wikidata-Query-Service-Sprint board.

prometheus blazegraph exporter updated, we should be good now.

Tue, Jun 25, 9:09 AM · Discovery-Wikidata-Query-Service-Sprint, Wikidata-Query-Service, Wikidata
Gehel claimed T226471: WDQS bans its own monitoring due to bad user agent.
Tue, Jun 25, 8:32 AM · Discovery-Wikidata-Query-Service-Sprint, Wikidata-Query-Service, Wikidata
Gehel added a comment to T186037: Need mvn build mode that does not build gui.

We could define the GUI module in a profile and disable that profile as needed (-P !gui). Some ideas: https://stackoverflow.com/questions/13381179/using-profiles-to-control-which-maven-modules-are-built

Tue, Jun 25, 8:18 AM · Discovery, Wikidata, Wikidata-Query-Service

Mon, Jun 24

Gehel closed T220982: maps hosts have bad permissions under /srv/deployment as Resolved.

no further issues seen, let's get this closed.

Mon, Jun 24, 5:45 PM · Operations
Restricted Application added a project to T226413: Investigate rate limiting of edits on WDQS: Wikidata.
Mon, Jun 24, 3:24 PM · Wikidata, Wikidata-Query-Service, Wikimedia-Incident

Mon, Jun 17

Gehel added a comment to T225904: Mjolnir bulk update failure check - eqiad.

Mjolnir workload is to transfer updates to the elasticsearch cluster, which happen weekly. So it is expected to have no updates for part of the week. The revised check we deploy checks for a ratio of errors, but does not check for zero devisions.

Mon, Jun 17, 8:20 AM · Discovery-Search
Gehel added a comment to T225904: Mjolnir bulk update failure check - eqiad.

The last change related to T214494 was merged on June 12, but according to graph the updates stopped on June 15 5:00 UTC. The cause is probably somewhere else.

Mon, Jun 17, 8:15 AM · Discovery-Search

Tue, Jun 11

Gehel added a comment to T214283: Memory correctable errors -EDAC- elastic1029.

@Cmjohnson elastic1029 is shut down and downtimed in icinga, do whatever you need to do and restart whenever it is done.

Tue, Jun 11, 3:42 PM · Discovery-Search (Current work), ops-eqiad, Discovery, DC-Ops, Operations
Gehel added a comment to T214283: Memory correctable errors -EDAC- elastic1029.

@Cmjohnson any news on this? Do you need anything from our side?

Tue, Jun 11, 8:37 AM · Discovery-Search (Current work), ops-eqiad, Discovery, DC-Ops, Operations

Fri, Jun 7

Gehel merged T225238: reimage of maps2002 fails to "run preseeded command" into T225278: Installation failing on late_command.sh .
Fri, Jun 7, 7:57 AM · Operations
Gehel merged task T225238: reimage of maps2002 fails to "run preseeded command" into T225278: Installation failing on late_command.sh .
Fri, Jun 7, 7:57 AM · Maps, Operations

Thu, Jun 6

Gehel added a comment to T225238: reimage of maps2002 fails to "run preseeded command".

Looking around at maps2002, I see an invalid apt source list (P8595) during late command:

Thu, Jun 6, 11:01 PM · Maps, Operations
Gehel created P8595 maps2002:/var/log/installer/syslog.
Thu, Jun 6, 10:59 PM
Gehel added projects to T225238: reimage of maps2002 fails to "run preseeded command": Operations, Maps.
Thu, Jun 6, 5:32 PM · Maps, Operations
Gehel created T225238: reimage of maps2002 fails to "run preseeded command".
Thu, Jun 6, 5:31 PM · Maps, Operations

Tue, Jun 4

Gehel moved T214283: Memory correctable errors -EDAC- elastic1029 from in progress to Waiting/Blocked on the Discovery-Search (Current work) board.
Tue, Jun 4, 5:22 PM · Discovery-Search (Current work), ops-eqiad, Discovery, DC-Ops, Operations
Gehel moved T216055: Move backend for current search dashboard to pull data from Hadoop from Needs review to Waiting/Blocked on the Discovery-Search (Current work) board.
Tue, Jun 4, 5:20 PM · Discovery-Search (Current work), Patch-For-Review, Product-Analytics, Epic
Gehel added a comment to T224967: Find a better partitioning scheme for maps.

For context: The maps servers have 2x900GB + 2x1.5TB disks. We are at the moment using RAID10 across those disks, so we're wasting a bunch of space. We could do better by doing RAID1 on the same size disks and LVM across those.

Tue, Jun 4, 8:58 AM · Operations, Maps

Mon, Jun 3

Gehel renamed T224911: [Epic] Migrate log transport to kafka for Search Platform applications from [EpicMigrate log transport to kafka for Search Platform applications to [Epic] Migrate log transport to kafka for Search Platform applications.
Mon, Jun 3, 4:42 PM · Wikimedia-Logstash, Operations, Discovery-Search, Epic
Gehel created T224911: [Epic] Migrate log transport to kafka for Search Platform applications.
Mon, Jun 3, 4:39 PM · Wikimedia-Logstash, Operations, Discovery-Search, Epic

Tue, May 28

Gehel closed T216701: Wikidata Query Service should have a proper high level error handler as Declined.
Tue, May 28, 5:25 PM · Patch-For-Review, Discovery-Wikidata-Query-Service-Sprint, Wikidata, Wikidata-Query-Service
Gehel edited projects for T220205: Define constraints for cloudelastic use cases, added: Discovery-Search; removed Discovery-Search (Current work).
Tue, May 28, 5:23 PM · Discovery-Search
Gehel moved T220205: Define constraints for cloudelastic use cases from in progress to Waiting/Blocked on the Discovery-Search (Current work) board.
Tue, May 28, 5:23 PM · Discovery-Search
Gehel moved T224385: Create WDQS reboot cookbook from Done to Needs review on the Discovery-Search (Current work) board.
Tue, May 28, 5:22 PM · Discovery-Search (Current work), Operations-Software-Development, User-jijiki, User-Joe, Operations
Gehel moved T224385: Create WDQS reboot cookbook from Needs review to Done on the Discovery-Search (Current work) board.
Tue, May 28, 5:22 PM · Discovery-Search (Current work), Operations-Software-Development, User-jijiki, User-Joe, Operations

May 27 2019

Gehel added a comment to T224395: Maps[12]004 /srv disk space is critical.

For whatever reason, only maps1004 was reimaged to RAID10 (instead of RAID1) when adding new disks (so we have 2 unused disks in each server). Note that since we have disks of different sizes, RAID10 is still wasting quite a bit of space, we should probably have RAID1 over the physical disks and use LVM to spread the partition over those 2 RAID1.

May 27 2019, 8:25 AM · Operations, Maps
Gehel added a comment to T224395: Maps[12]004 /srv disk space is critical.

Previous instance of a similar problem: T194966

May 27 2019, 7:05 AM · Operations, Maps

May 22 2019

Gehel created T224097: Make spicerack / cumin cluster aware.
May 22 2019, 9:21 AM · Operations-Software-Development

May 21 2019

Gehel closed T220554: Open cloudelastic to wmf cloud hosts as Resolved.

Duplicate of T223519

May 21 2019, 8:52 AM · Discovery-Search (Current work), Cloud-Services, Elasticsearch, Discovery
Gehel closed T220554: Open cloudelastic to wmf cloud hosts, a subtask of T109715: Replicate production elasticsearch indices to labs, as Resolved.
May 21 2019, 8:52 AM · Discovery-Search, Cloud-Services, Elasticsearch, Discovery
Gehel edited projects for T220554: Open cloudelastic to wmf cloud hosts, added: Discovery-Search (Current work); removed Discovery-Search.
May 21 2019, 8:50 AM · Discovery-Search (Current work), Cloud-Services, Elasticsearch, Discovery

May 20 2019

Gehel added a project to T214283: Memory correctable errors -EDAC- elastic1029: Discovery-Search (Current work).
May 20 2019, 5:12 PM · Discovery-Search (Current work), ops-eqiad, Discovery, DC-Ops, Operations
Gehel updated subscribers of T214283: Memory correctable errors -EDAC- elastic1029.

Error reseted as documented in Monitoring/Memory.

May 20 2019, 1:22 PM · Discovery-Search (Current work), ops-eqiad, Discovery, DC-Ops, Operations

May 7 2019

Gehel moved T217398: elastic2038 DOWN (CPU/memory errors ) from Waiting/Blocked to in progress on the Discovery-Search (Current work) board.
May 7 2019, 5:23 PM · Discovery-Search (Current work), Operations, ops-codfw

May 6 2019

Gehel closed Unknown Object (Task), a subtask of T221630: [Epic] Search platform - Hardware requests for 2019-2020, as Resolved.
May 6 2019, 4:35 PM · Discovery-Search, Epic
Gehel moved T220901: Elasticsearch nodes overloading in eqiad from in progress to Done on the Discovery-Search (Current work) board.
May 6 2019, 4:32 PM · Operations, Discovery-Search (Current work)
Gehel added a comment to T221121: Capacity planning for elastic search .

Excellent, can we document this findings on wikitech so they are easy to find?

May 6 2019, 4:20 PM · Discovery-Search (Current work)
Gehel added a comment to T221121: Capacity planning for elastic search .

Did we also take into account codfw being smaller? If we recorded on 35 nodes in eqiad but only replayed from 30 nodes in codfw then we are replaying 86% of the actual traffic.

May 6 2019, 4:18 PM · Discovery-Search (Current work)
Gehel added a comment to T221121: Capacity planning for elastic search .

Executive summary: we should have enough capacity for next year.

May 6 2019, 11:58 AM · Discovery-Search (Current work)

May 3 2019

Gehel closed T222432: Decrease shard alert threshold for omega and psi elasticsearch clusters as Invalid.

Actually, the check timed out. Which make sense if it was routed to the problematic server, before it was marked invalid. This is expected, so nothing to do here.

May 3 2019, 9:35 AM · Operations, Discovery-Search (Current work)
Gehel created T222432: Decrease shard alert threshold for omega and psi elasticsearch clusters.
May 3 2019, 9:27 AM · Operations, Discovery-Search (Current work)
Gehel reopened T217398: elastic2038 DOWN (CPU/memory errors ) as "Open".
May 3 2019, 9:25 AM · Discovery-Search (Current work), Operations, ops-codfw
Gehel added a comment to T217398: elastic2038 DOWN (CPU/memory errors ).

It looks like we need to investigate this a bit more

May 3 2019, 9:24 AM · Discovery-Search (Current work), Operations, ops-codfw

May 2 2019

Gehel added a comment to T222349: Do not rate limit dumps from internal network.

The use cas being run currently is actually the cirrus dumps to initialize cloudelastic servers. They are downloaded on mwmaint1002 with curl -s https://dumps.wikimedia.org/other/cirrussearch/20190429/enwiki-20190429-cirrussearch-general.json.gz

May 2 2019, 12:30 PM · Discovery-Search, Operations
Gehel created T222349: Do not rate limit dumps from internal network.
May 2 2019, 9:34 AM · Discovery-Search, Operations

May 1 2019

Gehel removed a project from T221636: Replace elastic1017-1031: Epic.
May 1 2019, 4:36 PM · Discovery-Search
Gehel moved T221630: [Epic] Search platform - Hardware requests for 2019-2020 from needs triage to Ops / SRE on the Discovery-Search board.
May 1 2019, 4:36 PM · Discovery-Search, Epic
Gehel moved T221632: Storage capacity upgrade for WDQS from needs triage to Ops / SRE on the Discovery-Search board.
May 1 2019, 4:36 PM · Wikidata, Wikidata-Query-Service, Discovery-Search
Gehel moved T221631: Dedicated servers on WMCS to test WDQS scalability strategy from needs triage to Ops / SRE on the Discovery-Search board.
May 1 2019, 4:35 PM · Wikidata, Wikidata-Query-Service, cloud-services-team, Discovery-Search
Gehel moved T221633: Bring back codfw elasticsearch cluster to its intended size from needs triage to Ops / SRE on the Discovery-Search board.
May 1 2019, 4:35 PM · Discovery-Search
Gehel moved T221634: replace elastic[1048-1052] (lease expiry) from needs triage to Ops / SRE on the Discovery-Search board.
May 1 2019, 4:35 PM · Discovery-Search
Gehel moved T221635: elastic[2025-2036] (lease expiry) from needs triage to Ops / SRE on the Discovery-Search board.
May 1 2019, 4:35 PM · Discovery-Search
Gehel moved T221636: Replace elastic1017-1031 from needs triage to Ops / SRE on the Discovery-Search board.
May 1 2019, 4:35 PM · Discovery-Search

Apr 30 2019

Gehel closed T221670: Smooth tilerator load as Resolved.

After a few days, the load looks good and smoother than before. Let's close this!

Apr 30 2019, 3:11 PM · Patch-For-Review, Maps (Tilerator)
Gehel changed the subtype of T221523: Feature request: allow rotating the wikimedia map from "Bug Report" to "Feature Request".
Apr 30 2019, 3:10 PM · Maps (Kartographer)

Apr 26 2019

Gehel created T221938: [Epic] Scaling strategy for Wikidata Query Service.
Apr 26 2019, 9:50 AM · Operations, Wikidata-Query-Service, Epic, Wikidata

Apr 25 2019

Gehel added a comment to T221632: Storage capacity upgrade for WDQS.

I don't think it makes sense to perpetuate a vertical scaling model.

Apr 25 2019, 5:12 PM · Wikidata, Wikidata-Query-Service, Discovery-Search
Gehel added a comment to T221632: Storage capacity upgrade for WDQS.

This also makes me note that if we do introduce sharding (in any shape, either with Blazegraph or another solution) we'd need even more servers, since each shard would need to be at least on 2-3 servers to survive, so for sharding to make any sense we'd need at least 6, maybe even more servers, otherwise we'd just store every or nearly every shard on every server, which makes sharding pointless.

So if we'd want resilience to loss of a server and meaningful sharding, we'd need something like 2x servers probably.

Apr 25 2019, 8:58 AM · Wikidata, Wikidata-Query-Service, Discovery-Search

Apr 23 2019

Gehel updated the task description for T221670: Smooth tilerator load.
Apr 23 2019, 4:53 PM · Patch-For-Review, Maps (Tilerator)
Gehel claimed T221670: Smooth tilerator load.
Apr 23 2019, 4:51 PM · Patch-For-Review, Maps (Tilerator)
Gehel created T221670: Smooth tilerator load.
Apr 23 2019, 4:50 PM · Patch-For-Review, Maps (Tilerator)
Gehel added a comment to T221631: Dedicated servers on WMCS to test WDQS scalability strategy.

Yes, this is the continuation of T206636.

Apr 23 2019, 4:11 PM · Wikidata, Wikidata-Query-Service, cloud-services-team, Discovery-Search
Gehel created T221636: Replace elastic1017-1031.
Apr 23 2019, 2:45 PM · Discovery-Search
Gehel updated the task description for T221630: [Epic] Search platform - Hardware requests for 2019-2020.
Apr 23 2019, 2:44 PM · Discovery-Search, Epic
Gehel created T221635: elastic[2025-2036] (lease expiry).
Apr 23 2019, 2:43 PM · Discovery-Search
Gehel created T221634: replace elastic[1048-1052] (lease expiry).
Apr 23 2019, 2:41 PM · Discovery-Search
Gehel created T221633: Bring back codfw elasticsearch cluster to its intended size.
Apr 23 2019, 2:41 PM · Discovery-Search
Gehel created T221632: Storage capacity upgrade for WDQS.
Apr 23 2019, 2:37 PM · Wikidata, Wikidata-Query-Service, Discovery-Search
Gehel triaged T221631: Dedicated servers on WMCS to test WDQS scalability strategy as Normal priority.
Apr 23 2019, 2:30 PM · Wikidata, Wikidata-Query-Service, cloud-services-team, Discovery-Search
Gehel created T221631: Dedicated servers on WMCS to test WDQS scalability strategy.
Apr 23 2019, 2:29 PM · Wikidata, Wikidata-Query-Service, cloud-services-team, Discovery-Search
Gehel triaged T221630: [Epic] Search platform - Hardware requests for 2019-2020 as Normal priority.
Apr 23 2019, 2:24 PM · Discovery-Search, Epic
Gehel created T221630: [Epic] Search platform - Hardware requests for 2019-2020.
Apr 23 2019, 2:23 PM · Discovery-Search, Epic

Apr 18 2019

Gehel added a comment to T141324: Look into shoving gerrit logs into logstash.

I had a conversation with @hashar about this topic. So here are a few idea:

Apr 18 2019, 2:39 PM · Release-Engineering-Team, Release-Engineering-Team-TODO, observability, Patch-For-Review, Technical-Debt, Wikimedia-Logstash, Gerrit
Gehel moved T220830: data reimport on wdqs1009 and wdqs1010 from Backlog to Done on the Discovery-Wikidata-Query-Service-Sprint board.

Data transfer completed with the new cookbook, everything seems fine.

Apr 18 2019, 9:18 AM · Operations, Discovery-Wikidata-Query-Service-Sprint

Apr 16 2019

Gehel closed T219849: Tilerator crashed on maps200[1-3].codfw.wmnet, a subtask of T198622: migrate maps servers to stretch with the current style, as Resolved.
Apr 16 2019, 3:08 PM · Patch-For-Review, Reading-Infrastructure-Team-Backlog, Operations, Maps
Gehel closed T219849: Tilerator crashed on maps200[1-3].codfw.wmnet as Resolved.

Stretch migration is completed. This should be fixed, we'll reopen if this happens again.

Apr 16 2019, 3:08 PM · Maps (Tilerator), Operations
Gehel assigned T221055: Collect metrics on maps cassandra to Mathew.onipe.
Apr 16 2019, 3:04 PM · Operations, Maps, Cassandra
Gehel moved T221013: prometheus-wmf-elasticsearch-exporter interferes with prometheus-wmf-elasticsearch-exporter-9* unit on elastic nodes from in progress to Done on the Discovery-Search (Current work) board.
Apr 16 2019, 1:00 PM · Discovery-Search (Current work), Elasticsearch
Gehel added a comment to T221013: prometheus-wmf-elasticsearch-exporter interferes with prometheus-wmf-elasticsearch-exporter-9* unit on elastic nodes.

redundant units have been cleaned via cumin:

Apr 16 2019, 1:00 PM · Discovery-Search (Current work), Elasticsearch
Gehel created P8404 (An Untitled Masterwork).
Apr 16 2019, 12:44 PM

Apr 15 2019

Gehel added a comment to T220982: maps hosts have bad permissions under /srv/deployment.

Deployment seems to be a noop:

Apr 15 2019, 2:57 PM · Operations
Gehel added a comment to T220982: maps hosts have bad permissions under /srv/deployment.

permissions reset via:

Apr 15 2019, 2:23 PM · Operations
Gehel removed a project from T202898: Decommission maps-test cluster: Maps.

Removing maps from this ticket, since there isn't any work left on our side.

Apr 15 2019, 7:44 AM · Patch-For-Review, Reading-Infrastructure-Team-Backlog, ops-codfw, decommission, Operations

Apr 12 2019

Gehel created T220830: data reimport on wdqs1009 and wdqs1010.
Apr 12 2019, 2:58 PM · Operations, Discovery-Wikidata-Query-Service-Sprint

Apr 11 2019

Gehel moved T219799: Create cookbook to reset readonly indices on elasticsearch clusters from Needs review to Done on the Discovery-Search (Current work) board.
Apr 11 2019, 12:13 PM · Patch-For-Review, Operations, Wikimedia-Incident, Discovery-Search (Current work)
Gehel closed T217557: Socket timeout on wdqs.svc.eqiad.wmnet as Resolved.

I don't think there is anything actionable at this point. Let's close.

Apr 11 2019, 7:26 AM · Wikidata, Operations, Wikidata-Query-Service, Discovery-Wikidata-Query-Service-Sprint

Apr 10 2019

Gehel added a comment to T220625: Initialize CirrusSearch on cloudelastic.

Open firewall on cloudelsatic machines to allow connections from mwmaint*, mw job runners and cloudelastic

Apr 10 2019, 4:24 PM · MW-1.34-notes (1.34.0-wmf.4; 2019-05-07), Discovery-Search (Current work), Patch-For-Review, Cloud-Services, Elasticsearch, Discovery

Apr 9 2019

Gehel claimed T219799: Create cookbook to reset readonly indices on elasticsearch clusters.
Apr 9 2019, 5:17 PM · Patch-For-Review, Operations, Wikimedia-Incident, Discovery-Search (Current work)
Gehel moved T220038: Degraded RAID on elastic2048 from in progress to Done on the Discovery-Search (Current work) board.
Apr 9 2019, 2:43 PM · Discovery-Search (Current work), Operations, ops-codfw
Gehel added a project to T220038: Degraded RAID on elastic2048: Discovery-Search (Current work).

Reimage was problematic, with first a puppet failure and then the server not booting over PXE. Manually booting in PXE (F12) finally fixed the issue.

Apr 9 2019, 2:43 PM · Discovery-Search (Current work), Operations, ops-codfw
Gehel created P8376 (An Untitled Masterwork).
Apr 9 2019, 1:51 PM

Apr 8 2019

Gehel moved T219799: Create cookbook to reset readonly indices on elasticsearch clusters from in progress to Needs review on the Discovery-Search (Current work) board.
Apr 8 2019, 5:39 PM · Patch-For-Review, Operations, Wikimedia-Incident, Discovery-Search (Current work)

Apr 5 2019

Mill <mill@mail.com> committed rCUMIN4d1480f7c3f0: 0kbaaaaaaaaaaa (authored by Gehel).
0kbaaaaaaaaaaa
Apr 5 2019, 10:29 PM
Mill <mill@mail.com> committed rCUMIN6964d19b3846: )ccaaaaaaaaaaa (authored by Gehel).
)ccaaaaaaaaaaa
Apr 5 2019, 10:29 PM
Gehel created T220205: Define constraints for cloudelastic use cases.
Apr 5 2019, 2:08 PM · Discovery-Search
Gehel committed rDPOM88f102a989f4: [maven-release-plugin] prepare for next development iteration (authored by Gehel).
[maven-release-plugin] prepare for next development iteration
Apr 5 2019, 9:41 AM
Gehel committed rDPOMe280fd8f3a01: [maven-release-plugin] prepare release discovery-parent-pom-1.28 (authored by Gehel).
[maven-release-plugin] prepare release discovery-parent-pom-1.28
Apr 5 2019, 9:41 AM
Gehel committed rDPOM1a3d9f7828fe: Update surefire / failsafe to latest milestone. (authored by Gehel).
Update surefire / failsafe to latest milestone.
Apr 5 2019, 9:25 AM