Page MenuHomePhabricator

thcipriani (Tyler Cipriani)
¯\_(ツ)_/¯Administrator

Projects (20)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Feb 9 2015, 10:04 PM (231 w, 3 d)
Roles
Administrator
Availability
Available
IRC Nick
thcipriani
LDAP User
Unknown
MediaWiki User
TCipriani (WMF) [ Global Accounts ]

Recent Activity

Today

thcipriani moved T228328: 'scap pull' stopped working on appservers ? from Doing to Done (within RelEng) on the Release-Engineering-Team-TODO (201907) board.
Thu, Jul 18, 10:38 PM · Release-Engineering-Team-TODO (201907), Release-Engineering-Team (Deployment services), Scap, serviceops
thcipriani moved T228482: Deploy scap 3.11.1-1 from Needs triage to External/Watching on the Scap board.
Thu, Jul 18, 10:25 PM · Release-Engineering-Team-TODO (201907), Release-Engineering-Team (Deployment services), Scap, serviceops
thcipriani moved T228482: Deploy scap 3.11.1-1 from INBOX to Done (within RelEng) on the Release-Engineering-Team-TODO (201907) board.
Thu, Jul 18, 10:25 PM · Release-Engineering-Team-TODO (201907), Release-Engineering-Team (Deployment services), Scap, serviceops
thcipriani created T228482: Deploy scap 3.11.1-1.
Thu, Jul 18, 10:24 PM · Release-Engineering-Team-TODO (201907), Release-Engineering-Team (Deployment services), Scap, serviceops
thcipriani added a comment to T227529: Request rename of "waldir" to "waldyrious" on LDAP.

I am resolving this, feel free to reopen is something is amiss

The cn and sn for uid=waldir,ou=people,dc=wikimedia,dc=org are both waldyrious (lower case w). Wikitech will never be able to authenticate as MediaWiki will canonicalize the username to start with a capital letter W and wikitech is configured to enforce same case matching for user account lookup.

Changed to Waldyrious (upper case) per comment above. @waldyrious could you please try again logging into wikitech?
@thcipriani The gerrit error seems to be the same as T216605, mind running that script?

Thu, Jul 18, 8:47 PM · LDAP-Access-Requests
thcipriani created T228446: Add prometheus metrics for Blubberoid.
Thu, Jul 18, 4:10 PM · Release Pipeline (Blubber)

Yesterday

thcipriani added a comment to T228196: docker-registry: some layers has been corrupted due to deleting other swift containers.

For that particular image I can recreate locally:

Wed, Jul 17, 7:39 PM · Patch-For-Review, Release-Engineering-Team-TODO (201907), Operations, Wikimedia-Incident, serviceops
thcipriani updated subscribers of T228328: 'scap pull' stopped working on appservers ?.

refreshMessageBlobs was added in T222539
One of two solutions:

  • Install scap::scripts on all appservers rather than canary appservers
  • rethink how this is included in scap
Wed, Jul 17, 7:35 PM · Release-Engineering-Team-TODO (201907), Release-Engineering-Team (Deployment services), Scap, serviceops
thcipriani added a comment to T228328: 'scap pull' stopped working on appservers ?.

refreshMessageBlobs was added in T222539

Wed, Jul 17, 7:04 PM · Release-Engineering-Team-TODO (201907), Release-Engineering-Team (Deployment services), Scap, serviceops

Tue, Jul 16

thcipriani committed rGBLBRa4b12761b8c6: Unit tests: PosOf InsertElement (authored by thcipriani).
Unit tests: PosOf InsertElement
Tue, Jul 16, 7:45 PM
thcipriani created P8755 (An Untitled Masterwork).
Tue, Jul 16, 5:56 PM
thcipriani closed T207702: contint1001:/var/lib/docker growth as Invalid.

We use /mnt/docker now and it's got a lot of space.

Tue, Jul 16, 4:56 PM · Release-Engineering-Team (CI & Testing services), Release-Engineering-Team-TODO, Release Pipeline, Continuous-Integration-Infrastructure
thcipriani closed T207703: Pruning docker-pkg images, a subtask of T207702: contint1001:/var/lib/docker growth, as Resolved.
Tue, Jul 16, 4:56 PM · Release-Engineering-Team (CI & Testing services), Release-Engineering-Team-TODO, Release Pipeline, Continuous-Integration-Infrastructure
thcipriani closed T207703: Pruning docker-pkg images as Resolved.

This looks to be released as a feature on contint1001 now.

Tue, Jul 16, 4:56 PM · docker-pkg, Continuous-Integration-Infrastructure

Mon, Jul 15

mmodell awarded T140921: Static asset time on disk a Love token.
Mon, Jul 15, 10:41 PM · Performance-Team (Radar), Release-Engineering-Team-TODO (201907), Release-Engineering-Team (Deployment services), Deployments

Fri, Jul 12

thcipriani added a comment to T227814: [Regression wmf.13] Wikidata localisation is broken.

So did we figure out what happened? (The change @thcipriani linked is by me, and I also think it feels suspect, so I hope this is not my fault 😬)

Fri, Jul 12, 7:41 PM · User-greg, Release-Engineering-Team-TODO (201907), Wikimedia-production-error (Shared Build Failure), Performance-Team (Radar), I18n, Wikidata
thcipriani added a comment to T227814: [Regression wmf.13] Wikidata localisation is broken.

So did we figure out what happened? (The change @thcipriani linked is by me, and I also think it feels suspect, so I hope this is not my fault 😬)
Edit: And if this was only fixed by reverting the train, and the issue persists on Test Wikidata, shouldn’t the task remain open?

Fri, Jul 12, 7:08 PM · User-greg, Release-Engineering-Team-TODO (201907), Wikimedia-production-error (Shared Build Failure), Performance-Team (Radar), I18n, Wikidata
thcipriani added a comment to T193824: Determine a standard way of installing MediaWiki lib/extension dependencies within containers.

I've been operating under the assumption that at some stage we're going to have some kind of containers with just some extensions installed, and we'll need to resolve those dependencies for testing or whatever.

I'm not sure that this would actually be very useful. AS far as I understand, goal here is to test the interoperability of extensions. Just enabling all of them and running all their tests in the combined environment doesn't seem like a great approach for that.

Fri, Jul 12, 7:02 PM · Release-Engineering-Team-TODO (201907), MW-1.34-notes (1.34.0-wmf.8; 2019-06-04), Core Platform Team (Extension Management (TEC13)), Patch-For-Review, Release Pipeline

Thu, Jul 11

thcipriani added a comment to T227814: [Regression wmf.13] Wikidata localisation is broken.

Ran in another terminal window after I got the message 22:10:59 Updating ExtensionMessages-1.34.0-wmf.13.php from scap

Thu, Jul 11, 10:12 PM · User-greg, Release-Engineering-Team-TODO (201907), Wikimedia-production-error (Shared Build Failure), Performance-Team (Radar), I18n, Wikidata
thcipriani added a comment to T227814: [Regression wmf.13] Wikidata localisation is broken.

mergeMessageFileList which is what scap shells out to to handle building ExtensionMessages does a require_once on each of the files listed in the extension-list file. It then dumps $wgMessagesDirs into the ExtensionMessages file.

Thu, Jul 11, 10:02 PM · User-greg, Release-Engineering-Team-TODO (201907), Wikimedia-production-error (Shared Build Failure), Performance-Team (Radar), I18n, Wikidata
thcipriani added a comment to T227814: [Regression wmf.13] Wikidata localisation is broken.

Nothing obvious in the repo's logs.

Thu, Jul 11, 9:26 PM · User-greg, Release-Engineering-Team-TODO (201907), Wikimedia-production-error (Shared Build Failure), Performance-Team (Radar), I18n, Wikidata
thcipriani added a comment to T227814: [Regression wmf.13] Wikidata localisation is broken.

hrm, so WikibaseLib is in ExtensionMessages for 1.34.11, but not in ExtensionsMessages for 1.34.13

Thu, Jul 11, 8:54 PM · User-greg, Release-Engineering-Team-TODO (201907), Wikimedia-production-error (Shared Build Failure), Performance-Team (Radar), I18n, Wikidata
thcipriani raised the priority of T227814: [Regression wmf.13] Wikidata localisation is broken from High to Unbreak Now!.

UBN since it's a deployment blocker.

Thu, Jul 11, 8:27 PM · User-greg, Release-Engineering-Team-TODO (201907), Wikimedia-production-error (Shared Build Failure), Performance-Team (Radar), I18n, Wikidata
thcipriani added a comment to T227814: [Regression wmf.13] Wikidata localisation is broken.

Nothing looks to have changed with the actual json in the extension:

Thu, Jul 11, 8:15 PM · User-greg, Release-Engineering-Team-TODO (201907), Wikimedia-production-error (Shared Build Failure), Performance-Team (Radar), I18n, Wikidata
thcipriani added a comment to T227814: [Regression wmf.13] Wikidata localisation is broken.

Hrm. That is bizarre.

Thu, Jul 11, 8:12 PM · User-greg, Release-Engineering-Team-TODO (201907), Wikimedia-production-error (Shared Build Failure), Performance-Team (Radar), I18n, Wikidata
thcipriani closed T223266: Unable to login to gerrit as Resolved.

Hi @Shirayuki I believe your issue should be resolved. Please re-open this ticket if you ware still unable to login.

Thu, Jul 11, 4:01 PM · Release-Engineering-Team (Development services), Release-Engineering-Team-TODO (201907), Gerrit
thcipriani closed Restricted Task, a subtask of T223266: Unable to login to gerrit, as Resolved.
Thu, Jul 11, 4:00 PM · Release-Engineering-Team (Development services), Release-Engineering-Team-TODO (201907), Gerrit

Tue, Jul 9

thcipriani triaged T227613: Cannot save on 1.34.0-wmf.13 - "Cannot access the database: Unknown error" as Unbreak Now! priority.

didn't mean to change priority

Tue, Jul 9, 8:59 PM · MW-1.34-notes (1.34.0-wmf.13; 2019-07-09), Performance-Team, AbuseFilter, Wikimedia-production-error, MediaWiki-Database
thcipriani lowered the priority of T227613: Cannot save on 1.34.0-wmf.13 - "Cannot access the database: Unknown error" from Unbreak Now! to Needs Triage.
Tue, Jul 9, 8:58 PM · MW-1.34-notes (1.34.0-wmf.13; 2019-07-09), Performance-Team, AbuseFilter, Wikimedia-production-error, MediaWiki-Database
thcipriani added a comment to P8731 (An Untitled Masterwork).
[thcipriani@contint1001 mathoid ((cc375c7db...) %)]$ git status
Not currently on any branch.
Untracked files:
  (use "git add <file>..." to include in what will be committed)
Tue, Jul 9, 3:20 PM
thcipriani created P8731 (An Untitled Masterwork).
Tue, Jul 9, 3:18 PM

Mon, Jul 8

thcipriani added a comment to T207707: contint1001 store docker images on separate partition or disk.

So I think we can just:

  • stick to overlay2
  • reconfigure Docker to use /mnt/docker
  • restart Docker
  • run docker-pkg and verify it actually pulls all images / does not rebuild any
  • archive /var/lib/docker and ultimately delete it
Mon, Jul 8, 10:13 PM · Release-Engineering-Team (CI & Testing services), Release-Engineering-Team-TODO (201907), serviceops, Operations, Continuous-Integration-Infrastructure
thcipriani added a subtask for T223266: Unable to login to gerrit: Unknown Object (Task).
Mon, Jul 8, 7:18 PM · Release-Engineering-Team (Development services), Release-Engineering-Team-TODO (201907), Gerrit
thcipriani added a comment to T208259: OSError: [Errno 1] Operation not permitted when running git fat pull.

I ran into this issue again when deploying WDQS today. Some of the binaries were owned by the previous deployer. My workaround was to reset ownership to myself, but that's obviously not a step that I would like to de every time we switch the deployer.
@thcipriani: Any idea how we could fix this for the longer term?

Mon, Jul 8, 5:53 PM · Deployments, Operations, Release
thcipriani moved T222539: Scap deployments are not purging MessageBlobStore (was: Stale localized messages) from Ready to Completed on the Release-Engineering-Team-TODO (201907) board.
Mon, Jul 8, 3:22 PM · Release-Engineering-Team-TODO (201907), MW-1.34-notes (1.34.0-wmf.7; 2019-05-28), Performance-Team (Radar), Patch-For-Review, Scap, Regression, MediaWiki-ResourceLoader
thcipriani closed T222539: Scap deployments are not purging MessageBlobStore (was: Stale localized messages) as Resolved.

This should now be resolved in production as of Tue, Jun 25, 7:45 AM when the new scap version (3.10.0-1) was released.

Mon, Jul 8, 3:22 PM · Release-Engineering-Team-TODO (201907), MW-1.34-notes (1.34.0-wmf.7; 2019-05-28), Performance-Team (Radar), Patch-For-Review, Scap, Regression, MediaWiki-ResourceLoader
thcipriani reopened Restricted Task, a subtask of T218750: Re-enable use of Gerrit HTTP token to push patchsets, as Open.
Mon, Jul 8, 2:59 PM · Release-Engineering-Team-TODO, Gerrit
thcipriani removed a project from T223266: Unable to login to gerrit: Patch-For-Review.

Hi @Shirayuki sorry for the delay

Mon, Jul 8, 2:57 PM · Release-Engineering-Team (Development services), Release-Engineering-Team-TODO (201907), Gerrit

Wed, Jul 3

thcipriani reassigned T227225: release a scap that contains I85a2161 (Remove functionality to talk to conftool) from thcipriani to CDanis.

@CDanis I merged in the changes to the release branch and pushed up the debian/3.11.0-1 tag: you should be good to release from the scap side.

Wed, Jul 3, 11:13 PM · Release-Engineering-Team (Deployment services), Release-Engineering-Team-TODO (201907), Scap
thcipriani triaged T227225: release a scap that contains I85a2161 (Remove functionality to talk to conftool) as Normal priority.
Wed, Jul 3, 8:32 PM · Release-Engineering-Team (Deployment services), Release-Engineering-Team-TODO (201907), Scap

Tue, Jul 2

thcipriani added a comment to T207707: contint1001 store docker images on separate partition or disk.

Hi Hashar, at this point i think it makes sense to assign back to to you to check if it seems sane and then for your next step:

and change its config to point there.

Tue, Jul 2, 10:04 PM · Release-Engineering-Team (CI & Testing services), Release-Engineering-Team-TODO (201907), serviceops, Operations, Continuous-Integration-Infrastructure
thcipriani closed T216605: Cannot assign user name "XXX" to account ####; name already in use. as Resolved.

Applied patch and reindexed accounts 2019-07-02 17:53 UTC.

Tue, Jul 2, 6:06 PM · Security, Gerrit
thcipriani added a comment to T227111: Zuul is no longer adding jobs to any jenkins pipelines.
09:47:13 <James_F> I just got a large number of e-mails from gerrit all at once.
09:47:19 <James_F> Possibly a deadlock got resolved?
Tue, Jul 2, 5:13 PM · Wikimedia-production-error (Shared Build Failure), Release-Engineering-Team, Zuul
thcipriani added a comment to T227111: Zuul is no longer adding jobs to any jenkins pipelines.
09:47:13 <James_F> I just got a large number of e-mails from gerrit all at once.
09:47:19 <James_F> Possibly a deadlock got resolved?
Tue, Jul 2, 5:00 PM · Wikimedia-production-error (Shared Build Failure), Release-Engineering-Team, Zuul
thcipriani added a comment to T227111: Zuul is no longer adding jobs to any jenkins pipelines.

Only thing I can imagine is that hmm Zuul lost its connection to Gerrit some how :-\

Tue, Jul 2, 4:52 PM · Wikimedia-production-error (Shared Build Failure), Release-Engineering-Team, Zuul
thcipriani assigned T226660: Make helm chart template in deployment-charts support local development to jeena.

Assigning to @jeena since she already has an initial patchset. Adding serviceops-radar since folks on their team will likely be reviewing (thanks @akosiaris for the current review on the patchset)

Tue, Jul 2, 3:30 PM · serviceops-radar, Release Pipeline, Release-Engineering-Team (Local Dev)

Mon, Jul 1

thcipriani added a comment to T216605: Cannot assign user name "XXX" to account ####; name already in use..
  • what's stopping people from making the same case mistakes and causing the issue again? Won't new accounts face that same problem? Or do we expect LocalUsernamesToLowerCase to mitigate that?
Mon, Jul 1, 10:49 PM · Security, Gerrit
thcipriani committed rDEPLOYCHARTS639471e5bcc1: blubberoid: Add policy file (authored by thcipriani).
blubberoid: Add policy file
Mon, Jul 1, 5:35 PM
thcipriani closed T226724: Gerrit manager rights for Ottomata as Resolved.

Not manager by SREs as discussed on SRE meeting, some sres do not even have the admin rights on gerrit AFAIK.

Mon, Jul 1, 5:08 PM · Release-Engineering-Team, Gerrit-Privilege-Requests

Fri, Jun 28

thcipriani committed rGBLBRd2ff2621c453: Bump config version to v4 (authored by thcipriani).
Bump config version to v4
Fri, Jun 28, 8:49 PM
thcipriani committed rGBLBR45d0ce6fe6fe: Bump config version to v4 (authored by thcipriani).
Bump config version to v4
Fri, Jun 28, 8:42 PM

Thu, Jun 27

thcipriani added a comment to T224448: Gerrit http threads stuck behind sendemail thread.

Investigated this a bit today. I was hoping with 3 incidents in one day that the trigger for this event might be obvious.

Thu, Jun 27, 6:23 PM · Release-Engineering-Team (Development services), Release-Engineering-Team-TODO, serviceops-radar, Gerrit

Tue, Jun 25

thcipriani added a comment to T224448: Gerrit http threads stuck behind sendemail thread.

This happened twice in the past 24 hours.

Tue, Jun 25, 9:27 PM · Release-Engineering-Team (Development services), Release-Engineering-Team-TODO, serviceops-radar, Gerrit
thcipriani reopened T218750: Re-enable use of Gerrit HTTP token to push patchsets as "Open".

Spoke too soon. Gerrit 2.15.14 caused a lot of SendEmail locks (T224448: Gerrit http threads stuck behind sendemail thread). I have to rollback to 2.15.13. I will update wikitech as well.

Tue, Jun 25, 9:21 PM · Release-Engineering-Team-TODO, Gerrit
thcipriani committed rGERRITDEPLOY7b379a61d95c: Revert "Gerrit v2.15.14" (authored by thcipriani).
Revert "Gerrit v2.15.14"
Tue, Jun 25, 9:13 PM
thcipriani added a reverting change for rGERRITDEPLOYe3695fdcc486: Gerrit v2.15.14: rGERRITDEPLOY7b379a61d95c: Revert "Gerrit v2.15.14".
Tue, Jun 25, 9:13 PM
thcipriani added a comment to T224448: Gerrit http threads stuck behind sendemail thread.

This happened twice in the past 24 hours.

Tue, Jun 25, 8:35 PM · Release-Engineering-Team (Development services), Release-Engineering-Team-TODO, serviceops-radar, Gerrit

Mon, Jun 24

thcipriani changed the status of T225308: Users with a different name in the cn field compared to uid field cannot use http auth from Open to Stalled.

@thcipriani: the fix is turning on http passwords.

Mon, Jun 24, 9:47 PM · Gerrit
thcipriani closed T218750: Re-enable use of Gerrit HTTP token to push patchsets as Resolved.

Config change has been deployed.

Mon, Jun 24, 9:47 PM · Release-Engineering-Team-TODO, Gerrit
thcipriani added a comment to T224857: Enhance MediaWiki deployments for support of php7.x.

Should we be working to implement symlink swapping in scap? Sounds like it's currently a blocker for this task (although opcache exhaustion may be blocking both).
If so the key is that we need the realpath for each deployment needs to be unique, correct? Atomic deployments are a side effect of the unique real path?

No, I don't think that's what we should focus on.
We want to solve both the atomicity problem and the opcache exhaustion problem, and the symlink swapping won't solve the latter, which is by far the more urgent matter.
As I said, rolling restarts on each deployment solve both problems, but for now given the doubts I've seen floating around we could find a middle ground as follows:

  • we only run the rolling restart when a signficant code path change happens, so when we run the train

this would unblock the php7 transition at the very least. Is that acceptable to you?

Mon, Jun 24, 3:10 PM · Release-Engineering-Team (Deployment services), Release-Engineering-Team-TODO, Patch-For-Review, User-jijiki, PHP 7.2 support, Scap, serviceops

Fri, Jun 21

thcipriani added a comment to T224857: Enhance MediaWiki deployments for support of php7.x.

To summarize the situation a bit, we have 2 problems right now that need to be solved:

  1. Deploys are not atomic
  2. Opcache gets progressively exhausted by deploying code

Se can solve 1) using the symlink swapping and mod_realdoc or switching to nginx on the frontend, but the second problem can only be solved via regular resets or restarts of the opcache.
Given we don't trust the opcache resets at all at this point, the rolling restart of the daemons seem like the best option to solve both issues.

Fri, Jun 21, 8:50 PM · Release-Engineering-Team (Deployment services), Release-Engineering-Team-TODO, Patch-For-Review, User-jijiki, PHP 7.2 support, Scap, serviceops

Thu, Jun 20

thcipriani added a comment to T224915: Deploy scap 3.10.0-1.

@thcipriani I have packaged the newer version and uploaded to stretch-wikimedia, I will upgrade the servers soon, time permits.

Thu, Jun 20, 5:10 PM · Release-Engineering-Team (Deployment services), User-jijiki, Release-Engineering-Team-TODO, serviceops, Scap
thcipriani committed rGBLBR15715357cf4e: Update go-playground validator (authored by thcipriani).
Update go-playground validator
Thu, Jun 20, 4:37 PM
thcipriani moved T207707: contint1001 store docker images on separate partition or disk from Backlog to Blocked (externally) on the Release-Engineering-Team (Kanban) board.

The new disks can be shown as sdc and sdd.
Currently I think we have 3 RAID 1 arrays, with LVM on the largest one (md2):
Disk /dev/sda: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors
Disk /dev/sdb: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors
Disk /dev/md0: 46.5 GiB, 49965694976 bytes, 97589248 sectors
Disk /dev/md1: 953.4 MiB, 999751680 bytes, 1952640 sectors
Disk /dev/md2: 883.9 GiB, 949069283328 bytes, 1853650944 sectors
Disk /dev/mapper/contint1001--vg-data: 883.9 GiB, 949066137600 bytes, 1853644800 sectors
Disk /dev/sdc: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors
Disk /dev/sdd: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors
I don't know what are ops best practices for disks. I guess we will need SRE to create a new RAID 1 array over the two disks, create a new LVM volume group and then we can do some partitioning.
For Docker I guess we can start with a 500GB partition on the new disks? Then mount that to /srv/docker and change its config to point there.

Thu, Jun 20, 3:00 PM · Release-Engineering-Team (CI & Testing services), Release-Engineering-Team-TODO (201907), serviceops, Operations, Continuous-Integration-Infrastructure
thcipriani triaged T226191: PipelineBot should be voting as Normal priority.
Thu, Jun 20, 2:57 PM · Release-Engineering-Team (Pipeline), Release-Engineering-Team-TODO, Release Pipeline
thcipriani moved T226191: PipelineBot should be voting from Backlog to CI on the Release Pipeline board.
Thu, Jun 20, 2:57 PM · Release-Engineering-Team (Pipeline), Release-Engineering-Team-TODO, Release Pipeline
thcipriani created T226191: PipelineBot should be voting.
Thu, Jun 20, 2:57 PM · Release-Engineering-Team (Pipeline), Release-Engineering-Team-TODO, Release Pipeline
thcipriani committed rGERRITDEPLOYe3695fdcc486: Gerrit v2.15.14 (authored by thcipriani).
Gerrit v2.15.14
Thu, Jun 20, 2:35 PM

Jun 18 2019

thcipriani committed rDEPLOYCHARTS1677c01cdda8: blubberoid: Add policy file (authored by thcipriani).
blubberoid: Add policy file
Jun 18 2019, 12:25 AM

Jun 17 2019

thcipriani added a comment to T217830: Problems deploying dblists/commonsuploads.dblist.

I think the simpler solution would be to do the same as what we generally to avoid this problem in production code, e.g. in MediaWiki core. Which is to have the verification mechanism part of the value instead of as some fragile indirect side-effect of the value.
In other words, either

  1. Make the fragile side-effect not a side effect and not fragile, by having the PHP code in wmf-config create the cache with the explicit mtime set to the value it was creating it for.
Jun 17 2019, 3:41 PM · Release-Engineering-Team-TODO, Performance-Team (Radar), Scap, User-zeljkofilipin

Jun 13 2019

thcipriani added a comment to T224857: Enhance MediaWiki deployments for support of php7.x.

Checking my understanding of what you're saying (please correct me where I'm misunderstanding): for rollback/emergencies, scap does opcache_reset, syncs the file, then calls the smart script to do a rolling restart (11 minutes -- 350 servers 2 seconds/server). By sending the opcache_reset first means that the change goes live immediately, and the restart is just for cache sanity.

We'd sync the files, reset the opcache (which would actually make the rollback effective) and then do the rolling restart afterwards.

Jun 13 2019, 11:26 PM · Release-Engineering-Team (Deployment services), Release-Engineering-Team-TODO, Patch-For-Review, User-jijiki, PHP 7.2 support, Scap, serviceops
thcipriani added a comment to T224939: Create an interface for the local-charts ecosystem.

...

  1. Python (present or installable on most systems?)
Jun 13 2019, 6:31 PM · Release-Engineering-Team (Local Dev), Release-Engineering-Team-TODO, Developer Productivity, local-charts
thcipriani updated subscribers of T216605: Cannot assign user name "XXX" to account ####; name already in use..

Thanks for the enthusiasm :)

Jun 13 2019, 3:56 PM · Security, Gerrit

Jun 12 2019

thcipriani committed rGBLBRf73e3ccafa7a: Update go-playground validator (authored by thcipriani).
Update go-playground validator
Jun 12 2019, 9:53 PM

Jun 11 2019

thcipriani added a comment to T216605: Cannot assign user name "XXX" to account ####; name already in use..

Happening to me too:

Jun 11 2019, 7:34 PM · Security, Gerrit
thcipriani created P8606 gerritUsernameToLowercase.sh.
Jun 11 2019, 7:27 PM
thcipriani added a comment to T224996: Gerrit repo scoring/ores/editquality not mirroing.

My theory here is that there is some ssh timeout on the phab side for pushing to gerrit (or maybe some error on the phab side(?)). I don't see anything in the gerrit error logs related to phab/editquality.

Jun 11 2019, 7:21 PM · Release-Engineering-Team (Development services), Release-Engineering-Team-TODO, artificial-intelligence, editquality-modeling, Scoring-platform-team, Diffusion
thcipriani added a comment to T224996: Gerrit repo scoring/ores/editquality not mirroing.

Looks like master is showing a newer version:

Jun 11 2019, 6:50 PM · Release-Engineering-Team (Development services), Release-Engineering-Team-TODO, artificial-intelligence, editquality-modeling, Scoring-platform-team, Diffusion
thcipriani assigned T224069: Add/reserve a Jenkins node for the pipeline's trigger jobs to brennen.

@brennen made the mistake of showing interest in this task, assigning accordingly :)

Jun 11 2019, 4:34 PM · Release-Engineering-Team (Kanban), Release Pipeline

Jun 10 2019

thcipriani claimed T224637: php-1.33.0-wmf.23/cache/l10n isn't cleaned up in prod.

Guessing this is the aborted clean from https://tools.wmflabs.org/sal/log/AWq2FV6OOwpQ-3Pk_Reb

Jun 10 2019, 8:17 PM · Release-Engineering-Team-TODO, Scap
thcipriani created P8604 not-latest-docker.py.
Jun 10 2019, 7:01 PM
thcipriani added a comment to T224857: Enhance MediaWiki deployments for support of php7.x.

Regarding rollbacks: I guess that we can add a switch that, just for rollbacks, sends out the opcache reset first, and does the rolling restart afterwards, thus reducing the window of time in which the issues are present. We'll have a small window of time in which an opcache corruption would be possible, but only in case of emergency.
@thcipriani would that work for addressing your concerns?

Jun 10 2019, 3:41 PM · Release-Engineering-Team (Deployment services), Release-Engineering-Team-TODO, Patch-For-Review, User-jijiki, PHP 7.2 support, Scap, serviceops

Jun 9 2019

thcipriani closed T213198: Update Blubber documentation as Resolved.

https://wikitech.wikimedia.org/wiki/Blubber has been rewritten (by @thcipriani), is there anything specific that still needs to be improved for this ticket to be done?

Jun 9 2019, 7:02 PM · Release-Engineering-Team (Next), Release Pipeline (Blubber), Documentation, Operations, Prod-Kubernetes
thcipriani closed T213198: Update Blubber documentation, a subtask of T213090: TEC3:O6:O:6.1:Q3: Deployment Pipeline Documentation, as Resolved.
Jun 9 2019, 7:02 PM · Release-Engineering-Team (Pipeline), Release-Engineering-Team-TODO, Documentation, Operations, Release Pipeline, Prod-Kubernetes

Jun 7 2019

thcipriani added a comment to T225308: Users with a different name in the cn field compared to uid field cannot use http auth.

@thcipriani: the fix is turning on http passwords.

Jun 7 2019, 6:25 PM · Gerrit
thcipriani added a comment to T225308: Users with a different name in the cn field compared to uid field cannot use http auth.

This is due to gerrit using cn as the UI login (internally in gerrit this is the gerrit schema), while using uid as the ssh/api login (internally in gerrit this is the username schema).

Jun 7 2019, 4:03 PM · Gerrit

Jun 6 2019

thcipriani added a comment to T225252: mediawiki-config (and others?) should ride gate-and-submit-swat not gate-and-submit.

gate-and-submit-swat is for immediate-to-prod code. Yes, mw-config is in its own queue, but it should also have priority…

Jun 6 2019, 10:16 PM · Release-Engineering-Team (CI & Testing services), Release-Engineering-Team-TODO, Continuous-Integration-Infrastructure
thcipriani added a comment to T225064: post merge builds in citoid are failing.

Nice, thank you for the explanation :-] Left to figure out in a different task is how to test Citoid together with Zotero, but I guess that is for another task.

Euh, no, that's what this task is for :) We were able to build images before, now we are not.

Jun 6 2019, 3:46 PM · Core Platform Team Workboards (Done with CPT), Services (done), Release Pipeline, Citoid

Jun 5 2019

thcipriani added a comment to T225166: Gerrit crashed due to out of Heap.

I have gc info from right when this happened: https://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTkvMDYvNS8tLWp2bV9nYy5nZXJyaXQubG9nLjcuY3VycmVudC0tMjMtMjgtNTE=&channel=WEB

Jun 5 2019, 11:41 PM · Gerrit
thcipriani added a comment to T225064: post merge builds in citoid are failing.

The pod that is running in the ci namespace of the staging cluster seems to be logging 504 errors:

Jun 5 2019, 7:04 PM · Core Platform Team Workboards (Done with CPT), Services (done), Release Pipeline, Citoid
thcipriani added a comment to T225064: post merge builds in citoid are failing.

Rerunning the test on contint1001 the failure message I see in the logs is:

Jun 5 2019, 3:50 PM · Core Platform Team Workboards (Done with CPT), Services (done), Release Pipeline, Citoid

Jun 4 2019

thcipriani claimed T177867: Pipeline image build cleanup.
Jun 4 2019, 4:08 PM · Release-Engineering-Team-TODO, Patch-For-Review, Release Pipeline
thcipriani added a comment to T224857: Enhance MediaWiki deployments for support of php7.x.

If there are not any better short-term solutions/ideas, depooling-deploy-restarting-pooling looks like the only option we have. We will have to sacrifice speed for platform stability, because at the end of the day, that's what our users expect.

Jun 4 2019, 2:45 PM · Release-Engineering-Team (Deployment services), Release-Engineering-Team-TODO, Patch-For-Review, User-jijiki, PHP 7.2 support, Scap, serviceops

Jun 3 2019

thcipriani added a comment to T224857: Enhance MediaWiki deployments for support of php7.x.

What affect have the opcode_invalidate calls for specific files via sync-file had? Do we only see this corruption for opcache_reset?

Jun 3 2019, 9:14 PM · Release-Engineering-Team (Deployment services), Release-Engineering-Team-TODO, Patch-For-Review, User-jijiki, PHP 7.2 support, Scap, serviceops
thcipriani created T224915: Deploy scap 3.10.0-1.
Jun 3 2019, 5:31 PM · Release-Engineering-Team (Deployment services), User-jijiki, Release-Engineering-Team-TODO, serviceops, Scap
thcipriani added a comment to T222539: Scap deployments are not purging MessageBlobStore (was: Stale localized messages).

Deployed on beta, the old method took ~25 seconds, new method takes ~5 seconds. I'll prep a new prod scap release.

Jun 3 2019, 5:13 PM · Release-Engineering-Team-TODO (201907), MW-1.34-notes (1.34.0-wmf.7; 2019-05-28), Performance-Team (Radar), Patch-For-Review, Scap, Regression, MediaWiki-ResourceLoader

May 28 2019

thcipriani added a comment to T224448: Gerrit http threads stuck behind sendemail thread.

Here SendEmail-1 is not "blocked" it's just waiting for jobs, however the dump says:
Locked ownable synchronizers:- <0x00000001c13617f8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync).
Meaning that this SendEmail-1 thread holds the ReentrantLock lock which is blocking Http threads.
So something happened previously in that thread that caused the lock to still be held.

Is it also possible that it's some kind of blocking read causing a deadlock that doesn't show up in dumps? i.e., https://dzone.com/articles/java-concurrency-hidden-thread

Here it's clear that Send-Email1 has held the lock but failed to release it. Also it's the responsability of guava LocalCache to not allow this to happen. Since Send-Email-1 is clearly not in a zone where the lock can legitimately be held I suspect a bug in guava or in the loading method of the Account info (if LocalCache does not allow hard failures).

May 28 2019, 2:49 PM · Release-Engineering-Team (Development services), Release-Engineering-Team-TODO, serviceops-radar, Gerrit
thcipriani added a comment to T224448: Gerrit http threads stuck behind sendemail thread.

Here SendEmail-1 is not "blocked" it's just waiting for jobs, however the dump says:
Locked ownable synchronizers:- <0x00000001c13617f8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync).
Meaning that this SendEmail-1 thread holds the ReentrantLock lock which is blocking Http threads.
So something happened previously in that thread that caused the lock to still be held.

May 28 2019, 1:53 PM · Release-Engineering-Team (Development services), Release-Engineering-Team-TODO, serviceops-radar, Gerrit

May 27 2019

thcipriani added a comment to T222472: Investigate gerrit session expiration.

Restarted Gerrit 2019-05-27T23:10,

May 27 2019, 11:31 PM · Release-Engineering-Team (Development services), Release-Engineering-Team-TODO, Gerrit
thcipriani triaged T224448: Gerrit http threads stuck behind sendemail thread as Normal priority.
May 27 2019, 11:18 PM · Release-Engineering-Team (Development services), Release-Engineering-Team-TODO, serviceops-radar, Gerrit