Fri, Sep 22
Correct. spec.template.yaml is a spec template that should be used to create the actual spec.yaml for your service.
Thu, Sep 21
The fix for the summary end point in RB has been merged and deployed.
Wed, Sep 20
FTR, all of the aforementioned services use logstash1001 directly. That ought to change soon(TM) with T175242: all log producers need to use the logstash LVS endpoint.
This is currently occurring on RESTBase and Parsoid hosts and SCB, impacting most of the Node.JS services, leaving them without logs in logstash.
Patchset merged, deploy about to happen. Resolving.
The main reason for this is that the MW API returns http origins for images. For the RESTBase side, I have created PR #865 to encompass the original field as well, which should fix the problem in the summary end point, but for mobile-sections a similar hack will need to be placed in the MCS.
Tue, Sep 19
Confirmed to have fixed deployments on SCB, resolving. Thank you @Joe for the quick fix!
Mon, Sep 18
Note that we are talking about librdkafka v0.9.5 here ;) node-rdkafka does not support the v0.11 version (yet).
Fri, Sep 15
CSS files are considered to be part of the codebase. That means that they (can) change with each code deploy. Consequently, there are no events happening in the system when a particular source file changes (and given the amount of files we have in all of the repositories combined, it's not even feasible to do so). In order words, CSS files are not considered content per se. If we wanted to promote them to that status, option #2 is probably the most pragmatic and easiest to implement in the short term. To implement #1 reliably, we could explore setting up some hooks during the deployment process that could/would check the CSS files' last modified time. However, this assumes that (i) the CSS files are all part of the same service codebase; (ii) all of them are in the same directory (for setting up the checks more easily); and (iii) we can actually pull it off with Scap.
Thu, Sep 14
Merged and deployed, resolving.
We should find/list all of such domains and simply declare them in the configs regardless of whether they would ever be used or not.
Ok, the above patch truly fixed the issue. There were problems in the seed list in both labs and staging, and they have now been remedied.
Wed, Sep 13
The job is being double-produced now, so resolving.
An intermediary solution which consults the MW API is available as PR 863, but as of the time of writing needs some improvements, so comments/suggestions are welcomed on the PR.
From the description, it looks like this would just perform reformatting? If so, and given the fact that clients already have all the needed info, what would be the exact benefit of doing this server-side?
Tue, Sep 12
Raising the priority as we should settle on this before migrating to the new storage scheme.
Mon, Sep 11
IMHO, updateBetaFeaturesUserCounts is the perfect candidate here. It's very lightweight (one SELECT, one UPDATE), it's idempotent and low-volume.
we might want to improve upon that.
Everything is set up now, and the cpjobqueue service is live in production on the SCB cluster (currently idling pending the resolution of T175210: Select candidate jobs for transferring to the new infrastucture). Calling this done!
Fri, Sep 8
The repo has been set up and cloned on tin and the ops/puppet profile created and merged. Left to do is to add the profile to SCB's role, which is scheduled to happen on Monday, 2017-09-08.
WRT the 10h lag, one theory could be that it is connected to T173710: Job queue is increasing non-stop, where the backlog of refreshLinks jobs (used to trigger updates to page properties as well) has been very high lately, especially on commons. If that is indeed the case (this is yet TBD), then no caching setting would help us.
Thu, Sep 7
My 2 cents as an Android app user: there is no difference to me if the time frame is 10 mins or 1h since the typical workflow is (i)) spot something is wrong; (ii) refresh 2, 3 times (takes way less than a second); and then either complain or ignore. As somebody that on the back-end side of this story, I opt for ignoring it (if I'm not in the capacity to purge it from Varnish right away), but I can relate to people that complain about it. I think that informing users about this edge case would go a long way. Posting something somewhere where people complain most often would greatly help, given the fact that solving this problem properly is not a small endeavour in technical terms.
RB PR has been undeployed.
With the deploy to production of the MCS part I assumed that the tests had been carried out in Beta (seeing that you don't need RESTBase for that there in order to compare the outputs). I can undeploy this from the RB side from production if needed.
The RESTBase side of things has been deployed, so that it now contacts MCS for getting the info needed for the summary end point. Note, however, that due to the fact that the summary end point is being used by Page Previews (high volume), I haven't dropped the old data from storage. This means that the new format will replace the old one gradually as pages need to be re-rendered.
+1 on decoupling these concerns from the the running services. This model would allow developers to concentrate solely on their service's functionality and would also decouple the configuration of the service itself from auxiliary facilities (like where to send logs, metrics, handle auth(n|z), etc).
Wed, Sep 6
Tue, Sep 5
@Nuria there seems to some confusion in this conversation. I am not proposing to migrate the existing end-points to project-specific domains, I am proposing to add these there so that the API is more easily discoverable.
Doesn't http://dev.wiki.local.wmftest.net:8888/?doc give you MCS' help page?
cause an error by putting the wrong URL in the input field
I looked over the patches, and I don't think they will solve this problem. One thing to note is that in the code you sometimes use the GET method, but the template enforces POST, which is the correct way: the MW API should be used with the POST method only.
Mon, Sep 4
This has been merged and Puppet has been run. The main IPs are no longer in the seeds lists, so resolving.
Agreed, this task has become confusing. As the start-up issue has been worked around, I am closing this task. I have created T174916: electron/pdfrender hangs where we can track the service's hangs in production.