Change Details

After an outage caused by expensive gallery tag expansions not timing out & being retried, we just introduced a [PHP API timeout of 290s](https://gerrit.wikimedia.org/r/#/c/206440/). Subsequent changes lowered this to 60s. However, there is evidence both in this task, T152074 and T149421 that those timeouts do not actually work in an FCGI context. This is a good first step on the path from no timeout to something more reasonable (<60s?Expensive requests can and do pile up in HHVM, see earlier discussion in {T64615})causing outages such as T151702. However, clients like Parsoid typically set significantly lower client timeouts for normally-cheap API actions, and retry when those elapse. For example, Parsoid batch requests [have a client timeout of 60s](https://github.com/wikimedia/parsoid/blob/de4a6f49c76c3e1036c4669b131b1e5b2868dfff/lib/config/ParsoidConfig.js#L34), and will [retry once](https://github.com/wikimedia/parsoid/blob/de4a6f49c76c3e1036c4669b131b1e5b2868dfff/lib/config/ParsoidConfig.js#L51) before the general API request timeout triggers.## Further tightening timeouts per API end point or request It would be much better for overall system stability toSince the normal cost of different API end points differs by several orders of magnitude, a global upper bound like 60 seconds is unlikely to be useful for many clients. For example, in many situations users are likely to move on instead of waiting for 60s. More critically, avoiding retry amplification requires a coordination of timeouts in our infrastructure (as described in https://www.mediawiki.org/wiki/Rules_of_thumb_for_robust_service_infrastructure), which means that the time budget for the lowest level services is very limited. - pass in a low timeout from the client (slightly lower than the client HTTP timout), and - detect whether a timeout happened on the API side based on the HTTP status code,It would be much better for overall system stability to support tighter timeouts per API end points. One option would be to generally set up timeouts based on expected execution times, and clearly document these so that clients can set their timeouts slightly larger. stop retrying if it didAnother option would be to allow clients to pass in a lower timeout.