[02:40:12] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp5006:9331 is unreachable - https://alerts.wikimedia.org [02:59:57] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Legoktm) @Vgutierrez cp5006 dropped off monitoring at exactly midnight, and ssh for it has been flapping - currently I can't get in.... [04:42:47] 10Wikimedia-Apache-configuration, 10Fundraising-Backlog, 10SRE, 10Thank-You-Page, and 4 others: Deal with donatewiki Thank You page launching in apps - https://phabricator.wikimedia.org/T259312 (10Tsevener) This fix is in new release candidate Testflight 6.8.2 (1868). [06:40:12] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp5006:9331 is unreachable - https://alerts.wikimedia.org [07:56:45] asking again during EU daytime: I closed T263829 yesterday, does that mean that we can close https://phabricator.wikimedia.org/T210411 and https://phabricator.wikimedia.org/T108580 now too? [07:56:45] T263829: cloudweb2001-dev: add TLS termination - https://phabricator.wikimedia.org/T263829 [08:19:04] majavah: very nice! [08:19:05] let me see [08:21:16] it does indeed look like T210411 can now be closed [08:21:16] T210411: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 [08:22:26] 10Traffic, 10SRE, 10Patch-For-Review: ATS production-ready as a backend cache layer - https://phabricator.wikimedia.org/T207048 (10ema) [08:32:49] 10HTTPS, 10SRE, 10Traffic-Icebox: HTTPS for internal service traffic - https://phabricator.wikimedia.org/T108580 (10ema) 05Open→03Resolved a:03ema Many of the assumptions made when this task was created have changed since the migration to ATS for cache backends (no more IPSec, the difference between Ti... [08:33:01] 10HTTPS, 10SRE, 10Traffic-Icebox: HTTPS for internal service traffic - https://phabricator.wikimedia.org/T108580 (10Majavah) Anything left to do here, now that all backends are using TLS? [08:33:06] lol, commented at the same time [08:33:10] * majavah deletes his comment [08:33:47] haha, precisely at the same minute [09:48:21] I'm looking for review of https://gerrit.wikimedia.org/r/c/operations/puppet/+/742148 , I'm out of my element and needed to make a deeper change than I hoped. [09:48:48] Happy to split or improve the patch if others make suggestions. [10:40:12] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp5006:9331 is unreachable - https://alerts.wikimedia.org [10:50:24] cp5006 behavior is weird [10:50:36] haproxy metrics are still being gathered properly [10:50:42] See https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=eqsin%20prometheus%2Fops&var-instance=cp5006&from=now-30m&to=now [10:53:52] And according to those metrics its been serving traffic till ema depooled it [12:51:30] yeah some varnish metrics were also still being reported [12:56:27] but there's something very wrong with the host, I can't even login in console as root [12:56:46] it just hangs there after getting the password [13:05:12] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp5006:9331 is unreachable - https://alerts.wikimedia.org [14:00:56] ema: Please let me know if there's some process I should follow to get this patch into a review backlog, https://gerrit.wikimedia.org/r/c/operations/puppet/+/742148 [14:06:24] Just asking because I'm an outsider to the codebase, don't know who can review, how this sort of .vcl change can be safely tested, etc. [14:22:14] awight: hey :) [14:23:48] awight: the best way to test your change is with https://github.com/wikimedia/puppet/tree/production/modules/varnish/files/tests [14:24:19] if you follow the README you'll see that there's a dockerfile and a procedure to run your VTC tests and see what happens [14:28:11] Very nice, thanks for the pointer :-) [14:28:39] I haven't had the time yet to properly review your change, but could you explain what the problem is? What is the status quo, why is that not great, and what the ideal solution would be [14:32:28] the reason I'm asking is that in general changing the way vcl_hash works is a fairly drastic measure to take :) [14:32:58] and maybe we can achieve the same end using other means [14:44:20] ema: +1 thanks, I was hoping to have that conversation! [14:44:50] So, we're adding a new URL parameter for revid, which makes it possible to render maps on an older version of a page. [14:45:40] However, the embedded maps rarely change, so the revid will fragment the cache unnecessarily. There's already a hash of the embedded map in the URL, so we can rely on that to provide the variation. [14:46:58] We learned some things that might make this change extraneous, however. Map images have roughly a 75% cache hit rate, and the cache only seems to last for 1 hour. Therefore, the fragmentation caused by revid would only affect very fast-changing pages, and only results in a few extra cache entries for maps on those changes. [14:48:18] When we came up with this vcl_hash thing I had the wrong assumption that images were cached for a long time. [14:50:37] awight: I see that karto is returning images with Cache-Control:max-age=3600 [14:51:08] in general we follow what the origin server say in that regard (with some caps for sanity) [14:52:33] currently the cap is 24h, so there's definitely room for increasing the life of maps images [14:53:07] 75% hit rate seemed pretty good to me--how would we tune that? [14:53:15] I mean, what would a target hit rate be? [14:53:55] 100%! [14:54:10] just kidding, the higher the better obviously but it depends on many factors [14:54:56] Yes my understanding is that we're balancing cache memory against application load, and there is a point of diminishing returns. [14:56:08] Perhaps, for the feature we're trying to implement now there isn't any cache tuning required. We can watch hit rate and if it drops, then we look at this .vcl patch again? [14:57:03] sure [14:57:31] Our team (WMDE Technical Wishes) does want to take a look at a few other aspects of maps caching, so we've made time for this either way. [14:57:33] more in general, if the object being returned to the client does not depend on revid, then let's just drop revid [14:58:00] ah that's the issue--revid is needed for making a backend request to get the embedded mapdata from mediawiki [14:58:07] (it's a horrifying circular architecture) [14:58:39] I see [14:59:07] than we can just hide it, which is similar to what you're doing but can be done without touching vcl_hash [14:59:50] O_O how do we hide it? [14:59:55] it isn't pretty, but: we can set an additional header (say x-original-revid), remove revid from the url, reset it when sending the request to ATS [15:00:07] then do something similar in Lua at the ATS layer :) [15:00:21] I don't think we have that option, these are requests coming from user browsers [15:00:56] no, I mean in the CDN [15:01:14] we don't have just a layer of Varnish for caching, we also have Apache Traffic Server for on-disk caching [15:02:05] and it also needs to be instructed about all this [15:02:22] * awight consults https://wikitech.wikimedia.org/wiki/Caching_overview [15:02:47] * ema hopes it is not entirely out of date [15:03:14] Okay, so this would be implemented using .vcl still, but the goal would be to take load off of the karto server and fetch redundant images straight from ATS. I can dig it. [15:04:43] It bothers me semantically since the revid is part of the request and belongs at the same level as "title" etc, but understandable that we make compromises to protect machines we love :-) [15:05:21] I think a good way forward now is to first evaluate the performance impact of the revid change and then take it from there [15:05:50] Meaning, evaluate the actual impact on production? I have to agree. [15:05:58] right [15:06:19] We've asked for a performance review of this feature, if you don't mind I'll subscribe you to the task. [15:06:25] sure [15:07:33] Unfortunately, it's not simple to analyze the impact ahead of time (because revid is not sent yet), but once it is, we can look at cache hit rate for a step change, and we can look through the web request logs to calculate how many times a revid changes but the map hash does not, within a given time period. Thanks for all the suggestions! [21:09:24] 10Traffic, 10Foundational Technology Requests, 10SRE, 10Wikimedia Enterprise, 10Wikimedia Enterprise Discussion: Allow-Listing for Enterprise IPs - https://phabricator.wikimedia.org/T294798 (10DAbad) 2021-12-08 Tech Steering Committee - seems like a small amount of effort - need by December 17th