[09:40:45] (HAProxyRestarted) firing: HAProxy server restarted on cp4040:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=ulsfo%20prometheus/ops&var-instance=cp4040&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [10:15:45] (HAProxyRestarted) resolved: HAProxy server restarted on cp4040:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=ulsfo%20prometheus/ops&var-instance=cp4040&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [11:39:48] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [11:48:35] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [13:29:29] <_joe_> hi, was there any update re:esams? it looks like things have settled down quite a bit compared to yesterday [13:29:59] _joe_: UK and DE are still going to drmrs [13:30:10] we should merge https://gerrit.wikimedia.org/r/c/operations/dns/+/951486 at some point [13:30:20] yeah we're past the 24h mark now and things do seem stable, we should go ahead with that [13:31:29] <_joe_> I am going by https://grafana.wikimedia.org/d/O9zAmeOWz/ats-cache-operations?orgId=1&var-site=esams&var-site=drmrs&var-layer=backend&viewPanel=4&from=now-24h&to=now [13:31:38] <_joe_> it seems the cache hit ratio has improved a lot today [13:32:02] well, the /backend/ hit ratio is tricky, because the frontend caches are large, too [13:32:25] <_joe_> it's what counts for the amount of requests we send to the backends [13:32:37] but yeah, now it's been long enough that contents have rolled around in the frontend and started really using the backend more [13:32:44] <_joe_> claime: you were saying rps were still a bit elevated in eqiad? [13:32:45] fyi, UK and DE are a lot of traffic: https://librenms.wikimedia.org/graphs/to=1692797400/id=19399%2C19413%2C19414%2C19112%2C19145/type=multiport_bits_separate/from=1692624600 [13:32:55] see the first spike [13:32:56] it is still elevated from before [13:33:04] just not as bad as initially [13:33:06] yes, from baseline [13:33:13] that was before re-routing the 2 countries [13:33:45] Yesterday same time was 3krps, today is *grafana crashes* [13:34:04] between 4 and 3.6k [13:34:32] <_joe_> claime: lol same thing just happened to me :D [13:35:01] <_joe_> yeah, I'm not sure it's sustainable to move back DE and UK [13:35:24] _joe_: well, frontend gets the first shot at the traffic. So the ratio of incoming user requests to applayer requests out of ATS depends on the the combined hitrate of fe+be. [13:35:28] <_joe_> we won't be in a secure place if we have to send all read-only traffic to a single dc [13:36:07] <_joe_> bblack: actually, the cache hit ratio does, the number of requests to the backend not really [13:36:12] but usually the backend caches are storing content and not really hitting it much early on when they're cold, because the FE is caching the same thing the BE is. [13:36:37] <_joe_> bblack: yeah but we're past 24 hours, I doubt it can improve much from here [13:36:50] correct [13:37:32] btw, there are deployments ongoing, so we should wait until that's done for anything [13:37:37] on text the be hitrate is never all that great lately. [13:38:39] <_joe_> but it's less than half what it was in drmrs before the repooling, which is ok if we make our backends beefier [13:39:15] yeah past 24h average for cache_text: drmrs 2.4%, esams 0.5% [13:39:32] <_joe_> it makes a lot of difference :) [13:40:22] I'm not saying it doesn't make a difference, just saying it's relative to what comes out of the frontend cache, too [13:40:45] (which, we probably still need to upsize on the new hardware) [13:44:20] (the new cp nodes used in the single-backend config have more memory for FE cache, but I don't think we puppeted the tunables to use more of it) [13:44:36] e.g. esams cache_text has over half its RAM sitting idle [13:55:57] <_joe_> bblack: ok that seems like an easy optimization to improve the state of things [13:56:49] in theory yes, in practice it's tricky to tune. We need to spend some time validating we don't push it too far and destabilize things for lack of free working memory for other things [13:57:14] but the currently-puppetized tuning is a generic formula that was built to handle much smaller-memory varnish hosts we used to have. [13:57:35] esams text node with the new 512GB memory config, after 24h, still has a ton of free ram not being used: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=cp3066&var-datasource=thanos&var-cluster=cache_text&viewPanel=4&from=now-24h&to=now [13:57:41] <_joe_> bblack: my point is - if we can tune things to get the increase in backend requests under control [13:57:58] <_joe_> then move back all countries to esams [13:59:00] the other two countries shouldn't have a /huge/ impact on reqrates once the caches have filled. Some, but not as much as before. [13:59:11] <_joe_> that would be acceptable - to be clear the current request rate is not strictly sustainable given the current database capacity [13:59:36] but fixing the memory tuning will take some time, we'll want to do it on a few and see stability for a week or so, etc. [13:59:56] (you can go too far and some memory usage spike kills things) [14:00:28] (also, changing it implies wiping the frontend cache of that node) [14:01:08] <_joe_> sure, I'm not saying it's fast [14:01:58] <_joe_> if you want we can try to move back uk and de, and if the backend req rate doesn't increase significantly, we can keep esams pooled while you work at optimizations [14:02:08] <_joe_> what we cannot do is wait for the disks as-is [14:03:06] we have some more-nuclear options too, like reverting to a multi-backend config, but that's kind of last resort as well. [14:03:23] we could try to minimize the wall-clock duration for which we're finding the new RAM amount by (at off-peak?) restarting each node in esams with a differently-sized frontend cache [14:03:49] and staggered quite a bit too [14:03:58] from +10% to +50% splayed across nodes, or similar [14:06:38] yeah I have an old patch, I'm gonna work up something to make it modernly-tunable in a sec [14:08:55] <_joe_> and btw: I do agree it would be desirable to not rely on the CDN to survive our normal traffic, but changing that would require a 5x increase in our hardware budget I think, at least on a short term [14:09:13] <_joe_> and that doesn't seem likely in the mid term [14:09:58] of course, that's long-range thinking [14:10:48] but the multi-backend hashing has proved unresilient over time, that's the push for this new config. the focus of heavy traffic on one node to kill it, and the perf interactions of the 8 nodes in general [14:11:00] One last thing about the particulars, while excluding UK makes sense from a cache warming point of view (it's going to get warmed up by people all over EU using enwiki), repooling DE with dewiki being barely used outside Germany means we'll take a noticeable cold-cache hit again [14:11:26] it only takes a few to warm up hot articles [14:11:30] It won't be as big as yesterday for sure, but dewiki cache will not be warm [14:11:58] I would expect it to already have some decent contents [14:12:13] geoip isn't always perfect, and there's plenty of DE speakers in neighboring countries [14:14:08] https://w.wiki/7JzW <- cache hosts recieving dewiki traffic past 24h [14:14:40] anyways, we should wait for offpeak to switch anything back regardless I think [14:17:26] I estimate about 30% of dewiki traffic coming from non-DE geoip tagged users [14:18:16] right, but it's more about the content patterns rather than the number of users [14:18:38] if those 30% are reading a lot of the same articles as the other 70%, they've already done the cache filling for them. statistically it tends to work out. [14:19:06] when caches are cold, a lot more raw traffic bleeds through of course [14:48:04] https://gerrit.wikimedia.org/r/c/operations/puppet/+/849633 -> https://gerrit.wikimedia.org/r/c/operations/puppet/+/951949/1 [14:48:38] ^ implements new tuning for fe cache mem, optional with the legacy calculation as the default. second patch sets a range of experimental values on the esams clusters as per cdanis suggestion [14:50:24] has a dumb syntax/definedness issue, will fix after meeting [14:55:26] (or now, maybe) [15:48:52] 10Traffic: confd seems to be leaking memory in cp hosts - https://phabricator.wikimedia.org/T344831 (10Vgutierrez) [15:50:35] 10Traffic: confd seems to be leaking memory in cp hosts - https://phabricator.wikimedia.org/T344831 (10Vgutierrez) we are currently using the confd version shipped with bullseye: ` vgutierrez@cp6016:~$ apt policy confd confd: Installed: 0.16.0-1+deb11u0 Candidate: 0.16.0-1+deb11u0 Version table: *** 0.16....