[01:08:01] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, 06SRE: EQSIN:Switch refresh diagram and wiring - https://phabricator.wikimedia.org/T423724#12048730 (10Papaul) [01:10:15] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, 06SRE: EQSIN:Switch refresh diagram and wiring - https://phabricator.wikimedia.org/T423724#12048732 (10Papaul) [01:13:19] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: rack B2 maintenance 2026-07-01 11:00 am CT - https://phabricator.wikimedia.org/T429861#12048738 (10Papaul) [05:44:56] 10netops, 06Traffic, 06Data-Persistence, 06Data-Platform-SRE, and 5 others: codfw: rack B2 maintenance 2026-07-01 11:00 am CT - https://phabricator.wikimedia.org/T429861#12048903 (10ayounsi) [05:57:34] 10netops, 06Traffic, 06Data-Persistence, 06Data-Platform-SRE, and 5 others: codfw: rack B2 maintenance 2026-07-01 11:00 am CT - https://phabricator.wikimedia.org/T429861#12048932 (10ayounsi) [08:13:57] 06Traffic, 10SRE-swift-storage: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744#12049171 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1003 for host cp7001.magru.wmnet with OS trixie [09:08:43] 06Traffic, 10SRE-swift-storage: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744#12049314 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1003 for host cp7001.magru.wmnet with OS trixie completed: - cp7001 (**PASS**) - Downtimed on Icinga/Alertmana... [09:12:35] \o I am working on some IPIP stuff for our k8s clusters, and my next step would be to run the sre.loadbalancer.migrate-service-ipip cookbook. However, its dry-run already fails since lvs2013 and lvs2014 have critical errors on icinga (config diff and an unhealthy backend). I am not sure what the not-loaded change is, so I'm hesitant to touch anything [09:13:20] thanks klausman I'll have a look soon [09:13:26] merci! [09:24:37] It might be related to https://phabricator.wikimedia.org/T429773 (btullis mentioned it) [09:24:38] Hello. I have a suspicion that the pybal unhealthy backend is probably an artifact of https://phabricator.wikimedia.org/T429773 [09:26:17] Scratch that, this is codfw. The cluster there is still small, so it will be something else. [09:26:49] I will still depool dse-k8s-wdqs2001 and investigate. [09:26:57] 06Traffic: HTTP 503 error trying to make any edits on Wikipedia - https://phabricator.wikimedia.org/T423991#12049382 (10Aklapper) 05Open→03Invalid Unfortunately closing this Phabricator task as no further information has been provided. @Ergur: After you have provided the information asked for and if thi... [09:29:21] btullis: Thank you [09:29:39] klausman: A minute or two more and I think the other alert should clear [09:30:52] Nope, guess not [09:31:18] That host is already depooled. [09:32:08] lvs2013 and 14 only show the config diff alert now, so the DSE bit has cleared [09:32:32] PyBal has been restartet to pick up some configuration changes [09:34:32] and we're down to the ipvs<>pybal mismatch [09:40:00] klausman: Should clear now [09:40:15] yeah, 2013 is already all green. Thanks a bunch! [09:40:17] 10netops, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: codfw: rack A6 maintenance - https://phabricator.wikimedia.org/T429812#12049442 (10jcrespo) I've stopped ms-backups2003 network operations for now, codfw media backups will continue to flow temporarily only through ms-backu... [09:40:33] Next icinga run is in two minutes on 2014 [09:40:44] I'll stare at it until moral improves :-) [09:41:12] thanks :)_ I will do my IPIP shenangians after lunch, I now have to run get said lunch, before the outside becomes entirely unbearable [09:41:57] So I'm unlear. Was the only problem that pybal hadn't been restarted on lvs2013 and lvs2014 - or was there something more fundamental? I still have one node depooled, but it had been the whole time, I think. [09:42:05] *unclear [09:42:43] IPVS also had a stale service that PyBal no long knew about [09:44:11] The dse-k8s-wdqs2001 wasn't really to blame I believe. It just sort of hang around as a warning. The other alerts were critical [09:45:07] So the offending service was inference-staging which had changed ports from 30051 to 30443 [09:45:31] The old port then hadn't been cleared out [09:47:30] Ah, thank you for the explanation. [09:49:21] It's this: https://wikitech.wikimedia.org/wiki/PyBal#Services_in_IPVS_but_unknown_to_PyBal if you're interested. The actual addr:port combo is only visible in Icinga, not in the AlertManager UI. [09:50:17] I can't stand the AlertManager UI. :-) [10:03:12] 06Traffic, 07Sustainability (Incident Followup): Ensure the pre-repooling checklist includes to restart liberica services whenever realserver IPs has changed - https://phabricator.wikimedia.org/T426299#12049567 (10hnowlan) I'm not 100% sure I'm afraid, I just filed this action item as it was in the [[ https://... [10:10:43] 06Traffic, 13Patch-For-Review: Test HAProxy 3.2 with AWS-LC libraries - https://phabricator.wikimedia.org/T419825#12049606 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1003 for host cp2044.codfw.wmnet with OS trixie [10:10:45] 06Traffic, 13Patch-For-Review: Test HAProxy 3.2 with AWS-LC libraries - https://phabricator.wikimedia.org/T419825#12049607 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1003 for host cp2043.codfw.wmnet with OS trixie [10:32:38] hi - can I get a basic intro to traffic prometheus metrics? I'm trying to see how I can get requests per second to MW app servers, and if it's possible to see cache hits vs misses for just the subset that would end up hitting the app servers [10:32:48] I guess it's app "containers" now [10:33:03] 10netops, 06Infrastructure-Foundations, 06Data-Platform-SRE (2026-06-05 - 2026-06-26), 07Kubernetes: Calico IPv4/IPv6 block exhaustion on dse-k8s cluster, blocking new node provisioning - https://phabricator.wikimedia.org/T429773#12049644 (10BTullis) I think that what I'm going to do in the short term is t... [10:34:20] I looked around grafana a bit and see some metrics that sound close, like varnish_requests, but not sure how to choose and which one has the dimensions I'm after [10:34:43] cc maybe @fabfur ? [10:36:03] not really an expert in prometheus metrics but I can have a look if it's possible... I'm afraid our cache ratio metrics aren't available out of the box for single endpoints... [10:36:33] probably someone from o11y can confirm [10:41:03] yep, varnish_x_cache is produced by mtail configuration that doesn't really know about the backend, it collects information about method, status code, hit or miss or pass and ttfb [10:41:37] milimetric: What are the sort of dimensions you're after? I would have thought that this 'Application Servers' dashboard might be a good place to start. https://grafana.wikimedia.org/goto/efq3eveq9a39cc?orgId=1 [10:42:00] but I think there could be other options to obtain that data, although probably not live [10:42:39] That's looking at `mediawiki_http_requests_duration_count` broken down by deployment like `mw-web` or `mw-api`. Does that help? [10:44:05] Oh, sorry. I missed the cache hits v misses part of your question. I'll keep out of your way and leave it to someone who knows better than I do. [10:51:32] 06Traffic, 13Patch-For-Review: Test HAProxy 3.2 with AWS-LC libraries - https://phabricator.wikimedia.org/T419825#12049707 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1003 for host cp2044.codfw.wmnet with OS trixie completed: - cp2044 (**PASS**) - Downtimed on Icinga/Al... [10:57:19] 06Traffic, 13Patch-For-Review: Test HAProxy 3.2 with AWS-LC libraries - https://phabricator.wikimedia.org/T419825#12049730 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1003 for host cp2043.codfw.wmnet with OS trixie completed: - cp2043 (**PASS**) - Downtimed on Icinga/Al... [12:02:01] 10netops, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: codfw: rack A6 maintenance - https://phabricator.wikimedia.org/T429812#12049887 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b3768ce5-4982-4cdb-ac8d-3735e9e5290b) set by ayounsi@cumin1003 for 2:00:00... [12:02:58] 10netops, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: codfw: rack A6 maintenance - https://phabricator.wikimedia.org/T429812#12049888 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ac4b8e18-e54c-4708-804e-e3c84d435ded) set by ayounsi@cumin1003 for 2:00:00... [12:42:09] 10netops, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: codfw: rack A6 maintenance - https://phabricator.wikimedia.org/T429812#12049997 (10ayounsi) 05Open→03Resolved All done, and all services re-pooled. [12:46:48] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1305397 I have made the change to enable IPIP for the ML staging k8s control plane. Would someone be willing to review? [12:49:14] (also implicitly asking for an okfor the cookbook to run and thus restart pybal) [12:49:14] 06Traffic, 13Patch-For-Review: Test HAProxy 3.2 with AWS-LC libraries - https://phabricator.wikimedia.org/T419825#12050020 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1003 for host cp3066.esams.wmnet with OS trixie [12:49:15] 06Traffic, 13Patch-For-Review: Test HAProxy 3.2 with AWS-LC libraries - https://phabricator.wikimedia.org/T419825#12050021 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1003 for host cp3074.esams.wmnet with OS trixie [13:15:37] milimetric: Oh wait you just want mw-on-k8s rps? [13:15:59] milimetric: mw-web, mw-api-int or mw-api-ext or all of them? [13:16:29] because if they end up in mw-web or mw-api-ext, they're cache misses by definition [13:16:58] Argh, I should backlog completely you want the hit ratio :( [13:17:20] sorry, no, I'm also not 100% clear and sure what I need [13:18:04] jumping into a meeting now but basically working on logged-in/logged-out performance, and I'm trying to figure out what's available. But yeah, something around mw k8s performance, cache ratio [13:26:48] 10netops, 06Traffic, 06Data-Persistence, 06Data-Platform-SRE, and 5 others: codfw: rack B2 maintenance 2026-07-01 11:00 am CT - https://phabricator.wikimedia.org/T429861#12050252 (10ayounsi) [13:47:00] 06Traffic: Test HAProxy 3.2 with AWS-LC libraries - https://phabricator.wikimedia.org/T419825#12050414 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1003 for host cp3074.esams.wmnet with OS trixie completed: - cp3074 (**PASS**) - Downtimed on Icinga/Alertmanager - Disable... [13:51:29] 06Traffic: Test HAProxy 3.2 with AWS-LC libraries - https://phabricator.wikimedia.org/T419825#12050461 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1003 for host cp3066.esams.wmnet with OS trixie completed: - cp3066 (**PASS**) - Downtimed on Icinga/Alertmanager - Disable... [14:06:12] ok @fabfur (cc @claime) I'm hearing from Hugh that the right metric would maybe be around HA Proxy? I think total cache hits and cache misses for mw-web, mw-api-int or mw-api-ext would be ideal, is that possible? [14:10:50] milimetric: The problem is the backend decision is made by ATS and not HAProxy so we would have to approximate somehow [14:10:52] It's a bit messy [14:12:06] yep, I think the problem is that haproxy metrics, while they have the cache hits/miss information (from mtail), they don't contain the information about which one from web|api-int|api-ext [14:12:30] this on the edge [14:12:41] aha, so I'd get image cache info mixed in with web stuff, right? [14:13:44] well, at least you can split between text and upload [14:14:01] milimetric: well, the edge doesn't directly talk to mw-api-int directly I think at all, for instance [14:14:09] err take out one of those 'directly's [14:14:42] hm, I wonder what `haproxy_backend_http_cache_lookups_total` is, sounds close? [14:14:54] if you don't need this information in real time, I think some query on webrequest could be more useful in this case [14:15:29] milimetric: that's for haproxy's own in-memory cache impl, which we don't use :) and it's 0 everywhere [14:15:34] yeah, maybe webrequest is the way here [14:15:37] I think so too [14:15:47] k, thanks! Sorry for the noise [14:16:18] np [14:27:29] 06Traffic: Test HAProxy 3.2 with AWS-LC libraries - https://phabricator.wikimedia.org/T419825#12050723 (10Fabfur) [14:28:27] 06Traffic: Test HAProxy 3.2 with AWS-LC libraries - https://phabricator.wikimedia.org/T419825#12050734 (10Fabfur) 05In progress→03Resolved All hosts with haproxy-awlc reimaged with standard haproxy 3.2 version [14:56:33] 06Traffic, 10Lift-Wing, 06ServiceOps new, 10ServiceOps-SharedInfra, and 2 others: Host Qwen 3.6-27B as an inference service - https://phabricator.wikimedia.org/T425680#12050807 (10gkyziridis) After using the latest `vllm0.22.1-3` with the `hipcc` fix, the deployment failed again. I am reporting here the ch... [15:04:18] 06Traffic, 10Lift-Wing, 06ServiceOps new, 10ServiceOps-SharedInfra, and 2 others: Host Qwen 3.6-27B as an inference service - https://phabricator.wikimedia.org/T425680#12050839 (10klausman) > I removed the ml-serve1015.eqiad.wmnet from the deployment and left only the ml-serve1013.eqiad.wmnet and ml-serve1... [15:25:19] 06Traffic: Test HAProxy 3.2 with AWS-LC libraries - https://phabricator.wikimedia.org/T419825#12050927 (10MoritzMuehlenhoff) Are the results of the comparison between aws-lc and openssl available somewhere? [15:31:33] 06Traffic: Test HAProxy 3.2 with AWS-LC libraries - https://phabricator.wikimedia.org/T419825#12050972 (10Fabfur) >>! In T419825#12050927, @MoritzMuehlenhoff wrote: > Are the results of the comparison between aws-lc and openssl available somewhere? There's a shared doc available that eventually will be turn... [15:39:10] 06Traffic, 06Data-Engineering (Q4 FS25/26 April 1st - June 30st), 13Patch-For-Review: Surge in webrequest validation check - https://phabricator.wikimedia.org/T422030#12050997 (10Snwachukwu) We've been getting Data problem ERROR and Data problem WARNINGS alerts. The alerts could be as a result of bad request... [16:34:56] 06Traffic, 10Lift-Wing, 06ServiceOps new, 10ServiceOps-SharedInfra, and 2 others: Host Qwen 3.6-27B as an inference service - https://phabricator.wikimedia.org/T425680#12051238 (10gkyziridis) >This is a bit of a "side quest", but should we maybe precompile these? I feel ml-build is sortof the ideal place t... [17:53:07] 10netops, 06Infrastructure-Foundations, 06Data-Platform-SRE (2026-06-05 - 2026-06-26), 07Kubernetes: Calico IPv4/IPv6 block exhaustion on dse-k8s cluster, blocking new node provisioning - https://phabricator.wikimedia.org/T429773#12051595 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was star... [17:53:09] 10netops, 06Infrastructure-Foundations, 06Data-Platform-SRE (2026-06-05 - 2026-06-26), 07Kubernetes: Calico IPv4/IPv6 block exhaustion on dse-k8s cluster, blocking new node provisioning - https://phabricator.wikimedia.org/T429773#12051596 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was star... [17:53:11] 10netops, 06Infrastructure-Foundations, 06Data-Platform-SRE (2026-06-05 - 2026-06-26), 07Kubernetes: Calico IPv4/IPv6 block exhaustion on dse-k8s cluster, blocking new node provisioning - https://phabricator.wikimedia.org/T429773#12051597 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was star... [18:31:45] 10netops, 06Infrastructure-Foundations, 06Data-Platform-SRE (2026-06-05 - 2026-06-26), 07Kubernetes: Calico IPv4/IPv6 block exhaustion on dse-k8s cluster, blocking new node provisioning - https://phabricator.wikimedia.org/T429773#12051757 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started... [18:35:10] 10netops, 06Infrastructure-Foundations, 06Data-Platform-SRE (2026-06-05 - 2026-06-26), 07Kubernetes: Calico IPv4/IPv6 block exhaustion on dse-k8s cluster, blocking new node provisioning - https://phabricator.wikimedia.org/T429773#12051761 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started... [18:39:44] 10netops, 06Infrastructure-Foundations, 06Data-Platform-SRE (2026-06-05 - 2026-06-26), 07Kubernetes: Calico IPv4/IPv6 block exhaustion on dse-k8s cluster, blocking new node provisioning - https://phabricator.wikimedia.org/T429773#12051786 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started... [18:53:25] 06Traffic, 10DNS, 06SRE, 13Patch-For-Review: new CNAME record for WikiLearn - https://phabricator.wikimedia.org/T429628#12051808 (10BCornwall) 05Open→03Resolved I'm marking this as resolved: Please feel free to reopen if this hasn't been!