[00:43:01] 06Traffic, 06Data-Persistence, 06SRE, 10SRE-swift-storage, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10603665 (10Ladsgroup) I forgot to mention: This will be done as part of {T360589} First, we start serving 250px thumbnails gradually but sized to 220px,... [04:43:09] FIRING: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [04:48:09] RESOLVED: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [06:58:29] 06Traffic: LVSRealserverMSS alert is broken for ferm based hosts - https://phabricator.wikimedia.org/T367204#10604037 (10Vgutierrez) 05Open→03Resolved [07:41:09] FIRING: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [07:46:07] 06Traffic, 13Patch-For-Review: Create systemd-tmpfiles configuration for TLS material - https://phabricator.wikimedia.org/T387826#10604080 (10Fabfur) 05Open→03Resolved Done, path will be `/run/haproxy-tls` [07:46:09] RESOLVED: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [07:46:47] 06Traffic, 13Patch-For-Review: Allow acmecerts to deploy certificates in tmpfs storage - https://phabricator.wikimedia.org/T384227#10604085 (10Fabfur) [07:47:17] 06Traffic, 13Patch-For-Review: Allow acmecerts to deploy certificates in tmpfs storage - https://phabricator.wikimedia.org/T384227#10604086 (10Fabfur) [10:11:27] 10netops, 06Infrastructure-Foundations, 10Data-Engineering (Q3 2025 January 1st - March 31th): Update `netflow` retention strategy in Druid (too much data) - https://phabricator.wikimedia.org/T387839#10604443 (10ayounsi) So what about: * turnilo full dimensions - 1 months * turnilo sanisitzed/reduced - 12 mo... [10:13:37] 10netops, 06Infrastructure-Foundations, 10Data-Engineering (Q3 2025 January 1st - March 31th): Update `netflow` retention strategy in Druid (too much data) - https://phabricator.wikimedia.org/T387839#10604451 (10JAllemandou) >>! In T387839#10604443, @ayounsi wrote: > So what about: > * turnilo full dimension... [10:20:45] 10netops, 06Infrastructure-Foundations: Different BFD settings on direct connected links - https://phabricator.wikimedia.org/T387773#10604469 (10ayounsi) They're mandatory on long distance link as we've had issue with interface status being up but the provider not forwarding traffic through said link. For loca... [13:19:32] just a heads-up, I have a minor fix for the citoid change, shouldn't need any careful rolling out https://gerrit.wikimedia.org/r/c/operations/puppet/+/1124766 [13:29:56] hnowlan: ack [14:01:58] \o We recently moved about (IP-wise) some wortkers (ml-staging2001 and 2002) for one of our LVS services (inference-staging.svc.codfw), and it seems LVS has not picked up the changed IPs (or we missed a step). I seems to have only one backend (dead IP 10.192.0.201) on the service (on lvs2013). I am not sure how to fix this. [14:05:54] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): Prometheus: attach host's BGP/interface remote side metrics - https://phabricator.wikimedia.org/T387287#10605317 (10ayounsi) It's live and working fine : {F58611687} https://grafana... [14:07:38] klausman: it should be on lvs2014 as well and it is but that's not the point, just as an FYI [14:07:43] PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2001.codfw.wmnet are marked down but pooled: k8s-ingress-ml-staging_31443: Servers ml-staging2001.codfw.wmnet are marked down but pooled [14:07:48] sukhe@lvs2013:~$ curl localhost:9090/pools/inference_30443 [14:07:51] ml-serve2002.codfw.wmnet: disabled/up/not pooled [14:08:16] sorry, staging, [14:08:25] ml-staging2002.codfw.wmnet: enabled/down/not pooled [14:09:28] but more importantly, ml-staging2001.codfw.wmnet: enabled/down/pooled [14:10:03] so what's the other context? what was the change? [14:10:18] the IP of the servers changed [14:12:35] yeah klausman mentioned that above but I was checking if that's basically it. let me check [14:15:23] klausman: all done, so given that this was the only thing that changed (the IP), restarted pybal and it's all good now [14:15:34] sukhe@lvs2013:~$ curl localhost:9090/pools/inference-staging_30443 [14:15:34] ml-staging2001.codfw.wmnet: enabled/up/pooled [14:15:34] ml-staging2002.codfw.wmnet: enabled/up/pooled [14:17:35] thank you! [14:19:37] hang on, 2003 is not in there? [14:21:17] mh, it should be, let me check 1-3 things [14:21:44] sukhe: thanks for sorting it! is it the case then that PyBal gets the IP from DNS, but caches it forever? [14:21:59] we need a restart if the IPs for hostnames have changed? [14:32:09] I am also a bit puzzled as to why 2003 remains depooled. [14:32:17] topranks: no but pybal needs to be restarted to reprogram IPVS for the changed IP. [14:32:30] klausman: so 2003 is a new host? [14:32:56] sukhe: ok thanks [14:33:05] No, like the other two, it was recently reimaged, but it is newer, and thus didn't get a new IP [14:33:08] but yep we need a restart when this happens basically [14:33:52] 2003 is in etcd, but marked as pooled:inactive. [14:34:18] I tried `confctl pool --service ml_staging --hostname ml-staging2003.codfw.wmnet` but it doesn't seem to have any effect [14:35:03] topranks: yeah, in this case I saw the old IP in the ipvsadm output and then worked from there [14:35:06] klausman: looking now [14:35:56] sukhe@puppetserver1001:~$ sudo confctl select 'name=ml-staging2003.codfw.wmnet' get [14:35:59] {"ml-staging2003.codfw.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=codfw,cluster=ml_staging,service=kubesvc"} [14:36:20] sukhe@lvs2013:~$ curl localhost:9090/pools/inference-staging_30443 [14:36:20] ml-staging2003.codfw.wmnet: enabled/up/pooled [14:36:20] ml-staging2001.codfw.wmnet: enabled/up/pooled [14:36:20] ml-staging2002.codfw.wmnet: enabled/up/pooled [14:36:22] now it's there [14:36:34] klausman: `sudo confctl select 'name=ml-staging2003.codfw.wmnet' set/pooled=yes' [14:37:13] ah, my old nemesis, conftool. I just don't use it often enough [14:37:46] thanks, sukhe, once again [14:37:55] no worries [14:38:06] we all get lost on this at some point or the other :) [15:06:14] 10netops, 06Infrastructure-Foundations: Different BFD settings on direct connected links - https://phabricator.wikimedia.org/T387773#10605720 (10Papaul) @cmoone @ayounsi thank you all for the input. since we have only cr1/2-codfw with the bfd configuration and the others without it for the main time can i go a... [15:11:51] 10netops, 06Infrastructure-Foundations: Different BFD settings on direct connected links - https://phabricator.wikimedia.org/T387773#10605755 (10cmooney) >>! In T387773#10604469, @ayounsi wrote: > They're mandatory on long distance link as we've had issue with interface status being up but the provider not for... [15:37:33] 10netops, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops: WikiKube clusters close to exhausting Calico IPPool allocations - https://phabricator.wikimedia.org/T375845#10605886 (10cmooney) FYI I've updated the prefix-list on our switches and routers in eqiad/codfw from the old /18 to the wider... [17:06:27] 06Traffic: upgrade to trafficserver 9.2.9 - https://phabricator.wikimedia.org/T388035 (10Vgutierrez) 03NEW [17:06:38] 06Traffic: upgrade to trafficserver 9.2.9 - https://phabricator.wikimedia.org/T388035#10606526 (10Vgutierrez) p:05Triage→03High [17:24:45] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): Prometheus: attach host's BGP/interface remote side metrics - https://phabricator.wikimedia.org/T387287#10606626 (10ayounsi) 05Open→03Resolved [17:41:52] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): Prometheus: attach host's BGP/interface remote side metrics - https://phabricator.wikimedia.org/T387287#10606718 (10ayounsi) I also added that metric to this dashboard as an exa... [17:44:14] 10netops, 10Hiddenparma, 06Infrastructure-Foundations: HIDDENPARMA feature: superset link → requestctl rule - https://phabricator.wikimedia.org/T388039 (10kamila) 03NEW [17:44:33] 10netops, 10Hiddenparma, 06Infrastructure-Foundations: HIDDENPARMA feature: superset link → requestctl rule - https://phabricator.wikimedia.org/T388039#10606746 (10kamila) p:05Triage→03Medium [18:03:45] 06Traffic: acme_chief and sslcert modules should allow destination parameter - https://phabricator.wikimedia.org/T387929#10606838 (10Fabfur) [18:11:12] 10netops, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops: WikiKube clusters close to exhausting Calico IPPool allocations - https://phabricator.wikimedia.org/T375845#10606894 (10JMeybohm) p:05Medium→03High [18:46:21] 06Traffic: upgrade to trafficserver 9.2.9 - https://phabricator.wikimedia.org/T388035#10607094 (10BCornwall) 05Open→03In progress