[00:43:01] <wikibugs>	 06Traffic, 06Data-Persistence, 06SRE, 10SRE-swift-storage, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10603665 (10Ladsgroup) I forgot to mention: This will be done as part of {T360589} First, we start serving 250px thumbnails gradually but sized to 220px,...
[04:43:09] <jinxer-wm>	 FIRING: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX
[04:48:09] <jinxer-wm>	 RESOLVED: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX
[06:58:29] <wikibugs>	 06Traffic: LVSRealserverMSS alert is broken for ferm based hosts - https://phabricator.wikimedia.org/T367204#10604037 (10Vgutierrez) 05Open→03Resolved
[07:41:09] <jinxer-wm>	 FIRING: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX
[07:46:07] <wikibugs>	 06Traffic, 13Patch-For-Review: Create systemd-tmpfiles configuration for TLS material - https://phabricator.wikimedia.org/T387826#10604080 (10Fabfur) 05Open→03Resolved Done, path will be `/run/haproxy-tls`
[07:46:09] <jinxer-wm>	 RESOLVED: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX
[07:46:47] <wikibugs>	 06Traffic, 13Patch-For-Review: Allow acmecerts to deploy certificates in tmpfs storage - https://phabricator.wikimedia.org/T384227#10604085 (10Fabfur)
[07:47:17] <wikibugs>	 06Traffic, 13Patch-For-Review: Allow acmecerts to deploy certificates in tmpfs storage - https://phabricator.wikimedia.org/T384227#10604086 (10Fabfur)
[10:11:27] <wikibugs>	 10netops, 06Infrastructure-Foundations, 10Data-Engineering (Q3 2025 January 1st - March 31th): Update `netflow` retention strategy in Druid (too much data) - https://phabricator.wikimedia.org/T387839#10604443 (10ayounsi) So what about: * turnilo full dimensions - 1 months * turnilo sanisitzed/reduced - 12 mo...
[10:13:37] <wikibugs>	 10netops, 06Infrastructure-Foundations, 10Data-Engineering (Q3 2025 January 1st - March 31th): Update `netflow` retention strategy in Druid (too much data) - https://phabricator.wikimedia.org/T387839#10604451 (10JAllemandou) >>! In T387839#10604443, @ayounsi wrote: > So what about: > * turnilo full dimension...
[10:20:45] <wikibugs>	 10netops, 06Infrastructure-Foundations: Different BFD settings on direct connected links - https://phabricator.wikimedia.org/T387773#10604469 (10ayounsi) They're mandatory on long distance link as we've had issue with interface status being up but the provider not forwarding traffic through said link. For loca...
[13:19:32] <hnowlan>	 just a heads-up, I have a minor fix for the citoid change, shouldn't need any careful rolling out https://gerrit.wikimedia.org/r/c/operations/puppet/+/1124766
[13:29:56] <vgutierrez>	 hnowlan: ack
[14:01:58] <klausman>	 \o We recently moved about (IP-wise) some wortkers (ml-staging2001 and 2002) for one of our LVS services (inference-staging.svc.codfw), and it seems LVS has not picked up the changed IPs (or we missed a step). I seems to have only one backend (dead IP 10.192.0.201) on the service (on lvs2013). I am not sure how to fix this.
[14:05:54] <wikibugs>	 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): Prometheus: attach host's BGP/interface remote side metrics - https://phabricator.wikimedia.org/T387287#10605317 (10ayounsi) It's live and working fine : {F58611687} https://grafana...
[14:07:38] <sukhe>	 klausman: it should be on lvs2014 as well and it is but that's not the point, just as an FYI
[14:07:43] <sukhe>	 PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2001.codfw.wmnet are marked down but pooled: k8s-ingress-ml-staging_31443: Servers ml-staging2001.codfw.wmnet are marked down but pooled
[14:07:48] <sukhe>	 sukhe@lvs2013:~$ curl localhost:9090/pools/inference_30443
[14:07:51] <sukhe>	 ml-serve2002.codfw.wmnet:	disabled/up/not pooled
[14:08:16] <sukhe>	 sorry, staging,
[14:08:25] <sukhe>	 ml-staging2002.codfw.wmnet:	enabled/down/not pooled
[14:09:28] <sukhe>	 but more importantly, ml-staging2001.codfw.wmnet:	enabled/down/pooled
[14:10:03] <sukhe>	 so what's the other context? what was the change?
[14:10:18] <topranks>	 the IP of the servers changed 
[14:12:35] <sukhe>	 yeah klausman mentioned that above but I was checking if that's basically it. let me check
[14:15:23] <sukhe>	 klausman: all done, so given that this was the only thing that changed (the IP), restarted pybal and it's all good now
[14:15:34] <sukhe>	 sukhe@lvs2013:~$ curl localhost:9090/pools/inference-staging_30443
[14:15:34] <sukhe>	 ml-staging2001.codfw.wmnet:	enabled/up/pooled
[14:15:34] <sukhe>	 ml-staging2002.codfw.wmnet:	enabled/up/pooled
[14:17:35] <klausman>	 thank you!
[14:19:37] <klausman>	 hang on, 2003 is not in there?
[14:21:17] <klausman>	 mh, it should be, let me check 1-3 things
[14:21:44] <topranks>	 sukhe: thanks for sorting it!  is it the case then that PyBal gets the IP from DNS, but caches it forever?
[14:21:59] <topranks>	 we need a restart if the IPs for hostnames have changed?
[14:32:09] <klausman>	 I am also a bit puzzled as to why 2003 remains depooled.
[14:32:17] <sukhe>	 topranks: no but pybal needs to be restarted to reprogram IPVS for the changed IP.
[14:32:30] <sukhe>	 klausman: so 2003 is a new host?
[14:32:56] <topranks>	 sukhe: ok thanks 
[14:33:05] <klausman>	 No, like the other two, it was recently reimaged, but it is newer, and thus didn't get a new IP
[14:33:08] <topranks>	 but yep we need a restart when this happens basically 
[14:33:52] <klausman>	 2003 is in etcd, but marked as pooled:inactive. 
[14:34:18] <klausman>	 I tried `confctl  pool --service ml_staging --hostname ml-staging2003.codfw.wmnet` but it doesn't seem to have any effect
[14:35:03] <sukhe>	 topranks: yeah, in this case I saw the old IP in the ipvsadm output and then worked from there
[14:35:06] <sukhe>	 klausman: looking now
[14:35:56] <sukhe>	 sukhe@puppetserver1001:~$ sudo confctl select 'name=ml-staging2003.codfw.wmnet' get
[14:35:59] <sukhe>	 {"ml-staging2003.codfw.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=codfw,cluster=ml_staging,service=kubesvc"}
[14:36:20] <sukhe>	 sukhe@lvs2013:~$ curl localhost:9090/pools/inference-staging_30443
[14:36:20] <sukhe>	 ml-staging2003.codfw.wmnet:	enabled/up/pooled
[14:36:20] <sukhe>	 ml-staging2001.codfw.wmnet:	enabled/up/pooled
[14:36:20] <sukhe>	 ml-staging2002.codfw.wmnet:	enabled/up/pooled
[14:36:22] <sukhe>	 now it's there
[14:36:34] <sukhe>	 klausman: `sudo confctl select 'name=ml-staging2003.codfw.wmnet' set/pooled=yes'
[14:37:13] <klausman>	 ah, my old nemesis, conftool. I just don't use it often enough
[14:37:46] <klausman>	 thanks, sukhe, once again
[14:37:55] <sukhe>	 no worries
[14:38:06] <sukhe>	 we all get lost on this at some point or the other :)
[15:06:14] <wikibugs>	 10netops, 06Infrastructure-Foundations: Different BFD settings on direct connected links - https://phabricator.wikimedia.org/T387773#10605720 (10Papaul) @cmoone @ayounsi thank you all for the input. since we have only cr1/2-codfw with the bfd configuration and the others without it for the main time can i go a...
[15:11:51] <wikibugs>	 10netops, 06Infrastructure-Foundations: Different BFD settings on direct connected links - https://phabricator.wikimedia.org/T387773#10605755 (10cmooney) >>! In T387773#10604469, @ayounsi wrote: > They're mandatory on long distance link as we've had issue with interface status being up but the provider not for...
[15:37:33] <wikibugs>	 10netops, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops: WikiKube clusters close to exhausting Calico IPPool allocations - https://phabricator.wikimedia.org/T375845#10605886 (10cmooney) FYI I've updated the prefix-list on our switches and routers in eqiad/codfw from the old /18 to the wider...
[17:06:27] <wikibugs>	 06Traffic: upgrade to trafficserver 9.2.9 - https://phabricator.wikimedia.org/T388035 (10Vgutierrez) 03NEW
[17:06:38] <wikibugs>	 06Traffic: upgrade to trafficserver 9.2.9 - https://phabricator.wikimedia.org/T388035#10606526 (10Vgutierrez) p:05Triage→03High
[17:24:45] <wikibugs>	 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): Prometheus: attach host's BGP/interface remote side metrics - https://phabricator.wikimedia.org/T387287#10606626 (10ayounsi) 05Open→03Resolved
[17:41:52] <wikibugs>	 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): Prometheus: attach host's BGP/interface remote side metrics - https://phabricator.wikimedia.org/T387287#10606718 (10ayounsi) I also added that metric to this dashboard as an exa...
[17:44:14] <wikibugs>	 10netops, 10Hiddenparma, 06Infrastructure-Foundations: HIDDENPARMA feature: superset link → requestctl rule - https://phabricator.wikimedia.org/T388039 (10kamila) 03NEW
[17:44:33] <wikibugs>	 10netops, 10Hiddenparma, 06Infrastructure-Foundations: HIDDENPARMA feature: superset link → requestctl rule - https://phabricator.wikimedia.org/T388039#10606746 (10kamila) p:05Triage→03Medium
[18:03:45] <wikibugs>	 06Traffic: acme_chief and sslcert modules should allow destination parameter - https://phabricator.wikimedia.org/T387929#10606838 (10Fabfur)
[18:11:12] <wikibugs>	 10netops, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops: WikiKube clusters close to exhausting Calico IPPool allocations - https://phabricator.wikimedia.org/T375845#10606894 (10JMeybohm) p:05Medium→03High
[18:46:21] <wikibugs>	 06Traffic: upgrade to trafficserver 9.2.9 - https://phabricator.wikimedia.org/T388035#10607094 (10BCornwall) 05Open→03In progress