[08:30:09] FIRING: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [08:35:09] RESOLVED: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [09:45:07] 10netops, 06Infrastructure-Foundations, 06SRE: Routinator 0.14 causing tempfs file system to fill up - https://phabricator.wikimedia.org/T383116#10487931 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff 0.14.1 is out, I'll import and upgrade [09:51:06] 06Traffic, 13Patch-For-Review: issue unified cert using pki.goog - https://phabricator.wikimedia.org/T384195#10487935 (10Vgutierrez) 05Stalled→03In progress pki.goog staging environment issued the unified cert successfully after GTS removed `wikipedia.org` from their denylist: ` vgutierrez@acmechief-test10... [10:10:09] FIRING: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [10:15:09] RESOLVED: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [10:54:30] 06Traffic, 06collaboration-services, 06Language and Product Localization, 10MinT, 13Patch-For-Review: MinT: Fails to download models/files from peopleweb.discovery.wmnet - https://phabricator.wikimedia.org/T383750#10488163 (10Samwilson) [11:11:56] 06Traffic: Upgrade haproxy to 2.8.13 on cp hosts - https://phabricator.wikimedia.org/T383111#10488240 (10Vgutierrez) [11:12:04] 06Traffic: Upgrade haproxy to 2.8.13 on cp hosts - https://phabricator.wikimedia.org/T383111#10488241 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [11:28:22] 10netops, 06Infrastructure-Foundations, 06SRE: Configure gnmic to collect data from routers at network pops - https://phabricator.wikimedia.org/T384345#10488267 (10fgiunchedi) I'm assuming you meant "this won't be too hard", anyways the simplest solution off the top of my head would be to have a map network... [12:26:13] 10netops, 06Infrastructure-Foundations, 06SRE: Configure gnmic to collect data from routers at network pops - https://phabricator.wikimedia.org/T384345#10488524 (10cmooney) >>! In T384345#10488267, @fgiunchedi wrote: > I'm assuming you meant "this won't be too hard", anyways the simplest solution off the top... [12:28:39] 10netops, 06Infrastructure-Foundations, 06SRE: Configure gnmic to collect data from routers at network pops - https://phabricator.wikimedia.org/T384345#10488529 (10cmooney) [12:28:40] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10488530 (10cmooney) [12:51:28] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10488586 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a6b392ba-8b36-4fa0-8d3d-10c8b2d2eb48) set by cmooney@cumin1002 for 1:00:00 on 1 host(s) and th... [13:29:39] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10488748 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f0f61f83-b1f7-48c8-9e4a-2e436917a7d3) set by cmooney@cumin1002 for 2:00:00 on 1 host(s) and th... [14:41:59] 06Traffic: clean up testlb services - https://phabricator.wikimedia.org/T384486#10488924 (10Vgutierrez) 05Open→03Resolved [14:42:23] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10488927 (10cmooney) So I rolled-back the patch to collect the BGP metrics. The config puppet produced worked fine in magru and esams, but for some reason in eqiad stats... [14:51:27] 06Traffic, 13Patch-For-Review: issue unified cert using pki.goog - https://phabricator.wikimedia.org/T384195#10488967 (10Vgutierrez) 05In progress→03Resolved a:03Vgutierrez ` vgutierrez@acmechief1002:~$ sudo -i openssl x509 -dates -subject -issuer -ext subjectAltName -noout -in /var/lib/acme-chief/ce... [15:01:41] 06Traffic: Deploy unified-goog certificate on cp hosts - https://phabricator.wikimedia.org/T384606 (10Vgutierrez) 03NEW [15:43:17] 06Traffic, 13Patch-For-Review: Deploy unified-goog certificate on cp hosts - https://phabricator.wikimedia.org/T384606#10489231 (10Vgutierrez) 05Open→03Resolved [16:16:42] 06Traffic, 06DC-Ops, 10ops-esams, 10ops-magru, 06SRE: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10489498 (10RobH) a:05RobH→03Vgutierrez @Vgutierrez, >>! In T382026#10489496, @RobH wrote: >> Good afternoon Dear >> The infrastructure team installed a Blanking Panel... [16:18:57] 06Traffic, 06DC-Ops, 10ops-esams, 10ops-magru, 06SRE: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10489523 (10ssingh) a:05Vgutierrez→03BCornwall [17:00:18] 10netops, 10Hiddenparma, 06Infrastructure-Foundations, 10Prod-Kubernetes, 07Kubernetes: Allow reaching services on the aux k8s cluster bypassing the CDN - https://phabricator.wikimedia.org/T382269#10489758 (10CDanis) >>! In T382269#10458292, @CDanis wrote: > I am wondering if we really need the ability t... [19:41:37] 10netops, 06Infrastructure-Foundations, 06SRE: Configure gnmic to collect data from routers at network pops - https://phabricator.wikimedia.org/T384345#10490388 (10cmooney) 05Open→03Resolved a:03cmooney This is working now {F58260515 width=700} [21:12:00] 10netops, 06Infrastructure-Foundations, 06SRE: Improve Eqiad outbound traffic balance - https://phabricator.wikimedia.org/T384253#10490568 (10cmooney) 05Open→03Resolved Gonna close this one for now, the balance is better with the changes we added and we can review as time goes on. [21:14:22] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10490586 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7b39f587-684b-42ab-a96c-cf552c03a29d) set by cmooney@cumin1002 for 2:00:00 on 1 host(s) and th... [21:33:10] 10netops, 06Infrastructure-Foundations, 06SRE: Manage fundraising network elements from Netbox - https://phabricator.wikimedia.org/T377996#10490638 (10cmooney) [21:38:24] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10490654 (10cmooney) Fwiw I thought I saw a potential optimisation to allow us to go back to the "on change" style subscription. gNMIc has a parameter that can be configu... [21:59:21] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10490691 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=3f0feb1a-6c73-4906-bb5a-2df62eb7e156) set by cmooney@cumin1002 for 1:00:00 on 1 host(s) and th... [22:34:09] FIRING: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [22:44:09] RESOLVED: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [23:03:09] FIRING: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [23:05:09] FIRING: [8x] LVSHighCPU: The host lvs7002:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs7002 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [23:06:02] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10490768 (10cmooney) The current configuration we have requires us to enable [[ https://gnmic.openconfig.net/user_guide/caching/ | gnmic caching ]], as we group certain me... [23:10:09] RESOLVED: [8x] LVSHighCPU: The host lvs7002:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs7002 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [23:11:14] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10490804 (10cmooney) FWIW I used the config from P72314 in the most recent tests. I'd tried to use some of the advice from [[ https://github.com/openconfig/gnmic/issues/4... [23:13:09] RESOLVED: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [23:33:09] FIRING: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [23:43:09] RESOLVED: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [23:48:50] 06Traffic, 06DC-Ops, 10ops-esams, 10ops-magru, 06SRE: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10490884 (10BCornwall) Hi, @RobH, thanks for doing this! At first glance, nothing's improved. The inlet temps are acceptable at ~20° yet the CPUs are still hitting ~90°. O...