[00:29:54] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [00:42:00] (03PS1) 10Dzahn: jenkins: explicitly set firewall provider for new role [puppet] - 10https://gerrit.wikimedia.org/r/1256614 [00:52:33] (03PS2) 10Dzahn: jenkins: include firewall and set provider for new role [puppet] - 10https://gerrit.wikimedia.org/r/1256614 (https://phabricator.wikimedia.org/T418521) [00:54:27] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1256614/8315/contint1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1256614 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [01:05:49] (03PS1) 10Dzahn: jenkins: let envoy listen on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1256642 (https://phabricator.wikimedia.org/T418521) [01:07:59] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T255568#11587266" [puppet] - 10https://gerrit.wikimedia.org/r/1256642 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [01:08:24] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T255568#11587266 fixed this but the default is still FALSE!" [puppet] - 10https://gerrit.wikimedia.org/r/1256642 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [01:09:10] (03CR) 10Dzahn: [C:03+2] jenkins: let envoy listen on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1256642 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [01:12:46] (03CR) 10Dzahn: [C:03+2] "- address: 0.0.0.0" [puppet] - 10https://gerrit.wikimedia.org/r/1256642 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [01:14:21] (03CR) 10Dzahn: [C:03+2] "[contint1002:~] $ telnet -6 jenkins.discovery.wmnet 1443" [puppet] - 10https://gerrit.wikimedia.org/r/1256642 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [01:19:54] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:09:18] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:18] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:01:29] (03PS1) 10Codename Noreste: ptwiki: Add suppressredirect to autoreviewer and rollbacker user groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256748 (https://phabricator.wikimedia.org/T420704) [04:03:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 20.12% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:08:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 20.12% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:17:51] PROBLEM - Backup freshness on backup1014 is CRITICAL: All failures: 1 (install4004), Fresh: 138 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:29:54] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:54:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:17:59] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 139 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:19:54] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:12:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:42:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:06:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 18.41% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:14:19] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:15:39] PROBLEM - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1213 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 1 Failed : virtual_disk: 1 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [07:15:41] ACKNOWLEDGEMENT - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1213 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 1 Failed : virtual_disk: 1 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T420812 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [07:15:49] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1213 - https://phabricator.wikimedia.org/T420812 (10ops-monitoring-bot) 03NEW [07:19:54] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:21:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.18% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:22:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.96% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:24:18] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:27:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.96% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:28:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.93% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:33:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.93% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:54:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:19:58] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:59:05] PROBLEM - Host lsw1-b7-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:59:39] RECOVERY - Host lsw1-b7-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.62 ms [11:02:55] 06SRE, 06Traffic: Wikidough: consider regional Anycast addresses - https://phabricator.wikimedia.org/T420819 (10cmooney) 03NEW p:05Triage→03Low [11:18:53] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 5 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11734995 (10Aklapper) [11:24:54] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:03:33] PROBLEM - Host thanos-be2006 is DOWN: PING CRITICAL - Packet loss = 100% [12:16:20] 06SRE, 06Infrastructure-Foundations, 10netops: Wikidough unreachable over IPv6 if it is depooled but still announced from a POP - https://phabricator.wikimedia.org/T420820 (10cmooney) 03NEW p:05Triage→03Medium [12:16:24] (03CR) 10Little Sunshine: [C:03+1] ptwiki: Add suppressredirect to autoreviewer and rollbacker user groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256748 (https://phabricator.wikimedia.org/T420704) (owner: 10Codename Noreste) [12:19:11] (03PS1) 10Cathal Mooney: wikimedia6 prefix-list: add anycast ranges for wikidough and ns2 [homer/public] - 10https://gerrit.wikimedia.org/r/1257195 (https://phabricator.wikimedia.org/T420820) [12:27:23] 06SRE, 06Infrastructure-Foundations, 10netops: Anycast services - depool strategy in terms of BGP routing - https://phabricator.wikimedia.org/T420821 (10cmooney) 03NEW p:05Triage→03Medium [12:27:36] 06SRE, 06Infrastructure-Foundations, 10netops: Anycast services - depool strategy in terms of BGP routing - https://phabricator.wikimedia.org/T420821#11735064 (10cmooney) [12:27:38] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Wikidough unreachable over IPv6 if it is depooled but still announced from a POP - https://phabricator.wikimedia.org/T420820#11735063 (10cmooney) [12:28:41] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Wikidough unreachable over IPv6 if it is depooled but still announced from a POP - https://phabricator.wikimedia.org/T420820#11735065 (10cmooney) [12:31:48] 06SRE, 06Traffic: Wikidough: consider regional Anycast addresses - https://phabricator.wikimedia.org/T420819#11735068 (10cmooney) 05Open→03Declined FWIW the reason for traffic re-routed to eqiad not drmrs was due to how we have the core routers set up. TL;DR depooling the service (i.e. stopping the do... [12:33:49] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Wikidough unreachable over IPv6 if it is depooled but still announced from a POP - https://phabricator.wikimedia.org/T420820#11735072 (10cmooney) [12:54:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:04:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:19:23] (03PS1) 10Cathal Mooney: Routed ganeti: disable nftables conntrack for forwarded VM traffic [puppet] - 10https://gerrit.wikimedia.org/r/1257209 (https://phabricator.wikimedia.org/T420715) [13:19:54] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:21:52] (03PS2) 10Cathal Mooney: wikimedia6 prefix-list: add wikidough anycast range [homer/public] - 10https://gerrit.wikimedia.org/r/1257195 (https://phabricator.wikimedia.org/T420820) [13:26:04] (03PS2) 10Cathal Mooney: Routed ganeti: disable nftables conntrack for forwarded VM traffic [puppet] - 10https://gerrit.wikimedia.org/r/1257209 (https://phabricator.wikimedia.org/T420715) [13:31:34] (03PS3) 10Cathal Mooney: Routed ganeti: disable nftables conntrack for forwarded VM traffic [puppet] - 10https://gerrit.wikimedia.org/r/1257209 (https://phabricator.wikimedia.org/T420715) [13:32:28] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1257209 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney) [13:59:44] FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [13:59:45] FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [14:24:43] (03CR) 10Little Sunshine: [C:03+1] ptwiki: Enable block action for the abuse filter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251200 (https://phabricator.wikimedia.org/T419312) (owner: 10Gerrit Patch Uploader) [14:26:15] (03PS4) 10Cathal Mooney: Routed ganeti: disable nftables conntrack for forwarded VM traffic [puppet] - 10https://gerrit.wikimedia.org/r/1257209 (https://phabricator.wikimedia.org/T420715) [14:26:46] (03CR) 10CI reject: [V:04-1] Routed ganeti: disable nftables conntrack for forwarded VM traffic [puppet] - 10https://gerrit.wikimedia.org/r/1257209 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney) [14:28:37] (03PS5) 10Cathal Mooney: Routed ganeti: disable nftables conntrack for forwarded VM traffic [puppet] - 10https://gerrit.wikimedia.org/r/1257209 (https://phabricator.wikimedia.org/T420715) [14:31:01] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1257209 (https://phabricator.wikimedia.org/T420715) (owner: 10Cathal Mooney) [14:53:51] PROBLEM - Host mr1-ulsfo.oob is DOWN: PING CRITICAL - Packet loss = 100% [14:58:53] RECOVERY - Host mr1-ulsfo.oob is UP: PING OK - Packet loss = 0%, RTA = 66.07 ms [15:24:54] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:58:55] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Wikidough unreachable over IPv6 if it is depooled but still announced from a POP - https://phabricator.wikimedia.org/T420820#11735166 (10cmooney) [16:02:33] 06SRE, 06Infrastructure-Foundations, 10netops: Anycast services - depool strategy in terms of BGP routing - https://phabricator.wikimedia.org/T420821#11735168 (10cmooney) [16:09:19] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:24:33] 06SRE, 06Infrastructure-Foundations, 10netops: Nokia SR-Linux - wonky routing with IPv6 RAs and EVPN Anycast GW - https://phabricator.wikimedia.org/T420706#11735193 (10cmooney) [16:34:19] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:19:54] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:59:59] FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [17:59:59] FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [18:01:31] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11735321 (10Aklapper) @AnnieKim_WMDE: Please also [link your LDAP account to your Phabricator account](https://phabricator.wikimedia.org/settings/panel/extern... [18:12:12] !incidents [18:12:12] 7778 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [18:12:12] 7777 (RESOLVED) [10x] ProbeDown sre (ip4 probes/service codfw) [18:15:07] o/ [18:16:54] Raine: looks like just expired acks? [18:18:34] jhathaway: my phone thinks everything is acked [18:18:36] I am confused [18:18:47] (or possibly splunk is confused) [18:20:28] but yeah, it's been 24h since when it paged [18:21:14] according to icinga, everything is still downtimed [18:21:26] so perhaps splunk weirdness [18:21:32] yeah [18:22:04] okay, going afk for now [18:23:29] same, thanks [18:48:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [19:24:54] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:28:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [20:11:56] FIRING: MaxConntrack: Elevated conntrack usage on ganeti7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [20:49:41] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:50:41] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:16:56] FIRING: [2x] MaxConntrack: Elevated conntrack usage on ganeti7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [21:19:54] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:59:59] FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [22:00:00] FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [22:14:19] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown