[00:41:42] (SystemdUnitFailed) firing: user@0.service Failed on cp2039:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:56:42] (SystemdUnitFailed) resolved: user@0.service Failed on cp2039:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:54:00] (HAProxyRestarted) firing: HAProxy server restarted on cp2033:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=codfw%20prometheus/ops&var-instance=cp2033&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [06:47:25] 10netops, 10Infrastructure-Foundations, 10SRE, 10User-jbond: Investigate the potential benefits of BGPalerter - https://phabricator.wikimedia.org/T230600 (10fgiunchedi) I noticed the weekly "software-update" emails from bgpalerter, can those be disabled ? (i.e. the version check I guess) [06:54:00] (HAProxyRestarted) firing: HAProxy server restarted on cp2033:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=codfw%20prometheus/ops&var-instance=cp2033&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [07:20:34] 10netops, 10Infrastructure-Foundations, 10SRE, 10User-jbond: Investigate the potential benefits of BGPalerter - https://phabricator.wikimedia.org/T230600 (10ayounsi) Relevant https://github.com/nttgin/BGPalerter/issues/1058 [07:52:21] 10Traffic: HAProxy 2.6.16 segfaults on cp2033 - https://phabricator.wikimedia.org/T334448 (10Vgutierrez) [07:52:34] 10Traffic: HAProxy 2.6.16 segfaults on cp2033 - https://phabricator.wikimedia.org/T334448 (10Vgutierrez) p:05Triage→03High [07:52:47] 10Traffic: HAProxy 2.6.12 segfaults on cp2033 - https://phabricator.wikimedia.org/T334448 (10Vgutierrez) [07:57:51] (HAProxyRestarted) resolved: HAProxy server restarted on cp2033:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=codfw%20prometheus/ops&var-instance=cp2033&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [08:09:27] 10Traffic, 10Upstream: HAProxy 2.6.12 segfaults on cp2033 - https://phabricator.wikimedia.org/T334448 (10Vgutierrez) Reported to upstream in https://github.com/haproxy/haproxy/issues/2111 [08:32:16] 10Traffic, 10Upstream: HAProxy 2.6.12 segfaults on cp2033 - https://phabricator.wikimedia.org/T334448 (10Vgutierrez) [09:51:30] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) [14:32:56] 10Traffic, 10Upstream: HAProxy 2.6.12 segfaults on cp2033 - https://phabricator.wikimedia.org/T334448 (10Vgutierrez) This seems to be the same issue as T332796 due to an incomplete bugfix [15:40:05] 10Traffic, 10SRE: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 (10Ottomata) Interesting. Each purged instance is in a distinct consumer group, yes? What kafka client is it using? (Just curious, neither answers will clue me in as to why they stopped working :) ) [15:40:15] 10Traffic, 10DC-Ops: Upgrade lvs1013-1016 firmware - https://phabricator.wikimedia.org/T334259 (10BCornwall) [15:41:00] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10LSobanski) [15:44:31] 10Traffic, 10SRE: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 (10Vgutierrez) purged uses github.com/confluentinc/confluent-kafka-go/kafka [16:00:11] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10LSobanski) p:05Triage→03Medium [16:08:15] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Jelto) [16:56:20] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [17:17:38] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs3006.esams.wmnet with OS bullseye [17:19:59] 10Traffic, 10Infrastructure-Foundations, 10SRE, 10netbox, 10Patch-For-Review: Issues converting services from active/passive to active/active - https://phabricator.wikimedia.org/T330084 (10jbond) from @brandon via irc >it *seems* like that error in the ticket would've only happened if the puppet agent... [17:38:54] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [17:59:34] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs3006.esams.wmnet with OS bullseye completed: - lvs3006 (**PASS**) - Downtimed on Icinga/Aler... [19:01:52] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [20:31:41] 10Traffic, 10DC-Ops: Upgrade lvs1013-1016 firmware - https://phabricator.wikimedia.org/T334259 (10Papaul) @BCornwall are you using the firmware upgrade cookbook and what is the server running the latest IDRAC version? [20:49:41] 10Traffic: Switch to Maglev hashing ('mh') on LVS hosts - https://phabricator.wikimedia.org/T263797 (10BCornwall) [22:01:45] (HAProxyRestarted) firing: HAProxy server restarted on cp2035:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=codfw%20prometheus/ops&var-instance=cp2035&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [22:52:42] (SystemdUnitFailed) firing: (4) varnishkafka-statsv.service Failed on cp3056:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:57:42] (SystemdUnitFailed) firing: (6) varnishkafka-eventlogging.service Failed on cp3056:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:02:20] 10Traffic, 10netops, 10Infrastructure-Foundations: Adjust routing policy to increase SSH session speed from East Asia to toolforge - https://phabricator.wikimedia.org/T334530 (10Stang) [23:02:42] (SystemdUnitFailed) resolved: (6) varnishkafka-eventlogging.service Failed on cp3056:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:03:42] (SystemdUnitFailed) firing: (5) varnishkafka-eventlogging.service Failed on cp3056:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:08:42] (SystemdUnitFailed) firing: (6) varnishkafka-eventlogging.service Failed on cp3056:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:28:42] (SystemdUnitFailed) resolved: (5) varnishkafka-eventlogging.service Failed on cp3056:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed