[06:33:44] jbond, brett, there is a merged but not deployed puppet patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/933398 [06:39:05] I'm reverting it [07:11:42] (SystemdUnitFailed) firing: haproxy_stek_job.service Failed on cp3081:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:46:44] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp3079:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [09:46:57] ^ reimaging so expeted [10:31:42] (SystemdUnitFailed) firing: (2) prometheus-ipmi-exporter.service Failed on cp3079:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:34:42] (SystemdUnitCrashLoop) firing: varnish-frontend.service crashloop on cp3079:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [10:36:42] (SystemdUnitFailed) firing: (3) haproxy_stek_job.service Failed on cp3079:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:36:55] 10Traffic, 10DC-Ops, 10SRE, 10ops-knams, 10Patch-For-Review: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10ssingh) [10:36:56] ^^ these should be expected [10:38:41] 10Traffic, 10DC-Ops, 10SRE, 10ops-knams, 10Patch-For-Review: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10ssingh) [10:39:42] (SystemdUnitCrashLoop) resolved: varnish-frontend.service crashloop on cp3079:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [10:39:45] (HAProxyRestarted) firing: HAProxy server restarted on cp3079:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=esams%20prometheus/ops&var-instance=cp3079&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [10:41:42] (SystemdUnitCrashLoop) firing: varnish-frontend.service crashloop on cp3079:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [10:41:42] (SystemdUnitFailed) firing: (3) haproxy_stek_job.service Failed on cp3079:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:44:45] (HAProxyRestarted) firing: (2) HAProxy server restarted on cp3073:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [10:46:42] (SystemdUnitCrashLoop) resolved: varnish-frontend.service crashloop on cp3079:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [10:46:42] (SystemdUnitFailed) firing: (4) clean-confd-rundir.service Failed on cp3073:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:46:44] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp3079:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [11:01:08] 10Traffic, 10DC-Ops, 10SRE, 10ops-knams, 10Patch-For-Review: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10ssingh) [11:01:42] (SystemdUnitFailed) firing: (2) clean-confd-rundir.service Failed on cp3073:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:46:42] (SystemdUnitCrashLoop) firing: varnish-frontend.service crashloop on cp3077:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [11:46:42] (SystemdUnitFailed) firing: (11) esitest.service Failed on cp3067:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:46:44] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp3067:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [11:49:45] (HAProxyRestarted) firing: (4) HAProxy server restarted on cp3067:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [11:51:42] (SystemdUnitCrashLoop) resolved: varnish-frontend.service crashloop on cp3077:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [11:51:42] (SystemdUnitFailed) firing: (11) esitest.service Failed on cp3067:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:51:44] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp3067:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [11:54:45] (HAProxyRestarted) firing: (6) HAProxy server restarted on cp3067:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [11:58:37] 10Traffic, 10DC-Ops, 10SRE, 10ops-knams, 10Patch-For-Review: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10ssingh) [12:01:00] 10Traffic, 10DC-Ops, 10SRE, 10ops-knams, 10Patch-For-Review: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10Fabfur) [13:36:43] (SystemdUnitFailed) resolved: haproxy_stek_job.service Failed on cp3081:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:11:19] XioNoX: I'm sorry about that :( [15:53:57] brett: no pb! [15:54:45] (HAProxyRestarted) firing: (6) HAProxy server restarted on cp3067:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [16:04:45] (HAProxyRestarted) firing: HAProxy server restarted on cp3081:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=esams%20prometheus/ops&var-instance=cp3081&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [16:05:34] 10Traffic, 10Observability-Metrics, 10Patch-For-Review: Add prometheus-https load balancer - https://phabricator.wikimedia.org/T326657 (10BCornwall) @fgiunchedi Now that this is merged, would you say that this is complete? Thanks for the feedback. [17:08:24] 10Traffic, 10DC-Ops, 10SRE, 10ops-knams, 10Patch-For-Review: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10RobH) [17:21:56] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [17:37:00] (PyBalBGPUnstable) firing: (2) PyBal BGP sessions on instance lvs3009 are failing - https://wikitech.wikimedia.org/wiki/PyBal#Alerts - https://grafana.wikimedia.org/d/000000488/pybal-bgp?var-datasource=esams%20prometheus/ops&var-server=lvs3009 - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [17:37:25] yeah this is expected [18:23:52] 10Traffic, 10DC-Ops, 10SRE, 10ops-knams, 10Patch-For-Review: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10Fabfur) [18:33:46] 10Traffic, 10DC-Ops, 10SRE, 10ops-knams, 10Patch-For-Review: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10BCornwall) [19:19:24] win 14 [19:23:23] 10Traffic, 10DC-Ops, 10SRE, 10ops-knams, 10Patch-For-Review: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10ssingh) [19:28:17] sukhe: +1 [19:28:57] thanks, will merge this at the very end [19:29:47] sukhe: how is dns3003? [19:29:57] all worked out :) [19:30:05] so I guess just two CPs left and two LVSes [19:30:06] and we are done [19:32:02] nice [19:42:12] 10Traffic, 10DC-Ops, 10SRE, 10ops-knams, 10Patch-For-Review: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10BCornwall) [19:47:54] 10Traffic, 10DC-Ops, 10SRE, 10ops-knams, 10Patch-For-Review: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10Fabfur) [19:49:45] (HAProxyRestarted) firing: (2) HAProxy server restarted on cp3066:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [20:09:45] (HAProxyRestarted) firing: (4) HAProxy server restarted on cp3066:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [20:57:06] 10Traffic, 10DC-Ops, 10SRE, 10ops-knams: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10ssingh) [20:57:58] 10Traffic, 10DC-Ops, 10SRE, 10ops-knams: Q1:rack/setup/install new esams/knams hosts - https://phabricator.wikimedia.org/T344174 (10ssingh) [21:01:44] (VarnishHighThreadCount) firing: (2) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [21:04:45] (HAProxyRestarted) firing: (5) HAProxy server restarted on cp3066:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [21:06:44] (VarnishHighThreadCount) firing: (8) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [21:09:45] (HAProxyRestarted) firing: (6) HAProxy server restarted on cp3066:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [21:11:44] (VarnishHighThreadCount) firing: (8) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [21:16:45] (VarnishHighThreadCount) firing: (8) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [21:21:45] (VarnishHighThreadCount) resolved: (8) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [21:37:00] (PyBalBGPUnstable) firing: (2) PyBal BGP sessions on instance lvs3009 are failing - https://wikitech.wikimedia.org/wiki/PyBal#Alerts - https://grafana.wikimedia.org/d/000000488/pybal-bgp?var-datasource=esams%20prometheus/ops&var-server=lvs3009 - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [21:54:45] (HAProxyRestarted) firing: (7) HAProxy server restarted on cp3066:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [21:59:45] (HAProxyRestarted) firing: (8) HAProxy server restarted on cp3066:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [22:02:01] (PyBalBGPUnstable) firing: (4) PyBal BGP sessions on instance lvs3008 are failing - https://wikitech.wikimedia.org/wiki/PyBal#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [22:42:00] (PyBalBGPUnstable) firing: (6) PyBal BGP sessions on instance lvs3008 are failing - https://wikitech.wikimedia.org/wiki/PyBal#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable