[00:02:10] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus3002.esams.wmnet - denisse@cumin1001"
[00:03:10] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10wiki_willy) Hi @Volans - just a heads up, it looks like some of those 5yr servers on the Accounting Spreadsheet are starting to pop up on the Netbox Error Report as accounti...
[00:03:31] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[00:04:11] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus3002.esams.wmnet - denisse@cumin1001"
[00:04:11] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[00:04:11] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.wipe-cache prometheus3002.esams.wmnet on all recursors
[00:04:14] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus3002.esams.wmnet on all recursors
[00:06:05] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1224.eqiad.wmnet with reason: host reimage
[00:07:35] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[00:07:36] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1222.eqiad.wmnet with OS bullseye
[00:07:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1222.eqiad.wmnet with OS bullseye completed: - db1222 (**PASS**)   - Removed from Puppet an...
[00:07:43] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[00:07:44] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1210.eqiad.wmnet with OS bullseye
[00:07:49] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1210.eqiad.wmnet with OS bullseye completed: - db1210 (**WARN**)   - Removed from Puppet an...
[00:08:11] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1225.eqiad.wmnet with OS bullseye
[00:08:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1225.eqiad.wmnet with OS bullseye
[00:09:12] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1155.mgmt.eqiad.wmnet with reboot policy FORCED
[00:09:13] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1224.eqiad.wmnet with reason: host reimage
[00:09:32] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1155.mgmt.eqiad.wmnet with reboot policy FORCED
[00:09:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Papaul)
[00:10:04] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1155.mgmt.eqiad.wmnet with reboot policy FORCED
[00:10:51] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1156.mgmt.eqiad.wmnet with reboot policy FORCED
[00:11:04] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[00:11:58] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[00:15:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10BTullis) Hi @jclark-ctr  Apologies for any omission on my part.  For these servers we use RAID1 for the OS, based on the two ris...
[00:18:37] <wikibugs>	 (03PS1) 10Andrea Denisse: prometheus: Add the prometheus3002 node definition [puppet] - 10https://gerrit.wikimedia.org/r/904651 (https://phabricator.wikimedia.org/T333627)
[00:18:47] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1156.mgmt.eqiad.wmnet with reboot policy FORCED
[00:19:56] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[00:19:57] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1223.eqiad.wmnet with OS bullseye
[00:20:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jclark-ctr)
[00:20:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1223.eqiad.wmnet with OS bullseye completed: - db1223 (**PASS**)   - Removed from Puppet an...
[00:20:25] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host gerrit1003.wikimedia.org with OS bullseye
[00:20:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q3:rack/setup/install gerrit1003 - https://phabricator.wikimedia.org/T326366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host gerrit1003.wikimedia.org with OS bullseye
[00:23:17] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1149.eqiad.wmnet']
[00:23:58] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[00:25:01] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40470/console" [puppet] - 10https://gerrit.wikimedia.org/r/904651 (https://phabricator.wikimedia.org/T333627) (owner: 10Andrea Denisse)
[00:25:34] <wikibugs>	 (03CR) 10Andrea Denisse: prometheus: Add the prometheus3002 node definition [puppet] - 10https://gerrit.wikimedia.org/r/904651 (https://phabricator.wikimedia.org/T333627) (owner: 10Andrea Denisse)
[00:26:02] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1225.eqiad.wmnet with reason: host reimage
[00:29:10] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1225.eqiad.wmnet with reason: host reimage
[00:29:33] <icinga-wm>	 RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:30:43] <icinga-wm>	 PROBLEM - Check systemd state on graphite1005 is CRITICAL: CRITICAL - degraded: The following units failed: statsd-proxy-socat-6to4.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:34:57] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on gerrit1003.wikimedia.org with reason: host reimage
[00:36:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:38:17] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gerrit1003.wikimedia.org with reason: host reimage
[00:41:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:42:02] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[00:51:19] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[00:51:20] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1224.eqiad.wmnet with OS bullseye
[00:51:22] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[00:51:23] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1225.eqiad.wmnet with OS bullseye
[00:51:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1224.eqiad.wmnet with OS bullseye completed: - db1224 (**WARN**)   - Removed from Puppet an...
[00:51:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1225.eqiad.wmnet with OS bullseye completed: - db1225 (**PASS**)   - Removed from Puppet an...
[00:53:52] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[00:57:20] <icinga-wm>	 RECOVERY - Check systemd state on graphite1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:04:16] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus3002.esams.wmnet - denisse@cumin1001"
[01:07:57] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[01:07:58] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gerrit1003.wikimedia.org with OS bullseye
[01:08:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q3:rack/setup/install gerrit1003 - https://phabricator.wikimedia.org/T326366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host gerrit1003.wikimedia.org with OS bullseye completed: - gerrit1003 (**PASS**)   - R...
[01:09:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Papaul)
[01:11:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Papaul) 05Open→03Resolved @Marostegui your 19 servers are ready have fun
[01:12:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:13:08] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q3:rack/setup/install gerrit1003 - https://phabricator.wikimedia.org/T326366 (10Papaul) 05Open→03Resolved @LSobanski this is ready
[01:15:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Eqiad: Backlog HW failure-racking tasks - Decommision and remote work tasks - https://phabricator.wikimedia.org/T332523 (10Papaul)
[01:17:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:06:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:31:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:32:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:37:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:39:35] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10Papaul) @Cmjohnson taking over the task to look into it
[03:40:02] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10Papaul) a:05Cmjohnson→03Papaul
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230331T0600)
[06:13:28] <wikibugs>	 (03PS2) 10Elukey: role::kafka::jumbo::broker: upgrade all brokers to PKI [puppet] - 10https://gerrit.wikimedia.org/r/904455 (https://phabricator.wikimedia.org/T296064)
[06:13:30] <wikibugs>	 (03PS1) 10Elukey: Switch kafka-main1001 broker's TLS cert to PKI [puppet] - 10https://gerrit.wikimedia.org/r/904667 (https://phabricator.wikimedia.org/T319372)
[06:15:07] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40471/console" [puppet] - 10https://gerrit.wikimedia.org/r/904667 (https://phabricator.wikimedia.org/T319372) (owner: 10Elukey)
[06:15:42] <wikibugs>	 (03CR) 10Elukey: Switch kafka-main1001 broker's TLS cert to PKI [puppet] - 10https://gerrit.wikimedia.org/r/904667 (https://phabricator.wikimedia.org/T319372) (owner: 10Elukey)
[06:17:20] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: Move Kafka main to the new intermediate PKI CA - https://phabricator.wikimedia.org/T319372 (10elukey) All brokers have the new truststore, so they can validate certs emitted by PKI. Next steps:  1) Upgrade kafka-main1001 to PKI, and monitor if any client fails to conn...
[06:40:11] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Update default tls terminator/mesh envoy version to 1.18.3-2 [puppet] - 10https://gerrit.wikimedia.org/r/904557 (https://phabricator.wikimedia.org/T333551) (owner: 10JMeybohm)
[06:40:56] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "I trust your JS knowledge :D" [puppet] - 10https://gerrit.wikimedia.org/r/904550 (owner: 10Volans)
[06:43:56] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[06:43:58] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] k8s: Configure the IPv6 service ip range for apiserver [puppet] - 10https://gerrit.wikimedia.org/r/903655 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[06:43:59] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[06:51:39] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Update default tls terminator/mesh envoy version to 1.18.3-2 [puppet] - 10https://gerrit.wikimedia.org/r/904557 (https://phabricator.wikimedia.org/T333551) (owner: 10JMeybohm)
[06:54:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:54:39] <logmsgbot>	 !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/mathoid: apply
[06:54:51] <logmsgbot>	 !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/mathoid: apply
[06:55:12] <logmsgbot>	 !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/shellbox: apply
[06:55:56] <logmsgbot>	 !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox: apply
[06:58:25] <wikibugs>	 (03PS1) 10Krinkle: private: Add readme.FatalErrorSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904670
[06:58:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] private: Add readme.FatalErrorSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904670 (owner: 10Krinkle)
[06:58:38] <wikibugs>	 (03PS2) 10Krinkle: private: Add readme.FatalErrorSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904670
[06:59:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230331T0700)
[07:04:08] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/904667 (https://phabricator.wikimedia.org/T319372) (owner: 10Elukey)
[07:05:46] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[07:05:49] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[07:08:11] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "LGTM! I don't recall if you already followed up in labs/private, but I guess that there is also a clean up in there to do right? Anyway, c" [puppet] - 10https://gerrit.wikimedia.org/r/904489 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm)
[07:16:01] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] k8s rsyslog: Use client cert instead of token (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904489 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm)
[07:17:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Add the prometheus3002 node definition [puppet] - 10https://gerrit.wikimedia.org/r/904651 (https://phabricator.wikimedia.org/T333627) (owner: 10Andrea Denisse)
[07:17:59] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] dns: repoint alert host services to alert2001 [dns] - 10https://gerrit.wikimedia.org/r/904614 (https://phabricator.wikimedia.org/T333478) (owner: 10Herron)
[07:19:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/902435 (https://phabricator.wikimedia.org/T332709) (owner: 10Jbond)
[07:20:03] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[07:20:06] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[07:20:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] alertmanager: also pages to sre for data-engineering, releng and search [puppet] - 10https://gerrit.wikimedia.org/r/902431 (https://phabricator.wikimedia.org/T332709) (owner: 10Jbond)
[07:21:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "On further thought, I think this policy might as well apply to all warnings, (i.e. a top level route instead with continue: true), what do" [puppet] - 10https://gerrit.wikimedia.org/r/903615 (owner: 10Jbond)
[07:22:56] <wikibugs>	 (03PS1) 10JMeybohm: k8s rsyslog: Remove unused tokens [labs/private] - 10https://gerrit.wikimedia.org/r/904672 (https://phabricator.wikimedia.org/T325268)
[07:23:58] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:24:02] <wikibugs>	 (03PS4) 10JMeybohm: k8s rsyslog: Use client cert instead of token [puppet] - 10https://gerrit.wikimedia.org/r/904489 (https://phabricator.wikimedia.org/T325268)
[07:25:28] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:27:12] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[07:27:15] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[07:27:35] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] k8s rsyslog: Remove unused tokens [labs/private] - 10https://gerrit.wikimedia.org/r/904672 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm)
[07:28:01] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[07:28:03] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[07:30:51] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40472/console" [puppet] - 10https://gerrit.wikimedia.org/r/904489 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm)
[07:31:28] <icinga-wm>	 PROBLEM - Check systemd state on graphite1005 is CRITICAL: CRITICAL - degraded: The following units failed: statsd-proxy-socat-6to4.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:31:30] <jinxer-wm>	 (Traffic on tunnel link) firing: Alert for device cr4-ulsfo.wikimedia.org - Traffic on tunnel link   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link
[07:33:22] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: add check for inodes free [alerts] - 10https://gerrit.wikimedia.org/r/904675 (https://phabricator.wikimedia.org/T332764)
[07:33:46] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10ayounsi) Following this [[ https://help.expandi.io/en/articles/5405660-making-a-webhook-with-google-sheets | doc ]] I was able to add data to a spreadsheet using a generic p...
[07:34:46] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s rsyslog: Use client cert instead of token [puppet] - 10https://gerrit.wikimedia.org/r/904489 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm)
[07:38:42] <icinga-wm>	 RECOVERY - Check systemd state on graphite1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:42:05] <wikibugs>	 (03PS1) 10Filippo Giunchedi: statsd_proxy: reuseaddr for 6to4 proxy to avoid crashloop [puppet] - 10https://gerrit.wikimedia.org/r/904677 (https://phabricator.wikimedia.org/T239862)
[07:44:28] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops-collab, 10GitLab (Infrastructure): Install additional SSDs on gitlab2003.wikimedia.org (B5) - https://phabricator.wikimedia.org/T333304 (10Jelto) p:05Triage→03Medium Thanks @Papaul for the quick installation!  I can confirm new disks are available on the host:  ` gitlab...
[07:47:16] <wikibugs>	 (03PS1) 10JMeybohm: k8s rsyslog: Use client cert instead of token [puppet] - 10https://gerrit.wikimedia.org/r/904678 (https://phabricator.wikimedia.org/T325268)
[07:49:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] k8s rsyslog: Use client cert instead of token [puppet] - 10https://gerrit.wikimedia.org/r/904678 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm)
[07:50:02] <wikibugs>	 (03PS2) 10JMeybohm: k8s rsyslog: Use client cert instead of token [puppet] - 10https://gerrit.wikimedia.org/r/904678 (https://phabricator.wikimedia.org/T325268)
[07:51:30] <jinxer-wm>	 (Traffic on tunnel link) resolved: Alert for device cr4-ulsfo.wikimedia.org - Traffic on tunnel link   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link
[07:52:14] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] k8s rsyslog: Use client cert instead of token [puppet] - 10https://gerrit.wikimedia.org/r/904678 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm)
[07:53:06] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10Volans) @ayounsi we already have all that setup...
[08:04:52] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:07:30] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:07:48] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:08:52] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "Seems reasonable to me :)" [puppet] - 10https://gerrit.wikimedia.org/r/904677 (https://phabricator.wikimedia.org/T239862) (owner: 10Filippo Giunchedi)
[08:10:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] statsd_proxy: reuseaddr for 6to4 proxy to avoid crashloop [puppet] - 10https://gerrit.wikimedia.org/r/904677 (https://phabricator.wikimedia.org/T239862) (owner: 10Filippo Giunchedi)
[08:13:48] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10MatthewVernon) @Jclark-ctr I can shut down ms-be1042 for you (or you can DIY, there's no special procedure for this host). Can I confirm you want it shut dow...
[08:14:16] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T333328 (10Peachey88) a:05Papaul→03Jhancock.wm
[08:14:53] <logmsgbot>	 !log jelto@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab2003.wikimedia.org with OS bullseye
[08:15:03] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops-collab, 10GitLab (Infrastructure): Install additional SSDs on gitlab2003.wikimedia.org (B5) - https://phabricator.wikimedia.org/T333304 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jelto@cumin2002 for host gitlab2003.wikimedia.org with OS bullseye
[08:15:28] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Avoid sub-optimal routing from CR routers to EVPN destinations - https://phabricator.wikimedia.org/T332781 (10ayounsi) Thanks for the details!  I'm always wary of adding configuration knobs and logic that could make troubleshooting more com...
[08:15:46] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:17:24] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "LGTM, please add the task number as comment before each one of them so we can remember in the future why it's there." [homer/public] - 10https://gerrit.wikimedia.org/r/902084 (https://phabricator.wikimedia.org/T332781) (owner: 10Cathal Mooney)
[08:24:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[08:25:25] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[08:25:27] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[08:27:04] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[08:27:07] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[08:27:40] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[08:27:43] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[08:27:56] <logmsgbot>	 !log jelto@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage
[08:29:15] <wikibugs>	 (03PS6) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510
[08:29:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[08:31:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 (owner: 10Slyngshede)
[08:32:08] <logmsgbot>	 !log jelto@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage
[08:34:44] <wikibugs>	 (03PS7) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510
[08:36:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 (owner: 10Slyngshede)
[08:37:10] <wikibugs>	 (03PS1) 10Ayounsi: Bird: POC use a different ASN for Cloud hosts [puppet] - 10https://gerrit.wikimedia.org/r/904745 (https://phabricator.wikimedia.org/T324992)
[08:38:16] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on an-worker1091.eqiad.wmnet with reason: Replacing battery
[08:38:30] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on an-worker1091.eqiad.wmnet with reason: Replacing battery
[08:38:36] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f76e48e4-3716-4c3a-8992-2858603cabe9) set by btullis@cumin1001 for 4 days, 0:00:00 on 1 host...
[08:38:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[08:38:58] <wikibugs>	 (03PS8) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510
[08:39:53] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good. Many thanks." [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) (owner: 10AOkoth)
[08:41:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 (owner: 10Slyngshede)
[08:43:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[08:43:59] <wikibugs>	 (03PS9) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510
[08:44:13] <wikibugs>	 (03PS1) 10David Caro: smart_data_dump: adapt for newer ssacli [puppet] - 10https://gerrit.wikimedia.org/r/904747 (https://phabricator.wikimedia.org/T306354)
[08:44:14] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Avoid sub-optimal routing from CR routers to EVPN destinations - https://phabricator.wikimedia.org/T332781 (10cmooney) >>! In T332781#8744511, @ayounsi wrote: > I'm always wary of adding configuration knobs and logic that could make trouble...
[08:44:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] smart_data_dump: adapt for newer ssacli [puppet] - 10https://gerrit.wikimedia.org/r/904747 (https://phabricator.wikimedia.org/T306354) (owner: 10David Caro)
[08:45:05] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Set BGP MED based on OSPF cost for EVPN originated routes [homer/public] - 10https://gerrit.wikimedia.org/r/902084 (https://phabricator.wikimedia.org/T332781) (owner: 10Cathal Mooney)
[08:45:11] <wikibugs>	 (03PS2) 10David Caro: smart_data_dump: adapt for newer ssacli [puppet] - 10https://gerrit.wikimedia.org/r/904747 (https://phabricator.wikimedia.org/T306354)
[08:45:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[08:45:40] <wikibugs>	 (03Merged) 10jenkins-bot: Set BGP MED based on OSPF cost for EVPN originated routes [homer/public] - 10https://gerrit.wikimedia.org/r/902084 (https://phabricator.wikimedia.org/T332781) (owner: 10Cathal Mooney)
[08:46:29] <wikibugs>	 (03CR) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 (owner: 10Slyngshede)
[08:47:56] <logmsgbot>	 !log jelto@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab2003.wikimedia.org with OS bullseye
[08:48:05] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops-collab, 10GitLab (Infrastructure): Install additional SSDs on gitlab2003.wikimedia.org (B5) - https://phabricator.wikimedia.org/T333304 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jelto@cumin2002 for host gitlab2003.wikimedia.org with OS bullseye...
[08:50:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[08:50:51] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10BTullis) @Jclark-ctr I've shut down an-worker1091 so you can replace the battery at any time. Feel free to boot it when the work is finished, as it should re...
[08:53:18] <wikibugs>	 (03PS2) 10Ayounsi: Bird: POC use a different ASN for Cloud hosts [puppet] - 10https://gerrit.wikimedia.org/r/904745 (https://phabricator.wikimedia.org/T324992)
[08:53:36] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Avoid sub-optimal routing from CR routers to EVPN destinations - https://phabricator.wikimedia.org/T332781 (10cmooney) 05Open→03Resolved Patch merged, working as expected.  Previous trace from bast1003 to a server in rack E1: ` cmooney@...
[08:56:00] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10Jclark-ctr) @matthewvernon 1300 utc will be on site to change battery
[08:57:20] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10MatthewVernon) >>! In T332883#8744637, @Jclark-ctr wrote: > @matthewvernon 1300 utc will be on site to change battery  Ah, glad I checked! I'll have it shut...
[09:00:00] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/904745 (https://phabricator.wikimedia.org/T324992) (owner: 10Ayounsi)
[09:00:47] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::kafka::jumbo::broker: upgrade all brokers to PKI [puppet] - 10https://gerrit.wikimedia.org/r/904455 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey)
[09:01:23] <wikibugs>	 (03CR) 10Ayounsi: "Patch goes with this comment https://phabricator.wikimedia.org/T324992#8744630" [puppet] - 10https://gerrit.wikimedia.org/r/904745 (https://phabricator.wikimedia.org/T324992) (owner: 10Ayounsi)
[09:02:18] <elukey>	 !log move kafka-jumbo1002's kafka broker cert to PKI - T296064
[09:02:19] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Set BGP MED based on OSPF cost for EVPN originated routes (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/902084 (https://phabricator.wikimedia.org/T332781) (owner: 10Cathal Mooney)
[09:02:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:02:24] <stashbot>	 T296064: Move Kafka Jumbo's TLS clients to the new bundle - https://phabricator.wikimedia.org/T296064
[09:03:30] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kafka-jumbo1002.eqiad.wmnet with reason: restart kafka, switch to PKI
[09:03:44] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kafka-jumbo1002.eqiad.wmnet with reason: restart kafka, switch to PKI
[09:04:19] <wikibugs>	 (03PS2) 10EoghanGaffney: Removes unnecessary krb:present line [puppet] - 10https://gerrit.wikimedia.org/r/904522
[09:04:23] <wikibugs>	 (03PS1) 10EoghanGaffney: Updates gitlab package versions [puppet] - 10https://gerrit.wikimedia.org/r/904752 (https://phabricator.wikimedia.org/T333636)
[09:06:07] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/904752 (https://phabricator.wikimedia.org/T333636) (owner: 10EoghanGaffney)
[09:06:15] <wikibugs>	 (03PS1) 10Cathal Mooney: Add comment in LSW external announce policy to detail MED setting [homer/public] - 10https://gerrit.wikimedia.org/r/904753 (https://phabricator.wikimedia.org/T332781)
[09:06:24] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] Updates gitlab package versions [puppet] - 10https://gerrit.wikimedia.org/r/904752 (https://phabricator.wikimedia.org/T333636) (owner: 10EoghanGaffney)
[09:06:47] <wikibugs>	 (03PS2) 10EoghanGaffney: Updates gitlab package versions [puppet] - 10https://gerrit.wikimedia.org/r/904752 (https://phabricator.wikimedia.org/T333636)
[09:07:47] <wikibugs>	 (03PS2) 10Cathal Mooney: Add comment in LSW external announce policy to detail MED setting [homer/public] - 10https://gerrit.wikimedia.org/r/904753 (https://phabricator.wikimedia.org/T332781)
[09:09:17] <jinxer-wm>	 (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[09:10:11] <elukey>	 this is me --^
[09:10:17] <elukey>	 should resolve soon-ish
[09:10:38] <wikibugs>	 (03PS5) 10Slyngshede: SSH Keymanagement, allow user to manage ssh keys. [software/bitu] - 10https://gerrit.wikimedia.org/r/899519
[09:10:40] <wikibugs>	 (03PS3) 10Cathal Mooney: Add comment in LSW external announce policy to detail MED setting [homer/public] - 10https://gerrit.wikimedia.org/r/904753 (https://phabricator.wikimedia.org/T332781)
[09:12:04] <wikibugs>	 (03PS1) 10Ayounsi: Bird: remove anycast subnet filter [puppet] - 10https://gerrit.wikimedia.org/r/904754 (https://phabricator.wikimedia.org/T324992)
[09:12:24] <wikibugs>	 (03PS4) 10Cathal Mooney: Add comment in EVPN SW external announce policy to detail MED setting [homer/public] - 10https://gerrit.wikimedia.org/r/904753 (https://phabricator.wikimedia.org/T332781)
[09:12:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Bird: remove anycast subnet filter [puppet] - 10https://gerrit.wikimedia.org/r/904754 (https://phabricator.wikimedia.org/T324992) (owner: 10Ayounsi)
[09:12:55] <wikibugs>	 (03PS5) 10Cathal Mooney: Add comment in EVPN SW external announce policy to detail MED setting [homer/public] - 10https://gerrit.wikimedia.org/r/904753 (https://phabricator.wikimedia.org/T332781)
[09:14:10] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Add comment in EVPN SW external announce policy to detail MED setting [homer/public] - 10https://gerrit.wikimedia.org/r/904753 (https://phabricator.wikimedia.org/T332781) (owner: 10Cathal Mooney)
[09:14:17] <jinxer-wm>	 (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[09:14:46] <wikibugs>	 (03Merged) 10jenkins-bot: Add comment in EVPN SW external announce policy to detail MED setting [homer/public] - 10https://gerrit.wikimedia.org/r/904753 (https://phabricator.wikimedia.org/T332781) (owner: 10Cathal Mooney)
[09:15:46] <wikibugs>	 (03PS3) 10Jaime Nuche: scap: block Scap execution on inactive deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/904502 (https://phabricator.wikimedia.org/T330756)
[09:19:40] <wikibugs>	 (03PS1) 10Elukey: kafka: update alerts related to brokers down [alerts] - 10https://gerrit.wikimedia.org/r/904756
[09:19:43] <wikibugs>	 (03PS2) 10Ayounsi: Bird: remove anycast subnet filter [puppet] - 10https://gerrit.wikimedia.org/r/904754 (https://phabricator.wikimedia.org/T324992)
[09:19:47] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] alertmanager: update phabricator project for WMCS alerts [puppet] - 10https://gerrit.wikimedia.org/r/904559 (https://phabricator.wikimedia.org/T333315) (owner: 10Arturo Borrero Gonzalez)
[09:20:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] kafka: update alerts related to brokers down [alerts] - 10https://gerrit.wikimedia.org/r/904756 (owner: 10Elukey)
[09:25:41] <wikibugs>	 (03PS2) 10Elukey: kafka: update alerts related to brokers down [alerts] - 10https://gerrit.wikimedia.org/r/904756
[09:26:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] kafka: update alerts related to brokers down [alerts] - 10https://gerrit.wikimedia.org/r/904756 (owner: 10Elukey)
[09:27:13] <wikibugs>	 (03PS3) 10Elukey: kafka: update alerts related to brokers down [alerts] - 10https://gerrit.wikimedia.org/r/904756
[09:28:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] kafka: update alerts related to brokers down [alerts] - 10https://gerrit.wikimedia.org/r/904756 (owner: 10Elukey)
[09:30:10] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Bird: remove anycast subnet filter [puppet] - 10https://gerrit.wikimedia.org/r/904754 (https://phabricator.wikimedia.org/T324992) (owner: 10Ayounsi)
[09:32:01] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to also sync networks - https://phabricator.wikimedia.org/T329669 (10ayounsi) Thanks! I think having the visualization of the data is a good start. Next step would to see how to do drop in replacement of some of t...
[09:33:02] <wikibugs>	 (03PS1) 10Slyngshede: Signup: Add captcha to signups. [software/bitu] - 10https://gerrit.wikimedia.org/r/904757 (https://phabricator.wikimedia.org/T320809)
[09:34:08] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Figure out a captcha option for IDM - https://phabricator.wikimedia.org/T320809 (10SLyngshede-WMF) 05Open→03In progress
[09:34:10] <wikibugs>	 10SRE, 10Infrastructure-Foundations: IDM milestone 3 "Build-out for self service" - https://phabricator.wikimedia.org/T320801 (10SLyngshede-WMF)
[09:34:56] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "A [potentially stupid] questions follows:" [puppet] - 10https://gerrit.wikimedia.org/r/904745 (https://phabricator.wikimedia.org/T324992) (owner: 10Ayounsi)
[09:35:28] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Figure out a captcha option for IDM - https://phabricator.wikimedia.org/T320809 (10SLyngshede-WMF) This doesn't solve the issue of captchas across the various projects, but it does provides a simple solution for the IDM (and other Django based projects...
[09:37:28] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "You've registered nodePort 4113 in https://wikitech.wikimedia.org/wiki/Kubernetes/Service_ports but it does not appear in CI diff." [deployment-charts] - 10https://gerrit.wikimedia.org/r/895327 (https://phabricator.wikimedia.org/T329074) (owner: 10Hnowlan)
[09:38:24] <wikibugs>	 (03PS4) 10Elukey: kafka: update alerts related to brokers down [alerts] - 10https://gerrit.wikimedia.org/r/904756
[09:39:18] <wikibugs>	 (03PS2) 10Elukey: Switch kafka-main1001 broker's TLS cert to PKI [puppet] - 10https://gerrit.wikimedia.org/r/904667 (https://phabricator.wikimedia.org/T319372)
[09:39:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] kafka: update alerts related to brokers down [alerts] - 10https://gerrit.wikimedia.org/r/904756 (owner: 10Elukey)
[09:41:06] <wikibugs>	 (03CR) 10Elukey: "Sorry folks, trying to run tox locally fails for multiple tests, and not sure what I am missing here. Will try to fix my local setup and p" [alerts] - 10https://gerrit.wikimedia.org/r/904756 (owner: 10Elukey)
[09:44:05] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: Remove redundant or outdated prefixes from aggregate_networks -> labs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/904529 (https://phabricator.wikimedia.org/T329669) (owner: 10Ayounsi)
[09:50:06] <wikibugs>	 (03CR) 10Ayounsi: Bird: POC use a different ASN for Cloud hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904745 (https://phabricator.wikimedia.org/T324992) (owner: 10Ayounsi)
[09:50:38] <wikibugs>	 (03PS5) 10Elukey: kafka: update alerts related to brokers down [alerts] - 10https://gerrit.wikimedia.org/r/904756
[09:53:31] <wikibugs>	 (03CR) 10Elukey: "Ok ready to go, sorry for the spam :)" [alerts] - 10https://gerrit.wikimedia.org/r/904756 (owner: 10Elukey)
[09:53:52] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kafka-jumbo1003.eqiad.wmnet with reason: restart kafka, switch to PKI
[09:54:05] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kafka-jumbo1003.eqiad.wmnet with reason: restart kafka, switch to PKI
[09:54:20] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: reprovisioning after maintenance
[09:54:34] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: reprovisioning after maintenance
[09:54:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Data-Persistence-Backup: db1150 crashed: DIMM_A8 memory issues - https://phabricator.wikimedia.org/T332708 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=3b38157a-7d2c-4b9f-ad17-b2b2c6932dcb) set by jynus@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their...
[09:54:46] <elukey>	 !log move kafka-jumbo1003's kafka broker cert to PKI - T296064
[09:54:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:54:51] <stashbot>	 T296064: Move Kafka Jumbo's TLS clients to the new bundle - https://phabricator.wikimedia.org/T296064
[09:58:17] <jinxer-wm>	 (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[10:02:45] <wikibugs>	 (03PS1) 10DCausse: flink-app: update to mesh.configuration 1.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/904762 (https://phabricator.wikimedia.org/T328675)
[10:02:47] <wikibugs>	 (03PS1) 10DCausse: rdf-streaming-updater: bump job image to flink-1.16-rc2... [deployment-charts] - 10https://gerrit.wikimedia.org/r/904763 (https://phabricator.wikimedia.org/T328675)
[10:03:17] <jinxer-wm>	 (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[10:04:43] <wikibugs>	 (03PS3) 10Jcrespo: mediabackups: Add static console port for easier remote management [puppet] - 10https://gerrit.wikimedia.org/r/904514 (https://phabricator.wikimedia.org/T306602)
[10:04:45] <wikibugs>	 (03PS1) 10Jcrespo: database-backups: Provision db1150 with s4 and s3 sections [puppet] - 10https://gerrit.wikimedia.org/r/904764 (https://phabricator.wikimedia.org/T332708)
[10:04:58] <wikibugs>	 (03PS2) 10Jcrespo: database-backups: Provision db1150 with s4 and s3 sections [puppet] - 10https://gerrit.wikimedia.org/r/904764 (https://phabricator.wikimedia.org/T332708)
[10:06:27] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[10:06:30] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[10:07:17] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[10:07:19] <wikibugs>	 (03CR) 10Jcrespo: "CC ing dbas so they are aware, no action needed- this host will be left (mostly) passive for quick redundancy" [puppet] - 10https://gerrit.wikimedia.org/r/904764 (https://phabricator.wikimedia.org/T332708) (owner: 10Jcrespo)
[10:07:20] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[10:07:30] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] database-backups: Provision db1150 with s4 and s3 sections [puppet] - 10https://gerrit.wikimedia.org/r/904764 (https://phabricator.wikimedia.org/T332708) (owner: 10Jcrespo)
[10:09:26] <wikibugs>	 (03CR) 10Jbond: "lgtm but see nit" [puppet] - 10https://gerrit.wikimedia.org/r/904617 (https://phabricator.wikimedia.org/T333511) (owner: 10Stevemunene)
[10:10:30] <icinga-wm>	 PROBLEM - Kafka Broker Server on kafka-test1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[10:10:56] <icinga-wm>	 PROBLEM - Check systemd state on kafka-test1006 is CRITICAL: CRITICAL - degraded: The following units failed: kafka.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:11:23] <btullis>	 ^looking
[10:11:34] <icinga-wm>	 PROBLEM - Kafka broker TLS certificate validity on kafka-test1006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[10:11:57] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/904629 (https://phabricator.wikimedia.org/T333456) (owner: 10Ladsgroup)
[10:12:13] <elukey>	 btullis: sorry it is me!
[10:12:17] <elukey>	 doing some tests
[10:12:32] <btullis>	 Aha, cool. No probs then.
[10:12:44] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] logstash: normalize_level add grafana error level alias [puppet] - 10https://gerrit.wikimedia.org/r/904591 (owner: 10Cwhite)
[10:14:57] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] alertmanager: also pages to sre for data-engineering, releng and search [puppet] - 10https://gerrit.wikimedia.org/r/902431 (https://phabricator.wikimedia.org/T332709) (owner: 10Jbond)
[10:15:11] <wikibugs>	 (03PS2) 10Jbond: alertmanager: also pages to sre for data-engineering [puppet] - 10https://gerrit.wikimedia.org/r/902435 (https://phabricator.wikimedia.org/T332709)
[10:15:31] <wikibugs>	 (03PS2) 10JMeybohm: WIP: deployment_server: Create k8s configs with pki certs [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268)
[10:15:41] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] alertmanager: also pages to sre for data-engineering [puppet] - 10https://gerrit.wikimedia.org/r/902435 (https://phabricator.wikimedia.org/T332709) (owner: 10Jbond)
[10:16:09] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] purged: Don't specify the kafka compression codec [puppet] - 10https://gerrit.wikimedia.org/r/904490 (https://phabricator.wikimedia.org/T332669) (owner: 10Vgutierrez)
[10:16:25] <wikibugs>	 (03PS3) 10Btullis: Correct the datahub elasticsearch index prefix for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/904571 (https://phabricator.wikimedia.org/T333580)
[10:16:42] <vgutierrez>	 Jbond: alertmanager: also pages to sre for data-engineering (9e69319e96)
[10:16:42] <vgutierrez>	 Jbond: alertmanager: also pages to sre for data-engineering, releng and search (0a3c42330a)
[10:16:50] <vgutierrez>	 ok to merge those two? 
[10:17:06] <wikibugs>	 (03CR) 10Jbond: alertmanager: change repeat interval to 1 week for warnings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903615 (owner: 10Jbond)
[10:17:12] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40476/console" [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm)
[10:17:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[10:17:25] <elukey>	 this is me, fixing --^
[10:17:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Correct the datahub elasticsearch index prefix for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/904571 (https://phabricator.wikimedia.org/T333580) (owner: 10Btullis)
[10:18:06] <logmsgbot>	 !log eoghan@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrading Gitlab
[10:18:10] <icinga-wm>	 RECOVERY - Check systemd state on kafka-test1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:18:50] <icinga-wm>	 RECOVERY - Kafka broker TLS certificate validity on kafka-test1006 is OK: SSL OK - Certificate kafka-test1006.eqiad.wmnet valid until 2024-01-13 11:02:00 +0000 (expires in 288 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[10:19:36] <icinga-wm>	 RECOVERY - Kafka Broker Server on kafka-test1006 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[10:19:41] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to also sync networks - https://phabricator.wikimedia.org/T329669 (10cmooney) Late to the party here.  >>! In T329669#8618111, @ayounsi wrote: > The other point related to above is that we don't have a strict/clea...
[10:20:07] <wikibugs>	 (03PS2) 10Jbond: alertmanager: change repeat interval to 1 week for warnings [puppet] - 10https://gerrit.wikimedia.org/r/903615
[10:22:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[10:24:29] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: purged issues a config warning on service start - https://phabricator.wikimedia.org/T332669 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez purged is now happy ` Mar 31 10:22:06 cp6001 systemd[1]: purged.service: Succeeded. Mar 31 10:22:06 cp6001 systemd[1]: Stoppe...
[10:25:20] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1001.eqiad.wmnet with reason: preparing for m1 primary db switchover
[10:25:34] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1001.eqiad.wmnet with reason: preparing for m1 primary db switchover
[10:27:10] <wikibugs>	 (03PS2) 10Jcrespo: Revert "backups: Replace db1164 with db1101" [puppet] - 10https://gerrit.wikimedia.org/r/903188 (owner: 10Marostegui)
[10:27:50] <wikibugs>	 (03CR) 10Gmodena: [C: 03+1] flink-app: update to mesh.configuration 1.2.0 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/904762 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse)
[10:27:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:28:14] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2067.codfw.wmnet with OS bullseye
[10:28:22] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2067.codfw.wmnet with OS bullseye
[10:29:05] <wikibugs>	 (03PS1) 10Vgutierrez: trafficserver: Remove esitest backend [puppet] - 10https://gerrit.wikimedia.org/r/904768 (https://phabricator.wikimedia.org/T308799)
[10:32:15] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['an-worker1149.eqiad.wmnet']
[10:32:24] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Promote db1164 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/903182 (https://phabricator.wikimedia.org/T333123) (owner: 10Marostegui)
[10:32:34] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] trafficserver: Remove esitest backend [puppet] - 10https://gerrit.wikimedia.org/r/904768 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez)
[10:33:09] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+1] Migrate from git fat to git lfs (031 comment) [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/904239 (https://phabricator.wikimedia.org/T333465) (owner: 10Hashar)
[10:33:13] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:33:15] <wikibugs>	 (03PS2) 10Ayounsi: Remove redundant or outdated prefixes from aggregate_networks -> labs [puppet] - 10https://gerrit.wikimedia.org/r/904529 (https://phabricator.wikimedia.org/T329669)
[10:35:28] <wikibugs>	 (03PS3) 10JMeybohm: WIP: deployment_server: Create k8s configs with pki certs [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268)
[10:35:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:36:50] <wikibugs>	 (03CR) 10DCausse: flink-app: update to mesh.configuration 1.2.0 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/904762 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse)
[10:37:48] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40477/console" [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm)
[10:39:39] <wikibugs>	 (03CR) 10Ayounsi: Remove redundant or outdated prefixes from aggregate_networks -> labs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/904529 (https://phabricator.wikimedia.org/T329669) (owner: 10Ayounsi)
[10:39:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jclark-ctr)
[10:40:32] <wikibugs>	 (03PS3) 10Ladsgroup: mariadb: Promote db1164 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/903182 (https://phabricator.wikimedia.org/T333123) (owner: 10Marostegui)
[10:40:42] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db1164 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/903182 (https://phabricator.wikimedia.org/T333123) (owner: 10Marostegui)
[10:41:34] <wikibugs>	 (03CR) 10Majavah: Remove redundant or outdated prefixes from aggregate_networks -> labs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/904529 (https://phabricator.wikimedia.org/T329669) (owner: 10Ayounsi)
[10:42:06] <wikibugs>	 (03PS4) 10Btullis: Correct the datahub elasticsearch index prefix for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/904571 (https://phabricator.wikimedia.org/T333580)
[10:44:06] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2067.codfw.wmnet with reason: host reimage
[10:45:41] <Amir1>	 !log Failover m1 from db1101 to db1164 - T333123
[10:45:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:52] <stashbot>	 T333123: Switchover m1 master (db1101 -> db1164) - https://phabricator.wikimedia.org/T333123
[10:46:39] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2067.codfw.wmnet with reason: host reimage
[10:48:05] <wikibugs>	 (03PS4) 10JMeybohm: WIP: deployment_server: Create k8s configs with pki certs [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268)
[10:49:07] <Amir1>	 jynus: pt-kill should finish before I move on to the next step?
[10:49:14] <Amir1>	 It's stuck
[10:49:26] <Amir1>	 not stuck more like, not stopping
[10:50:42] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40478/console" [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm)
[10:51:04] <Amir1>	 etherpad works
[10:51:16] <icinga-wm>	 PROBLEM - Check systemd state on kubestagemaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:53:02] <Amir1>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/903188 I think we can merge this now
[10:53:29] <Amir1>	 I leave it to Jaime 
[10:53:34] <jynus>	 ok
[10:54:02] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "I can merge it, or let Jaime merge it, whatever you prefer." [puppet] - 10https://gerrit.wikimedia.org/r/903188 (owner: 10Marostegui)
[10:54:34] <wikibugs>	 (03PS5) 10JMeybohm: WIP: deployment_server: Create k8s configs with pki certs [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268)
[10:56:14] <wikibugs>	 (03PS5) 10Btullis: Correct the datahub elasticsearch index prefix for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/904571 (https://phabricator.wikimedia.org/T333580)
[10:56:18] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Revert "backups: Replace db1164 with db1101" [puppet] - 10https://gerrit.wikimedia.org/r/903188 (owner: 10Marostegui)
[10:56:32] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40479/console" [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm)
[10:57:00] <wikibugs>	 (03PS3) 10Ayounsi: Remove redundant or outdated prefixes from aggregate_networks -> labs [puppet] - 10https://gerrit.wikimedia.org/r/904529 (https://phabricator.wikimedia.org/T329669)
[11:01:14] <wikibugs>	 (03PS1) 10Filippo Giunchedi: statsd_proxy: fix socat invocation to not crashloop [puppet] - 10https://gerrit.wikimedia.org/r/904771 (https://phabricator.wikimedia.org/T239862)
[11:01:20] <Amir1>	 thanks jynus 
[11:01:37] <wikibugs>	 (03CR) 10Ayounsi: Remove redundant or outdated prefixes from aggregate_networks -> labs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/904529 (https://phabricator.wikimedia.org/T329669) (owner: 10Ayounsi)
[11:02:04] <wikibugs>	 (03CR) 10Ayounsi: cloudlb: introduce BGP setup by means of bird (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868731 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez)
[11:02:30] <jynus>	 I am running a backup to test it, and if all works well, I will restart bacula
[11:02:46] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2067.codfw.wmnet with OS bullseye
[11:02:53] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2067.codfw.wmnet with OS bullseye completed: - ms-be2067 (**PASS**)   - Downtim...
[11:03:17] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:04:34] <Amir1>	 let me know once so I close the ticket
[11:05:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:05:51] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Remove redundant or outdated prefixes from aggregate_networks -> labs [puppet] - 10https://gerrit.wikimedia.org/r/904529 (https://phabricator.wikimedia.org/T329669) (owner: 10Ayounsi)
[11:05:55] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] statsd_proxy: fix socat invocation to not crashloop [puppet] - 10https://gerrit.wikimedia.org/r/904771 (https://phabricator.wikimedia.org/T239862) (owner: 10Filippo Giunchedi)
[11:06:43] <wikibugs>	 (03PS2) 10Ladsgroup: admin: Add sfaci ssh key and analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/904629 (https://phabricator.wikimedia.org/T333456)
[11:06:47] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] admin: Add sfaci ssh key and analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/904629 (https://phabricator.wikimedia.org/T333456) (owner: 10Ladsgroup)
[11:08:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] statsd_proxy: fix socat invocation to not crashloop [puppet] - 10https://gerrit.wikimedia.org/r/904771 (https://phabricator.wikimedia.org/T239862) (owner: 10Filippo Giunchedi)
[11:08:24] <logmsgbot>	 !log eoghan@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrading Gitlab
[11:09:03] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10API Platform, 10Patch-For-Review: Requesting access to analytics-privatedata-users for sfaci - https://phabricator.wikimedia.org/T333456 (10Ladsgroup) Added now, in thirty minutes you should be able to access stat machines but someone from data engineering needs to do your k...
[11:09:47] <icinga-wm>	 PROBLEM - SSH on kafka-jumbo1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[11:09:58] <logmsgbot>	 !log eoghan@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrading Gitlab
[11:11:07] <icinga-wm>	 RECOVERY - SSH on kafka-jumbo1001 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[11:11:45] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops-collab, 10GitLab (Infrastructure): Install additional SSDs on gitlab2003.wikimedia.org (B5) - https://phabricator.wikimedia.org/T333304 (10Jelto) 05Open→03Resolved The reimage happend on `gitlab2003` but it seems the partman config is not producing the expected result....
[11:12:08] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: Bird: POC use a different ASN for Cloud hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904745 (https://phabricator.wikimedia.org/T324992) (owner: 10Ayounsi)
[11:12:38] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1151.eqiad.wmnet']
[11:15:48] <wikibugs>	 (03PS1) 10Filippo Giunchedi: DNM: move statsd to graphite2004 [dns] - 10https://gerrit.wikimedia.org/r/904774
[11:15:50] <wikibugs>	 (03CR) 10Majavah: Bird: POC use a different ASN for Cloud hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904745 (https://phabricator.wikimedia.org/T324992) (owner: 10Ayounsi)
[11:16:03] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:16:37] <icinga-wm>	 RECOVERY - Check systemd state on kubestagemaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:16:41] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM, one comment for curiosity's sake only." [puppet] - 10https://gerrit.wikimedia.org/r/904754 (https://phabricator.wikimedia.org/T324992) (owner: 10Ayounsi)
[11:17:53] <icinga-wm>	 PROBLEM - Check systemd state on graphite2004 is CRITICAL: CRITICAL - degraded: The following units failed: statsd-proxy-socat-6to4.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:17:55] <wikibugs>	 (03Abandoned) 10Filippo Giunchedi: DNM: move statsd to graphite2004 [dns] - 10https://gerrit.wikimedia.org/r/904774 (owner: 10Filippo Giunchedi)
[11:19:27] <icinga-wm>	 RECOVERY - Check systemd state on graphite2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:24:06] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "Post-merge +1, sorry I was asleep and didn't see it in time." [puppet] - 10https://gerrit.wikimedia.org/r/904764 (https://phabricator.wikimedia.org/T332708) (owner: 10Jcrespo)
[11:26:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: alertmanager: change repeat interval to 1 week for warnings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903615 (owner: 10Jbond)
[11:30:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: alertmanager: change repeat interval to 1 week for warnings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903615 (owner: 10Jbond)
[11:31:52] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/904756 (owner: 10Elukey)
[11:34:42] <wikibugs>	 (03PS1) 10Jbond: systemd::unmask: change the default of refreshonly [puppet] - 10https://gerrit.wikimedia.org/r/904776
[11:41:33] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be1042.eqiad.wmnet with reason: Add-in Card 2 ROMB Battery LOW
[11:41:49] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be1042.eqiad.wmnet with reason: Add-in Card 2 ROMB Battery LOW
[11:41:56] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e19efa89-db0e-4ad2-bcc9-ed867218f629) set by mvernon@cumin2002 for 1 day, 0:00:00 on 1 host(...
[11:42:17] <Emperor>	 !log shutdown ms-be1042 for battery swap T332883
[11:42:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:42:23] <stashbot>	 T332883: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883
[11:43:20] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10MatthewVernon) @Jclark-ctr ms-be1042 shut down ready for you.
[11:44:20] <wikibugs>	 (03PS1) 10Slyngshede: Revert "P:url_downloader send Squid access logs to Logstash" [puppet] - 10https://gerrit.wikimedia.org/r/904691
[11:46:50] <wikibugs>	 (03PS2) 10Jbond: Revert "P:url_downloader send Squid access logs to Logstash" [puppet] - 10https://gerrit.wikimedia.org/r/904691 (https://phabricator.wikimedia.org/T333676) (owner: 10Slyngshede)
[11:50:13] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Revert "P:url_downloader send Squid access logs to Logstash" [puppet] - 10https://gerrit.wikimedia.org/r/904691 (https://phabricator.wikimedia.org/T333676) (owner: 10Slyngshede)
[11:53:52] <wikibugs>	 (03CR) 10Stevemunene: [V: 03+1] Jupyterhub-conda exclude /mnt from accessible paths (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904617 (https://phabricator.wikimedia.org/T333511) (owner: 10Stevemunene)
[11:54:58] <wikibugs>	 (03PS6) 10JMeybohm: WIP: deployment_server: Create k8s configs with pki certs [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268)
[11:55:12] <wikibugs>	 (03PS2) 10Stevemunene: Jupyterhub-conda exclude /mnt from accessible paths [puppet] - 10https://gerrit.wikimedia.org/r/904617 (https://phabricator.wikimedia.org/T333511)
[12:00:31] <logmsgbot>	 !log eoghan@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrading Gitlab
[12:01:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] systemd::unmask: change the default of refreshonly [puppet] - 10https://gerrit.wikimedia.org/r/904776 (owner: 10Jbond)
[12:04:18] <logmsgbot>	 !log eoghan@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrading Gitlab
[12:05:38] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Correct the datahub elasticsearch index prefix for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/904571 (https://phabricator.wikimedia.org/T333580) (owner: 10Btullis)
[12:07:29] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: FastAPI chart using sextant for ores-legacy service [deployment-charts] - 10https://gerrit.wikimedia.org/r/904777 (https://phabricator.wikimedia.org/T330414)
[12:11:00] <wikibugs>	 (03Merged) 10jenkins-bot: Correct the datahub elasticsearch index prefix for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/904571 (https://phabricator.wikimedia.org/T333580) (owner: 10Btullis)
[12:12:33] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] cloudceph: add the location info to the hosts [puppet] - 10https://gerrit.wikimedia.org/r/896372 (https://phabricator.wikimedia.org/T297083) (owner: 10David Caro)
[12:14:00] <wikibugs>	 (03CR) 10David Caro: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40482/console" [puppet] - 10https://gerrit.wikimedia.org/r/896372 (https://phabricator.wikimedia.org/T297083) (owner: 10David Caro)
[12:25:01] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] Adds flag to start after unmask, starts logrotate (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/904498 (https://phabricator.wikimedia.org/T332869) (owner: 10EoghanGaffney)
[12:25:24] <logmsgbot>	 !log btullis@deploy2002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[12:26:21] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to also sync networks - https://phabricator.wikimedia.org/T329669 (10jbond) > I take it you are basing this off the defined RFC1918 prefixes in the latest revision? yes w just uses pythons `ipaddress.ip_address(ad...
[12:27:48] <wikibugs>	 (03CR) 10Ayounsi: "Abandoning the change as it's not needed." [puppet] - 10https://gerrit.wikimedia.org/r/904745 (https://phabricator.wikimedia.org/T324992) (owner: 10Ayounsi)
[12:27:52] <wikibugs>	 (03Abandoned) 10Ayounsi: Bird: POC use a different ASN for Cloud hosts [puppet] - 10https://gerrit.wikimedia.org/r/904745 (https://phabricator.wikimedia.org/T324992) (owner: 10Ayounsi)
[12:29:05] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] Removes unnecessary krb:present line [puppet] - 10https://gerrit.wikimedia.org/r/904522 (owner: 10EoghanGaffney)
[12:29:11] <wikibugs>	 (03PS1) 10Btullis: Bump the main datahub chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/904782 (https://phabricator.wikimedia.org/T333580)
[12:30:12] <wikibugs>	 (03CR) 10EoghanGaffney: Add production ssh account for eoghan (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883114 (owner: 10EoghanGaffney)
[12:30:19] <wikibugs>	 (03PS2) 10DCausse: rdf-streaming-updater: bump job image to flink-1.16-rc2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/904763 (https://phabricator.wikimedia.org/T328675)
[12:31:42] <wikibugs>	 (03PS1) 10Slyngshede: P:url_downloader send squid logs to Logstash [puppet] - 10https://gerrit.wikimedia.org/r/904783 (https://phabricator.wikimedia.org/T333676)
[12:33:45] <wikibugs>	 (03PS1) 10Slyngshede: P:installserver::proxy fix typo in log message. [puppet] - 10https://gerrit.wikimedia.org/r/904784
[12:35:22] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Bump the main datahub chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/904782 (https://phabricator.wikimedia.org/T333580) (owner: 10Btullis)
[12:40:22] <wikibugs>	 (03Merged) 10jenkins-bot: Bump the main datahub chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/904782 (https://phabricator.wikimedia.org/T333580) (owner: 10Btullis)
[12:41:16] <wikibugs>	 (03PS7) 10JMeybohm: WIP: deployment_server: Create k8s configs with pki certs [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268)
[12:41:27] <wikibugs>	 (03PS1) 10Ladsgroup: Add add_af_actor_T333332.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/904786 (https://phabricator.wikimedia.org/T333332)
[12:43:11] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40484/console" [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm)
[12:45:20] <logmsgbot>	 !log btullis@deploy2002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[12:45:30] <wikibugs>	 (03CR) 10DCausse: [C: 03+2] flink-app: update to mesh.configuration 1.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/904762 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse)
[12:46:29] <wikibugs>	 (03PS8) 10JMeybohm: WIP: deployment_server: Create k8s configs with pki certs [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268)
[12:46:30] <logmsgbot>	 !log btullis@deploy2002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[12:47:03] <wikibugs>	 (03PS3) 10Jbond: alertmanager: change repeat interval to 1 week for warnings [puppet] - 10https://gerrit.wikimedia.org/r/903615
[12:47:24] <wikibugs>	 (03CR) 10Jbond: alertmanager: change repeat interval to 1 week for warnings (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/903615 (owner: 10Jbond)
[12:47:27] <wikibugs>	 (03PS1) 10David Caro: ceph: Allow setting a crush location hook for the rack [puppet] - 10https://gerrit.wikimedia.org/r/904787 (https://phabricator.wikimedia.org/T297083)
[12:48:06] <wikibugs>	 (03CR) 10Ayounsi: Bird: remove anycast subnet filter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904754 (https://phabricator.wikimedia.org/T324992) (owner: 10Ayounsi)
[12:48:11] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40485/console" [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm)
[12:49:47] <wikibugs>	 (03PS4) 10Filippo Giunchedi: alertmanager: change repeat interval to 3 days for warnings [puppet] - 10https://gerrit.wikimedia.org/r/903615 (owner: 10Jbond)
[12:50:07] <wikibugs>	 (03Merged) 10jenkins-bot: flink-app: update to mesh.configuration 1.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/904762 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse)
[12:50:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] alertmanager: change repeat interval to 3 days for warnings [puppet] - 10https://gerrit.wikimedia.org/r/903615 (owner: 10Jbond)
[12:50:30] <wikibugs>	 (03PS1) 10David Caro: p:cloudceph::osd: enable location hook [puppet] - 10https://gerrit.wikimedia.org/r/904788 (https://phabricator.wikimedia.org/T297083)
[12:51:22] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] alertmanager: change repeat interval to 3 days for warnings [puppet] - 10https://gerrit.wikimedia.org/r/903615 (owner: 10Jbond)
[12:52:10] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40486/console" [puppet] - 10https://gerrit.wikimedia.org/r/904787 (https://phabricator.wikimedia.org/T297083) (owner: 10David Caro)
[12:52:41] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40487/console" [puppet] - 10https://gerrit.wikimedia.org/r/904788 (https://phabricator.wikimedia.org/T297083) (owner: 10David Caro)
[12:53:54] <wikibugs>	 (03PS2) 10David Caro: ceph: Allow setting a crush location hook for the rack [puppet] - 10https://gerrit.wikimedia.org/r/904787 (https://phabricator.wikimedia.org/T297083)
[12:53:56] <wikibugs>	 (03PS2) 10David Caro: p:cloudceph::osd: enable location hook [puppet] - 10https://gerrit.wikimedia.org/r/904788 (https://phabricator.wikimedia.org/T297083)
[12:55:52] <logmsgbot>	 !log eoghan@cumin2002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrading Gitlab
[12:56:24] <jinxer-wm>	 (ProbeDown) firing: (2) Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:57:00] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40488/console" [puppet] - 10https://gerrit.wikimedia.org/r/904788 (https://phabricator.wikimedia.org/T297083) (owner: 10David Caro)
[12:57:13] <wikibugs>	 (03PS9) 10JMeybohm: WIP: deployment_server: Create k8s configs with pki certs [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268)
[12:57:52] <wikibugs>	 (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: bump job image to flink-1.16-rc2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/904763 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse)
[12:58:50] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40489/console" [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm)
[13:01:24] <jinxer-wm>	 (ProbeDown) resolved: (2) Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:02:46] <wikibugs>	 (03Merged) 10jenkins-bot: rdf-streaming-updater: bump job image to flink-1.16-rc2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/904763 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse)
[13:05:03] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] kafka: update alerts related to brokers down [alerts] - 10https://gerrit.wikimedia.org/r/904756 (owner: 10Elukey)
[13:09:17] <elukey>	 !log restart kafkatee on centrallog2002 - test to see if there are issues connecting to the jumbo brokers running pki
[13:09:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:10:26] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[13:10:29] <logmsgbot>	 !log phedenskog@deploy2002 Started deploy [performance/navtiming@c30b954]: (no justification provided)
[13:10:35] <logmsgbot>	 !log phedenskog@deploy2002 Finished deploy [performance/navtiming@c30b954]: (no justification provided) (duration: 00m 05s)
[13:11:01] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[13:11:16] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kafka-jumbo1004.eqiad.wmnet with reason: restart kafka, switch to PKI
[13:11:29] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kafka-jumbo1004.eqiad.wmnet with reason: restart kafka, switch to PKI
[13:12:48] <elukey>	 !log move kafka-jumbo1004's kafka broker cert to PKI - T296064
[13:13:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:01] <stashbot>	 T296064: Move Kafka Jumbo's TLS clients to the new bundle - https://phabricator.wikimedia.org/T296064
[13:16:15] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to also sync networks - https://phabricator.wikimedia.org/T329669 (10cmooney) >>! In T329669#8745241, @jbond wrote: > yes i did wonder if this could be wrong for ipv6  I guess it comes down to whether we want the...
[13:17:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job k8s-pods-tls in k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:22:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job k8s-pods-tls in k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:23:25] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10Jclark-ctr) ms-be1042 is finished @MatthewVernon
[13:26:51] <jinxer-wm>	 (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on analytics1075:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=analytics1075 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[13:30:21] <wikibugs>	 (03PS1) 10Filippo Giunchedi: profile: let check_dpkg write prometheus stats [puppet] - 10https://gerrit.wikimedia.org/r/904792 (https://phabricator.wikimedia.org/T332764)
[13:30:23] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe1013.eqiad.wmnet with OS bullseye
[13:30:51] <wikibugs>	 10SRE, 10Machine-Learning-Team, 10serviceops, 10Language-Team (Language-2023-April-June ), 10Service-deployment-requests: New Service Deployment Request: NNLB-200 for machine translation - https://phabricator.wikimedia.org/T329971 (10Pginer-WMF)
[13:31:32] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1013.eqiad.wmnet with reason: host reimage
[13:31:51] <jinxer-wm>	 (PowerSupply) resolved: (2) Power Supply - PS Redundancy - issue on analytics1075:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=analytics1075 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply
[13:32:44] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-fe1013.eqiad.wmnet with OS bullseye
[13:32:51] <godog>	 jbond: hah ^ alert works, I'll add a bit of leeway
[13:32:55] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "[Looks good, comments inline]" [puppet] - 10https://gerrit.wikimedia.org/r/904754 (https://phabricator.wikimedia.org/T324992) (owner: 10Ayounsi)
[13:33:16] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10Jclark-ctr) 05Open→03Resolved an-worker1091 @btullis  Thanks for shutting down server Battery has been replaced
[13:34:26] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "Happy to take care of deploying this on Monday, especially if we do decide to remove 203.0.113.1/32 and 2001:db8::1/128, in case something" [puppet] - 10https://gerrit.wikimedia.org/r/904754 (https://phabricator.wikimedia.org/T324992) (owner: 10Ayounsi)
[13:34:40] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1013.eqiad.wmnet with reason: host reimage
[13:39:06] <wikibugs>	 (03CR) 10Herron: [C: 03+1] prometheus: Add the prometheus3002 node definition [puppet] - 10https://gerrit.wikimedia.org/r/904651 (https://phabricator.wikimedia.org/T333627) (owner: 10Andrea Denisse)
[13:40:08] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: add some leeway in PowerSupply alert [alerts] - 10https://gerrit.wikimedia.org/r/904795 (https://phabricator.wikimedia.org/T332764)
[13:40:21] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1150.mgmt.eqiad.wmnet with reboot policy FORCED
[13:41:41] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1153.mgmt.eqiad.wmnet with reboot policy FORCED
[13:43:14] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Ladsgroup)
[13:46:06] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+1] Extract and deploy upstream plugins [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/904575 (owner: 10Hashar)
[13:49:53] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1150.mgmt.eqiad.wmnet with reboot policy FORCED
[13:51:21] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1153.mgmt.eqiad.wmnet with reboot policy FORCED
[13:51:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jclark-ctr)
[13:53:08] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[13:53:47] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1153.mgmt.eqiad.wmnet with reboot policy FORCED
[13:54:20] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[13:54:21] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1013.eqiad.wmnet with OS bullseye
[13:54:27] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-fe1013.eqiad.wmnet with OS bullseye completed: - ms-f...
[13:56:19] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10Papaul)
[14:02:32] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1153.mgmt.eqiad.wmnet with reboot policy FORCED
[14:03:08] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jclark-ctr)
[14:29:05] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/904795 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi)
[14:30:02] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/904792 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi)
[14:32:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] sre: add some leeway in PowerSupply alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/904795 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi)
[14:32:50] <wikibugs>	 (03PS2) 10Filippo Giunchedi: sre: add some leeway in PowerSupply alert [alerts] - 10https://gerrit.wikimedia.org/r/904795 (https://phabricator.wikimedia.org/T332764)
[14:33:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: sre: add some leeway in PowerSupply alert [alerts] - 10https://gerrit.wikimedia.org/r/904795 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi)
[14:35:19] <wikibugs>	 (03Merged) 10jenkins-bot: sre: add some leeway in PowerSupply alert [alerts] - 10https://gerrit.wikimedia.org/r/904795 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi)
[14:36:07] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/904777 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos)
[14:39:00] <wikibugs>	 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 11): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10JArguello-WMF)
[14:41:27] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: cloudlb: introduce BGP setup by means of bird [puppet] - 10https://gerrit.wikimedia.org/r/868731 (https://phabricator.wikimedia.org/T324992)
[14:43:25] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host ms-fe1014.eqiad.wmnet
[14:43:46] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host ms-fe1014.eqiad.wmnet
[14:43:55] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host ms-fe1014.eqiad.wmnet
[14:46:18] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: ceph: Allow setting a crush location hook for the rack (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904787 (https://phabricator.wikimedia.org/T297083) (owner: 10David Caro)
[14:47:17] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] "Let's see how it works." [puppet] - 10https://gerrit.wikimedia.org/r/904616 (https://phabricator.wikimedia.org/T333586) (owner: 10Dzahn)
[14:47:25] <logmsgbot>	 !log btullis@deploy2002 helmfile [staging] START helmfile.d/services/datahub: sync on main
[14:47:27] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: cloudlb: introduce BGP setup by means of bird (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/868731 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez)
[14:47:33] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host ms-fe1014.eqiad.wmnet
[14:47:54] <logmsgbot>	 !log btullis@deploy2002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[14:52:03] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe1014.eqiad.wmnet with OS bullseye
[14:52:13] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-fe1014.eqiad.wmnet with OS bullseye
[14:52:14] <wikibugs>	 (03PS1) 10DCausse: rdf-streaming-updater: bump image version to flink-1.16-rc3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/904813 (https://phabricator.wikimedia.org/T328675)
[14:52:58] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10Papaul) On ms-fe1014 IPMI was disable that is the reason it was failing
[14:54:36] <ottomata>	 ?  why doesn't flink-operator show up in the list of k8s namespaces in the Kubernetes Pods dashboard? https://grafana-rw.wikimedia.org/d/000000473/kubernetes-pods?orgId=1&var-cluster=eqiad+prometheus%2Fk8s-dse&from=1680270857261&to=1680274457261
[14:54:50] <ottomata>	 it is def a namespace in dse-k8s-eqiad:
[14:55:27] <ottomata>	 https://www.irccloud.com/pastebin/s33GTltO/
[14:56:31] <dcausse>	 ottomata: I don't see it when selecting eqiad/k8s
[14:57:08] <ottomata>	 its in dse-k8s
[14:57:22] <ottomata>	 i'd expect to see it there
[15:01:29] <dcausse>	 hm indeed, perhaps it does expose the prometheus labels?
[15:02:03] <dcausse>	 or that's totally different no clue :/
[15:02:04] <ottomata>	 but...these are k8s level metrics...it shoudln't matter what is running?
[15:03:20] <ottomata>	 hm i can't curl the prom port there.
[15:05:10] <wikibugs>	 (03PS1) 10Ladsgroup: Revert "Revert "Revert "mwscript: Switch to use run.php""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904698
[15:05:29] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Revert "Revert "Revert "mwscript: Switch to use run.php""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904698 (owner: 10Ladsgroup)
[15:05:43] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1014.eqiad.wmnet with reason: host reimage
[15:06:16] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Revert "Revert "mwscript: Switch to use run.php""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904698 (owner: 10Ladsgroup)
[15:06:38] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-fe1004.eqiad.wmnet with OS bullseye
[15:06:46] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with OS bullseye
[15:06:51] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:904698|Revert "Revert "Revert "mwscript: Switch to use run.php"""]]
[15:07:02] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-fe1004.eqiad.wmnet with OS bullseye
[15:07:07] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with OS bullseye executed with...
[15:07:25] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-fe1004.eqiad.wmnet with OS bullseye
[15:07:32] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with OS bullseye
[15:08:17] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-fe1004.eqiad.wmnet with OS bullseye
[15:08:27] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with OS bullseye executed with...
[15:08:51] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1014.eqiad.wmnet with reason: host reimage
[15:10:13] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-fe1004.eqiad.wmnet with OS bullseye
[15:10:19] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with OS bullseye
[15:10:45] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-fe1004.eqiad.wmnet with OS bullseye
[15:10:52] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with OS bullseye executed with...
[15:14:03] <wikibugs>	 (03CR) 10Raymond Ndibe: "Thanks for working on this dcaro. It greatly improves the way things work currently!" [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro)
[15:14:10] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-fe1004.eqiad.wmnet with OS bullseye
[15:14:23] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with OS bullseye
[15:14:36] <wikibugs>	 (03PS16) 10KartikMistry: WIP: Add new self hosted machinetranslation service (MinT) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505)
[15:14:45] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:904698|Revert "Revert "Revert "mwscript: Switch to use run.php"""]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
[15:18:38] <wikibugs>	 (03CR) 10KartikMistry: WIP: Add new self hosted machinetranslation service (MinT) (038 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry)
[15:19:06] <wikibugs>	 (03CR) 10Raymond Ndibe: maintain-dbusers: add prometheus metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro)
[15:22:41] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[15:26:05] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:904698|Revert "Revert "Revert "mwscript: Switch to use run.php"""]] (duration: 19m 14s)
[15:26:21] <wikibugs>	 10SRE-tools, 10DNS, 10Infrastructure-Foundations, 10Traffic, 10netbox: Enforce Netbox domain names without period termination - https://phabricator.wikimedia.org/T306809 (10BCornwall)
[15:26:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:27:29] <wikibugs>	 10SRE-tools, 10DNS, 10Infrastructure-Foundations, 10netbox: Enforce Netbox domain names without period termination - https://phabricator.wikimedia.org/T306809 (10BCornwall) Updated the task description to accurately reflect the work that needs doing. I'm also going to remove the Traffic tag since it seems...
[15:28:37] <wikibugs>	 (03CR) 10Ahmon Dancy: "Fancy" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/904575 (owner: 10Hashar)
[15:31:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:33:12] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[15:33:19] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1014.eqiad.wmnet with OS bullseye
[15:33:25] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-fe1014.eqiad.wmnet with OS bullseye completed: - ms-f...
[15:34:07] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10Papaul)
[15:34:46] <wikibugs>	 (03PS1) 10Btullis: Remove the hyphen from the datahub staging elasticsearch prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/904820 (https://phabricator.wikimedia.org/T329514)
[15:40:59] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] scap: block Scap execution on inactive deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/904502 (https://phabricator.wikimedia.org/T330756) (owner: 10Jaime Nuche)
[15:42:34] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10BTullis)
[15:43:45] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Remove the hyphen from the datahub staging elasticsearch prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/904820 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[15:48:57] <wikibugs>	 (03Merged) 10jenkins-bot: Remove the hyphen from the datahub staging elasticsearch prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/904820 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[15:49:09] <logmsgbot>	 !log btullis@deploy2002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[15:49:48] <logmsgbot>	 !log btullis@deploy2002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[15:55:59] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Revert "Enable hidden tag for "Edit Check" project on Wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904700 (https://phabricator.wikimedia.org/T324733)
[15:57:28] <MatmaRex>	 hi, i'd like to get this revert https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/904700 emergency-deployed today. is anyone around who could help?
[15:59:25] <RhinosF1>	 Amir1, cwhite, herron, thcipriani: ^
[15:59:47] * Lucas_WMDE here
[15:59:52] <Amir1>	 MatmaRex: I'm around
[15:59:57] <Amir1>	 reverts are fine
[15:59:59] <MatmaRex>	 for context: https://phabricator.wikimedia.org/T333612#8746101 https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Stuck_loading,_can't_post_edit
[16:00:07] <logmsgbot>	 !log btullis@deploy2002 helmfile [staging] START helmfile.d/services/datahub: sync on main
[16:00:17] <Lucas_WMDE>	 yeah this looks fine to me
[16:00:20] <Lucas_WMDE>	 Amir1: want to do it or should I?
[16:00:37] <logmsgbot>	 !log btullis@deploy2002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[16:00:39] <Amir1>	 I'm already logged in deploy*
[16:00:43] <Amir1>	 for another revert 
[16:00:50] <Lucas_WMDE>	 ok
[16:00:55] <MatmaRex>	 thanks all
[16:00:59] <wikibugs>	 (03PS2) 10Ladsgroup: Revert "Enable hidden tag for "Edit Check" project on Wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904700 (https://phabricator.wikimedia.org/T324733) (owner: 10Bartosz Dziewoński)
[16:01:02] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Revert "Enable hidden tag for "Edit Check" project on Wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904700 (https://phabricator.wikimedia.org/T324733) (owner: 10Bartosz Dziewoński)
[16:01:05] <Lucas_WMDE>	 I just logged in but I’ll leave it to you then
[16:01:48] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Enable hidden tag for "Edit Check" project on Wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904700 (https://phabricator.wikimedia.org/T324733) (owner: 10Bartosz Dziewoński)
[16:02:30] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:904700|Revert "Enable hidden tag for "Edit Check" project on Wikipedias" (T324733 T333612)]]
[16:02:43] <stashbot>	 T324733: Introduce a tag to identify edits that meet the Edit Check heuristic   - https://phabricator.wikimedia.org/T324733
[16:02:46] <stashbot>	 T333612: Visual Edits do not save - https://phabricator.wikimedia.org/T333612
[16:03:50] <logmsgbot>	 !log ladsgroup@deploy2002 matmarex and ladsgroup: Backport for [[gerrit:904700|Revert "Enable hidden tag for "Edit Check" project on Wikipedias" (T324733 T333612)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[16:04:25] <Amir1>	 MatmaRex: it's in mwdebug, please test
[16:04:57] <MatmaRex>	 Amir1: i actually haven't reproduced the error yet, but the stack traces in logstash are pointing to this code
[16:05:03] <MatmaRex>	 so i can't really test, sorry D:
[16:05:08] <wikibugs>	 (03CR) 10Hashar: Extract and deploy upstream plugins (032 comments) [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/904575 (owner: 10Hashar)
[16:05:17] <Amir1>	 ok
[16:05:20] <wikibugs>	 (03PS2) 10Hashar: Extract and deploy upstream plugins [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/904575
[16:10:49] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:904700|Revert "Enable hidden tag for "Edit Check" project on Wikipedias" (T324733 T333612)]] (duration: 08m 18s)
[16:10:56] <stashbot>	 T324733: Introduce a tag to identify edits that meet the Edit Check heuristic   - https://phabricator.wikimedia.org/T324733
[16:10:56] <stashbot>	 T333612: Visual Edits do not save - https://phabricator.wikimedia.org/T333612
[16:15:17] <logmsgbot>	 !log btullis@deploy2002 helmfile [staging] START helmfile.d/services/datahub: sync on main
[16:15:33] <MatmaRex>	 thanks Amir1
[16:15:47] <logmsgbot>	 !log btullis@deploy2002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[16:15:54] <MatmaRex>	 i'll reply on the task in a sec
[16:16:37] <Amir1>	 ^_^
[16:19:49] <wikibugs>	 10SRE, 10Commons, 10Traffic: Specific PNG thumbnail of SVG file is outdated / stuck (European caching cluster) - https://phabricator.wikimedia.org/T333042 (10Aklapper)
[16:21:34] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gitlab_runner: run clear-docker-cache every hour [puppet] - 10https://gerrit.wikimedia.org/r/904616 (https://phabricator.wikimedia.org/T333586) (owner: 10Dzahn)
[16:21:40] <wikibugs>	 (03PS1) 10Papaul: update thanos-fe1004 entry in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/904826 (https://phabricator.wikimedia.org/T326846)
[16:22:30] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] update thanos-fe1004 entry in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/904826 (https://phabricator.wikimedia.org/T326846) (owner: 10Papaul)
[16:22:35] <thcipriani>	 thanks Amir1 RhinosF1 and MatmaRex <3
[16:22:59] <RhinosF1>	 i didn't do anything thcipriani, thanks all though!
[16:23:28] <mutante>	 jbond: papaul: tr-tr-tr-triple merge combo.  merge ahead :)
[16:24:45] <jbond>	 mutante: ?
[16:24:58] <mutante>	 Papaul: update thanos-fe1004 entry in site.pp (a227d022aa)
[16:24:58] <mutante>	 Dzahn: gitlab_runner: run clear-docker-cache every hour (e6c553eca9)
[16:25:01] <mutante>	 Jbond: alertmanager: change repeat interval to 3 days for warnings (9b9af0c0ab)
[16:25:06] <mutante>	 all these want to be merged on master
[16:25:16] <jbond>	 mutante: oh sorry i thought i merged that one please go ahead
[16:25:51] <mutante>	 meanwhile someone else has the lock. what I wanted to say was "mine is fine to be merged" , heh
[16:26:26] <mutante>	 I bet in the other channel it's the same thing :)
[16:26:30] <mutante>	 -dcops
[16:26:48] <jbond>	 yep :)
[16:28:21] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host thanos-fe1004.eqiad.wmnet with OS bullseye
[16:29:25] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-fe1004.eqiad.wmnet with OS bullseye
[16:29:37] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with...
[16:29:49] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-fe1004.eqiad.wmnet with OS bullseye
[16:29:59] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with OS...
[16:30:17] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-fe1004.eqiad.wmnet with OS bullseye
[16:30:28] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with...
[16:41:44] <wikibugs>	 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 11): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Ladsgroup) Hi, sorry, I just came back from ooo. I want to take a step back a...
[16:47:45] <wikibugs>	 (03PS1) 10Ssingh: Revert "hiera: set bpg-med to 101 for lvs4008 (100 for lvs4010)" [puppet] - 10https://gerrit.wikimedia.org/r/904701
[16:50:08] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Revert "hiera: set bpg-med to 101 for lvs4008 (100 for lvs4010)" [puppet] - 10https://gerrit.wikimedia.org/r/904701 (owner: 10Ssingh)
[16:51:06] <wikibugs>	 (03PS1) 10Papaul: Fix typo on role for thanos-fe1004 [puppet] - 10https://gerrit.wikimedia.org/r/904830 (https://phabricator.wikimedia.org/T326846)
[16:52:04] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] Fix typo on role for thanos-fe1004 [puppet] - 10https://gerrit.wikimedia.org/r/904830 (https://phabricator.wikimedia.org/T326846) (owner: 10Papaul)
[16:54:02] <logmsgbot>	 !log aqu@deploy2002 Started deploy [airflow-dags/analytics@2aae7d0]: Fix for VirtualPageview Dag - Analytics [airflow-dags@2aae7d0]
[16:54:13] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@2aae7d0]: Fix for VirtualPageview Dag - Analytics [airflow-dags@2aae7d0] (duration: 00m 10s)
[16:55:41] <sukhe>	 !log restart pybal on lvs4008 to set it primary LVS for high-traffic1
[16:55:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:55:50] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10wiki_willy) Thanks, let us know if there's anything we can do in the meantime.  Here's a list of the assets that are reporting as Netbox errors for accounting mismatch, whic...
[17:05:50] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:05:52] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:07:30] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49708 bytes in 0.106 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:07:32] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.261 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:08:40] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Engineering: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10wiki_willy) @Cmjohnson / @Jclark-ctr - maybe we can try upgrading the firmware first if it's outdated?   Thanks, Willy
[17:13:00] <logmsgbot>	 !log denisse@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus3002.esams.wmnet - denisse@cumin1001"
[17:13:01] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.netbox
[17:15:38] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM prometheus3002.esams.wmnet - denisse@cumin1001"
[17:16:10] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Add the prometheus3002 node definition [puppet] - 10https://gerrit.wikimedia.org/r/904651 (https://phabricator.wikimedia.org/T333627) (owner: 10Andrea Denisse)
[17:16:33] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM prometheus3002.esams.wmnet - denisse@cumin1001"
[17:16:33] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:16:33] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.wipe-cache prometheus3002.esams.wmnet on all recursors
[17:16:36] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus3002.esams.wmnet on all recursors
[17:16:36] <logmsgbot>	 !log denisse@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus3002.esams.wmnet
[17:16:46] <logmsgbot>	 !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@48778b4]: bump discolytics to 0.11.0
[17:17:06] <logmsgbot>	 !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@48778b4]: bump discolytics to 0.11.0 (duration: 00m 19s)
[17:17:23] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.ganeti.makevm for new host prometheus3002.esams.wmnet
[17:17:24] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.netbox
[17:18:35] <logmsgbot>	 !log aqu@deploy2002 Started deploy [airflow-dags/analytics@9182e44]: Fix for VirtualPageview Dag - Analytics [airflow-dags@9182e44]
[17:18:47] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@9182e44]: Fix for VirtualPageview Dag - Analytics [airflow-dags@9182e44] (duration: 00m 11s)
[17:19:22] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus3002.esams.wmnet - denisse@cumin1001"
[17:20:16] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus3002.esams.wmnet - denisse@cumin1001"
[17:20:16] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:20:16] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.wipe-cache prometheus3002.esams.wmnet on all recursors
[17:20:19] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus3002.esams.wmnet on all recursors
[17:20:23] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.netbox
[17:22:33] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM prometheus3002.esams.wmnet - denisse@cumin1001"
[17:23:34] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM prometheus3002.esams.wmnet - denisse@cumin1001"
[17:23:34] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:23:34] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.wipe-cache prometheus3002.esams.wmnet on all recursors
[17:23:37] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus3002.esams.wmnet on all recursors
[17:23:41] <logmsgbot>	 !log denisse@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus3002.esams.wmnet
[17:27:39] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.hosts.decommission for hosts prometheus3002.esams.wmnet
[17:31:25] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.netbox
[17:32:37] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:32:37] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts prometheus3002.esams.wmnet
[17:32:41] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: esams 1 VM request for prometheus3002 - https://phabricator.wikimedia.org/T333627 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by denisse@cumin1001 for hosts: `prometheus3002.esams.wmnet` - prometheus3002.esams.wmnet (**WARN**)   -...
[17:36:27] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.ganeti.makevm for new host prometheus3002.esams.wmnet
[17:36:28] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.netbox
[17:39:06] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus3002.esams.wmnet - denisse@cumin1001"
[17:40:01] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus3002.esams.wmnet - denisse@cumin1001"
[17:40:01] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:40:02] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.wipe-cache prometheus3002.esams.wmnet on all recursors
[17:40:05] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus3002.esams.wmnet on all recursors
[17:44:47] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host thanos-fe1004.eqiad.wmnet with OS bullseye
[17:48:27] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-fe1004.eqiad.wmnet with OS bullseye
[17:48:33] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with OS bullseye
[17:49:30] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe1004.eqiad.wmnet with reason: host reimage
[17:52:40] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe1004.eqiad.wmnet with reason: host reimage
[18:01:22] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nfs traffic shaping: label IPs circa 2017 [puppet] - 10https://gerrit.wikimedia.org/r/904623 (owner: 10Andrew Bogott)
[18:01:30] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nfs traffic-shaping: replace labstore100[67] with clouddumps100[12] [puppet] - 10https://gerrit.wikimedia.org/r/904624 (owner: 10Andrew Bogott)
[18:01:38] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nfs traffic shaping: remove refs to labstore100[12] [puppet] - 10https://gerrit.wikimedia.org/r/904625 (owner: 10Andrew Bogott)
[18:01:50] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nfs traffic_shaping: replace labstore1003 rules with rules for scratch.svc [puppet] - 10https://gerrit.wikimedia.org/r/904626 (owner: 10Andrew Bogott)
[18:05:23] <wikibugs>	 (03PS2) 10Andrew Bogott: nfs traffic-shaping: replace labstore100[67] with clouddumps100[12] [puppet] - 10https://gerrit.wikimedia.org/r/904624
[18:05:25] <wikibugs>	 (03PS2) 10Andrew Bogott: nfs traffic shaping: remove refs to labstore100[12] [puppet] - 10https://gerrit.wikimedia.org/r/904625
[18:05:27] <wikibugs>	 (03PS2) 10Andrew Bogott: nfs traffic_shaping: replace labstore1003 rules with rules for scratch.svc [puppet] - 10https://gerrit.wikimedia.org/r/904626
[18:05:29] <wikibugs>	 (03PS3) 10Andrew Bogott: Toolforge: move to new VM-hosted NFS server [puppet] - 10https://gerrit.wikimedia.org/r/904562 (https://phabricator.wikimedia.org/T333477)
[18:05:31] <wikibugs>	 (03PS2) 10Andrew Bogott: nfs traffic_shaping: replace labstore1004 rules with rules for tools-nfs.svc [puppet] - 10https://gerrit.wikimedia.org/r/904627 (https://phabricator.wikimedia.org/T333477)
[18:05:33] <wikibugs>	 (03PS2) 10Andrew Bogott: labstore1004: park in an 'insetup' role until we're ready to decom [puppet] - 10https://gerrit.wikimedia.org/r/904630 (https://phabricator.wikimedia.org/T333477)
[18:05:49] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[18:17:51] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[18:17:57] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-fe1004.eqiad.wmnet with OS bullseye
[18:18:04] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with OS bullseye completed: -...
[18:18:15] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10Papaul)
[18:19:19] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10Papaul) 05Open→03Resolved The problem with thanos-fe1004 was wrong entry in site.pp. All the server are now ready
[18:20:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Eqiad: Sprint Week main tracking task - https://phabricator.wikimedia.org/T332516 (10Papaul)
[18:20:48] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Eqiad: Backlog HW failure-racking tasks - Decommision and remote work tasks - https://phabricator.wikimedia.org/T332523 (10Papaul)
[18:21:47] <logmsgbot>	 !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@30fae0e]: bump discolytics to 0.12.0
[18:22:08] <logmsgbot>	 !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@30fae0e]: bump discolytics to 0.12.0 (duration: 00m 20s)
[18:23:38] <logmsgbot>	 !log xcollazo@deploy2002 Started deploy [airflow-dags/platform_eng@30fae0e]: (no justification provided)
[18:23:59] <logmsgbot>	 !log xcollazo@deploy2002 Finished deploy [airflow-dags/platform_eng@30fae0e]: (no justification provided) (duration: 00m 20s)
[18:40:06] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus3002.esams.wmnet - denisse@cumin1001"
[18:40:59] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus3002.esams.wmnet - denisse@cumin1001"
[18:40:59] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host prometheus3002.esams.wmnet
[18:49:24] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1156.eqiad.wmnet']
[18:49:33] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1155.eqiad.wmnet']
[18:54:08] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['an-worker1151.eqiad.wmnet']
[18:56:03] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1152.eqiad.wmnet']
[18:58:52] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1156.eqiad.wmnet']
[18:58:54] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1155.eqiad.wmnet']
[19:00:00] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1154.eqiad.wmnet']
[19:00:12] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1153.eqiad.wmnet']
[19:10:12] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1153.eqiad.wmnet']
[19:12:25] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "I should not call it "serviceops" though in the code when I say "sre-collab" in the title. And should amend to add another Phab tag as dis" [puppet] - 10https://gerrit.wikimedia.org/r/903796 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[19:13:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Papaul) @Cmjohnson @Jgreen i did a quick look in Netbox for frbast1002 mgmt IP address it looks like this node is using 10.64.40.36/26 on eqiad mgmt ne...
[19:14:22] <andrewbogott>	 !log upgraded wikitech-static to 1.39.3
[19:14:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:20:50] <wikibugs>	 (03PS5) 10Dzahn: alertmanager: create receiver for both sre-collab and releng combined [puppet] - 10https://gerrit.wikimedia.org/r/903796 (https://phabricator.wikimedia.org/T329587)
[19:24:01] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.ganeti.reimage for host prometheus3002.esams.wmnet with OS bullseye
[19:24:06] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: esams 1 VM request for prometheus3002 - https://phabricator.wikimedia.org/T333627 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by denisse@cumin1001 for host prometheus3002.esams.wmnet with OS bullseye
[19:24:09] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['an-worker1154.eqiad.wmnet']
[19:24:16] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "renamed to "sre-collab-releng". added second PID to create a single ticket but tagged for both teams. I think it's good to go now. CCin'g " [puppet] - 10https://gerrit.wikimedia.org/r/903796 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[19:26:14] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.ganeti.makevm for new host prometheus4002.ulsfo.wmnet
[19:26:16] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.netbox
[19:26:43] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.ganeti.makevm for new host prometheus5002.eqsin.wmnet
[19:26:51] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.netbox
[19:27:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jclark-ctr)
[19:28:31] <logmsgbot>	 !log denisse@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[19:28:53] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1074.mgmt.eqiad.wmnet with reboot policy FORCED
[19:28:54] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1075.mgmt.eqiad.wmnet with reboot policy FORCED
[19:29:01] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus4002.ulsfo.wmnet - denisse@cumin1001"
[19:30:02] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus4002.ulsfo.wmnet - denisse@cumin1001"
[19:30:02] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:30:02] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.wipe-cache prometheus4002.ulsfo.wmnet on all recursors
[19:30:05] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus4002.ulsfo.wmnet on all recursors
[19:30:46] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1074.mgmt.eqiad.wmnet with reboot policy FORCED
[19:30:48] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1075.mgmt.eqiad.wmnet with reboot policy FORCED
[19:30:49] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.netbox
[19:32:35] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1072.eqiad.wmnet']
[19:32:35] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1073.eqiad.wmnet']
[19:32:44] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus5002.eqsin.wmnet - denisse@cumin1001"
[19:32:56] <wikibugs>	 (03PS1) 10Andrew Bogott: Cinder: backup tool project volumes [puppet] - 10https://gerrit.wikimedia.org/r/904838
[19:33:26] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1072.eqiad.wmnet']
[19:33:28] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1073.eqiad.wmnet']
[19:33:47] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus5002.eqsin.wmnet - denisse@cumin1001"
[19:33:47] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:33:47] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.wipe-cache prometheus5002.eqsin.wmnet on all recursors
[19:33:50] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus5002.eqsin.wmnet on all recursors
[19:34:15] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.ganeti.makevm for new host prometheus6002.drmrs.wmnet
[19:34:16] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.netbox
[19:34:34] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Cinder: backup tool project volumes [puppet] - 10https://gerrit.wikimedia.org/r/904838 (owner: 10Andrew Bogott)
[19:35:40] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1152.eqiad.wmnet']
[19:36:14] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus6002.drmrs.wmnet - denisse@cumin1001"
[19:36:51] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs backups: stop nfs backups, add a second cinder-backup node [puppet] - 10https://gerrit.wikimedia.org/r/904839
[19:37:18] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus6002.drmrs.wmnet - denisse@cumin1001"
[19:37:18] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:37:18] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.wipe-cache prometheus6002.drmrs.wmnet on all recursors
[19:37:22] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus6002.drmrs.wmnet on all recursors
[19:39:32] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1073.mgmt.eqiad.wmnet with reboot policy FORCED
[19:39:59] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs backups: stop nfs backups, add a second cinder-backup node [puppet] - 10https://gerrit.wikimedia.org/r/904839 (owner: 10Andrew Bogott)
[19:40:28] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1072.mgmt.eqiad.wmnet with reboot policy FORCED
[19:41:44] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1073.mgmt.eqiad.wmnet with reboot policy FORCED
[19:42:27] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus3002.esams.wmnet with reason: host reimage
[19:45:08] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be1074.mgmt.eqiad.wmnet with reboot policy FORCED
[19:45:30] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus3002.esams.wmnet with reason: host reimage
[19:45:44] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1072.mgmt.eqiad.wmnet with reboot policy FORCED
[19:46:11] <wikibugs>	 (03PS1) 10Andrea Denisse: prometheus: Add the prometheus Bullseye node definitions [puppet] - 10https://gerrit.wikimedia.org/r/904841 (https://phabricator.wikimedia.org/T333719)
[19:47:50] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "wmcs backups: stop nfs backups, add a second cinder-backup node" [puppet] - 10https://gerrit.wikimedia.org/r/904842
[19:48:24] <wikibugs>	 (03PS1) 10BCornwall: gitlab: Disable listening on port 80 [puppet] - 10https://gerrit.wikimedia.org/r/904843 (https://phabricator.wikimedia.org/T238720)
[19:48:28] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Revert "wmcs backups: stop nfs backups, add a second cinder-backup node" [puppet] - 10https://gerrit.wikimedia.org/r/904842 (owner: 10Andrew Bogott)
[19:51:19] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40490/console" [puppet] - 10https://gerrit.wikimedia.org/r/904843 (https://phabricator.wikimedia.org/T238720) (owner: 10BCornwall)
[19:58:16] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host prometheus3002.esams.wmnet with OS bullseye
[19:58:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: esams 1 VM request for prometheus3002 - https://phabricator.wikimedia.org/T333627 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by denisse@cumin1001 for host prometheus3002.esams.wmnet with OS bullseye completed: - prometheus300...
[20:00:28] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1074.mgmt.eqiad.wmnet with reboot policy FORCED
[20:01:46] <wikibugs>	 (03PS1) 10Andrew Bogott: cinder: increase backup workers [puppet] - 10https://gerrit.wikimedia.org/r/904847
[20:03:47] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cinder: increase backup workers [puppet] - 10https://gerrit.wikimedia.org/r/904847 (owner: 10Andrew Bogott)
[20:04:40] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10DBA, 10MediaWiki-libs-Rdbms, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Fix mediawiki heartbeat model, change pt-heartbeat model to not use super-user, avoid SPOF and switch automatically to the... - https://phabricator.wikimedia.org/T172497
[20:05:08] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1075.mgmt.eqiad.wmnet with reboot policy FORCED
[20:16:20] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1075.mgmt.eqiad.wmnet with reboot policy FORCED
[20:28:05] <wikibugs>	 (03CR) 10Herron: [C: 03+1] prometheus: Add the prometheus Bullseye node definitions [puppet] - 10https://gerrit.wikimedia.org/r/904841 (https://phabricator.wikimedia.org/T333719) (owner: 10Andrea Denisse)
[20:30:04] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus4002.ulsfo.wmnet - denisse@cumin1001"
[20:33:56] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus5002.eqsin.wmnet - denisse@cumin1001"
[20:37:20] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus6002.drmrs.wmnet - denisse@cumin1001"
[20:37:34] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus4002.ulsfo.wmnet - denisse@cumin1001"
[20:37:34] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host prometheus4002.ulsfo.wmnet
[20:37:40] <logmsgbot>	 !log denisse@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus5002.eqsin.wmnet - denisse@cumin1001"
[20:37:41] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.netbox
[20:38:12] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus6002.drmrs.wmnet - denisse@cumin1001"
[20:38:12] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host prometheus6002.drmrs.wmnet
[20:38:34] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.ganeti.reimage for host prometheus4002.ulsfo.wmnet with OS bullseye
[20:38:59] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.ganeti.reimage for host prometheus6002.drmrs.wmnet with OS bullseye
[20:39:08] <wikibugs>	 10SRE, 10vm-requests, 10Patch-For-Review: Site: drmrs 1 VM request for prometheus6002 - https://phabricator.wikimedia.org/T333721 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by denisse@cumin1001 for host prometheus6002.drmrs.wmnet with OS bullseye
[20:39:54] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM prometheus5002.eqsin.wmnet - denisse@cumin1001"
[20:40:54] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM prometheus5002.eqsin.wmnet - denisse@cumin1001"
[20:40:54] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:40:54] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.wipe-cache prometheus5002.eqsin.wmnet on all recursors
[20:40:57] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus5002.eqsin.wmnet on all recursors
[20:40:57] <logmsgbot>	 !log denisse@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus5002.eqsin.wmnet
[20:57:20] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/904595
[20:57:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/904595 (owner: 10TrainBranchBot)
[20:58:36] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.ganeti.makevm for new host prometheus5002.eqsin.wmnet
[20:58:37] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.netbox
[21:00:55] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus5002.eqsin.wmnet - denisse@cumin1001"
[21:01:19] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Add the prometheus Bullseye node definitions [puppet] - 10https://gerrit.wikimedia.org/r/904841 (https://phabricator.wikimedia.org/T333719) (owner: 10Andrea Denisse)
[21:01:46] <wikibugs>	 (03PS2) 10BCornwall: gitlab: Disable listening on port 80 [puppet] - 10https://gerrit.wikimedia.org/r/904843 (https://phabricator.wikimedia.org/T238720)
[21:02:28] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus5002.eqsin.wmnet - denisse@cumin1001"
[21:02:28] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:02:28] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.wipe-cache prometheus5002.eqsin.wmnet on all recursors
[21:02:31] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus5002.eqsin.wmnet on all recursors
[21:02:40] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.netbox
[21:04:11] <wikibugs>	 (03PS1) 10BCornwall: lists: Disable access on port 80 [puppet] - 10https://gerrit.wikimedia.org/r/904854 (https://phabricator.wikimedia.org/T238720)
[21:04:43] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM prometheus5002.eqsin.wmnet - denisse@cumin1001"
[21:05:46] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40491/console" [puppet] - 10https://gerrit.wikimedia.org/r/904843 (https://phabricator.wikimedia.org/T238720) (owner: 10BCornwall)
[21:05:47] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM prometheus5002.eqsin.wmnet - denisse@cumin1001"
[21:05:47] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:05:47] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.wipe-cache prometheus5002.eqsin.wmnet on all recursors
[21:05:50] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus5002.eqsin.wmnet on all recursors
[21:05:55] <logmsgbot>	 !log denisse@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus5002.eqsin.wmnet
[21:06:00] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): Site: esams 1 VM request for prometheus3002 - https://phabricator.wikimedia.org/T333627 (10andrea.denisse) 05Open→03Resolved
[21:06:02] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40492/console" [puppet] - 10https://gerrit.wikimedia.org/r/904854 (https://phabricator.wikimedia.org/T238720) (owner: 10BCornwall)
[21:06:03] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): Site: esams 1 VM request for prometheus3002 - https://phabricator.wikimedia.org/T333627 (10andrea.denisse)
[21:07:16] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.hosts.decommission for hosts prometheus5002
[21:08:03] <wikibugs>	 10SRE, 10vm-requests, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q4): Site: eqsin 1 VM request for prometheus5002 - https://phabricator.wikimedia.org/T333720 (10andrea.denisse)
[21:08:22] <wikibugs>	 10SRE, 10vm-requests, 10Patch-For-Review: Site: drmrs 1 VM request for prometheus6002 - https://phabricator.wikimedia.org/T333721 (10andrea.denisse)
[21:08:54] <wikibugs>	 10SRE, 10vm-requests, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q4): Site: drmrs 1 VM request for prometheus6002 - https://phabricator.wikimedia.org/T333721 (10andrea.denisse)
[21:09:33] <wikibugs>	 10SRE, 10vm-requests, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q4): Site: eqsin 1 VM request for prometheus5002 - https://phabricator.wikimedia.org/T333720 (10andrea.denisse)
[21:11:12] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.netbox
[21:12:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/904595 (owner: 10TrainBranchBot)
[21:12:25] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:12:26] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts prometheus5002
[21:12:31] <wikibugs>	 10SRE, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): Site: eqsin 1 VM request for prometheus5002 - https://phabricator.wikimedia.org/T333720 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by denisse@cumin1001 for hosts: `prometheus5002` - prometheus5002 (**WARN**)   - //Host not foun...
[21:16:55] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10BCornwall) 05Stalled→03In progress a:03BCornwall
[21:42:58] <wikibugs>	 (03PS1) 10Dzahn: etherpad: remove process monitoring [puppet] - 10https://gerrit.wikimedia.org/r/904856 (https://phabricator.wikimedia.org/T331901)
[21:47:08] <wikibugs>	 (03PS2) 10Dzahn: gerrit: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902799 (https://phabricator.wikimedia.org/T329587)
[21:47:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] gerrit: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902799 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[21:47:55] <wikibugs>	 (03PS3) 10Dzahn: gerrit: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902799 (https://phabricator.wikimedia.org/T329587)
[21:49:04] <wikibugs>	 (03CR) 10Dzahn: "you can review this as if https://gerrit.wikimedia.org/r/c/operations/puppet/+/903796 is already merged.. so it will have the new notifica" [puppet] - 10https://gerrit.wikimedia.org/r/902799 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[21:52:30] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "arr, no. It needs to be serviceops-sre-releng and then "severity" is "releng". a bit odd but works. need to amend again though, sorry" [puppet] - 10https://gerrit.wikimedia.org/r/903796 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[21:52:31] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1075.eqiad.wmnet']
[21:52:39] <logmsgbot>	 !log denisse@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host prometheus4002.ulsfo.wmnet with OS bullseye
[21:53:24] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "this is why I wanted to do that team rename change first, by the way. but gotta do that all at once later once that is finalized" [puppet] - 10https://gerrit.wikimedia.org/r/903796 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[21:55:06] <wikibugs>	 (03PS4) 10Dzahn: gerrit: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902799 (https://phabricator.wikimedia.org/T329587)
[21:57:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] gerrit: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902799 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[21:57:58] <logmsgbot>	 !log denisse@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host prometheus6002.drmrs.wmnet with OS bullseye
[21:58:03] <wikibugs>	 10SRE, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): Site: drmrs 1 VM request for prometheus6002 - https://phabricator.wikimedia.org/T333721 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by denisse@cumin1001 for host prometheus6002.drmrs.wmnet with OS bullseye executed with erro...
[21:58:42] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.ganeti.makevm for new host prometheus5002.eqsin.wmnet
[21:58:43] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.netbox
[21:58:44] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "well, I got caught with my "hack" (parameter 'severity' expects a match for Prometheus::Alert::Severity = Enum['critical', 'info', 'page'," [puppet] - 10https://gerrit.wikimedia.org/r/902799 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[22:00:18] <wikibugs>	 10SRE, 10vm-requests: Site: ulsfo 1 VM request for prometheus4002 - https://phabricator.wikimedia.org/T333719 (10Peachey88)
[22:00:38] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus5002.eqsin.wmnet - denisse@cumin1001"
[22:00:42] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-worker1132.eqiad.wmnet
[22:01:27] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1075.eqiad.wmnet']
[22:01:43] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus5002.eqsin.wmnet - denisse@cumin1001"
[22:01:43] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:01:43] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.wipe-cache prometheus5002.eqsin.wmnet on all recursors
[22:01:46] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus5002.eqsin.wmnet on all recursors
[22:03:22] <wikibugs>	 (03PS1) 10Dzahn: gerrit: replace Icinga monitoring with Prometheus, ssh port 29418 [puppet] - 10https://gerrit.wikimedia.org/r/904857 (https://phabricator.wikimedia.org/T331901)
[22:03:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] gerrit: replace Icinga monitoring with Prometheus, ssh port 29418 [puppet] - 10https://gerrit.wikimedia.org/r/904857 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn)
[22:04:06] <wikibugs>	 (03PS2) 10Dzahn: gerrit: replace Icinga monitoring with Prometheus, ssh port 29418 [puppet] - 10https://gerrit.wikimedia.org/r/904857 (https://phabricator.wikimedia.org/T331901)
[22:05:37] <wikibugs>	 (03CR) 10Dzahn: "Thinking about it, this should probably also use the new receiver including releng.." [puppet] - 10https://gerrit.wikimedia.org/r/904857 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn)
[22:06:27] <wikibugs>	 (03CR) 10Dzahn: "if it was "admins, gerrit" before it should probably be the same level as https on gerrit being down" [puppet] - 10https://gerrit.wikimedia.org/r/904857 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn)
[22:09:33] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "wait for https://gerrit.wikimedia.org/r/c/operations/puppet/+/903796 but feel free to leave other comments regardless" [puppet] - 10https://gerrit.wikimedia.org/r/904857 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn)
[22:09:42] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Engineering: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10Jclark-ctr) opened Dell ticket. sent support assist Confirmed: Service Request 165406278 was successfully submitted.
[22:12:47] <wikibugs>	 (03PS1) 10Dzahn: microsites: add monitor for os-reports.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/904859 (https://phabricator.wikimedia.org/T327976)
[22:13:52] <wikibugs>	 (03CR) 10Dzahn: "fyi, we are now monitoring this site as well. let us know if you want to receive notifications about it or think it's overkill or it's fin" [puppet] - 10https://gerrit.wikimedia.org/r/904859 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn)
[22:20:35] <wikibugs>	 (03PS1) 10Dzahn: microsites: add monitor for https://15.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/904860 (https://phabricator.wikimedia.org/T327976)
[22:21:43] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "one of the last remaining sites to check on miscweb to close this ticket" [puppet] - 10https://gerrit.wikimedia.org/r/904860 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn)
[22:24:00] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] microsites: add monitor for os-reports.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/904859 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn)
[22:24:57] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:20:00 on miscweb[2002-2003].codfw.wmnet,miscweb[1002-1003].eqiad.wmnet with reason: maintenance
[22:25:12] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on miscweb[2002-2003].codfw.wmnet,miscweb[1002-1003].eqiad.wmnet with reason: maintenance
[22:25:55] <wikibugs>	 (03PS1) 10Cwhite: logstash: grafana_ecs gsub the level field in [puppet] - 10https://gerrit.wikimedia.org/r/904596
[22:29:49] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: normalize_level add grafana error level alias [puppet] - 10https://gerrit.wikimedia.org/r/904591 (owner: 10Cwhite)
[22:29:51] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: grafana_ecs gsub the level field in [puppet] - 10https://gerrit.wikimedia.org/r/904596 (owner: 10Cwhite)
[22:41:59] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.ganeti.reimage for host prometheus4002.ulsfo.wmnet with OS bullseye
[22:42:05] <wikibugs>	 10SRE, 10vm-requests: Site: ulsfo 1 VM request for prometheus4002 - https://phabricator.wikimedia.org/T333719 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by denisse@cumin1001 for host prometheus4002.ulsfo.wmnet with OS bullseye
[22:43:33] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.ganeti.reimage for host prometheus6002.drmrs.wmnet with OS bullseye
[22:43:39] <wikibugs>	 10SRE, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): Site: drmrs 1 VM request for prometheus6002 - https://phabricator.wikimedia.org/T333721 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by denisse@cumin1001 for host prometheus6002.drmrs.wmnet with OS bullseye
[22:44:14] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs-cinder-volume-backup: handle backups that both do and don't have dependencies [puppet] - 10https://gerrit.wikimedia.org/r/904863
[22:44:16] <wikibugs>	 (03PS1) 10Andrew Bogott: cinder backups: don't do toolforge full backups on our busiest day [puppet] - 10https://gerrit.wikimedia.org/r/904864
[22:44:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs-cinder-volume-backup: handle backups that both do and don't have dependencies [puppet] - 10https://gerrit.wikimedia.org/r/904863 (owner: 10Andrew Bogott)
[22:45:43] <wikibugs>	 (03PS2) 10Andrew Bogott: wmcs-cinder-volume-backup: handle backups that have and don't have dependencies [puppet] - 10https://gerrit.wikimedia.org/r/904863
[22:45:45] <wikibugs>	 (03PS2) 10Andrew Bogott: cinder backups: don't do toolforge full backups on our busiest day [puppet] - 10https://gerrit.wikimedia.org/r/904864
[22:46:38] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs-cinder-volume-backup: handle backups that have and don't have dependencies [puppet] - 10https://gerrit.wikimedia.org/r/904863 (owner: 10Andrew Bogott)
[22:46:53] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cinder backups: don't do toolforge full backups on our busiest day [puppet] - 10https://gerrit.wikimedia.org/r/904864 (owner: 10Andrew Bogott)
[22:48:37] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:49:15] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:54:08] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "confirmed working in https://thanos.wikimedia.org/graph?g0.expr=probe_success%7Binstance%3D~%22.*miscweb.*%22%7D&g0.tab=1&g0.stacked=0&g0." [puppet] - 10https://gerrit.wikimedia.org/r/904859 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn)
[22:54:34] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "confirmed working in https://thanos.wikimedia.org/graph?g0.expr=probe_success%7Binstance%3D~%22.*miscweb.*%22%7D&g0.tab=1&g0.stacked=0&g0." [puppet] - 10https://gerrit.wikimedia.org/r/904860 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn)
[22:55:48] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus4002.ulsfo.wmnet with reason: host reimage
[22:57:22] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus6002.drmrs.wmnet with reason: host reimage
[22:58:53] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus4002.ulsfo.wmnet with reason: host reimage
[23:01:17] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10Volans) >>! In T320955#8744401, @Volans wrote: > @ayounsi we already have all that setup...  To add a bit more context, I was out on my mobile and just wanted to post a quic...
[23:01:24] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus6002.drmrs.wmnet with reason: host reimage
[23:01:53] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus5002.eqsin.wmnet - denisse@cumin1001"
[23:02:53] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus5002.eqsin.wmnet - denisse@cumin1001"
[23:02:54] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host prometheus5002.eqsin.wmnet
[23:10:33] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host prometheus4002.ulsfo.wmnet with OS bullseye
[23:10:38] <wikibugs>	 10SRE, 10vm-requests: Site: ulsfo 1 VM request for prometheus4002 - https://phabricator.wikimedia.org/T333719 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by denisse@cumin1001 for host prometheus4002.ulsfo.wmnet with OS bullseye completed: - prometheus4002 (**WARN**)   - Downtimed on Ic...
[23:14:06] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host prometheus6002.drmrs.wmnet with OS bullseye
[23:14:11] <wikibugs>	 10SRE, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): Site: drmrs 1 VM request for prometheus6002 - https://phabricator.wikimedia.org/T333721 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by denisse@cumin1001 for host prometheus6002.drmrs.wmnet with OS bullseye completed: - prome...
[23:14:31] <wikibugs>	 (03PS1) 10Cwhite: rsyslog: add rsyslog-namespaced fields to syslog_json [puppet] - 10https://gerrit.wikimedia.org/r/904597 (https://phabricator.wikimedia.org/T315500)
[23:19:36] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10wiki_willy) Sounds good, thanks!  > @wiki_willy I'll try to look at it next week, it should be easy to read from the spreadsheet you showed me and exclude those for now. The...
[23:21:29] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.ganeti.reimage for host prometheus5002.eqsin.wmnet with OS bullseye
[23:21:37] <wikibugs>	 10SRE, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): Site: eqsin 1 VM request for prometheus5002 - https://phabricator.wikimedia.org/T333720 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by denisse@cumin1001 for host prometheus5002.eqsin.wmnet with OS bullseye
[23:22:03] <wikibugs>	 10SRE, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): Site: drmrs 1 VM request for prometheus6002 - https://phabricator.wikimedia.org/T333721 (10andrea.denisse) 05Open→03Resolved
[23:22:21] <wikibugs>	 (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/904595 (owner: 10TrainBranchBot)
[23:22:46] <wikibugs>	 10SRE, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): Site: ulsfo 1 VM request for prometheus4002 - https://phabricator.wikimedia.org/T333719 (10andrea.denisse) 05Open→03Resolved
[23:34:57] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/904598
[23:35:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/904598 (owner: 10TrainBranchBot)
[23:49:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/904598 (owner: 10TrainBranchBot)
[23:52:22] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus5002.eqsin.wmnet with reason: host reimage
[23:52:23] <wikibugs>	 (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/904598 (owner: 10TrainBranchBot)
[23:52:36] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10colewhite)
[23:55:45] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus5002.eqsin.wmnet with reason: host reimage