[00:02:10] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus3002.esams.wmnet - denisse@cumin1001" [00:03:10] 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10wiki_willy) Hi @Volans - just a heads up, it looks like some of those 5yr servers on the Accounting Spreadsheet are starting to pop up on the Netbox Error Report as accounti... [00:03:31] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:04:11] !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus3002.esams.wmnet - denisse@cumin1001" [00:04:11] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:04:11] !log denisse@cumin1001 START - Cookbook sre.dns.wipe-cache prometheus3002.esams.wmnet on all recursors [00:04:14] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus3002.esams.wmnet on all recursors [00:06:05] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1224.eqiad.wmnet with reason: host reimage [00:07:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:07:36] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1222.eqiad.wmnet with OS bullseye [00:07:42] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1222.eqiad.wmnet with OS bullseye completed: - db1222 (**PASS**) - Removed from Puppet an... [00:07:43] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:07:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1210.eqiad.wmnet with OS bullseye [00:07:49] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1210.eqiad.wmnet with OS bullseye completed: - db1210 (**WARN**) - Removed from Puppet an... [00:08:11] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1225.eqiad.wmnet with OS bullseye [00:08:18] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1225.eqiad.wmnet with OS bullseye [00:09:12] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1155.mgmt.eqiad.wmnet with reboot policy FORCED [00:09:13] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1224.eqiad.wmnet with reason: host reimage [00:09:32] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1155.mgmt.eqiad.wmnet with reboot policy FORCED [00:09:46] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Papaul) [00:10:04] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1155.mgmt.eqiad.wmnet with reboot policy FORCED [00:10:51] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1156.mgmt.eqiad.wmnet with reboot policy FORCED [00:11:04] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [00:11:58] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:15:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10BTullis) Hi @jclark-ctr Apologies for any omission on my part. For these servers we use RAID1 for the OS, based on the two ris... [00:18:37] (03PS1) 10Andrea Denisse: prometheus: Add the prometheus3002 node definition [puppet] - 10https://gerrit.wikimedia.org/r/904651 (https://phabricator.wikimedia.org/T333627) [00:18:47] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1156.mgmt.eqiad.wmnet with reboot policy FORCED [00:19:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:19:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1223.eqiad.wmnet with OS bullseye [00:20:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jclark-ctr) [00:20:03] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1223.eqiad.wmnet with OS bullseye completed: - db1223 (**PASS**) - Removed from Puppet an... [00:20:25] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host gerrit1003.wikimedia.org with OS bullseye [00:20:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q3:rack/setup/install gerrit1003 - https://phabricator.wikimedia.org/T326366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host gerrit1003.wikimedia.org with OS bullseye [00:23:17] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1149.eqiad.wmnet'] [00:23:58] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:25:01] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40470/console" [puppet] - 10https://gerrit.wikimedia.org/r/904651 (https://phabricator.wikimedia.org/T333627) (owner: 10Andrea Denisse) [00:25:34] (03CR) 10Andrea Denisse: prometheus: Add the prometheus3002 node definition [puppet] - 10https://gerrit.wikimedia.org/r/904651 (https://phabricator.wikimedia.org/T333627) (owner: 10Andrea Denisse) [00:26:02] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1225.eqiad.wmnet with reason: host reimage [00:29:10] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1225.eqiad.wmnet with reason: host reimage [00:29:33] RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:43] PROBLEM - Check systemd state on graphite1005 is CRITICAL: CRITICAL - degraded: The following units failed: statsd-proxy-socat-6to4.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:57] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on gerrit1003.wikimedia.org with reason: host reimage [00:36:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:38:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gerrit1003.wikimedia.org with reason: host reimage [00:41:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:42:02] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:51:19] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:51:20] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1224.eqiad.wmnet with OS bullseye [00:51:22] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:51:23] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1225.eqiad.wmnet with OS bullseye [00:51:27] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1224.eqiad.wmnet with OS bullseye completed: - db1224 (**WARN**) - Removed from Puppet an... [00:51:30] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1225.eqiad.wmnet with OS bullseye completed: - db1225 (**PASS**) - Removed from Puppet an... [00:53:52] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:57:20] RECOVERY - Check systemd state on graphite1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:04:16] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus3002.esams.wmnet - denisse@cumin1001" [01:07:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:07:58] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gerrit1003.wikimedia.org with OS bullseye [01:08:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q3:rack/setup/install gerrit1003 - https://phabricator.wikimedia.org/T326366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host gerrit1003.wikimedia.org with OS bullseye completed: - gerrit1003 (**PASS**) - R... [01:09:03] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Papaul) [01:11:46] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Papaul) 05Open→03Resolved @Marostegui your 19 servers are ready have fun [01:12:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:13:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q3:rack/setup/install gerrit1003 - https://phabricator.wikimedia.org/T326366 (10Papaul) 05Open→03Resolved @LSobanski this is ready [01:15:09] 10SRE, 10ops-eqiad, 10DC-Ops: Eqiad: Backlog HW failure-racking tasks - Decommision and remote work tasks - https://phabricator.wikimedia.org/T332523 (10Papaul) [01:17:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:06:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:31:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:32:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:37:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:39:35] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10Papaul) @Cmjohnson taking over the task to look into it [03:40:02] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10Papaul) a:05Cmjohnson→03Papaul [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230331T0600) [06:13:28] (03PS2) 10Elukey: role::kafka::jumbo::broker: upgrade all brokers to PKI [puppet] - 10https://gerrit.wikimedia.org/r/904455 (https://phabricator.wikimedia.org/T296064) [06:13:30] (03PS1) 10Elukey: Switch kafka-main1001 broker's TLS cert to PKI [puppet] - 10https://gerrit.wikimedia.org/r/904667 (https://phabricator.wikimedia.org/T319372) [06:15:07] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40471/console" [puppet] - 10https://gerrit.wikimedia.org/r/904667 (https://phabricator.wikimedia.org/T319372) (owner: 10Elukey) [06:15:42] (03CR) 10Elukey: Switch kafka-main1001 broker's TLS cert to PKI [puppet] - 10https://gerrit.wikimedia.org/r/904667 (https://phabricator.wikimedia.org/T319372) (owner: 10Elukey) [06:17:20] 10SRE, 10serviceops, 10Patch-For-Review: Move Kafka main to the new intermediate PKI CA - https://phabricator.wikimedia.org/T319372 (10elukey) All brokers have the new truststore, so they can validate certs emitted by PKI. Next steps: 1) Upgrade kafka-main1001 to PKI, and monitor if any client fails to conn... [06:40:11] (03CR) 10Elukey: [C: 03+1] Update default tls terminator/mesh envoy version to 1.18.3-2 [puppet] - 10https://gerrit.wikimedia.org/r/904557 (https://phabricator.wikimedia.org/T333551) (owner: 10JMeybohm) [06:40:56] (03CR) 10Elukey: [C: 03+1] "I trust your JS knowledge :D" [puppet] - 10https://gerrit.wikimedia.org/r/904550 (owner: 10Volans) [06:43:56] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [06:43:58] (03CR) 10Elukey: [C: 03+1] k8s: Configure the IPv6 service ip range for apiserver [puppet] - 10https://gerrit.wikimedia.org/r/903655 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [06:43:59] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [06:51:39] (03CR) 10JMeybohm: [C: 03+2] Update default tls terminator/mesh envoy version to 1.18.3-2 [puppet] - 10https://gerrit.wikimedia.org/r/904557 (https://phabricator.wikimedia.org/T333551) (owner: 10JMeybohm) [06:54:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:54:39] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/mathoid: apply [06:54:51] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/mathoid: apply [06:55:12] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/shellbox: apply [06:55:56] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [06:58:25] (03PS1) 10Krinkle: private: Add readme.FatalErrorSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904670 [06:58:36] (03CR) 10CI reject: [V: 04-1] private: Add readme.FatalErrorSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904670 (owner: 10Krinkle) [06:58:38] (03PS2) 10Krinkle: private: Add readme.FatalErrorSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904670 [06:59:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230331T0700) [07:04:08] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/904667 (https://phabricator.wikimedia.org/T319372) (owner: 10Elukey) [07:05:46] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [07:05:49] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [07:08:11] (03CR) 10Elukey: [C: 03+1] "LGTM! I don't recall if you already followed up in labs/private, but I guess that there is also a clean up in there to do right? Anyway, c" [puppet] - 10https://gerrit.wikimedia.org/r/904489 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [07:16:01] (03CR) 10JMeybohm: [V: 03+1] k8s rsyslog: Use client cert instead of token (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904489 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [07:17:50] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Add the prometheus3002 node definition [puppet] - 10https://gerrit.wikimedia.org/r/904651 (https://phabricator.wikimedia.org/T333627) (owner: 10Andrea Denisse) [07:17:59] (03CR) 10Filippo Giunchedi: [C: 03+1] dns: repoint alert host services to alert2001 [dns] - 10https://gerrit.wikimedia.org/r/904614 (https://phabricator.wikimedia.org/T333478) (owner: 10Herron) [07:19:41] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/902435 (https://phabricator.wikimedia.org/T332709) (owner: 10Jbond) [07:20:03] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [07:20:06] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [07:20:15] (03CR) 10Filippo Giunchedi: [C: 03+1] alertmanager: also pages to sre for data-engineering, releng and search [puppet] - 10https://gerrit.wikimedia.org/r/902431 (https://phabricator.wikimedia.org/T332709) (owner: 10Jbond) [07:21:20] (03CR) 10Filippo Giunchedi: [C: 03+1] "On further thought, I think this policy might as well apply to all warnings, (i.e. a top level route instead with continue: true), what do" [puppet] - 10https://gerrit.wikimedia.org/r/903615 (owner: 10Jbond) [07:22:56] (03PS1) 10JMeybohm: k8s rsyslog: Remove unused tokens [labs/private] - 10https://gerrit.wikimedia.org/r/904672 (https://phabricator.wikimedia.org/T325268) [07:23:58] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:24:02] (03PS4) 10JMeybohm: k8s rsyslog: Use client cert instead of token [puppet] - 10https://gerrit.wikimedia.org/r/904489 (https://phabricator.wikimedia.org/T325268) [07:25:28] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:27:12] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [07:27:15] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [07:27:35] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] k8s rsyslog: Remove unused tokens [labs/private] - 10https://gerrit.wikimedia.org/r/904672 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [07:28:01] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [07:28:03] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [07:30:51] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40472/console" [puppet] - 10https://gerrit.wikimedia.org/r/904489 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [07:31:28] PROBLEM - Check systemd state on graphite1005 is CRITICAL: CRITICAL - degraded: The following units failed: statsd-proxy-socat-6to4.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:31:30] (Traffic on tunnel link) firing: Alert for device cr4-ulsfo.wikimedia.org - Traffic on tunnel link - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link [07:33:22] (03PS1) 10Filippo Giunchedi: sre: add check for inodes free [alerts] - 10https://gerrit.wikimedia.org/r/904675 (https://phabricator.wikimedia.org/T332764) [07:33:46] 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10ayounsi) Following this [[ https://help.expandi.io/en/articles/5405660-making-a-webhook-with-google-sheets | doc ]] I was able to add data to a spreadsheet using a generic p... [07:34:46] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s rsyslog: Use client cert instead of token [puppet] - 10https://gerrit.wikimedia.org/r/904489 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [07:38:42] RECOVERY - Check systemd state on graphite1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:42:05] (03PS1) 10Filippo Giunchedi: statsd_proxy: reuseaddr for 6to4 proxy to avoid crashloop [puppet] - 10https://gerrit.wikimedia.org/r/904677 (https://phabricator.wikimedia.org/T239862) [07:44:28] 10SRE, 10ops-codfw, 10serviceops-collab, 10GitLab (Infrastructure): Install additional SSDs on gitlab2003.wikimedia.org (B5) - https://phabricator.wikimedia.org/T333304 (10Jelto) p:05Triage→03Medium Thanks @Papaul for the quick installation! I can confirm new disks are available on the host: ` gitlab... [07:47:16] (03PS1) 10JMeybohm: k8s rsyslog: Use client cert instead of token [puppet] - 10https://gerrit.wikimedia.org/r/904678 (https://phabricator.wikimedia.org/T325268) [07:49:09] (03CR) 10CI reject: [V: 04-1] k8s rsyslog: Use client cert instead of token [puppet] - 10https://gerrit.wikimedia.org/r/904678 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [07:50:02] (03PS2) 10JMeybohm: k8s rsyslog: Use client cert instead of token [puppet] - 10https://gerrit.wikimedia.org/r/904678 (https://phabricator.wikimedia.org/T325268) [07:51:30] (Traffic on tunnel link) resolved: Alert for device cr4-ulsfo.wikimedia.org - Traffic on tunnel link - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link [07:52:14] (03CR) 10JMeybohm: [C: 03+2] k8s rsyslog: Use client cert instead of token [puppet] - 10https://gerrit.wikimedia.org/r/904678 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [07:53:06] 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10Volans) @ayounsi we already have all that setup... [08:04:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:07:30] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:07:48] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:08:52] (03CR) 10MVernon: [C: 03+1] "Seems reasonable to me :)" [puppet] - 10https://gerrit.wikimedia.org/r/904677 (https://phabricator.wikimedia.org/T239862) (owner: 10Filippo Giunchedi) [08:10:06] (03CR) 10Filippo Giunchedi: [C: 03+2] statsd_proxy: reuseaddr for 6to4 proxy to avoid crashloop [puppet] - 10https://gerrit.wikimedia.org/r/904677 (https://phabricator.wikimedia.org/T239862) (owner: 10Filippo Giunchedi) [08:13:48] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10MatthewVernon) @Jclark-ctr I can shut down ms-be1042 for you (or you can DIY, there's no special procedure for this host). Can I confirm you want it shut dow... [08:14:16] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T333328 (10Peachey88) a:05Papaul→03Jhancock.wm [08:14:53] !log jelto@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab2003.wikimedia.org with OS bullseye [08:15:03] 10SRE, 10ops-codfw, 10serviceops-collab, 10GitLab (Infrastructure): Install additional SSDs on gitlab2003.wikimedia.org (B5) - https://phabricator.wikimedia.org/T333304 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jelto@cumin2002 for host gitlab2003.wikimedia.org with OS bullseye [08:15:28] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Avoid sub-optimal routing from CR routers to EVPN destinations - https://phabricator.wikimedia.org/T332781 (10ayounsi) Thanks for the details! I'm always wary of adding configuration knobs and logic that could make troubleshooting more com... [08:15:46] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:24] (03CR) 10Ayounsi: [C: 03+1] "LGTM, please add the task number as comment before each one of them so we can remember in the future why it's there." [homer/public] - 10https://gerrit.wikimedia.org/r/902084 (https://phabricator.wikimedia.org/T332781) (owner: 10Cathal Mooney) [08:24:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:25:25] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [08:25:27] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [08:27:04] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [08:27:07] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [08:27:40] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [08:27:43] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [08:27:56] !log jelto@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage [08:29:15] (03PS6) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 [08:29:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:31:38] (03CR) 10CI reject: [V: 04-1] sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 (owner: 10Slyngshede) [08:32:08] !log jelto@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage [08:34:44] (03PS7) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 [08:36:50] (03CR) 10CI reject: [V: 04-1] sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 (owner: 10Slyngshede) [08:37:10] (03PS1) 10Ayounsi: Bird: POC use a different ASN for Cloud hosts [puppet] - 10https://gerrit.wikimedia.org/r/904745 (https://phabricator.wikimedia.org/T324992) [08:38:16] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on an-worker1091.eqiad.wmnet with reason: Replacing battery [08:38:30] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on an-worker1091.eqiad.wmnet with reason: Replacing battery [08:38:36] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f76e48e4-3716-4c3a-8992-2858603cabe9) set by btullis@cumin1001 for 4 days, 0:00:00 on 1 host... [08:38:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:38:58] (03PS8) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 [08:39:53] (03CR) 10Btullis: [C: 03+1] "Looks good. Many thanks." [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) (owner: 10AOkoth) [08:41:03] (03CR) 10CI reject: [V: 04-1] sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 (owner: 10Slyngshede) [08:43:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:43:59] (03PS9) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 [08:44:13] (03PS1) 10David Caro: smart_data_dump: adapt for newer ssacli [puppet] - 10https://gerrit.wikimedia.org/r/904747 (https://phabricator.wikimedia.org/T306354) [08:44:14] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Avoid sub-optimal routing from CR routers to EVPN destinations - https://phabricator.wikimedia.org/T332781 (10cmooney) >>! In T332781#8744511, @ayounsi wrote: > I'm always wary of adding configuration knobs and logic that could make trouble... [08:44:38] (03CR) 10CI reject: [V: 04-1] smart_data_dump: adapt for newer ssacli [puppet] - 10https://gerrit.wikimedia.org/r/904747 (https://phabricator.wikimedia.org/T306354) (owner: 10David Caro) [08:45:05] (03CR) 10Cathal Mooney: [C: 03+2] Set BGP MED based on OSPF cost for EVPN originated routes [homer/public] - 10https://gerrit.wikimedia.org/r/902084 (https://phabricator.wikimedia.org/T332781) (owner: 10Cathal Mooney) [08:45:11] (03PS2) 10David Caro: smart_data_dump: adapt for newer ssacli [puppet] - 10https://gerrit.wikimedia.org/r/904747 (https://phabricator.wikimedia.org/T306354) [08:45:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:45:40] (03Merged) 10jenkins-bot: Set BGP MED based on OSPF cost for EVPN originated routes [homer/public] - 10https://gerrit.wikimedia.org/r/902084 (https://phabricator.wikimedia.org/T332781) (owner: 10Cathal Mooney) [08:46:29] (03CR) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 (owner: 10Slyngshede) [08:47:56] !log jelto@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab2003.wikimedia.org with OS bullseye [08:48:05] 10SRE, 10ops-codfw, 10serviceops-collab, 10GitLab (Infrastructure): Install additional SSDs on gitlab2003.wikimedia.org (B5) - https://phabricator.wikimedia.org/T333304 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jelto@cumin2002 for host gitlab2003.wikimedia.org with OS bullseye... [08:50:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:50:51] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10BTullis) @Jclark-ctr I've shut down an-worker1091 so you can replace the battery at any time. Feel free to boot it when the work is finished, as it should re... [08:53:18] (03PS2) 10Ayounsi: Bird: POC use a different ASN for Cloud hosts [puppet] - 10https://gerrit.wikimedia.org/r/904745 (https://phabricator.wikimedia.org/T324992) [08:53:36] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Avoid sub-optimal routing from CR routers to EVPN destinations - https://phabricator.wikimedia.org/T332781 (10cmooney) 05Open→03Resolved Patch merged, working as expected. Previous trace from bast1003 to a server in rack E1: ` cmooney@... [08:56:00] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10Jclark-ctr) @matthewvernon 1300 utc will be on site to change battery [08:57:20] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10MatthewVernon) >>! In T332883#8744637, @Jclark-ctr wrote: > @matthewvernon 1300 utc will be on site to change battery Ah, glad I checked! I'll have it shut... [09:00:00] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/904745 (https://phabricator.wikimedia.org/T324992) (owner: 10Ayounsi) [09:00:47] (03CR) 10Elukey: [C: 03+2] role::kafka::jumbo::broker: upgrade all brokers to PKI [puppet] - 10https://gerrit.wikimedia.org/r/904455 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [09:01:23] (03CR) 10Ayounsi: "Patch goes with this comment https://phabricator.wikimedia.org/T324992#8744630" [puppet] - 10https://gerrit.wikimedia.org/r/904745 (https://phabricator.wikimedia.org/T324992) (owner: 10Ayounsi) [09:02:18] !log move kafka-jumbo1002's kafka broker cert to PKI - T296064 [09:02:19] (03CR) 10Cathal Mooney: [C: 03+2] Set BGP MED based on OSPF cost for EVPN originated routes (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/902084 (https://phabricator.wikimedia.org/T332781) (owner: 10Cathal Mooney) [09:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:24] T296064: Move Kafka Jumbo's TLS clients to the new bundle - https://phabricator.wikimedia.org/T296064 [09:03:30] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kafka-jumbo1002.eqiad.wmnet with reason: restart kafka, switch to PKI [09:03:44] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kafka-jumbo1002.eqiad.wmnet with reason: restart kafka, switch to PKI [09:04:19] (03PS2) 10EoghanGaffney: Removes unnecessary krb:present line [puppet] - 10https://gerrit.wikimedia.org/r/904522 [09:04:23] (03PS1) 10EoghanGaffney: Updates gitlab package versions [puppet] - 10https://gerrit.wikimedia.org/r/904752 (https://phabricator.wikimedia.org/T333636) [09:06:07] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/904752 (https://phabricator.wikimedia.org/T333636) (owner: 10EoghanGaffney) [09:06:15] (03PS1) 10Cathal Mooney: Add comment in LSW external announce policy to detail MED setting [homer/public] - 10https://gerrit.wikimedia.org/r/904753 (https://phabricator.wikimedia.org/T332781) [09:06:24] (03CR) 10EoghanGaffney: [C: 03+2] Updates gitlab package versions [puppet] - 10https://gerrit.wikimedia.org/r/904752 (https://phabricator.wikimedia.org/T333636) (owner: 10EoghanGaffney) [09:06:47] (03PS2) 10EoghanGaffney: Updates gitlab package versions [puppet] - 10https://gerrit.wikimedia.org/r/904752 (https://phabricator.wikimedia.org/T333636) [09:07:47] (03PS2) 10Cathal Mooney: Add comment in LSW external announce policy to detail MED setting [homer/public] - 10https://gerrit.wikimedia.org/r/904753 (https://phabricator.wikimedia.org/T332781) [09:09:17] (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [09:10:11] this is me --^ [09:10:17] should resolve soon-ish [09:10:38] (03PS5) 10Slyngshede: SSH Keymanagement, allow user to manage ssh keys. [software/bitu] - 10https://gerrit.wikimedia.org/r/899519 [09:10:40] (03PS3) 10Cathal Mooney: Add comment in LSW external announce policy to detail MED setting [homer/public] - 10https://gerrit.wikimedia.org/r/904753 (https://phabricator.wikimedia.org/T332781) [09:12:04] (03PS1) 10Ayounsi: Bird: remove anycast subnet filter [puppet] - 10https://gerrit.wikimedia.org/r/904754 (https://phabricator.wikimedia.org/T324992) [09:12:24] (03PS4) 10Cathal Mooney: Add comment in EVPN SW external announce policy to detail MED setting [homer/public] - 10https://gerrit.wikimedia.org/r/904753 (https://phabricator.wikimedia.org/T332781) [09:12:41] (03CR) 10CI reject: [V: 04-1] Bird: remove anycast subnet filter [puppet] - 10https://gerrit.wikimedia.org/r/904754 (https://phabricator.wikimedia.org/T324992) (owner: 10Ayounsi) [09:12:55] (03PS5) 10Cathal Mooney: Add comment in EVPN SW external announce policy to detail MED setting [homer/public] - 10https://gerrit.wikimedia.org/r/904753 (https://phabricator.wikimedia.org/T332781) [09:14:10] (03CR) 10Cathal Mooney: [C: 03+2] Add comment in EVPN SW external announce policy to detail MED setting [homer/public] - 10https://gerrit.wikimedia.org/r/904753 (https://phabricator.wikimedia.org/T332781) (owner: 10Cathal Mooney) [09:14:17] (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [09:14:46] (03Merged) 10jenkins-bot: Add comment in EVPN SW external announce policy to detail MED setting [homer/public] - 10https://gerrit.wikimedia.org/r/904753 (https://phabricator.wikimedia.org/T332781) (owner: 10Cathal Mooney) [09:15:46] (03PS3) 10Jaime Nuche: scap: block Scap execution on inactive deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/904502 (https://phabricator.wikimedia.org/T330756) [09:19:40] (03PS1) 10Elukey: kafka: update alerts related to brokers down [alerts] - 10https://gerrit.wikimedia.org/r/904756 [09:19:43] (03PS2) 10Ayounsi: Bird: remove anycast subnet filter [puppet] - 10https://gerrit.wikimedia.org/r/904754 (https://phabricator.wikimedia.org/T324992) [09:19:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] alertmanager: update phabricator project for WMCS alerts [puppet] - 10https://gerrit.wikimedia.org/r/904559 (https://phabricator.wikimedia.org/T333315) (owner: 10Arturo Borrero Gonzalez) [09:20:52] (03CR) 10CI reject: [V: 04-1] kafka: update alerts related to brokers down [alerts] - 10https://gerrit.wikimedia.org/r/904756 (owner: 10Elukey) [09:25:41] (03PS2) 10Elukey: kafka: update alerts related to brokers down [alerts] - 10https://gerrit.wikimedia.org/r/904756 [09:26:53] (03CR) 10CI reject: [V: 04-1] kafka: update alerts related to brokers down [alerts] - 10https://gerrit.wikimedia.org/r/904756 (owner: 10Elukey) [09:27:13] (03PS3) 10Elukey: kafka: update alerts related to brokers down [alerts] - 10https://gerrit.wikimedia.org/r/904756 [09:28:26] (03CR) 10CI reject: [V: 04-1] kafka: update alerts related to brokers down [alerts] - 10https://gerrit.wikimedia.org/r/904756 (owner: 10Elukey) [09:30:10] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Bird: remove anycast subnet filter [puppet] - 10https://gerrit.wikimedia.org/r/904754 (https://phabricator.wikimedia.org/T324992) (owner: 10Ayounsi) [09:32:01] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to also sync networks - https://phabricator.wikimedia.org/T329669 (10ayounsi) Thanks! I think having the visualization of the data is a good start. Next step would to see how to do drop in replacement of some of t... [09:33:02] (03PS1) 10Slyngshede: Signup: Add captcha to signups. [software/bitu] - 10https://gerrit.wikimedia.org/r/904757 (https://phabricator.wikimedia.org/T320809) [09:34:08] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Figure out a captcha option for IDM - https://phabricator.wikimedia.org/T320809 (10SLyngshede-WMF) 05Open→03In progress [09:34:10] 10SRE, 10Infrastructure-Foundations: IDM milestone 3 "Build-out for self service" - https://phabricator.wikimedia.org/T320801 (10SLyngshede-WMF) [09:34:56] (03CR) 10Arturo Borrero Gonzalez: "A [potentially stupid] questions follows:" [puppet] - 10https://gerrit.wikimedia.org/r/904745 (https://phabricator.wikimedia.org/T324992) (owner: 10Ayounsi) [09:35:28] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Figure out a captcha option for IDM - https://phabricator.wikimedia.org/T320809 (10SLyngshede-WMF) This doesn't solve the issue of captchas across the various projects, but it does provides a simple solution for the IDM (and other Django based projects... [09:37:28] (03CR) 10JMeybohm: [C: 04-1] "You've registered nodePort 4113 in https://wikitech.wikimedia.org/wiki/Kubernetes/Service_ports but it does not appear in CI diff." [deployment-charts] - 10https://gerrit.wikimedia.org/r/895327 (https://phabricator.wikimedia.org/T329074) (owner: 10Hnowlan) [09:38:24] (03PS4) 10Elukey: kafka: update alerts related to brokers down [alerts] - 10https://gerrit.wikimedia.org/r/904756 [09:39:18] (03PS2) 10Elukey: Switch kafka-main1001 broker's TLS cert to PKI [puppet] - 10https://gerrit.wikimedia.org/r/904667 (https://phabricator.wikimedia.org/T319372) [09:39:36] (03CR) 10CI reject: [V: 04-1] kafka: update alerts related to brokers down [alerts] - 10https://gerrit.wikimedia.org/r/904756 (owner: 10Elukey) [09:41:06] (03CR) 10Elukey: "Sorry folks, trying to run tox locally fails for multiple tests, and not sure what I am missing here. Will try to fix my local setup and p" [alerts] - 10https://gerrit.wikimedia.org/r/904756 (owner: 10Elukey) [09:44:05] (03CR) 10Arturo Borrero Gonzalez: Remove redundant or outdated prefixes from aggregate_networks -> labs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/904529 (https://phabricator.wikimedia.org/T329669) (owner: 10Ayounsi) [09:50:06] (03CR) 10Ayounsi: Bird: POC use a different ASN for Cloud hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904745 (https://phabricator.wikimedia.org/T324992) (owner: 10Ayounsi) [09:50:38] (03PS5) 10Elukey: kafka: update alerts related to brokers down [alerts] - 10https://gerrit.wikimedia.org/r/904756 [09:53:31] (03CR) 10Elukey: "Ok ready to go, sorry for the spam :)" [alerts] - 10https://gerrit.wikimedia.org/r/904756 (owner: 10Elukey) [09:53:52] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kafka-jumbo1003.eqiad.wmnet with reason: restart kafka, switch to PKI [09:54:05] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kafka-jumbo1003.eqiad.wmnet with reason: restart kafka, switch to PKI [09:54:20] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: reprovisioning after maintenance [09:54:34] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: reprovisioning after maintenance [09:54:39] 10SRE, 10ops-eqiad, 10DBA, 10Data-Persistence-Backup: db1150 crashed: DIMM_A8 memory issues - https://phabricator.wikimedia.org/T332708 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=3b38157a-7d2c-4b9f-ad17-b2b2c6932dcb) set by jynus@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their... [09:54:46] !log move kafka-jumbo1003's kafka broker cert to PKI - T296064 [09:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:51] T296064: Move Kafka Jumbo's TLS clients to the new bundle - https://phabricator.wikimedia.org/T296064 [09:58:17] (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [10:02:45] (03PS1) 10DCausse: flink-app: update to mesh.configuration 1.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/904762 (https://phabricator.wikimedia.org/T328675) [10:02:47] (03PS1) 10DCausse: rdf-streaming-updater: bump job image to flink-1.16-rc2... [deployment-charts] - 10https://gerrit.wikimedia.org/r/904763 (https://phabricator.wikimedia.org/T328675) [10:03:17] (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [10:04:43] (03PS3) 10Jcrespo: mediabackups: Add static console port for easier remote management [puppet] - 10https://gerrit.wikimedia.org/r/904514 (https://phabricator.wikimedia.org/T306602) [10:04:45] (03PS1) 10Jcrespo: database-backups: Provision db1150 with s4 and s3 sections [puppet] - 10https://gerrit.wikimedia.org/r/904764 (https://phabricator.wikimedia.org/T332708) [10:04:58] (03PS2) 10Jcrespo: database-backups: Provision db1150 with s4 and s3 sections [puppet] - 10https://gerrit.wikimedia.org/r/904764 (https://phabricator.wikimedia.org/T332708) [10:06:27] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [10:06:30] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [10:07:17] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [10:07:19] (03CR) 10Jcrespo: "CC ing dbas so they are aware, no action needed- this host will be left (mostly) passive for quick redundancy" [puppet] - 10https://gerrit.wikimedia.org/r/904764 (https://phabricator.wikimedia.org/T332708) (owner: 10Jcrespo) [10:07:20] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [10:07:30] (03CR) 10Jcrespo: [C: 03+2] database-backups: Provision db1150 with s4 and s3 sections [puppet] - 10https://gerrit.wikimedia.org/r/904764 (https://phabricator.wikimedia.org/T332708) (owner: 10Jcrespo) [10:09:26] (03CR) 10Jbond: "lgtm but see nit" [puppet] - 10https://gerrit.wikimedia.org/r/904617 (https://phabricator.wikimedia.org/T333511) (owner: 10Stevemunene) [10:10:30] PROBLEM - Kafka Broker Server on kafka-test1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [10:10:56] PROBLEM - Check systemd state on kafka-test1006 is CRITICAL: CRITICAL - degraded: The following units failed: kafka.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:11:23] ^looking [10:11:34] PROBLEM - Kafka broker TLS certificate validity on kafka-test1006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [10:11:57] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/904629 (https://phabricator.wikimedia.org/T333456) (owner: 10Ladsgroup) [10:12:13] btullis: sorry it is me! [10:12:17] doing some tests [10:12:32] Aha, cool. No probs then. [10:12:44] (03CR) 10Jbond: [C: 03+1] logstash: normalize_level add grafana error level alias [puppet] - 10https://gerrit.wikimedia.org/r/904591 (owner: 10Cwhite) [10:14:57] (03CR) 10Jbond: [C: 03+2] alertmanager: also pages to sre for data-engineering, releng and search [puppet] - 10https://gerrit.wikimedia.org/r/902431 (https://phabricator.wikimedia.org/T332709) (owner: 10Jbond) [10:15:11] (03PS2) 10Jbond: alertmanager: also pages to sre for data-engineering [puppet] - 10https://gerrit.wikimedia.org/r/902435 (https://phabricator.wikimedia.org/T332709) [10:15:31] (03PS2) 10JMeybohm: WIP: deployment_server: Create k8s configs with pki certs [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) [10:15:41] (03CR) 10Jbond: [C: 03+2] alertmanager: also pages to sre for data-engineering [puppet] - 10https://gerrit.wikimedia.org/r/902435 (https://phabricator.wikimedia.org/T332709) (owner: 10Jbond) [10:16:09] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] purged: Don't specify the kafka compression codec [puppet] - 10https://gerrit.wikimedia.org/r/904490 (https://phabricator.wikimedia.org/T332669) (owner: 10Vgutierrez) [10:16:25] (03PS3) 10Btullis: Correct the datahub elasticsearch index prefix for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/904571 (https://phabricator.wikimedia.org/T333580) [10:16:42] Jbond: alertmanager: also pages to sre for data-engineering (9e69319e96) [10:16:42] Jbond: alertmanager: also pages to sre for data-engineering, releng and search (0a3c42330a) [10:16:50] ok to merge those two? [10:17:06] (03CR) 10Jbond: alertmanager: change repeat interval to 1 week for warnings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903615 (owner: 10Jbond) [10:17:12] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40476/console" [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [10:17:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [10:17:25] this is me, fixing --^ [10:17:32] (03CR) 10CI reject: [V: 04-1] Correct the datahub elasticsearch index prefix for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/904571 (https://phabricator.wikimedia.org/T333580) (owner: 10Btullis) [10:18:06] !log eoghan@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrading Gitlab [10:18:10] RECOVERY - Check systemd state on kafka-test1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:18:50] RECOVERY - Kafka broker TLS certificate validity on kafka-test1006 is OK: SSL OK - Certificate kafka-test1006.eqiad.wmnet valid until 2024-01-13 11:02:00 +0000 (expires in 288 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [10:19:36] RECOVERY - Kafka Broker Server on kafka-test1006 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [10:19:41] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to also sync networks - https://phabricator.wikimedia.org/T329669 (10cmooney) Late to the party here. >>! In T329669#8618111, @ayounsi wrote: > The other point related to above is that we don't have a strict/clea... [10:20:07] (03PS2) 10Jbond: alertmanager: change repeat interval to 1 week for warnings [puppet] - 10https://gerrit.wikimedia.org/r/903615 [10:22:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [10:24:29] 10SRE, 10Traffic, 10Patch-For-Review: purged issues a config warning on service start - https://phabricator.wikimedia.org/T332669 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez purged is now happy ` Mar 31 10:22:06 cp6001 systemd[1]: purged.service: Succeeded. Mar 31 10:22:06 cp6001 systemd[1]: Stoppe... [10:25:20] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1001.eqiad.wmnet with reason: preparing for m1 primary db switchover [10:25:34] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1001.eqiad.wmnet with reason: preparing for m1 primary db switchover [10:27:10] (03PS2) 10Jcrespo: Revert "backups: Replace db1164 with db1101" [puppet] - 10https://gerrit.wikimedia.org/r/903188 (owner: 10Marostegui) [10:27:50] (03CR) 10Gmodena: [C: 03+1] flink-app: update to mesh.configuration 1.2.0 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/904762 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [10:27:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:28:14] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2067.codfw.wmnet with OS bullseye [10:28:22] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2067.codfw.wmnet with OS bullseye [10:29:05] (03PS1) 10Vgutierrez: trafficserver: Remove esitest backend [puppet] - 10https://gerrit.wikimedia.org/r/904768 (https://phabricator.wikimedia.org/T308799) [10:32:15] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['an-worker1149.eqiad.wmnet'] [10:32:24] (03PS2) 10Jcrespo: mariadb: Promote db1164 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/903182 (https://phabricator.wikimedia.org/T333123) (owner: 10Marostegui) [10:32:34] (03CR) 10Vgutierrez: [C: 03+2] trafficserver: Remove esitest backend [puppet] - 10https://gerrit.wikimedia.org/r/904768 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez) [10:33:09] (03CR) 10Jaime Nuche: [C: 03+1] Migrate from git fat to git lfs (031 comment) [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/904239 (https://phabricator.wikimedia.org/T333465) (owner: 10Hashar) [10:33:13] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:33:15] (03PS2) 10Ayounsi: Remove redundant or outdated prefixes from aggregate_networks -> labs [puppet] - 10https://gerrit.wikimedia.org/r/904529 (https://phabricator.wikimedia.org/T329669) [10:35:28] (03PS3) 10JMeybohm: WIP: deployment_server: Create k8s configs with pki certs [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) [10:35:45] (JobUnavailable) firing: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:36:50] (03CR) 10DCausse: flink-app: update to mesh.configuration 1.2.0 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/904762 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [10:37:48] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40477/console" [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [10:39:39] (03CR) 10Ayounsi: Remove redundant or outdated prefixes from aggregate_networks -> labs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/904529 (https://phabricator.wikimedia.org/T329669) (owner: 10Ayounsi) [10:39:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jclark-ctr) [10:40:32] (03PS3) 10Ladsgroup: mariadb: Promote db1164 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/903182 (https://phabricator.wikimedia.org/T333123) (owner: 10Marostegui) [10:40:42] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db1164 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/903182 (https://phabricator.wikimedia.org/T333123) (owner: 10Marostegui) [10:41:34] (03CR) 10Majavah: Remove redundant or outdated prefixes from aggregate_networks -> labs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/904529 (https://phabricator.wikimedia.org/T329669) (owner: 10Ayounsi) [10:42:06] (03PS4) 10Btullis: Correct the datahub elasticsearch index prefix for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/904571 (https://phabricator.wikimedia.org/T333580) [10:44:06] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2067.codfw.wmnet with reason: host reimage [10:45:41] !log Failover m1 from db1101 to db1164 - T333123 [10:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:52] T333123: Switchover m1 master (db1101 -> db1164) - https://phabricator.wikimedia.org/T333123 [10:46:39] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2067.codfw.wmnet with reason: host reimage [10:48:05] (03PS4) 10JMeybohm: WIP: deployment_server: Create k8s configs with pki certs [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) [10:49:07] jynus: pt-kill should finish before I move on to the next step? [10:49:14] It's stuck [10:49:26] not stuck more like, not stopping [10:50:42] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40478/console" [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [10:51:04] etherpad works [10:51:16] PROBLEM - Check systemd state on kubestagemaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:53:02] https://gerrit.wikimedia.org/r/c/operations/puppet/+/903188 I think we can merge this now [10:53:29] I leave it to Jaime [10:53:34] ok [10:54:02] (03CR) 10Ladsgroup: [C: 03+1] "I can merge it, or let Jaime merge it, whatever you prefer." [puppet] - 10https://gerrit.wikimedia.org/r/903188 (owner: 10Marostegui) [10:54:34] (03PS5) 10JMeybohm: WIP: deployment_server: Create k8s configs with pki certs [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) [10:56:14] (03PS5) 10Btullis: Correct the datahub elasticsearch index prefix for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/904571 (https://phabricator.wikimedia.org/T333580) [10:56:18] (03CR) 10Jcrespo: [C: 03+2] Revert "backups: Replace db1164 with db1101" [puppet] - 10https://gerrit.wikimedia.org/r/903188 (owner: 10Marostegui) [10:56:32] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40479/console" [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [10:57:00] (03PS3) 10Ayounsi: Remove redundant or outdated prefixes from aggregate_networks -> labs [puppet] - 10https://gerrit.wikimedia.org/r/904529 (https://phabricator.wikimedia.org/T329669) [11:01:14] (03PS1) 10Filippo Giunchedi: statsd_proxy: fix socat invocation to not crashloop [puppet] - 10https://gerrit.wikimedia.org/r/904771 (https://phabricator.wikimedia.org/T239862) [11:01:20] thanks jynus [11:01:37] (03CR) 10Ayounsi: Remove redundant or outdated prefixes from aggregate_networks -> labs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/904529 (https://phabricator.wikimedia.org/T329669) (owner: 10Ayounsi) [11:02:04] (03CR) 10Ayounsi: cloudlb: introduce BGP setup by means of bird (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868731 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [11:02:30] I am running a backup to test it, and if all works well, I will restart bacula [11:02:46] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2067.codfw.wmnet with OS bullseye [11:02:53] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2067.codfw.wmnet with OS bullseye completed: - ms-be2067 (**PASS**) - Downtim... [11:03:17] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:04:34] let me know once so I close the ticket [11:05:45] (JobUnavailable) resolved: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:05:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Remove redundant or outdated prefixes from aggregate_networks -> labs [puppet] - 10https://gerrit.wikimedia.org/r/904529 (https://phabricator.wikimedia.org/T329669) (owner: 10Ayounsi) [11:05:55] (03CR) 10MVernon: [C: 03+1] statsd_proxy: fix socat invocation to not crashloop [puppet] - 10https://gerrit.wikimedia.org/r/904771 (https://phabricator.wikimedia.org/T239862) (owner: 10Filippo Giunchedi) [11:06:43] (03PS2) 10Ladsgroup: admin: Add sfaci ssh key and analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/904629 (https://phabricator.wikimedia.org/T333456) [11:06:47] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] admin: Add sfaci ssh key and analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/904629 (https://phabricator.wikimedia.org/T333456) (owner: 10Ladsgroup) [11:08:10] (03CR) 10Filippo Giunchedi: [C: 03+2] statsd_proxy: fix socat invocation to not crashloop [puppet] - 10https://gerrit.wikimedia.org/r/904771 (https://phabricator.wikimedia.org/T239862) (owner: 10Filippo Giunchedi) [11:08:24] !log eoghan@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrading Gitlab [11:09:03] 10SRE, 10SRE-Access-Requests, 10API Platform, 10Patch-For-Review: Requesting access to analytics-privatedata-users for sfaci - https://phabricator.wikimedia.org/T333456 (10Ladsgroup) Added now, in thirty minutes you should be able to access stat machines but someone from data engineering needs to do your k... [11:09:47] PROBLEM - SSH on kafka-jumbo1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:09:58] !log eoghan@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrading Gitlab [11:11:07] RECOVERY - SSH on kafka-jumbo1001 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:11:45] 10SRE, 10ops-codfw, 10serviceops-collab, 10GitLab (Infrastructure): Install additional SSDs on gitlab2003.wikimedia.org (B5) - https://phabricator.wikimedia.org/T333304 (10Jelto) 05Open→03Resolved The reimage happend on `gitlab2003` but it seems the partman config is not producing the expected result.... [11:12:08] (03CR) 10Arturo Borrero Gonzalez: Bird: POC use a different ASN for Cloud hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904745 (https://phabricator.wikimedia.org/T324992) (owner: 10Ayounsi) [11:12:38] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1151.eqiad.wmnet'] [11:15:48] (03PS1) 10Filippo Giunchedi: DNM: move statsd to graphite2004 [dns] - 10https://gerrit.wikimedia.org/r/904774 [11:15:50] (03CR) 10Majavah: Bird: POC use a different ASN for Cloud hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904745 (https://phabricator.wikimedia.org/T324992) (owner: 10Ayounsi) [11:16:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:16:37] RECOVERY - Check systemd state on kubestagemaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:16:41] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM, one comment for curiosity's sake only." [puppet] - 10https://gerrit.wikimedia.org/r/904754 (https://phabricator.wikimedia.org/T324992) (owner: 10Ayounsi) [11:17:53] PROBLEM - Check systemd state on graphite2004 is CRITICAL: CRITICAL - degraded: The following units failed: statsd-proxy-socat-6to4.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:17:55] (03Abandoned) 10Filippo Giunchedi: DNM: move statsd to graphite2004 [dns] - 10https://gerrit.wikimedia.org/r/904774 (owner: 10Filippo Giunchedi) [11:19:27] RECOVERY - Check systemd state on graphite2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:24:06] (03CR) 10Ladsgroup: [C: 03+1] "Post-merge +1, sorry I was asleep and didn't see it in time." [puppet] - 10https://gerrit.wikimedia.org/r/904764 (https://phabricator.wikimedia.org/T332708) (owner: 10Jcrespo) [11:26:47] (03CR) 10Filippo Giunchedi: alertmanager: change repeat interval to 1 week for warnings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903615 (owner: 10Jbond) [11:30:10] (03CR) 10Filippo Giunchedi: alertmanager: change repeat interval to 1 week for warnings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903615 (owner: 10Jbond) [11:31:52] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/904756 (owner: 10Elukey) [11:34:42] (03PS1) 10Jbond: systemd::unmask: change the default of refreshonly [puppet] - 10https://gerrit.wikimedia.org/r/904776 [11:41:33] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be1042.eqiad.wmnet with reason: Add-in Card 2 ROMB Battery LOW [11:41:49] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be1042.eqiad.wmnet with reason: Add-in Card 2 ROMB Battery LOW [11:41:56] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e19efa89-db0e-4ad2-bcc9-ed867218f629) set by mvernon@cumin2002 for 1 day, 0:00:00 on 1 host(... [11:42:17] !log shutdown ms-be1042 for battery swap T332883 [11:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:23] T332883: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 [11:43:20] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10MatthewVernon) @Jclark-ctr ms-be1042 shut down ready for you. [11:44:20] (03PS1) 10Slyngshede: Revert "P:url_downloader send Squid access logs to Logstash" [puppet] - 10https://gerrit.wikimedia.org/r/904691 [11:46:50] (03PS2) 10Jbond: Revert "P:url_downloader send Squid access logs to Logstash" [puppet] - 10https://gerrit.wikimedia.org/r/904691 (https://phabricator.wikimedia.org/T333676) (owner: 10Slyngshede) [11:50:13] (03CR) 10Slyngshede: [C: 03+2] Revert "P:url_downloader send Squid access logs to Logstash" [puppet] - 10https://gerrit.wikimedia.org/r/904691 (https://phabricator.wikimedia.org/T333676) (owner: 10Slyngshede) [11:53:52] (03CR) 10Stevemunene: [V: 03+1] Jupyterhub-conda exclude /mnt from accessible paths (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904617 (https://phabricator.wikimedia.org/T333511) (owner: 10Stevemunene) [11:54:58] (03PS6) 10JMeybohm: WIP: deployment_server: Create k8s configs with pki certs [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) [11:55:12] (03PS2) 10Stevemunene: Jupyterhub-conda exclude /mnt from accessible paths [puppet] - 10https://gerrit.wikimedia.org/r/904617 (https://phabricator.wikimedia.org/T333511) [12:00:31] !log eoghan@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrading Gitlab [12:01:27] (03CR) 10Jbond: [C: 03+2] systemd::unmask: change the default of refreshonly [puppet] - 10https://gerrit.wikimedia.org/r/904776 (owner: 10Jbond) [12:04:18] !log eoghan@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrading Gitlab [12:05:38] (03CR) 10Btullis: [C: 03+2] Correct the datahub elasticsearch index prefix for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/904571 (https://phabricator.wikimedia.org/T333580) (owner: 10Btullis) [12:07:29] (03PS1) 10Ilias Sarantopoulos: ml-services: FastAPI chart using sextant for ores-legacy service [deployment-charts] - 10https://gerrit.wikimedia.org/r/904777 (https://phabricator.wikimedia.org/T330414) [12:11:00] (03Merged) 10jenkins-bot: Correct the datahub elasticsearch index prefix for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/904571 (https://phabricator.wikimedia.org/T333580) (owner: 10Btullis) [12:12:33] (03CR) 10David Caro: [C: 03+2] cloudceph: add the location info to the hosts [puppet] - 10https://gerrit.wikimedia.org/r/896372 (https://phabricator.wikimedia.org/T297083) (owner: 10David Caro) [12:14:00] (03CR) 10David Caro: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40482/console" [puppet] - 10https://gerrit.wikimedia.org/r/896372 (https://phabricator.wikimedia.org/T297083) (owner: 10David Caro) [12:25:01] (03CR) 10Jbond: [C: 04-1] Adds flag to start after unmask, starts logrotate (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/904498 (https://phabricator.wikimedia.org/T332869) (owner: 10EoghanGaffney) [12:25:24] !log btullis@deploy2002 helmfile [staging] START helmfile.d/services/datahub: apply on main [12:26:21] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to also sync networks - https://phabricator.wikimedia.org/T329669 (10jbond) > I take it you are basing this off the defined RFC1918 prefixes in the latest revision? yes w just uses pythons `ipaddress.ip_address(ad... [12:27:48] (03CR) 10Ayounsi: "Abandoning the change as it's not needed." [puppet] - 10https://gerrit.wikimedia.org/r/904745 (https://phabricator.wikimedia.org/T324992) (owner: 10Ayounsi) [12:27:52] (03Abandoned) 10Ayounsi: Bird: POC use a different ASN for Cloud hosts [puppet] - 10https://gerrit.wikimedia.org/r/904745 (https://phabricator.wikimedia.org/T324992) (owner: 10Ayounsi) [12:29:05] (03CR) 10EoghanGaffney: [C: 03+2] Removes unnecessary krb:present line [puppet] - 10https://gerrit.wikimedia.org/r/904522 (owner: 10EoghanGaffney) [12:29:11] (03PS1) 10Btullis: Bump the main datahub chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/904782 (https://phabricator.wikimedia.org/T333580) [12:30:12] (03CR) 10EoghanGaffney: Add production ssh account for eoghan (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883114 (owner: 10EoghanGaffney) [12:30:19] (03PS2) 10DCausse: rdf-streaming-updater: bump job image to flink-1.16-rc2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/904763 (https://phabricator.wikimedia.org/T328675) [12:31:42] (03PS1) 10Slyngshede: P:url_downloader send squid logs to Logstash [puppet] - 10https://gerrit.wikimedia.org/r/904783 (https://phabricator.wikimedia.org/T333676) [12:33:45] (03PS1) 10Slyngshede: P:installserver::proxy fix typo in log message. [puppet] - 10https://gerrit.wikimedia.org/r/904784 [12:35:22] (03CR) 10Btullis: [C: 03+2] Bump the main datahub chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/904782 (https://phabricator.wikimedia.org/T333580) (owner: 10Btullis) [12:40:22] (03Merged) 10jenkins-bot: Bump the main datahub chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/904782 (https://phabricator.wikimedia.org/T333580) (owner: 10Btullis) [12:41:16] (03PS7) 10JMeybohm: WIP: deployment_server: Create k8s configs with pki certs [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) [12:41:27] (03PS1) 10Ladsgroup: Add add_af_actor_T333332.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/904786 (https://phabricator.wikimedia.org/T333332) [12:43:11] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40484/console" [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [12:45:20] !log btullis@deploy2002 helmfile [staging] START helmfile.d/services/datahub: apply on main [12:45:30] (03CR) 10DCausse: [C: 03+2] flink-app: update to mesh.configuration 1.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/904762 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [12:46:29] (03PS8) 10JMeybohm: WIP: deployment_server: Create k8s configs with pki certs [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) [12:46:30] !log btullis@deploy2002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [12:47:03] (03PS3) 10Jbond: alertmanager: change repeat interval to 1 week for warnings [puppet] - 10https://gerrit.wikimedia.org/r/903615 [12:47:24] (03CR) 10Jbond: alertmanager: change repeat interval to 1 week for warnings (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/903615 (owner: 10Jbond) [12:47:27] (03PS1) 10David Caro: ceph: Allow setting a crush location hook for the rack [puppet] - 10https://gerrit.wikimedia.org/r/904787 (https://phabricator.wikimedia.org/T297083) [12:48:06] (03CR) 10Ayounsi: Bird: remove anycast subnet filter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904754 (https://phabricator.wikimedia.org/T324992) (owner: 10Ayounsi) [12:48:11] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40485/console" [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [12:49:47] (03PS4) 10Filippo Giunchedi: alertmanager: change repeat interval to 3 days for warnings [puppet] - 10https://gerrit.wikimedia.org/r/903615 (owner: 10Jbond) [12:50:07] (03Merged) 10jenkins-bot: flink-app: update to mesh.configuration 1.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/904762 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [12:50:21] (03CR) 10Filippo Giunchedi: [C: 03+1] alertmanager: change repeat interval to 3 days for warnings [puppet] - 10https://gerrit.wikimedia.org/r/903615 (owner: 10Jbond) [12:50:30] (03PS1) 10David Caro: p:cloudceph::osd: enable location hook [puppet] - 10https://gerrit.wikimedia.org/r/904788 (https://phabricator.wikimedia.org/T297083) [12:51:22] (03CR) 10Jbond: [C: 03+2] alertmanager: change repeat interval to 3 days for warnings [puppet] - 10https://gerrit.wikimedia.org/r/903615 (owner: 10Jbond) [12:52:10] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40486/console" [puppet] - 10https://gerrit.wikimedia.org/r/904787 (https://phabricator.wikimedia.org/T297083) (owner: 10David Caro) [12:52:41] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40487/console" [puppet] - 10https://gerrit.wikimedia.org/r/904788 (https://phabricator.wikimedia.org/T297083) (owner: 10David Caro) [12:53:54] (03PS2) 10David Caro: ceph: Allow setting a crush location hook for the rack [puppet] - 10https://gerrit.wikimedia.org/r/904787 (https://phabricator.wikimedia.org/T297083) [12:53:56] (03PS2) 10David Caro: p:cloudceph::osd: enable location hook [puppet] - 10https://gerrit.wikimedia.org/r/904788 (https://phabricator.wikimedia.org/T297083) [12:55:52] !log eoghan@cumin2002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrading Gitlab [12:56:24] (ProbeDown) firing: (2) Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:57:00] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40488/console" [puppet] - 10https://gerrit.wikimedia.org/r/904788 (https://phabricator.wikimedia.org/T297083) (owner: 10David Caro) [12:57:13] (03PS9) 10JMeybohm: WIP: deployment_server: Create k8s configs with pki certs [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) [12:57:52] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: bump job image to flink-1.16-rc2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/904763 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [12:58:50] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40489/console" [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [13:01:24] (ProbeDown) resolved: (2) Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:02:46] (03Merged) 10jenkins-bot: rdf-streaming-updater: bump job image to flink-1.16-rc2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/904763 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [13:05:03] (03CR) 10Elukey: [C: 03+2] kafka: update alerts related to brokers down [alerts] - 10https://gerrit.wikimedia.org/r/904756 (owner: 10Elukey) [13:09:17] !log restart kafkatee on centrallog2002 - test to see if there are issues connecting to the jumbo brokers running pki [13:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:26] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [13:10:29] !log phedenskog@deploy2002 Started deploy [performance/navtiming@c30b954]: (no justification provided) [13:10:35] !log phedenskog@deploy2002 Finished deploy [performance/navtiming@c30b954]: (no justification provided) (duration: 00m 05s) [13:11:01] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [13:11:16] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kafka-jumbo1004.eqiad.wmnet with reason: restart kafka, switch to PKI [13:11:29] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kafka-jumbo1004.eqiad.wmnet with reason: restart kafka, switch to PKI [13:12:48] !log move kafka-jumbo1004's kafka broker cert to PKI - T296064 [13:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:01] T296064: Move Kafka Jumbo's TLS clients to the new bundle - https://phabricator.wikimedia.org/T296064 [13:16:15] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to also sync networks - https://phabricator.wikimedia.org/T329669 (10cmooney) >>! In T329669#8745241, @jbond wrote: > yes i did wonder if this could be wrong for ipv6 I guess it comes down to whether we want the... [13:17:45] (JobUnavailable) firing: Reduced availability for job k8s-pods-tls in k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:22:45] (JobUnavailable) resolved: Reduced availability for job k8s-pods-tls in k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:23:25] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10Jclark-ctr) ms-be1042 is finished @MatthewVernon [13:26:51] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on analytics1075:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=analytics1075 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [13:30:21] (03PS1) 10Filippo Giunchedi: profile: let check_dpkg write prometheus stats [puppet] - 10https://gerrit.wikimedia.org/r/904792 (https://phabricator.wikimedia.org/T332764) [13:30:23] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe1013.eqiad.wmnet with OS bullseye [13:30:51] 10SRE, 10Machine-Learning-Team, 10serviceops, 10Language-Team (Language-2023-April-June ), 10Service-deployment-requests: New Service Deployment Request: NNLB-200 for machine translation - https://phabricator.wikimedia.org/T329971 (10Pginer-WMF) [13:31:32] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1013.eqiad.wmnet with reason: host reimage [13:31:51] (PowerSupply) resolved: (2) Power Supply - PS Redundancy - issue on analytics1075:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=analytics1075 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [13:32:44] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-fe1013.eqiad.wmnet with OS bullseye [13:32:51] jbond: hah ^ alert works, I'll add a bit of leeway [13:32:55] (03CR) 10Ssingh: [C: 03+1] "[Looks good, comments inline]" [puppet] - 10https://gerrit.wikimedia.org/r/904754 (https://phabricator.wikimedia.org/T324992) (owner: 10Ayounsi) [13:33:16] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10Jclark-ctr) 05Open→03Resolved an-worker1091 @btullis Thanks for shutting down server Battery has been replaced [13:34:26] (03CR) 10Ssingh: [C: 03+1] "Happy to take care of deploying this on Monday, especially if we do decide to remove 203.0.113.1/32 and 2001:db8::1/128, in case something" [puppet] - 10https://gerrit.wikimedia.org/r/904754 (https://phabricator.wikimedia.org/T324992) (owner: 10Ayounsi) [13:34:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1013.eqiad.wmnet with reason: host reimage [13:39:06] (03CR) 10Herron: [C: 03+1] prometheus: Add the prometheus3002 node definition [puppet] - 10https://gerrit.wikimedia.org/r/904651 (https://phabricator.wikimedia.org/T333627) (owner: 10Andrea Denisse) [13:40:08] (03PS1) 10Filippo Giunchedi: sre: add some leeway in PowerSupply alert [alerts] - 10https://gerrit.wikimedia.org/r/904795 (https://phabricator.wikimedia.org/T332764) [13:40:21] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1150.mgmt.eqiad.wmnet with reboot policy FORCED [13:41:41] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1153.mgmt.eqiad.wmnet with reboot policy FORCED [13:43:14] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Ladsgroup) [13:46:06] (03CR) 10Jaime Nuche: [C: 03+1] Extract and deploy upstream plugins [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/904575 (owner: 10Hashar) [13:49:53] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1150.mgmt.eqiad.wmnet with reboot policy FORCED [13:51:21] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1153.mgmt.eqiad.wmnet with reboot policy FORCED [13:51:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jclark-ctr) [13:53:08] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [13:53:47] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1153.mgmt.eqiad.wmnet with reboot policy FORCED [13:54:20] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [13:54:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1013.eqiad.wmnet with OS bullseye [13:54:27] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-fe1013.eqiad.wmnet with OS bullseye completed: - ms-f... [13:56:19] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10Papaul) [14:02:32] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1153.mgmt.eqiad.wmnet with reboot policy FORCED [14:03:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jclark-ctr) [14:29:05] (03CR) 10Jbond: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/904795 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi) [14:30:02] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/904792 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi) [14:32:35] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: add some leeway in PowerSupply alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/904795 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi) [14:32:50] (03PS2) 10Filippo Giunchedi: sre: add some leeway in PowerSupply alert [alerts] - 10https://gerrit.wikimedia.org/r/904795 (https://phabricator.wikimedia.org/T332764) [14:33:30] (03CR) 10Filippo Giunchedi: sre: add some leeway in PowerSupply alert [alerts] - 10https://gerrit.wikimedia.org/r/904795 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi) [14:35:19] (03Merged) 10jenkins-bot: sre: add some leeway in PowerSupply alert [alerts] - 10https://gerrit.wikimedia.org/r/904795 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi) [14:36:07] (03CR) 10Ilias Sarantopoulos: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/904777 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [14:39:00] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 11): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10JArguello-WMF) [14:41:27] (03PS5) 10Arturo Borrero Gonzalez: cloudlb: introduce BGP setup by means of bird [puppet] - 10https://gerrit.wikimedia.org/r/868731 (https://phabricator.wikimedia.org/T324992) [14:43:25] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host ms-fe1014.eqiad.wmnet [14:43:46] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host ms-fe1014.eqiad.wmnet [14:43:55] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host ms-fe1014.eqiad.wmnet [14:46:18] (03CR) 10Arturo Borrero Gonzalez: ceph: Allow setting a crush location hook for the rack (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904787 (https://phabricator.wikimedia.org/T297083) (owner: 10David Caro) [14:47:17] (03CR) 10Ahmon Dancy: [C: 03+1] "Let's see how it works." [puppet] - 10https://gerrit.wikimedia.org/r/904616 (https://phabricator.wikimedia.org/T333586) (owner: 10Dzahn) [14:47:25] !log btullis@deploy2002 helmfile [staging] START helmfile.d/services/datahub: sync on main [14:47:27] (03CR) 10Arturo Borrero Gonzalez: cloudlb: introduce BGP setup by means of bird (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/868731 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [14:47:33] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host ms-fe1014.eqiad.wmnet [14:47:54] !log btullis@deploy2002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [14:52:03] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe1014.eqiad.wmnet with OS bullseye [14:52:13] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-fe1014.eqiad.wmnet with OS bullseye [14:52:14] (03PS1) 10DCausse: rdf-streaming-updater: bump image version to flink-1.16-rc3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/904813 (https://phabricator.wikimedia.org/T328675) [14:52:58] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10Papaul) On ms-fe1014 IPMI was disable that is the reason it was failing [14:54:36] ? why doesn't flink-operator show up in the list of k8s namespaces in the Kubernetes Pods dashboard? https://grafana-rw.wikimedia.org/d/000000473/kubernetes-pods?orgId=1&var-cluster=eqiad+prometheus%2Fk8s-dse&from=1680270857261&to=1680274457261 [14:54:50] it is def a namespace in dse-k8s-eqiad: [14:55:27] https://www.irccloud.com/pastebin/s33GTltO/ [14:56:31] ottomata: I don't see it when selecting eqiad/k8s [14:57:08] its in dse-k8s [14:57:22] i'd expect to see it there [15:01:29] hm indeed, perhaps it does expose the prometheus labels? [15:02:03] or that's totally different no clue :/ [15:02:04] but...these are k8s level metrics...it shoudln't matter what is running? [15:03:20] hm i can't curl the prom port there. [15:05:10] (03PS1) 10Ladsgroup: Revert "Revert "Revert "mwscript: Switch to use run.php""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904698 [15:05:29] (03CR) 10Ladsgroup: [C: 03+2] Revert "Revert "Revert "mwscript: Switch to use run.php""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904698 (owner: 10Ladsgroup) [15:05:43] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1014.eqiad.wmnet with reason: host reimage [15:06:16] (03Merged) 10jenkins-bot: Revert "Revert "Revert "mwscript: Switch to use run.php""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904698 (owner: 10Ladsgroup) [15:06:38] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-fe1004.eqiad.wmnet with OS bullseye [15:06:46] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with OS bullseye [15:06:51] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:904698|Revert "Revert "Revert "mwscript: Switch to use run.php"""]] [15:07:02] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-fe1004.eqiad.wmnet with OS bullseye [15:07:07] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with OS bullseye executed with... [15:07:25] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-fe1004.eqiad.wmnet with OS bullseye [15:07:32] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with OS bullseye [15:08:17] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-fe1004.eqiad.wmnet with OS bullseye [15:08:27] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with OS bullseye executed with... [15:08:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1014.eqiad.wmnet with reason: host reimage [15:10:13] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-fe1004.eqiad.wmnet with OS bullseye [15:10:19] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with OS bullseye [15:10:45] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-fe1004.eqiad.wmnet with OS bullseye [15:10:52] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with OS bullseye executed with... [15:14:03] (03CR) 10Raymond Ndibe: "Thanks for working on this dcaro. It greatly improves the way things work currently!" [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [15:14:10] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-fe1004.eqiad.wmnet with OS bullseye [15:14:23] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with OS bullseye [15:14:36] (03PS16) 10KartikMistry: WIP: Add new self hosted machinetranslation service (MinT) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) [15:14:45] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:904698|Revert "Revert "Revert "mwscript: Switch to use run.php"""]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [15:18:38] (03CR) 10KartikMistry: WIP: Add new self hosted machinetranslation service (MinT) (038 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [15:19:06] (03CR) 10Raymond Ndibe: maintain-dbusers: add prometheus metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [15:22:41] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [15:26:05] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:904698|Revert "Revert "Revert "mwscript: Switch to use run.php"""]] (duration: 19m 14s) [15:26:21] 10SRE-tools, 10DNS, 10Infrastructure-Foundations, 10Traffic, 10netbox: Enforce Netbox domain names without period termination - https://phabricator.wikimedia.org/T306809 (10BCornwall) [15:26:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:27:29] 10SRE-tools, 10DNS, 10Infrastructure-Foundations, 10netbox: Enforce Netbox domain names without period termination - https://phabricator.wikimedia.org/T306809 (10BCornwall) Updated the task description to accurately reflect the work that needs doing. I'm also going to remove the Traffic tag since it seems... [15:28:37] (03CR) 10Ahmon Dancy: "Fancy" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/904575 (owner: 10Hashar) [15:31:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:33:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [15:33:19] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1014.eqiad.wmnet with OS bullseye [15:33:25] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-fe1014.eqiad.wmnet with OS bullseye completed: - ms-f... [15:34:07] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10Papaul) [15:34:46] (03PS1) 10Btullis: Remove the hyphen from the datahub staging elasticsearch prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/904820 (https://phabricator.wikimedia.org/T329514) [15:40:59] (03CR) 10Ahmon Dancy: [C: 03+1] scap: block Scap execution on inactive deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/904502 (https://phabricator.wikimedia.org/T330756) (owner: 10Jaime Nuche) [15:42:34] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10BTullis) [15:43:45] (03CR) 10Btullis: [C: 03+2] Remove the hyphen from the datahub staging elasticsearch prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/904820 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [15:48:57] (03Merged) 10jenkins-bot: Remove the hyphen from the datahub staging elasticsearch prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/904820 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [15:49:09] !log btullis@deploy2002 helmfile [staging] START helmfile.d/services/datahub: apply on main [15:49:48] !log btullis@deploy2002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [15:55:59] (03PS1) 10Bartosz Dziewoński: Revert "Enable hidden tag for "Edit Check" project on Wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904700 (https://phabricator.wikimedia.org/T324733) [15:57:28] hi, i'd like to get this revert https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/904700 emergency-deployed today. is anyone around who could help? [15:59:25] Amir1, cwhite, herron, thcipriani: ^ [15:59:47] * Lucas_WMDE here [15:59:52] MatmaRex: I'm around [15:59:57] reverts are fine [15:59:59] for context: https://phabricator.wikimedia.org/T333612#8746101 https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Stuck_loading,_can't_post_edit [16:00:07] !log btullis@deploy2002 helmfile [staging] START helmfile.d/services/datahub: sync on main [16:00:17] yeah this looks fine to me [16:00:20] Amir1: want to do it or should I? [16:00:37] !log btullis@deploy2002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [16:00:39] I'm already logged in deploy* [16:00:43] for another revert [16:00:50] ok [16:00:55] thanks all [16:00:59] (03PS2) 10Ladsgroup: Revert "Enable hidden tag for "Edit Check" project on Wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904700 (https://phabricator.wikimedia.org/T324733) (owner: 10Bartosz Dziewoński) [16:01:02] (03CR) 10Ladsgroup: [C: 03+2] Revert "Enable hidden tag for "Edit Check" project on Wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904700 (https://phabricator.wikimedia.org/T324733) (owner: 10Bartosz Dziewoński) [16:01:05] I just logged in but I’ll leave it to you then [16:01:48] (03Merged) 10jenkins-bot: Revert "Enable hidden tag for "Edit Check" project on Wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904700 (https://phabricator.wikimedia.org/T324733) (owner: 10Bartosz Dziewoński) [16:02:30] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:904700|Revert "Enable hidden tag for "Edit Check" project on Wikipedias" (T324733 T333612)]] [16:02:43] T324733: Introduce a tag to identify edits that meet the Edit Check heuristic - https://phabricator.wikimedia.org/T324733 [16:02:46] T333612: Visual Edits do not save - https://phabricator.wikimedia.org/T333612 [16:03:50] !log ladsgroup@deploy2002 matmarex and ladsgroup: Backport for [[gerrit:904700|Revert "Enable hidden tag for "Edit Check" project on Wikipedias" (T324733 T333612)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [16:04:25] MatmaRex: it's in mwdebug, please test [16:04:57] Amir1: i actually haven't reproduced the error yet, but the stack traces in logstash are pointing to this code [16:05:03] so i can't really test, sorry D: [16:05:08] (03CR) 10Hashar: Extract and deploy upstream plugins (032 comments) [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/904575 (owner: 10Hashar) [16:05:17] ok [16:05:20] (03PS2) 10Hashar: Extract and deploy upstream plugins [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/904575 [16:10:49] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:904700|Revert "Enable hidden tag for "Edit Check" project on Wikipedias" (T324733 T333612)]] (duration: 08m 18s) [16:10:56] T324733: Introduce a tag to identify edits that meet the Edit Check heuristic - https://phabricator.wikimedia.org/T324733 [16:10:56] T333612: Visual Edits do not save - https://phabricator.wikimedia.org/T333612 [16:15:17] !log btullis@deploy2002 helmfile [staging] START helmfile.d/services/datahub: sync on main [16:15:33] thanks Amir1 [16:15:47] !log btullis@deploy2002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [16:15:54] i'll reply on the task in a sec [16:16:37] ^_^ [16:19:49] 10SRE, 10Commons, 10Traffic: Specific PNG thumbnail of SVG file is outdated / stuck (European caching cluster) - https://phabricator.wikimedia.org/T333042 (10Aklapper) [16:21:34] (03CR) 10Dzahn: [C: 03+2] gitlab_runner: run clear-docker-cache every hour [puppet] - 10https://gerrit.wikimedia.org/r/904616 (https://phabricator.wikimedia.org/T333586) (owner: 10Dzahn) [16:21:40] (03PS1) 10Papaul: update thanos-fe1004 entry in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/904826 (https://phabricator.wikimedia.org/T326846) [16:22:30] (03CR) 10Papaul: [C: 03+2] update thanos-fe1004 entry in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/904826 (https://phabricator.wikimedia.org/T326846) (owner: 10Papaul) [16:22:35] thanks Amir1 RhinosF1 and MatmaRex <3 [16:22:59] i didn't do anything thcipriani, thanks all though! [16:23:28] jbond: papaul: tr-tr-tr-triple merge combo. merge ahead :) [16:24:45] mutante: ? [16:24:58] Papaul: update thanos-fe1004 entry in site.pp (a227d022aa) [16:24:58] Dzahn: gitlab_runner: run clear-docker-cache every hour (e6c553eca9) [16:25:01] Jbond: alertmanager: change repeat interval to 3 days for warnings (9b9af0c0ab) [16:25:06] all these want to be merged on master [16:25:16] mutante: oh sorry i thought i merged that one please go ahead [16:25:51] meanwhile someone else has the lock. what I wanted to say was "mine is fine to be merged" , heh [16:26:26] I bet in the other channel it's the same thing :) [16:26:30] -dcops [16:26:48] yep :) [16:28:21] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host thanos-fe1004.eqiad.wmnet with OS bullseye [16:29:25] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-fe1004.eqiad.wmnet with OS bullseye [16:29:37] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with... [16:29:49] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-fe1004.eqiad.wmnet with OS bullseye [16:29:59] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with OS... [16:30:17] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-fe1004.eqiad.wmnet with OS bullseye [16:30:28] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with... [16:41:44] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 11): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Ladsgroup) Hi, sorry, I just came back from ooo. I want to take a step back a... [16:47:45] (03PS1) 10Ssingh: Revert "hiera: set bpg-med to 101 for lvs4008 (100 for lvs4010)" [puppet] - 10https://gerrit.wikimedia.org/r/904701 [16:50:08] (03CR) 10Ssingh: [C: 03+2] Revert "hiera: set bpg-med to 101 for lvs4008 (100 for lvs4010)" [puppet] - 10https://gerrit.wikimedia.org/r/904701 (owner: 10Ssingh) [16:51:06] (03PS1) 10Papaul: Fix typo on role for thanos-fe1004 [puppet] - 10https://gerrit.wikimedia.org/r/904830 (https://phabricator.wikimedia.org/T326846) [16:52:04] (03CR) 10Papaul: [C: 03+2] Fix typo on role for thanos-fe1004 [puppet] - 10https://gerrit.wikimedia.org/r/904830 (https://phabricator.wikimedia.org/T326846) (owner: 10Papaul) [16:54:02] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@2aae7d0]: Fix for VirtualPageview Dag - Analytics [airflow-dags@2aae7d0] [16:54:13] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@2aae7d0]: Fix for VirtualPageview Dag - Analytics [airflow-dags@2aae7d0] (duration: 00m 10s) [16:55:41] !log restart pybal on lvs4008 to set it primary LVS for high-traffic1 [16:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:50] 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10wiki_willy) Thanks, let us know if there's anything we can do in the meantime. Here's a list of the assets that are reporting as Netbox errors for accounting mismatch, whic... [17:05:50] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:05:52] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:07:30] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49708 bytes in 0.106 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:07:32] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.261 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:08:40] 10SRE, 10ops-eqiad, 10Data-Engineering: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10wiki_willy) @Cmjohnson / @Jclark-ctr - maybe we can try upgrading the firmware first if it's outdated? Thanks, Willy [17:13:00] !log denisse@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus3002.esams.wmnet - denisse@cumin1001" [17:13:01] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [17:15:38] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM prometheus3002.esams.wmnet - denisse@cumin1001" [17:16:10] (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Add the prometheus3002 node definition [puppet] - 10https://gerrit.wikimedia.org/r/904651 (https://phabricator.wikimedia.org/T333627) (owner: 10Andrea Denisse) [17:16:33] !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM prometheus3002.esams.wmnet - denisse@cumin1001" [17:16:33] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:16:33] !log denisse@cumin1001 START - Cookbook sre.dns.wipe-cache prometheus3002.esams.wmnet on all recursors [17:16:36] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus3002.esams.wmnet on all recursors [17:16:36] !log denisse@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus3002.esams.wmnet [17:16:46] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@48778b4]: bump discolytics to 0.11.0 [17:17:06] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@48778b4]: bump discolytics to 0.11.0 (duration: 00m 19s) [17:17:23] !log denisse@cumin1001 START - Cookbook sre.ganeti.makevm for new host prometheus3002.esams.wmnet [17:17:24] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [17:18:35] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@9182e44]: Fix for VirtualPageview Dag - Analytics [airflow-dags@9182e44] [17:18:47] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@9182e44]: Fix for VirtualPageview Dag - Analytics [airflow-dags@9182e44] (duration: 00m 11s) [17:19:22] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus3002.esams.wmnet - denisse@cumin1001" [17:20:16] !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus3002.esams.wmnet - denisse@cumin1001" [17:20:16] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:20:16] !log denisse@cumin1001 START - Cookbook sre.dns.wipe-cache prometheus3002.esams.wmnet on all recursors [17:20:19] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus3002.esams.wmnet on all recursors [17:20:23] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [17:22:33] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM prometheus3002.esams.wmnet - denisse@cumin1001" [17:23:34] !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM prometheus3002.esams.wmnet - denisse@cumin1001" [17:23:34] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:23:34] !log denisse@cumin1001 START - Cookbook sre.dns.wipe-cache prometheus3002.esams.wmnet on all recursors [17:23:37] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus3002.esams.wmnet on all recursors [17:23:41] !log denisse@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus3002.esams.wmnet [17:27:39] !log denisse@cumin1001 START - Cookbook sre.hosts.decommission for hosts prometheus3002.esams.wmnet [17:31:25] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [17:32:37] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:32:37] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts prometheus3002.esams.wmnet [17:32:41] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: esams 1 VM request for prometheus3002 - https://phabricator.wikimedia.org/T333627 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by denisse@cumin1001 for hosts: `prometheus3002.esams.wmnet` - prometheus3002.esams.wmnet (**WARN**) -... [17:36:27] !log denisse@cumin1001 START - Cookbook sre.ganeti.makevm for new host prometheus3002.esams.wmnet [17:36:28] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [17:39:06] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus3002.esams.wmnet - denisse@cumin1001" [17:40:01] !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus3002.esams.wmnet - denisse@cumin1001" [17:40:01] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:40:02] !log denisse@cumin1001 START - Cookbook sre.dns.wipe-cache prometheus3002.esams.wmnet on all recursors [17:40:05] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus3002.esams.wmnet on all recursors [17:44:47] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host thanos-fe1004.eqiad.wmnet with OS bullseye [17:48:27] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-fe1004.eqiad.wmnet with OS bullseye [17:48:33] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with OS bullseye [17:49:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe1004.eqiad.wmnet with reason: host reimage [17:52:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe1004.eqiad.wmnet with reason: host reimage [18:01:22] (03CR) 10Andrew Bogott: [C: 03+2] nfs traffic shaping: label IPs circa 2017 [puppet] - 10https://gerrit.wikimedia.org/r/904623 (owner: 10Andrew Bogott) [18:01:30] (03CR) 10Andrew Bogott: [C: 03+2] nfs traffic-shaping: replace labstore100[67] with clouddumps100[12] [puppet] - 10https://gerrit.wikimedia.org/r/904624 (owner: 10Andrew Bogott) [18:01:38] (03CR) 10Andrew Bogott: [C: 03+2] nfs traffic shaping: remove refs to labstore100[12] [puppet] - 10https://gerrit.wikimedia.org/r/904625 (owner: 10Andrew Bogott) [18:01:50] (03CR) 10Andrew Bogott: [C: 03+2] nfs traffic_shaping: replace labstore1003 rules with rules for scratch.svc [puppet] - 10https://gerrit.wikimedia.org/r/904626 (owner: 10Andrew Bogott) [18:05:23] (03PS2) 10Andrew Bogott: nfs traffic-shaping: replace labstore100[67] with clouddumps100[12] [puppet] - 10https://gerrit.wikimedia.org/r/904624 [18:05:25] (03PS2) 10Andrew Bogott: nfs traffic shaping: remove refs to labstore100[12] [puppet] - 10https://gerrit.wikimedia.org/r/904625 [18:05:27] (03PS2) 10Andrew Bogott: nfs traffic_shaping: replace labstore1003 rules with rules for scratch.svc [puppet] - 10https://gerrit.wikimedia.org/r/904626 [18:05:29] (03PS3) 10Andrew Bogott: Toolforge: move to new VM-hosted NFS server [puppet] - 10https://gerrit.wikimedia.org/r/904562 (https://phabricator.wikimedia.org/T333477) [18:05:31] (03PS2) 10Andrew Bogott: nfs traffic_shaping: replace labstore1004 rules with rules for tools-nfs.svc [puppet] - 10https://gerrit.wikimedia.org/r/904627 (https://phabricator.wikimedia.org/T333477) [18:05:33] (03PS2) 10Andrew Bogott: labstore1004: park in an 'insetup' role until we're ready to decom [puppet] - 10https://gerrit.wikimedia.org/r/904630 (https://phabricator.wikimedia.org/T333477) [18:05:49] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [18:17:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [18:17:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-fe1004.eqiad.wmnet with OS bullseye [18:18:04] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with OS bullseye completed: -... [18:18:15] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10Papaul) [18:19:19] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10Papaul) 05Open→03Resolved The problem with thanos-fe1004 was wrong entry in site.pp. All the server are now ready [18:20:03] 10SRE, 10ops-eqiad, 10DC-Ops: Eqiad: Sprint Week main tracking task - https://phabricator.wikimedia.org/T332516 (10Papaul) [18:20:48] 10SRE, 10ops-eqiad, 10DC-Ops: Eqiad: Backlog HW failure-racking tasks - Decommision and remote work tasks - https://phabricator.wikimedia.org/T332523 (10Papaul) [18:21:47] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@30fae0e]: bump discolytics to 0.12.0 [18:22:08] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@30fae0e]: bump discolytics to 0.12.0 (duration: 00m 20s) [18:23:38] !log xcollazo@deploy2002 Started deploy [airflow-dags/platform_eng@30fae0e]: (no justification provided) [18:23:59] !log xcollazo@deploy2002 Finished deploy [airflow-dags/platform_eng@30fae0e]: (no justification provided) (duration: 00m 20s) [18:40:06] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus3002.esams.wmnet - denisse@cumin1001" [18:40:59] !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus3002.esams.wmnet - denisse@cumin1001" [18:40:59] !log denisse@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host prometheus3002.esams.wmnet [18:49:24] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1156.eqiad.wmnet'] [18:49:33] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1155.eqiad.wmnet'] [18:54:08] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['an-worker1151.eqiad.wmnet'] [18:56:03] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1152.eqiad.wmnet'] [18:58:52] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1156.eqiad.wmnet'] [18:58:54] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1155.eqiad.wmnet'] [19:00:00] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1154.eqiad.wmnet'] [19:00:12] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1153.eqiad.wmnet'] [19:10:12] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1153.eqiad.wmnet'] [19:12:25] (03CR) 10Dzahn: [C: 04-1] "I should not call it "serviceops" though in the code when I say "sre-collab" in the title. And should amend to add another Phab tag as dis" [puppet] - 10https://gerrit.wikimedia.org/r/903796 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [19:13:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Papaul) @Cmjohnson @Jgreen i did a quick look in Netbox for frbast1002 mgmt IP address it looks like this node is using 10.64.40.36/26 on eqiad mgmt ne... [19:14:22] !log upgraded wikitech-static to 1.39.3 [19:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:50] (03PS5) 10Dzahn: alertmanager: create receiver for both sre-collab and releng combined [puppet] - 10https://gerrit.wikimedia.org/r/903796 (https://phabricator.wikimedia.org/T329587) [19:24:01] !log denisse@cumin1001 START - Cookbook sre.ganeti.reimage for host prometheus3002.esams.wmnet with OS bullseye [19:24:06] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: esams 1 VM request for prometheus3002 - https://phabricator.wikimedia.org/T333627 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by denisse@cumin1001 for host prometheus3002.esams.wmnet with OS bullseye [19:24:09] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['an-worker1154.eqiad.wmnet'] [19:24:16] (03CR) 10Dzahn: [C: 03+1] "renamed to "sre-collab-releng". added second PID to create a single ticket but tagged for both teams. I think it's good to go now. CCin'g " [puppet] - 10https://gerrit.wikimedia.org/r/903796 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [19:26:14] !log denisse@cumin1001 START - Cookbook sre.ganeti.makevm for new host prometheus4002.ulsfo.wmnet [19:26:16] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [19:26:43] !log denisse@cumin1001 START - Cookbook sre.ganeti.makevm for new host prometheus5002.eqsin.wmnet [19:26:51] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [19:27:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jclark-ctr) [19:28:31] !log denisse@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [19:28:53] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1074.mgmt.eqiad.wmnet with reboot policy FORCED [19:28:54] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1075.mgmt.eqiad.wmnet with reboot policy FORCED [19:29:01] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus4002.ulsfo.wmnet - denisse@cumin1001" [19:30:02] !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus4002.ulsfo.wmnet - denisse@cumin1001" [19:30:02] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:30:02] !log denisse@cumin1001 START - Cookbook sre.dns.wipe-cache prometheus4002.ulsfo.wmnet on all recursors [19:30:05] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus4002.ulsfo.wmnet on all recursors [19:30:46] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1074.mgmt.eqiad.wmnet with reboot policy FORCED [19:30:48] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1075.mgmt.eqiad.wmnet with reboot policy FORCED [19:30:49] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [19:32:35] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1072.eqiad.wmnet'] [19:32:35] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1073.eqiad.wmnet'] [19:32:44] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus5002.eqsin.wmnet - denisse@cumin1001" [19:32:56] (03PS1) 10Andrew Bogott: Cinder: backup tool project volumes [puppet] - 10https://gerrit.wikimedia.org/r/904838 [19:33:26] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1072.eqiad.wmnet'] [19:33:28] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1073.eqiad.wmnet'] [19:33:47] !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus5002.eqsin.wmnet - denisse@cumin1001" [19:33:47] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:33:47] !log denisse@cumin1001 START - Cookbook sre.dns.wipe-cache prometheus5002.eqsin.wmnet on all recursors [19:33:50] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus5002.eqsin.wmnet on all recursors [19:34:15] !log denisse@cumin1001 START - Cookbook sre.ganeti.makevm for new host prometheus6002.drmrs.wmnet [19:34:16] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [19:34:34] (03CR) 10Andrew Bogott: [C: 03+2] Cinder: backup tool project volumes [puppet] - 10https://gerrit.wikimedia.org/r/904838 (owner: 10Andrew Bogott) [19:35:40] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1152.eqiad.wmnet'] [19:36:14] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus6002.drmrs.wmnet - denisse@cumin1001" [19:36:51] (03PS1) 10Andrew Bogott: wmcs backups: stop nfs backups, add a second cinder-backup node [puppet] - 10https://gerrit.wikimedia.org/r/904839 [19:37:18] !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus6002.drmrs.wmnet - denisse@cumin1001" [19:37:18] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:37:18] !log denisse@cumin1001 START - Cookbook sre.dns.wipe-cache prometheus6002.drmrs.wmnet on all recursors [19:37:22] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus6002.drmrs.wmnet on all recursors [19:39:32] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1073.mgmt.eqiad.wmnet with reboot policy FORCED [19:39:59] (03CR) 10Andrew Bogott: [C: 03+2] wmcs backups: stop nfs backups, add a second cinder-backup node [puppet] - 10https://gerrit.wikimedia.org/r/904839 (owner: 10Andrew Bogott) [19:40:28] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1072.mgmt.eqiad.wmnet with reboot policy FORCED [19:41:44] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1073.mgmt.eqiad.wmnet with reboot policy FORCED [19:42:27] !log denisse@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus3002.esams.wmnet with reason: host reimage [19:45:08] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be1074.mgmt.eqiad.wmnet with reboot policy FORCED [19:45:30] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus3002.esams.wmnet with reason: host reimage [19:45:44] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1072.mgmt.eqiad.wmnet with reboot policy FORCED [19:46:11] (03PS1) 10Andrea Denisse: prometheus: Add the prometheus Bullseye node definitions [puppet] - 10https://gerrit.wikimedia.org/r/904841 (https://phabricator.wikimedia.org/T333719) [19:47:50] (03PS1) 10Andrew Bogott: Revert "wmcs backups: stop nfs backups, add a second cinder-backup node" [puppet] - 10https://gerrit.wikimedia.org/r/904842 [19:48:24] (03PS1) 10BCornwall: gitlab: Disable listening on port 80 [puppet] - 10https://gerrit.wikimedia.org/r/904843 (https://phabricator.wikimedia.org/T238720) [19:48:28] (03CR) 10Andrew Bogott: [C: 03+2] Revert "wmcs backups: stop nfs backups, add a second cinder-backup node" [puppet] - 10https://gerrit.wikimedia.org/r/904842 (owner: 10Andrew Bogott) [19:51:19] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40490/console" [puppet] - 10https://gerrit.wikimedia.org/r/904843 (https://phabricator.wikimedia.org/T238720) (owner: 10BCornwall) [19:58:16] !log denisse@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host prometheus3002.esams.wmnet with OS bullseye [19:58:22] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: esams 1 VM request for prometheus3002 - https://phabricator.wikimedia.org/T333627 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by denisse@cumin1001 for host prometheus3002.esams.wmnet with OS bullseye completed: - prometheus300... [20:00:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1074.mgmt.eqiad.wmnet with reboot policy FORCED [20:01:46] (03PS1) 10Andrew Bogott: cinder: increase backup workers [puppet] - 10https://gerrit.wikimedia.org/r/904847 [20:03:47] (03CR) 10Andrew Bogott: [C: 03+2] cinder: increase backup workers [puppet] - 10https://gerrit.wikimedia.org/r/904847 (owner: 10Andrew Bogott) [20:04:40] 10SRE-Sprint-Week-Sustainability-March2023, 10DBA, 10MediaWiki-libs-Rdbms, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Fix mediawiki heartbeat model, change pt-heartbeat model to not use super-user, avoid SPOF and switch automatically to the... - https://phabricator.wikimedia.org/T172497 [20:05:08] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1075.mgmt.eqiad.wmnet with reboot policy FORCED [20:16:20] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1075.mgmt.eqiad.wmnet with reboot policy FORCED [20:28:05] (03CR) 10Herron: [C: 03+1] prometheus: Add the prometheus Bullseye node definitions [puppet] - 10https://gerrit.wikimedia.org/r/904841 (https://phabricator.wikimedia.org/T333719) (owner: 10Andrea Denisse) [20:30:04] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus4002.ulsfo.wmnet - denisse@cumin1001" [20:33:56] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus5002.eqsin.wmnet - denisse@cumin1001" [20:37:20] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus6002.drmrs.wmnet - denisse@cumin1001" [20:37:34] !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus4002.ulsfo.wmnet - denisse@cumin1001" [20:37:34] !log denisse@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host prometheus4002.ulsfo.wmnet [20:37:40] !log denisse@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus5002.eqsin.wmnet - denisse@cumin1001" [20:37:41] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [20:38:12] !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus6002.drmrs.wmnet - denisse@cumin1001" [20:38:12] !log denisse@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host prometheus6002.drmrs.wmnet [20:38:34] !log denisse@cumin1001 START - Cookbook sre.ganeti.reimage for host prometheus4002.ulsfo.wmnet with OS bullseye [20:38:59] !log denisse@cumin1001 START - Cookbook sre.ganeti.reimage for host prometheus6002.drmrs.wmnet with OS bullseye [20:39:08] 10SRE, 10vm-requests, 10Patch-For-Review: Site: drmrs 1 VM request for prometheus6002 - https://phabricator.wikimedia.org/T333721 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by denisse@cumin1001 for host prometheus6002.drmrs.wmnet with OS bullseye [20:39:54] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM prometheus5002.eqsin.wmnet - denisse@cumin1001" [20:40:54] !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM prometheus5002.eqsin.wmnet - denisse@cumin1001" [20:40:54] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:40:54] !log denisse@cumin1001 START - Cookbook sre.dns.wipe-cache prometheus5002.eqsin.wmnet on all recursors [20:40:57] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus5002.eqsin.wmnet on all recursors [20:40:57] !log denisse@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus5002.eqsin.wmnet [20:57:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/904595 [20:57:22] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/904595 (owner: 10TrainBranchBot) [20:58:36] !log denisse@cumin1001 START - Cookbook sre.ganeti.makevm for new host prometheus5002.eqsin.wmnet [20:58:37] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [21:00:55] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus5002.eqsin.wmnet - denisse@cumin1001" [21:01:19] (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Add the prometheus Bullseye node definitions [puppet] - 10https://gerrit.wikimedia.org/r/904841 (https://phabricator.wikimedia.org/T333719) (owner: 10Andrea Denisse) [21:01:46] (03PS2) 10BCornwall: gitlab: Disable listening on port 80 [puppet] - 10https://gerrit.wikimedia.org/r/904843 (https://phabricator.wikimedia.org/T238720) [21:02:28] !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus5002.eqsin.wmnet - denisse@cumin1001" [21:02:28] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:02:28] !log denisse@cumin1001 START - Cookbook sre.dns.wipe-cache prometheus5002.eqsin.wmnet on all recursors [21:02:31] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus5002.eqsin.wmnet on all recursors [21:02:40] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [21:04:11] (03PS1) 10BCornwall: lists: Disable access on port 80 [puppet] - 10https://gerrit.wikimedia.org/r/904854 (https://phabricator.wikimedia.org/T238720) [21:04:43] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM prometheus5002.eqsin.wmnet - denisse@cumin1001" [21:05:46] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40491/console" [puppet] - 10https://gerrit.wikimedia.org/r/904843 (https://phabricator.wikimedia.org/T238720) (owner: 10BCornwall) [21:05:47] !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM prometheus5002.eqsin.wmnet - denisse@cumin1001" [21:05:47] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:05:47] !log denisse@cumin1001 START - Cookbook sre.dns.wipe-cache prometheus5002.eqsin.wmnet on all recursors [21:05:50] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus5002.eqsin.wmnet on all recursors [21:05:55] !log denisse@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus5002.eqsin.wmnet [21:06:00] 10SRE, 10Infrastructure-Foundations, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): Site: esams 1 VM request for prometheus3002 - https://phabricator.wikimedia.org/T333627 (10andrea.denisse) 05Open→03Resolved [21:06:02] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40492/console" [puppet] - 10https://gerrit.wikimedia.org/r/904854 (https://phabricator.wikimedia.org/T238720) (owner: 10BCornwall) [21:06:03] 10SRE, 10Infrastructure-Foundations, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): Site: esams 1 VM request for prometheus3002 - https://phabricator.wikimedia.org/T333627 (10andrea.denisse) [21:07:16] !log denisse@cumin1001 START - Cookbook sre.hosts.decommission for hosts prometheus5002 [21:08:03] 10SRE, 10vm-requests, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q4): Site: eqsin 1 VM request for prometheus5002 - https://phabricator.wikimedia.org/T333720 (10andrea.denisse) [21:08:22] 10SRE, 10vm-requests, 10Patch-For-Review: Site: drmrs 1 VM request for prometheus6002 - https://phabricator.wikimedia.org/T333721 (10andrea.denisse) [21:08:54] 10SRE, 10vm-requests, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q4): Site: drmrs 1 VM request for prometheus6002 - https://phabricator.wikimedia.org/T333721 (10andrea.denisse) [21:09:33] 10SRE, 10vm-requests, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q4): Site: eqsin 1 VM request for prometheus5002 - https://phabricator.wikimedia.org/T333720 (10andrea.denisse) [21:11:12] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [21:12:15] (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/904595 (owner: 10TrainBranchBot) [21:12:25] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:12:26] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts prometheus5002 [21:12:31] 10SRE, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): Site: eqsin 1 VM request for prometheus5002 - https://phabricator.wikimedia.org/T333720 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by denisse@cumin1001 for hosts: `prometheus5002` - prometheus5002 (**WARN**) - //Host not foun... [21:16:55] 10SRE, 10Traffic, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10BCornwall) 05Stalled→03In progress a:03BCornwall [21:42:58] (03PS1) 10Dzahn: etherpad: remove process monitoring [puppet] - 10https://gerrit.wikimedia.org/r/904856 (https://phabricator.wikimedia.org/T331901) [21:47:08] (03PS2) 10Dzahn: gerrit: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902799 (https://phabricator.wikimedia.org/T329587) [21:47:23] (03CR) 10CI reject: [V: 04-1] gerrit: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902799 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [21:47:55] (03PS3) 10Dzahn: gerrit: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902799 (https://phabricator.wikimedia.org/T329587) [21:49:04] (03CR) 10Dzahn: "you can review this as if https://gerrit.wikimedia.org/r/c/operations/puppet/+/903796 is already merged.. so it will have the new notifica" [puppet] - 10https://gerrit.wikimedia.org/r/902799 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [21:52:30] (03CR) 10Dzahn: [C: 04-1] "arr, no. It needs to be serviceops-sre-releng and then "severity" is "releng". a bit odd but works. need to amend again though, sorry" [puppet] - 10https://gerrit.wikimedia.org/r/903796 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [21:52:31] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1075.eqiad.wmnet'] [21:52:39] !log denisse@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host prometheus4002.ulsfo.wmnet with OS bullseye [21:53:24] (03CR) 10Dzahn: [C: 04-1] "this is why I wanted to do that team rename change first, by the way. but gotta do that all at once later once that is finalized" [puppet] - 10https://gerrit.wikimedia.org/r/903796 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [21:55:06] (03PS4) 10Dzahn: gerrit: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902799 (https://phabricator.wikimedia.org/T329587) [21:57:33] (03CR) 10CI reject: [V: 04-1] gerrit: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902799 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [21:57:58] !log denisse@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host prometheus6002.drmrs.wmnet with OS bullseye [21:58:03] 10SRE, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): Site: drmrs 1 VM request for prometheus6002 - https://phabricator.wikimedia.org/T333721 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by denisse@cumin1001 for host prometheus6002.drmrs.wmnet with OS bullseye executed with erro... [21:58:42] !log denisse@cumin1001 START - Cookbook sre.ganeti.makevm for new host prometheus5002.eqsin.wmnet [21:58:43] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [21:58:44] (03CR) 10Dzahn: [C: 04-1] "well, I got caught with my "hack" (parameter 'severity' expects a match for Prometheus::Alert::Severity = Enum['critical', 'info', 'page'," [puppet] - 10https://gerrit.wikimedia.org/r/902799 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [22:00:18] 10SRE, 10vm-requests: Site: ulsfo 1 VM request for prometheus4002 - https://phabricator.wikimedia.org/T333719 (10Peachey88) [22:00:38] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus5002.eqsin.wmnet - denisse@cumin1001" [22:00:42] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-worker1132.eqiad.wmnet [22:01:27] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1075.eqiad.wmnet'] [22:01:43] !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus5002.eqsin.wmnet - denisse@cumin1001" [22:01:43] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:01:43] !log denisse@cumin1001 START - Cookbook sre.dns.wipe-cache prometheus5002.eqsin.wmnet on all recursors [22:01:46] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus5002.eqsin.wmnet on all recursors [22:03:22] (03PS1) 10Dzahn: gerrit: replace Icinga monitoring with Prometheus, ssh port 29418 [puppet] - 10https://gerrit.wikimedia.org/r/904857 (https://phabricator.wikimedia.org/T331901) [22:03:45] (03CR) 10CI reject: [V: 04-1] gerrit: replace Icinga monitoring with Prometheus, ssh port 29418 [puppet] - 10https://gerrit.wikimedia.org/r/904857 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn) [22:04:06] (03PS2) 10Dzahn: gerrit: replace Icinga monitoring with Prometheus, ssh port 29418 [puppet] - 10https://gerrit.wikimedia.org/r/904857 (https://phabricator.wikimedia.org/T331901) [22:05:37] (03CR) 10Dzahn: "Thinking about it, this should probably also use the new receiver including releng.." [puppet] - 10https://gerrit.wikimedia.org/r/904857 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn) [22:06:27] (03CR) 10Dzahn: "if it was "admins, gerrit" before it should probably be the same level as https on gerrit being down" [puppet] - 10https://gerrit.wikimedia.org/r/904857 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn) [22:09:33] (03CR) 10Dzahn: [C: 04-1] "wait for https://gerrit.wikimedia.org/r/c/operations/puppet/+/903796 but feel free to leave other comments regardless" [puppet] - 10https://gerrit.wikimedia.org/r/904857 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn) [22:09:42] 10SRE, 10ops-eqiad, 10Data-Engineering: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10Jclark-ctr) opened Dell ticket. sent support assist Confirmed: Service Request 165406278 was successfully submitted. [22:12:47] (03PS1) 10Dzahn: microsites: add monitor for os-reports.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/904859 (https://phabricator.wikimedia.org/T327976) [22:13:52] (03CR) 10Dzahn: "fyi, we are now monitoring this site as well. let us know if you want to receive notifications about it or think it's overkill or it's fin" [puppet] - 10https://gerrit.wikimedia.org/r/904859 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn) [22:20:35] (03PS1) 10Dzahn: microsites: add monitor for https://15.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/904860 (https://phabricator.wikimedia.org/T327976) [22:21:43] (03CR) 10Dzahn: [C: 03+2] "one of the last remaining sites to check on miscweb to close this ticket" [puppet] - 10https://gerrit.wikimedia.org/r/904860 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn) [22:24:00] (03CR) 10Dzahn: [C: 03+2] microsites: add monitor for os-reports.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/904859 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn) [22:24:57] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:20:00 on miscweb[2002-2003].codfw.wmnet,miscweb[1002-1003].eqiad.wmnet with reason: maintenance [22:25:12] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on miscweb[2002-2003].codfw.wmnet,miscweb[1002-1003].eqiad.wmnet with reason: maintenance [22:25:55] (03PS1) 10Cwhite: logstash: grafana_ecs gsub the level field in [puppet] - 10https://gerrit.wikimedia.org/r/904596 [22:29:49] (03CR) 10Cwhite: [C: 03+2] logstash: normalize_level add grafana error level alias [puppet] - 10https://gerrit.wikimedia.org/r/904591 (owner: 10Cwhite) [22:29:51] (03CR) 10Cwhite: [C: 03+2] logstash: grafana_ecs gsub the level field in [puppet] - 10https://gerrit.wikimedia.org/r/904596 (owner: 10Cwhite) [22:41:59] !log denisse@cumin1001 START - Cookbook sre.ganeti.reimage for host prometheus4002.ulsfo.wmnet with OS bullseye [22:42:05] 10SRE, 10vm-requests: Site: ulsfo 1 VM request for prometheus4002 - https://phabricator.wikimedia.org/T333719 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by denisse@cumin1001 for host prometheus4002.ulsfo.wmnet with OS bullseye [22:43:33] !log denisse@cumin1001 START - Cookbook sre.ganeti.reimage for host prometheus6002.drmrs.wmnet with OS bullseye [22:43:39] 10SRE, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): Site: drmrs 1 VM request for prometheus6002 - https://phabricator.wikimedia.org/T333721 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by denisse@cumin1001 for host prometheus6002.drmrs.wmnet with OS bullseye [22:44:14] (03PS1) 10Andrew Bogott: wmcs-cinder-volume-backup: handle backups that both do and don't have dependencies [puppet] - 10https://gerrit.wikimedia.org/r/904863 [22:44:16] (03PS1) 10Andrew Bogott: cinder backups: don't do toolforge full backups on our busiest day [puppet] - 10https://gerrit.wikimedia.org/r/904864 [22:44:44] (03CR) 10CI reject: [V: 04-1] wmcs-cinder-volume-backup: handle backups that both do and don't have dependencies [puppet] - 10https://gerrit.wikimedia.org/r/904863 (owner: 10Andrew Bogott) [22:45:43] (03PS2) 10Andrew Bogott: wmcs-cinder-volume-backup: handle backups that have and don't have dependencies [puppet] - 10https://gerrit.wikimedia.org/r/904863 [22:45:45] (03PS2) 10Andrew Bogott: cinder backups: don't do toolforge full backups on our busiest day [puppet] - 10https://gerrit.wikimedia.org/r/904864 [22:46:38] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-cinder-volume-backup: handle backups that have and don't have dependencies [puppet] - 10https://gerrit.wikimedia.org/r/904863 (owner: 10Andrew Bogott) [22:46:53] (03CR) 10Andrew Bogott: [C: 03+2] cinder backups: don't do toolforge full backups on our busiest day [puppet] - 10https://gerrit.wikimedia.org/r/904864 (owner: 10Andrew Bogott) [22:48:37] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:49:15] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:54:08] (03CR) 10Dzahn: [C: 03+2] "confirmed working in https://thanos.wikimedia.org/graph?g0.expr=probe_success%7Binstance%3D~%22.*miscweb.*%22%7D&g0.tab=1&g0.stacked=0&g0." [puppet] - 10https://gerrit.wikimedia.org/r/904859 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn) [22:54:34] (03CR) 10Dzahn: [C: 03+2] "confirmed working in https://thanos.wikimedia.org/graph?g0.expr=probe_success%7Binstance%3D~%22.*miscweb.*%22%7D&g0.tab=1&g0.stacked=0&g0." [puppet] - 10https://gerrit.wikimedia.org/r/904860 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn) [22:55:48] !log denisse@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus4002.ulsfo.wmnet with reason: host reimage [22:57:22] !log denisse@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus6002.drmrs.wmnet with reason: host reimage [22:58:53] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus4002.ulsfo.wmnet with reason: host reimage [23:01:17] 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10Volans) >>! In T320955#8744401, @Volans wrote: > @ayounsi we already have all that setup... To add a bit more context, I was out on my mobile and just wanted to post a quic... [23:01:24] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus6002.drmrs.wmnet with reason: host reimage [23:01:53] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus5002.eqsin.wmnet - denisse@cumin1001" [23:02:53] !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus5002.eqsin.wmnet - denisse@cumin1001" [23:02:54] !log denisse@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host prometheus5002.eqsin.wmnet [23:10:33] !log denisse@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host prometheus4002.ulsfo.wmnet with OS bullseye [23:10:38] 10SRE, 10vm-requests: Site: ulsfo 1 VM request for prometheus4002 - https://phabricator.wikimedia.org/T333719 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by denisse@cumin1001 for host prometheus4002.ulsfo.wmnet with OS bullseye completed: - prometheus4002 (**WARN**) - Downtimed on Ic... [23:14:06] !log denisse@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host prometheus6002.drmrs.wmnet with OS bullseye [23:14:11] 10SRE, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): Site: drmrs 1 VM request for prometheus6002 - https://phabricator.wikimedia.org/T333721 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by denisse@cumin1001 for host prometheus6002.drmrs.wmnet with OS bullseye completed: - prome... [23:14:31] (03PS1) 10Cwhite: rsyslog: add rsyslog-namespaced fields to syslog_json [puppet] - 10https://gerrit.wikimedia.org/r/904597 (https://phabricator.wikimedia.org/T315500) [23:19:36] 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10wiki_willy) Sounds good, thanks! > @wiki_willy I'll try to look at it next week, it should be easy to read from the spreadsheet you showed me and exclude those for now. The... [23:21:29] !log denisse@cumin1001 START - Cookbook sre.ganeti.reimage for host prometheus5002.eqsin.wmnet with OS bullseye [23:21:37] 10SRE, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): Site: eqsin 1 VM request for prometheus5002 - https://phabricator.wikimedia.org/T333720 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by denisse@cumin1001 for host prometheus5002.eqsin.wmnet with OS bullseye [23:22:03] 10SRE, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): Site: drmrs 1 VM request for prometheus6002 - https://phabricator.wikimedia.org/T333721 (10andrea.denisse) 05Open→03Resolved [23:22:21] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/904595 (owner: 10TrainBranchBot) [23:22:46] 10SRE, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): Site: ulsfo 1 VM request for prometheus4002 - https://phabricator.wikimedia.org/T333719 (10andrea.denisse) 05Open→03Resolved [23:34:57] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/904598 [23:35:03] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/904598 (owner: 10TrainBranchBot) [23:49:03] (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/904598 (owner: 10TrainBranchBot) [23:52:22] !log denisse@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus5002.eqsin.wmnet with reason: host reimage [23:52:23] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/904598 (owner: 10TrainBranchBot) [23:52:36] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10colewhite) [23:55:45] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus5002.eqsin.wmnet with reason: host reimage