[00:03:25] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:09:15] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:10:15] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[00:12:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job jmx_wcqs_blazegraph in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:38:42] <wikibugs>	 (03PS2) 10Ssingh: [In case of emergency] depool eqsin for hardware refresh [dns] - 10https://gerrit.wikimedia.org/r/856664
[00:38:49] <sukhe>	 ^ rebase
[00:40:20] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-logging1005']
[00:45:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Papaul) a:05Cmjohnson→03Papaul @BTullis thank you I will take over this tasks
[00:47:05] <icinga-wm>	 PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:47:58] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-logging1005']
[00:51:30] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-logging1005']
[01:01:41] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-logging1005']
[01:02:31] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:04:21] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-logging1005']
[01:04:52] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-logging1005']
[01:06:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10Papaul) @jbond if you have time tomorrow i did get the error below on kafka-logging1004. I checked the upgrade completed with no issue but the cook...
[01:08:23] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:20:18] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-logging1005']
[01:21:27] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-logging1005']
[01:25:13] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2173']
[01:25:41] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp5032 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[01:26:06] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db2173']
[01:34:02] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-logging1005']
[01:34:35] <icinga-wm>	 RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:35:57] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-logging1005']
[01:37:10] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-logging1005']
[01:37:43] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-logging1005']
[01:37:45] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:39:23] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-logging1005']
[01:39:42] <wikibugs>	 10SRE, 10serviceops-radar, 10Patch-For-Review, 10Performance-Team (Radar), 10Service-deployment-requests: New Service Request: xhgui - https://phabricator.wikimedia.org/T277483 (10Krinkle)
[01:40:39] <wikibugs>	 10SRE, 10serviceops-radar, 10Performance-Team (Radar), 10Service-deployment-requests: New Service Request: xhgui - https://phabricator.wikimedia.org/T277483 (10Krinkle)
[01:40:52] <wikibugs>	 10SRE, 10serviceops-radar, 10Performance-Team (Radar), 10Service-deployment-requests: New Service Request: xhgui - https://phabricator.wikimedia.org/T277483 (10Krinkle) 05Open→03Declined Prioritising {T291015} instead.
[01:42:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:46:42] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-logging1005']
[01:46:52] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-logging1005']
[01:52:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:55:08] <wikibugs>	 (03PS1) 10Krinkle: build: Update to PHPUnit 9.5 and declare php requirement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858441 (https://phabricator.wikimedia.org/T235142)
[01:55:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] build: Update to PHPUnit 9.5 and declare php requirement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858441 (https://phabricator.wikimedia.org/T235142) (owner: 10Krinkle)
[01:56:42] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-logging1005']
[02:07:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:10:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:15:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:17:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:22:45] <jinxer-wm>	 (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:36:29] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-logging1005']
[02:45:20] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-logging1005']
[02:56:13] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) The elevation google sheet has been updated with 11 of the 16 new cp hosts wired up.  We couldnt wire up the last 5 due to msws only being 24 port (oversight by me in planning t...
[02:57:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10Papaul)
[03:03:19] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:09:15] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:10:31] <icinga-wm>	 PROBLEM - Disk space on dumpsdata1003 is CRITICAL: DISK CRITICAL - free space: /data 874213 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops
[03:10:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10Papaul) @Volans i tried ro urn the reimage cookbook on kafka-logging1005 i am getting the error below ` raceback (most recent call last):   File "/...
[03:17:37] <icinga-wm>	 PROBLEM - SSH on db1123.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:35:31] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1094 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[03:52:38] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Papaul) @Jclark-ctr I have no node in netbox with the name kafka-jumbo1013 but i do have a node wmf10606 whit purchase date 2022-06-07 that is set to offline in netbox . can y...
[03:54:22] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[03:56:33] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[04:01:03] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kafka-jumbo1010.mgmt.eqiad.wmnet with reboot policy FORCED
[04:02:39] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:03:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Papaul) @Ottomata @BTullis what HW RAID are we using for those servers ? Thanks
[04:06:13] <icinga-wm>	 PROBLEM - SSH on an-coord1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:08:23] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1094 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[04:08:35] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:18:27] <icinga-wm>	 RECOVERY - SSH on db1123.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:27:13] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-jumbo1010.mgmt.eqiad.wmnet with reboot policy FORCED
[04:28:50] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1010']
[04:28:57] <wikibugs>	 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T321719 (10phaultfinder)
[04:45:10] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-jumbo1010']
[04:48:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Papaul)
[04:57:00] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Papaul)
[04:58:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Papaul)
[05:50:49] <icinga-wm>	 PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[05:50:55] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:51:01] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:51:03] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[05:51:19] <icinga-wm>	 PROBLEM - BFD status on cr2-drmrs is CRITICAL: CRIT: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[05:51:41] <icinga-wm>	 PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 2/4 UP : OSPFv3: 2/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:15:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[06:16:31] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[06:20:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[06:25:25] <wikibugs>	 10Puppet, 10SRE, 10Data-Services, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): clouddumps1002: ferm is being started on every puppet run - https://phabricator.wikimedia.org/T323324 (10taavi)
[06:39:27] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[06:40:19] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[07:20:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:24:09] <XioNoX>	 !log decom all Equinix SV8 BGP sessions - T321323
[07:24:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:25:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:41:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] P:pontoon: include firewall rules to allow metricsinfra scraping (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857023 (owner: 10Majavah)
[07:44:01] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: don't double-define ferm rules for metricsinfra prometheus [puppet] - 10https://gerrit.wikimedia.org/r/858455
[07:45:51] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] "oops, sorry about this" [puppet] - 10https://gerrit.wikimedia.org/r/858455 (owner: 10Filippo Giunchedi)
[07:47:11] <wikibugs>	 (03PS2) 10Filippo Giunchedi: pontoon: don't double-define ferm rules for metricsinfra prometheus [puppet] - 10https://gerrit.wikimedia.org/r/858455
[07:47:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: pontoon: don't double-define ferm rules for metricsinfra prometheus (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/858455 (owner: 10Filippo Giunchedi)
[07:48:28] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] "the provider class is empty now, but I guess that is fine and the class can be useful in the future?" [puppet] - 10https://gerrit.wikimedia.org/r/858455 (owner: 10Filippo Giunchedi)
[07:49:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: don't double-define ferm rules for metricsinfra prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/858455 (owner: 10Filippo Giunchedi)
[07:49:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] pontoon: don't double-define ferm rules for metricsinfra prometheus [puppet] - 10https://gerrit.wikimedia.org/r/858455 (owner: 10Filippo Giunchedi)
[07:51:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] P:pontoon: include firewall rules to allow metricsinfra scraping (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857023 (owner: 10Majavah)
[07:51:53] <icinga-wm>	 RECOVERY - BFD status on cr2-drmrs is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:52:11] <icinga-wm>	 RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:53:13] <icinga-wm>	 RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 13 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:53:15] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:53:19] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:53:27] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:56:51] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[07:58:01] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[07:59:52] <wikibugs>	 (03PS1) 10Slyngshede: Fix typing to allow Python 3.7 support. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/858457
[08:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221118T0800)
[08:03:23] <wikibugs>	 (03CR) 10Slyngshede: "I'm a little unsure as to why the existing Debian build configuration wouldn't work. This patch does build on both Buster and Bullseye." [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/858457 (owner: 10Slyngshede)
[08:07:33] <wikibugs>	 (03PS1) 10Marostegui: control-mariadb-10.4-bullseye: Upgrade to 10.4.27 [software] - 10https://gerrit.wikimedia.org/r/858458 (https://phabricator.wikimedia.org/T322620)
[08:10:01] <icinga-wm>	 RECOVERY - SSH on an-coord1002.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:10:02] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] control-mariadb-10.4-bullseye: Upgrade to 10.4.27 [software] - 10https://gerrit.wikimedia.org/r/858458 (https://phabricator.wikimedia.org/T322620) (owner: 10Marostegui)
[08:10:35] <wikibugs>	 (03Merged) 10jenkins-bot: control-mariadb-10.4-bullseye: Upgrade to 10.4.27 [software] - 10https://gerrit.wikimedia.org/r/858458 (https://phabricator.wikimedia.org/T322620) (owner: 10Marostegui)
[08:12:13] <wikibugs>	 (03PS9) 10David Caro: ceph.roll_restart_*daemons: allow ignoring current health issues [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933
[08:28:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1019.eqiad.wmnet
[08:31:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5003.eqsin.wmnet
[08:31:45] <wikibugs>	 (03PS1) 10Elukey: benthos: reduce webrequest-live kafka partitions to read [puppet] - 10https://gerrit.wikimedia.org/r/858542 (https://phabricator.wikimedia.org/T314981)
[08:32:26] <wikibugs>	 (03CR) 10DCausse: WIP flink image (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata)
[08:33:02] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38306/console" [puppet] - 10https://gerrit.wikimedia.org/r/858542 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey)
[08:35:59] <wikibugs>	 (03PS2) 10Elukey: benthos: reduce webrequest-live kafka partitions to read [puppet] - 10https://gerrit.wikimedia.org/r/858542 (https://phabricator.wikimedia.org/T314981)
[08:37:00] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38307/console" [puppet] - 10https://gerrit.wikimedia.org/r/858542 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey)
[08:37:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1019.eqiad.wmnet
[08:37:12] <XioNoX>	 !log shutdown SV8 port - T321323
[08:37:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:38:38] <godog>	 win 25
[08:38:41] <godog>	 lose some
[08:40:50] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: role::kubernetes::wroker: allow scap to pre-pull mediawiki images [puppet] - 10https://gerrit.wikimedia.org/r/858543 (https://phabricator.wikimedia.org/T323349)
[08:40:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5003.eqsin.wmnet
[08:41:11] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 45102
[08:41:27] <wikibugs>	 (03CR) 10Elukey: Add the pause image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858345 (https://phabricator.wikimedia.org/T322920) (owner: 10Elukey)
[08:41:59] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 45102
[08:43:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] role::kubernetes::wroker: allow scap to pre-pull mediawiki images [puppet] - 10https://gerrit.wikimedia.org/r/858543 (https://phabricator.wikimedia.org/T323349) (owner: 10Giuseppe Lavagetto)
[08:46:03] <moritzm>	 !log failover ganeti master in eqsin to ganeti5003
[08:46:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:48] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10Volans) Apart the fact that the host is in planned state in netbox and hence `--new` is required, the problem is that the DNS record is wrong in Ne...
[08:51:58] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Consider alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10fgiunchedi) I can definitely relate with the long (and stressful!) cycles of Puppet patches you mention @bking and that one of my [[ https://wikitech.wikimedia.org/wiki/Pu...
[08:52:07] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti5001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[08:58:58] <wikibugs>	 (03CR) 10David Caro: ceph.roll_restart_*daemons: allow ignoring current health issues (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 (owner: 10David Caro)
[09:01:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, see inline for a nit" [puppet] - 10https://gerrit.wikimedia.org/r/858542 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey)
[09:06:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1019.eqiad.wmnet to cluster eqiad and group D
[09:08:13] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1019.eqiad.wmnet to cluster eqiad and group D
[09:08:37] <wikibugs>	 (03PS3) 10Elukey: benthos: reduce webrequest-live kafka partitions to read [puppet] - 10https://gerrit.wikimedia.org/r/858542 (https://phabricator.wikimedia.org/T314981)
[09:08:39] <wikibugs>	 (03CR) 10Elukey: benthos: reduce webrequest-live kafka partitions to read (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/858542 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey)
[09:13:32] <wikibugs>	 (03PS1) 10JMeybohm: aux-k8s: Remove obsolete hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/858545
[09:14:02] <wikibugs>	 (03CR) 10JMeybohm: "Feel free to merge when you're happy with this" [puppet] - 10https://gerrit.wikimedia.org/r/858545 (owner: 10JMeybohm)
[09:16:49] <elukey>	 !log push the 'k8s_116' tag for docker-registry.discovery.wmnet/pause - T322920
[09:16:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:56] <stashbot>	 T322920: Import pause container image >= 3.5 (k8s 1.23 dependency) - https://phabricator.wikimedia.org/T322920
[09:17:23] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858345 (https://phabricator.wikimedia.org/T322920) (owner: 10Elukey)
[09:21:10] <wikibugs>	 (03PS1) 10Elukey: k8s: pin the pause container image to the k8s_116 tag on staging [puppet] - 10https://gerrit.wikimedia.org/r/858546 (https://phabricator.wikimedia.org/T322920)
[09:21:52] <godog>	 !log nuke MediaWiki.objectcache.*_11ed_* - T323357
[09:21:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:22:15] <wikibugs>	 (03PS2) 10Elukey: k8s: pin the pause container image to the k8s_116 tag on staging [puppet] - 10https://gerrit.wikimedia.org/r/858546 (https://phabricator.wikimedia.org/T322920)
[09:22:32] <stashbot>	 T323357: Spam graphite metrics from MediaWiki.objectcache - https://phabricator.wikimedia.org/T323357
[09:22:44] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] ceph.roll_restart_*daemons: allow ignoring current health issues (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 (owner: 10David Caro)
[09:25:22] <wikibugs>	 (03PS3) 10Vgutierrez: cache: Remove wikiba.se caching rules [puppet] - 10https://gerrit.wikimedia.org/r/858408
[09:26:09] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 2 others: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10JMeybohm)
[09:26:35] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1094 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[09:30:23] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Don't run OutputPageBeforeHTML for the talkpageheader [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858319 (https://phabricator.wikimedia.org/T316175)
[09:32:35] <MatmaRex>	 brennen: thcipriani: i'm hoping to get this patch backported today: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/858319 , is that possible? (i guess you're both asleep now, can anyone else help?)
[09:32:41] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38308/console" [puppet] - 10https://gerrit.wikimedia.org/r/858546 (https://phabricator.wikimedia.org/T322920) (owner: 10Elukey)
[09:34:21] <wikibugs>	 (03CR) 10Elukey: k8s: pin the pause container image to the k8s_116 tag on staging [puppet] - 10https://gerrit.wikimedia.org/r/858546 (https://phabricator.wikimedia.org/T322920) (owner: 10Elukey)
[09:37:33] <moritzm>	 !log installing ncurses security updates
[09:37:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:21] <MatmaRex>	 Amir1: _joe_: are you around maybe? i'm trying to get this patch backported: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/858319
[09:41:30] <_joe_>	 MatmaRex: and I guess you're not searching for my opinion on that patch, right? :P
[09:41:33] <_joe_>	 I'm here
[09:42:07] <wikibugs>	 10SRE, 10Traffic, 10Upstream: Wikipedia on flow with no http request, still responds with a Bad Request 400 - https://phabricator.wikimedia.org/T323263 (10Vgutierrez) 05Stalled→03In progress We were missing one config option in our HAProxy setup: `option http-ignore-probes`, after enabling it, HAProxy be...
[09:42:22] <MatmaRex>	 _joe_: heh, what is your opinion?
[09:43:36] <MatmaRex>	 i think it's a very simple low-risk change and i think the issue it fixes is bad enough to deploy it today (doubled-up and non-functional buttons on e.g. https://fr.m.wikipedia.org/wiki/Discussion_Wikipédia:Accueil_principal)
[09:44:22] <_joe_>	 MatmaRex: oh my, sure, go on let's backport
[09:44:43] <_joe_>	 UI bugs of this nature should always be considered emergencies IMHO
[09:44:49] <MatmaRex>	 i don't have access, so i need someone to click the buttons / run the commands
[09:45:01] <_joe_>	 ah ok
[09:45:08] <_joe_>	 not even to backport the patch?
[09:45:20] <MatmaRex>	 nope
[09:45:21] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Don't run OutputPageBeforeHTML for the talkpageheader [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858319 (https://phabricator.wikimedia.org/T316175) (owner: 10Bartosz Dziewoński)
[09:45:38] <_joe_>	 ok
[09:45:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] benthos: reduce webrequest-live kafka partitions to read [puppet] - 10https://gerrit.wikimedia.org/r/858542 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey)
[09:45:49] <_joe_>	 I'll test scap backport with this
[09:45:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[09:45:56] <wikibugs>	 (03PS1) 10Vgutierrez: cache::haproxy: Silently ignore probes [puppet] - 10https://gerrit.wikimedia.org/r/858550 (https://phabricator.wikimedia.org/T323263)
[09:46:09] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] benthos: reduce webrequest-live kafka partitions to read [puppet] - 10https://gerrit.wikimedia.org/r/858542 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey)
[09:46:16] <MatmaRex>	 thank you
[09:47:08] <wikibugs>	 (03PS2) 10Vgutierrez: cache::haproxy: Silently ignore probes [puppet] - 10https://gerrit.wikimedia.org/r/858550 (https://phabricator.wikimedia.org/T323263)
[09:47:37] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] cache: Remove wikiba.se caching rules [puppet] - 10https://gerrit.wikimedia.org/r/858408 (owner: 10Vgutierrez)
[09:47:43] <_joe_>	 MatmaRex: please stay around so that when the change is merged we can test a few pages
[09:48:01] <MatmaRex>	 yeah
[09:49:17] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync-mgmt - filippo@cumin1001"
[09:50:20] <wikibugs>	 (03Merged) 10jenkins-bot: Don't run OutputPageBeforeHTML for the talkpageheader [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858319 (https://phabricator.wikimedia.org/T316175) (owner: 10Bartosz Dziewoński)
[09:50:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[09:51:12] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "sync-mgmt - filippo@cumin1001"
[09:51:43] <_joe_>	 ok lemme try to deploy this
[09:52:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by oblivian@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858319 (https://phabricator.wikimedia.org/T316175) (owner: 10Bartosz Dziewoński)
[09:52:28] <logmsgbot>	 !log oblivian@deploy1002 Started scap: Backport for [[gerrit:858319|Don't run OutputPageBeforeHTML for the talkpageheader (T316175)]]
[09:52:37] <stashbot>	 T316175: Make the mobile Add Topic button easier for people to access  - https://phabricator.wikimedia.org/T316175
[09:52:50] <logmsgbot>	 !log oblivian@deploy1002 oblivian and matmarex: Backport for [[gerrit:858319|Don't run OutputPageBeforeHTML for the talkpageheader (T316175)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[09:52:54] <_joe_>	 MatmaRex: can you test the patch on mwdebug?
[09:53:21] <_joe_>	 it seems to work in my tests
[09:53:21] <MatmaRex>	 _joe_: yeah. looks good to me
[09:53:34] <_joe_>	 ofc there will be a lot of cached pages that will not be fixed
[09:53:45] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] cache::haproxy: Silently ignore probes [puppet] - 10https://gerrit.wikimedia.org/r/858550 (https://phabricator.wikimedia.org/T323263) (owner: 10Vgutierrez)
[09:53:52] <_joe_>	 I would say we'll let the community fix it
[09:54:32] <MatmaRex>	 there shouldn't be too many, this was only broken for a couple of hours
[09:54:56] <MatmaRex>	 well, less than a day, at least :)
[09:57:58] <logmsgbot>	 !log oblivian@deploy1002 Finished scap: Backport for [[gerrit:858319|Don't run OutputPageBeforeHTML for the talkpageheader (T316175)]] (duration: 05m 29s)
[09:58:11] <stashbot>	 T316175: Make the mobile Add Topic button easier for people to access  - https://phabricator.wikimedia.org/T316175
[09:58:28] <_joe_>	 MatmaRex: done :)
[09:58:29] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: nova: compute: cleanup unused code [puppet] - 10https://gerrit.wikimedia.org/r/858327
[09:58:31] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: cloudvirts: introduce modern NIC setup and use it by default [puppet] - 10https://gerrit.wikimedia.org/r/858328 (https://phabricator.wikimedia.org/T319184)
[09:58:37] <MatmaRex>	 thank you _joe_!
[09:59:53] <MatmaRex>	 _joe_: related question - can i update https://wikitech.wikimedia.org/wiki/Deployments/Emergencies to suggest messaging people on-call when tyler and the train owner aren't available?
[10:00:10] <_joe_>	 MatmaRex: not really, IMHO
[10:00:23] <_joe_>	 say there was an outage, I would've had to abandon the deployment
[10:00:35] <MatmaRex>	 heh. well, i just did that. but alright :D
[10:00:44] <MatmaRex>	 thanks for the help
[10:00:46] <_joe_>	 I know :)
[10:01:02] <_joe_>	 I'm just saying it shouldn't be a general rule, but I'll leave that to the managers
[10:01:24] <MatmaRex>	 fair
[10:03:44] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Upstream: Wikipedia on flow with no http request, still responds with a Bad Request 400 - https://phabricator.wikimedia.org/T323263 (10Vgutierrez) 05In progress→03Resolved a:03Vgutierrez fix has been merged and it's being deployed, it should be available fleet...
[10:04:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5001.eqsin.wmnet
[10:06:05] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] openstack: nova: compute: cleanup unused code [puppet] - 10https://gerrit.wikimedia.org/r/858327 (owner: 10Arturo Borrero Gonzalez)
[10:06:19] <wikibugs>	 (03PS1) 10Elukey: benthos: use env() in webrequest_live's bloblang config [puppet] - 10https://gerrit.wikimedia.org/r/858551 (https://phabricator.wikimedia.org/T314981)
[10:07:14] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "ship it! (carefully) :)" [homer/public] - 10https://gerrit.wikimedia.org/r/857598 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney)
[10:07:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] benthos: use env() in webrequest_live's bloblang config [puppet] - 10https://gerrit.wikimedia.org/r/858551 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey)
[10:08:09] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: nova: compute: cleanup unused code [puppet] - 10https://gerrit.wikimedia.org/r/858327 (owner: 10Arturo Borrero Gonzalez)
[10:08:14] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Add OSPF automation template for EVPN switches (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/857482 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney)
[10:09:25] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: cloudvirts: introduce modern NIC setup and use it by default [puppet] - 10https://gerrit.wikimedia.org/r/858328 (https://phabricator.wikimedia.org/T319184)
[10:09:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:10:02] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Add function to expose required device VRFs to Homer templates (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/857593 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney)
[10:10:49] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] benthos: use env() in webrequest_live's bloblang config [puppet] - 10https://gerrit.wikimedia.org/r/858551 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey)
[10:13:02] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff)
[10:13:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5001.eqsin.wmnet
[10:13:56] <moritzm>	 !log installing sysstat security updates
[10:13:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:14:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:16:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:16:11] <wikibugs>	 (03PS1) 10Elukey: benthos: discard the msg in webrequest_live if ip is unset [puppet] - 10https://gerrit.wikimedia.org/r/858552 (https://phabricator.wikimedia.org/T314981)
[10:16:33] <wikibugs>	 (03CR) 10JMeybohm: P:pontoon: include firewall rules to allow metricsinfra scraping (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857023 (owner: 10Majavah)
[10:17:52] <wikibugs>	 (03PS2) 10Elukey: benthos: discard the msg in webrequest_live if ip is unset [puppet] - 10https://gerrit.wikimedia.org/r/858552 (https://phabricator.wikimedia.org/T314981)
[10:18:53] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[10:18:55] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[10:19:57] <wikibugs>	 (03CR) 10David Caro: "LGTM, can you run a pcc on it?" [puppet] - 10https://gerrit.wikimedia.org/r/858328 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[10:22:40] <wikibugs>	 (03PS6) 10Arturo Borrero Gonzalez: cloudvirts: introduce modern NIC setup and use it by default [puppet] - 10https://gerrit.wikimedia.org/r/858328 (https://phabricator.wikimedia.org/T319184)
[10:25:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] benthos: discard the msg in webrequest_live if ip is unset [puppet] - 10https://gerrit.wikimedia.org/r/858552 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey)
[10:26:03] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:27:01] <wikibugs>	 (03PS7) 10Arturo Borrero Gonzalez: cloudvirts: introduce modern NIC setup and use it by default [puppet] - 10https://gerrit.wikimedia.org/r/858328 (https://phabricator.wikimedia.org/T319184)
[10:28:01] <wikibugs>	 (03PS8) 10Arturo Borrero Gonzalez: cloudvirts: introduce modern NIC setup and use it by default [puppet] - 10https://gerrit.wikimedia.org/r/858328 (https://phabricator.wikimedia.org/T319184)
[10:29:02] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200): /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[10:31:00] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[10:31:03] <jinxer-wm>	 (ProbeDown) resolved: (4) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:31:38] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] benthos: discard the msg in webrequest_live if ip is unset [puppet] - 10https://gerrit.wikimedia.org/r/858552 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey)
[10:33:57] <wikibugs>	 10SRE, 10Traffic, 10serviceops, 10Patch-For-Review: _etcd-client SRV record missing for conftool cluster - https://phabricator.wikimedia.org/T320397 (10Vgutierrez) ping?
[10:34:45] <moritzm>	 !log draining ganeti1012 in preparation of server move to a new rack T308339
[10:34:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:34:56] <stashbot>	 T308339: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339
[10:35:05] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] prometheus: Remove old ats config export job [puppet] - 10https://gerrit.wikimedia.org/r/858418 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[10:41:12] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:42:12] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] lvs4009: commission new LVS host (ulsfo hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/858336 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh)
[10:42:20] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[10:47:43] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Stop spamming SAL with helmfile on scap deployments - https://phabricator.wikimedia.org/T323296 (10Clement_Goubert) Merge request on `scap` to pass the `SUPPRESS_SAL` variable  https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/26
[10:49:08] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff)
[10:50:47] <wikibugs>	 (03PS1) 10Jbond: P:pki: add new type calidation for ca names [puppet] - 10https://gerrit.wikimedia.org/r/858556
[10:54:38] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38316/console" [puppet] - 10https://gerrit.wikimedia.org/r/858556 (owner: 10Jbond)
[11:06:12] <wikibugs>	 (03PS9) 10Arturo Borrero Gonzalez: cloudvirts: introduce modern NIC setup and use it by default [puppet] - 10https://gerrit.wikimedia.org/r/858328 (https://phabricator.wikimedia.org/T319184)
[11:10:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, nice!" [puppet] - 10https://gerrit.wikimedia.org/r/858556 (owner: 10Jbond)
[11:12:19] <wikibugs>	 (03PS10) 10Arturo Borrero Gonzalez: cloudvirts: introduce modern NIC setup and use it by default [puppet] - 10https://gerrit.wikimedia.org/r/858328 (https://phabricator.wikimedia.org/T319184)
[11:14:58] <wikibugs>	 10SRE, 10Traffic: Improve handling/logging of HAproxy emergency log messages - https://phabricator.wikimedia.org/T306236 (10Vgutierrez) 05In progress→03Resolved a:03Vgutierrez
[11:15:33] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: arclamp: Add role contact information [puppet] - 10https://gerrit.wikimedia.org/r/854985 (owner: 10Alexandros Kosiaris)
[11:16:49] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/858328/38318/" [puppet] - 10https://gerrit.wikimedia.org/r/858328 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[11:18:25] <wikibugs>	 10SRE, 10Traffic: Rename role::cache::(text|upload)_haproxy to role::cache::(text|upload) - https://phabricator.wikimedia.org/T323365 (10Vgutierrez)
[11:19:01] <wikibugs>	 10SRE, 10Traffic: Rename role::cache::(text|upload)_haproxy to role::cache::(text|upload) - https://phabricator.wikimedia.org/T323365 (10Vgutierrez) p:05Triage→03Medium
[11:20:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Multiple RAID battery failures on hadoop worker hosts - https://phabricator.wikimedia.org/T318659 (10BTullis) >>! In T318659#8403396, @RobH wrote: > @btullis: I've gone ahead and requested quotation to get replacement...
[11:21:01] <wikibugs>	 (03PS11) 10Arturo Borrero Gonzalez: cloudvirts: introduce modern NIC setup and use it by default [puppet] - 10https://gerrit.wikimedia.org/r/858328 (https://phabricator.wikimedia.org/T319184)
[11:21:48] <Amir1>	 MatmaRex: thanks for the fix, let me know if you still require my services
[11:23:04] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/858328/38319/" [puppet] - 10https://gerrit.wikimedia.org/r/858328 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[11:25:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10jbond) >>! In T313960#8404684, @Papaul wrote: > @jbond if you have time tomorrow i did get the error below on kafka-logging1004. I checked the upgr...
[11:27:23] <claime>	 !log Starting decommission of apple-search service - T316296
[11:27:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:27:43] <stashbot>	 T316296: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296
[11:28:07] <wikibugs>	 (03PS2) 10Clément Goubert: apple-search: remove discovery record [dns] - 10https://gerrit.wikimedia.org/r/858207 (https://phabricator.wikimedia.org/T316296)
[11:28:17] <wikibugs>	 (03PS10) 10Clément Goubert: apple-search: Remove DNS records [dns] - 10https://gerrit.wikimedia.org/r/852208 (https://phabricator.wikimedia.org/T316296)
[11:28:55] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] apple-search: remove discovery record [dns] - 10https://gerrit.wikimedia.org/r/858207 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert)
[11:29:20] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] apple-search: Remove DNS records [dns] - 10https://gerrit.wikimedia.org/r/852208 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert)
[11:31:30] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] apple-search: remove discovery record [dns] - 10https://gerrit.wikimedia.org/r/858207 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert)
[11:31:42] <moritzm>	 !log installing Linux 4.19.260 on Buster systems
[11:31:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:34:22] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] cloudvirts: introduce modern NIC setup and use it by default (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/858328 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[11:34:24] <claime>	 !log Running authdns-update - T316296
[11:34:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:34:31] <stashbot>	 T316296: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296
[11:36:04] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:36:37] <_joe_>	 Amir1: ^^ lists.w.o down again?
[11:36:43] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] cloudvirts: introduce modern NIC setup and use it by default [puppet] - 10https://gerrit.wikimedia.org/r/858328 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[11:37:08] <Amir1>	 I check
[11:37:12] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[11:37:37] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/852210 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert)
[11:37:50] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48974 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:38:26] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:38:31] <Amir1>	 I haven't done anything yet
[11:38:36] <Amir1>	 probably another scraper 
[11:38:38] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] apple-search: Switch lvs state to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/852210 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert)
[11:38:40] <Amir1>	 let me check logs
[11:38:45] <wikibugs>	 (03PS1) 10Jbond: upgrade-firmware: small fix to ensure files get saved in the correct path [cookbooks] - 10https://gerrit.wikimedia.org/r/858559
[11:38:57] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+1] apple-search: Switch lvs state to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/852210 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert)
[11:39:06] <wikibugs>	 (03PS2) 10Jbond: upgrade-firmware: small fix to ensure files get saved in the correct path [cookbooks] - 10https://gerrit.wikimedia.org/r/858559
[11:41:28] <claime>	 !log Switching apple-search to state:lvs_setup - T316296
[11:41:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:39] <stashbot>	 T316296: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296
[11:42:58] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: wmcs: proxy: don't fail if killing the proxy fails [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/858560
[11:44:12] <icinga-wm>	 PROBLEM - SSH on db1120.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:47:31] <wikibugs>	 (03CR) 10David Caro: wmcs: proxy: don't fail if killing the proxy fails (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/858560 (owner: 10Arturo Borrero Gonzalez)
[11:49:43] <wikibugs>	 (03PS3) 10Jbond: upgrade-firmware: small fix to ensure files get saved in the correct path [cookbooks] - 10https://gerrit.wikimedia.org/r/858559
[11:50:07] <wikibugs>	 (03PS1) 10Muehlenhoff: Retire conf-lvm partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/858562 (https://phabricator.wikimedia.org/T156955)
[11:50:57] <wikibugs>	 (03PS4) 10Jbond: upgrade-firmware: small fix to ensure files get saved in the correct path [cookbooks] - 10https://gerrit.wikimedia.org/r/858559
[11:51:03] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: wmcs: proxy: don't fail if killing the proxy fails (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/858560 (owner: 10Arturo Borrero Gonzalez)
[11:51:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure: Attempt to move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10BTullis) >>! In T318696#8358171, @Ottomata wrote: > @BTullis can/should we just remove those nodes as Hadoop workers and reimage them as DS...
[11:53:51] <claime>	 !log Switching apple-search to state:service_setup - T316296
[11:53:54] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] apple-search: Remove service from lb and backend [puppet] - 10https://gerrit.wikimedia.org/r/857691 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert)
[11:53:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:54:01] <stashbot>	 T316296: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296
[11:54:12] <wikibugs>	 (03PS3) 10Clément Goubert: apple-search: Remove service from lb and backend [puppet] - 10https://gerrit.wikimedia.org/r/857691 (https://phabricator.wikimedia.org/T316296)
[11:57:40] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] upgrade-firmware: small fix to ensure files get saved in the correct path [cookbooks] - 10https://gerrit.wikimedia.org/r/858559 (owner: 10Jbond)
[12:01:33] <logmsgbot>	 !log oblivian@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on D{lvs2010.codfw.wmnet,lvs1020.eqiad.wmnet} and A:lvs
[12:01:57] <moritzm>	 !log installing libgoogle-gson-java security updates
[12:02:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:02:32] <wikibugs>	 (03PS2) 10Vgutierrez: role::cache: Link/copy (text|upload)_haproxy to base roles [puppet] - 10https://gerrit.wikimedia.org/r/849180 (https://phabricator.wikimedia.org/T323365) (owner: 10BBlack)
[12:02:36] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt2001-dev.codfw.wmnet with OS bullseye
[12:02:52] <_joe_>	 claime: uhm the cookbook is waiting for the IPVS_diffs_check to recover
[12:03:07] <_joe_>	 which it won't unless we run ipvsadm
[12:03:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] role::cache: Link/copy (text|upload)_haproxy to base roles [puppet] - 10https://gerrit.wikimedia.org/r/849180 (https://phabricator.wikimedia.org/T323365) (owner: 10BBlack)
[12:03:15] <claime>	 _joe_: ack
[12:03:36] <claime>	 _joe_: But we're supposed to do that after restarting pybal on the primary
[12:04:34] <claime>	 Or do I just ipvsadm --delete-service --tcp-service addr:port on lvs2010.codfw.wmnet
[12:04:48] <wikibugs>	 (03PS3) 10Vgutierrez: role::cache: Link/copy (text|upload)_haproxy to base roles [puppet] - 10https://gerrit.wikimedia.org/r/849180 (https://phabricator.wikimedia.org/T323365) (owner: 10BBlack)
[12:04:53] <_joe_>	 that seems a very modern syntax, but yes
[12:05:02] <claime>	 cgoubert@lvs2010:~$ sudo ipvsadm --delete-service --tcp-service 10.2.1.68:4013
[12:05:14] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.2.68:4013]) https://wikitech.wikimedia.org/wiki/PyBal
[12:05:23] <claime>	 done
[12:05:27] <claime>	 Sorry for alert noise
[12:05:30] <vgutierrez>	 , !log? :)
[12:05:38] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.2.68:4013]) https://wikitech.wikimedia.org/wiki/PyBal
[12:05:47] <_joe_>	 vgutierrez: see above, pybal restart
[12:05:53] <_joe_>	 ah you mean claime
[12:05:56] <_joe_>	 yeah :)
[12:06:01] <vgutierrez>	 yeah
[12:06:05] <claime>	 !log cgoubert@lvs2010:~$ sudo ipvsadm --delete-service --tcp-service 10.2.1.68:4013
[12:06:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:09] <vgutierrez>	 thx <3
[12:06:13] <claime>	 I was doing it :')
[12:06:27] <vgutierrez>	 E_WOULDBLOCK lol
[12:06:29] <claime>	 Just fat fingering
[12:06:54] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.1.68:4013]) https://wikitech.wikimedia.org/wiki/PyBal
[12:07:42] <claime>	 _joe_: ==> Failed to downtime hosts: Not all services are recovered: lvs2010:PyBal IPVS diff che
[12:07:45] <claime>	 go ?
[12:07:47] <wikibugs>	 (03CR) 10David Caro: wmcs: proxy: don't fail if killing the proxy fails (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/858560 (owner: 10Arturo Borrero Gonzalez)
[12:08:06] <claime>	 thx
[12:08:10] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38323/console" [puppet] - 10https://gerrit.wikimedia.org/r/849180 (https://phabricator.wikimedia.org/T323365) (owner: 10BBlack)
[12:08:24] <_joe_>	 claime:  yeah you can do the same (with the correct IP) on 1020
[12:08:33] <claime>	 yep
[12:08:53] <claime>	 !log cgoubert@lvs1020:~$ sudo ipvsadm --delete-service --tcp-service 10.2.2.68:4013
[12:08:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:08:58] <claime>	 done
[12:09:18] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[12:09:26] <logmsgbot>	 !log oblivian@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on D{lvs2010.codfw.wmnet,lvs1020.eqiad.wmnet} and A:lvs
[12:09:29] <claime>	 _joe_: So that ipvsadm has to be run on all lvs afterwards right ?
[12:09:35] <_joe_>	 yes
[12:09:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:10:01] <_joe_>	 ok, let me run on the primaries I guess?
[12:10:02] <wikibugs>	 (03PS4) 10Vgutierrez: role::cache: Link/copy (text|upload)_haproxy to base roles [puppet] - 10https://gerrit.wikimedia.org/r/849180 (https://phabricator.wikimedia.org/T323365) (owner: 10BBlack)
[12:10:30] <logmsgbot>	 !log oblivian@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on D{lvs2009.codfw.wmnet,lvs1019.eqiad.wmnet} and A:lvs
[12:11:30] <claime>	 same on my side, run the ipvsadm ?
[12:11:35] <_joe_>	 yes
[12:11:36] <_joe_>	 lvs2009
[12:11:46] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: wmcs: proxy: only mark the proxy as started if it didn't fail [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/858560
[12:12:04] <claime>	 !log cgoubert@lvs2009:~$ sudo ipvsadm --delete-service --tcp-service 10.2.1.68:4013
[12:12:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:10] <claime>	 done
[12:12:43] <claime>	 ready for lvs1019
[12:12:50] <_joe_>	 yeah, sigh icinga
[12:14:23] <wikibugs>	 (03CR) 10Vgutierrez: "BBlack I've addressed your prometheus::ops concerns and the PCC still reports a NOOP, we just need to be careful and remove duplicated yam" [puppet] - 10https://gerrit.wikimedia.org/r/849180 (https://phabricator.wikimedia.org/T323365) (owner: 10BBlack)
[12:14:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:15:38] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[12:16:06] <wikibugs>	 (03PS1) 10Jbond: admin: add all data center ops users to datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/858567
[12:16:08] <wikibugs>	 (03PS1) 10Jbond: P:spicerack: ensure firmware directory is writable by dc-ops [puppet] - 10https://gerrit.wikimedia.org/r/858568
[12:17:00] <claime>	 !log cgoubert@lvs1019:~$ sudo ipvsadm --delete-service --tcp-service 10.2.2.68:4013
[12:17:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:17:38] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[12:17:44] <logmsgbot>	 !log oblivian@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on D{lvs2009.codfw.wmnet,lvs1019.eqiad.wmnet} and A:lvs
[12:18:57] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2001-dev.codfw.wmnet with reason: host reimage
[12:21:40] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2001-dev.codfw.wmnet with reason: host reimage
[12:22:23] <claime>	 !log apple-search removed from backends - T316296
[12:22:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:22:38] <stashbot>	 T316296: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296
[12:22:57] <icinga-wm>	 PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[12:23:11] <dcausse>	 looking ^
[12:24:51] <wikibugs>	 (03PS1) 10Jbond: sre.hardware.upgrade-firmware: ensure folderes are group writable [cookbooks] - 10https://gerrit.wikimedia.org/r/858569
[12:25:46] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] apple-search: Remove DNS records [dns] - 10https://gerrit.wikimedia.org/r/852208 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert)
[12:26:16] <claime>	 !log Clean up apple-search DNS - T316296
[12:26:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:26:31] <claime>	 !log cgoubert@authdns1001:~$ sudo -i authdns-update
[12:26:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:26:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/858567 (owner: 10Jbond)
[12:27:36] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] admin: add all data center ops users to datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/858567 (owner: 10Jbond)
[12:28:03] <wikibugs>	 (03PS2) 10Jbond: admin: add all data center ops users to datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/858567
[12:28:05] <icinga-wm>	 RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1004 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[12:28:46] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] apple-search: Remove service from service::catalog [puppet] - 10https://gerrit.wikimedia.org/r/857706 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert)
[12:28:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:28:59] <wikibugs>	 (03PS3) 10Clément Goubert: apple-search: Remove service from service::catalog [puppet] - 10https://gerrit.wikimedia.org/r/857706 (https://phabricator.wikimedia.org/T316296)
[12:29:44] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38324/console" [puppet] - 10https://gerrit.wikimedia.org/r/858568 (owner: 10Jbond)
[12:30:01] <icinga-wm>	 RECOVERY - puppet last run on sretest1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[12:30:39] <wikibugs>	 (03CR) 10Esanders: Don't run OutputPageBeforeHTML for the talkpageheader (031 comment) [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858319 (https://phabricator.wikimedia.org/T316175) (owner: 10Bartosz Dziewoński)
[12:30:53] <claime>	 !log Removing apple-search from service::catalog  - T316296
[12:30:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:31:08] <stashbot>	 T316296: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296
[12:32:20] <wikibugs>	 (03PS2) 10Jbond: P:spicerack: ensure firmware directory is writable by dc-ops [puppet] - 10https://gerrit.wikimedia.org/r/858568
[12:32:39] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] P:spicerack: ensure firmware directory is writable by dc-ops [puppet] - 10https://gerrit.wikimedia.org/r/858568 (owner: 10Jbond)
[12:33:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:37:02] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] apple-search: Remove apple-search from conftool [puppet] - 10https://gerrit.wikimedia.org/r/858286 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert)
[12:37:13] <claime>	 !log Removing apple-search from conftool  - T316296
[12:37:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:38:22] <stashbot>	 T316296: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296
[12:38:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:40:00] <wikibugs>	 10SRE, 10Discovery-Search, 10serviceops, 10serviceops-collab, and 2 others: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10Clement_Goubert) apple-search removed from DNS, LVS, service::catalog and conftool. Starting removal from wikikube and deployment-charts.
[12:41:48] <claime>	 !log Starting apple-search removal from wikikube - T316296
[12:41:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:29] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: ensure folderes are group writable [cookbooks] - 10https://gerrit.wikimedia.org/r/858569 (owner: 10Jbond)
[12:43:10] <claime>	 !log cgoubert@deploy1002:/apple-search$ helmfile -e staging -i destroy - T316296
[12:43:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:43:25] <stashbot>	 T316296: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296
[12:45:13] <claime>	 !log cgoubert@deploy1002:/apple-search$ helmfile -e eqiad -i destroy - T316296
[12:45:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:45:58] <claime>	 !log cgoubert@deploy1002:/apple-search$ helmfile -e codfw -i destroy - T316296
[12:46:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:46:07] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2001-dev.codfw.wmnet with OS bullseye
[12:46:40] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: ensure folderes are group writable [cookbooks] - 10https://gerrit.wikimedia.org/r/858569 (owner: 10Jbond)
[12:49:35] <wikibugs>	 (03CR) 10Clément Goubert: "This change is ready for review." [labs/private] - 10https://gerrit.wikimedia.org/r/858573 (owner: 10Clément Goubert)
[12:51:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:56:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:57:49] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: Deploy MediaWiki config change to use OpenSSL for PBKDF2 password hashing - https://phabricator.wikimedia.org/T320929 (10akosiaris) I am not sure what type of coordination is needed from #SRE either. Maybe just making sure that 1 or 2 SREs are around when the patch is...
[12:58:09] <wikibugs>	 (03CR) 10Clément Goubert: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/858575 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert)
[13:00:29] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudvirt2002-dev: move to the modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/858579
[13:02:19] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/858579/38325/" [puppet] - 10https://gerrit.wikimedia.org/r/858579 (owner: 10Arturo Borrero Gonzalez)
[13:04:56] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] P:mediawiki::maintenance: CampaignEvents periodic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/858346 (https://phabricator.wikimedia.org/T320403) (owner: 10Clément Goubert)
[13:06:34] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[13:07:42] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[13:08:10] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[13:08:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance
[13:08:24] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[13:08:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T323214)', diff saved to https://phabricator.wikimedia.org/P40124 and previous config saved to /var/cache/conftool/dbconfig/20221118-130829-ladsgroup.json
[13:08:35] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance
[13:08:47] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[13:08:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:09:49] <wikibugs>	 (03PS2) 10Clément Goubert: P:mediawiki::maintenance: CampaignEvents periodic [puppet] - 10https://gerrit.wikimedia.org/r/858346 (https://phabricator.wikimedia.org/T320403)
[13:10:35] <wikibugs>	 (03CR) 10Clément Goubert: P:mediawiki::maintenance: CampaignEvents periodic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/858346 (https://phabricator.wikimedia.org/T320403) (owner: 10Clément Goubert)
[13:13:06] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38326/console" [puppet] - 10https://gerrit.wikimedia.org/r/858346 (https://phabricator.wikimedia.org/T320403) (owner: 10Clément Goubert)
[13:13:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:14:35] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "lgtm, might be interesting at some point to have that info though." [puppet] - 10https://gerrit.wikimedia.org/r/855970 (https://phabricator.wikimedia.org/T271096) (owner: 10Arturo Borrero Gonzalez)
[13:14:39] <wikibugs>	 (03PS1) 10Jbond: nodegen: skip new files when processing  auto host selector [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/858581 (https://phabricator.wikimedia.org/T323282)
[13:14:41] <wikibugs>	 (03PS1) 10Jbond: html: add additional retrun code descriptions [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/858582
[13:14:56] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt2002-dev.codfw.wmnet with OS bullseye
[13:15:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Very nice! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/858562 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff)
[13:15:09] <wikibugs>	 (03CR) 10Vivian Rook: [C: 03+1] cloudvirt2002-dev: move to the modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/858579 (owner: 10Arturo Borrero Gonzalez)
[13:15:27] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] cloudvirt2002-dev: move to the modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/858579 (owner: 10Arturo Borrero Gonzalez)
[13:16:10] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.16.74:9042 on aqs1017 is OK: TCP OK - 0.000 second response time on 10.64.16.74 port 9042 https://phabricator.wikimedia.org/T93886
[13:16:35] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: prometheus: drop cloudvirt ceph metrics generator [puppet] - 10https://gerrit.wikimedia.org/r/855970 (https://phabricator.wikimedia.org/T271096)
[13:19:35] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] prometheus: drop cloudvirt ceph metrics generator [puppet] - 10https://gerrit.wikimedia.org/r/855970 (https://phabricator.wikimedia.org/T271096) (owner: 10Arturo Borrero Gonzalez)
[13:21:01] <wikibugs>	 (03PS2) 10Jbond: nodegen: skip new files when processing  auto host selector [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/858581 (https://phabricator.wikimedia.org/T323282)
[13:21:03] <wikibugs>	 (03PS2) 10Jbond: html: add additional retrun code descriptions [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/858582
[13:21:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[13:21:35] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[13:21:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T318605)', diff saved to https://phabricator.wikimedia.org/P40125 and previous config saved to /var/cache/conftool/dbconfig/20221118-132141-ladsgroup.json
[13:22:04] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[13:27:40] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance
[13:27:54] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance
[13:29:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Retire conf-lvm partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/858562 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff)
[13:31:38] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2002-dev.codfw.wmnet with reason: host reimage
[13:32:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T323214)', diff saved to https://phabricator.wikimedia.org/P40126 and previous config saved to /var/cache/conftool/dbconfig/20221118-133203-ladsgroup.json
[13:32:11] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[13:35:18] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2002-dev.codfw.wmnet with reason: host reimage
[13:37:31] <wikibugs>	 (03PS1) 10Muehlenhoff: Retire two k8s Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/858587 (https://phabricator.wikimedia.org/T156955)
[13:37:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[13:42:24] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove dumpsdata100XH750.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/858589
[13:42:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[13:43:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T318605)', diff saved to https://phabricator.wikimedia.org/P40127 and previous config saved to /var/cache/conftool/dbconfig/20221118-134334-ladsgroup.json
[13:43:43] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[13:44:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/858587 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff)
[13:46:03] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance
[13:46:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance
[13:46:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T323214)', diff saved to https://phabricator.wikimedia.org/P40128 and previous config saved to /var/cache/conftool/dbconfig/20221118-134633-ladsgroup.json
[13:47:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P40129 and previous config saved to /var/cache/conftool/dbconfig/20221118-134709-ladsgroup.json
[13:47:53] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[13:48:24] <wikibugs>	 (03PS3) 10Jbond: html: add additional return code descriptions [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/858582
[13:51:16] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] nodegen: skip new files when processing  auto host selector [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/858581 (https://phabricator.wikimedia.org/T323282) (owner: 10Jbond)
[13:51:20] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] html: add additional return code descriptions [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/858582 (owner: 10Jbond)
[13:53:33] <wikibugs>	 (03Merged) 10jenkins-bot: nodegen: skip new files when processing  auto host selector [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/858581 (https://phabricator.wikimedia.org/T323282) (owner: 10Jbond)
[13:53:35] <wikibugs>	 (03Merged) 10jenkins-bot: html: add additional return code descriptions [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/858582 (owner: 10Jbond)
[13:56:32] <wikibugs>	 (03PS1) 10Jbond: puppet_compiler: bump version to 2.5.2 [puppet] - 10https://gerrit.wikimedia.org/r/858594
[13:57:18] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38327/console" [puppet] - 10https://gerrit.wikimedia.org/r/858594 (owner: 10Jbond)
[13:58:02] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] puppet_compiler: bump version to 2.5.2 [puppet] - 10https://gerrit.wikimedia.org/r/858594 (owner: 10Jbond)
[13:58:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P40130 and previous config saved to /var/cache/conftool/dbconfig/20221118-135841-ladsgroup.json
[13:59:22] <wikibugs>	 (03CR) 10JHathaway: "looks great, thanks for cleaning this up!" [puppet] - 10https://gerrit.wikimedia.org/r/858545 (owner: 10JMeybohm)
[13:59:26] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] aux-k8s: Remove obsolete hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/858545 (owner: 10JMeybohm)
[14:01:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: "FYI this is ready to submit but hasn't yet" [puppet] - 10https://gerrit.wikimedia.org/r/858266 (owner: 10Jbond)
[14:02:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P40131 and previous config saved to /var/cache/conftool/dbconfig/20221118-140216-ladsgroup.json
[14:04:24] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2002-dev.codfw.wmnet with OS bullseye
[14:04:49] <icinga-wm>	 PROBLEM - SSH on mw1315.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:06:00] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Revert "hieradata: move multirootca standard settings to profile" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/858266 (owner: 10Jbond)
[14:07:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T323214)', diff saved to https://phabricator.wikimedia.org/P40132 and previous config saved to /var/cache/conftool/dbconfig/20221118-140749-ladsgroup.json
[14:07:59] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[14:09:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Revert "hieradata: move multirootca standard settings to profile" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/858266 (owner: 10Jbond)
[14:09:41] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki: add new type calidation for ca names [puppet] - 10https://gerrit.wikimedia.org/r/858556 (owner: 10Jbond)
[14:10:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10BTullis) @Cmjohnson - Let me knowhen you're ready to move an-tool1010 please. I'll schedule a maintenance window for Superset and shut it down for you. Am I right in assuming that you'll want...
[14:13:12] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Consider alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10bking)  > I don't think there is a productive and actionable outcome of the discussion in this task, nor that we've made progress in the discussion. I would suggest we clo...
[14:13:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P40133 and previous config saved to /var/cache/conftool/dbconfig/20221118-141347-ladsgroup.json
[14:13:57] <wikibugs>	 (03PS1) 10Hashar: Plugin to customize Zuul reports [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/858598
[14:14:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Plugin to customize Zuul reports [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/858598 (owner: 10Hashar)
[14:17:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T323214)', diff saved to https://phabricator.wikimedia.org/P40134 and previous config saved to /var/cache/conftool/dbconfig/20221118-141722-ladsgroup.json
[14:17:24] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[14:17:36] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[14:17:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[14:17:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T323214)', diff saved to https://phabricator.wikimedia.org/P40135 and previous config saved to /var/cache/conftool/dbconfig/20221118-141744-ladsgroup.json
[14:18:05] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Consider alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10bking) >>! In T321874#8404989, @fgiunchedi wrote: > I can definitely relate with the long (and stressful!) cycles of Puppet patches you mention @bking and that one of my [...
[14:19:11] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudvirt2003-dev: move to the modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/858600
[14:21:33] <wikibugs>	 10SRE, 10Patch-For-Review, 10User-fgiunchedi: Standardizing our partman recipes - https://phabricator.wikimedia.org/T156955 (10fgiunchedi) I was reviewing this work again and realized the audit command should be updated. The situation in puppet.git as of `348f4a06ed` is reported below.  ` $ git grep -h -o 'p...
[14:22:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P40136 and previous config saved to /var/cache/conftool/dbconfig/20221118-142255-ladsgroup.json
[14:24:24] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvirt2003-dev: move to the modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/858600 (owner: 10Arturo Borrero Gonzalez)
[14:25:02] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt2003-dev.codfw.wmnet with OS bullseye
[14:25:10] <wikibugs>	 10SRE, 10Wikimedia-Portals, 10Wikimedia-Site-requests, 10Security, 10Vuln-XSS: Malicious meta admin can add javascript to https://office.wikimedia.org/api/ . Move api listing off wiki - https://phabricator.wikimedia.org/T109147 (10Bawolff)
[14:28:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T318605)', diff saved to https://phabricator.wikimedia.org/P40137 and previous config saved to /var/cache/conftool/dbconfig/20221118-142854-ladsgroup.json
[14:29:04] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[14:30:15] <urandom>	 !log initiating Cassandra bootstrap, aqs1017-b -- T307802
[14:30:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:25] <stashbot>	 T307802: Bootstrap new Cassandra nodes (eqiad) - https://phabricator.wikimedia.org/T307802
[14:31:25] <Amir1>	 MatmaRex: community people are complaining about "ext-discussiontools-init-lede-button-container" element, is it tracked? 
[14:31:37] <Amir1>	 https://usercontent.irccloud-cdn.com/file/GsTIPWT6/image.png
[14:31:58] <Amir1>	 can't see it in T316175
[14:31:58] <icinga-wm>	 RECOVERY - cassandra-b service on aqs1017 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:31:58] <stashbot>	 T316175: Make the mobile Add Topic button easier for people to access  - https://phabricator.wikimedia.org/T316175
[14:32:12] <MatmaRex>	 Amir1: ugh
[14:33:20] <MatmaRex>	 Amir1: apparently https://phabricator.wikimedia.org/T323341 . i haven't seen this before
[14:34:23] <Amir1>	 it seems this is in all pages in fawiki in mobile now
[14:34:34] <Amir1>	 not sure articles too
[14:34:35] <Amir1>	 let me chekc
[14:35:00] <Amir1>	 not articles
[14:35:07] <MatmaRex>	 Amir1: all talk pages, surely?
[14:35:13] <MatmaRex>	 yeah
[14:35:20] <MatmaRex>	 looks like we missed some if() somewhere
[14:35:22] <Amir1>	 Should we fix it, etc.
[14:35:27] <Amir1>	 I can help backporting
[14:35:33] <wikibugs>	 (03PS1) 10Muehlenhoff: alertmanager: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/858603 (https://phabricator.wikimedia.org/T308013)
[14:35:35] <wikibugs>	 (03PS1) 10Muehlenhoff: analytics::refinery: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/858604 (https://phabricator.wikimedia.org/T308013)
[14:35:37] <wikibugs>	 (03PS1) 10Muehlenhoff: webperf: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/858605 (https://phabricator.wikimedia.org/T308013)
[14:35:39] <wikibugs>	 (03PS1) 10Muehlenhoff: Add SPDX headers to various IF profiles [puppet] - 10https://gerrit.wikimedia.org/r/858606 (https://phabricator.wikimedia.org/T308013)
[14:36:29] <MatmaRex>	 probably
[14:38:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P40138 and previous config saved to /var/cache/conftool/dbconfig/20221118-143802-ladsgroup.json
[14:38:12] <MatmaRex>	 Amir1: i'll submit a patch in a minute, let me just make sure i've got the conditions right
[14:38:35] <Amir1>	 SGTM
[14:41:59] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2003-dev.codfw.wmnet with reason: host reimage
[14:42:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T323214)', diff saved to https://phabricator.wikimedia.org/P40139 and previous config saved to /var/cache/conftool/dbconfig/20221118-144239-ladsgroup.json
[14:43:35] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[14:45:13] <icinga-wm>	 RECOVERY - SSH on db1120.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:45:27] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2003-dev.codfw.wmnet with reason: host reimage
[14:47:53] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.16.78:7001 on aqs1017 is OK: SSL OK - Certificate aqs1017-b valid until 2024-11-08 15:06:22 +0000 (expires in 721 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[14:48:09] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/858371
[14:50:10] <wikibugs>	 (03PS24) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson)
[14:50:46] <MatmaRex>	 Amir1: the fix is https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/858608/ , but i don't think anyone else from my team is around at the moment.
[14:53:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T323214)', diff saved to https://phabricator.wikimedia.org/P40140 and previous config saved to /var/cache/conftool/dbconfig/20221118-145308-ladsgroup.json
[14:53:10] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance
[14:53:16] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[14:53:24] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance
[14:53:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T323214)', diff saved to https://phabricator.wikimedia.org/P40141 and previous config saved to /var/cache/conftool/dbconfig/20221118-145330-ladsgroup.json
[14:54:01] <moritzm>	 !log installing node-minimist security updates
[14:54:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:14] <Amir1>	 MatmaRex: let me know once it's merged
[14:57:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P40142 and previous config saved to /var/cache/conftool/dbconfig/20221118-145746-ladsgroup.json
[14:58:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] alertmanager: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/858603 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[15:01:06] <wikibugs>	 (03PS1) 10Filippo Giunchedi: graphite: start mirroring traffic to graphite2004 [puppet] - 10https://gerrit.wikimedia.org/r/858610 (https://phabricator.wikimedia.org/T315524)
[15:01:25] <icinga-wm>	 PROBLEM - SSH on mw1328.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:07:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:08:40] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2003-dev.codfw.wmnet with OS bullseye
[15:09:41] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[15:10:05] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hiera: add graphite2004 to codfw graphite queries [puppet] - 10https://gerrit.wikimedia.org/r/858611 (https://phabricator.wikimedia.org/T315524)
[15:10:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: "To be merged once graphite2004 is in sync" [puppet] - 10https://gerrit.wikimedia.org/r/858611 (https://phabricator.wikimedia.org/T315524) (owner: 10Filippo Giunchedi)
[15:10:56] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[15:11:17] <wikibugs>	 (03PS25) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson)
[15:11:35] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[15:12:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P40143 and previous config saved to /var/cache/conftool/dbconfig/20221118-151252-ladsgroup.json
[15:12:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:14:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T323214)', diff saved to https://phabricator.wikimedia.org/P40144 and previous config saved to /var/cache/conftool/dbconfig/20221118-151458-ladsgroup.json
[15:15:10] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[15:17:32] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] k8s: pin the pause container image to the k8s_116 tag on staging [puppet] - 10https://gerrit.wikimedia.org/r/858546 (https://phabricator.wikimedia.org/T322920) (owner: 10Elukey)
[15:18:15] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:18:17] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] k8s: pin the pause container image to the k8s_116 tag on staging [puppet] - 10https://gerrit.wikimedia.org/r/858546 (https://phabricator.wikimedia.org/T322920) (owner: 10Elukey)
[15:19:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10Papaul) @jbon I think the issue was with what @Volans mentioned above. Didn't have the issue with another node that I worked with yesterday (kafka-...
[15:19:39] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff)
[15:19:53] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: wmcs: proxy: only mark the proxy as started if it didn't fail (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/858560 (owner: 10Arturo Borrero Gonzalez)
[15:22:08] <MatmaRex>	 Amir1: merged, i think we could backport it
[15:22:32] <Amir1>	 sounds good, do you want to do the honours?
[15:24:09] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Wikidata: don't show Vector search thumbnails (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große)
[15:24:23] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-logging1005.eqiad.wmnet with OS bullseye
[15:24:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-logging1005.eqiad.wmnet with OS bullseye
[15:25:52] <wikibugs>	 (03PS1) 10Ladsgroup: Don't add lede button if mobile DiscussionTools not enabled [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858320 (https://phabricator.wikimedia.org/T323341)
[15:25:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:26:02] <Amir1>	 MatmaRex: one mwf.10? ^
[15:26:59] <MatmaRex>	 yes. new feature
[15:27:10] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C: 03+1] Don't add lede button if mobile DiscussionTools not enabled [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858320 (https://phabricator.wikimedia.org/T323341) (owner: 10Ladsgroup)
[15:27:23] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Don't add lede button if mobile DiscussionTools not enabled [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858320 (https://phabricator.wikimedia.org/T323341) (owner: 10Ladsgroup)
[15:27:27] <Amir1>	 let's go
[15:27:36] <MatmaRex>	 Amir1: i don't have deployment access, i can't do the honours :)
[15:27:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T323214)', diff saved to https://phabricator.wikimedia.org/P40145 and previous config saved to /var/cache/conftool/dbconfig/20221118-152758-ladsgroup.json
[15:28:01] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[15:28:02] <Amir1>	 we should fix that, let's work on that next week
[15:28:07] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[15:28:14] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[15:28:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T323214)', diff saved to https://phabricator.wikimedia.org/P40146 and previous config saved to /var/cache/conftool/dbconfig/20221118-152820-ladsgroup.json
[15:30:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P40147 and previous config saved to /var/cache/conftool/dbconfig/20221118-153005-ladsgroup.json
[15:30:59] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:31:04] <MatmaRex>	 Amir1: ew no. it's scary enough with all the things i *can* access
[15:32:40] <wikibugs>	 (03Merged) 10jenkins-bot: Don't add lede button if mobile DiscussionTools not enabled [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858320 (https://phabricator.wikimedia.org/T323341) (owner: 10Ladsgroup)
[15:33:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858320 (https://phabricator.wikimedia.org/T323341) (owner: 10Ladsgroup)
[15:33:54] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:858320|Don't add lede button if mobile DiscussionTools not enabled (T323341)]]
[15:34:05] <stashbot>	 T323341: unnecessary button on mobile talk pages - https://phabricator.wikimedia.org/T323341
[15:34:20] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:858320|Don't add lede button if mobile DiscussionTools not enabled (T323341)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
[15:35:04] <Amir1>	 MatmaRex: live on mwdebug1002, can you check?
[15:35:22] <MatmaRex>	 yeah
[15:36:22] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging1005.eqiad.wmnet with reason: host reimage
[15:36:47] <MatmaRex>	 still testing some things
[15:37:18] <Amir1>	 take your time, I'm coding
[15:38:20] <MatmaRex>	 Amir1: all looks good though. the button shows up when it should and doesn't when it shouldn't
[15:38:29] <Amir1>	 awesome
[15:38:43] <MatmaRex>	 tried some pages on en.wp, fr.wp, mw.org
[15:39:08] <Amir1>	 it's being pushed everywhere now
[15:40:01] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging1005.eqiad.wmnet with reason: host reimage
[15:40:48] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kafka-jumbo1011.mgmt.eqiad.wmnet with reboot policy FORCED
[15:42:42] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:858320|Don't add lede button if mobile DiscussionTools not enabled (T323341)]] (duration: 08m 47s)
[15:42:50] <Amir1>	 MatmaRex: deployed everywhere
[15:42:52] <stashbot>	 T323341: unnecessary button on mobile talk pages - https://phabricator.wikimedia.org/T323341
[15:42:59] <MatmaRex>	 thanks
[15:43:30] <wikibugs>	 (03PS1) 10Herron: dispatch: manage config.js locally [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858616 (https://phabricator.wikimedia.org/T313229)
[15:44:42] <wikibugs>	 (03PS1) 10Clément Goubert: apple-search: absent kubernetes service [puppet] - 10https://gerrit.wikimedia.org/r/858617 (https://phabricator.wikimedia.org/T316296)
[15:45:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P40148 and previous config saved to /var/cache/conftool/dbconfig/20221118-154511-ladsgroup.json
[15:46:33] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38331/console" [puppet] - 10https://gerrit.wikimedia.org/r/858617 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert)
[15:46:43] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/858611 (https://phabricator.wikimedia.org/T315524) (owner: 10Filippo Giunchedi)
[15:47:21] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/858610 (https://phabricator.wikimedia.org/T315524) (owner: 10Filippo Giunchedi)
[15:48:28] <wikibugs>	 (03CR) 10Clément Goubert: apple-search: absent kubernetes service [puppet] - 10https://gerrit.wikimedia.org/r/858617 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert)
[15:48:39] <wikibugs>	 (03PS1) 10Clément Goubert: apple-search: final cleanup [puppet] - 10https://gerrit.wikimedia.org/r/858624 (https://phabricator.wikimedia.org/T316296)
[15:52:12] <wikibugs>	 (03CR) 10Herron: [C: 03+1] hiera: add graphite2004 to codfw graphite queries [puppet] - 10https://gerrit.wikimedia.org/r/858611 (https://phabricator.wikimedia.org/T315524) (owner: 10Filippo Giunchedi)
[15:52:40] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge restart  - bking@cumin1001 - T319020
[15:52:41] <wikibugs>	 (03CR) 10Herron: [C: 03+1] graphite: start mirroring traffic to graphite2004 [puppet] - 10https://gerrit.wikimedia.org/r/858610 (https://phabricator.wikimedia.org/T315524) (owner: 10Filippo Giunchedi)
[15:52:47] <wikibugs>	 (03PS26) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson)
[15:52:56] <stashbot>	 T319020: Reset to upstream java GC options and remove redundant JVM options - https://phabricator.wikimedia.org/T319020
[15:52:58] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge restart  - bking@cumin1001 - T319020
[15:53:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T323214)', diff saved to https://phabricator.wikimedia.org/P40149 and previous config saved to /var/cache/conftool/dbconfig/20221118-155310-ladsgroup.json
[15:53:36] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[15:53:59] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] apple-search: remove dummy service data [labs/private] - 10https://gerrit.wikimedia.org/r/858573 (owner: 10Clément Goubert)
[15:54:22] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] admin-ng: remove apple-search namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/858575 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert)
[15:54:30] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] wikikube: remove apple-search deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/858577 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert)
[15:54:37] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-logging1005.eqiad.wmnet with OS bullseye
[15:54:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-logging1005.eqiad.wmnet with OS bullseye comple...
[15:54:42] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] charts: remove apple-search chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/858578 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert)
[15:54:57] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] apple-search: absent kubernetes service [puppet] - 10https://gerrit.wikimedia.org/r/858617 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert)
[15:55:12] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] apple-search: final cleanup [puppet] - 10https://gerrit.wikimedia.org/r/858624 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert)
[15:55:35] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge restart  - bking@cumin1001 - T319020
[15:55:42] <wikibugs>	 (03CR) 10Michael Große: "I guess we can schedule this for the backport-window on Nov 30th, that is the one after the train" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große)
[15:57:00] <wikibugs>	 (03CR) 10Clément Goubert: "recheck" [labs/private] - 10https://gerrit.wikimedia.org/r/858573 (owner: 10Clément Goubert)
[15:57:39] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cp1078:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[15:58:15] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+2 C: 03+2] apple-search: remove dummy service data [labs/private] - 10https://gerrit.wikimedia.org/r/858573 (owner: 10Clément Goubert)
[15:58:40] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for cp5012:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[15:59:24] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge restart  - bking@cumin1001 - T319020
[15:59:54] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] admin-ng: remove apple-search namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/858575 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert)
[16:00:04] <stashbot>	 T319020: Reset to upstream java GC options and remove redundant JVM options - https://phabricator.wikimedia.org/T319020
[16:00:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T323214)', diff saved to https://phabricator.wikimedia.org/P40150 and previous config saved to /var/cache/conftool/dbconfig/20221118-160018-ladsgroup.json
[16:00:20] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance
[16:00:33] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance
[16:00:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2121 (T323214)', diff saved to https://phabricator.wikimedia.org/P40151 and previous config saved to /var/cache/conftool/dbconfig/20221118-160039-ladsgroup.json
[16:01:07] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[16:01:16] <icinga-wm>	 RECOVERY - SSH on mw1328.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:02:39] <jinxer-wm>	 (NodeTextfileStale) firing: (7) Stale textfile for cp1078:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[16:03:40] <jinxer-wm>	 (NodeTextfileStale) firing: (7) Stale textfile for cp2029:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[16:05:19] <wikibugs>	 (03Merged) 10jenkins-bot: admin-ng: remove apple-search namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/858575 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert)
[16:05:34] <icinga-wm>	 RECOVERY - SSH on mw1315.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:06:14] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ssingh)
[16:07:37] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[16:07:39] <jinxer-wm>	 (NodeTextfileStale) firing: (19) Stale textfile for cp1076:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[16:07:43] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ssingh)
[16:08:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P40152 and previous config saved to /var/cache/conftool/dbconfig/20221118-160817-ladsgroup.json
[16:08:40] <jinxer-wm>	 (NodeTextfileStale) firing: (19) Stale textfile for cp1077:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[16:08:46] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM, thanks!" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/858560 (owner: 10Arturo Borrero Gonzalez)
[16:08:49] <claime>	 !log removing apple-search namespaces - T316296
[16:08:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:09:04] <stashbot>	 T316296: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296
[16:09:10] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[16:09:15] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[16:09:18] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Add function to expose required device VRFs to Homer templates [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/857593 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney)
[16:09:28] <wikibugs>	 (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Add function to expose required device VRFs to Homer templates [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/857593 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney)
[16:09:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10Papaul)
[16:10:19] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: proxy: only mark the proxy as started if it didn't fail [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/858560 (owner: 10Arturo Borrero Gonzalez)
[16:10:26] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[16:10:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10Papaul) 05Open→03Resolved @herron this is complete
[16:10:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] dispatch: manage config.js locally [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858616 (https://phabricator.wikimedia.org/T313229) (owner: 10Herron)
[16:11:13] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[16:11:50] <wikibugs>	 (03PS27) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson)
[16:12:18] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[16:12:40] <jinxer-wm>	 (NodeTextfileStale) firing: (27) Stale textfile for cp1076:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[16:12:50] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[16:13:23] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[16:13:24] <vgutierrez>	 brett: ^^ that's for you
[16:13:40] <jinxer-wm>	 (NodeTextfileStale) firing: (30) Stale textfile for cp1077:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[16:13:52] <wikibugs>	 (03Abandoned) 10Clément Goubert: mw-*: Remove sal logging hook [deployment-charts] - 10https://gerrit.wikimedia.org/r/858360 (https://phabricator.wikimedia.org/T323296) (owner: 10Clément Goubert)
[16:15:02] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] wikikube: remove apple-search deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/858577 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert)
[16:15:46] <vgutierrez>	 brett: per https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile?orgId=1 you need to clean the stale ats_config.prom file
[16:17:40] <jinxer-wm>	 (NodeTextfileStale) firing: (38) Stale textfile for cp1076:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[16:18:40] <jinxer-wm>	 (NodeTextfileStale) firing: (42) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[16:18:51] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic restart - bking@cumin1001 - T319020
[16:19:05] <stashbot>	 T319020: Reset to upstream java GC options and remove redundant JVM options - https://phabricator.wikimedia.org/T319020
[16:19:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:20:01] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Wikidata: don't show Vector search thumbnails (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große)
[16:20:24] <wikibugs>	 (03Merged) 10jenkins-bot: wikikube: remove apple-search deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/858577 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert)
[16:20:26] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] wmcs: add cookbook to add/remove a user to/from a project (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/851650 (owner: 10David Caro)
[16:20:29] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Wikidata: don't show Vector search thumbnails (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große)
[16:20:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs: add cookbook to add/remove a user to/from a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/851650 (owner: 10David Caro)
[16:21:04] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] charts: remove apple-search chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/858578 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert)
[16:21:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T323214)', diff saved to https://phabricator.wikimedia.org/P40154 and previous config saved to /var/cache/conftool/dbconfig/20221118-162147-ladsgroup.json
[16:21:57] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[16:22:13] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "I'm fine with this being merged as-is and the additional feature comments being left for future patches." [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/851650 (owner: 10David Caro)
[16:22:22] <wikibugs>	 (03CR) 10Michael Große: Wikidata: don't show Vector search thumbnails (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große)
[16:22:32] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] apple-search: absent kubernetes service [puppet] - 10https://gerrit.wikimedia.org/r/858617 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert)
[16:22:40] <jinxer-wm>	 (NodeTextfileStale) firing: (42) Stale textfile for cp1076:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[16:23:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P40155 and previous config saved to /var/cache/conftool/dbconfig/20221118-162323-ladsgroup.json
[16:23:40] <jinxer-wm>	 (NodeTextfileStale) firing: (46) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[16:24:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:25:04] <wikibugs>	 (03PS3) 10David Caro: wmcs: add cookbook to add/remove a user to/from a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/851650
[16:25:29] <wikibugs>	 (03Merged) 10jenkins-bot: charts: remove apple-search chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/858578 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert)
[16:26:02] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] apple-search: final cleanup [puppet] - 10https://gerrit.wikimedia.org/r/858624 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert)
[16:26:40] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-jumbo1011.mgmt.eqiad.wmnet with reboot policy FORCED
[16:27:14] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1011']
[16:27:40] <jinxer-wm>	 (NodeTextfileStale) firing: (44) Stale textfile for cp1076:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[16:35:01] <wikibugs>	 (03PS4) 10David Caro: wmcs: add cookbook to add/remove a user to/from a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/851650
[16:35:22] <wikibugs>	 (03CR) 10David Caro: wmcs: add cookbook to add/remove a user to/from a project (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/851650 (owner: 10David Caro)
[16:36:52] <wikibugs>	 (03PS5) 10David Caro: wmcs: add cookbook to add/remove a user to/from a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/851650
[16:36:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P40156 and previous config saved to /var/cache/conftool/dbconfig/20221118-163653-ladsgroup.json
[16:36:54] <wikibugs>	 (03CR) 10David Caro: wmcs: add cookbook to add/remove a user to/from a project (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/851650 (owner: 10David Caro)
[16:37:40] <jinxer-wm>	 (NodeTextfileStale) resolved: Stale textfile for cp4050:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[16:37:40] <MatmaRex>	 looks like i have one more thing to backport today, https://phabricator.wikimedia.org/T323343
[16:38:13] <MatmaRex>	 this is the worst friday in months! (at least this one isn't my fault)
[16:38:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T323214)', diff saved to https://phabricator.wikimedia.org/P40157 and previous config saved to /var/cache/conftool/dbconfig/20221118-163830-ladsgroup.json
[16:38:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance
[16:38:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance
[16:38:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T323214)', diff saved to https://phabricator.wikimedia.org/P40158 and previous config saved to /var/cache/conftool/dbconfig/20221118-163851-ladsgroup.json
[16:38:55] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[16:41:26] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-jumbo1011']
[16:45:37] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1010']
[16:47:47] <wikibugs>	 (03CR) 10Ahmon Dancy: role::kubernetes::wroker: allow scap to pre-pull mediawiki images (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/858543 (https://phabricator.wikimedia.org/T323349) (owner: 10Giuseppe Lavagetto)
[16:47:56] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.dns.netbox
[16:49:10] <jinxer-wm>	 (NodeTextfileStale) resolved: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[16:49:19] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic restart - bking@cumin1001 - T319020
[16:49:23] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: VE: Use <sup> instead of <span> in CE HTML [extensions/Cite] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858321 (https://phabricator.wikimedia.org/T323343)
[16:49:34] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Undo use of .reference instead of .mw-ref in CSS counter rules [extensions/Cite] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858322 (https://phabricator.wikimedia.org/T323343)
[16:49:42] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] wmcs: add cookbook to add/remove a user to/from a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/851650 (owner: 10David Caro)
[16:49:44] <stashbot>	 T319020: Reset to upstream java GC options and remove redundant JVM options - https://phabricator.wikimedia.org/T319020
[16:49:48] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:49:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T323214)', diff saved to https://phabricator.wikimedia.org/P40159 and previous config saved to /var/cache/conftool/dbconfig/20221118-164957-ladsgroup.json
[16:50:05] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[16:50:48] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp5017
[16:51:06] <wikibugs>	 (03Abandoned) 10Jforrester: onSpecialSearchCreateLink: Handle null from Title::newFromText [extensions/WikimediaIncubator] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/851016 (https://phabricator.wikimedia.org/T320736) (owner: 10Jforrester)
[16:51:11] <wikibugs>	 (03Abandoned) 10Jforrester: onSpecialSearchCreateLink: Handle another null from Title::newFromText [extensions/WikimediaIncubator] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/851017 (https://phabricator.wikimedia.org/T320736) (owner: 10Jforrester)
[16:51:19] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp5017
[16:51:27] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-jumbo1010']
[16:51:41] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp5018
[16:52:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P40160 and previous config saved to /var/cache/conftool/dbconfig/20221118-165200-ladsgroup.json
[16:52:06] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp5018
[16:52:07] <MatmaRex>	 brennen: thcipriani: (or anyone else) are you perhaps around for an emergency friday backport? (another one, different than the thing this rmoning…) https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Cite/+/858321 https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Cite/+/858322
[16:52:10] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp5019
[16:52:24] * thcipriani looks
[16:52:35] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp5019
[16:52:39] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp5020
[16:52:40] <MatmaRex>	 the bug is https://phabricator.wikimedia.org/T323343
[16:53:01] <wikibugs>	 (03Merged) 10jenkins-bot: wmcs: add cookbook to add/remove a user to/from a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/851650 (owner: 10David Caro)
[16:53:06] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp5020
[16:53:12] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp5028
[16:53:29] <thcipriani>	 MatmaRex: I can get it out
[16:53:34] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp5028
[16:53:37] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp5029
[16:53:59] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp5029
[16:54:03] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp5030
[16:54:26] <MatmaRex>	 thcipriani: thank you
[16:54:27] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp5030
[16:54:31] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp5031
[16:54:43] <wikibugs>	 10SRE, 10Discovery-Search, 10serviceops, 10serviceops-collab, and 2 others: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10Clement_Goubert) 05In progress→03Resolved Certificates cleaned up. It's dead, Jim.
[16:54:53] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp5031
[16:56:26] <claime>	 !log apple-search service decommissioned - T316296
[16:56:44] * brennen reads backscroll
[16:56:45] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp5017.mgmt.eqsin.wmnet with reboot policy FORCED
[16:57:10] <claime>	 Hmm logmsg.bot, plz log to sal
[16:58:04] <claime>	 stash.bot even
[16:58:06] <thcipriani>	 MatmaRex: do these need to go out in any particular order? All at once OK?
[16:58:21] <claime>	 Ah. That explains it.
[16:58:34] <MatmaRex>	 thcipriani: any  order, yes
[16:58:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [extensions/Cite] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858321 (https://phabricator.wikimedia.org/T323343) (owner: 10Bartosz Dziewoński)
[16:58:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [extensions/Cite] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858322 (https://phabricator.wikimedia.org/T323343) (owner: 10Bartosz Dziewoński)
[16:59:38] <wikibugs>	 (03CR) 10Herron: [V: 03+2 C: 03+2] dispatch: manage config.js locally [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858616 (https://phabricator.wikimedia.org/T313229) (owner: 10Herron)
[17:02:17] <Lucas_WMDE>	 claime: I think the last log messages still ended up in https://sal.toolforge.org/ ?
[17:03:40] <claime>	 Lucas_WMDE: Yeah it did
[17:04:05] <claime>	 I just didn't get an echo here since it was in the process of timeouting :')
[17:04:45] <Lucas_WMDE>	 you’re right, it should’ve replied to you since you’re not logmsgbot ^^
[17:04:48] <Lucas_WMDE>	 I missed that part
[17:05:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P40161 and previous config saved to /var/cache/conftool/dbconfig/20221118-170503-ladsgroup.json
[17:07:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T323214)', diff saved to https://phabricator.wikimedia.org/P40162 and previous config saved to /var/cache/conftool/dbconfig/20221118-170706-ladsgroup.json
[17:07:08] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance
[17:07:22] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance
[17:07:24] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[17:07:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2122 (T323214)', diff saved to https://phabricator.wikimedia.org/P40163 and previous config saved to /var/cache/conftool/dbconfig/20221118-170727-ladsgroup.json
[17:08:16] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp5017.mgmt.eqsin.wmnet with reboot policy FORCED
[17:10:46] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1010']
[17:11:18] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp5018.mgmt.eqsin.wmnet with reboot policy FORCED
[17:12:01] <logmsgbot>	 !log jbond@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-jumbo1010']
[17:13:13] <wikibugs>	 (03Merged) 10jenkins-bot: VE: Use <sup> instead of <span> in CE HTML [extensions/Cite] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858321 (https://phabricator.wikimedia.org/T323343) (owner: 10Bartosz Dziewoński)
[17:13:19] <wikibugs>	 (03Merged) 10jenkins-bot: Undo use of .reference instead of .mw-ref in CSS counter rules [extensions/Cite] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858322 (https://phabricator.wikimedia.org/T323343) (owner: 10Bartosz Dziewoński)
[17:13:33] <logmsgbot>	 !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:858321|VE: Use <sup> instead of <span> in CE HTML (T323343)]], [[gerrit:858322|Undo use of .reference instead of .mw-ref in CSS counter rules (T323343)]]
[17:13:43] <stashbot>	 T323343: [1][2][3] style references in unusual vertical position when editing, and erroneous [0] references added when saving - https://phabricator.wikimedia.org/T323343
[17:13:53] <logmsgbot>	 !log thcipriani@deploy1002 thcipriani and matmarex: Backport for [[gerrit:858321|VE: Use <sup> instead of <span> in CE HTML (T323343)]], [[gerrit:858322|Undo use of .reference instead of .mw-ref in CSS counter rules (T323343)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
[17:14:04] <thcipriani>	 ^ MatmaRex finally on mwdebug, check please
[17:15:10] <MatmaRex>	 thcipriani: looks good
[17:15:14] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1010']
[17:15:25] <thcipriani>	 cool, syncing everywhere now
[17:15:39] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-jumbo1010']
[17:15:51] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1010']
[17:19:04] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-jumbo1010']
[17:19:08] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1010']
[17:19:31] <logmsgbot>	 !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:858321|VE: Use <sup> instead of <span> in CE HTML (T323343)]], [[gerrit:858322|Undo use of .reference instead of .mw-ref in CSS counter rules (T323343)]] (duration: 05m 58s)
[17:19:38] <thcipriani>	 ^ MatmaRex should be everywhere now
[17:19:43] <stashbot>	 T323343: [1][2][3] style references in unusual vertical position when editing, and erroneous [0] references added when saving - https://phabricator.wikimedia.org/T323343
[17:19:47] <MatmaRex>	 thanks thcipriani!
[17:20:03] <thcipriani>	 any time: thanks for the patches
[17:20:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P40164 and previous config saved to /var/cache/conftool/dbconfig/20221118-172010-ladsgroup.json
[17:22:54] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp5018.mgmt.eqsin.wmnet with reboot policy FORCED
[17:24:00] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp5019.mgmt.eqsin.wmnet with reboot policy FORCED
[17:25:32] <wikibugs>	 (03PS28) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson)
[17:25:34] <wikibugs>	 (03CR) 10Bartosz Dziewoński: Don't run OutputPageBeforeHTML for the talkpageheader (031 comment) [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858319 (https://phabricator.wikimedia.org/T316175) (owner: 10Bartosz Dziewoński)
[17:31:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T323214)', diff saved to https://phabricator.wikimedia.org/P40165 and previous config saved to /var/cache/conftool/dbconfig/20221118-173156-ladsgroup.json
[17:32:07] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[17:35:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T323214)', diff saved to https://phabricator.wikimedia.org/P40166 and previous config saved to /var/cache/conftool/dbconfig/20221118-173516-ladsgroup.json
[17:35:28] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp5019.mgmt.eqsin.wmnet with reboot policy FORCED
[17:38:18] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp5020.mgmt.eqsin.wmnet with reboot policy FORCED
[17:41:50] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[17:42:15] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[17:42:16] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[17:42:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[17:42:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T323214)', diff saved to https://phabricator.wikimedia.org/P40167 and previous config saved to /var/cache/conftool/dbconfig/20221118-174226-ladsgroup.json
[17:45:04] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[17:47:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P40168 and previous config saved to /var/cache/conftool/dbconfig/20221118-174702-ladsgroup.json
[17:49:43] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp5020.mgmt.eqsin.wmnet with reboot policy FORCED
[17:52:22] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp5028.mgmt.eqsin.wmnet with reboot policy FORCED
[17:56:45] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-jumbo1010']
[17:57:12] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1010']
[17:57:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T323214)', diff saved to https://phabricator.wikimedia.org/P40169 and previous config saved to /var/cache/conftool/dbconfig/20221118-175717-ladsgroup.json
[17:57:30] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[17:57:50] <wikibugs>	 (03PS29) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson)
[18:02:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P40170 and previous config saved to /var/cache/conftool/dbconfig/20221118-180212-ladsgroup.json
[18:03:54] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp5028.mgmt.eqsin.wmnet with reboot policy FORCED
[18:04:38] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-jumbo1010']
[18:05:51] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1011']
[18:06:33] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp5029.mgmt.eqsin.wmnet with reboot policy FORCED
[18:09:28] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[18:11:24] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[18:12:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P40171 and previous config saved to /var/cache/conftool/dbconfig/20221118-181223-ladsgroup.json
[18:12:29] <wikibugs>	 (03PS1) 10Andrew Bogott: nova-compute.conf: replace 'cpu_model' with 'cpu_models' [puppet] - 10https://gerrit.wikimedia.org/r/858633
[18:14:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Jclark-ctr) @Papaul  corrected netbox it was in as asset tag WMF10621
[18:15:16] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-jumbo1011']
[18:16:46] <icinga-wm>	 PROBLEM - Exim SMTP on mx1001 is CRITICAL: connect to address 208.80.154.76 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting
[18:17:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T323214)', diff saved to https://phabricator.wikimedia.org/P40172 and previous config saved to /var/cache/conftool/dbconfig/20221118-181720-ladsgroup.json
[18:17:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance
[18:17:31] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[18:17:35] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance
[18:17:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T323214)', diff saved to https://phabricator.wikimedia.org/P40173 and previous config saved to /var/cache/conftool/dbconfig/20221118-181741-ladsgroup.json
[18:18:07] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp5029.mgmt.eqsin.wmnet with reboot policy FORCED
[18:18:55] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kafka-jumbo1012.mgmt.eqiad.wmnet with reboot policy FORCED
[18:19:25] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp5030.mgmt.eqsin.wmnet with reboot policy FORCED
[18:20:57] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kafka-jumbo1014.mgmt.eqiad.wmnet with reboot policy FORCED
[18:21:40] <herron>	 !log removed older exim logs to free space T305567
[18:21:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:22:48] <stashbot>	 T305567: MX: increasing disk space - https://phabricator.wikimedia.org/T305567
[18:27:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P40174 and previous config saved to /var/cache/conftool/dbconfig/20221118-182730-ladsgroup.json
[18:27:50] <icinga-wm>	 RECOVERY - Exim SMTP on mx1001 is OK: OK - Certificate mx1001.wikimedia.org will expire on Fri 30 Dec 2022 08:22:47 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting
[18:31:10] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp5030.mgmt.eqsin.wmnet with reboot policy FORCED
[18:31:34] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp5031.mgmt.eqsin.wmnet with reboot policy FORCED
[18:39:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T323214)', diff saved to https://phabricator.wikimedia.org/P40175 and previous config saved to /var/cache/conftool/dbconfig/20221118-183906-ladsgroup.json
[18:39:16] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[18:42:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T323214)', diff saved to https://phabricator.wikimedia.org/P40176 and previous config saved to /var/cache/conftool/dbconfig/20221118-184236-ladsgroup.json
[18:42:38] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[18:42:52] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[18:42:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T323214)', diff saved to https://phabricator.wikimedia.org/P40177 and previous config saved to /var/cache/conftool/dbconfig/20221118-184258-ladsgroup.json
[18:43:01] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp5031.mgmt.eqsin.wmnet with reboot policy FORCED
[18:43:47] <wikibugs>	 (03PS1) 10Ssingh: cp5017: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/858635 (https://phabricator.wikimedia.org/T322048)
[18:45:42] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-jumbo1014.mgmt.eqiad.wmnet with reboot policy FORCED
[18:47:16] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-jumbo1012.mgmt.eqiad.wmnet with reboot policy FORCED
[18:48:40] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:51:27] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp5017']
[18:51:33] <logmsgbot>	 !log robh@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cp5017']
[18:52:12] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1012']
[18:54:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P40178 and previous config saved to /var/cache/conftool/dbconfig/20221118-185412-ladsgroup.json
[18:54:17] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp5017']
[18:56:38] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_delayed.service,monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:02:51] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp5018']
[19:02:58] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1014']
[19:03:26] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['kafka-jumbo1014']
[19:03:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T323214)', diff saved to https://phabricator.wikimedia.org/P40179 and previous config saved to /var/cache/conftool/dbconfig/20221118-190340-ladsgroup.json
[19:04:01] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[19:05:29] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1010']
[19:05:44] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['kafka-jumbo1010']
[19:06:58] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp5017']
[19:07:03] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1014']
[19:07:29] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp5019']
[19:08:38] <icinga-wm>	 PROBLEM - SSH on mw1315.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:09:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P40180 and previous config saved to /var/cache/conftool/dbconfig/20221118-190919-ladsgroup.json
[19:11:16] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH)
[19:12:33] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH)
[19:12:38] <wikibugs>	 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.16; 2021-07-26), 10Sustainability (Incident Followup), 10Thai-Sites: Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Bebiezaza) Tagging #thai-sites because this extension is currently in use at Thai Wikisource (t...
[19:15:01] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp5018']
[19:18:38] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp5020']
[19:18:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P40181 and previous config saved to /var/cache/conftool/dbconfig/20221118-191846-ladsgroup.json
[19:20:00] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH)
[19:21:19] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH)
[19:23:37] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp5019']
[19:23:45] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-jumbo1014']
[19:23:55] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp5028']
[19:24:05] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-jumbo1012']
[19:24:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T323214)', diff saved to https://phabricator.wikimedia.org/P40182 and previous config saved to /var/cache/conftool/dbconfig/20221118-192425-ladsgroup.json
[19:24:28] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance
[19:24:42] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance
[19:24:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance
[19:24:45] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance
[19:24:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2159 (T323214)', diff saved to https://phabricator.wikimedia.org/P40183 and previous config saved to /var/cache/conftool/dbconfig/20221118-192452-ladsgroup.json
[19:25:02] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[19:27:12] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1012']
[19:28:09] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1014']
[19:28:53] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH)
[19:31:40] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp5020']
[19:31:49] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp5029']
[19:32:29] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH)
[19:33:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P40184 and previous config saved to /var/cache/conftool/dbconfig/20221118-193353-ladsgroup.json
[19:34:12] <wikibugs>	 (03PS1) 10BCornwall: cp5018: Set cp role via site.pp and related config [puppet] - 10https://gerrit.wikimedia.org/r/858640 (https://phabricator.wikimedia.org/T322048)
[19:34:34] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-jumbo1014']
[19:36:03] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp5028']
[19:37:16] <wikibugs>	 (03PS2) 10BCornwall: cp5018: Set cp role via site.pp and related config [puppet] - 10https://gerrit.wikimedia.org/r/858640 (https://phabricator.wikimedia.org/T322048)
[19:39:38] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "Looks good! We will merge it later when we are ready to reimage." [puppet] - 10https://gerrit.wikimedia.org/r/858640 (https://phabricator.wikimedia.org/T322048) (owner: 10BCornwall)
[19:44:56] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kafka-jumbo1015.mgmt.eqiad.wmnet with reboot policy FORCED
[19:46:10] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-jumbo1012']
[19:46:45] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp5030']
[19:47:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T323214)', diff saved to https://phabricator.wikimedia.org/P40185 and previous config saved to /var/cache/conftool/dbconfig/20221118-194721-ladsgroup.json
[19:47:59] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[19:49:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T323214)', diff saved to https://phabricator.wikimedia.org/P40186 and previous config saved to /var/cache/conftool/dbconfig/20221118-194859-ladsgroup.json
[19:49:01] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[19:49:09] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1 C: 03+2] prometheus: Remove old ats config export job [puppet] - 10https://gerrit.wikimedia.org/r/858418 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[19:49:15] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[19:58:18] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1012']
[19:58:48] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-jumbo1012']
[19:58:50] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp5030']
[19:59:14] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp5031']
[20:00:24] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH)
[20:01:03] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH)
[20:02:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P40187 and previous config saved to /var/cache/conftool/dbconfig/20221118-200228-ladsgroup.json
[20:03:12] <logmsgbot>	 !log robh@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cp5029']
[20:03:16] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp5029']
[20:04:02] <logmsgbot>	 !log robh@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cp5029']
[20:05:50] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1003 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-blazegraph-exporter-wcqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:06:38] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp5029']
[20:07:36] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp5029']
[20:07:42] <icinga-wm>	 RECOVERY - Check systemd state on wcqs1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:08:27] <logmsgbot>	 !log robh@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cp5031']
[20:09:26] <icinga-wm>	 RECOVERY - SSH on mw1315.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:09:45] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH)
[20:10:00] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[20:10:24] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[20:10:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T323214)', diff saved to https://phabricator.wikimedia.org/P40188 and previous config saved to /var/cache/conftool/dbconfig/20221118-201030-ladsgroup.json
[20:10:36] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[20:15:22] <wikibugs>	 (03PS2) 10Andrew Bogott: nova-compute.conf: replace 'cpu_model' with 'cpu_models' [puppet] - 10https://gerrit.wikimedia.org/r/858633 (https://phabricator.wikimedia.org/T323319)
[20:15:24] <wikibugs>	 (03PS1) 10Andrew Bogott: glance: use www_authenticate_uri [puppet] - 10https://gerrit.wikimedia.org/r/858644 (https://phabricator.wikimedia.org/T323319)
[20:15:26] <wikibugs>	 (03PS1) 10Andrew Bogott: nova-compute.conf: add explanatory note about live_migration_uri [puppet] - 10https://gerrit.wikimedia.org/r/858645 (https://phabricator.wikimedia.org/T323319)
[20:15:28] <wikibugs>	 (03PS1) 10Andrew Bogott: neutron.conf: remove allow_overlapping_ips config flag [puppet] - 10https://gerrit.wikimedia.org/r/858646 (https://phabricator.wikimedia.org/T323319)
[20:15:30] <wikibugs>	 (03PS1) 10Andrew Bogott: Set service_token_roles for services that use Keystone [puppet] - 10https://gerrit.wikimedia.org/r/858647 (https://phabricator.wikimedia.org/T323319)
[20:17:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P40189 and previous config saved to /var/cache/conftool/dbconfig/20221118-201734-ladsgroup.json
[20:18:06] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-jumbo1015.mgmt.eqiad.wmnet with reboot policy FORCED
[20:21:00] <wikibugs>	 (03PS2) 10Andrew Bogott: glance: use www_authenticate_uri [puppet] - 10https://gerrit.wikimedia.org/r/858644 (https://phabricator.wikimedia.org/T323319)
[20:21:02] <wikibugs>	 (03PS3) 10Andrew Bogott: nova-compute.conf: replace 'cpu_model' with 'cpu_models' [puppet] - 10https://gerrit.wikimedia.org/r/858633 (https://phabricator.wikimedia.org/T323319)
[20:21:04] <wikibugs>	 (03PS2) 10Andrew Bogott: nova-compute.conf: add explanatory note about live_migration_uri [puppet] - 10https://gerrit.wikimedia.org/r/858645 (https://phabricator.wikimedia.org/T323319)
[20:21:06] <wikibugs>	 (03PS2) 10Andrew Bogott: neutron.conf: remove allow_overlapping_ips config flag [puppet] - 10https://gerrit.wikimedia.org/r/858646 (https://phabricator.wikimedia.org/T323319)
[20:21:08] <wikibugs>	 (03PS2) 10Andrew Bogott: Set service_token_roles for services that use Keystone [puppet] - 10https://gerrit.wikimedia.org/r/858647 (https://phabricator.wikimedia.org/T323319)
[20:21:44] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1015']
[20:21:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] glance: use www_authenticate_uri [puppet] - 10https://gerrit.wikimedia.org/r/858644 (https://phabricator.wikimedia.org/T323319) (owner: 10Andrew Bogott)
[20:22:08] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH)
[20:22:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T323214)', diff saved to https://phabricator.wikimedia.org/P40190 and previous config saved to /var/cache/conftool/dbconfig/20221118-202245-ladsgroup.json
[20:22:51] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[20:23:02] <wikibugs>	 (03PS3) 10Andrew Bogott: glance: use www_authenticate_uri [puppet] - 10https://gerrit.wikimedia.org/r/858644 (https://phabricator.wikimedia.org/T323319)
[20:23:04] <wikibugs>	 (03PS4) 10Andrew Bogott: nova-compute.conf: replace 'cpu_model' with 'cpu_models' [puppet] - 10https://gerrit.wikimedia.org/r/858633 (https://phabricator.wikimedia.org/T323319)
[20:23:06] <wikibugs>	 (03PS3) 10Andrew Bogott: nova-compute.conf: add explanatory note about live_migration_uri [puppet] - 10https://gerrit.wikimedia.org/r/858645 (https://phabricator.wikimedia.org/T323319)
[20:23:08] <wikibugs>	 (03PS3) 10Andrew Bogott: neutron.conf: remove allow_overlapping_ips config flag [puppet] - 10https://gerrit.wikimedia.org/r/858646 (https://phabricator.wikimedia.org/T323319)
[20:23:10] <wikibugs>	 (03PS3) 10Andrew Bogott: Set service_token_roles for services that use Keystone [puppet] - 10https://gerrit.wikimedia.org/r/858647 (https://phabricator.wikimedia.org/T323319)
[20:23:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] neutron.conf: remove allow_overlapping_ips config flag [puppet] - 10https://gerrit.wikimedia.org/r/858646 (https://phabricator.wikimedia.org/T323319) (owner: 10Andrew Bogott)
[20:23:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Set service_token_roles for services that use Keystone [puppet] - 10https://gerrit.wikimedia.org/r/858647 (https://phabricator.wikimedia.org/T323319) (owner: 10Andrew Bogott)
[20:24:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] glance: use www_authenticate_uri [puppet] - 10https://gerrit.wikimedia.org/r/858644 (https://phabricator.wikimedia.org/T323319) (owner: 10Andrew Bogott)
[20:25:01] <wikibugs>	 (03PS4) 10Andrew Bogott: glance: use www_authenticate_uri [puppet] - 10https://gerrit.wikimedia.org/r/858644 (https://phabricator.wikimedia.org/T323319)
[20:25:03] <wikibugs>	 (03PS5) 10Andrew Bogott: nova-compute.conf: replace 'cpu_model' with 'cpu_models' [puppet] - 10https://gerrit.wikimedia.org/r/858633 (https://phabricator.wikimedia.org/T323319)
[20:25:05] <wikibugs>	 (03PS4) 10Andrew Bogott: nova-compute.conf: add explanatory note about live_migration_uri [puppet] - 10https://gerrit.wikimedia.org/r/858645 (https://phabricator.wikimedia.org/T323319)
[20:25:08] <wikibugs>	 (03PS4) 10Andrew Bogott: neutron.conf: remove allow_overlapping_ips config flag [puppet] - 10https://gerrit.wikimedia.org/r/858646 (https://phabricator.wikimedia.org/T323319)
[20:25:09] <wikibugs>	 (03PS4) 10Andrew Bogott: Set service_token_roles for services that use Keystone [puppet] - 10https://gerrit.wikimedia.org/r/858647 (https://phabricator.wikimedia.org/T323319)
[20:26:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] glance: use www_authenticate_uri [puppet] - 10https://gerrit.wikimedia.org/r/858644 (https://phabricator.wikimedia.org/T323319) (owner: 10Andrew Bogott)
[20:29:13] <wikibugs>	 (03PS5) 10Andrew Bogott: glance: use www_authenticate_uri [puppet] - 10https://gerrit.wikimedia.org/r/858644 (https://phabricator.wikimedia.org/T323319)
[20:29:14] <wikibugs>	 (03PS6) 10Andrew Bogott: nova-compute.conf: replace 'cpu_model' with 'cpu_models' [puppet] - 10https://gerrit.wikimedia.org/r/858633 (https://phabricator.wikimedia.org/T323319)
[20:29:16] <wikibugs>	 (03PS5) 10Andrew Bogott: nova-compute.conf: add explanatory note about live_migration_uri [puppet] - 10https://gerrit.wikimedia.org/r/858645 (https://phabricator.wikimedia.org/T323319)
[20:29:18] <wikibugs>	 (03PS5) 10Andrew Bogott: neutron.conf: remove allow_overlapping_ips config flag [puppet] - 10https://gerrit.wikimedia.org/r/858646 (https://phabricator.wikimedia.org/T323319)
[20:29:20] <wikibugs>	 (03PS5) 10Andrew Bogott: Set service_token_roles for services that use Keystone [puppet] - 10https://gerrit.wikimedia.org/r/858647 (https://phabricator.wikimedia.org/T323319)
[20:32:16] <wikibugs>	 (03CR) 10BBlack: [C: 04-1] "Needs: "profile::cache::varnish::frontend::single_backend: true" in the hieradata/cp5017.yaml file as well" [puppet] - 10https://gerrit.wikimedia.org/r/858635 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh)
[20:32:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T323214)', diff saved to https://phabricator.wikimedia.org/P40191 and previous config saved to /var/cache/conftool/dbconfig/20221118-203241-ladsgroup.json
[20:32:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance
[20:32:48] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[20:32:56] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance
[20:33:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T323214)', diff saved to https://phabricator.wikimedia.org/P40192 and previous config saved to /var/cache/conftool/dbconfig/20221118-203302-ladsgroup.json
[20:33:13] <wikibugs>	 (03CR) 10BBlack: [C: 04-1] "Needs: "profile::cache::varnish::frontend::single_backend: true" in the hieradata/cp5017.yaml file as well" [puppet] - 10https://gerrit.wikimedia.org/r/858640 (https://phabricator.wikimedia.org/T322048) (owner: 10BCornwall)
[20:37:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P40193 and previous config saved to /var/cache/conftool/dbconfig/20221118-203751-ladsgroup.json
[20:39:01] <wikibugs>	 (03PS1) 10BBlack: cp5032: turn on single_backend [puppet] - 10https://gerrit.wikimedia.org/r/858649 (https://phabricator.wikimedia.org/T322048)
[20:41:56] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] cp5032: turn on single_backend [puppet] - 10https://gerrit.wikimedia.org/r/858649 (https://phabricator.wikimedia.org/T322048) (owner: 10BBlack)
[20:44:10] <wikibugs>	 (03PS6) 10Andrew Bogott: glance: use www_authenticate_uri [puppet] - 10https://gerrit.wikimedia.org/r/858644 (https://phabricator.wikimedia.org/T323319)
[20:44:12] <wikibugs>	 (03PS7) 10Andrew Bogott: nova-compute.conf: replace 'cpu_model' with 'cpu_models' [puppet] - 10https://gerrit.wikimedia.org/r/858633 (https://phabricator.wikimedia.org/T323319)
[20:44:14] <wikibugs>	 (03PS6) 10Andrew Bogott: nova-compute.conf: add explanatory note about live_migration_uri [puppet] - 10https://gerrit.wikimedia.org/r/858645 (https://phabricator.wikimedia.org/T323319)
[20:44:16] <wikibugs>	 (03PS6) 10Andrew Bogott: neutron.conf: remove allow_overlapping_ips config flag [puppet] - 10https://gerrit.wikimedia.org/r/858646 (https://phabricator.wikimedia.org/T323319)
[20:44:18] <wikibugs>	 (03PS6) 10Andrew Bogott: Set service_token_roles for services that use Keystone [puppet] - 10https://gerrit.wikimedia.org/r/858647 (https://phabricator.wikimedia.org/T323319)
[20:46:10] <wikibugs>	 (03PS7) 10Andrew Bogott: glance: use www_authenticate_uri [puppet] - 10https://gerrit.wikimedia.org/r/858644 (https://phabricator.wikimedia.org/T323319)
[20:46:12] <wikibugs>	 (03PS8) 10Andrew Bogott: nova-compute.conf: replace 'cpu_model' with 'cpu_models' [puppet] - 10https://gerrit.wikimedia.org/r/858633 (https://phabricator.wikimedia.org/T323319)
[20:46:14] <wikibugs>	 (03PS7) 10Andrew Bogott: nova-compute.conf: add explanatory note about live_migration_uri [puppet] - 10https://gerrit.wikimedia.org/r/858645 (https://phabricator.wikimedia.org/T323319)
[20:46:16] <wikibugs>	 (03PS7) 10Andrew Bogott: neutron.conf: remove allow_overlapping_ips config flag [puppet] - 10https://gerrit.wikimedia.org/r/858646 (https://phabricator.wikimedia.org/T323319)
[20:46:18] <wikibugs>	 (03PS7) 10Andrew Bogott: Set service_token_roles for services that use Keystone [puppet] - 10https://gerrit.wikimedia.org/r/858647 (https://phabricator.wikimedia.org/T323319)
[20:46:48] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:48:56] <wikibugs>	 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T321719 (10phaultfinder)
[20:49:01] <wikibugs>	 (03PS2) 10Ssingh: cp5017: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/858635 (https://phabricator.wikimedia.org/T322048)
[20:49:14] <wikibugs>	 (03CR) 10Ssingh: cp5017: update site.pp and related configs for cp role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/858635 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh)
[20:52:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P40194 and previous config saved to /var/cache/conftool/dbconfig/20221118-205258-ladsgroup.json
[20:53:58] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] cp5017: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/858635 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh)
[20:54:46] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_delayed.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:55:50] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/output/857079/38343/" [puppet] - 10https://gerrit.wikimedia.org/r/857079 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[20:56:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T323214)', diff saved to https://phabricator.wikimedia.org/P40195 and previous config saved to /var/cache/conftool/dbconfig/20221118-205649-ladsgroup.json
[20:56:55] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[20:56:58] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5017.eqsin.wmnet with OS buster
[20:57:05] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5017.eqsin.wmnet with OS buster
[21:01:44] <wikibugs>	 (03PS1) 10Andrew Bogott: glance: use memcached for token caching [puppet] - 10https://gerrit.wikimedia.org/r/858651 (https://phabricator.wikimedia.org/T323319)
[21:02:46] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "I want to get as much as possible done before the switch itself.. This will add systemd timers, logging config, the dump service.." [puppet] - 10https://gerrit.wikimedia.org/r/857079 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[21:03:22] <wikibugs>	 (03PS2) 10Dzahn: phabricator: enable dumping on phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/857079 (https://phabricator.wikimedia.org/T280597)
[21:05:22] <wikibugs>	 (03CR) 10Dzahn: phabricator: enable dumping on phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/857079 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[21:06:37] <mutante>	 sukhe: you make us get scary puppet changes  :)
[21:08:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T323214)', diff saved to https://phabricator.wikimedia.org/P40196 and previous config saved to /var/cache/conftool/dbconfig/20221118-210804-ladsgroup.json
[21:08:06] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[21:08:13] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[21:08:19] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[21:08:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T323214)', diff saved to https://phabricator.wikimedia.org/P40197 and previous config saved to /var/cache/conftool/dbconfig/20221118-210825-ladsgroup.json
[21:08:48] <sukhe>	 mutante: oh which one!
[21:08:48] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:08:48] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2065 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:09:21] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-jumbo1015']
[21:09:30] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2065 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[21:09:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (5) High Kubernetes API latency (LIST configurations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:10:05] <mutante>	 sukhe: hehe, everything is ok. it's just.. every time you remove or add a cp host, it means there is an edit to "@def $CACHES" and that in turn means there is an edit to /etc/ferm/conf.d/00_defs and that means on any random host you run puppet on and expect nothing to change.. suddenly there is an entire window full of firewall rule changes :)
[21:11:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P40198 and previous config saved to /var/cache/conftool/dbconfig/20221118-211155-ladsgroup.json
[21:12:15] <mutante>	 (or it looks like there is because it's a huge list that has one host added or removed and gets displayed)
[21:14:42] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1015']
[21:14:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (6) High Kubernetes API latency (LIST configurations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:17:35] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-jumbo1015']
[21:19:28] <sukhe>	 ah!
[21:19:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T323214)', diff saved to https://phabricator.wikimedia.org/P40199 and previous config saved to /var/cache/conftool/dbconfig/20221118-211931-ladsgroup.json
[21:19:38] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[21:21:54] <mutante>	 !log running phabricator task dump script on phab1004
[21:21:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:26:21] <wikibugs>	 (03PS1) 10Andrew Bogott: cinder.conf: lock_path to oslo_concurrency [puppet] - 10https://gerrit.wikimedia.org/r/858653 (https://phabricator.wikimedia.org/T323319)
[21:26:23] <wikibugs>	 (03PS1) 10Andrew Bogott: cinder: remove default quota settings [puppet] - 10https://gerrit.wikimedia.org/r/858654 (https://phabricator.wikimedia.org/T323319)
[21:26:25] <wikibugs>	 (03PS1) 10Andrew Bogott: trove: remove network_label_regex [puppet] - 10https://gerrit.wikimedia.org/r/858655 (https://phabricator.wikimedia.org/T323319)
[21:26:28] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "confirmed that the dump script uses the slave DB regardless on which server it runs. and started it on phab1004. it should be just fine." [puppet] - 10https://gerrit.wikimedia.org/r/857079 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[21:27:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P40200 and previous config saved to /var/cache/conftool/dbconfig/20221118-212702-ladsgroup.json
[21:27:16] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5017.eqsin.wmnet with reason: host reimage
[21:27:41] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova-compute.conf: replace 'cpu_model' with 'cpu_models' [puppet] - 10https://gerrit.wikimedia.org/r/858633 (https://phabricator.wikimedia.org/T323319) (owner: 10Andrew Bogott)
[21:27:49] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] glance: use www_authenticate_uri [puppet] - 10https://gerrit.wikimedia.org/r/858644 (https://phabricator.wikimedia.org/T323319) (owner: 10Andrew Bogott)
[21:27:55] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova-compute.conf: add explanatory note about live_migration_uri [puppet] - 10https://gerrit.wikimedia.org/r/858645 (https://phabricator.wikimedia.org/T323319) (owner: 10Andrew Bogott)
[21:32:12] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5017.eqsin.wmnet with reason: host reimage
[21:34:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P40201 and previous config saved to /var/cache/conftool/dbconfig/20221118-213437-ladsgroup.json
[21:34:42] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2065 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:39:38] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2065 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[21:39:58] <icinga-wm>	 PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 42 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[21:41:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Papaul)
[21:42:01] <wikibugs>	 (03PS3) 10Ssingh: cp5018: Set cp role via site.pp and related config [puppet] - 10https://gerrit.wikimedia.org/r/858640 (https://phabricator.wikimedia.org/T322048) (owner: 10BCornwall)
[21:42:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T323214)', diff saved to https://phabricator.wikimedia.org/P40202 and previous config saved to /var/cache/conftool/dbconfig/20221118-214208-ladsgroup.json
[21:42:11] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance
[21:42:13] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Papaul)
[21:42:15] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[21:42:16] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 214 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:42:24] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance
[21:42:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T323214)', diff saved to https://phabricator.wikimedia.org/P40203 and previous config saved to /var/cache/conftool/dbconfig/20221118-214230-ladsgroup.json
[21:43:34] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:45:18] <icinga-wm>	 RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 31 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[21:45:29] <wikibugs>	 (03PS1) 10Dzahn: phabricator: remove hardcoded ports, use parameters in my.cnf for admins [puppet] - 10https://gerrit.wikimedia.org/r/858656
[21:46:12] <wikibugs>	 (03PS2) 10Dzahn: phabricator: remove hardcoded ports, use parameters in my.cnf for admins [puppet] - 10https://gerrit.wikimedia.org/r/858656 (https://phabricator.wikimedia.org/T280597)
[21:46:18] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 180 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:47:50] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:49:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P40204 and previous config saved to /var/cache/conftool/dbconfig/20221118-214944-ladsgroup.json
[21:50:08] <wikibugs>	 (03PS3) 10Dzahn: phabricator: remove hardcoded ports, use parameters in my.cnf for admins [puppet] - 10https://gerrit.wikimedia.org/r/858656 (https://phabricator.wikimedia.org/T280597)
[21:52:51] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+1] "noop https://puppet-compiler.wmflabs.org/output/858656/38346/" [puppet] - 10https://gerrit.wikimedia.org/r/858656 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[21:55:38] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "I'm going to be a bit more bold here and merge this and proof it's noop on clouddumps1002. We want to switch the phab host name on Monday " [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[21:55:57] <wikibugs>	 (03PS9) 10Dzahn: dumps/distribution: move hardcoded host names to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597)
[21:56:30] <wikibugs>	 (03CR) 10Dzahn: "After this I can switch the phab dump host from phab1001 to phab1004 where I have enabled dumping in https://gerrit.wikimedia.org/r/c/oper" [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[21:59:09] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "compiling on C:profile::dumps::distribution::datasets::fetcher which then picks for me that the right host is, btw: I don't manually enter" [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[21:59:22] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:00:04] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "and it shows there is no change, only the class parameters: https://puppet-compiler.wmflabs.org/output/852259/38347/clouddumps1002.wikimed" [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[22:01:20] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:02:36] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "ran puppet on clouddumps1002. complete noop" [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[22:03:06] <wikibugs>	 (03PS13) 10Dzahn: dumps/distribution: add more data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852260
[22:04:21] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "needs update after https://gerrit.wikimedia.org/r/c/operations/puppet/+/852259/9 was merged" [puppet] - 10https://gerrit.wikimedia.org/r/824805 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[22:04:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T323214)', diff saved to https://phabricator.wikimedia.org/P40205 and previous config saved to /var/cache/conftool/dbconfig/20221118-220421-ladsgroup.json
[22:04:28] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[22:04:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T323214)', diff saved to https://phabricator.wikimedia.org/P40206 and previous config saved to /var/cache/conftool/dbconfig/20221118-220450-ladsgroup.json
[22:04:52] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[22:05:06] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[22:05:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T323214)', diff saved to https://phabricator.wikimedia.org/P40207 and previous config saved to /var/cache/conftool/dbconfig/20221118-220512-ladsgroup.json
[22:05:58] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5017.eqsin.wmnet with OS buster
[22:06:05] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5017.eqsin.wmnet with OS buster completed: - cp5017 (**PASS**)   -...
[22:11:06] <icinga-wm>	 PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-blazegraph-exporter-wcqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:12:06] <wikibugs>	 (03PS3) 10Dzahn: dumps/phabricator: switch phab dumps host from phab1001 to phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/824805 (https://phabricator.wikimedia.org/T280597)
[22:14:35] <wikibugs>	 (03PS4) 10Dzahn: dumps/phabricator: switch phab dumps host from phab1001 to phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/824805 (https://phabricator.wikimedia.org/T280597)
[22:16:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T323214)', diff saved to https://phabricator.wikimedia.org/P40209 and previous config saved to /var/cache/conftool/dbconfig/20221118-221612-ladsgroup.json
[22:16:43] <wikibugs>	 (03CR) 10Dzahn: "the dump service is running on phab1004 but waiting for it to complete:" [puppet] - 10https://gerrit.wikimedia.org/r/824805 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[22:18:06] <wikibugs>	 (03PS1) 10BCornwall: node: Exclude trafficserver promfile mtime check [alerts] - 10https://gerrit.wikimedia.org/r/858658
[22:18:28] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[22:19:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P40210 and previous config saved to /var/cache/conftool/dbconfig/20221118-221927-ladsgroup.json
[22:19:28] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 134 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[22:19:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (7) High Kubernetes API latency (LIST nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[22:21:22] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[22:26:23] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs-cinder-backup-manager: allow for less frequent backups [puppet] - 10https://gerrit.wikimedia.org/r/858659 (https://phabricator.wikimedia.org/T306200)
[22:31:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P40211 and previous config saved to /var/cache/conftool/dbconfig/20221118-223118-ladsgroup.json
[22:34:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P40212 and previous config saved to /var/cache/conftool/dbconfig/20221118-223434-ladsgroup.json
[22:39:14] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[22:41:44] <inflatador>	 icinga is having issues with me...or vice versa. hmm
[22:43:00] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1016 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.076 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[22:43:04] <icinga-wm>	 PROBLEM - SSH on mw1319.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:46:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P40213 and previous config saved to /var/cache/conftool/dbconfig/20221118-224625-ladsgroup.json
[22:49:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T323214)', diff saved to https://phabricator.wikimedia.org/P40214 and previous config saved to /var/cache/conftool/dbconfig/20221118-224940-ladsgroup.json
[22:49:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance
[22:49:47] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[22:49:56] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance
[22:50:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2182 (T323214)', diff saved to https://phabricator.wikimedia.org/P40215 and previous config saved to /var/cache/conftool/dbconfig/20221118-225002-ladsgroup.json
[22:51:26] <icinga-wm>	 PROBLEM - SSH on db1120.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:01:03] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "Yea, I think it's fine. Will deploy next week though." [puppet] - 10https://gerrit.wikimedia.org/r/858297 (owner: 10Muehlenhoff)
[23:01:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T323214)', diff saved to https://phabricator.wikimedia.org/P40216 and previous config saved to /var/cache/conftool/dbconfig/20221118-230131-ladsgroup.json
[23:01:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[23:01:38] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[23:01:47] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[23:01:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1202 (T323214)', diff saved to https://phabricator.wikimedia.org/P40217 and previous config saved to /var/cache/conftool/dbconfig/20221118-230152-ladsgroup.json
[23:02:38] <wikibugs>	 (03PS1) 10Dzahn: dumps: remove phab1001 from rsync clients [puppet] - 10https://gerrit.wikimedia.org/r/858662
[23:05:39] <wikibugs>	 (03PS1) 10Dzahn: phabricator: stop creating public dump on phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/858663 (https://phabricator.wikimedia.org/T280597)
[23:06:51] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/output/824805/38348/clouddumps1002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/824805 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[23:07:27] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+1] "dump finished and looks fine on phab1004 and stopping the dump script on phab1001 in https://gerrit.wikimedia.org/r/c/operations/puppet/+/" [puppet] - 10https://gerrit.wikimedia.org/r/824805 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[23:07:48] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] dumps/phabricator: switch phab dumps host from phab1001 to phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/824805 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[23:09:24] <wikibugs>	 (03PS2) 10Krinkle: build: Update to PHPUnit 9.5 and declare php requirement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858441 (https://phabricator.wikimedia.org/T235142)
[23:11:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T323214)', diff saved to https://phabricator.wikimedia.org/P40218 and previous config saved to /var/cache/conftool/dbconfig/20221118-231111-ladsgroup.json
[23:11:17] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[23:11:23] <wikibugs>	 (03CR) 10Krinkle: build: Upgrade symfony/yaml to 5.4.3, the version we use in prod (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793053 (owner: 10Jforrester)
[23:12:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T323214)', diff saved to https://phabricator.wikimedia.org/P40219 and previous config saved to /var/cache/conftool/dbconfig/20221118-231229-ladsgroup.json
[23:12:32] <mutante>	 !log clouddumps1001 - manually ran /usr/local/bin/dump-fetch-phabdumps.sh and confirmed fetching works from new phab host phab1004 after gerrit:824805 T280597
[23:13:36] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "manually ran /usr/local/bin/dump-fetch-phabdumps.sh on clouddumps1002 and confirmed fetching works from new phab host phab1004" [puppet] - 10https://gerrit.wikimedia.org/r/824805 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[23:13:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:13:55] <stashbot>	 T280597: move phabricator to new hardware generation - https://phabricator.wikimedia.org/T280597
[23:14:47] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: stop creating public dump on phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/858663 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[23:15:36] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10wiki_willy) a:03Jclark-ctr ++@Jclark-ctr, since @Cmjohnson will be out for a while  >>! In T308339#8405694, @BTullis wrote: > @Cmjohnson - Let me knowhen you're ready to move an-tool1010 pl...
[23:17:04] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "timer/service removed on phab1004 by puppet. clean." [puppet] - 10https://gerrit.wikimedia.org/r/858663 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[23:17:22] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "on phab1001 I meant to say. it's active on phab1004" [puppet] - 10https://gerrit.wikimedia.org/r/858663 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[23:18:34] <wikibugs>	 (03PS2) 10Dzahn: dumps: remove phab1001 from rsync clients [puppet] - 10https://gerrit.wikimedia.org/r/858662 (https://phabricator.wikimedia.org/T280597)
[23:20:41] <wikibugs>	 (03CR) 10Dzahn: "now we can get back to this one next :)" [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn)
[23:21:05] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[23:22:16] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[23:25:01] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[23:26:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P40220 and previous config saved to /var/cache/conftool/dbconfig/20221118-232618-ladsgroup.json
[23:27:31] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[23:27:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P40221 and previous config saved to /var/cache/conftool/dbconfig/20221118-232736-ladsgroup.json
[23:28:03] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kafka-jumbo1013.mgmt.eqiad.wmnet with reboot policy FORCED
[23:28:07] <wikibugs>	 10SRE, 10Traffic: strip non session cookies before cache lookup in ATS - https://phabricator.wikimedia.org/T316338 (10Krinkle)
[23:28:53] <wikibugs>	 (03PS4) 10Dzahn: phabricator: remove production role from phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/824804 (https://phabricator.wikimedia.org/T280597)
[23:29:01] <wikibugs>	 (03PS3) 10Dzahn: dumps: remove phab1001 from rsync clients [puppet] - 10https://gerrit.wikimedia.org/r/858662 (https://phabricator.wikimedia.org/T280597)
[23:29:07] <wikibugs>	 (03PS2) 10Dzahn: site: remove phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/858421 (https://phabricator.wikimedia.org/T323418)
[23:29:16] <wikibugs>	 (03PS2) 10Dzahn: mariadb: remove phab1001 from production-m3 grants [puppet] - 10https://gerrit.wikimedia.org/r/858419 (https://phabricator.wikimedia.org/T323418)
[23:29:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] phabricator: remove production role from phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/824804 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[23:29:39] <wikibugs>	 (03PS2) 10Dzahn: phabricator: remove phab1001 as src_host from migration class [puppet] - 10https://gerrit.wikimedia.org/r/858420 (https://phabricator.wikimedia.org/T323418)
[23:33:08] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "noop https://puppet-compiler.wmflabs.org/output/858420/38349/" [puppet] - 10https://gerrit.wikimedia.org/r/858420 (https://phabricator.wikimedia.org/T323418) (owner: 10Dzahn)
[23:35:17] <wikibugs>	 (03PS5) 10Dzahn: O:phabricator: move common settings to role hiera [puppet] - 10https://gerrit.wikimedia.org/r/824412 (https://phabricator.wikimedia.org/T280597) (owner: 10Jbond)
[23:35:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] O:phabricator: move common settings to role hiera [puppet] - 10https://gerrit.wikimedia.org/r/824412 (https://phabricator.wikimedia.org/T280597) (owner: 10Jbond)
[23:35:46] <wikibugs>	 (03CR) 10Dzahn: "I will get back to this after Monday when phab1001 should not be production anymore." [puppet] - 10https://gerrit.wikimedia.org/r/824412 (https://phabricator.wikimedia.org/T280597) (owner: 10Jbond)
[23:41:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P40222 and previous config saved to /var/cache/conftool/dbconfig/20221118-234124-ladsgroup.json
[23:42:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P40223 and previous config saved to /var/cache/conftool/dbconfig/20221118-234242-ladsgroup.json
[23:44:44] <mutante>	  /away laters
[23:47:24] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] gitlab_runner: make one Shared Runner canary [puppet] - 10https://gerrit.wikimedia.org/r/858188 (owner: 10Jelto)
[23:51:57] <icinga-wm>	 RECOVERY - SSH on db1120.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:56:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T323214)', diff saved to https://phabricator.wikimedia.org/P40225 and previous config saved to /var/cache/conftool/dbconfig/20221118-235631-ladsgroup.json
[23:56:37] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[23:57:42] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-jumbo1013.mgmt.eqiad.wmnet with reboot policy FORCED
[23:57:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T323214)', diff saved to https://phabricator.wikimedia.org/P40226 and previous config saved to /var/cache/conftool/dbconfig/20221118-235749-ladsgroup.json
[23:57:50] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[23:58:04] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance