[00:03:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:09:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:10:15] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [00:12:45] (JobUnavailable) resolved: Reduced availability for job jmx_wcqs_blazegraph in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:38:42] (03PS2) 10Ssingh: [In case of emergency] depool eqsin for hardware refresh [dns] - 10https://gerrit.wikimedia.org/r/856664 [00:38:49] ^ rebase [00:40:20] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-logging1005'] [00:45:30] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Papaul) a:05Cmjohnson→03Papaul @BTullis thank you I will take over this tasks [00:47:05] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:47:58] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-logging1005'] [00:51:30] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-logging1005'] [01:01:41] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-logging1005'] [01:02:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:04:21] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-logging1005'] [01:04:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-logging1005'] [01:06:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10Papaul) @jbond if you have time tomorrow i did get the error below on kafka-logging1004. I checked the upgrade completed with no issue but the cook... [01:08:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:20:18] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-logging1005'] [01:21:27] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-logging1005'] [01:25:13] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2173'] [01:25:41] RECOVERY - IPMI Sensor Status on cp5032 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [01:26:06] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db2173'] [01:34:02] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-logging1005'] [01:34:35] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:35:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-logging1005'] [01:37:10] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-logging1005'] [01:37:43] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-logging1005'] [01:37:45] (JobUnavailable) firing: (4) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:39:23] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-logging1005'] [01:39:42] 10SRE, 10serviceops-radar, 10Patch-For-Review, 10Performance-Team (Radar), 10Service-deployment-requests: New Service Request: xhgui - https://phabricator.wikimedia.org/T277483 (10Krinkle) [01:40:39] 10SRE, 10serviceops-radar, 10Performance-Team (Radar), 10Service-deployment-requests: New Service Request: xhgui - https://phabricator.wikimedia.org/T277483 (10Krinkle) [01:40:52] 10SRE, 10serviceops-radar, 10Performance-Team (Radar), 10Service-deployment-requests: New Service Request: xhgui - https://phabricator.wikimedia.org/T277483 (10Krinkle) 05Open→03Declined Prioritising {T291015} instead. [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:42] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-logging1005'] [01:46:52] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-logging1005'] [01:52:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:55:08] (03PS1) 10Krinkle: build: Update to PHPUnit 9.5 and declare php requirement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858441 (https://phabricator.wikimedia.org/T235142) [01:55:33] (03CR) 10CI reject: [V: 04-1] build: Update to PHPUnit 9.5 and declare php requirement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858441 (https://phabricator.wikimedia.org/T235142) (owner: 10Krinkle) [01:56:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-logging1005'] [02:07:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:10:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:15:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:17:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:22:45] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:36:29] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-logging1005'] [02:45:20] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-logging1005'] [02:56:13] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) The elevation google sheet has been updated with 11 of the 16 new cp hosts wired up. We couldnt wire up the last 5 due to msws only being 24 port (oversight by me in planning t... [02:57:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10Papaul) [03:03:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:09:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:10:31] PROBLEM - Disk space on dumpsdata1003 is CRITICAL: DISK CRITICAL - free space: /data 874213 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops [03:10:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10Papaul) @Volans i tried ro urn the reimage cookbook on kafka-logging1005 i am getting the error below ` raceback (most recent call last): File "/... [03:17:37] PROBLEM - SSH on db1123.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:35:31] RECOVERY - MegaRAID on an-worker1094 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:52:38] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Papaul) @Jclark-ctr I have no node in netbox with the name kafka-jumbo1013 but i do have a node wmf10606 whit purchase date 2022-06-07 that is set to offline in netbox . can y... [03:54:22] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [03:56:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [04:01:03] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kafka-jumbo1010.mgmt.eqiad.wmnet with reboot policy FORCED [04:02:39] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:03:43] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Papaul) @Ottomata @BTullis what HW RAID are we using for those servers ? Thanks [04:06:13] PROBLEM - SSH on an-coord1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:08:23] PROBLEM - MegaRAID on an-worker1094 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:08:35] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:18:27] RECOVERY - SSH on db1123.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:27:13] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-jumbo1010.mgmt.eqiad.wmnet with reboot policy FORCED [04:28:50] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1010'] [04:28:57] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T321719 (10phaultfinder) [04:45:10] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-jumbo1010'] [04:48:10] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Papaul) [04:57:00] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Papaul) [04:58:15] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Papaul) [05:50:49] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:50:55] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:51:01] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:51:03] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:51:19] PROBLEM - BFD status on cr2-drmrs is CRITICAL: CRIT: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:51:41] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 2/4 UP : OSPFv3: 2/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:15:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:16:31] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:20:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:25:25] 10Puppet, 10SRE, 10Data-Services, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): clouddumps1002: ferm is being started on every puppet run - https://phabricator.wikimedia.org/T323324 (10taavi) [06:39:27] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [06:40:19] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [07:20:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:24:09] !log decom all Equinix SV8 BGP sessions - T321323 [07:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:41:41] (03CR) 10Filippo Giunchedi: [C: 03+2] P:pontoon: include firewall rules to allow metricsinfra scraping (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857023 (owner: 10Majavah) [07:44:01] (03PS1) 10Filippo Giunchedi: pontoon: don't double-define ferm rules for metricsinfra prometheus [puppet] - 10https://gerrit.wikimedia.org/r/858455 [07:45:51] (03CR) 10Majavah: [C: 03+1] "oops, sorry about this" [puppet] - 10https://gerrit.wikimedia.org/r/858455 (owner: 10Filippo Giunchedi) [07:47:11] (03PS2) 10Filippo Giunchedi: pontoon: don't double-define ferm rules for metricsinfra prometheus [puppet] - 10https://gerrit.wikimedia.org/r/858455 [07:47:48] (03CR) 10Filippo Giunchedi: pontoon: don't double-define ferm rules for metricsinfra prometheus (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/858455 (owner: 10Filippo Giunchedi) [07:48:28] (03CR) 10Majavah: [C: 03+1] "the provider class is empty now, but I guess that is fine and the class can be useful in the future?" [puppet] - 10https://gerrit.wikimedia.org/r/858455 (owner: 10Filippo Giunchedi) [07:49:39] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: don't double-define ferm rules for metricsinfra prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/858455 (owner: 10Filippo Giunchedi) [07:49:42] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] pontoon: don't double-define ferm rules for metricsinfra prometheus [puppet] - 10https://gerrit.wikimedia.org/r/858455 (owner: 10Filippo Giunchedi) [07:51:20] (03CR) 10Filippo Giunchedi: [C: 03+2] P:pontoon: include firewall rules to allow metricsinfra scraping (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857023 (owner: 10Majavah) [07:51:53] RECOVERY - BFD status on cr2-drmrs is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:52:11] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:53:13] RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 13 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:53:15] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:53:19] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:53:27] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:56:51] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [07:58:01] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [07:59:52] (03PS1) 10Slyngshede: Fix typing to allow Python 3.7 support. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/858457 [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221118T0800) [08:03:23] (03CR) 10Slyngshede: "I'm a little unsure as to why the existing Debian build configuration wouldn't work. This patch does build on both Buster and Bullseye." [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/858457 (owner: 10Slyngshede) [08:07:33] (03PS1) 10Marostegui: control-mariadb-10.4-bullseye: Upgrade to 10.4.27 [software] - 10https://gerrit.wikimedia.org/r/858458 (https://phabricator.wikimedia.org/T322620) [08:10:01] RECOVERY - SSH on an-coord1002.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:10:02] (03CR) 10Marostegui: [C: 03+2] control-mariadb-10.4-bullseye: Upgrade to 10.4.27 [software] - 10https://gerrit.wikimedia.org/r/858458 (https://phabricator.wikimedia.org/T322620) (owner: 10Marostegui) [08:10:35] (03Merged) 10jenkins-bot: control-mariadb-10.4-bullseye: Upgrade to 10.4.27 [software] - 10https://gerrit.wikimedia.org/r/858458 (https://phabricator.wikimedia.org/T322620) (owner: 10Marostegui) [08:12:13] (03PS9) 10David Caro: ceph.roll_restart_*daemons: allow ignoring current health issues [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 [08:28:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1019.eqiad.wmnet [08:31:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5003.eqsin.wmnet [08:31:45] (03PS1) 10Elukey: benthos: reduce webrequest-live kafka partitions to read [puppet] - 10https://gerrit.wikimedia.org/r/858542 (https://phabricator.wikimedia.org/T314981) [08:32:26] (03CR) 10DCausse: WIP flink image (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [08:33:02] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38306/console" [puppet] - 10https://gerrit.wikimedia.org/r/858542 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [08:35:59] (03PS2) 10Elukey: benthos: reduce webrequest-live kafka partitions to read [puppet] - 10https://gerrit.wikimedia.org/r/858542 (https://phabricator.wikimedia.org/T314981) [08:37:00] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38307/console" [puppet] - 10https://gerrit.wikimedia.org/r/858542 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [08:37:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1019.eqiad.wmnet [08:37:12] !log shutdown SV8 port - T321323 [08:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:38] win 25 [08:38:41] lose some [08:40:50] (03PS1) 10Giuseppe Lavagetto: role::kubernetes::wroker: allow scap to pre-pull mediawiki images [puppet] - 10https://gerrit.wikimedia.org/r/858543 (https://phabricator.wikimedia.org/T323349) [08:40:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5003.eqsin.wmnet [08:41:11] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 45102 [08:41:27] (03CR) 10Elukey: Add the pause image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858345 (https://phabricator.wikimedia.org/T322920) (owner: 10Elukey) [08:41:59] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 45102 [08:43:27] (03CR) 10CI reject: [V: 04-1] role::kubernetes::wroker: allow scap to pre-pull mediawiki images [puppet] - 10https://gerrit.wikimedia.org/r/858543 (https://phabricator.wikimedia.org/T323349) (owner: 10Giuseppe Lavagetto) [08:46:03] !log failover ganeti master in eqsin to ganeti5003 [08:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10Volans) Apart the fact that the host is in planned state in netbox and hence `--new` is required, the problem is that the DNS record is wrong in Ne... [08:51:58] 10Puppet, 10Infrastructure-Foundations: Consider alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10fgiunchedi) I can definitely relate with the long (and stressful!) cycles of Puppet patches you mention @bking and that one of my [[ https://wikitech.wikimedia.org/wiki/Pu... [08:52:07] PROBLEM - ganeti-wconfd running on ganeti5001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [08:58:58] (03CR) 10David Caro: ceph.roll_restart_*daemons: allow ignoring current health issues (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 (owner: 10David Caro) [09:01:18] (03CR) 10Filippo Giunchedi: "LGTM, see inline for a nit" [puppet] - 10https://gerrit.wikimedia.org/r/858542 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [09:06:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1019.eqiad.wmnet to cluster eqiad and group D [09:08:13] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1019.eqiad.wmnet to cluster eqiad and group D [09:08:37] (03PS3) 10Elukey: benthos: reduce webrequest-live kafka partitions to read [puppet] - 10https://gerrit.wikimedia.org/r/858542 (https://phabricator.wikimedia.org/T314981) [09:08:39] (03CR) 10Elukey: benthos: reduce webrequest-live kafka partitions to read (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/858542 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [09:13:32] (03PS1) 10JMeybohm: aux-k8s: Remove obsolete hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/858545 [09:14:02] (03CR) 10JMeybohm: "Feel free to merge when you're happy with this" [puppet] - 10https://gerrit.wikimedia.org/r/858545 (owner: 10JMeybohm) [09:16:49] !log push the 'k8s_116' tag for docker-registry.discovery.wmnet/pause - T322920 [09:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:56] T322920: Import pause container image >= 3.5 (k8s 1.23 dependency) - https://phabricator.wikimedia.org/T322920 [09:17:23] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858345 (https://phabricator.wikimedia.org/T322920) (owner: 10Elukey) [09:21:10] (03PS1) 10Elukey: k8s: pin the pause container image to the k8s_116 tag on staging [puppet] - 10https://gerrit.wikimedia.org/r/858546 (https://phabricator.wikimedia.org/T322920) [09:21:52] !log nuke MediaWiki.objectcache.*_11ed_* - T323357 [09:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:15] (03PS2) 10Elukey: k8s: pin the pause container image to the k8s_116 tag on staging [puppet] - 10https://gerrit.wikimedia.org/r/858546 (https://phabricator.wikimedia.org/T322920) [09:22:32] T323357: Spam graphite metrics from MediaWiki.objectcache - https://phabricator.wikimedia.org/T323357 [09:22:44] (03CR) 10FNegri: [C: 03+1] ceph.roll_restart_*daemons: allow ignoring current health issues (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 (owner: 10David Caro) [09:25:22] (03PS3) 10Vgutierrez: cache: Remove wikiba.se caching rules [puppet] - 10https://gerrit.wikimedia.org/r/858408 [09:26:09] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 2 others: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10JMeybohm) [09:26:35] RECOVERY - MegaRAID on an-worker1094 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:30:23] (03PS1) 10Bartosz Dziewoński: Don't run OutputPageBeforeHTML for the talkpageheader [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858319 (https://phabricator.wikimedia.org/T316175) [09:32:35] brennen: thcipriani: i'm hoping to get this patch backported today: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/858319 , is that possible? (i guess you're both asleep now, can anyone else help?) [09:32:41] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38308/console" [puppet] - 10https://gerrit.wikimedia.org/r/858546 (https://phabricator.wikimedia.org/T322920) (owner: 10Elukey) [09:34:21] (03CR) 10Elukey: k8s: pin the pause container image to the k8s_116 tag on staging [puppet] - 10https://gerrit.wikimedia.org/r/858546 (https://phabricator.wikimedia.org/T322920) (owner: 10Elukey) [09:37:33] !log installing ncurses security updates [09:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:21] Amir1: _joe_: are you around maybe? i'm trying to get this patch backported: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/858319 [09:41:30] <_joe_> MatmaRex: and I guess you're not searching for my opinion on that patch, right? :P [09:41:33] <_joe_> I'm here [09:42:07] 10SRE, 10Traffic, 10Upstream: Wikipedia on flow with no http request, still responds with a Bad Request 400 - https://phabricator.wikimedia.org/T323263 (10Vgutierrez) 05Stalled→03In progress We were missing one config option in our HAProxy setup: `option http-ignore-probes`, after enabling it, HAProxy be... [09:42:22] _joe_: heh, what is your opinion? [09:43:36] i think it's a very simple low-risk change and i think the issue it fixes is bad enough to deploy it today (doubled-up and non-functional buttons on e.g. https://fr.m.wikipedia.org/wiki/Discussion_Wikipédia:Accueil_principal) [09:44:22] <_joe_> MatmaRex: oh my, sure, go on let's backport [09:44:43] <_joe_> UI bugs of this nature should always be considered emergencies IMHO [09:44:49] i don't have access, so i need someone to click the buttons / run the commands [09:45:01] <_joe_> ah ok [09:45:08] <_joe_> not even to backport the patch? [09:45:20] nope [09:45:21] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Don't run OutputPageBeforeHTML for the talkpageheader [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858319 (https://phabricator.wikimedia.org/T316175) (owner: 10Bartosz Dziewoński) [09:45:38] <_joe_> ok [09:45:43] (03CR) 10Filippo Giunchedi: [C: 03+1] benthos: reduce webrequest-live kafka partitions to read [puppet] - 10https://gerrit.wikimedia.org/r/858542 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [09:45:49] <_joe_> I'll test scap backport with this [09:45:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:45:56] (03PS1) 10Vgutierrez: cache::haproxy: Silently ignore probes [puppet] - 10https://gerrit.wikimedia.org/r/858550 (https://phabricator.wikimedia.org/T323263) [09:46:09] (03CR) 10Elukey: [C: 03+2] benthos: reduce webrequest-live kafka partitions to read [puppet] - 10https://gerrit.wikimedia.org/r/858542 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [09:46:16] thank you [09:47:08] (03PS2) 10Vgutierrez: cache::haproxy: Silently ignore probes [puppet] - 10https://gerrit.wikimedia.org/r/858550 (https://phabricator.wikimedia.org/T323263) [09:47:37] (03CR) 10Vgutierrez: [C: 03+2] cache: Remove wikiba.se caching rules [puppet] - 10https://gerrit.wikimedia.org/r/858408 (owner: 10Vgutierrez) [09:47:43] <_joe_> MatmaRex: please stay around so that when the change is merged we can test a few pages [09:48:01] yeah [09:49:17] !log filippo@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync-mgmt - filippo@cumin1001" [09:50:20] (03Merged) 10jenkins-bot: Don't run OutputPageBeforeHTML for the talkpageheader [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858319 (https://phabricator.wikimedia.org/T316175) (owner: 10Bartosz Dziewoński) [09:50:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:51:12] !log filippo@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "sync-mgmt - filippo@cumin1001" [09:51:43] <_joe_> ok lemme try to deploy this [09:52:16] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by oblivian@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858319 (https://phabricator.wikimedia.org/T316175) (owner: 10Bartosz Dziewoński) [09:52:28] !log oblivian@deploy1002 Started scap: Backport for [[gerrit:858319|Don't run OutputPageBeforeHTML for the talkpageheader (T316175)]] [09:52:37] T316175: Make the mobile Add Topic button easier for people to access - https://phabricator.wikimedia.org/T316175 [09:52:50] !log oblivian@deploy1002 oblivian and matmarex: Backport for [[gerrit:858319|Don't run OutputPageBeforeHTML for the talkpageheader (T316175)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [09:52:54] <_joe_> MatmaRex: can you test the patch on mwdebug? [09:53:21] <_joe_> it seems to work in my tests [09:53:21] _joe_: yeah. looks good to me [09:53:34] <_joe_> ofc there will be a lot of cached pages that will not be fixed [09:53:45] (03CR) 10Vgutierrez: [C: 03+2] cache::haproxy: Silently ignore probes [puppet] - 10https://gerrit.wikimedia.org/r/858550 (https://phabricator.wikimedia.org/T323263) (owner: 10Vgutierrez) [09:53:52] <_joe_> I would say we'll let the community fix it [09:54:32] there shouldn't be too many, this was only broken for a couple of hours [09:54:56] well, less than a day, at least :) [09:57:58] !log oblivian@deploy1002 Finished scap: Backport for [[gerrit:858319|Don't run OutputPageBeforeHTML for the talkpageheader (T316175)]] (duration: 05m 29s) [09:58:11] T316175: Make the mobile Add Topic button easier for people to access - https://phabricator.wikimedia.org/T316175 [09:58:28] <_joe_> MatmaRex: done :) [09:58:29] (03PS2) 10Arturo Borrero Gonzalez: openstack: nova: compute: cleanup unused code [puppet] - 10https://gerrit.wikimedia.org/r/858327 [09:58:31] (03PS4) 10Arturo Borrero Gonzalez: cloudvirts: introduce modern NIC setup and use it by default [puppet] - 10https://gerrit.wikimedia.org/r/858328 (https://phabricator.wikimedia.org/T319184) [09:58:37] thank you _joe_! [09:59:53] _joe_: related question - can i update https://wikitech.wikimedia.org/wiki/Deployments/Emergencies to suggest messaging people on-call when tyler and the train owner aren't available? [10:00:10] <_joe_> MatmaRex: not really, IMHO [10:00:23] <_joe_> say there was an outage, I would've had to abandon the deployment [10:00:35] heh. well, i just did that. but alright :D [10:00:44] thanks for the help [10:00:46] <_joe_> I know :) [10:01:02] <_joe_> I'm just saying it shouldn't be a general rule, but I'll leave that to the managers [10:01:24] fair [10:03:44] 10SRE, 10Traffic, 10Patch-For-Review, 10Upstream: Wikipedia on flow with no http request, still responds with a Bad Request 400 - https://phabricator.wikimedia.org/T323263 (10Vgutierrez) 05In progress→03Resolved a:03Vgutierrez fix has been merged and it's being deployed, it should be available fleet... [10:04:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5001.eqsin.wmnet [10:06:05] (03CR) 10David Caro: [C: 03+1] openstack: nova: compute: cleanup unused code [puppet] - 10https://gerrit.wikimedia.org/r/858327 (owner: 10Arturo Borrero Gonzalez) [10:06:19] (03PS1) 10Elukey: benthos: use env() in webrequest_live's bloblang config [puppet] - 10https://gerrit.wikimedia.org/r/858551 (https://phabricator.wikimedia.org/T314981) [10:07:14] (03CR) 10Ayounsi: [C: 03+1] "ship it! (carefully) :)" [homer/public] - 10https://gerrit.wikimedia.org/r/857598 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [10:07:20] (03CR) 10Filippo Giunchedi: [C: 03+1] benthos: use env() in webrequest_live's bloblang config [puppet] - 10https://gerrit.wikimedia.org/r/858551 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [10:08:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: nova: compute: cleanup unused code [puppet] - 10https://gerrit.wikimedia.org/r/858327 (owner: 10Arturo Borrero Gonzalez) [10:08:14] (03CR) 10Ayounsi: [C: 03+1] Add OSPF automation template for EVPN switches (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/857482 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [10:09:25] (03PS5) 10Arturo Borrero Gonzalez: cloudvirts: introduce modern NIC setup and use it by default [puppet] - 10https://gerrit.wikimedia.org/r/858328 (https://phabricator.wikimedia.org/T319184) [10:09:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:10:02] (03CR) 10Ayounsi: [C: 03+1] Add function to expose required device VRFs to Homer templates (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/857593 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [10:10:49] (03CR) 10Elukey: [C: 03+2] benthos: use env() in webrequest_live's bloblang config [puppet] - 10https://gerrit.wikimedia.org/r/858551 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [10:13:02] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff) [10:13:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5001.eqsin.wmnet [10:13:56] !log installing sysstat security updates [10:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:16:03] (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:16:11] (03PS1) 10Elukey: benthos: discard the msg in webrequest_live if ip is unset [puppet] - 10https://gerrit.wikimedia.org/r/858552 (https://phabricator.wikimedia.org/T314981) [10:16:33] (03CR) 10JMeybohm: P:pontoon: include firewall rules to allow metricsinfra scraping (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857023 (owner: 10Majavah) [10:17:52] (03PS2) 10Elukey: benthos: discard the msg in webrequest_live if ip is unset [puppet] - 10https://gerrit.wikimedia.org/r/858552 (https://phabricator.wikimedia.org/T314981) [10:18:53] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [10:18:55] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [10:19:57] (03CR) 10David Caro: "LGTM, can you run a pcc on it?" [puppet] - 10https://gerrit.wikimedia.org/r/858328 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [10:22:40] (03PS6) 10Arturo Borrero Gonzalez: cloudvirts: introduce modern NIC setup and use it by default [puppet] - 10https://gerrit.wikimedia.org/r/858328 (https://phabricator.wikimedia.org/T319184) [10:25:50] (03CR) 10Filippo Giunchedi: [C: 03+1] benthos: discard the msg in webrequest_live if ip is unset [puppet] - 10https://gerrit.wikimedia.org/r/858552 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [10:26:03] (ProbeDown) firing: (4) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:27:01] (03PS7) 10Arturo Borrero Gonzalez: cloudvirts: introduce modern NIC setup and use it by default [puppet] - 10https://gerrit.wikimedia.org/r/858328 (https://phabricator.wikimedia.org/T319184) [10:28:01] (03PS8) 10Arturo Borrero Gonzalez: cloudvirts: introduce modern NIC setup and use it by default [puppet] - 10https://gerrit.wikimedia.org/r/858328 (https://phabricator.wikimedia.org/T319184) [10:29:02] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200): /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [10:31:00] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [10:31:03] (ProbeDown) resolved: (4) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:31:38] (03CR) 10Elukey: [C: 03+2] benthos: discard the msg in webrequest_live if ip is unset [puppet] - 10https://gerrit.wikimedia.org/r/858552 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [10:33:57] 10SRE, 10Traffic, 10serviceops, 10Patch-For-Review: _etcd-client SRV record missing for conftool cluster - https://phabricator.wikimedia.org/T320397 (10Vgutierrez) ping? [10:34:45] !log draining ganeti1012 in preparation of server move to a new rack T308339 [10:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:56] T308339: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 [10:35:05] (03CR) 10Vgutierrez: [C: 03+1] prometheus: Remove old ats config export job [puppet] - 10https://gerrit.wikimedia.org/r/858418 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [10:41:12] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:42:12] (03CR) 10Vgutierrez: [C: 03+1] lvs4009: commission new LVS host (ulsfo hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/858336 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [10:42:20] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:47:43] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Stop spamming SAL with helmfile on scap deployments - https://phabricator.wikimedia.org/T323296 (10Clement_Goubert) Merge request on `scap` to pass the `SUPPRESS_SAL` variable https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/26 [10:49:08] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff) [10:50:47] (03PS1) 10Jbond: P:pki: add new type calidation for ca names [puppet] - 10https://gerrit.wikimedia.org/r/858556 [10:54:38] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38316/console" [puppet] - 10https://gerrit.wikimedia.org/r/858556 (owner: 10Jbond) [11:06:12] (03PS9) 10Arturo Borrero Gonzalez: cloudvirts: introduce modern NIC setup and use it by default [puppet] - 10https://gerrit.wikimedia.org/r/858328 (https://phabricator.wikimedia.org/T319184) [11:10:08] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, nice!" [puppet] - 10https://gerrit.wikimedia.org/r/858556 (owner: 10Jbond) [11:12:19] (03PS10) 10Arturo Borrero Gonzalez: cloudvirts: introduce modern NIC setup and use it by default [puppet] - 10https://gerrit.wikimedia.org/r/858328 (https://phabricator.wikimedia.org/T319184) [11:14:58] 10SRE, 10Traffic: Improve handling/logging of HAproxy emergency log messages - https://phabricator.wikimedia.org/T306236 (10Vgutierrez) 05In progress→03Resolved a:03Vgutierrez [11:15:33] (03Abandoned) 10Alexandros Kosiaris: arclamp: Add role contact information [puppet] - 10https://gerrit.wikimedia.org/r/854985 (owner: 10Alexandros Kosiaris) [11:16:49] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/858328/38318/" [puppet] - 10https://gerrit.wikimedia.org/r/858328 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [11:18:25] 10SRE, 10Traffic: Rename role::cache::(text|upload)_haproxy to role::cache::(text|upload) - https://phabricator.wikimedia.org/T323365 (10Vgutierrez) [11:19:01] 10SRE, 10Traffic: Rename role::cache::(text|upload)_haproxy to role::cache::(text|upload) - https://phabricator.wikimedia.org/T323365 (10Vgutierrez) p:05Triage→03Medium [11:20:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Multiple RAID battery failures on hadoop worker hosts - https://phabricator.wikimedia.org/T318659 (10BTullis) >>! In T318659#8403396, @RobH wrote: > @btullis: I've gone ahead and requested quotation to get replacement... [11:21:01] (03PS11) 10Arturo Borrero Gonzalez: cloudvirts: introduce modern NIC setup and use it by default [puppet] - 10https://gerrit.wikimedia.org/r/858328 (https://phabricator.wikimedia.org/T319184) [11:21:48] MatmaRex: thanks for the fix, let me know if you still require my services [11:23:04] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/858328/38319/" [puppet] - 10https://gerrit.wikimedia.org/r/858328 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [11:25:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10jbond) >>! In T313960#8404684, @Papaul wrote: > @jbond if you have time tomorrow i did get the error below on kafka-logging1004. I checked the upgr... [11:27:23] !log Starting decommission of apple-search service - T316296 [11:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:43] T316296: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 [11:28:07] (03PS2) 10Clément Goubert: apple-search: remove discovery record [dns] - 10https://gerrit.wikimedia.org/r/858207 (https://phabricator.wikimedia.org/T316296) [11:28:17] (03PS10) 10Clément Goubert: apple-search: Remove DNS records [dns] - 10https://gerrit.wikimedia.org/r/852208 (https://phabricator.wikimedia.org/T316296) [11:28:55] (03CR) 10Giuseppe Lavagetto: [C: 03+1] apple-search: remove discovery record [dns] - 10https://gerrit.wikimedia.org/r/858207 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [11:29:20] (03CR) 10Giuseppe Lavagetto: [C: 03+1] apple-search: Remove DNS records [dns] - 10https://gerrit.wikimedia.org/r/852208 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [11:31:30] (03CR) 10Clément Goubert: [C: 03+2] apple-search: remove discovery record [dns] - 10https://gerrit.wikimedia.org/r/858207 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [11:31:42] !log installing Linux 4.19.260 on Buster systems [11:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:22] (03CR) 10David Caro: [C: 03+1] cloudvirts: introduce modern NIC setup and use it by default (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/858328 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [11:34:24] !log Running authdns-update - T316296 [11:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:31] T316296: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 [11:36:04] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:36:37] <_joe_> Amir1: ^^ lists.w.o down again? [11:36:43] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] cloudvirts: introduce modern NIC setup and use it by default [puppet] - 10https://gerrit.wikimedia.org/r/858328 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [11:37:08] I check [11:37:12] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:37:37] (03CR) 10Giuseppe Lavagetto: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/852210 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [11:37:50] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48974 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:38:26] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:38:31] I haven't done anything yet [11:38:36] probably another scraper [11:38:38] (03CR) 10Clément Goubert: [C: 03+2] apple-search: Switch lvs state to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/852210 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [11:38:40] let me check logs [11:38:45] (03PS1) 10Jbond: upgrade-firmware: small fix to ensure files get saved in the correct path [cookbooks] - 10https://gerrit.wikimedia.org/r/858559 [11:38:57] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+1] apple-search: Switch lvs state to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/852210 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [11:39:06] (03PS2) 10Jbond: upgrade-firmware: small fix to ensure files get saved in the correct path [cookbooks] - 10https://gerrit.wikimedia.org/r/858559 [11:41:28] !log Switching apple-search to state:lvs_setup - T316296 [11:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:39] T316296: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 [11:42:58] (03PS1) 10Arturo Borrero Gonzalez: wmcs: proxy: don't fail if killing the proxy fails [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/858560 [11:44:12] PROBLEM - SSH on db1120.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:47:31] (03CR) 10David Caro: wmcs: proxy: don't fail if killing the proxy fails (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/858560 (owner: 10Arturo Borrero Gonzalez) [11:49:43] (03PS3) 10Jbond: upgrade-firmware: small fix to ensure files get saved in the correct path [cookbooks] - 10https://gerrit.wikimedia.org/r/858559 [11:50:07] (03PS1) 10Muehlenhoff: Retire conf-lvm partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/858562 (https://phabricator.wikimedia.org/T156955) [11:50:57] (03PS4) 10Jbond: upgrade-firmware: small fix to ensure files get saved in the correct path [cookbooks] - 10https://gerrit.wikimedia.org/r/858559 [11:51:03] (03CR) 10Arturo Borrero Gonzalez: wmcs: proxy: don't fail if killing the proxy fails (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/858560 (owner: 10Arturo Borrero Gonzalez) [11:51:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure: Attempt to move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10BTullis) >>! In T318696#8358171, @Ottomata wrote: > @BTullis can/should we just remove those nodes as Hadoop workers and reimage them as DS... [11:53:51] !log Switching apple-search to state:service_setup - T316296 [11:53:54] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] apple-search: Remove service from lb and backend [puppet] - 10https://gerrit.wikimedia.org/r/857691 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [11:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:01] T316296: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 [11:54:12] (03PS3) 10Clément Goubert: apple-search: Remove service from lb and backend [puppet] - 10https://gerrit.wikimedia.org/r/857691 (https://phabricator.wikimedia.org/T316296) [11:57:40] (03CR) 10Jbond: [C: 03+2] upgrade-firmware: small fix to ensure files get saved in the correct path [cookbooks] - 10https://gerrit.wikimedia.org/r/858559 (owner: 10Jbond) [12:01:33] !log oblivian@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on D{lvs2010.codfw.wmnet,lvs1020.eqiad.wmnet} and A:lvs [12:01:57] !log installing libgoogle-gson-java security updates [12:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:32] (03PS2) 10Vgutierrez: role::cache: Link/copy (text|upload)_haproxy to base roles [puppet] - 10https://gerrit.wikimedia.org/r/849180 (https://phabricator.wikimedia.org/T323365) (owner: 10BBlack) [12:02:36] !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt2001-dev.codfw.wmnet with OS bullseye [12:02:52] <_joe_> claime: uhm the cookbook is waiting for the IPVS_diffs_check to recover [12:03:07] <_joe_> which it won't unless we run ipvsadm [12:03:08] (03CR) 10CI reject: [V: 04-1] role::cache: Link/copy (text|upload)_haproxy to base roles [puppet] - 10https://gerrit.wikimedia.org/r/849180 (https://phabricator.wikimedia.org/T323365) (owner: 10BBlack) [12:03:15] _joe_: ack [12:03:36] _joe_: But we're supposed to do that after restarting pybal on the primary [12:04:34] Or do I just ipvsadm --delete-service --tcp-service addr:port on lvs2010.codfw.wmnet [12:04:48] (03PS3) 10Vgutierrez: role::cache: Link/copy (text|upload)_haproxy to base roles [puppet] - 10https://gerrit.wikimedia.org/r/849180 (https://phabricator.wikimedia.org/T323365) (owner: 10BBlack) [12:04:53] <_joe_> that seems a very modern syntax, but yes [12:05:02] cgoubert@lvs2010:~$ sudo ipvsadm --delete-service --tcp-service 10.2.1.68:4013 [12:05:14] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.2.68:4013]) https://wikitech.wikimedia.org/wiki/PyBal [12:05:23] done [12:05:27] Sorry for alert noise [12:05:30] , !log? :) [12:05:38] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.2.68:4013]) https://wikitech.wikimedia.org/wiki/PyBal [12:05:47] <_joe_> vgutierrez: see above, pybal restart [12:05:53] <_joe_> ah you mean claime [12:05:56] <_joe_> yeah :) [12:06:01] yeah [12:06:05] !log cgoubert@lvs2010:~$ sudo ipvsadm --delete-service --tcp-service 10.2.1.68:4013 [12:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:09] thx <3 [12:06:13] I was doing it :') [12:06:27] E_WOULDBLOCK lol [12:06:29] Just fat fingering [12:06:54] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.1.68:4013]) https://wikitech.wikimedia.org/wiki/PyBal [12:07:42] _joe_: ==> Failed to downtime hosts: Not all services are recovered: lvs2010:PyBal IPVS diff che [12:07:45] go ? [12:07:47] (03CR) 10David Caro: wmcs: proxy: don't fail if killing the proxy fails (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/858560 (owner: 10Arturo Borrero Gonzalez) [12:08:06] thx [12:08:10] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38323/console" [puppet] - 10https://gerrit.wikimedia.org/r/849180 (https://phabricator.wikimedia.org/T323365) (owner: 10BBlack) [12:08:24] <_joe_> claime: yeah you can do the same (with the correct IP) on 1020 [12:08:33] yep [12:08:53] !log cgoubert@lvs1020:~$ sudo ipvsadm --delete-service --tcp-service 10.2.2.68:4013 [12:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:58] done [12:09:18] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [12:09:26] !log oblivian@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on D{lvs2010.codfw.wmnet,lvs1020.eqiad.wmnet} and A:lvs [12:09:29] _joe_: So that ipvsadm has to be run on all lvs afterwards right ? [12:09:35] <_joe_> yes [12:09:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:10:01] <_joe_> ok, let me run on the primaries I guess? [12:10:02] (03PS4) 10Vgutierrez: role::cache: Link/copy (text|upload)_haproxy to base roles [puppet] - 10https://gerrit.wikimedia.org/r/849180 (https://phabricator.wikimedia.org/T323365) (owner: 10BBlack) [12:10:30] !log oblivian@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on D{lvs2009.codfw.wmnet,lvs1019.eqiad.wmnet} and A:lvs [12:11:30] same on my side, run the ipvsadm ? [12:11:35] <_joe_> yes [12:11:36] <_joe_> lvs2009 [12:11:46] (03PS2) 10Arturo Borrero Gonzalez: wmcs: proxy: only mark the proxy as started if it didn't fail [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/858560 [12:12:04] !log cgoubert@lvs2009:~$ sudo ipvsadm --delete-service --tcp-service 10.2.1.68:4013 [12:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:10] done [12:12:43] ready for lvs1019 [12:12:50] <_joe_> yeah, sigh icinga [12:14:23] (03CR) 10Vgutierrez: "BBlack I've addressed your prometheus::ops concerns and the PCC still reports a NOOP, we just need to be careful and remove duplicated yam" [puppet] - 10https://gerrit.wikimedia.org/r/849180 (https://phabricator.wikimedia.org/T323365) (owner: 10BBlack) [12:14:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:15:38] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [12:16:06] (03PS1) 10Jbond: admin: add all data center ops users to datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/858567 [12:16:08] (03PS1) 10Jbond: P:spicerack: ensure firmware directory is writable by dc-ops [puppet] - 10https://gerrit.wikimedia.org/r/858568 [12:17:00] !log cgoubert@lvs1019:~$ sudo ipvsadm --delete-service --tcp-service 10.2.2.68:4013 [12:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:38] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [12:17:44] !log oblivian@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on D{lvs2009.codfw.wmnet,lvs1019.eqiad.wmnet} and A:lvs [12:18:57] !log aborrero@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2001-dev.codfw.wmnet with reason: host reimage [12:21:40] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2001-dev.codfw.wmnet with reason: host reimage [12:22:23] !log apple-search removed from backends - T316296 [12:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:38] T316296: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 [12:22:57] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [12:23:11] looking ^ [12:24:51] (03PS1) 10Jbond: sre.hardware.upgrade-firmware: ensure folderes are group writable [cookbooks] - 10https://gerrit.wikimedia.org/r/858569 [12:25:46] (03CR) 10Clément Goubert: [C: 03+2] apple-search: Remove DNS records [dns] - 10https://gerrit.wikimedia.org/r/852208 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [12:26:16] !log Clean up apple-search DNS - T316296 [12:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:31] !log cgoubert@authdns1001:~$ sudo -i authdns-update [12:26:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:54] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/858567 (owner: 10Jbond) [12:27:36] (03CR) 10Jbond: [C: 03+2] admin: add all data center ops users to datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/858567 (owner: 10Jbond) [12:28:03] (03PS2) 10Jbond: admin: add all data center ops users to datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/858567 [12:28:05] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1004 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [12:28:46] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] apple-search: Remove service from service::catalog [puppet] - 10https://gerrit.wikimedia.org/r/857706 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [12:28:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:28:59] (03PS3) 10Clément Goubert: apple-search: Remove service from service::catalog [puppet] - 10https://gerrit.wikimedia.org/r/857706 (https://phabricator.wikimedia.org/T316296) [12:29:44] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38324/console" [puppet] - 10https://gerrit.wikimedia.org/r/858568 (owner: 10Jbond) [12:30:01] RECOVERY - puppet last run on sretest1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:30:39] (03CR) 10Esanders: Don't run OutputPageBeforeHTML for the talkpageheader (031 comment) [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858319 (https://phabricator.wikimedia.org/T316175) (owner: 10Bartosz Dziewoński) [12:30:53] !log Removing apple-search from service::catalog - T316296 [12:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:08] T316296: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 [12:32:20] (03PS2) 10Jbond: P:spicerack: ensure firmware directory is writable by dc-ops [puppet] - 10https://gerrit.wikimedia.org/r/858568 [12:32:39] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:spicerack: ensure firmware directory is writable by dc-ops [puppet] - 10https://gerrit.wikimedia.org/r/858568 (owner: 10Jbond) [12:33:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:37:02] (03CR) 10Clément Goubert: [C: 03+2] apple-search: Remove apple-search from conftool [puppet] - 10https://gerrit.wikimedia.org/r/858286 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [12:37:13] !log Removing apple-search from conftool - T316296 [12:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:22] T316296: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 [12:38:58] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:40:00] 10SRE, 10Discovery-Search, 10serviceops, 10serviceops-collab, and 2 others: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10Clement_Goubert) apple-search removed from DNS, LVS, service::catalog and conftool. Starting removal from wikikube and deployment-charts. [12:41:48] !log Starting apple-search removal from wikikube - T316296 [12:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:29] (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: ensure folderes are group writable [cookbooks] - 10https://gerrit.wikimedia.org/r/858569 (owner: 10Jbond) [12:43:10] !log cgoubert@deploy1002:/apple-search$ helmfile -e staging -i destroy - T316296 [12:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:25] T316296: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 [12:45:13] !log cgoubert@deploy1002:/apple-search$ helmfile -e eqiad -i destroy - T316296 [12:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:58] !log cgoubert@deploy1002:/apple-search$ helmfile -e codfw -i destroy - T316296 [12:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:07] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2001-dev.codfw.wmnet with OS bullseye [12:46:40] (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: ensure folderes are group writable [cookbooks] - 10https://gerrit.wikimedia.org/r/858569 (owner: 10Jbond) [12:49:35] (03CR) 10Clément Goubert: "This change is ready for review." [labs/private] - 10https://gerrit.wikimedia.org/r/858573 (owner: 10Clément Goubert) [12:51:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:56:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:57:49] 10SRE, 10serviceops, 10Patch-For-Review: Deploy MediaWiki config change to use OpenSSL for PBKDF2 password hashing - https://phabricator.wikimedia.org/T320929 (10akosiaris) I am not sure what type of coordination is needed from #SRE either. Maybe just making sure that 1 or 2 SREs are around when the patch is... [12:58:09] (03CR) 10Clément Goubert: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/858575 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [13:00:29] (03PS1) 10Arturo Borrero Gonzalez: cloudvirt2002-dev: move to the modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/858579 [13:02:19] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/858579/38325/" [puppet] - 10https://gerrit.wikimedia.org/r/858579 (owner: 10Arturo Borrero Gonzalez) [13:04:56] (03CR) 10Ladsgroup: [C: 03+1] P:mediawiki::maintenance: CampaignEvents periodic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/858346 (https://phabricator.wikimedia.org/T320403) (owner: 10Clément Goubert) [13:06:34] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [13:07:42] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [13:08:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance [13:08:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [13:08:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance [13:08:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T323214)', diff saved to https://phabricator.wikimedia.org/P40124 and previous config saved to /var/cache/conftool/dbconfig/20221118-130829-ladsgroup.json [13:08:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [13:08:47] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [13:08:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:09:49] (03PS2) 10Clément Goubert: P:mediawiki::maintenance: CampaignEvents periodic [puppet] - 10https://gerrit.wikimedia.org/r/858346 (https://phabricator.wikimedia.org/T320403) [13:10:35] (03CR) 10Clément Goubert: P:mediawiki::maintenance: CampaignEvents periodic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/858346 (https://phabricator.wikimedia.org/T320403) (owner: 10Clément Goubert) [13:13:06] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38326/console" [puppet] - 10https://gerrit.wikimedia.org/r/858346 (https://phabricator.wikimedia.org/T320403) (owner: 10Clément Goubert) [13:13:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:14:35] (03CR) 10David Caro: [C: 03+1] "lgtm, might be interesting at some point to have that info though." [puppet] - 10https://gerrit.wikimedia.org/r/855970 (https://phabricator.wikimedia.org/T271096) (owner: 10Arturo Borrero Gonzalez) [13:14:39] (03PS1) 10Jbond: nodegen: skip new files when processing auto host selector [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/858581 (https://phabricator.wikimedia.org/T323282) [13:14:41] (03PS1) 10Jbond: html: add additional retrun code descriptions [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/858582 [13:14:56] !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt2002-dev.codfw.wmnet with OS bullseye [13:15:05] (03CR) 10Filippo Giunchedi: [C: 03+1] "Very nice! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/858562 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [13:15:09] (03CR) 10Vivian Rook: [C: 03+1] cloudvirt2002-dev: move to the modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/858579 (owner: 10Arturo Borrero Gonzalez) [13:15:27] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] cloudvirt2002-dev: move to the modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/858579 (owner: 10Arturo Borrero Gonzalez) [13:16:10] RECOVERY - cassandra-a CQL 10.64.16.74:9042 on aqs1017 is OK: TCP OK - 0.000 second response time on 10.64.16.74 port 9042 https://phabricator.wikimedia.org/T93886 [13:16:35] (03PS3) 10Arturo Borrero Gonzalez: prometheus: drop cloudvirt ceph metrics generator [puppet] - 10https://gerrit.wikimedia.org/r/855970 (https://phabricator.wikimedia.org/T271096) [13:19:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] prometheus: drop cloudvirt ceph metrics generator [puppet] - 10https://gerrit.wikimedia.org/r/855970 (https://phabricator.wikimedia.org/T271096) (owner: 10Arturo Borrero Gonzalez) [13:21:01] (03PS2) 10Jbond: nodegen: skip new files when processing auto host selector [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/858581 (https://phabricator.wikimedia.org/T323282) [13:21:03] (03PS2) 10Jbond: html: add additional retrun code descriptions [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/858582 [13:21:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [13:21:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [13:21:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T318605)', diff saved to https://phabricator.wikimedia.org/P40125 and previous config saved to /var/cache/conftool/dbconfig/20221118-132141-ladsgroup.json [13:22:04] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [13:27:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance [13:27:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance [13:29:56] (03CR) 10Muehlenhoff: [C: 03+2] Retire conf-lvm partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/858562 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [13:31:38] !log aborrero@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2002-dev.codfw.wmnet with reason: host reimage [13:32:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T323214)', diff saved to https://phabricator.wikimedia.org/P40126 and previous config saved to /var/cache/conftool/dbconfig/20221118-133203-ladsgroup.json [13:32:11] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [13:35:18] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2002-dev.codfw.wmnet with reason: host reimage [13:37:31] (03PS1) 10Muehlenhoff: Retire two k8s Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/858587 (https://phabricator.wikimedia.org/T156955) [13:37:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:42:24] (03PS1) 10Muehlenhoff: Remove dumpsdata100XH750.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/858589 [13:42:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:43:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T318605)', diff saved to https://phabricator.wikimedia.org/P40127 and previous config saved to /var/cache/conftool/dbconfig/20221118-134334-ladsgroup.json [13:43:43] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [13:44:50] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/858587 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [13:46:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance [13:46:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance [13:46:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T323214)', diff saved to https://phabricator.wikimedia.org/P40128 and previous config saved to /var/cache/conftool/dbconfig/20221118-134633-ladsgroup.json [13:47:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P40129 and previous config saved to /var/cache/conftool/dbconfig/20221118-134709-ladsgroup.json [13:47:53] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [13:48:24] (03PS3) 10Jbond: html: add additional return code descriptions [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/858582 [13:51:16] (03CR) 10Jbond: [C: 03+2] nodegen: skip new files when processing auto host selector [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/858581 (https://phabricator.wikimedia.org/T323282) (owner: 10Jbond) [13:51:20] (03CR) 10Jbond: [C: 03+2] html: add additional return code descriptions [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/858582 (owner: 10Jbond) [13:53:33] (03Merged) 10jenkins-bot: nodegen: skip new files when processing auto host selector [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/858581 (https://phabricator.wikimedia.org/T323282) (owner: 10Jbond) [13:53:35] (03Merged) 10jenkins-bot: html: add additional return code descriptions [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/858582 (owner: 10Jbond) [13:56:32] (03PS1) 10Jbond: puppet_compiler: bump version to 2.5.2 [puppet] - 10https://gerrit.wikimedia.org/r/858594 [13:57:18] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38327/console" [puppet] - 10https://gerrit.wikimedia.org/r/858594 (owner: 10Jbond) [13:58:02] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppet_compiler: bump version to 2.5.2 [puppet] - 10https://gerrit.wikimedia.org/r/858594 (owner: 10Jbond) [13:58:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P40130 and previous config saved to /var/cache/conftool/dbconfig/20221118-135841-ladsgroup.json [13:59:22] (03CR) 10JHathaway: "looks great, thanks for cleaning this up!" [puppet] - 10https://gerrit.wikimedia.org/r/858545 (owner: 10JMeybohm) [13:59:26] (03CR) 10JHathaway: [C: 03+2] aux-k8s: Remove obsolete hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/858545 (owner: 10JMeybohm) [14:01:53] (03CR) 10Filippo Giunchedi: "FYI this is ready to submit but hasn't yet" [puppet] - 10https://gerrit.wikimedia.org/r/858266 (owner: 10Jbond) [14:02:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P40131 and previous config saved to /var/cache/conftool/dbconfig/20221118-140216-ladsgroup.json [14:04:24] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2002-dev.codfw.wmnet with OS bullseye [14:04:49] PROBLEM - SSH on mw1315.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:06:00] (03CR) 10Jbond: [C: 03+2] Revert "hieradata: move multirootca standard settings to profile" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/858266 (owner: 10Jbond) [14:07:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T323214)', diff saved to https://phabricator.wikimedia.org/P40132 and previous config saved to /var/cache/conftool/dbconfig/20221118-140749-ladsgroup.json [14:07:59] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [14:09:18] (03CR) 10Jbond: [C: 03+2] Revert "hieradata: move multirootca standard settings to profile" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/858266 (owner: 10Jbond) [14:09:41] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki: add new type calidation for ca names [puppet] - 10https://gerrit.wikimedia.org/r/858556 (owner: 10Jbond) [14:10:52] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10BTullis) @Cmjohnson - Let me knowhen you're ready to move an-tool1010 please. I'll schedule a maintenance window for Superset and shut it down for you. Am I right in assuming that you'll want... [14:13:12] 10Puppet, 10Infrastructure-Foundations: Consider alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10bking) > I don't think there is a productive and actionable outcome of the discussion in this task, nor that we've made progress in the discussion. I would suggest we clo... [14:13:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P40133 and previous config saved to /var/cache/conftool/dbconfig/20221118-141347-ladsgroup.json [14:13:57] (03PS1) 10Hashar: Plugin to customize Zuul reports [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/858598 [14:14:26] (03CR) 10CI reject: [V: 04-1] Plugin to customize Zuul reports [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/858598 (owner: 10Hashar) [14:17:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T323214)', diff saved to https://phabricator.wikimedia.org/P40134 and previous config saved to /var/cache/conftool/dbconfig/20221118-141722-ladsgroup.json [14:17:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance [14:17:36] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [14:17:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance [14:17:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T323214)', diff saved to https://phabricator.wikimedia.org/P40135 and previous config saved to /var/cache/conftool/dbconfig/20221118-141744-ladsgroup.json [14:18:05] 10Puppet, 10Infrastructure-Foundations: Consider alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10bking) >>! In T321874#8404989, @fgiunchedi wrote: > I can definitely relate with the long (and stressful!) cycles of Puppet patches you mention @bking and that one of my [... [14:19:11] (03PS1) 10Arturo Borrero Gonzalez: cloudvirt2003-dev: move to the modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/858600 [14:21:33] 10SRE, 10Patch-For-Review, 10User-fgiunchedi: Standardizing our partman recipes - https://phabricator.wikimedia.org/T156955 (10fgiunchedi) I was reviewing this work again and realized the audit command should be updated. The situation in puppet.git as of `348f4a06ed` is reported below. ` $ git grep -h -o 'p... [14:22:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P40136 and previous config saved to /var/cache/conftool/dbconfig/20221118-142255-ladsgroup.json [14:24:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvirt2003-dev: move to the modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/858600 (owner: 10Arturo Borrero Gonzalez) [14:25:02] !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt2003-dev.codfw.wmnet with OS bullseye [14:25:10] 10SRE, 10Wikimedia-Portals, 10Wikimedia-Site-requests, 10Security, 10Vuln-XSS: Malicious meta admin can add javascript to https://office.wikimedia.org/api/ . Move api listing off wiki - https://phabricator.wikimedia.org/T109147 (10Bawolff) [14:28:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T318605)', diff saved to https://phabricator.wikimedia.org/P40137 and previous config saved to /var/cache/conftool/dbconfig/20221118-142854-ladsgroup.json [14:29:04] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [14:30:15] !log initiating Cassandra bootstrap, aqs1017-b -- T307802 [14:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:25] T307802: Bootstrap new Cassandra nodes (eqiad) - https://phabricator.wikimedia.org/T307802 [14:31:25] MatmaRex: community people are complaining about "ext-discussiontools-init-lede-button-container" element, is it tracked? [14:31:37] https://usercontent.irccloud-cdn.com/file/GsTIPWT6/image.png [14:31:58] can't see it in T316175 [14:31:58] RECOVERY - cassandra-b service on aqs1017 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:31:58] T316175: Make the mobile Add Topic button easier for people to access - https://phabricator.wikimedia.org/T316175 [14:32:12] Amir1: ugh [14:33:20] Amir1: apparently https://phabricator.wikimedia.org/T323341 . i haven't seen this before [14:34:23] it seems this is in all pages in fawiki in mobile now [14:34:34] not sure articles too [14:34:35] let me chekc [14:35:00] not articles [14:35:07] Amir1: all talk pages, surely? [14:35:13] yeah [14:35:20] looks like we missed some if() somewhere [14:35:22] Should we fix it, etc. [14:35:27] I can help backporting [14:35:33] (03PS1) 10Muehlenhoff: alertmanager: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/858603 (https://phabricator.wikimedia.org/T308013) [14:35:35] (03PS1) 10Muehlenhoff: analytics::refinery: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/858604 (https://phabricator.wikimedia.org/T308013) [14:35:37] (03PS1) 10Muehlenhoff: webperf: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/858605 (https://phabricator.wikimedia.org/T308013) [14:35:39] (03PS1) 10Muehlenhoff: Add SPDX headers to various IF profiles [puppet] - 10https://gerrit.wikimedia.org/r/858606 (https://phabricator.wikimedia.org/T308013) [14:36:29] probably [14:38:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P40138 and previous config saved to /var/cache/conftool/dbconfig/20221118-143802-ladsgroup.json [14:38:12] Amir1: i'll submit a patch in a minute, let me just make sure i've got the conditions right [14:38:35] SGTM [14:41:59] !log aborrero@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2003-dev.codfw.wmnet with reason: host reimage [14:42:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T323214)', diff saved to https://phabricator.wikimedia.org/P40139 and previous config saved to /var/cache/conftool/dbconfig/20221118-144239-ladsgroup.json [14:43:35] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [14:45:13] RECOVERY - SSH on db1120.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:45:27] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2003-dev.codfw.wmnet with reason: host reimage [14:47:53] RECOVERY - cassandra-b SSL 10.64.16.78:7001 on aqs1017 is OK: SSL OK - Certificate aqs1017-b valid until 2024-11-08 15:06:22 +0000 (expires in 721 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [14:48:09] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/858371 [14:50:10] (03PS24) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [14:50:46] Amir1: the fix is https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/858608/ , but i don't think anyone else from my team is around at the moment. [14:53:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T323214)', diff saved to https://phabricator.wikimedia.org/P40140 and previous config saved to /var/cache/conftool/dbconfig/20221118-145308-ladsgroup.json [14:53:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance [14:53:16] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [14:53:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance [14:53:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T323214)', diff saved to https://phabricator.wikimedia.org/P40141 and previous config saved to /var/cache/conftool/dbconfig/20221118-145330-ladsgroup.json [14:54:01] !log installing node-minimist security updates [14:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:14] MatmaRex: let me know once it's merged [14:57:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P40142 and previous config saved to /var/cache/conftool/dbconfig/20221118-145746-ladsgroup.json [14:58:28] (03CR) 10Filippo Giunchedi: [C: 03+1] alertmanager: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/858603 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:01:06] (03PS1) 10Filippo Giunchedi: graphite: start mirroring traffic to graphite2004 [puppet] - 10https://gerrit.wikimedia.org/r/858610 (https://phabricator.wikimedia.org/T315524) [15:01:25] PROBLEM - SSH on mw1328.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:07:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:08:40] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2003-dev.codfw.wmnet with OS bullseye [15:09:41] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [15:10:05] (03PS1) 10Filippo Giunchedi: hiera: add graphite2004 to codfw graphite queries [puppet] - 10https://gerrit.wikimedia.org/r/858611 (https://phabricator.wikimedia.org/T315524) [15:10:39] (03CR) 10Filippo Giunchedi: "To be merged once graphite2004 is in sync" [puppet] - 10https://gerrit.wikimedia.org/r/858611 (https://phabricator.wikimedia.org/T315524) (owner: 10Filippo Giunchedi) [15:10:56] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:11:17] (03PS25) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [15:11:35] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [15:12:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P40143 and previous config saved to /var/cache/conftool/dbconfig/20221118-151252-ladsgroup.json [15:12:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:14:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T323214)', diff saved to https://phabricator.wikimedia.org/P40144 and previous config saved to /var/cache/conftool/dbconfig/20221118-151458-ladsgroup.json [15:15:10] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [15:17:32] (03CR) 10JMeybohm: [C: 03+1] k8s: pin the pause container image to the k8s_116 tag on staging [puppet] - 10https://gerrit.wikimedia.org/r/858546 (https://phabricator.wikimedia.org/T322920) (owner: 10Elukey) [15:18:15] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:18:17] (03CR) 10Elukey: [C: 03+2] k8s: pin the pause container image to the k8s_116 tag on staging [puppet] - 10https://gerrit.wikimedia.org/r/858546 (https://phabricator.wikimedia.org/T322920) (owner: 10Elukey) [15:19:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10Papaul) @jbon I think the issue was with what @Volans mentioned above. Didn't have the issue with another node that I worked with yesterday (kafka-... [15:19:39] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff) [15:19:53] (03CR) 10Arturo Borrero Gonzalez: wmcs: proxy: only mark the proxy as started if it didn't fail (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/858560 (owner: 10Arturo Borrero Gonzalez) [15:22:08] Amir1: merged, i think we could backport it [15:22:32] sounds good, do you want to do the honours? [15:24:09] (03CR) 10Lucas Werkmeister (WMDE): Wikidata: don't show Vector search thumbnails (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große) [15:24:23] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-logging1005.eqiad.wmnet with OS bullseye [15:24:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-logging1005.eqiad.wmnet with OS bullseye [15:25:52] (03PS1) 10Ladsgroup: Don't add lede button if mobile DiscussionTools not enabled [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858320 (https://phabricator.wikimedia.org/T323341) [15:25:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:26:02] MatmaRex: one mwf.10? ^ [15:26:59] yes. new feature [15:27:10] (03CR) 10Bartosz Dziewoński: [C: 03+1] Don't add lede button if mobile DiscussionTools not enabled [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858320 (https://phabricator.wikimedia.org/T323341) (owner: 10Ladsgroup) [15:27:23] (03CR) 10Ladsgroup: [C: 03+2] Don't add lede button if mobile DiscussionTools not enabled [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858320 (https://phabricator.wikimedia.org/T323341) (owner: 10Ladsgroup) [15:27:27] let's go [15:27:36] Amir1: i don't have deployment access, i can't do the honours :) [15:27:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T323214)', diff saved to https://phabricator.wikimedia.org/P40145 and previous config saved to /var/cache/conftool/dbconfig/20221118-152758-ladsgroup.json [15:28:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance [15:28:02] we should fix that, let's work on that next week [15:28:07] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [15:28:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance [15:28:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T323214)', diff saved to https://phabricator.wikimedia.org/P40146 and previous config saved to /var/cache/conftool/dbconfig/20221118-152820-ladsgroup.json [15:30:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P40147 and previous config saved to /var/cache/conftool/dbconfig/20221118-153005-ladsgroup.json [15:30:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:31:04] Amir1: ew no. it's scary enough with all the things i *can* access [15:32:40] (03Merged) 10jenkins-bot: Don't add lede button if mobile DiscussionTools not enabled [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858320 (https://phabricator.wikimedia.org/T323341) (owner: 10Ladsgroup) [15:33:42] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858320 (https://phabricator.wikimedia.org/T323341) (owner: 10Ladsgroup) [15:33:54] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:858320|Don't add lede button if mobile DiscussionTools not enabled (T323341)]] [15:34:05] T323341: unnecessary button on mobile talk pages - https://phabricator.wikimedia.org/T323341 [15:34:20] !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:858320|Don't add lede button if mobile DiscussionTools not enabled (T323341)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [15:35:04] MatmaRex: live on mwdebug1002, can you check? [15:35:22] yeah [15:36:22] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging1005.eqiad.wmnet with reason: host reimage [15:36:47] still testing some things [15:37:18] take your time, I'm coding [15:38:20] Amir1: all looks good though. the button shows up when it should and doesn't when it shouldn't [15:38:29] awesome [15:38:43] tried some pages on en.wp, fr.wp, mw.org [15:39:08] it's being pushed everywhere now [15:40:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging1005.eqiad.wmnet with reason: host reimage [15:40:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kafka-jumbo1011.mgmt.eqiad.wmnet with reboot policy FORCED [15:42:42] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:858320|Don't add lede button if mobile DiscussionTools not enabled (T323341)]] (duration: 08m 47s) [15:42:50] MatmaRex: deployed everywhere [15:42:52] T323341: unnecessary button on mobile talk pages - https://phabricator.wikimedia.org/T323341 [15:42:59] thanks [15:43:30] (03PS1) 10Herron: dispatch: manage config.js locally [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858616 (https://phabricator.wikimedia.org/T313229) [15:44:42] (03PS1) 10Clément Goubert: apple-search: absent kubernetes service [puppet] - 10https://gerrit.wikimedia.org/r/858617 (https://phabricator.wikimedia.org/T316296) [15:45:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P40148 and previous config saved to /var/cache/conftool/dbconfig/20221118-154511-ladsgroup.json [15:46:33] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38331/console" [puppet] - 10https://gerrit.wikimedia.org/r/858617 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [15:46:43] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/858611 (https://phabricator.wikimedia.org/T315524) (owner: 10Filippo Giunchedi) [15:47:21] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/858610 (https://phabricator.wikimedia.org/T315524) (owner: 10Filippo Giunchedi) [15:48:28] (03CR) 10Clément Goubert: apple-search: absent kubernetes service [puppet] - 10https://gerrit.wikimedia.org/r/858617 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [15:48:39] (03PS1) 10Clément Goubert: apple-search: final cleanup [puppet] - 10https://gerrit.wikimedia.org/r/858624 (https://phabricator.wikimedia.org/T316296) [15:52:12] (03CR) 10Herron: [C: 03+1] hiera: add graphite2004 to codfw graphite queries [puppet] - 10https://gerrit.wikimedia.org/r/858611 (https://phabricator.wikimedia.org/T315524) (owner: 10Filippo Giunchedi) [15:52:40] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge restart - bking@cumin1001 - T319020 [15:52:41] (03CR) 10Herron: [C: 03+1] graphite: start mirroring traffic to graphite2004 [puppet] - 10https://gerrit.wikimedia.org/r/858610 (https://phabricator.wikimedia.org/T315524) (owner: 10Filippo Giunchedi) [15:52:47] (03PS26) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [15:52:56] T319020: Reset to upstream java GC options and remove redundant JVM options - https://phabricator.wikimedia.org/T319020 [15:52:58] !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge restart - bking@cumin1001 - T319020 [15:53:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T323214)', diff saved to https://phabricator.wikimedia.org/P40149 and previous config saved to /var/cache/conftool/dbconfig/20221118-155310-ladsgroup.json [15:53:36] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [15:53:59] (03CR) 10Giuseppe Lavagetto: [C: 03+1] apple-search: remove dummy service data [labs/private] - 10https://gerrit.wikimedia.org/r/858573 (owner: 10Clément Goubert) [15:54:22] (03CR) 10Giuseppe Lavagetto: [C: 03+1] admin-ng: remove apple-search namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/858575 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [15:54:30] (03CR) 10Giuseppe Lavagetto: [C: 03+1] wikikube: remove apple-search deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/858577 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [15:54:37] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-logging1005.eqiad.wmnet with OS bullseye [15:54:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-logging1005.eqiad.wmnet with OS bullseye comple... [15:54:42] (03CR) 10Giuseppe Lavagetto: [C: 03+1] charts: remove apple-search chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/858578 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [15:54:57] (03CR) 10Giuseppe Lavagetto: [C: 03+1] apple-search: absent kubernetes service [puppet] - 10https://gerrit.wikimedia.org/r/858617 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [15:55:12] (03CR) 10Giuseppe Lavagetto: [C: 03+1] apple-search: final cleanup [puppet] - 10https://gerrit.wikimedia.org/r/858624 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [15:55:35] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge restart - bking@cumin1001 - T319020 [15:55:42] (03CR) 10Michael Große: "I guess we can schedule this for the backport-window on Nov 30th, that is the one after the train" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große) [15:57:00] (03CR) 10Clément Goubert: "recheck" [labs/private] - 10https://gerrit.wikimedia.org/r/858573 (owner: 10Clément Goubert) [15:57:39] (NodeTextfileStale) firing: Stale textfile for cp1078:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:58:15] (03CR) 10Clément Goubert: [V: 03+2 C: 03+2] apple-search: remove dummy service data [labs/private] - 10https://gerrit.wikimedia.org/r/858573 (owner: 10Clément Goubert) [15:58:40] (NodeTextfileStale) firing: (2) Stale textfile for cp5012:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:59:24] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge restart - bking@cumin1001 - T319020 [15:59:54] (03CR) 10Clément Goubert: [C: 03+2] admin-ng: remove apple-search namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/858575 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [16:00:04] T319020: Reset to upstream java GC options and remove redundant JVM options - https://phabricator.wikimedia.org/T319020 [16:00:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T323214)', diff saved to https://phabricator.wikimedia.org/P40150 and previous config saved to /var/cache/conftool/dbconfig/20221118-160018-ladsgroup.json [16:00:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance [16:00:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance [16:00:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2121 (T323214)', diff saved to https://phabricator.wikimedia.org/P40151 and previous config saved to /var/cache/conftool/dbconfig/20221118-160039-ladsgroup.json [16:01:07] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [16:01:16] RECOVERY - SSH on mw1328.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:02:39] (NodeTextfileStale) firing: (7) Stale textfile for cp1078:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:03:40] (NodeTextfileStale) firing: (7) Stale textfile for cp2029:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:05:19] (03Merged) 10jenkins-bot: admin-ng: remove apple-search namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/858575 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [16:05:34] RECOVERY - SSH on mw1315.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:06:14] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ssingh) [16:07:37] !log cgoubert@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [16:07:39] (NodeTextfileStale) firing: (19) Stale textfile for cp1076:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:07:43] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ssingh) [16:08:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P40152 and previous config saved to /var/cache/conftool/dbconfig/20221118-160817-ladsgroup.json [16:08:40] (NodeTextfileStale) firing: (19) Stale textfile for cp1077:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:08:46] (03CR) 10David Caro: [C: 03+1] "LGTM, thanks!" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/858560 (owner: 10Arturo Borrero Gonzalez) [16:08:49] !log removing apple-search namespaces - T316296 [16:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:04] T316296: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 [16:09:10] !log cgoubert@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [16:09:15] !log cgoubert@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [16:09:18] (03CR) 10Cathal Mooney: [C: 03+2] Add function to expose required device VRFs to Homer templates [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/857593 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [16:09:28] (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Add function to expose required device VRFs to Homer templates [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/857593 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [16:09:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10Papaul) [16:10:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: proxy: only mark the proxy as started if it didn't fail [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/858560 (owner: 10Arturo Borrero Gonzalez) [16:10:26] !log cgoubert@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [16:10:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10Papaul) 05Open→03Resolved @herron this is complete [16:10:40] (03CR) 10Filippo Giunchedi: [C: 03+1] dispatch: manage config.js locally [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858616 (https://phabricator.wikimedia.org/T313229) (owner: 10Herron) [16:11:13] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [16:11:50] (03PS27) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [16:12:18] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [16:12:40] (NodeTextfileStale) firing: (27) Stale textfile for cp1076:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:12:50] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [16:13:23] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [16:13:24] brett: ^^ that's for you [16:13:40] (NodeTextfileStale) firing: (30) Stale textfile for cp1077:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:13:52] (03Abandoned) 10Clément Goubert: mw-*: Remove sal logging hook [deployment-charts] - 10https://gerrit.wikimedia.org/r/858360 (https://phabricator.wikimedia.org/T323296) (owner: 10Clément Goubert) [16:15:02] (03CR) 10Clément Goubert: [C: 03+2] wikikube: remove apple-search deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/858577 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [16:15:46] brett: per https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile?orgId=1 you need to clean the stale ats_config.prom file [16:17:40] (NodeTextfileStale) firing: (38) Stale textfile for cp1076:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:18:40] (NodeTextfileStale) firing: (42) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:18:51] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic restart - bking@cumin1001 - T319020 [16:19:05] T319020: Reset to upstream java GC options and remove redundant JVM options - https://phabricator.wikimedia.org/T319020 [16:19:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:20:01] (03CR) 10Lucas Werkmeister (WMDE): Wikidata: don't show Vector search thumbnails (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große) [16:20:24] (03Merged) 10jenkins-bot: wikikube: remove apple-search deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/858577 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [16:20:26] (03CR) 10David Caro: [C: 03+2] wmcs: add cookbook to add/remove a user to/from a project (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/851650 (owner: 10David Caro) [16:20:29] (03CR) 10Lucas Werkmeister (WMDE): Wikidata: don't show Vector search thumbnails (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große) [16:20:39] (03CR) 10CI reject: [V: 04-1] wmcs: add cookbook to add/remove a user to/from a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/851650 (owner: 10David Caro) [16:21:04] (03CR) 10Clément Goubert: [C: 03+2] charts: remove apple-search chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/858578 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [16:21:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T323214)', diff saved to https://phabricator.wikimedia.org/P40154 and previous config saved to /var/cache/conftool/dbconfig/20221118-162147-ladsgroup.json [16:21:57] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [16:22:13] (03CR) 10Andrew Bogott: [C: 03+1] "I'm fine with this being merged as-is and the additional feature comments being left for future patches." [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/851650 (owner: 10David Caro) [16:22:22] (03CR) 10Michael Große: Wikidata: don't show Vector search thumbnails (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große) [16:22:32] (03CR) 10Clément Goubert: [C: 03+2] apple-search: absent kubernetes service [puppet] - 10https://gerrit.wikimedia.org/r/858617 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [16:22:40] (NodeTextfileStale) firing: (42) Stale textfile for cp1076:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:23:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P40155 and previous config saved to /var/cache/conftool/dbconfig/20221118-162323-ladsgroup.json [16:23:40] (NodeTextfileStale) firing: (46) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:24:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:25:04] (03PS3) 10David Caro: wmcs: add cookbook to add/remove a user to/from a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/851650 [16:25:29] (03Merged) 10jenkins-bot: charts: remove apple-search chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/858578 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [16:26:02] (03CR) 10Clément Goubert: [C: 03+2] apple-search: final cleanup [puppet] - 10https://gerrit.wikimedia.org/r/858624 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [16:26:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-jumbo1011.mgmt.eqiad.wmnet with reboot policy FORCED [16:27:14] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1011'] [16:27:40] (NodeTextfileStale) firing: (44) Stale textfile for cp1076:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:35:01] (03PS4) 10David Caro: wmcs: add cookbook to add/remove a user to/from a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/851650 [16:35:22] (03CR) 10David Caro: wmcs: add cookbook to add/remove a user to/from a project (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/851650 (owner: 10David Caro) [16:36:52] (03PS5) 10David Caro: wmcs: add cookbook to add/remove a user to/from a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/851650 [16:36:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P40156 and previous config saved to /var/cache/conftool/dbconfig/20221118-163653-ladsgroup.json [16:36:54] (03CR) 10David Caro: wmcs: add cookbook to add/remove a user to/from a project (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/851650 (owner: 10David Caro) [16:37:40] (NodeTextfileStale) resolved: Stale textfile for cp4050:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:37:40] looks like i have one more thing to backport today, https://phabricator.wikimedia.org/T323343 [16:38:13] this is the worst friday in months! (at least this one isn't my fault) [16:38:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T323214)', diff saved to https://phabricator.wikimedia.org/P40157 and previous config saved to /var/cache/conftool/dbconfig/20221118-163830-ladsgroup.json [16:38:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance [16:38:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance [16:38:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T323214)', diff saved to https://phabricator.wikimedia.org/P40158 and previous config saved to /var/cache/conftool/dbconfig/20221118-163851-ladsgroup.json [16:38:55] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [16:41:26] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-jumbo1011'] [16:45:37] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1010'] [16:47:47] (03CR) 10Ahmon Dancy: role::kubernetes::wroker: allow scap to pre-pull mediawiki images (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/858543 (https://phabricator.wikimedia.org/T323349) (owner: 10Giuseppe Lavagetto) [16:47:56] !log robh@cumin2002 START - Cookbook sre.dns.netbox [16:49:10] (NodeTextfileStale) resolved: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:49:19] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic restart - bking@cumin1001 - T319020 [16:49:23] (03PS1) 10Bartosz Dziewoński: VE: Use instead of in CE HTML [extensions/Cite] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858321 (https://phabricator.wikimedia.org/T323343) [16:49:34] (03PS1) 10Bartosz Dziewoński: Undo use of .reference instead of .mw-ref in CSS counter rules [extensions/Cite] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858322 (https://phabricator.wikimedia.org/T323343) [16:49:42] (03CR) 10David Caro: [C: 03+2] wmcs: add cookbook to add/remove a user to/from a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/851650 (owner: 10David Caro) [16:49:44] T319020: Reset to upstream java GC options and remove redundant JVM options - https://phabricator.wikimedia.org/T319020 [16:49:48] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:49:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T323214)', diff saved to https://phabricator.wikimedia.org/P40159 and previous config saved to /var/cache/conftool/dbconfig/20221118-164957-ladsgroup.json [16:50:05] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [16:50:48] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp5017 [16:51:06] (03Abandoned) 10Jforrester: onSpecialSearchCreateLink: Handle null from Title::newFromText [extensions/WikimediaIncubator] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/851016 (https://phabricator.wikimedia.org/T320736) (owner: 10Jforrester) [16:51:11] (03Abandoned) 10Jforrester: onSpecialSearchCreateLink: Handle another null from Title::newFromText [extensions/WikimediaIncubator] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/851017 (https://phabricator.wikimedia.org/T320736) (owner: 10Jforrester) [16:51:19] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp5017 [16:51:27] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-jumbo1010'] [16:51:41] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp5018 [16:52:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P40160 and previous config saved to /var/cache/conftool/dbconfig/20221118-165200-ladsgroup.json [16:52:06] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp5018 [16:52:07] brennen: thcipriani: (or anyone else) are you perhaps around for an emergency friday backport? (another one, different than the thing this rmoning…) https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Cite/+/858321 https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Cite/+/858322 [16:52:10] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp5019 [16:52:24] * thcipriani looks [16:52:35] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp5019 [16:52:39] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp5020 [16:52:40] the bug is https://phabricator.wikimedia.org/T323343 [16:53:01] (03Merged) 10jenkins-bot: wmcs: add cookbook to add/remove a user to/from a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/851650 (owner: 10David Caro) [16:53:06] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp5020 [16:53:12] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp5028 [16:53:29] MatmaRex: I can get it out [16:53:34] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp5028 [16:53:37] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp5029 [16:53:59] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp5029 [16:54:03] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp5030 [16:54:26] thcipriani: thank you [16:54:27] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp5030 [16:54:31] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp5031 [16:54:43] 10SRE, 10Discovery-Search, 10serviceops, 10serviceops-collab, and 2 others: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10Clement_Goubert) 05In progress→03Resolved Certificates cleaned up. It's dead, Jim. [16:54:53] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp5031 [16:56:26] !log apple-search service decommissioned - T316296 [16:56:44] * brennen reads backscroll [16:56:45] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp5017.mgmt.eqsin.wmnet with reboot policy FORCED [16:57:10] Hmm logmsg.bot, plz log to sal [16:58:04] stash.bot even [16:58:06] MatmaRex: do these need to go out in any particular order? All at once OK? [16:58:21] Ah. That explains it. [16:58:34] thcipriani: any order, yes [16:58:48] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [extensions/Cite] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858321 (https://phabricator.wikimedia.org/T323343) (owner: 10Bartosz Dziewoński) [16:58:54] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [extensions/Cite] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858322 (https://phabricator.wikimedia.org/T323343) (owner: 10Bartosz Dziewoński) [16:59:38] (03CR) 10Herron: [V: 03+2 C: 03+2] dispatch: manage config.js locally [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858616 (https://phabricator.wikimedia.org/T313229) (owner: 10Herron) [17:02:17] claime: I think the last log messages still ended up in https://sal.toolforge.org/ ? [17:03:40] Lucas_WMDE: Yeah it did [17:04:05] I just didn't get an echo here since it was in the process of timeouting :') [17:04:45] you’re right, it should’ve replied to you since you’re not logmsgbot ^^ [17:04:48] I missed that part [17:05:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P40161 and previous config saved to /var/cache/conftool/dbconfig/20221118-170503-ladsgroup.json [17:07:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T323214)', diff saved to https://phabricator.wikimedia.org/P40162 and previous config saved to /var/cache/conftool/dbconfig/20221118-170706-ladsgroup.json [17:07:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance [17:07:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance [17:07:24] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [17:07:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2122 (T323214)', diff saved to https://phabricator.wikimedia.org/P40163 and previous config saved to /var/cache/conftool/dbconfig/20221118-170727-ladsgroup.json [17:08:16] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp5017.mgmt.eqsin.wmnet with reboot policy FORCED [17:10:46] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1010'] [17:11:18] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp5018.mgmt.eqsin.wmnet with reboot policy FORCED [17:12:01] !log jbond@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-jumbo1010'] [17:13:13] (03Merged) 10jenkins-bot: VE: Use instead of in CE HTML [extensions/Cite] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858321 (https://phabricator.wikimedia.org/T323343) (owner: 10Bartosz Dziewoński) [17:13:19] (03Merged) 10jenkins-bot: Undo use of .reference instead of .mw-ref in CSS counter rules [extensions/Cite] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858322 (https://phabricator.wikimedia.org/T323343) (owner: 10Bartosz Dziewoński) [17:13:33] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:858321|VE: Use instead of in CE HTML (T323343)]], [[gerrit:858322|Undo use of .reference instead of .mw-ref in CSS counter rules (T323343)]] [17:13:43] T323343: [1][2][3] style references in unusual vertical position when editing, and erroneous [0] references added when saving - https://phabricator.wikimedia.org/T323343 [17:13:53] !log thcipriani@deploy1002 thcipriani and matmarex: Backport for [[gerrit:858321|VE: Use instead of in CE HTML (T323343)]], [[gerrit:858322|Undo use of .reference instead of .mw-ref in CSS counter rules (T323343)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [17:14:04] ^ MatmaRex finally on mwdebug, check please [17:15:10] thcipriani: looks good [17:15:14] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1010'] [17:15:25] cool, syncing everywhere now [17:15:39] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-jumbo1010'] [17:15:51] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1010'] [17:19:04] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-jumbo1010'] [17:19:08] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1010'] [17:19:31] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:858321|VE: Use instead of in CE HTML (T323343)]], [[gerrit:858322|Undo use of .reference instead of .mw-ref in CSS counter rules (T323343)]] (duration: 05m 58s) [17:19:38] ^ MatmaRex should be everywhere now [17:19:43] T323343: [1][2][3] style references in unusual vertical position when editing, and erroneous [0] references added when saving - https://phabricator.wikimedia.org/T323343 [17:19:47] thanks thcipriani! [17:20:03] any time: thanks for the patches [17:20:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P40164 and previous config saved to /var/cache/conftool/dbconfig/20221118-172010-ladsgroup.json [17:22:54] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp5018.mgmt.eqsin.wmnet with reboot policy FORCED [17:24:00] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp5019.mgmt.eqsin.wmnet with reboot policy FORCED [17:25:32] (03PS28) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [17:25:34] (03CR) 10Bartosz Dziewoński: Don't run OutputPageBeforeHTML for the talkpageheader (031 comment) [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858319 (https://phabricator.wikimedia.org/T316175) (owner: 10Bartosz Dziewoński) [17:31:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T323214)', diff saved to https://phabricator.wikimedia.org/P40165 and previous config saved to /var/cache/conftool/dbconfig/20221118-173156-ladsgroup.json [17:32:07] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [17:35:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T323214)', diff saved to https://phabricator.wikimedia.org/P40166 and previous config saved to /var/cache/conftool/dbconfig/20221118-173516-ladsgroup.json [17:35:28] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp5019.mgmt.eqsin.wmnet with reboot policy FORCED [17:38:18] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp5020.mgmt.eqsin.wmnet with reboot policy FORCED [17:41:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [17:42:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [17:42:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [17:42:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [17:42:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T323214)', diff saved to https://phabricator.wikimedia.org/P40167 and previous config saved to /var/cache/conftool/dbconfig/20221118-174226-ladsgroup.json [17:45:04] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [17:47:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P40168 and previous config saved to /var/cache/conftool/dbconfig/20221118-174702-ladsgroup.json [17:49:43] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp5020.mgmt.eqsin.wmnet with reboot policy FORCED [17:52:22] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp5028.mgmt.eqsin.wmnet with reboot policy FORCED [17:56:45] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-jumbo1010'] [17:57:12] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1010'] [17:57:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T323214)', diff saved to https://phabricator.wikimedia.org/P40169 and previous config saved to /var/cache/conftool/dbconfig/20221118-175717-ladsgroup.json [17:57:30] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [17:57:50] (03PS29) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [18:02:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P40170 and previous config saved to /var/cache/conftool/dbconfig/20221118-180212-ladsgroup.json [18:03:54] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp5028.mgmt.eqsin.wmnet with reboot policy FORCED [18:04:38] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-jumbo1010'] [18:05:51] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1011'] [18:06:33] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp5029.mgmt.eqsin.wmnet with reboot policy FORCED [18:09:28] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [18:11:24] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [18:12:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P40171 and previous config saved to /var/cache/conftool/dbconfig/20221118-181223-ladsgroup.json [18:12:29] (03PS1) 10Andrew Bogott: nova-compute.conf: replace 'cpu_model' with 'cpu_models' [puppet] - 10https://gerrit.wikimedia.org/r/858633 [18:14:23] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Jclark-ctr) @Papaul corrected netbox it was in as asset tag WMF10621 [18:15:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-jumbo1011'] [18:16:46] PROBLEM - Exim SMTP on mx1001 is CRITICAL: connect to address 208.80.154.76 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [18:17:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T323214)', diff saved to https://phabricator.wikimedia.org/P40172 and previous config saved to /var/cache/conftool/dbconfig/20221118-181720-ladsgroup.json [18:17:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance [18:17:31] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [18:17:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance [18:17:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T323214)', diff saved to https://phabricator.wikimedia.org/P40173 and previous config saved to /var/cache/conftool/dbconfig/20221118-181741-ladsgroup.json [18:18:07] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp5029.mgmt.eqsin.wmnet with reboot policy FORCED [18:18:55] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kafka-jumbo1012.mgmt.eqiad.wmnet with reboot policy FORCED [18:19:25] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp5030.mgmt.eqsin.wmnet with reboot policy FORCED [18:20:57] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kafka-jumbo1014.mgmt.eqiad.wmnet with reboot policy FORCED [18:21:40] !log removed older exim logs to free space T305567 [18:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:48] T305567: MX: increasing disk space - https://phabricator.wikimedia.org/T305567 [18:27:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P40174 and previous config saved to /var/cache/conftool/dbconfig/20221118-182730-ladsgroup.json [18:27:50] RECOVERY - Exim SMTP on mx1001 is OK: OK - Certificate mx1001.wikimedia.org will expire on Fri 30 Dec 2022 08:22:47 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [18:31:10] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp5030.mgmt.eqsin.wmnet with reboot policy FORCED [18:31:34] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp5031.mgmt.eqsin.wmnet with reboot policy FORCED [18:39:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T323214)', diff saved to https://phabricator.wikimedia.org/P40175 and previous config saved to /var/cache/conftool/dbconfig/20221118-183906-ladsgroup.json [18:39:16] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [18:42:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T323214)', diff saved to https://phabricator.wikimedia.org/P40176 and previous config saved to /var/cache/conftool/dbconfig/20221118-184236-ladsgroup.json [18:42:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [18:42:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [18:42:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T323214)', diff saved to https://phabricator.wikimedia.org/P40177 and previous config saved to /var/cache/conftool/dbconfig/20221118-184258-ladsgroup.json [18:43:01] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp5031.mgmt.eqsin.wmnet with reboot policy FORCED [18:43:47] (03PS1) 10Ssingh: cp5017: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/858635 (https://phabricator.wikimedia.org/T322048) [18:45:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-jumbo1014.mgmt.eqiad.wmnet with reboot policy FORCED [18:47:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-jumbo1012.mgmt.eqiad.wmnet with reboot policy FORCED [18:48:40] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:51:27] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp5017'] [18:51:33] !log robh@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cp5017'] [18:52:12] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1012'] [18:54:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P40178 and previous config saved to /var/cache/conftool/dbconfig/20221118-185412-ladsgroup.json [18:54:17] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp5017'] [18:56:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_delayed.service,monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:02:51] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp5018'] [19:02:58] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1014'] [19:03:26] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['kafka-jumbo1014'] [19:03:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T323214)', diff saved to https://phabricator.wikimedia.org/P40179 and previous config saved to /var/cache/conftool/dbconfig/20221118-190340-ladsgroup.json [19:04:01] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [19:05:29] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1010'] [19:05:44] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['kafka-jumbo1010'] [19:06:58] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp5017'] [19:07:03] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1014'] [19:07:29] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp5019'] [19:08:38] PROBLEM - SSH on mw1315.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:09:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P40180 and previous config saved to /var/cache/conftool/dbconfig/20221118-190919-ladsgroup.json [19:11:16] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) [19:12:33] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) [19:12:38] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.16; 2021-07-26), 10Sustainability (Incident Followup), 10Thai-Sites: Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Bebiezaza) Tagging #thai-sites because this extension is currently in use at Thai Wikisource (t... [19:15:01] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp5018'] [19:18:38] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp5020'] [19:18:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P40181 and previous config saved to /var/cache/conftool/dbconfig/20221118-191846-ladsgroup.json [19:20:00] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) [19:21:19] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) [19:23:37] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp5019'] [19:23:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-jumbo1014'] [19:23:55] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp5028'] [19:24:05] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-jumbo1012'] [19:24:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T323214)', diff saved to https://phabricator.wikimedia.org/P40182 and previous config saved to /var/cache/conftool/dbconfig/20221118-192425-ladsgroup.json [19:24:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance [19:24:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance [19:24:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [19:24:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [19:24:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2159 (T323214)', diff saved to https://phabricator.wikimedia.org/P40183 and previous config saved to /var/cache/conftool/dbconfig/20221118-192452-ladsgroup.json [19:25:02] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [19:27:12] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1012'] [19:28:09] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1014'] [19:28:53] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) [19:31:40] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp5020'] [19:31:49] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp5029'] [19:32:29] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) [19:33:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P40184 and previous config saved to /var/cache/conftool/dbconfig/20221118-193353-ladsgroup.json [19:34:12] (03PS1) 10BCornwall: cp5018: Set cp role via site.pp and related config [puppet] - 10https://gerrit.wikimedia.org/r/858640 (https://phabricator.wikimedia.org/T322048) [19:34:34] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-jumbo1014'] [19:36:03] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp5028'] [19:37:16] (03PS2) 10BCornwall: cp5018: Set cp role via site.pp and related config [puppet] - 10https://gerrit.wikimedia.org/r/858640 (https://phabricator.wikimedia.org/T322048) [19:39:38] (03CR) 10Ssingh: [C: 03+1] "Looks good! We will merge it later when we are ready to reimage." [puppet] - 10https://gerrit.wikimedia.org/r/858640 (https://phabricator.wikimedia.org/T322048) (owner: 10BCornwall) [19:44:56] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kafka-jumbo1015.mgmt.eqiad.wmnet with reboot policy FORCED [19:46:10] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-jumbo1012'] [19:46:45] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp5030'] [19:47:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T323214)', diff saved to https://phabricator.wikimedia.org/P40185 and previous config saved to /var/cache/conftool/dbconfig/20221118-194721-ladsgroup.json [19:47:59] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [19:49:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T323214)', diff saved to https://phabricator.wikimedia.org/P40186 and previous config saved to /var/cache/conftool/dbconfig/20221118-194859-ladsgroup.json [19:49:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [19:49:09] (03CR) 10BCornwall: [V: 03+1 C: 03+2] prometheus: Remove old ats config export job [puppet] - 10https://gerrit.wikimedia.org/r/858418 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [19:49:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [19:58:18] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1012'] [19:58:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-jumbo1012'] [19:58:50] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp5030'] [19:59:14] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp5031'] [20:00:24] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) [20:01:03] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) [20:02:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P40187 and previous config saved to /var/cache/conftool/dbconfig/20221118-200228-ladsgroup.json [20:03:12] !log robh@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cp5029'] [20:03:16] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp5029'] [20:04:02] !log robh@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cp5029'] [20:05:50] PROBLEM - Check systemd state on wcqs1003 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-blazegraph-exporter-wcqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:06:38] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp5029'] [20:07:36] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp5029'] [20:07:42] RECOVERY - Check systemd state on wcqs1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:08:27] !log robh@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cp5031'] [20:09:26] RECOVERY - SSH on mw1315.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:09:45] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) [20:10:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [20:10:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [20:10:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T323214)', diff saved to https://phabricator.wikimedia.org/P40188 and previous config saved to /var/cache/conftool/dbconfig/20221118-201030-ladsgroup.json [20:10:36] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [20:15:22] (03PS2) 10Andrew Bogott: nova-compute.conf: replace 'cpu_model' with 'cpu_models' [puppet] - 10https://gerrit.wikimedia.org/r/858633 (https://phabricator.wikimedia.org/T323319) [20:15:24] (03PS1) 10Andrew Bogott: glance: use www_authenticate_uri [puppet] - 10https://gerrit.wikimedia.org/r/858644 (https://phabricator.wikimedia.org/T323319) [20:15:26] (03PS1) 10Andrew Bogott: nova-compute.conf: add explanatory note about live_migration_uri [puppet] - 10https://gerrit.wikimedia.org/r/858645 (https://phabricator.wikimedia.org/T323319) [20:15:28] (03PS1) 10Andrew Bogott: neutron.conf: remove allow_overlapping_ips config flag [puppet] - 10https://gerrit.wikimedia.org/r/858646 (https://phabricator.wikimedia.org/T323319) [20:15:30] (03PS1) 10Andrew Bogott: Set service_token_roles for services that use Keystone [puppet] - 10https://gerrit.wikimedia.org/r/858647 (https://phabricator.wikimedia.org/T323319) [20:17:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P40189 and previous config saved to /var/cache/conftool/dbconfig/20221118-201734-ladsgroup.json [20:18:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-jumbo1015.mgmt.eqiad.wmnet with reboot policy FORCED [20:21:00] (03PS2) 10Andrew Bogott: glance: use www_authenticate_uri [puppet] - 10https://gerrit.wikimedia.org/r/858644 (https://phabricator.wikimedia.org/T323319) [20:21:02] (03PS3) 10Andrew Bogott: nova-compute.conf: replace 'cpu_model' with 'cpu_models' [puppet] - 10https://gerrit.wikimedia.org/r/858633 (https://phabricator.wikimedia.org/T323319) [20:21:04] (03PS2) 10Andrew Bogott: nova-compute.conf: add explanatory note about live_migration_uri [puppet] - 10https://gerrit.wikimedia.org/r/858645 (https://phabricator.wikimedia.org/T323319) [20:21:06] (03PS2) 10Andrew Bogott: neutron.conf: remove allow_overlapping_ips config flag [puppet] - 10https://gerrit.wikimedia.org/r/858646 (https://phabricator.wikimedia.org/T323319) [20:21:08] (03PS2) 10Andrew Bogott: Set service_token_roles for services that use Keystone [puppet] - 10https://gerrit.wikimedia.org/r/858647 (https://phabricator.wikimedia.org/T323319) [20:21:44] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1015'] [20:21:46] (03CR) 10CI reject: [V: 04-1] glance: use www_authenticate_uri [puppet] - 10https://gerrit.wikimedia.org/r/858644 (https://phabricator.wikimedia.org/T323319) (owner: 10Andrew Bogott) [20:22:08] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) [20:22:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T323214)', diff saved to https://phabricator.wikimedia.org/P40190 and previous config saved to /var/cache/conftool/dbconfig/20221118-202245-ladsgroup.json [20:22:51] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [20:23:02] (03PS3) 10Andrew Bogott: glance: use www_authenticate_uri [puppet] - 10https://gerrit.wikimedia.org/r/858644 (https://phabricator.wikimedia.org/T323319) [20:23:04] (03PS4) 10Andrew Bogott: nova-compute.conf: replace 'cpu_model' with 'cpu_models' [puppet] - 10https://gerrit.wikimedia.org/r/858633 (https://phabricator.wikimedia.org/T323319) [20:23:06] (03PS3) 10Andrew Bogott: nova-compute.conf: add explanatory note about live_migration_uri [puppet] - 10https://gerrit.wikimedia.org/r/858645 (https://phabricator.wikimedia.org/T323319) [20:23:08] (03PS3) 10Andrew Bogott: neutron.conf: remove allow_overlapping_ips config flag [puppet] - 10https://gerrit.wikimedia.org/r/858646 (https://phabricator.wikimedia.org/T323319) [20:23:10] (03PS3) 10Andrew Bogott: Set service_token_roles for services that use Keystone [puppet] - 10https://gerrit.wikimedia.org/r/858647 (https://phabricator.wikimedia.org/T323319) [20:23:12] (03CR) 10CI reject: [V: 04-1] neutron.conf: remove allow_overlapping_ips config flag [puppet] - 10https://gerrit.wikimedia.org/r/858646 (https://phabricator.wikimedia.org/T323319) (owner: 10Andrew Bogott) [20:23:14] (03CR) 10CI reject: [V: 04-1] Set service_token_roles for services that use Keystone [puppet] - 10https://gerrit.wikimedia.org/r/858647 (https://phabricator.wikimedia.org/T323319) (owner: 10Andrew Bogott) [20:24:29] (03CR) 10CI reject: [V: 04-1] glance: use www_authenticate_uri [puppet] - 10https://gerrit.wikimedia.org/r/858644 (https://phabricator.wikimedia.org/T323319) (owner: 10Andrew Bogott) [20:25:01] (03PS4) 10Andrew Bogott: glance: use www_authenticate_uri [puppet] - 10https://gerrit.wikimedia.org/r/858644 (https://phabricator.wikimedia.org/T323319) [20:25:03] (03PS5) 10Andrew Bogott: nova-compute.conf: replace 'cpu_model' with 'cpu_models' [puppet] - 10https://gerrit.wikimedia.org/r/858633 (https://phabricator.wikimedia.org/T323319) [20:25:05] (03PS4) 10Andrew Bogott: nova-compute.conf: add explanatory note about live_migration_uri [puppet] - 10https://gerrit.wikimedia.org/r/858645 (https://phabricator.wikimedia.org/T323319) [20:25:08] (03PS4) 10Andrew Bogott: neutron.conf: remove allow_overlapping_ips config flag [puppet] - 10https://gerrit.wikimedia.org/r/858646 (https://phabricator.wikimedia.org/T323319) [20:25:09] (03PS4) 10Andrew Bogott: Set service_token_roles for services that use Keystone [puppet] - 10https://gerrit.wikimedia.org/r/858647 (https://phabricator.wikimedia.org/T323319) [20:26:59] (03CR) 10CI reject: [V: 04-1] glance: use www_authenticate_uri [puppet] - 10https://gerrit.wikimedia.org/r/858644 (https://phabricator.wikimedia.org/T323319) (owner: 10Andrew Bogott) [20:29:13] (03PS5) 10Andrew Bogott: glance: use www_authenticate_uri [puppet] - 10https://gerrit.wikimedia.org/r/858644 (https://phabricator.wikimedia.org/T323319) [20:29:14] (03PS6) 10Andrew Bogott: nova-compute.conf: replace 'cpu_model' with 'cpu_models' [puppet] - 10https://gerrit.wikimedia.org/r/858633 (https://phabricator.wikimedia.org/T323319) [20:29:16] (03PS5) 10Andrew Bogott: nova-compute.conf: add explanatory note about live_migration_uri [puppet] - 10https://gerrit.wikimedia.org/r/858645 (https://phabricator.wikimedia.org/T323319) [20:29:18] (03PS5) 10Andrew Bogott: neutron.conf: remove allow_overlapping_ips config flag [puppet] - 10https://gerrit.wikimedia.org/r/858646 (https://phabricator.wikimedia.org/T323319) [20:29:20] (03PS5) 10Andrew Bogott: Set service_token_roles for services that use Keystone [puppet] - 10https://gerrit.wikimedia.org/r/858647 (https://phabricator.wikimedia.org/T323319) [20:32:16] (03CR) 10BBlack: [C: 04-1] "Needs: "profile::cache::varnish::frontend::single_backend: true" in the hieradata/cp5017.yaml file as well" [puppet] - 10https://gerrit.wikimedia.org/r/858635 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [20:32:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T323214)', diff saved to https://phabricator.wikimedia.org/P40191 and previous config saved to /var/cache/conftool/dbconfig/20221118-203241-ladsgroup.json [20:32:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [20:32:48] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [20:32:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [20:33:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T323214)', diff saved to https://phabricator.wikimedia.org/P40192 and previous config saved to /var/cache/conftool/dbconfig/20221118-203302-ladsgroup.json [20:33:13] (03CR) 10BBlack: [C: 04-1] "Needs: "profile::cache::varnish::frontend::single_backend: true" in the hieradata/cp5017.yaml file as well" [puppet] - 10https://gerrit.wikimedia.org/r/858640 (https://phabricator.wikimedia.org/T322048) (owner: 10BCornwall) [20:37:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P40193 and previous config saved to /var/cache/conftool/dbconfig/20221118-203751-ladsgroup.json [20:39:01] (03PS1) 10BBlack: cp5032: turn on single_backend [puppet] - 10https://gerrit.wikimedia.org/r/858649 (https://phabricator.wikimedia.org/T322048) [20:41:56] (03CR) 10BBlack: [C: 03+2] cp5032: turn on single_backend [puppet] - 10https://gerrit.wikimedia.org/r/858649 (https://phabricator.wikimedia.org/T322048) (owner: 10BBlack) [20:44:10] (03PS6) 10Andrew Bogott: glance: use www_authenticate_uri [puppet] - 10https://gerrit.wikimedia.org/r/858644 (https://phabricator.wikimedia.org/T323319) [20:44:12] (03PS7) 10Andrew Bogott: nova-compute.conf: replace 'cpu_model' with 'cpu_models' [puppet] - 10https://gerrit.wikimedia.org/r/858633 (https://phabricator.wikimedia.org/T323319) [20:44:14] (03PS6) 10Andrew Bogott: nova-compute.conf: add explanatory note about live_migration_uri [puppet] - 10https://gerrit.wikimedia.org/r/858645 (https://phabricator.wikimedia.org/T323319) [20:44:16] (03PS6) 10Andrew Bogott: neutron.conf: remove allow_overlapping_ips config flag [puppet] - 10https://gerrit.wikimedia.org/r/858646 (https://phabricator.wikimedia.org/T323319) [20:44:18] (03PS6) 10Andrew Bogott: Set service_token_roles for services that use Keystone [puppet] - 10https://gerrit.wikimedia.org/r/858647 (https://phabricator.wikimedia.org/T323319) [20:46:10] (03PS7) 10Andrew Bogott: glance: use www_authenticate_uri [puppet] - 10https://gerrit.wikimedia.org/r/858644 (https://phabricator.wikimedia.org/T323319) [20:46:12] (03PS8) 10Andrew Bogott: nova-compute.conf: replace 'cpu_model' with 'cpu_models' [puppet] - 10https://gerrit.wikimedia.org/r/858633 (https://phabricator.wikimedia.org/T323319) [20:46:14] (03PS7) 10Andrew Bogott: nova-compute.conf: add explanatory note about live_migration_uri [puppet] - 10https://gerrit.wikimedia.org/r/858645 (https://phabricator.wikimedia.org/T323319) [20:46:16] (03PS7) 10Andrew Bogott: neutron.conf: remove allow_overlapping_ips config flag [puppet] - 10https://gerrit.wikimedia.org/r/858646 (https://phabricator.wikimedia.org/T323319) [20:46:18] (03PS7) 10Andrew Bogott: Set service_token_roles for services that use Keystone [puppet] - 10https://gerrit.wikimedia.org/r/858647 (https://phabricator.wikimedia.org/T323319) [20:46:48] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:48:56] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T321719 (10phaultfinder) [20:49:01] (03PS2) 10Ssingh: cp5017: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/858635 (https://phabricator.wikimedia.org/T322048) [20:49:14] (03CR) 10Ssingh: cp5017: update site.pp and related configs for cp role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/858635 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [20:52:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P40194 and previous config saved to /var/cache/conftool/dbconfig/20221118-205258-ladsgroup.json [20:53:58] (03CR) 10Ssingh: [C: 03+2] cp5017: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/858635 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [20:54:46] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_delayed.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:55:50] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/output/857079/38343/" [puppet] - 10https://gerrit.wikimedia.org/r/857079 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [20:56:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T323214)', diff saved to https://phabricator.wikimedia.org/P40195 and previous config saved to /var/cache/conftool/dbconfig/20221118-205649-ladsgroup.json [20:56:55] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [20:56:58] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5017.eqsin.wmnet with OS buster [20:57:05] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5017.eqsin.wmnet with OS buster [21:01:44] (03PS1) 10Andrew Bogott: glance: use memcached for token caching [puppet] - 10https://gerrit.wikimedia.org/r/858651 (https://phabricator.wikimedia.org/T323319) [21:02:46] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "I want to get as much as possible done before the switch itself.. This will add systemd timers, logging config, the dump service.." [puppet] - 10https://gerrit.wikimedia.org/r/857079 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [21:03:22] (03PS2) 10Dzahn: phabricator: enable dumping on phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/857079 (https://phabricator.wikimedia.org/T280597) [21:05:22] (03CR) 10Dzahn: phabricator: enable dumping on phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/857079 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [21:06:37] sukhe: you make us get scary puppet changes :) [21:08:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T323214)', diff saved to https://phabricator.wikimedia.org/P40196 and previous config saved to /var/cache/conftool/dbconfig/20221118-210804-ladsgroup.json [21:08:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [21:08:13] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [21:08:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [21:08:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T323214)', diff saved to https://phabricator.wikimedia.org/P40197 and previous config saved to /var/cache/conftool/dbconfig/20221118-210825-ladsgroup.json [21:08:48] mutante: oh which one! [21:08:48] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:08:48] PROBLEM - Check systemd state on ms-be2065 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:09:21] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-jumbo1015'] [21:09:30] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2065 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:09:58] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (LIST configurations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:10:05] sukhe: hehe, everything is ok. it's just.. every time you remove or add a cp host, it means there is an edit to "@def $CACHES" and that in turn means there is an edit to /etc/ferm/conf.d/00_defs and that means on any random host you run puppet on and expect nothing to change.. suddenly there is an entire window full of firewall rule changes :) [21:11:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P40198 and previous config saved to /var/cache/conftool/dbconfig/20221118-211155-ladsgroup.json [21:12:15] (or it looks like there is because it's a huge list that has one host added or removed and gets displayed) [21:14:42] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-jumbo1015'] [21:14:58] (KubernetesAPILatency) firing: (6) High Kubernetes API latency (LIST configurations) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:17:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-jumbo1015'] [21:19:28] ah! [21:19:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T323214)', diff saved to https://phabricator.wikimedia.org/P40199 and previous config saved to /var/cache/conftool/dbconfig/20221118-211931-ladsgroup.json [21:19:38] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [21:21:54] !log running phabricator task dump script on phab1004 [21:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:21] (03PS1) 10Andrew Bogott: cinder.conf: lock_path to oslo_concurrency [puppet] - 10https://gerrit.wikimedia.org/r/858653 (https://phabricator.wikimedia.org/T323319) [21:26:23] (03PS1) 10Andrew Bogott: cinder: remove default quota settings [puppet] - 10https://gerrit.wikimedia.org/r/858654 (https://phabricator.wikimedia.org/T323319) [21:26:25] (03PS1) 10Andrew Bogott: trove: remove network_label_regex [puppet] - 10https://gerrit.wikimedia.org/r/858655 (https://phabricator.wikimedia.org/T323319) [21:26:28] (03CR) 10Dzahn: [C: 03+2] "confirmed that the dump script uses the slave DB regardless on which server it runs. and started it on phab1004. it should be just fine." [puppet] - 10https://gerrit.wikimedia.org/r/857079 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [21:27:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P40200 and previous config saved to /var/cache/conftool/dbconfig/20221118-212702-ladsgroup.json [21:27:16] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5017.eqsin.wmnet with reason: host reimage [21:27:41] (03CR) 10Andrew Bogott: [C: 03+2] nova-compute.conf: replace 'cpu_model' with 'cpu_models' [puppet] - 10https://gerrit.wikimedia.org/r/858633 (https://phabricator.wikimedia.org/T323319) (owner: 10Andrew Bogott) [21:27:49] (03CR) 10Andrew Bogott: [C: 03+2] glance: use www_authenticate_uri [puppet] - 10https://gerrit.wikimedia.org/r/858644 (https://phabricator.wikimedia.org/T323319) (owner: 10Andrew Bogott) [21:27:55] (03CR) 10Andrew Bogott: [C: 03+2] nova-compute.conf: add explanatory note about live_migration_uri [puppet] - 10https://gerrit.wikimedia.org/r/858645 (https://phabricator.wikimedia.org/T323319) (owner: 10Andrew Bogott) [21:32:12] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5017.eqsin.wmnet with reason: host reimage [21:34:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P40201 and previous config saved to /var/cache/conftool/dbconfig/20221118-213437-ladsgroup.json [21:34:42] RECOVERY - Check systemd state on ms-be2065 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:39:38] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2065 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:39:58] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 42 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:41:41] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Papaul) [21:42:01] (03PS3) 10Ssingh: cp5018: Set cp role via site.pp and related config [puppet] - 10https://gerrit.wikimedia.org/r/858640 (https://phabricator.wikimedia.org/T322048) (owner: 10BCornwall) [21:42:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T323214)', diff saved to https://phabricator.wikimedia.org/P40202 and previous config saved to /var/cache/conftool/dbconfig/20221118-214208-ladsgroup.json [21:42:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [21:42:13] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Papaul) [21:42:15] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [21:42:16] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 214 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:42:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [21:42:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T323214)', diff saved to https://phabricator.wikimedia.org/P40203 and previous config saved to /var/cache/conftool/dbconfig/20221118-214230-ladsgroup.json [21:43:34] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:45:18] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 31 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:45:29] (03PS1) 10Dzahn: phabricator: remove hardcoded ports, use parameters in my.cnf for admins [puppet] - 10https://gerrit.wikimedia.org/r/858656 [21:46:12] (03PS2) 10Dzahn: phabricator: remove hardcoded ports, use parameters in my.cnf for admins [puppet] - 10https://gerrit.wikimedia.org/r/858656 (https://phabricator.wikimedia.org/T280597) [21:46:18] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 180 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:47:50] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:49:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P40204 and previous config saved to /var/cache/conftool/dbconfig/20221118-214944-ladsgroup.json [21:50:08] (03PS3) 10Dzahn: phabricator: remove hardcoded ports, use parameters in my.cnf for admins [puppet] - 10https://gerrit.wikimedia.org/r/858656 (https://phabricator.wikimedia.org/T280597) [21:52:51] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "noop https://puppet-compiler.wmflabs.org/output/858656/38346/" [puppet] - 10https://gerrit.wikimedia.org/r/858656 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [21:55:38] (03CR) 10Dzahn: [C: 03+1] "I'm going to be a bit more bold here and merge this and proof it's noop on clouddumps1002. We want to switch the phab host name on Monday " [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [21:55:57] (03PS9) 10Dzahn: dumps/distribution: move hardcoded host names to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) [21:56:30] (03CR) 10Dzahn: "After this I can switch the phab dump host from phab1001 to phab1004 where I have enabled dumping in https://gerrit.wikimedia.org/r/c/oper" [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [21:59:09] (03CR) 10Dzahn: [C: 03+1] "compiling on C:profile::dumps::distribution::datasets::fetcher which then picks for me that the right host is, btw: I don't manually enter" [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [21:59:22] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:00:04] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "and it shows there is no change, only the class parameters: https://puppet-compiler.wmflabs.org/output/852259/38347/clouddumps1002.wikimed" [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:01:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:02:36] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "ran puppet on clouddumps1002. complete noop" [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:03:06] (03PS13) 10Dzahn: dumps/distribution: add more data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852260 [22:04:21] (03CR) 10Dzahn: [C: 04-1] "needs update after https://gerrit.wikimedia.org/r/c/operations/puppet/+/852259/9 was merged" [puppet] - 10https://gerrit.wikimedia.org/r/824805 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:04:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T323214)', diff saved to https://phabricator.wikimedia.org/P40205 and previous config saved to /var/cache/conftool/dbconfig/20221118-220421-ladsgroup.json [22:04:28] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [22:04:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T323214)', diff saved to https://phabricator.wikimedia.org/P40206 and previous config saved to /var/cache/conftool/dbconfig/20221118-220450-ladsgroup.json [22:04:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [22:05:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [22:05:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T323214)', diff saved to https://phabricator.wikimedia.org/P40207 and previous config saved to /var/cache/conftool/dbconfig/20221118-220512-ladsgroup.json [22:05:58] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5017.eqsin.wmnet with OS buster [22:06:05] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5017.eqsin.wmnet with OS buster completed: - cp5017 (**PASS**) -... [22:11:06] PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-blazegraph-exporter-wcqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:12:06] (03PS3) 10Dzahn: dumps/phabricator: switch phab dumps host from phab1001 to phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/824805 (https://phabricator.wikimedia.org/T280597) [22:14:35] (03PS4) 10Dzahn: dumps/phabricator: switch phab dumps host from phab1001 to phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/824805 (https://phabricator.wikimedia.org/T280597) [22:16:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T323214)', diff saved to https://phabricator.wikimedia.org/P40209 and previous config saved to /var/cache/conftool/dbconfig/20221118-221612-ladsgroup.json [22:16:43] (03CR) 10Dzahn: "the dump service is running on phab1004 but waiting for it to complete:" [puppet] - 10https://gerrit.wikimedia.org/r/824805 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:18:06] (03PS1) 10BCornwall: node: Exclude trafficserver promfile mtime check [alerts] - 10https://gerrit.wikimedia.org/r/858658 [22:18:28] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [22:19:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P40210 and previous config saved to /var/cache/conftool/dbconfig/20221118-221927-ladsgroup.json [22:19:28] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 134 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:19:58] (KubernetesAPILatency) firing: (7) High Kubernetes API latency (LIST nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:21:22] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:26:23] (03PS1) 10Andrew Bogott: wmcs-cinder-backup-manager: allow for less frequent backups [puppet] - 10https://gerrit.wikimedia.org/r/858659 (https://phabricator.wikimedia.org/T306200) [22:31:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P40211 and previous config saved to /var/cache/conftool/dbconfig/20221118-223118-ladsgroup.json [22:34:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P40212 and previous config saved to /var/cache/conftool/dbconfig/20221118-223434-ladsgroup.json [22:39:14] PROBLEM - WDQS SPARQL on wdqs1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:41:44] icinga is having issues with me...or vice versa. hmm [22:43:00] RECOVERY - WDQS SPARQL on wdqs1016 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.076 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:43:04] PROBLEM - SSH on mw1319.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:46:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P40213 and previous config saved to /var/cache/conftool/dbconfig/20221118-224625-ladsgroup.json [22:49:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T323214)', diff saved to https://phabricator.wikimedia.org/P40214 and previous config saved to /var/cache/conftool/dbconfig/20221118-224940-ladsgroup.json [22:49:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [22:49:47] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [22:49:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [22:50:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2182 (T323214)', diff saved to https://phabricator.wikimedia.org/P40215 and previous config saved to /var/cache/conftool/dbconfig/20221118-225002-ladsgroup.json [22:51:26] PROBLEM - SSH on db1120.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:01:03] (03CR) 10Dzahn: [C: 03+1] "Yea, I think it's fine. Will deploy next week though." [puppet] - 10https://gerrit.wikimedia.org/r/858297 (owner: 10Muehlenhoff) [23:01:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T323214)', diff saved to https://phabricator.wikimedia.org/P40216 and previous config saved to /var/cache/conftool/dbconfig/20221118-230131-ladsgroup.json [23:01:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance [23:01:38] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [23:01:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance [23:01:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1202 (T323214)', diff saved to https://phabricator.wikimedia.org/P40217 and previous config saved to /var/cache/conftool/dbconfig/20221118-230152-ladsgroup.json [23:02:38] (03PS1) 10Dzahn: dumps: remove phab1001 from rsync clients [puppet] - 10https://gerrit.wikimedia.org/r/858662 [23:05:39] (03PS1) 10Dzahn: phabricator: stop creating public dump on phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/858663 (https://phabricator.wikimedia.org/T280597) [23:06:51] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/output/824805/38348/clouddumps1002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/824805 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:07:27] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "dump finished and looks fine on phab1004 and stopping the dump script on phab1001 in https://gerrit.wikimedia.org/r/c/operations/puppet/+/" [puppet] - 10https://gerrit.wikimedia.org/r/824805 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:07:48] (03CR) 10Dzahn: [V: 03+1 C: 03+2] dumps/phabricator: switch phab dumps host from phab1001 to phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/824805 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:09:24] (03PS2) 10Krinkle: build: Update to PHPUnit 9.5 and declare php requirement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858441 (https://phabricator.wikimedia.org/T235142) [23:11:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T323214)', diff saved to https://phabricator.wikimedia.org/P40218 and previous config saved to /var/cache/conftool/dbconfig/20221118-231111-ladsgroup.json [23:11:17] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [23:11:23] (03CR) 10Krinkle: build: Upgrade symfony/yaml to 5.4.3, the version we use in prod (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793053 (owner: 10Jforrester) [23:12:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T323214)', diff saved to https://phabricator.wikimedia.org/P40219 and previous config saved to /var/cache/conftool/dbconfig/20221118-231229-ladsgroup.json [23:12:32] !log clouddumps1001 - manually ran /usr/local/bin/dump-fetch-phabdumps.sh and confirmed fetching works from new phab host phab1004 after gerrit:824805 T280597 [23:13:36] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "manually ran /usr/local/bin/dump-fetch-phabdumps.sh on clouddumps1002 and confirmed fetching works from new phab host phab1004" [puppet] - 10https://gerrit.wikimedia.org/r/824805 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:55] T280597: move phabricator to new hardware generation - https://phabricator.wikimedia.org/T280597 [23:14:47] (03CR) 10Dzahn: [C: 03+2] phabricator: stop creating public dump on phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/858663 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:15:36] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10wiki_willy) a:03Jclark-ctr ++@Jclark-ctr, since @Cmjohnson will be out for a while >>! In T308339#8405694, @BTullis wrote: > @Cmjohnson - Let me knowhen you're ready to move an-tool1010 pl... [23:17:04] (03CR) 10Dzahn: [C: 03+2] "timer/service removed on phab1004 by puppet. clean." [puppet] - 10https://gerrit.wikimedia.org/r/858663 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:17:22] (03CR) 10Dzahn: [C: 03+2] "on phab1001 I meant to say. it's active on phab1004" [puppet] - 10https://gerrit.wikimedia.org/r/858663 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:18:34] (03PS2) 10Dzahn: dumps: remove phab1001 from rsync clients [puppet] - 10https://gerrit.wikimedia.org/r/858662 (https://phabricator.wikimedia.org/T280597) [23:20:41] (03CR) 10Dzahn: "now we can get back to this one next :)" [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [23:21:05] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [23:22:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:25:01] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [23:26:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P40220 and previous config saved to /var/cache/conftool/dbconfig/20221118-232618-ladsgroup.json [23:27:31] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:27:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P40221 and previous config saved to /var/cache/conftool/dbconfig/20221118-232736-ladsgroup.json [23:28:03] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kafka-jumbo1013.mgmt.eqiad.wmnet with reboot policy FORCED [23:28:07] 10SRE, 10Traffic: strip non session cookies before cache lookup in ATS - https://phabricator.wikimedia.org/T316338 (10Krinkle) [23:28:53] (03PS4) 10Dzahn: phabricator: remove production role from phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/824804 (https://phabricator.wikimedia.org/T280597) [23:29:01] (03PS3) 10Dzahn: dumps: remove phab1001 from rsync clients [puppet] - 10https://gerrit.wikimedia.org/r/858662 (https://phabricator.wikimedia.org/T280597) [23:29:07] (03PS2) 10Dzahn: site: remove phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/858421 (https://phabricator.wikimedia.org/T323418) [23:29:16] (03PS2) 10Dzahn: mariadb: remove phab1001 from production-m3 grants [puppet] - 10https://gerrit.wikimedia.org/r/858419 (https://phabricator.wikimedia.org/T323418) [23:29:18] (03CR) 10CI reject: [V: 04-1] phabricator: remove production role from phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/824804 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:29:39] (03PS2) 10Dzahn: phabricator: remove phab1001 as src_host from migration class [puppet] - 10https://gerrit.wikimedia.org/r/858420 (https://phabricator.wikimedia.org/T323418) [23:33:08] (03CR) 10Dzahn: [V: 03+1] "noop https://puppet-compiler.wmflabs.org/output/858420/38349/" [puppet] - 10https://gerrit.wikimedia.org/r/858420 (https://phabricator.wikimedia.org/T323418) (owner: 10Dzahn) [23:35:17] (03PS5) 10Dzahn: O:phabricator: move common settings to role hiera [puppet] - 10https://gerrit.wikimedia.org/r/824412 (https://phabricator.wikimedia.org/T280597) (owner: 10Jbond) [23:35:42] (03CR) 10CI reject: [V: 04-1] O:phabricator: move common settings to role hiera [puppet] - 10https://gerrit.wikimedia.org/r/824412 (https://phabricator.wikimedia.org/T280597) (owner: 10Jbond) [23:35:46] (03CR) 10Dzahn: "I will get back to this after Monday when phab1001 should not be production anymore." [puppet] - 10https://gerrit.wikimedia.org/r/824412 (https://phabricator.wikimedia.org/T280597) (owner: 10Jbond) [23:41:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P40222 and previous config saved to /var/cache/conftool/dbconfig/20221118-234124-ladsgroup.json [23:42:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P40223 and previous config saved to /var/cache/conftool/dbconfig/20221118-234242-ladsgroup.json [23:44:44] /away laters [23:47:24] (03CR) 10Dzahn: [C: 03+1] gitlab_runner: make one Shared Runner canary [puppet] - 10https://gerrit.wikimedia.org/r/858188 (owner: 10Jelto) [23:51:57] RECOVERY - SSH on db1120.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:56:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T323214)', diff saved to https://phabricator.wikimedia.org/P40225 and previous config saved to /var/cache/conftool/dbconfig/20221118-235631-ladsgroup.json [23:56:37] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [23:57:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-jumbo1013.mgmt.eqiad.wmnet with reboot policy FORCED [23:57:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T323214)', diff saved to https://phabricator.wikimedia.org/P40226 and previous config saved to /var/cache/conftool/dbconfig/20221118-235749-ladsgroup.json [23:57:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [23:58:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance