[00:00:08] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:12] (03CR) 10CI reject: [V: 04-1] phabricator: replace user{} with systemd::sysuser for daemon user [puppet] - 10https://gerrit.wikimedia.org/r/823767 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [00:00:18] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:14] (03CR) 10CI reject: [V: 04-1] beta cluster: don't instantiate ::esitest [puppet] - 10https://gerrit.wikimedia.org/r/823766 (https://phabricator.wikimedia.org/T315350) (owner: 10Ori) [00:02:41] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging2005.mgmt.codfw.wmnet with reboot policy FORCED [00:03:10] (03CR) 10Dzahn: netmon: Create LibreNMS logs file. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/823764 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [00:03:52] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kafka-logging2005.mgmt.codfw.wmnet with reboot policy FORCED [00:04:48] (03PS3) 10Andrea Denisse: netmon: Create LibreNMS logs file. [puppet] - 10https://gerrit.wikimedia.org/r/823764 (https://phabricator.wikimedia.org/T315393) [00:06:19] (03CR) 10Andrea Denisse: "LibreNMS" [puppet] - 10https://gerrit.wikimedia.org/r/823764 (https://phabricator.wikimedia.org/T315393) (owner: 10Andrea Denisse) [00:06:44] (03PS3) 10Dzahn: phabricator::migration: add phd user with systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/823765 (https://phabricator.wikimedia.org/T313360) [00:07:10] PROBLEM - dump of es4 in codfw on backupmon1001 is CRITICAL: dump for es4 at codfw (es2022) taken more than a week ago: Most recent backup 2022-08-09 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:07:44] (03PS2) 10Dzahn: phabricator: replace user{} with systemd::sysuser for daemon user [puppet] - 10https://gerrit.wikimedia.org/r/823767 (https://phabricator.wikimedia.org/T313360) [00:08:55] (03CR) 10Dzahn: netmon: Create LibreNMS logs file. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/823764 (https://phabricator.wikimedia.org/T315393) (owner: 10Andrea Denisse) [00:09:48] (03CR) 10Dzahn: netmon: Create LibreNMS logs file. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/823764 (https://phabricator.wikimedia.org/T315393) (owner: 10Andrea Denisse) [00:10:36] (03CR) 10Andrea Denisse: netmon: Create LibreNMS logs file. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/823764 (https://phabricator.wikimedia.org/T315393) (owner: 10Andrea Denisse) [00:12:16] (03CR) 10Dzahn: "I think I would like it best to just use new code on new servers.. I am not sure yet how easy or messy converting an existing sysuser is.." [puppet] - 10https://gerrit.wikimedia.org/r/823767 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [00:12:24] (03CR) 10Andrea Denisse: netmon: Create LibreNMS logs file. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/823764 (https://phabricator.wikimedia.org/T315393) (owner: 10Andrea Denisse) [00:14:42] (03CR) 10Andrea Denisse: netmon: Create LibreNMS logs file. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/823764 (https://phabricator.wikimedia.org/T315393) (owner: 10Andrea Denisse) [00:15:59] 10SRE, 10Infrastructure-Foundations, 10netops: Rancid on netmon1003 unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10andrea.denisse) [00:16:49] 10SRE, 10Infrastructure-Foundations, 10netops: Rancid on netmon1003 unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10andrea.denisse) 05In progress→03Resolved [00:17:20] 10SRE, 10Infrastructure-Foundations, 10netops: Rancid on netmon1003 unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10andrea.denisse) Fixed in [[ https://gerrit.wikimedia.org/r/822196 | 822196 ]]. [00:20:54] PROBLEM - dump of es4 in eqiad on backupmon1001 is CRITICAL: dump for es4 at eqiad (es1022) taken more than a week ago: Most recent backup 2022-08-09 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:25:07] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-logging2004'] [00:25:31] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-logging2004'] [00:26:32] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-logging2004'] [00:26:35] 10SRE, 10Cloud-VPS, 10Performance-Team (Radar), 10cloud-services-team (Kanban): CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10Krinkle) [00:29:40] PROBLEM - dump of es5 in eqiad on backupmon1001 is CRITICAL: dump for es5 at eqiad (es1025) taken more than a week ago: Most recent backup 2022-08-09 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:33:00] PROBLEM - dump of es5 in codfw on backupmon1001 is CRITICAL: dump for es5 at codfw (es2025) taken more than a week ago: Most recent backup 2022-08-09 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:33:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-logging2004'] [00:44:37] 10SRE, 10Performance-Team (Radar): Set MW appserver scaling_governor to performance - https://phabricator.wikimedia.org/T315398 (10tstarling) [00:48:08] (03PS1) 10Ryan Kemper: elastic: decom elastic10[49-50] [puppet] - 10https://gerrit.wikimedia.org/r/823771 (https://phabricator.wikimedia.org/T309810) [00:50:58] (03CR) 10Ryan Kemper: [C: 03+2] elastic: decom elastic10[49-50] [puppet] - 10https://gerrit.wikimedia.org/r/823771 (https://phabricator.wikimedia.org/T309810) (owner: 10Ryan Kemper) [00:51:47] (03PS1) 10DDesouza: QuickSurveys: Remove research incentive survey from BN wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823772 (https://phabricator.wikimedia.org/T31433) [00:52:38] 10Puppet, 10Infrastructure-Foundations, 10MobileFrontend (Tracking), 10User-Jdlrobson: Mobile site does not automatically redirect to desktop version (and not possible to use browser "use desktop view") - https://phabricator.wikimedia.org/T60425 (10Iniquity) >>! In T60425#5179124, @MaxSem wrote: > Why do y... [00:55:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:56:43] (03PS2) 10DDesouza: QuickSurveys: Remove research incentive survey from BN wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823772 (https://phabricator.wikimedia.org/T314333) [00:58:34] (03PS4) 10Samtar: profile::etcd::v3: use puppet certs for standalone cluster [puppet] - 10https://gerrit.wikimedia.org/r/668701 (owner: 10Giuseppe Lavagetto) [01:00:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:03:37] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-logging2005.mgmt.codfw.wmnet with reboot policy FORCED [01:03:54] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Wed 24 Aug 2022 07:48:40 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:05:11] (03PS5) 10Samtar: profile::etcd::v3: use puppet certs for standalone cluster [puppet] - 10https://gerrit.wikimedia.org/r/668701 (https://phabricator.wikimedia.org/T315395) (owner: 10Giuseppe Lavagetto) [01:06:16] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 23 Oct 2022 06:50:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:12:19] !log ryankemper@cumin1001 START - Cookbook sre.hosts.decommission for hosts elastic[1049-1050].eqiad.wmnet [01:13:18] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Wed 24 Aug 2022 07:48:40 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:15:36] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 23 Oct 2022 06:50:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:16:43] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-logging2005'] [01:23:58] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-logging2005'] [01:26:34] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [01:28:47] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging200[45] - https://phabricator.wikimedia.org/T313959 (10Papaul) [01:31:14] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1004 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [01:37:45] (JobUnavailable) firing: (3) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:48:52] Looks like the sec patch for T309894 fell off for wmf.23 and wmf.25 (likely before that tbh). The patch still applies fine to those versions so I'm going to reapply and scap them out now. [01:49:16] !log ryankemper@cumin1001 START - Cookbook sre.dns.netbox [01:52:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:53:06] (03PS1) 10Jforrester: Add try…catch in failing deferred update [extensions/DiscussionTools] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/823640 (https://phabricator.wikimedia.org/T315383) [01:54:36] !log Re-deployed security fix for T309894 to wmf.23 [01:54:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:54:50] RECOVERY - dump of es4 in eqiad on backupmon1001 is OK: Last dump for es4 at eqiad (es1022) taken on 2022-08-16 00:00:01 (3384 GiB, +0.7 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:59:43] !log Re-deployed security fix for T309894 to wmf.25 [01:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:05] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [02:07:06] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic[1049-1050].eqiad.wmnet [02:07:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:10:26] (03PS1) 10Ryan Kemper: elastic: decom elastic10[51-52] [puppet] - 10https://gerrit.wikimedia.org/r/823786 (https://phabricator.wikimedia.org/T309810) [02:10:57] (03PS1) 10Andrew Bogott: cloudcephosd1025 through 1034: rename ip interfaces for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/823787 (https://phabricator.wikimedia.org/T314870) [02:10:59] (03PS2) 10Ryan Kemper: elastic: decom elastic10[51-52] [puppet] - 10https://gerrit.wikimedia.org/r/823786 (https://phabricator.wikimedia.org/T309810) [02:15:00] (03CR) 10Ryan Kemper: [C: 03+2] elastic: decom elastic10[51-52] [puppet] - 10https://gerrit.wikimedia.org/r/823786 (https://phabricator.wikimedia.org/T309810) (owner: 10Ryan Kemper) [02:16:08] !log ryankemper@cumin1001 START - Cookbook sre.hosts.decommission for hosts elastic[1051-1052].eqiad.wmnet [02:16:35] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts elastic[1051-1052].eqiad.wmnet [02:17:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:20:38] (03PS1) 10Ryan Kemper: elastic: pick new canary host [puppet] - 10https://gerrit.wikimedia.org/r/823788 (https://phabricator.wikimedia.org/T309810) [02:22:45] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:25:15] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:27:51] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:01] 10SRE, 10Performance-Team (Radar): Set MW appserver scaling_governor to performance - https://phabricator.wikimedia.org/T315398 (10ori) > I propose making this change on all eqiad appservers in soft state, with cumin. Our latency metrics are noisy so changing it everywhere at once will give us the best chance... [02:30:05] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:47] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:32:33] (03CR) 10Ryan Kemper: [C: 03+2] elastic: pick new canary host [puppet] - 10https://gerrit.wikimedia.org/r/823788 (https://phabricator.wikimedia.org/T309810) (owner: 10Ryan Kemper) [02:32:59] !log ryankemper@cumin1001 START - Cookbook sre.hosts.decommission for hosts elastic[1051-1052].eqiad.wmnet [02:33:43] (03PS1) 10SBassett: Enable StopForumSpam on initial candidate projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823789 (https://phabricator.wikimedia.org/T273220) [02:34:30] (03CR) 10SBassett: [C: 04-2] "Prepping for tomorrow's deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823789 (https://phabricator.wikimedia.org/T273220) (owner: 10SBassett) [02:35:29] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [02:37:03] RECOVERY - dump of es5 in codfw on backupmon1001 is OK: Last dump for es5 at codfw (es2025) taken on 2022-08-16 00:00:02 (3362 GiB, +0.7 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [02:38:41] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:40:29] (03PS2) 10SBassett: Enable StopForumSpam on initial candidate projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823789 (https://phabricator.wikimedia.org/T273220) [02:45:35] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:45:40] (03PS1) 10SBassett: Enable StopForumSpam on initial candidate projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823790 (https://phabricator.wikimedia.org/T273220) [02:45:42] !log ryankemper@cumin1001 START - Cookbook sre.dns.netbox [02:46:39] (03PS2) 10SBassett: Enable StopForumSpam on initial candidate projects (CommonSettings) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823790 (https://phabricator.wikimedia.org/T273220) [02:47:05] (03CR) 10SBassett: [C: 04-2] "Prepping for tomorrow's deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823790 (https://phabricator.wikimedia.org/T273220) (owner: 10SBassett) [02:51:37] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1004 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [02:53:11] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:58:28] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [02:58:28] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic[1051-1052].eqiad.wmnet [03:03:36] RECOVERY - dump of es5 in eqiad on backupmon1001 is OK: Last dump for es5 at eqiad (es1025) taken on 2022-08-16 00:00:01 (3362 GiB, +0.7 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [03:07:16] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:13:30] RECOVERY - dump of es4 in codfw on backupmon1001 is OK: Last dump for es4 at codfw (es2022) taken on 2022-08-16 00:00:02 (3384 GiB, +0.7 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [03:19:06] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Wed 24 Aug 2022 07:48:40 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:20:50] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 23 Oct 2022 06:50:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:29:02] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:31:22] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:35:20] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: search-drop-query-clicks.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:42:40] PROBLEM - Check unit status of search-drop-query-clicks on stat1007 is CRITICAL: CRITICAL: Status of the systemd unit search-drop-query-clicks https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:43:56] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:46:14] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:53:14] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:53:21] (03PS2) 10DLynch: Make DiscussionTools topicsubscription opt-out on A/B test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823749 (https://phabricator.wikimedia.org/T314693) (owner: 10Bartosz Dziewoński) [04:00:10] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:01:32] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:08:47] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1025.eqiad.wmnet with OS bullseye [04:09:12] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1030.eqiad.wmnet with OS bullseye [04:11:46] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:13:06] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Wed 24 Aug 2022 07:48:40 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:15:26] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 23 Oct 2022 06:50:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:15:30] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:16:28] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:23:05] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1030.eqiad.wmnet with reason: host reimage [04:23:30] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1027.eqiad.wmnet with OS bullseye [04:23:34] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1028.eqiad.wmnet with OS bullseye [04:23:37] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1029.eqiad.wmnet with OS bullseye [04:23:41] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1031.eqiad.wmnet with OS bullseye [04:23:44] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1032.eqiad.wmnet with OS bullseye [04:23:46] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1033.eqiad.wmnet with OS bullseye [04:23:47] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1034.eqiad.wmnet with OS bullseye [04:25:44] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1030.eqiad.wmnet with reason: host reimage [04:31:11] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1030.eqiad.wmnet with OS bullseye [04:37:03] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:42:09] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1030.eqiad.wmnet with OS bullseye [04:45:57] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:47:56] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1027.eqiad.wmnet with OS bullseye [04:47:59] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1028.eqiad.wmnet with OS bullseye [04:48:03] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1029.eqiad.wmnet with OS bullseye [04:48:05] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1031.eqiad.wmnet with OS bullseye [04:48:08] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1033.eqiad.wmnet with OS bullseye [04:48:10] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1034.eqiad.wmnet with OS bullseye [04:51:29] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:55:59] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1030.eqiad.wmnet with reason: host reimage [04:56:55] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1027.eqiad.wmnet with OS bullseye [04:57:05] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1028.eqiad.wmnet with OS bullseye [04:57:12] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1029.eqiad.wmnet with OS bullseye [04:57:28] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1031.eqiad.wmnet with OS bullseye [04:57:36] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1033.eqiad.wmnet with OS bullseye [04:57:45] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1034.eqiad.wmnet with OS bullseye [04:58:26] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1030.eqiad.wmnet with reason: host reimage [04:59:54] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1026.eqiad.wmnet with OS bullseye [04:59:56] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1025.eqiad.wmnet with OS bullseye [05:00:17] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:02:56] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1032.eqiad.wmnet with OS bullseye [05:03:40] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1032.eqiad.wmnet with OS bullseye [05:13:39] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1026.eqiad.wmnet with reason: host reimage [05:13:49] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1025.eqiad.wmnet with reason: host reimage [05:14:17] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1030.eqiad.wmnet with OS bullseye [05:16:41] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1026.eqiad.wmnet with reason: host reimage [05:19:02] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1025.eqiad.wmnet with reason: host reimage [05:23:14] PROBLEM - Disk space on ms-be1071 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdz1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be1071&var-datasource=eqiad+prometheus/ops [05:26:32] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1034.eqiad.wmnet with OS bullseye [05:26:42] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1033.eqiad.wmnet with OS bullseye [05:26:45] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1032.eqiad.wmnet with OS bullseye [05:26:48] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1029.eqiad.wmnet with OS bullseye [05:26:50] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1031.eqiad.wmnet with OS bullseye [05:26:52] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1028.eqiad.wmnet with OS bullseye [05:26:55] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1027.eqiad.wmnet with OS bullseye [05:31:01] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1027.eqiad.wmnet with OS bullseye [05:31:02] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1028.eqiad.wmnet with OS bullseye [05:31:04] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1029.eqiad.wmnet with OS bullseye [05:31:05] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1031.eqiad.wmnet with OS bullseye [05:31:06] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1032.eqiad.wmnet with OS bullseye [05:31:07] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1033.eqiad.wmnet with OS bullseye [05:31:09] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1034.eqiad.wmnet with OS bullseye [05:31:11] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1026.eqiad.wmnet with OS bullseye [05:32:39] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:39:13] PROBLEM - Check systemd state on ms-be1071 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:50:55] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1034.eqiad.wmnet with OS bullseye [05:50:59] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1033.eqiad.wmnet with OS bullseye [05:51:02] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1032.eqiad.wmnet with OS bullseye [05:51:05] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1031.eqiad.wmnet with OS bullseye [05:51:08] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1028.eqiad.wmnet with OS bullseye [05:51:10] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1029.eqiad.wmnet with OS bullseye [05:51:12] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1027.eqiad.wmnet with OS bullseye [05:53:33] 10SRE, 10Performance-Team (Radar): Set MW appserver scaling_governor to performance - https://phabricator.wikimedia.org/T315398 (10ori) Based on https://www.kernel.org/doc/html/v5.6/admin-guide/pm/intel_pstate.html#operation-modes the scaling behavior will be different for systems depending on whether or not h... [05:57:06] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1027.eqiad.wmnet with OS bullseye [05:57:08] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1028.eqiad.wmnet with OS bullseye [05:57:09] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1029.eqiad.wmnet with OS bullseye [05:57:10] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1031.eqiad.wmnet with OS bullseye [05:57:11] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1032.eqiad.wmnet with OS bullseye [05:57:12] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1033.eqiad.wmnet with OS bullseye [05:57:14] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1034.eqiad.wmnet with OS bullseye [05:59:05] (03PS1) 10Tim Starling: Set binlog_format=STATEMENT on x2 servers [puppet] - 10https://gerrit.wikimedia.org/r/824037 (https://phabricator.wikimedia.org/T315271) [06:00:19] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1025.eqiad.wmnet with OS bullseye [06:08:27] (03PS1) 10Tim Starling: Re-enable multi-DC mode on testwiki, test2wiki and mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/824039 (https://phabricator.wikimedia.org/T315271) [06:09:39] (03Abandoned) 10Tim Starling: Remove codfw hosts from X-Wikimedia-Debug [puppet] - 10https://gerrit.wikimedia.org/r/823518 (owner: 10Tim Starling) [06:10:18] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1027.eqiad.wmnet with reason: host reimage [06:10:26] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1031.eqiad.wmnet with reason: host reimage [06:10:29] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1029.eqiad.wmnet with reason: host reimage [06:10:33] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1032.eqiad.wmnet with reason: host reimage [06:10:37] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1033.eqiad.wmnet with reason: host reimage [06:10:40] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1028.eqiad.wmnet with reason: host reimage [06:13:39] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1027.eqiad.wmnet with reason: host reimage [06:14:40] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:15:28] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1032.eqiad.wmnet with reason: host reimage [06:15:44] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:17:51] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1031.eqiad.wmnet with reason: host reimage [06:18:28] (03PS1) 10Andrew Bogott: Revert "netboot.cfg: temporarily switch cloudvirt1025 to a full reimage" [puppet] - 10https://gerrit.wikimedia.org/r/824041 [06:19:00] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:20:12] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudcephosd1028.eqiad.wmnet with reason: host reimage [06:20:19] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1033.eqiad.wmnet with reason: host reimage [06:20:29] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudcephosd1029.eqiad.wmnet with reason: host reimage [06:21:38] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1034.eqiad.wmnet with OS bullseye [06:21:51] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1034.eqiad.wmnet with OS bullseye [06:23:23] (03CR) 10Filippo Giunchedi: [C: 03+2] memcached: point to active/used configuration options [puppet] - 10https://gerrit.wikimedia.org/r/822039 (https://phabricator.wikimedia.org/T314914) (owner: 10Filippo Giunchedi) [06:28:01] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1027.eqiad.wmnet with OS bullseye [06:28:42] (03CR) 10Filippo Giunchedi: [C: 04-1] "Idea LGTM, though see inline" [puppet] - 10https://gerrit.wikimedia.org/r/823748 (https://phabricator.wikimedia.org/T314972) (owner: 10Andrea Denisse) [06:30:34] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1032.eqiad.wmnet with OS bullseye [06:31:53] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! See also inline for a possible simplification. +1'ing though this depends on quickdatacopy change of course" [puppet] - 10https://gerrit.wikimedia.org/r/823752 (https://phabricator.wikimedia.org/T314972) (owner: 10Andrea Denisse) [06:32:50] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! See also inline for a possible simplification. +1'ing though this depends on quickdatacopy change of course" [puppet] - 10https://gerrit.wikimedia.org/r/823759 (https://phabricator.wikimedia.org/T314972) (owner: 10Andrea Denisse) [06:33:37] (03CR) 10Filippo Giunchedi: [C: 03+2] dispatch: add container [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/821698 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [06:33:45] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] dispatch: add container [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/821698 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [06:34:03] (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:34:55] jouncebot: nowandnext [06:34:55] No deployments scheduled for the next 0 hour(s) and 25 minute(s) [06:34:55] In 0 hour(s) and 25 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220817T0700) [06:35:05] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1034.eqiad.wmnet with reason: host reimage [06:35:12] no task there either [06:35:36] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Trokhymovych - https://phabricator.wikimedia.org/T315262 (10Trokhymovych) Public SSH key: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIB8/dfbAQjsOu3EzPIosLsY0Dxz0LOMtW2dKPndAqDnh trokhymovych.mykola@gmail.com [06:36:22] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1029.eqiad.wmnet with OS bullseye [06:36:28] (03PS1) 10Ladsgroup: SpecialRecentChangesLinked: Use rdbms code for building the main query [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824053 [06:36:32] (03CR) 10Ladsgroup: [C: 03+2] SpecialRecentChangesLinked: Use rdbms code for building the main query [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824053 (owner: 10Ladsgroup) [06:36:42] (03PS1) 10Ladsgroup: SpecialRecentChangesLinked: Use rdbms code for building the main query [core] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/824054 [06:36:49] (03CR) 10Ladsgroup: [C: 03+2] SpecialRecentChangesLinked: Use rdbms code for building the main query [core] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/824054 (owner: 10Ladsgroup) [06:37:36] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1031.eqiad.wmnet with OS bullseye [06:38:40] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1034.eqiad.wmnet with reason: host reimage [06:38:55] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1033.eqiad.wmnet with OS bullseye [06:39:03] (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:42:51] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1028.eqiad.wmnet with OS bullseye [06:43:36] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:45:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T314041)', diff saved to https://phabricator.wikimedia.org/P32412 and previous config saved to /var/cache/conftool/dbconfig/20220817-064534-ladsgroup.json [06:45:38] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [06:47:58] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:50:08] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:54:05] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1034.eqiad.wmnet with OS bullseye [06:57:08] (03Merged) 10jenkins-bot: SpecialRecentChangesLinked: Use rdbms code for building the main query [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824053 (owner: 10Ladsgroup) [06:58:15] (03Merged) 10jenkins-bot: SpecialRecentChangesLinked: Use rdbms code for building the main query [core] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/824054 (owner: 10Ladsgroup) [07:00:05] Amir1 and Urbanecm: Time to snap out of that daydream and deploy UTC morning backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220817T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:16] I have these two patches [07:00:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P32413 and previous config saved to /var/cache/conftool/dbconfig/20220817-070040-ladsgroup.json [07:04:15] sigh, this is breaking :/ [07:04:32] it works in beta cluster but not test wikipedia [07:05:02] > Error 1060: Duplicate column name 'rc_title' [07:05:20] (03PS1) 10Ladsgroup: Revert "SpecialRecentChangesLinked: Use rdbms code for building the main query" [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824055 [07:05:29] (03PS1) 10Ladsgroup: Revert "SpecialRecentChangesLinked: Use rdbms code for building the main query" [core] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/824056 [07:05:33] (03CR) 10Ladsgroup: [C: 03+2] Revert "SpecialRecentChangesLinked: Use rdbms code for building the main query" [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824055 (owner: 10Ladsgroup) [07:05:36] (03CR) 10Ladsgroup: [C: 03+2] Revert "SpecialRecentChangesLinked: Use rdbms code for building the main query" [core] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/824056 (owner: 10Ladsgroup) [07:13:16] (03PS1) 10Hashar: Merge tag 'v3.4.5' into wmf/stable-3.4 [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/824122 [07:13:35] (03PS2) 10Hashar: Merge tag 'v3.4.5' into wmf/stable-3.4 [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/824122 (https://phabricator.wikimedia.org/T315408) [07:15:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P32414 and previous config saved to /var/cache/conftool/dbconfig/20220817-071546-ladsgroup.json [07:17:43] (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/823787 (https://phabricator.wikimedia.org/T314870) (owner: 10Andrew Bogott) [07:19:42] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:20:58] (03PS1) 10Hashar: [WMF] update javamelody plugin [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/824124 [07:23:23] (03CR) 10Hashar: [C: 03+2] Merge tag 'v3.4.5' into wmf/stable-3.4 [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/824122 (https://phabricator.wikimedia.org/T315408) (owner: 10Hashar) [07:23:32] (03Merged) 10jenkins-bot: Revert "SpecialRecentChangesLinked: Use rdbms code for building the main query" [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824055 (owner: 10Ladsgroup) [07:23:58] (03Merged) 10jenkins-bot: Revert "SpecialRecentChangesLinked: Use rdbms code for building the main query" [core] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/824056 (owner: 10Ladsgroup) [07:26:14] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:30:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T314041)', diff saved to https://phabricator.wikimedia.org/P32415 and previous config saved to /var/cache/conftool/dbconfig/20220817-073052-ladsgroup.json [07:30:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [07:30:56] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [07:31:02] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2110 - https://phabricator.wikimedia.org/T315229 (10Marostegui) p:05Triage→03Medium @Papaul @wiki_willy any chances we can buy one? This is s4 master. [07:31:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [07:31:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:31:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:31:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T314041)', diff saved to https://phabricator.wikimedia.org/P32416 and previous config saved to /var/cache/conftool/dbconfig/20220817-073141-ladsgroup.json [07:34:08] (03PS1) 10Marostegui: db1132: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/824127 [07:34:35] (03Merged) 10jenkins-bot: Merge tag 'v3.4.5' into wmf/stable-3.4 [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/824122 (https://phabricator.wikimedia.org/T315408) (owner: 10Hashar) [07:35:19] (03CR) 10Marostegui: [C: 03+2] db1132: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/824127 (owner: 10Marostegui) [07:35:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 1%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32417 and previous config saved to /var/cache/conftool/dbconfig/20220817-073553-root.json [07:36:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 1%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32418 and previous config saved to /var/cache/conftool/dbconfig/20220817-073613-root.json [07:37:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 1%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32419 and previous config saved to /var/cache/conftool/dbconfig/20220817-073701-root.json [07:37:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 1%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32420 and previous config saved to /var/cache/conftool/dbconfig/20220817-073712-root.json [07:41:00] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:41:37] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Marostegui) Thanks everyone for woking on this! From my side db1187 and db1185 aren't reachable. @Cmjohnson can you take a look? Thanks. [07:45:08] (03PS1) 10Samwilson: Enable Realtime Preview on Group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824128 (https://phabricator.wikimedia.org/T314182) [07:46:52] (03CR) 10Hashar: [C: 03+2] [WMF] update javamelody plugin [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/824124 (owner: 10Hashar) [07:47:03] (03PS1) 10David Caro: p:admin: ensure the shells exist before the users are created [puppet] - 10https://gerrit.wikimedia.org/r/824130 [07:50:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 2%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32421 and previous config saved to /var/cache/conftool/dbconfig/20220817-075057-root.json [07:51:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 5%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32422 and previous config saved to /var/cache/conftool/dbconfig/20220817-075118-root.json [07:52:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 5%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32423 and previous config saved to /var/cache/conftool/dbconfig/20220817-075206-root.json [07:52:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 2%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32424 and previous config saved to /var/cache/conftool/dbconfig/20220817-075216-root.json [07:52:24] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) [07:54:19] (03CR) 10RhinosF1: [C: 03+1] Enable Realtime Preview on Group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824128 (https://phabricator.wikimedia.org/T314182) (owner: 10Samwilson) [07:54:46] (03PS2) 10JMeybohm: sre.discovery.service-route: Make the cookbook work [cookbooks] - 10https://gerrit.wikimedia.org/r/823659 (https://phabricator.wikimedia.org/T260663) [07:55:04] (03Merged) 10jenkins-bot: [WMF] update javamelody plugin [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/824124 (owner: 10Hashar) [07:55:42] (03PS1) 10Jcrespo: mariadb: Move notifications disabled from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/824131 [07:56:28] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:01:00] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:02:16] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:05:12] (03PS2) 10Jcrespo: mariadb: Fix syntax for disabling notifications role-wise on test hosts [puppet] - 10https://gerrit.wikimedia.org/r/824131 [08:06:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 5%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32425 and previous config saved to /var/cache/conftool/dbconfig/20220817-080602-root.json [08:06:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 10%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32426 and previous config saved to /var/cache/conftool/dbconfig/20220817-080622-root.json [08:07:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 10%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32427 and previous config saved to /var/cache/conftool/dbconfig/20220817-080710-root.json [08:07:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 5%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32428 and previous config saved to /var/cache/conftool/dbconfig/20220817-080721-root.json [08:10:11] (03CR) 10Jcrespo: "I think this works now:" [puppet] - 10https://gerrit.wikimedia.org/r/824131 (owner: 10Jcrespo) [08:14:04] (03PS1) 10Hashar: Gerrit v3.4.5 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/824134 (https://phabricator.wikimedia.org/T315408) [08:14:16] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:14:24] (03CR) 10CI reject: [V: 04-1] Gerrit v3.4.5 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/824134 (https://phabricator.wikimedia.org/T315408) (owner: 10Hashar) [08:15:41] (03CR) 10Ladsgroup: [C: 03+1] mariadb: Fix syntax for disabling notifications role-wise on test hosts [puppet] - 10https://gerrit.wikimedia.org/r/824131 (owner: 10Jcrespo) [08:16:26] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:20:15] (03CR) 10Hashar: "recheck" [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/824134 (https://phabricator.wikimedia.org/T315408) (owner: 10Hashar) [08:20:27] (03CR) 10Jcrespo: "I think Manuel should lead this but IMHO, this shouldn't be added to the hosts, as that will be lost when host are upgraded or under maint" [puppet] - 10https://gerrit.wikimedia.org/r/824037 (https://phabricator.wikimedia.org/T315271) (owner: 10Tim Starling) [08:21:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 10%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32429 and previous config saved to /var/cache/conftool/dbconfig/20220817-082106-root.json [08:21:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 50%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32430 and previous config saved to /var/cache/conftool/dbconfig/20220817-082127-root.json [08:22:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 50%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32431 and previous config saved to /var/cache/conftool/dbconfig/20220817-082215-root.json [08:22:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 10%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32432 and previous config saved to /var/cache/conftool/dbconfig/20220817-082226-root.json [08:25:14] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:32:07] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:36:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 25%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32433 and previous config saved to /var/cache/conftool/dbconfig/20220817-083611-root.json [08:36:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 75%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32434 and previous config saved to /var/cache/conftool/dbconfig/20220817-083631-root.json [08:37:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 75%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32435 and previous config saved to /var/cache/conftool/dbconfig/20220817-083719-root.json [08:37:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 25%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32436 and previous config saved to /var/cache/conftool/dbconfig/20220817-083730-root.json [08:40:06] (03CR) 10Jbond: "need to circle back to this, im pretty sure this is good to go and matches production. I suspect this was broken with the following" [puppet] - 10https://gerrit.wikimedia.org/r/823762 (owner: 10Zabe) [08:41:44] (03CR) 10Jbond: "and the incident report https://wikitech.wikimedia.org/wiki/Incidents/2022-05-24_Failed_Apache_restart" [puppet] - 10https://gerrit.wikimedia.org/r/823762 (owner: 10Zabe) [08:43:29] (03CR) 10Jbond: [C: 03+2] role::puppetmaster::standalone: Remove apache2 standard ports [puppet] - 10https://gerrit.wikimedia.org/r/823762 (owner: 10Zabe) [08:45:40] (03PS11) 10Jaime Nuche: scap: introduce bootstrapping mechanism specific to deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/820749 [08:47:20] (03CR) 10Jaime Nuche: scap: introduce bootstrapping mechanism specific to deployment hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820749 (owner: 10Jaime Nuche) [08:49:56] (03CR) 10Cathal Mooney: [C: 03+2] admin: Add Purity Waigi to 'wmf' LDAP group and analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/823654 (https://phabricator.wikimedia.org/T315257) (owner: 10Cathal Mooney) [08:51:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 50%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32437 and previous config saved to /var/cache/conftool/dbconfig/20220817-085115-root.json [08:51:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 100%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32438 and previous config saved to /var/cache/conftool/dbconfig/20220817-085136-root.json [08:52:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 100%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32439 and previous config saved to /var/cache/conftool/dbconfig/20220817-085224-root.json [08:52:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 50%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32440 and previous config saved to /var/cache/conftool/dbconfig/20220817-085235-root.json [08:54:20] (03CR) 10Hashar: [C: 03+2] Gerrit v3.4.5 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/824134 (https://phabricator.wikimedia.org/T315408) (owner: 10Hashar) [08:54:42] (03Merged) 10jenkins-bot: Gerrit v3.4.5 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/824134 (https://phabricator.wikimedia.org/T315408) (owner: 10Hashar) [08:56:16] 10SRE, 10Infrastructure-Foundations, 10netops: Rancid on netmon1003 unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10cmooney) 05Resolved→03Open @andrea.denisse Hey, does you patch correct the other problem I observed above? With the prompt for accepting the host key cau... [08:59:44] I am going to upgrade Gerrit from 3.4.4 to 3.4.5 [08:59:47] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Patch-For-Review: Grant private data access to Purity Waigi - https://phabricator.wikimedia.org/T315257 (10cmooney) @PWaigi-WMF I've added you to the correct groups now, can you test the access and advise if it is now working? Thanks. [09:00:57] (03PS3) 10Jcrespo: mariadb: Fix syntax for disabling notifications role-wise on test hosts [puppet] - 10https://gerrit.wikimedia.org/r/824131 [09:01:04] (03PS4) 10Jcrespo: mariadb: Fix syntax for disabling notifications role-wise on test hosts [puppet] - 10https://gerrit.wikimedia.org/r/824131 [09:03:27] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:03:50] !log hashar@deploy1002 Started deploy [gerrit/gerrit@e11e6a7]: Gerrit to 3.4.5 on gerrit 2002 # T315408 [09:03:54] T315408: Upgrade Gerrit to 3.4.5 - https://phabricator.wikimedia.org/T315408 [09:04:01] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@e11e6a7]: Gerrit to 3.4.5 on gerrit 2002 # T315408 (duration: 00m 11s) [09:05:20] (03CR) 10David Caro: global: add inventory module (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823169 (owner: 10David Caro) [09:06:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 75%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32441 and previous config saved to /var/cache/conftool/dbconfig/20220817-090620-root.json [09:07:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 75%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32442 and previous config saved to /var/cache/conftool/dbconfig/20220817-090739-root.json [09:09:06] !log hashar@deploy1002 Started deploy [gerrit/gerrit@e11e6a7]: Gerrit to 3.4.5 on gerrit1001 # T315408 [09:09:10] T315408: Upgrade Gerrit to 3.4.5 - https://phabricator.wikimedia.org/T315408 [09:09:15] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@e11e6a7]: Gerrit to 3.4.5 on gerrit1001 # T315408 (duration: 00m 09s) [09:10:53] !log Upgraded Gerrit from 3.4.4 to 3.4.5 # T315408 [09:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:42] (03PS1) 10Jbond: P:cache::varnish::frontend::text: make esitest ensurable and absent on cloud [puppet] - 10https://gerrit.wikimedia.org/r/824146 (https://phabricator.wikimedia.org/T315350) [09:14:32] (03CR) 10Jcrespo: [C: 03+2] mariadb: Fix syntax for disabling notifications role-wise on test hosts [puppet] - 10https://gerrit.wikimedia.org/r/824131 (owner: 10Jcrespo) [09:14:35] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Trokhymovych - https://phabricator.wikimedia.org/T315262 (10cmooney) Hi @Trokhymovych thanks for sharing the key, I assume you need shell/ssh access is that correct? I notice you already have two SSH pubkeys associated with your account -... [09:15:31] (03PS2) 10Jbond: P:cache::varnish::frontend::text: make esitest ensurable and absent on cloud [puppet] - 10https://gerrit.wikimedia.org/r/824146 (https://phabricator.wikimedia.org/T315394) [09:15:33] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Trokhymovych - https://phabricator.wikimedia.org/T315262 (10cmooney) @KFrancis can you confirm everything is ok here in terms of signed NDA before I add this user to our nda group? Thanks. [09:16:57] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1160 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/824147 (https://phabricator.wikimedia.org/T315419) [09:16:57] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Trokhymovych - https://phabricator.wikimedia.org/T315262 (10cmooney) @Ottomata @odimitrijevic are you ok to approve this access? Thanks. [09:17:02] (03PS1) 10Gerrit maintenance bot: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/824148 (https://phabricator.wikimedia.org/T315419) [09:17:50] (03PS3) 10Jbond: P:cache::varnish::frontend::text: make esitest ensurable and absent on cloud [puppet] - 10https://gerrit.wikimedia.org/r/824146 (https://phabricator.wikimedia.org/T315394) [09:19:58] (03PS4) 10Jbond: P:cache::varnish::frontend::text: make esitest ensurable and absent on cloud [puppet] - 10https://gerrit.wikimedia.org/r/824146 (https://phabricator.wikimedia.org/T315394) [09:21:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 100%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32443 and previous config saved to /var/cache/conftool/dbconfig/20220817-092125-root.json [09:22:42] (03PS2) 10Samtar: beta cluster: don't instantiate ::esitest [puppet] - 10https://gerrit.wikimedia.org/r/823766 (https://phabricator.wikimedia.org/T315350) (owner: 10Ori) [09:22:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 100%: Repooling 10.6', diff saved to https://phabricator.wikimedia.org/P32444 and previous config saved to /var/cache/conftool/dbconfig/20220817-092244-root.json [09:23:48] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Trokhymovych - https://phabricator.wikimedia.org/T315262 (10cmooney) 05Open→03In progress p:05Triage→03Medium [09:24:31] (03PS1) 10David Caro: ceph.bootstrap_and_add: add --force option [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824149 [09:24:34] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Patch-For-Review: Grant private data access to Purity Waigi - https://phabricator.wikimedia.org/T315257 (10cmooney) 05Open→03In progress p:05Triage→03Medium [09:30:54] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to private-data for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313429 (10cmooney) Ok reached out to user on slack to confirm ssh key after which I will add for them. [09:32:04] (03PS1) 10Jbond: C:trafficserver: Ensure we only instantiate the trafficserver class once [puppet] - 10https://gerrit.wikimedia.org/r/824150 (https://phabricator.wikimedia.org/T315394) [09:36:14] (03PS5) 10Jbond: P:cache::varnish::frontend::text: make esitest ensurable and absent on cloud [puppet] - 10https://gerrit.wikimedia.org/r/824146 (https://phabricator.wikimedia.org/T315394) [09:37:56] (03CR) 10Vgutierrez: [C: 03+1] C:trafficserver: Ensure we only instantiate the trafficserver class once [puppet] - 10https://gerrit.wikimedia.org/r/824150 (https://phabricator.wikimedia.org/T315394) (owner: 10Jbond) [09:41:57] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:46:24] (03PS6) 10Jbond: P:cache::varnish::frontend::text: make esitest ensurable and absent on cloud [puppet] - 10https://gerrit.wikimedia.org/r/824146 (https://phabricator.wikimedia.org/T315394) [09:46:50] (03PS1) 10Zabe: Start writing to cuc_actor on s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824152 (https://phabricator.wikimedia.org/T233004) [09:47:03] (03PS3) 10David Caro: global: add inventory module [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823169 [09:47:05] (03PS2) 10David Caro: Openstack: use cluster_name instead of control node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823666 [09:47:07] (03PS2) 10David Caro: ceph: use cluster_name instead of control node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823667 [09:47:09] (03PS2) 10David Caro: ceph: use human-readable names for ceph clusters [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823668 [09:47:11] (03PS2) 10David Caro: ceph: use the correct codfw ceph mon hosts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823669 [09:47:13] (03PS2) 10David Caro: ceph,opensatck: use the inventory to get the nodes domain [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823670 [09:47:15] (03PS2) 10David Caro: ceph: add roll_restart_osd_daemons cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823671 [09:47:17] (03PS2) 10David Caro: ceph.bootstrap_and_add: add --force option [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824149 [09:47:19] (03PS1) 10David Caro: WIP: adding support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 [09:47:32] (03CR) 10Zabe: [C: 04-1] "needs to wait for the master switchover which is scheduled to be tomorrow morning" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824152 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [09:47:42] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36782/console" [puppet] - 10https://gerrit.wikimedia.org/r/824146 (https://phabricator.wikimedia.org/T315394) (owner: 10Jbond) [09:48:36] (03CR) 10Jbond: [C: 03+2] C:trafficserver: Ensure we only instantiate the trafficserver class once [puppet] - 10https://gerrit.wikimedia.org/r/824150 (https://phabricator.wikimedia.org/T315394) (owner: 10Jbond) [09:49:30] 10SRE-Access-Requests, 10Data-Engineering, 10Data-Engineering-Operations: Access request to analytics system(s) - https://phabricator.wikimedia.org/T315409 (10Urbanecm) Looks like analytics-privatedata-users request to me. Tagging with #sre-access-requests. [09:57:12] (03CR) 10CI reject: [V: 04-1] ceph: use human-readable names for ceph clusters [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823668 (owner: 10David Caro) [09:57:22] (03CR) 10CI reject: [V: 04-1] ceph,opensatck: use the inventory to get the nodes domain [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823670 (owner: 10David Caro) [09:57:27] (03CR) 10CI reject: [V: 04-1] Openstack: use cluster_name instead of control node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823666 (owner: 10David Caro) [09:57:59] (03CR) 10CI reject: [V: 04-1] ceph: use the correct codfw ceph mon hosts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823669 (owner: 10David Caro) [09:58:01] (03CR) 10CI reject: [V: 04-1] ceph: add roll_restart_osd_daemons cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823671 (owner: 10David Caro) [09:58:25] (03CR) 10CI reject: [V: 04-1] global: add inventory module [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823169 (owner: 10David Caro) [09:58:44] (03CR) 10CI reject: [V: 04-1] WIP: adding support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (owner: 10David Caro) [09:58:46] (03CR) 10CI reject: [V: 04-1] ceph: use cluster_name instead of control node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823667 (owner: 10David Caro) [10:00:14] (03CR) 10CI reject: [V: 04-1] ceph.bootstrap_and_add: add --force option [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824149 (owner: 10David Caro) [10:00:25] (03PS1) 10Jbond: cloud: add etcd CA to cloud pki infrastracture [puppet] - 10https://gerrit.wikimedia.org/r/824155 (https://phabricator.wikimedia.org/T315395) [10:00:56] (03CR) 10Jbond: [V: 03+2 C: 03+2] cloud: add etcd CA to cloud pki infrastracture [puppet] - 10https://gerrit.wikimedia.org/r/824155 (https://phabricator.wikimedia.org/T315395) (owner: 10Jbond) [10:02:16] (03PS3) 10David Caro: ceph.bootstrap_and_add: add --force option [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824149 [10:02:18] (03PS2) 10David Caro: WIP: adding support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 [10:05:12] jbond: ref T315395, did you want to take it? [10:05:12] T315395: Rebase & merge or re-cherry-pick 668701 on deployment-puppetmaster04 - https://phabricator.wikimedia.org/T315395 [10:05:55] TheresNoTime: sure thing [10:10:46] (03CR) 10CI reject: [V: 04-1] WIP: adding support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (owner: 10David Caro) [10:12:18] (03CR) 10CI reject: [V: 04-1] ceph.bootstrap_and_add: add --force option [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824149 (owner: 10David Caro) [10:18:33] (03CR) 10Marostegui: Set binlog_format=STATEMENT on x2 servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824037 (https://phabricator.wikimedia.org/T315271) (owner: 10Tim Starling) [10:18:56] (03CR) 10RhinosF1: "This change is ready for review." [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824166 (owner: 10RhinosF1) [10:19:17] (03PS2) 10RhinosF1: quota_increase: pretty format SAL entry [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824166 [10:19:53] (03PS1) 10Jbond: hieradata - cloud: add etcd CI to cloud pki instance [puppet] - 10https://gerrit.wikimedia.org/r/824156 [10:20:18] (03PS2) 10Jbond: hieradata - cloud: add etcd CI to cloud pki instance [puppet] - 10https://gerrit.wikimedia.org/r/824156 (https://phabricator.wikimedia.org/T315395) [10:20:36] (03CR) 10Jbond: [V: 03+2 C: 03+2] hieradata - cloud: add etcd CI to cloud pki instance [puppet] - 10https://gerrit.wikimedia.org/r/824156 (https://phabricator.wikimedia.org/T315395) (owner: 10Jbond) [10:22:37] (03CR) 10Marostegui: "Created https://phabricator.wikimedia.org/T315427" [puppet] - 10https://gerrit.wikimedia.org/r/824037 (https://phabricator.wikimedia.org/T315271) (owner: 10Tim Starling) [10:23:03] (03CR) 10Marostegui: [C: 03+2] "Merging this for now to triage the initial issue and to avoid hosts going back to ROW upon reboot/restart" [puppet] - 10https://gerrit.wikimedia.org/r/824037 (https://phabricator.wikimedia.org/T315271) (owner: 10Tim Starling) [10:27:31] (03CR) 10CI reject: [V: 04-1] quota_increase: pretty format SAL entry [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824166 (owner: 10RhinosF1) [10:28:02] (03CR) 10Jbond: "not sure we still need this, I think with some recent changes from ben[1] we should be able to use profile::pki::client, i think this woul" [puppet] - 10https://gerrit.wikimedia.org/r/668701 (https://phabricator.wikimedia.org/T315395) (owner: 10Giuseppe Lavagetto) [10:28:22] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:28:44] (03PS3) 10RhinosF1: quota_increase: pretty format SAL entry [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824166 [10:30:42] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:31:11] (03PS4) 10David Caro: ceph.bootstrap_and_add: add --force option [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824149 [10:31:13] (03PS3) 10David Caro: WIP: adding support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 [10:33:29] (03PS1) 10Jbond: deployment-prep: use pki for etcd certificates in deployment prep [puppet] - 10https://gerrit.wikimedia.org/r/824158 (https://phabricator.wikimedia.org/T315395) [10:35:07] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36783/console" [puppet] - 10https://gerrit.wikimedia.org/r/824158 (https://phabricator.wikimedia.org/T315395) (owner: 10Jbond) [10:35:11] (03CR) 10CI reject: [V: 04-1] quota_increase: pretty format SAL entry [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824166 (owner: 10RhinosF1) [10:37:18] (03PS4) 10RhinosF1: quota_increase: pretty format SAL entry [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824166 [10:38:21] (03PS1) 10Btullis: Add a new intermediate CA for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/824161 (https://phabricator.wikimedia.org/T310172) [10:38:35] (03CR) 10David Caro: "LGTM, once it passes the tests, a couple nits (feel free to ignore)" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824166 (owner: 10RhinosF1) [10:38:49] (03CR) 10CI reject: [V: 04-1] WIP: adding support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (owner: 10David Caro) [10:42:11] (03PS5) 10RhinosF1: quota_increase: pretty format SAL entry [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824166 [10:42:22] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:42:32] (03CR) 10CI reject: [V: 04-1] quota_increase: pretty format SAL entry [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824166 (owner: 10RhinosF1) [10:42:37] (03CR) 10Jbond: [C: 03+1] "LGTM but see feature creep comment" [puppet] - 10https://gerrit.wikimedia.org/r/823755 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [10:42:50] (03CR) 10RhinosF1: quota_increase: pretty format SAL entry (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824166 (owner: 10RhinosF1) [10:42:56] (03CR) 10RhinosF1: "recheck" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824166 (owner: 10RhinosF1) [10:46:15] (03CR) 10Jbond: [C: 04-1] "lgtm but see comment about uid/gid" [puppet] - 10https://gerrit.wikimedia.org/r/823765 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [10:47:06] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:47:50] (03PS2) 10Urbanecm: Initial configuration for pcmwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815227 (https://phabricator.wikimedia.org/T310776) [10:49:57] (03CR) 10CI reject: [V: 04-1] Initial configuration for pcmwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815227 (https://phabricator.wikimedia.org/T310776) (owner: 10Urbanecm) [10:49:59] (03CR) 10Jbond: [C: 04-1] phabricator: replace user{} with systemd::sysuser for daemon user (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/823767 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [10:50:04] :-( [10:50:29] (03PS2) 10Urbanecm: Initial configuration for guwwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815228 (https://phabricator.wikimedia.org/T309054) [10:51:14] (03PS3) 10Urbanecm: Initial configuration for pcmwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815227 (https://phabricator.wikimedia.org/T310776) [10:51:26] (03CR) 10CI reject: [V: 04-1] Initial configuration for guwwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815228 (https://phabricator.wikimedia.org/T309054) (owner: 10Urbanecm) [10:51:45] (03PS3) 10Urbanecm: Initial configuration for guwwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815228 (https://phabricator.wikimedia.org/T309054) [10:52:10] (03PS4) 10Urbanecm: Initial configuration for guwwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815228 (https://phabricator.wikimedia.org/T309054) [10:53:53] (03PS2) 10Urbanecm: Initial configuration for bjnwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815229 (https://phabricator.wikimedia.org/T312209) [10:57:58] (03CR) 10Jbond: "is there a bug or a way to recreate the error?" [puppet] - 10https://gerrit.wikimedia.org/r/824130 (owner: 10David Caro) [11:00:05] Urbanecm and Amir1: #bothumor I � Unicode. All rise for New wiki creation deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220817T1100). [11:00:10] o/ [11:00:28] (03CR) 10Samtar: extension-list: Add Phonos (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821249 (https://phabricator.wikimedia.org/T314294) (owner: 10Samtar) [11:00:39] o/ [11:00:53] let's start! [11:00:57] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for pcmwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815227 (https://phabricator.wikimedia.org/T310776) (owner: 10Urbanecm) [11:01:49] (03Merged) 10jenkins-bot: Initial configuration for pcmwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815227 (https://phabricator.wikimedia.org/T310776) (owner: 10Urbanecm) [11:02:19] of course. error. [11:02:25] https://www.irccloud.com/pastebin/qAGwPv2L/ [11:03:29] at least the DB's there, with tables [11:04:39] well, on main. not on x1. [11:05:10] Amir1: looks like it's having some issues with assigning new connection to `$conn`, as the desturctor is in the stacktrace. [11:05:28] let me check [11:05:46] is it doing reuse? [11:06:22] hmm it doesn't [11:06:30] (03CR) 10Giuseppe Lavagetto: [C: 03+1] sre.discovery.service-route: Make the cookbook work (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/823659 (https://phabricator.wikimedia.org/T260663) (owner: 10JMeybohm) [11:06:36] it does ` $conn = $localLb->getConnection( DB_PRIMARY, [], $localLb::DOMAIN_ANY );` at 109 to get main DB connection, and then `$conn = $growthLB->getConnection( DB_PRIMARY, [], $localLb::DOMAIN_ANY );` at 135 to get x1's connection [11:07:22] Amir1: DBConnRef::__destruct does call reuseConnectionInternal though [11:07:34] yeah, that's ok [11:08:01] I think get connection is wrong, you need to set the domain again I think, let me double check [11:08:31] sure [11:09:58] (03PS1) 10Btullis: Add new admin_ng values for the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/824163 (https://phabricator.wikimedia.org/T310196) [11:11:13] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [11:11:22] urbanecm: the re-run should just work I think [11:11:28] with skipping main [11:11:31] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [11:11:37] yeah, i think so too. trying. [11:12:18] it does work [11:12:37] x1 has the db now [11:12:43] pulling to mwdebug1001 [11:13:08] (03PS1) 10Jbond: C:admin: when creating users make sure we add a dependency on the shell package [puppet] - 10https://gerrit.wikimedia.org/r/824164 [11:13:28] wiki's live. syncing. [11:14:43] Amir1: what would be the long-term fix? running `$lbFactory->redefineLocalDomain( $dbName );` again, before acquiring x1 connection for the first time? [11:14:56] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36784/console" [puppet] - 10https://gerrit.wikimedia.org/r/824164 (owner: 10Jbond) [11:15:02] for shorter term, I suggest renaming the $conn object [11:15:19] in x1, let's call it $connX1 and $connGrowth? [11:15:34] (03CR) 10Jbond: "i create a new patch for this however it would be good to have a reproducer as id prefer not to add the dependency if parse order is enoug" [puppet] - 10https://gerrit.wikimedia.org/r/824130 (owner: 10David Caro) [11:15:44] then I think it should simply split into multiple maint scripts, this is not sustainable [11:15:47] (03CR) 10Jbond: "and the other CR https://gerrit.wikimedia.org/r/c/operations/puppet/+/824164" [puppet] - 10https://gerrit.wikimedia.org/r/824130 (owner: 10David Caro) [11:16:01] probably $echoConn and $growthConn, similar to how the load balancers are named [11:16:11] sur [11:16:13] *sure [11:16:20] oh, scap now says 11:15:15 This takes about 3 minutes. good change :) [11:17:23] !log urbanecm@deploy1002 Synchronized wmf-config/db-production.php: Creating pcmwiki (T310776) (duration: 03m 22s) [11:17:27] T310776: Create Wikipedia Nigerian Pidgin - https://phabricator.wikimedia.org/T310776 [11:18:57] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Trokhymovych - https://phabricator.wikimedia.org/T315262 (10diego) >>! In T315262#8160758, @cmooney wrote: > Hi @Trokhymovych thanks for sharing the key, I assume you need shell/ssh access is that correct? Could you advise of exactly what... [11:18:58] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:20:10] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [11:20:37] !log urbanecm@deploy1002 Synchronized dblists: Creating pcmwiki (T310776) (duration: 03m 13s) [11:22:30] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [11:23:21] (03CR) 10Jbond: "before we go ahead with this i just want to double check that the intention is to use this for the server to server communication and the " [puppet] - 10https://gerrit.wikimedia.org/r/824161 (https://phabricator.wikimedia.org/T310172) (owner: 10Btullis) [11:24:06] (03CR) 10Btullis: "The `dse-k8s-certl.svc.eqiad.wmnet` endpoint is not yet ready, so I don't think that this should be merged yet, but I believe that the val" [deployment-charts] - 10https://gerrit.wikimedia.org/r/824163 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [11:24:39] !log urbanecm@deploy1002 rebuilt and synchronized wikiversions files: Creating pcmwiki (T310776) [11:24:45] T310776: Create Wikipedia Nigerian Pidgin - https://phabricator.wikimedia.org/T310776 [11:26:07] (03CR) 10Jbond: "also do we want one CA for all k8s clusters or to we want one CA per kubernetes cluster. i think it would be good to speak with o9ther k8" [puppet] - 10https://gerrit.wikimedia.org/r/824161 (https://phabricator.wikimedia.org/T310172) (owner: 10Btullis) [11:26:26] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Trokhymovych - https://phabricator.wikimedia.org/T315262 (10Trokhymovych) Hi @cmooney, Can you please delete all my old keys and leave only the one I have provided in this ticket? Thank you! [11:27:22] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Wed 24 Aug 2022 07:48:40 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:27:53] !log urbanecm@deploy1002 Synchronized static/images/project-logos/: Creating pcmwiki (T310776) (duration: 03m 13s) [11:29:10] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for guwwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815228 (https://phabricator.wikimedia.org/T309054) (owner: 10Urbanecm) [11:29:40] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 23 Oct 2022 06:50:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:29:52] PROBLEM - Check systemd state on mw1429 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-phpfpm-statustext-textfile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:30:38] (03Merged) 10jenkins-bot: Initial configuration for guwwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815228 (https://phabricator.wikimedia.org/T309054) (owner: 10Urbanecm) [11:31:17] !log urbanecm@deploy1002 Synchronized wmf-config/logos.php: Creating pcmwiki (T310776) (duration: 03m 24s) [11:31:21] T310776: Create Wikipedia Nigerian Pidgin - https://phabricator.wikimedia.org/T310776 [11:32:12] RECOVERY - Check systemd state on mw1429 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:34:36] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Creating pcmwiki (T310776) (duration: 03m 18s) [11:38:19] !log urbanecm@deploy1002 Synchronized langlist: Creating pcmwiki (T310776) (duration: 03m 42s) [11:38:22] T310776: Create Wikipedia Nigerian Pidgin - https://phabricator.wikimedia.org/T310776 [11:38:43] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache sretest1001.eqiad.wmnet sretest1002.eqiad.wmnet on all recursors [11:38:48] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) sretest1001.eqiad.wmnet sretest1002.eqiad.wmnet on all recursors [11:39:01] first wiki done, proceeding with the second one [11:39:40] (03CR) 10Btullis: Add a new intermediate CA for kubernetes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/824161 (https://phabricator.wikimedia.org/T310172) (owner: 10Btullis) [11:40:03] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/823659 (https://phabricator.wikimedia.org/T260663) (owner: 10JMeybohm) [11:40:26] guwwiktionary works via mwdebug1001, syncing [11:43:27] (03PS2) 10Samtar: Enable Realtime Preview on Group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824128 (https://phabricator.wikimedia.org/T314182) (owner: 10Samwilson) [11:44:15] !log urbanecm@deploy1002 Synchronized wmf-config/db-production.php: Creating guwwiktionary (T309054) (duration: 03m 08s) [11:44:19] T309054: Create Wiktionary Gungbe - https://phabricator.wikimedia.org/T309054 [11:46:14] (03PS2) 10Btullis: Add a new intermediate CA for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/824161 (https://phabricator.wikimedia.org/T310172) [11:47:27] !log urbanecm@deploy1002 Synchronized dblists: Creating guwwiktionary (T309054) (duration: 03m 11s) [11:47:27] (03CR) 10Klausman: [C: 03+1] "LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/824163 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [11:51:24] !log urbanecm@deploy1002 rebuilt and synchronized wikiversions files: Creating guwwiktionary (T309054) [11:51:28] T309054: Create Wiktionary Gungbe - https://phabricator.wikimedia.org/T309054 [11:53:45] (03PS3) 10JMeybohm: sre.discovery.service-route: Make the cookbook work [cookbooks] - 10https://gerrit.wikimedia.org/r/823659 (https://phabricator.wikimedia.org/T260663) [11:54:12] PROBLEM - SSH on ms-be1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:54:50] !log urbanecm@deploy1002 Synchronized static/images/project-logos/: Creating guwwiktionary (T309054) (duration: 03m 25s) [11:56:02] (03CR) 10JMeybohm: sre.discovery.service-route: Make the cookbook work (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/823659 (https://phabricator.wikimedia.org/T260663) (owner: 10JMeybohm) [11:56:05] (03PS4) 10JMeybohm: sre.discovery.service-route: Make the cookbook work [cookbooks] - 10https://gerrit.wikimedia.org/r/823659 (https://phabricator.wikimedia.org/T260663) [11:57:21] 10SRE, 10Performance-Team, 10Traffic, 10serviceops: multi-dc.lua ATS script failing in production - https://phabricator.wikimedia.org/T315434 (10Vgutierrez) [11:58:33] !log urbanecm@deploy1002 Synchronized wmf-config/logos.php: Creating guwwiktionary (T309054) (duration: 03m 43s) [11:58:37] T309054: Create Wiktionary Gungbe - https://phabricator.wikimedia.org/T309054 [11:59:10] (03CR) 10CI reject: [V: 04-1] sre.discovery.service-route: Make the cookbook work [cookbooks] - 10https://gerrit.wikimedia.org/r/823659 (https://phabricator.wikimedia.org/T260663) (owner: 10JMeybohm) [12:01:42] (03PS1) 10Vgutierrez: trafficserver:multi-dc: Avoid get_query_param calls if query is nil [puppet] - 10https://gerrit.wikimedia.org/r/824190 (https://phabricator.wikimedia.org/T315434) [12:01:51] (03PS1) 10Hokwelum: Add PDApps organisation details to list of dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/824191 [12:01:55] !log copy prometheus-ipmi-exporter package from bullseye to buster [12:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:08] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Creating guwwiktionary (T309054) (duration: 03m 34s) [12:02:46] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for bjnwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815229 (https://phabricator.wikimedia.org/T312209) (owner: 10Urbanecm) [12:03:35] (03Merged) 10jenkins-bot: Initial configuration for bjnwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815229 (https://phabricator.wikimedia.org/T312209) (owner: 10Urbanecm) [12:05:06] bjnwiktionary works at mwdebug1001, syncing [12:06:16] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:09:05] !log urbanecm@deploy1002 Synchronized wmf-config/db-production.php: Creating bjnwiktionary (T312209) (duration: 03m 29s) [12:09:09] T312209: Create Wiktionary Banjar - https://phabricator.wikimedia.org/T312209 [12:09:29] (03PS2) 10Hokwelum: Add PDApps organisation details to list of dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/824191 [12:10:31] (03CR) 10Tim Starling: [C: 03+1] trafficserver:multi-dc: Avoid get_query_param calls if query is nil [puppet] - 10https://gerrit.wikimedia.org/r/824190 (https://phabricator.wikimedia.org/T315434) (owner: 10Vgutierrez) [12:10:56] (03PS5) 10JMeybohm: sre.discovery.service-route: Make the cookbook work [cookbooks] - 10https://gerrit.wikimedia.org/r/823659 (https://phabricator.wikimedia.org/T260663) [12:11:46] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Wed 24 Aug 2022 07:48:40 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:12:38] !log urbanecm@deploy1002 Synchronized dblists: Creating bjnwiktionary (T312209) (duration: 03m 33s) [12:14:10] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 23 Oct 2022 06:50:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:15:45] !log copy prometheus-ipmi-exporter package from buster to stretch [12:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:38] !log urbanecm@deploy1002 rebuilt and synchronized wikiversions files: Creating bjnwiktionary (T312209) [12:16:42] T312209: Create Wiktionary Banjar - https://phabricator.wikimedia.org/T312209 [12:17:20] !log remove prometheus-ipmi-exporter from stretch [12:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:30] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:17:49] (03PS3) 10Hokwelum: Add PDApps organisation details to list of dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/824191 [12:18:56] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:19:48] (03PS4) 10Hokwelum: Add PDApps organisation details to list of dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/824191 [12:20:06] !log urbanecm@deploy1002 Synchronized static/images/project-logos/: Creating bjnwiktionary (T312209) (duration: 03m 27s) [12:20:18] (03PS1) 10Jbond: P:base::production: install prometheus::ipmi_exporter to buster [puppet] - 10https://gerrit.wikimedia.org/r/824193 [12:21:36] (03PS5) 10Hokwelum: Add PDApps organisation details to list of dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/824191 [12:21:56] (03CR) 10Jbond: [C: 03+2] "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/824193 (owner: 10Jbond) [12:23:26] !log urbanecm@deploy1002 Synchronized wmf-config/logos.php: Creating bjnwiktionary (T312209) (duration: 03m 19s) [12:23:30] T312209: Create Wiktionary Banjar - https://phabricator.wikimedia.org/T312209 [12:24:34] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:26:40] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Creating bjnwiktionary (T312209) (duration: 03m 13s) [12:28:20] PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:30:13] !log urbanecm@deploy1002 Synchronized dblists-index.php: Creating bjnwiktionary (T312209) (duration: 03m 32s) [12:30:17] T312209: Create Wiktionary Banjar - https://phabricator.wikimedia.org/T312209 [12:30:36] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:44] (03PS6) 10Hokwelum: Add PDApps organisation details to list of dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/824191 [12:30:58] bjnwiktionary should now be done. No other wikis were scheduled => updating interwiki cache now. [12:31:52] PROBLEM - Check systemd state on mw2373 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:34:26] (03CR) 10Bartosz Dziewoński: [C: 03+1] "Thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823749 (https://phabricator.wikimedia.org/T314693) (owner: 10Bartosz Dziewoński) [12:34:40] (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824194 (https://phabricator.wikimedia.org/T310776) [12:34:43] (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824194 (https://phabricator.wikimedia.org/T310776) (owner: 10Urbanecm) [12:35:31] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824194 (https://phabricator.wikimedia.org/T310776) (owner: 10Urbanecm) [12:38:11] (03CR) 10Jbond: "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/824161 (https://phabricator.wikimedia.org/T310172) (owner: 10Btullis) [12:38:31] 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Two failed disks in ms-be1071 - https://phabricator.wikimedia.org/T315437 (10MatthewVernon) [12:39:05] (03PS1) 10Hashar: Merge tag 'v3.5.2' into wmf/stable-3.5 [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/824196 (https://phabricator.wikimedia.org/T307334) [12:39:23] 10ops-eqiad, 10DC-Ops: dbprov1002 lost power redundancy - https://phabricator.wikimedia.org/T315439 (10jcrespo) [12:39:38] !log urbanecm@deploy1002 Synchronized wmf-config/interwiki.php: Update interwiki cache (T310776, T312209, T309054) (duration: 03m 30s) [12:39:44] T310776: Create Wikipedia Nigerian Pidgin - https://phabricator.wikimedia.org/T310776 [12:39:44] T312209: Create Wiktionary Banjar - https://phabricator.wikimedia.org/T312209 [12:39:44] T309054: Create Wiktionary Gungbe - https://phabricator.wikimedia.org/T309054 [12:39:57] i think that's all [12:40:12] thanks for the help with the error Amir1 [12:40:30] nada [12:40:43] you did the heavy lifting [12:41:58] (03CR) 10ArielGlenn: [C: 03+2] Add PDApps organisation details to list of dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/824191 (owner: 10Hokwelum) [12:42:15] 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Two failed disks in ms-be1071 - https://phabricator.wikimedia.org/T315437 (10MatthewVernon) a:05Cmjohnson→03wiki_willy [12:42:56] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/823659 (https://phabricator.wikimedia.org/T260663) (owner: 10JMeybohm) [12:43:24] 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Two failed disks in ms-be1071 - https://phabricator.wikimedia.org/T315437 (10MatthewVernon) a:05wiki_willy→03None [12:44:06] (03CR) 10JMeybohm: [C: 03+2] sre.discovery.service-route: Make the cookbook work [cookbooks] - 10https://gerrit.wikimedia.org/r/823659 (https://phabricator.wikimedia.org/T260663) (owner: 10JMeybohm) [12:46:44] (03CR) 10CI reject: [V: 04-1] Merge tag 'v3.5.2' into wmf/stable-3.5 [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/824196 (https://phabricator.wikimedia.org/T307334) (owner: 10Hashar) [12:47:58] (03PS3) 10Samtar: InitialiseSettings: Add wmgUsePhonos (default => false) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822656 (https://phabricator.wikimedia.org/T314294) [12:48:22] (03Merged) 10jenkins-bot: sre.discovery.service-route: Make the cookbook work [cookbooks] - 10https://gerrit.wikimedia.org/r/823659 (https://phabricator.wikimedia.org/T260663) (owner: 10JMeybohm) [12:51:14] (03PS2) 10Hashar: Merge tag 'v3.5.2' into wmf/stable-3.5 [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/824196 (https://phabricator.wikimedia.org/T307334) [12:52:51] (03CR) 10Btullis: Add a new intermediate CA for kubernetes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824161 (https://phabricator.wikimedia.org/T310172) (owner: 10Btullis) [12:53:11] (03Abandoned) 10Btullis: Add a new intermediate CA for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/824161 (https://phabricator.wikimedia.org/T310172) (owner: 10Btullis) [12:54:08] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:55:20] RECOVERY - SSH on ms-be1041.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:59:38] PROBLEM - Check systemd state on thanos-be2004 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: That opportune time is upon us again. Time for a UTC afternoon backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220817T1300). [13:00:05] MatmaRex and TheresNoTime: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:44] hello [13:00:50] here, and noting that we've got a busy window - mine can be pushed if needed, but I can also self-deploy [13:00:58] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:01:24] (03PS2) 10Bartosz Dziewoński: Enable DiscussionTools visual enhancements as beta everywhere except en/de/jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822440 (https://phabricator.wikimedia.org/T312672) (owner: 10Esanders) [13:01:31] (03PS2) 10Bartosz Dziewoński: Make DiscussionTools replytool, newtopictool opt-out on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823746 (https://phabricator.wikimedia.org/T297410) [13:01:36] (03PS3) 10Bartosz Dziewoński: Make DiscussionTools topicsubscription opt-out on A/B test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823749 (https://phabricator.wikimedia.org/T314693) [13:01:47] (03PS3) 10Bartosz Dziewoński: Enable visual editor in Project: (Wikipedia:) namespace on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823757 (https://phabricator.wikimedia.org/T314968) [13:01:53] * taavi looks at the 'Max 6 patches' part of the window description [13:01:53] (03PS3) 10Bartosz Dziewoński: Enable wgCiteResponsiveReferences on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823758 (https://phabricator.wikimedia.org/T315333) [13:02:00] (03PS2) 10Bartosz Dziewoński: Add wgDiscussionToolsEnablePermalinksBackend config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823697 (https://phabricator.wikimedia.org/T315353) [13:02:01] I can deploy today if no-one else is around [13:02:09] (03PS3) 10Bartosz Dziewoński: Remove unused config for Echo notification emails [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820546 (https://phabricator.wikimedia.org/T314604) [13:02:41] taavi: yeah, sorry about that. it always seemed like more of a guideline to me ;) [13:02:59] but i will completely understand if we can't do all of this, i noted it in the list [13:03:03] lol [13:03:07] (i was just rebasing the patches now) [13:03:32] MatmaRex: yeah, unfortunately scap is a lot slower these days since it needs to restart php-fpm everywhere [13:04:14] bump mine to the late window, it was wishful thinking of me :D [13:04:49] (03CR) 10Btullis: Add new admin_ng values for the dse-k8s-eqiad cluster (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/824163 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [13:05:00] (03CR) 10Majavah: [C: 03+2] Add try…catch in failing deferred update [extensions/DiscussionTools] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/823640 (https://phabricator.wikimedia.org/T315383) (owner: 10Jforrester) [13:05:01] the patches are all independent, we could merge them all at once and deploy them all on the canaries. if there are any problems, any patch can be reverted separately [13:05:06] (03CR) 10Vgutierrez: [C: 03+2] trafficserver:multi-dc: Avoid get_query_param calls if query is nil [puppet] - 10https://gerrit.wikimedia.org/r/824190 (https://phabricator.wikimedia.org/T315434) (owner: 10Vgutierrez) [13:05:22] (03PS1) 10Hashar: Gerrit v3.5.2 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/824200 (https://phabricator.wikimedia.org/T307334) [13:05:24] and they're config changes (except that one), so at least the merging will be fast [13:06:08] (03CR) 10Majavah: [C: 03+2] Enable DiscussionTools visual enhancements as beta everywhere except en/de/jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822440 (https://phabricator.wikimedia.org/T312672) (owner: 10Esanders) [13:06:19] sounds good, I'll pull all of them at once [13:06:31] (03CR) 10Majavah: [C: 03+2] Make DiscussionTools replytool, newtopictool opt-out on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823746 (https://phabricator.wikimedia.org/T297410) (owner: 10Bartosz Dziewoński) [13:06:53] (03CR) 10Majavah: [C: 03+2] Make DiscussionTools topicsubscription opt-out on A/B test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823749 (https://phabricator.wikimedia.org/T314693) (owner: 10Bartosz Dziewoński) [13:07:10] (03CR) 10Majavah: [C: 03+2] Enable visual editor in Project: (Wikipedia:) namespace on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823757 (https://phabricator.wikimedia.org/T314968) (owner: 10Bartosz Dziewoński) [13:07:25] (03CR) 10Majavah: [C: 03+2] Enable wgCiteResponsiveReferences on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823758 (https://phabricator.wikimedia.org/T315333) (owner: 10Bartosz Dziewoński) [13:08:41] looks like CI is backed up a bit, let's see how long it takes [13:09:36] (03PS2) 10Hashar: Gerrit v3.5.2 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/824200 (https://phabricator.wikimedia.org/T307334) [13:11:44] PROBLEM - SSH on mw1314.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:11:46] (03PS2) 10Hashar: gerrit: $gerrit_servers > $ssh_allowed_hosts [puppet] - 10https://gerrit.wikimedia.org/r/816038 [13:12:23] taavi: thanks for deploying. let me know if i can help. [13:12:48] (03Merged) 10jenkins-bot: Enable DiscussionTools visual enhancements as beta everywhere except en/de/jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822440 (https://phabricator.wikimedia.org/T312672) (owner: 10Esanders) [13:12:51] (03Merged) 10jenkins-bot: Make DiscussionTools replytool, newtopictool opt-out on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823746 (https://phabricator.wikimedia.org/T297410) (owner: 10Bartosz Dziewoński) [13:12:57] (03Merged) 10jenkins-bot: Make DiscussionTools topicsubscription opt-out on A/B test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823749 (https://phabricator.wikimedia.org/T314693) (owner: 10Bartosz Dziewoński) [13:13:01] (03Merged) 10jenkins-bot: Enable visual editor in Project: (Wikipedia:) namespace on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823757 (https://phabricator.wikimedia.org/T314968) (owner: 10Bartosz Dziewoński) [13:13:05] (03Merged) 10jenkins-bot: Enable wgCiteResponsiveReferences on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823758 (https://phabricator.wikimedia.org/T315333) (owner: 10Bartosz Dziewoński) [13:13:09] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/816038 (owner: 10Hashar) [13:14:02] MatmaRex: pulled the first five config patches to mwdebug1001. Please test all of them [13:14:07] (03CR) 10Hashar: "That is a change I have crafted end of July while you prepared the puppet patch to migrate the gerrit replica to a new host. The new name" [puppet] - 10https://gerrit.wikimedia.org/r/816038 (owner: 10Hashar) [13:14:22] thanks. looking [13:15:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10dcaro) [13:15:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudcephosd10[25-34] Missing/unplugged hard drives - https://phabricator.wikimedia.org/T315221 (10dcaro) 05Open→03Resolved This is done and ready! Thanks a lot @Cmjohnson! [13:16:23] (03Merged) 10jenkins-bot: Add try…catch in failing deferred update [extensions/DiscussionTools] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/823640 (https://phabricator.wikimedia.org/T315383) (owner: 10Jforrester) [13:17:01] (03PS1) 10David Caro: node_pinger: use jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) [13:18:44] (03CR) 10CI reject: [V: 04-1] node_pinger: use jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [13:19:18] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@141f179]: (no justification provided) [13:19:29] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@141f179]: (no justification provided) (duration: 00m 10s) [13:19:52] (still testing) [13:20:32] (03PS1) 10Esanders: Disable DiscussionTools pageframe everywhere except labs and mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824203 [13:23:04] taavi: all 5 look good [13:24:23] MatmaRex: thanks, syncing [13:24:30] in the meantime, please test the backport on mwdebug1001 [13:25:28] RECOVERY - Check systemd state on thanos-be2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:25:32] 10SRE-OnFire, 10Discovery-Search, 10Sustainability (Incident Followup): Replace certificate on deployment-elastic09.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T315386 (10RhinosF1) [13:26:01] 10Puppet, 10SRE, 10SRE-OnFire, 10Beta-Cluster-Infrastructure, and 3 others: Evaluation Error on deployment-cache-text06 puppet run - https://phabricator.wikimedia.org/T315351 (10RhinosF1) [13:26:05] ^^ def not an "on fire" situation, just removed those tags [13:26:38] (03CR) 10Ori: [C: 03+1] P:cache::varnish::frontend::text: make esitest ensurable and absent on cloud [puppet] - 10https://gerrit.wikimedia.org/r/824146 (https://phabricator.wikimedia.org/T315394) (owner: 10Jbond) [13:27:03] 10SRE-OnFire, 10Beta-Cluster-Infrastructure, 10Sustainability (Incident Followup): (Beta cluster) Running logspam-watch on deployment-mwlog01 gives repeated `Use of uninitialized value $host` errors - https://phabricator.wikimedia.org/T315379 (10RhinosF1) [13:27:31] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: lots of DiscussionTools and other changes (duration: 03m 11s) [13:27:32] inflatador: OnFire is the working group for incidents [13:27:44] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to > 2.44.10 - https://phabricator.wikimedia.org/T265549 (10JMcLeod_WMF) [13:27:49] it's because TNT started an IR and i've tagged it as a follow up [13:27:54] taavi: i can't really directly test it, saving edits still works (on testwiki), so it should be good [13:28:32] (03CR) 10Giuseppe Lavagetto: [C: 03+2] role::alerting_host: run vopsbot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821255 (https://phabricator.wikimedia.org/T314840) (owner: 10Giuseppe Lavagetto) [13:29:00] MatmaRex: "nothing breaks" is good enough for me, so I'll sync [13:29:08] 10SRE-OnFire, 10Beta-Cluster-Infrastructure, 10Patch-For-Review, 10Sustainability (Incident Followup): Remove two cherry-picked reverts from deployment-puppetmaster04 - https://phabricator.wikimedia.org/T315394 (10RhinosF1) [13:29:24] 10SRE-OnFire, 10Beta-Cluster-Infrastructure, 10Patch-For-Review, 10Sustainability (Incident Followup): Rebase & merge or re-cherry-pick 668701 on deployment-puppetmaster04 - https://phabricator.wikimedia.org/T315395 (10RhinosF1) [13:29:33] i don't see the exception in logstash for the edit i just made, but it's (probably) a race condition, so that's not unexpected [13:29:58] TheresNoTime: hello, your patch is up next [13:30:05] taavi: :) [13:30:12] MatmaRex: I'll return to your no-ops if we have time after this [13:30:24] thanks [13:30:25] (03PS3) 10Majavah: Enable Realtime Preview on Group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824128 (https://phabricator.wikimedia.org/T314182) (owner: 10Samwilson) [13:30:43] 10SRE-OnFire, 10Beta-Cluster-Infrastructure, 10Discovery-Search, 10Release-Engineering-Team, and 5 others: Known, Beta cluster Error: 502, Next Hop Connection Failed - https://phabricator.wikimedia.org/T315350 (10RhinosF1) [13:30:49] (03CR) 10Majavah: [C: 03+2] Enable Realtime Preview on Group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824128 (https://phabricator.wikimedia.org/T314182) (owner: 10Samwilson) [13:31:24] (03PS1) 10Clément Goubert: pcc: Encode jenkins username to utf-8 [puppet] - 10https://gerrit.wikimedia.org/r/824209 [13:31:39] 10SRE, 10Performance-Team, 10Traffic, 10serviceops: multi-dc.lua ATS script failing in production - https://phabricator.wikimedia.org/T315434 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez Fix has been deployed. I'll reopen the task if we are still seeing errors [13:31:49] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10Vgutierrez) [13:32:22] (03Merged) 10jenkins-bot: Enable Realtime Preview on Group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824128 (https://phabricator.wikimedia.org/T314182) (owner: 10Samwilson) [13:32:40] !log taavi@deploy1002 Synchronized php-1.39.0-wmf.25/extensions/DiscussionTools/includes/Hooks/DataUpdatesHooks.php: Backport: [[gerrit:823640|Add try…catch in failing deferred update (T315383)]] (duration: 03m 18s) [13:32:45] T315383: MWException: Error contacting the Parsoid/RESTBase server (HTTP 404) - https://phabricator.wikimedia.org/T315383 [13:33:05] TheresNoTime: please test on mwdebug1001 [13:33:13] testing.. [13:34:19] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:cache::varnish::frontend::text: make esitest ensurable and absent on cloud [puppet] - 10https://gerrit.wikimedia.org/r/824146 (https://phabricator.wikimedia.org/T315394) (owner: 10Jbond) [13:34:34] 10SRE-OnFire, 10Discovery-Search, 10Sustainability (Incident Followup): Replace certificate on deployment-elastic09.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T315386 (10RhinosF1) per IRC - is a follow up on https://wikitech.wikimedia.org/wiki/Incidents/2022-08-16_Beta_Cluster... [13:34:35] taavi: working :) [13:34:53] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T314509 (10Papaul) @fgiunchedi @MatthewVernon this server and ms-be2032 were fresh last fiscal year with https://phabricator.wikimedia.org/T285809. Any reason we still that them in production?... [13:35:05] ok, syncing [13:36:05] (03PS3) 10Majavah: Add wgDiscussionToolsEnablePermalinksBackend config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823697 (https://phabricator.wikimedia.org/T315353) (owner: 10Bartosz Dziewoński) [13:36:28] (03PS2) 10Clément Goubert: pcc: Encode jenkins username to utf-8 [puppet] - 10https://gerrit.wikimedia.org/r/824209 [13:36:49] MatmaRex: is there anything to test about https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/823697/? [13:37:20] taavi: no, the code using that config is not deployed yet [13:37:46] ok, in that case it seems like a perfect candidate for testing the new `scap backport` command [13:37:59] 10SRE, 10Infrastructure-Foundations, 10netops: Rancid on netmon1003 unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10andrea.denisse) @cmooney Thanks for the heads-up, I missed that part, my bad. [13:38:08] :o [13:38:15] (03PS2) 10David Caro: node_pinger: use jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) [13:38:24] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:38:26] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:824128|Enable Realtime Preview on Group 1 (T314182)]] (duration: 03m 26s) [13:38:30] T314182: Enable Realtime Preview on group1 - https://phabricator.wikimedia.org/T314182 [13:38:39] (03CR) 10TrainBranchBot: [C: 03+2] "Approved via scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823697 (https://phabricator.wikimedia.org/T315353) (owner: 10Bartosz Dziewoński) [13:39:14] (thank you for deploying that ta/avi) [13:39:48] you're welcome! [13:40:07] (03PS5) 10David Caro: wmcs: add ldap getent speed alerts [alerts] - 10https://gerrit.wikimedia.org/r/813915 [13:40:17] (03CR) 10CI reject: [V: 04-1] node_pinger: use jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [13:40:21] (03Merged) 10jenkins-bot: Add wgDiscussionToolsEnablePermalinksBackend config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823697 (https://phabricator.wikimedia.org/T315353) (owner: 10Bartosz Dziewoński) [13:41:23] !log taavi@deploy1002 Started scap: Backport for [[gerrit:823697]] Add wgDiscussionToolsEnablePermalinksBackend config [13:41:56] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host kafka-stretch2001 [13:42:01] (03CR) 10CI reject: [V: 04-1] wmcs: add ldap getent speed alerts [alerts] - 10https://gerrit.wikimedia.org/r/813915 (owner: 10David Caro) [13:42:08] (03PS3) 10David Caro: node_pinger: use jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) [13:42:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kafka-stretch2001 [13:42:30] (03CR) 10FNegri: "I'm not sure if we want to apply this to all hosts. If I understand this correctly, the node_pinger at the moment is only used for Ceph bu" [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [13:42:32] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host kafka-stretch2002 [13:43:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kafka-stretch2002 [13:44:49] (03CR) 10Hashar: "Booted it locally:" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/824200 (https://phabricator.wikimedia.org/T307334) (owner: 10Hashar) [13:45:57] (03PS4) 10Andrea Denisse: quickdatacopy: Added simple username/groupname mapping for the Rsync server [puppet] - 10https://gerrit.wikimedia.org/r/823748 (https://phabricator.wikimedia.org/T314972) [13:46:46] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [13:48:15] (03Abandoned) 10David Caro: p:admin: ensure the shells exist before the users are created [puppet] - 10https://gerrit.wikimedia.org/r/824130 (owner: 10David Caro) [13:48:21] (03PS1) 10Giuseppe Lavagetto: deployment-prep: serve php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/824216 (https://phabricator.wikimedia.org/T306042) [13:48:23] (03PS1) 10Giuseppe Lavagetto: mediawiki::jobrunner: allow picking a default php version [puppet] - 10https://gerrit.wikimedia.org/r/824217 (https://phabricator.wikimedia.org/T306042) [13:48:25] (03PS1) 10Giuseppe Lavagetto: deployment-prep: convert jobrunner to use php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/824218 (https://phabricator.wikimedia.org/T306042) [13:48:27] (03CR) 10David Caro: p:admin: ensure the shells exist before the users are created (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/824130 (owner: 10David Caro) [13:48:31] (03CR) 10Andrea Denisse: quickdatacopy: Added simple username/groupname mapping for the Rsync server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/823748 (https://phabricator.wikimedia.org/T314972) (owner: 10Andrea Denisse) [13:50:19] (03CR) 10Giuseppe Lavagetto: [C: 03+1] pcc: Encode jenkins username to utf-8 [puppet] - 10https://gerrit.wikimedia.org/r/824209 (owner: 10Clément Goubert) [13:51:30] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:52:01] still waiting, apparently the new command is really slow [13:52:31] hmm, it's good that we're testing it [13:52:41] (btw, if you were looking for more no-op patches to deploy, https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/824203 is also good to go :P ) [13:52:50] 'started scap: ' sounds like it's doing a sync-world? [13:53:51] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kafka-stretch2001.mgmt.codfw.wmnet with reboot policy FORCED [13:54:33] (03CR) 10Samtar: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/824216 (https://phabricator.wikimedia.org/T306042) (owner: 10Giuseppe Lavagetto) [13:55:15] zabe: apparently yes [13:55:21] https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/tools/scap/+/refs/heads/master/scap/plugins/backport.py#138 [13:55:33] but even then with minimal changes I wouldn't expect it to take this long [13:55:50] yeah [13:56:01] just about to finish sync-apaches [13:56:56] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36789/console" [puppet] - 10https://gerrit.wikimedia.org/r/823748 (https://phabricator.wikimedia.org/T314972) (owner: 10Andrea Denisse) [13:58:17] (03PS1) 10Papaul: Add graphite2004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/824219 (https://phabricator.wikimedia.org/T313851) [13:58:19] 10SRE, 10Infrastructure-Foundations, 10SRE Observability: Management interface SSH icinga alerts - https://phabricator.wikimedia.org/T304289 (10lmata) [13:59:32] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:04] 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting, 10netops: Enable OSPF Icinga check for EVPN based switches - https://phabricator.wikimedia.org/T315053 (10lmata) [14:00:26] (03CR) 10Papaul: [C: 03+2] Add graphite2004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/824219 (https://phabricator.wikimedia.org/T313851) (owner: 10Papaul) [14:00:48] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:823697]] Add wgDiscussionToolsEnablePermalinksBackend config (duration: 19m 24s) [14:01:00] finally [14:01:08] looks like we're just out of time [14:01:12] !log UTC afternoon deploys done [14:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:54] (03PS1) 10Hashar: gerrit: update style for Gerrit 3.5 [puppet] - 10https://gerrit.wikimedia.org/r/824221 (https://phabricator.wikimedia.org/T315445) [14:01:56] (03PS1) 10Hashar: gerrit: remove Gerrit 3.5 obsolete @apply css statement [puppet] - 10https://gerrit.wikimedia.org/r/824222 (https://phabricator.wikimedia.org/T315445) [14:02:16] 10SRE-OnFire, 10Beta-Cluster-Infrastructure, 10Patch-For-Review, 10Sustainability (Incident Followup): Remove two cherry-picked reverts from deployment-puppetmaster04 - https://phabricator.wikimedia.org/T315394 (10jbond) > > https://gerrit.wikimedia.org/r/c/operations/puppet/+/823638 > https://gerrit.wi... [14:04:03] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host graphite2004.codfw.wmnet with OS bullseye [14:04:10] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10Patch-For-Review: Q1:rack/setup/install graphite2004 - https://phabricator.wikimedia.org/T313851 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host graphite2004.codfw.wmnet with OS bullseye [14:07:01] (03CR) 10David Caro: [C: 04-1] node_pinger: use jumbo frames (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [14:07:32] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Allow jumbo frames between cloud hosts in production realm - https://phabricator.wikimedia.org/T315446 (10cmooney) p:05Triage→03Medium [14:08:52] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:10:15] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T314509 (10MatthewVernon) We do indeed have T294549 open to take this node out of production. Unfortunately, to do so we need to drain them out of the swift rings. For that process to proceed... [14:10:56] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Wed 24 Aug 2022 07:48:40 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:12:58] RECOVERY - SSH on mw1314.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:13:20] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 23 Oct 2022 06:50:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:14:40] (03CR) 10Samtar: profile::etcd::v3: use puppet certs for standalone cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668701 (https://phabricator.wikimedia.org/T315395) (owner: 10Giuseppe Lavagetto) [14:17:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-stretch2001.mgmt.codfw.wmnet with reboot policy FORCED [14:17:57] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Trokhymovych - https://phabricator.wikimedia.org/T315262 (10Ottomata) @diego, since this is an account for an external, we need an expiration date for the access. Otherwise, approved! [14:18:07] !log Redact new wikis guwwiktionary pcmwiki bjnwiktionary T312214 T310879 T309056 [14:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:14] T310879: Prepare and check storage layer for pcmwiki - https://phabricator.wikimedia.org/T310879 [14:18:14] T309056: Prepare and check storage layer for guwwiktionary - https://phabricator.wikimedia.org/T309056 [14:18:14] T312214: Prepare and check storage layer for bjnwiktionary - https://phabricator.wikimedia.org/T312214 [14:18:33] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kafka-stretch2002.mgmt.codfw.wmnet with reboot policy FORCED [14:27:26] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Wed 24 Aug 2022 07:48:40 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:28:17] 10Puppet, 10SRE, 10SRE-OnFire, 10Beta-Cluster-Infrastructure, and 3 others: Evaluation Error on deployment-cache-text06 puppet run - https://phabricator.wikimedia.org/T315351 (10Zabe) 05Open→03Resolved Some cherry-picks made by ori made puppet run again, see T315394 for follow-up. [14:28:29] 10SRE-OnFire, 10Beta-Cluster-Infrastructure, 10Discovery-Search, 10Release-Engineering-Team, and 5 others: Known, Beta cluster Error: 502, Next Hop Connection Failed - https://phabricator.wikimedia.org/T315350 (10Zabe) [14:29:48] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 23 Oct 2022 06:50:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:29:51] (03PS1) 10C. Scott Ananian: RESTBase is not enabled on closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824224 (https://phabricator.wikimedia.org/T315383) [14:32:06] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on graphite2004.codfw.wmnet with reason: host reimage [14:34:30] (03PS2) 10C. Scott Ananian: RESTBase is not enabled on closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824224 (https://phabricator.wikimedia.org/T315383) [14:35:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on graphite2004.codfw.wmnet with reason: host reimage [14:35:46] (03CR) 10RhinosF1: [C: 03+1] RESTBase is not enabled on closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824224 (https://phabricator.wikimedia.org/T315383) (owner: 10C. Scott Ananian) [14:36:19] (03CR) 10Bartosz Dziewoński: [C: 03+1] RESTBase is not enabled on closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824224 (https://phabricator.wikimedia.org/T315383) (owner: 10C. Scott Ananian) [14:36:21] (03CR) 10Zabe: "Is it expected that diffConfig does not catch what changes here?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823716 (https://phabricator.wikimedia.org/T315199) (owner: 10Majavah) [14:37:21] (03CR) 10Zabe: RESTBase is not enabled on closed wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824224 (https://phabricator.wikimedia.org/T315383) (owner: 10C. Scott Ananian) [14:37:23] (03CR) 10Majavah: jawiki: Restrict abusefilter log access (2) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823716 (https://phabricator.wikimedia.org/T315199) (owner: 10Majavah) [14:39:45] (03CR) 10Zabe: [C: 03+1] jawiki: Restrict abusefilter log access (1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823715 (https://phabricator.wikimedia.org/T315199) (owner: 10Majavah) [14:39:52] (03CR) 10Zabe: [C: 03+1] jawiki: Restrict abusefilter log access (2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823716 (https://phabricator.wikimedia.org/T315199) (owner: 10Majavah) [14:40:05] (03PS1) 10Marostegui: db1169: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/824225 [14:41:05] (03CR) 10Marostegui: [C: 03+2] db1169: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/824225 (owner: 10Marostegui) [14:41:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P32446 and previous config saved to /var/cache/conftool/dbconfig/20220817-144123-root.json [14:41:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P32447 and previous config saved to /var/cache/conftool/dbconfig/20220817-144129-root.json [14:41:32] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Trokhymovych - https://phabricator.wikimedia.org/T315262 (10diego) Thanks @Ottomata , the contract finish at December 15th. [14:41:34] (03PS3) 10C. Scott Ananian: RESTBase is not enabled on closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824224 (https://phabricator.wikimedia.org/T315383) [14:41:59] (03CR) 10C. Scott Ananian: RESTBase is not enabled on closed wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824224 (https://phabricator.wikimedia.org/T315383) (owner: 10C. Scott Ananian) [14:43:30] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36790/console" [puppet] - 10https://gerrit.wikimedia.org/r/822434 (owner: 10Ori) [14:43:30] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-stretch2002.mgmt.codfw.wmnet with reboot policy FORCED [14:46:30] 10SRE-OnFire, 10Beta-Cluster-Infrastructure, 10Discovery-Search, 10Release-Engineering-Team, and 5 others: Known, Beta cluster Error: 502, Next Hop Connection Failed - https://phabricator.wikimedia.org/T315350 (10jbond) [14:46:37] 10SRE-OnFire, 10Beta-Cluster-Infrastructure, 10Sustainability (Incident Followup): Remove two cherry-picked reverts from deployment-puppetmaster04 - https://phabricator.wikimedia.org/T315394 (10jbond) 05Open→03Resolved a:03jbond resolving as i think this is all resolved now but please reopen if not [14:46:56] MatmaRex: do you want to squeeze in the restbase config change before group1 rolls? [14:47:25] (03CR) 10JMeybohm: ci: enable docker on machine start (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/814157 (https://phabricator.wikimedia.org/T313119) (owner: 10Hashar) [14:51:35] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host graphite2004.codfw.wmnet with OS bullseye [14:51:39] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install graphite2004 - https://phabricator.wikimedia.org/T313851 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host graphite2004.codfw.wmnet with OS bullseye completed: - graphite2004 (**FAIL**) - R... [14:51:41] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install graphite2004 - https://phabricator.wikimedia.org/T313851 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host graphite2004.codfw.wmnet with OS bullseye executed with errors: - graphite2004 (**FA... [14:51:58] (03CR) 10Hashar: ci: enable docker on machine start (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/814157 (https://phabricator.wikimedia.org/T313119) (owner: 10Hashar) [14:54:17] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/823748 (https://phabricator.wikimedia.org/T314972) (owner: 10Andrea Denisse) [14:55:52] cscott: I would like to see the restbase config change deployed [14:56:20] (03CR) 10JMeybohm: [C: 03+2] ci: enable docker on machine start [puppet] - 10https://gerrit.wikimedia.org/r/814157 (https://phabricator.wikimedia.org/T313119) (owner: 10Hashar) [14:56:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 2%: Repooling', diff saved to https://phabricator.wikimedia.org/P32449 and previous config saved to /var/cache/conftool/dbconfig/20220817-145628-root.json [14:56:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P32450 and previous config saved to /var/cache/conftool/dbconfig/20220817-145634-root.json [15:04:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:04:34] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:07:45] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Two failed disks in ms-be1071 - https://phabricator.wikimedia.org/T315437 (10wiki_willy) a:03Jclark-ctr Looks like this is still under warranty. Since @Cmjohnson will be out on vacation soon, @Jclark-ctr - can you submit the RMA for this one? Thanks,... [15:09:24] cscott: sure, i can't deploy it myself though [15:09:28] if you can, go for it [15:09:32] i'm afk for a bit [15:09:33] (03CR) 10Jbond: p:admin: ensure the shells exist before the users are created (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/824130 (owner: 10David Caro) [15:10:30] 10SRE, 10ops-eqiad, 10DC-Ops: dbprov1002 lost power redundancy - https://phabricator.wikimedia.org/T315439 (10wiki_willy) a:03Cmjohnson [15:11:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:11:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:11:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P32452 and previous config saved to /var/cache/conftool/dbconfig/20220817-151132-root.json [15:11:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P32453 and previous config saved to /var/cache/conftool/dbconfig/20220817-151139-root.json [15:12:42] (03CR) 10Majavah: [C: 04-1] "This would stop installing zsh on new wmcs instances, which is not wanted I think" [puppet] - 10https://gerrit.wikimedia.org/r/824164 (owner: 10Jbond) [15:16:16] (03CR) 10Jbond: p:admin: ensure the shells exist before the users are created (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824130 (owner: 10David Caro) [15:17:25] (03PS2) 10Jbond: C:admin: when creating users make sure we add a dependency on the shell package [puppet] - 10https://gerrit.wikimedia.org/r/824164 [15:17:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:18:09] (03CR) 10Jbond: C:admin: when creating users make sure we add a dependency on the shell package (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824164 (owner: 10Jbond) [15:19:42] cscott: dancy: did you want a deploy of 824224 ? [15:20:39] MatmaRex, TheresNoTime : i scheduled it for the late deploy window today but earlier would be great too [15:22:17] i don't know offhand if any of the closed wikis are group1/2, if they are all group0 it probably doesn't matter whether it happens before or after the next train roll. [15:22:38] I'd like the errors go away ASAP [15:22:54] ack, will deploy now [15:23:42] jouncebot: nowandnext [15:23:43] No deployments scheduled for the next 2 hour(s) and 36 minute(s) [15:23:43] In 2 hour(s) and 36 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220817T1800) [15:23:43] In 2 hour(s) and 36 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220817T1800) [15:23:49] ok, i'm online to test [15:23:58] TheresNoTime: ping me when done? I'll also sneak in a config change of my own [15:24:04] !log deploying [[gerrit:824224|RESTBase is not enabled on closed wikis (T315383)]] out of window [15:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:08] T315383: MWException: Error contacting the Parsoid/RESTBase server (HTTP 404) - https://phabricator.wikimedia.org/T315383 [15:24:10] taavi: okay [15:25:00] (03CR) 10Samtar: [C: 03+2] "Out of window deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824224 (https://phabricator.wikimedia.org/T315383) (owner: 10C. Scott Ananian) [15:26:18] (03Merged) 10jenkins-bot: RESTBase is not enabled on closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824224 (https://phabricator.wikimedia.org/T315383) (owner: 10C. Scott Ananian) [15:26:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P32454 and previous config saved to /var/cache/conftool/dbconfig/20220817-152637-root.json [15:26:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P32455 and previous config saved to /var/cache/conftool/dbconfig/20220817-152643-root.json [15:26:55] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install graphite2004 - https://phabricator.wikimedia.org/T313851 (10Papaul) [15:27:03] (03CR) 10Vgutierrez: [V: 03+1] "looks good, could we add a VTC case on modules/varnish/files/tests/text/44-querysort.vtc to cover the new functionality?" [puppet] - 10https://gerrit.wikimedia.org/r/822434 (owner: 10Ori) [15:27:37] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install graphite2004 - https://phabricator.wikimedia.org/T313851 (10Papaul) 05Open→03Resolved @fgiunchedi all yours [15:27:59] (03PS2) 10Majavah: jawiki: Restrict abusefilter log access (1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823715 (https://phabricator.wikimedia.org/T315199) [15:28:24] cscott: live on mwdebug1001 for testing [15:28:56] (03CR) 10Vgutierrez: [V: 03+1] Incremental roll-out of query-sorting (0%) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/822434 (owner: 10Ori) [15:29:26] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Allow jumbo frames between cloud hosts in production realm - https://phabricator.wikimedia.org/T315446 (10cmooney) Ok so looking at this a bit closer it seems the ommision was just that the MTU wasn't set high on cloudsw1-c8, on its links to... [15:29:44] 10SRE, 10Scap, 10Python3-Porting, 10Release-Engineering-Team (Bonus Level 🕹️): git-fat needs to be ported to Python 3 - https://phabricator.wikimedia.org/T279509 (10thcipriani) [15:30:31] TheresNoTime: hm, "automatic local account creation is disabled" on closed wikis. Does someone have an admin account who can test something for me on a closed wiki? [15:30:53] we can surely create you a local account on a closed wiki if you need [15:31:00] RECOVERY - SSH on restbase2012.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:31:20] aawiki would be fine (obviously just picking one at random) [15:31:37] s/at random/arbitrarily/ [15:31:46] user:cscott, or user:CAnanian (WMF) yr choice [15:32:47] cscott: I'm not seeing that message, all I did was go to (logged out) https://kj.wikipedia.org/wiki/Omafiku_oko_shivike?veaction=edit with the header set and VE loaded as expected? [15:33:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:33:07] (03PS2) 10Jdlrobson: Enable new Vector skin on select pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823587 (https://phabricator.wikimedia.org/T314286) [15:33:25] thanks CentralAuth https://ttm.sh/quq.png [15:33:50] (other than the permission denied popover, "The action you have requested is limited to users in the group: Stewards.") [15:34:31] bah, and the maintenance script fails with the same thing [15:34:39] one sec [15:34:50] anyhow I seem to have an account, what do you need? [15:34:54] taavi: what wiki is that on? i can't log into aawiki with that, unless I'm doing something wrong. [15:35:00] aawiki [15:35:54] hm. anyway, starting VE like that should be enough and looking for a 404; a slightly better test is to make a null edit somewhere and verify that the 404 generated by discussiontools in the refreshlinks task don't pop up in the logs [15:36:30] (ah this is why *you're* testing and I'm just pressing buttons) [15:37:07] opening VE on a normal server shows some 404s in the devtools and shows an error message, seems to load fine on mwdebug1001 [15:37:45] ok, good enough [15:37:59] !log install net-snmp updates [15:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:10] cscott: happy for me to sync? [15:38:25] side note: this is the code that restricts account creation on closed wikis: https://github.com/wikimedia/operations-mediawiki-config/blob/6fcc641be4d23273700f6754997da9f0c5c63a79/wmf-config/CommonSettings.php#L4260 [15:38:26] yes [15:38:34] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1079.eqiad.wmnet with OS bullseye [15:38:41] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1079.eqiad.wmnet with OS bullseye [15:38:56] `createaccount` is globally assigned to stewards and jimbo, and I think I created my account before that code was fixed to work properly [15:39:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:39:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:39:56] i'm back [15:40:00] * TheresNoTime is syncing [15:40:43] oh, neat [15:41:20] MatmaRex: you want to poke at it a bit to double check? [15:41:28] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [15:41:28] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [15:41:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P32457 and previous config saved to /var/cache/conftool/dbconfig/20220817-154142-root.json [15:41:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P32458 and previous config saved to /var/cache/conftool/dbconfig/20220817-154148-root.json [15:41:49] i just had a look and checked that VE is loading now [15:42:10] !log samtar@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:824224|RESTBase is not enabled on closed wikis (T315383)]] (duration: 03m 27s) [15:42:13] and https://kj.wikipedia.org/w/api.php?action=visualeditor&format=jsonfm&paction=parse&page=Omafiku_oko_shivike&uselang=kj&formatversion=2&oldid=3337 has the expected response rather than an error [15:42:13] T315383: MWException: Error contacting the Parsoid/RESTBase server (HTTP 404) - https://phabricator.wikimedia.org/T315383 [15:42:19] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [15:42:20] !log jayme@cumin1001 END (FAIL) - Cookbook sre.discovery.service-route (exit_code=1) [15:42:21] I still can't seem to log in to aa.wikipedia.org (also tried logging in on enwiki and praying to the gods of global login) but i'm convinced by taavi's test [15:42:59] !log finished deploying [[gerrit:824224|RESTBase is not enabled on closed wikis (T315383)]] [15:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:17] Those errors have stopped being logged. [15:43:38] Last one at 15:39 [15:43:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:43:54] (03PS3) 10Ori: Incremental roll-out of query-sorting (0%) [puppet] - 10https://gerrit.wikimedia.org/r/822434 [15:44:04] ok, unless anyone objects I'm going to deploy an unrelated config patch [15:44:38] (03CR) 10Majavah: [C: 03+2] jawiki: Restrict abusefilter log access (1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823715 (https://phabricator.wikimedia.org/T315199) (owner: 10Majavah) [15:45:50] (03PS2) 10Majavah: jawiki: Restrict abusefilter log access (2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823716 (https://phabricator.wikimedia.org/T315199) [15:45:52] (03Merged) 10jenkins-bot: jawiki: Restrict abusefilter log access (1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823715 (https://phabricator.wikimedia.org/T315199) (owner: 10Majavah) [15:45:54] (03CR) 10Majavah: [C: 03+2] jawiki: Restrict abusefilter log access (2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823716 (https://phabricator.wikimedia.org/T315199) (owner: 10Majavah) [15:47:16] (03Merged) 10jenkins-bot: jawiki: Restrict abusefilter log access (2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823716 (https://phabricator.wikimedia.org/T315199) (owner: 10Majavah) [15:48:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:49:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:49:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:50:12] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:823715|jawiki: Restrict abusefilter log access (1) (T315199)]] (duration: 03m 25s) [15:50:16] T315199: Restrict viewing [[Special:Log/abusefilter]] only Abusefilter editors on ja.wikipedia - https://phabricator.wikimedia.org/T315199 [15:50:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:51:20] (03CR) 10Ori: Incremental roll-out of query-sorting (0%) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/822434 (owner: 10Ori) [15:51:32] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1079.eqiad.wmnet with reason: host reimage [15:52:33] !log push out update for linux-image-amd64 on bullseye [15:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:45] 10SRE, 10Wikimedia-Mailing-lists: lists.wikimedia.org flapping between soon to expire and renewed cert (Aug 2022) - https://phabricator.wikimedia.org/T315294 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup I restarted apache2 so it should be good for the next three months. [15:54:17] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1079.eqiad.wmnet with reason: host reimage [15:54:49] !log taavi@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:823716|jawiki: Restrict abusefilter log access (2) (T315199)]] (duration: 03m 47s) [15:55:21] * taavi all done [15:55:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:56:18] (03CR) 10Vgutierrez: [C: 03+2] Enable query sorting for all mediawikiwiki requests [puppet] - 10https://gerrit.wikimedia.org/r/823656 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori) [15:56:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P32460 and previous config saved to /var/cache/conftool/dbconfig/20220817-155646-root.json [15:56:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P32461 and previous config saved to /var/cache/conftool/dbconfig/20220817-155653-root.json [15:58:39] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/824233 [15:58:59] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Allow jumbo frames between cloud hosts in production realm - https://phabricator.wikimedia.org/T315446 (10dcaro) [15:59:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:59:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:00:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:01:05] (03PS1) 10JMeybohm: sre.discovery.service-route: Fix argument parser [cookbooks] - 10https://gerrit.wikimedia.org/r/824234 [16:04:19] (03PS2) 10JMeybohm: sre.discovery.service-route: Fix argument parser [cookbooks] - 10https://gerrit.wikimedia.org/r/824234 [16:07:37] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install graphite2004 - https://phabricator.wikimedia.org/T313851 (10Papaul) [16:07:42] (03CR) 10Vgutierrez: [C: 03+1] Incremental roll-out of query-sorting (0%) [puppet] - 10https://gerrit.wikimedia.org/r/822434 (owner: 10Ori) [16:11:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P32462 and previous config saved to /var/cache/conftool/dbconfig/20220817-161151-root.json [16:12:36] (03PS3) 10SBassett: Enable StopForumSpam on initial candidate projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823789 (https://phabricator.wikimedia.org/T273220) [16:12:45] (03PS3) 10SBassett: Enable StopForumSpam on initial candidate projects (CommonSettings) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823790 (https://phabricator.wikimedia.org/T273220) [16:13:46] (03PS1) 10Ladsgroup: wwwportals: Make sure portal assets are also visible in wikinews vhost [puppet] - 10https://gerrit.wikimedia.org/r/824237 (https://phabricator.wikimedia.org/T273179) [16:14:20] (03PS1) 10Ladsgroup: Revert "db1169: Disable notification" [puppet] - 10https://gerrit.wikimedia.org/r/824167 [16:14:50] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:15:29] (03PS2) 10Ladsgroup: Revert "db1169: Disable notification" [puppet] - 10https://gerrit.wikimedia.org/r/824167 [16:15:35] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db1169: Disable notification" [puppet] - 10https://gerrit.wikimedia.org/r/824167 (owner: 10Ladsgroup) [16:15:53] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1079.eqiad.wmnet with OS bullseye [16:16:00] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1079.eqiad.wmnet with OS bullseye completed: - elastic1055 (... [16:16:38] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48535 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:16:48] (03Abandoned) 10Ori: beta cluster: don't instantiate ::esitest [puppet] - 10https://gerrit.wikimedia.org/r/823766 (https://phabricator.wikimedia.org/T315350) (owner: 10Ori) [16:17:02] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135 [16:17:05] (03Abandoned) 10Ori: BETA CLUSTER: Revert "esitest service for cache nodes" [puppet] - 10https://gerrit.wikimedia.org/r/823639 (owner: 10Ori) [16:17:05] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [16:17:40] (03Abandoned) 10Ori: BETA CLUSTER: Revert "trafficserver: 9.x upgrade: install ATS 9.x from component" [puppet] - 10https://gerrit.wikimedia.org/r/823638 (owner: 10Ori) [16:18:12] (03PS2) 10Ladsgroup: wwwportals: Make sure portal assets are also visible in wikinews vhost [puppet] - 10https://gerrit.wikimedia.org/r/824237 (https://phabricator.wikimedia.org/T273179) [16:18:15] (03CR) 10Ladsgroup: [C: 03+2] wwwportals: Make sure portal assets are also visible in wikinews vhost [puppet] - 10https://gerrit.wikimedia.org/r/824237 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup) [16:18:40] (03PS4) 10SBassett: Enable StopForumSpam on initial candidate projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823789 (https://phabricator.wikimedia.org/T273220) [16:18:44] PROBLEM - Check systemd state on elastic1079 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:02] (03PS5) 10SBassett: Enable StopForumSpam on initial candidate projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823789 (https://phabricator.wikimedia.org/T273220) [16:20:01] jouncebot: nowandnext [16:20:01] No deployments scheduled for the next 1 hour(s) and 39 minute(s) [16:20:01] In 1 hour(s) and 39 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220817T1800) [16:20:01] In 1 hour(s) and 39 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220817T1800) [16:20:07] awesome [16:20:12] (03CR) 10Cwhite: [C: 03+2] tcpircbot: add and enable ecs logging handler [puppet] - 10https://gerrit.wikimedia.org/r/822421 (https://phabricator.wikimedia.org/T257861) (owner: 10Cwhite) [16:22:10] (03PS4) 10SBassett: Enable StopForumSpam on initial candidate projects (CommonSettings) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823790 (https://phabricator.wikimedia.org/T273220) [16:22:22] (03PS5) 10SBassett: Enable StopForumSpam on initial candidate projects (CommonSettings) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823790 (https://phabricator.wikimedia.org/T273220) [16:22:25] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/824234 (owner: 10JMeybohm) [16:22:35] 10SRE, 10MediaWiki-General, 10Traffic, 10MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), 10Patch-For-Review: Roll out query parameter normalization - https://phabricator.wikimedia.org/T314868 (10ori) [16:24:24] !log restart logmsgbot T257861 [16:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:28] T257861: Pipe SAL entries into Logstash - https://phabricator.wikimedia.org/T257861 [16:25:33] (03PS1) 10Ladsgroup: portals: Bump to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824239 (https://phabricator.wikimedia.org/T273179) [16:26:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P32463 and previous config saved to /var/cache/conftool/dbconfig/20220817-162655-root.json [16:27:53] (03CR) 10Ladsgroup: [C: 03+2] portals: Bump to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824239 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup) [16:28:52] (03Merged) 10jenkins-bot: portals: Bump to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824239 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup) [16:29:00] Amir1: was trying to deploy some config for StopForumSpam. Should I wait? [16:29:07] nah, mine can wait [16:29:15] let me just rebase it [16:29:39] Ok, thanks. Reedy and I should be done soon. Either this thing works or it gets reverted quickly :) [16:29:40] sbassett: done, take the floor, let me know once you're done [16:31:11] (03CR) 10Bartosz Dziewoński: [C: 03+1] "Scheduled: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220818T1300" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824203 (owner: 10Esanders) [16:31:14] PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [16:31:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:31:32] (03CR) 10Reedy: [C: 03+2] Enable StopForumSpam on initial candidate projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823789 (https://phabricator.wikimedia.org/T273220) (owner: 10SBassett) [16:32:09] (03CR) 10SBassett: [C: 03+1] Enable StopForumSpam on initial candidate projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823789 (https://phabricator.wikimedia.org/T273220) (owner: 10SBassett) [16:32:16] (03CR) 10Reedy: [C: 03+2] Enable StopForumSpam on initial candidate projects (CommonSettings) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823790 (https://phabricator.wikimedia.org/T273220) (owner: 10SBassett) [16:32:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:32:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:32:35] (03CR) 10SBassett: [C: 03+1] Enable StopForumSpam on initial candidate projects (CommonSettings) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823790 (https://phabricator.wikimedia.org/T273220) (owner: 10SBassett) [16:33:12] (03Merged) 10jenkins-bot: Enable StopForumSpam on initial candidate projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823789 (https://phabricator.wikimedia.org/T273220) (owner: 10SBassett) [16:33:16] (03Merged) 10jenkins-bot: Enable StopForumSpam on initial candidate projects (CommonSettings) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823790 (https://phabricator.wikimedia.org/T273220) (owner: 10SBassett) [16:33:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:38:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:38:54] RECOVERY - Check systemd state on elastic1079 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:41:15] (03PS2) 10Cwhite: tcpircbot: send tcpircbot logs to centralized logging [puppet] - 10https://gerrit.wikimedia.org/r/822423 (https://phabricator.wikimedia.org/T257861) [16:42:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:42:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:43:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:48:01] (03PS3) 10Cwhite: tcpircbot: send tcpircbot logs to centralized logging [puppet] - 10https://gerrit.wikimedia.org/r/822423 (https://phabricator.wikimedia.org/T257861) [16:50:34] !log sbassett@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Enable StopForumSpam on candidate wikis (IS.php) - T273220 (duration: 03m 20s) [16:50:39] T273220: Deploy StopForumSpam extension to production - https://phabricator.wikimedia.org/T273220 [16:51:08] (03CR) 10Cwhite: [C: 03+2] "PCC checks out https://puppet-compiler.wmflabs.org/pcc-worker1002/36794/" [puppet] - 10https://gerrit.wikimedia.org/r/822423 (https://phabricator.wikimedia.org/T257861) (owner: 10Cwhite) [16:54:37] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host graphite2004.codfw.wmnet with OS bullseye [16:54:43] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install graphite2004 - https://phabricator.wikimedia.org/T313851 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host graphite2004.codfw.wmnet with OS bullseye [16:54:49] !log sbassett@deploy1002 Synchronized wmf-config/CommonSettings.php: Enable StopForumSpam on candidate wikis (CS.php) - T273220 (duration: 03m 26s) [16:55:14] (03PS1) 10Xcollazo: Add missing airflow service users to yarn's production queue [puppet] - 10https://gerrit.wikimedia.org/r/824241 (https://phabricator.wikimedia.org/T312858) [16:55:51] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on graphite2004.codfw.wmnet with reason: host reimage [16:57:14] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Allow jumbo frames between cloud hosts in production realm - https://phabricator.wikimedia.org/T315446 (10cmooney) 05Open→03Resolved Ok gonna close this one as the cloud team have confirmed things are now working for them. Apologies for... [16:58:30] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on graphite2004.codfw.wmnet with reason: host reimage [17:00:35] (03CR) 10Joal: [C: 03+1] "LGTM! Thanks Xabriel :)" [puppet] - 10https://gerrit.wikimedia.org/r/824241 (https://phabricator.wikimedia.org/T312858) (owner: 10Xcollazo) [17:02:09] (03PS1) 10Papaul: Add kafka-logging200[45] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/824243 (https://phabricator.wikimedia.org/T313959) [17:04:14] RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [17:04:27] (03CR) 10Papaul: [C: 03+2] Add kafka-logging200[45] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/824243 (https://phabricator.wikimedia.org/T313959) (owner: 10Papaul) [17:06:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host graphite2004.codfw.wmnet with OS bullseye [17:06:20] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install graphite2004 - https://phabricator.wikimedia.org/T313851 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host graphite2004.codfw.wmnet with OS bullseye completed: - graphite2004 (**PASS**) - D... [17:09:34] (03PS1) 10AOkoth: gitlab: revert gitlab-replica TTL to 600s [dns] - 10https://gerrit.wikimedia.org/r/824244 (https://phabricator.wikimedia.org/T296713) [17:10:15] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-logging2004.codfw.wmnet with OS bullseye [17:10:19] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2110 - https://phabricator.wikimedia.org/T315229 (10wiki_willy) [17:10:27] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install graphite2004 - https://phabricator.wikimedia.org/T313851 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-logging2004.codfw.wmnet with OS bullseye [17:11:37] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2110 - https://phabricator.wikimedia.org/T315229 (10wiki_willy) Most definitely, we'll get it procured in T315462 >>! In T315229#8160434, @Marostegui wrote: > @Papaul @wiki_willy any chances we can buy one? This is s4 master. [17:11:40] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install graphite2004 - https://phabricator.wikimedia.org/T313851 (10Papaul) [17:12:23] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2110 - https://phabricator.wikimedia.org/T315229 (10Marostegui) Thanks a lot Willy [17:18:42] (03CR) 10Ottomata: "I am not super familiar with the capacity scheduler settings, but I think we'd also want additions to" [puppet] - 10https://gerrit.wikimedia.org/r/824241 (https://phabricator.wikimedia.org/T312858) (owner: 10Xcollazo) [17:22:56] (03PS1) 10Cathal Mooney: Add SSH key for user mikeraish [puppet] - 10https://gerrit.wikimedia.org/r/824245 (https://phabricator.wikimedia.org/T313429) [17:24:09] (03CR) 10CI reject: [V: 04-1] Add SSH key for user mikeraish [puppet] - 10https://gerrit.wikimedia.org/r/824245 (https://phabricator.wikimedia.org/T313429) (owner: 10Cathal Mooney) [17:24:45] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1056.eqiad.wmnet with OS bullseye [17:24:53] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1056.eqiad.wmnet with OS bullseye [17:26:25] (03PS2) 10Cathal Mooney: Add SSH key for user mikeraish [puppet] - 10https://gerrit.wikimedia.org/r/824245 (https://phabricator.wikimedia.org/T313429) [17:27:50] (03CR) 10CI reject: [V: 04-1] Add SSH key for user mikeraish [puppet] - 10https://gerrit.wikimedia.org/r/824245 (https://phabricator.wikimedia.org/T313429) (owner: 10Cathal Mooney) [17:29:35] !log ladsgroup@deploy1002 Synchronized portals/wikipedia.org/assets: Migrate wikinews.org to the modern portals (duration: 03m 29s) [17:30:38] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host kafka-logging2004 [17:30:53] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [17:31:15] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kafka-logging2004 [17:31:57] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [17:32:17] (03PS3) 10Cathal Mooney: Add SSH key for user mikeraish [puppet] - 10https://gerrit.wikimedia.org/r/824245 (https://phabricator.wikimedia.org/T313429) [17:33:08] !log ladsgroup@deploy1002 Synchronized portals: Migrate wikinews.org to the modern portals (duration: 03m 32s) [17:34:01] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [17:36:19] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [17:39:33] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1056.eqiad.wmnet with reason: host reimage [17:41:59] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host kafka-logging2005 [17:42:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kafka-logging2005 [17:43:08] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1056.eqiad.wmnet with reason: host reimage [17:46:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T314041)', diff saved to https://phabricator.wikimedia.org/P32465 and previous config saved to /var/cache/conftool/dbconfig/20220817-174644-ladsgroup.json [17:46:50] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [17:48:39] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kafka-logging2004.codfw.wmnet with OS bullseye [17:48:43] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install graphite2004 - https://phabricator.wikimedia.org/T313851 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-logging2004.codfw.wmnet with OS bullseye executed with errors: - kafka-logging... [17:50:30] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:00:04] ^demon and dancy: #bothumor My software never has bugs. It just develops random features. Rise for Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220817T1800). [18:00:05] ^demon and dancy: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220817T1800). [18:00:16] o/ [18:01:41] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-logging2004.codfw.wmnet with OS bullseye [18:01:47] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install graphite2004 - https://phabricator.wikimedia.org/T313851 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-logging2004.codfw.wmnet with OS bullseye [18:01:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P32466 and previous config saved to /var/cache/conftool/dbconfig/20220817-180150-ladsgroup.json [18:07:30] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1056.eqiad.wmnet with OS bullseye [18:07:36] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1056.eqiad.wmnet with OS bullseye completed: - elastic1056 (... [18:16:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P32467 and previous config saved to /var/cache/conftool/dbconfig/20220817-181656-ladsgroup.json [18:22:51] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1027.eqiad.wmnet with OS bullseye [18:25:56] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1073.eqiad.wmnet with OS bullseye [18:26:04] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1073.eqiad.wmnet with OS bullseye [18:27:27] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Data-Engineering-Operations: Access request to analytics system(s) - https://phabricator.wikimedia.org/T315409 (10cmooney) [18:28:41] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Data-Engineering-Operations: Access request to analytics system(s) - https://phabricator.wikimedia.org/T315409 (10cmooney) Hi @JayCano can you approve this request and confirm (if you are aware) that access needs to be given to shell group "analytics-priv... [18:29:33] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Data-Engineering-Operations: Access request to analytics system(s) - https://phabricator.wikimedia.org/T315409 (10cmooney) p:05Triage→03Medium a:03cmooney [18:30:02] (03CR) 10BCornwall: [C: 03+2] Add SSH key for user mikeraish [puppet] - 10https://gerrit.wikimedia.org/r/824245 (https://phabricator.wikimedia.org/T313429) (owner: 10Cathal Mooney) [18:32:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T314041)', diff saved to https://phabricator.wikimedia.org/P32468 and previous config saved to /var/cache/conftool/dbconfig/20220817-183202-ladsgroup.json [18:32:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1110.eqiad.wmnet with reason: Maintenance [18:32:06] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [18:32:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1110.eqiad.wmnet with reason: Maintenance [18:32:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T314041)', diff saved to https://phabricator.wikimedia.org/P32469 and previous config saved to /var/cache/conftool/dbconfig/20220817-183223-ladsgroup.json [18:32:52] (03CR) 10CDanis: [C: 03+1] quickdatacopy: Added simple username/groupname mapping for the Rsync server [puppet] - 10https://gerrit.wikimedia.org/r/823748 (https://phabricator.wikimedia.org/T314972) (owner: 10Andrea Denisse) [18:33:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:33:50] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Trokhymovych - https://phabricator.wikimedia.org/T315262 (10cmooney) @Trokhymovych thanks for confirming. Can you also review the below Server Access Responsibilities Document and sign it? https://phabricator.wikimedia.org/L3 Once that'... [18:34:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10Jclark-ctr) [18:36:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install graphite1005 - https://phabricator.wikimedia.org/T313853 (10Jclark-ctr) [18:36:50] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1027.eqiad.wmnet with reason: host reimage [18:37:50] PROBLEM - ElasticSearch numbers of masters eligible - 9643 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [18:38:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install centrallog1002 - https://phabricator.wikimedia.org/T313858 (10Jclark-ctr) [18:38:53] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1073.eqiad.wmnet with reason: host reimage [18:40:22] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1027.eqiad.wmnet with reason: host reimage [18:40:57] !log disabling reserved space on codfw nodes (RESTBase), /dev/md2 (aka /srv/cassandra/instance-data) -- T314941 [18:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:00] T314941: RESTBase Cassandra high utilization alarms (instance-data) - https://phabricator.wikimedia.org/T314941 [18:42:31] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10Jclark-ctr) [18:42:44] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Trokhymovych - https://phabricator.wikimedia.org/T315262 (10KFrancis) @cmooney @Trokhymovych Is there a WMF email (contractor email is also fine) for this request? [18:43:03] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1073.eqiad.wmnet with reason: host reimage [18:44:14] (03CR) 10Xcollazo: Add missing airflow service users to yarn's production queue (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824241 (https://phabricator.wikimedia.org/T312858) (owner: 10Xcollazo) [18:44:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes102[01] - https://phabricator.wikimedia.org/T313873 (10Jclark-ctr) [18:47:13] PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:48:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install new eqiad memcached hosts - https://phabricator.wikimedia.org/T313963 (10Jclark-ctr) [18:51:33] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:51:53] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10Jclark-ctr) [18:52:36] (03PS1) 10TrainBranchBot: group1 wikis to 1.39.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824263 (https://phabricator.wikimedia.org/T314186) [18:52:38] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.39.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824263 (https://phabricator.wikimedia.org/T314186) (owner: 10TrainBranchBot) [18:53:53] (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824263 (https://phabricator.wikimedia.org/T314186) (owner: 10TrainBranchBot) [18:54:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Jclark-ctr) [18:55:27] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1027.eqiad.wmnet with OS bullseye [18:56:23] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10Jclark-ctr) [18:57:18] dancy: i replied at https://phabricator.wikimedia.org/T315383#8162953 , i hope this reassures you. thanks for checking for other problems [18:57:30] Thanks! [18:57:53] RECOVERY - ElasticSearch numbers of masters eligible - 9643 on search.svc.eqiad.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [18:58:00] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-logging2004.codfw.wmnet with OS bullseye [18:58:03] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install graphite2004 - https://phabricator.wikimedia.org/T313851 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-logging2004.codfw.wmnet with OS bullseye executed with errors: - kafka-logging... [18:58:16] !log demon@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.25 refs T314186 [18:58:21] T314186: 1.39.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T314186 [19:00:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:01:42] !log demon@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.25 refs T314186 (duration: 03m 24s) [19:01:53] <^demon> Rolling back to wmf.23, actually. [19:02:28] (03PS1) 10TrainBranchBot: group1 wikis to 1.39.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824268 (https://phabricator.wikimedia.org/T314186) [19:02:30] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.39.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824268 (https://phabricator.wikimedia.org/T314186) (owner: 10TrainBranchBot) [19:03:57] (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824268 (https://phabricator.wikimedia.org/T314186) (owner: 10TrainBranchBot) [19:04:40] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1073.eqiad.wmnet with OS bullseye [19:04:46] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1073.eqiad.wmnet with OS bullseye completed: - elastic1073 (... [19:04:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:04:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:05:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:07:55] !log demon@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.23 refs T314186 [19:07:59] T314186: 1.39.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T314186 [19:08:48] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Trokhymovych - https://phabricator.wikimedia.org/T315262 (10Trokhymovych) @cmooney I have already reviewed and signed the Server Access Responsibilities Document [19:10:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:11:10] !log demon@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.23 refs T314186 (duration: 03m 15s) [19:11:43] PROBLEM - HP RAID on ms-be1054 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:4 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 3I:3:1, 3I:3:2, 3I:3:3, 3I:3:4, 4I:5:1, 4I:5:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [19:11:46] ACKNOWLEDGEMENT - HP RAID on ms-be1054 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:4 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 3I:3:1, 3I:3:2, 3I:3:3, 3I:3:4, 4I:5:1, 4I:5:2 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T315480 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [19:11:49] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1054 - https://phabricator.wikimedia.org/T315480 (10ops-monitoring-bot) [19:11:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:11:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:12:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:16:14] PROBLEM - Check systemd state on mw2339 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:18:53] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T314509 (10Papaul) Thanks for the update [19:21:13] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-logging2004.codfw.wmnet with OS bullseye [19:21:20] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install graphite2004 - https://phabricator.wikimedia.org/T313851 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-logging2004.codfw.wmnet with OS bullseye [19:22:05] (03CR) 10Ottomata: "Hm, yeah not sure! Just going on what the existent users and groups have. If analytics-search and analytics-search-users have those perm" [puppet] - 10https://gerrit.wikimedia.org/r/824241 (https://phabricator.wikimedia.org/T312858) (owner: 10Xcollazo) [19:24:14] (03CR) 10Andrew Bogott: [C: 03+2] Revert "netboot.cfg: temporarily switch cloudvirt1025 to a full reimage" [puppet] - 10https://gerrit.wikimedia.org/r/824041 (owner: 10Andrew Bogott) [19:47:50] RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:48:14] (03PS4) 10Samtar: InitialiseSettings: Add wmgUsePhonos (default => false) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822656 (https://phabricator.wikimedia.org/T314294) [19:48:59] (03CR) 10Andrea Denisse: [C: 03+2] quickdatacopy: Added simple username/groupname mapping for the Rsync server [puppet] - 10https://gerrit.wikimedia.org/r/823748 (https://phabricator.wikimedia.org/T314972) (owner: 10Andrea Denisse) [20:00:05] RoanKattouw, Urbanecm, and cjming: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220817T2000). [20:00:05] danisztls and TheresNoTime: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:13] hi [20:00:17] i can deploy today [20:00:27] * TheresNoTime is here [20:00:37] TheresNoTime: hi, do you want to self-serve your commits? [20:00:58] sure, when do you want me to do them? [20:01:50] (03CR) 10Urbanecm: QuickSurveys: Remove research incentive survey from BN wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823772 (https://phabricator.wikimedia.org/T314333) (owner: 10DDesouza) [20:01:55] TheresNoTime: i'll ping you :) [20:02:08] Okay :) [20:02:17] danisztls: can you please review the comment i made in your change? [20:02:28] urbanecm: yes [20:04:33] urbanecm: I agree, will remove it. I had a different feedback regarding this when undeploying the previous survey. :) [20:05:01] oops, didn't know that (sorry for giving contradictory feedback :)) [20:05:19] 10SRE: Add xcollazo@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T315486 (10xcollazo) [20:06:09] urbanecm: no problem, as I see both ways are fine and will not cause problems but it does make sense to disable the extension be it on this change or on a follow up [20:06:52] urbanecm: as referece, the change is 812377 [20:06:56] *reference [20:07:23] thanks. looks like at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/812377, it was about the deployer not being comfortable with doing both changes at once, but it looks like we both agree it should be removed. [20:07:36] urbanecm: ok [20:07:44] anyway, let's remove that line please :). [20:09:01] (03PS1) 10Ladsgroup: Do not attempt to create a FlaggableWikiPage when the title can't exist [extensions/FlaggedRevs] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824169 (https://phabricator.wikimedia.org/T315479) [20:09:03] (03PS3) 10DDesouza: QuickSurveys: Remove research incentive survey from BN wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823772 (https://phabricator.wikimedia.org/T314333) [20:09:09] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging2004.codfw.wmnet with reason: host reimage [20:09:30] (03PS4) 10Urbanecm: QuickSurveys: Remove research incentive survey from BN wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823772 (https://phabricator.wikimedia.org/T314333) (owner: 10DDesouza) [20:09:34] (03CR) 10Urbanecm: [C: 03+2] QuickSurveys: Remove research incentive survey from BN wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823772 (https://phabricator.wikimedia.org/T314333) (owner: 10DDesouza) [20:10:36] (03CR) 10DDesouza: QuickSurveys: Remove research incentive survey from BN wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823772 (https://phabricator.wikimedia.org/T314333) (owner: 10DDesouza) [20:10:59] (03Merged) 10jenkins-bot: QuickSurveys: Remove research incentive survey from BN wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823772 (https://phabricator.wikimedia.org/T314333) (owner: 10DDesouza) [20:12:26] danisztls: syncing :) [20:12:38] urbanecm: I will make a new change to disable it for the JA wiki as I forgot to do the follow up on the last survey. [20:12:41] urbanecm: thanks [20:12:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging2004.codfw.wmnet with reason: host reimage [20:12:45] thanks! [20:13:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:14:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:14:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:15:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:15:34] (03PS1) 10DDesouza: QuickSurveys: Disable extension on JA wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824283 (https://phabricator.wikimedia.org/T311015) [20:15:43] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 2cf80d1e038b33f7f99d56ca8e30ce37cb726ef2: QuickSurveys: Remove research incentive survey from BN wiki (T314333) (duration: 03m 24s) [20:15:46] T314333: Deploy Research Incentive Survey on Bengali Wikipedia - https://phabricator.wikimedia.org/T314333 [20:16:44] (03CR) 10Urbanecm: [C: 03+2] QuickSurveys: Disable extension on JA wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824283 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [20:16:52] danisztls: i'll sync out the second one too [20:17:43] (03Merged) 10jenkins-bot: QuickSurveys: Disable extension on JA wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824283 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [20:19:11] syncing second one too [20:19:30] (03CR) 10CI reject: [V: 04-1] Do not attempt to create a FlaggableWikiPage when the title can't exist [extensions/FlaggedRevs] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824169 (https://phabricator.wikimedia.org/T315479) (owner: 10Ladsgroup) [20:20:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:20:28] (03PS1) 10Ladsgroup: updateAutoPromote: Fix rev_comment [extensions/FlaggedRevs] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824170 [20:20:34] (03CR) 10Ladsgroup: [C: 03+2] updateAutoPromote: Fix rev_comment [extensions/FlaggedRevs] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824170 (owner: 10Ladsgroup) [20:21:32] (03CR) 10BCornwall: "This change is ready for review." (033 comments) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/820795 (https://phabricator.wikimedia.org/T244232) (owner: 10BCornwall) [20:22:15] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Trokhymovych - https://phabricator.wikimedia.org/T315262 (10KFrancis) @cmooney I am confirming the NDA has been signed. Please go ahead with the access request. Thanks! [20:22:19] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 1ddc661e6e73b60542e31d2128c2add3e2307b74: QuickSurveys: Disable extension on JA wiki (T311015) (duration: 03m 19s) [20:22:22] T311015: Deploy QuickSurvey on Japanese Wikipedia - https://phabricator.wikimedia.org/T311015 [20:22:32] danisztls: all done! [20:22:41] urbanecm: thanks! [20:22:45] no problem [20:22:49] TheresNoTime: the floor is yours! [20:23:03] urbanecm: thank you :) will let you know when I'm done! [20:23:30] (03CR) 10Samtar: [C: 03+2] extension-list: Add Phonos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821249 (https://phabricator.wikimedia.org/T314294) (owner: 10Samtar) [20:23:47] TheresNoTime: thanks, but i think you should tell Amir1 instead, he likely has a backport to do :) [20:23:57] yeah, ubn fix [20:24:00] will do [20:24:10] (03CR) 10Andrea Denisse: netmon: Set correct username/groupname mappings for LibreNMS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/823752 (https://phabricator.wikimedia.org/T314972) (owner: 10Andrea Denisse) [20:24:31] (03PS1) 10Andrea Denisse: netmon: Set correct username/groupname mappings for LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/824284 (https://phabricator.wikimedia.org/T314972) [20:24:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:24:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:25:17] (03Merged) 10jenkins-bot: extension-list: Add Phonos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821249 (https://phabricator.wikimedia.org/T314294) (owner: 10Samtar) [20:25:25] (03CR) 10CI reject: [V: 04-1] updateAutoPromote: Fix rev_comment [extensions/FlaggedRevs] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824170 (owner: 10Ladsgroup) [20:25:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:27:08] (03PS1) 10Ladsgroup: Remove indexExists check for page_name_title index [extensions/FlaggedRevs] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824171 [20:27:15] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-logging2004.codfw.wmnet with OS bullseye [20:27:21] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install graphite2004 - https://phabricator.wikimedia.org/T313851 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-logging2004.codfw.wmnet with OS bullseye completed: - kafka-logging2004 (**PAS... [20:27:35] (03PS5) 10Samtar: InitialiseSettings: Add wmgUsePhonos (default => false) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822656 (https://phabricator.wikimedia.org/T314294) [20:27:37] (03CR) 10Ladsgroup: [C: 03+2] Remove indexExists check for page_name_title index [extensions/FlaggedRevs] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824171 (owner: 10Ladsgroup) [20:28:54] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-logging2005.codfw.wmnet with OS bullseye [20:29:00] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging200[45] - https://phabricator.wikimedia.org/T313959 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-logging2005.codfw.wmnet with OS bullseye [20:29:27] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1002/36795/" [puppet] - 10https://gerrit.wikimedia.org/r/824284 (https://phabricator.wikimedia.org/T314972) (owner: 10Andrea Denisse) [20:29:58] !log samtar@deploy1002 Synchronized wmf-config/extension-list: Config: [[gerrit:821249|extension-list: Add Phonos (T314294)]] (duration: 03m 17s) [20:30:02] T314294: Deploy Phonos to beta cluster - https://phabricator.wikimedia.org/T314294 [20:30:12] (03CR) 10CI reject: [V: 04-1] Remove indexExists check for page_name_title index [extensions/FlaggedRevs] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824171 (owner: 10Ladsgroup) [20:30:13] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Trokhymovych - https://phabricator.wikimedia.org/T315262 (10cmooney) @Trokhymovych thanks - yep I see your signature from yesterday, somehow missed earlier apologies. @KFrancis thanks for confirming re: the NDA. No WMF email attached to... [20:30:16] (03CR) 10Samtar: [C: 03+2] InitialiseSettings: Add wmgUsePhonos (default => false) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822656 (https://phabricator.wikimedia.org/T314294) (owner: 10Samtar) [20:30:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:31:23] (03Merged) 10jenkins-bot: InitialiseSettings: Add wmgUsePhonos (default => false) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822656 (https://phabricator.wikimedia.org/T314294) (owner: 10Samtar) [20:31:34] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] "Force merging since there is a circular breakage with this and I37312b12dab04" [extensions/FlaggedRevs] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824171 (owner: 10Ladsgroup) [20:31:43] (03PS2) 10Ladsgroup: updateAutoPromote: Fix rev_comment [extensions/FlaggedRevs] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824170 [20:31:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:31:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:31:50] (03PS1) 10Andrea Denisse: netmon: Set correct username/groupname mappings for Rancid [puppet] - 10https://gerrit.wikimedia.org/r/824286 (https://phabricator.wikimedia.org/T314972) [20:31:52] (03CR) 10Ladsgroup: "trying again" [extensions/FlaggedRevs] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824170 (owner: 10Ladsgroup) [20:32:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:32:57] (03CR) 10Andrea Denisse: netmon: Set correct username/groupname mappings for Rancid (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/823759 (https://phabricator.wikimedia.org/T314972) (owner: 10Andrea Denisse) [20:35:09] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM... but I've not done any tests myself, if you feel I can help there let me know glad to double-check." [software/homer] - 10https://gerrit.wikimedia.org/r/820778 (https://phabricator.wikimedia.org/T310745) (owner: 10Ayounsi) [20:35:30] (03CR) 10CI reject: [V: 04-1] updateAutoPromote: Fix rev_comment [extensions/FlaggedRevs] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824170 (owner: 10Ladsgroup) [20:36:00] !log samtar@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:822656|InitialiseSettings: Add wmgUsePhonos (default => false) (T314294)]] (duration: 03m 29s) [20:36:04] T314294: Deploy Phonos to beta cluster - https://phabricator.wikimedia.org/T314294 [20:36:30] Amir1: all yours, cc urbanecm [20:37:30] Thanks [20:37:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:38:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:38:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:39:00] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1001/36796/" [puppet] - 10https://gerrit.wikimedia.org/r/824286 (https://phabricator.wikimedia.org/T314972) (owner: 10Andrea Denisse) [20:39:07] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM, I've a hard time working out in my head why the box would throw the error you seen. Changes to the data / templates seems like a go" [homer/public] - 10https://gerrit.wikimedia.org/r/822414 (https://phabricator.wikimedia.org/T313805) (owner: 10Ayounsi) [20:39:33] (03Abandoned) 10Andrea Denisse: netmon: Set correct username/groupname mappings for Rancid [puppet] - 10https://gerrit.wikimedia.org/r/823759 (https://phabricator.wikimedia.org/T314972) (owner: 10Andrea Denisse) [20:39:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:39:51] (03Abandoned) 10Andrea Denisse: netmon: Set correct username/groupname mappings for LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/823752 (https://phabricator.wikimedia.org/T314972) (owner: 10Andrea Denisse) [20:41:01] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging200[45] - https://phabricator.wikimedia.org/T313959 (10Papaul) [20:46:57] (03CR) 10Ladsgroup: [C: 03+2] Do not attempt to create a FlaggableWikiPage when the title can't exist [extensions/FlaggedRevs] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824169 (https://phabricator.wikimedia.org/T315479) (owner: 10Ladsgroup) [20:50:21] (03CR) 10CI reject: [V: 04-1] Do not attempt to create a FlaggableWikiPage when the title can't exist [extensions/FlaggedRevs] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824169 (https://phabricator.wikimedia.org/T315479) (owner: 10Ladsgroup) [20:50:53] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging2005.codfw.wmnet with reason: host reimage [20:51:46] (03CR) 10Xcollazo: Add missing airflow service users to yarn's production queue (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824241 (https://phabricator.wikimedia.org/T312858) (owner: 10Xcollazo) [20:54:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging2005.codfw.wmnet with reason: host reimage [20:56:12] (03PS2) 10Xcollazo: Add missing airflow service users to yarn's production queue [puppet] - 10https://gerrit.wikimedia.org/r/824241 (https://phabricator.wikimedia.org/T312858) [20:57:19] (03CR) 10Ladsgroup: [C: 03+2] "random failure" [extensions/FlaggedRevs] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824170 (owner: 10Ladsgroup) [20:59:01] (03CR) 10Xcollazo: Add missing airflow service users to yarn's production queue (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824241 (https://phabricator.wikimedia.org/T312858) (owner: 10Xcollazo) [20:59:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:01:43] (03Merged) 10jenkins-bot: updateAutoPromote: Fix rev_comment [extensions/FlaggedRevs] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824170 (owner: 10Ladsgroup) [21:02:17] (03CR) 10Ladsgroup: [C: 03+2] "retry" [extensions/FlaggedRevs] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824169 (https://phabricator.wikimedia.org/T315479) (owner: 10Ladsgroup) [21:06:38] (03Merged) 10jenkins-bot: Do not attempt to create a FlaggableWikiPage when the title can't exist [extensions/FlaggedRevs] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824169 (https://phabricator.wikimedia.org/T315479) (owner: 10Ladsgroup) [21:06:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:06:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:07:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-logging2005.codfw.wmnet with OS bullseye [21:07:46] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging200[45] - https://phabricator.wikimedia.org/T313959 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-logging2005.codfw.wmnet with OS bullseye comple... [21:09:31] (03CR) 10Samtar: [C: 04-2] "Waiting on extension" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824294 (https://phabricator.wikimedia.org/T314294) (owner: 10Samtar) [21:13:03] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.25/extensions/FlaggedRevs/frontend/FlaggedRevsUIHooks.php: Backport: [[gerrit:824169|Do not attempt to create a FlaggableWikiPage when the title can't exist (T315479)]] (duration: 03m 26s) [21:13:07] T315479: InvalidArgumentException: WikiPage constructed on a Title that cannot exist as a page: Especial:MobileDiff/17544 - https://phabricator.wikimedia.org/T315479 [21:13:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:14:23] (03CR) 10Cathal Mooney: "Thanks for the review John, some replies in-line, I'll submit a new patch set tomorrow. If you've any thoughts on the resulting netbox mo" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [21:15:24] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:15:50] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:16:31] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.25/extensions/FlaggedRevs: Backport: [[gerrit:824171|Remove indexExists check for page_name_title index]] (duration: 03m 12s) [21:18:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:22:19] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging200[45] - https://phabricator.wikimedia.org/T313959 (10Papaul) [21:22:59] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging200[45] - https://phabricator.wikimedia.org/T313959 (10Papaul) 05Open→03Resolved @herron all yours [21:24:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:24:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:25:22] dancy ^demon: blocker fixed, do you want to roll forward? [21:26:31] (03PS1) 10Andrea Denisse: netmon: Add the OpenSSH configuration file inside the rancid home directory [puppet] - 10https://gerrit.wikimedia.org/r/824299 (https://phabricator.wikimedia.org/T314936) [21:27:27] (03CR) 10CI reject: [V: 04-1] netmon: Add the OpenSSH configuration file inside the rancid home directory [puppet] - 10https://gerrit.wikimedia.org/r/824299 (https://phabricator.wikimedia.org/T314936) (owner: 10Andrea Denisse) [21:28:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:31:02] (03PS2) 10Andrea Denisse: netmon: Add the OpenSSH configuration file inside the rancid home directory [puppet] - 10https://gerrit.wikimedia.org/r/824299 (https://phabricator.wikimedia.org/T314936) [21:32:37] Amir1: I'm about to go afk for a bit so I'll wait for ^demon. [21:32:46] Unless you want to press the button! (and watch logs) [21:33:00] it's quite late here, you have fun! [21:33:20] haha. ok. I'll check back in a bit to see if anything has happened [21:34:44] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1001/36798/" [puppet] - 10https://gerrit.wikimedia.org/r/824299 (https://phabricator.wikimedia.org/T314936) (owner: 10Andrea Denisse) [21:48:47] 10SRE: Add xcollazo@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T315486 (10Dzahn) to the SRE on clinic duty: this is configured in `[puppetmaster1001:/srv/private/modules/privateexim/files/wikimedia.org` (but not sure if analytics sre want to confirm this or do it thems... [21:52:56] 10SRE: Add xcollazo@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T315486 (10Dzahn) This is how to get a list of all existing members: `[mx1001:~] $ sudo exim4 -bt analytics-alerts@wikimedia.org`. Asking ITS to make it a Google group should be just an email to techsuppor... [21:53:24] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:53:56] 10SRE, 10Infrastructure-Foundations, 10Mail: Add xcollazo@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T315486 (10Dzahn) [21:54:51] (03CR) 10Dzahn: [C: 03+1] "so far I don't have a good reason for or against it but if this is reverting and makes it consistent with gerrit, yes, +1" [dns] - 10https://gerrit.wikimedia.org/r/824244 (https://phabricator.wikimedia.org/T296713) (owner: 10AOkoth) [21:59:53] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "yes, thank you. I remember this. Just a rename but one that makes a lot of sense because we both ran into it and wondered what "gerrit_ser" [puppet] - 10https://gerrit.wikimedia.org/r/816038 (owner: 10Hashar) [22:00:54] (03PS4) 10Ori: Incremental roll-out of query-sorting (0%) [puppet] - 10https://gerrit.wikimedia.org/r/822434 [22:02:37] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop on prod gerrit servers confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/816038 (owner: 10Hashar) [22:03:44] (03CR) 10Dzahn: [C: 04-2] "technical -2 based on "This must not be merged before we have upgraded to Gerrit 3.5." comment. please just delete it once it's ready" [puppet] - 10https://gerrit.wikimedia.org/r/824222 (https://phabricator.wikimedia.org/T315445) (owner: 10Hashar) [22:04:44] (03CR) 10Dzahn: "please clarify whether this is meant to be merged before or after the version upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/824221 (https://phabricator.wikimedia.org/T315445) (owner: 10Hashar) [22:06:10] (03CR) 10Dzahn: phabricator: move lvs::realserver inclusion to profile, depend on vcs_enabled (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/823755 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:06:20] (03PS1) 10Ryan Kemper: elastic: upgrade to 7.10.2-2 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/824306 (https://phabricator.wikimedia.org/T299226) [22:07:45] (03CR) 10Ottomata: "One more q, but I think is good! Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/824241 (https://phabricator.wikimedia.org/T312858) (owner: 10Xcollazo) [22:09:14] (03PS5) 10Ori: Incremental roll-out of query-sorting (0%) [puppet] - 10https://gerrit.wikimedia.org/r/822434 [22:09:47] (03CR) 10Ori: "PS 4 / 5: rebased and added 'querysort_rollout_percent: 50' for Beta." [puppet] - 10https://gerrit.wikimedia.org/r/822434 (owner: 10Ori) [22:10:55] (03PS1) 10Stang: mrwiktionary: Set import source [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824308 (https://phabricator.wikimedia.org/T314939) [22:13:23] (03PS2) 10Dzahn: phabricator: move lvs::realserver inclusion to profile, create use_lvs parameter [puppet] - 10https://gerrit.wikimedia.org/r/823755 (https://phabricator.wikimedia.org/T280597) [22:14:02] (03CR) 10Dzahn: phabricator: move lvs::realserver inclusion to profile, create use_lvs parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/823755 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:15:03] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:16:11] (03PS3) 10Dzahn: phabricator: move lvs::realserver inclusion to profile, create use_lvs parameter [puppet] - 10https://gerrit.wikimedia.org/r/823755 (https://phabricator.wikimedia.org/T280597) [22:19:11] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1002/36800/" [puppet] - 10https://gerrit.wikimedia.org/r/823755 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:23:18] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop confirmed on prod phab servers, first 2002, 2001, then 1001 last (temp disabled puppet, re-enabled)" [puppet] - 10https://gerrit.wikimedia.org/r/823755 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:23:53] jouncenot now [22:23:56] jouncebot now [22:23:56] No deployments scheduled for the next 7 hour(s) and 36 minute(s) [22:24:17] I'm rolling wmf.25 to group1 again [22:24:37] (03PS1) 10TrainBranchBot: group1 wikis to 1.39.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824310 (https://phabricator.wikimedia.org/T314186) [22:24:39] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.39.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824310 (https://phabricator.wikimedia.org/T314186) (owner: 10TrainBranchBot) [22:27:29] (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824310 (https://phabricator.wikimedia.org/T314186) (owner: 10TrainBranchBot) [22:30:51] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:31:28] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.25 refs T314186 [22:31:33] T314186: 1.39.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T314186 [22:34:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:34:46] !log dancy@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.25 refs T314186 (duration: 03m 17s) [22:35:25] (03CR) 10Dzahn: phabricator::migration: add phd user with systemd::sysuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/823765 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [22:37:47] (03PS4) 10Dzahn: phabricator::migration: add phd user with systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/823765 (https://phabricator.wikimedia.org/T313360) [22:38:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:38:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:38:39] (03PS5) 10Dzahn: phabricator::migration: add phd with systemd::sysuser, reserve UID 920 [puppet] - 10https://gerrit.wikimedia.org/r/823765 (https://phabricator.wikimedia.org/T313360) [22:39:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:40:34] (03PS6) 10Dzahn: phabricator::migration: add phd with systemd::sysuser, reserve UID 920 [puppet] - 10https://gerrit.wikimedia.org/r/823765 (https://phabricator.wikimedia.org/T313360) [22:44:32] (03CR) 10Dzahn: [C: 04-2] "Systemd::Sysuser[phd]: has no parameter named 'content'" [puppet] - 10https://gerrit.wikimedia.org/r/823765 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [22:45:14] (03CR) 10Dzahn: [C: 04-2] "heh, this is a syntax error but it was a copy/paste from the docs" [puppet] - 10https://gerrit.wikimedia.org/r/823765 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [22:50:36] (03PS7) 10Dzahn: phabricator::migration: add phd with systemd::sysuser, reserve UID 920 [puppet] - 10https://gerrit.wikimedia.org/r/823765 (https://phabricator.wikimedia.org/T313360) [22:52:00] (03CR) 10Dzahn: "@JBond I made this edit: https://wikitech.wikimedia.org/w/index.php?title=UID&type=revision&diff=2004647&oldid=2004642 looks good?" [puppet] - 10https://gerrit.wikimedia.org/r/823765 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [22:54:46] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/36802/" [puppet] - 10https://gerrit.wikimedia.org/r/823765 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [22:54:52] (03PS8) 10Dzahn: phabricator::migration: add phd with systemd::sysuser, reserve UID 920 [puppet] - 10https://gerrit.wikimedia.org/r/823765 (https://phabricator.wikimedia.org/T313360) [23:03:14] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-stretch2001.codfw.wmnet'] [23:10:47] (03CR) 10Dzahn: "on phab1001/phab2002: noop on phab1004: created new user phd with id 920 on phab2002 changed uid of existing user from 498 to 920. wil" [puppet] - 10https://gerrit.wikimedia.org/r/823765 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [23:10:59] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-stretch2001.codfw.wmnet'] [23:17:15] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-stretch2002.codfw.wmnet'] [23:21:46] (03PS1) 10Papaul: Add kafka-stretch200[12] to site.pp and to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/824311 (https://phabricator.wikimedia.org/T314160) [23:22:16] (03PS3) 10Dzahn: phabricator: replace user{} with systemd::sysuser, only on new hosts [puppet] - 10https://gerrit.wikimedia.org/r/823767 (https://phabricator.wikimedia.org/T313360) [23:23:09] (03CR) 10Dzahn: phabricator: replace user{} with systemd::sysuser, only on new hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/823767 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [23:23:17] !log phab2002 - chmod -R phd /srv/repos | find /srv/repos/ -gid 498 -exec chown phd:phd {} \; T313360 [23:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:21] T313360: Setup rsync for phab data on disk - https://phabricator.wikimedia.org/T313360 [23:27:18] (03PS4) 10Dzahn: phabricator: replace user{} with systemd::sysuser, only on new hosts [puppet] - 10https://gerrit.wikimedia.org/r/823767 (https://phabricator.wikimedia.org/T313360) [23:27:37] (03PS5) 10Dzahn: phabricator: replace user{} with systemd::sysuser, only on new hosts [puppet] - 10https://gerrit.wikimedia.org/r/823767 (https://phabricator.wikimedia.org/T313360) [23:31:58] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1002/36803/" [puppet] - 10https://gerrit.wikimedia.org/r/823767 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [23:35:08] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['kafka-stretch2002.codfw.wmnet'] [23:35:20] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-stretch2002.codfw.wmnet'] [23:35:31] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-stretch2002.codfw.wmnet'] [23:35:57] (03CR) 10Dzahn: [C: 03+2] "noop on old and new - will have an effect once we put prod role on new" [puppet] - 10https://gerrit.wikimedia.org/r/823767 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [23:36:08] (03PS2) 10Dzahn: Revert "Revert "site: add phabricator role to phab2002"" [puppet] - 10https://gerrit.wikimedia.org/r/823636 [23:36:20] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-stretch2002.codfw.wmnet'] [23:36:32] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-stretch2002.codfw.wmnet'] [23:42:03] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-stretch2002.codfw.wmnet'] [23:42:13] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:42:19] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-stretch2002.codfw.wmnet'] [23:50:06] (03PS1) 10Cwhite: logstash: add support for rsyslog-namespaced fields [puppet] - 10https://gerrit.wikimedia.org/r/824314 (https://phabricator.wikimedia.org/T315500) [23:50:08] (03PS1) 10Cwhite: rsyslog: add rsyslog-namespaced fields to syslog_cee [puppet] - 10https://gerrit.wikimedia.org/r/824315 (https://phabricator.wikimedia.org/T315500) [23:50:10] (03PS1) 10Cwhite: logstash: use rsyslog-namespaced fields [puppet] - 10https://gerrit.wikimedia.org/r/824316 (https://phabricator.wikimedia.org/T315500) [23:50:12] (03PS1) 10Cwhite: logstash: add tcpircbot logging tests [puppet] - 10https://gerrit.wikimedia.org/r/824317 (https://phabricator.wikimedia.org/T257861) [23:51:03] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-stretch2002.codfw.wmnet'] [23:51:19] (03PS2) 10Cwhite: logstash: add support for rsyslog-namespaced fields [puppet] - 10https://gerrit.wikimedia.org/r/824314 (https://phabricator.wikimedia.org/T315500) [23:52:54] (03PS1) 10Dzahn: phabricator: fix location of hosts files, ensure services stopped on new [puppet] - 10https://gerrit.wikimedia.org/r/824319 (https://phabricator.wikimedia.org/T280597) [23:54:04] (03CR) 10Dzahn: [C: 03+2] phabricator: fix location of hosts files, ensure services stopped on new [puppet] - 10https://gerrit.wikimedia.org/r/824319 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:54:18] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes202[34] - https://phabricator.wikimedia.org/T313870 (10Papaul) [23:57:08] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes2023 [23:57:31] (03CR) 10Xcollazo: Add missing airflow service users to yarn's production queue (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824241 (https://phabricator.wikimedia.org/T312858) (owner: 10Xcollazo) [23:57:46] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes2023 [23:58:04] (03CR) 10CI reject: [V: 04-1] logstash: add support for rsyslog-namespaced fields [puppet] - 10https://gerrit.wikimedia.org/r/824314 (https://phabricator.wikimedia.org/T315500) (owner: 10Cwhite) [23:58:23] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-stretch2002.codfw.wmnet'] [23:58:47] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:58:57] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox