[00:00:54] (03CR) 10Andrew Bogott: [C: 03+2] Eqiad designate -> OpenStack version Xena [puppet] - 10https://gerrit.wikimedia.org/r/825927 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [00:11:14] (03CR) 10Cwhite: [C: 03+1] "Looks good! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/824448 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [00:11:47] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/823764 (https://phabricator.wikimedia.org/T315393) (owner: 10Andrea Denisse) [00:15:41] (03CR) 10Cwhite: "Change looks good to me. The additions Filippo suggested may be helpful to you, but the patch is functional as-is." [alerts] - 10https://gerrit.wikimedia.org/r/825294 (https://phabricator.wikimedia.org/T315866) (owner: 10Ladsgroup) [00:18:20] PROBLEM - dump of es4 in eqiad on backupmon1001 is CRITICAL: dump for es4 at eqiad (es1022) taken more than a week ago: Most recent backup 2022-08-16 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:20:08] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:24:10] PROBLEM - dump of es5 in eqiad on backupmon1001 is CRITICAL: dump for es5 at eqiad (es1025) taken more than a week ago: Most recent backup 2022-08-16 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:24:10] PROBLEM - dump of es4 in codfw on backupmon1001 is CRITICAL: dump for es4 at codfw (es2022) taken more than a week ago: Most recent backup 2022-08-16 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:28:02] RECOVERY - Check systemd state on ms-be2039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:29:02] PROBLEM - dump of es5 in codfw on backupmon1001 is CRITICAL: dump for es5 at codfw (es2025) taken more than a week ago: Most recent backup 2022-08-16 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:35:06] PROBLEM - Check systemd state on ms-be2039 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:50:05] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T314509 (10Papaul) Good afternoon Papaul, I have submitted DPS 432866984 for the replacement backplane to ship out. Service is scheduled for Thursday 08/25/22. The tech will call upon assignm... [00:55:57] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: thanos-be2002 sdj failed - https://phabricator.wikimedia.org/T314913 (10Papaul) Good afternoon Papaul, I have submitted DPS 432867152 for the replacement drive on ST 1R1H043. I have set the dispatch to notify you via email with a tracking number once the... [01:36:45] (JobUnavailable) firing: (2) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:51:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:22] RECOVERY - dump of es4 in eqiad on backupmon1001 is OK: Last dump for es4 at eqiad (es1022) taken on 2022-08-23 00:00:01 (3418 GiB, +1.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [02:06:45] (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:42:18] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:47:02] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:54:13] (KubernetesRsyslogDown) firing: (2) rsyslog on dse-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:02:54] (03CR) 10Tim Starling: [C: 03+1] "Approved for self-merge" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822114 (owner: 10Krinkle) [03:05:30] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:05:42] RECOVERY - dump of es5 in codfw on backupmon1001 is OK: Last dump for es5 at codfw (es2025) taken on 2022-08-23 00:00:01 (3397 GiB, +1.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [03:07:29] (03CR) 10Tim Starling: [C: 03+1] "Approved for self-merge" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822119 (https://phabricator.wikimedia.org/T186673) (owner: 10Krinkle) [03:32:04] RECOVERY - dump of es5 in eqiad on backupmon1001 is OK: Last dump for es5 at eqiad (es1025) taken on 2022-08-23 00:00:01 (3397 GiB, +1.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [03:32:04] RECOVERY - dump of es4 in codfw on backupmon1001 is OK: Last dump for es4 at codfw (es2022) taken on 2022-08-23 00:00:01 (3418 GiB, +1.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [04:00:42] RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:09:27] (03CR) 10Andrea Denisse: [C: 03+2] netmon: Configure Logrotate for LibreNMS logs [puppet] - 10https://gerrit.wikimedia.org/r/823764 (https://phabricator.wikimedia.org/T315393) (owner: 10Andrea Denisse) [04:11:26] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:16:08] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:22:02] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/825880 (https://phabricator.wikimedia.org/T257861) (owner: 10Cwhite) [04:22:44] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/824754 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [04:23:36] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/824752 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [04:28:00] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:30:22] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:31:15] (03CR) 10Andrew Bogott: "Soon there won't be actual labstore hosts anymore -- should I assume that 'labstore' means 'nfs server' in this context?" [puppet] - 10https://gerrit.wikimedia.org/r/813898 (owner: 10David Caro) [04:36:04] PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: spamassassin_updates.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:40:12] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:44:34] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:46:56] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:47:18] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:00:03] (03PS2) 10Gergő Tisza: Drop unused wgGECampaignPattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820586 [05:23:42] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Marostegui) Thank you Papaul, I can access db1186 and db1188 fine! [05:26:20] (03PS1) 10Marostegui: db1189: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826118 (https://phabricator.wikimedia.org/T313569) [05:26:52] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:27:07] (03CR) 10Marostegui: [C: 03+2] db1189: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826118 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [05:28:31] (03PS1) 10Marostegui: instances.yaml: Add db1189 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826119 (https://phabricator.wikimedia.org/T313569) [05:29:21] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1189 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826119 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [05:31:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1189 with minimal weight', diff saved to https://phabricator.wikimedia.org/P32858 and previous config saved to /var/cache/conftool/dbconfig/20220824-053141-root.json [05:33:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Move db2180 from s4 to s6', diff saved to https://phabricator.wikimedia.org/P32859 and previous config saved to /var/cache/conftool/dbconfig/20220824-053311-root.json [05:34:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1189 with minimal weight', diff saved to https://phabricator.wikimedia.org/P32860 and previous config saved to /var/cache/conftool/dbconfig/20220824-053434-root.json [05:35:32] (03PS1) 10Marostegui: db1187: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826121 (https://phabricator.wikimedia.org/T313569) [05:36:35] (03CR) 10Marostegui: [C: 03+2] db1187: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826121 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [05:37:59] (03PS1) 10Marostegui: instances.yaml: Add db1187 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826122 (https://phabricator.wikimedia.org/T313569) [05:38:39] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1187 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826122 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [05:40:06] RECOVERY - Check systemd state on ms-be2039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:40:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1187 with minimal weight', diff saved to https://phabricator.wikimedia.org/P32861 and previous config saved to /var/cache/conftool/dbconfig/20220824-054018-root.json [05:43:22] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:44:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 1%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P32862 and previous config saved to /var/cache/conftool/dbconfig/20220824-054404-root.json [05:47:02] PROBLEM - Check systemd state on ms-be2039 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. An error occured trying to list the failed units https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:47:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1119', diff saved to https://phabricator.wikimedia.org/P32863 and previous config saved to /var/cache/conftool/dbconfig/20220824-054719-root.json [05:49:48] (03PS1) 10Marostegui: install_server: Do not reimage db1189 [puppet] - 10https://gerrit.wikimedia.org/r/826123 [05:50:45] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1189 [puppet] - 10https://gerrit.wikimedia.org/r/826123 (owner: 10Marostegui) [05:59:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 2%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P32865 and previous config saved to /var/cache/conftool/dbconfig/20220824-055909-root.json [05:59:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1187 (re)pooling @ 2%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P32866 and previous config saved to /var/cache/conftool/dbconfig/20220824-055918-root.json [06:08:57] (03PS1) 10Marostegui: mariadb: Productionize db1186 [puppet] - 10https://gerrit.wikimedia.org/r/826125 (https://phabricator.wikimedia.org/T313569) [06:14:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 3%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P32867 and previous config saved to /var/cache/conftool/dbconfig/20220824-061413-root.json [06:14:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1187 (re)pooling @ 3%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P32868 and previous config saved to /var/cache/conftool/dbconfig/20220824-061422-root.json [06:15:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1129', diff saved to https://phabricator.wikimedia.org/P32869 and previous config saved to /var/cache/conftool/dbconfig/20220824-061532-root.json [06:18:06] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1186 [puppet] - 10https://gerrit.wikimedia.org/r/826125 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [06:19:51] (03PS1) 10Marostegui: mariadb: Productionize db1188 [puppet] - 10https://gerrit.wikimedia.org/r/826128 (https://phabricator.wikimedia.org/T313569) [06:20:46] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1188 [puppet] - 10https://gerrit.wikimedia.org/r/826128 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [06:23:42] (03PS1) 10Marostegui: db1129: Change binlog format to ROW [puppet] - 10https://gerrit.wikimedia.org/r/826141 [06:24:25] (03CR) 10Marostegui: [C: 03+2] db1129: Change binlog format to ROW [puppet] - 10https://gerrit.wikimedia.org/r/826141 (owner: 10Marostegui) [06:24:44] 10SRE, 10Infrastructure-Foundations, 10netops: Detect IP address collisions - https://phabricator.wikimedia.org/T189522 (10ayounsi) p:05High→03Triage [06:27:48] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:28:02] RECOVERY - Check systemd state on ms-be2039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:28:15] (03PS1) 10Marostegui: site.pp: Remove insetup from db1186 [puppet] - 10https://gerrit.wikimedia.org/r/826197 (https://phabricator.wikimedia.org/T313569) [06:29:12] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove insetup from db1186 [puppet] - 10https://gerrit.wikimedia.org/r/826197 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [06:29:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 4%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P32871 and previous config saved to /var/cache/conftool/dbconfig/20220824-062918-root.json [06:29:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1187 (re)pooling @ 4%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P32872 and previous config saved to /var/cache/conftool/dbconfig/20220824-062927-root.json [06:32:36] (03PS1) 10Ladsgroup: Changes list filter: don't add fields that are already in the query [extensions/FlaggedRevs] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825894 (https://phabricator.wikimedia.org/T316026) [06:32:51] (03PS1) 10Ladsgroup: Changes list filter: don't add fields that are already in the query [extensions/FlaggedRevs] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/825895 (https://phabricator.wikimedia.org/T316026) [06:33:02] jouncebot: nowandnext [06:33:02] No deployments scheduled for the next 0 hour(s) and 26 minute(s) [06:33:02] In 0 hour(s) and 26 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220824T0700) [06:33:16] (03CR) 10Ladsgroup: [C: 03+2] Changes list filter: don't add fields that are already in the query [extensions/FlaggedRevs] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/825895 (https://phabricator.wikimedia.org/T316026) (owner: 10Ladsgroup) [06:33:20] (03CR) 10Ladsgroup: [C: 03+2] Changes list filter: don't add fields that are already in the query [extensions/FlaggedRevs] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825894 (https://phabricator.wikimedia.org/T316026) (owner: 10Ladsgroup) [06:34:57] (03CR) 10Hashar: Gerrit: Disable auto reloading replication config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/541115 (owner: 10Paladox) [06:35:58] (03PS1) 10Hashar: Revert "Gerrit: Disable auto reloading replication config" [puppet] - 10https://gerrit.wikimedia.org/r/825896 [06:36:08] (03PS2) 10Hashar: Revert "Gerrit: Disable auto reloading replication config" [puppet] - 10https://gerrit.wikimedia.org/r/825896 [06:36:14] (03Merged) 10jenkins-bot: Changes list filter: don't add fields that are already in the query [extensions/FlaggedRevs] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/825895 (https://phabricator.wikimedia.org/T316026) (owner: 10Ladsgroup) [06:36:21] (03Merged) 10jenkins-bot: Changes list filter: don't add fields that are already in the query [extensions/FlaggedRevs] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/825894 (https://phabricator.wikimedia.org/T316026) (owner: 10Ladsgroup) [06:37:09] !log dbmaint s3 T312160 [06:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:15] T312160: Adjust the field type of cx_corpora.cxc_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312160 [06:38:03] any SRE around to `puppet-merge` a gerrit config change for me please ? https://gerrit.wikimedia.org/r/c/operations/puppet/+/825896 [06:38:08] hashar: I can do it [06:38:12] it is to get Gerrit to detect replication config changes ;) [06:38:23] (03CR) 10Marostegui: [C: 03+2] Revert "Gerrit: Disable auto reloading replication config" [puppet] - 10https://gerrit.wikimedia.org/r/825896 (owner: 10Hashar) [06:38:26] it got disabled 3 years ago due to "a bug" [06:38:31] hashar: done [06:38:43] and well given I am debugging some replication issue, I felt like I can address that one as well! [06:38:46] marostegui: thank you! [06:41:01] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.26/extensions/FlaggedRevs/frontend/FlaggedRevsUIHooks.php: Backport: [[gerrit:825895|Changes list filter: don't add fields that are already in the query (T316026)]] (duration: 03m 07s) [06:41:07] T316026: DBQueryError: Duplicate column name 'fp_stable' (SpecialRecentChangesLinked via FlaggedRevs) - https://phabricator.wikimedia.org/T316026 [06:42:55] !log dbmaint x1 codfw T312574 [06:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:01] T312574: Adjust the field type of flow_revision.rev_mod_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312574 [06:44:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 5%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P32873 and previous config saved to /var/cache/conftool/dbconfig/20220824-064423-root.json [06:44:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1187 (re)pooling @ 5%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P32874 and previous config saved to /var/cache/conftool/dbconfig/20220824-064432-root.json [06:44:34] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:45:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [06:46:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [06:46:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [06:46:58] !log Restarted Gerrit to enable replication configuration autoloading [06:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [06:50:12] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.25/extensions/FlaggedRevs/frontend/FlaggedRevsUIHooks.php: Backport: [[gerrit:825894|Changes list filter: don't add fields that are already in the query (T316026)]] (duration: 02m 57s) [06:50:13] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10ayounsi) Nice! let me know when we're ready to do the move. [06:50:16] T316026: DBQueryError: Duplicate column name 'fp_stable' (SpecialRecentChangesLinked via FlaggedRevs) - https://phabricator.wikimedia.org/T316026 [06:51:09] (03CR) 10Slyngshede: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/825924 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [06:52:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [06:54:13] (KubernetesRsyslogDown) firing: (2) rsyslog on dse-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:56:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [06:56:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [06:57:20] (03PS1) 10Marostegui: site.pp: Remove insetup from db1188 [puppet] - 10https://gerrit.wikimedia.org/r/826200 (https://phabricator.wikimedia.org/T313569) [06:57:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [06:57:36] (03CR) 10Hashar: [C: 03+1] Drop unused wgGECampaignPattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820586 (owner: 10Gergő Tisza) [06:57:52] (03CR) 10JMeybohm: [C: 04-1] "You should split this change. realserver config needs to be up before the state change to lvs_setup" [puppet] - 10https://gerrit.wikimedia.org/r/825726 (https://phabricator.wikimedia.org/T310172) (owner: 10Btullis) [06:58:03] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove insetup from db1188 [puppet] - 10https://gerrit.wikimedia.org/r/826200 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [06:59:22] (03CR) 10Slyngshede: "This turns out to be used online by the cloud VPS, could you give it a look?" [puppet] - 10https://gerrit.wikimedia.org/r/819505 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [06:59:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 10%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P32875 and previous config saved to /var/cache/conftool/dbconfig/20220824-065927-root.json [06:59:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1187 (re)pooling @ 10%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P32876 and previous config saved to /var/cache/conftool/dbconfig/20220824-065937-root.json [07:00:04] Amir1 and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220824T0700). [07:00:04] tgr: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:11] o/ [07:01:05] tgr: wanna do it yourself or should I assist? [07:01:24] I can do it, it's just cleanup [07:01:31] +1 [07:02:08] (03CR) 10Gergő Tisza: [C: 03+2] Drop unused wgGECampaignPattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820586 (owner: 10Gergő Tisza) [07:03:03] (03Merged) 10jenkins-bot: Drop unused wgGECampaignPattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820586 (owner: 10Gergő Tisza) [07:04:14] (03CR) 10JMeybohm: "This LGTM apart from the question inline." [alerts] - 10https://gerrit.wikimedia.org/r/825741 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [07:07:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:08:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:08:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:08:44] (03PS1) 10Marostegui: pc1014: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826201 (https://phabricator.wikimedia.org/T315526) [07:09:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:10:13] (03CR) 10Marostegui: [C: 03+2] pc1014: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826201 (https://phabricator.wikimedia.org/T315526) (owner: 10Marostegui) [07:10:15] (03PS1) 10Marostegui: ProductionServices.php: Promote pc1014 to pc2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826202 (https://phabricator.wikimedia.org/T315526) [07:12:45] !log tgr@deploy1002 Synchronized wmf-config: Config: [[gerrit:820586|Drop unused wgGECampaignPattern]] (duration: 02m 57s) [07:13:57] !log UTC morning deploys done [07:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:14:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 25%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P32877 and previous config saved to /var/cache/conftool/dbconfig/20220824-071432-root.json [07:14:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1187 (re)pooling @ 25%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P32878 and previous config saved to /var/cache/conftool/dbconfig/20220824-071441-root.json [07:15:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:15:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:16:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:20:04] (03CR) 10Ladsgroup: [C: 03+1] ProductionServices.php: Promote pc1014 to pc2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826202 (https://phabricator.wikimedia.org/T315526) (owner: 10Marostegui) [07:20:56] (03CR) 10Jcrespo: [C: 03+1] "Note pc1014 is 10.6 and is replicationg at the moment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826202 (https://phabricator.wikimedia.org/T315526) (owner: 10Marostegui) [07:29:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 50%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P32879 and previous config saved to /var/cache/conftool/dbconfig/20220824-072937-root.json [07:29:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1187 (re)pooling @ 50%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P32880 and previous config saved to /var/cache/conftool/dbconfig/20220824-072946-root.json [07:36:40] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc1014 to pc2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826202 (https://phabricator.wikimedia.org/T315526) (owner: 10Marostegui) [07:37:29] (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc1014 to pc2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826202 (https://phabricator.wikimedia.org/T315526) (owner: 10Marostegui) [07:40:17] !log Promote pc1014 to pc2 master T315526 [07:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:24] T315526: Promote pc1014 to pc2 master - https://phabricator.wikimedia.org/T315526 [07:41:11] !log marostegui@deploy1002 Synchronized wmf-config/ProductionServices.php: Promote pc1014 to pc2 master T315526 (duration: 03m 03s) [07:41:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:42:05] (03PS1) 10Marostegui: parsercache: Promote pc1014 to pc2 master [puppet] - 10https://gerrit.wikimedia.org/r/826203 (https://phabricator.wikimedia.org/T315526) [07:42:25] (03CR) 10Marostegui: [V: 03+2 C: 03+2] parsercache: Promote pc1014 to pc2 master [puppet] - 10https://gerrit.wikimedia.org/r/826203 (https://phabricator.wikimedia.org/T315526) (owner: 10Marostegui) [07:42:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:42:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:43:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:44:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 75%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P32882 and previous config saved to /var/cache/conftool/dbconfig/20220824-074441-root.json [07:44:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1187 (re)pooling @ 75%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P32883 and previous config saved to /var/cache/conftool/dbconfig/20220824-074451-root.json [07:47:09] !log marostegui@deploy1002 Synchronized wmf-config/ProductionServices.php: Promote pc1014 to pc2 master T315526 (duration: 02m 48s) [07:47:14] T315526: Promote pc1014 to pc2 master - https://phabricator.wikimedia.org/T315526 [07:56:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 10%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P32884 and previous config saved to /var/cache/conftool/dbconfig/20220824-075620-root.json [07:58:41] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde and ldap/nda for jdfraine - https://phabricator.wikimedia.org/T316044 (10Tobi_WMDE_SW) 05Stalled→03Open [07:58:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 10%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P32885 and previous config saved to /var/cache/conftool/dbconfig/20220824-075843-root.json [07:59:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1147', diff saved to https://phabricator.wikimedia.org/P32886 and previous config saved to /var/cache/conftool/dbconfig/20220824-075927-root.json [07:59:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 100%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P32887 and previous config saved to /var/cache/conftool/dbconfig/20220824-075946-root.json [07:59:47] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde and ldap/nda for jdfraine - https://phabricator.wikimedia.org/T316044 (10Tobi_WMDE_SW) >>! In T316044#8179964, @Aklapper wrote: > If there are some WMDE onboarding docs, then please make these docs point to https://phabricator.wikimedia.org/tag/ldap-acce... [07:59:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1187 (re)pooling @ 100%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P32888 and previous config saved to /var/cache/conftool/dbconfig/20220824-075955-root.json [08:00:05] hashar and dduvall: Dear deployers, time to do the MediaWiki train - Utc-0+Utc-7 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220824T0800). [08:01:11] o/ [08:02:00] (03PS1) 10Marostegui: mariadb: Productionize db1190 [puppet] - 10https://gerrit.wikimedia.org/r/826206 (https://phabricator.wikimedia.org/T313569) [08:04:53] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1190 [puppet] - 10https://gerrit.wikimedia.org/r/826206 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [08:07:56] 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 4 others: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 (10fgiunchedi) >>! In T314835#8178848, @dcausse wrote: > Moving forward we will: > - stop the presto-swift client in favor of an S3 connector. > -... [08:11:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 25%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P32890 and previous config saved to /var/cache/conftool/dbconfig/20220824-081125-root.json [08:11:28] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: duplicate sal logs to Loki [puppet] - 10https://gerrit.wikimedia.org/r/825880 (https://phabricator.wikimedia.org/T257861) (owner: 10Cwhite) [08:11:44] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: w3creportingapi to use logstash-managed index pattern [puppet] - 10https://gerrit.wikimedia.org/r/824754 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [08:12:10] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: dlq use logstash-managed index pattern [puppet] - 10https://gerrit.wikimedia.org/r/824752 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [08:12:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:12:21] (03PS1) 10TrainBranchBot: group1 wikis to 1.39.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826207 (https://phabricator.wikimedia.org/T314187) [08:12:23] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.39.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826207 (https://phabricator.wikimedia.org/T314187) (owner: 10TrainBranchBot) [08:13:12] (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826207 (https://phabricator.wikimedia.org/T314187) (owner: 10TrainBranchBot) [08:13:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 25%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P32891 and previous config saved to /var/cache/conftool/dbconfig/20220824-081347-root.json [08:15:00] (03PS1) 10FNegri: Add cloudcephosd1028 to the Ceph pool [puppet] - 10https://gerrit.wikimedia.org/r/826208 [08:15:11] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff) [08:16:49] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.26 refs T314187 [08:16:54] T314187: 1.39.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T314187 [08:17:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:17:37] 10SRE, 10SRE-Access-Requests: Requesting Production shell access and a Kerberos principal for Hadoop - https://phabricator.wikimedia.org/T315865 (10Ladsgroup) [08:18:44] 10SRE, 10SRE-Access-Requests: Requesting Production shell access and a Kerberos principal for Hadoop - https://phabricator.wikimedia.org/T315865 (10Ladsgroup) [08:19:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:19:35] !log hashar@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.26 refs T314187 (duration: 02m 46s) [08:19:48] (03PS2) 10FNegri: Add cloudcephosd1028 to the Ceph pool [puppet] - 10https://gerrit.wikimedia.org/r/826208 (https://phabricator.wikimedia.org/T314870) [08:20:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:20:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:21:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:21:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:22:30] (03PS1) 10Marostegui: site.pp: Remove insetup from db1190 [puppet] - 10https://gerrit.wikimedia.org/r/826209 (https://phabricator.wikimedia.org/T313569) [08:23:39] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove insetup from db1190 [puppet] - 10https://gerrit.wikimedia.org/r/826209 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [08:26:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:26:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 50%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P32892 and previous config saved to /var/cache/conftool/dbconfig/20220824-082630-root.json [08:26:52] 10SRE, 10SRE-Access-Requests: Requesting Production shell access and a Kerberos principal for Hadoop - https://phabricator.wikimedia.org/T315865 (10Ladsgroup) [08:27:35] MediaWiki train looks good so far [08:28:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1174', diff saved to https://phabricator.wikimedia.org/P32893 and previous config saved to /var/cache/conftool/dbconfig/20220824-082809-root.json [08:28:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:28:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 50%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P32895 and previous config saved to /var/cache/conftool/dbconfig/20220824-082852-root.json [08:29:15] (03PS1) 10Muehlenhoff: Stop reporting releng images to debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/826211 [08:30:12] (03PS1) 10Ladsgroup: admin: Add Aline Bruenger ssh key [puppet] - 10https://gerrit.wikimedia.org/r/826212 (https://phabricator.wikimedia.org/T315865) [08:30:39] (03PS1) 10Marostegui: mariadb: Productionize db1191 [puppet] - 10https://gerrit.wikimedia.org/r/826213 (https://phabricator.wikimedia.org/T313569) [08:31:59] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1191 [puppet] - 10https://gerrit.wikimedia.org/r/826213 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [08:33:08] (03PS1) 10Filippo Giunchedi: sre: include test kafka cluster in alerts [alerts] - 10https://gerrit.wikimedia.org/r/826214 (https://phabricator.wikimedia.org/T309010) [08:33:18] (03PS1) 10Marostegui: site.pp: Fix db1191 section [puppet] - 10https://gerrit.wikimedia.org/r/826215 [08:33:36] (03CR) 10Marostegui: [V: 03+2 C: 03+2] site.pp: Fix db1191 section [puppet] - 10https://gerrit.wikimedia.org/r/826215 (owner: 10Marostegui) [08:34:33] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Data-Engineering-Operations: Access request to analytics system(s) - https://phabricator.wikimedia.org/T315409 (10Ladsgroup) The dumps can be also accessed from WMCS. [08:34:48] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @SiKo_WMDE - https://phabricator.wikimedia.org/T315878 (10Ladsgroup) [08:37:21] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @SiKo_WMDE - https://phabricator.wikimedia.org/T315878 (10Ladsgroup) [08:37:53] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @SiKo_WMDE - https://phabricator.wikimedia.org/T315878 (10Ladsgroup) Needs approval from @Ottomata or @odimitrijevic [08:38:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:39:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:41:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 75%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P32896 and previous config saved to /var/cache/conftool/dbconfig/20220824-084134-root.json [08:41:39] (03PS2) 10Filippo Giunchedi: sre: followup on Kafka partition replication alerts [alerts] - 10https://gerrit.wikimedia.org/r/826214 (https://phabricator.wikimedia.org/T309010) [08:41:41] (03PS6) 10Filippo Giunchedi: sre: port Zookeeper alerts [alerts] - 10https://gerrit.wikimedia.org/r/818402 (https://phabricator.wikimedia.org/T305847) [08:41:43] (03PS3) 10Filippo Giunchedi: sre: alert on appserver unavailability [alerts] - 10https://gerrit.wikimedia.org/r/825741 (https://phabricator.wikimedia.org/T305847) [08:42:47] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] "Thank you for the reviews! See followup at Icd8425f4b4be" [alerts] - 10https://gerrit.wikimedia.org/r/818108 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [08:43:45] (03CR) 10Filippo Giunchedi: "This patch addresses a comment thread at https://gerrit.wikimedia.org/r/c/operations/alerts/+/818108" [alerts] - 10https://gerrit.wikimedia.org/r/826214 (https://phabricator.wikimedia.org/T309010) (owner: 10Filippo Giunchedi) [08:43:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 75%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P32897 and previous config saved to /var/cache/conftool/dbconfig/20220824-084357-root.json [08:47:40] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde and ldap/nda for jdfraine - https://phabricator.wikimedia.org/T316044 (10Ladsgroup) (Clinic duty this week) The connected SUL account doesn't have `(WMDE)` in it. It's not a big deal but it makes knowing who is WMDE staff and not harder. I'd ask for mana... [08:48:11] (03PS1) 10Slavina Stefanova: Add SSH key for sstefanova to authorized keys [labs/private] - 10https://gerrit.wikimedia.org/r/826219 (https://phabricator.wikimedia.org/T313934) [08:48:25] (03PS1) 10Marostegui: mariadb: Add db1195 as sby host for m1 [puppet] - 10https://gerrit.wikimedia.org/r/826220 (https://phabricator.wikimedia.org/T315864) [08:48:33] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:48:39] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde and ldap/nda for jdfraine - https://phabricator.wikimedia.org/T316044 (10Ladsgroup) (Actually it wasn't the SUL account, it was the wikitech account which is fine usually) [08:49:26] 10SRE, 10Wikimedia-GitHub: stop syncing and delete labs/private repo from github - https://phabricator.wikimedia.org/T315925 (10Ladsgroup) p:05Triage→03Low [08:49:49] !log jayme@builder-envoy-03:~$ sudo apt-get remove --purge linux-image-4.19.0-6-amd64-dbg linux-image-4.19.0-14-amd64-dbg [08:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:11] (03CR) 10Ladsgroup: [C: 03+1] "The IPs look correct. Checked in production." [puppet] - 10https://gerrit.wikimedia.org/r/826220 (https://phabricator.wikimedia.org/T315864) (owner: 10Marostegui) [08:52:15] (03CR) 10Marostegui: [C: 03+2] mariadb: Add db1195 as sby host for m1 [puppet] - 10https://gerrit.wikimedia.org/r/826220 (https://phabricator.wikimedia.org/T315864) (owner: 10Marostegui) [08:52:46] (03CR) 10Marostegui: [C: 03+2] mariadb: Add db1195 as sby host for m1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826220 (https://phabricator.wikimedia.org/T315864) (owner: 10Marostegui) [08:55:32] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde and ldap/nda for jdfraine - https://phabricator.wikimedia.org/T316044 (10Ladsgroup) p:05Triage→03Medium [08:55:38] 10SRE, 10LDAP-Access-Requests, 10Release-Engineering-Team (Radar): Grant Access to gerritadmin for junuche, demon, jhuneidi - https://phabricator.wikimedia.org/T315887 (10Ladsgroup) p:05Triage→03Medium [08:56:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 100%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P32898 and previous config saved to /var/cache/conftool/dbconfig/20220824-085639-root.json [08:56:59] (03PS1) 10Marostegui: mariadb: Promote db1195 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/826222 (https://phabricator.wikimedia.org/T315864) [08:58:30] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover date" [puppet] - 10https://gerrit.wikimedia.org/r/826222 (https://phabricator.wikimedia.org/T315864) (owner: 10Marostegui) [08:59:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 100%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P32899 and previous config saved to /var/cache/conftool/dbconfig/20220824-085902-root.json [09:01:30] (03PS1) 10Marostegui: backups: Replace m1 master [puppet] - 10https://gerrit.wikimedia.org/r/826223 (https://phabricator.wikimedia.org/T315864) [09:02:25] (03CR) 10Marostegui: "jcrespo let me know if you want me to merge this now or after the switchover. The host is ready anyways (it is set as standby on haproxy f" [puppet] - 10https://gerrit.wikimedia.org/r/826223 (https://phabricator.wikimedia.org/T315864) (owner: 10Marostegui) [09:03:02] (03PS2) 10Jcrespo: dbbackups: Replace m1 master [puppet] - 10https://gerrit.wikimedia.org/r/826223 (https://phabricator.wikimedia.org/T315864) (owner: 10Marostegui) [09:03:42] (03CR) 10Ladsgroup: [C: 03+1] mariadb: Promote db1195 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/826222 (https://phabricator.wikimedia.org/T315864) (owner: 10Marostegui) [09:04:54] (03PS3) 10Btullis: Enable the LVS realserver profile for dse-k8s-ctrl [puppet] - 10https://gerrit.wikimedia.org/r/825726 (https://phabricator.wikimedia.org/T310172) [09:05:47] (03CR) 10Jcrespo: "It has to be after the fact, or after todays backups run (~7-8am). While dbmonitor can work with a replica with no issue, statistics write" [puppet] - 10https://gerrit.wikimedia.org/r/826223 (https://phabricator.wikimedia.org/T315864) (owner: 10Marostegui) [09:06:23] (03CR) 10Marostegui: dbbackups: Replace m1 master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826223 (https://phabricator.wikimedia.org/T315864) (owner: 10Marostegui) [09:06:33] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover to be done" [puppet] - 10https://gerrit.wikimedia.org/r/826223 (https://phabricator.wikimedia.org/T315864) (owner: 10Marostegui) [09:09:35] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36911/console" [puppet] - 10https://gerrit.wikimedia.org/r/825726 (https://phabricator.wikimedia.org/T310172) (owner: 10Btullis) [09:09:55] (03CR) 10Filippo Giunchedi: "Thank you for the review!" [alerts] - 10https://gerrit.wikimedia.org/r/825741 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [09:10:27] (03PS4) 10Filippo Giunchedi: sre: alert on appserver unavailability [alerts] - 10https://gerrit.wikimedia.org/r/825741 (https://phabricator.wikimedia.org/T305847) [09:17:55] (03PS2) 10Ladsgroup: admin: Add Aline Bruenger ssh key [puppet] - 10https://gerrit.wikimedia.org/r/826212 (https://phabricator.wikimedia.org/T315865) [09:17:59] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] admin: Add Aline Bruenger ssh key [puppet] - 10https://gerrit.wikimedia.org/r/826212 (https://phabricator.wikimedia.org/T315865) (owner: 10Ladsgroup) [09:19:30] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde and ldap/nda for jdfraine - https://phabricator.wikimedia.org/T316044 (10WMDE-leszek) yeah I believe WMDE has been advised some time ago to not use WMDE in the Wikitech accounts. It is the account of WMDE staff member. [09:19:32] (03CR) 10Muehlenhoff: c:raid::md move from crontab to systemd timer (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/819577 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [09:22:36] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:23:07] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Production shell access and a Kerberos principal for Hadoop - https://phabricator.wikimedia.org/T315865 (10Ladsgroup) 05Open→03Resolved So this is done on SRE side but you need to request separately for Kerberos (with a separate ticket) as we d... [09:25:11] 10SRE, 10Discovery-Search, 10User-MoritzMuehlenhoff: Also use java::security on elasticsearch/relforge - https://phabricator.wikimedia.org/T251540 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This task can be closed, Elastic uses profile::java for over two years now. [09:26:12] (03CR) 10Ladsgroup: webperf: add prometheus::blackbox::check::http for performance.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/823737 (https://phabricator.wikimedia.org/T277927) (owner: 10Dzahn) [09:26:53] 10SRE, 10User-MoritzMuehlenhoff: Investigate StorCLI - https://phabricator.wikimedia.org/T254019 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff We still use megacli for a substantial number of servers, but latest controller revisions procurable by Dell now moved to perccli as the successfo... [09:28:26] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde and ldap/nda for jdfraine - https://phabricator.wikimedia.org/T316044 (10Ladsgroup) Yeah makes sense. I mistook wikitech and SUL. Awesome. I wait for the NDA confirmation and then I add the user to ldap groups. [09:28:54] (03PS5) 10Slyngshede: c:raid::md move from crontab to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/819577 (https://phabricator.wikimedia.org/T273673) [09:31:18] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:32:09] (03PS6) 10Slyngshede: c:raid::md move from crontab to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/819577 (https://phabricator.wikimedia.org/T273673) [09:32:47] (03CR) 10CI reject: [V: 04-1] c:raid::md move from crontab to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/819577 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [09:33:41] (03PS7) 10Slyngshede: c:raid::md move from crontab to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/819577 (https://phabricator.wikimedia.org/T273673) [09:35:15] (03PS3) 10Muehlenhoff: Add Cumin aliases for ml-cache [puppet] - 10https://gerrit.wikimedia.org/r/820129 [09:38:16] (03PS8) 10Slyngshede: c:raid::md move from crontab to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/819577 (https://phabricator.wikimedia.org/T273673) [09:38:39] (03Abandoned) 10JMeybohm: Update to v3.20.6 [debs/calico] (v3.20) - 10https://gerrit.wikimedia.org/r/823159 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [09:42:17] (03PS9) 10Slyngshede: c:raid::md move from crontab to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/819577 (https://phabricator.wikimedia.org/T273673) [09:44:21] (03PS10) 10Slyngshede: c:raid::md move from crontab to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/819577 (https://phabricator.wikimedia.org/T273673) [09:44:45] 10SRE-Access-Requests, 10Data-Engineering, 10Discovery-Search: Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10Gehel) [09:45:17] (03CR) 10Vgutierrez: [C: 03+2] Restart incremental roll-out of query-sorting at 1% [puppet] - 10https://gerrit.wikimedia.org/r/825917 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori) [09:46:22] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36917/console" [puppet] - 10https://gerrit.wikimedia.org/r/819577 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [09:46:27] !log Restart incremental roll-out of query-sorting at 1% - T314868 [09:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:34] T314868: Roll out query parameter normalization - https://phabricator.wikimedia.org/T314868 [09:47:32] (03PS11) 10Slyngshede: c:raid::md move from crontab to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/819577 (https://phabricator.wikimedia.org/T273673) [09:49:00] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Discovery-Search: Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10Gehel) [09:49:06] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36918/console" [puppet] - 10https://gerrit.wikimedia.org/r/819577 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [09:50:04] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Discovery-Search: Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10Gehel) As Peter's manager, and owner of the Search and W[CD]QS services, I'm approving this request. [09:51:27] (03PS12) 10Slyngshede: c:raid::md move from crontab to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/819577 (https://phabricator.wikimedia.org/T273673) [09:52:01] (03PS1) 10Vgutierrez: trafficserver: Set transaction_active_timeout_out for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/826228 (https://phabricator.wikimedia.org/T315533) [09:52:55] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36919/console" [puppet] - 10https://gerrit.wikimedia.org/r/819577 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [09:55:07] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 3 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36920/console" [puppet] - 10https://gerrit.wikimedia.org/r/826228 (https://phabricator.wikimedia.org/T315533) (owner: 10Vgutierrez) [09:55:35] (03CR) 10Slyngshede: [V: 03+1] "Fixed comments and ensured that the timers a equivalent to the original cronjob." [puppet] - 10https://gerrit.wikimedia.org/r/819577 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [09:56:18] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Set transaction_active_timeout_out for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/826228 (https://phabricator.wikimedia.org/T315533) (owner: 10Vgutierrez) [09:57:46] (03PS3) 10David Caro: Add cloudcephosd1028 to the Ceph pool [puppet] - 10https://gerrit.wikimedia.org/r/826208 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri) [09:57:55] (03CR) 10David Caro: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/826208 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri) [09:59:15] (03CR) 10FNegri: [C: 03+2] Add cloudcephosd1028 to the Ceph pool [puppet] - 10https://gerrit.wikimedia.org/r/826208 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri) [09:59:52] (03PS2) 10Hnowlan: api-gateway: disable shipping logs to eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/824703 [10:06:02] (03CR) 10Jforrester: scap/dsh: remove parsoid service, replaced by parsoid-php (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/825753 (https://phabricator.wikimedia.org/T241207) (owner: 10Dzahn) [10:08:09] (03PS1) 10Clément Goubert: kubestage: cleanup profile::docker::storage [puppet] - 10https://gerrit.wikimedia.org/r/826229 (https://phabricator.wikimedia.org/T315977) [10:10:20] 10SRE, 10MediaWiki-Docker, 10ARM support: Create and publish arm64 images of wikimedia-stretch and wikimedia-buster - https://phabricator.wikimedia.org/T274140 (10Jdforrester-WMF) [10:11:58] (03CR) 10David Caro: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/819505 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [10:13:02] (03CR) 10Hnowlan: [C: 03+2] api-gateway: disable shipping logs to eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/824703 (owner: 10Hnowlan) [10:14:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1147 (re)pooling @ 5%: Repooling after cloning db1190', diff saved to https://phabricator.wikimedia.org/P32900 and previous config saved to /var/cache/conftool/dbconfig/20220824-101414-root.json [10:17:53] (03PS1) 10JMeybohm: Update to v3.23.3 [debs/calico] (v3.23) - 10https://gerrit.wikimedia.org/r/826230 (https://phabricator.wikimedia.org/T307943) [10:18:08] (03PS1) 10Marostegui: site.pp: Remove insetup from db1191 [puppet] - 10https://gerrit.wikimedia.org/r/826231 (https://phabricator.wikimedia.org/T313569) [10:18:15] (03Merged) 10jenkins-bot: api-gateway: disable shipping logs to eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/824703 (owner: 10Hnowlan) [10:18:23] (03PS2) 10JMeybohm: Update calico to v3.23.3 [debs/calico] (v3.23) - 10https://gerrit.wikimedia.org/r/826230 (https://phabricator.wikimedia.org/T307943) [10:18:55] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove insetup from db1191 [puppet] - 10https://gerrit.wikimedia.org/r/826231 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [10:19:12] (03CR) 10David Caro: [C: 03+2] grid:exec: cleanup /tmp of stale files [puppet] - 10https://gerrit.wikimedia.org/r/813826 (https://phabricator.wikimedia.org/T313006) (owner: 10David Caro) [10:29:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1147 (re)pooling @ 10%: Repooling after cloning db1190', diff saved to https://phabricator.wikimedia.org/P32901 and previous config saved to /var/cache/conftool/dbconfig/20220824-102919-root.json [10:30:17] (03PS1) 10Clément Goubert: ml-staging: cleanup profile::docker::storage [puppet] - 10https://gerrit.wikimedia.org/r/826233 (https://phabricator.wikimedia.org/T315977) [10:31:19] (03PS1) 10Phuedx: testwiki: Add mediawiki.edit_attempt stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826234 (https://phabricator.wikimedia.org/T309013) [10:32:31] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: sync [10:32:47] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync [10:35:48] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: sync [10:36:12] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync [10:36:37] (03PS1) 10Clément Goubert: kubernetes: cleanup profile::docker::storage [puppet] - 10https://gerrit.wikimedia.org/r/826236 (https://phabricator.wikimedia.org/T315977) [10:37:36] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36926/console" [puppet] - 10https://gerrit.wikimedia.org/r/826236 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert) [10:38:06] (03PS3) 10Ladsgroup: data-persistence: Add alert for replication lag [alerts] - 10https://gerrit.wikimedia.org/r/825294 (https://phabricator.wikimedia.org/T315866) [10:38:34] (03PS1) 10Hashar: gerrit: allow nist kex algorithms on OpenSsh server [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) [10:39:06] (03PS2) 10Hashar: gerrit: allow nist kex algorithms on OpenSsh server [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) [10:39:24] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar) [10:40:32] (03CR) 10CI reject: [V: 04-1] data-persistence: Add alert for replication lag [alerts] - 10https://gerrit.wikimedia.org/r/825294 (https://phabricator.wikimedia.org/T315866) (owner: 10Ladsgroup) [10:40:34] (03PS1) 10Clément Goubert: ml-serve: cleanup profile::docker::storage [puppet] - 10https://gerrit.wikimedia.org/r/826238 (https://phabricator.wikimedia.org/T315977) [10:41:31] (03CR) 10CI reject: [V: 04-1] gerrit: allow nist kex algorithms on OpenSsh server [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar) [10:42:28] (03CR) 10Ladsgroup: "I added tests but they are failing because I think I messed up adding series and also for unquoted values in URL https://integration.wikim" [alerts] - 10https://gerrit.wikimedia.org/r/825294 (https://phabricator.wikimedia.org/T315866) (owner: 10Ladsgroup) [10:43:00] (03CR) 10David Caro: Modify maintain-dbusers.py to call the rest-api service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [10:43:42] (03CR) 10Ladsgroup: "I followed https://www.prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/ maybe I misunderstood it?" [alerts] - 10https://gerrit.wikimedia.org/r/825294 (https://phabricator.wikimedia.org/T315866) (owner: 10Ladsgroup) [10:44:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1147 (re)pooling @ 25%: Repooling after cloning db1190', diff saved to https://phabricator.wikimedia.org/P32902 and previous config saved to /var/cache/conftool/dbconfig/20220824-104424-root.json [10:46:12] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync [10:46:18] (03PS1) 10Vgutierrez: trafficserver: Disable origin coalescing in cp600[78] [puppet] - 10https://gerrit.wikimedia.org/r/826239 (https://phabricator.wikimedia.org/T315911) [10:46:39] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync [10:47:39] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, just left a few remaining nits. If you want to play it safe for the rollout you can disable Puppet on all affected servers wit" [puppet] - 10https://gerrit.wikimedia.org/r/819577 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [10:48:01] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36928/console" [puppet] - 10https://gerrit.wikimedia.org/r/826239 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez) [10:51:25] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Disable origin coalescing in cp600[78] [puppet] - 10https://gerrit.wikimedia.org/r/826239 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez) [10:52:19] !log disable origin coalescing in ats@cp600[78] - T315911 [10:52:20] (03PS3) 10Hashar: gerrit: allow nist kex algorithms on OpenSsh server [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) [10:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:23] T315911: ATS Read While Writer feature is wrongly configured - https://phabricator.wikimedia.org/T315911 [10:52:37] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar) [10:54:13] (KubernetesRsyslogDown) firing: (2) rsyslog on dse-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:54:56] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:59:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1147 (re)pooling @ 50%: Repooling after cloning db1190', diff saved to https://phabricator.wikimedia.org/P32903 and previous config saved to /var/cache/conftool/dbconfig/20220824-105928-root.json [11:00:23] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good! When we backfill the deprecated entries after merging, this is also a good opportunity to fully remove groups no longer in use" [puppet] - 10https://gerrit.wikimedia.org/r/825370 (https://phabricator.wikimedia.org/T248161) (owner: 10Jbond) [11:00:51] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:01:33] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:01:54] (03PS1) 10Btullis: Enable the dse-k8s-worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/826240 (https://phabricator.wikimedia.org/T310177) [11:02:04] (03PS13) 10Slyngshede: c:raid::md move from crontab to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/819577 (https://phabricator.wikimedia.org/T273673) [11:03:44] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36929/console" [puppet] - 10https://gerrit.wikimedia.org/r/819577 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [11:07:31] (03PS1) 10Btullis: Add dummy tokens for dse_k8s_workers [labs/private] - 10https://gerrit.wikimedia.org/r/826241 (https://phabricator.wikimedia.org/T310177) [11:07:43] !log klausman@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching ml-cache*: Rolling restart to activate new JRE - klausman@cumin1001 [11:07:57] (03CR) 10Btullis: [V: 03+2 C: 03+2] Add dummy tokens for dse_k8s_workers [labs/private] - 10https://gerrit.wikimedia.org/r/826241 (https://phabricator.wikimedia.org/T310177) (owner: 10Btullis) [11:08:56] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36931/console" [puppet] - 10https://gerrit.wikimedia.org/r/826240 (https://phabricator.wikimedia.org/T310177) (owner: 10Btullis) [11:09:22] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/819577 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [11:13:16] (03PS3) 10JMeybohm: Update calico to v3.23.3 [debs/calico] (v3.23) - 10https://gerrit.wikimedia.org/r/826230 (https://phabricator.wikimedia.org/T307943) [11:13:30] (03PS2) 10Btullis: Enable the dse-k8s-worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/826240 (https://phabricator.wikimedia.org/T310177) [11:14:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1147 (re)pooling @ 75%: Repooling after cloning db1190', diff saved to https://phabricator.wikimedia.org/P32904 and previous config saved to /var/cache/conftool/dbconfig/20220824-111433-root.json [11:14:57] (03PS3) 10Btullis: Enable the dse-k8s-worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/826240 (https://phabricator.wikimedia.org/T310177) [11:17:26] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] c:raid::md move from crontab to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/819577 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [11:17:48] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] c:raid::md move from crontab to systemd timer (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/819577 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [11:18:08] (03PS4) 10Btullis: Enable the dse-k8s-worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/826240 (https://phabricator.wikimedia.org/T310177) [11:18:55] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36934/console" [puppet] - 10https://gerrit.wikimedia.org/r/826240 (https://phabricator.wikimedia.org/T310177) (owner: 10Btullis) [11:19:12] 10SRE, 10Wikidata-Query-Service: Migrate WDQS to Java 11 - https://phabricator.wikimedia.org/T316103 (10MoritzMuehlenhoff) [11:22:58] (03PS4) 10Ladsgroup: es_exporter: Add metrics collection for mediawiki's db errors [puppet] - 10https://gerrit.wikimedia.org/r/825306 (https://phabricator.wikimedia.org/T297435) [11:23:19] (03CR) 10Ladsgroup: es_exporter: Add metrics collection for mediawiki's db errors (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/825306 (https://phabricator.wikimedia.org/T297435) (owner: 10Ladsgroup) [11:24:20] Was there a deployment just now? [11:26:37] i'm suddenly seeing lots of unparsed in the Wikimedia Incubator where there previously weren't any [11:27:20] example: https://incubator.wikimedia.org/wiki/Wp/sdc/1732 (just below the title) [11:28:24] (03CR) 10Btullis: [V: 03+1] Enable the LVS realserver profile for dse-k8s-ctrl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/825726 (https://phabricator.wikimedia.org/T310172) (owner: 10Btullis) [11:29:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1147 (re)pooling @ 100%: Repooling after cloning db1190', diff saved to https://phabricator.wikimedia.org/P32905 and previous config saved to /var/cache/conftool/dbconfig/20220824-112938-root.json [11:31:38] (03CR) 10David Caro: labstore: Send prom stats for getent_check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813898 (owner: 10David Caro) [11:33:00] (03PS1) 10Clément Goubert: C:profile::docker::storage removal and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/826245 (https://phabricator.wikimedia.org/T315977) [11:37:20] Jhs: since when? [11:37:26] https://sal.toolforge.org/production?p=0&q=Synchronized&d= is deploys today [11:37:40] None for 3 hours [11:37:49] RhinosF1, i noticed it a few minutes ago. reloaded a page, and suddenly they were there [11:38:28] also, the WikimediaIncubator extension is on version 6.0.0, which I expected to happen only after the 18:00 deployment today [11:38:42] Jhs: the train was this morning [11:38:44] (I checked that earlier today, and it was still on v 5.5.0) [11:38:47] oh okay [11:38:48] We're on the EU schedule [11:39:01] About 3 hours ago it was promoted [11:39:16] aah, i was looking at the wrong week on [[wikitech:Deployments]] [11:39:33] Jhs: could this be an issue with the train deployment or? [11:40:02] not sure. i need to figure out where those breadcrumbs below the title actually come from. they may be from some local JS [11:41:08] (they are not present on my local MediaWiki install with the WikimediaIncubator extension) [11:42:02] !log klausman@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching ml-cache*: Rolling restart to activate new JRE - klausman@cumin1001 [11:42:06] hashar: for awareness ^ [11:42:18] (03PS1) 10JMeybohm: admin_ng: Allow to pin calico chart versions per environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/826268 (https://phabricator.wikimedia.org/T307943) [11:42:20] (03PS1) 10JMeybohm: calico-crd: Split crds.yaml into multiple files [deployment-charts] - 10https://gerrit.wikimedia.org/r/826269 [11:42:22] (03PS1) 10JMeybohm: Update calico-crds to v3.23.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/826270 (https://phabricator.wikimedia.org/T307943) [11:42:42] (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/826229 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert) [11:43:36] (03CR) 10CI reject: [V: 04-1] Update calico-crds to v3.23.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/826270 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [11:43:38] (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/826236 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert) [11:43:41] (03PS1) 10Ladsgroup: Add rename_echo_push_indexes_T312975.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/826271 (https://phabricator.wikimedia.org/T312975) [11:43:55] Jhs: no idea off the top of my head [11:45:06] (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/826233 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert) [11:45:57] (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/826238 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert) [11:46:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [11:46:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [11:46:56] (03CR) 10Hashar: [C: 04-1] gerrit: allow nist kex algorithms on OpenSsh server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar) [11:47:15] (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/826245 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert) [11:47:18] (03PS2) 10Hnowlan: helmfile.d: add thumbor configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/824519 (https://phabricator.wikimedia.org/T233196) [11:47:39] RhinosF1, i will file a bug for it [11:47:40] (03CR) 10Filippo Giunchedi: data-persistence: Add alert for replication lag (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/825294 (https://phabricator.wikimedia.org/T315866) (owner: 10Ladsgroup) [11:47:54] (03PS2) 10JMeybohm: Update calico-crds to v3.23.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/826270 (https://phabricator.wikimedia.org/T307943) [11:47:56] (03CR) 10Btullis: [C: 03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/826233 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert) [11:48:33] (03CR) 10Btullis: [C: 03+1] "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/826238 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert) [11:48:49] PROBLEM - Check systemd state on ms-be2035 is CRITICAL: CRITICAL - degraded: The following units failed: session-c42.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:48:52] (03PS4) 10Ladsgroup: data-persistence: Add alert for replication lag [alerts] - 10https://gerrit.wikimedia.org/r/825294 (https://phabricator.wikimedia.org/T315866) [11:48:57] (03CR) 10Ladsgroup: data-persistence: Add alert for replication lag (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/825294 (https://phabricator.wikimedia.org/T315866) (owner: 10Ladsgroup) [11:49:09] (03CR) 10Hashar: [C: 04-1] gerrit: allow nist kex algorithms on OpenSsh server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar) [11:49:23] (03PS4) 10Hashar: gerrit: allow nist kex algorithms on OpenSsh server [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) [11:49:30] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36938/console" [puppet] - 10https://gerrit.wikimedia.org/r/826229 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert) [11:49:41] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar) [11:50:04] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36939/console" [puppet] - 10https://gerrit.wikimedia.org/r/826236 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert) [11:50:31] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36940/console" [puppet] - 10https://gerrit.wikimedia.org/r/826233 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert) [11:51:14] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36941/console" [puppet] - 10https://gerrit.wikimedia.org/r/826238 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert) [11:51:32] (03CR) 10CI reject: [V: 04-1] data-persistence: Add alert for replication lag [alerts] - 10https://gerrit.wikimedia.org/r/825294 (https://phabricator.wikimedia.org/T315866) (owner: 10Ladsgroup) [11:52:03] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36942/console" [puppet] - 10https://gerrit.wikimedia.org/r/826245 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert) [11:52:24] Amir1: I have to go shortly, though the alert is firing now! [11:52:41] yeah, progress :D [11:53:02] > MySQL instance 5m 40s has too large replication lag (db1099:13318) [11:53:06] Spot the idiot [11:53:11] lulz [11:53:39] Amir1: if you have 'promtool' installed locally you can also run the tests with 'tox' FWIW [11:53:49] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 (10MoritzMuehlenhoff) [11:53:50] i.e. the exact same thing that CI runs [11:53:52] godog: I did, the problem is that everything fail [11:54:03] maybe I mistook warnings with failure, let me try again [11:54:43] understandable, coincidentally I filed T316086 this morning about that [11:54:44] T316086: Move dashboard/runbook missing annotations from warning to errors - https://phabricator.wikimedia.org/T316086 [11:56:27] (03CR) 10Nikerabbit: "Probably not needed given wmf.25 branch reaches all wikis today." [extensions/Translate] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824442 (https://phabricator.wikimedia.org/T315657) (owner: 10Jforrester) [11:56:48] (03PS5) 10Ladsgroup: data-persistence: Add alert for replication lag [alerts] - 10https://gerrit.wikimedia.org/r/825294 (https://phabricator.wikimedia.org/T315866) [11:56:57] I think it's the runbook [11:57:39] FWIW, it fails locally with this [11:57:42] https://www.irccloud.com/pastebin/1rM5xxBi/ [11:58:06] interesting I haven't see that before [11:58:20] but yeah you have an extra space in var-port {{ ...} [11:59:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [11:59:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [11:59:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1120.eqiad.wmnet with reason: Maintenance [11:59:18] ok gotta go [11:59:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1120.eqiad.wmnet with reason: Maintenance [11:59:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1120 (T312975)', diff saved to https://phabricator.wikimedia.org/P32906 and previous config saved to /var/cache/conftool/dbconfig/20220824-115935-ladsgroup.json [11:59:40] T312975: Drop foreign keys and rename index for table echo_push_subscription on wmf wikis - https://phabricator.wikimedia.org/T312975 [11:59:53] (03CR) 10CI reject: [V: 04-1] data-persistence: Add alert for replication lag [alerts] - 10https://gerrit.wikimedia.org/r/825294 (https://phabricator.wikimedia.org/T315866) (owner: 10Ladsgroup) [12:01:03] (03CR) 10Marostegui: [C: 03+1] Add rename_echo_push_indexes_T312975.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/826271 (https://phabricator.wikimedia.org/T312975) (owner: 10Ladsgroup) [12:01:52] !log killed refresh links-recomm scripts in rowiki, cswiki, simplewiki, frwiki (T299021) [12:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:56] T299021: Shorten running time of refreshLinkRecommendations.php - https://phabricator.wikimedia.org/T299021 [12:02:28] (03CR) 10Ladsgroup: [C: 03+2] Add rename_echo_push_indexes_T312975.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/826271 (https://phabricator.wikimedia.org/T312975) (owner: 10Ladsgroup) [12:02:47] (03Merged) 10jenkins-bot: Add rename_echo_push_indexes_T312975.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/826271 (https://phabricator.wikimedia.org/T312975) (owner: 10Ladsgroup) [12:02:56] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] define osm::planet_sync move from cron to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [12:03:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1120 (T312975)', diff saved to https://phabricator.wikimedia.org/P32907 and previous config saved to /var/cache/conftool/dbconfig/20220824-120346-ladsgroup.json [12:07:38] (03PS1) 10Muehlenhoff: Remove obsolete absented cron file [puppet] - 10https://gerrit.wikimedia.org/r/826274 [12:09:57] (03PS1) 10Slyngshede: define osm::planet_sync Remove invalid cron times [puppet] - 10https://gerrit.wikimedia.org/r/826275 (https://phabricator.wikimedia.org/T273673) [12:10:45] (03CR) 10CI reject: [V: 04-1] define osm::planet_sync Remove invalid cron times [puppet] - 10https://gerrit.wikimedia.org/r/826275 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [12:11:49] (03PS2) 10Slyngshede: define osm::planet_sync Remove invalid cron times [puppet] - 10https://gerrit.wikimedia.org/r/826275 (https://phabricator.wikimedia.org/T273673) [12:13:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 5%: Repooling after cloning db1191', diff saved to https://phabricator.wikimedia.org/P32908 and previous config saved to /var/cache/conftool/dbconfig/20220824-121343-root.json [12:14:35] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36943/console" [puppet] - 10https://gerrit.wikimedia.org/r/826275 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [12:16:13] RhinosF1, https://phabricator.wikimedia.org/T316108 [12:16:35] Maybe it should have a train-related tag as well, since this bug became apparent with today's deployment? [12:17:03] Jhs: how used is GeoCrumbs? [12:17:12] Wikivoyage + Wikimedia Incubator [12:17:15] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] define osm::planet_sync Remove invalid cron times [puppet] - 10https://gerrit.wikimedia.org/r/826275 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [12:17:24] (03PS6) 10Ladsgroup: data-persistence: Add alert for replication lag [alerts] - 10https://gerrit.wikimedia.org/r/825294 (https://phabricator.wikimedia.org/T315866) [12:17:27] On Wikivoyage the bug only manifests itself on subpages in the content namespace [12:17:46] On Wikimedia Incubator, every content page is a subpage, so it will be visible there on all pages [12:18:08] but I think we should just disable the entire extension on incubatorwiki, because it never(?) did its job correctly there anyways [12:18:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1120', diff saved to https://phabricator.wikimedia.org/P32909 and previous config saved to /var/cache/conftool/dbconfig/20220824-121852-ladsgroup.json [12:19:37] Jhs: I'm going to ping releng [12:19:41] To decide what to do [12:19:45] ok, thanks [12:19:48] (03CR) 10CI reject: [V: 04-1] data-persistence: Add alert for replication lag [alerts] - 10https://gerrit.wikimedia.org/r/825294 (https://phabricator.wikimedia.org/T315866) (owner: 10Ladsgroup) [12:20:04] (03PS1) 10Marostegui: install_server: Do not reimage db1190 [puppet] - 10https://gerrit.wikimedia.org/r/826278 (https://phabricator.wikimedia.org/T313569) [12:20:40] (03PS7) 10Ladsgroup: data-persistence: Add alert for replication lag [alerts] - 10https://gerrit.wikimedia.org/r/825294 (https://phabricator.wikimedia.org/T315866) [12:21:11] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1190 [puppet] - 10https://gerrit.wikimedia.org/r/826278 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [12:23:56] (03CR) 10Ladsgroup: "wohoo, finally." [alerts] - 10https://gerrit.wikimedia.org/r/825294 (https://phabricator.wikimedia.org/T315866) (owner: 10Ladsgroup) [12:24:16] !log installing containerd security updates [12:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:48] (03CR) 10Slyngshede: [C: 03+2] c:dynamicproxy move cronjob to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/819505 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [12:28:03] (03PS1) 10Jon Harald Søby: Remove GeoCrumbs from the Wikimedia Incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826279 (https://phabricator.wikimedia.org/T316109) [12:28:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 10%: Repooling after cloning db1191', diff saved to https://phabricator.wikimedia.org/P32910 and previous config saved to /var/cache/conftool/dbconfig/20220824-122848-root.json [12:30:14] (03CR) 10Slyngshede: [C: 04-1] "We actually need to keep the absent part. Otherwise the Debian package will just recreate the file on the next SpamAssassin update." [puppet] - 10https://gerrit.wikimedia.org/r/825924 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [12:31:02] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1131 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/826281 (https://phabricator.wikimedia.org/T316110) [12:31:06] (03PS1) 10Gerrit maintenance bot: wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/826282 (https://phabricator.wikimedia.org/T316110) [12:31:21] RhinosF1, I filed a task & patch to remove the GeoCrumbs extension from the Incubator (permanently). Would it be too soon to do that during the next backport window in ~30 minutes? [12:31:37] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1181 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/826283 (https://phabricator.wikimedia.org/T316111) [12:31:46] (03PS1) 10Gerrit maintenance bot: wmnet: Update s7-master alias [dns] - 10https://gerrit.wikimedia.org/r/826284 (https://phabricator.wikimedia.org/T316111) [12:33:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1120', diff saved to https://phabricator.wikimedia.org/P32911 and previous config saved to /var/cache/conftool/dbconfig/20220824-123358-ladsgroup.json [12:36:14] (03PS1) 10Marostegui: install_server: Do not reimage db1191 [puppet] - 10https://gerrit.wikimedia.org/r/826286 (https://phabricator.wikimedia.org/T313569) [12:37:05] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1191 [puppet] - 10https://gerrit.wikimedia.org/r/826286 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [12:37:20] (03CR) 10Muehlenhoff: c:dynamicproxy move cronjob to systemd timer. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819505 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [12:41:16] Jhs: id rather it have some community notice [12:41:21] But in theory no [12:41:29] That call would likely rest with a deployer [12:41:32] Cc urbanecm [12:41:42] what's up? [12:42:10] urbanecm, summary: https://phabricator.wikimedia.org/T316109 [12:42:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [12:42:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [12:42:53] Jhs: I'd prefer having a community notice here, to avoid later complaints. let's announce and do it in a week [12:43:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1104.eqiad.wmnet with reason: Maintenance [12:43:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1104.eqiad.wmnet with reason: Maintenance [12:43:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1104 (T314041)', diff saved to https://phabricator.wikimedia.org/P32912 and previous config saved to /var/cache/conftool/dbconfig/20220824-124346-ladsgroup.json [12:43:46] urbanecm, So we will have to live with escaped HTML under every page title (see the bug linked in that task) for a week instead of removing it? [12:43:50] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [12:43:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 25%: Repooling after cloning db1191', diff saved to https://phabricator.wikimedia.org/P32913 and previous config saved to /var/cache/conftool/dbconfig/20220824-124354-root.json [12:44:19] urbanecm: there's a task for it being broken [12:44:26] Jhs: sorry, i didn't see there's a bug. [12:44:27] But no answer from releng as it's with this train [12:44:37] https://phabricator.wikimedia.org/T316108 [12:45:37] i don't mind notifying the community, of course, but i'd rather remove it in the mean-time until the bug is fixed [12:45:41] Jhs: thanks for clarifying that. in that case (and you saying it's not actually needed), let's remove it i guess, but we should announce it regardless :) [12:45:47] (03PS1) 10Slyngshede: c:dynamicproxy clean up after cronjob removal. [puppet] - 10https://gerrit.wikimedia.org/r/826287 (https://phabricator.wikimedia.org/T273673) [12:45:52] urbanecm, 👍 [12:46:37] (03PS1) 10Marostegui: install_server: Allow reimage of db1196-db1209 [puppet] - 10https://gerrit.wikimedia.org/r/826288 (https://phabricator.wikimedia.org/T306848) [12:48:36] oh, now that bug is present on all pages on Wikivoyage as well >< . https://en.wikivoyage.org/wiki/Special:Random [12:48:41] it must have been a caching issue when i didn't see it before [12:49:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1120 (T312975)', diff saved to https://phabricator.wikimedia.org/P32914 and previous config saved to /var/cache/conftool/dbconfig/20220824-124905-ladsgroup.json [12:49:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1116.eqiad.wmnet with reason: Maintenance [12:49:08] Jhs: hmm, i don't see it with timeless [12:49:10] T312975: Drop foreign keys and rename index for table echo_push_subscription on wmf wikis - https://phabricator.wikimedia.org/T312975 [12:49:19] or with vector [12:49:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1116.eqiad.wmnet with reason: Maintenance [12:49:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [12:49:34] but if it's not incubator-specific, perhaps we should fix the issue instead? [12:49:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [12:49:41] (03PS2) 10Marostegui: install_server: Allow reimage of db1196-db1203 [puppet] - 10https://gerrit.wikimedia.org/r/826288 (https://phabricator.wikimedia.org/T306848) [12:49:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1137.eqiad.wmnet with reason: Maintenance [12:49:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1137.eqiad.wmnet with reason: Maintenance [12:49:59] urbanecm: I see it on Minerva [12:50:02] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10Marostegui) @RobH I just realised the task has more hosts than the ones we actually ordered, at T303435 we say 8 hosts, but... [12:50:03] urbanecm, here's what I see: https://phabricator.wikimedia.org/F35487621 [12:50:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1137 (T312975)', diff saved to https://phabricator.wikimedia.org/P32915 and previous config saved to /var/cache/conftool/dbconfig/20220824-125003-ladsgroup.json [12:50:08] I think train rollback would be better [12:50:16] or that, sure [12:50:29] (03CR) 10Marostegui: [C: 03+2] install_server: Allow reimage of db1196-db1203 [puppet] - 10https://gerrit.wikimedia.org/r/826288 (https://phabricator.wikimedia.org/T306848) (owner: 10Marostegui) [12:50:50] urbanecm: unless we can find who maintains it and fix it [12:51:02] well if we rollback we have to find someone to fix it anyway [12:51:07] Can you try grab a relenger on slack or something [12:52:29] RhinosF1: can you note it at T314187 first? unless i'm blind, i don't see this bug mentioned there [12:52:30] T314187: 1.39.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T314187 [12:53:22] (03PS1) 10Marostegui: db1196-db1203: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826290 (https://phabricator.wikimedia.org/T306848) [12:54:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1137 (T312975)', diff saved to https://phabricator.wikimedia.org/P32916 and previous config saved to /var/cache/conftool/dbconfig/20220824-125414-ladsgroup.json [12:54:19] T312975: Drop foreign keys and rename index for table echo_push_subscription on wmf wikis - https://phabricator.wikimedia.org/T312975 [12:54:47] (03CR) 10Marostegui: [C: 03+2] db1196-db1203: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826290 (https://phabricator.wikimedia.org/T306848) (owner: 10Marostegui) [12:55:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104 (T314041)', diff saved to https://phabricator.wikimedia.org/P32917 and previous config saved to /var/cache/conftool/dbconfig/20220824-125537-ladsgroup.json [12:55:42] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [12:58:04] urbanecm: yes doing [12:58:37] urbanecm: done [12:58:48] thanks. let's see what happens :) [12:58:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 50%: Repooling after cloning db1191', diff saved to https://phabricator.wikimedia.org/P32918 and previous config saved to /var/cache/conftool/dbconfig/20220824-125858-root.json [12:59:29] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/826287 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220824T1300). nyaa~ [13:00:05] Jhs: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:27] Jhs: can you drop your patch from the window if we not deploying [13:00:33] sure [13:00:43] (03PS2) 10Marostegui: mariadb: Promote db1195 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/826222 (https://phabricator.wikimedia.org/T315864) [13:00:51] Jhs: hi, since the change affects other wikis, not only incubator, i prefer not doing the deployment now, and treating this like others train-caused feature regressions. thanks! [13:00:59] no worries [13:01:25] (03CR) 10Dzahn: "Hmm..if that is true it sounds like one of those cases where they will be a puppet change on every single run." [puppet] - 10https://gerrit.wikimedia.org/r/825924 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [13:01:40] in the meantime, i will post on the community portal of the Incubator to have it removed – it doesn't serve any purpose there even when it *is* working [13:02:34] Jhs: great [13:06:16] (03CR) 10Slyngshede: [C: 04-1] spamassassin: remove absented cron file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/825924 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [13:08:10] (03CR) 10Hashar: "Puppet diff https://puppet-compiler.wmflabs.org/pcc-worker1003/1406/gerrit1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar) [13:09:20] (03CR) 10Hashar: [C: 03+1] "The ssh client config is in. It does not show up in the catalog cause we copy the whole of modules/gerrit/files/homedir" [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar) [13:09:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1137', diff saved to https://phabricator.wikimedia.org/P32919 and previous config saved to /var/cache/conftool/dbconfig/20220824-130920-ladsgroup.json [13:10:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P32920 and previous config saved to /var/cache/conftool/dbconfig/20220824-131043-ladsgroup.json [13:13:00] (03CR) 10Muehlenhoff: spamassassin: remove absented cron file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/825924 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [13:14:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 75%: Repooling after cloning db1191', diff saved to https://phabricator.wikimedia.org/P32921 and previous config saved to /var/cache/conftool/dbconfig/20220824-131403-root.json [13:15:11] (03CR) 10Btullis: [V: 03+1 C: 03+2] Enable the LVS realserver profile for dse-k8s-ctrl [puppet] - 10https://gerrit.wikimedia.org/r/825726 (https://phabricator.wikimedia.org/T310172) (owner: 10Btullis) [13:23:46] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @SiKo_WMDE - https://phabricator.wikimedia.org/T315878 (10Ottomata) Approved! [13:24:20] !log taavi@mwmaint1002 ~ $ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki mediawikiwiki "Africa Wikimedia Developers Project" "African Wikimedia Technical Community" "Taavi" --reason "per request [[:phab:T316066]]" # T316066 [13:24:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1137', diff saved to https://phabricator.wikimedia.org/P32922 and previous config saved to /var/cache/conftool/dbconfig/20220824-132426-ladsgroup.json [13:24:32] umh, where's stashbot? [13:24:33] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @SiKo_WMDE - https://phabricator.wikimedia.org/T315878 (10Ladsgroup) [13:25:48] 10SRE, 10Discovery-Search, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech: Migrate WDQS to Java 11 - https://phabricator.wikimedia.org/T316103 (10Ladsgroup) [13:25:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P32923 and previous config saved to /var/cache/conftool/dbconfig/20220824-132549-ladsgroup.json [13:28:39] (03PS1) 10Marostegui: wmnet: Update pc2-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/826294 (https://phabricator.wikimedia.org/T315526) [13:28:56] (03CR) 10Ottomata: sre: followup on Kafka partition replication alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/826214 (https://phabricator.wikimedia.org/T309010) (owner: 10Filippo Giunchedi) [13:29:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 100%: Repooling after cloning db1191', diff saved to https://phabricator.wikimedia.org/P32924 and previous config saved to /var/cache/conftool/dbconfig/20220824-132908-root.json [13:29:33] (03CR) 10Marostegui: [C: 03+2] wmnet: Update pc2-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/826294 (https://phabricator.wikimedia.org/T315526) (owner: 10Marostegui) [13:31:55] !log taavi@mwmaint1002 ~ $ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki mediawikiwiki "Africa Wikimedia Developers Project" "African Wikimedia Technical Community" "Taavi" --reason "per request [[:phab:T316066]]" # T316066 [13:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:01] T316066: Move Africa Wikimedia Developers Project to Africa Wikimedia Technical Community on MediaWiki.org - https://phabricator.wikimedia.org/T316066 [13:33:27] (03PS1) 10Slyngshede: c:dynamicproxy fix wrong interval format to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/826295 [13:34:04] (03CR) 10CI reject: [V: 04-1] c:dynamicproxy fix wrong interval format to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/826295 (owner: 10Slyngshede) [13:35:00] (03PS2) 10Slyngshede: c:dynamicproxy fix wrong interval format to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/826295 [13:35:24] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/826295 (owner: 10Slyngshede) [13:35:25] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:36:02] (03CR) 10jenkins-bot: c:dynamicproxy fix wrong interval format to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/826295 (owner: 10Slyngshede) [13:36:48] (03PS3) 10Slyngshede: c:dynamicproxy fix wrong interval format to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/826295 [13:37:50] (03CR) 10Slyngshede: [C: 03+2] c:dynamicproxy fix wrong interval format to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/826295 (owner: 10Slyngshede) [13:38:02] (03CR) 10David Caro: [C: 03+1] "Now it's good 😊" [puppet] - 10https://gerrit.wikimedia.org/r/826295 (owner: 10Slyngshede) [13:38:35] (03PS1) 10Btullis: Configure the load-balancers for dse-k8s-ctrl [puppet] - 10https://gerrit.wikimedia.org/r/826296 (https://phabricator.wikimedia.org/T310172) [13:39:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1137 (T312975)', diff saved to https://phabricator.wikimedia.org/P32925 and previous config saved to /var/cache/conftool/dbconfig/20220824-133932-ladsgroup.json [13:39:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2131.codfw.wmnet with reason: Maintenance [13:39:38] T312975: Drop foreign keys and rename index for table echo_push_subscription on wmf wikis - https://phabricator.wikimedia.org/T312975 [13:39:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2131.codfw.wmnet with reason: Maintenance [13:39:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2131 (T312975)', diff saved to https://phabricator.wikimedia.org/P32926 and previous config saved to /var/cache/conftool/dbconfig/20220824-133953-ladsgroup.json [13:40:12] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36944/console" [puppet] - 10https://gerrit.wikimedia.org/r/826296 (https://phabricator.wikimedia.org/T310172) (owner: 10Btullis) [13:40:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104 (T314041)', diff saved to https://phabricator.wikimedia.org/P32927 and previous config saved to /var/cache/conftool/dbconfig/20220824-134057-ladsgroup.json [13:40:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance [13:41:02] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [13:41:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2131 (T312975)', diff saved to https://phabricator.wikimedia.org/P32928 and previous config saved to /var/cache/conftool/dbconfig/20220824-134104-ladsgroup.json [13:41:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance [13:41:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T314041)', diff saved to https://phabricator.wikimedia.org/P32929 and previous config saved to /var/cache/conftool/dbconfig/20220824-134118-ladsgroup.json [13:45:51] (03CR) 10FNegri: [C: 04-1] wmcs.openstack.quota_increase: allow all known quota types (033 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/825736 (https://phabricator.wikimedia.org/T315961) (owner: 10David Caro) [13:46:29] I am rolling back mediawiki due to T316085 [13:46:29] T316085: Escaped HTML underneath page title in wikis with the GeoCrumbs extension enabled - https://phabricator.wikimedia.org/T316085 [13:47:24] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36945/console" [puppet] - 10https://gerrit.wikimedia.org/r/826296 (https://phabricator.wikimedia.org/T310172) (owner: 10Btullis) [13:49:56] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: Revert "Group 1 wikis to 1.39.0-wmf.26" # T316085 T314187 [13:50:02] T314187: 1.39.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T314187 [13:50:13] (03PS1) 10David Caro: dynamicproxy: add simple compile test [puppet] - 10https://gerrit.wikimedia.org/r/826299 [13:50:44] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Jclark-ctr) @Papaul db1185 is fixed loosed cable [13:53:19] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Marostegui) I can now access db1185 - thank you @Jclark-ctr. From my side, we can close this task [13:53:50] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Papaul) @Jclark-ctr thanks [13:54:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T314041)', diff saved to https://phabricator.wikimedia.org/P32930 and previous config saved to /var/cache/conftool/dbconfig/20220824-135404-ladsgroup.json [13:54:11] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [13:55:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1185.eqiad.wmnet with OS bullseye [13:55:59] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1185.eqiad.wmnet with OS bullseye [13:56:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2131', diff saved to https://phabricator.wikimedia.org/P32931 and previous config saved to /var/cache/conftool/dbconfig/20220824-135611-ladsgroup.json [13:56:17] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Marostegui) I am doing a reimage of db1185 as its puppet cert was revoked and it was in a weird state. [13:59:21] PROBLEM - SSH on ms-be1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:07:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1185.eqiad.wmnet with reason: host reimage [14:09:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P32932 and previous config saved to /var/cache/conftool/dbconfig/20220824-140910-ladsgroup.json [14:11:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2131', diff saved to https://phabricator.wikimedia.org/P32933 and previous config saved to /var/cache/conftool/dbconfig/20220824-141117-ladsgroup.json [14:11:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1185.eqiad.wmnet with reason: host reimage [14:13:47] 10SRE, 10ops-codfw, 10Discovery-Search: elastic2054 is down with memory error - https://phabricator.wikimedia.org/T315989 (10Papaul) unfortunately this server is out of warranty. I will check to see if we can use some memory from decom servers that we have onsite [14:15:46] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36946/console" [puppet] - 10https://gerrit.wikimedia.org/r/826296 (https://phabricator.wikimedia.org/T310172) (owner: 10Btullis) [14:16:45] (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] Configure the load-balancers for dse-k8s-ctrl [puppet] - 10https://gerrit.wikimedia.org/r/826296 (https://phabricator.wikimedia.org/T310172) (owner: 10Btullis) [14:20:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [14:21:03] (03PS2) 10Bking: elastic: clear old es_6 resources during upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/825874 (https://phabricator.wikimedia.org/T308676) (owner: 10Ryan Kemper) [14:24:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P32934 and previous config saved to /var/cache/conftool/dbconfig/20220824-142416-ladsgroup.json [14:24:23] (03CR) 10CI reject: [V: 04-1] elastic: clear old es_6 resources during upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/825874 (https://phabricator.wikimedia.org/T308676) (owner: 10Ryan Kemper) [14:25:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1185.eqiad.wmnet with OS bullseye [14:25:08] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1185.eqiad.wmnet with OS bullseye completed: -... [14:25:57] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Marostegui) 05Open→03Resolved db1185 was reimaged successfully. Closing this. Thanks everyone for getting these hosts up! [14:26:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2131 (T312975)', diff saved to https://phabricator.wikimedia.org/P32935 and previous config saved to /var/cache/conftool/dbconfig/20220824-142623-ladsgroup.json [14:26:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2101.codfw.wmnet with reason: Maintenance [14:26:28] T312975: Drop foreign keys and rename index for table echo_push_subscription on wmf wikis - https://phabricator.wikimedia.org/T312975 [14:26:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2101.codfw.wmnet with reason: Maintenance [14:26:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2115.codfw.wmnet with reason: Maintenance [14:27:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [14:27:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2115.codfw.wmnet with reason: Maintenance [14:27:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2115 (T312975)', diff saved to https://phabricator.wikimedia.org/P32936 and previous config saved to /var/cache/conftool/dbconfig/20220824-142715-ladsgroup.json [14:27:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10Jclark-ctr) kafka-logging1004. e2 u30 port30 20220047 kafka-logging1005 f2. u30. port30 20220048 [14:27:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10Jclark-ctr) [14:29:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2115 (T312975)', diff saved to https://phabricator.wikimedia.org/P32937 and previous config saved to /var/cache/conftool/dbconfig/20220824-142926-ladsgroup.json [14:37:54] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:39:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T314041)', diff saved to https://phabricator.wikimedia.org/P32938 and previous config saved to /var/cache/conftool/dbconfig/20220824-143923-ladsgroup.json [14:39:28] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [14:40:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install centrallog1002 - https://phabricator.wikimedia.org/T313858 (10Jclark-ctr) centrallog1002 b1 U36 port36 cableid 230000014 [14:40:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install centrallog1002 - https://phabricator.wikimedia.org/T313858 (10Jclark-ctr) [14:43:48] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T316121 (10phaultfinder) [14:44:04] RECOVERY - Check systemd state on mw2387 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:44:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2115', diff saved to https://phabricator.wikimedia.org/P32939 and previous config saved to /var/cache/conftool/dbconfig/20220824-144432-ladsgroup.json [14:48:44] !log powercycling krb2002 [14:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:46] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Auth extremely slow on clouddumps100[12] - https://phabricator.wikimedia.org/T316123 (10Andrew) [14:50:27] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Auth extremely slow on clouddumps100[12] - https://phabricator.wikimedia.org/T316123 (10Andrew) [14:53:08] (03CR) 10Btullis: [V: 03+1 C: 03+2] Configure the load-balancers for dse-k8s-ctrl [puppet] - 10https://gerrit.wikimedia.org/r/826296 (https://phabricator.wikimedia.org/T310172) (owner: 10Btullis) [14:53:10] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Auth extremely slow on clouddumps100[12] - https://phabricator.wikimedia.org/T316123 (10Andrew) ` Aug 24 14:52:46 clouddumps1001 dbus-daemon[1329]: [system] Failed to activate service 'org.freedesktop.login1': timed out (service_start_timeo... [14:54:13] (KubernetesRsyslogDown) firing: (2) rsyslog on dse-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:55:05] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase1016.eqiad.wmnet: Restarting to canary OpenJDK 8u342 - eevans@cumin1001 [14:56:40] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:59:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2115', diff saved to https://phabricator.wikimedia.org/P32940 and previous config saved to /var/cache/conftool/dbconfig/20220824-145939-ladsgroup.json [15:00:28] RECOVERY - SSH on ms-be1041.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:01:10] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:01:42] (03CR) 10Vgutierrez: [C: 04-1] "tests are failing.. you can run them locally using puppet/modules/varnish/files/tests$ ./docker_run.sh cp6016.drmrs.wmnet 506868. You have" [puppet] - 10https://gerrit.wikimedia.org/r/824793 (https://phabricator.wikimedia.org/T260943) (owner: 10BCornwall) [15:03:10] (03CR) 10David Caro: c:dynamicproxy clean up after cronjob removal. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826287 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [15:04:51] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase1016.eqiad.wmnet: Restarting to canary OpenJDK 8u342 - eevans@cumin1001 [15:10:21] (03CR) 10Muehlenhoff: [C: 03+1] c:dynamicproxy clean up after cronjob removal. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826287 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [15:12:19] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: Restarting to apply OpenJDK 8u342 - eevans@cumin1001 [15:14:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install graphite1005 - https://phabricator.wikimedia.org/T313853 (10Jclark-ctr) graphite1005 B1 U37 port37 cableid:23000036 [15:14:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2115 (T312975)', diff saved to https://phabricator.wikimedia.org/P32941 and previous config saved to /var/cache/conftool/dbconfig/20220824-151445-ladsgroup.json [15:14:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install graphite1005 - https://phabricator.wikimedia.org/T313853 (10Jclark-ctr) [15:14:50] T312975: Drop foreign keys and rename index for table echo_push_subscription on wmf wikis - https://phabricator.wikimedia.org/T312975 [15:20:41] (03CR) 10Cwhite: [C: 03+2] logstash: duplicate sal logs to Loki [puppet] - 10https://gerrit.wikimedia.org/r/825880 (https://phabricator.wikimedia.org/T257861) (owner: 10Cwhite) [15:20:44] (03CR) 10Dzahn: "ok, thanks both. I was thinking the same about adding a comment because someone else will try to remove it again I would expect. I'll amen" [puppet] - 10https://gerrit.wikimedia.org/r/825924 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [15:23:12] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Degraded RAID on ms-be1054 - https://phabricator.wikimedia.org/T315480 (10Jclark-ctr) @MatthewVernon I have a spare drive from retired host. Can we schedule for drive replacement or can it be done at any time [15:24:02] (03CR) 10Cwhite: [C: 03+2] beta-logs: dlq use logstash-managed index pattern [puppet] - 10https://gerrit.wikimedia.org/r/824751 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [15:25:34] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Two failed disks in ms-be1071 - https://phabricator.wikimedia.org/T315437 (10Jclark-ctr) ` Tue, Aug 23, 3:23 PM (20 hours ago) Hello John, I'm following up with an update we were experiencing staff shortages at the warehouse which the parts ship and con... [15:26:41] (03PS3) 10Cwhite: beta-logs: w3creportingapi to use logstash-managed index pattern [puppet] - 10https://gerrit.wikimedia.org/r/824753 (https://phabricator.wikimedia.org/T305175) [15:30:56] (03CR) 10Cwhite: [C: 03+2] beta-logs: w3creportingapi to use logstash-managed index pattern [puppet] - 10https://gerrit.wikimedia.org/r/824753 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [15:34:43] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Degraded RAID on ms-be1054 - https://phabricator.wikimedia.org/T315480 (10Eevans) >>! In T315480#8181950, @Jclark-ctr wrote: > @MatthewVernon I have a spare drive from retired host. Can we schedule for drive replacement or can it be done at any time @Jcl... [15:41:23] (03CR) 10Dzahn: "well, the comment there already says it.. remove the Debian provided cron..so I guess I can just abandon this" [puppet] - 10https://gerrit.wikimedia.org/r/825924 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [15:41:42] (03CR) 10Herron: [C: 03+1] logstash: dlq use logstash-managed index pattern [puppet] - 10https://gerrit.wikimedia.org/r/824752 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [15:42:01] (03Abandoned) 10Dzahn: spamassassin: remove absented cron file [puppet] - 10https://gerrit.wikimedia.org/r/825924 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [15:43:27] (03CR) 10Herron: [C: 03+1] logstash: w3creportingapi to use logstash-managed index pattern [puppet] - 10https://gerrit.wikimedia.org/r/824754 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [15:45:37] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Data-Engineering-Operations: Access request to analytics system(s) - https://phabricator.wikimedia.org/T315409 (10Tchanders) > Is this monthly data dump script something that runs in Hadoop or perhaps on the stat boxes? If so, analytics-privatedata-users... [15:45:59] (03PS5) 10Cwhite: es_exporter: Add metrics collection for mediawiki's db errors [puppet] - 10https://gerrit.wikimedia.org/r/825306 (https://phabricator.wikimedia.org/T297435) (owner: 10Ladsgroup) [15:52:34] (03PS1) 10Hashar: Vector legacy no longer imports variables from Vector modern [skins/Vector] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826250 (https://phabricator.wikimedia.org/T213778) [15:52:49] (03PS2) 10Jdlrobson: Vector legacy no longer imports variables from Vector modern [skins/Vector] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826250 (https://phabricator.wikimedia.org/T213778) (owner: 10Hashar) [15:58:53] jouncebot: nowandnext [15:58:53] No deployments scheduled for the next 2 hour(s) and 1 minute(s) [15:58:53] In 2 hour(s) and 1 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220824T1800) [15:58:53] In 2 hour(s) and 1 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220824T1800) [15:59:35] (03CR) 10Ladsgroup: [C: 03+2] "I deploy" [skins/Vector] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826250 (https://phabricator.wikimedia.org/T213778) (owner: 10Hashar) [15:59:50] hashar Jdlrobson ^ [16:00:26] !log Restarted CI Jenkins, Release Jenkins, Gerrit replica and Gerrit [16:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:03] (03CR) 10CI reject: [V: 04-1] Vector legacy no longer imports variables from Vector modern [skins/Vector] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826250 (https://phabricator.wikimedia.org/T213778) (owner: 10Hashar) [16:01:10] oh [16:01:23] Amir1: sorry I restarted Gerrit just when you +2ed it [16:02:07] (03CR) 10Hashar: [C: 03+2] Vector legacy no longer imports variables from Vector modern [skins/Vector] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826250 (https://phabricator.wikimedia.org/T213778) (owner: 10Hashar) [16:03:13] haha [16:03:16] okay [16:03:29] +2ed it again [16:05:25] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1007.eqiad.wmnet with OS bullseye [16:05:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-presto1007.eqiad.wmnet with OS bullseye [16:15:35] !log btullis@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-presto1007.eqiad.wmnet with OS bullseye [16:15:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-presto1007.eqiad.wmnet with OS bullseye exec... [16:17:28] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1009.eqiad.wmnet with OS bullseye [16:17:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-presto1009.eqiad.wmnet with OS bullseye [16:20:43] (03CR) 10CI reject: [V: 04-1] Vector legacy no longer imports variables from Vector modern [skins/Vector] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826250 (https://phabricator.wikimedia.org/T213778) (owner: 10Hashar) [16:21:34] damn [16:24:13] (03CR) 10Cwhite: [C: 03+2] es_exporter: Add metrics collection for mediawiki's db errors [puppet] - 10https://gerrit.wikimedia.org/r/825306 (https://phabricator.wikimedia.org/T297435) (owner: 10Ladsgroup) [16:25:07] stderr: 'fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/core/': The requested URL returned error: 503' [16:25:09] hmm [16:25:33] (03CR) 10Hashar: [C: 03+2] Vector legacy no longer imports variables from Vector modern [skins/Vector] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826250 (https://phabricator.wikimedia.org/T213778) (owner: 10Hashar) [16:26:11] !log mwmaint1002 systemctl start mediawiki_job_initsitestats T315121 [16:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:16] T315121: After new wikis are created/imported from Incubator, statistics should be updated - https://phabricator.wikimedia.org/T315121 [16:26:23] oh no [16:27:13] hashar: is that a "oh no, it's working" or "oh no, something is really bad"? gerrit works for me [16:27:19] (03CR) 10Hashar: [C: 03+2] "Failed due to:" [skins/Vector] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826250 (https://phabricator.wikimedia.org/T213778) (owner: 10Hashar) [16:27:23] I mixed up the builds [16:27:26] ok [16:27:27] it is a flappy test ;) [16:27:31] ok:) [16:27:57] the "503" "core" and "oh no" combo got me [16:28:08] 10SRE, 10Observability-Logging, 10Patch-For-Review: Leverage Grafana annotations to show events in graphs - https://phabricator.wikimedia.org/T222826 (10colewhite) [16:30:19] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1009.eqiad.wmnet with reason: host reimage [16:30:42] jouncebot now [16:30:43] No deployments scheduled for the next 1 hour(s) and 29 minute(s) [16:33:04] (03CR) 10Dzahn: webperf: add prometheus::blackbox::check::http for performance.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/823737 (https://phabricator.wikimedia.org/T277927) (owner: 10Dzahn) [16:34:13] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1009.eqiad.wmnet with reason: host reimage [16:34:45] (03CR) 10Dzahn: webperf: add prometheus::blackbox::check::http for performance.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/823737 (https://phabricator.wikimedia.org/T277927) (owner: 10Dzahn) [16:35:05] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Undocumented IP on WMCS network - https://phabricator.wikimedia.org/T315955 (10Andrew) a:03Andrew [16:37:38] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [16:39:52] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [16:41:00] (03CR) 10Dzahn: scap/dsh: remove parsoid service, replaced by parsoid-php (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/825753 (https://phabricator.wikimedia.org/T241207) (owner: 10Dzahn) [16:41:22] (03Merged) 10jenkins-bot: Vector legacy no longer imports variables from Vector modern [skins/Vector] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826250 (https://phabricator.wikimedia.org/T213778) (owner: 10Hashar) [16:47:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:47:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:47:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:47:58] (03CR) 10Dzahn: "these scap "dsh groups" (back the in the old days we actually used dsh) are data for conftool. but looking at for example https://config-m" [puppet] - 10https://gerrit.wikimedia.org/r/825753 (https://phabricator.wikimedia.org/T241207) (owner: 10Dzahn) [16:48:47] (03PS1) 10Majavah: hieradata: remove unused labs_tld labs_site variables [puppet] - 10https://gerrit.wikimedia.org/r/826346 [16:48:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:51:30] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.26/skins/Vector/resources/mediawiki.less.legacy/mediawiki.skin.variables.less: Backport: [[gerrit:826250|Vector legacy no longer imports variables from Vector modern (T213778)]] (duration: 02m 52s) [16:51:34] T213778: Update link colors in Vector 2022 for improved UX (and consistency) - https://phabricator.wikimedia.org/T213778 [16:52:32] (03PS1) 10Dzahn: mediwiki/initsitestats: change time of day to run initsitestats [puppet] - 10https://gerrit.wikimedia.org/r/826347 [16:56:49] (03CR) 10Hashar: [C: 03+2] "Was busy with some meetings/reviews. I am deploying it now." [skins/Vector] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826250 (https://phabricator.wikimedia.org/T213778) (owner: 10Hashar) [16:56:56] Jdlrobson: Amir1: I am deploying the Vector hotfix https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/826250/ [16:57:04] hashar: I did [16:57:15] oh [16:57:31] damn I missed your !log line aboe. Thank you Amir1 ! [16:57:41] !bacc [16:58:35] !bacc alias https://deploy-commands.toolforge.org/bacc/$1 [16:58:35] Sorry, you are not authorized to perform this [16:58:38] damn [16:59:26] hashar: `scap backport` [17:00:01] OH true [17:00:11] last time I was active I think it was not entirely ready yet [17:00:13] note taken [17:00:36] It's ready now. Feedback (and fixes) are welcome! [17:00:41] (03PS1) 10Majavah: apt::noupgrade: remove [puppet] - 10https://gerrit.wikimedia.org/r/826350 [17:01:20] will do for sure [17:01:42] Speaking which, lemme know when yall are done deploying. I'll update the scap release [17:03:52] (03PS1) 10Andrew Bogott: Make Cloudservices1005 a designate node [puppet] - 10https://gerrit.wikimedia.org/r/826352 (https://phabricator.wikimedia.org/T304888) [17:05:01] (03CR) 10Dzahn: "JBond was added automatically I think because ssh config is involved. Since I know he is out let me ask Moritz, Moritz do you see a securi" [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar) [17:06:11] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1009.eqiad.wmnet with OS bullseye [17:06:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-presto1009.eqiad.wmnet with OS bullseye exec... [17:06:48] (03CR) 10Ladsgroup: [C: 03+1] webperf: add prometheus::blackbox::check::http for performance.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/823737 (https://phabricator.wikimedia.org/T277927) (owner: 10Dzahn) [17:07:14] (03CR) 10Andrew Bogott: [C: 03+2] Make Cloudservices1005 a designate node [puppet] - 10https://gerrit.wikimedia.org/r/826352 (https://phabricator.wikimedia.org/T304888) (owner: 10Andrew Bogott) [17:10:25] (03CR) 10Dzahn: gerrit: allow nist kex algorithms on OpenSsh server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar) [17:11:14] (03PS3) 10Ahmon Dancy: Add systemd timer to run scap stage-train on Tuesday morning [puppet] - 10https://gerrit.wikimedia.org/r/819180 (https://phabricator.wikimedia.org/T310395) [17:13:43] (03CR) 10Hashar: [C: 03+1] gerrit: allow nist kex algorithms on OpenSsh server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar) [17:13:53] (03PS4) 10Ahmon Dancy: Add systemd timer to run scap stage-train on Tuesday morning [puppet] - 10https://gerrit.wikimedia.org/r/819180 (https://phabricator.wikimedia.org/T310395) [17:20:55] I am off for the rest of the day [17:21:01] (03CR) 10Ahmon Dancy: Add systemd timer to run scap stage-train on Tuesday morning (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/819180 (https://phabricator.wikimedia.org/T310395) (owner: 10Ahmon Dancy) [17:23:22] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "looks good in prod, now if it's noop in cloud I'll merge. https://puppet-compiler.wmflabs.org/pcc-worker1002/36948/" [puppet] - 10https://gerrit.wikimedia.org/r/819180 (https://phabricator.wikimedia.org/T310395) (owner: 10Ahmon Dancy) [17:24:54] (03CR) 10Majavah: [C: 04-1] Add systemd timer to run scap stage-train on Tuesday morning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819180 (https://phabricator.wikimedia.org/T310395) (owner: 10Ahmon Dancy) [17:25:42] (03CR) 10Ahmon Dancy: Add systemd timer to run scap stage-train on Tuesday morning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819180 (https://phabricator.wikimedia.org/T310395) (owner: 10Ahmon Dancy) [17:28:46] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "@Majavah fyi, I was especially asking for it to be noop in cloud before merge having you in mind" [puppet] - 10https://gerrit.wikimedia.org/r/819180 (https://phabricator.wikimedia.org/T310395) (owner: 10Ahmon Dancy) [17:33:08] 10SRE, 10DBA, 10observability, 10MW-1.38-notes (1.38.0-wmf.19; 2022-01-24), 10Patch-For-Review: Send metrics of db errors of mediawiki to prometheus - https://phabricator.wikimedia.org/T297435 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup We now have something https://grafana-rw.wikimedia.org/d/000... [17:33:10] 10SRE, 10Data-Persistence, 10observability, 10Patch-For-Review: MySQL metrics monitoring - https://phabricator.wikimedia.org/T143896 (10Ladsgroup) [17:33:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [17:33:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [17:33:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [17:33:50] (03PS1) 10Andrew Bogott: Replace ns[01].openstack.codfw1dev.wikimediacloud.org definitions with cnames [dns] - 10https://gerrit.wikimedia.org/r/826354 [17:34:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [17:34:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T314041)', diff saved to https://phabricator.wikimedia.org/P32942 and previous config saved to /var/cache/conftool/dbconfig/20220824-173409-ladsgroup.json [17:34:14] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [17:34:49] (03CR) 10Majavah: [C: 04-1] Add systemd timer to run scap stage-train on Tuesday morning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819180 (https://phabricator.wikimedia.org/T310395) (owner: 10Ahmon Dancy) [17:37:22] (03CR) 10Dzahn: [V: 03+1 C: 03+1] Add systemd timer to run scap stage-train on Tuesday morning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819180 (https://phabricator.wikimedia.org/T310395) (owner: 10Ahmon Dancy) [17:37:38] (03CR) 10Andrew Bogott: [C: 03+2] Replace ns[01].openstack.codfw1dev.wikimediacloud.org definitions with cnames [dns] - 10https://gerrit.wikimedia.org/r/826354 (owner: 10Andrew Bogott) [17:39:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1103.eqiad.wmnet with reason: Maintenance [17:40:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1103.eqiad.wmnet with reason: Maintenance [17:40:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2096.codfw.wmnet with reason: Maintenance [17:40:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2096.codfw.wmnet with reason: Maintenance [17:45:32] (03PS1) 10Andrew Bogott: Add temporary entry for ns2.openstack.eqiad1.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/826356 [17:46:31] (03CR) 10CI reject: [V: 04-1] Add temporary entry for ns2.openstack.eqiad1.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/826356 (owner: 10Andrew Bogott) [17:46:48] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-codfw: Restarting to apply OpenJDK 8u342 - eevans@cumin1001 [17:47:43] (03PS5) 10Ahmon Dancy: Add systemd timer to run scap stage-train on Tuesday morning [puppet] - 10https://gerrit.wikimedia.org/r/819180 (https://phabricator.wikimedia.org/T310395) [17:49:20] (03PS2) 10Andrew Bogott: Add temporary entry for ns2.openstack.eqiad1.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/826356 [17:49:44] (03PS6) 10Ahmon Dancy: Add systemd timer to run scap stage-train on Tuesday morning [puppet] - 10https://gerrit.wikimedia.org/r/819180 (https://phabricator.wikimedia.org/T310395) [17:49:52] (03PS1) 10Dzahn: site: add phabricator role on phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/826357 (https://phabricator.wikimedia.org/T280597) [17:50:13] (03PS2) 10Dzahn: site: add phabricator role on phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/826357 (https://phabricator.wikimedia.org/T280597) [17:50:18] (03CR) 10CI reject: [V: 04-1] Add temporary entry for ns2.openstack.eqiad1.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/826356 (owner: 10Andrew Bogott) [17:50:54] (03CR) 10CI reject: [V: 04-1] site: add phabricator role on phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/826357 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [17:52:00] (03PS3) 10Bking: elastic: clear old es_6 resources during upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/825874 (https://phabricator.wikimedia.org/T308676) (owner: 10Ryan Kemper) [17:52:12] (03CR) 10Ahmon Dancy: Add systemd timer to run scap stage-train on Tuesday morning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819180 (https://phabricator.wikimedia.org/T310395) (owner: 10Ahmon Dancy) [17:52:59] (03PS3) 10Andrew Bogott: Add temporary entry for ns2.openstack.eqiad1.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/826356 [17:54:34] (03CR) 10Andrew Bogott: [C: 03+2] Add temporary entry for ns2.openstack.eqiad1.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/826356 (owner: 10Andrew Bogott) [17:55:30] (03CR) 10jenkins-bot: elastic: clear old es_6 resources during upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/825874 (https://phabricator.wikimedia.org/T308676) (owner: 10Ryan Kemper) [17:58:46] (03PS1) 10Andrew Bogott: cloudservices1005: hack in a temporary resolver fqdn [puppet] - 10https://gerrit.wikimedia.org/r/826358 [17:59:22] (03CR) 10CI reject: [V: 04-1] cloudservices1005: hack in a temporary resolver fqdn [puppet] - 10https://gerrit.wikimedia.org/r/826358 (owner: 10Andrew Bogott) [17:59:33] (03PS4) 10Bking: elastic: clear old es_6 resources during upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/825874 (https://phabricator.wikimedia.org/T308676) (owner: 10Ryan Kemper) [18:00:05] hashar and dduvall: Your horoscope predicts another unfortunate Train log triage with CPT deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220824T1800). [18:00:05] hashar and dduvall: Dear deployers, time to do the MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220824T1800). [18:01:45] (03PS2) 10Andrew Bogott: cloudservices1005: hack in a temporary resolver fqdn [puppet] - 10https://gerrit.wikimedia.org/r/826358 (https://phabricator.wikimedia.org/T304888) [18:02:43] (03CR) 10Andrew Bogott: [C: 03+2] cloudservices1005: hack in a temporary resolver fqdn [puppet] - 10https://gerrit.wikimedia.org/r/826358 (https://phabricator.wikimedia.org/T304888) (owner: 10Andrew Bogott) [18:05:39] (03PS2) 10Dzahn: site: add phabricator role on phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/824803 (https://phabricator.wikimedia.org/T280597) [18:05:50] (03Abandoned) 10Dzahn: site: add phabricator role on phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/826357 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [18:06:06] (03PS3) 10Dzahn: site: add phabricator role on phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/824803 (https://phabricator.wikimedia.org/T280597) [18:07:25] (03PS5) 10Ryan Kemper: elastic: clear old es_6 resources during upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/825874 (https://phabricator.wikimedia.org/T308676) [18:07:39] 10SRE-Access-Requests: Grant private data access to Amanda Bittaker - https://phabricator.wikimedia.org/T316140 (10Abit) [18:08:13] 10SRE-Access-Requests: Grant private data access to Amanda Bittaker - https://phabricator.wikimedia.org/T316140 (10nshahquinn-wmf) @CDunn can you approve Amanda to access private data in Superset? [18:08:47] (03CR) 10Dzahn: "there should be no more LVS/git-ssh/pybal etc in https://puppet-compiler.wmflabs.org/pcc-worker1003/36951/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/824803 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [18:09:37] 10SRE-Access-Requests: Grant private data access to Amanda Bittaker - https://phabricator.wikimedia.org/T316140 (10nshahquinn-wmf) [18:09:55] 10SRE-Access-Requests: Grant private data access to Amanda Bittaker - https://phabricator.wikimedia.org/T316140 (10nshahquinn-wmf) @Ottomata / @odimitrijevic can you approve? [18:12:07] (03CR) 10Dzahn: "the "be careful"-part means basically nothing in role/eqiad/phabricator.yaml should be a problem if it's applied to more than the prod hos" [puppet] - 10https://gerrit.wikimedia.org/r/824803 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [18:13:31] (03CR) 10Dzahn: [C: 04-1] "Just like in codfw before I should move the LVS IPs and all that which we won't use on the new host to hosts/phab1001.yaml first. it's the" [puppet] - 10https://gerrit.wikimedia.org/r/824803 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [18:18:43] (03PS1) 10Dzahn: phabricator: move LVS IPs for git-ssh service from role/eqiad to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/826360 (https://phabricator.wikimedia.org/T296022) [18:21:12] (03CR) 10Dzahn: [C: 03+2] "noop on all - https://puppet-compiler.wmflabs.org/pcc-worker1003/36952/" [puppet] - 10https://gerrit.wikimedia.org/r/826360 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [18:32:22] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Data-Engineering-Operations: Access request to analytics system(s) - https://phabricator.wikimedia.org/T315409 (10gmodena) >>! In T315409#8181985, @Tchanders wrote: >> Is this monthly data dump script something that runs in Hadoop or perhaps on the stat b... [18:33:41] 10SRE, 10Traffic: ATS should alert if the number of total or active connections reached maximum - https://phabricator.wikimedia.org/T292815 (10BCornwall) [18:34:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T314041)', diff saved to https://phabricator.wikimedia.org/P32943 and previous config saved to /var/cache/conftool/dbconfig/20220824-183425-ladsgroup.json [18:34:31] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [18:35:56] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde and ldap/nda for jdfraine - https://phabricator.wikimedia.org/T316044 (10KFrancis) @jdfraine Please provide me with your WMDE email address. If you'd rather not post it here, please send it to my WMF address: kfrancis@wikimedia.org. Thanks! [18:38:34] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:49:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P32944 and previous config saved to /var/cache/conftool/dbconfig/20220824-184931-ladsgroup.json [18:57:47] (03PS1) 10BCornwall: prometheus: Parse ATS config to node exporter text [puppet] - 10https://gerrit.wikimedia.org/r/826362 (https://phabricator.wikimedia.org/T292815) [18:58:31] (03CR) 10CI reject: [V: 04-1] prometheus: Parse ATS config to node exporter text [puppet] - 10https://gerrit.wikimedia.org/r/826362 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [18:59:55] (03PS2) 10BCornwall: prometheus: Parse ATS config to node exporter text [puppet] - 10https://gerrit.wikimedia.org/r/826362 (https://phabricator.wikimedia.org/T292815) [19:00:39] (03CR) 10CI reject: [V: 04-1] prometheus: Parse ATS config to node exporter text [puppet] - 10https://gerrit.wikimedia.org/r/826362 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [19:03:13] (03CR) 10Dzahn: [C: 03+2] "if you could recreate the "updatequery" page part for a later patch that would be great. I just wanted to first add shared_periodic_jobs i" [puppet] - 10https://gerrit.wikimedia.org/r/804800 (owner: 10Legoktm) [19:03:46] (03CR) 10Dzahn: [C: 03+2] Add profile::mediawiki::sharded_periodic_job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/804800 (owner: 10Legoktm) [19:04:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P32945 and previous config saved to /var/cache/conftool/dbconfig/20220824-190437-ladsgroup.json [19:05:57] (03PS7) 10Ahmon Dancy: Add systemd timer to run scap stage-train on Tuesday morning [puppet] - 10https://gerrit.wikimedia.org/r/819180 (https://phabricator.wikimedia.org/T310395) [19:06:02] (03PS1) 10Ladsgroup: Convert page title to variant properly [extensions/GeoCrumbs] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826330 (https://phabricator.wikimedia.org/T316085) [19:06:55] (03CR) 10Bking: [C: 03+1] elastic: clear old es_6 resources during upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/825874 (https://phabricator.wikimedia.org/T308676) (owner: 10Ryan Kemper) [19:07:24] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:08:04] MatmaRex: Shall we deploy the backport? I can make it in a way you could test it in mwdebug (even if the train is rolled back) [19:10:11] Amir1: yeah, let's [19:10:36] jouncebot: nowandnext [19:10:36] For the next 0 hour(s) and 49 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220824T1800) [19:10:36] In 0 hour(s) and 49 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220824T2000) [19:10:51] the train is blocked so los geht's [19:11:03] (03CR) 10Ladsgroup: [C: 03+2] Convert page title to variant properly [extensions/GeoCrumbs] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826330 (https://phabricator.wikimedia.org/T316085) (owner: 10Ladsgroup) [19:12:37] (03CR) 10Bking: [C: 03+2] elastic: clear old es_6 resources during upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/825874 (https://phabricator.wikimedia.org/T308676) (owner: 10Ryan Kemper) [19:12:55] (03CR) 10Bking: [V: 03+2 C: 03+2] elastic: clear old es_6 resources during upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/825874 (https://phabricator.wikimedia.org/T308676) (owner: 10Ryan Kemper) [19:13:02] (03Merged) 10jenkins-bot: Convert page title to variant properly [extensions/GeoCrumbs] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826330 (https://phabricator.wikimedia.org/T316085) (owner: 10Ladsgroup) [19:15:29] (03PS3) 10BCornwall: prometheus: Parse ATS config to node exporter text [puppet] - 10https://gerrit.wikimedia.org/r/826362 (https://phabricator.wikimedia.org/T292815) [19:16:23] MatmaRex: pulled in mwdebug1001, rolled forward enwikivoyage and zhwikivoyage to wmf.26 there [19:16:40] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:16:55] let's see [19:18:08] (03PS8) 10Ahmon Dancy: Add systemd timer to run scap stage-train on Tuesday morning [puppet] - 10https://gerrit.wikimedia.org/r/819180 (https://phabricator.wikimedia.org/T310395) [19:18:15] (03PS1) 10Andrew Bogott: Add cloudservices1005 to the list of designate hosts [puppet] - 10https://gerrit.wikimedia.org/r/826364 (https://phabricator.wikimedia.org/T304888) [19:19:40] Amir1: seems to work fine for me [19:19:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T314041)', diff saved to https://phabricator.wikimedia.org/P32946 and previous config saved to /var/cache/conftool/dbconfig/20220824-191943-ladsgroup.json [19:19:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1111.eqiad.wmnet with reason: Maintenance [19:19:50] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [19:19:55] cool, let's push it forward then [19:19:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1111.eqiad.wmnet with reason: Maintenance [19:20:05] (03CR) 10Andrew Bogott: [C: 03+2] Add cloudservices1005 to the list of designate hosts [puppet] - 10https://gerrit.wikimedia.org/r/826364 (https://phabricator.wikimedia.org/T304888) (owner: 10Andrew Bogott) [19:20:06] it's all a bit confusing to test [19:20:20] since the feature combines data from the parser cache for different pages [19:20:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:20:36] and something unrelated also hides the "real" page title on that wiki [19:20:45] so i can't see whether the markup is present in it or not [19:20:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance [19:20:50] 10SRE, 10Traffic, 10Patch-For-Review: ATS should alert if the number of total or active connections reached maximum - https://phabricator.wikimedia.org/T292815 (10BCornwall) 05Open→03Stalled Change needs some testing but will be stalled until https://phabricator.wikimedia.org/T309651 is fixed [19:21:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance [19:21:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1178 (T314041)', diff saved to https://phabricator.wikimedia.org/P32947 and previous config saved to /var/cache/conftool/dbconfig/20220824-192119-ladsgroup.json [19:21:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:21:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:22:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:22:29] (03CR) 10Ahmon Dancy: "PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1001/36954/" [puppet] - 10https://gerrit.wikimedia.org/r/819180 (https://phabricator.wikimedia.org/T310395) (owner: 10Ahmon Dancy) [19:23:02] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.26/extensions/GeoCrumbs/includes/Hooks.php: Backport: [[gerrit:826330|Convert page title to variant properly (T316085)]] (duration: 02m 50s) [19:23:05] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: acme-chief should be able to refresh OCSP stapling response even if the renewal process fails - https://phabricator.wikimedia.org/T244232 (10BCornwall) 05Open→03In progress [19:23:06] T316085: Escaped HTML underneath page title in wikis with the GeoCrumbs extension enabled - https://phabricator.wikimedia.org/T316085 [19:23:08] 10SRE, 10Traffic: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has X seconds left - https://phabricator.wikimedia.org/T243948 (10BCornwall) [19:25:17] 10SRE, 10Traffic: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has X seconds left - https://phabricator.wikimedia.org/T243948 (10BCornwall) 05Resolved→03Open [19:26:15] 10SRE, 10Traffic: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has X seconds left - https://phabricator.wikimedia.org/T243948 (10BCornwall) 05Open→03Resolved [19:27:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T314041)', diff saved to https://phabricator.wikimedia.org/P32948 and previous config saved to /var/cache/conftool/dbconfig/20220824-192705-ladsgroup.json [19:27:10] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [19:29:13] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: acme-chief should be able to refresh OCSP stapling response even if the renewal process fails - https://phabricator.wikimedia.org/T244232 (10BCornwall) p:05Medium→03High [19:30:12] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 252, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:30:42] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:30:44] 10SRE, 10Acme-chief, 10Cloud-VPS, 10Traffic-Icebox, and 2 others: acme-chief shouldn't try to perform OCSP stapling of expired certs - https://phabricator.wikimedia.org/T262251 (10BCornwall) 05Open→03In progress p:05Medium→03High [19:34:31] 10SRE, 10Analytics-Radar, 10Machine-Learning-Team: Using docker in WMF production network outside of kubernetes - https://phabricator.wikimedia.org/T275551 (10gmodena) >>! In T275551#8178081, @Ottomata wrote: >> will it be possible to consume e.g. events from kafka infra, or read/write to swift? > Nopers :/... [19:40:37] (03CR) 10Cwhite: [C: 03+2] logstash: dlq use logstash-managed index pattern [puppet] - 10https://gerrit.wikimedia.org/r/824752 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [19:40:45] (03PS2) 10Cwhite: logstash: dlq use logstash-managed index pattern [puppet] - 10https://gerrit.wikimedia.org/r/824752 (https://phabricator.wikimedia.org/T305175) [19:41:06] (03PS3) 10Cwhite: logstash: w3creportingapi to use logstash-managed index pattern [puppet] - 10https://gerrit.wikimedia.org/r/824754 (https://phabricator.wikimedia.org/T305175) [19:42:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P32949 and previous config saved to /var/cache/conftool/dbconfig/20220824-194211-ladsgroup.json [19:42:51] (03PS1) 10Ahmon Dancy: scap.cfg.erb: Add web_proxy config value for production [puppet] - 10https://gerrit.wikimedia.org/r/826366 [19:46:15] (03PS1) 10BCornwall: varnish: Remove extraneous checks for Docker [puppet] - 10https://gerrit.wikimedia.org/r/826367 [19:48:18] (03PS1) 10Dzahn: Revert "c:spamassassin move Spamassassin updates from crontab to systemd timers." [puppet] - 10https://gerrit.wikimedia.org/r/826331 [19:48:44] (03CR) 10CI reject: [V: 04-1] Revert "c:spamassassin move Spamassassin updates from crontab to systemd timers." [puppet] - 10https://gerrit.wikimedia.org/r/826331 (owner: 10Dzahn) [19:49:35] (03CR) 10Dzahn: "unfortunately spamassassin_updates.service is not working so let's revert first and figure it out, or we have constant systemd alert on ot" [puppet] - 10https://gerrit.wikimedia.org/r/826331 (owner: 10Dzahn) [19:51:32] (03PS2) 10Dzahn: Revert "c:spamassassin move Spamassassin updates from crontab to systemd timers." [puppet] - 10https://gerrit.wikimedia.org/r/826331 [19:52:04] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:52:07] (03CR) 10CI reject: [V: 04-1] Revert "c:spamassassin move Spamassassin updates from crontab to systemd timers." [puppet] - 10https://gerrit.wikimedia.org/r/826331 (owner: 10Dzahn) [19:56:57] (03PS1) 10Cwhite: logstash: add dlq revision to index pattern [puppet] - 10https://gerrit.wikimedia.org/r/826372 (https://phabricator.wikimedia.org/T305175) [19:57:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P32950 and previous config saved to /var/cache/conftool/dbconfig/20220824-195717-ladsgroup.json [19:59:05] (03CR) 10Cwhite: [C: 03+2] "PCC checks out: https://puppet-compiler.wmflabs.org/pcc-worker1003/36955/" [puppet] - 10https://gerrit.wikimedia.org/r/826372 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [20:00:05] RoanKattouw, Urbanecm, and cjming: My dear minions, it's time we take the moon! Just kidding. Time for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220824T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:00:46] (03PS2) 10Cwhite: logstash: use puppet dlq version and revision for index pattern [puppet] - 10https://gerrit.wikimedia.org/r/826372 (https://phabricator.wikimedia.org/T305175) [20:01:32] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:02:10] (03CR) 10Cwhite: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/36956/" [puppet] - 10https://gerrit.wikimedia.org/r/826372 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [20:02:50] (03CR) 10Dzahn: "https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=otrs1001&service=Check+systemd+state" [puppet] - 10https://gerrit.wikimedia.org/r/826331 (owner: 10Dzahn) [20:03:01] (03PS3) 10Dzahn: Revert "c:spamassassin move Spamassassin updates from crontab to systemd timers." [puppet] - 10https://gerrit.wikimedia.org/r/826331 [20:03:51] (03CR) 10CI reject: [V: 04-1] Revert "c:spamassassin move Spamassassin updates from crontab to systemd timers." [puppet] - 10https://gerrit.wikimedia.org/r/826331 (owner: 10Dzahn) [20:04:48] (03CR) 10Kosta Harlan: Vector legacy no longer imports variables from Vector modern (031 comment) [skins/Vector] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826250 (https://phabricator.wikimedia.org/T213778) (owner: 10Hashar) [20:10:29] 10SRE, 10SRE-Access-Requests: Grant private data access to Amanda Bittaker - https://phabricator.wikimedia.org/T316140 (10Ladsgroup) p:05Triage→03Medium a:03Ladsgroup Clinic duty this week. [20:12:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T314041)', diff saved to https://phabricator.wikimedia.org/P32951 and previous config saved to /var/cache/conftool/dbconfig/20220824-201224-ladsgroup.json [20:12:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [20:12:29] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [20:12:39] (03PS4) 10Cwhite: logstash: w3creportingapi to use logstash-managed index pattern [puppet] - 10https://gerrit.wikimedia.org/r/824754 (https://phabricator.wikimedia.org/T305175) [20:12:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [20:12:53] 10SRE, 10SRE-Access-Requests: Grant private data access to Amanda Bittaker - https://phabricator.wikimedia.org/T316140 (10Ladsgroup) [20:13:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance [20:13:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance [20:13:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3318 (T314041)', diff saved to https://phabricator.wikimedia.org/P32952 and previous config saved to /var/cache/conftool/dbconfig/20220824-201344-ladsgroup.json [20:15:43] 10SRE, 10SRE-Access-Requests: Grant private data access to Amanda Bittaker - https://phabricator.wikimedia.org/T316140 (10Ladsgroup) I confirm the coolest username in wikitech belongs to @Abit: ` ladsgroup@mwmaint1002:~$ ldapsearch -x mail=abittaker@wikimedia.org ... uid: wubwubwub cn: Wubwubwub sn: Wubwubwub... [20:15:50] 10SRE, 10SRE-Access-Requests: Grant private data access to Amanda Bittaker - https://phabricator.wikimedia.org/T316140 (10Ladsgroup) [20:16:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T314041)', diff saved to https://phabricator.wikimedia.org/P32953 and previous config saved to /var/cache/conftool/dbconfig/20220824-201637-ladsgroup.json [20:16:39] (03CR) 10Bking: [C: 03+2] elastic: es7 removed bulk threadpool [puppet] - 10https://gerrit.wikimedia.org/r/825883 (https://phabricator.wikimedia.org/T308676) (owner: 10Ryan Kemper) [20:16:58] (03CR) 10Cwhite: [C: 03+2] logstash: w3creportingapi to use logstash-managed index pattern [puppet] - 10https://gerrit.wikimedia.org/r/824754 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [20:18:13] cwhite looks like we tried to merge at the same instant, feel free to merge my change if you haven't already [20:18:20] (or I can do it if you're done) [20:18:21] inflatador: merged, thanks! [20:18:28] ACK, thank you! [20:21:44] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:21:58] (03CR) 10Dzahn: [C: 03+2] "[deploy1002:~] $ host webproxy" [puppet] - 10https://gerrit.wikimedia.org/r/826366 (owner: 10Ahmon Dancy) [20:27:13] (03CR) 10Dzahn: [C: 04-1] "did https://gerrit.wikimedia.org/r/c/operations/puppet/+/826360 and now I would like https://phabricator.wikimedia.org/T315713#8183037 fir" [puppet] - 10https://gerrit.wikimedia.org/r/824803 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [20:27:52] (03PS1) 10Andrew Bogott: cloudservices1005 will replace ns0 rather than ns1. [puppet] - 10https://gerrit.wikimedia.org/r/826378 (https://phabricator.wikimedia.org/T304888) [20:28:58] (03CR) 10Andrew Bogott: [C: 03+2] cloudservices1005 will replace ns0 rather than ns1. [puppet] - 10https://gerrit.wikimedia.org/r/826378 (https://phabricator.wikimedia.org/T304888) (owner: 10Andrew Bogott) [20:31:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P32954 and previous config saved to /var/cache/conftool/dbconfig/20220824-203143-ladsgroup.json [20:32:00] (03PS4) 10Dzahn: Revert "c:spamassassin move Spamassassin updates from crontab to systemd timers." [puppet] - 10https://gerrit.wikimedia.org/r/826331 [20:32:36] (03CR) 10CI reject: [V: 04-1] Revert "c:spamassassin move Spamassassin updates from crontab to systemd timers." [puppet] - 10https://gerrit.wikimedia.org/r/826331 (owner: 10Dzahn) [20:33:24] (03CR) 10Dzahn: "a real clean revert would not remove the timer though..so it has to be some "custom revert"" [puppet] - 10https://gerrit.wikimedia.org/r/826331 (owner: 10Dzahn) [20:34:10] (03PS5) 10Dzahn: Revert "c:spamassassin move Spamassassin updates from crontab" [puppet] - 10https://gerrit.wikimedia.org/r/826331 [20:34:46] (03CR) 10CI reject: [V: 04-1] Revert "c:spamassassin move Spamassassin updates from crontab" [puppet] - 10https://gerrit.wikimedia.org/r/826331 (owner: 10Dzahn) [20:35:31] (03PS6) 10Dzahn: Revert "c:spamassassin move Spamassassin updates from crontab" [puppet] - 10https://gerrit.wikimedia.org/r/826331 [20:38:00] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:38:33] (03CR) 10Dzahn: [C: 03+2] Revert "c:spamassassin move Spamassassin updates from crontab" [puppet] - 10https://gerrit.wikimedia.org/r/826331 (owner: 10Dzahn) [20:40:20] !log otrs1001 - systemctl reset failed [20:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:52] (03CR) 10Dzahn: [C: 03+2] "File[/etc/cron.daily/spamassassin]/ensure: defined content.. and timer/service removed. then "systemctl reset-failed" to clearn alert" [puppet] - 10https://gerrit.wikimedia.org/r/826331 (owner: 10Dzahn) [20:41:44] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:41:57] 10SRE, 10serviceops, 10Sustainability (Incident Followup): Set API and appserver weights in eqiad - https://phabricator.wikimedia.org/T304800 (10Dzahn) [20:42:12] RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:42:36] (03CR) 10Dzahn: [C: 03+2] "20:42 <+icinga-wm> RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikime" [puppet] - 10https://gerrit.wikimedia.org/r/826331 (owner: 10Dzahn) [20:43:26] (03PS1) 10Bking: opensearch: replace outdated config [puppet] - 10https://gerrit.wikimedia.org/r/826383 (https://phabricator.wikimedia.org/T308676) [20:43:59] (03PS2) 10Ryan Kemper: opensearch: replace outdated config [puppet] - 10https://gerrit.wikimedia.org/r/826383 (https://phabricator.wikimedia.org/T308676) (owner: 10Bking) [20:44:08] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/826383 (https://phabricator.wikimedia.org/T308676) (owner: 10Bking) [20:44:17] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-eqiad: Restarting to apply OpenJDK 8u342 - eevans@cumin1001 [20:44:38] i'd also like to merge and backport https://gerrit.wikimedia.org/r/c/mediawiki/core/+/826380 before wmf.26 rolls out further [20:44:49] it's a trivial fix so i hope someone here can review [20:46:30] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:46:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P32955 and previous config saved to /var/cache/conftool/dbconfig/20220824-204649-ladsgroup.json [20:47:22] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:50:00] PROBLEM - SSH on mw1313.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:50:31] MatmaRex: +2 applied [20:50:53] thanks! [21:01:04] (03PS1) 10Cwhite: logstash: set ecs routing only when the output is logstash [puppet] - 10https://gerrit.wikimedia.org/r/826384 (https://phabricator.wikimedia.org/T305013) [21:01:08] (03PS1) 10Cwhite: logstash: alerts to use yearly rotation [puppet] - 10https://gerrit.wikimedia.org/r/826385 (https://phabricator.wikimedia.org/T304924) [21:01:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T314041)', diff saved to https://phabricator.wikimedia.org/P32956 and previous config saved to /var/cache/conftool/dbconfig/20220824-210155-ladsgroup.json [21:01:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1126.eqiad.wmnet with reason: Maintenance [21:02:01] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [21:02:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1126.eqiad.wmnet with reason: Maintenance [21:02:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1126 (T314041)', diff saved to https://phabricator.wikimedia.org/P32957 and previous config saved to /var/cache/conftool/dbconfig/20220824-210216-ladsgroup.json [21:05:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T314041)', diff saved to https://phabricator.wikimedia.org/P32958 and previous config saved to /var/cache/conftool/dbconfig/20220824-210507-ladsgroup.json [21:05:37] (03CR) 10Cwhite: [C: 03+1] "Looks like this option was deprecated in 6.3 and removed in 7.0 https://www.elastic.co/guide/en/elasticsearch/reference/6.3/breaking-chang" [puppet] - 10https://gerrit.wikimedia.org/r/826383 (https://phabricator.wikimedia.org/T308676) (owner: 10Bking) [21:06:46] (03PS1) 10Ryan Kemper: elastic: use our jvm not elasticsearch's jvm [puppet] - 10https://gerrit.wikimedia.org/r/826386 (https://phabricator.wikimedia.org/T308676) [21:07:26] (03CR) 10CI reject: [V: 04-1] elastic: use our jvm not elasticsearch's jvm [puppet] - 10https://gerrit.wikimedia.org/r/826386 (https://phabricator.wikimedia.org/T308676) (owner: 10Ryan Kemper) [21:08:33] (03PS2) 10Ryan Kemper: elastic: use our jvm not elasticsearch's jvm [puppet] - 10https://gerrit.wikimedia.org/r/826386 (https://phabricator.wikimedia.org/T308676) [21:09:23] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/826386 (https://phabricator.wikimedia.org/T308676) (owner: 10Ryan Kemper) [21:10:09] (03PS1) 10Andrew Bogott: Replace cloudservices1003 with cloudservices1005 [puppet] - 10https://gerrit.wikimedia.org/r/826387 (https://phabricator.wikimedia.org/T304888) [21:10:48] (03PS1) 10Ladsgroup: Display page namespace with spaces instead of underscores when page doesn't exist [core] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826332 (https://phabricator.wikimedia.org/T316092) [21:13:00] (03PS1) 10Andrew Bogott: Replace cloudservices1003 with cloudservices1005 for ns0 [dns] - 10https://gerrit.wikimedia.org/r/826388 (https://phabricator.wikimedia.org/T304888) [21:13:19] (03CR) 10Andrew Bogott: [C: 03+2] Replace cloudservices1003 with cloudservices1005 [puppet] - 10https://gerrit.wikimedia.org/r/826387 (https://phabricator.wikimedia.org/T304888) (owner: 10Andrew Bogott) [21:15:48] (03PS1) 10Ryan Kemper: delete-old-output-large-reports: fix desc [puppet] - 10https://gerrit.wikimedia.org/r/826390 [21:15:58] !log dzahn@cumin2002 conftool action : set/weight=25; selector: name=mw131[2-7].eqiad.wmnet [21:16:24] 10SRE, 10SRE-Access-Requests: Grant private data access to Amanda Bittaker - https://phabricator.wikimedia.org/T316140 (10Abit) > I confirm the coolest username in wikitech belongs to @Abit I was a baby at the time! I didn't know 😭 [21:18:13] !log dzahn@cumin2002 conftool action : set/weight=25; selector: name=mw132[1-9].eqiad.wmnet [21:20:05] !log setting weight to 25 (from 30) for appservers and API servers in the range mw1307 through mw1348 because they are of an older hardware type (not changing weights of jobrunners/videoscalers even if in this range) (T304800) [21:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:09] T304800: Set API and appserver weights in eqiad - https://phabricator.wikimedia.org/T304800 [21:20:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P32959 and previous config saved to /var/cache/conftool/dbconfig/20220824-212013-ladsgroup.json [21:20:29] (03PS2) 10Ryan Kemper: delete-old-output-large-reports: fix desc [puppet] - 10https://gerrit.wikimedia.org/r/826390 [21:20:38] (03PS1) 10Bking: elastic: enable ES7 repo on cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/826391 (https://phabricator.wikimedia.org/T316159) [21:20:54] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/826390 (owner: 10Ryan Kemper) [21:21:35] (03PS7) 10BCornwall: Varnish: Stop sending analytics cookies to API [puppet] - 10https://gerrit.wikimedia.org/r/824793 (https://phabricator.wikimedia.org/T260943) [21:21:40] (03CR) 10BCornwall: Varnish: Stop sending analytics cookies to API (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/824793 (https://phabricator.wikimedia.org/T260943) (owner: 10BCornwall) [21:22:46] !log dzahn@cumin2002 conftool action : set/weight=25; selector: name=mw133[1-9].eqiad.wmnet,cluster=appserver [21:22:52] !log dzahn@cumin2002 conftool action : set/weight=25; selector: name=mw133[1-9].eqiad.wmnet,cluster=api_appserver [21:23:09] (03PS3) 10Ryan Kemper: delete-old-output-large-reports: fix desc [puppet] - 10https://gerrit.wikimedia.org/r/826390 [21:23:34] !log dzahn@cumin2002 conftool action : set/weight=25; selector: name=mw134[1-8].eqiad.wmnet,cluster=api_appserver [21:24:02] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36960/console" [puppet] - 10https://gerrit.wikimedia.org/r/826386 (https://phabricator.wikimedia.org/T308676) (owner: 10Ryan Kemper) [21:24:53] (03CR) 10Bartosz Dziewoński: [C: 03+1] "Please deploy whenever convenient (or I'll put it in the backport window tomorrow)" [core] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826332 (https://phabricator.wikimedia.org/T316092) (owner: 10Ladsgroup) [21:25:21] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/826391 (https://phabricator.wikimedia.org/T316159) (owner: 10Bking) [21:27:33] 10SRE, 10serviceops, 10Sustainability (Incident Followup): Set API and appserver weights in eqiad - https://phabricator.wikimedia.org/T304800 (10Dzahn) @RLazarus Thank you! Done! I changed the value from 30 to 25 for any server within the range mw1307 through mw1348 that was either appserver or api_appserve... [21:29:42] 10SRE, 10serviceops, 10Sustainability (Incident Followup): Set API and appserver weights in eqiad - https://phabricator.wikimedia.org/T304800 (10Dzahn) 05Open→03Resolved a:03Dzahn [21:33:31] (03CR) 10Bking: [C: 03+1] elastic: use our jvm not elasticsearch's jvm [puppet] - 10https://gerrit.wikimedia.org/r/826386 (https://phabricator.wikimedia.org/T308676) (owner: 10Ryan Kemper) [21:33:33] (03CR) 10Bking: [C: 03+2] elastic: use our jvm not elasticsearch's jvm [puppet] - 10https://gerrit.wikimedia.org/r/826386 (https://phabricator.wikimedia.org/T308676) (owner: 10Ryan Kemper) [21:35:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P32961 and previous config saved to /var/cache/conftool/dbconfig/20220824-213519-ladsgroup.json [21:37:15] (03PS2) 10Ryan Kemper: elastic: enable ES7 repo on cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/826391 (https://phabricator.wikimedia.org/T316159) (owner: 10Bking) [21:37:56] (03CR) 10Bking: [C: 03+1] elastic: enable ES7 repo on cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/826391 (https://phabricator.wikimedia.org/T316159) (owner: 10Bking) [21:39:17] (03CR) 10Bking: [C: 03+2] elastic: enable ES7 repo on cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/826391 (https://phabricator.wikimedia.org/T316159) (owner: 10Bking) [21:42:58] (03PS2) 10Dzahn: mediwiki/initsitestats: change time of day to run initsitestats [puppet] - 10https://gerrit.wikimedia.org/r/826347 (https://phabricator.wikimedia.org/T315121) [21:44:18] (03CR) 10Dzahn: "follow-up to Ie98bf696620a3c8a7" [puppet] - 10https://gerrit.wikimedia.org/r/826347 (https://phabricator.wikimedia.org/T315121) (owner: 10Dzahn) [21:46:03] (03CR) 10Andrew Bogott: [C: 03+2] Replace cloudservices1003 with cloudservices1005 for ns0 [dns] - 10https://gerrit.wikimedia.org/r/826388 (https://phabricator.wikimedia.org/T304888) (owner: 10Andrew Bogott) [21:46:34] PROBLEM - Check systemd state on cloudelastic1001 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_6@cloudelastic-chi-eqiad.service,elasticsearch_7@cloudelastic-chi-eqiad.service,elasticsearch_7@cloudelastic-omega-eqiad.service,elasticsearch_7@cloudelastic-psi-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:46:58] PROBLEM - Check systemd state on cloudelastic1002 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_6@cloudelastic-chi-eqiad.service,elasticsearch_7@cloudelastic-chi-eqiad.service,elasticsearch_7@cloudelastic-omega-eqiad.service,elasticsearch_7@cloudelastic-psi-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:47:02] PROBLEM - Check systemd state on cloudelastic1006 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_7@cloudelastic-chi-eqiad.service,elasticsearch_7@cloudelastic-omega-eqiad.service,elasticsearch_7@cloudelastic-psi-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:47:02] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1002 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f2efea14280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wiki [21:47:02] imedia.org/wiki/Search%23Administration [21:47:04] PROBLEM - Check systemd state on cloudelastic1005 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_7@cloudelastic-chi-eqiad.service,elasticsearch_7@cloudelastic-omega-eqiad.service,elasticsearch_7@cloudelastic-psi-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:47:08] PROBLEM - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - CRITICAL - cloudelasticlb_9243: Servers cloudelastic1006.wikimedia.org, cloudelastic1001.wikimedia.org, cloudelastic1005.wikimedia.org are marked down but pooled: cloudelasticlb6_9243: Servers cloudelastic1006.wikimedia.org, cloudelastic1001.wikimedia.org, cloudelastic1005.wikimedia.org are marked down but pooled: cloudelasticlb6_8243: Servers cloudelastic1006. [21:47:08] a.org, cloudelastic1002.wikimedia.org, cloudelastic1005.wikimedia.org are marked down but pooled: cloudelasticlb_8243: Servers cloudelastic1006.wikimedia.org, cloudelastic1001.wikimedia.org, cloudelastic1005.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:47:09] (03CR) 10Dzahn: "just realized this won't work as expected. reason is that the other jobs are already spread out by project, so the job for wikipedias vs w" [puppet] - 10https://gerrit.wikimedia.org/r/826347 (https://phabricator.wikimedia.org/T315121) (owner: 10Dzahn) [21:47:12] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1001 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fd5c4842280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wiki [21:47:13] imedia.org/wiki/Search%23Administration [21:47:22] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:47:53] (03CR) 10Dzahn: [C: 04-1] mediwiki/initsitestats: change time of day to run initsitestats [puppet] - 10https://gerrit.wikimedia.org/r/826347 (https://phabricator.wikimedia.org/T315121) (owner: 10Dzahn) [21:48:20] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on 6 hosts with reason: T316159 [21:48:22] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:48:26] T316159: Upgrade cloudelastic cluster to 7.10.2 - https://phabricator.wikimedia.org/T316159 [21:48:37] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on 6 hosts with reason: T316159 [21:49:14] (03PS1) 10Andrew Bogott: Remove temporary ns2 def for cloudservices1005 [puppet] - 10https://gerrit.wikimedia.org/r/826393 (https://phabricator.wikimedia.org/T304888) [21:50:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T314041)', diff saved to https://phabricator.wikimedia.org/P32962 and previous config saved to /var/cache/conftool/dbconfig/20220824-215025-ladsgroup.json [21:50:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1116.eqiad.wmnet with reason: Maintenance [21:50:32] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [21:50:40] RECOVERY - SSH on mw1313.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:50:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1116.eqiad.wmnet with reason: Maintenance [21:51:03] search team's looking into the cloudelastic alerts [21:51:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1114.eqiad.wmnet with reason: Maintenance [21:51:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1114.eqiad.wmnet with reason: Maintenance [21:51:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1114 (T314041)', diff saved to https://phabricator.wikimedia.org/P32963 and previous config saved to /var/cache/conftool/dbconfig/20220824-215143-ladsgroup.json [21:52:19] (03CR) 10Andrew Bogott: [C: 03+2] Remove temporary ns2 def for cloudservices1005 [puppet] - 10https://gerrit.wikimedia.org/r/826393 (https://phabricator.wikimedia.org/T304888) (owner: 10Andrew Bogott) [21:52:20] PROBLEM - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 691 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [21:54:42] ACKNOWLEDGEMENT - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 691 bytes in 0.005 second response time Brian_King unexpected result of T316159 actively working https://wikitech.wikimedia.org/wiki/Search%23Administration [21:54:42] ACKNOWLEDGEMENT - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 0.007 second response time Brian_King unexpected result of T316159 actively working https://wikitech.wikimedia.org/wiki/Search%23Administration [21:54:48] !log [Elastic] `ryankemper@cloudelastic1004:~$ sudo systemctl restart elasticsearch_6@cloudelastic-chi-eqiad.service` Restarting 1004's chi eqiad, it died due to `Aug 24 21:43:21 cloudelastic1004 systemd[1]: elasticsearch_6@cloudelastic-chi-eqiad.service: Main process exited, code=killed, status=9/KILL` [21:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:00] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1001 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 32, number_of_in_flight_fet [21:55:00] ask_max_waiting_in_queue_millis: 4986, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [21:55:00] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1005 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 32, number_of_in_flight_fet [21:55:00] ask_max_waiting_in_queue_millis: 4986, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [21:55:27] ACKNOWLEDGEMENT - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - CRITICAL - cloudelasticlb_9243: Servers cloudelastic1006.wikimedia.org, cloudelastic1001.wikimedia.org, cloudelastic1005.wikimedia.org are marked down but pooled: cloudelasticlb6_9243: Servers cloudelastic1006.wikimedia.org, cloudelastic1001.wikimedia.org, cloudelastic1005.wikimedia.org are marked down but pooled: cloudelasticlb6_8243: Servers cloudelas [21:55:27] wikimedia.org, cloudelastic1002.wikimedia.org, cloudelastic1005.wikimedia.org are marked down but pooled: cloudelasticlb_8243: Servers cloudelastic1006.wikimedia.org, cloudelastic1001.wikimedia.org, cloudelastic1005.wikimedia.org are marked down but pooled Brian_King unexpected result of T316159 actively working https://wikitech.wikimedia.org/wiki/PyBal [21:56:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T314041)', diff saved to https://phabricator.wikimedia.org/P32964 and previous config saved to /var/cache/conftool/dbconfig/20220824-215634-ladsgroup.json [21:56:35] (03PS1) 10Dzahn: wikistats: run updates of WMF-operated wikis earlier in the day [puppet] - 10https://gerrit.wikimedia.org/r/826394 (https://phabricator.wikimedia.org/T315121) [21:56:39] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [21:58:26] !log [Elastic] `ryankemper@cloudelastic1003:~$ sudo systemctl restart elasticsearch_6@cloudelastic-chi-eqiad.service`, 1003 was also oom-killed: `[4165984.362182] Out of memory: Killed process 3759 (java) total-vm:2277062348kB, anon-rss:61648756kB, file-rss:0kB, shmem-rss:0kB, UID:113 pgtables:1448136kB oom_score_adj:0` [21:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:14] (03PS1) 10Bking: Revert "elastic: enable ES7 repo on cloudelastic" [puppet] - 10https://gerrit.wikimedia.org/r/826333 [22:00:40] (03CR) 10Aaron Schulz: [C: 03+1] Explicitly set wgMessageCacheType=mcrouter (avoid newAnything in prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822119 (https://phabricator.wikimedia.org/T186673) (owner: 10Krinkle) [22:00:48] (03CR) 10Bking: [C: 03+2] Revert "elastic: enable ES7 repo on cloudelastic" [puppet] - 10https://gerrit.wikimedia.org/r/826333 (owner: 10Bking) [22:00:50] (03CR) 10Bking: [V: 03+2 C: 03+2] Revert "elastic: enable ES7 repo on cloudelastic" [puppet] - 10https://gerrit.wikimedia.org/r/826333 (owner: 10Bking) [22:01:00] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - cloudelasticlb_9243: Servers cloudelastic1006.wikimedia.org, cloudelastic1001.wikimedia.org, cloudelastic1005.wikimedia.org are marked down but pooled: cloudelasticlb6_9243: Servers cloudelastic1006.wikimedia.org, cloudelastic1001.wikimedia.org, cloudelastic1005.wikimedia.org are marked down but pooled: cloudelasticlb6_8243: Servers cloudelastic1006. [22:01:00] a.org, cloudelastic1001.wikimedia.org, cloudelastic1005.wikimedia.org are marked down but pooled: cloudelasticlb_8243: Servers cloudelastic1006.wikimedia.org, cloudelastic1001.wikimedia.org, cloudelastic1005.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:03:57] ACKNOWLEDGEMENT - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - cloudelasticlb_9243: Servers cloudelastic1006.wikimedia.org, cloudelastic1001.wikimedia.org, cloudelastic1005.wikimedia.org are marked down but pooled: cloudelasticlb6_9243: Servers cloudelastic1006.wikimedia.org, cloudelastic1001.wikimedia.org, cloudelastic1005.wikimedia.org are marked down but pooled: cloudelasticlb6_8243: Servers cloudelas [22:03:57] wikimedia.org, cloudelastic1001.wikimedia.org, cloudelastic1005.wikimedia.org are marked down but pooled: cloudelasticlb_8243: Servers cloudelastic1006.wikimedia.org, cloudelastic1001.wikimedia.org, cloudelastic1005.wikimedia.org are marked down but pooled Brian_King unexpected result of T316159 https://wikitech.wikimedia.org/wiki/PyBal [22:06:42] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetc [22:06:43] sk_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [22:07:22] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1002 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetc [22:07:22] sk_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [22:07:46] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:08:06] RECOVERY - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 674 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [22:08:46] RECOVERY - PyBal backends health check on lvs1018 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:11:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P32965 and previous config saved to /var/cache/conftool/dbconfig/20220824-221140-ladsgroup.json [22:20:16] !log [Elastic] We've got the cloudelastic instances all back up. A bunch of shard recoveries ongoing; currently the cluster is red. It might go all the way back to green; hard to say until the shard recoveries complete. [22:20:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P32966 and previous config saved to /var/cache/conftool/dbconfig/20220824-222646-ladsgroup.json [22:37:29] !log [Elastic] We're back to green in `cloudelastic-chi`, so cloudelastic is back to fully healthy [22:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:48] (03PS1) 10Ryan Kemper: elastic: don't start es7 unit until we tell it [puppet] - 10https://gerrit.wikimedia.org/r/826396 (https://phabricator.wikimedia.org/T308676) [22:41:02] (03CR) 10Ebernhardson: [C: 03+1] "concept seems appropriate, ensure this unit cannot run until we are ready for it" [puppet] - 10https://gerrit.wikimedia.org/r/826396 (https://phabricator.wikimedia.org/T308676) (owner: 10Ryan Kemper) [22:41:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T314041)', diff saved to https://phabricator.wikimedia.org/P32967 and previous config saved to /var/cache/conftool/dbconfig/20220824-224153-ladsgroup.json [22:41:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1099.eqiad.wmnet with reason: Maintenance [22:41:58] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [22:42:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1099.eqiad.wmnet with reason: Maintenance [22:42:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3318 (T314041)', diff saved to https://phabricator.wikimedia.org/P32968 and previous config saved to /var/cache/conftool/dbconfig/20220824-224214-ladsgroup.json [22:43:08] (03PS1) 10Ryan Kemper: elastic: don't start es 7 until ready [cookbooks] - 10https://gerrit.wikimedia.org/r/826397 [22:43:45] (03PS1) 10Ori: Increase roll-out of query-sorting to 5% [puppet] - 10https://gerrit.wikimedia.org/r/826398 (https://phabricator.wikimedia.org/T314868) [22:43:59] (03CR) 10Ebernhardson: [C: 03+1] elastic: don't start es 7 until ready [cookbooks] - 10https://gerrit.wikimedia.org/r/826397 (owner: 10Ryan Kemper) [22:45:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T314041)', diff saved to https://phabricator.wikimedia.org/P32969 and previous config saved to /var/cache/conftool/dbconfig/20220824-224507-ladsgroup.json [22:45:11] (03CR) 10Ori: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36961/console" [puppet] - 10https://gerrit.wikimedia.org/r/826398 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori) [23:00:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P32970 and previous config saved to /var/cache/conftool/dbconfig/20220824-230013-ladsgroup.json [23:08:08] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10Jclark-ctr) [23:13:21] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10Jclark-ctr) @Marostegui Can you confirm racking? i am unsure if you meant 1 per row or per rack. Did you want row diversity 1 host per row A-D a... [23:15:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P32971 and previous config saved to /var/cache/conftool/dbconfig/20220824-231519-ladsgroup.json [23:18:52] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T316121 (10phaultfinder) [23:24:48] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:28:51] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T316121 (10phaultfinder) [23:30:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T314041)', diff saved to https://phabricator.wikimedia.org/P32972 and previous config saved to /var/cache/conftool/dbconfig/20220824-233025-ladsgroup.json [23:30:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [23:30:30] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [23:30:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [23:30:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T314041)', diff saved to https://phabricator.wikimedia.org/P32973 and previous config saved to /var/cache/conftool/dbconfig/20220824-233046-ladsgroup.json [23:33:36] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-eqiad: Restarting to apply OpenJDK 8u342 - eevans@cumin1001 [23:34:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T314041)', diff saved to https://phabricator.wikimedia.org/P32974 and previous config saved to /var/cache/conftool/dbconfig/20220824-233431-ladsgroup.json [23:43:45] (03PS1) 10Tim Starling: Apply scaling_governor=performance to MediaWiki appservers [puppet] - 10https://gerrit.wikimedia.org/r/826405 (https://phabricator.wikimedia.org/T315398) [23:49:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P32975 and previous config saved to /var/cache/conftool/dbconfig/20220824-234937-ladsgroup.json [23:54:15] (03CR) 10Tim Starling: "pcc result: https://puppet-compiler.wmflabs.org/pcc-worker1002/36962/" [puppet] - 10https://gerrit.wikimedia.org/r/826405 (https://phabricator.wikimedia.org/T315398) (owner: 10Tim Starling) [23:56:27] (03PS2) 10Tim Starling: Apply scaling_governor=performance to MediaWiki appservers [puppet] - 10https://gerrit.wikimedia.org/r/826405 (https://phabricator.wikimedia.org/T315398) [23:59:44] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.057 second response time https://wikitech.wikimedia.org/wiki/Swift