[00:06:52] PROBLEM - dump of es5 in codfw on backupmon1001 is CRITICAL: dump for es5 at codfw (es2025) taken more than a week ago: Most recent backup 2022-06-28 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:19:56] PROBLEM - dump of es4 in eqiad on backupmon1001 is CRITICAL: dump for es4 at eqiad (es1022) taken more than a week ago: Most recent backup 2022-06-28 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:23:56] PROBLEM - dump of es5 in eqiad on backupmon1001 is CRITICAL: dump for es5 at eqiad (es1025) taken more than a week ago: Most recent backup 2022-06-28 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:39:15] (03PS4) 10Dzahn: vrts: add promtheus blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/810087 (https://phabricator.wikimedia.org/T303190) [00:44:02] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:45:18] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:49:29] 10SRE, 10ops-eqiad: SSH on wtp1036.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311761 (10ssingh) Thanks @Cmjohnson. There is another host, ` 20:45:18 <+icinga-wm> PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-... [00:50:46] 10SRE, 10ops-eqiad: SSH on wtp1036.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311761 (10ssingh) On second thought, making it another task just for clarity. Sorry for the noise. [00:50:46] RECOVERY - dump of es4 in eqiad on backupmon1001 is OK: Last dump for es4 at eqiad (es1022) taken on 2022-07-05 00:00:02 (3231 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:53:41] (03PS1) 10Tim Starling: Set wgStatsCacheType to mcrouter-primary-dc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811394 (https://phabricator.wikimedia.org/T310662) [00:54:22] 10SRE, 10ops-eqiad: SSH on wtp1040.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T312185 (10ssingh) [00:55:44] (03CR) 10Krinkle: Set wgStatsCacheType to mcrouter-primary-dc (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811394 (https://phabricator.wikimedia.org/T310662) (owner: 10Tim Starling) [00:58:23] 10SRE, 10Traffic-Icebox, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (MediaWiki-MultiDC): Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820 (10Krinkle) 05Open→03Resolved a:03tstarling [01:00:52] (03PS2) 10Tim Starling: Set wgStatsCacheType to mcrouter-primary-dc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811394 (https://phabricator.wikimedia.org/T310662) [01:00:55] (03CR) 10Tim Starling: Set wgStatsCacheType to mcrouter-primary-dc (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811394 (https://phabricator.wikimedia.org/T310662) (owner: 10Tim Starling) [01:04:52] PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-config-backup-gitlab2001.wikimedia.org.service,rsync-data-backup-gitlab2001.wikimedia.org.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:08:04] ^ we have an open ticket about that but were hoping it to be fixed this time [01:10:46] (03CR) 10Dzahn: [C: 03+2] "body_regex_matches needs to be an array, same fix as for the gitlab check" [puppet] - 10https://gerrit.wikimedia.org/r/810087 (https://phabricator.wikimedia.org/T303190) (owner: 10Dzahn) [01:14:15] (03CR) 10Dzahn: [C: 03+2] "double checked puppet on otrs1001. no error" [puppet] - 10https://gerrit.wikimedia.org/r/810087 (https://phabricator.wikimedia.org/T303190) (owner: 10Dzahn) [01:16:42] ACKNOWLEDGEMENT - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-config-backup-gitlab2001.wikimedia.org.service,rsync-data-backup-gitlab2001.wikimedia.org.service daniel_zahn https://phabricator.wikimedia.org/T274463 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:18:00] the alert is about a systemd unit trying to rsync to a VM that has been decom'ed. that's all [01:21:25] !log gitlab1004 rm /lib/systemd/system/rsync-data-backup-gitlab2001.wikimedia.org.* ; systemctl reset-failed (T274463, T307142) - fix icinga alert after gitlab2001 was decom'ed, we didn't have puppet remove the timer/service [01:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:21:31] T307142: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 [01:21:31] T274463: Backups for GitLab - https://phabricator.wikimedia.org/T274463 [01:21:32] RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:28:52] !log gitlab1004 - rm /lib/systemd/system/rsync-config-backup-gitlab2001.wikimedia.org.* [01:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:46:36] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:47:15] (03PS1) 10DLynch: Revert "Hide the lede section on mobile when DiscussionTools is enabled" [extensions/DiscussionTools] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811406 (https://phabricator.wikimedia.org/T312177) [01:57:54] RECOVERY - dump of es5 in eqiad on backupmon1001 is OK: Last dump for es5 at eqiad (es1025) taken on 2022-07-05 00:00:02 (3210 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [02:19:25] 10SRE, 10Performance-Team, 10Traffic, 10serviceops: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) [02:19:46] 10SRE, 10Performance-Team, 10Traffic, 10serviceops: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) I edited the task description with a proposed rollout plan, and I renamed the task to encompass the actual work, not just deciding on the work. [02:21:38] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:22:58] (03CR) 10Tim Starling: [C: 03+2] "This is a prerequisite for the WRStats backport." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811394 (https://phabricator.wikimedia.org/T310662) (owner: 10Tim Starling) [02:23:42] (03Merged) 10jenkins-bot: Set wgStatsCacheType to mcrouter-primary-dc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811394 (https://phabricator.wikimedia.org/T310662) (owner: 10Tim Starling) [02:26:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:27:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:27:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:27:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:16] !log tstarling@deploy1002 Synchronized wmf-config/InitialiseSettings.php: T310662 g 811394 harmless prerequisite (duration: 03m 39s) [02:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:20] T310662: Acceptably efficient AbuseFilter profiling storage backend - https://phabricator.wikimedia.org/T310662 [02:32:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:33:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:33:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:36:41] (03PS1) 10Tim Starling: Introduce new WRStats library for write-read stats [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811407 (https://phabricator.wikimedia.org/T310662) [02:37:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:38:36] (03PS3) 10Tim Starling: [MultiDC] Switch $wgCentralAuthTokenCacheType to mcrouter-primary-dc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809326 (https://phabricator.wikimedia.org/T278392) (owner: 10Krinkle) [02:52:10] (03CR) 10Tim Starling: [C: 03+2] Introduce new WRStats library for write-read stats [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811407 (https://phabricator.wikimedia.org/T310662) (owner: 10Tim Starling) [03:10:45] (03Merged) 10jenkins-bot: Introduce new WRStats library for write-read stats [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811407 (https://phabricator.wikimedia.org/T310662) (owner: 10Tim Starling) [03:14:14] RECOVERY - dump of es5 in codfw on backupmon1001 is OK: Last dump for es5 at codfw (es2025) taken on 2022-07-05 00:00:01 (3210 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [03:17:18] !log tstarling@deploy1002 Started scap: WRStats core prereq T310662 g811407 [03:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:17:22] T310662: Acceptably efficient AbuseFilter profiling storage backend - https://phabricator.wikimedia.org/T310662 [03:18:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [03:18:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:18:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [03:18:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [03:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:19:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [03:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:22:56] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:23:36] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:28:18] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.133 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:34:39] !log tstarling@deploy1002 Finished scap: WRStats core prereq T310662 g811407 (duration: 17m 20s) [03:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:34:42] T310662: Acceptably efficient AbuseFilter profiling storage backend - https://phabricator.wikimedia.org/T310662 [03:51:55] (03PS1) 10Tim Starling: FilterProfiler: use WRStats [extensions/AbuseFilter] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811408 (https://phabricator.wikimedia.org/T310662) [03:52:56] (03CR) 10Tim Starling: [C: 03+2] FilterProfiler: use WRStats [extensions/AbuseFilter] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811408 (https://phabricator.wikimedia.org/T310662) (owner: 10Tim Starling) [04:06:33] (03Merged) 10jenkins-bot: FilterProfiler: use WRStats [extensions/AbuseFilter] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811408 (https://phabricator.wikimedia.org/T310662) (owner: 10Tim Starling) [04:15:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [04:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:16:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [04:16:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [04:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:16:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [04:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:18:46] !log tstarling@deploy1002 Synchronized php-1.39.0-wmf.19/extensions/AbuseFilter: T310662 deployment with possible post-send error spike due to ServiceWiring/FilterProfiler interdependency (duration: 03m 33s) [04:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:18:49] T310662: Acceptably efficient AbuseFilter profiling storage backend - https://phabricator.wikimedia.org/T310662 [04:19:52] PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8803.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:22:00] PROBLEM - Mediawiki CirrusSearch Saneitizer Weekly Fix Rate on alert1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [250000.0] https://wikitech.wikimedia.org/wiki/Search%23Saneitizer_%28background_repair_process%29 https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now [04:29:14] RECOVERY - Mediawiki CirrusSearch Saneitizer Weekly Fix Rate on alert1001 is OK: OK: Less than 1.00% above the threshold [100000.0] https://wikitech.wikimedia.org/wiki/Search%23Saneitizer_%28background_repair_process%29 https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now [04:30:40] RECOVERY - Check systemd state on thumbor2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:37:58] PROBLEM - Check systemd state on thumbor2004 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8803.service,thumbor@8805.service,thumbor@8811.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:48:29] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Marostegui) Thank you! [05:04:26] PROBLEM - Check systemd state on thumbor2005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:06:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 1%: After restart', diff saved to https://phabricator.wikimedia.org/P30868 and previous config saved to /var/cache/conftool/dbconfig/20220706-050615-root.json [05:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:16] (03PS1) 10Marostegui: instances.yaml: Add db2159 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/811485 (https://phabricator.wikimedia.org/T311493) [05:09:07] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2159 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/811485 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [05:10:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2159 to dbctl T311493', diff saved to https://phabricator.wikimedia.org/P30869 and previous config saved to /var/cache/conftool/dbconfig/20220706-051046-marostegui.json [05:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:51] T311493: Productionize db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T311493 [05:11:29] (03PS1) 10Marostegui: db2158: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/811567 (https://phabricator.wikimedia.org/T311493) [05:11:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 10 hosts with reason: codfw s6 sanitarium master switch [05:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 10 hosts with reason: codfw s6 sanitarium master switch [05:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:00] PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:19:36] (03CR) 10Marostegui: [C: 03+2] db2158: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/811567 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [05:21:19] (03PS1) 10Marostegui: mariadb: db2076 no longer sanitariu master [puppet] - 10https://gerrit.wikimedia.org/r/811575 (https://phabricator.wikimedia.org/T311493) [05:21:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 2%: After restart', diff saved to https://phabricator.wikimedia.org/P30870 and previous config saved to /var/cache/conftool/dbconfig/20220706-052119-root.json [05:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:03] (03CR) 10Marostegui: [C: 03+2] mariadb: db2076 no longer sanitariu master [puppet] - 10https://gerrit.wikimedia.org/r/811575 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [05:33:29] (03PS1) 10Marostegui: db2159: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/811577 (https://phabricator.wikimedia.org/T311493) [05:33:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 12 hosts with reason: codfw s7 sanitarium master switch [05:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 12 hosts with reason: codfw s7 sanitarium master switch [05:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:09] (03CR) 10Marostegui: [C: 03+2] db2159: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/811577 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [05:35:28] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (phab1001), Fresh: 115 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:36:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 5%: After restart', diff saved to https://phabricator.wikimedia.org/P30871 and previous config saved to /var/cache/conftool/dbconfig/20220706-053623-root.json [05:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:02] (03PS1) 10Marostegui: mariadb: db2077 no longer s7 codfw sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/811578 (https://phabricator.wikimedia.org/T311493) [05:39:09] (03CR) 10Marostegui: [C: 03+2] mariadb: db2077 no longer s7 codfw sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/811578 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [05:41:10] 10ops-codfw, 10decommission-hardware: decommission db2076 - https://phabricator.wikimedia.org/T312190 (10Marostegui) [05:41:47] 10ops-codfw, 10decommission-hardware: decommission db2077 - https://phabricator.wikimedia.org/T312191 (10Marostegui) [05:42:18] 10ops-codfw, 10decommission-hardware: decommission db2077 - https://phabricator.wikimedia.org/T312191 (10Marostegui) [05:42:20] 10ops-codfw, 10decommission-hardware: decommission db2076 - https://phabricator.wikimedia.org/T312190 (10Marostegui) [05:45:56] !log dbmaint x1@eqiad T312161 [05:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:59] T312161: Adjust the field type of cx_lists.cxl_start_time/cxl_end_time to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312161 [05:46:12] !log dbmaint s3@eqiad T312161 [05:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:55] !log dbmaint s3@eqiad T312162 [05:48:57] !log dbmaint x1@eqiad T312162 [05:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:58] T312162: Adjust the field type of cx_notification_log.cxn_date/cxn_newest to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312162 [05:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 10%: After restart', diff saved to https://phabricator.wikimedia.org/P30872 and previous config saved to /var/cache/conftool/dbconfig/20220706-055127-root.json [05:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:44] (03PS3) 10Giuseppe Lavagetto: mediawiki: install php7.4 on jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/808910 (https://phabricator.wikimedia.org/T311386) [06:01:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10ayounsi) Agreed! However, >>! In T305414#8034695, @Jclark-ctr wrote: > cloudweb1003 c8 u39 20220099 port 10 (cloudsw2-c8-eqiad) > clo... [06:06:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 25%: After restart', diff saved to https://phabricator.wikimedia.org/P30873 and previous config saved to /var/cache/conftool/dbconfig/20220706-060631-root.json [06:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:04] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: install php7.4 on jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/808910 (https://phabricator.wikimedia.org/T311386) (owner: 10Giuseppe Lavagetto) [06:16:50] RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:21:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 50%: After restart', diff saved to https://phabricator.wikimedia.org/P30874 and previous config saved to /var/cache/conftool/dbconfig/20220706-062135-root.json [06:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:47] marostegui: is your work impacting thanos-fe2003 ? it's saturating lvs2009 [06:30:30] XioNoX: hmm, i just started a snapshot from elastic2* to thanos-swift.discovery.wmnet [06:30:41] actually it's ton of elastic hosts in codfw flooding lvs2009 [06:30:43] ebernhardson: :) [06:30:52] lemme see if i can back that off, sec [06:30:54] (03CR) 10Giuseppe Lavagetto: sre: add php busy workers alerts for parsoid, jobrunners (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/797313 (owner: 10Giuseppe Lavagetto) [06:30:58] XioNoX: nop [06:31:04] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2009&viewPanel=8 [06:31:18] (03PS3) 10Giuseppe Lavagetto: mediawiki: install php7.4 on the maintenance server [puppet] - 10https://gerrit.wikimedia.org/r/808911 (https://phabricator.wikimedia.org/T311386) [06:31:48] https://librenms.wikimedia.org/device/device=94/tab=port/port=21632/ [06:31:53] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: install php7.4 on the maintenance server [puppet] - 10https://gerrit.wikimedia.org/r/808911 (https://phabricator.wikimedia.org/T311386) (owner: 10Giuseppe Lavagetto) [06:33:29] (03CR) 10Ayounsi: netops: add DNS probes alerts (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/811207 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [06:36:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 75%: After restart', diff saved to https://phabricator.wikimedia.org/P30875 and previous config saved to /var/cache/conftool/dbconfig/20220706-063639-root.json [06:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:45] XioNoX: it should be calming itself down now [06:37:55] ebernhardson: yep, got the recovery, thanks! [06:39:19] i wonder if high-bandwidth-ish things (this si trying to move 1.5tb between clusters) would be better avoiding lvs? i suppose there isn't much other option though [06:43:51] (03CR) 10Jcrespo: [C: 03+2] Add new user for dbbackups database for django dashboard [puppet] - 10https://gerrit.wikimedia.org/r/810885 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [06:44:40] ebernhardson: LVS should have more bandwidth than hosts they front :) [06:47:04] well i was thinking more that lvs architecture is more specialized for small requests and arbitrary responses, but here i'm trying to push 1.5TB through lvs [06:47:12] ebernhardson: but yeah for now the short term workaround it to bypass it or rate limit it [06:48:02] i've adjusted the rate limit for now, it was 40mb * 32 shards, bumped down to 20mb * 32 [06:48:11] thanks! [06:51:31] RECOVERY - ElasticSearch unassigned shard check - 9200 on cloudelastic1002 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [06:51:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 100%: After restart', diff saved to https://phabricator.wikimedia.org/P30876 and previous config saved to /var/cache/conftool/dbconfig/20220706-065143-root.json [06:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:34] (03PS4) 10Jcrespo: bacula::storage: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810961 (owner: 10Muehlenhoff) [06:58:13] (03CR) 10Jcrespo: [C: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1003/36196/backup1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/810961 (owner: 10Muehlenhoff) [06:58:24] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: introduce blackbox::module [puppet] - 10https://gerrit.wikimedia.org/r/811295 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [06:58:31] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: move blackbox http check to prometheus::rule [puppet] - 10https://gerrit.wikimedia.org/r/811294 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [06:58:39] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: deploy custom probedown alerts [puppet] - 10https://gerrit.wikimedia.org/r/811241 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [06:58:42] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: deploy alerts as yml not yaml [puppet] - 10https://gerrit.wikimedia.org/r/811242 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [06:58:45] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: switch to blackbox::module [puppet] - 10https://gerrit.wikimedia.org/r/811296 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [06:58:45] RECOVERY - ElasticSearch unassigned shard check - 9200 on cloudelastic1001 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [06:59:24] (03PS5) 10Jcrespo: bacula::storage: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810961 (owner: 10Muehlenhoff) [07:00:04] Amir1 and Urbanecm: Dear deployers, time to do the UTC morning backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220706T0700). [07:00:05] kemayo: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:59] I am, indeed, around. [07:05:18] Kemayo: can you self-service? [07:05:36] Amir1: I don't think I have the relevant permissions. [07:06:04] ok [07:06:11] backports would take a bit [07:06:24] (03CR) 10Ladsgroup: [C: 03+2] Revert "Hide the lede section on mobile when DiscussionTools is enabled" [extensions/DiscussionTools] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811406 (https://phabricator.wikimedia.org/T312177) (owner: 10DLynch) [07:07:30] PROBLEM - haproxy failover on dbproxy2003 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [07:07:32] RECOVERY - ElasticSearch unassigned shard check - 9200 on cloudelastic1006 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [07:08:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1132', diff saved to https://phabricator.wikimedia.org/P30878 and previous config saved to /var/cache/conftool/dbconfig/20220706-070835-ladsgroup.json [07:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:46] (03PS1) 10Marostegui: mariadb: Productionize db2160 [puppet] - 10https://gerrit.wikimedia.org/r/811581 (https://phabricator.wikimedia.org/T311493) [07:09:49] (03PS2) 10Filippo Giunchedi: netops: add DNS probes alerts [alerts] - 10https://gerrit.wikimedia.org/r/811207 (https://phabricator.wikimedia.org/T169860) [07:09:51] (03PS1) 10Filippo Giunchedi: sre: expand comments re: probes alerts and puppet [alerts] - 10https://gerrit.wikimedia.org/r/811582 [07:10:44] (03CR) 10Filippo Giunchedi: "Thank you for the reviews!" [alerts] - 10https://gerrit.wikimedia.org/r/811207 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [07:10:54] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2160 [puppet] - 10https://gerrit.wikimedia.org/r/811581 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [07:11:45] (JobUnavailable) firing: Reduced availability for job mysql-misc in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:11:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1132', diff saved to https://phabricator.wikimedia.org/P30879 and previous config saved to /var/cache/conftool/dbconfig/20220706-071157-ladsgroup.json [07:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:05] (03Merged) 10jenkins-bot: Revert "Hide the lede section on mobile when DiscussionTools is enabled" [extensions/DiscussionTools] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811406 (https://phabricator.wikimedia.org/T312177) (owner: 10DLynch) [07:12:08] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/810961 (owner: 10Muehlenhoff) [07:14:38] Kemayo: It's live in mwdebug1002 [07:14:44] do you know how to test it? [07:15:03] (03PS2) 10Filippo Giunchedi: sre: expand comments re: probes alerts and puppet [alerts] - 10https://gerrit.wikimedia.org/r/811582 [07:15:49] (03CR) 10Jcrespo: [C: 03+2] bacula::storage: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810961 (owner: 10Muehlenhoff) [07:15:51] Amir1: I do, one second [07:16:48] Amir1: Looks good! [07:16:57] awesome, gonna sync [07:17:45] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: expand comments re: probes alerts and puppet [alerts] - 10https://gerrit.wikimedia.org/r/811582 (owner: 10Filippo Giunchedi) [07:19:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:40] PROBLEM - haproxy failover on dbproxy2002 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [07:19:40] PROBLEM - haproxy failover on dbproxy2004 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [07:19:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:19:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:42] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.19/extensions/DiscussionTools/modules/dt.init.less: Backport: [[gerrit:811406|Revert "Hide the lede section on mobile when DiscussionTools is enabled" (T312177)]] (duration: 03m 37s) [07:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:20:44] T312177: mediawiki.org Main Page issue on mobile - https://phabricator.wikimedia.org/T312177 [07:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:59] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Ladsgroup) Disabling P_S increased number of concurrent connections I could make (700*) but still started to throw the same errors... [07:21:58] mmhh there might be some P A G E alerts coming in, in the alert text only, not actual oncall pages tho [07:22:08] Amir1: Thanks for the help! [07:22:35] Kemayo: thank you for building this awesome tool. I just hit shiny buttons and copy pasted stuff [07:23:06] 😂 [07:23:31] godog: https://phabricator.wikimedia.org/T312194 got created too [07:23:57] RhinosF1: ah thank you, yeah that makes sense! [07:24:26] godog: np [07:24:48] RECOVERY - ElasticSearch unassigned shard check - 9200 on cloudelastic1005 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [07:25:04] Alert went off with the text in -serviceops for gitlab [07:27:21] ah that explains why we didn't see the P A G E here [07:28:31] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2024.codfw.wmnet with reason: Remove node for reimage [07:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2024.codfw.wmnet with reason: Remove node for reimage [07:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:16] godog: there was only 1. Not the list on the task. [07:29:45] I think you're in #wikimedia-serviceops though so you can see [07:29:51] (03CR) 10Ayounsi: [C: 03+1] "Nice!" [alerts] - 10https://gerrit.wikimedia.org/r/811207 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [07:30:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1135, if anything breaks, it's marostegui's fault (T311106)', diff saved to https://phabricator.wikimedia.org/P30880 and previous config saved to /var/cache/conftool/dbconfig/20220706-073052-ladsgroup.json [07:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:57] T311106: Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 [07:30:58] XD [07:31:34] RhinosF1: yeah, it was one notification though notice the (8) in the text, meaning the actual alerts firing are 8 [07:31:53] I see! [07:31:58] (03CR) 10Filippo Giunchedi: [C: 03+2] netops: add DNS probes alerts [alerts] - 10https://gerrit.wikimedia.org/r/811207 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [07:32:02] (03PS3) 10Filippo Giunchedi: netops: add DNS probes alerts [alerts] - 10https://gerrit.wikimedia.org/r/811207 (https://phabricator.wikimedia.org/T169860) [07:32:26] PROBLEM - haproxy failover on dbproxy2001 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [07:32:39] haproxy alerts are expected [07:33:16] uh? ack! [07:35:27] (03PS1) 10Giuseppe Lavagetto: mediawiki: install php 7.4 on all maintenance servers [puppet] - 10https://gerrit.wikimedia.org/r/811585 (https://phabricator.wikimedia.org/T311386) [07:36:09] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: install php 7.4 on all maintenance servers [puppet] - 10https://gerrit.wikimedia.org/r/811585 (https://phabricator.wikimedia.org/T311386) (owner: 10Giuseppe Lavagetto) [07:40:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 10%: Test done (T311106)', diff saved to https://phabricator.wikimedia.org/P30881 and previous config saved to /var/cache/conftool/dbconfig/20220706-074028-ladsgroup.json [07:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:32] T311106: Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 [07:40:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 10%: Test done (T311106)', diff saved to https://phabricator.wikimedia.org/P30882 and previous config saved to /var/cache/conftool/dbconfig/20220706-074051-ladsgroup.json [07:40:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2024.codfw.wmnet with OS bullseye [07:42:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:09] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2024.codfw.wmnet with OS bullseye [07:42:32] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (31) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudservices1003, cloudservices1004, elastic2049, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002 [07:42:32] -fe2003, thumbor1006, thumbor2003, thumbor2004, thumbor2005, thumbor2006 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [07:43:53] (03CR) 10Muehlenhoff: [C: 03+1] admin: add gitlab-roots group to gitlab_runner role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809018 (https://phabricator.wikimedia.org/T308350) (owner: 10Dzahn) [07:45:54] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) >>! In T311106#8054391, @Ladsgroup wrote: > Disabling P_S increased number of concurrent connections I could make (700*... [07:47:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1143', diff saved to https://phabricator.wikimedia.org/P30883 and previous config saved to /var/cache/conftool/dbconfig/20220706-074721-root.json [07:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:32] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) ` ===== NODE GROUP ===== (4) db[1111,1127,1132,1143].eqiad.wmnet ----- OUTPUT of 'sudo mysql -e "s...rformance_schema'... [07:52:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 1%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30884 and previous config saved to /var/cache/conftool/dbconfig/20220706-075206-root.json [07:52:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 1%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30885 and previous config saved to /var/cache/conftool/dbconfig/20220706-075211-root.json [07:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 1%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30886 and previous config saved to /var/cache/conftool/dbconfig/20220706-075224-root.json [07:52:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 25%: Test done (T311106)', diff saved to https://phabricator.wikimedia.org/P30887 and previous config saved to /var/cache/conftool/dbconfig/20220706-075532-ladsgroup.json [07:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:36] T311106: Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 [07:55:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 25%: Test done (T311106)', diff saved to https://phabricator.wikimedia.org/P30888 and previous config saved to /var/cache/conftool/dbconfig/20220706-075555-ladsgroup.json [07:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:14] RECOVERY - ElasticSearch unassigned shard check - 9200 on cloudelastic1003 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [07:56:18] RECOVERY - Check systemd state on mw2378 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:56:40] (03PS3) 10Giuseppe Lavagetto: mediawiki: install php7.4 on all appservers [puppet] - 10https://gerrit.wikimedia.org/r/808912 (https://phabricator.wikimedia.org/T311386) [07:58:03] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2024.codfw.wmnet with reason: host reimage [07:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:04] jnuche and dduvall: How many deployers does it take to do MediaWiki train - Utc-0+Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220706T0800). [08:01:27] !log elukey@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be1028.eqiad.wmnet: Renew puppet certificate - elukey@cumin1001 [08:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:44] !log elukey@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be1028.eqiad.wmnet: Renew puppet certificate - elukey@cumin1001 [08:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2024.codfw.wmnet with reason: host reimage [08:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:35] (03PS1) 10Jaime Nuche: group1 wikis to 1.39.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811589 (https://phabricator.wikimedia.org/T308072) [08:03:37] (03CR) 10Jaime Nuche: [C: 03+2] group1 wikis to 1.39.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811589 (https://phabricator.wikimedia.org/T308072) (owner: 10Jaime Nuche) [08:04:20] (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811589 (https://phabricator.wikimedia.org/T308072) (owner: 10Jaime Nuche) [08:07:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 2%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30889 and previous config saved to /var/cache/conftool/dbconfig/20220706-080710-root.json [08:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 2%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30890 and previous config saved to /var/cache/conftool/dbconfig/20220706-080715-root.json [08:07:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 2%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30891 and previous config saved to /var/cache/conftool/dbconfig/20220706-080728-root.json [08:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:55] !log elukey@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be1029.eqiad.wmnet: Renew puppet certificate - elukey@cumin1001 [08:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:32] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.19 refs T308072 [08:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:35] T308072: 1.39.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T308072 [08:09:14] !log elukey@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be1029.eqiad.wmnet: Renew puppet certificate - elukey@cumin1001 [08:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:19] RECOVERY - puppet last run on ms-be1029 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:10:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 75%: Test done (T311106)', diff saved to https://phabricator.wikimedia.org/P30892 and previous config saved to /var/cache/conftool/dbconfig/20220706-081036-ladsgroup.json [08:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:41] T311106: Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 [08:10:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 75%: Test done (T311106)', diff saved to https://phabricator.wikimedia.org/P30893 and previous config saved to /var/cache/conftool/dbconfig/20220706-081059-ladsgroup.json [08:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:05] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [08:12:12] !log jnuche@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.19 refs T308072 (duration: 03m 39s) [08:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:56] !log elukey@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be1030.eqiad.wmnet: Renew puppet certificate - elukey@cumin1001 [08:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:15] !log elukey@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be1030.eqiad.wmnet: Renew puppet certificate - elukey@cumin1001 [08:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:59] RECOVERY - puppet last run on ms-be1030 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:18:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2024.codfw.wmnet with OS bullseye [08:18:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:50] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2024.codfw.wmnet with OS bullseye completed: - ganeti2024 (**PASS**) - Downtimed on... [08:20:13] !log elukey@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be1031.eqiad.wmnet: Renew puppet certificate - elukey@cumin1001 [08:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:45] (03CR) 10Slavina Stefanova: novafullstack: Refactor and minor fix (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/811316 (owner: 10David Caro) [08:20:49] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10ayounsi) a:03Papaul @papaul, nice! We should keep all the same switch's uplinks on the same breakout cable: So instead of doing: 0/0 - asw2-c-eqiad:xe-2/0/[44... [08:21:29] !log elukey@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be1031.eqiad.wmnet: Renew puppet certificate - elukey@cumin1001 [08:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 5%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30894 and previous config saved to /var/cache/conftool/dbconfig/20220706-082214-root.json [08:22:15] RECOVERY - puppet last run on ms-be1031 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 5%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30895 and previous config saved to /var/cache/conftool/dbconfig/20220706-082219-root.json [08:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 5%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30896 and previous config saved to /var/cache/conftool/dbconfig/20220706-082232-root.json [08:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:42] !log elukey@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be1032.eqiad.wmnet: Renew puppet certificate - elukey@cumin1001 [08:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:02] !log elukey@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be1032.eqiad.wmnet: Renew puppet certificate - elukey@cumin1001 [08:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:37] !log elukey@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be1033.eqiad.wmnet: Renew puppet certificate - elukey@cumin1001 [08:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 100%: Test done (T311106)', diff saved to https://phabricator.wikimedia.org/P30897 and previous config saved to /var/cache/conftool/dbconfig/20220706-082540-ladsgroup.json [08:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:51] T311106: Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 [08:26:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 100%: Test done (T311106)', diff saved to https://phabricator.wikimedia.org/P30898 and previous config saved to /var/cache/conftool/dbconfig/20220706-082603-ladsgroup.json [08:26:05] RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:55] !log elukey@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be1033.eqiad.wmnet: Renew puppet certificate - elukey@cumin1001 [08:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:47] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:28:42] 10SRE, 10Icinga, 10Observability-Alerting: PROBLEM: Icinga on alert2001.wikimedia.org is CRITICAL - https://phabricator.wikimedia.org/T311926 (10Volans) I see that now the crontab entries are: ` */5 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga --tries 5 --sleep 60 alert2001.wiki... [08:30:32] RECOVERY - puppet last run on ms-be1033 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:31:51] (03PS1) 10Urbanecm: GrowthExperiments: Remove redundant IS-labs definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811661 [08:31:53] (03PS1) 10Urbanecm: [beta] Remove overrides for GrowthExperiments enable percentage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811662 [08:33:25] jnuche: Is the train deployment complete? Any objections if I merge a Beta Cluster only config patch? [08:34:08] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:34:55] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10jcrespo) > Disable P_S on all the sX hosts that run 10.6 Note disabling P_S on production hosts will break the query killer. [08:35:33] (03PS4) 10Giuseppe Lavagetto: mediawiki: install php7.4 on all appservers [puppet] - 10https://gerrit.wikimedia.org/r/808912 (https://phabricator.wikimedia.org/T311386) [08:36:33] !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be1029.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001 [08:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 10%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30899 and previous config saved to /var/cache/conftool/dbconfig/20220706-083718-root.json [08:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 10%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30900 and previous config saved to /var/cache/conftool/dbconfig/20220706-083723-root.json [08:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 10%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30901 and previous config saved to /var/cache/conftool/dbconfig/20220706-083736-root.json [08:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:51] !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be1029.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001 [08:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:54] !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be1030.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001 [08:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:00] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: install php7.4 on all appservers [puppet] - 10https://gerrit.wikimedia.org/r/808912 (https://phabricator.wikimedia.org/T311386) (owner: 10Giuseppe Lavagetto) [08:39:10] !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be1030.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001 [08:39:12] !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be1031.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001 [08:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:09] (03CR) 10Kosta Harlan: [C: 03+1] [beta] Remove overrides for GrowthExperiments enable percentage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811662 (owner: 10Urbanecm) [08:40:17] (03CR) 10Kosta Harlan: [C: 03+1] GrowthExperiments: Remove redundant IS-labs definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811661 (owner: 10Urbanecm) [08:40:29] !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be1031.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001 [08:40:30] !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be1032.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001 [08:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:42] (03CR) 10Slavina Stefanova: novafullstack: Refactor and minor fix (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/811316 (owner: 10David Caro) [08:41:46] !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be1032.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001 [08:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:48] !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be1033.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001 [08:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:56] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) >>! In T311106#8054563, @jcrespo wrote: >> Disable P_S on all the sX hosts that run 10.6 > > Note disabling P_S on pro... [08:43:10] !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be1033.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001 [08:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:40] PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8803.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:43:46] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:44:02] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/811270 (owner: 10David Caro) [08:44:22] (03PS1) 10Urbanecm: [beta] GrowthExperiments: Remove variables that are primarily set on-wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811663 [08:44:24] (03PS1) 10Urbanecm: GrowthExperiments: Remove wgGEHomepageTutorialTitle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811664 [08:44:27] (03CR) 10David Caro: [C: 03+2] Revert "profile::mariadb::packages_wmf: Remove support for stretch" [puppet] - 10https://gerrit.wikimedia.org/r/811270 (owner: 10David Caro) [08:45:15] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) [08:46:13] !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be2028.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001 [08:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:22] (03CR) 10David Caro: [C: 03+2] toolsdb: enable pt-heartbeat [puppet] - 10https://gerrit.wikimedia.org/r/763584 (owner: 10Majavah) [08:47:32] !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be2028.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001 [08:47:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:34] !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be2029.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001 [08:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:54] !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be2029.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001 [08:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:56] !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be2030.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001 [08:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:38] RECOVERY - puppet last run on ms-be2028 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:50:18] !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be2030.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001 [08:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:20] phuedx: hi, yeah, the train deployment is finished, please go ahead [08:50:20] !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be2031.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001 [08:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:33] jnuche: Thanks! [08:51:39] !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be2031.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001 [08:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:41] !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be2032.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001 [08:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 25%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30902 and previous config saved to /var/cache/conftool/dbconfig/20220706-085221-root.json [08:52:24] RECOVERY - puppet last run on ms-be2030 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:27] (03PS1) 10Slyngshede: Add Nagios script for monitoring the Dell PERC RAID controller. [puppet] - 10https://gerrit.wikimedia.org/r/811667 [08:52:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 25%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30903 and previous config saved to /var/cache/conftool/dbconfig/20220706-085227-root.json [08:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:38] RECOVERY - puppet last run on ms-be2031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:52:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 25%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30904 and previous config saved to /var/cache/conftool/dbconfig/20220706-085240-root.json [08:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:57] Since it's BC-only. I'll merge it, let it roll out to the Beta Cluster, and then pull it onto the deployment host [08:53:02] !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be2032.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001 [08:53:04] !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be2033.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001 [08:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:07] (03CR) 10CI reject: [V: 04-1] Add Nagios script for monitoring the Dell PERC RAID controller. [puppet] - 10https://gerrit.wikimedia.org/r/811667 (owner: 10Slyngshede) [08:54:23] !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be2033.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001 [08:54:25] !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be2034.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001 [08:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:14] RECOVERY - puppet last run on ms-be2033 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:55:45] !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be2034.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001 [08:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:29] (03CR) 10Phuedx: [C: 03+2] beta: Add mediawiki.web_ui.interactions event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811322 (https://phabricator.wikimedia.org/T311268) (owner: 10Phuedx) [08:57:23] (03Merged) 10jenkins-bot: beta: Add mediawiki.web_ui.interactions event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811322 (https://phabricator.wikimedia.org/T311268) (owner: 10Phuedx) [08:57:58] (03PS2) 10Slyngshede: Add Nagios script for monitoring the Dell PERC RAID controller. [puppet] - 10https://gerrit.wikimedia.org/r/811667 [08:58:34] (03CR) 10CI reject: [V: 04-1] Add Nagios script for monitoring the Dell PERC RAID controller. [puppet] - 10https://gerrit.wikimedia.org/r/811667 (owner: 10Slyngshede) [08:58:48] !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be2036.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001 [08:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:07] !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be2036.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001 [09:00:09] !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be2037.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001 [09:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:38] (03PS3) 10Slyngshede: Add Nagios script for monitoring the Dell PERC RAID controller. [puppet] - 10https://gerrit.wikimedia.org/r/811667 [09:01:17] (03CR) 10CI reject: [V: 04-1] Add Nagios script for monitoring the Dell PERC RAID controller. [puppet] - 10https://gerrit.wikimedia.org/r/811667 (owner: 10Slyngshede) [09:01:30] !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be2037.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001 [09:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:32] !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be2038.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001 [09:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:54] I've updated the deployment host prior to the next backport window. I'm testing on the Beta Cluster now [09:02:00] RECOVERY - puppet last run on ms-be2036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:02:06] (03PS4) 10Slyngshede: Add Nagios script for monitoring the Dell PERC RAID controller. [puppet] - 10https://gerrit.wikimedia.org/r/811667 [09:02:15] (03CR) 10David Caro: [C: 03+1] "Adding Moritz and Slyngshede to review the repos side, the kubeadm looks good :+1:" [puppet] - 10https://gerrit.wikimedia.org/r/802143 (owner: 10Majavah) [09:02:53] !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be2038.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001 [09:02:55] !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be2039.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001 [09:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:03:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:16] !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be2039.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001 [09:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:04:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:09] (03CR) 10JMeybohm: [C: 03+1] sre: add php busy workers alerts for parsoid, jobrunners (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/797313 (owner: 10Giuseppe Lavagetto) [09:06:11] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/802143 (owner: 10Majavah) [09:06:38] phuedx: i assume you're done with your beta cluster deployment? [09:06:55] (i'd like to do my own) [09:07:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 50%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30905 and previous config saved to /var/cache/conftool/dbconfig/20220706-090725-root.json [09:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:30] (03CR) 10David Caro: [C: 03+2] kubeadm: drop support for 1.20 [puppet] - 10https://gerrit.wikimedia.org/r/802143 (owner: 10Majavah) [09:07:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 50%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30906 and previous config saved to /var/cache/conftool/dbconfig/20220706-090731-root.json [09:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:38] (03CR) 10Slyngshede: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/802143 (owner: 10Majavah) [09:07:41] urbanecm: Yes :) [09:07:43] All yours [09:07:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 50%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30907 and previous config saved to /var/cache/conftool/dbconfig/20220706-090744-root.json [09:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:49] thanks! [09:08:26] (03PS2) 10Urbanecm: [beta] GrowthExperiments: Remove redundant IS-labs definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811661 [09:08:30] (03PS3) 10Urbanecm: [beta] GrowthExperiments: Remove redundant IS-labs definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811661 [09:08:34] (03CR) 10Urbanecm: [C: 03+2] [beta] GrowthExperiments: Remove redundant IS-labs definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811661 (owner: 10Urbanecm) [09:08:48] (03PS2) 10Urbanecm: [beta] Remove overrides for GrowthExperiments enable percentage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811662 [09:08:53] (03CR) 10Urbanecm: [C: 03+2] [beta] Remove overrides for GrowthExperiments enable percentage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811662 (owner: 10Urbanecm) [09:09:06] !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be1035.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001 [09:09:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:45] (03Merged) 10jenkins-bot: [beta] GrowthExperiments: Remove redundant IS-labs definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811661 (owner: 10Urbanecm) [09:09:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet [09:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:25] !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be1035.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001 [09:10:27] !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be1036.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001 [09:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:31] (03Merged) 10jenkins-bot: [beta] Remove overrides for GrowthExperiments enable percentage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811662 (owner: 10Urbanecm) [09:10:58] * urbanecm done [09:11:43] !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be1036.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001 [09:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:45] !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be1037.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001 [09:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:13] (03PS3) 10Aqu: [WIP] Build spark assembly for Spark3 [puppet] - 10https://gerrit.wikimedia.org/r/810951 (https://phabricator.wikimedia.org/T310578) [09:13:04] !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be1037.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001 [09:13:06] !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be1038.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001 [09:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:35] (03CR) 10CI reject: [V: 04-1] [WIP] Build spark assembly for Spark3 [puppet] - 10https://gerrit.wikimedia.org/r/810951 (https://phabricator.wikimedia.org/T310578) (owner: 10Aqu) [09:14:24] !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be1038.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001 [09:14:26] !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be1039.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001 [09:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:50] (03CR) 10JMeybohm: k8s: Add KubernetesNode.taints propertry (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/811336 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [09:15:36] (03PS3) 10JMeybohm: Alert on helm releases in bad state [alerts] - 10https://gerrit.wikimedia.org/r/808968 (https://phabricator.wikimedia.org/T310714) [09:15:44] !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be1039.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001 [09:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:10] RECOVERY - puppet last run on ms-be2037 is OK: OK: Puppet is currently enabled, last run 10 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:17:14] (03CR) 10Jelto: [C: 03+2] site/hiera: remove gitlab2001 after decom [puppet] - 10https://gerrit.wikimedia.org/r/811362 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [09:17:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1132', diff saved to https://phabricator.wikimedia.org/P30908 and previous config saved to /var/cache/conftool/dbconfig/20220706-091717-root.json [09:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:49] (03PS2) 10Jelto: site/hiera: remove gitlab2001 after decom [puppet] - 10https://gerrit.wikimedia.org/r/811362 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [09:18:02] (03PS1) 10David Caro: distributions-wikimedia: add note to the docs [puppet] - 10https://gerrit.wikimedia.org/r/811671 [09:19:29] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM, see a couple of minor comments." [alerts] - 10https://gerrit.wikimedia.org/r/808968 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [09:20:01] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/811671 (owner: 10David Caro) [09:21:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 1%: After restart', diff saved to https://phabricator.wikimedia.org/P30911 and previous config saved to /var/cache/conftool/dbconfig/20220706-092130-root.json [09:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:04] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) After having a chat with Jaime: - db1132 got P_S enabled but with `performance-schema-instrument='memory/%=OFF'` [09:22:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 75%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30912 and previous config saved to /var/cache/conftool/dbconfig/20220706-092229-root.json [09:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 75%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30913 and previous config saved to /var/cache/conftool/dbconfig/20220706-092237-root.json [09:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 75%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30914 and previous config saved to /var/cache/conftool/dbconfig/20220706-092248-root.json [09:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:22:57] (03CR) 10David Caro: [C: 03+2] P:wmcs::metricsinfra::prometheus: enable thanos sidecar [puppet] - 10https://gerrit.wikimedia.org/r/806551 (https://phabricator.wikimedia.org/T286301) (owner: 10Majavah) [09:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:01] (03CR) 10David Caro: [C: 03+2] P:metricsinfra: add thanos query [puppet] - 10https://gerrit.wikimedia.org/r/806552 (https://phabricator.wikimedia.org/T286301) (owner: 10Majavah) [09:23:07] (03CR) 10David Caro: [C: 03+2] P:metricsinfra::haproxy: add thanos routing [puppet] - 10https://gerrit.wikimedia.org/r/806553 (https://phabricator.wikimedia.org/T286301) (owner: 10Majavah) [09:23:19] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti2024.codfw.wmnet [09:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:39] (03CR) 10Hnowlan: [C: 03+1] Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik) [09:23:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:12] (03PS4) 10David Caro: Remove hiera files for nonexistent Cloud VPS instances [puppet] - 10https://gerrit.wikimedia.org/r/767925 (owner: 10Majavah) [09:24:30] PROBLEM - Check systemd state on ganeti2024 is CRITICAL: CRITICAL - degraded: The following units failed: networking.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:24:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:04] (03CR) 10David Caro: [C: 03+2] distributions-wikimedia: add note to the docs [puppet] - 10https://gerrit.wikimedia.org/r/811671 (owner: 10David Caro) [09:25:08] (03CR) 10CI reject: [V: 04-1] Remove hiera files for nonexistent Cloud VPS instances [puppet] - 10https://gerrit.wikimedia.org/r/767925 (owner: 10Majavah) [09:25:16] RECOVERY - Puppet CA expired certs on puppetmaster1001 is OK: OK: all puppet agent certs fine https://wikitech.wikimedia.org/wiki/Puppet%23Renew_agent_certificate [09:26:03] (03CR) 10David Caro: Remove hiera files for nonexistent Cloud VPS instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/767925 (owner: 10Majavah) [09:26:40] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [09:27:23] (03PS5) 10David Caro: Remove hiera files for nonexistent Cloud VPS instances [puppet] - 10https://gerrit.wikimedia.org/r/767925 (owner: 10Majavah) [09:27:25] (03CR) 10David Caro: Remove hiera files for nonexistent Cloud VPS instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/767925 (owner: 10Majavah) [09:28:46] (03PS4) 10Hnowlan: api-gateway: allow discovery services to set custom rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/809198 (https://phabricator.wikimedia.org/T295956) [09:28:57] (03PS4) 10Aqu: [WIP] Build spark assembly for Spark3 [puppet] - 10https://gerrit.wikimedia.org/r/810951 (https://phabricator.wikimedia.org/T310578) [09:32:44] Anyone familiar with the Beta Cluster config? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/811322 doesn't seem to be applying and I can't figure out why [09:33:08] urbanecm maybe? ^ [09:33:27] phuedx: what does "doesn't seem to be applying" mean? [09:33:33] PROBLEM - Check systemd state on thumbor2005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:34:32] urbanecm: The stream that I added isn't visible in the output of https://en.wikipedia.beta.wmflabs.org/w/api.php?action=streamconfigs&format=json&all_settings=1, say [09:34:36] (03PS1) 10Jelto: wikimedia.org: remove gitlab-replica-old.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/811674 (https://phabricator.wikimedia.org/T307142) [09:35:06] Hrrm [09:36:13] i see [09:36:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 2%: After restart', diff saved to https://phabricator.wikimedia.org/P30915 and previous config saved to /var/cache/conftool/dbconfig/20220706-093634-root.json [09:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:07] at the very least, i can confirm the config is on the servers themselves [09:37:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 100%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30916 and previous config saved to /var/cache/conftool/dbconfig/20220706-093733-root.json [09:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 100%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30917 and previous config saved to /var/cache/conftool/dbconfig/20220706-093741-root.json [09:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 100%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30918 and previous config saved to /var/cache/conftool/dbconfig/20220706-093752-root.json [09:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:35] 10SRE-swift-storage, 10Observability-Metrics: Investigate HW requirements for Thanos frontend - https://phabricator.wikimedia.org/T312201 (10LSobanski) [09:41:42] (03CR) 10JMeybohm: Alert on helm releases in bad state (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/808968 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [09:41:42] It looks like the +group2 bit isn't being merged in whereas the +enwiki bit is [09:42:47] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [09:43:31] (03CR) 10Gergő Tisza: GrowthExperiments: Set GEImageRecommendationApiHandler (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [09:43:38] (03PS5) 10Aqu: [WIP] Build spark assembly for Spark3 [puppet] - 10https://gerrit.wikimedia.org/r/810951 (https://phabricator.wikimedia.org/T310578) [09:45:12] (03CR) 10David Caro: novafullstack: Refactor and minor fix (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/811316 (owner: 10David Caro) [09:46:02] phuedx: yeah. not sure why though. i recommend trying to debug it locally (you can run `composer buildConfigCache` in your config repo, and files like `wmf-config/config-cache/conf-labs-enwiki.json` will then have the configuration as it will be seen by MediaWiki) [09:46:25] (03CR) 10David Caro: [C: 03+2] Remove hiera files for nonexistent Cloud VPS instances [puppet] - 10https://gerrit.wikimedia.org/r/767925 (owner: 10Majavah) [09:48:17] (03CR) 10JMeybohm: k8s: Retry checks for expected pods on drain (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/811331 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [09:50:18] !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-workers restart workers for Hadoop test cluster: Roll restart of jvm daemons for openjdk upgrade. [09:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:23] my testing shows that the group2 bit is just ignored (might be because group2 doesn't really make any sense when talking about beta, but that's just a guess) [09:51:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 5%: After restart', diff saved to https://phabricator.wikimedia.org/P30919 and previous config saved to /var/cache/conftool/dbconfig/20220706-095138-root.json [09:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:14] (03PS4) 10David Caro: wmcs.openstack: move libs to it's own module [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/809543 [09:52:18] (03PS3) 10David Caro: alerts: add a default duration of 1h [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810367 [09:52:22] (03PS3) 10David Caro: wmcs.lib.openstack: move to a directory module [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810451 [09:52:26] (03PS9) 10David Caro: cloudnet: add show, reboot_node and roll_reboot_cloudnets [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810368 [09:52:44] we just lost wikibugs [09:53:17] poor bot :( [09:55:51] urbanecm: Agreed. Changing +group2 to +wikipedia, for example, fixes the problem [09:56:03] Also, poor wikibugs [09:58:16] phuedx: in that case, an easy workaround would be group2 => wikipedia (has roughly the same meaning anyway). an alternative solution would be to introduce `wmgExtraEventStreams` and call `$wgEventStreams = array_merge( $wgEventStreams, $wmgExtraEventStreams )` in CommonSettings-labs.php (which'd be an equivalent of `+default`, if it was supported). [09:59:27] (03PS1) 10Ayounsi: Remove cloudstore hosts from cloud ACL [homer/public] - 10https://gerrit.wikimedia.org/r/811676 [09:59:34] !log restarted wikibugs [09:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:40] welcome back, wikibugs [09:59:57] and thanks volans for resuscitating them [10:00:33] np :) [10:00:38] (03CR) 10CI reject: [V: 04-1] wmcs.openstack.cloudgw: add reboot_node and roll_reboot_cloudgws [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810914 (owner: 10David Caro) [10:02:30] !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0) restart workers for Hadoop test cluster: Roll restart of jvm daemons for openjdk upgrade. [10:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:41] (03CR) 10David Caro: [C: 03+2] cloudvirt.safe_reboot: remove non-used openstack_api property [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/811675 (owner: 10David Caro) [10:04:43] (03CR) 10David Caro: [C: 03+2] wmcs.openstack: move libs to it's own module [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/809543 (owner: 10David Caro) [10:04:49] (03CR) 10David Caro: [C: 03+2] alerts: add a default duration of 1h [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810367 (owner: 10David Caro) [10:04:58] (03PS1) 10Phuedx: beta: Correctly add mediawiki.web_ui.interactions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811678 [10:05:00] (03CR) 10David Caro: [C: 03+2] wmcs.lib.openstack: move to a directory module [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810451 (owner: 10David Caro) [10:05:26] urbanecm: ^^ I've also tried to explain why I took the short route :) [10:05:50] (03CR) 10Urbanecm: [C: 03+1] "should work" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811678 (owner: 10Phuedx) [10:06:31] (03CR) 10Volans: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/811676 (owner: 10Ayounsi) [10:06:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 10%: After restart', diff saved to https://phabricator.wikimedia.org/P30920 and previous config saved to /var/cache/conftool/dbconfig/20220706-100642-root.json [10:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:08] (03CR) 10Phuedx: "I've confirmed that the mediawiki.web_ui.interactions stream shows up in wgEventStreams and wgEventLoggingStreamNames in wmf-config/config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811678 (owner: 10Phuedx) [10:07:35] Right. I'll merge that and get it on the deployment host [10:08:24] (03CR) 10Phuedx: [C: 03+2] beta: Correctly add mediawiki.web_ui.interactions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811678 (owner: 10Phuedx) [10:09:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet [10:09:11] (03Merged) 10jenkins-bot: beta: Correctly add mediawiki.web_ui.interactions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811678 (owner: 10Phuedx) [10:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:07] (03Merged) 10jenkins-bot: wmcs.openstack: move libs to it's own module [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/809543 (owner: 10David Caro) [10:11:12] (03Merged) 10jenkins-bot: cloudvirt.safe_reboot: remove non-used openstack_api property [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/811675 (owner: 10David Caro) [10:11:14] (03Merged) 10jenkins-bot: alerts: add a default duration of 1h [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810367 (owner: 10David Caro) [10:11:37] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/811336 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [10:11:47] (03CR) 10Jelto: "Added some comment inline." [puppet] - 10https://gerrit.wikimedia.org/r/793875 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall) [10:11:56] (03Merged) 10jenkins-bot: wmcs.lib.openstack: move to a directory module [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810451 (owner: 10David Caro) [10:12:48] (03PS1) 10Muehlenhoff: bigtop::hadoop: All hosts use the new GID/UID scheme by now [puppet] - 10https://gerrit.wikimedia.org/r/811680 [10:12:51] (03CR) 10Volans: [C: 03+1] "I'm not familiar with the underlying issue, but python wise LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/811331 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [10:13:21] (03CR) 10Ayounsi: [C: 03+2] Remove cloudstore hosts from cloud ACL [homer/public] - 10https://gerrit.wikimedia.org/r/811676 (owner: 10Ayounsi) [10:13:32] (03CR) 10Slavina Stefanova: novafullstack: Refactor and minor fix (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/811316 (owner: 10David Caro) [10:13:54] (03Merged) 10jenkins-bot: Remove cloudstore hosts from cloud ACL [homer/public] - 10https://gerrit.wikimedia.org/r/811676 (owner: 10Ayounsi) [10:14:53] RECOVERY - Check systemd state on ganeti2024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:14:57] (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/811674 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [10:15:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:16:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:21] (03PS2) 10Majavah: openstack: horizon: remove enc url from hiera [puppet] - 10https://gerrit.wikimedia.org/r/800232 [10:19:50] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti2024.codfw.wmnet [10:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 25%: After restart', diff saved to https://phabricator.wikimedia.org/P30921 and previous config saved to /var/cache/conftool/dbconfig/20220706-102146-root.json [10:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:11] (03PS1) 10Muehlenhoff: installserver: Remove support for pre buster [puppet] - 10https://gerrit.wikimedia.org/r/811681 [10:22:49] !log btullis@cumin1001 START - Cookbook sre.ganeti.makevm for new host dse-k8s-etcd1001.eqiad.wmnet [10:22:50] !log btullis@cumin1001 START - Cookbook sre.dns.netbox [10:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:05] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:27:03] (03PS1) 10Muehlenhoff: ores: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/811682 [10:27:39] !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:27:39] !log btullis@cumin1001 START - Cookbook sre.dns.wipe-cache dse-k8s-etcd1001.eqiad.wmnet on all recursors [10:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:42] !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-etcd1001.eqiad.wmnet on all recursors [10:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:09] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36197/console" [puppet] - 10https://gerrit.wikimedia.org/r/811682 (owner: 10Muehlenhoff) [10:30:18] (03CR) 10Elukey: [V: 03+1 C: 03+1] ores: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/811682 (owner: 10Muehlenhoff) [10:30:56] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-fe1009.eqiad.wmnet [10:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:34] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-fe2009.codfw.wmnet [10:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:49] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/811682 (owner: 10Muehlenhoff) [10:36:33] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:36:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 50%: After restart', diff saved to https://phabricator.wikimedia.org/P30923 and previous config saved to /var/cache/conftool/dbconfig/20220706-103650-root.json [10:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:02] 10SRE, 10Wikimedia-Mailing-lists: MM3/Postorius: Inconsistent translations for "Log In" in Spanish - https://phabricator.wikimedia.org/T312204 (10MarcoAurelio) [10:37:20] !log btullis@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host dse-k8s-etcd1001.eqiad.wmnet [10:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:55] !log btullis@cumin1001 START - Cookbook sre.ganeti.makevm for new host dse-k8s-etcd1002.eqiad.wmnet [10:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:56] !log btullis@cumin1001 START - Cookbook sre.dns.netbox [10:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:57] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1009.eqiad.wmnet [10:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:57] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe2009.codfw.wmnet [10:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:11] (03CR) 10Hnowlan: [C: 03+2] api-gateway: allow discovery services to set custom rate limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/809198 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan) [10:40:14] (03PS1) 10Volans: sre.ganeti.*: automatically get default group [cookbooks] - 10https://gerrit.wikimedia.org/r/811684 [10:40:49] (03PS2) 10Volans: sre.ganeti.*: automatically get default group [cookbooks] - 10https://gerrit.wikimedia.org/r/811684 [10:42:28] !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:42:28] !log btullis@cumin1001 START - Cookbook sre.dns.wipe-cache dse-k8s-etcd1002.eqiad.wmnet on all recursors [10:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:31] !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-etcd1002.eqiad.wmnet on all recursors [10:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:04] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Switch image builds over to build2001 [puppet] - 10https://gerrit.wikimedia.org/r/811344 (https://phabricator.wikimedia.org/T298463) (owner: 10Muehlenhoff) [10:43:21] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Switch image reports over to build2001 [puppet] - 10https://gerrit.wikimedia.org/r/811324 (https://phabricator.wikimedia.org/T298463) (owner: 10Muehlenhoff) [10:43:36] 10SRE, 10Wikimedia-Mailing-lists: MM3/Postorius: Inconsistent translations for "Log In" in Spanish - https://phabricator.wikimedia.org/T312204 (10MarcoAurelio) [10:44:00] (03Merged) 10jenkins-bot: api-gateway: allow discovery services to set custom rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/809198 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan) [10:44:17] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2028.codfw.wmnet [10:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:04] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1028.eqiad.wmnet [10:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:35] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:51:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 75%: After restart', diff saved to https://phabricator.wikimedia.org/P30925 and previous config saved to /var/cache/conftool/dbconfig/20220706-105154-root.json [10:51:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:10] !log btullis@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host dse-k8s-etcd1002.eqiad.wmnet [10:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:09] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2028.codfw.wmnet [10:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:35] PROBLEM - Check systemd state on ganeti2024 is CRITICAL: CRITICAL - degraded: The following units failed: networking.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:54:53] !log btullis@cumin1001 START - Cookbook sre.ganeti.makevm for new host dse-k8s-etcd1003.eqiad.wmnet [10:54:55] !log btullis@cumin1001 START - Cookbook sre.dns.netbox [10:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:26] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1028.eqiad.wmnet [10:55:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:37] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2029.codfw.wmnet [10:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:42] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1029.eqiad.wmnet [10:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:49] RECOVERY - Check systemd state on thumbor2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:01:39] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (32) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudservices1003, cloudservices1004, elastic2049, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002 [11:01:39] -fe2003, thumbor1002, thumbor1006, thumbor2003, thumbor2004, thumbor2005, thumbor2006 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [11:03:21] PROBLEM - MariaDB Replica IO: m2 on db2078 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:03:23] PROBLEM - MariaDB Replica IO: m3 on db2078 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:03:27] PROBLEM - mysqld processes on db2078 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [11:03:27] PROBLEM - MariaDB Replica IO: m5 on db2078 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:03:51] PROBLEM - Check systemd state on db2078 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter@m3.service,wmf_auto_restart_prometheus-mysqld-exporter@m5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:04:11] PROBLEM - MariaDB Replica Lag: m3 on db2078 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:04:13] PROBLEM - MariaDB Replica SQL: m3 on db2078 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:04:19] PROBLEM - MariaDB Replica Lag: m1 on db2078 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:04:23] PROBLEM - MariaDB Replica Lag: m2 on db2078 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:04:43] PROBLEM - MariaDB Replica SQL: m1 on db2078 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:04:55] PROBLEM - MariaDB Replica SQL: m5 on db2078 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:04:57] PROBLEM - MariaDB read only m1 on db2078 is CRITICAL: Could not connect to localhost:3321 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:05:11] PROBLEM - MariaDB read only m2 on db2078 is CRITICAL: Could not connect to localhost:3322 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:05:23] PROBLEM - MariaDB read only m3 on db2078 is CRITICAL: Could not connect to localhost:3323 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:05:33] PROBLEM - MariaDB read only m5 on db2078 is CRITICAL: Could not connect to localhost:3325 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:05:39] PROBLEM - MariaDB Replica Lag: m5 on db2078 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:05:39] PROBLEM - MariaDB Replica SQL: m2 on db2078 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:05:45] PROBLEM - MariaDB Replica IO: m1 on db2078 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:06:12] marostegui: ^ is expected isn't it? You said you were shutting it down earlier . [11:06:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 100%: After restart', diff saved to https://phabricator.wikimedia.org/P30927 and previous config saved to /var/cache/conftool/dbconfig/20220706-110658-root.json [11:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:09] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2029.codfw.wmnet [11:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:19] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1029.eqiad.wmnet [11:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:21] PROBLEM - Check systemd state on thumbor2004 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8803.service,thumbor@8805.service,thumbor@8811.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:08:52] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2030.codfw.wmnet [11:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:21] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1030.eqiad.wmnet [11:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:00] (JobUnavailable) firing: Reduced availability for job mysql-misc in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:15:48] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2030.codfw.wmnet [11:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:27] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1030.eqiad.wmnet [11:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:16] (03CR) 10Jelto: [C: 03+2] wikimedia.org: remove gitlab-replica-old.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/811674 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [11:26:57] (03PS3) 10Cathal Mooney: Add test in Netbox network report for port-block speeds on QFX5120 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811314 (https://phabricator.wikimedia.org/T303529) [11:27:31] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:28:56] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2031.codfw.wmnet [11:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:07] (03CR) 10Cathal Mooney: Add test in Netbox network report for port-block speeds on QFX5120 (033 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811314 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [11:29:15] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1031.eqiad.wmnet [11:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:07] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1031.eqiad.wmnet [11:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:18] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2031.codfw.wmnet [11:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:11] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2032.codfw.wmnet [11:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:24] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1032.eqiad.wmnet [11:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:22] (03Abandoned) 10Jelto: site: remove gitlab2001 [puppet] - 10https://gerrit.wikimedia.org/r/806864 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [11:47:38] (03PS4) 10Cathal Mooney: Add test in Netbox network report for port-block speeds on QFX5120 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811314 (https://phabricator.wikimedia.org/T303529) [11:51:16] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1032.eqiad.wmnet [11:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:25] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2032.codfw.wmnet [11:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:59] (03PS2) 10Jelto: gitlab/acme_chief: remove gitlab1001 from list of (passive) hosts [puppet] - 10https://gerrit.wikimedia.org/r/802822 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [11:54:05] (03CR) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [11:56:00] (03CR) 10Jelto: [C: 03+1] "rebase + merge conflict in patch set 2. I removed gitlab2001, which is decommissioned now." [puppet] - 10https://gerrit.wikimedia.org/r/802822 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [11:56:45] RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:57:05] (03CR) 10Jelto: "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/802824 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [11:57:14] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/802824 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [11:57:15] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2033.codfw.wmnet [11:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:28] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1033.eqiad.wmnet [11:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:35] (03CR) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [11:59:11] (03PS1) 10Muehlenhoff: Add a helper function to query the disk type of a VM [software/spicerack] - 10https://gerrit.wikimedia.org/r/811693 (https://phabricator.wikimedia.org/T312116) [11:59:35] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Make Spicerack cookbook to resize ganeti VM - https://phabricator.wikimedia.org/T219454 (10MoritzMuehlenhoff) [11:59:39] 10SRE, 10Ganeti: Cookbooks for Ganeti maintenance tasks - https://phabricator.wikimedia.org/T283319 (10MoritzMuehlenhoff) [12:00:33] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [12:03:47] PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8803.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:05:20] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2033.codfw.wmnet [12:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:39] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1033.eqiad.wmnet [12:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:03] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2034.codfw.wmnet [12:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:50] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1035.eqiad.wmnet [12:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:51] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1035.eqiad.wmnet [12:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:07] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2034.codfw.wmnet [12:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:41] PROBLEM - Check systemd state on thumbor2003 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8803.service,thumbor@8807.service,thumbor@8811.service,thumbor@8814.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:17:31] PROBLEM - Check systemd state on thumbor2004 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8803.service,thumbor@8805.service,thumbor@8811.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:18:41] (03PS2) 10Klausman: ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195) [12:20:43] (03CR) 10Muehlenhoff: [C: 03+1] "Thanks, looks great!" [cookbooks] - 10https://gerrit.wikimedia.org/r/811684 (owner: 10Volans) [12:21:02] (03CR) 10Klausman: [C: 03+1] ores: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/811682 (owner: 10Muehlenhoff) [12:21:51] (03CR) 10Muehlenhoff: [C: 03+2] ores: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/811682 (owner: 10Muehlenhoff) [12:26:54] (03PS1) 10Muehlenhoff: profile::rsyslog::kubernetes: Remove stretch support [puppet] - 10https://gerrit.wikimedia.org/r/811699 [12:28:34] (03PS3) 10Klausman: ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195) [12:28:40] !log jayme@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-codfw [12:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:35] RECOVERY - Check systemd state on thumbor2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:35:58] (KubernetesCalicoDown) firing: kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:36:45] (JobUnavailable) firing: (2) Reduced availability for job calico-felix in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:39:55] PROBLEM - Check systemd state on thumbor2005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:40:12] !log jayme@cumin1001 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:wikikube-staging-worker-codfw [12:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:53] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 116 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [12:40:58] (KubernetesCalicoDown) resolved: kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:41:36] !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:41:36] !log btullis@cumin1001 START - Cookbook sre.dns.wipe-cache dse-k8s-etcd1003.eqiad.wmnet on all recursors [12:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:39] !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-etcd1003.eqiad.wmnet on all recursors [12:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:02] (03CR) 10David Caro: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/800232 (owner: 10Majavah) [12:43:31] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudstore1008.wikimedia.org [12:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:09] (03PS4) 10Klausman: ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195) [12:45:17] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [12:46:46] (03PS1) 10Kosta Harlan: Add image-suggestion listener to service-proxy [puppet] - 10https://gerrit.wikimedia.org/r/811701 (https://phabricator.wikimedia.org/T306032) [12:47:13] (03CR) 10Jbond: [C: 03+1] "sgtm" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/809132 (https://phabricator.wikimedia.org/T311235) (owner: 10Muehlenhoff) [12:47:22] (03CR) 10CI reject: [V: 04-1] ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [12:48:10] (03CR) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [12:48:19] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:04] (03PS10) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) [12:49:05] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:49:14] (03PS11) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) [12:49:18] (03CR) 10Majavah: [C: 04-1] "per https://wikitech.wikimedia.org/wiki/Kubernetes/Service_ports, the name should be image-suggestion-api and port 4009" [puppet] - 10https://gerrit.wikimedia.org/r/811701 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [12:49:19] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2035.codfw.wmnet [12:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:38] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1036.eqiad.wmnet [12:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:59] (03PS2) 10Vlad.shapik: WIP: Adjust the online tests to new changes in the thumbor functionality [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/811257 [12:50:14] (03PS5) 10Klausman: ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195) [12:50:18] (03CR) 10Alexandros Kosiaris: GrowthExperiments: Set GEImageRecommendationApiHandler (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [12:50:30] (03CR) 10CI reject: [V: 04-1] GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [12:50:45] (03CR) 10Kosta Harlan: Add image-suggestion listener to service-proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811701 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [12:51:25] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48390 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:51:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:51:37] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudstore1008.wikimedia.org [12:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:45] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudstore100[89] - https://phabricator.wikimedia.org/T311844 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `cloudstore1008.wikimedia.org` - cloudst... [12:53:09] (03CR) 10CI reject: [V: 04-1] ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [12:53:56] (03PS12) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) [12:54:09] (03CR) 10Jbond: "can we pause this until im back, i did some refactoring of the raid classes nd have a feeling i was thinking of moving away from using the" [puppet] - 10https://gerrit.wikimedia.org/r/809913 (https://phabricator.wikimedia.org/T297913) (owner: 10Muehlenhoff) [12:56:01] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1036.eqiad.wmnet [12:56:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:09] (03PS13) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) [12:56:34] (03Abandoned) 10Jbond: C:monitoring: Add define for creating http checks [puppet] - 10https://gerrit.wikimedia.org/r/786365 (owner: 10Jbond) [12:56:55] (03PS2) 10Kosta Harlan: Add image-suggestion listener to service-proxy [puppet] - 10https://gerrit.wikimedia.org/r/811701 (https://phabricator.wikimedia.org/T306032) [12:57:02] (03CR) 10CI reject: [V: 04-1] GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [12:57:26] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add image-suggestion listener to service-proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811701 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [12:57:43] (03CR) 10Kosta Harlan: Add image-suggestion listener to service-proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811701 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [12:57:47] (03PS6) 10Klausman: ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195) [12:57:59] (03Abandoned) 10Jbond: O:nrpe: add check_http_wmf script [puppet] - 10https://gerrit.wikimedia.org/r/773272 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [12:58:09] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add image-suggestion listener to service-proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811701 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [12:58:19] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2035.codfw.wmnet [12:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:49] (03PS3) 10Kosta Harlan: Add image-suggestion listener to service-proxy [puppet] - 10https://gerrit.wikimedia.org/r/811701 (https://phabricator.wikimedia.org/T306032) [12:58:56] (03CR) 10Kosta Harlan: Add image-suggestion listener to service-proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811701 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [12:59:22] (03CR) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [12:59:24] (03PS1) 10Ayounsi: cr: policy-options add missing return [homer/public] - 10https://gerrit.wikimedia.org/r/811706 (https://phabricator.wikimedia.org/T253194) [12:59:31] (03CR) 10Muehlenhoff: Extend custom raid fact to support Perc 750 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809913 (https://phabricator.wikimedia.org/T297913) (owner: 10Muehlenhoff) [12:59:33] (03PS14) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) [12:59:57] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC afternoon backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220706T1300). [13:00:05] kostajh: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:13] o/ [13:00:14] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2036.codfw.wmnet [13:00:17] o/ [13:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:20] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "We need to set up a service proxy instance for the image-suggestion service first, then use that port here." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [13:00:25] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1037.eqiad.wmnet [13:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:31] (03CR) 10CI reject: [V: 04-1] ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [13:00:33] (03PS15) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) [13:01:15] is the config change ready to deploy? it says it Depends-On a puppet change that’s still open [13:01:19] i'm here, but still sorting out some issues with my patch [13:01:23] ok [13:01:27] which I'm doubtful about getting done now, but let's see [13:01:41] (03PS1) 10Muehlenhoff: Remove puppet refs for cloudstore1008/cloudstore1009 [puppet] - 10https://gerrit.wikimedia.org/r/811711 (https://phabricator.wikimedia.org/T311844) [13:02:42] (03CR) 10Ayounsi: [C: 03+2] "noop on the devices." [homer/public] - 10https://gerrit.wikimedia.org/r/811706 (https://phabricator.wikimedia.org/T253194) (owner: 10Ayounsi) [13:03:15] (03CR) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [13:03:27] (03PS7) 10Klausman: ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195) [13:03:40] (03Merged) 10jenkins-bot: cr: policy-options add missing return [homer/public] - 10https://gerrit.wikimedia.org/r/811706 (https://phabricator.wikimedia.org/T253194) (owner: 10Ayounsi) [13:03:57] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: cloudstore1008.wikimedia.org [13:03:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: cloudstore1008.wikimedia.org [13:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:03] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: cloudstore1009.wikimedia.org [13:04:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: cloudstore1009.wikimedia.org [13:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:06] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudstore100[89] - https://phabricator.wikimedia.org/T311844 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: cloudstore1008.wikimedia.org [13:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:10] (03CR) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [13:04:13] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudstore100[89] - https://phabricator.wikimedia.org/T311844 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: cloudstore1009.wikimedia.org [13:04:56] 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP wmf group for Aline Bruenger WMDE - https://phabricator.wikimedia.org/T312220 (10Aline_Bruenger_WMDE) [13:05:47] Lucas_WMDE urbanecm: can we give it another 10 minutes or so, as I'm getting some CR comments. [13:06:07] kostajh: no issues at all [13:06:12] I can’t deploy Puppet changes anyways, not sure if urbanecm can (and would be willing to) [13:06:14] (03CR) 10CI reject: [V: 04-1] ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [13:06:39] i can't do puppet changes (but happy to do the MW counterpart once puppet is resolved) [13:06:55] ok [13:07:09] I'm talking with _j.oe_ about the puppet change in #wikimedia-sre [13:07:46] (03PS2) 10Muehlenhoff: Remove puppet refs for cloudstore1008/cloudstore1009 [puppet] - 10https://gerrit.wikimedia.org/r/811711 (https://phabricator.wikimedia.org/T311844) [13:07:51] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add image-suggestion listener to service-proxy [puppet] - 10https://gerrit.wikimedia.org/r/811701 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [13:08:58] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2036.codfw.wmnet [13:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:03] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/810867 (https://phabricator.wikimedia.org/T311999) (owner: 10Muehlenhoff) [13:09:05] (03PS8) 10Klausman: ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195) [13:09:17] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1037.eqiad.wmnet [13:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:04] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2037.codfw.wmnet [13:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:12] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1038.eqiad.wmnet [13:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:21] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add image-suggestion listener to service-proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811701 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [13:10:25] (03CR) 10Jbond: [C: 04-1] P:mediawiki::scap_client: add parameter to indicate scap master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807510 (https://phabricator.wikimedia.org/T310740) (owner: 10Jbond) [13:11:04] 10SRE-swift-storage: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690 (10fgiunchedi) I did a brief analysis on space vs retention vs resolution: | resolution | #samples | #series | bytes | -- | -- | -- | -- | | 0s | 29.1B | 4B | 40TB | 5m | 5.8B | 2.6B | 30TB | 1h | 474.5M | 2.4B | 3.7TB... [13:11:16] the puppet patch needs ~30 minutes to propagate. So, that is still within this window, but not sure about whether to go forward with this. [13:11:30] (03CR) 10CI reject: [V: 04-1] ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [13:17:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1132 (T311106)', diff saved to https://phabricator.wikimedia.org/P30930 and previous config saved to /var/cache/conftool/dbconfig/20220706-131715-ladsgroup.json [13:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:19] (03CR) 10Ayounsi: "A couple comments then we're good! I had a look at what's running on netbox-next as well." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811314 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [13:17:20] T311106: Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 [13:18:20] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 4 others: Create a cookbook to perform a rolling reboot of a kubernetes cluster - https://phabricator.wikimedia.org/T260661 (10JMeybohm) [13:18:48] 10SRE-swift-storage, 10Infrastructure-Foundations: Poweredge R730xd, R740xd, R740xd2 SSDs not visible to OS as SSDs - https://phabricator.wikimedia.org/T309027 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Remaining nodes done by hand during reboots for T310483: ` mvernon@cumin1001:~$ sudo cumin... [13:19:01] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1038.eqiad.wmnet [13:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:11] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1039.eqiad.wmnet [13:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:39] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2037.codfw.wmnet [13:19:46] Lucas_WMDE / urbanecm: there's issues with the proxy, so let's leave this patch out for now and I'll look for another window to deploy it. [13:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:47] (03CR) 10Volans: [C: 03+2] sre.ganeti.*: automatically get default group [cookbooks] - 10https://gerrit.wikimedia.org/r/811684 (owner: 10Volans) [13:19:59] ack [13:20:00] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2038.codfw.wmnet [13:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:23] ok [13:20:29] RECOVERY - haproxy failover on dbproxy2003 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [13:21:01] RECOVERY - MariaDB read only m1 on db2078 is OK: Version 10.4.25-MariaDB-log, Uptime 61s, read_only: True, event_scheduler: True, 20.73 QPS, connection latency: 0.003852s, query latency: 0.000360s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [13:21:11] RECOVERY - haproxy failover on dbproxy2001 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [13:21:17] RECOVERY - MariaDB read only m2 on db2078 is OK: Version 10.4.25-MariaDB-log, Uptime 67s, read_only: True, event_scheduler: True, 11.84 QPS, connection latency: 0.003672s, query latency: 0.000367s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [13:21:27] RECOVERY - haproxy failover on dbproxy2004 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [13:21:31] RECOVERY - MariaDB read only m3 on db2078 is OK: Version 10.4.25-MariaDB-log, Uptime 76s, read_only: True, event_scheduler: True, 12.84 QPS, connection latency: 0.003884s, query latency: 0.000317s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [13:21:37] RECOVERY - MariaDB read only m5 on db2078 is OK: Version 10.4.25-MariaDB-log, Uptime 75s, read_only: True, event_scheduler: True, 14.73 QPS, connection latency: 0.003573s, query latency: 0.000348s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [13:21:41] RECOVERY - haproxy failover on dbproxy2002 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [13:21:43] (03CR) 10Hnowlan: [C: 03+2] Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik) [13:21:45] (JobUnavailable) resolved: Reduced availability for job mysql-misc in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:22:01] RECOVERY - mysqld processes on db2078 is OK: PROCS OK: 4 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [13:22:47] RECOVERY - MariaDB Replica SQL: m3 on db2078 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:23:13] RECOVERY - MariaDB Replica SQL: m1 on db2078 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:23:23] RECOVERY - MariaDB Replica SQL: m5 on db2078 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:23:28] (03Merged) 10jenkins-bot: sre.ganeti.*: automatically get default group [cookbooks] - 10https://gerrit.wikimedia.org/r/811684 (owner: 10Volans) [13:23:30] (03Merged) 10jenkins-bot: Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik) [13:24:07] RECOVERY - MariaDB Replica Lag: m5 on db2078 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:24:07] RECOVERY - MariaDB Replica SQL: m2 on db2078 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:24:17] RECOVERY - MariaDB Replica IO: m1 on db2078 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:24:23] (03PS9) 10Klausman: ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195) [13:24:25] RECOVERY - MariaDB Replica IO: m2 on db2078 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:24:27] RECOVERY - MariaDB Replica IO: m3 on db2078 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:24:33] RECOVERY - MariaDB Replica IO: m5 on db2078 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:25:25] RECOVERY - MariaDB Replica Lag: m2 on db2078 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:27:29] (03PS10) 10Klausman: ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195) [13:28:11] RECOVERY - Check systemd state on ganeti2024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:28:23] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1039.eqiad.wmnet [13:28:23] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2038.codfw.wmnet [13:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:35] (03PS11) 10Klausman: ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195) [13:28:44] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2039.codfw.wmnet [13:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:25] (03PS12) 10Klausman: ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195) [13:30:13] !log btullis@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host dse-k8s-etcd1003.eqiad.wmnet [13:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:19] RECOVERY - MariaDB Replica Lag: m3 on db2078 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:30:53] (03PS1) 10Ayounsi: Add sre.network.configure-switch-interfaces to dcops sudo [puppet] - 10https://gerrit.wikimedia.org/r/811714 [13:31:53] (03PS1) 10Majavah: prometheus: blackbox: don't deploy tls alerts when tls is disabled [puppet] - 10https://gerrit.wikimedia.org/r/811715 [13:31:56] (03PS1) 10Majavah: prometheus: blackbox: support exporting modules for other instances [puppet] - 10https://gerrit.wikimedia.org/r/811716 [13:31:58] (03PS1) 10Majavah: P:toolforge::prometheus: import blackbox checks from puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/811717 [13:32:00] (03PS1) 10Majavah: P:toolforge::static: remove HTTPS enforcement [puppet] - 10https://gerrit.wikimedia.org/r/811718 [13:32:02] (03PS1) 10Majavah: P:toolforge::static: add blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/811719 [13:32:53] (03CR) 10CI reject: [V: 04-1] prometheus: blackbox: support exporting modules for other instances [puppet] - 10https://gerrit.wikimedia.org/r/811716 (owner: 10Majavah) [13:34:02] (03PS2) 10Majavah: prometheus: blackbox: support exporting modules for other instances [puppet] - 10https://gerrit.wikimedia.org/r/811716 [13:34:04] (03PS2) 10Majavah: P:toolforge::prometheus: import blackbox checks from puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/811717 [13:34:06] (03PS2) 10Majavah: P:toolforge::static: remove HTTPS enforcement [puppet] - 10https://gerrit.wikimedia.org/r/811718 [13:34:08] (03PS2) 10Majavah: P:toolforge::static: add blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/811719 [13:34:23] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36198/console" [puppet] - 10https://gerrit.wikimedia.org/r/811680 (owner: 10Muehlenhoff) [13:35:27] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2039.codfw.wmnet [13:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet [13:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:05] (03CR) 10CI reject: [V: 04-1] P:toolforge::prometheus: import blackbox checks from puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/811717 (owner: 10Majavah) [13:38:49] (03CR) 10CI reject: [V: 04-1] P:toolforge::static: remove HTTPS enforcement [puppet] - 10https://gerrit.wikimedia.org/r/811718 (owner: 10Majavah) [13:39:13] (03CR) 10Elukey: [V: 03+1 C: 03+1] bigtop::hadoop: All hosts use the new GID/UID scheme by now [puppet] - 10https://gerrit.wikimedia.org/r/811680 (owner: 10Muehlenhoff) [13:40:06] (03PS3) 10Majavah: P:toolforge::prometheus: import blackbox checks from puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/811717 [13:40:08] (03PS3) 10Majavah: P:toolforge::static: remove HTTPS enforcement [puppet] - 10https://gerrit.wikimedia.org/r/811718 [13:40:10] (03PS3) 10Majavah: P:toolforge::static: add blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/811719 [13:41:25] (03PS6) 10Aqu: [WIP] Build spark assembly for Spark3 [puppet] - 10https://gerrit.wikimedia.org/r/810951 (https://phabricator.wikimedia.org/T310578) [13:41:28] (03CR) 10CI reject: [V: 04-1] P:toolforge::prometheus: import blackbox checks from puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/811717 (owner: 10Majavah) [13:42:25] (03PS4) 10Majavah: P:toolforge::prometheus: import blackbox checks from puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/811717 [13:42:27] (03PS4) 10Majavah: P:toolforge::static: remove HTTPS enforcement [puppet] - 10https://gerrit.wikimedia.org/r/811718 [13:42:29] (03PS4) 10Majavah: P:toolforge::static: add blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/811719 [13:42:35] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) [13:44:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet [13:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:22] urbanecm: o/ [13:44:29] hi elukey! [13:44:50] sorry to bother, would you be available to help me to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/810007 if I sneak it in the deployment window? :) [13:44:56] (I haven't done it in a while) [13:45:01] 10SRE, 10Image-Suggestions: Envoy cannot connect to image-suggestion service - https://phabricator.wikimedia.org/T312225 (10kostajh) [13:45:16] 10SRE, 10Image-Suggestions: Envoy cannot connect to image-suggestion service - https://phabricator.wikimedia.org/T312225 (10kostajh) [13:45:41] elukey: sure thing. do you want to try the deployment yourself? https://deploy-commands.toolforge.org/bacc/810007 should be helpful :) [13:46:21] (if not, i can also deploy it for you) [13:46:37] ah wow [13:46:58] if you have time please go ahead, I'll study the link and try the next time :) [13:47:02] okay [13:47:11] <3 thanks [13:47:12] (03PS4) 10Urbanecm: Add a new Eventgate stream for revision-score events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810007 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey) [13:47:16] (03CR) 10Urbanecm: [C: 03+2] Add a new Eventgate stream for revision-score events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810007 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey) [13:48:04] (03Merged) 10jenkins-bot: Add a new Eventgate stream for revision-score events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810007 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey) [13:48:07] RECOVERY - MariaDB Replica Lag: m1 on db2078 is OK: OK slave_sql_lag Replication lag: 0.11 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:49:01] elukey: pulled to mwdebug1001 (not sure if it's testable there) [13:49:22] urbanecm: yeah I think you can go ahead, it is a event-gate specific thing I am afraid [13:49:28] okay, syncing [13:50:54] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2024.codfw.wmnet to cluster codfw and group A [13:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:28] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul) @Marostegui you welcome [13:51:52] (03PS1) 10Volans: sre.hosts.decommission: fix switch matching [cookbooks] - 10https://gerrit.wikimedia.org/r/811721 [13:52:20] 10SRE, 10ops-eqiad, 10DC-Ops: Q1: eqiad: (32) PDUs for expansion - https://phabricator.wikimedia.org/T290899 (10Papaul) @wiki_willy you welcome [13:53:06] (03CR) 10Ayounsi: [C: 03+1] sre.hosts.decommission: fix switch matching [cookbooks] - 10https://gerrit.wikimedia.org/r/811721 (owner: 10Volans) [13:53:21] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:810007|Add a new Eventgate stream for revision-score events (T301878)]] (duration: 03m 46s) [13:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:25] T301878: Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 [13:53:32] elukey: it should be live now. anything else i can help with? [13:54:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:41] !log jmm@cumin2002 END (ERROR) - Cookbook sre.ganeti.addnode (exit_code=97) for new host ganeti2024.codfw.wmnet to cluster codfw and group A [13:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:07] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/811711 (https://phabricator.wikimedia.org/T311844) (owner: 10Muehlenhoff) [13:55:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:55:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:55] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: fix switch matching [cookbooks] - 10https://gerrit.wikimedia.org/r/811721 (owner: 10Volans) [13:56:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:11] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:57:05] urbanecm: nope thanks a lot!! [13:57:10] any time [13:59:22] (03Merged) 10jenkins-bot: sre.hosts.decommission: fix switch matching [cookbooks] - 10https://gerrit.wikimedia.org/r/811721 (owner: 10Volans) [14:05:54] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudstore1008.wikimedia.org [14:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:29] (03CR) 10Andrea Denisse: Add PHP 7.4 dependencies for LibreNMS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810106 (owner: 10Andrea Denisse) [14:07:11] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:07:57] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:03] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:09:41] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:10:22] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [14:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:33] (03CR) 10Muehlenhoff: Add PHP 7.4 dependencies for LibreNMS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810106 (owner: 10Andrea Denisse) [14:11:57] (03CR) 10Volans: [C: 04-1] "One possible bug inline. Also missing tests." [software/spicerack] - 10https://gerrit.wikimedia.org/r/811693 (https://phabricator.wikimedia.org/T312116) (owner: 10Muehlenhoff) [14:11:59] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:12:03] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48391 bytes in 0.299 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:13:13] (03PS1) 10Alexandros Kosiaris: Add conf100[789] in DNS SRV records [dns] - 10https://gerrit.wikimedia.org/r/811728 (https://phabricator.wikimedia.org/T311407) [14:13:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:55] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.307 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:15:07] (03CR) 10Muehlenhoff: Add a helper function to query the disk type of a VM (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/811693 (https://phabricator.wikimedia.org/T312116) (owner: 10Muehlenhoff) [14:15:20] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudstore1008.wikimedia.org [14:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:29] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudstore100[89] - https://phabricator.wikimedia.org/T311844 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `cloudstore1008.wikimedia.org` - cloudst... [14:16:04] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudstore1009.wikimedia.org [14:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:26] (03PS1) 10Alexandros Kosiaris: Assign conf100[789] roles and add them to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/811729 (https://phabricator.wikimedia.org/T311407) [14:16:46] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=codfw [14:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:14] !log pool codfw for kartotherian T305845 [14:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:18] T305845: Re-import full planet data into codfw - https://phabricator.wikimedia.org/T305845 [14:18:02] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/811714 (owner: 10Ayounsi) [14:19:23] (03CR) 10Andrew Bogott: [C: 03+2] Remove puppet refs for cloudstore1008/cloudstore1009 [puppet] - 10https://gerrit.wikimedia.org/r/811711 (https://phabricator.wikimedia.org/T311844) (owner: 10Muehlenhoff) [14:19:53] (03CR) 10Volans: [C: 04-1] Add a helper function to query the disk type of a VM (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/811693 (https://phabricator.wikimedia.org/T312116) (owner: 10Muehlenhoff) [14:20:42] (03PS23) 10Ayounsi: Decom cookbook: configure switches using cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/803262 [14:20:43] RECOVERY - Check systemd state on thumbor2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:44] (03PS1) 10Ayounsi: provision cookbook: configure switches using cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/811730 [14:20:53] (03CR) 10Alexandros Kosiaris: [C: 03+2] mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/809194 (owner: 10PipelineBot) [14:21:10] (03CR) 10Muehlenhoff: Add a helper function to query the disk type of a VM (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/811693 (https://phabricator.wikimedia.org/T312116) (owner: 10Muehlenhoff) [14:21:36] (03CR) 10Cathal Mooney: Add test in Netbox network report for port-block speeds on QFX5120 (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811314 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [14:21:43] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:31] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=eqiad [14:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:43] !log depool eqiad kartotherian T305845 [14:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:46] T305845: Re-import full planet data into codfw - https://phabricator.wikimedia.org/T305845 [14:24:58] (03Merged) 10jenkins-bot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/809194 (owner: 10PipelineBot) [14:26:03] (03CR) 10Ayounsi: [C: 04-1] "-1 for now as not sure if it's a good idea." [cookbooks] - 10https://gerrit.wikimedia.org/r/811730 (owner: 10Ayounsi) [14:26:50] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/mathoid: apply [14:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:15] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/mathoid: apply [14:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:19] PROBLEM - Check systemd state on thumbor2006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8802.service,thumbor@8803.service,thumbor@8805.service,thumbor@8809.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:01] (03CR) 10Muehlenhoff: [C: 03+2] Switch image reports over to build2001 [puppet] - 10https://gerrit.wikimedia.org/r/811324 (https://phabricator.wikimedia.org/T298463) (owner: 10Muehlenhoff) [14:30:06] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mathoid: apply [14:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:52] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mathoid: apply [14:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:05] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mathoid: apply [14:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:48] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply [14:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:45] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [14:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:01] PROBLEM - Check systemd state on maps2010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. An error occured trying to list the failed units https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:37:42] 10SRE, 10ops-eqiad: SSH on wtp1040.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T312185 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson replaced the mgmt cable this should take care of the flapping. If the problem persists please re-open and ping me. [14:37:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:00] 10SRE, 10ops-eqiad: SSH on wtp1036.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311761 (10Cmjohnson) 05Open→03Resolved replaced the mgmt cable this should take care of the flapping. If the problem persists please re-open and ping me. [14:38:10] 10SRE, 10ops-eqiad: SSH on wtp1040.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T312185 (10ssingh) Thanks for the help @Cmjohnson! [14:38:24] (03PS2) 10Muehlenhoff: Switch image builds over to build2001 [puppet] - 10https://gerrit.wikimedia.org/r/811344 (https://phabricator.wikimedia.org/T298463) [14:38:31] PROBLEM - Check no envoy runtime configuration is left persistent on mw1414 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [14:39:09] PROBLEM - Host ores1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:39:27] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudstore1009.wikimedia.org [14:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:37] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudstore100[89] - https://phabricator.wikimedia.org/T311844 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `cloudstore1009.wikimedia.org` - cloudst... [14:40:34] (03CR) 10Klausman: ml-services: add some more revscoring services to staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [14:41:15] (03CR) 10Alexandros Kosiaris: [C: 03+2] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/809270 (https://phabricator.wikimedia.org/T310740) (owner: 10Ahmon Dancy) [14:41:49] (03CR) 10Alexandros Kosiaris: [C: 03+2] scap: make scap::target require the scap class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809270 (https://phabricator.wikimedia.org/T310740) (owner: 10Ahmon Dancy) [14:42:17] PROBLEM - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:42:18] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [14:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:30] (03CR) 10Ayounsi: [C: 03+2] Add sre.network.configure-switch-interfaces to dcops sudo [puppet] - 10https://gerrit.wikimedia.org/r/811714 (owner: 10Ayounsi) [14:44:58] (03CR) 10Alexandros Kosiaris: [C: 03+2] Depool poolcounter1005 for reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809990 (owner: 10Muehlenhoff) [14:45:30] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:05] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Configure cloudsw1-e4-eqiad and cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T304936 (10Cmjohnson) [14:47:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10Cmjohnson) 05Open→03Resolved @cmooney the 2nd interface requires manual input, I mistakenly connected it to the mgmt port.... [14:47:28] (03CR) 10Muehlenhoff: [C: 03+2] Switch image builds over to build2001 [puppet] - 10https://gerrit.wikimedia.org/r/811344 (https://phabricator.wikimedia.org/T298463) (owner: 10Muehlenhoff) [14:49:40] !log akosiaris@deploy1002 Synchronized wmf-config/ProductionServices.php: (no justification provided) (duration: 03m 33s) [14:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:16] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) Entirely disabling performance_schema on 10.6 got 10.6 and 10.4 (with P_S ON) to die at the same time (more or less) ar... [14:50:37] 10SRE, 10Image-Suggestions: Envoy cannot connect to image-suggestion service - https://phabricator.wikimedia.org/T312225 (10JMeybohm) Ingress needs SNI and Host header to be set properly in order to be able to serve the correct certificate and route the request accordingly. [14:51:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:52:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:29] !log reboot poolcounter1005 for kernel upgrades [14:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:52] (03PS1) 10JMeybohm: service_proxy: Set SNI and Host header for ingress services [puppet] - 10https://gerrit.wikimedia.org/r/811733 (https://phabricator.wikimedia.org/T312225) [14:53:59] PROBLEM - Host poolcounter1005 is DOWN: PING CRITICAL - Packet loss = 100% [14:54:25] PROBLEM - Check systemd state on thumbor2004 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8803.service,thumbor@8805.service,thumbor@8811.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:55:59] !log moving switch ports cloudcephosd1021 from cloudsw1-c to cloudsw2-c T310546 [14:56:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:03] T310546: Recable cloudcephosd1021 from cloudsw1-c8-eqiad to cloudsw2-c8-eqiad - https://phabricator.wikimedia.org/T310546 [14:56:53] (03PS1) 10Alexandros Kosiaris: Revert "Depool poolcounter1005 for reboot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811421 [14:56:59] RECOVERY - Host poolcounter1005 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [14:57:02] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Revert "Depool poolcounter1005 for reboot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811421 (owner: 10Alexandros Kosiaris) [14:59:48] (03PS1) 10Ottomata: Upstream release 0.273.3 [debs/presto] (debian) - 10https://gerrit.wikimedia.org/r/811735 (https://phabricator.wikimedia.org/T311525) [15:00:19] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Upstream release 0.273.3 [debs/presto] (debian) - 10https://gerrit.wikimedia.org/r/811735 (https://phabricator.wikimedia.org/T311525) (owner: 10Ottomata) [15:00:47] !log akosiaris@deploy1002 Synchronized wmf-config/ProductionServices.php: (no justification provided) (duration: 03m 28s) [15:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:26] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:03:08] (03PS1) 10Alexandros Kosiaris: Depool poolcounter1004 for reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811736 [15:03:16] (03PS1) 10Jgiannelos: maps: Disable tilerator on codfw replicas [puppet] - 10https://gerrit.wikimedia.org/r/811737 [15:03:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:11] (03CR) 10Filippo Giunchedi: Add PHP 7.4 dependencies for LibreNMS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810106 (owner: 10Andrea Denisse) [15:04:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:04:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:15] !log installing intel-microcode security updates [15:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:33] (03CR) 10Alexandros Kosiaris: [C: 03+2] Depool poolcounter1004 for reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811736 (owner: 10Alexandros Kosiaris) [15:08:13] (03PS2) 10JMeybohm: service_proxy: Set SNI and Host header for ingress services [puppet] - 10https://gerrit.wikimedia.org/r/811733 (https://phabricator.wikimedia.org/T312225) [15:08:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:52] RECOVERY - Check no envoy runtime configuration is left persistent on mw1414 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [15:09:44] !log akosiaris@deploy1002 Synchronized wmf-config/ProductionServices.php: (no justification provided) (duration: 03m 41s) [15:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:02] (03CR) 10CI reject: [V: 04-1] service_proxy: Set SNI and Host header for ingress services [puppet] - 10https://gerrit.wikimedia.org/r/811733 (https://phabricator.wikimedia.org/T312225) (owner: 10JMeybohm) [15:13:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:14:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1001.eqiad.wmnet [15:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:49] (03PS1) 10Ottomata: analytics_cluster presto - reorg settings and unify configs for presto 0.273.3 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/811739 (https://phabricator.wikimedia.org/T311525) [15:17:35] (03CR) 10CI reject: [V: 04-1] analytics_cluster presto - reorg settings and unify configs for presto 0.273.3 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/811739 (https://phabricator.wikimedia.org/T311525) (owner: 10Ottomata) [15:17:50] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36200/console" [puppet] - 10https://gerrit.wikimedia.org/r/811739 (https://phabricator.wikimedia.org/T311525) (owner: 10Ottomata) [15:21:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:22:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1001.eqiad.wmnet [15:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:02] (03PS3) 10JMeybohm: service_proxy: Set SNI and Host header for ingress services [puppet] - 10https://gerrit.wikimedia.org/r/811733 (https://phabricator.wikimedia.org/T312225) [15:24:16] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36201/console" [puppet] - 10https://gerrit.wikimedia.org/r/811733 (https://phabricator.wikimedia.org/T312225) (owner: 10JMeybohm) [15:24:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [15:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:17] (03PS4) 10JMeybohm: service_proxy: Set SNI and Host header for ingress services [puppet] - 10https://gerrit.wikimedia.org/r/811733 (https://phabricator.wikimedia.org/T312225) [15:28:18] (03PS2) 10Ottomata: analytics_cluster presto - reorg settings and unify configs for presto 0.273.3 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/811739 (https://phabricator.wikimedia.org/T311525) [15:29:11] (03CR) 10CI reject: [V: 04-1] analytics_cluster presto - reorg settings and unify configs for presto 0.273.3 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/811739 (https://phabricator.wikimedia.org/T311525) (owner: 10Ottomata) [15:29:27] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36202/console" [puppet] - 10https://gerrit.wikimedia.org/r/811739 (https://phabricator.wikimedia.org/T311525) (owner: 10Ottomata) [15:30:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [15:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:22] PROBLEM - Check systemd state on mw2387 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:37:30] !log btullis@cumin1001 START - Cookbook sre.ganeti.makevm for new host dse-k8s-ctrl1001.eqiad.wmnet [15:37:32] !log btullis@cumin1001 START - Cookbook sre.dns.netbox [15:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:36] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:40:33] (03PS5) 10Cathal Mooney: Add test in Netbox network report for port-block speeds on QFX5120 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811314 (https://phabricator.wikimedia.org/T303529) [15:41:12] (03PS3) 10Ottomata: presto - reorg settings and unify configs for presto 0.273.3 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/811739 (https://phabricator.wikimedia.org/T311525) [15:41:14] (03CR) 10CI reject: [V: 04-1] Add test in Netbox network report for port-block speeds on QFX5120 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811314 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [15:41:36] !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:41:36] !log btullis@cumin1001 START - Cookbook sre.dns.wipe-cache dse-k8s-ctrl1001.eqiad.wmnet on all recursors [15:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:39] !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-ctrl1001.eqiad.wmnet on all recursors [15:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:05] (03PS1) 10JMeybohm: service-proxy: Set SNI and Host header for ingress services [deployment-charts] - 10https://gerrit.wikimedia.org/r/811744 (https://phabricator.wikimedia.org/T312225) [15:45:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:42] RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:51] (03PS6) 10Cathal Mooney: Add test in Netbox network report for port-block speeds on QFX5120 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811314 (https://phabricator.wikimedia.org/T303529) [15:48:48] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff) [15:50:13] (03PS7) 10Cathal Mooney: Add test in Netbox network report for port-block speeds on QFX5120 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811314 (https://phabricator.wikimedia.org/T303529) [15:51:19] !log btullis@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host dse-k8s-ctrl1001.eqiad.wmnet [15:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:42] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:45] !log btullis@cumin1001 START - Cookbook sre.ganeti.makevm for new host dse-k8s-ctrl1002.eqiad.wmnet [15:53:46] !log btullis@cumin1001 START - Cookbook sre.dns.netbox [15:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:05] (03CR) 10Cathal Mooney: Add test in Netbox network report for port-block speeds on QFX5120 (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811314 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [15:54:18] PROBLEM - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:44] !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:57:44] !log btullis@cumin1001 START - Cookbook sre.dns.wipe-cache dse-k8s-ctrl1002.eqiad.wmnet on all recursors [15:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:47] !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-ctrl1002.eqiad.wmnet on all recursors [15:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:25] (03PS8) 10Cathal Mooney: Add test in Netbox network report for port-block speeds on QFX5120 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811314 (https://phabricator.wikimedia.org/T303529) [15:59:40] (03PS9) 10Cathal Mooney: Add test in Netbox network report for port-block speeds on QFX5120 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811314 (https://phabricator.wikimedia.org/T303529) [16:00:14] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:07:27] !log btullis@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host dse-k8s-ctrl1002.eqiad.wmnet [16:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:44] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:11:51] 10SRE, 10DSE-Kubernetes-Cluster, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 3 VMs requested for Etcd cluster in support of the new DSE Kubernetes cluster - https://phabricator.wikimedia.org/T311131 (10BTullis) 05Open→03Resolved a:03BTullis All 3 VMs created successfully. I've also mo... [16:15:14] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:17] 10SRE, 10DSE-Kubernetes-Cluster, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 2 VMs requested for DSE Kubernetes Cluster control plane servers - https://phabricator.wikimedia.org/T311133 (10BTullis) 05Open→03Resolved a:03BTullis Both VMs successfully created. I'll resolve this ticket an... [16:17:01] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36203/console" [puppet] - 10https://gerrit.wikimedia.org/r/811733 (https://phabricator.wikimedia.org/T312225) (owner: 10JMeybohm) [16:17:10] (03PS1) 10Btullis: Add DHCP boot entries for new dse-k8s VMs [puppet] - 10https://gerrit.wikimedia.org/r/811747 (https://phabricator.wikimedia.org/T310170) [16:21:42] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/811733 (https://phabricator.wikimedia.org/T312225) (owner: 10JMeybohm) [16:22:48] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:26:43] (03PS1) 10Btullis: Add the new dse-k8s servers with the insetup role [puppet] - 10https://gerrit.wikimedia.org/r/811749 (https://phabricator.wikimedia.org/T310170) [16:27:22] (03CR) 10Btullis: [C: 03+2] Add DHCP boot entries for new dse-k8s VMs [puppet] - 10https://gerrit.wikimedia.org/r/811747 (https://phabricator.wikimedia.org/T310170) (owner: 10Btullis) [16:29:30] RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:30:22] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:35:02] (03PS2) 10Btullis: Add the new dse-k8s servers with the insetup role [puppet] - 10https://gerrit.wikimedia.org/r/811749 (https://phabricator.wikimedia.org/T310170) [16:37:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:38:29] (03CR) 10Btullis: [C: 03+2] Add the new dse-k8s servers with the insetup role [puppet] - 10https://gerrit.wikimedia.org/r/811749 (https://phabricator.wikimedia.org/T310170) (owner: 10Btullis) [16:41:41] (03PS1) 10JMeybohm: Use the generic service_proxy definition for envoy config [deployment-charts] - 10https://gerrit.wikimedia.org/r/811751 [16:41:43] (03PS1) 10JMeybohm: Remove the need for charts to define services_procxy fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/811752 [16:42:04] PROBLEM - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:45:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:53:04] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:57:56] (03PS1) 10Alexandros Kosiaris: Revert "Depool poolcounter1004 for reboot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811423 [16:59:26] (03CR) 10Alexandros Kosiaris: [C: 03+2] Revert "Depool poolcounter1004 for reboot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811423 (owner: 10Alexandros Kosiaris) [17:00:09] (03Merged) 10jenkins-bot: Revert "Depool poolcounter1004 for reboot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811423 (owner: 10Alexandros Kosiaris) [17:00:26] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:04:22] PROBLEM - Check systemd state on poolcounter1004 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:08] PROBLEM - Check systemd state on thumbor2005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:13] !log bking@cloudelastic1006 "restarting elastic services in preparation for cloudelastic reimage T309343" [17:06:14] !log akosiaris@deploy1002 Synchronized wmf-config/ProductionServices.php: (no justification provided) (duration: 03m 38s) [17:06:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:17] T309343: Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 [17:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:06:56] RECOVERY - Check systemd state on maps2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:07:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:42] 10SRE, 10Machine-Learning-Team, 10ORES, 10serviceops: Migrate ORES Redis servers to Stretch/Buster - https://phabricator.wikimedia.org/T224569 (10akosiaris) 05Open→03Resolved a:03akosiaris Done a long time ago. Now [misc_redis](https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc) is being us... [17:10:48] 10SRE: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10akosiaris) [17:17:32] RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:17:43] (03CR) 10BryanDavis: [C: 03+1] Remove the need for charts to define services_procxy fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/811752 (owner: 10JMeybohm) [17:19:19] (03PS2) 10Dzahn: admin: add gitlab-roots group to gitlab_runner role [puppet] - 10https://gerrit.wikimedia.org/r/809018 (https://phabricator.wikimedia.org/T308350) [17:31:17] (03CR) 10David Caro: novafullstack: Refactor and minor fix (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/811316 (owner: 10David Caro) [17:31:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Recable cloudcephosd1021 from cloudsw1-c8-eqiad to cloudsw2-c8-eqiad - https://phabricator.wikimedia.org/T310546 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson cloudcephosd1021 has been moved to cloudsw2, thanks to @cmooney for figuring o... [17:31:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Recable cloudcephosd1015 from cloudsw1-d5-eqiad to cloudsw2-d5-eqiad - https://phabricator.wikimedia.org/T310547 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson The server has been moved [17:31:17] (03PS1) 10Sergio Gimeno: GrowthExperiments: end mailing list campaign in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811756 (https://phabricator.wikimedia.org/T307985) [17:31:17] (03CR) 10Dzahn: [C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/809018 (https://phabricator.wikimedia.org/T308350) (owner: 10Dzahn) [17:32:13] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [17:32:39] (03CR) 10CI reject: [V: 04-1] admin: add gitlab-roots group to gitlab_runner role [puppet] - 10https://gerrit.wikimedia.org/r/809018 (https://phabricator.wikimedia.org/T308350) (owner: 10Dzahn) [17:32:59] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10AlexisJazz) [17:33:02] PROBLEM - Check systemd state on thumbor2004 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8803.service,thumbor@8805.service,thumbor@8811.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:33:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [17:36:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [17:38:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [17:41:22] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10User-DannyS712: 502 error on beta commons - https://phabricator.wikimedia.org/T250103 (10AlexisJazz) [17:42:18] well, jenkins, -1 for reason "aborted" is unusual [17:43:31] 12m 43s runtime suggests it timed out? [17:43:44] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10User-DannyS712: 502 error on beta commons - https://phabricator.wikimedia.org/T250103 (10AlexisJazz) [17:44:38] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10User-DannyS712: 502 error on beta commons - https://phabricator.wikimedia.org/T250103 (10AlexisJazz) [17:44:58] (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/809018 (https://phabricator.wikimedia.org/T308350) (owner: 10Dzahn) [17:45:07] mutante: sorry but you taught me this :P [17:46:38] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:59] sukhe: hehe, thank you! [17:49:12] and it worked [17:49:18] :P [17:49:39] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr @Jclark-ctr Can you move these servers out of wmcs rack and into a 10G rack. there is space in B2, D2... [17:52:22] (03CR) 10Ottomata: [C: 03+2] "Merging, presto will not be auto-restarted, so I can do that after I upgrade the package." [puppet] - 10https://gerrit.wikimedia.org/r/811739 (https://phabricator.wikimedia.org/T311525) (owner: 10Ottomata) [17:53:20] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:58] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudstore100[89] - https://phabricator.wikimedia.org/T311844 (10Cmjohnson) [17:54:57] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudstore100[89] - https://phabricator.wikimedia.org/T311844 (10Cmjohnson) 05Open→03Resolved These servers have been removed along with the storage arrays [17:55:34] !log dcaro@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on cloudcephmon1002.eqiad.wmnet with reason: Moving racks [17:55:36] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cloudcephmon1002.eqiad.wmnet with reason: Moving racks [17:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:36] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:50] PROBLEM - Host cloudcephmon1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:00:04] jnuche and dduvall: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220706T1800). [18:00:04] jnuche and dduvall: (Dis)respected human, time to deploy MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220706T1800). Please do the needful. [18:02:07] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster reimage to bullseye - bking@cumin1001 - T309343 [18:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:11] T309343: Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 [18:02:53] (03CR) 10Krinkle: [C: 03+1] site/DHCP: decom doc1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/810400 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn) [18:03:34] (03CR) 10Urbanecm: [C: 04-1] "This will turn off welcome survey at beta enwiki and eswiki; I don't think that's intended." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811756 (https://phabricator.wikimedia.org/T307985) (owner: 10Sergio Gimeno) [18:03:38] Krinkle: ready to go? ok, then I'll delete that whole thing today [18:04:10] like..destroying the VM [18:04:28] (03PS1) 10Ottomata: Enable iceberg hive for presto [puppet] - 10https://gerrit.wikimedia.org/r/811759 (https://phabricator.wikimedia.org/T311525) [18:06:10] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36204/console" [puppet] - 10https://gerrit.wikimedia.org/r/811759 (https://phabricator.wikimedia.org/T311525) (owner: 10Ottomata) [18:06:16] (03CR) 10Ottomata: [WIP] Build spark assembly for Spark3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/810951 (https://phabricator.wikimedia.org/T310578) (owner: 10Aqu) [18:06:22] RECOVERY - Host cloudcephmon1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [18:07:01] (03CR) 10Ottomata: [V: 03+1 C: 03+2] Enable iceberg hive for presto [puppet] - 10https://gerrit.wikimedia.org/r/811759 (https://phabricator.wikimedia.org/T311525) (owner: 10Ottomata) [18:07:16] PROBLEM - Host ms-be1065.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:09:59] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for DDesouza - https://phabricator.wikimedia.org/T312271 (10DDeSouza) [18:10:10] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1003.wikimedia.org with OS bullseye [18:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:15] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye [18:11:50] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for DDesouza - https://phabricator.wikimedia.org/T312271 (10DDeSouza) [18:13:42] RECOVERY - Host ms-be1065.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [18:15:45] 10SRE, 10ops-eqiad, 10Cloud-Services, 10DC-Ops, and 2 others: move cloudcephmon1002.eqiad.wmnet from rack B4 to rack D5 - https://phabricator.wikimedia.org/T304096 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson The server has been moved to D5 and is accessible [18:18:42] 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Management interface SSH icinga alerts - https://phabricator.wikimedia.org/T304289 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson Management flapping will be an ongoing issue, no need to keep this ticket open. If problems pers... [18:27:43] (03PS1) 10Clare Ming: Enable sticky header edit A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811762 (https://phabricator.wikimedia.org/T311144) [18:28:15] (03CR) 10CI reject: [V: 04-1] Enable sticky header edit A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811762 (https://phabricator.wikimedia.org/T311144) (owner: 10Clare Ming) [18:29:12] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 (10Cmjohnson) @RKemper @Marostegui @Dzahn @MoritzMuehlenhoff @ssastry Do any of your servers require 10G? I should be able to keep them all in row D, this would only be an in-row move and woul... [18:30:02] (03PS2) 10Clare Ming: Enable sticky header edit A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811762 (https://phabricator.wikimedia.org/T311144) [18:30:13] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:30:34] (03CR) 10CI reject: [V: 04-1] Enable sticky header edit A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811762 (https://phabricator.wikimedia.org/T311144) (owner: 10Clare Ming) [18:33:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:39:12] (03PS3) 10Clare Ming: Enable sticky header edit A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811762 (https://phabricator.wikimedia.org/T311144) [18:45:24] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1003.wikimedia.org with OS bullseye [18:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:29] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye executed with errors... [18:45:51] !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster reimage to bullseye - bking@cumin1001 - T309343 [18:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:55] T309343: Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 [18:47:35] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1003.wikimedia.org with OS bullseye [18:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:40] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye [18:47:42] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1003.wikimedia.org with OS bullseye [18:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:44] (03CR) 10Jdlrobson: [C: 03+1] Enable sticky header edit A/B test for pilot wikis (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811762 (https://phabricator.wikimedia.org/T311144) (owner: 10Clare Ming) [18:47:47] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye executed with errors... [18:48:30] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1003.wikimedia.org with OS bullseye [18:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:36] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye [18:48:37] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1003.wikimedia.org with OS bullseye [18:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:42] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye executed with errors... [18:51:28] (03PS4) 10Clare Ming: Enable sticky header edit A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811762 (https://phabricator.wikimedia.org/T311144) [18:52:14] (03CR) 10Clare Ming: Enable sticky header edit A/B test for pilot wikis (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811762 (https://phabricator.wikimedia.org/T311144) (owner: 10Clare Ming) [18:56:27] RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:00:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:00:25] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1003.wikimedia.org with OS bullseye [19:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:31] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye [19:00:40] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Traffic, 10IPv6: Some Traffic clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271144 (10BCornwall) Thank you for doing that, @Volans ; I apologize for forgetting to run the cookbook. I'm a little confused here regarding onl... [19:03:09] PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8803.service,thumbor@8805.service,thumbor@8812.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:07:17] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:08:25] (03PS1) 10Dbrant: Add sampling to android.breadcrumbs event stream. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811765 (https://phabricator.wikimedia.org/T310847) [19:11:31] (03PS1) 10Ebernhardson: superset: Turn template processing back on [puppet] - 10https://gerrit.wikimedia.org/r/811766 (https://phabricator.wikimedia.org/T312134) [19:12:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10nskaggs) Thank you @ayounsi ! [19:13:16] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1003.wikimedia.org with OS bullseye [19:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:20] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye executed with errors... [19:15:56] (03CR) 10Ebernhardson: "Patch is based on https://github.com/apache/superset/issues/12487#issuecomment-759390836" [puppet] - 10https://gerrit.wikimedia.org/r/811766 (https://phabricator.wikimedia.org/T312134) (owner: 10Ebernhardson) [19:16:27] (03PS5) 10Clare Ming: Enable sticky header edit A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811762 (https://phabricator.wikimedia.org/T311144) [19:32:21] (03PS6) 10Clare Ming: Enable sticky header edit A/B test for pilot wikis excluding idwiki/viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811762 (https://phabricator.wikimedia.org/T311144) [19:50:41] (03CR) 10Jdrewniak: [C: 03+1] Enable sticky header edit A/B test for pilot wikis excluding idwiki/viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811762 (https://phabricator.wikimedia.org/T311144) (owner: 10Clare Ming) [19:54:01] !log bd808@mwmaint1002 Testing statshbot following deploy of [[gerrit:809732]]. This should be logged in SAL, but stashbot should not say that was done on irc. [19:54:29] Krinkle: ^ seems to have worked -- https://sal.toolforge.org/log/N8sT1YEBa_6PSCT9nOPQ [19:56:20] Stashbot should no longer ack !log messages sent here by logmsgbot. This is hoped to help reduce the noise in this channel a little bit. [19:57:02] if you find this to be a good thing, give Krinkle your praise. If you find it to be horrible, blame me for merging the change. ;) [19:57:14] bd808: I'm not sure that's a good idea. how will i know when stashbot is broken? [19:58:56] +1, seems like we would not notice when logs dont actually get logged [19:59:08] urbanecm: when things stop showing up on https://wikitech.wikimedia.org/wiki/Server_Admin_Log I guess. I'm open to reverting if folks find it actually bad in practice, but maybe we can give it a few days before deciding? [19:59:45] well, I likely won't check that page when doing deployments. a missing IRC message is easy to notice, as i monitor -operations during deployments anyway :) [19:59:48] it is currently only omitting the ack message when logmsgbot is the sender of the !log [20:00:05] RoanKattouw, Urbanecm, and cjming: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220706T2000). [20:00:05] cjming: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:14] o/ [20:00:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:00:24] i will deploy [20:00:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10nskaggs) Looks like those two tasks are complete (thanks @Cmjohnson !), and it seems netbox plans show connecting to cloudsw1* as suggested. Thanks! [20:00:54] * urbanecm waves to cjming [20:01:12] I wouldn't want to have to open SAL each time to check. [20:01:25] urbanecm: can you point to documentation of a time you noticed that stashbot was down based on activity in this channel? I do understand the concern, but I also can't recall the last report of the bot being broken coming from here. [20:02:04] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 (10Cmjohnson) @ayounsi confirmed they're all 1G, I added the racks and U# to the timeslots [20:02:32] (03CR) 10Clare Ming: [C: 03+2] Enable sticky header edit A/B test for pilot wikis excluding idwiki/viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811762 (https://phabricator.wikimedia.org/T311144) (owner: 10Clare Ming) [20:03:21] https://sal.toolforge.org/tools.stashbot actually shows very few forced restarts of stashbot in general in the last couple of years [20:03:37] (03Merged) 10jenkins-bot: Enable sticky header edit A/B test for pilot wikis excluding idwiki/viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811762 (https://phabricator.wikimedia.org/T311144) (owner: 10Clare Ming) [20:03:39] * cjming waves to urbanecm [20:03:51] bd808: i can have a look, but it was in the form of getting someone via IRC to restart it, so that's hard to find. [20:04:43] urbanecm: fair enough. It is always easy to revert that change if there is actual value in the ack messages here. [20:05:42] Even if it netsplits and it's just waiting to come back, now there's no indication [20:05:52] People hide quit messages and it's easier to loose them [20:07:35] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:09:07] i'll hang around for a bit in case anyone still needs something deployed -- otherwise i'll close the backport window in 10-15 [20:09:26] urbanecm: mutante: this only affects automated messages by logmsgbot, human !log will be ack'ed the same as before, nothing changes. If it's hiding it for humans, I've made a mistake. [20:10:14] e.g. reimage and db maintenance basically [20:10:20] i understand that, but even seeing the acks by stashbot to logmsgbot's !log is useful to ensure stashbot does log the logs to SAL [20:10:38] Krinkle: isn't it all cookbooks [20:10:38] with this change, i need to either open SAL and check, or issue a manual !log to test it [20:10:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:11:07] cjming: did you ever sync [20:11:23] if we want to hide the automated messages for some reason, I'd suggest merging logmsgbot and stashbot to a single bot, responsible for both logging to SAL and logging to iRC [20:11:28] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:811762|Enable sticky header edit A/B test for pilot wikis excluding idwiki/viwiki (T311144)]] (duration: 03m 25s) [20:11:31] T311144: Enable sticky header A/B test - https://phabricator.wikimedia.org/T311144 [20:11:37] RhinosF1: i'm syncing now - looks like it just finished [20:11:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:11:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:11:52] urbanecm: I [20:12:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:14:17] (03PS1) 10Cmjohnson: adding new wmcs hosts to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/811771 (https://phabricator.wikimedia.org/T304888) [20:15:14] urbanecm: I don't think deployers should worry about SAL and afaik people generally do not look for the ack of automated messages to know that it made it there. [20:15:38] (03CR) 10Cmjohnson: [C: 03+2] adding new wmcs hosts to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/811771 (https://phabricator.wikimedia.org/T304888) (owner: 10Cmjohnson) [20:15:47] I also think for incidents etc we already go off the IRC log anyway, not SAL. Perhas an inciga alert would be useful to check that SAL is up. [20:16:21] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:17:55] I'm not sure about others, but for my deployments, when stashbot didn't ack the log (or reported an error), i paused to get that fixed somehow (as i don't like making changes that aren't properly logged, so other people know what i did) [20:17:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Cmjohnson) [20:18:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Cmjohnson) [20:18:45] an Icinga alert needs to actually notify someone though [20:18:52] or it's just going to sit there as unhandled crit [20:21:42] yes, I do look for the ACK and SAL is the place to check what others did on a regular basis [20:23:43] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudrabbit1001.wikimedia.org with OS bullseye [20:23:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudrabbit1001.wikime... [20:33:53] RECOVERY - Check systemd state on thumbor2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:35:40] !log end of UTC late backport window [20:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:11] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudrabbit1001.wikimedia.org with OS bullseye [20:36:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudrabbit1001.wikimedia.org with OS bullseye e... [20:38:05] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudrabbit1001.wikimedia.org with OS bullseye [20:38:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudrabbit1001.wikimedia.org with OS bullseye [20:41:07] PROBLEM - Check systemd state on thumbor2005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8806.service,thumbor@8808.service,thumbor@8810.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:43:42] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudrabbit1002.wikimedia.org with OS bullseye [20:43:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudrabbit1002.wikimedia.org with OS bullseye [20:43:49] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudrabbit1003.wikimedia.org with OS bullseye [20:43:56] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1005.eqiad.wmnet with OS bullseye [20:43:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudrabbit1003.wikimedia.org with OS bullseye [20:44:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudnet1005.eqiad.wmnet with OS bullseye [20:44:03] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1006.eqiad.wmnet with OS bullseye [20:44:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye [20:44:11] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudservices1005.wikimedia.org with OS bullseye [20:44:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudservices1005.wikimedia.org with OS bull... [20:59:26] (03CR) 10Bearloga: [C: 03+1] superset: Turn template processing back on [puppet] - 10https://gerrit.wikimedia.org/r/811766 (https://phabricator.wikimedia.org/T312134) (owner: 10Ebernhardson) [20:59:36] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudrabbit1001.wikimedia.org with OS bullseye [20:59:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudrabbit1001.wikimedia.org with OS bullseye e... [21:01:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Cmjohnson) I am getting this on all but the cloudnets, those are not hitting the installer. ────────────────────┤ [!!] Configure the network ├─... [21:08:43] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10bking) [21:11:43] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:15:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:18:54] (03PS2) 10Dbrant: Add sampling to android.breadcrumbs event stream. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811765 (https://phabricator.wikimedia.org/T310847) [21:22:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:30:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:37:17] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:39:54] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudrabbit1002.wikimedia.org with OS bullseye [21:39:59] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudrabbit1003.wikimedia.org with OS bullseye [21:40:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudrabbit1002.wikimedia.org with OS bullseye e... [21:40:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudrabbit1003.wikimedia.org with OS bullseye e... [21:40:12] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudnet1005.eqiad.wmnet with OS bullseye [21:40:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudnet1005.eqiad.wmnet with OS bullseye execut... [21:40:19] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudnet1006.eqiad.wmnet with OS bullseye [21:40:23] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudservices1005.wikimedia.org with OS bullseye [21:40:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye execut... [21:40:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudservices1005.wikimedia.org with OS bullseye... [21:49:21] 10SRE, 10Data-Engineering, 10Event-Platform, 10serviceops, 10Patch-For-Review: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10JArguello-WMF) [21:49:36] 10SRE, 10Data-Engineering, 10Traffic-Icebox: varnishkafka / ATSkafka should support setting the kafka message timestamp - https://phabricator.wikimedia.org/T277553 (10JArguello-WMF) [21:49:46] 10SRE, 10Data-Engineering: Downloading from Archiva.wikimedia.org seems slower than Maven Central - https://phabricator.wikimedia.org/T273086 (10JArguello-WMF) [22:00:54] 10SRE, 10DNS, 10Traffic, 10WMF-Legal, and 4 others: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10Varnent) >>! In T310738#8035453, @Dzahn wrote: >>>! In T310738#8033789, @LSobanski wrote: >> @Varnent After chatting about this... [22:13:03] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:21:59] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, and 4 others: Access to trusted gitlab runners for gitlab-roots (or appropriate similar group) - https://phabricator.wikimedia.org/T308350 (10Dzahn) The change has been approved and then deployed. On gitlab-runner1002 I saw puppet ad... [22:26:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10Jclark-ctr) @ayounsi host will be moved tomorrow morning When i started racking task i went by Racking Proposal: Place in WMCS racks. Place... [22:31:23] (03PS1) 10BCornwall: varnish: Enable Prometheus sysctl exporting [puppet] - 10https://gerrit.wikimedia.org/r/811780 [22:31:48] (03PS2) 10BCornwall: varnish: Enable Prometheus sysctl exporting [puppet] - 10https://gerrit.wikimedia.org/r/811780 (https://phabricator.wikimedia.org/T311445) [22:34:18] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, and 4 others: Access to trusted gitlab runners for gitlab-roots (or appropriate similar group) - https://phabricator.wikimedia.org/T308350 (10Dzahn) 05Open→03Resolved ` [gitlab-runner1002:~] $ for relenguser in brennen dancy dduv... [22:35:50] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@debd402]: airflow dags to generate subgraph and query mapping along with their metrics [22:37:51] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@debd402]: airflow dags to generate subgraph and query mapping along with their metrics (duration: 02m 01s) [22:50:31] (03CR) 10Dzahn: [C: 03+2] gitlab/acme_chief: remove gitlab1001 from list of (passive) hosts [puppet] - 10https://gerrit.wikimedia.org/r/802822 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [22:52:05] !log restart airflow-webserver and airflow-scheduler for plugins update on an-airflow1001 [22:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:39] !log etherpad - deleted 2 pads that had leaked information [22:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:04] (03CR) 10Dzahn: [C: 03+2] "this removed snippets from /etc/rsyslog.d/, like /etc/rsyslog.d/20-rsync-data-backup-gitlab1001-wikimedia-org.conf from gitlab1004" [puppet] - 10https://gerrit.wikimedia.org/r/802822 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [23:00:01] !log gitlab1004 - rm /lib/systemd/system/rsync-config-backup-gitlab1001* T307142 [23:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:04] T307142: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 [23:00:29] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:01:22] (03CR) 10Dzahn: [C: 03+2] DHCP: remove gitlab1001 [puppet] - 10https://gerrit.wikimedia.org/r/802824 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [23:03:26] (03PS2) 10Dzahn: DHCP: remove gitlab1001 [puppet] - 10https://gerrit.wikimedia.org/r/802824 (https://phabricator.wikimedia.org/T307142) [23:07:30] (03PS3) 10Dzahn: DHCP: remove gitlab1001 [puppet] - 10https://gerrit.wikimedia.org/r/802824 (https://phabricator.wikimedia.org/T307142) [23:07:43] (03PS1) 10Dzahn: site/gitlab: remove gitlab1001, update comments [puppet] - 10https://gerrit.wikimedia.org/r/811782 (https://phabricator.wikimedia.org/T307142) [23:07:51] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:09:45] (03PS4) 10Dzahn: DHCP: remove gitlab1001 [puppet] - 10https://gerrit.wikimedia.org/r/802824 (https://phabricator.wikimedia.org/T307142) [23:10:44] (03CR) 10Dzahn: [C: 03+2] DHCP: remove gitlab1001 [puppet] - 10https://gerrit.wikimedia.org/r/802824 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [23:14:37] RECOVERY - Check systemd state on thumbor1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:15:11] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:18:57] (03Abandoned) 10Dzahn: site: remove gitlab1001, adjust gitlab machine descriptions [puppet] - 10https://gerrit.wikimedia.org/r/802846 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [23:20:17] (03PS4) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [23:21:59] PROBLEM - Check systemd state on thumbor1002 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8807.service,thumbor@8812.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:22:31] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:24:30] (03PS5) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [23:25:50] !log dzahn@cumin2002 START - Cookbook sre.hosts.decommission for hosts gitlab1001.wikimedia.org [23:30:29] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [23:48:20] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@5082f17]: increase subgraph_mapping_weekly executor memory [23:50:26] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@5082f17]: increase subgraph_mapping_weekly executor memory (duration: 02m 05s)