[00:30:52] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:37:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [01:42:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [01:47:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [01:52:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [01:57:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [02:02:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [02:07:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [02:12:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [02:16:32] (03CR) 10Krinkle: [C: 03+1] scap: Drop never-used 'sqldump' tool [puppet] - 10https://gerrit.wikimedia.org/r/692370 (owner: 10Zabe) [02:20:18] (03CR) 10Krinkle: [C: 03+1] "I've emailed ops-l just in case." [puppet] - 10https://gerrit.wikimedia.org/r/692370 (owner: 10Zabe) [02:21:46] (Traffic on tunnel link) firing: Traffic on tunnel link - https://alerts.wikimedia.org [02:26:46] (Traffic on tunnel link) resolved: Traffic on tunnel link - https://alerts.wikimedia.org [03:02:10] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:02:52] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:20:14] PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [04:37:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [04:39:09] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): June 2021 Datacenter switchover - https://phabricator.wikimedia.org/T281515 (10wkandek) [04:39:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1099:3311 T163532', diff saved to https://phabricator.wikimedia.org/P16645 and previous config saved to /var/cache/conftool/dbconfig/20210621-043941-marostegui.json [04:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:39:47] T163532: Drop index rev_page_id (rev_page, rev_id) - https://phabricator.wikimedia.org/T163532 [04:40:36] !log Re-add rev_page_id to db1099:3311 T163532 T285149 [04:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:40:41] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [04:42:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [04:47:46] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [04:47:55] marostegui: o/ (FML) [04:48:42] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 21 hosts with reason: Master switchover s3 T284648 [04:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:48:47] T284648: Switchover s3 from db1123 to db1157 - https://phabricator.wikimedia.org/T284648 [04:48:49] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 21 hosts with reason: Master switchover s3 T284648 [04:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:56] !log kormat@cumin1001 dbctl commit (dc=all): 'Set db1157 with weight 0 T284648', diff saved to https://phabricator.wikimedia.org/P16646 and previous config saved to /var/cache/conftool/dbconfig/20210621-044955-kormat.json [04:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:51:25] (03CR) 10Jcrespo: [C: 03+1] "Not used by backups either." [puppet] - 10https://gerrit.wikimedia.org/r/692370 (owner: 10Zabe) [04:52:46] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [04:54:10] (03PS5) 10Kormat: mariadb: Promote db1157 as s3 primary [puppet] - 10https://gerrit.wikimedia.org/r/698981 (https://phabricator.wikimedia.org/T284648) [04:55:23] (03CR) 10Kormat: [C: 03+2] mariadb: Promote db1157 as s3 primary [puppet] - 10https://gerrit.wikimedia.org/r/698981 (https://phabricator.wikimedia.org/T284648) (owner: 10Kormat) [04:55:36] (03PS3) 10Jcrespo: dbbackups: Switchover s3 backup source from db1171 to db1102 (buster) [puppet] - 10https://gerrit.wikimedia.org/r/692845 (https://phabricator.wikimedia.org/T283131) [04:56:28] (03PS3) 10Kormat: wmnet: Update s3-master to db1157 [dns] - 10https://gerrit.wikimedia.org/r/698982 (https://phabricator.wikimedia.org/T284648) [04:57:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [04:57:54] (03PS4) 10KartikMistry: Add support for Elia MT to cxserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/699089 (https://phabricator.wikimedia.org/T276059) [04:58:05] (03PS1) 10Marostegui: mariadb: Promote db1130 to s5 master. [puppet] - 10https://gerrit.wikimedia.org/r/700462 (https://phabricator.wikimedia.org/T284529) [04:58:26] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/700462 (https://phabricator.wikimedia.org/T284529) (owner: 10Marostegui) [04:59:24] (03CR) 10Kormat: [C: 03+1] mariadb: Promote db1130 to s5 master. [puppet] - 10https://gerrit.wikimedia.org/r/700462 (https://phabricator.wikimedia.org/T284529) (owner: 10Marostegui) [05:01:54] marostegui: alright, here we go. [05:01:59] !log Starting s3 eqiad failover from db1123 to db1157 - T284648 [05:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:04] T284648: Switchover s3 from db1123 to db1157 - https://phabricator.wikimedia.org/T284648 [05:02:06] drumroll [05:02:17] around [05:02:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [05:03:04] !log kormat@cumin1001 dbctl commit (dc=all): 'Set s3 eqiad as read-only for maintenance - T284648', diff saved to https://phabricator.wikimedia.org/P16647 and previous config saved to /var/cache/conftool/dbconfig/20210621-050304-kormat.json [05:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:37] confirmed read only on eswikiquote [05:03:47] confirmed on ruwikinews [05:03:54] XDDDDD [05:04:09] randomly selected, of course. **cough** [05:04:15] totally [05:04:40] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikibase_repo_prune_test.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:05:07] !log kormat@cumin1001 dbctl commit (dc=all): 'Promote db1157 to s3 master and set section read-write T284648', diff saved to https://phabricator.wikimedia.org/P16648 and previous config saved to /var/cache/conftool/dbconfig/20210621-050506-kormat.json [05:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:28] rw I can see on my local wiki [05:06:19] critical section done [05:06:24] same on eswikiquote [05:07:09] no errors on logstash that I can see [05:07:29] Tendril shows the new state [05:07:46] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [05:07:47] (03CR) 10Kormat: [C: 03+2] wmnet: Update s3-master to db1157 [dns] - 10https://gerrit.wikimedia.org/r/698982 (https://phabricator.wikimedia.org/T284648) (owner: 10Kormat) [05:07:55] recentchanges seems to be moving on eswikiquote [05:08:19] "new" systemd service started/stopped correctly? [05:08:37] jynus: yep [05:08:42] cool [05:09:14] huh. db1171 is still hanging off the old primary, at least on tendril [05:09:34] I can see, I thought it was on purpose [05:09:40] kormat: that one has replication stopped [05:09:47] ahh, ok [05:09:47] Is that the backup host? [05:09:59] it is, but the old one [05:10:16] well, I was about to ask if to merge: https://gerrit.wikimedia.org/r/c/operations/puppet/+/692845 [05:10:29] jynus: +1 yeah [05:11:20] I didn't stop it, did you or is it being backed up right now? [05:11:28] marostegui: can confirm that db-switchover did update zarcillo successfully [05:11:38] I guess the backup is running jynus [05:11:41] kormat: excellent! [05:11:44] cool [05:11:50] !log kormat@cumin1001 dbctl commit (dc=all): 'Depool db1123 until it's reimaged to buster T284648', diff saved to https://phabricator.wikimedia.org/P16649 and previous config saved to /var/cache/conftool/dbconfig/20210621-051149-kormat.json [05:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:54] T284648: Switchover s3 from db1123 to db1157 - https://phabricator.wikimedia.org/T284648 [05:11:55] then the switchover script did the right thing [05:12:46] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [05:13:01] let's make sure all of heartbeat, query killer is nice and happy everywhere else [05:14:36] marostegui: are there any outstanding schema changes that need to be applied to db1123 (old primary)? [05:15:06] yep, let me list them [05:15:10] Is it OK to update cxserver now? Anything on deploy1002 that might interfere? [05:15:46] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:15:59] kormat: https://phabricator.wikimedia.org/T277123 https://phabricator.wikimedia.org/T268392 https://phabricator.wikimedia.org/T266486 I can apply those [05:16:22] And https://phabricator.wikimedia.org/T276150 [05:16:40] actually, we can switch back to #-databases now [05:16:46] yep [05:16:53] kart_: you can proceed [05:17:46] Thanks. [05:18:22] (03CR) 10KartikMistry: [C: 03+2] Add support for Elia MT to cxserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/699089 (https://phabricator.wikimedia.org/T276059) (owner: 10KartikMistry) [05:20:36] (03Merged) 10jenkins-bot: Add support for Elia MT to cxserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/699089 (https://phabricator.wikimedia.org/T276059) (owner: 10KartikMistry) [05:23:33] (03PS1) 10Kormat: db1123: Disable notifications. [puppet] - 10https://gerrit.wikimedia.org/r/700463 (https://phabricator.wikimedia.org/T283131) [05:24:31] (03PS4) 10Jcrespo: dbbackups: Switchover s3 backup source from db1171 to db1102 (buster) [puppet] - 10https://gerrit.wikimedia.org/r/692845 (https://phabricator.wikimedia.org/T283131) [05:24:45] (03CR) 10Kormat: [C: 03+2] db1123: Disable notifications. [puppet] - 10https://gerrit.wikimedia.org/r/700463 (https://phabricator.wikimedia.org/T283131) (owner: 10Kormat) [05:25:27] !log kartik@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [05:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:56] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Switchover s3 backup source from db1171 to db1102 (buster) [puppet] - 10https://gerrit.wikimedia.org/r/692845 (https://phabricator.wikimedia.org/T283131) (owner: 10Jcrespo) [05:31:19] !log stopping replication on db1123 T283131 [05:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:23] T283131: Upgrade s3 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T283131 [05:33:37] !log kartik@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' . [05:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3311 (re)pooling @ 25%: Repool db1099:3311 after re-adding rev_page_id index', diff saved to https://phabricator.wikimedia.org/P16650 and previous config saved to /var/cache/conftool/dbconfig/20210621-053645-root.json [05:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:19] !log kartik@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' . [05:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:01] PROBLEM - MariaDB Replica Lag: s3 on db1102 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1067.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:50:31] !log cxserver: Added support for Elia MT + Updated to 2021-06-10-074331-production (T276059, T275803, T276246, T283513, T255231, T237028) [05:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:44] T237028: Evaluate: