[00:55:05] FIRING: [6x] MysqlReplicationThreadCountTooLow: MySQL instance db1150:13314 has replication issues. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationThreadCountTooLow [02:00:05] FIRING: [6x] MysqlReplicationThreadCountTooLow: MySQL instance db1150:13314 has replication issues. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationThreadCountTooLow [03:20:05] FIRING: [4x] MysqlReplicationThreadCountTooLow: MySQL instance db1150:13314 has replication issues. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationThreadCountTooLow [03:25:05] FIRING: [4x] MysqlReplicationThreadCountTooLow: MySQL instance db1150:13314 has replication issues. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationThreadCountTooLow [03:35:05] RESOLVED: [4x] MysqlReplicationThreadCountTooLow: MySQL instance db1150:13314 has replication issues. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationThreadCountTooLow [05:20:05] FIRING: MysqlReplicationThreadCountTooLow: MySQL instance db1150:13313 has replication issues. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1150&var-port=13313 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationThreadCountTooLow [05:30:05] FIRING: [2x] MysqlReplicationThreadCountTooLow: MySQL instance db1150:13313 has replication issues. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationThreadCountTooLow [07:10:05] RESOLVED: MysqlReplicationThreadCountTooLow: MySQL instance db2139:13313 has replication issues. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2139&var-port=13313 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationThreadCountTooLow [07:14:27] i'm going to switch codfw es7 master T373168 [07:14:32] T373168: Switchover es7 master (es2038 -> es2039) - https://phabricator.wikimedia.org/T373168 [07:38:54] i'm going to switch codfw s1 master T373173 [07:38:55] T373173: Switchover s1 master (db2212 -> db2203) - https://phabricator.wikimedia.org/T373173 [07:56:23] there was an uncaught exception during execution https://phabricator.wikimedia.org/P67759 [08:25:05] i've disabled semi-sync on db2176 to fix its replication [08:41:55] same on db2216 [09:49:25] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:10:53] started T373174 [12:10:53] T373174: Switchover s6 master (db2214 -> db2129) - https://phabricator.wikimedia.org/T373174 [12:22:28] Amir1: you have a task for the db-switchover tests? [12:22:38] just knowing if I can reference the puppet patch [12:22:44] I don't need a task anyway [12:23:40] marostegui: I actually don't :( [12:23:46] Amir1: no problem [12:34:25] FIRING: [2x] SystemdUnitFailed: pt-heartbeat-wikimedia.service on db2230:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:43:41] Amir1: the script failed in the first run [12:44:05] 💔 [12:44:15] what's the error? [12:44:25] FIRING: [3x] SystemdUnitFailed: pt-heartbeat-wikimedia.service on db2230:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:44:25] db-switchover isn't great with error handliong [12:44:30] [ERROR]: db2232/(none) failed to be moved under the new master [12:44:32] :) [12:44:39] ah yeah, classic [12:46:00] yeah it fails consistently [12:46:06] db-switchover --timeout=25 --replicating-master --read-only-master --only-slave-move db2230 db2231 [12:49:19] I'm going to debug this [12:49:21] thanks <3 [12:49:37] Amir1: This is the topology I set up https://orchestrator.wikimedia.org/web/cluster/alias/test-s4 [12:49:50] Amir1: Please test before I am gone, so I can revert all the changes [12:49:54] So during the week [12:49:58] sure [13:03:03] don't ask who I got here but here is the error [13:03:04] here {'success': False, 'errno': -1, 'errmsg': 'We expected both hosts to be stopped and in sync, but they are not, or other error happened'} [13:03:49] they are in the same coordinates [13:04:02] mmmm wait [13:04:09] I know what the problem is [13:04:40] Wow the topology looks weird now [13:04:50] yeah [13:04:53] I think the command needed db2231 db2232 instead of db2230 [13:04:59] do you need me to fix the topology? [13:05:43] thanks [13:05:48] that'd be amazing [13:06:05] Also why is db2232 not able to reach db1125? [13:06:09] root@db2232:~# telnet db1125.eqiad.wmnet 3306 [13:06:09] Trying 10.64.48.73... [13:06:09] telnet: Unable to connect to remote host: Connection refused [13:07:09] marostegui: I can't reach it from cumin either [13:07:43] ah wait [13:07:45] because the master is down [13:09:14] ok topology fixed [13:13:17] oh gods [13:13:19] I'm an idiot [13:13:29] spot the idiot [13:13:34] https://www.irccloud.com/pastebin/ZetNAptO/ [13:13:55] XDDDDD [15:32:45] new version of wmfmariadbpy is pushed to the repo, already upgraded in cumin2002 [15:34:32] the switchover of test-s4 looks fine [15:48:38] Amir1: can I roll back the changes then? [16:02:31] marostegui: sure [16:02:49] Ok [16:44:25] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:45:23] I run pt-heartbeat-wikimedia service in db2231 (test-s4) but it automatically unfixes itself and stops it making orchestrator panic [18:45:51] Aug 26 17:05:55 db2231 systemd[1]: pt-heartbeat-wikimedia.service: Deactivated successfully. [18:45:51] Aug 26 17:05:55 db2231 systemd[1]: Stopped pt-heartbeat-wikimedia.service - "pt-heartbeat-wikimedia". [18:59:50] Amir1: don't worry I'm going to roll back the changes tomorrow morning [19:01:44] thanks. I was just slightly worried about the script messing up with stuff [19:03:54] The script shouldn't be touching the service [19:03:57] That's all puppet [20:46:10] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed