[00:55:05] <jinxer-wm>	 FIRING: [6x] MysqlReplicationThreadCountTooLow: MySQL instance db1150:13314 has replication issues. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica  - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationThreadCountTooLow
[02:00:05] <jinxer-wm>	 FIRING: [6x] MysqlReplicationThreadCountTooLow: MySQL instance db1150:13314 has replication issues. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica  - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationThreadCountTooLow
[03:20:05] <jinxer-wm>	 FIRING: [4x] MysqlReplicationThreadCountTooLow: MySQL instance db1150:13314 has replication issues. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica  - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationThreadCountTooLow
[03:25:05] <jinxer-wm>	 FIRING: [4x] MysqlReplicationThreadCountTooLow: MySQL instance db1150:13314 has replication issues. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica  - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationThreadCountTooLow
[03:35:05] <jinxer-wm>	 RESOLVED: [4x] MysqlReplicationThreadCountTooLow: MySQL instance db1150:13314 has replication issues. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica  - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationThreadCountTooLow
[05:20:05] <jinxer-wm>	 FIRING: MysqlReplicationThreadCountTooLow: MySQL instance db1150:13313 has replication issues. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1150&var-port=13313 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationThreadCountTooLow
[05:30:05] <jinxer-wm>	 FIRING: [2x] MysqlReplicationThreadCountTooLow: MySQL instance db1150:13313 has replication issues. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica  - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationThreadCountTooLow
[07:10:05] <jinxer-wm>	 RESOLVED: MysqlReplicationThreadCountTooLow: MySQL instance db2139:13313 has replication issues. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2139&var-port=13313 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationThreadCountTooLow
[07:14:27] <arnaudb>	 i'm going to switch codfw es7 master T373168
[07:14:32] <stashbot>	 T373168: Switchover es7 master (es2038 -> es2039) - https://phabricator.wikimedia.org/T373168
[07:38:54] <arnaudb>	 i'm going to switch codfw s1 master T373173
[07:38:55] <stashbot>	 T373173: Switchover s1 master (db2212 -> db2203) - https://phabricator.wikimedia.org/T373173
[07:56:23] <arnaudb>	 there was an uncaught exception during execution https://phabricator.wikimedia.org/P67759
[08:25:05] <arnaudb>	 i've disabled semi-sync on db2176 to fix its replication
[08:41:55] <arnaudb>	 same on db2216
[09:49:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:10:53] <arnaudb>	 started T373174
[12:10:53] <stashbot>	 T373174: Switchover s6 master (db2214 -> db2129) - https://phabricator.wikimedia.org/T373174
[12:22:28] <marostegui>	 Amir1: you have a task for the db-switchover tests?
[12:22:38] <marostegui>	 just knowing if I can reference the puppet patch
[12:22:44] <marostegui>	 I don't need a task anyway
[12:23:40] <Amir1>	 marostegui: I actually don't :(
[12:23:46] <marostegui>	 Amir1: no problem
[12:34:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: pt-heartbeat-wikimedia.service on db2230:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:43:41] <marostegui>	 Amir1: the script failed in the first run
[12:44:05] <Amir1>	 💔
[12:44:15] <Amir1>	 what's the error?
[12:44:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: pt-heartbeat-wikimedia.service on db2230:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:44:25] <marostegui>	 db-switchover isn't great with error handliong
[12:44:30] <marostegui>	 [ERROR]: db2232/(none) failed to be moved under the new master
[12:44:32] <marostegui>	 :)
[12:44:39] <Amir1>	 ah yeah, classic
[12:46:00] <marostegui>	 yeah it fails consistently
[12:46:06] <marostegui>	 db-switchover --timeout=25 --replicating-master --read-only-master --only-slave-move db2230 db2231
[12:49:19] <Amir1>	 I'm going to debug this
[12:49:21] <Amir1>	 thanks <3
[12:49:37] <marostegui>	 Amir1: This is the topology I set up https://orchestrator.wikimedia.org/web/cluster/alias/test-s4
[12:49:50] <marostegui>	 Amir1: Please test before I am gone, so I can revert all the changes
[12:49:54] <marostegui>	 So during the week
[12:49:58] <Amir1>	 sure
[13:03:03] <Amir1>	 don't ask who I got here but here is the error
[13:03:04] <Amir1>	 here {'success': False, 'errno': -1, 'errmsg': 'We expected both hosts to be stopped and in sync, but they are not, or other error happened'}
[13:03:49] <marostegui>	 they are in the same coordinates
[13:04:02] <marostegui>	 mmmm wait
[13:04:09] <marostegui>	 I know what the problem is
[13:04:40] <marostegui>	 Wow the topology looks weird now
[13:04:50] <Amir1>	 yeah
[13:04:53] <marostegui>	 I think the command needed db2231 db2232 instead of db2230
[13:04:59] <marostegui>	 do you need me to fix the topology?
[13:05:43] <Amir1>	 thanks
[13:05:48] <Amir1>	 that'd be amazing
[13:06:05] <marostegui>	 Also why is db2232 not able to reach db1125?
[13:06:09] <marostegui>	 root@db2232:~# telnet db1125.eqiad.wmnet 3306
[13:06:09] <marostegui>	 Trying 10.64.48.73...
[13:06:09] <marostegui>	 telnet: Unable to connect to remote host: Connection refused
[13:07:09] <Amir1>	 marostegui: I can't reach it from cumin either
[13:07:43] <marostegui>	 ah wait
[13:07:45] <marostegui>	 because the master is down
[13:09:14] <marostegui>	 ok topology fixed
[13:13:17] <Amir1>	 oh gods
[13:13:19] <Amir1>	 I'm an idiot
[13:13:29] <Amir1>	 spot the idiot
[13:13:34] <Amir1>	 https://www.irccloud.com/pastebin/ZetNAptO/
[13:13:55] <marostegui>	 XDDDDD
[15:32:45] <Amir1>	 new version of wmfmariadbpy is pushed to the repo, already upgraded in cumin2002
[15:34:32] <Amir1>	 the switchover of test-s4 looks fine
[15:48:38] <marostegui>	 Amir1: can I roll back the changes then?
[16:02:31] <Amir1>	 marostegui: sure
[16:02:49] <marostegui>	 Ok
[16:44:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:45:23] <Amir1>	 I run pt-heartbeat-wikimedia service in db2231 (test-s4) but it automatically unfixes itself and stops it making orchestrator panic
[18:45:51] <Amir1>	 Aug 26 17:05:55 db2231 systemd[1]: pt-heartbeat-wikimedia.service: Deactivated successfully.
[18:45:51] <Amir1>	 Aug 26 17:05:55 db2231 systemd[1]: Stopped pt-heartbeat-wikimedia.service - "pt-heartbeat-wikimedia".
[18:59:50] <marostegui>	 Amir1: don't worry I'm going to roll back the changes tomorrow morning 
[19:01:44] <Amir1>	 thanks. I was just slightly worried about the script messing up with stuff
[19:03:54] <marostegui>	 The script shouldn't be touching the service 
[19:03:57] <marostegui>	 That's all puppet 
[20:46:10] <jinxer-wm>	 FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed