[05:04:24] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus-mysqld-exporter.service on db2245:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:35:32] <federico3>	 oh wow, transfer.py failed and it might have left nc and pigz running 
[08:35:49] <federico3>	 is it a known bug?
[08:35:54] <marostegui>	 yes
[08:37:41] <marostegui>	 however, there may have been a race condition because I was using es2032 to clone sretest and I saw it was started to get pooled and mariadb got brought up, was that your script?
[08:37:56] <marostegui>	 because I checked and the host was last used 17th Oct
[08:38:34] <federico3>	 the clone_es cookbook was running as part of the setup of es2055.codfw.wamnet
[08:39:27] <federico3>	 anyhow it's depooled at the moment, can I terminate nc/pigz by hand?
[08:40:17] <marostegui>	 yes
[08:40:25] <marostegui>	 but the host didn't finish since 17th?
[08:40:39] <marostegui>	 I am using es2028 now anyway
[08:42:30] <federico3>	 yes but i did not want to pool the host back in during the weekend
[08:42:40] <marostegui>	 got it
[08:52:53] <jynus>	 So if the transfer fails, the script tries to kill the nc and pigz processes, but it may not handle all cases, like the process getting stuck
[08:54:47] <federico3>	 the only files modified recently are related to mysql being started:
[08:54:51] <federico3>	 https://www.irccloud.com/pastebin/4YbbvNTW/
[08:55:03] <marostegui>	 that's like 4 minutes ago
[08:55:14] <federico3>	 ...and the wikis are all untouched
[08:55:23] <marostegui>	 maybe the cloning script understood the transfer finished and started mariadb?
[08:57:03] <federico3>	 no, it was a leftover nc/pigz pair from a previous transfer.py crashing out. After that a different transfer.py was ran succesfully by the clone_es cookbook
[08:57:42] <federico3>	 based on the wiki files being unmodified it seems ok to pool the host in
[08:58:04] <federico3>	 I'll just restart mysql again and monitor how it behaves
[09:08:33] <jynus>	 One way to confirm mysql is healthy is to restart it as you did and confirm it had a healthy shutdown by not starting a recovery process
[09:09:46] <jynus>	 I am doing a test backup of x1 to test new cumin1003 backup setup
[09:28:53] <federico3>	 yes I was monitoring the logs during restart and the time it takes and amount of i/o because it felt slugging but afaict it's ok
[09:29:30] <federico3>	 unrelated - there's a handful of alerts on https://alerts.wikimedia.org/?q=%40cluster%3Dwikimedia.org&q=instance%3D~%5E(db%7Cpc%7Ces)%5B12%5D.* 
[09:30:00] <federico3>	 did something happen?
[09:30:38] <marostegui>	 those hosts are being setup
[10:01:10] <jynus>	 snapshot worked nicely after migration, a refreshing experience for once
[11:57:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus-mysqld-exporter.service on es2056:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:39:32] <Amir1>	 Straightforward clean up of SqlBagOStuff in mw: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1196055 
[12:39:52] <Amir1>	 if someone feels like reviewing, the mode has been completely disabled in production for weeks now
[13:00:04] <kavitha>	 Hi all, I am running late, will join in 5mins
[14:56:25] <Emperor>	 What I predicted in our team meeting has happened - https://chaos.social/@fleaz/115405485437426139 :)