[05:04:24] FIRING: SystemdUnitFailed: prometheus-mysqld-exporter.service on db2245:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:35:32] oh wow, transfer.py failed and it might have left nc and pigz running [08:35:49] is it a known bug? [08:35:54] yes [08:37:41] however, there may have been a race condition because I was using es2032 to clone sretest and I saw it was started to get pooled and mariadb got brought up, was that your script? [08:37:56] because I checked and the host was last used 17th Oct [08:38:34] the clone_es cookbook was running as part of the setup of es2055.codfw.wamnet [08:39:27] anyhow it's depooled at the moment, can I terminate nc/pigz by hand? [08:40:17] yes [08:40:25] but the host didn't finish since 17th? [08:40:39] I am using es2028 now anyway [08:42:30] yes but i did not want to pool the host back in during the weekend [08:42:40] got it [08:52:53] So if the transfer fails, the script tries to kill the nc and pigz processes, but it may not handle all cases, like the process getting stuck [08:54:47] the only files modified recently are related to mysql being started: [08:54:51] https://www.irccloud.com/pastebin/4YbbvNTW/ [08:55:03] that's like 4 minutes ago [08:55:14] ...and the wikis are all untouched [08:55:23] maybe the cloning script understood the transfer finished and started mariadb? [08:57:03] no, it was a leftover nc/pigz pair from a previous transfer.py crashing out. After that a different transfer.py was ran succesfully by the clone_es cookbook [08:57:42] based on the wiki files being unmodified it seems ok to pool the host in [08:58:04] I'll just restart mysql again and monitor how it behaves [09:08:33] One way to confirm mysql is healthy is to restart it as you did and confirm it had a healthy shutdown by not starting a recovery process [09:09:46] I am doing a test backup of x1 to test new cumin1003 backup setup [09:28:53] yes I was monitoring the logs during restart and the time it takes and amount of i/o because it felt slugging but afaict it's ok [09:29:30] unrelated - there's a handful of alerts on https://alerts.wikimedia.org/?q=%40cluster%3Dwikimedia.org&q=instance%3D~%5E(db%7Cpc%7Ces)%5B12%5D.* [09:30:00] did something happen? [09:30:38] those hosts are being setup [10:01:10] snapshot worked nicely after migration, a refreshing experience for once [11:57:25] FIRING: SystemdUnitFailed: prometheus-mysqld-exporter.service on es2056:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:39:32] Straightforward clean up of SqlBagOStuff in mw: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1196055 [12:39:52] if someone feels like reviewing, the mode has been completely disabled in production for weeks now [13:00:04] Hi all, I am running late, will join in 5mins [14:56:25] What I predicted in our team meeting has happened - https://chaos.social/@fleaz/115405485437426139 :)