[01:07:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db2102:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:19:14] <marostegui>	 I am going to start with s4 switchover
[05:07:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db2102:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:44:46] <Emperor>	 jynus: moving all the systemdunitfailed alerts to the -feed channel should be a doable thing if its what we want
[07:46:13] <jynus>	 it is what *I* would suggest, alertmanager is very spammy, but I don't get to decide 0:-)
[07:46:59] <jynus>	 specially the systemd unit- I don't want to get notified every X hours that a decommissioned host has failed a service
[07:47:45] <marostegui>	 dhinus: the performance issues are because the query killers aren't running
[07:47:53] <marostegui>	 dhinus: Did you stop/removed the service during the reboots?
[07:48:16] <marostegui>	 I just started it
[08:19:40] <dhinus>	 marostegui: I think they were running yesterday, because I checked /var/log/wmf-pt-kill/wmf-pt-kill-s1.log and I could see many queries being killed
[08:20:10] <dhinus>	 I also tried to see if the lag decrease was matching the times of the queries being killed, but I couldn't find a clear correlation
[08:20:30] <marostegui>	 You sure? I had to even install the package 
[08:21:16] <dhinus>	 I didn't check the services, but you can check the logs and you'll find many queries logged in the last 3 days
[08:21:30] <dhinus>	 so I don't know what was logging there, but something was :)
[08:21:56] <dhinus>	 I also noticed a big increase in the size of the logs since last Friday
[08:22:11] <dhinus>	 only a few queries per day were logged until Friday, then they suddenly became dozens
[08:24:32] <marostegui>	 I will check later if I have some time
[08:27:41] <dhinus>	 thanks, I'll keep an eye on the replag today
[08:48:22] <Emperor>	 is db1165 meant to be downtimed?
[08:48:49] <arnaudb>	 its alert flapped it seems, its back online
[08:49:24] <marostegui>	 arnaudb: can you troubleshoot it?
[08:49:31] <arnaudb>	 checking rn
[08:49:34] <marostegui>	 thanks
[08:50:52] <arnaudb>	 will downtime it as it seems to have hardware issues, will open a ticket to dc-ops after that fyi
[08:51:06] <marostegui>	 probably worth depooling too then
[08:58:12] <arnaudb>	 T367854
[08:58:12] <stashbot>	 T367854: db1165 network flapping issues - https://phabricator.wikimedia.org/T367854
[08:58:28] <marostegui>	 thank you!
[10:26:34] <Emperor>	 Time to reclaim the moss-fe nodes back from ms so they can go into apus as $DEITY intended... https://gerrit.wikimedia.org/r/c/operations/puppet/+/1047033
[12:42:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: ceph-59ea825c-2a67-11ef-9c1c-bc97e1bbace4@osd.26.service on moss-be2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:47:25] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: ceph-59ea825c-2a67-11ef-9c1c-bc97e1bbace4@osd.26.service on moss-be2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:53:31] <jynus>	 es6 and es7 dumps failed on both datacenters
[12:56:14] <jynus>	 "[ERROR] - Could not read data from arwiki.blobs_cluster30: Lost connection to MySQL server during query"
[12:56:25] <jynus>	 that's weird, a network issue on both dcs at the same time?
[13:05:12] <jynus>	 there are spikes of network errors every 30 minutes, not related to the main service
[13:05:20] <jynus>	 puppet? Some other collector?
[13:05:44] <jynus>	 They are disk writes at 3 MB/s
[13:07:16] <jynus>	 it matches puppet runs
[13:08:43] <jynus>	 It is puppet runs
[13:08:49] <jynus>	 confirmed on a separate host
[13:13:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: systemd-timedated.service on backup2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:16:37] <jynus>	 ^ the pupper run killed the backup run (!)
[13:19:06] <jynus>	 it makes no sense because it is killed exactly at 13:08:06
[13:19:14] <jynus>	 which matches with puppet again
[13:19:30] <jynus>	 but the puppet logs say: Jun 18 13:08:00 backup2002 puppet-agent-cronjob: Sleeping 47 for random splay
[13:19:43] <jynus>	 Jun 18 13:08:49 backup2002 puppet-agent-cronjob: su: warning: cannot change directory to /nonexistent: No such file or directory
[13:20:30] <jynus>	 ok, one more run with puppet disabled, just to prove 100% the correlation
[13:28:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: systemd-timedated.service on backup2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:55:20] <swfrench-wmf>	 hi all - would you have any objections if, as part of verifying the updated dbctl (starting soon), make a noop edit to one of the notes fields on a parsercache spare? this is the "safest" edit I can think of to test end-to-end write :)
[13:56:58] <swfrench-wmf>	 it's also "fine" if a schema change script comes along and commits it for me (e.g., when it depools), and AFAICT the repool script waits for uncommitted changes
[13:57:03] <jynus>	 that seems fine to me, but dbas should have the say ^
[14:03:12] <swfrench-wmf>	 thanks, jynus - I'll hold until I hear back :)
[16:32:00] <swfrench-wmf>	 FYI, as I've not heard any objections, I'll move ahead with this on cumin2002. specifically, I'll be editing (and then reverting) the note tag on pc2014 (codfw spare for pc1 and pc2)