[01:07:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db2102:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:19:14] I am going to start with s4 switchover [05:07:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db2102:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:44:46] jynus: moving all the systemdunitfailed alerts to the -feed channel should be a doable thing if its what we want [07:46:13] it is what *I* would suggest, alertmanager is very spammy, but I don't get to decide 0:-) [07:46:59] specially the systemd unit- I don't want to get notified every X hours that a decommissioned host has failed a service [07:47:45] dhinus: the performance issues are because the query killers aren't running [07:47:53] dhinus: Did you stop/removed the service during the reboots? [07:48:16] I just started it [08:19:40] marostegui: I think they were running yesterday, because I checked /var/log/wmf-pt-kill/wmf-pt-kill-s1.log and I could see many queries being killed [08:20:10] I also tried to see if the lag decrease was matching the times of the queries being killed, but I couldn't find a clear correlation [08:20:30] You sure? I had to even install the package [08:21:16] I didn't check the services, but you can check the logs and you'll find many queries logged in the last 3 days [08:21:30] so I don't know what was logging there, but something was :) [08:21:56] I also noticed a big increase in the size of the logs since last Friday [08:22:11] only a few queries per day were logged until Friday, then they suddenly became dozens [08:24:32] I will check later if I have some time [08:27:41] thanks, I'll keep an eye on the replag today [08:48:22] is db1165 meant to be downtimed? [08:48:49] its alert flapped it seems, its back online [08:49:24] arnaudb: can you troubleshoot it? [08:49:31] checking rn [08:49:34] thanks [08:50:52] will downtime it as it seems to have hardware issues, will open a ticket to dc-ops after that fyi [08:51:06] probably worth depooling too then [08:58:12] T367854 [08:58:12] T367854: db1165 network flapping issues - https://phabricator.wikimedia.org/T367854 [08:58:28] thank you! [10:26:34] Time to reclaim the moss-fe nodes back from ms so they can go into apus as $DEITY intended... https://gerrit.wikimedia.org/r/c/operations/puppet/+/1047033 [12:42:25] FIRING: [2x] SystemdUnitFailed: ceph-59ea825c-2a67-11ef-9c1c-bc97e1bbace4@osd.26.service on moss-be2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:47:25] RESOLVED: [2x] SystemdUnitFailed: ceph-59ea825c-2a67-11ef-9c1c-bc97e1bbace4@osd.26.service on moss-be2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:53:31] es6 and es7 dumps failed on both datacenters [12:56:14] "[ERROR] - Could not read data from arwiki.blobs_cluster30: Lost connection to MySQL server during query" [12:56:25] that's weird, a network issue on both dcs at the same time? [13:05:12] there are spikes of network errors every 30 minutes, not related to the main service [13:05:20] puppet? Some other collector? [13:05:44] They are disk writes at 3 MB/s [13:07:16] it matches puppet runs [13:08:43] It is puppet runs [13:08:49] confirmed on a separate host [13:13:25] FIRING: SystemdUnitFailed: systemd-timedated.service on backup2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:16:37] ^ the pupper run killed the backup run (!) [13:19:06] it makes no sense because it is killed exactly at 13:08:06 [13:19:14] which matches with puppet again [13:19:30] but the puppet logs say: Jun 18 13:08:00 backup2002 puppet-agent-cronjob: Sleeping 47 for random splay [13:19:43] Jun 18 13:08:49 backup2002 puppet-agent-cronjob: su: warning: cannot change directory to /nonexistent: No such file or directory [13:20:30] ok, one more run with puppet disabled, just to prove 100% the correlation [13:28:25] RESOLVED: SystemdUnitFailed: systemd-timedated.service on backup2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:55:20] hi all - would you have any objections if, as part of verifying the updated dbctl (starting soon), make a noop edit to one of the notes fields on a parsercache spare? this is the "safest" edit I can think of to test end-to-end write :) [13:56:58] it's also "fine" if a schema change script comes along and commits it for me (e.g., when it depools), and AFAICT the repool script waits for uncommitted changes [13:57:03] that seems fine to me, but dbas should have the say ^ [14:03:12] thanks, jynus - I'll hold until I hear back :) [16:32:00] FYI, as I've not heard any objections, I'll move ahead with this on cumin2002. specifically, I'll be editing (and then reverting) the note tag on pc2014 (codfw spare for pc1 and pc2)