[05:47:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db2102:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:51:22] remember to migrate long-running sessions/screens to cumin2002 as cumin1002 is going to be rebooted at the end of this week! [07:54:02] (there is a couple of old ones right now) [09:47:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db2102:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:09:57] that's a host I am about to decommision so it won't bother us more [12:57:25] FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on backup1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:00:15] checking [13:01:31] weird: "systemd-timedated.service: start operation timed out. Terminating." [13:01:50] it must be some weird race condition that only happens once in thousand [13:01:55] I just restarted it [13:02:43] still, I would like systemd-related alerts on -feed, not here [13:07:25] FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on backup1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:09:48] It says "systemd-timedated.service: Succeeded." but the unit says "inactive (dead)" [13:14:07] I think it worked now (?) [13:47:20] there's one clouddb that started to struggle with replag since Friday [13:47:24] https://grafana.wikimedia.org/goto/nZb63Y8Sg?orgId=1 [13:47:51] the load on the host does not seem too high [13:48:17] do you know what could explain it? [13:54:32] dhinus: we were in a meeting, but surely a dba maybe can have a look if there is some ongoing maintenance right now [13:55:25] dhinus: did you check show processlist first to see if the host itself is doing any migration? I would check that first and then orchestrator to see if replication is stopped somewhere above [13:55:39] orchestrator looks good, host above is healty and no replag [13:56:15] also someone told us that a cloud db host did not came back, not sure if it is this one or another, and that could be a cause of extra load? [13:56:25] different one [13:56:29] sorry, I don't know much about that [13:56:42] np :) I was actually thinking if that one is getting extra load [13:56:49] from that graph, however [13:57:02] I would say it is actually a load isue and not maintenance [13:57:07] but no, because the host that failed to boot is not hosting @s1 [13:57:09] because it seems to be going up and down [13:57:31] processlist shows a few long-running SELECT but it seems odd they're enough to cause the replag [13:57:32] while maintenance usually would show replication stopped continuously [13:57:40] yeah, that could be it [13:57:58] one change that is suspiciously timed is https://github.com/toolforge/quarry/pull/51 [13:58:13] it might have caused more parallel connections from quarry [13:58:56] in theory that change should affect only connectinos from quarry to its own db, though [13:59:26] yeah, and it is a reasonable change, although it shouldn't affect much for long running queries [14:00:09] CPU usage and IOPS on the host do not show big spikes [14:00:30] well CPU increased quite a bit [14:00:35] I'm afraid that you may be having some performance issue but that will require debugging [14:02:07] the replag is not exploding so we can also wait and see what happens [14:02:24] I'll try to see if I find something suspicious in the processlist [14:02:55] I mean, back in the day, what we did is- when the resouces where small overall, we reduced the per-user resources (aka reduce the query killer timeout) [14:03:40] those arcs tell me that it is not a maintenance, but a performance issue, as it is going up and down [14:06:59] what could be the perf bottleneck? long queries on the slave creating locks? [14:07:15] cpu and disk don't look saturated [14:11:23] I would start with select * FROM sys.user_summary_by_statement_latency; [14:12:27] select * FROM sys.user_summary_by_file_io; [14:12:39] but there is no magic solution! [14:12:43] ooh nice, I was using SELECT id,user,host,db,time,info FROM INFORMATION_SCHEMA.PROCESSLIST ORDER BY time DESC [14:13:08] user_summary_by_file_io is quite neat, never used it before! [14:14:32] sadly we cannot have those on grafana because privacy concerns [14:14:49] but hopefully at some point we get them on a private dashboard or something [14:15:20] that would be nice [14:52:36] following up from last week: FYI, I plan to update conftool (including dbctl) starting at around 14:00 UTC tomorrow (6/18) for T365123. as noted before, this should have no impact on in-flight schema-change scripts etc. [14:52:36] I'll ping here as I start work (and track progress in -operations / SAL). [14:52:36] T365123: Make dbctl check for depooled future masters - https://phabricator.wikimedia.org/T365123 [15:03:04] dc-ops fixed the issue on clouddb1018 [15:03:48] I will restart the services [15:06:33] replication restarted on clouddb1018, I will wait before repooling it because it's lagging 3 days behind [17:07:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db2102:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:07:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db2102:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed