[05:47:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db2102:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:51:22] <jynus>	 remember to migrate long-running sessions/screens to cumin2002 as cumin1002 is going to be rebooted at the end of this week!
[07:54:02] <jynus>	 (there is a couple of old ones right now)
[09:47:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db2102:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:09:57] <jynus>	 that's a host I am about to decommision so it won't bother us more
[12:57:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on backup1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:00:15] <jynus>	 checking
[13:01:31] <jynus>	 weird: "systemd-timedated.service: start operation timed out. Terminating."
[13:01:50] <jynus>	 it must be some weird race condition that only happens once in thousand
[13:01:55] <jynus>	 I just restarted it
[13:02:43] <jynus>	 still, I would like systemd-related alerts on -feed, not here
[13:07:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on backup1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:09:48] <jynus>	 It says "systemd-timedated.service: Succeeded." but the unit says "inactive (dead)"
[13:14:07] <jynus>	 I think it worked now (?)
[13:47:20] <dhinus>	 there's one clouddb that started to struggle with replag since Friday
[13:47:24] <dhinus>	 https://grafana.wikimedia.org/goto/nZb63Y8Sg?orgId=1
[13:47:51] <dhinus>	 the load on the host does not seem too high
[13:48:17] <dhinus>	 do you know what could explain it?
[13:54:32] <jynus>	 dhinus: we were in a meeting, but surely a dba maybe can have a look if there is some ongoing maintenance right now
[13:55:25] <jynus>	 dhinus: did you check show processlist first to see if the host itself is doing any migration? I would check that first and then orchestrator to see if replication is stopped somewhere above
[13:55:39] <dhinus>	 orchestrator looks good, host above is healty and no replag
[13:56:15] <jynus>	 also someone told us that a cloud db host did not came back, not sure if it is this one or another, and that could be a cause of extra load?
[13:56:25] <dhinus>	 different one
[13:56:29] <jynus>	 sorry, I don't know much about that
[13:56:42] <dhinus>	 np :) I was actually thinking if that one is getting extra load
[13:56:49] <jynus>	 from that graph, however
[13:57:02] <jynus>	 I would say it is actually a load isue and not maintenance
[13:57:07] <dhinus>	 but no, because the host that failed to boot is not hosting @s1
[13:57:09] <jynus>	 because it seems to be going up and down
[13:57:31] <dhinus>	 processlist shows a few long-running SELECT but it seems odd they're enough to cause the replag
[13:57:32] <jynus>	 while maintenance usually would show replication stopped continuously
[13:57:40] <jynus>	 yeah, that could be it
[13:57:58] <dhinus>	 one change that is suspiciously timed is https://github.com/toolforge/quarry/pull/51
[13:58:13] <dhinus>	 it might have caused more parallel connections from quarry
[13:58:56] <dhinus>	 in theory that change should affect only connectinos from quarry to its own db, though
[13:59:26] <jynus>	 yeah, and it is a reasonable change, although it shouldn't affect much for long running queries
[14:00:09] <dhinus>	 CPU usage and IOPS on the host do not show big spikes
[14:00:30] <dhinus>	 well CPU increased quite a bit
[14:00:35] <jynus>	 I'm afraid that you may be having some performance issue but that will require debugging
[14:02:07] <dhinus>	 the replag is not exploding so we can also wait and see what happens
[14:02:24] <dhinus>	 I'll try to see if I find something suspicious in the processlist
[14:02:55] <jynus>	 I mean, back in the day, what we did is- when the resouces where small overall, we reduced the per-user resources (aka reduce the query killer timeout)
[14:03:40] <jynus>	 those arcs tell me that it is not a maintenance, but a performance issue, as it is going up and down
[14:06:59] <dhinus>	 what could be the perf bottleneck? long queries on the slave creating locks?
[14:07:15] <dhinus>	 cpu and disk don't look saturated
[14:11:23] <jynus>	 I would start with select * FROM sys.user_summary_by_statement_latency;
[14:12:27] <jynus>	 select * FROM sys.user_summary_by_file_io;
[14:12:39] <jynus>	 but there is no magic solution!
[14:12:43] <dhinus>	 ooh nice, I was using SELECT id,user,host,db,time,info FROM INFORMATION_SCHEMA.PROCESSLIST ORDER BY time DESC
[14:13:08] <dhinus>	 user_summary_by_file_io is quite neat, never used it before!
[14:14:32] <jynus>	 sadly we cannot have those on grafana because privacy concerns
[14:14:49] <jynus>	 but hopefully at some point we get them on a private dashboard or something
[14:15:20] <dhinus>	 that would be nice
[14:52:36] <swfrench-wmf>	 following up from last week: FYI, I plan to update conftool (including dbctl) starting at around 14:00 UTC tomorrow (6/18) for T365123. as noted before, this should have no impact on in-flight schema-change scripts etc.
[14:52:36] <swfrench-wmf>	 I'll ping here as I start work (and track progress in -operations / SAL).
[14:52:36] <stashbot>	 T365123: Make dbctl check for depooled future masters  - https://phabricator.wikimedia.org/T365123
[15:03:04] <dhinus>	 dc-ops fixed the issue on clouddb1018
[15:03:48] <dhinus>	 I will restart the services
[15:06:33] <dhinus>	 replication restarted on clouddb1018, I will wait before repooling it because it's lagging 3 days behind
[17:07:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db2102:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:07:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db2102:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed