[07:54:23] Are we still expecting a bunch of prometheus-mysqyld-exporter failures on alerts.wm.org? [09:04:08] is there a better grafana dashboard than https://grafana.wikimedia.org/d/000000303/mysql-replication-lag for replication lag that includes only production hosts (no cloud, no dbstore, no backup sources, etc...) [09:04:12] ? [09:08:33] we got a small peak of connections on s4 (cc Amir1 ): https://grafana.wikimedia.org/goto/AgKYnbXSg?orgId=1 open connections more than doubled globally. Master's stats for the last 3h here: https://grafana.wikimedia.org/goto/iaJlnxuIR?orgId=1 [09:09:10] it looks like it had a hiccup on some metrics and there is some lack of data for others [09:10:25] * Amir1 dies inside [09:10:47] spikes should be fine generally as long as it doesn't render the host unavailable [09:19:30] slow queries on the master: https://logstash.wikimedia.org/goto/538e03d7ff95d1b175bb0c5d6acbe983 [09:20:30] ofc includes unrelated things that got affected because mysql slowness [09:21:35] yeah, the rows written is quite high. Maybe we have a gigantic transaction? We can look in binlog, it stores exec time [09:49:54] for the MysqlHostIoPressure alerts for 3 es hosts (es4 and es5) do we need to check anything specific? the runbook link doesn't say anything related, the dashboard links have no data (or no useful data). I'm looking at the standard dashboard [09:51:51] there is indeed a spike in reads [09:52:16] That's very likely either dumpers or some maint scripts [09:52:47] last 2 days stats: https://grafana.wikimedia.org/goto/KjWcNxuSR?orgId=1 [09:53:03] yeah I guessed [09:53:45] the fun part of this is that since it's a spike, you can't check the host. We could set up a sampling I think [09:54:02] es4/5 is still ongoing [09:54:21] oh nice [09:54:24] eqiad? [09:54:30] I really miss the old tool for queries (has been so long that I can't even remember the name :/ [09:54:33] ) [09:55:42] tendril? [09:55:55] yep :D [09:55:59] bad memory [09:56:24] I checked processlist in es1022, nothing out of ordinary jumped out [09:56:34] same for me on es1020 [09:56:51] there is a lot of asleep threads which annoys me but that's mw [09:56:52] fast queries don't show up [09:57:01] easily in processlist [09:57:25] and ES queries are all the same :D [09:58:19] yeah, the thing is that when a connection is opened in mw, it keeps it open since it might be reused but it also means it stays open for the whole duration of mw request [09:58:44] maaaybe mw got slower and that's why there is a lot more connections piling up [09:58:50] so each open thread has basically done a query recently or will do one shortly [09:59:17] by recently it means "in the last 180 seconds" :D [09:59:26] yes seconds [09:59:34] lol 180 is a lot [09:59:42] POST reqs [09:59:47] GETs are 60 [10:00:01] those are all get's RO es cluster [10:00:05] (POST + jobs) [10:00:10] smells like some scraper [10:01:00] volans: not necessarily, if you edit a page, (POST) it gets reparsed and it would need to load templates, content of the templates might be in the RO ES hosts. [10:01:32] ah the timeout of the session is based on the MW GET/POST, not the DB RO/RW, got it [10:01:42] yeah [10:02:58] I wouldn't rule out scraper though. It can be that. It's just hard to figure it out :( [10:03:42] in pre-k8s I would reverse DNS of the IPs and check what cluster they are coming from but it's not possible anymore :( [15:00:25] FIRING: SystemdUnitFailed: ceph-3f38ada2-2d88-11ef-8c7c-bc97e1bb7c18@mgr.moss-be1003.yxfdls.service on moss-be1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:12:22] I think that's related to the network work [15:32:55] RESOLVED: SystemdUnitFailed: ceph-3f38ada2-2d88-11ef-8c7c-bc97e1bb7c18@mgr.moss-be1003.yxfdls.service on moss-be1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed