[12:14:57] hnowlan: fair enough; any idea what tag(s) to usefully put on that task so it gets some eyeballs on it? [12:15:22] Keep it on thumbor for now and I'll try to triage it better once I know what's going on [12:21:44] TY :) [13:03:08] dbprov2005 is back working as usual and cought up with missed backups [13:07:34] *caught; it is not as bad yet as to start coughing [13:39:02] taking a break before fun starts [16:36:59] Could I get reviews of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1128907 (remove ms-be2075 from rings) and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1128908 (put it back again), please? It's already drained, but I need to move it to a new-style VLAN which will renumber it, so the old IP needs to go out of the ring entirely before we re-add the node with its new address [16:48:39] Also https://gitlab.wikimedia.org/repos/data_persistence/swift-ring/-/merge_requests/11 for the new network [16:54:25] FIRING: SystemdUnitFailed: ceph-59ea825c-2a67-11ef-9c1c-bc97e1bbace4@osd.8.service on moss-be2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:03:09] that's going to be a dead disk. I'll try and get it drained after this meeting [17:24:41] Heya, it looks like moss-be2002's ceph is in a bad state. Is this known? [17:25:53] Looks like it's had some kernel hangups [17:26:40] and lots 'o crashes [17:27:15] brett: ceph health detail tells me of repeated crashes of osd.8 which is consistent with a dead disk. [17:27:20] [in meeting currently] [17:28:01] brett: are you seeing failures beyond sdb ? [17:29:25] RESOLVED: SystemdUnitFailed: ceph-59ea825c-2a67-11ef-9c1c-bc97e1bbace4@osd.8.service on moss-be2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:29:44] looks like you're right! /dev/sdb looks bad [17:30:09] https://wikitech.wikimedia.org/wiki/Ceph/Cephadm#Disk_Failure is the fix procedure (which I'll get to later) [17:30:26] You want to handle that yourself or would you like me to? [17:31:15] if you'd like to try following the instructions and tell me how the docs suck, that'd be great :) [17:31:23] ...but I can do it myself once not in a meeting [17:34:25] FIRING: SystemdUnitFailed: ceph-59ea825c-2a67-11ef-9c1c-bc97e1bbace4@osd.8.service on moss-be2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:41:01] seeing a bit of an increase in these errors since the switchover, anything familiar? https://logstash.wikimedia.org/goto/a4b518e7fc7b1b1d1e296e21a15b699b [17:43:36] that's not a mariadb direct errors, that's some kind of mw logic "commit critical section while session state is out of sync" [17:43:39] Emperor: I removed the drive - having never used ceph before I found the docs easy enough to use. Thanks :) [17:43:46] *zapped [17:43:55] I'll leave the DC ticket to your capable hands :) [17:44:35] (Which means that osd.rrd_NVMe is still set to unmanaged) [17:44:45] cool, thanks [17:45:53] hnowlan: I would like to check Database.php:3091 to see when that happens [17:47:00] I guess that's https://github.com/wikimedia/mediawiki/blob/master/includes/libs/rdbms/database/Database.php#L3091 [17:47:55] hnowlan: sadly that is generic enough to not provide any insight as it is an unexpected db error [17:49:17] I see a call to request-timeout, so maybe the queries are timing out? [17:49:25] RESOLVED: SystemdUnitFailed: ceph-59ea825c-2a67-11ef-9c1c-bc97e1bbace4@osd.8.service on moss-be2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:51:28] the urls would match with more expensive uncached operations (checking histories of pages, unblocking users, etc) [17:54:46] brett: I've opened T389236 [17:54:47] T389236: disk (sdb) failed in moss-be2002 - https://phabricator.wikimedia.org/T389236 [17:54:58] hnowlan: my guess is that something running in the current mode makes things slower, and it will go away tomorrow, so just something to keep an eye now. But maybe the mediawiki datase team can provide more insight. [17:55:05] thanks again both for doing the zapping, and for trying out the docs :) [18:01:34] And thank you for handling this on the reg :) [18:07:45] jynus: makes sense, thanks! [18:23:02] hnowlan: I found someone complaining, so maybe we should take a look: https://wikimedia.slack.com/archives/C05FWANFT8X/p1742318251458209 [18:23:38] those are consistent with strict timeout issues, timings being tuned for multiple dc operation (?) [18:24:20] but I will have a look to see if some db is hurting or something [18:26:11] I see an increase in read activity on s6, but that's only recent [18:29:32] and it's gone, so nothing weird or ongoing