[02:07:07] FIRING: SystemdUnitFailed: ceph-59ea825c-2a67-11ef-9c1c-bc97e1bbace4@osd.20.service on moss-be2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:07:07] FIRING: SystemdUnitFailed: ceph-59ea825c-2a67-11ef-9c1c-bc97e1bbace4@osd.20.service on moss-be2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:57:07] FIRING: [4x] SystemdUnitFailed: ceph-59ea825c-2a67-11ef-9c1c-bc97e1bbace4@osd.20.service on moss-be2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:59:06] FIRING: MysqlReplicationThreadCountTooLow: MySQL instance pc1013:9104 has replication issues. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=pc1013&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationThreadCountTooLow [08:00:10] silence incoming [08:10:57] arnaudb: Can we repool that and pc1017? [08:16:40] yes we can but there is pending questions on this, added to all the meeting points we have in common. [08:24:12] RESOLVED: SystemdUnitFailed: ceph-59ea825c-2a67-11ef-9c1c-bc97e1bbace4@osd.20.service on moss-be2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:28:14] arnaudb: As explained before, you don't need anything special. Just truncate all tables and start replicating from the the current position. [08:29:00] If you want to reimage formatting srv, then you'd need to bootstrap Mariadb entirely, if not, just what I mentioned above [08:30:06] apus dead disk is now ticket T381239 [08:30:06] T381239: Disk (sdg) failed on moss-be2002 - https://phabricator.wikimedia.org/T381239 [08:30:17] (and the unhappy OSD is gone, hence the alert clearing) [09:40:08] I have installed 10.6.20 on db1198 and I am running a general table rebuild for pagelinks and later for recentchanges on that host [09:40:14] And see if 10.6.20 doesn't have any corruptions [10:41:39] Emperor: o/ ms-be2* nodes ready, I am finishing up 2088 but as agreed we'll keep it aside for testing [10:42:07] so if you want to double check 208[1-7] for anomalyes etc.. [10:42:14] after that you are free to add them to prod [10:42:25] for ms-be1* let's see what dcops says later on [10:54:19] elukey: cool, thanks, I'll have a look a bit later today [11:18:12] arnaudb: can you handle this please? https://phabricator.wikimedia.org/T378143#10320743 [11:38:34] The cloning cookbook doesn't work with standalone hosts right? [11:42:31] marostegui: yeah, I don't think it would [11:42:40] thanks [11:42:49] that reminded me, I should push my change on multiinstance [11:43:04] Ideally we should also make it work with standalone [11:43:09] Should I create a task for it too? [11:44:45] sure [11:45:07] I'll do it thanks [11:45:10] my only note is that for es hosts, transfer.py might not work well [11:45:21] yeah I know :( [11:46:11] I am glad we are using 10G in codfw hosts already [11:46:27] Cause I need to transfer 8TB [11:52:07] FIRING: SystemdUnitFailed: prometheus-mysqld-exporter.service on es2041:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:53:26] ^ known [11:53:34] I am productinizing that host [11:53:43] I am going to silence it for a few days [13:00:14] I have optimized db1198 everywhere for pagelinks and recentchanges [13:00:17] and it is running 10.6.20 [13:01:35] naive question: would there be any benefit to run this kind of operation periodically on a host? (adding an option to optimize every single table before running a clone for instance) [13:17:07] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:36:36] arnaudb: That's a difficult process for several reasons: first the size of the tables, it can potentially take hours or days for some of them. Secondly, the bug is introduced on a version, so we need to first confirm it doesn't happen in 10.6.20, and then migrate everything and only then we could do that (but refer to the first point). The bug is introduced in one version, but it doesn't get fixed by other version, [13:36:36] it [13:36:36] only gets fixed when the table is rebuilt in a version that won't introduce it again [13:41:47] I see, so it could even have the opposite effect (spreading a potential corruption seed to other servers) depending on which bug we trigger, thanks [14:34:35] I have an easy question and a hard question! 1) Do y'all have a good way to contact the kiwix team? 2) The data persistence team is managing dumps these days right? Is there a primary contact for that? [14:36:39] we're in our team meeting now, but no, we're not managing dumps [14:57:52] I think I've confused -persistence and -engineering yet again. [14:58:52] andrewbogott: I feel you [17:17:07] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:17:08] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed