[04:36:05] PROBLEM - MariaDB sustained replica lag on s4 on db1241 is CRITICAL: 36.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1241&var-port=9104 [04:36:15] PROBLEM - MariaDB sustained replica lag on s4 on db1247 is CRITICAL: 64.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1247&var-port=9104 [04:36:15] PROBLEM - MariaDB sustained replica lag on s4 on db1238 is CRITICAL: 56.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1238&var-port=9104 [04:36:29] PROBLEM - MariaDB sustained replica lag on s4 on db1242 is CRITICAL: 16.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1242&var-port=9104 [04:37:11] PROBLEM - MariaDB sustained replica lag on s4 on db1248 is CRITICAL: 56.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1248&var-port=9104 [04:37:23] PROBLEM - MariaDB sustained replica lag on s4 on db1221 is CRITICAL: 33 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1221&var-port=9104 [04:39:23] RECOVERY - MariaDB sustained replica lag on s4 on db1221 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1221&var-port=9104 [04:40:07] RECOVERY - MariaDB sustained replica lag on s4 on db1241 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1241&var-port=9104 [04:40:19] PROBLEM - MariaDB sustained replica lag on s4 on db1190 is CRITICAL: 45.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1190&var-port=9104 [04:41:15] RECOVERY - MariaDB sustained replica lag on s4 on db1247 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1247&var-port=9104 [04:42:13] RECOVERY - MariaDB sustained replica lag on s4 on db1248 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1248&var-port=9104 [04:42:21] RECOVERY - MariaDB sustained replica lag on s4 on db1190 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1190&var-port=9104 [04:42:31] RECOVERY - MariaDB sustained replica lag on s4 on db1242 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1242&var-port=9104 [04:43:17] RECOVERY - MariaDB sustained replica lag on s4 on db1238 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1238&var-port=9104 [09:17:42] I see j.ynus is taking offsite backups seriously https://medium.com/arch-mission-foundation/third-times-a-charm-lunar-library-successfully-lands-on-the-moon-backup-of-human-civilization-1ef424ebe4f2 :) [09:23:04] then now there are 2 copies of wikipedia on the moon [09:26:53] recovery will be fun [09:27:05] there's something weird going on [09:27:08] * Amir1 imagines Jaime with spacesuit [09:27:17] is db2144 safe to depool? [09:27:21] no [09:27:26] why is that being depooled? [09:27:31] which section? [09:27:33] then: how can I revert a dbctl commit? :D [09:27:35] x2 Amir1 [09:27:41] oh have fun [09:27:49] (I did not commit marostegui, please don't worry) [09:27:50] You can just do dbctl instance db2144 pool [09:27:55] ok! [09:27:56] yeah, dbctl won't let you [09:28:02] that is x2 master [09:28:06] I'm cloning stuff in s3, and schema changes in s6, otherwise nothing on my side [09:28:21] correct, there is one schema change pending in s6, which is blocked by that uncommitable change [09:28:52] oh fun [09:29:22] thanks for the help! I did not even tried to commit but I started suspecting something was off when I saw where it was pooled in the diff [09:29:40] but why was it being depooled? [09:29:43] hostname mistake? [09:30:54] https://phabricator.wikimedia.org/T356240 missed annotation in my table [09:31:16] yep, it is a master [09:41:53] db1201 may alert soon [09:43:12] I downtimed it [09:43:25] it's been rebooted for T356240 [09:45:37] PROBLEM - MariaDB sustained replica lag on s6 on db1201 is CRITICAL: 45 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1201&var-port=9104 [09:46:37] RECOVERY - MariaDB sustained replica lag on s6 on db1201 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1201&var-port=9104 [10:14:58] jynus: could you check which one of these will be backup sources? https://phabricator.wikimedia.org/T355422 the list of hosts being refreshed is on top of the task description [10:15:04] this is not urgen [10:15:05] t [10:15:37] checking [10:21:26] 5 of those, 97, 98, 99, 100 & 101 [10:22:03] ok, if you want to do them yourself, that'd be nice [10:22:17] I'd prefer it, yes [10:22:39] great [10:47:14] jynus: I just realised that that all those hosts are bookworm, which means 10.6 [10:47:19] so maybe you'd need to reimage to bullseye [10:47:51] well, it actally depends on you, if you want to keep having 10.4 backups or not [10:48:10] For now yeah [10:48:17] So maybe let's wait a bit to replace those hosts [10:48:45] however, reimaging 5 hosts from 0 and populating them seems to me like an OKR-sized task [10:49:16] definitely not happening this quarter [10:49:34] sounds good [10:51:54] as it is setting up and reconfiguring 11 new instances [11:07:08] PROBLEM - MariaDB sustained replica lag on s8 on db1209 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1209&var-port=9104 [11:08:08] RECOVERY - MariaDB sustained replica lag on s8 on db1209 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1209&var-port=9104 [12:24:48] DP people: Let me know your thoughts on this summary, specially if you think is incorrect, unfair or can be make you look better: https://phab.wmfusercontent.org/file/data/e3mo5id7icqp4xhcjctc/PHID-FILE-zfpsoz6alyewdex5wsjs/SRE_Data_Persistence_2024_State_of_the_Union_-_Jaime_s_version.png [12:28:18] I'm not sure what to put for latency in mariadb, we have no good metrics of that [12:29:04] Average query execution time? [12:41:26] My inner pedant wants to point out that there are _two_ ms clusters [12:42:27] eqiad has 4.8PB raw capacity, codfw 4.9PB [12:46:06] I think 73ms is probably a more fair p75 estimate (from eqiad which is the cluster doing more work now) [13:21:07] updating [13:23:30] Should I leave the unique since around 1 PB? It is ok if it is a very broad aproximation- I had just divided by 6 [13:40:46] yeah, I think that's reasonable [13:41:14] I dunno if you also want to mention the thanos-swift cluster? It's only 384 TB, so quite small [13:41:28] (that's raw capacity) [13:42:53] * Emperor has these figures to hand because they feature on their Swift 101 slides :) [13:48:18] sure, will add them. I just didn't trust the sizes were up to date [13:49:52] I also added the ceph machines [13:57:45] ceph machines? I'd not say anything about apus (since it's still WIP) [14:11:48] I just added the old machine names to the clusters [14:13:59] I'm not going to mention the table, just added it for context [14:28:05] I'm happy to look at a revised version when you're done :) [14:30:25] https://phabricator.wikimedia.org/P58241#234996 [14:31:42] I have a few questions for everybody regarding future, as I cannot get into your brain, and would benefit from additional context [14:31:56] I think I'd take moss-* off (yes, we have a couple of moss-fe* nodes in production use), as they're broadly not in service [14:32:16] ...and I think also just call them Swift clusters for now, we don't have any Ceph in prod [14:32:58] ok [14:33:52] And I think total size is about 10PB (or slightly fewer PiB), not 11Pi8B [14:34:01] sorry, I'll stop quibbling! [14:37:04] I added the 700 TB of thanos [14:38:01] grafana told me it was 10.1 P for the swift cluster [14:38:54] https://grafana.wikimedia.org/d/U29lWjTIk/cluster-overview-jaime-s-copy?orgId=1&var-cluster=thanos&var-instance=All&var-site=codfw [14:45:24] Amir1: db2156 can be repooled? [15:00:00] marostegui: now yes [15:00:08] ok, I will do that! [15:02:22] fi you see an alert for db1125, that's me [15:02:27] it's test-s4 [15:09:28] jynus: thanos has 384TB of raw capacity, so about 128TB of usable capacity (3x replication) [15:34:44] db1213 is me too [15:34:47] downtimed [15:36:26] (SystemdUnitFailed) firing: pt-heartbeat-wikimedia.service on db1125:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:51:41] this should be resolved now ^ [15:52:52] Greetings, and apologies for the noise :) [15:53:43] swfrench-wmf: welcome [15:56:26] (SystemdUnitFailed) resolved: pt-heartbeat-wikimedia.service on db1125:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed