[04:36:05] <icinga-wm_>	 PROBLEM - MariaDB sustained replica lag on s4 on db1241 is CRITICAL: 36.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1241&var-port=9104
[04:36:15] <icinga-wm_>	 PROBLEM - MariaDB sustained replica lag on s4 on db1247 is CRITICAL: 64.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1247&var-port=9104
[04:36:15] <icinga-wm_>	 PROBLEM - MariaDB sustained replica lag on s4 on db1238 is CRITICAL: 56.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1238&var-port=9104
[04:36:29] <icinga-wm_>	 PROBLEM - MariaDB sustained replica lag on s4 on db1242 is CRITICAL: 16.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1242&var-port=9104
[04:37:11] <icinga-wm_>	 PROBLEM - MariaDB sustained replica lag on s4 on db1248 is CRITICAL: 56.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1248&var-port=9104
[04:37:23] <icinga-wm_>	 PROBLEM - MariaDB sustained replica lag on s4 on db1221 is CRITICAL: 33 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1221&var-port=9104
[04:39:23] <icinga-wm_>	 RECOVERY - MariaDB sustained replica lag on s4 on db1221 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1221&var-port=9104
[04:40:07] <icinga-wm_>	 RECOVERY - MariaDB sustained replica lag on s4 on db1241 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1241&var-port=9104
[04:40:19] <icinga-wm_>	 PROBLEM - MariaDB sustained replica lag on s4 on db1190 is CRITICAL: 45.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1190&var-port=9104
[04:41:15] <icinga-wm_>	 RECOVERY - MariaDB sustained replica lag on s4 on db1247 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1247&var-port=9104
[04:42:13] <icinga-wm_>	 RECOVERY - MariaDB sustained replica lag on s4 on db1248 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1248&var-port=9104
[04:42:21] <icinga-wm_>	 RECOVERY - MariaDB sustained replica lag on s4 on db1190 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1190&var-port=9104
[04:42:31] <icinga-wm_>	 RECOVERY - MariaDB sustained replica lag on s4 on db1242 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1242&var-port=9104
[04:43:17] <icinga-wm_>	 RECOVERY - MariaDB sustained replica lag on s4 on db1238 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1238&var-port=9104
[09:17:42] <Emperor>	 I see j.ynus is taking offsite backups seriously https://medium.com/arch-mission-foundation/third-times-a-charm-lunar-library-successfully-lands-on-the-moon-backup-of-human-civilization-1ef424ebe4f2 :)
[09:23:04] <jynus>	 then now there are 2 copies of wikipedia on the moon
[09:26:53] <Amir1>	 recovery will be fun
[09:27:05] <marostegui>	 there's something weird going on
[09:27:08] * Amir1 imagines Jaime with spacesuit 
[09:27:17] <arnaudb>	 is db2144 safe to depool?
[09:27:21] <marostegui>	 no
[09:27:26] <marostegui>	 why is that being depooled?
[09:27:31] <Amir1>	 which section?
[09:27:33] <arnaudb>	 then: how can I revert a dbctl commit? :D
[09:27:35] <arnaudb>	 x2 Amir1 
[09:27:41] <Amir1>	 oh have fun
[09:27:49] <arnaudb>	 (I did not commit marostegui, please don't worry)
[09:27:50] <marostegui>	 You can just do dbctl instance db2144 pool
[09:27:55] <arnaudb>	 ok!
[09:27:56] <marostegui>	 yeah, dbctl won't let you
[09:28:02] <marostegui>	 that is x2 master
[09:28:06] <Amir1>	 I'm cloning stuff in s3, and schema changes in s6, otherwise nothing on my side
[09:28:21] <marostegui>	 correct, there is one schema change pending in s6, which is blocked by that uncommitable change
[09:28:52] <Amir1>	 oh fun
[09:29:22] <arnaudb>	 thanks for the help! I did not even tried to commit but I started suspecting something was off when I saw where it was pooled in the diff
[09:29:40] <marostegui>	 but why was it being depooled?
[09:29:43] <marostegui>	 hostname mistake?
[09:30:54] <arnaudb>	 https://phabricator.wikimedia.org/T356240 missed annotation in my table
[09:31:16] <marostegui>	 yep, it is a master
[09:41:53] <jynus>	 db1201 may alert soon
[09:43:12] <arnaudb>	 I downtimed it
[09:43:25] <arnaudb>	 it's been rebooted for T356240
[09:45:37] <icinga-wm_>	 PROBLEM - MariaDB sustained replica lag on s6 on db1201 is CRITICAL: 45 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1201&var-port=9104
[09:46:37] <icinga-wm_>	 RECOVERY - MariaDB sustained replica lag on s6 on db1201 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1201&var-port=9104
[10:14:58] <marostegui>	 jynus: could you check which one of these will be backup sources? https://phabricator.wikimedia.org/T355422 the list of hosts being refreshed is on top of the task description 
[10:15:04] <marostegui>	 this is not urgen
[10:15:05] <marostegui>	 t
[10:15:37] <jynus>	 checking
[10:21:26] <jynus>	 5 of those, 97, 98, 99, 100 & 101
[10:22:03] <marostegui>	 ok, if you want to do them yourself, that'd be nice
[10:22:17] <jynus>	 I'd prefer it, yes
[10:22:39] <marostegui>	 great
[10:47:14] <marostegui>	 jynus: I just realised that that all those hosts are bookworm, which means 10.6
[10:47:19] <marostegui>	 so maybe you'd need to reimage to bullseye
[10:47:51] <jynus>	 well, it actally depends on you, if you want to keep having 10.4 backups or not
[10:48:10] <marostegui>	 For now yeah
[10:48:17] <marostegui>	 So maybe let's wait a bit to replace those hosts
[10:48:45] <jynus>	 however, reimaging 5 hosts from 0 and populating them seems to me like an OKR-sized task
[10:49:16] <jynus>	 definitely not happening this quarter
[10:49:34] <marostegui>	 sounds good
[10:51:54] <jynus>	 as it is setting up and reconfiguring 11 new instances
[11:07:08] <icinga-wm_>	 PROBLEM - MariaDB sustained replica lag on s8 on db1209 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1209&var-port=9104
[11:08:08] <icinga-wm_>	 RECOVERY - MariaDB sustained replica lag on s8 on db1209 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1209&var-port=9104
[12:24:48] <jynus>	 DP people: Let me know your thoughts on this summary, specially if you think is incorrect, unfair or can be make you look better: https://phab.wmfusercontent.org/file/data/e3mo5id7icqp4xhcjctc/PHID-FILE-zfpsoz6alyewdex5wsjs/SRE_Data_Persistence_2024_State_of_the_Union_-_Jaime_s_version.png
[12:28:18] <jynus>	 I'm not sure what to put for latency in mariadb, we have no good metrics of that
[12:29:04] <jynus>	 Average query execution time?
[12:41:26] <Emperor>	 My inner pedant wants to point out that there are _two_ ms clusters
[12:42:27] <Emperor>	 eqiad has 4.8PB raw capacity, codfw 4.9PB
[12:46:06] <Emperor>	 I think 73ms is probably a more fair p75 estimate (from eqiad which is the cluster doing more work now)
[13:21:07] <jynus>	 updating
[13:23:30] <jynus>	 Should I leave the unique since around 1 PB? It is ok if it is a very broad aproximation- I had just divided by 6
[13:40:46] <Emperor>	 yeah, I think that's reasonable
[13:41:14] <Emperor>	 I dunno if you also want to mention the thanos-swift cluster? It's only 384 TB, so quite small
[13:41:28] <Emperor>	 (that's raw capacity)
[13:42:53] * Emperor has these figures to hand because they feature on their Swift 101 slides :)
[13:48:18] <jynus>	 sure, will add them. I just didn't trust the sizes were up to date
[13:49:52] <jynus>	 I also added the ceph machines
[13:57:45] <Emperor>	 ceph machines? I'd not say anything about apus (since it's still WIP)
[14:11:48] <jynus>	 I just added the old machine names to the clusters
[14:13:59] <jynus>	 I'm not going to mention the table, just added it for context
[14:28:05] <Emperor>	 I'm happy to look at a revised version when you're done :)
[14:30:25] <jynus>	 https://phabricator.wikimedia.org/P58241#234996
[14:31:42] <jynus>	 I have a few questions for everybody regarding future, as I cannot get into your brain, and would benefit from additional context
[14:31:56] <Emperor>	 I think I'd take moss-* off (yes, we have a couple of moss-fe* nodes in production use), as they're broadly not in service
[14:32:16] <Emperor>	 ...and I think also just call them Swift clusters for now, we don't have any Ceph in prod
[14:32:58] <jynus>	 ok
[14:33:52] <Emperor>	 And I think total size is about 10PB (or slightly fewer PiB), not 11Pi8B
[14:34:01] <Emperor>	 sorry, I'll stop quibbling!
[14:37:04] <jynus>	 I added the 700 TB of thanos
[14:38:01] <jynus>	 grafana told me it was 10.1 P for the swift cluster
[14:38:54] <jynus>	 https://grafana.wikimedia.org/d/U29lWjTIk/cluster-overview-jaime-s-copy?orgId=1&var-cluster=thanos&var-instance=All&var-site=codfw
[14:45:24] <marostegui>	 Amir1: db2156 can be repooled?
[15:00:00] <Amir1>	 marostegui: now yes
[15:00:08] <marostegui>	 ok, I will do that!
[15:02:22] <Amir1>	 fi you see an alert for db1125, that's me
[15:02:27] <Amir1>	 it's test-s4
[15:09:28] <Emperor>	 jynus: thanos has 384TB of raw capacity, so about 128TB of usable capacity (3x replication)
[15:34:44] <Amir1>	 db1213 is me too
[15:34:47] <Amir1>	 downtimed
[15:36:26] <jinxer-wm>	 (SystemdUnitFailed) firing: pt-heartbeat-wikimedia.service on db1125:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:51:41] <Amir1>	 this should be resolved now ^
[15:52:52] <swfrench-wmf>	 Greetings, and apologies for the noise :)
[15:53:43] <marostegui>	 swfrench-wmf: welcome
[15:56:26] <jinxer-wm>	 (SystemdUnitFailed) resolved: pt-heartbeat-wikimedia.service on db1125:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed