[05:09:06] Amir1: let me know when you are around and ready to enable writes on es6 and es7 [05:22:27] I am going to switch s5 codfw master [05:43:55] hey! I have a slight issue that might be disrupting for the next few days: I have no internet at home since last tuesday. My ISP is working to restoring the service, a person is scheduled to step by tomorrow morning to try and fix everything. Meanwhile I have a 4G box to be able to work. Timely disruption as I was on staycation during last week x) [07:32:06] that is a long time without internet :( [07:34:50] at least it was a good crash test for my home assistant setup, I've lost my DHCP server for a while and could not manage my living room lights anymore because a component was addressed dynamically :D [07:37:19] ouch [07:37:37] https://usercontent.irccloud-cdn.com/file/oRUfpygZ/grafik.png [07:38:00] arnaudb: don't worry :/ let me know if I can help on anything [07:39:47] :D thanks! [07:39:54] I am going to enable writes on es6 and es7 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1029109 [08:52:27] PROBLEM - MariaDB sustained replica lag on s1 on db1207 is CRITICAL: 2.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1207&var-port=9104 [08:53:27] RECOVERY - MariaDB sustained replica lag on s1 on db1207 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1207&var-port=9104 [09:06:34] heads up on alerts caused re: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1030896 [09:07:35] backups are available and were generated over the weekend, but they lacked metadata (and so, detection). As a failsafe, if metadata cannot be generated, the backups still complete successfully [09:56:45] PROBLEM - MariaDB sustained replica lag on s4 on db1248 is CRITICAL: 6.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1248&var-port=9104 [09:58:45] RECOVERY - MariaDB sustained replica lag on s4 on db1248 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1248&var-port=9104 [12:43:35] just tried to run db-mysql on localhost... [12:56:29] urandom: o/ ping me if you have time for the cassandra private cleanup later on :) [13:10:39] elukey: I have a ~30min meeting in ~20mins, but I am otherwise free [13:11:18] ok to do it later on! [14:19:47] elukey: o/ [14:20:10] urandom: sorry in a meeting now :( Should be free in ~1h [14:20:26] 👍 [14:34:27] elukey, hnowlan: do either of you know what the `/srv/cassandra-{a,b}/tmp/local_group_default_T_mediarequest_per_file` directories are on aqs1010? The timestamp is March 30, 2021, I think that might roughly correspond to a migration (v2 to v3)? [14:35:00] that seems extremely likely [14:35:00] pretty sure this is safe to delete, but you know, measure twice/cut once [14:35:20] no idea sorry [14:35:21] I'd say you're fine to clean it - btullis might be able to confirm [14:55:29] urandom: ready! [14:55:30] if you are [14:55:38] I am! [14:56:25] super :) [14:56:33] puppet is disabled in all c:cassandra nodes [14:56:55] if you are ok I'd stage the cassandra dir drop in puppet private for you to review [14:57:06] then we can merge and slowly run puppet on some nodes [14:57:09] like one for each cluster [14:57:41] sgtm [15:03:07] super prepping [15:04:05] urandom: ok done! Can you review in puppet private? [15:05:17] elukey: I can confirm, that is a frightening number of deletions :) [15:05:29] but yes, `modules/secret/secrets/cassandra` recursively, +1 [15:09:04] merged! [15:09:32] running puppet on one node for each cluster [15:10:41] wow 988 deletion [15:11:04] Delete All The Things \o/ [15:11:06] volans: scary, huh! [15:11:25] * volans tempted to run git gc [15:14:06] volans: yes a big drop :D [15:15:37] urandom: I think that we are good, re-enabling puppet and forcing a run on all nodes via cumin [15:15:40] to double check [15:15:52] kk [15:17:20] I'll report back when done [15:32:57] urandom: all no-ops! We are done! [15:33:00] \o/ [15:34:35] I guess the real test is what happens when the existing certs expire :) [15:35:06] I mean, if there is any issue lying in wait, it's something else we missed that relies on them, and what they might do when no longer valid [15:36:10] urandom: I tested the cert reload for ml-cache and it worked fine, with Cassandra 4.x.. Also we have a prom blackbox alert that should fire if we get close to some days before the expiry [15:37:12] I was thinking about something client oriented [15:37:22] scripts or something [15:37:26] ahh okok [15:37:29] we'll see yes [18:40:10] PROBLEM - MariaDB sustained replica lag on s8 on db1214 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1214&var-port=9104 [18:44:10] RECOVERY - MariaDB sustained replica lag on s8 on db1214 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1214&var-port=9104 [19:27:12] PROBLEM - MariaDB sustained replica lag on s4 on db1248 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1248&var-port=9104 [19:28:12] RECOVERY - MariaDB sustained replica lag on s4 on db1248 is OK: (C)2 ge (W)1 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1248&var-port=9104 [19:55:14] PROBLEM - MariaDB sustained replica lag on s8 on db1214 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1214&var-port=9104 [19:57:14] RECOVERY - MariaDB sustained replica lag on s8 on db1214 is OK: (C)2 ge (W)1 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1214&var-port=9104 [20:17:16] PROBLEM - MariaDB sustained replica lag on s8 on db1214 is CRITICAL: 2.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1214&var-port=9104 [20:21:16] RECOVERY - MariaDB sustained replica lag on s8 on db1214 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1214&var-port=9104