[05:09:06] <marostegui>	 Amir1: let me know when you are around and ready to enable writes on es6 and es7
[05:22:27] <marostegui>	 I am going to switch s5 codfw master
[05:43:55] <arnaudb>	 hey! I have a slight issue that might be disrupting for the next few days: I have no internet at home since last tuesday. My ISP is working to restoring the service, a person is scheduled to step by tomorrow morning to try and fix everything. Meanwhile I have a 4G box to be able to work. Timely disruption as I was on staycation during last week x)
[07:32:06] <Emperor>	 that is a long time without internet :(
[07:34:50] <arnaudb>	 at least it was a good crash test for my home assistant setup, I've lost my DHCP server for a while and could not manage my living room lights anymore because a component was addressed dynamically :D
[07:37:19] <Amir1>	 ouch
[07:37:37] <Amir1>	 https://usercontent.irccloud-cdn.com/file/oRUfpygZ/grafik.png
[07:38:00] <Amir1>	 arnaudb: don't worry :/ let me know if I can help on anything
[07:39:47] <arnaudb>	 :D thanks!
[07:39:54] <marostegui>	 I am going to enable writes on es6 and es7 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1029109
[08:52:27] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db1207 is CRITICAL: 2.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1207&var-port=9104
[08:53:27] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db1207 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1207&var-port=9104
[09:06:34] <jynus>	 heads up on alerts caused re: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1030896
[09:07:35] <jynus>	 backups are available and were generated over the weekend, but they lacked metadata (and so, detection). As a failsafe, if metadata cannot be generated, the backups still complete successfully
[09:56:45] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s4 on db1248 is CRITICAL: 6.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1248&var-port=9104
[09:58:45] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s4 on db1248 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1248&var-port=9104
[12:43:35] <Amir1>	 just tried to run db-mysql on localhost...
[12:56:29] <elukey>	 urandom: o/ ping me if you have time for the cassandra private cleanup later on :)
[13:10:39] <urandom>	 elukey: I have a ~30min meeting in ~20mins, but I am otherwise free
[13:11:18] <elukey>	 ok to do it later on!
[14:19:47] <urandom>	 elukey: o/
[14:20:10] <elukey>	 urandom: sorry in a meeting now :( Should be free in ~1h
[14:20:26] <urandom>	 👍
[14:34:27] <urandom>	 elukey, hnowlan: do either of you know what the `/srv/cassandra-{a,b}/tmp/local_group_default_T_mediarequest_per_file` directories are on aqs1010?  The timestamp is March 30, 2021, I think that might roughly correspond to a migration (v2 to v3)?
[14:35:00] <hnowlan>	 that seems extremely likely 
[14:35:00] <urandom>	 pretty sure this is safe to delete, but you know, measure twice/cut once
[14:35:20] <elukey>	 no idea sorry
[14:35:21] <hnowlan>	 I'd say you're fine to clean it - btullis might be able to confirm 
[14:55:29] <elukey>	 urandom: ready!
[14:55:30] <elukey>	 if you are
[14:55:38] <urandom>	 I am!
[14:56:25] <elukey>	 super :)
[14:56:33] <elukey>	 puppet is disabled in all c:cassandra nodes
[14:56:55] <elukey>	 if you are ok I'd stage the cassandra dir drop in puppet private for you to review
[14:57:06] <elukey>	 then we can merge and slowly run puppet on some nodes
[14:57:09] <elukey>	 like one for each cluster
[14:57:41] <urandom>	 sgtm
[15:03:07] <elukey>	 super prepping
[15:04:05] <elukey>	 urandom: ok done! Can you review in puppet private?
[15:05:17] <urandom>	 elukey: I can confirm, that is a frightening number of deletions :)
[15:05:29] <urandom>	 but yes, `modules/secret/secrets/cassandra` recursively, +1
[15:09:04] <elukey>	 merged!
[15:09:32] <elukey>	 running puppet on one node for each cluster
[15:10:41] <volans>	 wow 988 deletion
[15:11:04] <Emperor>	 Delete All The Things \o/
[15:11:06] <urandom>	 volans: scary, huh!
[15:11:25] * volans tempted to run git gc
[15:14:06] <elukey>	 volans: yes a big drop :D
[15:15:37] <elukey>	 urandom: I think that we are good, re-enabling puppet and forcing a run on all nodes via cumin
[15:15:40] <elukey>	 to double check
[15:15:52] <urandom>	 kk
[15:17:20] <elukey>	 I'll report back when done
[15:32:57] <elukey>	 urandom: all no-ops! We are done!
[15:33:00] <elukey>	 \o/
[15:34:35] <urandom>	 I guess the real test is what happens when the existing certs expire :)
[15:35:06] <urandom>	 I mean, if there is any issue lying in wait, it's something else we missed that relies on them, and what they might do when  no longer valid
[15:36:10] <elukey>	 urandom: I tested the cert reload for ml-cache and it worked fine, with Cassandra 4.x.. Also we have a prom blackbox alert that should fire if we get close to some days before the expiry
[15:37:12] <urandom>	 I was thinking about something client oriented
[15:37:22] <urandom>	 scripts or something
[15:37:26] <elukey>	 ahh okok
[15:37:29] <elukey>	 we'll see yes
[18:40:10] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s8 on db1214 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1214&var-port=9104
[18:44:10] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s8 on db1214 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1214&var-port=9104
[19:27:12] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s4 on db1248 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1248&var-port=9104
[19:28:12] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s4 on db1248 is OK: (C)2 ge (W)1 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1248&var-port=9104
[19:55:14] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s8 on db1214 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1214&var-port=9104
[19:57:14] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s8 on db1214 is OK: (C)2 ge (W)1 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1214&var-port=9104
[20:17:16] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s8 on db1214 is CRITICAL: 2.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1214&var-port=9104
[20:21:16] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s8 on db1214 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1214&var-port=9104