[05:14:27] I am starting s2 primary switch [07:03:31] I am running a very long schema change on s4 (eqiad only), so coordinate with me if there's any maintenance that needs to be done on s4 eqiad [07:34:25] "diff saved to https://phabricator.wikimedia.org/P59999" → so close [07:34:55] dbctl wouldn't have let you depool the master :) [07:35:28] oh its not this one, I was talking about the paste ID :D [07:35:40] aah :) [07:46:55] OK, going to try and do the TLS renewal stuff for T361844 [07:46:55] T361844: Swift TLS certificates will expire soon (14 April) - https://phabricator.wikimedia.org/T361844 [08:12:41] gah, not a good time for my desktop to lock up [08:31:09] Could I get a +1 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1018190 please? Updating the new cert in puppet. [08:37:29] ...which I need before I can check that the new cert works OK in codfw and then repeat the entire process for eqiad... [08:37:57] Done [08:39:19] thanks <3 [08:46:39] codfw now 'Not After : Apr 8 08:00:23 2029 GMT'; going to leave that a bit to settle in case of delayed-onset 🔥, then will do eqiad [10:00:57] Now going to do eqiad [10:13:19] Can I now get a +1 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1018227 please? So I can deploy the updated crt to eqiad-swift [10:20:21] done [10:20:31] thanks :) [10:24:52] weird, db2097 is lagging, but it is replicating and lagging is increasing [10:26:41] and the events are crazy, prometheus user too, not producing metrics [10:26:46] yep [10:26:50] looks stalled [10:26:57] there's a dump running [10:26:58] marostegui: if you want to have a look at it? [10:27:07] otherwise I may decommission it [10:27:14] +1 to decom [10:27:18] I am with somethbing else at the moment [10:27:19] in case it is useful for debugging [10:37:51] there is something with replication and x1 that doesn't work well [10:38:09] seems like the same issue from a few weeks ago [10:38:34] I may recover x1 logically to rebuild he backup source on codfw [10:39:10] From what I saw when I fixed x1 past monday, the issue was semi sync, whereas with this host, the host looks stalled (at least show processlist is) [10:39:42] but stop slave got stuck [10:40:00] but that wasn't the case last week [10:40:12] i think it is stalled and any operation would get stuck [10:41:33] was GTID trimmed on x1? [10:41:40] no [10:41:56] then I have nothing [10:42:03] that could explain it [10:42:42] will do the logical load and pray [10:43:01] I will wait for you at the church [10:44:29] I had to -9 the process, it was IO-stuck [10:45:11] expected I'd say [10:46:05] actually, I have a lead [10:46:23] x1 I think is one of the few production sections using ROW [10:46:33] could be related to that (?) [10:46:38] we use row in all the slaves [13:18:15] urandom: o/ I should have found a workaround for cqlshrc - https://gerrit.wikimedia.org/r/c/operations/puppet/+/1018267 [13:20:16] ah also, what is the spark loader? [13:20:20] I guess it is in refinery land [13:45:55] elukey: yeah, the if/else was what I was thinking as well. when the dust settles (and all the clusters are running PKI), there is going to be a lot of those conditionals that can be cleaned up. :) [13:46:38] the spark loader is TTBMK, what is used to connect to Cassandra for the bulk imports [13:47:04] you lost me at TTBMK :D [13:47:09] and I think it's used for everything batch-oriented, so all the AQS stuff, and image suggestions as well [13:48:20] you had to know it wasn't a Java-based software acronym, they haven't exhausted the 3 and 4 letter space quite yet :) [13:49:53] ahhh ok sorry I never use that so I thought it was some internal thing : [13:49:56] :D [13:50:26] yes yes so it must be in DE's refinery, I am downloading the repo now to see if there are traces [13:58:09] use `git grep` to search for the password, that's probably easiest :/ [15:30:04] urandom: after some scavenging with Ben https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/wmf_airflow_common/config/dag_default_args.py#L182-190 seems to be the config [15:30:25] that IIUC doesn't force any TLS setting, so it uses the unencrypted conn [15:30:28] does it make sense? [15:30:46] if so we should be good, we can force TLS right after PKI [15:36:45] opened T362181 [15:36:46] T362181: Encrypt Airflow connections to AQS Cassandra - https://phabricator.wikimedia.org/T362181 [15:36:54] * urandom groans [15:57:02] greetings, data-persistence! if we need to arrange a 30m-1h window in the next few of weeks where non-emergency conftool actions (e.g., via dbctl) are discouraged, what would be the best way to coordinate that with you folks? [15:57:02] context: we need to make some changes to how etcd is replicated between sites, which requires replication to be down (but recoverable in an emergency) for a couple of brief periods in that window. [15:58:29] arnaudb: can you coordinate with swfrench-wmf ? [15:59:15] swfrench-wmf: I personally have a schema change running which is probably not going to finish before the end of the week, so from my side it'd need to be next week [16:00:56] marostegui: good to know. that should not be a problem - I only plan to do prep work this week, which should be non-disruptive. [16:01:39] urandom: something very fun https://puppet-compiler.wmflabs.org/output/1018309/1832/aqs1010.eqiad.wmnet/change.aqs1010.eqiad.wmnet.err [16:01:42] swfrench-wmf: Good, we have multiple schema changes running, so we need to coordinate to make sure they are all stopped for your windows. Please coordinate with arnaudb for next week [16:02:00] urandom: the change is https://gerrit.wikimedia.org/r/#/c/1018309/ [16:02:11] not sure if it is puppet 7 or not, didn't see it before for ml-cache [16:03:10] elukey: you have a strange idea of fun! :) [16:03:39] very wrong idea of fun, not strange [16:03:47] but we all knew that already ;) [16:03:56] tags were omitted :D [16:04:46] in the cassandra init.pp line 265 we instanciate the cassandra::instances, and it complains about duplicate declaration of something [16:05:23] that is defined only in cassandra::intance [16:05:26] *instance [16:05:31] so I am very puzzled [16:05:32] marostegui: ack, will do. thank you! [16:05:39] thanks! [16:06:09] marostegui: Good NetboxEvening [16:06:18] elukey: cumin do! [16:07:14] lol [16:12:05] ahhh wait yes now I get it sigh [16:12:23] it didn't trigger for ml-cache since we have only once instance running [16:24:04] urandom: fixed it with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1018311, plus filed the change to move aqs1010 to pki [16:24:12] hopefully we should be good now :D [16:24:18] going afk, have a nice rest of the day folks!