[00:12:38] <wikibugs>	 10Machine-Learning-Team, 10ORES, 10Wikimedia-production-error: PHP Notice: Trying to access array offset on value of type null (in SpecialORESModels) - https://phabricator.wikimedia.org/T329304 (10Krinkle)
[07:56:57] <elukey>	 hello folks
[07:57:09] <elukey>	 started again another run of the cookbook to upgrade staging to k8s 1.23
[07:57:13] <elukey>	 fingers crossed
[07:57:18] <elukey>	 ottomata: ack thanks!
[08:16:57] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Implement new mediawiki.revision-score streams with Lift Wing - https://phabricator.wikimedia.org/T328576 (10elukey) Updates:  * Andrew rolled out rc1 page_change stream to Kafka Main, so we can test it with ChangeProp (thanks!). * After a brief chat on IRC, it seem...
[08:41:54] <isaranto>	 o/ 🤞
[08:49:36] <elukey>	 klausman: 
[08:49:37] <elukey>	 o/
[08:49:45] <klausman>	 \o
[08:49:46] <elukey>	 so I think that the current etcd reimage procedure doesn't work
[08:50:21] <elukey>	 I got stuck in https://docs.wire.com/how-to/administrate/etcd.html#troubleshooting with 2001, since IIUC wiping the raft log doesn't work
[08:50:40] <elukey>	 I am trying to remove/add 2001 to the cluster to see if it works
[08:50:59] <elukey>	 but in general I think that the safest way is to preserve the etcd raft log's dir
[08:51:11] <elukey>	 maybe using a dedicated partman recipe
[08:51:14] <elukey>	 does it make sense?
[08:51:44] <klausman>	 hmm. but the raft log is not on a separate filesystem, is it?
[08:52:50] <elukey>	 good point it is on /var/lib/etcd, so not on a separate partition
[08:52:54] <elukey>	 so doesn't work
[08:53:28] <klausman>	 So what did you try yesterday? keep all three running and reimage one at a time?
[08:53:37] <elukey>	 removed 2001, readded, but didn't work
[08:53:59] <klausman>	 Any useful errors?
[08:54:08] <elukey>	 no yesterday it was a mess, I stopped etcd on all, and manually started 2001 without the DISCOVERY variables
[08:54:24] <elukey>	 it doesn't find members and it fails to bootstrap
[08:54:34] <elukey>	 so frustrating
[08:55:14] <elukey>	 stuff like failed to find member b12825ca936a35a6 in cluster 3b33a847854e28f1
[08:55:19] <elukey>	 that is a little cryptic
[08:59:09] <klausman>	 yeah, those IDs are a bit useless
[08:59:14] <elukey>	 klausman: wiped /var/lib/etcd, restarted, and it looks working now
[08:59:39] <elukey>	 totally random
[08:59:41] <klausman>	 But a stop, reimage, start would do that as well, no?
[09:00:21] <elukey>	 in theory yes, but we'd need to add some steps to remove member from cluster, wipe all, re-add it etc..
[09:00:22] <klausman>	 I am wondering about those "initial setup" variables, and whether they maybe get a fresh etcd wedged into a state it can't recover from
[09:01:15] <klausman>	 So reading that wire doc, the steps should be remove, stop, wipe/reimage, start?
[09:01:40] <elukey>	 in theory yes
[09:01:52] <elukey>	 but it is starting to be more complicated than anticipated
[09:02:13] <klausman>	 My best guess then is that a remaining etcd cluster will not admit a member that used to be there, but has now lost its raft logs
[09:02:16] <elukey>	 so maybe upgrading etcd to a new os should be a separate step
[09:02:32] <klausman>	 yeah, I thought about that as well,
[09:02:55] <klausman>	 and I _think_ reimaging an etcd cluster can be done without stopping the associated k8s cluster
[09:03:11] <klausman>	 I wouldn't wanna _guarantee_ it keeps working, but it might.
[09:03:30] <elukey>	 yes yes it should be totally ok, if you do it one node at the time
[09:04:07] <elukey>	 but removing a member in a 3 node cluster could lead to quorum failures? No idea
[09:04:31] <elukey>	 I'll ask to service ops how the did the other etcd reimages
[09:04:37] <klausman>	 That's only a problem during leader election
[09:04:49] <klausman>	 once you have a leader, you can keep running on two nodes
[09:05:04] <klausman>	 kinda brittle, of course, but it should keep working.
[09:05:49] <elukey>	 yeah but very risky
[09:06:09] <elukey>	 if the reimaged node doesn't come up with any problem then it is not great
[09:07:13] <elukey>	 going to proceed with 2002 to see how it goes
[09:07:17] <elukey>	 uff
[09:09:35] <klausman>	 fingers crossed.
[09:11:29] <elukey>	 I expect the same issue, if not I'd be really baffled
[09:12:30] <klausman>	 I can take a look at the logs in a moment
[09:28:04] <klausman>	 brb
[09:39:55] <elukey>	 klausman: had to redo the same procedure but this time it worked fine
[09:40:21] <klausman>	 Weird.
[09:40:55] <klausman>	 Overall, I think decoupling etcd reimage from the rest of k8s reimage is preferable. Fixing a faltering etcd cluster is just too much voodoo.
[09:41:08] <elukey>	 maybe on 2001 the flag to consider the cluster "existing" was not there so it got a little messier
[09:41:39] <klausman>	 possible. I have almost no real idea about what that flag actually does.
[09:42:41] <elukey>	 I think that if we have "new" then etcd assumes that a new cluster is being bootstrapped
[09:43:08] <wikibugs>	 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 10 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10MoritzMuehlenhoff)
[09:43:11] <klausman>	 I meant more concretely/mechanically, like how does election change? etc
[09:45:40] <elukey>	 not sure if the election per se changes, what I think it happens is that etcd assumes that it has to create a new cluster id etc.. and if it finds an existing one from other nodes it complains
[09:46:51] <klausman>	 I need to put some time aside to read up on these things
[09:47:16] <klausman>	 I remember reading the Raft paper way back and understanding how it works, but by now I've forgotten most of it
[09:47:26] <klausman>	 At least it's not as complex as Paxos :D
[10:08:50] <elukey>	 etcd reimages completed, cluster health
[10:08:53] <elukey>	 *healthy
[10:19:14] <klausman>	 nice work!
[10:19:32] <elukey>	 sort of :(
[10:20:13] <klausman>	 Don't shortsell your efforts. The results might not be perfect, but not for lack of effort.
[10:30:19] <elukey>	 the rest of the cookbook works, if used with only etcd wipe it runs fine, at least that
[10:31:20] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: revertrisk: upgrade python 3.9 and debian [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/889502 (https://phabricator.wikimedia.org/T328439)
[10:32:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] revertrisk: upgrade python 3.9 and debian [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/889502 (https://phabricator.wikimedia.org/T328439) (owner: 10Ilias Sarantopoulos)
[10:35:55] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: revertrisk: upgrade python 3.9 and debian [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/889502 (https://phabricator.wikimedia.org/T328439)
[10:36:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] revertrisk: upgrade python 3.9 and debian [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/889502 (https://phabricator.wikimedia.org/T328439) (owner: 10Ilias Sarantopoulos)
[10:40:03] <wikibugs>	 (03PS3) 10Ilias Sarantopoulos: WIP: revertrisk: upgrade python 3.9 and debian [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/889502 (https://phabricator.wikimedia.org/T328439)
[11:27:08] * elukey lunch
[11:37:37] <klausman>	 same
[12:16:09] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] revertrisk: create model.py and blubberfile for each version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888824 (owner: 10AikoChou)
[13:10:19] <ottomata>	 elukey:  just so i'm following all the tickets correct
[13:10:29] <ottomata>	 https://phabricator.wikimedia.org/T328576 'new mediawiki.revision-score streams' will use the existent mediawiki.revision-score schema, right?
[13:39:39] <elukey>	 ottomata: o/ yes this is the idea
[13:58:53] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "Nice work! I added 2 comments and the main one is mostly about the duplicate code that is being introduced. Let me know what you think and" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888824 (owner: 10AikoChou)
[14:41:54] <elukey>	 klausman: o/ I have rolled out some configs to staging but calico pods are acting weirdly
[14:42:27] <elukey>	 I am discussing this in #wikimedia-k8s-sig, the aux cluster seem to show the same issue, but not the wikikube staging one
[14:42:33] <elukey>	 so I bet it is an operator issue :D
[14:45:02] <klausman>	 "operator" is ambiguous in that context :D
[14:45:15] <klausman>	 And let me rejoin that channel. for some reason weechat did not remember it
[15:28:07] <wikibugs>	 10Machine-Learning-Team: [outlink] Upgrade python from 3.7 to 3.9 in docker images - https://phabricator.wikimedia.org/T328438 (10isarantopoulos) a:03isarantopoulos
[15:38:00] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "I have tested the model image locally and works fine!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/889502 (https://phabricator.wikimedia.org/T328439) (owner: 10Ilias Sarantopoulos)
[17:06:31] <elukey>	 have a nice rest of the day folks :)
[17:09:43] <klausman>	 \o