[00:12:38] 10Machine-Learning-Team, 10ORES, 10Wikimedia-production-error: PHP Notice: Trying to access array offset on value of type null (in SpecialORESModels) - https://phabricator.wikimedia.org/T329304 (10Krinkle) [07:56:57] hello folks [07:57:09] started again another run of the cookbook to upgrade staging to k8s 1.23 [07:57:13] fingers crossed [07:57:18] ottomata: ack thanks! [08:16:57] 10Machine-Learning-Team, 10Patch-For-Review: Implement new mediawiki.revision-score streams with Lift Wing - https://phabricator.wikimedia.org/T328576 (10elukey) Updates: * Andrew rolled out rc1 page_change stream to Kafka Main, so we can test it with ChangeProp (thanks!). * After a brief chat on IRC, it seem... [08:41:54] o/ 🤞 [08:49:36] klausman: [08:49:37] o/ [08:49:45] \o [08:49:46] so I think that the current etcd reimage procedure doesn't work [08:50:21] I got stuck in https://docs.wire.com/how-to/administrate/etcd.html#troubleshooting with 2001, since IIUC wiping the raft log doesn't work [08:50:40] I am trying to remove/add 2001 to the cluster to see if it works [08:50:59] but in general I think that the safest way is to preserve the etcd raft log's dir [08:51:11] maybe using a dedicated partman recipe [08:51:14] does it make sense? [08:51:44] hmm. but the raft log is not on a separate filesystem, is it? [08:52:50] good point it is on /var/lib/etcd, so not on a separate partition [08:52:54] so doesn't work [08:53:28] So what did you try yesterday? keep all three running and reimage one at a time? [08:53:37] removed 2001, readded, but didn't work [08:53:59] Any useful errors? [08:54:08] no yesterday it was a mess, I stopped etcd on all, and manually started 2001 without the DISCOVERY variables [08:54:24] it doesn't find members and it fails to bootstrap [08:54:34] so frustrating [08:55:14] stuff like failed to find member b12825ca936a35a6 in cluster 3b33a847854e28f1 [08:55:19] that is a little cryptic [08:59:09] yeah, those IDs are a bit useless [08:59:14] klausman: wiped /var/lib/etcd, restarted, and it looks working now [08:59:39] totally random [08:59:41] But a stop, reimage, start would do that as well, no? [09:00:21] in theory yes, but we'd need to add some steps to remove member from cluster, wipe all, re-add it etc.. [09:00:22] I am wondering about those "initial setup" variables, and whether they maybe get a fresh etcd wedged into a state it can't recover from [09:01:15] So reading that wire doc, the steps should be remove, stop, wipe/reimage, start? [09:01:40] in theory yes [09:01:52] but it is starting to be more complicated than anticipated [09:02:13] My best guess then is that a remaining etcd cluster will not admit a member that used to be there, but has now lost its raft logs [09:02:16] so maybe upgrading etcd to a new os should be a separate step [09:02:32] yeah, I thought about that as well, [09:02:55] and I _think_ reimaging an etcd cluster can be done without stopping the associated k8s cluster [09:03:11] I wouldn't wanna _guarantee_ it keeps working, but it might. [09:03:30] yes yes it should be totally ok, if you do it one node at the time [09:04:07] but removing a member in a 3 node cluster could lead to quorum failures? No idea [09:04:31] I'll ask to service ops how the did the other etcd reimages [09:04:37] That's only a problem during leader election [09:04:49] once you have a leader, you can keep running on two nodes [09:05:04] kinda brittle, of course, but it should keep working. [09:05:49] yeah but very risky [09:06:09] if the reimaged node doesn't come up with any problem then it is not great [09:07:13] going to proceed with 2002 to see how it goes [09:07:17] uff [09:09:35] fingers crossed. [09:11:29] I expect the same issue, if not I'd be really baffled [09:12:30] I can take a look at the logs in a moment [09:28:04] brb [09:39:55] klausman: had to redo the same procedure but this time it worked fine [09:40:21] Weird. [09:40:55] Overall, I think decoupling etcd reimage from the rest of k8s reimage is preferable. Fixing a faltering etcd cluster is just too much voodoo. [09:41:08] maybe on 2001 the flag to consider the cluster "existing" was not there so it got a little messier [09:41:39] possible. I have almost no real idea about what that flag actually does. [09:42:41] I think that if we have "new" then etcd assumes that a new cluster is being bootstrapped [09:43:08] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 10 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10MoritzMuehlenhoff) [09:43:11] I meant more concretely/mechanically, like how does election change? etc [09:45:40] not sure if the election per se changes, what I think it happens is that etcd assumes that it has to create a new cluster id etc.. and if it finds an existing one from other nodes it complains [09:46:51] I need to put some time aside to read up on these things [09:47:16] I remember reading the Raft paper way back and understanding how it works, but by now I've forgotten most of it [09:47:26] At least it's not as complex as Paxos :D [10:08:50] etcd reimages completed, cluster health [10:08:53] *healthy [10:19:14] nice work! [10:19:32] sort of :( [10:20:13] Don't shortsell your efforts. The results might not be perfect, but not for lack of effort. [10:30:19] the rest of the cookbook works, if used with only etcd wipe it runs fine, at least that [10:31:20] (03PS1) 10Ilias Sarantopoulos: revertrisk: upgrade python 3.9 and debian [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/889502 (https://phabricator.wikimedia.org/T328439) [10:32:43] (03CR) 10CI reject: [V: 04-1] revertrisk: upgrade python 3.9 and debian [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/889502 (https://phabricator.wikimedia.org/T328439) (owner: 10Ilias Sarantopoulos) [10:35:55] (03PS2) 10Ilias Sarantopoulos: revertrisk: upgrade python 3.9 and debian [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/889502 (https://phabricator.wikimedia.org/T328439) [10:36:57] (03CR) 10CI reject: [V: 04-1] revertrisk: upgrade python 3.9 and debian [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/889502 (https://phabricator.wikimedia.org/T328439) (owner: 10Ilias Sarantopoulos) [10:40:03] (03PS3) 10Ilias Sarantopoulos: WIP: revertrisk: upgrade python 3.9 and debian [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/889502 (https://phabricator.wikimedia.org/T328439) [11:27:08] * elukey lunch [11:37:37] same [12:16:09] (03CR) 10Klausman: [C: 03+1] revertrisk: create model.py and blubberfile for each version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888824 (owner: 10AikoChou) [13:10:19] elukey: just so i'm following all the tickets correct [13:10:29] https://phabricator.wikimedia.org/T328576 'new mediawiki.revision-score streams' will use the existent mediawiki.revision-score schema, right? [13:39:39] ottomata: o/ yes this is the idea [13:58:53] (03CR) 10Ilias Sarantopoulos: "Nice work! I added 2 comments and the main one is mostly about the duplicate code that is being introduced. Let me know what you think and" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888824 (owner: 10AikoChou) [14:41:54] klausman: o/ I have rolled out some configs to staging but calico pods are acting weirdly [14:42:27] I am discussing this in #wikimedia-k8s-sig, the aux cluster seem to show the same issue, but not the wikikube staging one [14:42:33] so I bet it is an operator issue :D [14:45:02] "operator" is ambiguous in that context :D [14:45:15] And let me rejoin that channel. for some reason weechat did not remember it [15:28:07] 10Machine-Learning-Team: [outlink] Upgrade python from 3.7 to 3.9 in docker images - https://phabricator.wikimedia.org/T328438 (10isarantopoulos) a:03isarantopoulos [15:38:00] (03CR) 10Ilias Sarantopoulos: "I have tested the model image locally and works fine!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/889502 (https://phabricator.wikimedia.org/T328439) (owner: 10Ilias Sarantopoulos) [17:06:31] have a nice rest of the day folks :) [17:09:43] \o