[01:58:19] (03PS1) 10AikoChou: revertrisk: create model.py and blubberfile for each version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888824 [02:18:39] (03PS2) 10AikoChou: revertrisk: create model.py and blubberfile for each version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888824 [05:26:43] (03CR) 10AikoChou: events: support multiple source events (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888190 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [07:25:27] (03CR) 10Elukey: events: support multiple source events (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888190 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [07:51:23] hello folks :) [08:12:08] klausman: o/ Today I realized that the etcd procedure can probably be simpler, see https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/889048/ [08:12:17] in theory it should avoid yesterday's mess [08:26:42] elukey: \o [08:27:35] Having a look at that change. Like it so far :) [08:29:46] I am reading https://etcd.io/docs/v3.5/upgrades/upgrade_3_3/, it seems more inline [08:30:29] One thing of note: at some point yesterday, I saw errors about the etcd versions being different. I don't think it would _break_ this scheme, but it might. [08:31:05] LGTM'd [08:31:36] yeah I saw the warning as well, but it should be harmless from what upstream suggests [08:38:08] So the k8s machines are all reimaged, right> [08:39:03] yeah, finished yesterday [08:39:17] but I'd like to run another time the cookbook to see if the procedure works better [08:39:20] If etcd is using TLS, the discovery SRV record (e.g. example.com) must be included in the SSL certificate DNS SAN along with the hostname, or clustering will fail with log messages like the following: [08:39:24] [...] rejected connection from "10.0.1.11:53162" (error "remote error: tls: bad certificate", ServerName "example.com") [08:39:32] https://etcd.io/docs/v3.3/op-guide/clustering/#dns-discovery [08:39:59] this should explain what we were seeing.. [08:40:06] it is not in 3.2's guide [08:40:32] in our case it should be ml_staging_etcd.codfw.wmnet [08:40:50] ah, yes, that is not in our subject altnames [08:42:02] And let me guess, while it's in the 3.3 docs, it's nowhere in the changelog? [08:42:36] I haven't really checked the whole changelog [08:43:32] Just checked. There are two mentions of SRV records, and neither is indicating this [08:43:47] https://github.com/etcd-io/etcd/blob/main/CHANGELOG/CHANGELOG-3.3.md#security-authentication-3 [08:44:32] mmm or https://github.com/etcd-io/etcd/blob/main/CHANGELOG/CHANGELOG-3.3.md#security-authentication-4 [08:45:35] there is also https://github.com/etcd-io/etcd/blob/main/CHANGELOG/CHANGELOG-3.3.md#improved [08:46:01] I think the tail end of the 2.2 bullet item may hint at this. But not in a way I would have picked up [08:46:13] (auth-4) [08:47:08] And I don't think PR 11196 is what we would have wanted. Definitely not in a steady state [08:47:50] So you wanna do a whole-reimage of staging again, once 889048 is merged? [08:50:47] yes exactly, but we'd need to add the SAN to the etcd cert before it [08:52:52] I'll wait Janis' opinion on this to figure out what's best [09:00:18] (03PS6) 10Elukey: events: support multiple source events [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888190 (https://phabricator.wikimedia.org/T328576) [09:00:43] (03CR) 10Elukey: events: support multiple source events (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888190 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [09:04:06] (03PS7) 10Elukey: events: support multiple source events [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888190 (https://phabricator.wikimedia.org/T328576) [09:22:34] elukey: ack. note that I'll be out in the afternoon (after about 1430 or so). visiting you-know-who :) [09:23:24] definitely, say hello to you-know-who :) [09:25:49] Will do. [09:26:59] hey folks, if there is anything I can be of help with the k8s upgrade let me know [09:52:19] isaranto: thanks! For the moment just chatting with people would probably save our mental sanity a bit :D [09:52:34] jokes aside, a new version of the cookbook is ready, I'll try to re-run it on staging [09:52:46] we have a little thing in need to be fixed, but so far everything looks good [09:52:56] we should have the cluster on 1.23 by end of week [09:53:05] cool, just wanted to say I'm here (although probably I won't be of much help) [09:53:14] (waiting for some feedback from serviceops on etcd before proceeding) [09:54:26] isaranto: if you have questions etc.. about the ORES migration or if you want to have a chat with me about doubts/etc.. lemme know [09:54:29] I am available anytime [09:54:36] ack [10:20:20] 10Machine-Learning-Team: [nsfw] Upgrade python and debian in docker image - https://phabricator.wikimedia.org/T329612 (10isarantopoulos) [10:21:06] 10Machine-Learning-Team: [revertrisk] Upgrade python from 3.7 to 3.9 in docker images - https://phabricator.wikimedia.org/T328439 (10isarantopoulos) a:03isarantopoulos [10:27:38] hmm, perhaps before upgrading python I could take a look on revertrisk - to create separate deployments for new and old version [10:28:11] but I'll wait to coordinate with Aiko tomorrow since she has been working on this [10:31:46] +1 I wanted to suggest it, the code review seems in a good state [11:00:58] klausman: I moved the etcd staging cluster to PKI, super quick, so we'll be able to add an extra san via puppet [11:01:26] (03CR) 10Ilias Sarantopoulos: [C: 03+1] events: support multiple source events (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888190 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [11:14:46] elukey: what does "moved to PKI" mean here? Not using the Puppet CA? [11:15:16] klausman: yep exactly [11:15:20] and cergen [11:15:25] Neato. That always seemed hacky [11:17:48] klausman: downside - we have underscores in our cluster name configs, that don't translate well into DNS SANs [11:17:55] so we'll have to move to dashes [11:18:08] Well, at least that's a one-off. [11:18:16] Ben had to do it in https://phabricator.wikimedia.org/T313129 IIUC [11:40:24] * elukey lunch! [12:29:50] \o I'm out for the rest of the day, talk to you tomorrow! [12:56:33] o/ [13:06:17] elukey: o/ if having page_change in kafka main would help you (and encourage you!) to use it, I think we can expedite getting it there [13:54:13] ottomata: yep yep if it was on main for testing it would be super great, even rc streams, so we'll be able to start testing the whole pipeline from our side [13:57:20] Hello. I have a question about whether anyone still needs support for ORES on the Hadoop cluster: https://phabricator.wikimedia.org/T329363#8614295 [13:58:40] If we do, then I will need to find a fix for the `enchant` package being superseded by `enchant-2`. I thought that people in this channel may know the most about it, so I hope you don't mind my asking. [13:58:56] btullis: hello! please ask any question, sorry for the trouble [13:59:23] I don't recall exactly why we added ores::base to hadoop, maybe it was to allow distributed training or similar? [13:59:37] I can try to figure it out from git blame later on, but in theory we shouldn't need it [13:59:54] OK, great. Many thanks. [14:00:01] ores::base may also be used by ores nodes as well so we cannot, for the moment, upgrade to enchant-2 :( [14:17:29] Ok, maybe the simple thing to do is to try to forward port enchant. [14:17:58] btullis: I found old-me in https://gerrit.wikimedia.org/r/c/operations/puppet/+/435966/3/modules/profile/manifests/hadoop/common.pp, there is some sense in why we have ores packages on hadoop. We don't really need nor have tried in the past to use ores on hadoop, so I am +1 to remove [14:18:04] (from the hadoop workers I mean) [14:18:47] it will remove a ton of old cruft from the hadoop workers, and it is a good occasion to remove them with reimages [14:18:56] it will make them way faster [14:19:01] so let's do it :) [14:20:43] Ok, great. Thanks so much for your help. I will remove the classes from the Hadoop nodes. [14:23:07] isaranto will be happy as well since we'll hopefully start to remove ores stuff from other systems soon-ish :) [14:42:02] I am always happy when I remove-deprecate stuff 😜 [14:43:28] isaranto: now that I think about it, we should add a task to the list about cleaning up repos/puppet/etc.. [14:44:20] +1 [14:57:49] 10Machine-Learning-Team, 10CirrusSearch, 10Discovery-Search: Add outlink topic model predictions to CirrusSearch indices - https://phabricator.wikimedia.org/T328276 (10dcausse) [14:58:51] 10Machine-Learning-Team, 10CirrusSearch, 10Discovery-Search: Add outlink topic model predictions to CirrusSearch indices - https://phabricator.wikimedia.org/T328276 (10dcausse) Thanks for all the input! I've updated the task description accordingly. [15:57:47] elukey: isaranto: what about the stat servers? Might ores::base still be used there, do you know? [16:02:04] I have no idea...perhaps someone more familiar with ores can chip in [16:23:02] btullis: on stat100x maybe, but probably in the sporadic occasion of retraining an ores model (something that we don't really plan to) [16:23:09] so I'd say go ahead and clean it up too :) [16:39:16] elukey: done. rc1.mediawiki.page_change events are in kafka main now. [16:39:35] ottomata: you rock thanks [16:39:38] I'll try to use it asap [16:48:42] thanks ottomata 🤘 [17:29:08] ottomata: one qs - in the revision-score streams case, should I set changeprop to filter events that don't have the "page_change_kind":"edit" flag, or is it not necessary? [17:45:07] going afk for the evening folks, have a nice rest of the day [18:34:11] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10JMeybohm) [19:58:33] elukey: you pprobably also want page_change_kind: create too. revision-create had page-creates in them as well. [19:58:59] but elukey let's brainstorm this :)