[07:16:00] <elukey>	 hello folks
[07:16:17] <elukey>	 I am trying with https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/764799 to limit the number of ksvc (knative revisions) that we keep around after each deployment
[07:16:42] <elukey>	 the main issue is that, if we keep them around, service ips and other resources will be held
[07:34:40] <elukey>	 yeah confirmed
[07:34:59] <elukey>	 when we change any isvc (InferenceService) a new knative revision is created
[07:35:31] <elukey>	 and every revision gets an IP from the svc pool
[07:35:46] <elukey>	 that is a /24
[07:36:35] <elukey>	 the pods ip pool is a /23
[07:37:15] <elukey>	 254 svc ips is not enough for our use case, we'd need probably a /22
[07:40:32] <elukey>	 opening a task
[07:40:42] <elukey>	 definitely a blocker for lift wing
[08:08:43] <wikibugs>	 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Re-evaluate ip pools for ml-serve-{eqiad,codfw} - https://phabricator.wikimedia.org/T302701 (10elukey)
[08:13:47] <elukey>	 created --^
[08:23:54] <elukey>	 tried to find the /22 that included our current /24, but afaics it would also overlap with other k8s && non-ml subnets
[08:24:11] <elukey>	 the easy choice would be to pick a new /22 from the available pool
[08:24:16] <elukey>	 https://netbox.wikimedia.org/ipam/prefixes/376/prefixes/
[08:24:30] <elukey>	 but then we'd need to wipe the clusters (in theory)
[08:24:42] <elukey>	 or at least, I don't find another solution on a Monday morning :D
[08:39:42] <wikibugs>	 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Re-evaluate ip pools for ml-serve-{eqiad,codfw} - https://phabricator.wikimedia.org/T302701 (10elukey) On ml-serve-eqiad (half way through loading ORES pods):  ` root@deploy1002:~# kubectl get svc -A  |grep 10. | wc -l 200 `
[08:40:00] <elukey>	 on ml-serve-eqiad we are at 200 svc IPs used
[08:40:01] <elukey>	 sigh
[08:48:39] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Load test the Lift Wing cluster - https://phabricator.wikimedia.org/T296173 (10elukey) a:05elukey→03None
[08:49:03] <wikibugs>	 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Re-evaluate ip pools for ml-serve-{eqiad,codfw} - https://phabricator.wikimedia.org/T302701 (10elukey) a:03elukey
[08:55:28] <elukey>	 going afk for some errands, bbl!
[10:19:23] <wikibugs>	 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10elukey) Thanks a lot @Halfak for the tests!...
[10:27:28] <aiko>	 Good morning! :D
[10:28:13] <elukey>	 good morning!
[11:28:39] * elukey lunch!
[12:00:52] <klausman>	 elukey: could we auto-expire the revisions more aggressively than we do today? At least until we have figure out a new /22 and how to get there.
[12:36:03] <wikibugs>	 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Re-evaluate ip pools for ml-serve-{eqiad,codfw} - https://phabricator.wikimedia.org/T302701 (10klausman) Is there currently any kind of auto-expire/auto-clean of old revisions? If not, does kserve have such functionality built-in somewhere? That mi...
[12:37:20] <klausman>	 https://knative.dev/development/serving/configuration/revision-gc/#cluster-wide-configuration might be useful
[12:38:13] <klausman>	 The "complex example" with numbers tweaked looks very promising
[12:49:31] <wikibugs>	 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Re-evaluate ip pools for ml-serve-{eqiad,codfw} - https://phabricator.wikimedia.org/T302701 (10klausman) This  might be useful:  https://knative.dev/development/serving/configuration/revision-gc/#cluster-wide-configuration  The "complex example" wi...
[12:50:59] <elukey>	 klausman: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/764799 :)
[12:51:04] <elukey>	 it was merged this morning
[12:51:39] <klausman>	 :D
[12:52:13] <klausman>	 Now I feel bad for having said anything -.-
[12:52:17] <wikibugs>	 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Re-evaluate ip pools for ml-serve-{eqiad,codfw} - https://phabricator.wikimedia.org/T302701 (10elukey) Merged this morning https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/764799  If this works we should keep only 2 revisions for eac...
[12:52:57] <elukey>	 klausman: the main issue is that we have already ~200 svc IPs already used :(
[12:53:13] <klausman>	 YEah, as I mentioned, we need a bigger pool son, either way
[12:53:38] <elukey>	 the auto-cleanup is good anyway since even with a bigger pool we'll want to avoid wasting revisions
[12:53:47] <elukey>	 but I didn't get at the time that it would have been so bad
[12:53:50] <klausman>	 And as you mentioned, the fragmentation is annoying. We might as well go ahead an do a wipe while we still can do so without disrupting anything in "prod"
[12:54:17] <elukey>	 yeah definitely
[12:54:29] <elukey>	 very annoying
[12:54:45] <klausman>	 It's a bit weird that you can't have multiple (fragmented) ranges in a cluster, but then I guess routing might become a PITA
[12:54:57] <klausman>	 Plus, cleanliness etc etc
[12:56:58] <elukey>	 this is the part that I am not super sure about
[12:57:11] <elukey>	 calico uses https://projectcalico.docs.tigera.io/reference/resources/ippool
[12:57:38] <elukey>	 we can define multiple ones, we have now "ipv4" and "ipv6" defined
[12:58:06] <elukey>	 I haven't dove deep into how that resource works, maybe having multiple ippools is fine for calico
[12:58:21] <elukey>	 but the "ipvx" naming felt like we needed one for each kind
[13:00:02] <klausman>	 I'll take a look later today (planning to do the etcd VMs today first)
[13:11:52] <elukey>	 ah also I just realized that the IPPools mentioned in deployment-charts are /23s, so those should be the pods subnets
[13:17:35] <elukey>	 ahhh
[13:17:36] <elukey>	 profile::kubernetes::master::service_cluster_ip_range: 10.64.77.0/24
[13:17:43] <elukey>	 okok so it is defined in puppet
[13:17:59] <klausman>	 So are we currently "artificially constrained"?
[13:18:19] <elukey>	 what do you mean?
[13:18:33] <klausman>	 SO that's a /24
[13:18:50] <klausman>	 But we have a /22 assigned to us, right?
[13:18:50] <elukey>	 it is what we have allocated in IPAM
[13:18:55] <elukey>	 https://netbox.wikimedia.org/ipam/prefixes/376/prefixes/
[13:19:00] <elukey>	 nope
[13:19:22] <klausman>	 Oh, so we only have one /24 each for eqiad and codfw?
[13:19:33] <elukey>	 yes correct
[13:19:41] <elukey>	 same thing for the other clusters
[13:19:51] <klausman>	 Hmm.
[13:19:54] <elukey>	 I think we are the first use case that needs more than few svc ips
[13:20:10] <klausman>	 I suspect service IPs are more easily managed if fragmented than pod IPs would be?
[13:20:26] <klausman>	 After all, they need to freely move around anyway
[13:23:17] <elukey>	 so in puppet the svc subnet range ends up in the kube-api config
[13:23:18] <elukey>	 https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/
[13:23:30] <elukey>	 the parameter is --service-cluster-ip-range
[13:25:13] <klausman>	 Now I wonder if that flag can be specified multiple times
[13:25:35] <klausman>	 > Max of two dual-stack CIDRs is allowed.
[13:25:37] <klausman>	 Hrm.
[13:27:41] <elukey>	 https://github.com/kubernetes/kubernetes/issues/104088
[13:28:54] <elukey>	 so my impression is that it may support only one ipv4 range
[13:29:03] <elukey>	 I'll ask to Alex in the task
[13:29:07] <klausman>	 yeah, at least at the moment
[13:32:17] <elukey>	 added a comment
[13:32:32] <wikibugs>	 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Re-evaluate ip pools for ml-serve-{eqiad,codfw} - https://phabricator.wikimedia.org/T302701 (10elukey) @akosiaris we'd need an expert opinion on this :)  Afaics from puppet, we configure the svc ip range in kube-api's defaults, and from https://git...
[13:42:50] <elukey>	 ok let's see what Alex thinks about it, worst that happens is that we'll wipe what we have
[13:47:32] <klausman>	 Ack
[14:29:24] <klausman>	 elukey: FYI https://wikitech.wikimedia.org/w/index.php?title=SRE%2FInfrastructure_naming_conventions&type=revision&diff=1952879&oldid=1948572
[14:31:00] <wikibugs>	 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 (10Ottomata) > The think that I am not clear is why we need to use Knative Eventing when we already have Changeprop change-prop is a home grown event notification re...
[14:31:41] <elukey>	 ack
[14:34:48] <wikibugs>	 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 (10elukey) Yep yep got it. As first iteration, we'll likely add some code to lift wing to send a score to eventgate (following the related schema etc..) when request...
[14:36:39] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team: Support (or not) the ORES augmented feature output in liftwing - https://phabricator.wikimedia.org/T301766 (10elukey) a:03achou
[14:37:16] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Support (or not) the ORES augmented feature output in liftwing - https://phabricator.wikimedia.org/T301766 (10elukey)
[14:37:29] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Implement an online feature store - https://phabricator.wikimedia.org/T294434 (10elukey)
[14:42:57] <wikibugs>	 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 (10Ottomata) I see!  So you'd kinda be using ChangeProp as the Knative eventing stand in for now :)   BTW, you may prefer to produce to Kafka directly, rather than t...
[14:43:09] <klausman>	 elukey: we talked the other day about naming and I just realized it's a bit inconistent
[14:43:32] <klausman>	 So we'd have ml-etcd-staging200x and ml-staging-ctrl200x
[14:43:56] <klausman>	 wouldn't ml-ctrl-staging200x be better?
[14:46:30] <elukey>	 klausman: I think that the idea was to have something like ml-{serve,staging}-ctrl
[14:46:36] <elukey>	 but we can name it as we want
[14:46:49] <klausman>	 Well, I almost created one of the etcd VMs already
[14:46:52] <klausman>	 But it failed :-/
[14:47:12] <klausman>	 But I guess at this point we could nix it anyway
[14:47:28] <klausman>	 so they'd be ml-staging-etcd?
[14:48:19] <elukey>	 we have ml-etcd atm, anything is fine
[14:48:28] <elukey>	 (for "prod" I mean)
[14:48:34] <klausman>	 Yeah
[14:48:39] <klausman>	 How does one delete a VM?
[14:49:43] <elukey>	 where did the cookbook fail?
[14:49:59] <klausman>	 Step 12, authdns
[14:50:14] <elukey>	 sure but with what error?
[14:50:35] <klausman>	 https://phabricator.wikimedia.org/P21601
[14:50:52] <klausman>	 I don't really see any specific error
[14:51:40] <klausman>	 Oh hang on, I may have missed some messages in the previous steps
[14:51:49] <klausman>	 fatal: unable to access 'https://netbox-exports.wikimedia.org/dns.git/': Operation timed out after 300033 milliseconds with 0 out of 0 bytes received                                                   
[14:52:08] <klausman>	 Twice, for dns6001 and dns6002
[14:52:11] <elukey>	 Riccardo mentioned this morning on #sre IIRC that there is some problem with netbox in the new Marseille DC
[14:52:31] <klausman>	 Mh.
[14:52:42] <elukey>	 so we may need to wait to create VMs, if they fail in this way
[14:54:09] <klausman>	 If we want to use different VM names anyway, should I just decom the half-made one?
[14:54:39] <elukey>	 so delete a VM in https://wikitech.wikimedia.org/wiki/Ganeti#Delete_a_VM still lists the ganeti command, but if netbox was involved maybe the decom cookbook could be best
[14:54:59] <klausman>	 I'll se what colors it explodes with
[14:55:10] <elukey>	 wait a sec, did you find anything on netbox about it?
[14:55:43] <klausman>	 Yes, two IPs (v6 and v4
[14:56:04] <elukey>	 ah ok if they are there we need to either clean up manually or run the decom cookbook
[14:56:08] <elukey>	 the best is to ask to Riccardo
[14:56:13] <elukey>	 just to be sure
[14:57:55] <chrisalbon>	 Morning all!
[14:58:08] <elukey>	 morning :)
[14:58:21] <klausman>	 'ello
[15:00:05] <elukey>	 (bbiab)
[15:02:22] <aiko>	 good morning!!
[15:22:09] <elukey>	 (also getting some groceries before peak times, ttl)
[15:49:52] <elukey>	 back
[16:19:21] <accraze>	 o/
[16:19:37] <elukey>	 morning :)
[16:20:22] <klausman>	 Heya andy
[16:20:39] <klausman>	 accraze: wouldya know it: I got new vinyl on Staurday :D
[16:21:39] <elukey>	 https://wikitech.wikimedia.org/wiki/ORES/Deployment#Running_tests was done by Aiko and it is really nice
[16:22:00] <elukey>	 the tests are for all model/wiki combinations
[16:23:26] <klausman>	 Very nice
[16:24:07] <klausman>	 Preventing inadvertent changes is always nice to have
[16:29:17] <accraze>	 klausman: nice way to spend a saturday :D
[16:29:26] <accraze>	 aiko: nice work on updating the ores docs!
[16:30:31] <klausman>	 Yes, spent the afternoon in a 2nd hand shop. Got a nice record (still sealed!) and a handful of very nice RCA cables. The latter were three stereo pairs, at 2 bucks each. Nice metal plugs and thick but pliant insulation
[16:35:42] <aiko>	 Glad to send a patch and update the doc :)
[16:52:42] <elukey>	 accraze: I think that we are ready to attempt a prod deployment, lemme know if you have ideas/concerns about it
[17:03:08] <accraze>	 elukey: i just looked at the task, i think it's probably good to go!
[17:05:47] <elukey>	 \o/
[18:31:20] <elukey>	 accraze: not sure if you saw my last comment in https://phabricator.wikimedia.org/T302232, but I am wondering if using Cassandra instead of Redis for ml-cache could be good
[18:31:43] <elukey>	 I'll talk with Joseph about it tomorrow, but I wanted your opinion as well
[18:32:20] <elukey>	 Cassandra is more difficult to manage on the data engineering point of view (the keyspaces need to be designed in a good way, optimized for the perfomances needed)
[18:32:30] <elukey>	 but it would solve a lot of current headaches
[18:32:35] <elukey>	 like replication, routing, etc..
[18:32:48] <elukey>	 and IIUC Feast supports Cassandra
[18:33:08] <elukey>	 we could have a 3 nodes cluster in eqiad (ml-cache100[1-3]) and an idential one in codfw
[18:33:20] <accraze>	 elukey: that does sound interesting. fwiw i have no experience w/ cassandra but have heard good things
[18:33:53] <elukey>	 accraze: it is nice since you can replicate data across DCs, but not all
[18:34:10] <elukey>	 so we could have one keyspace for feast that is updated in eqiad and replicated in codfw
[18:34:20] <elukey>	 and one for the score cache, separated for each cluster
[18:34:31] <accraze>	 also we haven't  done Redis sentinel here at wmf before so it could wind up being equally difficult to manage
[18:34:57] <elukey>	 the tricky bit is designing the keyspace (primary keys etc..) in a way that the lookup that we need are performant enough
[18:35:07] <elukey>	 exactly yes
[18:35:18] <elukey>	 in Cassandra you can target any node, and routing will be handled
[18:35:31] <elukey>	 also with 3 nodes data will likely be  replicated on all
[18:35:58] <elukey>	 (we could also have a LVS VIP in front to spread the load and ease maintenance)
[18:36:11] <elukey>	 anyway, food for thoughts, let's discuss it on Wed :)
[18:36:33] <accraze>	 definitely! i think this seems promising
[18:36:38] * elukey afk for dinner o/
[18:39:37] <accraze>	 see ya elukey
[20:24:04] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Return meaningful HTTP responses in Lift Wing's revscoring backends - https://phabricator.wikimedia.org/T300270 (10ACraze) To keep archives happy: We had discussed potentially creating a base 'revscoring' isvc class that would include error handling. After l...