[07:16:00] hello folks [07:16:17] I am trying with https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/764799 to limit the number of ksvc (knative revisions) that we keep around after each deployment [07:16:42] the main issue is that, if we keep them around, service ips and other resources will be held [07:34:40] yeah confirmed [07:34:59] when we change any isvc (InferenceService) a new knative revision is created [07:35:31] and every revision gets an IP from the svc pool [07:35:46] that is a /24 [07:36:35] the pods ip pool is a /23 [07:37:15] 254 svc ips is not enough for our use case, we'd need probably a /22 [07:40:32] opening a task [07:40:42] definitely a blocker for lift wing [08:08:43] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Re-evaluate ip pools for ml-serve-{eqiad,codfw} - https://phabricator.wikimedia.org/T302701 (10elukey) [08:13:47] created --^ [08:23:54] tried to find the /22 that included our current /24, but afaics it would also overlap with other k8s && non-ml subnets [08:24:11] the easy choice would be to pick a new /22 from the available pool [08:24:16] https://netbox.wikimedia.org/ipam/prefixes/376/prefixes/ [08:24:30] but then we'd need to wipe the clusters (in theory) [08:24:42] or at least, I don't find another solution on a Monday morning :D [08:39:42] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Re-evaluate ip pools for ml-serve-{eqiad,codfw} - https://phabricator.wikimedia.org/T302701 (10elukey) On ml-serve-eqiad (half way through loading ORES pods): ` root@deploy1002:~# kubectl get svc -A |grep 10. | wc -l 200 ` [08:40:00] on ml-serve-eqiad we are at 200 svc IPs used [08:40:01] sigh [08:48:39] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Load test the Lift Wing cluster - https://phabricator.wikimedia.org/T296173 (10elukey) a:05elukey→03None [08:49:03] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Re-evaluate ip pools for ml-serve-{eqiad,codfw} - https://phabricator.wikimedia.org/T302701 (10elukey) a:03elukey [08:55:28] going afk for some errands, bbl! [10:19:23] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10elukey) Thanks a lot @Halfak for the tests!... [10:27:28] Good morning! :D [10:28:13] good morning! [11:28:39] * elukey lunch! [12:00:52] elukey: could we auto-expire the revisions more aggressively than we do today? At least until we have figure out a new /22 and how to get there. [12:36:03] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Re-evaluate ip pools for ml-serve-{eqiad,codfw} - https://phabricator.wikimedia.org/T302701 (10klausman) Is there currently any kind of auto-expire/auto-clean of old revisions? If not, does kserve have such functionality built-in somewhere? That mi... [12:37:20] https://knative.dev/development/serving/configuration/revision-gc/#cluster-wide-configuration might be useful [12:38:13] The "complex example" with numbers tweaked looks very promising [12:49:31] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Re-evaluate ip pools for ml-serve-{eqiad,codfw} - https://phabricator.wikimedia.org/T302701 (10klausman) This might be useful: https://knative.dev/development/serving/configuration/revision-gc/#cluster-wide-configuration The "complex example" wi... [12:50:59] klausman: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/764799 :) [12:51:04] it was merged this morning [12:51:39] :D [12:52:13] Now I feel bad for having said anything -.- [12:52:17] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Re-evaluate ip pools for ml-serve-{eqiad,codfw} - https://phabricator.wikimedia.org/T302701 (10elukey) Merged this morning https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/764799 If this works we should keep only 2 revisions for eac... [12:52:57] klausman: the main issue is that we have already ~200 svc IPs already used :( [12:53:13] YEah, as I mentioned, we need a bigger pool son, either way [12:53:38] the auto-cleanup is good anyway since even with a bigger pool we'll want to avoid wasting revisions [12:53:47] but I didn't get at the time that it would have been so bad [12:53:50] And as you mentioned, the fragmentation is annoying. We might as well go ahead an do a wipe while we still can do so without disrupting anything in "prod" [12:54:17] yeah definitely [12:54:29] very annoying [12:54:45] It's a bit weird that you can't have multiple (fragmented) ranges in a cluster, but then I guess routing might become a PITA [12:54:57] Plus, cleanliness etc etc [12:56:58] this is the part that I am not super sure about [12:57:11] calico uses https://projectcalico.docs.tigera.io/reference/resources/ippool [12:57:38] we can define multiple ones, we have now "ipv4" and "ipv6" defined [12:58:06] I haven't dove deep into how that resource works, maybe having multiple ippools is fine for calico [12:58:21] but the "ipvx" naming felt like we needed one for each kind [13:00:02] I'll take a look later today (planning to do the etcd VMs today first) [13:11:52] ah also I just realized that the IPPools mentioned in deployment-charts are /23s, so those should be the pods subnets [13:17:35] ahhh [13:17:36] profile::kubernetes::master::service_cluster_ip_range: 10.64.77.0/24 [13:17:43] okok so it is defined in puppet [13:17:59] So are we currently "artificially constrained"? [13:18:19] what do you mean? [13:18:33] SO that's a /24 [13:18:50] But we have a /22 assigned to us, right? [13:18:50] it is what we have allocated in IPAM [13:18:55] https://netbox.wikimedia.org/ipam/prefixes/376/prefixes/ [13:19:00] nope [13:19:22] Oh, so we only have one /24 each for eqiad and codfw? [13:19:33] yes correct [13:19:41] same thing for the other clusters [13:19:51] Hmm. [13:19:54] I think we are the first use case that needs more than few svc ips [13:20:10] I suspect service IPs are more easily managed if fragmented than pod IPs would be? [13:20:26] After all, they need to freely move around anyway [13:23:17] so in puppet the svc subnet range ends up in the kube-api config [13:23:18] https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/ [13:23:30] the parameter is --service-cluster-ip-range [13:25:13] Now I wonder if that flag can be specified multiple times [13:25:35] > Max of two dual-stack CIDRs is allowed. [13:25:37] Hrm. [13:27:41] https://github.com/kubernetes/kubernetes/issues/104088 [13:28:54] so my impression is that it may support only one ipv4 range [13:29:03] I'll ask to Alex in the task [13:29:07] yeah, at least at the moment [13:32:17] added a comment [13:32:32] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Re-evaluate ip pools for ml-serve-{eqiad,codfw} - https://phabricator.wikimedia.org/T302701 (10elukey) @akosiaris we'd need an expert opinion on this :) Afaics from puppet, we configure the svc ip range in kube-api's defaults, and from https://git... [13:42:50] ok let's see what Alex thinks about it, worst that happens is that we'll wipe what we have [13:47:32] Ack [14:29:24] elukey: FYI https://wikitech.wikimedia.org/w/index.php?title=SRE%2FInfrastructure_naming_conventions&type=revision&diff=1952879&oldid=1948572 [14:31:00] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 (10Ottomata) > The think that I am not clear is why we need to use Knative Eventing when we already have Changeprop change-prop is a home grown event notification re... [14:31:41] ack [14:34:48] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 (10elukey) Yep yep got it. As first iteration, we'll likely add some code to lift wing to send a score to eventgate (following the related schema etc..) when request... [14:36:39] 10Lift-Wing, 10Machine-Learning-Team: Support (or not) the ORES augmented feature output in liftwing - https://phabricator.wikimedia.org/T301766 (10elukey) a:03achou [14:37:16] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Support (or not) the ORES augmented feature output in liftwing - https://phabricator.wikimedia.org/T301766 (10elukey) [14:37:29] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Implement an online feature store - https://phabricator.wikimedia.org/T294434 (10elukey) [14:42:57] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 (10Ottomata) I see! So you'd kinda be using ChangeProp as the Knative eventing stand in for now :) BTW, you may prefer to produce to Kafka directly, rather than t... [14:43:09] elukey: we talked the other day about naming and I just realized it's a bit inconistent [14:43:32] So we'd have ml-etcd-staging200x and ml-staging-ctrl200x [14:43:56] wouldn't ml-ctrl-staging200x be better? [14:46:30] klausman: I think that the idea was to have something like ml-{serve,staging}-ctrl [14:46:36] but we can name it as we want [14:46:49] Well, I almost created one of the etcd VMs already [14:46:52] But it failed :-/ [14:47:12] But I guess at this point we could nix it anyway [14:47:28] so they'd be ml-staging-etcd? [14:48:19] we have ml-etcd atm, anything is fine [14:48:28] (for "prod" I mean) [14:48:34] Yeah [14:48:39] How does one delete a VM? [14:49:43] where did the cookbook fail? [14:49:59] Step 12, authdns [14:50:14] sure but with what error? [14:50:35] https://phabricator.wikimedia.org/P21601 [14:50:52] I don't really see any specific error [14:51:40] Oh hang on, I may have missed some messages in the previous steps [14:51:49] fatal: unable to access 'https://netbox-exports.wikimedia.org/dns.git/': Operation timed out after 300033 milliseconds with 0 out of 0 bytes received [14:52:08] Twice, for dns6001 and dns6002 [14:52:11] Riccardo mentioned this morning on #sre IIRC that there is some problem with netbox in the new Marseille DC [14:52:31] Mh. [14:52:42] so we may need to wait to create VMs, if they fail in this way [14:54:09] If we want to use different VM names anyway, should I just decom the half-made one? [14:54:39] so delete a VM in https://wikitech.wikimedia.org/wiki/Ganeti#Delete_a_VM still lists the ganeti command, but if netbox was involved maybe the decom cookbook could be best [14:54:59] I'll se what colors it explodes with [14:55:10] wait a sec, did you find anything on netbox about it? [14:55:43] Yes, two IPs (v6 and v4 [14:56:04] ah ok if they are there we need to either clean up manually or run the decom cookbook [14:56:08] the best is to ask to Riccardo [14:56:13] just to be sure [14:57:55] Morning all! [14:58:08] morning :) [14:58:21] 'ello [15:00:05] (bbiab) [15:02:22] good morning!! [15:22:09] (also getting some groceries before peak times, ttl) [15:49:52] back [16:19:21] o/ [16:19:37] morning :) [16:20:22] Heya andy [16:20:39] accraze: wouldya know it: I got new vinyl on Staurday :D [16:21:39] https://wikitech.wikimedia.org/wiki/ORES/Deployment#Running_tests was done by Aiko and it is really nice [16:22:00] the tests are for all model/wiki combinations [16:23:26] Very nice [16:24:07] Preventing inadvertent changes is always nice to have [16:29:17] klausman: nice way to spend a saturday :D [16:29:26] aiko: nice work on updating the ores docs! [16:30:31] Yes, spent the afternoon in a 2nd hand shop. Got a nice record (still sealed!) and a handful of very nice RCA cables. The latter were three stereo pairs, at 2 bucks each. Nice metal plugs and thick but pliant insulation [16:35:42] Glad to send a patch and update the doc :) [16:52:42] accraze: I think that we are ready to attempt a prod deployment, lemme know if you have ideas/concerns about it [17:03:08] elukey: i just looked at the task, i think it's probably good to go! [17:05:47] \o/ [18:31:20] accraze: not sure if you saw my last comment in https://phabricator.wikimedia.org/T302232, but I am wondering if using Cassandra instead of Redis for ml-cache could be good [18:31:43] I'll talk with Joseph about it tomorrow, but I wanted your opinion as well [18:32:20] Cassandra is more difficult to manage on the data engineering point of view (the keyspaces need to be designed in a good way, optimized for the perfomances needed) [18:32:30] but it would solve a lot of current headaches [18:32:35] like replication, routing, etc.. [18:32:48] and IIUC Feast supports Cassandra [18:33:08] we could have a 3 nodes cluster in eqiad (ml-cache100[1-3]) and an idential one in codfw [18:33:20] elukey: that does sound interesting. fwiw i have no experience w/ cassandra but have heard good things [18:33:53] accraze: it is nice since you can replicate data across DCs, but not all [18:34:10] so we could have one keyspace for feast that is updated in eqiad and replicated in codfw [18:34:20] and one for the score cache, separated for each cluster [18:34:31] also we haven't done Redis sentinel here at wmf before so it could wind up being equally difficult to manage [18:34:57] the tricky bit is designing the keyspace (primary keys etc..) in a way that the lookup that we need are performant enough [18:35:07] exactly yes [18:35:18] in Cassandra you can target any node, and routing will be handled [18:35:31] also with 3 nodes data will likely be replicated on all [18:35:58] (we could also have a LVS VIP in front to spread the load and ease maintenance) [18:36:11] anyway, food for thoughts, let's discuss it on Wed :) [18:36:33] definitely! i think this seems promising [18:36:38] * elukey afk for dinner o/ [18:39:37] see ya elukey [20:24:04] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Return meaningful HTTP responses in Lift Wing's revscoring backends - https://phabricator.wikimedia.org/T300270 (10ACraze) To keep archives happy: We had discussed potentially creating a base 'revscoring' isvc class that would include error handling. After l...