[05:58:25] good morning! [05:58:26] https://knative.dev/blog/articles/announcing-knative-1.0/ [05:58:33] this is really interesting :) [05:58:41] (current upstream is 0.26) [06:42:56] Woah ... 😲 [06:43:57] They are moving so fast! [06:47:00] I hope that this will introduce more stability, even for kserve [07:50:55] SRE work for revscoring-articlequality done, let's wait for Andy to see if we can get another deployment to ml-serve :) [09:09:57] started the first dashboard for knative, metrics are flowing! https://grafana-rw.wikimedia.org/d/c6GYmqdnz/knative-serving [09:10:11] (a lot more metrics to add, just started with a few as poc) [09:14:31] * elukey afk for a bit [10:35:42] * elukey lunch [13:37:25] added more graphs to https://grafana-rw.wikimedia.org/d/c6GYmqdnz/knative-serving, it is clear that we'll need both istio and knative metrics to understand how traffic [13:37:56] knative splits requests also by revisions, so say if we introduce a new model getting 10% of the traffic we'll see it [13:38:32] (in theory we could even with istio checking the pod's name etc.., but knative seems a little easier to follow) [13:39:21] another set of metrics that we may want are the ones from the queue-proxy, but those are part of the sidecar that every inference service pod will have [13:39:34] and it is a little more difficult right now to add prometheus annotations in the helm chart [13:40:05] (the knative activator buffers requests for a certain inference pod until the queue-proxy returns the green light) [13:40:21] (and the green light is usually an health check) [13:45:01] need to run a little errand, bbiab! [15:51:10] o/ [15:58:30] wow the knative metrics are looking great so far! [16:02:15] accraze: o/ [16:07:09] we can deploy articlequality anytime [16:08:28] cool just pushing up a fix for the helmfile name [16:10:17] ok should be good to go [16:11:10] Did you mean: hellfile.yaml? [16:11:57] LOL [16:12:07] :D [16:15:03] https://www.featurestore.org/feature-store-summit-videos - interesting [16:15:52] niiiice - was hoping some videos would show up from that [16:16:47] accraze: you can deploy anytime! [16:17:47] ok cool, going to try now [16:21:31] oof had to switch back to gnome, started trying out xmonad on friday and my ssh config is all messed up. ok trying a deploy now for reals [16:27:24] lovely, pods ended up in error state [16:29:11] ahhh shoot, yeah just tried to curl and got 404 [16:29:15] the storage-initializer container says [16:29:16] [I 211025 16:25:49 storage:85] Successfully copied s3://wmf-ml-models/articlequality/enwiki/20211022183902/model.bin to /mnt/models [16:29:30] (checking with kubectl logs enwiki-articlequality-predictor-default-cc2dw-deployment-65cgnw -n revscoring-articlequality storage-initializer from ml-serve-ctr1001) [16:29:48] (to see all pods, kubectl get pods -n revscoring-articlequality) [16:30:13] the kserve-container [16:30:14] FileNotFoundError: [Errno 2] No such file or directory: '/mnt/models/model.bin' [16:30:32] whaaa [16:31:36] the last time it was due to the incorrect model uri, but this one seems good [16:33:26] is kubectl installed on ml-serve? i keep getting command not found [16:33:42] accraze: only on ml-serve-ctrl1001 [16:34:03] you have it also on deploy1002 but with less powers [16:34:33] ah ok cool [16:35:57] could this be an annotations issue? [16:37:05] annotations as in thanos-swift config? [16:37:24] yeah [16:37:42] in theory no, the storage-initializer doesn't report any problem in fetching the model [16:38:16] the last time with Kevin though it reported a successful copy but in reality the model URI was wrong [16:38:36] hmm lemme double check that's the right uri (it should be) [16:38:49] but if there is a problem in connecting to thanos we'd get a loong stacktrace [16:40:35] verified that the URI is correct [16:40:51] trying to delete the pod, let's see [16:41:48] ok it is not a race condition, same error again, so it must be something in the config [16:53:13] my suspiction is that the storage-initializer is not pulling down the model.bin in the right way, and it reports success anyway [16:53:35] I checked the annotations etc.., all good [16:56:08] i thought maybe model.bin was corrupted or something but it's the correct file [17:33:28] accraze: do you have a min to get back in the meeting? [17:34:06] elukey: sure! [17:53:25] accraze: I tried to make a connection from one of the healthy pods to thanos-swift.discovery.wmnet:443 and it works [17:53:32] so networking rules seem to work [17:53:35] weird [17:53:41] I'll dig into this tomorrow morning [17:53:48] have a good rest of the day folks :) [17:57:11] okay, request to delay procurement made [20:24:04] ohh interesting, it looks like we have an older version of the articlequality deployed on the KFv1.3 sanbox cluster (`2021-09-07-234722-production`) [20:25:02] going to do a test to see if i can recreate the error using the `2021-09-08...` image instead [20:32:01] aha! yup enwiki-articlequality inference service is crashlooping now on the sandbox using the most recent image [20:38:03] hmmmm... not entirely sure if anything changed between those two images though [20:38:32] also noticing i'm running an older version of storage-initializer (v0.5.1) which is using minio and not boto