[09:44:12] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): API Gateway Integration - https://phabricator.wikimedia.org/T288789 (10elukey) [09:44:14] 10Machine-Learning-Team, 10Platform Team Initiatives (API Gateway): Proposal: add a per-service rate limit setting to API Gateway - https://phabricator.wikimedia.org/T295956 (10elukey) [10:01:16] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Load test the Lift Wing cluster - https://phabricator.wikimedia.org/T296173 (10elukey) [10:17:52] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Load test the Lift Wing cluster - https://phabricator.wikimedia.org/T296173 (10elukey) I am doing a very simple and base load testing from deploy1002: ` elukey@deploy1002:~$ cat input.json {"rev_id":123456} elukey@deploy1002:~$ siege -c 50 "https://inferen... [10:36:18] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Load test the Lift Wing cluster - https://phabricator.wikimedia.org/T296173 (10elukey) Tornado on each KServer pod (in Kserve 0.7) has two parameters to tune: 1) workers 2) async_io_workers https://www.tornadoweb.org/en/stable/guide/running.html#processes-... [11:34:52] * elukey lunch! [14:38:15] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Load test the Lift Wing cluster - https://phabricator.wikimedia.org/T296173 (10elukey) There is definitely something weird happening for our pods: ` elukey@ml-serve1002:~$ ps -eLf | grep python | awk '{ print $2" "$10" "$11}' | uniq -c 2 1295 python3... [14:44:42] folks I think that we should migrate to KServer from kserve==0.7, I suspect that https://github.com/kserve/kserve/commit/c10e6271897d7fd058f5618d5e0e70b31496f64c is very important [15:21:11] Agreed. [16:05:59] o/ [16:06:23] elukey: nice catch! i agree we need to upgrade to kserve v0.7.0 asap [16:06:53] accraze: o/ lemme know if I can help! [16:07:56] for sure, i'm still finishing up the new ml-sandbox, got much closer using the helm template approach you outlined last week, but running into issues with the cert [16:08:57] super [16:08:59] got an error about no PEM found on the kserve controller manager, which is due to probably not having the certs package installed [16:09:27] ah no wait there is a catch [16:10:31] if you check `kubectl get secrets -n kserve` on ml-serve-ctrl1001 you'll see kserve-webhook-server-cert [16:10:56] we store a TLS certificate in there for the webhook, since the kubernetes api needs to call an https endpoint (mandatory) [16:11:12] we have the TLS cert deployed as private helmfile on deploy1002 via puppet private [16:12:11] in your case, the certificate can be generated via something like https://github.com/kserve/kserve/blob/master/hack/self-signed-ca.sh [16:12:26] to mimic production we could generate the cert and upload it as secret [16:12:46] aha! okok this is making sense with what i was seeing on friday [16:12:50] or in the bright future we'll have cert-manager taking care of it (Janis in Serviceops is working on it) [16:13:30] i was guessing i would need to use self-signed-ca.sh but was unsure where [16:14:13] accraze: so the script takes care of multiple things, not sure if it works with our code, but we can try [16:14:16] in case we can adapt it [16:14:42] (the script also takes care of live hack the kubernetes resources to create the secret etc..) [16:15:41] for knative we have a secret for the webhook too [17:55:43] well the cert issue is solved, but now got a new error related to IngressConfig [17:57:10] `services.serving.knative.dev "sklearn-iris-sample-predictor-default" already exists` [17:57:51] getting closer, will dig in again after the team meeting [18:11:28] 10Machine-Learning-Team, 10Observability-Logging: Indexing errors from logs generated by Activator - https://phabricator.wikimedia.org/T288549 (10elukey) 05Open→03Resolved [18:16:13] 10Machine-Learning-Team, 10Observability-Logging: Indexing errors from logs generated by Activator - https://phabricator.wikimedia.org/T288549 (10colewhite) 05Resolved→03Open > failed to parse field [knative_dev/key] of type [text] is still coming in: https://logstash.wikimedia.org/goto/375bfec5b28ed0614b6... [18:19:20] 10Machine-Learning-Team, 10Observability-Logging: Indexing errors from logs generated by Activator - https://phabricator.wikimedia.org/T288549 (10elukey) >>! In T288549#7520997, @colewhite wrote: >> failed to parse field [knative_dev/key] of type [text] > is still coming in: https://logstash.wikimedia.org/goto... [18:24:13] 10Machine-Learning-Team, 10Observability-Logging: Indexing errors from logs generated by Activator - https://phabricator.wikimedia.org/T288549 (10colewhite) >>! In T288549#7521005, @elukey wrote: > > Cole I am a bit confused, didn't the patch take care of the field that varies type? The patch was for the `er... [19:55:32] going afk, have a nice rest of the day folks! [19:55:48] accraze: post in here issue with the sandbox in case, I'll try to work on them tomorrow :) [19:56:25] elukey: will do! have a good evening :) [19:57:59] i traced the issue down to the knative service, when i describe the ksvc i see the images cannot be fetched [19:58:08] ` x509: certificate signed by unknown authority` [20:00:34] i might be able to get around this by using the `--insecure-registry` flag when starting minikube [21:29:04] hmm yeah the insecure-registry flag approach didn't change anything, it seems the kserve-controller-manager cannot fetch the images from wmf registry and gets the x509 [21:30:18] this happens when i try to deploy the enwiki-goodfaith inference service via k apply -f .... [21:35:33] errr its the webhook that is unable to fetch the image from our registry