[07:16:29] Hi folks! [07:16:37] kalimera :) [07:19:04] kevinbazira: o/ going to deploy the new docker image for rec-api-ng (containing the new Debian version) [07:21:19] elukey: o/ [07:21:35] okok I'll test soon as you've deployed [07:22:42] done! [07:23:35] testing ... [07:23:49] results: [07:23:49] ``` [07:23:49] $ time curl "https://recommendation-api-ng.discovery.wmnet:31443/api/?s=en&t=fr&n=3&article=Basketball" [07:23:49] [{"pageviews": 0, "title": "Keisei_Tominaga", "wikidata_id": "Q107172388", "rank": 483.0}, {"pageviews": 0, "title": "2006_NCAA_Division_I_women's_basketball_tournament", "wikidata_id": "Q4606663", "rank": 481.0}, {"pageviews": 0, "title": "Earl_Strom", "wikidata_id": "Q735731", "rank": 479.0}] [07:23:50] real 0m4.585s [07:23:50] user 0m0.010s [07:23:51] sys 0m0.004s [07:23:51] ``` [07:26:47] we used the same query in https://phabricator.wikimedia.org/T347475#9268123 and it run in ~1.5m, today it's running at ~4.5. not sure why. maybe it's to do with the external endpoint cache. [07:28:19] kevinbazira: if you query multiple times the latency differs, I think it depends a lot in how slow the various endpoints that we fetch data from are [07:28:34] sure sure [07:28:37] I tried multiple times and got some responses around 2.x seconds [07:28:57] this is why I was suggesting a more in depth load test to figure out the latency variation of rec-api-ng [07:33:12] 10Machine-Learning-Team: Configure envoy settings to enable rec-api-ng container to access endpoints external to k8s/LiftWing - https://phabricator.wikimedia.org/T348607 (10kevinbazira) To help others in future who might have challenges to resolve the issue above, I have added a note to the envoy proxy docs that... [08:28:35] 10Machine-Learning-Team, 10observability, 10Patch-For-Review: Some Istio recording rules may be missing data - https://phabricator.wikimedia.org/T349072 (10elukey) 05Open→03Resolved After checking https://thanos.wikimedia.org/rules#istio_slos it seems that the new rules are way faster, and I don't see th... [08:59:01] I am cleaning up some old TLS settings from ores-legacy and rec-api-ng [08:59:20] it is basically a no-op, since we are not using a complete automated system [09:08:38] 10Machine-Learning-Team: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 (10elukey) ` elukey@puppetmaster1001:~$ sudo puppet cert clean ores.discovery.wmnet Warning: `puppet cert` is deprecated and will be removed in a future release. (location: /usr/lib/ruby/vendor_ru... [09:09:07] all cleaned up! [09:09:15] lemme know if you see issues [09:10:26] klausman: o/ I removed the cergen/puppet-ca certificates for ores,ores-legacy,rec-api-ng.discovery.wmnet (including revoking them etc..) [09:10:47] we are all managing them via cfssl, so a cleanup was needed [09:11:40] 10Machine-Learning-Team: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 (10elukey) [09:13:35] going to start with the rollout of the new docker images with newer bullseye [09:13:58] so all the revscoring ones should be fine, I'll double check but we shouldn't need to rollout [09:14:57] I'll do ores-legacy, outlink, readability (kserve 0.11 upgrade), revertrisk LA, revertrisk ML (kserve 0.11 upgrade) [09:17:03] elukey: ack. re: ssl. [09:17:25] also: morning. [09:22:31] man, my prodimage git repos are completely messed up. I don't see my own merged changes anymore, no matter where I pull from :-/ [09:23:56] elukey: even in a fresh cline of p-i, I don/t see the 1.21 change, am I doing something wrong? [09:24:12] clone* [09:26:36] ah got it, typo in branch names %) [09:29:09] oohkay, now the change is fired :-/ [09:29:13] fried* [09:31:12] cloning the p-i repo seems to set one up for getting rebases wrong, it seems. [09:53:43] klausman: re storage-init - pip is should be needed, I was more wondering about cpp [09:54:05] Well, the build seemed to work without it [09:54:32] hang on, I'll pastebin the whole build log (with --debug) [09:54:49] ahh maybe because we install it via python3 -m pip etc.. [09:55:41] https://phabricator.wikimedia.org/P53023 [09:55:50] yes, -m pip doing all that is needed would be my guess [09:56:32] perfect [09:56:39] elukey: I'm back. We can split the deployments. lemme know which ones you want to take [09:57:24] elukey: when we have some time™, we might want to have a quick think about whether we could create a useful test.sh for these images, just covering some basics. [09:57:57] klausman: you can do it as part of the upgrade if you want :) [09:58:42] I'd need some time to figure out the best way of doing it :) [09:59:11] But I'll put it on my secondary todo pile :) [09:59:42] isaranto: o/ so rec-api-ng and ores-legacy are done, I can do outlink now, and if you want you can take readability and RR ML (both contain the kserve 0.11 upgrade so a quick load test in staging is needed) [10:01:02] ok, I'll start with deploy on staging and try to find past load test results to compare [10:02:02] super [10:02:45] a no need to find old ones, I'll just run a load test before and after deployment [10:02:59] but I'look for the old ones either way to check [10:17:21] I have an issue in ml-staging in revertrisk namespace I can't figure out what is wrong: helmfile diff shows nothing although the diff in CI showed the image update https://integration.wikimedia.org/ci/job/helm-lint/13576/consoleFull [10:19:32] isaranto: I think that I have updated all staging dirs IIRC, check from kubectl describe pod if the docker image is the right one [10:19:54] klausman: when you have a moment I think we can give the green light in https://phabricator.wikimedia.org/T342765 [10:21:59] elukey: last multilingual deployment was 12d ago. The image in the values.yaml is the correct one (from 16/10). With kubectl describe I can see a sha256 hash for the image but not the tag. any easy way to cross-check (other than downloading the image and checking the hash myself) [10:31:46] isaranto: in theory if you see the 16/10 mark it should be the tag, no need for the sha [10:33:58] there are only sha(s), e.g. `docker-registry.discovery.wmnet/wikimedia/machinelearning-liftwing-inference-services-revertrisk-multilingual@sha256:1856657ddf483c35456e6f3c1837454f60d3c696b0cb61ab2be869791f79aa05` so no way to verify the 16/10 tag. [10:34:22] in any case, this pod was started 12d ago so it doesnt have the latest tag for sure [10:35:39] it must be https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/964559 then [10:36:15] ok lemme think about the sha [10:36:27] yes, that is it. [10:37:20] the problem I have is not the verification for now but that helmfile diff doesnt show a change although the values file has a different image. I'll check the output locally to see [10:37:34] elukey: ah, yes. Do we need to do any more cleanup on our side? Like reconfig of repos or sth? [10:38:43] klausman: we need to archive them all I think [10:38:48] (part of the cleanup [10:38:58] isaranto: I see 1856657ddf483c35456e6f3c1837454f60d3c696b0cb61ab2be869791f79aa05 on the node [10:39:35] yes me2 [10:40:57] elukey: is the a canonical list of GH repos that were used for ORES and that can now be archived? I suspect https://github.com/wikimedia/ores is one of them? [10:42:18] yes [10:42:49] We should probably add a deprecation warning (and pointer to LW) on that README before archiving. [10:44:45] yep [10:47:14] isaranto: ahh okok try to check "kubectl get revision -n revertrisk" [10:47:27] the new revision is in 'deploying' state [10:48:20] and kubectl get events -n revertrisk says why [10:48:42] 14m Warning FailedCreate replicaset/revertrisk-multilingual-predictor-default-00014-deployment-7b6f88b95f Error creating: pods "revertrisk-multilingual-predictor-default-00014-deployment78m7g" is forbidden: maximum memory usage per Pod is 8Gi, but limit is 8747220992 [10:48:47] isaranto: --^ [10:49:27] oh ok, thankss [10:49:51] should have figured that out! sorry for the hassle. [10:50:07] isaranto: nono please, it took me a while, I was puzzled as well :D [10:50:20] knative forces the sha, I forgot about it [10:50:28] so it is less straightforward [10:50:47] afaics in staging we have only 8Gi for pods in a namespace [10:54:46] and I am not getting why we are crossing it [10:56:24] we have 6Gi as limit [10:56:25] mmm [11:00:14] at this point one of the containers got more memory allocated, and we are crossing the threshold [11:01:48] isaranto: I modified limitranges manually for rr in staging to allow 10GiB [11:01:51] let's see [11:02:20] at some point the controller will retry [11:02:46] Ok got it ,thanks [11:03:11] hopefully it is that, in case I'll file a change to update the thresholds [11:03:36] * elukey lunch! [11:06:15] same. [11:06:24] sent a PR for adding the deprecation warning on GH [11:19:21] * isaranto lunch as well [12:44:55] Morning all [12:45:24] morning! [12:50:31] A full day in the office is ahead of me [13:20:30] \o heyo chris. hope the day won't be too stressful [13:21:56] yeah sounds like a long day, given your head start! [13:22:30] btw folks revertrisk-multilingual isn't working in staging at all. pod is up but cant get a request through. Debugging... [13:33:42] isaranto: I think that Aiko is testing the batcher in there [13:33:58] I see 4 pods, and if I make a request it asks for "instances" [13:34:02] err 4 containers [13:34:19] correct! [13:35:07] so we can revert those settings for the moment, test and see [13:35:53] I'm facing a difficult monday figuring out stuff today as you undestood already :) [13:38:35] isaranto: you are not the only one, I happened to remember Aiko's code review, otherwise I would have been there starting at my monitor as well :D [13:38:52] thanks for being nice [13:39:27] ok, since it is there I'll try to run a load test with the batcher as well and then I'll send a patch to remove it for now [14:07:09] Seems like it needs some work, as it doesnt pass input validation as it is. Proceeding with reverting the commit for now https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/967773 [14:10:52] isaranto: I commented that we could leave it in there, so Aiko will be able to keep working [14:12:13] elukey: do u mean to have 2 deployments for multilingual? or test it with a manual change? [14:12:32] isaranto: having two for ML exactly [14:12:39] ack [14:13:21] cool, I'm abandoning the change then [14:16:31] only if you think it is ok, otherwise we can proceed [14:16:39] it was meant as a discussion :) [14:18:16] yes, I didn't state it explicitly, I agree! [14:18:39] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/967939 [14:25:48] ahhh okk now I get why we were seeing the limit ranges error before [14:25:52] we had 4 containers [14:27:07] elukey: building the kserve images now on build2001 [14:28:19] okok [14:29:40] 10Machine-Learning-Team: Upgrade the readability model server to KServe 0.11.1 - https://phabricator.wikimedia.org/T348664 (10isarantopoulos) Started to run load testing on staging and figured out the model server is toooo slow. A single request takes 38-46 seconds instead of <1s. ` time curl -s https://inferen... [14:30:47] isaranto: I think that --^ is the same issue that we saw for revert risk :( [14:30:51] see the cpu throttling in https://grafana.wikimedia.org/d/hyl18XgMk/kubernetes-container-details?orgId=1&var-datasource=codfw%20prometheus%2Fk8s-mlstaging&var-namespace=readability&var-pod=readability-predictor-default-00005-deployment-58d55c48d5-76mjq&var-container=All [14:31:18] I can check the threads on the container, but I suspect they are a lot [14:32:52] elukey@ml-staging2002:~$ ps -eLf | grep 2686637 | wc -l [14:32:53] 243 [14:32:56] yeah :( [14:33:05] ok. iirc the issue was with xgboost. this one is using catboost [14:33:25] but the issue is the same [14:33:32] I can check via perf what is consuming cpu, maybe it is openmp as well [14:36:15] https://github.com/search?q=repo%3Acatboost%2Fcatboost%20OMP_NUM_THREADS&type=code [14:36:18] it uses openmp as well [14:36:52] isaranto: can you try to make some requests in a row? [14:37:03] that's what I just saw [14:37:08] sure [14:37:22] doing so now. [14:38:01] yep! libgomp-a34b3233.so.1 as top talker via perf [14:38:04] same issue [14:38:19] we need to set OMP_NUM_THREADS=1 [14:38:47] 10Machine-Learning-Team: Test the kserve batcher for Revert Risk multilingual isvc - https://phabricator.wikimedia.org/T348536 (10isarantopoulos) @achou Created a model server named `revertrisk-multilingual-batcher` in staging. Feel free to use that one for working on the batcher. [14:42:22] Successfully published image docker-registry.discovery.wmnet/kserve-agent:0.11.1-1 [14:42:25] Successfully published image docker-registry.discovery.wmnet/kserve-storage-initializer:0.11.1-1 [14:42:27] Successfully published image docker-registry.discovery.wmnet/kserve-build:0.11.1-1 [14:42:29] Successfully published image docker-registry.discovery.wmnet/kserve-controller:0.11.1-1 [14:42:31] \o/ [14:42:39] nice! [14:43:04] No idea if they actually *work* :D [15:04:54] klausman: re: https://github.com/wikimedia/ores/pull/365 - we should discuss with the team what to add, and then apply it to all the repos.. I would add something related to the fact that the ORES code will not be improved, and the community can fork it etc.. [15:05:21] yes, teh PR was meant as a starting point, so to speak [15:05:33] ack ack [15:12:52] readability is back in more normal levels [15:13:19] \o/ [15:31:36] klausman: https://gitlab.wikimedia.org/repos/sre/alerts-triage, really nice [15:33:01] oh, giving that a spin [15:34:48] very neat [15:37:21] * elukey bbiab [15:38:31] 10Machine-Learning-Team: Upgrade Revert Risk Multilingual docker images to KServe 0.11 - https://phabricator.wikimedia.org/T347551 (10isarantopoulos) a:03isarantopoulos [15:40:03] 10Machine-Learning-Team: Upgrade Revert Risk Multilingual docker images to KServe 0.11 - https://phabricator.wikimedia.org/T347551 (10isarantopoulos) Ran some load tests ` isaranto@deploy2002:~/pycharm/test/wrk$ wrk -c 1 -t 1 --timeout 5s -s revertrisk.lua https://inference-staging.svc.codfw.wmnet:30443/v1/model... [15:48:24] 10Machine-Learning-Team: Upgrade the readability model server to KServe 0.11.1 - https://phabricator.wikimedia.org/T348664 (10isarantopoulos) I ran some load tests. There are diffrerence with the [[ https://phabricator.wikimedia.org/P52406 | old load tests done ]] by Aiko. The main difference is when trying to m... [16:06:52] isaranto: very interesting, if I check threads in readability prod I see 6 [16:08:14] weird, in staging I still see a ton of them [16:09:17] yes. I wanted to write a follow up after pasting the results. [16:10:19] ah snap it must be the same also for revertrisk [16:10:29] the threads are created, but only one is used, and we don't see throttling [16:14:02] isaranto: ok if I try to set also OMP_THREAD_LIMIT ? [16:15:42] we were discussing it with Aiko the other day (https://www.openmp.org/spec-html/5.0/openmpse58.html) [16:16:43] set it to 1? [16:17:11] yeah [16:17:36] I tried and I see less threads, but still too many [16:18:36] I think that https://github.com/catboost/catboost/blob/master/util/system/info.cpp is still missing cgroups v2 [16:19:10] like https://github.com/dmlc/xgboost/issues/9622 [16:22:14] opened https://github.com/catboost/catboost/issues/2518 [16:24:02] We'll have to revisit this for sure. setting threads to 1 limits performance. I think there is benefit of >1 thread even in 1 cpu [16:24:46] since setting it to 1 is more stable we can revisit if we need to boost performance [16:25:11] thanks for opening the issue! [16:34:26] 10Machine-Learning-Team, 10Observability-Alerting, 10Patch-For-Review: Lift Wing alerting - https://phabricator.wikimedia.org/T346151 (10isarantopoulos) I have tried to set the alerting rule to `sum by (exported_cluster) (kafka_burrow_partition_lag{group="cpjobqueue-ORESFetchScoreJob"}) > 100` which would br... [16:35:48] I had no luck with the kafka alert. I've tried numerous things :( [16:36:36] klausman: could you help me with this one of these days ? tomorrow or wednesday would be great! [16:36:51] I'm logging off for the day folks, have a great rest of day! [16:38:01] Sure, I can help [16:38:30] also heading out for today [16:46:18] heading out as well, have a good rest of the day! [18:57:23] 10Machine-Learning-Team, 10Data-Engineering, 10Wikimedia Enterprise, 10Data Engineering and Event Platform Team (Sprint 4), and 2 others: [Event Platform] Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10Ah...