[07:16:29] <isaranto>	 Hi folks!
[07:16:37] <elukey>	 kalimera :)
[07:19:04] <elukey>	 kevinbazira: o/ going to deploy the new docker image for rec-api-ng (containing the new Debian version)
[07:21:19] <kevinbazira>	 elukey: o/
[07:21:35] <kevinbazira>	 okok I'll test soon as you've deployed
[07:22:42] <elukey>	 done!
[07:23:35] <kevinbazira>	 testing ...
[07:23:49] <kevinbazira>	 results:
[07:23:49] <kevinbazira>	 ```
[07:23:49] <kevinbazira>	 $ time curl "https://recommendation-api-ng.discovery.wmnet:31443/api/?s=en&t=fr&n=3&article=Basketball"
[07:23:49] <kevinbazira>	 [{"pageviews": 0, "title": "Keisei_Tominaga", "wikidata_id": "Q107172388", "rank": 483.0}, {"pageviews": 0, "title": "2006_NCAA_Division_I_women's_basketball_tournament", "wikidata_id": "Q4606663", "rank": 481.0}, {"pageviews": 0, "title": "Earl_Strom", "wikidata_id": "Q735731", "rank": 479.0}]
[07:23:50] <kevinbazira>	 real	0m4.585s
[07:23:50] <kevinbazira>	 user	0m0.010s
[07:23:51] <kevinbazira>	 sys	0m0.004s
[07:23:51] <kevinbazira>	 ```
[07:26:47] <kevinbazira>	 we used the same query in https://phabricator.wikimedia.org/T347475#9268123 and it run in ~1.5m, today it's running at ~4.5. not sure why. maybe it's to do with the external endpoint cache.
[07:28:19] <elukey>	 kevinbazira: if you query multiple times the latency differs, I think it depends a lot in how slow the various endpoints that we fetch data from are 
[07:28:34] <kevinbazira>	 sure sure
[07:28:37] <elukey>	 I tried multiple times and got some responses around 2.x seconds
[07:28:57] <elukey>	 this is why I was suggesting a more in depth load test to figure out the latency variation of rec-api-ng
[07:33:12] <wikibugs>	 10Machine-Learning-Team: Configure envoy settings to enable rec-api-ng container to access endpoints external to k8s/LiftWing - https://phabricator.wikimedia.org/T348607 (10kevinbazira) To help others in future who might have challenges to resolve the issue above, I have added a note to the envoy proxy docs that...
[08:28:35] <wikibugs>	 10Machine-Learning-Team, 10observability, 10Patch-For-Review: Some Istio recording rules may be missing data - https://phabricator.wikimedia.org/T349072 (10elukey) 05Open→03Resolved After checking https://thanos.wikimedia.org/rules#istio_slos it seems that the new rules are way faster, and I don't see th...
[08:59:01] <elukey>	 I am cleaning up some old TLS settings from ores-legacy and rec-api-ng
[08:59:20] <elukey>	 it is basically a no-op, since we are not using a complete automated system
[09:08:38] <wikibugs>	 10Machine-Learning-Team: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 (10elukey) ` elukey@puppetmaster1001:~$ sudo puppet cert clean ores.discovery.wmnet Warning: `puppet cert` is deprecated and will be removed in a future release.    (location: /usr/lib/ruby/vendor_ru...
[09:09:07] <elukey>	 all cleaned up!
[09:09:15] <elukey>	 lemme know if you see issues
[09:10:26] <elukey>	 klausman: o/ I removed the cergen/puppet-ca certificates for ores,ores-legacy,rec-api-ng.discovery.wmnet (including revoking them etc..)
[09:10:47] <elukey>	 we are all managing them via cfssl, so a cleanup was needed
[09:11:40] <wikibugs>	 10Machine-Learning-Team: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 (10elukey)
[09:13:35] <elukey>	 going to start with the rollout of the new docker images with newer bullseye
[09:13:58] <elukey>	 so all the revscoring ones should be fine, I'll double check but we shouldn't need to rollout
[09:14:57] <elukey>	 I'll do ores-legacy, outlink, readability (kserve 0.11 upgrade), revertrisk LA, revertrisk ML (kserve 0.11 upgrade)
[09:17:03] <klausman>	 elukey: ack. re: ssl.
[09:17:25] <klausman>	 also: morning.
[09:22:31] <klausman>	 man, my prodimage git repos are completely messed up. I don't see my own merged changes anymore, no matter where I pull from :-/
[09:23:56] <klausman>	 elukey: even in a fresh cline of p-i, I don/t see the 1.21 change, am I doing something wrong?
[09:24:12] <klausman>	 clone*
[09:26:36] <klausman>	 ah got it, typo in branch names %)
[09:29:09] <klausman>	 oohkay, now the change is fired :-/
[09:29:13] <klausman>	 fried*
[09:31:12] <klausman>	 cloning the p-i repo seems to set one up for getting rebases wrong, it seems.
[09:53:43] <elukey>	 klausman: re storage-init - pip is should be needed, I was more wondering about cpp
[09:54:05] <klausman>	 Well, the build seemed to work without it
[09:54:32] <klausman>	 hang on, I'll pastebin the whole build log (with --debug)
[09:54:49] <elukey>	 ahh maybe because we install it via python3 -m pip etc..
[09:55:41] <klausman>	 https://phabricator.wikimedia.org/P53023
[09:55:50] <klausman>	 yes, -m pip doing all that is needed would be my guess
[09:56:32] <elukey>	 perfect
[09:56:39] <isaranto>	 elukey: I'm back. We can split the deployments. lemme know which ones you want to take
[09:57:24] <klausman>	 elukey: when we have some time™, we might want to have a quick think about whether we could create a useful test.sh for these images, just covering some basics.
[09:57:57] <elukey>	 klausman: you can do it as part of the upgrade if you want :)
[09:58:42] <klausman>	 I'd need some time to figure out the best way of doing it :)
[09:59:11] <klausman>	 But I'll put it on my secondary todo pile :)
[09:59:42] <elukey>	 isaranto: o/ so rec-api-ng and ores-legacy are done, I can do outlink now, and if you want you can take readability and RR ML (both contain the kserve 0.11 upgrade so a quick load test in staging is needed)
[10:01:02] <isaranto>	 ok, I'll start with deploy on staging and try to find past load test results to compare
[10:02:02] <elukey>	 super
[10:02:45] <isaranto>	 a no need to find old ones, I'll just run a load test before and after deployment
[10:02:59] <isaranto>	 but I'look for the old ones either way to check
[10:17:21] <isaranto>	 I have an issue in ml-staging in revertrisk namespace I can't figure out what is wrong: helmfile diff shows nothing although the diff in CI showed the image update https://integration.wikimedia.org/ci/job/helm-lint/13576/consoleFull
[10:19:32] <elukey>	 isaranto: I think that I have updated all staging dirs IIRC, check from kubectl describe pod if the docker image is the right one
[10:19:54] <elukey>	 klausman: when you have a moment I think we can give the green light in https://phabricator.wikimedia.org/T342765
[10:21:59] <isaranto>	 elukey: last multilingual deployment was 12d ago. The image in the values.yaml is the correct one (from 16/10). With kubectl describe I can see a sha256 hash for the image but not the tag. any easy way to cross-check (other than downloading the image and checking the hash myself)
[10:31:46] <elukey>	 isaranto: in theory if you see the 16/10 mark it should be the tag, no need for the sha
[10:33:58] <isaranto>	 there are only sha(s), e.g. `docker-registry.discovery.wmnet/wikimedia/machinelearning-liftwing-inference-services-revertrisk-multilingual@sha256:1856657ddf483c35456e6f3c1837454f60d3c696b0cb61ab2be869791f79aa05` so no way to verify the 16/10 tag.
[10:34:22] <isaranto>	 in any case, this pod was started 12d ago so it doesnt have the latest tag for sure
[10:35:39] <elukey>	 it must be https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/964559 then
[10:36:15] <elukey>	 ok lemme think about the sha
[10:36:27] <isaranto>	 yes, that is it.
[10:37:20] <isaranto>	 the problem I have is not the verification for now but that helmfile diff doesnt show a change although the values file has a different image. I'll check the output locally to see
[10:37:34] <klausman>	 elukey: ah, yes. Do we need to do any more cleanup on our side? Like reconfig of repos or sth?
[10:38:43] <elukey>	 klausman: we need to archive them all I think
[10:38:48] <elukey>	 (part of the cleanup
[10:38:58] <elukey>	 isaranto: I see 1856657ddf483c35456e6f3c1837454f60d3c696b0cb61ab2be869791f79aa05 on the node
[10:39:35] <isaranto>	 yes me2
[10:40:57] <klausman>	 elukey: is the a canonical list of GH repos that were used for ORES and that can now be archived? I suspect https://github.com/wikimedia/ores is one of them?
[10:42:18] <elukey>	 yes
[10:42:49] <klausman>	 We should probably add a deprecation warning (and pointer to LW) on that README before archiving.
[10:44:45] <elukey>	 yep
[10:47:14] <elukey>	 isaranto: ahh okok try to check "kubectl get revision -n revertrisk"
[10:47:27] <elukey>	 the new revision is in 'deploying' state
[10:48:20] <elukey>	 and kubectl get events -n revertrisk says why
[10:48:42] <elukey>	 14m         Warning   FailedCreate   replicaset/revertrisk-multilingual-predictor-default-00014-deployment-7b6f88b95f   Error creating: pods "revertrisk-multilingual-predictor-default-00014-deployment78m7g" is forbidden: maximum memory usage per Pod is 8Gi, but limit is 8747220992
[10:48:47] <elukey>	 isaranto: --^
[10:49:27] <isaranto>	 oh ok, thankss
[10:49:51] <isaranto>	 should have figured that out! sorry for the hassle. 
[10:50:07] <elukey>	 isaranto: nono please, it took me a while, I was puzzled as well :D
[10:50:20] <elukey>	 knative forces the sha, I forgot about it
[10:50:28] <elukey>	 so it is less straightforward
[10:50:47] <elukey>	 afaics in staging we have only 8Gi for pods in a namespace
[10:54:46] <elukey>	 and I am not getting why we are crossing it
[10:56:24] <elukey>	 we have 6Gi as limit
[10:56:25] <elukey>	 mmm
[11:00:14] <elukey>	 at this point one of the containers got more memory allocated, and we are crossing the threshold
[11:01:48] <elukey>	 isaranto: I modified limitranges manually for rr in staging to allow 10GiB
[11:01:51] <elukey>	 let's see
[11:02:20] <elukey>	 at some point the controller will retry
[11:02:46] <isaranto>	 Ok got it ,thanks
[11:03:11] <elukey>	 hopefully it is that, in case I'll file a change to update the thresholds
[11:03:36] * elukey lunch!
[11:06:15] <klausman>	 same.
[11:06:24] <klausman>	 sent a PR for adding the deprecation warning on GH
[11:19:21] * isaranto lunch as well
[12:44:55] <chrisalbon>	 Morning all
[12:45:24] <isaranto>	 morning!
[12:50:31] <chrisalbon>	 A full day in the office is ahead of me
[13:20:30] <klausman>	 \o heyo chris. hope the day won't be too stressful
[13:21:56] <isaranto>	 yeah sounds like a long day, given your head start!
[13:22:30] <isaranto>	 btw folks revertrisk-multilingual isn't working in staging at all. pod is up but cant get a request through. Debugging...
[13:33:42] <elukey>	 isaranto: I think that Aiko is testing the batcher in there
[13:33:58] <elukey>	 I see 4 pods, and if I make a request it asks for "instances"
[13:34:02] <elukey>	 err 4 containers
[13:34:19] <isaranto>	 correct!
[13:35:07] <elukey>	 so we can revert those settings for the moment, test and see
[13:35:53] <isaranto>	 I'm facing a difficult monday figuring out stuff today as you undestood  already :)
[13:38:35] <elukey>	 isaranto: you are not the only one, I happened to remember Aiko's code review, otherwise I would have been there starting at my monitor as well :D
[13:38:52] <isaranto>	 thanks for being nice 
[13:39:27] <isaranto>	 ok, since it is there I'll try to run a load test with the batcher as well and then I'll send a patch to remove it for now
[14:07:09] <isaranto>	 Seems like it needs some work, as it doesnt pass input validation as it is. Proceeding with reverting the commit for now https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/967773
[14:10:52] <elukey>	 isaranto: I commented that we could leave it in there, so Aiko will be able to keep working
[14:12:13] <isaranto>	 elukey: do u mean to have 2 deployments for multilingual? or test it with a manual change?
[14:12:32] <elukey>	 isaranto: having two for ML exactly
[14:12:39] <isaranto>	 ack
[14:13:21] <isaranto>	 cool, I'm abandoning the change then 
[14:16:31] <elukey>	 only if you think it is ok, otherwise we can proceed
[14:16:39] <elukey>	 it was meant as a discussion :)
[14:18:16] <isaranto>	 yes, I didn't state it explicitly, I agree! 
[14:18:39] <isaranto>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/967939
[14:25:48] <elukey>	 ahhh okk now I get why we were seeing the limit ranges error before
[14:25:52] <elukey>	 we had 4 containers
[14:27:07] <klausman>	 elukey: building the kserve images now on build2001
[14:28:19] <elukey>	 okok
[14:29:40] <wikibugs>	 10Machine-Learning-Team: Upgrade the readability model server to KServe 0.11.1 - https://phabricator.wikimedia.org/T348664 (10isarantopoulos) Started to run load testing on staging and figured out the model server is toooo slow. A single request takes 38-46 seconds instead of <1s.  ` time curl -s https://inferen...
[14:30:47] <elukey>	 isaranto: I think that --^ is the same issue that we saw for revert risk :(
[14:30:51] <elukey>	 see the cpu throttling in https://grafana.wikimedia.org/d/hyl18XgMk/kubernetes-container-details?orgId=1&var-datasource=codfw%20prometheus%2Fk8s-mlstaging&var-namespace=readability&var-pod=readability-predictor-default-00005-deployment-58d55c48d5-76mjq&var-container=All
[14:31:18] <elukey>	 I can check the threads on the container, but I suspect they are a lot
[14:32:52] <elukey>	 elukey@ml-staging2002:~$ ps -eLf | grep 2686637 | wc -l
[14:32:53] <elukey>	 243
[14:32:56] <elukey>	 yeah :(
[14:33:05] <isaranto>	 ok. iirc the issue was with xgboost. this one is using catboost
[14:33:25] <isaranto>	 but the issue is the same
[14:33:32] <elukey>	 I can check via perf what is consuming cpu, maybe it is openmp as well
[14:36:15] <elukey>	 https://github.com/search?q=repo%3Acatboost%2Fcatboost%20OMP_NUM_THREADS&type=code
[14:36:18] <elukey>	 it uses openmp as well
[14:36:52] <elukey>	 isaranto: can you try to make some requests in a row?
[14:37:03] <isaranto>	 that's what I just saw
[14:37:08] <isaranto>	 sure
[14:37:22] <isaranto>	 doing so now. 
[14:38:01] <elukey>	 yep! libgomp-a34b3233.so.1 as top talker via perf
[14:38:04] <elukey>	 same issue
[14:38:19] <elukey>	 we need to set OMP_NUM_THREADS=1
[14:38:47] <wikibugs>	 10Machine-Learning-Team: Test the kserve batcher for Revert Risk multilingual isvc - https://phabricator.wikimedia.org/T348536 (10isarantopoulos) @achou Created a model server named `revertrisk-multilingual-batcher` in staging. Feel free to use that one for working on the batcher.
[14:42:22] <klausman>	 Successfully published image docker-registry.discovery.wmnet/kserve-agent:0.11.1-1
[14:42:25] <klausman>	 Successfully published image docker-registry.discovery.wmnet/kserve-storage-initializer:0.11.1-1
[14:42:27] <klausman>	 Successfully published image docker-registry.discovery.wmnet/kserve-build:0.11.1-1
[14:42:29] <klausman>	 Successfully published image docker-registry.discovery.wmnet/kserve-controller:0.11.1-1
[14:42:31] <klausman>	 \o/
[14:42:39] <elukey>	 nice!
[14:43:04] <klausman>	 No idea if they actually *work* :D
[15:04:54] <elukey>	 klausman: re: https://github.com/wikimedia/ores/pull/365 - we should discuss with the team what to add, and then apply it to all the repos.. I would add something related to the fact that the ORES code will not be improved, and the community can fork it etc..
[15:05:21] <klausman>	 yes, teh PR was meant as a starting point, so to speak
[15:05:33] <elukey>	 ack ack
[15:12:52] <isaranto>	 readability is back in more normal levels
[15:13:19] <elukey>	 \o/
[15:31:36] <elukey>	 klausman: https://gitlab.wikimedia.org/repos/sre/alerts-triage, really nice
[15:33:01] <klausman>	 oh, giving that a spin
[15:34:48] <klausman>	 very neat
[15:37:21] * elukey bbiab
[15:38:31] <wikibugs>	 10Machine-Learning-Team: Upgrade Revert Risk Multilingual docker images to KServe 0.11 - https://phabricator.wikimedia.org/T347551 (10isarantopoulos) a:03isarantopoulos
[15:40:03] <wikibugs>	 10Machine-Learning-Team: Upgrade Revert Risk Multilingual docker images to KServe 0.11 - https://phabricator.wikimedia.org/T347551 (10isarantopoulos) Ran some load tests ` isaranto@deploy2002:~/pycharm/test/wrk$ wrk -c 1 -t 1 --timeout 5s -s revertrisk.lua https://inference-staging.svc.codfw.wmnet:30443/v1/model...
[15:48:24] <wikibugs>	 10Machine-Learning-Team: Upgrade the readability model server to KServe 0.11.1 - https://phabricator.wikimedia.org/T348664 (10isarantopoulos) I ran some load tests. There are diffrerence with the [[ https://phabricator.wikimedia.org/P52406 | old load tests done ]] by Aiko. The main difference is when trying to m...
[16:06:52] <elukey>	 isaranto: very interesting, if I check threads in readability prod I see 6
[16:08:14] <elukey>	 weird, in staging I still see a ton of them
[16:09:17] <isaranto>	 yes. I wanted to write a follow up after pasting the results.
[16:10:19] <elukey>	 ah snap it must be the same also for revertrisk
[16:10:29] <elukey>	 the threads are created, but only one is used, and we don't see throttling
[16:14:02] <elukey>	 isaranto: ok if I try to set also OMP_THREAD_LIMIT ?
[16:15:42] <elukey>	 we were discussing it with Aiko the other day (https://www.openmp.org/spec-html/5.0/openmpse58.html)
[16:16:43] <isaranto>	 set it to 1?
[16:17:11] <elukey>	 yeah
[16:17:36] <elukey>	 I tried and I see less threads, but still too many
[16:18:36] <elukey>	 I think that https://github.com/catboost/catboost/blob/master/util/system/info.cpp is still missing cgroups v2
[16:19:10] <elukey>	 like https://github.com/dmlc/xgboost/issues/9622
[16:22:14] <elukey>	 opened https://github.com/catboost/catboost/issues/2518
[16:24:02] <isaranto>	 We'll have to revisit this for sure. setting threads to 1 limits performance. I think there is benefit of >1 thread even in 1 cpu
[16:24:46] <isaranto>	 since setting it to 1 is more stable we can revisit if we need to boost performance
[16:25:11] <isaranto>	 thanks for opening the issue!
[16:34:26] <wikibugs>	 10Machine-Learning-Team, 10Observability-Alerting, 10Patch-For-Review: Lift Wing alerting - https://phabricator.wikimedia.org/T346151 (10isarantopoulos) I have tried to set the alerting rule to `sum by (exported_cluster) (kafka_burrow_partition_lag{group="cpjobqueue-ORESFetchScoreJob"}) > 100` which would br...
[16:35:48] <isaranto>	 I had no luck with the kafka alert. I've tried numerous things :( 
[16:36:36] <isaranto>	 klausman: could you help me with this one of these days ? tomorrow or wednesday would be great!
[16:36:51] <isaranto>	 I'm logging off for the day folks, have a great rest of day!
[16:38:01] <klausman>	 Sure, I can help
[16:38:30] <klausman>	 also heading out for today
[16:46:18] <elukey>	 heading out as well, have a good rest of the day!
[18:57:23] <wikibugs>	 10Machine-Learning-Team, 10Data-Engineering, 10Wikimedia Enterprise, 10Data Engineering and Event Platform Team (Sprint 4), and 2 others: [Event Platform] Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10Ah...