[01:17:22] <wikibugs>	 10artificial-intelligence, 10Structured-Data-Backlog: Implement NSFW image classifier using Open NSFW - https://phabricator.wikimedia.org/T214201 (10Frostly) https://github.com/infinitered/nsfwjs might be interesting for implementation too (it can be run on Node)
[05:43:44] <isaranto>	 ragesoss: We'll take a look! afaik CORS wouldn't be enabled on LW
[05:44:26] <isaranto>	 I made an effort here based on Luca's previous patches for ores-legacy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/964625
[05:44:40] <isaranto>	 but no diff! so sth is missing
[06:46:52] <elukey>	 isaranto: o/
[06:46:54] <elukey>	 almost!
[06:47:14] <elukey>	 ingress is not under networkpolicy, but at the same level
[06:47:21] <elukey>	 if you move it it should work :)
[06:48:52] <elukey>	 ah no wait, still sleeping :D
[06:49:26] <elukey>	 we cannot use that one for inference, since the "ingress" module is only available for services like ores-legacy and rec-api-ng (that use the "serviceops" template)
[06:49:42] <elukey>	 so we'll need to inject those values to our istio configs
[06:49:48] <elukey>	 via isvc probably
[06:50:31] <elukey>	 https://github.com/kserve/kserve/issues/721 doesn't look good
[06:51:32] <elukey>	 people in https://github.com/kserve/kserve/issues/1902 suggest to set headers via uvicorn, sigh
[06:58:24] <isaranto>	 ok will check!
[06:58:31] <isaranto>	 afk, back online later!
[08:26:52] <kevinbazira>	 elukey: o/
[08:26:52] <kevinbazira>	 are you able to check rec-api-ng logs on LiftWing staging?
[08:26:52] <kevinbazira>	 on my end, running `kubectl logs -p recommendation-api-ng-main-675599698-t8nl8 -c recommendation-api-ng-main`
[08:26:52] <kevinbazira>	 returns `Error from server (BadRequest): previous terminated container "recommendation-api-ng-main" in pod "recommendation-api-ng-main-675599698-t8nl8" not found
[08:26:52] <kevinbazira>	 `
[08:26:53] <kevinbazira>	 yet the pod shows that it is running `kubectl get pods
[08:26:53] <kevinbazira>	 NAME                                         READY   STATUS    RESTARTS   AGE
[08:26:54] <kevinbazira>	 recommendation-api-ng-main-675599698-t8nl8   2/2     Running   0          18m
[08:26:54] <kevinbazira>	 `
[08:28:54] <elukey>	 kevinbazira: `kubectl logs recommendation-api-ng-main-675599698-t8nl8 -n recommendation-api-ng recommendation-api-ng-main` works
[08:29:14] <elukey>	 in your case you may not need the -n etc..
[08:30:34] <elukey>	 I am testing the endpoint, I see
[08:30:35] <elukey>	 Tue Oct 10 08:29:46 2023 - *** HARAKIRI ON WORKER 2 (pid: 139, try: 1) ***
[08:31:13] <elukey>	 "Every request that will take longer than the seconds specified in the harakiri timeout will be dropped and the corresponding worker recycled."
[08:31:16] <elukey>	 ah okok
[08:31:59] <klausman>	 monrin'
[08:32:15] <elukey>	 morning :)
[08:32:20] <klausman>	 elukey: I do find it puzzling that kubectl logs -n recommendation-api-ng -p recommendation-api-ng-main-675599698-t8nl8 -c recommendation-api-ng-main doesn't work
[08:32:36] <klausman>	 And the error message is confusing, too:
[08:32:36] <elukey>	 klausman: no idea
[08:32:40] <klausman>	 Error from server (BadRequest): previous terminated container "recommendation-api-ng-main" in pod "recommendation-api-ng-main-675599698-t8nl8" not found
[08:33:08] <elukey>	 kevinbazira: I checked the pods cpu consumption etc.., nothing weird registered https://grafana.wikimedia.org/d/hyl18XgMk/kubernetes-container-details?orgId=1&var-datasource=codfw%20prometheus%2Fk8s-mlstaging&var-namespace=recommendation-api-ng&var-pod=recommendation-api-ng-main-675599698-t8nl8&var-container=All
[08:33:28] <elukey>	 but I suspect that the 10 processes spawned to fetch data from wikidata are the culprit
[08:33:33] <klausman>	 Oh, I know: I though -p selected the pod, but it means "previous"
[08:33:39] <klausman>	 The pod is just a plain arg
[08:34:02] <klausman>	 So this works: kubectl logs -n recommendation-api-ng recommendation-api-ng-main-675599698-t8nl8 -c recommendation-api-ng-main
[08:34:17] <klausman>	 -n for namespace, -c for the container, but no flag for the pod name.
[08:34:22] <elukey>	 -c is not needed
[08:34:38] <elukey>	 and Kevin doesn't use admin, so the -n is not needed too
[08:34:44] <klausman>	 oh alright
[08:34:51] <elukey>	 (it is already assumed)
[08:35:11] <wikibugs>	 (03CR) 10AikoChou: [C: 03+1] events: drop support for /mediawiki/revision/create#1.x events [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930665 (https://phabricator.wikimedia.org/T267648) (owner: 10DCausse)
[08:36:01] <klausman>	 It's a bit odd that the help text does not mention that -c optional
[08:36:21] <kevinbazira>	 thanks elukey and klausman. I am now able to see the logs :)
[08:42:51] <wikibugs>	 (03CR) 10AikoChou: [C: 03+1] "LGTM! I like you moved everything to the same place, so they don't scatter around different folders." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/963367 (https://phabricator.wikimedia.org/T347404) (owner: 10Ilias Sarantopoulos)
[08:43:10] <wikibugs>	 10Machine-Learning-Team: Investigate recommendation-api-ng internal endpoint failure - https://phabricator.wikimedia.org/T347475 (10elukey) >>! In T347475#9235450, @Isaac wrote: >> @Isaac do you reckon if we could use multi-threading instead of multiprocessing? Are those all HTTP-like calls (hence preemptable) o...
[08:43:47] <elukey>	 kevinbazira: from https://github.com/wikimedia/research-recommendation-api/blob/master/recommendation/api/external_data/wikidata.py#L105 it seems that we use a process pool, but the import is multiprocess.dummy, that afaics from the docs uses a thread pool
[08:43:53] <elukey>	 so it shouldn't be a problem in theory
[08:47:03] <elukey>	 but I have a suspicion on what is happening
[08:47:06] <elukey>	 checking
[08:47:36] <kevinbazira>	 elukey: sure, I am running tests locally to see whether the harakiri is affecting the workers i.e if a uwsgi worker is killed and respawned, does the request continue or it gets affected? Trying to get an answer to this.
[08:49:48] <kevinbazira>	 yes, it does.
[08:51:01] <elukey>	 kevinbazira: I think I found the problem :)
[08:52:18] <elukey>	 we totally forget one thing, and we are battling against it
[08:52:30] <kevinbazira>	 what's that? :)
[08:53:06] <elukey>	 kevinbazira: Let's try to brainbounce/debug it, it will be surely helpful for the next time
[08:53:20] <elukey>	 so what I did was asking to myself - what is the current issue?
[08:54:16] <elukey>	 and afaics what is happening at the moment is that we try to hit the endpoint, and the response never comes, but instead we end up in a timeout
[08:54:37] <elukey>	 in our case it is the envoy proxy that tells us "look, I am giving up waiting, here's a 50x"
[08:55:23] <elukey>	 second question - why don't we get any response? Is the container trying to get something that never returns as well?
[08:55:33] <elukey>	 or is its connectivity ok?
[08:56:18] <elukey>	 one of the things that we have to remind is that a container cannot make, by default, calls to any endpoint without specific allowance
[08:56:41] <elukey>	 the third step was checking the liftwing.ini config
[08:56:42] <elukey>	 https://github.com/wikimedia/research-recommendation-api/blob/master/recommendation/data/recommendation_liftwing.ini#L5
[08:57:17] <elukey>	 same thing for the line above (#4)
[08:58:02] <elukey>	 kevinbazira: if you recall we have a specific envoy proxy to use for calls to external endpoints, for example we worked to allow calls to swift via localhost:port
[08:59:50] <kevinbazira>	 great. so envoy should allow access to the apis that the container tries to access right? https://github.com/wikimedia/research-recommendation-api/blob/master/recommendation/data/recommendation_liftwing.ini#L1-L7
[09:00:00] <elukey>	 exactly yes :)
[09:00:06] <elukey>	 in our case, it should be the mw api
[09:00:29] <elukey>	 so we have to add the 'discovery' config in values.yaml for the mw api (I can take care of it)
[09:00:29] <kevinbazira>	 ok, thank you for the clarification. let me prepare a patch for this.
[09:00:35] <elukey>	 kevinbazira: wait wait :)
[09:00:39] <elukey>	 there is another bit  missing
[09:00:44] <kevinbazira>	 ok ok :)
[09:01:19] <elukey>	 if we add as endpoint something like "http://localhost:port/api/etc.." it will not work, because the HTTP Host header will not be set correctly
[09:03:10] <elukey>	 kevinbazira: for example, I think that https://github.com/wikimedia/research-recommendation-api/blob/master/recommendation/api/external_data/wikidata.py#L119
[09:03:49] <elukey>	 calls https://github.com/wikimedia/research-recommendation-api/blob/master/recommendation/api/external_data/fetcher.py#L25
[09:04:18] <elukey>	 mmmm I am now wondering though if envoy sets the Host header for us
[09:04:23] <elukey>	 kevinbazira: --^
[09:04:32] <kevinbazira>	 great. please add me as a reviewer to the patch. I'd like to see the fix.
[09:05:02] <elukey>	 no I think it will probably set the .discovery name as Host Header
[09:05:20] <elukey>	 in our case, we'll need to explicitly set stuff like "wikidata.org" or "en.wikipedia.org"
[09:05:35] <elukey>	 and force the post() method to use them
[09:14:49] <elukey>	 kevinbazira: this is the first bit https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/964859
[09:15:03] <elukey>	 but I think we'll also need to change the python code to set the host header where needed
[09:15:10] * elukey bbiab
[09:15:21] <elukey>	 klausman: --^
[09:15:35] <klausman>	 Looking
[09:17:11] <klausman>	 Was the intent for me to +2 and deploy it?
[09:44:52] <elukey>	 back
[09:45:24] <elukey>	 klausman: nono just to get your opinion, maybe you and kevin can work together on it?
[09:46:45] <klausman>	 I am unsure how the string mw-api-int-async-ro ties to specific egress rules.
[09:47:33] <klausman>	 Is it via the fixtures files?
[09:48:19] <klausman>	 (I may also be misunderstanding the problem)
[09:48:31] <elukey>	 fixtures are only used when CI renders the charts' diffs
[09:48:40] <elukey>	 in this case, it is related to helmfiles
[09:49:01] <elukey>	 it is all related to the related module, the mesh one
[09:49:14] <elukey>	 there are some configs rendered on deploy2002 via puppet
[09:49:19] <elukey>	 containing the varioius ips etc..
[09:49:31] <elukey>	 klausman: check envoy.yaml in puppet
[09:49:48] <kevinbazira>	 https://gerrit.wikimedia.org/g/operations/puppet/+/refs/heads/production/hieradata/common/profile/services_proxy/envoy.yaml#304
[09:49:52] <elukey>	 (under profile service+proxy)
[09:49:54] <elukey>	 yes
[09:50:23] <klausman>	 Ah, I wasn't aware the name spanned repos
[09:50:24] <elukey>	 so the mesh module adds two things
[09:50:38] <elukey>	 1) tls terminator for ingress traffic
[09:50:50] <elukey>	 2) tls proxy to $services
[09:51:04] <elukey>	 for 2) we need to explicitly add what services we want to contact
[09:51:13] <elukey>	 and it adds networkpolicies, configs, etc..
[09:51:50] <elukey>	 the result of adding mw-api-etc.. is that it will be available a localhost:6500 endoint in the pod 
[09:52:00] <elukey>	 for stuff like /w/api.php etc..
[09:52:43] <elukey>	 klausman: the main issue atm is that we configure stuff like https://github.com/wikimedia/research-recommendation-api/blob/master/recommendation/data/recommendation_liftwing.ini#L4
[09:52:48] <elukey>	 (see also the line for wikidata)
[09:53:06] <elukey>	 in there we should add http://localhost:6500/w/api.php
[09:53:20] <elukey>	 but, we'd also need to set the Host header
[09:53:20] <klausman>	 ah, right, so I understood that part correctly
[09:53:34] <elukey>	 and I believe we'll need to modify the code to allow it
[09:53:47] <klausman>	 yes, localhost will likely not work as a Host: header
[09:53:48] <elukey>	 so merging the above patch and changing the .ini file is not enough
[09:54:11] <elukey>	 klausman: the envoy proxy may set stuff like mw-api.discovery.wmnet etc.., but it is not correct either
[09:54:26] <elukey>	 so, having said this, klausman and kevinbazira, ok to work together on it?
[09:54:34] <klausman>	 Does it leave already-set headers alone?
[09:54:49] <elukey>	 I believe so, but needs to be verified
[09:55:31] <klausman>	 yeah, sure, though I have a question.
[09:55:54] <klausman>	 How are the names in the ini mapped to e.g. localhost:6500?
[09:56:30] <elukey>	 the port is specified in envoy.yaml
[09:56:36] <elukey>	 with the endpoint etc..
[09:56:47] <elukey>	 there is a template that renders all 
[09:56:53] <elukey>	 it should be in the mesh module
[09:57:15] <elukey>	 before proceeding with https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/964859 we need to get the sign-off from serviceops
[09:57:34] <elukey>	 because the endpoint is the mw-on-k8s stuff, and they are selecting what to onboard etc..
[09:57:54] <elukey>	 (someone that is not me should follow up, to be clear :)
[09:59:42] <klausman>	 I'ms still a bit lost re: mesh
[09:59:59] <elukey>	 please ask all the questions that you want :)
[10:00:12] <elukey>	 but part of the answer is probably buried inside deployment-chart
[10:00:14] <elukey>	 *charts
[10:00:46] <klausman>	 modules/mesh contains a bunch of files, but I don't understand their significance here
[10:03:04] <elukey>	 --verbose :)
[10:06:38] <klausman>	 So tehenvoy.yaml stuff I get.
[10:07:27] <klausman>	 The INI file... well, how does the tool know that to talk to something is port 6500?
[10:08:29] <klausman>	 For example ,t eh ini file has `wikipedia = https://{source}.wikipedia.org/w/api.php`. How does that become localhost:6500?
[10:08:37] <elukey>	 no that needs to be changed
[10:08:43] <elukey>	 this is what I was writing above
[10:08:57] <elukey>	 <<elukey> in there we should add http://localhost:6500/w/api.php
[10:09:19] <elukey>	 the main problem is that only changing the ini is not enough
[10:09:29] <elukey>	 we need to add the Host header in the python code 
[10:09:54] <elukey>	 so we'll call http://localhost:6500/w/api.php + Host: en.wikipedia.org or Host: wikidata.org
[10:09:57] <elukey>	 for example
[10:11:22] <isaranto>	 o/ back here
[10:11:55] <isaranto>	 wow many messages. catching up!
[10:12:05] <klausman>	 Ok, so `wikipedia = ...` changes to `... http://localhost:6500/w/api.php` and the code needs to add the relevant host header
[10:12:16] <elukey>	 correct, this is my theory
[10:12:33] <elukey>	 plus we need to add the support for the 6500 port on the pod
[10:12:42] <klausman>	 And we can deploy that after change 964859, which needs erviceops sync
[10:12:54] <elukey>	 that is the deployment-chart patch (that needs a follow up to service ops because we'd use the mw k8s api)
[10:12:58] <elukey>	 exactly
[10:13:04] <klausman>	 Ok, now I get it :)
[10:13:22] <klausman>	 Is there someone in serviceops that is already familiar with the topic of moving rec-api to LW?
[10:13:48] <elukey>	 sort of, but I think we can just tell them that it will be a low traffic volume
[10:13:56] <klausman>	 (also, Kevin had a aquestion about wikidata on the change, I dunno if that has been answered)
[10:15:44] <kevinbazira>	 yep, in the recommendation_liftwing.ini I can change `wikipedia = https://{source}.wikipedia.org/w/api.php` to `wikipedia = http://localhost:6500/w/api.php` but what does `wikidata = https://www.wikidata.org/w/api.php` change to?
[10:16:11] <elukey>	 this bit is left for exercise :D
[10:16:32] <elukey>	 (but it is a very good question)
[10:17:04] <elukey>	 kevinbazira and klausman, if you can pair up to anwer the question and follow up etc..
[10:17:16] <klausman>	 ack
[10:17:21] <elukey>	 I am available for clarifications, but my goal is to start staying aside 
[10:17:24] <elukey>	 as much as possible
[10:19:23] <kevinbazira>	 I am looking at the envoy.yaml and doesn't seem to have a listener for wikidata: https://gerrit.wikimedia.org/g/operations/puppet/+/refs/heads/production/hieradata/common/profile/services_proxy/envoy.yaml
[10:22:20] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team: kserve CORS error - https://phabricator.wikimedia.org/T348511 (10isarantopoulos)
[10:23:06] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team: kserve CORS error - https://phabricator.wikimedia.org/T348511 (10isarantopoulos) At the moment it seems that we can modify the fastAPI app but if I am not mistaken it is more difficult to do the fix directly on istio
[10:28:00] <elukey>	 kevinbazira: hint - you should focus on what servers do render www.wikidata.org
[10:44:45] <elukey>	 klausman: I had already created the two tasks for the decoms
[10:44:51] <klausman>	 oops.
[10:44:57] <klausman>	 I'll close mine, then
[10:44:58] <elukey>	 let's merge them in
[10:46:18] <wikibugs>	 10Machine-Learning-Team, 10SRE, 10decommission-hardware, 10ops-codfw: decommission ores{2001..2009}.codfw.wmnet - https://phabricator.wikimedia.org/T348462 (10klausman)
[10:46:45] <klausman>	 Done
[10:47:06] <elukey>	 ack thanks
[10:47:08] * elukey lunch
[10:47:26] <wikibugs>	 10Machine-Learning-Team, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission ores{1001..1009}.eqiad.wmnet - https://phabricator.wikimedia.org/T348144 (10klausman)
[10:51:11] <kevinbazira>	 I've come accross this: https://github.com/wikimedia/operations-deployment-charts/blob/9da9b1874ea44363ef1a8a2979bed473f2129487/helmfile.d/admin_ng/values/ml-serve.yaml#L370-L382
[10:51:11] <kevinbazira>	 It looks like they were added to resolve timeouts similar to what we are experiencing with the rec-api: https://github.com/wikimedia/operations-deployment-charts/commit/14d5aa89602a4c1b2c907cb734fb610c49d6922d
[10:51:11] <kevinbazira>	 Would adding hosts to the mesh resolve our issue vs adding envoy listeners?
[11:00:34] <chrisalbon>	 Morning alll
[11:01:07] <kevinbazira>	 hi Chris o/
[11:13:10] <chrisalbon>	 I need coffee so bad
[11:14:41] <isaranto>	 o/ hey
[11:29:48] <chrisalbon>	 Anything I can help with?
[11:34:25] <aiko>	 hi Chris! 
[11:35:00] * aiko lunch+coffee
[11:47:28] <klausman>	 chrisalbon: I think Kevin and I got it covered
[11:48:48] <chrisalbon>	 Thanks Klausman and Kevinbazira for working on that.
[11:51:53] <klausman>	 as for the serviceops side: they're fine with us sending them the traffic.
[12:47:44] <elukey>	 kevinbazira: very good point! So for isvcs we use istio/envoy but they are set up in a way that we don't need to specify the localhost:port endpoint, since they are (sort-of) transparent proxies/sidecars
[12:48:18] <elukey>	 kevinbazira: for ores-legacy and rec-api-ng we use the serviceops template, that uses istio/envoy but in a different way (with an explicit proxy, namely something that your code needs to be aware of)
[12:48:47] <elukey>	 very interesting: https://grafana-rw.wikimedia.org/d/n3LJdTGIk/kserve-inference-services?forceLogin&from=now-6h&orgId=1&to=now&var-cluster=codfw%20prometheus%2Fk8s-mlstaging&var-component=All&var-namespace=revscoring-editquality-goodfaith
[12:49:09] <elukey>	 this is only available in staging at the moment, for good faith, I am going to roll it out everywhere :)
[12:49:25] <elukey>	 there are also some python-gc metrics IIUC
[12:49:45] <isaranto>	 sweeet
[12:57:54] <klausman>	 elukey: neat!
[12:58:49] <chrisalbon>	 elukey what am I looking at? Is this preprocessing the features or a cache?
[13:01:29] <isaranto>	 the latency for the preprocess and predict steps in the inference service(s)
[13:02:22] <elukey>	 updated the dashboard with RSS, CPU time, GC, etc..
[13:02:30] <elukey>	 (you can refresh it)
[13:02:57] <elukey>	 chrisalbon: we are now collecting metrics from kserve related to the preprocess and predict python methods (so before cache etc..)
[13:03:10] <elukey>	 so we will have an idea where we spend the time on
[13:03:21] <chrisalbon>	 ah got it thanks
[13:03:29] <elukey>	 also good morning :)
[13:09:07] <elukey>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/964899
[13:09:14] <elukey>	 ready to roll it out everywhere :)
[13:09:44] <klausman>	 Ship It!
[13:10:18] <elukey>	 danke
[13:10:51] <chrisalbon>	 nice!
[13:10:57] <chrisalbon>	 And good afternoon!
[13:16:42] <wikibugs>	 10Machine-Learning-Team, 10SRE, 10decommission-hardware, 10ops-codfw: decommission ores{2001..2009}.codfw.wmnet - https://phabricator.wikimedia.org/T348462 (10Papaul) a:03Jhancock.wm
[13:17:26] <elukey>	 rolling out the new metrics in staging
[13:35:25] <elukey>	 isaranto: very interesting data for goodfaith in prod - https://grafana-rw.wikimedia.org/d/n3LJdTGIk/kserve-inference-services
[13:35:49] <elukey>	 we kinda knew it but preprocess is really the bulk of the time
[13:36:11] <elukey>	 so maybe we could think about offloading to a process only that part
[13:36:13] <elukey>	 and not predict
[13:40:02] <chrisalbon>	 That fits my mental model. These are small xgboost models, they should be blazing fast, but they aren't. 
[13:40:23] <chrisalbon>	 So if it isn't the model, it has to be the preprocessing
[13:40:32] * elukey nods
[13:41:06] <elukey>	 the last time we tried to offload both predict and preprocess to a model, and we saw a big price in terms of latency for some rev-ids 
[13:41:27] <elukey>	 I guess that serialize/deserialize of features to and from processes running predict() is not worth it
[13:41:37] <elukey>	 but it may be worth it only for features
[13:42:23] <wikibugs>	 10Machine-Learning-Team: Test the kserve batcher for Revert Risk multilingual isvc - https://phabricator.wikimedia.org/T348536 (10achou)
[13:49:54] <elukey>	 damaging seems to be way worse than goodfaith
[13:52:21] <isaranto>	 ack
[13:53:13] <isaranto>	 will try using processes in  preprocess  :P
[13:53:19] <isaranto>	 process process process
[13:53:44] <elukey>	 ahahahha
[13:54:03] <elukey>	 https://imgflip.com/memegenerator/Inception :D
[13:54:50] <elukey>	 klausman: with the new k8s alarms I don't see a ton of latency alerts anymore for the k8s control plane
[13:54:55] <elukey>	 fingers crossed
[13:55:03] <elukey>	 ahhaha just said it, alarm fired
[13:55:12] <klausman>	 Classic :)
[13:55:31] <elukey>	 but it makes sense
[13:55:38] <elukey>	 https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=PATCH&orgId=1
[13:55:49] <elukey>	 there is a sustained latency for isvcs
[13:56:03] <elukey>	 maybe we could think about adding more resources to the control plane
[13:57:26] <klausman>	 You mean the VMs? atm they have 2cpus, and msc1001 shows a load of 3/4
[13:57:32] <klausman>	 (as in 0.75)
[13:58:57] <elukey>	 the vms yes
[13:59:56] <elukey>	 the RAM is definitely used, and I am not sure if the goroutines usage is reflected by the load
[14:00:36] <klausman>	 https://grafana.wikimedia.org/d/000000342/node-exporter-server-metrics?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-node=ml-serve-ctrl1001:9100&var-disk_device=All&var-net_dev=All looks relatively tame. But switching to 4 cores might be a good idea.
[14:01:47] <klausman>	 memusage is slightly more more than half available RAM, so I doubt that is an actual problem. The control plane should not really need a large amounts of page cache.
[14:02:00] <elukey>	 in theory :)
[14:18:30] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Investigate recommendation-api-ng internal endpoint failure - https://phabricator.wikimedia.org/T347475 (10Isaac) > After a closer look I think that we are already using a thread pool: > https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.d...
[14:43:29] <wikibugs>	 10Machine-Learning-Team, 10SRE, 10decommission-hardware, 10ops-codfw: decommission ores{2001..2009}.codfw.wmnet - https://phabricator.wikimedia.org/T348462 (10Jhancock.wm) 05Open→03Resolved
[14:53:56] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team: kserve CORS error - https://phabricator.wikimedia.org/T348511 (10calbon) a:03isarantopoulos
[14:58:09] <wikibugs>	 10Machine-Learning-Team, 10MediaWiki-extensions-ORES: Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298 (10klausman)
[15:05:51] <wikibugs>	 10Machine-Learning-Team, 10ORES: ORES extremely slow when to return when asking for multiple scores. - https://phabricator.wikimedia.org/T347612 (10calbon) a:03isarantopoulos
[15:06:53] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Upgrade Revert Risk Multilingual docker images to KServe 0.11 - https://phabricator.wikimedia.org/T347551 (10calbon) a:03elukey
[15:26:58] <elukey>	 so I rolled out the metrics change in all isvcs
[15:27:14] <elukey>	 except drafttopic eqiad, that for some reason ends up in a helmfile error
[15:27:23] <elukey>	 Error: query: failed to query with labels: proto: Unknown: illegal tag 0 (wire type 0)
[15:27:26] <elukey>	 codfw is all good
[15:27:49] <elukey>	 aiko: https://grafana.wikimedia.org/d/n3LJdTGIk/kserve-inference-services?orgId=1&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-component=All&var-namespace=revertrisk
[15:27:52] <elukey>	 looks nice :)
[15:28:41] <wikibugs>	 10Machine-Learning-Team: Visualize KServe latency metrics in a dashboard - https://phabricator.wikimedia.org/T348456 (10elukey) Created the first dashboard: https://grafana.wikimedia.org/d/n3LJdTGIk/kserve-inference-services
[15:31:52] <aiko>	 elukey: wow nice dashboard and numbers :) 
[15:32:24] <aiko>	 elukey: <3
[15:32:35] <elukey>	 thanks!
[15:34:45] <wikibugs>	 10Machine-Learning-Team: Upgrade outlink docker images to KServe 0.11 - https://phabricator.wikimedia.org/T347549 (10achou) a:03achou
[15:36:19] <wikibugs>	 (03CR) 10AikoChou: [C: 03+1] revert-risk: upgrade to KServe 0.11.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/964559 (https://phabricator.wikimedia.org/T347550) (owner: 10Elukey)
[15:40:52] <isaranto>	 if there isn't any objection with this patch I plan to merge it tomorrow https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/963367
[15:46:42] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: revscoring: customize kserve logs (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/964568 (https://phabricator.wikimedia.org/T333804) (owner: 10Ilias Sarantopoulos)
[15:57:44] <elukey>	 isaranto: didn't have time to fully review, if Aiko already did it go ahead!
[15:58:29] <isaranto>	 sry I dont want to rush things, just that it enables local runs and allows to debug other things easily (like I did with mp and logging)
[15:58:55] <isaranto>	 so doesnt have actual changes BUT moves things around (which may cause issues sometimes ofc)
[15:59:29] <isaranto>	 I think we can start adding better unit tests now also for the model servers :)
[16:00:01] <elukey>	 isaranto: definitely, if you want I can review it tomorrow morning
[16:00:06] <elukey>	 or you can proceed, as you wish
[16:00:58] <isaranto>	 tomorrow is fine, even thursday
[16:03:17] <wikibugs>	 (03PS3) 10Ilias Sarantopoulos: revscoring: customize kserve logs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/964568 (https://phabricator.wikimedia.org/T333804)
[16:03:49] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] revert-risk: upgrade to KServe 0.11.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/964559 (https://phabricator.wikimedia.org/T347550) (owner: 10Elukey)
[16:07:25] <wikibugs>	 (03PS7) 10Ilias Sarantopoulos: revscoring: allow local runs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/963367 (https://phabricator.wikimedia.org/T347404)
[16:08:22] <wikibugs>	 10Machine-Learning-Team, 10Epic, 10Patch-For-Review: Add meaningful access logs to KServe's pods - https://phabricator.wikimedia.org/T333804 (10isarantopoulos) Since asgi-logger can only be used if we specify the `access_log_format` we defined the environment variable `LOGGING_FORMAT` in the above patch to a...
[16:18:11] <wikibugs>	 (03CR) 10AikoChou: [C: 03+1] "After the patch is merged, I will do the deployment and testing (in staging). :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/964559 (https://phabricator.wikimedia.org/T347550) (owner: 10Elukey)
[16:24:31] * elukey afk!
[16:24:38] <elukey>	 have a good rest of the day folks :)
[16:25:29] <aiko>	 bye luca!
[16:29:03] <isaranto>	 ciao and enjoy the evening 
[16:32:40] <wikibugs>	 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 15th round of wikis - https://phabricator.wikimedia.org/T308141 (10Sgs) a:05kostajh→03Sgs I ran this script for adding the link-recommendation task type and populating the excluded sections entries: `lang=bash PHA...
[16:38:51] <wikibugs>	 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10Patch-For-Review, 10User-notice: Deploy "add a link" to 15th round of wikis - https://phabricator.wikimedia.org/T308141 (10Sgs)
[17:31:37] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: revscoring: bump scipy version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/964954
[17:32:30] <isaranto>	 aiko: I managed to fix the issue we had with articlequality and apple silicon --^
[17:32:36] <isaranto>	 hope it indeed works
[17:39:37] <aiko>	 isaranto: wow nice, thank you <3 I'll let you know if it works on my end
[18:16:46] <wikibugs>	 (03PS1) 10Varnent: Update link to privacy policy. [services/ores] - 10https://gerrit.wikimedia.org/r/964960 (https://phabricator.wikimedia.org/T331680)
[19:43:02] <wikibugs>	 (03PS1) 10Ladsgroup: Migrate away from LB/LBF to ICP [extensions/ORES] - 10https://gerrit.wikimedia.org/r/964969 (https://phabricator.wikimedia.org/T330641)
[19:44:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Migrate away from LB/LBF to ICP [extensions/ORES] - 10https://gerrit.wikimedia.org/r/964969 (https://phabricator.wikimedia.org/T330641) (owner: 10Ladsgroup)
[19:46:07] <wikibugs>	 (03PS2) 10Ladsgroup: Migrate away from LB/LBF to ICP [extensions/ORES] - 10https://gerrit.wikimedia.org/r/964969 (https://phabricator.wikimedia.org/T330641)
[21:41:03] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] Migrate away from LB/LBF to ICP [extensions/ORES] - 10https://gerrit.wikimedia.org/r/964969 (https://phabricator.wikimedia.org/T330641) (owner: 10Ladsgroup)
[21:54:00] <wikibugs>	 (03Merged) 10jenkins-bot: Migrate away from LB/LBF to ICP [extensions/ORES] - 10https://gerrit.wikimedia.org/r/964969 (https://phabricator.wikimedia.org/T330641) (owner: 10Ladsgroup)