[00:26:44] FIRING: LiftWingServiceErrorRate: ... [00:26:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=huwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [00:36:44] RESOLVED: LiftWingServiceErrorRate: ... [00:36:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=huwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [10:30:06] `/me lunch [12:45:41] hello folks! [12:46:40] 'ello [12:50:35] thanks for the reviews, I am going to deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1028552 later on [12:50:42] to change the autoscaling settings [12:51:26] :+1: [13:03:29] Morning all [13:07:35] o/ [13:16:28] chrisalbon: qq - are we doing the team meeting? (not sure if you need to run away for the meetings etc..) [13:16:46] Yeah yeah [13:16:53] This other meeting is like hours away [13:17:32] okok :) [13:23:59] the new autoscaling settings have been deployed [13:27:33] (03PS2) 10Elukey: huggingface: upgrade base image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1028393 (https://phabricator.wikimedia.org/T362984) [13:27:45] hopefully this time it will work :D [13:30:53] TIL https://docs.python.org/3.12/howto/perf_profiling.html [13:30:58] this is really nice [13:36:57] oh, yeah. Python profiling has come a long way sicne I last had to use it [13:37:32] (03CR) 10Klausman: [C:03+1] huggingface: upgrade base image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1028393 (https://phabricator.wikimedia.org/T362984) (owner: 10Elukey) [13:37:44] I would be very handy right now for what I am doing, but 3.12 will be on Trixie so for the moment we'll have to wait [13:48:14] (03CR) 10Elukey: [C:03+2] huggingface: upgrade base image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1028393 (https://phabricator.wikimedia.org/T362984) (owner: 10Elukey) [13:48:56] (03Merged) 10jenkins-bot: huggingface: upgrade base image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1028393 (https://phabricator.wikimedia.org/T362984) (owner: 10Elukey) [13:56:57] 06Machine-Learning-Team, 10ORES, 13Patch-For-Review: Add slow-logs for ML isvcs - https://phabricator.wikimedia.org/T362663#9777512 (10elukey) All the revscoring Docker images running in production now log the request id (associated with the related x-request-id header). This turned out to be sufficient to f... [14:00:18] 06Machine-Learning-Team, 13Patch-For-Review: Deploy logo-detection model-server to LiftWing staging - https://phabricator.wikimedia.org/T362749#9777519 (10kevinbazira) @mfossati and I were able to download images that we uploaded to the commons stash by using the stash file URL and a cookie as shown in the fun... [14:08:40] 06Machine-Learning-Team, 05Goal: 2024 Q4 Goal: Revert Risk models are supported by caching in production - https://phabricator.wikimedia.org/T362672#9777539 (10calbon) Update: - Working on plumbing on staging, should be done within week - Feeling good about it [14:10:21] 06Machine-Learning-Team, 05Goal: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU - https://phabricator.wikimedia.org/T362670#9777544 (10calbon) Update: - Wait for vendor (Supermicro) to finalize order of 2x for ml-staging. [14:19:22] 06Machine-Learning-Team, 05Goal: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services - https://phabricator.wikimedia.org/T362674#9777567 (10calbon) - Narrowed down cause of symptoms of spike in CPU usage to feature extraction in revscoring isvc. Might be c... [14:23:04] 06Machine-Learning-Team, 10ORES: Create basic alerts for isvcs to catch outages - https://phabricator.wikimedia.org/T362661#9777584 (10klausman) 05Open→03Resolved [14:23:06] 06Machine-Learning-Team, 10Cassandra: Figure out a way to query Cassandra node IPs from `profile::kubernetes::deployment_server::global_config` - https://phabricator.wikimedia.org/T362649#9777587 (10klausman) 05Open→03Resolved [14:23:45] 06Machine-Learning-Team, 06Research: Add Article Quality Model to LiftWing - https://phabricator.wikimedia.org/T360455#9777590 (10calbon) a:03kevinbazira [14:24:12] 06Machine-Learning-Team, 10Cassandra: Figure out a way to query Cassandra node IPs from `profile::kubernetes::deployment_server::global_config` - https://phabricator.wikimedia.org/T362649#9777588 (10klausman) [14:24:55] 06Machine-Learning-Team, 06Research: Add Article Quality Model to LiftWing - https://phabricator.wikimedia.org/T360455#9777593 (10calbon) a:05kevinbazira→03None [14:32:19] 06Machine-Learning-Team, 13Patch-For-Review: Configure the logo-detection model-server hosted on LiftWing to process images from Wikimedia Commons - https://phabricator.wikimedia.org/T363449#9777620 (10klausman) >>! In T363449#9773855, @elukey wrote: > @klausman leaving the decision to you :) You can file pat... [14:37:39] (03CR) 10Kevin Bazira: "We were able to download images uploaded to the stash using the file URL and a cookie as shown in: https://phabricator.wikimedia.org/T3627" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1023542 (https://phabricator.wikimedia.org/T363449) (owner: 10Kevin Bazira) [14:45:20] 10Lift-Wing, 06Machine-Learning-Team: GPU errors in hf image in ml-staging - https://phabricator.wikimedia.org/T362984#9777685 (10elukey) Still seeing the old issue with ROCm 5.6: ` amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)... [15:00:35] klausman and elukey: I've added the service entry as discussed in the meeting: https://gerrit.wikimedia.org/r/1027484 [15:00:35] please review whenever you get a minute. thanks! [15:00:51] :+1: [15:02:08] the only thing that I would do is to check if the MW API can serve the Host header [15:02:16] I mean proxying one curl request, just to be sure [15:02:28] in theory yes, but it would be best to check to avoid issues later on [15:02:33] but looks good :) [15:03:33] +1 from me, with the same suggestions as Luca did :) [15:20:52] if we were to run the a curl request like this one: [15:20:52] ``` [15:20:52] curl -v -H "Host: commons.wikimedia.org" "http://mw-api-int-ro.discovery.wmnet:4680/w/index.php?title=Special:FilePath&file=Cambia_logo.png&width=224" [15:20:53] ``` [15:20:53] wouldn't this patch have to be merged first? or I'm missing something? [15:22:46] nono I mean from a host like stat100x or even deploy1002, just to verify that what you wrote above works [15:23:10] we are assuming that the mw-api-int-ro supports commons.wikimedia.org, but it is better to verify [15:23:34] also you'll need to use the https:// protocol and the 4446 port [15:25:24] for example, I am running the following from stat1004 [15:25:37] curl -v -H "Host: commons.wikimedia.org" "https://mw-api-int-ro.discovery.wmnet:4446/w/index.php?title=Special:FilePath&file=Cambia_logo.png&width=224" [15:25:45] afaics there are multiple HTTP redirects [15:27:11] this is a problem due to https://phabricator.wikimedia.org/T363725 [15:27:36] or better, it is something that we can handle in the code, but it needs to be kept in mind [15:30:11] looks like even `http://commons.wikimedia.org/w/index.php?title=Special:FilePath&file=Cambia_logo.png&width=224` [15:30:11] eventually redirects to `https://upload.wikimedia.org/wikipedia/commons/thumb/2/20/Cambia_logo.png/224px-Cambia_logo.png` [15:30:32] in the browser [15:33:52] yep [15:35:35] we can handle it in the code, aiohttp supports not following redirects [15:35:43] but we need to fix the Location header etc.. [15:43:33] okok I'll proceed to merge the service entry patch above and prepare to handle redirections within the model-server [15:45:35] kevinbazira: wait a sec :) [15:45:42] okok ... [15:45:53] if you see above you also have upload.wikimedia.org [15:46:19] so I think it will need to be added as well [15:49:06] done :) [15:50:09] super [15:50:32] also remember that Tobias has to deploy it, only sres can deploy stuff to admin_ng [15:51:37] sure sure. I'll request Tobias to deploy soon as it's been merged [15:53:03] Will stagging be enough for now? There is still a big unrelated diff on admin_ng in serve-eqiad. I'll push that tomorrow morning, so it be with/behind that. [16:08:56] 10Lift-Wing, 06Machine-Learning-Team: GPU errors in hf image in ml-staging - https://phabricator.wikimedia.org/T362984#9778054 (10elukey) A lot of useful info in https://en.wikipedia.org/wiki/Direct_Rendering_Manager, it is also mentioned DRM-Auth and what it does. [16:16:32] 10Lift-Wing, 06Machine-Learning-Team: GPU errors in hf image in ml-staging - https://phabricator.wikimedia.org/T362984#9778080 (10elukey) Tried https://wikitech.wikimedia.org/wiki/Machine_Learning/AMD_GPU#Reset_the_GPU_state and killed/restarted the mistral pod, just as a test to see if anything was in a weird... [16:52:24] all right logging off folks! [16:52:27] have a nice rest of the day [17:35:23] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MW-1.43-notes (1.43.0-wmf.4; 2024-05-07), 07Wikimedia-production-error: UnexpectedValueException: Wikimedia\Rdbms\InsertQueryBuilder::execute can't have empty $rows value - https://phabricator.wikimedia.org/T364218#9778423 (10Zabe) [17:35:30] (03PS1) 10Zabe: Avoid empty insert in SqlScoreStorage::storeScores [extensions/ORES] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1028583 (https://phabricator.wikimedia.org/T364218) [18:05:39] 06Machine-Learning-Team: Allow to set Catboost's threads in readability-liftwing - https://phabricator.wikimedia.org/T353461#9778540 (10leila) [18:05:51] 06Machine-Learning-Team: Allow to set Catboost's threads in readability-liftwing - https://phabricator.wikimedia.org/T353461#9778532 (10leila) It looks like this task was a request by the ML team and they're calling it Done on their end. I'm going to remove Research's tag on our end and the ML team can close or... [18:28:08] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MW-1.43-notes (1.43.0-wmf.4; 2024-05-07), 13Patch-For-Review, 07Wikimedia-production-error: UnexpectedValueException: Wikimedia\Rdbms\InsertQueryBuilder::execute can't have empty $rows value (via ORE... - https://phabricator.wikimedia.org/T364218#9778658 [18:33:24] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MW-1.43-notes (1.43.0-wmf.4; 2024-05-07), 13Patch-For-Review, 07Wikimedia-production-error: UnexpectedValueException: Wikimedia\Rdbms\InsertQueryBuilder::execute can't have empty $rows value (via ORE... - https://phabricator.wikimedia.org/T364218#9778697 [18:36:10] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MW-1.43-notes (1.43.0-wmf.4; 2024-05-07), 13Patch-For-Review, 07Wikimedia-production-error: UnexpectedValueException: Wikimedia\Rdbms\InsertQueryBuilder::execute can't have empty $rows value (via ORE... - https://phabricator.wikimedia.org/T364218#9778720 [20:20:54] (03CR) 10Zabe: [C:03+2] Avoid empty insert in SqlScoreStorage::storeScores [extensions/ORES] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1028583 (https://phabricator.wikimedia.org/T364218) (owner: 10Zabe) [20:23:17] (03Merged) 10jenkins-bot: Avoid empty insert in SqlScoreStorage::storeScores [extensions/ORES] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1028583 (https://phabricator.wikimedia.org/T364218) (owner: 10Zabe) [20:29:14] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MW-1.43-notes (1.43.0-wmf.4; 2024-05-07), 13Patch-For-Review, 07Wikimedia-production-error: UnexpectedValueException: Wikimedia\Rdbms\InsertQueryBuilder::execute can't have empty $rows value (via OR... - https://phabricator.wikimedia.org/T364218#9779065