[01:18:15] FIRING: ErrorBudgetBurn: - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:18:30] FIRING: ErrorBudgetBurn: - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:23:06] o/ [05:23:32] I'm looking at the alerts --^ [05:50:32] there was a spike in preprocess latencies for enwiki-articlequality a couple of hours ago which resulted in the latency budget burn in the SLO dashboards [05:50:32] https://grafana.wikimedia.org/goto/BvJYw6lIR?orgId=1 [05:50:32] https://grafana.wikimedia.org/d/slo-Lift_Wing_Revscoring/lift-wing-revscoring-slo-s?orgId=1 [05:51:24] things are fine now regarding the latencies. I'm gathering some stuff in a doc so that we can discuss all the alerts together in today's/tomorrow's meeting [06:33:24] 06Machine-Learning-Team, 13Patch-For-Review: Reorganize LiftWing isvcs repo structure to improve maintainability - https://phabricator.wikimedia.org/T369344#9984000 (10kevinbazira) [06:33:47] o/ [06:34:10] the readability model-server migrated to the src dir is up and running in prod (both eqiad and codfw): https://phabricator.wikimedia.org/P66582 [06:38:15] RESOLVED: ErrorBudgetBurn: - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:43:53] o/ kevin,nice! [08:39:44] Morning! [08:41:22] Guten Morgen! [08:41:54] I'll be pushing the securityContext update for Istio to prod-codfw in a moment. Shouldn't break anything [08:45:12] and done. keeping an eye on things dfor a bit [08:46:51] ack, I'll be deploying ores-legacy and rec-api first in staging [08:48:04] Roger [08:48:11] 06Machine-Learning-Team, 10Cloud-VPS (Debian Buster Deprecation): Cloud VPS "machine-learning" project Buster deprecation - https://phabricator.wikimedia.org/T367537#9984273 (10klausman) 05Open→03Resolved [09:53:37] 06Machine-Learning-Team, 10Observability-Metrics, 10SRE Observability (FY2024/2025-Q1): Gap in metrics rendered from Thanos Rules - https://phabricator.wikimedia.org/T352756#9984495 (10fgiunchedi) >>! In T352756#9965404, @gerritbot wrote: > Change #1052784 **merged** by jenkins-bot: > %%%[operations/alerts@m... [09:56:35] after testing that everything was fine in staging, I've also deployed both to prod. everything works as expected [10:01:54] nice! [10:04:47] thanks :) [10:13:48] :) [10:43:41] * klausman lunch [10:48:01] * isaranto ditto! [11:35:20] (03PS1) 10Kevin Bazira: outlink_topic_model: migrate to src dir [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1054526 (https://phabricator.wikimedia.org/T369344) [11:56:43] (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1054526 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [12:53:55] 06Machine-Learning-Team: [LLM] Use vllm with rocm in huggingface image - https://phabricator.wikimedia.org/T370149 (10isarantopoulos) 03NEW [13:02:03] I'll now push the above-mentioned Istio changes to eqiad as well. Again, should be without faults. [13:05:24] 06Machine-Learning-Team, 07Kubernetes, 13Patch-For-Review: Migrate ml-staging/ml-serve clusters off of Pod Security Policies - https://phabricator.wikimedia.org/T369493#9985121 (10klausman) All pushed (staging, 2x prod) [13:07:01] roger that! [13:08:30] 06Machine-Learning-Team, 07Kubernetes, 13Patch-For-Review: Migrate ml-staging/ml-serve clusters off of Pod Security Policies - https://phabricator.wikimedia.org/T369493#9985124 (10klausman) 05Open→03Resolved [13:33:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [13:33:49] Deployment enwiki-articlequality-predictor-default-00020-deployment in revscoring-articlequality at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ... [13:33:49] https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revscoring-articlequality&var-deployment=enwiki-articlequality-predictor-default-00020-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [13:38:19] ouch [13:38:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [13:38:49] Deployment enwiki-articlequality-predictor-default-00020-deployment in revscoring-articlequality at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ... [13:38:49] https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revscoring-articlequality&var-deployment=enwiki-articlequality-predictor-default-00020-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [13:41:28] 10Lift-Wing, 06Machine-Learning-Team: Use vllm for ROCm in huggingface image - https://phabricator.wikimedia.org/T370149#9985362 (10isarantopoulos) [13:45:00] 06Machine-Learning-Team, 13Patch-For-Review: Deploy 7b parameter models from HF - https://phabricator.wikimedia.org/T354870#9985384 (10isarantopoulos) 05Open→03Resolved The current work can be marked done as we can now deploy images using the huggingfaceserver. [13:45:34] 10Lift-Wing, 06Machine-Learning-Team: Investigate deployment of gemma2 on LiftWing - https://phabricator.wikimedia.org/T369055#9985420 (10isarantopoulos) 05Open→03Resolved [13:45:37] 06Machine-Learning-Team: Investigate inference optimization frameworks for Large Language Models (LLMs) - https://phabricator.wikimedia.org/T354257#9985402 (10isarantopoulos) 05Open→03Resolved The current task can be marked done as after investigation vllm seems to be the most prominent solution for an... [13:46:15] 06Machine-Learning-Team, 13Patch-For-Review: Use Huggingface model server image for HF LLMs - https://phabricator.wikimedia.org/T357986#9985425 (10isarantopoulos) 05Open→03Resolved The current work can be marked done as we can now deploy images using the huggingfaceserver and in a stable way after comp... [16:18:27] 06Machine-Learning-Team: Upgrade Knative control plane Docker images to Bullseye/Bookworm - https://phabricator.wikimedia.org/T368359#9986535 (10isarantopoulos) p:05Triage→03Medium [16:18:37] 06Machine-Learning-Team, 13Patch-For-Review: Reorganize LiftWing isvcs repo structure to improve maintainability - https://phabricator.wikimedia.org/T369344#9986536 (10isarantopoulos) p:05Triage→03Medium [16:18:42] 06Machine-Learning-Team, 06Structured-Data-Backlog: Pass image objects to the logo detection service - https://phabricator.wikimedia.org/T363506#9986537 (10isarantopoulos) p:05Triage→03Medium [16:18:51] 06Machine-Learning-Team, 13Patch-For-Review: Configure the logo-detection model-server hosted on LiftWing to process images from Wikimedia Commons - https://phabricator.wikimedia.org/T363449#9986538 (10isarantopoulos) p:05Triage→03Medium [16:18:55] 06Machine-Learning-Team: Run load tests for the rec-api-ng and update production resources to meet expected load - https://phabricator.wikimedia.org/T365554#9986539 (10isarantopoulos) p:05Triage→03Medium [16:19:43] 06Machine-Learning-Team: Deploy logo-detection model-server to LiftWing staging - https://phabricator.wikimedia.org/T362749#9986546 (10isarantopoulos) p:05Triage→03Medium [16:19:47] 06Machine-Learning-Team: Investigate kserve 0.13.0 upgrade - https://phabricator.wikimedia.org/T367048#9986548 (10isarantopoulos) p:05Triage→03Medium [16:19:53] 06Machine-Learning-Team: Allow calling revertrisk language agnostic and revert risk multilingual APIs in a pre-save context - https://phabricator.wikimedia.org/T356102#9986551 (10isarantopoulos) p:05Triage→03Medium [16:20:00] 06Machine-Learning-Team, 07Epic: Epic: Implement prototype inference service that uses Cassandra for request caching - https://phabricator.wikimedia.org/T356256#9986552 (10isarantopoulos) p:05Triage→03Medium [16:20:06] 06Machine-Learning-Team: Investigate how to improve model card integration with existing user flows - https://phabricator.wikimedia.org/T353025#9986554 (10isarantopoulos) p:05Triage→03Medium [16:20:17] 07artificial-intelligence, 06Machine-Learning-Team: LLM that specializes in assisting Wikimedia/MediaWiki technical contributors - https://phabricator.wikimedia.org/T353974#9986555 (10isarantopoulos) p:05Triage→03Medium [16:20:38] logging off folks, have a nice evening/rest of day o/ [16:26:36] \o [17:22:34] (03CR) 10Kevin Bazira: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1054526 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [17:23:16] (03CR) 10CI reject: [V:04-1] outlink_topic_model: migrate to src dir [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1054526 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira)