[02:14:48] (InfServiceHighMemoryUsage) firing: (2) High Memory usage detected in Inference Service - https://wikitech.wikimedia.org/w/index.php?title=Machine_Learning/LiftWing/Alerts#Inference_Services_High_Memory_Usage_-_InfServiceHighMemoryUsage_alert - https://alerts.wikimedia.org/?q=alertname%3DInfServiceHighMemoryUsage [06:14:48] (InfServiceHighMemoryUsage) firing: (2) High Memory usage detected in Inference Service - https://wikitech.wikimedia.org/w/index.php?title=Machine_Learning/LiftWing/Alerts#Inference_Services_High_Memory_Usage_-_InfServiceHighMemoryUsage_alert - https://alerts.wikimedia.org/?q=alertname%3DInfServiceHighMemoryUsage [07:14:04] (03PS1) 10MPGuy2824: Replace makeList() with ExpressionGroups [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1007485 [07:55:29] 06Machine-Learning-Team: Investigate InfServiceHighMemoryUsage for article-descriptions - https://phabricator.wikimedia.org/T358742 (10klausman) [08:58:47] 06Machine-Learning-Team: Investigate InfServiceHighMemoryUsage for article-descriptions - https://phabricator.wikimedia.org/T358742#9586312 (10klausman) I have found this: > Memory Utilization, Saturation and Errors > The memory metrics that are tracked in the cAdvisor are a subset of the 43 memory metrics expo... [08:58:49] Morning! [08:59:11] kevinbazira: I think I have found the root cause for the memory alerts: we're using the wrong metric. [08:59:37] klausman: o/ [09:00:12] ok is this metric used by only articl-descriptions? [09:00:50] it's used on all kserve containers, AIUI [09:01:12] why it doesn't fire for other services is unclear, but at any rate, it is the wrong metric to alert on [09:01:38] `(container_memory_usage_bytes{container="kserve-container"} / container_spec_memory_limit_bytes{container="kserve-container"}) > 0.9` is the current rule [09:02:38] I will dig a bit deeper to see if the alerting rule can eb improved further [09:05:32] thank you for digging into this klausman. it makes sense that the wrong metric is the one firing the alert because we used the same memory resources on staging and didn't run into any OOM issues. [09:06:27] I think it doesn't fire in staging because we only alert for the prod clusters (note the deploy tags in the inf_services.yaml file) [09:08:20] yes, we alert in prod but if indeed there were OOM issues the service would have failed and we would have seen them in staging logs. [09:08:40] Ack. [09:32:43] 06Machine-Learning-Team: Investigate how to implement batch inference for revertrisk-multilingual - https://phabricator.wikimedia.org/T355656#9586392 (10achou) [09:36:07] I have a hypothesis why the other services never alerted: the base usage is much lower than the limit, and they don't do enough disk-I/O to fill the page cache to the point the combined metric (working set+page cache etc) gets close to the limit [09:38:49] 06Machine-Learning-Team, 13Patch-For-Review: Investigate InfServiceHighMemoryUsage for article-descriptions - https://phabricator.wikimedia.org/T358742#9586403 (10klausman) Hypothesis why the other services never alerted: their base usage (`container_memory_working_set_bytes`) is much lower than the limit, and... [09:50:30] 06Machine-Learning-Team: Deploy RR-language-agnostic batch version to prod - https://phabricator.wikimedia.org/T358744 (10achou) [09:55:53] 06Machine-Learning-Team: Deploy RR-language-agnostic batch version to prod - https://phabricator.wikimedia.org/T358744#9586434 (10achou) [09:55:57] 06Machine-Learning-Team, 05Goal: Goal: Lift Wing users can request multiple predictions using a single request. - https://phabricator.wikimedia.org/T348153#9586435 (10achou) [10:12:38] 06Machine-Learning-Team: Improving error message for Revertrisk models - https://phabricator.wikimedia.org/T351278#9586500 (10achou) Knowledge Integrity v0.6.0 improved error representations by introducing an Error data class and different error codes for various situations when fetching MediaWiki API for revisi... [10:35:54] 06Machine-Learning-Team: Prep work for (re)training workflow sprint - https://phabricator.wikimedia.org/T358748 (10achou) [10:36:51] klausman: nice work, I think you are on the right path, left a comment in the code review [10:37:05] and I already replied :) [10:37:10] ah nice :) [10:38:52] One difference would be that the SRE/generic is at 95% and ours is at 90%, but I don't think that's a problem. [10:39:37] yeah exactly [10:41:32] But as mentioned, I think we already get the beneift of the team-sre alerts by virtue of the deploy: tag. I've asked in the k8s-sig IRC channel for clarification [10:42:48] If that is true, we can just remove the two files and all's good. I'd also add a README pointing out the deploy: tag and that we (ML) might get alerts from other subdirs by that mechanism [10:45:27] the only question mark is how to get alerts for say kserve containers here, that would be nice [10:45:35] but not sure if possible with the current state of configs [10:45:41] not mandatory, but it would be nice [10:47:17] klausman: another big difference is that those alerts are warnings [10:47:25] not sure if they are posted on IRC anywhere [10:47:37] Good point [10:47:54] also, it looks like the alerts with thatd eploy tage from sre-team don't make it to our config for some reason [10:49:53] ah, nvm, it lives in /srv/alerts not /srv/prometheus [10:50:06] (since AM is a separate process/app) [10:51:00] elukey: So do we make a copy of the sre-team alert for ourselves, as sev:crit and drop the deploy:ml tag from theirs? [10:51:14] I think that might be the best strategy [10:51:43] I'd also drop our site: tag, since it currently does nothing [10:53:21] it is still a duplication of effort, I'd prefer to keep those alerts in one place if possible.. We can duplicate if serviceops thinks that a broad alert would be too spammy though, but first let's try to find a common solution (you are already doing the work posting to the k8s chan, just wait for a consensus is my suggestion) [10:53:36] ack [11:19:48] * klausman lunch [11:30:33] (03PS15) 10MPGuy2824: Replace makeList() with ExpressionGroups [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1007485 (https://phabricator.wikimedia.org/T350986) [11:31:41] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 101.42.0-wmf.21; 2024-03-05, 13Patch-For-Review, 07Technical-Debt: Use expression builder instead of raw SQL in ORES - https://phabricator.wikimedia.org/T350986#9586669 (10MPGuy2824) [11:45:35] * elukey lunch [12:29:56] Morning all [12:30:45] Morning, Chris! [12:55:03] hello hello [13:43:29] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Growth-Team, 06Wikipedia-Android-App-Backlog: Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298#9586940 (10diego) @kostajh , to the best of my knowledge @KStoller-WMF is leading this project. We had... [14:32:14] We really need to either pass this extension to another team or rename it. [14:32:28] Ideally the first [14:33:51] in theory it has been renamed buut we cannot change the name of the extension (repo etc..), maybe we could try and change the labels in phab (but it would add confusion to people for sure) [14:34:43] folks in https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1007400/3/images/kserve/storage-initializer/Dockerfile.template I filed a proposal to migrate the storage-initializer's image to Debian Bookworm [14:34:48] this has some implications, namely: [14:35:10] 1) Use of python 3.11 (but limited to the storage-init container, so nothing that can affect the kserve one) [14:35:42] 2) A different use of PIP, due to how Debian now suggests to install packages (see https://www.debian.org/releases/bookworm/amd64/release-notes/ch-information.en.html#python3-pep-668) [14:36:09] it may be a good test for future upgrades, for example when we'll move all containers to bookworm [14:36:19] if you have ideas thoughts please chime in :) [14:36:49] The solution that I chose/tested uses pipx but we can do otherwise [14:39:17] I think decoupling Py version in the storage init from what we run elsewhere in the same pod is fine. pipx is something I have been using for private stuff for a while and I think it's a good fit for us. Having a single user (instead of nobody) has some security implications, but I don't think they outweigh the benefits. So overall: +1 from me [14:39:50] lemme c&p that into the review for posterity [15:02:28] what are the security implications that you are worried on? [15:02:49] (just curious) [15:03:00] the main one is probably having the home dir writable [15:42:38] mmm maybe with bare venvs we could keep using nobody [15:52:49] testing it :) [15:54:46] yep! [15:54:48] (InfServiceHighMemoryUsage) firing: (2) High Memory usage detected in Inference Service - https://wikitech.wikimedia.org/w/index.php?title=Machine_Learning/LiftWing/Alerts#Inference_Services_High_Memory_Usage_-_InfServiceHighMemoryUsage_alert - https://alerts.wikimedia.org/?q=alertname%3DInfServiceHighMemoryUsage [16:04:01] but I don't love the idea of having the entire venv available in the container [16:04:04] mmm [16:13:47] yes, my concern was a request "somehow" being able to make the isvc overwrite one of its py modules [16:21:44] the main issue is that even now, with a venv nobody is kinda able to run some build commands [16:21:54] it is true that it is an init container etc.. [16:22:21] but I have an idea, will do some testing [16:22:32] I think the writability of the venev is actually a smaller concern with a storag initializer, since nothing talks to it. [16:24:05] yes yes but since we are doing the refactoring.. [16:24:34] I am thinking to basically use a multi-stage build, create the venv + install kserve, and copy only the libs to the final image [16:24:44] site-packages basically [16:24:57] That would also work. [16:50:19] logging off, see you tomorrow folks! [16:53:13] \o [17:42:00] night elukey [19:47:59] (03CR) 10Ladsgroup: Replace makeList() with ExpressionGroups (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1007485 (https://phabricator.wikimedia.org/T350986) (owner: 10MPGuy2824) [19:48:58] (03CR) 10Ladsgroup: [C: 03+2] Replace makeList() with ExpressionGroups (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1007485 (https://phabricator.wikimedia.org/T350986) (owner: 10MPGuy2824) [19:52:31] (03Merged) 10jenkins-bot: Replace makeList() with ExpressionGroups [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1007485 (https://phabricator.wikimedia.org/T350986) (owner: 10MPGuy2824) [19:54:49] (InfServiceHighMemoryUsage) firing: (2) High Memory usage detected in Inference Service - https://wikitech.wikimedia.org/w/index.php?title=Machine_Learning/LiftWing/Alerts#Inference_Services_High_Memory_Usage_-_InfServiceHighMemoryUsage_alert - https://alerts.wikimedia.org/?q=alertname%3DInfServiceHighMemoryUsage [23:54:48] (InfServiceHighMemoryUsage) firing: (2) High Memory usage detected in Inference Service - https://wikitech.wikimedia.org/w/index.php?title=Machine_Learning/LiftWing/Alerts#Inference_Services_High_Memory_Usage_-_InfServiceHighMemoryUsage_alert - https://alerts.wikimedia.org/?q=alertname%3DInfServiceHighMemoryUsage