[05:56:16] o/ Have a great week every1! [05:56:40] I started creating some patches to start adding alerts https://gerrit.wikimedia.org/r/c/operations/puppet/+/958072 [05:58:29] (03CR) 10Kevin Bazira: [C: 03+2] "Thanks for the reviews :)" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/956846 (https://phabricator.wikimedia.org/T346218) (owner: 10Kevin Bazira) [06:00:18] (03Merged) 10jenkins-bot: Load preprocessed numpy arrays from swift [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/956846 (https://phabricator.wikimedia.org/T346218) (owner: 10Kevin Bazira) [06:00:19] I was thinking to start adding these alerts for ml-staging. [06:11:48] morning :) [06:12:45] I didn't see https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/958052 yesterday sorry, but we already use those limits (the CI diff is empty to confirm) [06:12:59] one thing that we could do is to raise the limit to 4G [06:13:03] leaving request to 2 [06:13:25] (killed eswiki damaging again in the meantime) [06:14:26] I was surprised to see the diff empty as I did a describe on the pod and saw 1G but probably I was looking at another container or sth [06:15:13] ok, I raised the limit to 4G [06:16:30] ack :) [06:16:41] for the alerts, let's not add them to staging, prod is fine [06:17:37] https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-7d&orgId=1&to=now&var-cluster=main-eqiad&var-consumer_group=cpjobqueue-ORESFetchScoreJob&var-datasource=eqiad%20prometheus%2Fops&var-topic=All&viewPanel=1 is also another thing to alert on, namely if lag starts to build up [06:21:03] the diff is again empty [06:21:13] mmmm maybe it is an issue with the template [06:24:33] update the cr, let's see [06:25:45] I ll check the template as it is different for revcoring than other isvc [06:26:12] Commuting to coworking.afk 30'! [06:30:23] now it works! I think we should apply it to all isvcs, damaging and goodfaith [06:30:39] it affects eswiki the most for some reason, but also others are failing [06:30:57] I am not 100% sure if this will fix the OOMs though [06:37:13] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade revscoring Docker images to KServe 0.11 - https://phabricator.wikimedia.org/T346446 (10elukey) Error while uploading the new revscoring to Pypi: ` WARNING Error during upload. Retry with the --verbose option for more details.... [06:41:02] commuting as well [06:59:15] ok so I am checkign eswiki-goodfaith, and it is kinda dead as well [06:59:33] but it is not showing the 2/3 indicator in kubectl get pods [07:00:18] ah no it is, I was looking with the wrong filter [07:00:41] same pattern, OOM and then this weird status [07:04:46] ok finally I don't see events in eqiad.cpjobqueue.retry.mediawiki.job.ORESFetchScoreJob [07:06:48] 10Machine-Learning-Team, 10Patch-For-Review: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic - https://phabricator.wikimedia.org/T346445 (10elukey) Killed eswiki damaging/goodfaith, same pattern. [07:16:49] for some pods there seems to be a gradual leak over time, eswiki damaging being the most affected [07:17:59] one enwiki pod moved from ~700 MB of RSS to 1.350MB in the course of 5 hours [07:23:00] I am pretty sure that the issue is in our code, now that I recall we tried to use some preprocess caches here and there [07:24:29] interesting... [07:24:42] well not so interesting but you get what I mean :P [07:28:25] (03PS1) 10Elukey: python: remove unnecessary self attributes in revscoring's model svc [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/958393 (https://phabricator.wikimedia.org/T346445) [07:28:33] first one is --^ [07:29:48] mmm lemme re-read the code [07:30:52] (03PS2) 10Elukey: python: remove unnecessary self attributes in revscoring's model svc [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/958393 (https://phabricator.wikimedia.org/T346445) [07:30:52] yes okok [07:30:58] I renamed the function too [07:31:26] the other candidate that I have in mind is the "cache" parameter that we use in fetch_features [07:32:23] uff I need await [07:32:59] (03PS3) 10Elukey: python: remove unnecessary self attributes in revscoring's model svc [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/958393 (https://phabricator.wikimedia.org/T346445) [07:45:20] trying to hit eswiki with a lot of requests using "extended_output": "True" [07:46:37] (03CR) 10Ilias Sarantopoulos: [C: 03+1] python: remove unnecessary self attributes in revscoring's model svc [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/958393 (https://phabricator.wikimedia.org/T346445) (owner: 10Elukey) [07:48:18] ok, I'm checking the helm template to see if it is ok [07:56:42] (03CR) 10Elukey: [C: 03+2] python: remove unnecessary self attributes in revscoring's model svc [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/958393 (https://phabricator.wikimedia.org/T346445) (owner: 10Elukey) [07:56:54] quickly tested it locally --^ [07:57:17] I have a meeting now, if you guys have time to deploy/test it to staging etc.. please go ahead [07:57:20] otherwise I'll do it [07:57:29] it will likely not solve but we can start from something [07:57:44] I wasn't able to raise the RSS usage of eswiki with the extended output [08:01:01] I'll deploy it in staging! [08:01:05] test it etc [08:02:13] (03Merged) 10jenkins-bot: python: remove unnecessary self attributes in revscoring's model svc [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/958393 (https://phabricator.wikimedia.org/T346445) (owner: 10Elukey) [08:31:06] good morning o/ [08:32:54] morning! [08:40:57] hey! [08:58:28] o/ [08:58:38] isaranto: thanks a lot! Just finished, can I help in the rollout? [09:01:01] I have just deployed in staging for now [09:01:24] tested it and all is ok [09:01:31] rolling out to eqiad and codfw now [09:03:13] <3 [09:03:28] actually I am deploying to all staging namespaces first [09:03:36] and then to prod [09:38:05] thanks a lot for the rollout isaranto [09:38:35] I am reviewing the code that we use for the revscoring's extractor, maybe the leak is in the cache injected [09:38:46] np! dont thank me <3 [09:39:06] I'll be rolling out all namespaces - now I just did eqiad damaging in prod [09:39:39] you can do only eqiad goodfaith atm, we can observe and refine [09:39:47] we'll likely have to rollout multiple fixes [09:39:54] so you don't get crazy :D [09:41:58] the other big thing for today - https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/957687 [09:42:03] ready to go? [09:42:21] we have the leak in Lift Wing but it seems completely separate etcd.. [09:42:25] *etc [09:42:39] we may need to revert if people complain, but it is relatively straightforward [09:42:42] The changeprop thing LGTM [09:43:16] all right proceeding :) [09:43:55] historic moment :) [09:44:45] historic moments can be either a good and a bad thing 😛 [09:44:50] *or [09:47:40] please don't ruin my rare optimistic moment :D [09:53:19] sry, didn't have such an intention. lets gooooo [09:54:59] doneee [09:55:14] Now we wait for the impact and explosion ;) [09:58:41] yeees! The logstash dashboard looks nice [09:59:07] got a link? (I am inept at navigating logstash) [09:59:28] https://logstash.wikimedia.org/goto/84285068e68e31c43ae8a30f120c82e0 [09:59:59] --^ I have a filter so you only see changeprop traffic [10:00:11] very nice [10:04:04] klausman: in case we need to rollback: revert + deploy of changeprop [10:04:10] nothing more [10:04:14] ack [10:04:39] I have an appointment at 13:00, so if you want to do lunch now, I can cover any mishaps [10:05:04] nono it is fine, I'll leave in a bit as well [10:05:16] I am also watching https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?orgId=1&refresh=1mhttps://grafana.wikimedia.org/d/HIRrxQ6mk/ores?orgId=1&refresh=1m [10:05:25] as expected now we see score cache misses [10:06:10] from https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-site=eqiad&var-cluster=ores&var-instance=All it looks good [10:07:03] the ores nodes are basically doing nothing [10:10:42] even the temp is dropping slightly :D [10:23:12] elukey: do you have a link with the dashboard where we can see container resources ? cpu/ram. I cant find iiiit [10:24:27] https://grafana.wikimedia.org/d/hyl18XgMk/kubernetes-container-details?orgId=1 This? [10:25:00] o yes. thank you! gotta fix my bookmarks [10:26:04] yep! [10:26:20] so far eswiki's memory seems stable [10:29:16] yesterday it peaked after running for 5-6 hours. don't know about the previous pods [10:30:31] on Saturday it was after apporx 11-12h [10:31:54] but memory usage started to go up in the first 2 hours so we'll find out soon-ish enough [10:32:04] * isaranto going for lunch! [10:34:06] email sent to wikitech-l for ores changeprop [10:34:40] 10Machine-Learning-Team: Deprecate mediawiki revision-score stream - https://phabricator.wikimedia.org/T342116 (10elukey) Email sent to Wikitech-l, the task is completed. Let's leave it open for a couple of days to see if everything works as expected. [10:42:54] * elukey lunch! [11:09:04] * klausman lunch asmwell (and errand) [11:51:36] Morning all! [11:57:30] Monring Chris! [12:00:57] *morning [12:15:35] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade revscoring Docker images to KServe 0.11 - https://phabricator.wikimedia.org/T346446 (10isarantopoulos) It seems that a direct reference to a commit in this way is not allowed according to [[ https://peps.python.org/pep-0440/#direct-references | PEP 440 ]]. I... [12:56:51] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade revscoring Docker images to KServe 0.11 - https://phabricator.wikimedia.org/T346446 (10isarantopoulos) Created https://github.com/wikimedia/yamlconf/pull/. The fork has been set up to publish yamlfconf-wmf package as a wmf maintained version of the package. [13:04:09] morning chrisalbon! [13:04:17] isaranto: I don't see any traces of the leak! \o/ [13:05:45] yes! it seems nothing compared to the previous one [13:05:58] fingers crossed [13:06:31] i created https://github.com/wikimedia/yamlconf/pull/1 lemme know what u think [13:07:01] nice! [13:21:22] Amir1: we stopped changeprop's ores-cache stream, if you are free next morning we try to move traffic away from ORES [13:21:31] *next monday [13:21:46] sounds good to me [13:27:46] ZOMG [13:27:51] its happening [13:32:54] 10Machine-Learning-Team, 10Observability-Alerting, 10Patch-For-Review: Lift Wing alerting - https://phabricator.wikimedia.org/T346151 (10elukey) @RLazarus thanks a lot! We can wait and be the first beta-testers of the new alerts if you are ok! @isarantopoulos he first alert that I would add is based on [[... [13:33:02] chrisalbon: yessssssssssss [13:34:17] chrisalbon: I suspect that the first switch we'll not be successful, maybe some corner case that we didn't anticipate etc.. but we'll be able to quickly revert with one dns change. [13:34:26] after that it just a fix/retry loop [13:35:01] I get that, but at least it is happening. This is huge. [13:35:34] looking forward for it [13:35:42] how long it has been? 2.5y ? [13:36:10] Right?! [13:38:00] 10Machine-Learning-Team, 10Patch-For-Review: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic - https://phabricator.wikimedia.org/T346445 (10elukey) Ilias rolled out https://gerrit.wikimedia.org/r/958393 to damaging/goodfaith pods in ml-serve-eqiad, so far we haven't seen any occurrence of... [13:51:09] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade revscoring Docker images to KServe 0.11 - https://phabricator.wikimedia.org/T346446 (10isarantopoulos) The new yamlconf-wmf package has been published . I created a PR for revscoring https://github.com/wikimedia/revscoring/pull/548. There have been some cha... [14:10:48] 10Machine-Learning-Team, 10serviceops: Rename the envoy's uses_ingress option to sets_sni - https://phabricator.wikimedia.org/T346638 (10elukey) [14:37:17] (03PS2) 10Ilias Sarantopoulos: WIP - Upgrade revscoring images to KServe 0.11 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/957948 (https://phabricator.wikimedia.org/T346446) (owner: 10Elukey) [14:37:48] (03PS3) 10Ilias Sarantopoulos: WIP - Upgrade revscoring images to KServe 0.11 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/957948 (https://phabricator.wikimedia.org/T346446) (owner: 10Elukey) [14:39:36] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade revscoring Docker images to KServe 0.11 - https://phabricator.wikimedia.org/T346446 (10isarantopoulos) Turns out the old GH repo worked and revscoring 2.11.12 was published. I updated the [[ https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference... [14:44:22] logging off folks, have a great rest of the day/evening! [14:57:52] have a nice evening isaranto! [14:58:08] 10Machine-Learning-Team, 10serviceops, 10Patch-For-Review: Rename the envoy's uses_ingress option to sets_sni - https://phabricator.wikimedia.org/T346638 (10JMeybohm) [15:38:59] the fix for the memory leak seems working, I do see some increase in memory usage over time (very mild) for eswiki, but no more regular spikes like before [15:40:14] Hooray! And nice work! [16:28:58] snap the leak seems to be back [16:29:03] sigh [16:29:10] it is maybe less pronounced [16:29:32] but eswiki doubled its usage [16:32:21] goodfaith and damaging ramp up at the same time, so I suppose it may be a client calling them at the same time [16:53:22] we'll see tomorrow, there is surely another thing to tune [17:00:55] 10Machine-Learning-Team, 10Patch-For-Review: Host the recommendation-api container on LiftWing - https://phabricator.wikimedia.org/T339890 (10Isaac) Hey folks -- not to take away from the good work by @kevinbazira but I wanted to flag that I don't think it makes sense to port the embeddings component over to L... [18:27:15] (03PS1) 10Ladsgroup: tests: Migrate to use SelectQueryBuilder [extensions/ORES] - 10https://gerrit.wikimedia.org/r/958539 (https://phabricator.wikimedia.org/T312454) [19:49:26] fixed the resources patch (it had wrong indentation) and reployed eswiki goodfaith and damaging on eqiad with increased memory limits (4Gi) as it was going to reach the limit anytime soon