[06:39:18] (03PS3) 10Elukey: editquality: refactor preprocess common code [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/829847 (https://phabricator.wikimedia.org/T313915) [06:40:13] (03CR) 10Elukey: "Tobias thanks for the review, I had to add a parameter to the extractor_utils' function to ease the port of the articlequality code. Hopef" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/829847 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [06:40:56] (03PS1) 10Elukey: articlequality: refactor code to use the new extractor_utils module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830058 (https://phabricator.wikimedia.org/T313915) [06:51:38] good morning folks :) [06:52:07] with the new code split moving {draft,article}quality to async preprocess should be relatively easy [06:52:16] less copy/paste [07:19:19] (03PS1) 10Elukey: draftquality: move to async preprocess [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830061 (https://phabricator.wikimedia.org/T313915) [07:40:54] (03PS1) 10Elukey: drafttopic: move preprocess to async [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830084 (https://phabricator.wikimedia.org/T313915) [07:41:08] all right all revscoring models have their new code revieqw :) [07:42:16] going afk for ~1 hour or a little more for errands, ttl! [09:34:12] back! [09:37:16] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Move revscoring isvcs to async architecture - https://phabricator.wikimedia.org/T313915 (10elukey) All code reviews out, plus a refactoring of the existing code for edit and article quality to reduce duplication as much as possible. Ne... [10:17:04] elukey: are you awa re of anything that would explain why ml-cache1001's mgmt interface would be down? Icinga says it can't ping it (for 1d18h now) [10:18:43] From inside the host, IPMI commands work, and the configured IP address looks correct [10:21:48] mmm so `ping ml-cache1001.mgmt.eqiad.wmnet` from cumin1001 doesn't work [10:22:04] it works for 1002 for example [10:22:23] I've been brosing around in Netbox, and when I looked at 1001's interfacesm there was no cable connection configured. *But* that is also the case for 1002, which is fine [10:22:28] so it may be that the cable is faulty, or that we have to reboot BMC [10:23:11] ack, I'll see how to do that [10:23:35] all the commands in https://wikitech.wikimedia.org/wiki/Management_Interfaces [10:23:43] ack [10:24:26] but the fact that we cannot ping it smells like a faulty cable [10:24:42] reset done. ANy idea how long a reset like that usually takes for the mgmt card to boot? [10:24:51] some minutes IIRC [10:25:10] Ok, I'll see if there's a change in 10m from now. Otherwise I'll ping DCops about it [10:25:10] maybe a couple, not a lot [10:25:14] super [10:25:49] going afk for lunch, ttl! [10:27:31] yeah, same [10:42:37] 10Lift-Wing, 10Data Engineering Planning, 10Event-Platform Value Stream, 10Epic, and 2 others: Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 (10EChetty) [11:44:13] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Support pre-transformed inputs for Outlink topic model - https://phabricator.wikimedia.org/T315998 (10achou) @Isaac The change (the first option) has been deployed to the production. :) > So long as LiftWing isn't taking some sort of a... [12:13:02] Morning all [12:22:36] morning! [12:24:23] Hey Elukey! [12:30:26] \o [12:34:08] one thing that I have realized is that we don't really have a dashboard for the Lift Wing traffic (a logstash one I mean) [12:35:25] we probably need one for all kubernetes logs that are shipped via rsyslog [12:35:32] and one with the pods' traffic [12:39:04] I have no idea what's involved in setting something like that up. Is it much work? [12:39:24] But yes, we probably want that once prod traffic hits. Probably before. [12:40:38] in theory it should be traffic already shipped by rsyslog [12:45:26] yeah I see that containers logs are stored under /var/log/containers and there is a rule for mmkubernetes in rsyslog's config [12:48:55] Would the log format need to be configured? Or does Logstash understand them magically? [12:49:40] not sure [12:51:51] in theory the log entries are shipped to kafka and then logstash pulls from the related topic [12:52:11] the message is sent to kafka in a pre-defined json format, that should be parsable by logstash [12:57:12] https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2022.09.06?id=Jz3fEoMBzGlbejpUpSv1 [12:57:20] this is from the kserve pod [12:57:29] err container [12:57:51] so we can filter for kubernetes.container_name kserve-container [12:58:11] the log is something like [12:58:12] [I 220906 12:56:28 web:2243] 200 POST /v1/models/enwiki-articlequality:predict (127.0.0.1) 155.53ms [12:58:21] that is not what we want, needs to be tuned [13:01:35] * elukey opens a task [13:04:18] 10Machine-Learning-Team: Create logstash dashboard(s) for Lift Wing - https://phabricator.wikimedia.org/T317105 (10elukey) [13:07:34] (03CR) 10Klausman: [C: 03+1] drafttopic: move preprocess to async [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830084 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [13:07:50] (03CR) 10Klausman: [C: 03+1] draftquality: move to async preprocess [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830061 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [13:08:09] (03CR) 10Klausman: [C: 03+1] articlequality: refactor code to use the new extractor_utils module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830058 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [13:08:36] klausman: <3 [13:08:41] (for the reviews) [13:09:01] I'll wait for Kevin and Aiko to take a look as well since the change is a bit invasive [13:10:37] ack! [13:11:32] but in theory the code to make the revscoring models' preprocess fun async is all out [13:12:12] (back in a few) [13:12:47] "fun async" as opposed to "boring sync" :) [13:32:52] Oh man I just had a major moment of cognitive confusion. I was updating my laptopm and apt-listchanges shows a message about systemd. Its maintainer is Luca Boccassi, but of course my brain ignored his last name, and for several seconds I wondered why Luca was sending me messages about systemd on my laptop. [13:34:23] There is only one Luca in the world [13:35:18] Well, only one that matters :D [13:36:51] lol [13:51:29] also, ml-cache.mgmt is now pinging and can be ssh'd into (it was a bad switch port) [13:51:34] 1001* [13:51:44] nice [15:02:38] aiko, kevinbazira: forgot to ask, but from https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/829847 onwards there is a refactoring of the current async preprocess fun for ores-models. When you have a moment lemme know what you think about it (even tomorrow, no rush) [15:05:28] Tobias already reviewed but I wanted your opinion too because it is a little invasive (new subdir in the python common dir etc..) [15:29:56] elukey: I am debugging some wierd 400s between APIGW and k8s in codfw, where would you expect a routing error by k8s to be handled (i.e. which pod logs should I look at? The nod-elevel calicos don't see anything) [15:30:26] istio-ingressgateway? [15:45:29] Hrm. Nothing to be found. Breakage is likely in th gw, then [16:12:55] elukey: ok, I'll have a look! [16:22:18] klausman: sorry I was in meetings, didn't see the ping [16:22:21] still having the issue? [16:22:39] a 400 probably is returned by istio itself, maybe the gateway pod logs could help [16:22:45] thanks aiko! [16:35:04] * elukey afk for the evening o/ [20:00:02] It's likely a Host rewrite problem. Hugh and I will figure it out [22:50:16] python people who use linux... what tools do you use to manage different version of python? pyenv? something else?