[06:39:18] <wikibugs>	 (03PS3) 10Elukey: editquality: refactor preprocess common code [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/829847 (https://phabricator.wikimedia.org/T313915)
[06:40:13] <wikibugs>	 (03CR) 10Elukey: "Tobias thanks for the review, I had to add a parameter to the extractor_utils' function to ease the port of the articlequality code. Hopef" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/829847 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey)
[06:40:56] <wikibugs>	 (03PS1) 10Elukey: articlequality: refactor code to use the new extractor_utils module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830058 (https://phabricator.wikimedia.org/T313915)
[06:51:38] <elukey>	 good morning folks :)
[06:52:07] <elukey>	 with the new code split moving {draft,article}quality to async preprocess should be relatively easy
[06:52:16] <elukey>	 less copy/paste
[07:19:19] <wikibugs>	 (03PS1) 10Elukey: draftquality: move to async preprocess [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830061 (https://phabricator.wikimedia.org/T313915)
[07:40:54] <wikibugs>	 (03PS1) 10Elukey: drafttopic: move preprocess to async [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830084 (https://phabricator.wikimedia.org/T313915)
[07:41:08] <elukey>	 all right all revscoring models have their new code revieqw :)
[07:42:16] <elukey>	 going afk for ~1 hour or a little more for errands, ttl!
[09:34:12] <elukey>	 back!
[09:37:16] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Move revscoring isvcs to async architecture - https://phabricator.wikimedia.org/T313915 (10elukey) All code reviews out, plus a refactoring of the existing code for edit and article quality to reduce duplication as much as possible.  Ne...
[10:17:04] <klausman>	 elukey: are you awa re of anything that would explain why ml-cache1001's mgmt interface would be down? Icinga says it can't ping it (for 1d18h now)
[10:18:43] <klausman>	 From inside the host, IPMI commands work, and the configured IP address looks correct
[10:21:48] <elukey>	 mmm so `ping ml-cache1001.mgmt.eqiad.wmnet` from cumin1001 doesn't work
[10:22:04] <elukey>	 it works for 1002 for example
[10:22:23] <klausman>	 I've been brosing around in Netbox, and when I looked at 1001's interfacesm there was no cable connection configured. *But* that is also the case for 1002, which is fine
[10:22:28] <elukey>	 so it may be that the cable is faulty, or that we have to reboot BMC
[10:23:11] <klausman>	 ack, I'll see how to do that
[10:23:35] <elukey>	 all the commands in https://wikitech.wikimedia.org/wiki/Management_Interfaces
[10:23:43] <klausman>	 ack
[10:24:26] <elukey>	 but the fact that we cannot ping it smells like a faulty cable
[10:24:42] <klausman>	 reset done. ANy idea how long a reset like that usually takes for the mgmt card to boot?
[10:24:51] <elukey>	 some minutes IIRC
[10:25:10] <klausman>	 Ok, I'll see if there's a change in 10m from now. Otherwise I'll ping DCops about it
[10:25:10] <elukey>	 maybe a couple, not a lot
[10:25:14] <elukey>	 super
[10:25:49] <elukey>	 going afk for lunch, ttl!
[10:27:31] <klausman>	 yeah, same
[10:42:37] <wikibugs>	 10Lift-Wing, 10Data Engineering Planning, 10Event-Platform Value Stream, 10Epic, and 2 others: Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 (10EChetty)
[11:44:13] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Support pre-transformed inputs for Outlink topic model - https://phabricator.wikimedia.org/T315998 (10achou) @Isaac The change (the first option) has been deployed to the production. :)  > So long as LiftWing isn't taking some sort of a...
[12:13:02] <chrisalbon>	 Morning all
[12:22:36] <elukey>	 morning!
[12:24:23] <chrisalbon>	 Hey Elukey!
[12:30:26] <klausman>	 \o
[12:34:08] <elukey>	 one thing that I have realized is that we don't really have a dashboard for the Lift Wing traffic (a logstash one I mean)
[12:35:25] <elukey>	 we probably need one for all kubernetes logs that are shipped via rsyslog
[12:35:32] <elukey>	 and one with the pods' traffic
[12:39:04] <klausman>	 I have no idea what's involved in setting something like that up. Is it much work?
[12:39:24] <klausman>	 But yes, we probably want that once prod traffic hits. Probably before.
[12:40:38] <elukey>	 in theory it should be traffic already shipped by rsyslog
[12:45:26] <elukey>	 yeah I see that containers logs are stored under /var/log/containers and there is a rule for mmkubernetes in rsyslog's config
[12:48:55] <klausman>	 Would the log format need to be configured? Or does Logstash understand them magically?
[12:49:40] <elukey>	 not sure
[12:51:51] <elukey>	 in theory the log entries are shipped to kafka and then logstash pulls from the related topic
[12:52:11] <elukey>	 the message is sent to kafka in a pre-defined json format, that should be parsable by logstash
[12:57:12] <elukey>	 https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2022.09.06?id=Jz3fEoMBzGlbejpUpSv1
[12:57:20] <elukey>	 this is from the kserve pod
[12:57:29] <elukey>	 err container
[12:57:51] <elukey>	 so we can filter for kubernetes.container_name kserve-container
[12:58:11] <elukey>	 the log is something like
[12:58:12] <elukey>	 [I 220906 12:56:28 web:2243] 200 POST /v1/models/enwiki-articlequality:predict (127.0.0.1) 155.53ms
[12:58:21] <elukey>	 that is not what we want, needs to be tuned
[13:01:35] * elukey opens a task
[13:04:18] <wikibugs>	 10Machine-Learning-Team: Create logstash dashboard(s) for Lift Wing - https://phabricator.wikimedia.org/T317105 (10elukey)
[13:07:34] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] drafttopic: move preprocess to async [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830084 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey)
[13:07:50] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] draftquality: move to async preprocess [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830061 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey)
[13:08:09] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] articlequality: refactor code to use the new extractor_utils module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830058 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey)
[13:08:36] <elukey>	 klausman: <3
[13:08:41] <elukey>	 (for the reviews)
[13:09:01] <elukey>	 I'll wait for Kevin and Aiko to take a look as well since the change is a bit invasive
[13:10:37] <klausman>	 ack!
[13:11:32] <elukey>	 but in theory the code to make the revscoring models' preprocess fun async is all out
[13:12:12] <elukey>	 (back in a few)
[13:12:47] <klausman>	 "fun async" as opposed to "boring sync" :)
[13:32:52] <klausman>	 Oh man I just had a major moment of cognitive confusion. I was updating my laptopm and apt-listchanges shows a message about systemd. Its maintainer is Luca Boccassi, but of course my brain ignored his last name, and for several seconds I wondered why Luca was sending me messages about systemd on my laptop.
[13:34:23] <chrisalbon>	 There is only one Luca in the world
[13:35:18] <klausman>	 Well, only one that matters :D
[13:36:51] <elukey>	 lol
[13:51:29] <klausman>	 also, ml-cache.mgmt is now pinging and can be ssh'd into (it was a bad switch port)
[13:51:34] <klausman>	 1001*
[13:51:44] <elukey>	 nice
[15:02:38] <elukey>	 aiko, kevinbazira: forgot to ask, but from https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/829847 onwards there is a refactoring of the current async preprocess fun for ores-models. When you have a moment lemme know what you think about it (even tomorrow, no rush)
[15:05:28] <elukey>	 Tobias already reviewed but I wanted your opinion too because it is a little invasive (new subdir in the python common dir etc..)
[15:29:56] <klausman>	 elukey: I am debugging some wierd 400s between APIGW and k8s in codfw, where would you expect a routing error by k8s to be handled (i.e. which pod logs should I look at? The nod-elevel calicos don't see anything)
[15:30:26] <klausman>	 istio-ingressgateway?
[15:45:29] <klausman>	 Hrm. Nothing to be found. Breakage is likely in th gw, then
[16:12:55] <aiko>	 elukey: ok, I'll have a look!
[16:22:18] <elukey>	 klausman: sorry I was in meetings, didn't see the ping
[16:22:21] <elukey>	 still having the issue?
[16:22:39] <elukey>	 a 400 probably is returned by istio itself, maybe the gateway pod logs could help
[16:22:45] <elukey>	 thanks aiko!
[16:35:04] * elukey afk for the evening o/
[20:00:02] <klausman>	 It's likely a Host rewrite problem. Hugh and I will figure it out
[22:50:16] <ragesoss>	 python people who use linux... what tools do you use to manage different version of python? pyenv? something else?