[07:01:14] klausman: o/ [07:01:30] if you want I can let you do the roll out! [07:01:34] there is no real hurry [07:01:43] lemme know what you prefer [07:01:47] also good morning :) [08:02:22] \o [08:02:32] Morning! [08:04:29] elukey: Let me set up a few tmux's and watch's, and we can start [08:08:32] you can go ahead anytime, we need to rollout to ml-serve-{eqiad,codfw} [08:08:54] admin_ng's knative-serving config first, then the revscoring helmfiles (except outlink IIRC) [08:09:06] ack [08:09:13] I'll start with codfw [08:11:39] diff looks ok, syncing [08:18:20] revscoring-articletopic has no diff, everything else does. [08:18:25] So syncing them now [08:26:19] Ok, everything restarted and old pods terminated. I'll watch things a bit before continuing [08:26:54] I have tested articlequality with events sent to eventgate, all godo [08:26:57] good [08:27:08] going to check the sidecar metrics as well to see if they are populated [08:29:21] yep I see metrics [08:29:37] I've spot-checked a few services with curl, everything seems to be fine [08:30:31] we still have these alerts when deploying [08:30:31] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw [08:30:56] but all temporary things, not sure if it is related to the volume of pods changed or something else [08:31:13] the control plane works fine [08:31:27] We might also want to look at hte thresholds there. Maybe they're too narrow [08:31:27] maybe for ml-serve-eqiad let's try a slower rollout to see if it fires [08:31:38] Will do. [08:32:54] istio diff looks the same, syncing [08:33:54] (03CR) 10Elukey: editquality: align ORES prediction output with Lift Wing's one (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/839516 (https://phabricator.wikimedia.org/T318932) (owner: 10AikoChou) [08:33:59] (03CR) 10Elukey: [C: 03+1] editquality: align ORES prediction output with Lift Wing's one [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/839516 (https://phabricator.wikimedia.org/T318932) (owner: 10AikoChou) [08:34:50] * elukey bbiab [08:35:13] Istio push done. Will now sync the services. I'll wait until no Terminating/Init pods are visible before going from one subdir to the next [09:03:21] And all rolled out [09:04:24] nice! [09:06:38] I think waiting for all thebterminations to be complete helped with avoiding latency warnings [09:11:25] ack so the kube api may be a little overwhelmed if there are a lot of pods [09:14:52] klausman: do we have remaining blockers in T288789 ? [09:19:33] The only blockers are deciding the format of the URLs externally and then setting up the apigw config to implement that [09:22:14] (03CR) 10AikoChou: [C: 03+1] "Thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/839516 (https://phabricator.wikimedia.org/T318932) (owner: 10AikoChou) [09:22:21] (03CR) 10AikoChou: [C: 03+2] editquality: align ORES prediction output with Lift Wing's one [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/839516 (https://phabricator.wikimedia.org/T318932) (owner: 10AikoChou) [09:22:49] (03PS1) 10Elukey: extractor_utils: fix incorrect usage of logging.error [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/840080 [09:23:02] this is a quick one [09:23:03] --^ [09:23:23] spotted while reviewing some errors from benthos [09:24:36] (03CR) 10AikoChou: [C: 03+1] extractor_utils: fix incorrect usage of logging.error [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/840080 (owner: 10Elukey) [09:28:05] (03CR) 10Klausman: [C: 03+1] extractor_utils: fix incorrect usage of logging.error [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/840080 (owner: 10Elukey) [09:33:32] (03Merged) 10jenkins-bot: editquality: align ORES prediction output with Lift Wing's one [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/839516 (https://phabricator.wikimedia.org/T318932) (owner: 10AikoChou) [09:39:24] (03CR) 10Elukey: [C: 03+2] extractor_utils: fix incorrect usage of logging.error [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/840080 (owner: 10Elukey) [09:39:59] it is a very simple fix but since it is code shared we'll need to roll it out to all revscoring isvcs [09:40:14] ...and I just restarted most of them %-) [09:41:02] hopefully at some point in the future we'll drop revscoring models [09:42:03] klausman: ack for T288789, can you make a summary at the end of the task when you have a moment? (so others will know it as well etc..) [09:42:32] yep [09:57:45] interesting, I noticed an error case triggered by benthos for some revision-create events, namely that rev_parent_id seems not in the event [09:57:49] so our code fails [10:00:50] and it also fails in extractor_info when we try to get the revision_info [10:01:17] but to know what the error is, we need the logging bits to work correctly :D [10:01:27] very good though, some issues are popping up [10:33:10] (03PS2) 10Elukey: extractor_utils: fix incorrect usage of logging.error [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/840080 [10:33:18] (03CR) 10Elukey: [V: 03+2] extractor_utils: fix incorrect usage of logging.error [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/840080 (owner: 10Elukey) [10:33:32] (03PS1) 10Elukey: events.py: set some revision-score fields as optional [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/840095 [10:33:37] another one :D [10:34:29] (03CR) 10Elukey: "Need to test it properly locally on Docker, but lemme know if you are ok with the idea or not :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/840095 (owner: 10Elukey) [10:34:56] * elukey lunch! [11:46:11] (03CR) 10AikoChou: "I have a suggestion on the optional fields. :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/840095 (owner: 10Elukey) [11:57:34] <- lunch [13:14:29] Morning all [13:26:41] (03CR) 10Elukey: events.py: set some revision-score fields as optional (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/840095 (owner: 10Elukey) [13:29:43] (03CR) 10Elukey: events.py: set some revision-score fields as optional (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/840095 (owner: 10Elukey) [14:15:26] (03CR) 10AikoChou: events.py: set some revision-score fields as optional (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/840095 (owner: 10Elukey) [14:18:05] aiko: I think I found another little bug while testing [14:18:10] in various models.py we have [14:18:11] if extended_output: [14:18:14] [..] [14:18:23] but we don't check if the value is true or false :D [14:18:45] so if we set extended_output: false it is considered like a true value (because the flag is set) [14:22:56] Mmm? If we set extended_output: false, then in the case of if extended_output: [..], it won't go into [..] [14:25:45] aiko: you are right, for some reason my tests were showing a different thing, but now everything works, so I may have messed up some code when testing. Nevermind, friday :) [14:26:11] ok ok :D [14:27:00] (03PS2) 10Elukey: events.py: set some revision-score fields as optional [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/840095 [14:27:54] (03CR) 10Elukey: events.py: set some revision-score fields as optional (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/840095 (owner: 10Elukey) [14:45:12] aiko: confirmed that the field needs to be an array [14:45:14] crazy [14:51:06] really! 🫠 [14:58:03] (03PS3) 10Elukey: events.py: improve revision score generation code [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/840095 [14:58:15] aiko: tested --^ and the event generated is validated correctly by eventgate [14:59:27] (03CR) 10Elukey: events.py: improve revision score generation code (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/840095 (owner: 10Elukey) [15:01:39] * elukey back in a bit [15:15:15] (03CR) 10AikoChou: [C: 03+1] events.py: improve revision score generation code (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/840095 (owner: 10Elukey) [15:28:19] (03CR) 10Klausman: [C: 03+1] events.py: improve revision score generation code [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/840095 (owner: 10Elukey) [15:29:19] thanks both :) [15:29:34] (03CR) 10Elukey: [C: 03+2] events.py: improve revision score generation code [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/840095 (owner: 10Elukey) [15:40:27] (03Merged) 10jenkins-bot: events.py: improve revision score generation code [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/840095 (owner: 10Elukey) [15:46:30] (03PS3) 10AikoChou: outlink: increase the number of links returned for MW API call [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/837642 [15:47:41] (03PS4) 10AikoChou: outlink: increase the number of links returned for MW API call [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/837642 [15:49:34] (03CR) 10AikoChou: outlink: increase the number of links returned for MW API call (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/837642 (owner: 10AikoChou) [15:50:18] (03CR) 10Elukey: [C: 03+1] outlink: increase the number of links returned for MW API call [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/837642 (owner: 10AikoChou) [15:57:01] (03CR) 10AikoChou: [C: 03+2] outlink: increase the number of links returned for MW API call [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/837642 (owner: 10AikoChou) [16:03:29] going afk for the weekend folks! o/ [16:03:31] have a nice weekend :) [16:03:44] (03Merged) 10jenkins-bot: outlink: increase the number of links returned for MW API call [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/837642 (owner: 10AikoChou) [16:04:43] bye Luca :) Have a nice weekend too! [16:13:06] \o