[07:18:49] Hello! [07:40:17] FYI, ml-etcd2002 will go down for a few minutes for a reboot of the underlying Ganeti node [07:53:10] ack! [08:34:01] likewise, ml-etcd1002 will go down for a few minutes for a reboot of the underlying Ganeti node [08:36:04] ack! [08:36:10] Also; morning! [08:48:18] o/ Tobias! [08:56:42] Morning, Ilias! [10:13:26] * klausman lunch and errands [10:17:34] likewise, ml-etcd1001 will go down for a few minutes for a reboot of the underlying Ganeti node [10:23:12] ack! [10:26:25] and ml-etcd2003 as well now [10:26:42] just deployed the use of envoy proxy for ores-legacy, works fine! [11:03:39] nice work! [12:18:02] 06Machine-Learning-Team: Investigate kserve 0.13.0 upgrade - https://phabricator.wikimedia.org/T367048 (10isarantopoulos) 03NEW [12:19:07] (03PS1) 10Ilias Sarantopoulos: huggingface: kserve 0.13.0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1041082 (https://phabricator.wikimedia.org/T367048) [12:28:14] this is WIP --^ [12:28:58] 06Machine-Learning-Team, 13Patch-For-Review: Investigate kserve 0.13.0 upgrade - https://phabricator.wikimedia.org/T367048#9875115 (10elukey) [12:32:58] 06Machine-Learning-Team, 13Patch-For-Review: Investigate kserve 0.13.0 upgrade - https://phabricator.wikimedia.org/T367048#9875141 (10elukey) [12:33:45] and ml-etcd2001 as well now [12:34:15] 06Machine-Learning-Team, 13Patch-For-Review: Investigate kserve 0.13.0 upgrade - https://phabricator.wikimedia.org/T367048#9875156 (10elukey) Please also review T367050 :) [12:35:15] Hi folks! [12:35:29] If you are ok I am going to rollout the new versions of eventrouter, k8s-controller-sidecars and kube-state-metrics (see https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1040153) [12:35:38] nothing should impact ongoing traffic [12:40:11] SGTM! [12:53:06] o/ Luca. Ok by me as well! [12:53:11] * isaranto afk - I have a physio appt - be back online in ~1.5h [12:54:26] https://phabricator.wikimedia.org/T363336#9875162 [12:55:11] ---^ updated the task for the investigation on revscoring [12:56:48] very nice :) [12:58:21] Excellent research [13:19:47] all deployed [15:23:35] o/ [15:23:36] back! [15:25:15] o/ [15:25:37] I am reviewing the SLO dashboards, and the last month worth of data looks good https://grafana.wikimedia.org/d/slo-ORES_Legacy/ores-legacy-slo-s?orgId=1 [15:25:43] no more weird values and holes [15:26:19] https://grafana.wikimedia.org/d/slo-Lift_Wing_Revert_Risk_LA/lift-wing-revert-risk-la-slo-s?orgId=1&from=2024-03-01%2000:00:00&to=2024-05-31%2023:59:59 [15:26:51] \o/ [15:27:45] no idea why though [15:27:57] I'll file a change to update the time window for the new quarter [15:27:58] we'll see [15:32:17] https://gerrit.wikimedia.org/r/c/operations/grafana-grizzly/+/1041159 [15:36:55] 06Machine-Learning-Team, 10Observability-Metrics: SLO dashboards for Lift Wing showing unexpected values - https://phabricator.wikimedia.org/T359879#9875881 (10elukey) I rechecked our dashboards, and after https://gerrit.wikimedia.org/r/c/operations/puppet/+/1017873 I don't see anymore the weird values above 1... [15:37:03] updated https://phabricator.wikimedia.org/T359879#9875881, I think that a change happened on April 8th fixed the weird values [15:46:03] ack [15:47:38] Bummer that it always takes at least a month to see if a change had the desired effect [16:01:02] the holes are still there I am afraid [16:01:08] for example https://grafana.wikimedia.org/dashboard/snapshot/MEfeYAWsKvOk8rA5SY4wrfWY27d1j7s1?orgId=1 [16:01:26] from the first of the month I don't see metrics, until the 8th [16:04:10] 06Machine-Learning-Team, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q4): Gap in metrics rendered from Thanos Rules - https://phabricator.wikimedia.org/T352756#9876049 (10elukey) Rechecking this - after the usual time window update for the quarter, I see that we don't have metrics from the firs... [16:12:40] 06Machine-Learning-Team, 10observability: Istio recording rules for Pyrra and Grizzly - https://phabricator.wikimedia.org/T351390#9876102 (10elukey) a:05elukey→03None [16:12:53] 06Machine-Learning-Team, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q4): Gap in metrics rendered from Thanos Rules - https://phabricator.wikimedia.org/T352756#9876105 (10elukey) a:05elukey→03None [16:13:03] 06Machine-Learning-Team, 10Observability-Metrics: SLO dashboards for Lift Wing showing unexpected values - https://phabricator.wikimedia.org/T359879#9876106 (10elukey) a:05elukey→03None [16:13:10] 06Machine-Learning-Team, 10observability: Istio recording rules for Pyrra and Grizzly - https://phabricator.wikimedia.org/T351390#9876100 (10elukey) Status: * The timings in https://thanos.wikimedia.org/rules#istio_slos are good, max is less than 20s and the latency improved a lot after the performance tuning... [16:58:49] :( [16:59:42] same thing happens to outlink https://grafana.wikimedia.org/d/slo-Lift_Wing_Article_Topic_Outlink/lift-wing-article-topic-outlink-slo-s?orgId=1 [16:59:46] it starts from 8th of June [17:00:38] no idea why [17:00:54] but at least one problem is solved :) [17:01:22] I removed myself as owner of the task, I'll keep checking but if anybody wants to work on it feel free [17:12:08] ok! thanks for all work and the updates! [17:38:35] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 07User-notice: Deploy "add a link" to 18th round of wikis (en.wp and de.wp) - https://phabricator.wikimedia.org/T308144#9876703 (10KStoller-WMF) [17:38:45] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 07User-notice: Deploy "add a link" to 18th round of wikis (en.wp and de.wp) - https://phabricator.wikimedia.org/T308144#9876708 (10KStoller-WMF) [18:21:36] this is a very minimal way to add a model to the liftwing python package https://github.com/wikimedia/liftwing-python/pull/10 [18:21:47] (03PS2) 10Ilias Sarantopoulos: huggingface: kserve 0.13.0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1041082 (https://phabricator.wikimedia.org/T367048) [18:23:43] (03PS3) 10Ilias Sarantopoulos: huggingface: kserve 0.13.0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1041082 (https://phabricator.wikimedia.org/T367048) [18:37:26] (03CR) 10Ilias Sarantopoulos: "Compressed image is 3.14GB so this is still good with the registry" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1041082 (https://phabricator.wikimedia.org/T367048) (owner: 10Ilias Sarantopoulos) [18:37:40] logging off folks! nightyyy