[06:23:30] Hola o/ [07:42:22] morning! [07:56:08] o/ aiko [09:51:08] (03PS1) 10AikoChou: reference-quality: add reference-risk model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1076163 (https://phabricator.wikimedia.org/T372405) [11:33:19] (03PS2) 10AikoChou: reference-quality: add reference-risk model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1076163 (https://phabricator.wikimedia.org/T372405) [11:40:03] (03PS3) 10AikoChou: reference-quality: add reference-risk model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1076163 (https://phabricator.wikimedia.org/T372405) [12:24:02] (03PS3) 10Nik Gkountas: Initialize campaign cache and update it every 1 hour [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1075974 [13:03:05] (03PS4) 10Nik Gkountas: Initialize campaign cache and update it every 1 hour [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1075974 [13:03:09] (03PS2) 10Nik Gkountas: Use category search to find campaign pages instead of template [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1076020 (https://phabricator.wikimedia.org/T373132) [13:09:05] I was looking at thelatency SLO for the articlequality and the query we use in the grafana dashboards for the preprocess latency graphs. [13:09:51] I modified this query to find the 99th percentile over the previous 90d period [13:11:48] I want to get just the upper bound as our initial query refers to multiple pods. So if I were to look at that over a 90d period it is a disaster. I ended up with this: [13:11:48] ``` [13:11:48] histogram_quantile(0.99, sum(rate(request_preprocess_seconds_bucket{kubernetes_namespace=~"revscoring-articlequality", component=~".*", model_name=~"enwiki-articlequality", app_wmf="kserve-inference"}[90d])) by (le)) [13:11:48] ``` [13:12:45] (03PS3) 10Nik Gkountas: Use category search to find campaign pages instead of template [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1076020 (https://phabricator.wikimedia.org/T373132) [13:13:05] (03CR) 10Nik Gkountas: Fetch campaign metadata and return them with recommendations (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1070308 (https://phabricator.wikimedia.org/T373132) (owner: 10Nik Gkountas) [13:14:12] (03PS7) 10Nik Gkountas: Fetch campaign metadata and return them with recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1070308 (https://phabricator.wikimedia.org/T373132) [13:17:45] which gives me a 6.18s. this is just for preprocess. Predict seems to be always in the ms range, so I was thinking of updating the latency SLO to 7s for this model in pyrra and monitor it. [13:35:19] I'll open a task about this so that we keep track over there but the slo definitions are in the puppet repo. For example here is the articlequality one ->. https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/pyrra/filesystem/slos.pp#68 [13:50:27] (03PS8) 10Nik Gkountas: Fetch campaign metadata and return them with recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1070308 (https://phabricator.wikimedia.org/T373132) [13:54:42] (03PS3) 10Kevin Bazira: article-country: initial commit [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1075033 (https://phabricator.wikimedia.org/T371897) [13:55:48] (03CR) 10Kevin Bazira: "thanks. fixed!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1075033 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [14:55:42] * isaranto afk [15:00:51] isaranto: ack! thanks for working on SLO