[06:10:14] 10Machine-Learning-Team, 10Wikimedia Enterprise: Elevate LiftWing access to WME tier for development and production environment - https://phabricator.wikimedia.org/T346032 (10elukey) @prabhat It shouldn't be a problem, so let's keep both dev and prod accounts for the moment. The total traffic per second should... [06:15:31] hello folks! [06:23:05] Good morning! o/ [06:37:49] filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/956775 [06:38:02] to remove the revision-score stream from eventstreams [06:40:57] 10Machine-Learning-Team, 10Patch-For-Review: Deprecate mediawiki revision-score stream - https://phabricator.wikimedia.org/T342116 (10elukey) From [[ https://thanos.wikimedia.org/graph?g0.expr=sum(eventstreams_connected_clients%7Bstream%3D%22mediawiki.revision-score%22%7D)%20by%20(client_ip)&g0.tab=0&g0.stacke... [06:43:18] Nice! [06:44:14] I'm off to donate some blood. bbl [06:44:22] nice! [06:44:29] thanks for doing it :) [06:52:16] <3 [07:25:10] * elukey afk for some errands [07:39:58] morning :) [08:03:52] 10Machine-Learning-Team, 10Patch-For-Review: Host the recommendation-api container on LiftWing - https://phabricator.wikimedia.org/T339890 (10kevinbazira) As suggested in T339890#9156420, I run the `load_raw_embedding` method and saved both `wikidata_ids` and `decoded_lines` numpy arrays: ` ... np.save('wikida... [08:30:22] 10Machine-Learning-Team, 10Patch-For-Review, 10Research (FY2023-24-Research-July-September): Deploy multilingual readability model to LiftWing - https://phabricator.wikimedia.org/T334182 (10achou) I conducted some load tests on the readability model in staging using the same input and script as we did for re... [08:36:08] (03PS4) 10AikoChou: test: add load test script and input for outlink [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/954613 [08:36:30] (03CR) 10AikoChou: [V: 03+2] test: add load test script and input for outlink [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/954613 (owner: 10AikoChou) [08:44:08] 10Machine-Learning-Team, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Wikimedia Enterprise, and 2 others: Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10achou) a:03achou [08:55:40] Morning! [09:02:55] o/ [09:03:38] i'm back [09:03:53] welcome back klausman ! [09:08:26] It's good to be back, though confusing :D [09:12:20] elukey: I saw you did quite some work on the SLOs (and added ORES SLOs). We should sync up on that [09:15:10] klausman: sure! I wanted to have a chat with you about the metrics, I added a note in the task since the avg(rate) seems odd [09:15:37] ahyes, that is about converting a 5m rate to a 90d one. I can explain in more detail [09:15:40] I proposed a new formula, but it needs recording rules etc.. and I wanted to know your opinion first [09:16:05] klausman: yeah but we are not counting all the events as requested, at least IIUC [09:16:10] the numbers are very different [09:16:19] Hmm. I'll do some reading [09:17:08] I may be wrong so I added all my thoughts to the task [09:18:02] but basically all the dashboards are up, we dediced to just set SLO 95% for latency and availability for any new service (ores-legacy, rec-api-ng, etc..) [09:18:08] and then refine after a quarter [09:18:23] revertrisk has now separate SLOs for every model type (after a chat with the team) [09:18:32] That sounds good. due to needing the RRs, the time depth isn't great yet [09:20:00] The calculations/formulas aren't helped by PromQL being really hard to read due to long metric names and labels [09:21:49] my line of thinking for the new proposal (sum of increase basically) is due to the fact that we are dealing with counters, so summing up all the events is easily doable with increase() [09:22:25] but we need recording rules otherwise it is super slow [09:24:32] yes [09:24:48] fwiw, this is how ended up with the avg_over_time(...[90d]): https://phabricator.wikimedia.org/P52462 [09:25:49] (SuperQ is one of the main Prom devs and super helpful, over in #prometheus) [09:28:43] klausman: yeah makes sense, my point is more related to the premise - IIUC the SLO calculations require divisions between good/bad events and total events [09:29:02] and avg(rate:5m) seems to be different [09:29:28] (but again I could be wrong this is why I am asking/proposing to use increase) [09:30:09] Oh you mean for the recorded metric, not the compute-90d-from-5m part? [09:30:26] exactly yes! [09:30:41] I totally understood why you used that formula [09:30:59] ah, so that is why (in contrast to the metrics mentioned in the phab ticket, my change had response_code=~"2.." [09:31:48] I don't follow sorry [09:32:16] Let me try to re-understand my initial patch set :) [09:32:54] I compared numbers to understand the difference, and IIUC in one case we avg the rate of events over 90d [09:33:03] in the other one we *count* all the events [09:33:35] avg+rate removes precision in my opinion [09:34:12] you mean in latency vs. error rate? [09:34:38] in both use cases [09:34:57] then what do you mean by "in one case vs the other"? [09:35:21] avg(rate:5m) vs sum(increase) [09:36:22] I don't think I used increase() at all, originally. [09:36:37] exactly, it is my proposal :) [09:36:48] you used avg(rate), the current implementation [09:37:05] I left it as it is, modulo changes required by olly and the team [09:37:49] reading up on the semantic differences between increase and rate [09:39:04] increase() just counts events, that is very nice for a counter [09:39:23] rate is already an avg (loss of info), plus avg_over_time another one [09:39:59] The docs say that increase(foo[1m]) is just rate(foo[1m])*60 [09:40:25] And they specifically say to use rate() for recording rules [09:40:30] https://prometheus.io/docs/prometheus/latest/querying/functions/#increase [09:41:10] ah right didn't see it [09:41:16] mm interesting [09:41:39] not what I expected, but anyway, let's review one of the SLO calculations [09:41:40] 1 - (bad_events / total_events) / error_budget [09:42:06] in here it is requested to use the amount of events happened in a range, not the rate [09:42:15] this is why I thought increase() was better [09:42:27] Mh, yes, it's definitely more intuitive [09:42:28] (plus sum() over increase eliminates the need for avg_over_time) [09:42:47] it returns completely different numbers [09:43:04] I am willing to use increase() instead if that makes the computations more readable/maintainable [09:43:34] I'd also drop avg_over_time too [09:43:37] How credible my initial computations were was hard to say because of too little history [09:44:31] One note though: if we made a recording rule for 90d, that would be quite CPU intensive, since at every scrape, the whole 90d of history of all metrics would need to be (re)examined. [09:44:32] my doubts are related to having a rate (say, 5 rps) averaged over time in the calculation I showed up above [09:44:42] where a number of events is needed [09:45:30] I think it is sufficient to use a rule that is similar to the ones we already have [09:45:40] using increase instead of rate5m [09:45:43] then we apply the sum [09:45:51] does it make sense? [09:45:54] sum_over_time, that is [09:46:26] correct [09:46:48] We would need to changre the RR and then remake the dashboard computations to fit :-/ [09:47:28] yes but I already sent a patch for one SLO in the list, I can extend it to all [09:47:34] and the recording rules should be quick [09:47:39] 955958? [09:47:49] I am not 100% sure avg_over_time(rate:5m) is correct, this is why [09:48:10] yep [09:48:24] (need to run errand, bbl) [09:48:40] ttyl. will do some more reading [10:14:48] * klausman lunch [10:22:47] 10Machine-Learning-Team: Outlink returns 500 when EventGate returns 503 Service Unavailable - https://phabricator.wikimedia.org/T346136 (10achou) [10:34:58] * aiko lunch [11:09:30] o/ [11:10:42] elukey: klausman: I need some help with revscoring pods. there are 3 pods in damaging namespace (eswiki, zhwiki and ruwiki) that the queue-proxy container is failing [11:16:37] I don't see any non-running pods in the revscoring-editquality-damaging NSes in eqiad, codfw, or staging [11:17:40] ah hang on. [11:18:09] I see 2/3 containers running for es, nl, ru and zh [11:24:41] Hmm, not sure what is going on. the logs are not useful [11:24:48] 10Machine-Learning-Team, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Wikimedia Enterprise, and 2 others: Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10lbowmaker) @AikoChou - we hav... [11:26:48] I can try and delete one of the pods, see if restarting it helps. We'd still have the others to see what the actual bug is [11:27:20] deleting eswiki-damaging-predictor-default-00011-deployment-754dcd8h4mtb [11:27:35] thanks! yes that would probably help. [11:27:54] eswiki-damaging-predictor-default-00011-deployment-754dcd86j6kb is up at 3/3 [11:28:38] nice! [11:29:14] codfw is ok on the other hand. [11:31:00] I should request access to be able to do these kind of operations (delete pod, resource etc) [11:31:31] You think just bouncing the other pods the same way is the right thing to do? Feels a bit like papering over a problem. [11:31:45] (and destroying "evidence" [11:31:49] yes please do , for now [11:31:54] alright [11:32:38] it is production traffic, otherwise we would keep it like that. We can look at the queue-proxy logs and keep an eye if it happens again [11:32:41] Thanks a lot! [11:32:58] it already impacted our error budget/SLI [11:33:00] np, all restarted (nl, es, ru and zh) [11:33:28] the nl, ru, and zh ones still show as terminating, that always takes a while [11:33:37] just tested, all work great! [11:33:44] :+1: [11:34:01] * isaranto lunch [11:58:30] Morning all [11:59:00] Hey Klausman [12:11:49] morning! [12:12:37] isaranto: releated to the SLO (don't stress a lot about it) - I had an interesting chat with Reuven and the time window to set now would be 2023-09-01 00:00:00 to 2023-11-31 23:59:59 [12:13:11] the idea is that we have a sliding window of 3 months, offset 1 month earlier, so if anything horrible happens and we burn the slo down we have a month before end of quarter to plan accordingly [12:13:31] every three months we'll have to "fix" the dashboards [12:20:12] Morning Chris! [12:21:35] elukey: Cool! I'm not stressing but I really like the aggregated monitoring we can get from the dashboard [12:26:37] (03PS1) 10Kevin Bazira: load_raw_embedding: Downcast float to np.float32 [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/956846 (https://phabricator.wikimedia.org/T339890) [12:28:14] (03CR) 10CI reject: [V: 04-1] load_raw_embedding: Downcast float to np.float32 [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/956846 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [12:40:34] 10Machine-Learning-Team, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Wikimedia Enterprise, and 2 others: Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10achou) Hi @lbowmaker, thanks... [13:33:52] (03PS2) 10Kevin Bazira: load_raw_embedding: Downcast float to np.float32 [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/956846 (https://phabricator.wikimedia.org/T339890) [13:34:24] (03CR) 10CI reject: [V: 04-1] load_raw_embedding: Downcast float to np.float32 [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/956846 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [13:35:09] 10Machine-Learning-Team: Lift Wing alerting - https://phabricator.wikimedia.org/T346151 (10isarantopoulos) [13:37:11] kevinbazira: o/ [13:37:36] before attempting the road of truncating the float, wouldn't it be better to try to load the np array from file? [13:37:47] and do the related performance tests etc.. [13:38:06] we can always use the truncation later on, but we'd need the sign-off of Reaserch before proceeding [13:38:42] (03PS3) 10Kevin Bazira: load_raw_embedding: Downcast float to np.float32 [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/956846 (https://phabricator.wikimedia.org/T339890) [13:39:13] (03CR) 10CI reject: [V: 04-1] load_raw_embedding: Downcast float to np.float32 [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/956846 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [13:40:35] elukey: o/ [13:42:01] 10Machine-Learning-Team, 10Patch-For-Review: Define SLI/SLO for Lift Wing - https://phabricator.wikimedia.org/T327620 (10elukey) Opened T346144 as related change for the dashboards :) [13:42:21] elukey: I did run the test and shared it in the task, here are the results of loading the np array from a file: https://phabricator.wikimedia.org/T339890#9159257 [13:43:43] kevinbazira: very nice didn't see it! Let's start with that change if you agree, it seems the safest [13:43:55] in the meantime we can ask for Research's sign off fo the other one [13:44:00] have you tried both changes combined? [13:44:22] or is the one linked above already running both changes? [13:45:38] 3 tests were made: T339890#9155985, T339890#9156601, [13:45:38] and T339890#9159257. the 2nd test had the best results. [13:47:46] kevinbazira: yeah but you can easily combine the last two [13:48:19] change load_raw_embeddings and use np.float32, this would be another interesting result [13:48:53] I agree that the 2nd is the best in terms of raw numbers, but it is the more invasive one [13:49:11] changing load_raw_embeddings doesn't really have any effect (at least on paper) [13:49:22] this is why I was suggesting the "safest" approach first [13:56:56] that would be great. I wonder what chrisalbon would recommend between T339890#9156601 and T339890#9159257. [13:58:15] 10Machine-Learning-Team, 10Patch-For-Review, 10Research (FY2023-24-Research-July-September): Deploy multilingual readability model to LiftWing - https://phabricator.wikimedia.org/T334182 (10elukey) Great results @achou! @MGerlach before proceeding, do you have any plan for the model? I mean, are there any k... [13:58:37] chrisalbon and isaranto: your recommendations on T339890#9156601 and T339890#9159257 are welcome. [13:58:56] 10Machine-Learning-Team: Lift Wing alerting - https://phabricator.wikimedia.org/T346151 (10isarantopoulos) [14:07:12] (03PS4) 10Kevin Bazira: load_raw_embedding: Downcast float to np.float32 [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/956846 (https://phabricator.wikimedia.org/T339890) [14:07:44] (03CR) 10CI reject: [V: 04-1] load_raw_embedding: Downcast float to np.float32 [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/956846 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [14:16:02] 10Machine-Learning-Team, 10Goal: Zero traffic on bare metal ORES servers - https://phabricator.wikimedia.org/T341696 (10isarantopoulos) All MediaWiki traffic has been redirected to use Lift Wing starting from 11th of September. [14:16:43] 10Machine-Learning-Team, 10Goal: Lift Wing announced as MVP to the public - https://phabricator.wikimedia.org/T341703 (10calbon) update: no update [14:19:47] 10Machine-Learning-Team, 10Goal: Defined and measured SLO for every production service - https://phabricator.wikimedia.org/T341693 (10klausman) a:03klausman [14:31:52] 10Machine-Learning-Team, 10Goal: Content Recommendation API migration completed - https://phabricator.wikimedia.org/T341704 (10kevinbazira) We are working on making adjustments to the rec-api deployment settings until we get to a state that can run on LiftWing. Below are the settings configured so far: 1. a... [14:58:14] something not great - https://grafana.wikimedia.org/d/slo-ORES_Legacy/ores-legacy-slo-s?orgId=1 - the latency SLO may not be perfect :D [14:58:33] not sure if it is because we have zero traffic, but IIRC last week it wasn't like htat [14:59:04] https://grafana.wikimedia.org/d/slo-Lift_Wing_Revert_Risk_LA/lift-wing-revert-risk-la-slo-s?orgId=1 is better but there are holes in metrics [14:59:29] same in https://grafana.wikimedia.org/d/slo-Lift_Wing_Revscoring/lift-wing-revscoring-slo-s?orgId=1 [15:01:26] 10Machine-Learning-Team, 10Wikimedia Enterprise: Elevate LiftWing access to WME tier for development and production environment - https://phabricator.wikimedia.org/T346032 (10klausman) One thing of note: after elevating the tier like Luca did yesterday, the token has to be re-issued using the webui to have the... [15:03:37] elukey: WoW about ores-legacy. regarding the other 2 (revertrisk and revscoring) I dont see an issue. [15:04:02] isaranto: yeah ores-legacy is probably my fault, for the other you don't see holes in grafana? [15:04:19] other than the 79% remaining error budget for damaging model, but that was caused by the downtime in the pods that we resolved today [15:04:26] (zhwiki, eswiki and ruwiki) [15:04:27] ah right wait I still had the 90d time window [15:04:28] okok [15:04:34] with 7 days it is better [15:05:09] 10Machine-Learning-Team, 10Wikimedia Enterprise: Elevate LiftWing access to WME tier for development and production environment - https://phabricator.wikimedia.org/T346032 (10prabhat) @elukey Sure, sounds good. Thanks. @LDlulisa-WMF Could you pls re-issue the tokens and update in our infra? [15:06:53] when u say "holes" do u mean the gaps that exist in the SLI chart? [15:06:59] exactly [15:07:23] I'm not sure why these are caused, but I don't see them affecting the error budget [15:10:07] 10Machine-Learning-Team, 10Wikimedia Enterprise: Elevate LiftWing access to WME tier for development and production environment - https://phabricator.wikimedia.org/T346032 (10LDlulisa-WMF) @klausman, @prabhat Thanks, will do. [15:37:31] 10Machine-Learning-Team, 10Goal: Order 2-4 GPU for Lift Wing and Statbox - https://phabricator.wikimedia.org/T341699 (10elukey) Some info: * Current GPU https://www.techpowerup.com/gpu-specs/radeon-pro-wx-9100.c2989 * Radeon MI50 https://www.amd.com/en/products/professional-graphics/instinct-mi50-32gb * Measu... [15:48:16] (03PS5) 10Ilias Sarantopoulos: load_raw_embedding: Downcast float to np.float32 [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/956846 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [15:52:50] fixed the unit test --^ [15:53:23] stepping away from the keyboard for today folks. [15:54:04] I feel drained from the blood donation. cu tomorrow! [15:57:30] \o ttyl [15:58:14] Thank you for fixing this isaranto. [15:58:14] Hope you feel better soon! [15:58:14] I will test the combination of T339890#9156601 and T339890#9159257 as suggested in the meeting then share the findings on the task. [15:58:14] If the 4th test is better than the 2nd then we might opt for that one. I will keep you posted. [15:59:51] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10ORES, 10MW-1.41-notes (1.41.0-wmf.26; 2023-09-12), 10Moderator-Tools-Team (Kanban): ORES Extension master branch is failing tests - https://phabricator.wikimedia.org/T345922 (10jsn.sherman) As a starting point, the documented defaults were incomplet... [16:01:07] bye Ilias! have a nice rest of the day :) [16:30:57] going afk as well, have a good one folks! [16:40:39] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10ORES, 10MW-1.41-notes (1.41.0-wmf.26; 2023-09-12), 10Moderator-Tools-Team (Kanban): ORES Extension master branch is failing tests - https://phabricator.wikimedia.org/T345922 (10jsn.sherman) With other models left as above, I can enable reverted and... [17:01:35] logging off as well! [17:45:47] 10Machine-Learning-Team: User: Wikipedia recent changes list the edit highlighting by ORES has disappeared - https://phabricator.wikimedia.org/T346175 (10calbon) [18:00:59] 10Machine-Learning-Team, 10Data-Engineering, 10Wikimedia Enterprise, 10Data Engineering and Event Platform Team (Sprint 2), and 2 others: Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10lbowmaker) [18:17:17] hello ML! so I take it the new LeftWing API does not accept multiple revision IDs? [19:34:54] 10Machine-Learning-Team, 10ORES, 10Growth-Team, 10MediaWiki-Recent-changes: User: Wikipedia recent changes list the edit highlighting by ORES has disappeared - https://phabricator.wikimedia.org/T346175 (10kostajh) [19:47:09] 10Machine-Learning-Team, 10Wikipedia-Android-App-Backlog (Android Release - FY2023-24): Migrate Machine-generated Article Descriptions from toolforge to liftwing. - https://phabricator.wikimedia.org/T343123 (10Isaac) > If I foresee a potential engineering challenge, it's that the model inference code currently...