[05:55:44] (03PS1) 10Santhosh: Optimize page collection metadata fetching with batch processing and concurrency limits [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1130348 [05:57:22] (03CR) 10CI reject: [V:04-1] Optimize page collection metadata fetching with batch processing and concurrency limits [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1130348 (owner: 10Santhosh) [06:00:16] (03CR) 10Santhosh: "recheck" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1130348 (owner: 10Santhosh) [07:33:20] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10666558 (10ayounsi) There was an outstanding diff from automation trying to add the BGP config for ml-serve2004 on the switch. As I see that there is a provisi... [08:25:23] (03CR) 10Nik Gkountas: [C:04-1] "It looks good overall. Just a comment that we may want to address." [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1130348 (owner: 10Santhosh) [09:02:29] morning folks o/ [09:03:49] 早上好! [09:06:41] elukey: 早啊 :))) [09:09:15] all right 啊 [09:09:19] is new to me :D [09:09:34] is that like "morning as well" ? [09:11:06] morning morning [09:14:50] (03PS2) 10Kevin Bazira: RRLA: send prediction results to output event stream [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129755 (https://phabricator.wikimedia.org/T326179) [09:14:59] o/ [09:15:05] ml-serve2004 back in service with containerd [09:15:56] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10666707 (10elukey) [09:20:45] (03CR) 10CI reject: [V:04-1] RRLA: send prediction results to output event stream [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129755 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [09:20:45] elukey: 啊 is typically used as a sentence final interjection. it adds a tone of urgency, exclamation or excitement. 早啊 is more casual than 早上好 :D [09:21:12] (03PS1) 10DCausse: search weighted_tags: allow producing to the "v1" stream [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1130529 (https://phabricator.wikimedia.org/T375821) [09:21:13] (03PS1) 10DCausse: search weighted_tags: drop BC for rc0 weighted_tag stream [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1130530 (https://phabricator.wikimedia.org/T375821) [09:21:43] aiko: ahhh TIL thanks! [09:22:56] (03CR) 10DCausse: [C:04-1] search weighted_tags: drop BC for rc0 weighted_tag stream [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1130530 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [09:30:55] (03PS3) 10Kevin Bazira: RRLA: send prediction results to output event stream [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129755 (https://phabricator.wikimedia.org/T326179) [09:46:01] (03CR) 10Kevin Bazira: RRLA: send prediction results to output event stream (034 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129755 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [09:53:08] (03CR) 10Kevin Bazira: [C:03+1] "Thank you for working on this, David! LGTM :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1130529 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [10:02:56] 10Lift-Wing: LiftWing articlecountry model logs improper json in stderr - https://phabricator.wikimedia.org/T389768 (10dcausse) 03NEW [10:15:47] (03CR) 10Santhosh: Optimize page collection metadata fetching with batch processing and concurrency limits (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1130348 (owner: 10Santhosh) [10:27:28] (03CR) 10AikoChou: [C:03+1] "One little thing about the parameters. Other than that, LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129755 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [10:54:20] 06Machine-Learning-Team, 10EditCheck: Evaluate the existing peacock detection model - https://phabricator.wikimedia.org/T386645#10667152 (10achou) **[Correction] The evaluation results in T386645#10596450 were not from the "all_bert-base-multilingual-cased_peacock_512" model, but from the monolingual model tha... [10:59:35] klausman: o/ I am going to reimage 1 ml-serve host everyday, if you want we can split (like I start from the lowest ids and you from the highest, like 2001 and 2001). It is not urgent, but let's set up something in motion so we slowly get to the whole fleet done [10:59:54] *2001 and 2011 [11:00:17] yeah sounds good. Did you encounter any trouble with the eqiad reimages? [11:01:05] I am doing codfw atm, all the issues are in the task, so far only hiccups and ml-serve2002 with a bad DIMM [11:01:41] I also added what to do for the network/bgp config [11:01:55] so we can freely move-vlan anytime [11:02:09] (finishing ml-serve2005 as we speak) [11:02:19] nice [11:02:51] for some reason these hosts are a little bit weird when rebooting, they sometimes get stuck/frozen and a racadm serveraction powercycle is needed [11:02:54] no idea why [11:03:08] ah and I encountered one time the issue with booting and grub not finding the root partition [11:03:12] Hmm. odd. Only the Dell ones, or the SMC ones as well? [11:03:13] reimaged again and worked [11:03:33] I think Dell for the moment [11:03:39] Well, there ahs been only one SMC one, of course [11:03:46] I did 2001->2005 [11:03:55] Yeah, the first SMC one will be 2009 [11:03:59] (outside of staging) [11:04:59] Let me do 2011 after lunch, see if I am capable of following instructions ;) [11:08:55] anytime, even later on during the week, didn't mean to derail your plans [11:09:05] just keeping you informed since I am messing up with the ML infra :) [11:09:19] ay cap'n [11:09:54] I think doing up to 2-3 machines a day should be feasible, as long as nothing goes wrong. Maybe let the first 1-2 SMC machines soak a bit longer just to be sure. [11:10:53] we are not in a hurry, even 1 host for me and one for you is a lot, we'll likely finish in a couple of weeks [11:12:02] I also don't want to burn all your time on this, when nominally it's my job to do it :) [11:18:07] 06Machine-Learning-Team, 10EditCheck: Evaluate the existing peacock detection model - https://phabricator.wikimedia.org/T386645#10667304 (10achou) **Evaluation results on Spanish (es), Japanese (ja), Arabic (ar), and Portuguese (pt)** eswiki: - Data source: peacock reverts - Total evaluation examples: 353 (18... [11:18:08] (03PS4) 10Kevin Bazira: RRLA: send prediction results to output event stream [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129755 (https://phabricator.wikimedia.org/T326179) [11:23:33] 2005 up and running [11:36:10] (03CR) 10AikoChou: [C:03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129755 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [12:08:46] (03CR) 10Kevin Bazira: [C:03+2] "谢谢 :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129755 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [12:16:59] (03Merged) 10jenkins-bot: RRLA: send prediction results to output event stream [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129755 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [12:46:32] :D [12:47:15] (03CR) 10Sbisson: [C:03+1] Optimize page collection metadata fetching with batch processing and concurrency limits (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1130348 (owner: 10Santhosh) [13:30:45] elukey: I've started working on 2011 - https://gerrit.wikimedia.org/r/c/operations/puppet/+/1130595 [13:45:31] klausman: LGTM! Please run pcc on various nodes to confirm that only 2011 is changed [13:45:42] ack, will do [13:45:42] just to be very defensive [13:46:01] if puppet starts to add containerd in other nodes we are [censored] :D [13:46:52] yeah, I had that lovely experience with staging :) [13:47:07] 2011 is already drained&cordoned, btw. [13:47:45] You ran with --move-vlan I presume? Any other extra options for the reimage? [13:47:53] (03PS2) 10Kevin Bazira: search weighted_tags: allow producing to the "v1" stream [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1130529 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [13:54:25] (03CR) 10Kevin Bazira: [C:03+2] search weighted_tags: allow producing to the "v1" stream [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1130529 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [13:55:50] Starting reimage [13:56:13] (03Merged) 10jenkins-bot: search weighted_tags: allow producing to the "v1" stream [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1130529 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [13:56:32] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10668134 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin2002 for host ml-serve2011.codfw.wmnet with OS bookworm [14:00:25] np --move-vlan is fine! [14:15:09] does that also do the Homer bits or is that still a manual step? [14:20:03] still manual yes [14:20:25] first you need to run it on the cr*-codfw routers (right after move-vlan basically) [14:20:38] and then to the target l3 switch (you can find it in netbox) [14:20:50] so the old BGP session is removed, and the new one is added [14:21:04] and after that calico will peer etc.. [14:21:14] Ack. [14:21:33] doing that while the first puppet run happens. [14:21:57] Hmmm. the cr*codfw homer diff is empty. [14:22:13] ahhh, this is a new-enough machine that VLAN move was not actually necessary [14:22:37] the TOR switch diff should also be empty, checking now [14:23:20] yep, both empty [14:25:30] nice :) [14:31:37] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10668460 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin2002 for host ml-serve2011.codfw.wmnet with OS bookworm completed:... [14:39:31] ok, machine is up, back in the cluster and pods on ity are serving traffic normally \o/ [14:46:09] nice! [14:50:27] No trouble whatsoever, no boot failures or anything. Very smooth :) [14:54:19] are you implying some correlation between me and failures? :D :D [14:59:21] nooo. Correlation between you writing good docs, and working out the bugs [16:50:35] 06Machine-Learning-Team, 06collaboration-services, 10Discovery-Search (2025.03.22 - 2025.04.11), 10Wikipedia-iOS-App-Backlog (iOS Release FY2024-25): [Spike] Fetch Topics for Articles in History on iOS app - https://phabricator.wikimedia.org/T379119#10669735 (10Gehel) [17:26:45] 06Machine-Learning-Team, 10EditCheck, 10VisualEditor, 10Editing-team (Tracking): Evaluate efficacy of Peacock Check model output (internal review) - https://phabricator.wikimedia.org/T384651#10670100 (10ppelberg) [17:27:10] 06Machine-Learning-Team, 10EditCheck, 10VisualEditor, 10Editing-team (Tracking): Evaluate efficacy of Peacock Check model output (internal review) - https://phabricator.wikimedia.org/T384651#10670106 (10ppelberg) a:05ppelberg→03SSalgaonkar-WMF [18:35:15] 06Machine-Learning-Team, 10ORES, 10Testing Support, 10VisualEditor, 10Continuous-Integration-Config: Audit tests/selenium/LocalSettings.php file aiming at possibly deprecating the feature - https://phabricator.wikimedia.org/T199939#10670611 (10zeljkofilipin)