[00:35:16] 10Lift-Wing, 06Machine-Learning-Team: Request to host the Reference Need Model on LiftWing - https://phabricator.wikimedia.org/T371902#10098476 (10Aitolkyn) Hi Aiko! The location on the stat1010 is `/home/aitolkyn/temp/reference-quality/pretrained_models/multilingual_reference_need_128_v0.pkl` sha512 is in `/h... [07:12:28] inflatador: oddly enough the list produced by sudo cumin 'C:amd_rocm' is incomplete. E.g. ml-staging2003 is missing, as are ml-serve2009-2011 [07:13:39] My suspicion is that is because the field used is only needed for hosts that get the userpace tools for rocm installed. For bookworm k8s GPU hosts, this is not necessary (all the bits needed are shipped with the default Debian kernel) [07:16:29] The metrics browser approach should work, if one uses Thanos [07:17:35] I'll update the page [08:53:46] good morning :) [09:00:15] (03CR) 10AikoChou: [C:03+1] "Thanks for working on this, Kevin! After merge, we should deploy and test this in staging to make sure everything works without any issues" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1067216 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [09:23:28] (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the review, Aiko!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1067216 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [09:43:41] (03Merged) 10jenkins-bot: revert_risk_model: migrate to src dir [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1067216 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [11:17:57] * klausman lunch [12:39:14] klausman: I think the easiest query atm is 'C:prometheus::node_amd_rocm' [12:40:08] C:amd_rocm doesn't hit the k8s nodes since profile::amd_gpu deploys amd_gpu only on non-k8s nodes [12:40:20] but all of them have monitoring, so the list should be complete [12:40:27] 11 hosts will be targeted: [12:40:27] an-worker[1100-1101].eqiad.wmnet,dse-k8s-worker1001.eqiad.wmnet,ml-serve[2009-2011].codfw.wmnet,ml-serve1001.eqiad.wmnet,ml-staging[2001,2003].codfw.wmnet,stat[1008,1010].eqiad.wmnet [12:40:34] looks correct afaics [12:40:42] cc: inflatador: --^ [12:45:44] I'll add the cumin approach to the wiki page, since it's a bit more copy&paste-able than the Thanos stuff [12:51:46] I;'ve also added ml-staging2001's MI100 to the list [13:21:23] good morning all [13:24:01] if y'all have any dashboards w/GPU info LMK. I'm working on a dashboard for the stat hosts as they've been falling over a lot lately [13:25:02] https://grafana.wikimedia.org/goto/pS0_doqIR?orgId=1 very much WIP [13:43:16] I got this https://grafana-rw.wikimedia.org/d/d10408b0-518d-47d5-a879-81884b73d7dc/klausman-ml-amd-rocm-gpu?orgId=1 [13:48:31] excellent! will def "borrow" that [13:55:48] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10099735 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cum... [14:17:44] 06Machine-Learning-Team, 06Data-Platform-SRE, 06Infrastructure-Foundations, 06serviceops, 07Security: Migrate the ownership of Docker images in production-images repo to mailing lists - https://phabricator.wikimedia.org/T373526#10099799 (10elukey) [14:25:18] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10099847 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cum... [14:36:25] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10099860 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin10... [14:50:32] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10099888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cum... [14:54:48] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10099910 (10Jclark-ctr) [15:20:34] 06Machine-Learning-Team, 06Data-Platform-SRE, 06Infrastructure-Foundations, 06serviceops: Migrate the ownership of Docker images in production-images repo to mailing lists - https://phabricator.wikimedia.org/T373526#10100057 (10elukey) [15:23:12] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100078 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin10... [15:31:48] 06Machine-Learning-Team, 06Data-Platform-SRE, 06Infrastructure-Foundations, 06serviceops, 07Security: Migrate the ownership of DPE-Owned Docker images in production-images repo to mailing lists - https://phabricator.wikimedia.org/T373534 (10bking) 03NEW [15:45:25] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100206 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin10... [15:47:54] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100223 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cum... [15:49:36] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100228 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cum... [16:00:47] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100254 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cum... [16:20:24] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100330 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin10... [16:20:54] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100333 (10Jclark-ctr) [16:35:25] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100388 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin10... [16:38:36] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100413 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin10... [16:44:37] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100481 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cum... [16:44:40] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100483 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cum... [17:09:05] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100575 (10Jclark-ctr) [17:19:54] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100609 (10Jclark-ctr) [17:22:32] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100613 (10Jclark-ctr) a:03klausman @klausman. If you can update preseed.yaml file for thes... [17:29:59] 06Machine-Learning-Team, 06Data-Platform-SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Migrate the ownership of Docker images in production-images repo to mailing lists - https://phabricator.wikimedia.org/T373526#10100630 (10akosiaris) I see one problem with this approach. Teams ch... [17:43:11] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100691 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin10... [18:00:54] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100775 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin10...