[06:52:33] hello folks [08:24:33] klausman: o/ [08:24:47] I have updated the code reviews for ml-serve-codfw, and also the kubernetes docs [08:24:53] \o morning [08:24:55] in theory there should be all the steps documented now [08:26:43] Roger [08:32:45] good morning :) [08:32:51] heyo :) [08:32:55] o/ [08:33:46] elukey: do we have a writeup anywhere that's basically a rundown of "how to re-number a k8s cluster"? [08:34:10] (i.e. a writeup of what we did, minus the detours) [08:35:14] https://phabricator.wikimedia.org/T304673 [08:38:35] but nothing on wikitech [08:38:51] I updated it with the steps that we missed, and Kubernetes/New for the missing sync commands etc.. [08:39:02] so it should be complete [08:39:05] one thing still missing there is to deepool nodes [08:39:32] mmm from where? Pybal? [08:39:42] yes [08:39:51] Janis mentions it as well [08:40:05] I am not sure if we can avoid an alarm, we can try setting them as inactive [08:40:17] but with 0 hosts behind Pybal might be upset as well [08:40:19] not sure [08:40:28] hm, good point [08:40:49] Icinga checks/alerts are not as granular as I'd like for that [08:40:50] but it is a good try, worst case we get an alarm as happened yesterday :) [08:41:42] we can do ml-serve-codfw today if you want, or the lvs change for ml-staging [08:41:52] (going to take a little break, will read in a few) [08:42:36] I'd rather finish up staging. We're currently not blocking anyone, but it feels like serve-codfw can wait now that eqiad is all fresh™ [09:01:11] (03PS1) 10AikoChou: draftquality: add the ORES augmented feature output [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/778225 (https://phabricator.wikimedia.org/T301766) [09:05:46] sure [09:15:27] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10elukey) 05Open→03Resolved Host reimaged correctly, all done! [09:16:56] all ml-cache nodes are ready [09:17:11] I'll start working on adding cassandra to them [09:31:38] there is an alert in AM about ml-serve-eqiad, related to prometheus not able to scrape port 15020 [09:31:43] I think we are missing https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/778247 [09:32:42] I think you're right [09:33:14] +1d [09:34:06] (03PS1) 10AikoChou: articlequality: add the ORES augmented feature output [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/778248 (https://phabricator.wikimedia.org/T301766) [09:51:13] (03PS1) 10AikoChou: topic: add the ORES augmented feature output [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/778250 (https://phabricator.wikimedia.org/T301766) [10:01:11] interesting issue while deploying damaging and goodfaith [10:01:12] Error creating: pods "nlwiki-damaging-predictor-default-w57tc-deployment-65fd584vhdqq" is forbidden: exceeded quota: quota-compute-resources, requested: limits.cpu=4, used: limits.cpu=88, limited: limits.cpu=90 [10:01:31] Quota exceeded, huh [10:01:56] yes it has to create the new pods and then terminate the others [10:02:02] so it temporarily breaches the limits [10:02:20] I think we can tweak those, but the large number of pods doesn't help [10:02:22] Ahs, so it doesn't have enough temp quota. Does k8s have a mechanism for that kind of transition? [10:02:42] Basically "don't enforce quotas during a roll-forward or roll-back [10:02:46] no idea [10:03:49] How did we arrive at the current quotas/limits? [10:04:05] there is a default in common.yaml [10:04:07] resourcequota: [10:04:07] pods: {} [10:04:07] compute: [10:04:07] requests: [10:04:09] cpu: "90" [10:04:12] memory: "100Gi" [10:04:14] limits: [10:04:17] cpu: "90" [10:04:19] memory: "100Gi" [10:04:37] I think it is a failsafe to avoid a namespace to eat a ton of resources if pods are created by mistake [10:04:54] we can override it for the moment [10:05:05] Probably a good idea. [10:08:46] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/778254/ [10:09:23] LGTM [10:09:47] I'll wait for the CI diff before proceeding [10:10:36] ack [10:28:44] nothing really happened after changing the limits for namespaces [10:29:08] I tried to delete a couple of pods, they are recreated but it seems that they have problems fetching from thanos [10:31:22] going afk for lunch, will check later [12:37:33] good news is that all the pods have been recycled as expected [12:37:42] 3 of them are still erroring our when initing, weird [12:37:49] and they have duplicates in "Running" state [12:38:05] I guess that they would be killed by kubernetes if they reached the Running state [12:40:16] I'll try to delete (via kubectl) the related ReplicaSet and see if newer pods will be created and the killed [13:00:37] seems to have worked :) [13:07:01] excellent! [15:34:06] Morning all! [15:47:56] o/ [16:08:24] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet wit... [16:18:01] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet with OS... [16:18:22] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet wit... [16:48:54] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye [16:56:07] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host dse-k8s-worker1002.eqiad.wmnet with OS bullseye [17:06:37] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye... [17:06:46] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host dse-k8s-worker1002.eqiad.wmnet with OS bullseye... [17:09:59] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye [17:13:36] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet with OS... [17:16:19] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host dse-k8s-worker1002.eqiad.wmnet with OS bullseye [17:17:01] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host dse-k8s-worker1003.eqiad.wmnet with OS bullseye [17:21:45] folks logging off, talk with you in a few days! o/ [17:22:05] bye! [17:26:47] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host dse-k8s-worker1004.eqiad.wmnet with OS bullseye [17:38:44] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10Cmjohnson) [17:40:04] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10Cmjohnson) 05Open→03Resolved on-site work completed [17:58:13] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host dse-k8s-worker1001.eqiad.wmnet with OS bullseye... [18:01:51] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host dse-k8s-worker1003.eqiad.wmnet with OS bullseye... [18:05:02] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host dse-k8s-worker1002.eqiad.wmnet with OS bullseye... [18:12:56] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host dse-k8s-worker1004.eqiad.wmnet with OS bullseye... [18:13:41] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10Cmjohnson) [18:13:54] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10Cmjohnson) 05Open→03Resolved [20:10:28] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Support (or not) the ORES augmented feature output in liftwing - https://phabricator.wikimedia.org/T301766 (10achou) The ORES augmented feature output seems to be inconsistent with the output in liftwing for articlequality and editquali...