[06:59:20] good morning! [06:59:41] I am trying to rollout calico on ml-serve-eqiad but it hangs [06:59:47] very weird [06:59:59] the docs say to sync namespaces first in that case (already done), but nothing [07:00:36] to clear out state I've run "helm3 uninstall calico -n kube-system" and retried helmfile, same thing [07:00:39] it is currently hanging [07:02:43] ahhh maybe it is because of the extra istio cni config [07:09:01] mmm maybe due to coredns missing [07:09:02] dial tcp: lookup ml-ctrl.svc.eqiad.wmnet on 10.3.0.1:53: read udp 10.67.21.65:44115->10.3.0.1:53: i/o timeout [07:51:21] tried to sync coredns but hanged as well, calico should definitely go first [08:32:20] so 10.3.0.1 is our internal VIP for DNS [08:32:24] that prod uses too [08:32:32] (internal recursors) [09:05:33] Morming \oi [09:05:52] elukey: I presume you did the master labels? [09:06:42] i.e. https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#Label_Kubernetes_Masters [09:08:15] klausman: o/ the master labels are applied when the kubelet starts IIRC [09:08:23] Automatically? [09:08:55] yes [09:09:02] root@deploy1002:~# kubectl get node --selector=node-role.kubernetes.io/master [09:09:04] No resources found in default namespace. [09:09:16] Not with -A, either [09:10:08] ah ok so I think we needed to add them via puppet, in the kubelet config, now I recall [09:10:14] but they are not really used [09:11:04] so yes we can add them via kubectl label if you want [09:11:06] it doesn't hurt [09:11:21] I'll do that now [09:11:29] (I misremembered the previous config, if we add the label to the kubelet's defaults it fails when starting on 1.16) [09:11:41] I know it's a stretch that this has any impact on the istio/net thing, but covering bases and all that [09:12:36] and done [09:12:57] Which specific step did hang for you? [09:13:12] the calico sync [09:13:43] https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#Calico_node/controllers this, right? [09:13:56] yep [09:14:01] Mind if I give it a try? [09:15:08] if you try now it will fail due to [09:15:09] root@deploy1002:~# helm3 history calico --namespace kube-system [09:15:09] REVISION UPDATED STATUS CHART APP VERSION DESCRIPTION [09:15:12] 1 Wed Apr 6 07:21:03 2022 pending-install calico-0.1.17 3.17.0 Initial install underway [09:15:30] so we have to either uninstall the release for calico via helm command, or rollback [09:15:39] tried this morning, redone the sync, still hanging [09:15:53] It... completed for me [09:16:50] https://phabricator.wikimedia.org/P24149 [09:17:11] yes you did the crds [09:17:14] that works [09:17:22] ah, the second step fails [09:18:41] thanks for the trust :D [09:18:52] Hey, sometimes the environments differ [09:19:51] Remember that "/usr is unreadable" outage I have spoken about? That only happened because of a slightly different umask [09:20:15] so the interesting bit, after a chat with Janis, is that calico-kube-controllers is the only pod for the moment that uses the CNI bootstrap chain [09:20:28] since the other pods, for the moment, have all the hostNetwork flag on [09:20:37] so they bypass the cni [09:20:48] maybe the istio-cni config is problematic before calico? [09:21:10] You mean we should set up "base calico" before doing the "cni upgrade"? [09:23:00] That sounds like a plan. But I have no idea how to cancel/rollback a helm sync [09:26:58] so my idea is to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/777753 [09:27:10] let the kubelet refresh, and then see if anything changes [09:27:42] the CR is wrong sorry, fixing :D [09:27:51] still sgtm in spirit [09:31:59] ok done! [09:33:17] so deleted the pod, but same error [09:33:19] we can try to sync again [09:33:50] so with `helm3 uninstall calico --namespace kube-system` we should be now able to sync [09:33:54] do you want to try klausman ? [09:34:55] yup [09:36:43] just normally from deploy1002 in the right env, right? [09:36:57] yep [09:37:11] `Error: uninstall: Release not loaded: calico: release: not found` [09:37:32] nono just proceed with the regular helmfile sync [09:37:36] the uninstall is already done [09:37:41] ah :D [09:38:49] Might hang again... how long does it usually take? [09:40:40] Yeah, I think this is hanging again [09:43:19] lovely [09:44:19] just ctrl-c it? [09:46:14] killed it [09:46:41] I wonder if re-running it (after deleting again) with --debug would provide any insight [09:48:58] the thing that puzzles me is why the calico kube controller pod fails to contact ml-ctrl [09:53:26] I am trying to think of something we did to the clusters, but did not document [10:15:32] very weird [10:18:28] Ir's also a bit odd that it doesn't hit a deadline/timeout, but hangs [10:22:09] elukey: also note the icinga UNKNOWNs https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=ml-staging-ctrl2001 [10:23:49] for codfw? :( [10:23:50] ohwait, that's codfw :D [10:24:01] _and _ staging [10:24:01] I think I found the issue for eqiad though, lemme send a code change [10:24:08] then we can cry in a corner together [10:25:13] What was it? [10:27:41] https://gerrit.wikimedia.org/r/c/operations/puppet/+/777764 [10:28:26] I think it then triggers some firewall rule changes or similar [10:31:47] aaaah [10:32:12] +1'd [10:32:29] thanks :) [10:32:48] I have been wondering about that file and its purpose [10:32:52] need to go now, if you want to roll it out now please go ahead, I'll be back afer lunch! [10:33:21] Should we wait for Alex's LGTM? [10:34:13] (also: buon appetito) [10:34:15] too curious, merged, you and Janis +1ed so we should be ok :) [10:34:21] :D [10:40:35] ok still not working, but I have only rolled it out to the ml-serve nodes [10:40:44] maybe the firewall rule needs to propagate elsewhere [10:40:47] we'll see after lunch :) [10:42:24] aye. I am noticing a hungry too [13:02:28] let's try also with https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/777802/ [13:06:26] +1d [13:12:02] np failed [13:12:04] uffff [13:12:10] np? [13:12:40] "nope" failed :( [13:14:19] ah. [13:14:38] Should we try to sync with --debug? (or maybe you already did) [13:14:52] is there a debug option for helmfile? [13:15:00] yes [13:15:49] there is a tmux under root on deploy1002 [13:16:08] listening in :) [13:16:13] but it says calico-kube-controllers-c755df5f-7hjq7 [13:16:14] err [13:16:27] that the calico-kube-controllers pod is not read [13:18:50] 2022-04-06 13:17:37.284 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://ml-ctrl.svc.eqiad.wmnet:6443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp: lookup ml-ctrl.svc.eqiad.wmnet on 10.3.0.1:53: read udp 10.67.21.194:55127->10.3.0.1:53: i/o timeout [13:19:03] (thats from kubectl -n kube-system logs --all-containers=true calico-kube-controllers-c755df5f-zfq4m) [13:19:23] Plus some other messages [13:20:54] yep [13:27:03] trying with coredns, let's see [13:27:10] it failed this morning [13:28:31] ah yes [13:28:32] connection is unauthorized: clusterinformations.crd.projectcalico.org "default" is forbidden: User "calico-cni" cannot get resource "clusterinformations" in API group "crd.projectcalico.org" at the cluster scope, [13:29:19] it will rollback in some mins by itself [13:39:25] mh. cni. didn't we remove that? [13:42:07] the calico cni is still there, it is needed for ipam etc.., but I think this is due to calico being rolledback when I tried coredns [13:42:58] it seems as if the kube-controllers pod doesn't get a valid networking setup, and it fails at the first dns request [13:43:08] (for calico I mean) [13:43:55] Makes a certain amount of sense [13:44:23] what could cause something like that? [13:44:52] just reasoning out loud, maybe it could give us some hint [13:46:23] So if we can rule out iptables/netfitlers/firewall [13:46:39] and DNS fails, that meins that we're either using the wrong one [13:46:47] the one we try to use is down [13:47:01] or it is misconfigured and gives wrong/answerse/does not allow us to query [13:47:39] 10.3.0.1 [13:47:43] is up and running etc.. [13:48:22] I can query it from the master, too [13:48:24] I am wondering if iptables is somehow having some weird config from the previous attempts t deploy [13:48:33] (shell, so not 100% the same etc) [13:48:48] Is there a way to reset the rules? [13:51:06] no idea [13:54:17] One last-ditch measure would also be rebooting the ctrl nodes. [13:58:42] and https://gerrit.wikimedia.org/r/c/operations/homer/public/+/777811/ [14:00:08] Did we miss all thos spots where IPs are configured? [14:00:23] I think so yes, didn't really think about the BGP config [14:04:12] going to wait for a netops kust in case [14:04:42] Morning all! [14:04:50] \o [14:04:56] Is our scipy from 2016 or 2006? [14:05:13] 2016 IIRC [14:05:17] 0.18.1 [14:05:23] awesome thanks [14:18:38] kube-system calico-kube-controllers-c755df5f-v49mk 1/1 Running 4 5m3s [14:18:41] klausman: --^ [14:18:42] \o/ [14:18:46] yay! [14:19:00] So it was all those ip ranges hiding in (to us) obscure repos [14:19:29] yeah the routers were not configured to accept the new ip ranges from the ml-serve nodes [14:19:50] This is going to be important for staging as well [14:20:05] (and serve-codfw, of course) [14:20:12] yep yep [14:20:18] doing coredns now, pods are coming up [14:20:44] Also: very glad we could do this without people with torches and pitchforks demanding prod coming back [14:21:23] definitely [14:21:29] I am also rolling back the istio cni change [14:22:29] +1d [14:22:42] do you want to roll it out? [14:24:33] Sure [14:24:39] just the usual sync, right> [14:26:26] this is in puppet, need to be rolled out to ml-serve100[1-4] nodes [14:26:32] the kubelet in theory doesn't need a restart [14:26:47] so just a puppet-merge and a forced agent run? [14:26:50] yep [14:27:37] Running now [14:27:52] Oh wait, not merged yet :D [14:29:15] chrisalbon: today there's the Monthly Tech Department Updates meeting colliding with our two meetings. do we cancel? Ignore? [14:29:31] (I feel like back in the DOS era: Abort, Retry, Cancel?) [14:31:32] elukey: sorry, did I misunderstand? do you want me to merge 777770? [14:31:37] yep yep [14:31:39] (so close toa cool number!) [14:31:54] doing it now [14:32:26] super [14:32:33] after this, the next step is finally istio! [14:32:49] merged, now doing agent run [14:33:45] l successful [14:33:47] All+ [14:34:08] (and diff for /etc/cni/net.d/10-calico.conflist looks good) [14:34:52] So now we do another helmfile...calico sync? [14:35:12] in theory no, all good, the only thing that is affected by the last change is the kubelet [14:35:25] basically it will now invoke the istio-cni binary when spinning up pods [14:35:34] to see if they need to get the iptables redirs injected etc.. [14:35:44] after calico's, that is needed to get the pod IP [14:35:56] The output for deployment,demonset looks good [14:36:35] for istio, we can now use istioctl [14:36:41] istioctl manifest apply -f ... [14:36:52] and istead of ..., the custom.d mlserve config yaml [14:37:03] (that is under /srv/deployment-charts/etc..) [14:37:13] ah and before that, kube_env etc.. ml-serve-eqiad [14:37:23] to deploy to the right cluster :D [14:37:41] (I usually check `env | grep ml-serve` before running istioctl to check) [14:38:18] Looks good? root@deploy1002:/srv/deployment-charts/custom_deploy.d/istio/ml-serve# istioctl manifest apply -f config.yaml [14:38:36] (env I already checked) [14:38:51] yep! [14:39:27] ah. Isitoctl has multiple versions [14:39:37] klausman go to the tech dept meeting [14:39:38] 1.6.14 and 1.9.5 [14:39:46] roger, chris [14:39:47] 1.9.5 is ours [14:40:04] We need to reschedule our meeting, it keeps getting stepped on by bigger meetings [14:40:32] Running. [14:43:15] Hmm. Still at: [14:43:38] `- Processing resources for Istiod. Waiting for Deployment/istio-system/istiod` [14:43:45] yeah istiod's log show [14:43:46] "pkica","msg":"Failed to get secret (error: Get \"https://kubernetes.default.svc.cluster.local:443/api/v1/namespaces/istio-system/secrets/istio-ca-secret\": dial tcp 10.67.0.1:443: i/o timeout [14:44:21] 10.67.0.1 is the right IP, tho [14:46:06] Welp, it timed out on that, installed CNI, now waiting on Ingress gw resources [14:47:17] I'll let it run, at least doing what works [14:50:24] ah I forgot one thing, silly me [14:50:34] the network policy rules [14:50:52] we need to do two helmfile syncs [14:51:21] -l name=istio-gateways-networkpolicies [14:51:22] where do those files live? [14:51:33] -l name=istio-proxy-settings [14:51:37] in deployment-charts [14:52:57] both done, now redoing the istio push [14:53:07] lol did the same as well [14:53:13] istiod now works [14:53:29] not doing the istio push, then [14:53:42] the other pods are still in containerCreating phase [14:54:30] all up!! [14:54:36] Noic [14:54:38] +e [14:55:00] super, next step is knative-crds and knative, from admin-ng [14:55:07] helmfile sync [14:55:31] on it [14:56:06] err: no releases found that matches specified selector(name=knative-crds) and environment(ml-serve-eqiad), in any helmfile [14:56:47] knative-serving-crds and knative-serving sorry [14:56:49] knative-serving-crds [14:56:51] ha [14:57:18] crds done [14:57:53] kn-srv done [14:58:15] all pods up [14:58:44] and now kserve [14:59:29] Error: unable to build kubernetes objects from release manifest: unable to recognize "": no matches for kind "Certificate" in version "cert-manager.io/v1alpha2" [14:59:39] ah yes we need cert manager [15:00:17] is that just a `certmanager` helm sync? [15:01:42] we need cert-manager, cfssl-issuer-crds, cfssl-issuer, cert-manager-networkpolicies [15:02:05] ok, doing them [15:02:58] and then in theory the kserve sync shoudl work [15:03:42] Is certmanager typically slow? It's been sitting for a while [15:04:15] A couple of pods errored [15:04:50] same [15:04:51] error creating manager: Get "https://kubernetes.default.svc.cluster.local:443/api?timeout=32s": dial tcp 10.67.0.1:443: i/o timeout [15:04:59] maybe the network policies go first [15:05:02] Do we maybe need the ... yes [15:05:24] klausman: trying to sync them [15:05:56] running now [15:06:10] already deployed [15:06:13] doing cfssl stuff now [15:06:26] sorry, running now as in meaning they're running now [15:07:04] cfssl-issuer is up [15:07:18] all up afaics [15:07:31] now kserve :) [15:07:59] running as well [15:08:12] We should maybe automate/script this :D [15:08:36] in theory helmfile sync should take care of all, if all the deps are good [15:08:55] AH, theory. [15:08:59] exactly :D [15:09:16] so now we can try to deploy something like ml-services/revscoring-articlequality [15:10:05] Sure [15:10:23] root@deploy1002:/srv/deployment-charts/helmfile.d/ml-services# helmfile -e ml-serve-eqiad -l name=revscoring-articlequality sync [15:10:27] ^^^ looking good? [15:10:47] nono this is different from admin-ng [15:10:56] you need to cd to revscoring-articlequality [15:11:03] and sync without the -l name etc.. [15:11:12] in admin-ng we have a single helmfile [15:11:19] righto. [15:11:19] meanwhile we have one for each ml-services dirs [15:11:27] The -e is still needed, tho? [15:11:33] correct [15:11:43] ok, running now [15:12:12] in init 1/2 [15:12:30] I presume those are "a bit" chunkier than the base services :) [15:12:59] also we need to pull the images [15:13:04] And running! [15:13:13] let's test a score! [15:13:20] What's the easiest way to do that? [15:13:48] I had a command on ml-serve-ctrl1001's history that of course now is gone :D [15:14:34] I always thought the WMF default of 500 lines of history was... suboptimal [15:14:56] well we reimaged it so even without that limit :D [15:15:01] And of course wiping a machine also nukes stuff [15:16:27] it should be something like [15:16:27] curl "https://inference.svc.eqiad.wmnet:30443/v1/models/enwiki-articlequality:predict" -X POST -d @input.json -i -H "Host: enwiki-articlequality.revscoring-articlequality.wikimedia.org" --http1.1 [15:16:42] but I see that something must be off with the endpoint, connection reset [15:17:07] The log does show requests, but I think that's just monitoring [15:18:08] Do you still have the inpit json? [15:18:35] I don't good point [15:20:31] openssl s_client -connect inference.svc.eqiad.wmnet:30443 doesn't work [15:20:47] so I think there is an issue with the tls cert, that in theory should come from cert-manager [15:22:53] But I see no errors on the services themselves, so it must be an envoy/network problem [15:24:23] ah we need to sync in admin-ng namespace-certificates [15:24:26] doing it now [15:25:02] yep now works! [15:25:48] Hrm. I tried with the input.json from your homedir on m-s-c-1001, but it says that's malformed [15:25:59] just fixed it, now it works [15:26:18] Confirmed [15:26:38] (we might wanna make that test something self-contained for easy copying) [15:27:01] about the namespace-certificates - this is something relatively new that Janis added, we configure in our ml-serve values.yaml what TLS secrets we want, and cfssl-issuer/cert-manager are then configure automatically to create the cert [15:27:20] so now I think we can deploy the other revscoring namespaces [15:27:49] can do! [15:27:52] Going to take a little break in the meantime, if you want to go ahead please do, otherwise we can finish in a bit :) [15:28:08] I'll go ahead and break shit :) [15:28:18] go go go [15:29:24] draftquality is up. [15:29:28] (and works) [15:31:36] editquality is up, but not sure how to test that [15:32:10] eq-damamging, that is [15:32:19] now doing eq-goodfaith [15:33:44] eq-gf up and running [15:34:06] eq-reverted coming up [15:34:24] all running [15:41:28] Taking a quick break [16:04:27] super! [16:04:50] so to test them, just swap the various occurrences of "articlequality" in the curl above [16:05:46] like [16:05:47] curl "https://inference.svc.eqiad.wmnet:30443/v1/models/enwiki-goodfaith:predict" -X POST -d @input.json -i -H "Host: enwiki-goodfaith.revscoring-editquality-goodfaith.wikimedia.org" --http1.1 [16:05:51] that works fine afaics [16:07:38] curl "https://inference.svc.eqiad.wmnet:30443/v1/models/enwiki-damaging:predict" -d @input.json -i -H "Host: enwiki-damaging.revscoring-editquality-damaging.wikimedia.org" --http1.1 [16:07:42] works [16:08:46] I think that the eqiad cluster is definitely healthy and done :) [16:08:53] !!! [16:09:01] yep! [16:14:00] so we need to file the missing ml-serve-codfw changes, and then schedule the codfw re-init [16:14:10] but for the moment we can deploy to eqiad and keep loading models to it [16:14:12] so no rush [16:14:23] we can do it next week and prep tomorrow [16:15:49] leaving for the evening folks, talk with you tomorrow! [16:16:07] \o [16:33:50] bye! [18:44:57] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10Cmjohnson) [18:49:13] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10Cmjohnson) [19:06:22] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10Cmjohnson) [19:17:41] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10Cmjohnson) [19:31:30] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-cache1002.eqiad.wmnet with OS bullseye [19:31:38] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-cache1002.eqiad.wmnet with OS bullseye executed wit... [19:42:02] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10Cmjohnson) [19:44:26] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet wit... [19:45:54] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-serve1006.eqiad.wmnet wit... [19:48:34] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-serve1007.eqiad.wmnet wit... [19:49:00] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-serve1008.eqiad.wmnet wit... [19:50:36] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-serve1008.eqiad.wmnet with OS... [19:50:59] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-serve1007.eqiad.wmnet with OS... [19:51:11] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-serve1006.eqiad.wmnet with OS... [19:51:23] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet with OS... [19:56:28] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10Cmjohnson) ml-serve1005 E4:3D:1A:A2:BF:FC ml-serve1006 E4:3D:1A:AD:D7:A2 ml-serve1007 E4:3D:1A:AC:8F:D6 ml-serve1008 E4:3D:1A:AD:... [19:57:39] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10Cmjohnson) @Jclark-ctr These are erroring during the installation with the media failure, suggesting that there isn't a cable con... [20:39:14] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10Jclark-ctr) port was not set to pxe fixed setting for all 4 host [20:43:46] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet wit... [21:03:48] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet with OS... [21:04:11] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet wit... [21:31:05] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet with OS... [21:34:32] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet wit... [21:40:24] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-serve1006.eqiad.wmnet wit... [21:41:15] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-serve1007.eqiad.wmnet wit... [21:41:23] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-serve1008.eqiad.wmnet wit... [21:57:35] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet with OS... [21:57:47] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet wit... [22:05:15] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet with OS... [22:05:36] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet wit... [22:14:26] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet with OS... [22:30:27] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-serve1007.eqiad.wmnet with OS... [22:31:46] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-serve1006.eqiad.wmnet with OS... [22:33:39] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-serve1008.eqiad.wmnet with OS...