[06:27:29] good morning :) [08:11:43] need to go afk for some errands, ttl! [08:39:20] \o [08:40:09] elukey: there is one more thing missing from the staging CL: it doesn't assign the role to any machine. I left that out just in case I accidentally submit it early. Want me to do it in that CL or make another? [09:24:15] klausman: o/ [09:24:33] +1 to add it in the same code change, should be fine [09:31:18] (let's run pcc just in case) [09:31:30] ok if I go forward with the 3 code changes for istio cni? [09:56:32] sec... [09:56:47] Yes! [09:58:52] running pcc now [09:59:50] :) [09:59:52] Ah. Missing bgp peers [10:00:28] SHould I omit cr1/cr2 for now? [10:01:37] Eh, it shouldn't break anything [10:03:37] yeah it shouldn't break in theory, does the kubernetes new cluster tutorial suggest any particular order? [10:03:46] if not we can go forward [10:03:55] some alerts may fire, so we could add some downtime [10:09:02] https://puppet-compiler.wmflabs.org/pcc-worker1001/34642/ml-staging-ctrl2001.codfw.wmnet/index.html PCC run [10:11:03] rebasing and merging now [10:11:54] klausman: I left a nit, maybe the commit msg could be generalized a bit [10:11:56] but +1 to proceed [10:12:26] oops [10:12:36] puppet-merge is also done [10:13:45] augh. ctokens missing on private repo [10:19:05] Still something missing :-/ [10:20:23] I can check the node if you want [10:20:55] https://phabricator.wikimedia.org/P24008 [10:21:05] It's the same lookup that broke a few days ago [10:21:41] yep [10:22:55] But what was the fix for that again... -.- [10:25:30] ah, found it [10:30:57] elukey: btw I noticed that the actual-private tree has tokens for isito-cni, but the dummy private/labs repo does not. Does that matter? [10:32:08] klausman: mmm I added some yesterday, is it only for staging? [10:32:22] I didn't see it for prod, either [10:32:26] but maybe my tree is old [10:33:18] ah, for prod it's there [10:33:34] I already added it for staging on the actual repo, will add it to the labs one now as well [10:35:24] super thanks [10:35:43] Oh, also, no actual tokens for the revscoring-*-deply keys. Should I add that? [10:36:06] (and would they be the same as for prod?) [10:37:32] the tokens can be added later on, I'd use different ones for staging though [10:37:48] aye [10:39:54] klausman: there is a k8s alert firing for one of the staging nodes, I'd add downtime in https://alerts.wikimedia.org/ [10:40:01] thanks [10:40:11] Not sure what's going on [10:40:12] (hit the bell icon on the top-right corner) [10:40:47] it is complaining about rsyslog, it is surely something related to the node being new [10:41:05] Mar 31 10:40:56 ml-staging-ctrl2001 kube-controller-manager[1603411]: E0331 10:40:56.099237 1603411 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get https://ml-staging-ctrl.svc.codfw.wmnet:6443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s: dial tcp: lookup ml-staging-ctrl.svc.codfw.wmnet on 10.3.0.1:53: no such host [10:41:20] Is this because we don't have workers yet? [10:41:49] we don't really have a LVS endpoint either right? [10:41:56] Also correct [10:42:04] yeah so all expected [10:42:14] Ok. [10:45:21] hrm [10:45:25] Mar 31 10:43:28 ml-staging-ctrl2001 kubelet[1604936]: F0331 10:43:28.985098 1604936 server.go:271] failed to run Kubelet: mountpoint for cpu not found [10:45:36] did some cgroup kernel opt not get applied? [10:45:46] I'll try a reboot [10:47:01] Yep, that was it [10:47:52] :) [10:48:26] ok, both kubelets now run correctly [10:52:35] I need to get something to eat, then I'll make the worker CL (unless you want to do pybal or something else first) [10:59:57] me too! [12:47:29] aiko: o/ thanks for the patches! When you have a moment can you update https://phabricator.wikimedia.org/T302851 with the pull requests and the next steps? (afaics finding how to push a new release to pypy and deploy it) [13:01:14] ok so articlequality on ml-serve-eqiad has the sidecar containers, and it seems working! [13:01:22] nice! [13:39:24] elukey: I made two CLs for you: https://gerrit.wikimedia.org/r/c/labs/private/+/775823 (istio-cni token in labs, for the staging cluster) and https://gerrit.wikimedia.org/r/c/operations/puppet/+/775860 (worker setup for staging [13:45:02] klausman: reviewed, one little nit and I think we are ok [13:45:18] alrighty [13:51:20] lgtm :) [14:09:49] reboots and puppet runs done [14:12:30] super [14:12:56] I am going afk for a quick walk, bbiab [14:17:04] elukey: o/ Ok I'll update the task :) [14:43:52] sneaky bug found in the secrets charts https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/775870, it was blocking deployments [14:44:31] going to migrate the other pods to the sidecar scheme [14:58:31] all revscoring namespaces done in ml-serve-eqiad, it seems working! [15:48:39] 10ORES, 10Machine-Learning-Team (Active Tasks): revscoring feature extraction error for wikitext papes in Wikidata - https://phabricator.wikimedia.org/T302851 (10achou) Current status: * pull request https://github.com/wikimedia/revscoring/pull/518 has been merged. Next steps: * release revscoring 2.11.2 to p... [15:51:41] I found a little issue with networkpolicies in ml-serve-codfw, there is one last thing that I need to fi [15:51:44] *fix [15:52:23] we have policies to allow traffic to port 15012 of pods labeled "istiod", the most recent ones are the ones allowing pods with sidecars to contact it [15:52:52] but it seems as if the old policies (mentioning only gateways) are the only one evaluated [15:55:56] mmm no the policy seems not working :D [15:57:28] (basically the istio proxy on the sidecar cannot contact istiod to get routes etc..) [15:59:51] 10Machine-Learning-Team, 10artificial-intelligence, 10articlequality-modeling: Articlequality model for nlwiki doesn't seem to track images correctly. - https://phabricator.wikimedia.org/T304973 (10Halfak) I can't seem to replicate the issue with the current version of the feature in the articlequality repo.... [16:03:23] will need to take a better look on monday, so for the moment don't use ml-serve-codfw (eqiad works :) [16:09:20] * elukey afk!