[07:57:21] starting deployment for draftquality inference services [08:00:23] both eqiad and codfw deployments have been completed successfully. [08:00:23] checking pods now ... [08:10:29] all new pods are up and running. [08:10:30] NAME READY STATUS RESTARTS AGE [08:10:30] enwiki-draftquality-predictor-default-rzctc-deployment-7c4nfgjv 3/3 Running 0 [08:10:30] ptwiki-draftquality-predictor-default-frwng-deployment-8fd9jwcx 3/3 Running 0 [08:19:30] 10Lift-Wing, 10artificial-intelligence, 10Machine-Learning-Team (Active Tasks): Create draftquality inference services - https://phabricator.wikimedia.org/T310704 (10kevinbazira) Inference services were created for all the 2 draftquality models and they are all up and running in KServe on both eqiad and codf... [08:23:14] 10Lift-Wing, 10artificial-intelligence, 10Machine-Learning-Team (Active Tasks): Migrate draftquality models - https://phabricator.wikimedia.org/T310698 (10kevinbazira) a:03kevinbazira [08:25:35] 10Lift-Wing, 10artificial-intelligence, 10Machine-Learning-Team (Active Tasks): Migrate draftquality models - https://phabricator.wikimedia.org/T310698 (10kevinbazira) The migration of draftquality models has been completed. 2/2 draftquality [[ https://phabricator.wikimedia.org/T310701 | models were uploade... [08:53:12] nice kevinbazira :) [09:30:11] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Use async http client of Tornado to get outlinks from the article - https://phabricator.wikimedia.org/T311043 (10achou) [09:31:56] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Deploy Outlinks topic model to production - https://phabricator.wikimedia.org/T287056 (10achou) [09:31:58] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Lift Wing proof of concept - https://phabricator.wikimedia.org/T272917 (10achou) [09:56:49] klausman: o/ I have merged your service change and synced some missing cfssl-issuer resources on ml-staging, but in one of the new pods I see [09:56:52] failed POST to https://pki.discovery.wmnet:8443/api/v1/cfssl/info: Post \\\"https://pki.discovery.wmnet:8443/api/v1/cfssl/info\\\": dial tcp [2620:0:861:101:10:64:0:10]:8443: connect: no route to host [09:58:07] NAMESPACE NAME READY SECRET AGE [09:58:10] istio-system knative-serving False knative-serving-tls-certificate 7d19h [09:59:43] hmm [09:59:48] (thanks re: merge) [10:00:14] Does using v4 work? [10:01:02] didn't try, from the host works even with v6, it may be a pod-level issue [10:01:14] Or I messed up some pool? [10:02:13] other pods should have showed up some trouble as well in theory [10:02:28] Oh wait, that is the connect to the PKI that is failing. [10:03:38] going to take a quick break :) [10:03:51] Sure. I'll rummage a bit [10:38:48] elukey: one thing I never verified was CR-side BGP status (I can't login to the core routers) [10:49:54] klausman: all ESTABLISHED, v4 and v6 [10:50:17] I am wondering if it is related to firewall rules though, on the pki intermediate nodes [10:50:30] the "no route to host" is weird of course but maybe it is misleading [10:50:35] I think I may have found the problem [10:50:43] ah! [10:51:20] https://phabricator.wikimedia.org/P29933 [10:52:05] That is the only meaningful difference in the values dir between serv-codfw and staging-codfw [10:52:31] The tlsHostnames bit is correct, but I am not sure the global network policies are [10:53:02] this is definitely a good point! [10:53:14] yes yes I think you are right, it needs to be fixed [10:53:34] e.g. I dunno if allow-all-icmp would make it to the staging config [10:54:09] We probably need to copy over the a-a-i and d-d subsections [10:54:18] should I make a PR? [10:54:52] yep! I think that you should be able to override only allow-pod-to-pod no? [10:55:06] CI will emit a diff, we can see what will change [10:55:08] well, no, we already do that :) [10:55:21] note that the green bits in the diff are _staging_ [10:56:08] yes sure [10:56:25] maybe I didn't get your point about allow-all-icmp then [10:56:31] anyway, let's do the CR :) [10:56:50] I have to go in a few, but feel free to proceed if the CI's diff looks sane [10:56:51] I am not sure we're on the same page :) [10:57:09] So the pod-to-pod stuff is there and configured for staging, as far as I can tell [10:57:49] I was looking in the wrong file yes [10:58:05] But the "no route" is about something _external_ anyway, the PKI [10:58:15] so it would not care about pod-to-pod config [10:58:42] I didn't see the `diff` etc.. in the paste, now I understand, apologies [10:59:40] I would expect the config to work as-is, i.e. the ml-serve config for icmp (allow-all-icmp) and default-deny would apply correctly [10:59:55] But I am not sure it does. [11:00:34] The thing is that the two serve setups do not have a global network policy stanza at all [11:01:16] So those do not work as examples of the setup working as expected. [11:03:41] one thing to add, that may be not related but needed for sure, is https://gerrit.wikimedia.org/r/c/operations/puppet/+/724933 [11:03:53] we'd need to add a note in the documentation [11:04:25] That is more likely to be the culprit, IMO [11:05:35] "no roujte to host" is classic "iptables says no" [11:05:57] added https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#Puppet [11:05:57] Want me to make a similar CR and we'll wait with the hlm stuff? [11:06:02] yep yep [11:06:06] ok, on it [11:07:11] on pki1001, iptables -L shows some entries for port 8443 like kubernetes-pod-10-64-64-0.eqiad.wmnet/21 [11:07:21] but afaics there is nothing for the ml-staging cluster [11:07:46] let's see on codfw [11:08:14] yep nothing [11:08:26] have to go now, but I'll recheck later :) [11:10:47] ack [11:11:39] when you're back: https://gerrit.wikimedia.org/r/c/operations/puppet/+/807096 [12:04:57] Morning alll [12:13:39] \o [13:26:38] klausman: commented! [13:26:53] I think one bit is missing, plus I think I found where the firewall rules are for pki [13:26:57] all added to the CR [13:37:04] merci [13:51:43] elukey: given https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=puppetmaster1001&service=Confd+template+for+%2Fsrv%2Fconfig-master%2Fpybal%2Fcodfw%2Finference-staging we should probably do the pybal dance soon. [13:52:25] klausman: sure, it should be ok even without the TLS config [13:52:35] can you file a change to switch to lvs_setup? [13:52:49] yarp [13:54:56] (03PS1) 10AikoChou: outlink: use tornado async http client to fetch outlinks [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/807135 (https://phabricator.wikimedia.org/T311043) [14:25:18] 10Lift-Wing, 10artificial-intelligence, 10Machine-Learning-Team (Active Tasks): Upload draftquality model binaries to storage - https://phabricator.wikimedia.org/T310701 (10calbon) 05Open→03Resolved [14:25:20] 10Lift-Wing, 10artificial-intelligence, 10Machine-Learning-Team (Active Tasks): Migrate draftquality models - https://phabricator.wikimedia.org/T310698 (10calbon) [14:25:43] 10Lift-Wing, 10artificial-intelligence, 10Machine-Learning-Team (Active Tasks): Create draftquality inference services - https://phabricator.wikimedia.org/T310704 (10calbon) 05Open→03Resolved [14:25:45] 10Lift-Wing, 10artificial-intelligence, 10Machine-Learning-Team (Active Tasks): Migrate draftquality models - https://phabricator.wikimedia.org/T310698 (10calbon) [14:25:52] 10Lift-Wing, 10artificial-intelligence, 10Machine-Learning-Team (Active Tasks): Migrate draftquality models - https://phabricator.wikimedia.org/T310698 (10calbon) 05Open→03Resolved [14:47:44] elukey: can you lgtm https://gerrit.wikimedia.org/r/c/operations/puppet/+/807133 ? [14:48:05] (I know you'll soon be in more meetings end-to-end, so I'll do the pybal dance with Valentin [14:59:15] done :) [14:59:23] grazie mille [15:16:20] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Use async http client of Tornado to get outlinks from the article - https://phabricator.wikimedia.org/T311043 (10achou) Some test results for model using async http calls: ` aikochou@ml-sandbox:~/isvcs/outlink$ wrk -c 1 -t 1 --timeout 1... [15:18:10] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Deploy Outlinks topic model to production - https://phabricator.wikimedia.org/T287056 (10achou) [15:18:12] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Use async http client of Tornado to get outlinks from the article - https://phabricator.wikimedia.org/T311043 (10achou) 05Open→03In progress [15:43:42] elukey: staging now has an inference endpoint \o/ Now we only need that firewall change reviewed and merged and we can test whether that cluster actually can serve something :) [15:47:45] jbond has just LGTM'd it, but I dunno what kind of rollout procedure (if any) there is [15:54:25] klausman: ah ok so the procedure is simply to merge and let puppet run on all nodes, because it will update ferm rules everywhere. So maybe pinging the serviceops team and the #sre chan first is a good idea [15:54:35] we'll also need an extra patch though [15:54:51] I added a comment in the code review, the pki profile needs to include the new network constants that you created [15:55:04] (and puppet needs to run on pki nodes to update the ferm/iptables rules) [15:55:13] ah right, yes, I have that file edit, but no CR for it yet. [15:55:17] once that is done, we should see our TLS cert crated [15:55:25] super [15:56:26] We'll do it tomorrow, I have to run to dinner with Stevie and a friend :) [16:02:15] elukey and aiko Let's move your ITCs to next week. I think elukey's internet is broken and I could use the time to work on stuff for tomorrow [16:04:11] chrisalbon: sure! [16:04:14] thanks :) [16:04:19] all right logging off earlier then! [16:04:21] no problem! [16:27:24] chrisalbon: o/ ok! [17:32:45] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Set up the ml-cache clusters - https://phabricator.wikimedia.org/T302232 (10lbowmaker) Adding some notes from meeting on 6/21 with Chris Albon, Eric Evans, Luca Toscano, Lukasz Sobanski, Matthew Vernon and Luke Bowmaker. **1....