[02:41:43] 10serviceops, 10SRE, 10Patch-For-Review: Delay spinner showing for graphs for 1s - https://phabricator.wikimedia.org/T256641 (10Seddon) a:05Seddon→03None [06:55:52] <_joe_> so, given I'm working today [06:56:37] <_joe_> jelto/jayme/mutante/effie: I'm going to go through how I looked at the logs the other day during the apcu outage to nail down problematic requests later in the morning [06:56:46] <_joe_> let's say in a couple hours? [06:57:41] joe: I have a meeting the next 1.5hr, then I'm available and would be happy to join the session :) [06:58:04] depends on definition of couple for me :) I'll be in transit from ~10:30Z for 4h [07:01:08] hello folks, I have another istio-related question (please be patient) [07:01:32] after reading a ton of logs to find why the current set up is not working, I saw this on the kubeapi server [07:01:35] failed calling webhook "validation.istio.io": Post https://istiod.istio-system.svc:443/validate?timeout=30s: dial tcp 10.64.77.73:443: i/o timeout [07:02:06] so istiod (the only pod up, the control plane) tries at first to validate a wrong config via validation webhook, and fails (so all the rest is not created) [07:02:35] in theory the kube api should be able to call the istiod validation webhook when needed [07:02:38] but this is not working [07:02:58] is there a special calico config that I should look up, or something else in your opinion? [07:05:05] 10serviceops, 10SRE, 10Datacenter-Switchover: Document communication expectations around planning a DC switchover - https://phabricator.wikimedia.org/T285806 (10Legoktm) a:03Legoktm I've tried to summarize a combination of what I did and the feedback here into https://wikitech.wikimedia.org/wiki/Switch_Dat... [07:06:57] elukey: the FQDN looks weird in first place but let's ignore as it is looked up. Is 10.64.77.73 the IP of your istiod pod? [07:07:20] 10serviceops, 10MW-on-K8s, 10SRE, 10Shellbox, and 3 others: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10Samwilson) The 1.36 release notes say that "Command::execute() now returns a Shellbox\Command\UnboxedResult instead of a MediaWiki\Shell\Result.... [07:07:56] elukey: do you have proper ingress policies in place for the apiserver(s) to reach the pod? [07:10:40] jayme: for the IP it should be, there is also an svc called "istiod" with the following [07:10:43] Port: https-webhook 443/TCP [07:10:45] TargetPort: 15017/TCP [07:11:13] about the ingress policies, probably not, I have never added them (this is what I was asking about) [07:11:36] ah now I remember that our GlobalNetworkPolicies are empty, Alex at the time said that we should have done it in a second step [07:12:21] ok so I guess I have to mess with that [07:13:10] I'm not 100% what the calico default is when you've not defined anything [07:13:20] misery probably [07:13:35] I would have assumed "allow all" tbh [07:13:42] 10serviceops, 10SRE, 10Datacenter-Switchover: Document communication expectations around planning a DC switchover - https://phabricator.wikimedia.org/T285806 (10Joe) After talking off-phabricator with a few people, I think what we have seen is more of a failure of coordination between affected SRE teams than... [07:14:57] but take a look at the rules we have in main.yaml. You'll get an idea [07:16:43] maybe https://docs.projectcalico.org/security/app-layer-policy would be a good read as well. Don't know if that applies to you or is just about sidecar injection stuff [07:18:47] I'll try to read it, in theory at the moment we don't need the sidecar injection magic since we don't use it, but anything can be true at this point :D [07:19:30] I am pretty sure that if I unblock the kube api -> istiod webhook comms it should work, but there may be more to do [07:19:44] I'll try to also check that defaults the calico global net policies have [07:20:13] in issues like https://github.com/istio/istio/issues/19532 they say "update the firewall rules" that is of course very helpful :D [07:22:34] <_joe_> elukey: so before going to blindly change stuff, I'd just go and do some blackbox debugging the old-fashioned way [07:22:48] <_joe_> I can help with that if you want [07:23:19] <_joe_> elukey: how do I select your cluster/namespace with kube_env? [07:23:44] <_joe_> ml-serve-eqiad, I see [07:25:01] _joe_ I am not going to blindly change stuff, I am following what suggested by upstream and what logs are pointing to :) [07:25:16] <_joe_> which seem quite unclear [07:25:41] <_joe_> that's why I said "blindly", as in "we're not sure what's not working exactly [07:26:09] <_joe_> sorry I wasn't suggesting you were acting recklessly :) [07:27:16] the thing that it is not working, IIUC, is that when istiod tries to validate a "Wrong" config as pre-check, it calls the kube-api that in turn has to call the istiod validation webhook, and this times out for some reason [07:27:44] I don't have a solid idea if it is Calico not allowing it or something else [07:27:51] <_joe_> so calling istiod from the kube-api seems to be the issue [07:31:17] the IP mentioned in the errors (to follow up to what Janis asked, I was wrong) is [07:31:20] elukey@ml-serve-ctrl1001:~$ kubectl get svc -n istio-system [07:31:23] NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE [07:31:26] istiod ClusterIP 10.64.77.73 15010/TCP,15012/TCP,443/TCP,15014/TCP 40h [07:31:50] and it does [07:31:51] Port: https-webhook 443/TCP [07:31:51] TargetPort: 15017/TCP [07:32:10] (selector is istio, so it should map to the istiod pod's 15017 in theory) [07:33:07] makes sense to be a service ip rather than a pod ip, sorry [07:33:24] nono sorry me, I should have better checked :) [07:33:43] I tried nsenter to check the port on the pod etc.. and it looks working [07:35:53] kubectl get ep -n istio-system gives you the correct endpoints as well, right? [07:36:38] ah nice TIL! [07:36:46] yes target is the istiod pod ip [07:40:37] I think it's a firewall issue [07:41:15] "Calico network policy is default deny." the docs say [07:41:53] <_joe_> jayme: I don't think so [07:42:00] me neither [07:42:09] maybe different for k8s [07:42:11] <_joe_> from ml-serve1004 I can reach the ip:port [07:42:23] <_joe_> same from ml-serve1005 [07:42:30] <_joe_> sorry 1003 [07:42:52] <_joe_> because calico's running there [07:43:01] ah, makes sense [07:43:13] we're not running it at all on masters [07:43:15] I imagine that I am the first one testing a validation webhook [07:43:22] <_joe_> while I can't reach it from e.g. the ml-serve-ctrl1001 server, where calico isn't running [07:43:24] you are elukey [07:43:25] <_joe_> elukey: yes [07:43:45] <_joe_> elukey: I think it was even a deliberate choice of ours at the time, but alex would rememeber better [07:44:03] * elukey take notes "Blame Alex" [07:46:26] <_joe_> elukey: always blame alex and istio [07:47:28] what what about vo lans? [07:47:32] *wait what [07:48:06] of course we don't forget about Riccardo [07:48:09] <_joe_> apergos: we only blame volans when python is involved [07:48:24] oooohhh my bad, I was just blaming all the time tbh :-P [07:48:24] <_joe_> this is golang all the way down [07:48:43] ohdear :-D [07:48:44] _joe_ I think that he got 3 pings for highlights (username, Riccardo, python) [07:48:54] <_joe_> apergos: also you can assume blaming him is ok if you get frustrated at any linter [07:49:00] * elukey would love to see Riccardo's IRC client [07:49:05] whew gtk :-D [07:49:06] <_joe_> elukey: and we didn't put any cumin in the spice mix [07:49:16] so this situation is the same on staging clusters (where the master is unable to reach service/pod IPs) [07:49:38] confirmed it's not a policy issue then [07:49:39] ahahha yes we were derailing a bit the conversation :D [07:50:47] <_joe_> elukey: we were establishing ground rules for blaming, it's an important detour [07:50:58] <_joe_> jayme: yes, they need to run calico on the masters [07:51:10] I assume it is not as simple as including profile::calico::kubernetes [07:51:10] <_joe_> now, how can they do so? aren't we running calico in-cluster now? [07:51:25] <_joe_> I completely lost track tbh [07:51:42] yep. calico-node is running as daemonset [07:52:51] and for the masters to run it in theory they should bgp-peer with the routers as well [07:52:59] damn...I totally did not think about this at the time of building the new calico stuff [07:55:02] <_joe_> jayme: well we can run it as a docker process maybe on the masters? [07:55:37] <_joe_> jayme: do we need calico-node for being able to route to calico addresses though? [07:55:46] <_joe_> I think it's only needed to set up local IP addresses [07:56:11] I'm not sure tbh [07:56:48] I am glad to deliver joy to your team folks [07:57:03] the docs do hide that fact pretty well (only talking about calico-node needs to run on every node) [07:57:13] I'll tell Chris to prep something to ship as gift :D [07:57:34] but the k8s manifests do state that they want the daemonset on the masters as well [07:57:58] <_joe_> but I don't think we run the kubelet on the master, do we? [07:58:02] nono [07:58:16] we don't but it seems they assume we do [07:58:20] which is weird [07:58:32] <_joe_> so yeah, we might need to run calico-node on the masters as well somehow [08:00:07] maybe a silly question, but wouldn't it be sufficient to allow the masters to bgp-peer with routers + adding to them profile::calico::kubernetes? [08:00:12] or is there something more? [08:03:09] * elukey sees Janis building a voodoo doll with "Luca" written on top [08:06:15] unfortunately all that stuff is in caloco-node I guess [08:06:34] <_joe_> no [08:06:48] <_joe_> calico-bird typha and the other thing are still on the servers I think? [08:06:54] <_joe_> lookingf at the puppet class [08:07:07] <_joe_> elukey: that was also my hope [08:07:23] <_joe_> elukey: given your cluster is still experimental, you can try :) [08:07:50] no [08:08:04] bird is running inside the node container [08:08:12] whis is hostNetwork: true [08:09:26] so is felix [08:09:39] <_joe_> jayme: ok so we're not installing profile::calico::kubernetes anymore? because it seems to install bird [08:10:35] where do you see that? [08:11:05] is installs calicoctl and calico-cni [08:11:07] *it [08:11:45] <_joe_> jayme: right it's just ferm [08:11:49] <_joe_> sigh. [08:12:12] <_joe_> ok, so... do we have any chance to make it work as a simple docker container launched by systemd? [08:13:31] I'm not sure that is simpler than running kubelet on the masters with master taint on [08:14:21] running it manually via docker/systemd would mean another way of launching, potentially configuring it [08:14:37] + we would need docker on the masters (which we don't have currently) [08:14:54] in that case, I think we could also add kubelet [08:19:19] <_joe_> yeah you're probably right [08:19:32] <_joe_> probably the best solution [08:20:08] <_joe_> we also need to add a toleration to all helm charts though in that case [08:20:17] <_joe_> which we don't have rn [08:20:22] yeah [08:20:39] wait, no :) [08:21:01] we would just need to add something for calico to allow it to run on master nodes as well [08:22:50] and that we already have in calico-node [08:23:11] * _joe_ goes to look at the charts [08:32:52] I can open a task with a summary of everything it was discussed if it helps [08:34:59] that would be nice indeed [08:36:54] will do it in a bit :) [08:37:35] <_joe_> thanks [08:51:05] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jelto@cumin1001 for hosts: `mw1261.eqiad.wmnet` - m... [08:54:45] jelto: mw1261+ are canary hosts, have we decided which ones are going to replace them ? [08:56:48] <_joe_> it's about time for the debugging session, but AIUI jayme isn't available [08:57:05] <_joe_> also I didn't see mutante around today [08:59:05] joe: mutante was in a decom session with me this morning. How do join the debugging session? Meet? tmux? [09:03:30] effie: I talked with mutante that we have to deceide on how to replace the canarys. I will make sure that there is a ticket in phabricator for the new canarys and we don't forget about replacing them. [09:05:41] cool, just remember to copy over the hieradata/hosts/.yaml for each canary [09:05:48] thank you ! [09:06:12] _joe_: I can't join as I am working on tegola's deployment [09:14:22] I'm t [09:14:55] I'm still around like an hour, but I it looks like a bad slot anyways :) [09:24:21] <_joe_> ok [09:24:43] <_joe_> sorry I was reading a couple tasks [09:25:09] <_joe_> ok let's do tomorrow at 9:00Z [09:25:35] <_joe_> but at that point who's in is in, and that's going to be final [09:26:01] sgtm [09:30:25] _joe_: thanks for looking ad dragonfly. Unfortunately I forgot to add a supernode role [09:30:39] <_joe_> lol see? [09:30:51] <_joe_> I was concentrated at reading the code and seeing if there was some error [09:38:12] done [09:41:32] 10serviceops, 10Machine-Learning-Team, 10SRE: Add the possibility to deploy calico on kubernetes master nodes - https://phabricator.wikimedia.org/T285927 (10elukey) [09:41:54] 10serviceops, 10Machine-Learning-Team, 10SRE: Add the possibility to deploy calico on kubernetes master nodes - https://phabricator.wikimedia.org/T285927 (10elukey) [09:41:58] ok tried to summarize all in --^ [09:42:44] 10serviceops, 10Machine-Learning-Team, 10SRE, 10Kubernetes: Add the possibility to deploy calico on kubernetes master nodes - https://phabricator.wikimedia.org/T285927 (10JMeybohm) [09:43:15] looking [09:44:39] 10serviceops, 10Machine-Learning-Team, 10SRE, 10Kubernetes: Add the possibility to deploy calico on kubernetes master nodes - https://phabricator.wikimedia.org/T285927 (10JMeybohm) [09:52:12] 10serviceops, 10Machine-Learning-Team, 10SRE, 10Kubernetes: Add the possibility to deploy calico on kubernetes master nodes - https://phabricator.wikimedia.org/T285927 (10JMeybohm) I don't like the idea of having another way of how calico-node is run (it's already complex enough). Because of that I'll sugg... [09:58:33] 10serviceops, 10Machine-Learning-Team, 10SRE, 10Kubernetes: Add the possibility to deploy calico on kubernetes master nodes - https://phabricator.wikimedia.org/T285927 (10elukey) Definitely, it seems a good way to proceed. The only concern that I have is that our kube masters are lightweight VMs (1 virtual... [10:00:17] does anyone know what " Error: release main failed: timed out waiting for the condition" means? [10:00:43] I run helm apply [10:00:53] and after wainting for ages, I got this error [10:03:01] _joe_: ^ [10:03:33] <_joe_> effie: do you have very fat containers to deploy? [10:03:51] nemo-yiannis: do I have very fat containers to deploy? [10:03:52] <_joe_> effie: it generically means that some operation lasted longer than 2 minutes [10:04:08] <_joe_> one such case could be pulling from the registry - it happens with the mediawiki images [10:04:16] <_joe_> effie: look at the kubernetes events [10:04:24] ah right, let me look [10:05:22] ah ! [10:05:25] Error creating: pods "tegola-vector-tiles-main-69d844cf76-8llw2" is forbidden: minimum cpu usage per Container is 100m, but request is 1m [10:05:36] ok let me fix that [10:05:42] thanx joe [10:07:30] effie: I pulled it locally and its ~700Mb so not really a lightweight image. [10:07:58] ok let me fix teh first issue, and we will see about its size [10:15:00] Yeah, i just checked blubber. We have a step to build tegola and then we copy the whole intermediate image involving all the dependent packages that we don't necessarily need on prod. [10:15:02] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/tegola/+/refs/heads/wmf/v0.14.x/.pipeline/blubber.yaml#27 [10:17:08] so, we can make the image smaller I reckon ? [10:18:52] yes, the vast majority of the files are go packages [10:18:58] 10serviceops, 10Machine-Learning-Team, 10SRE, 10Kubernetes: Add the possibility to deploy calico on kubernetes master nodes - https://phabricator.wikimedia.org/T285927 (10JMeybohm) Yeah, maybe. Calico-node runs with a memory limit of 400Mi and CPU requests if 350m but the other components will also take up... [10:23:28] <_joe_> oh indeed, that's also a security issue tbh [10:24:04] <_joe_> nemo-yiannis: do you already know what you need to do to improve this, or do you need assistance? [10:24:56] <_joe_> you just need to copy over /srv/service/cmd/tegola/tegola over I think [10:25:01] Yeah I am working on it, i am testing the patch locally at the moment. [10:33:46] 10serviceops, 10SRE, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10fgiunchedi) Another data point, as expected post-switchover the high latency uploads from jobrunners moved from codfw to eqiad since codfw is now active. [10:35:07] 10serviceops, 10SRE, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10fgiunchedi) Also to avoid confusion I'd like to clarify that on the swift side I can't find anything obviously wrong though I don't have the bandwidth to investiga... [10:59:34] <_joe_> ugh this is the mwdebug pod cpu usage *without any load besides readiness probes* https://grafana-rw.wikimedia.org/d/U7JT--knk/joe-k8s-mwdebug?viewPanel=28&orgId=1&refresh=1m [10:59:43] <_joe_> I would say it's quite underresourced :P [10:59:51] <_joe_> I'll dig deeper later [11:05:58] which are the mwdebug in codfw for deploy testing anyhow? [11:07:10] nm [11:46:33] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jelto@cumin1001 for hosts: `mw1262.eqiad.wmnet` - m... [12:09:31] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jelto@cumin1001 for hosts: `mw1263.eqiad.wmnet` - m... [12:09:55] jelto: FYI the decommission cookbook can be run on multiple hosts at once (limit to 5 by default, to 20 with --force) [12:10:32] this also makes the run of the dns cookbook part quicker because it runs it only once at the end [12:11:36] but of course be careful on which hosts you run it ;) [12:11:52] *run it on [12:12:52] volans: thanks for the hint! As this is my first time running this cookbock I wanted to get a feeling what it is doing. But I will batch the last two remaining mw canary hosts together :) [12:13:31] ack! feel free to ping me if you have any question about it :) [12:13:56] volans: thanks a lot I will do [12:30:06] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jelto@cumin1001 for hosts: `mw[1264-1265].eqiad.wmn... [12:49:57] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jelto@cumin1001 for hosts: `mw1266.eqiad.wmnet` - m... [13:29:10] tried to come up with https://gerrit.wikimedia.org/r/702645 [13:29:47] the pcc diff looks reasonable, then there will be the calico part in case (plus I imagine the router part to enable BGP peering) [13:50:46] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) [13:54:41] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) @Jclark-ctr @wiki_willy The 6 servers at the bottom of rack A5 (mw1261 through mw1266) have been decomed and... [13:57:32] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Jelto) [14:00:42] 10serviceops, 10SRE: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Jelto) [17:13:43] 10serviceops, 10SRE, 10Patch-For-Review: Delay spinner showing for graphs for 1s - https://phabricator.wikimedia.org/T256641 (10herron) p:05Triage→03Medium [17:18:34] 10serviceops, 10Machine-Learning-Team, 10SRE, 10Kubernetes, 10Patch-For-Review: Add the possibility to deploy calico on kubernetes master nodes - https://phabricator.wikimedia.org/T285927 (10herron) p:05Triage→03Medium [19:18:33] 10serviceops, 10SRE, 10Datacenter-Switchover: Document communication expectations around planning a DC switchover - https://phabricator.wikimedia.org/T285806 (10wkandek) Thanks everybody for the feedback on the communications for the DC switchover process. We will spend some time this quarter (Q1) in working... [20:25:28] 10serviceops, 10Performance-Team, 10SRE, 10MW-1.36-notes, and 3 others: Enable "/*/mw-with-onhost-tier/" route for MediaWiki where safe - https://phabricator.wikimedia.org/T264604 (10Krinkle) So the problem appears to be bad interactions between WANCache's "pre-emptive regeneration" feature (as prompted by... [22:56:09] 10serviceops, 10SRE, 10Services, 10Wikibase-Quality-Constraints, and 3 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Addshore) Any idea on a timeline for being able to get this ticket moving? It's blocking T176312 whic...