[03:21:34] 10serviceops, 10Icinga, 10SRE, 10observability: incident 20170323-wikibase did not trigger Icinga paging - https://phabricator.wikimedia.org/T161528 (10lmata) p:05Medium→03High Apologies i seem to have been confused. Scheduling for review. [07:42:40] 10serviceops, 10MW-on-K8s: Kubernetes timeing out befor pulling the mediawiki-multiversion image - https://phabricator.wikimedia.org/T284628 (10jijiki) [07:45:02] 10serviceops, 10MW-on-K8s: Kubernetes timeing out befor pulling the mediawiki-multiversion image - https://phabricator.wikimedia.org/T284628 (10jijiki) [07:45:04] 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Patch-For-Review: Run stress tests on docker images infrastructure - https://phabricator.wikimedia.org/T264209 (10jijiki) [08:01:02] <_joe_> mw.config.set({"wgBackendResponseTime":10150,"wgHostname":"mediawiki-pinkunicorn-778cfcff7-x2rqq"}); [08:01:11] <_joe_> effie: ^^ VICTORY [08:01:37] <_joe_> curl -H 'Host: en.wikipedia.org' https://staging.svc.eqiad.wmnet:4444/wiki/Main_Page [08:14:56] ah that's nice [08:14:59] 10serviceops, 10SRE, 10Patch-For-Review, 10User-jbond: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10jbond) @JMeybohm [[ https://gerrit.wikimedia.org/r/c/operations/docker-images/docker-report/+/608889 | the patch to filter ]] still required considering https... [08:15:52] 10serviceops, 10MW-on-K8s: Kubernetes timeing out before pulling the mediawiki-multiversion image - https://phabricator.wikimedia.org/T284628 (10jijiki) p:05Triage→03Medium [08:37:06] 10serviceops, 10SRE, 10Patch-For-Review, 10User-jbond: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10JMeybohm) >>! In T251918#7144893, @jbond wrote: > @JMeybohm [[ https://gerrit.wikimedia.org/r/c/operations/docker-images/docker-report/+/608889 | the patch to... [08:53:54] 10serviceops, 10SRE, 10Patch-For-Review, 10User-jbond: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10jbond) linked the wrong CR earlier i meant https://gerrit.wikimedia.org/r/698763, however assuming you saw past my error ill revert that now, thanks :) [09:06:06] 10serviceops, 10MW-on-K8s, 10Kubernetes: Kubernetes timeing out before pulling the mediawiki-multiversion image - https://phabricator.wikimedia.org/T284628 (10JMeybohm) The default timeout is 2min here, unfortunately that is not configurable for pull only but for all runtime requests. See https://v1-17.docs.... [09:26:47] hello folks [09:27:11] I have some istio/knative thoughts to bring up, let me know your thoughts when you have a moment [09:28:07] the summary: istio and knative prefer the k8s operator as deployment mechanism, and they don't ufficially support helm [09:28:34] istio is special since is support two options: either istioctl or istio-operator [09:28:42] https://istio.io/latest/docs/setup/install/operator/ [09:29:41] in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/697938 I tried to come up with a strategy for istioctl, that seemed less invasive, but of course the downside is that it would require a .deb package with istioctl binaries deployed on the deployment server [09:30:13] they suggest to deploy the istio-operator via istioctl in case, but we may just use some helm charts in case [09:30:31] knative is simpler - https://knative.dev/docs/install/knative-with-operators/ [09:30:54] at this point, I am curious about what is the best strategy to follow [09:31:17] 1) istioctl + knative operator [09:31:28] 2) istio operator + knative operator [09:31:55] 3) helm charts only for both (may be a little painful for istio, less for knative) [09:32:31] all the operators would need basic helm charts + RBAC rules (I suppose like we do for tiller) [09:32:32] <_joe_> I am against investing too much efffort right now that we're in the exploration phase, so I would advise against 3) [09:34:52] 3) is a little cumbersome to maintain in the istio use case for sre [09:34:57] *sure [09:35:45] jayme: let me know your thoughts when you have a moment (even tomorrow) [09:41:53] effie: I'm not sure how to address https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/671204/20/charts/rdf-streaming-updater/values.yaml#4 related to helm_scaffold_version 0.1 > 0.3 [09:43:15] I don't understand what this var is supposed to do nor what reads it [09:44:55] dcausse: sorry for the mixup, this was kind of an internal note for us, there is nothing actionable from your end [09:45:09] ok thanks :) [11:03:59] elukey: I'm not sure I get the istio thing right. When using istioctl, do you still need the operator? If not, why using the operator at all (as you would need istioctl to deploy the operator :)) [11:06:41] so some istioctl workflow seems okay I guess. Packaging istioctl is probably easier than creating working helm-charts for all those components istioctl bootstraps [11:07:03] (and keeping the helm charts up to date) [11:07:28] for knative I guess some kind of helm chart would be needed to install the operator, right? But that is probably pretty easy to do [11:07:49] <_joe_> yeah also my take more or less [11:08:03] <_joe_> go with istioctl and for knative install the operator with a helm chart [11:08:12] <_joe_> gosh this stuff seems so absolutely brittle [12:57:57] jayme: hello! So istioctl doesn't need the operator, but if you want to deploy the operator the suggested way seems to use istioctl (or to adapt their "hidden" helm charts to our use case) [12:58:25] for knative yes we'd need a basic helm chart for the operator, then a manifest to apply the standard config [12:58:49] IIUC we go for 1) [12:58:55] (so part of it is already in code review :) [13:01:21] I was trying to get everybody's option for 2) in case we wanted more consistency in the way we deploy istio and knative [13:01:33] but those are definitely separate, so istioctl + knative-operator seems ok [13:02:24] I'll wait for comments in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/697938 and in the meantime I'll test/create a helm chart for knative-operator [13:58:13] https://github.com/knative/operator/blob/release-0.18/go.mod#L35 [13:58:18] * elukey flips table [13:58:47] so knative serving 0.18.x is the last version that supports k8s 1.16, but it seems that the operator wants k8s 1.18+ [14:01:06] sooooo back to helm charts I guess :D [15:05:43] elukey: but that's just a go lib dependency. It's absolutely possible that it does not use API features of k8s 1.18 [15:06:34] just because I'm curious: Is there any benefit of using the istio operator? Compared to just istioctl I mean [15:11:04] jayme: yeah I agreep but I was reading https://github.com/knative/operator/releases/tag/v0.18.1, that mentions k8s 1.18, and the pull request seems indicating that it is a necessary step. I am not sure if I am reading it correctly or not, but https://github.com/knative/operator/pull/275/commits/81d3b4b3419e5850fb7d65d5f6759d134ffafb6c worries me a bit. [15:11:52] jayme: for the istio operator - what they advertise is that once you have it up and running (even if the initial deployment seems to require `istioctl operator init`) then you don't need to keep track of istioctl versions [15:12:12] in our case it may be nice to avoid a .deb package for istioctl, this was my thinking [15:12:32] (assuming that istioctl operator init could be translated into a simple helm chart of course) [15:14:08] hm..but how would you upgrade the operator then? [15:15:00] 10serviceops, 10decommission-hardware: decommission thumbor200[12].codfw.wmnet - https://phabricator.wikimedia.org/T273141 (10wkandek) 05Open→03Stalled We are waiting for Thumbor to migrate to k8s before retiring these servers. New servers that were originally been purchased for thumbor have been repurpose... [15:15:23] 10serviceops, 10decommission-hardware: decommission thumbor100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T273137 (10wkandek) 05Open→03Stalled We are waiting for Thumbor to migrate to k8s before retiring these servers. New servers that were originally been purchased for thumbor have been repurpose... [15:16:26] jayme: I guess via helm, bumping up the docker image, I imagine that it should keep what it has been deployed if you don't send other commands (like kubectl apply -f some-manifest.yam;) [15:17:28] https://knative.dev/docs/upgrade/upgrade-installation-with-operator/ [15:17:31] :) [15:17:54] ah snap of course wrong link [15:18:43] https://istio.io/latest/docs/setup/install/operator/#update [15:20:05] And yes. If it's easily possible to create a helm chart for the istio controller and that means you do not need istioctl at all, that would be a pretty nice idea as well [15:21:08] my concern would be (regarding operator updates) that they might change suble things later on (like not only the image they are running) and that that is not gonna be properly documented because they only rely on istioctl as upgrade path [15:21:38] yes I'd be worried too, I think that istioctl is better [15:21:51] the .deb is very simple, we can try with it and see how it goes [15:23:00] for knative, the helm chart seems to be very easy, the provide a list of crds in a yaml and some core resources in another one, so I could do something like calico-crds/calico in a couple of hours probably (prototype, not saying that it would work :D) [15:23:03] you could do it like we do with helm, so that we can install multiple versions of it in parallel [15:23:23] (for the deb, I mean) [15:23:29] yep yep definitely [15:23:43] then update-alternatives could take care of the "canonical" version [15:25:48] in case "canonical" exists as we will poetntially end up with clusters running multiple versions of istio at some point [15:26:04] s/multiple/different [15:27:41] what a joyful world ahead of us [15:38:04] <_joe_> multiple istios [15:38:10] <_joe_> on the same cluster [15:39:16] even that...probably. Canary operators [15:39:35] (multiple istia!) [17:38:07] elukey: should we abandone https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/693826 ? [17:39:21] jayme: yep I think so! I was able to deploy kfserving without cert-manager, so feel free to skip it [17:40:26] we may need it in the future if we want to have a more complex TLS setup for service -> service comms in istio for example [17:40:35] elukey: cool. Feel free to re-add me as reviewer in case you figure you need it [17:41:13] jayme: <3 [18:26:35] I need to get the static-bugzilla HTML files from miscweb ganeti VMs somewhere else, but we are already low on space there so that I can't make a local copy anymore and gzip that.. partially because static-codereview also uses a couple GB now and before most sites were tiny [18:26:57] also cant directly dump it on dumps servers since they have their own space issues right now [18:27:11] and was about to pull all that via my personal phone hotspot.. but lost patience [18:27:30] since I see deployment server has more than enough i'll use puppetized rsync to dump it there [18:28:51] and then try to make my docker build pull it from there and put it inside a container [18:31:21] we still don't have a nice all purpose way to send data between hosts in different dcs (securely), right? [18:33:47] you can use rsync::quickdatacopy with stunnel [18:34:02] chris added that feature if you need secure [18:34:33] I wont bother though to encrypt it in this case where it's public data anyways [18:35:47] # [*server_uses_stunnel*] [18:35:47] # For TLS-wrapping rsync. Must be set here, and must be set true on rsync::server::wrap_with_stunnel [18:35:50] # in the server's hiera. [18:35:52] define rsync::quickdatacopy( [18:37:15] also, i just received my replacement Thinkpad, refresh after ~ 4 years, so will spend some time installing OS, copying data, sending old laptop back if I have to [18:38:06] trying to transfer stickers :P [18:41:56] is static-bugzilla actually all public? (it doesn't have old procurement stuff?) [18:43:30] bblack: yes, it is, and because we also dot have procurement tickets imported as private tickets in phab.. we still run RT [18:43:58] the other plan was always to make static-rt and shut down the Perl app [18:44:30] unless "import private RT tickets into Phab" still has a chance in the future, but I think not [18:45:33] we are providing the sanitized Bugzilla DB dump at https://dumps.wikimedia.org/other/bugzilla/ [18:45:51] but we have not put the HTML files there.. and that is what I would want to do [18:53:29] ah no not for you [18:53:33] I was askiogn generally [18:54:09] oh right, I kinda replaced RT with bugzilla in my head there [18:55:55] hey so that stanza that's created from the quickdatacopy, that would be just added to any existing rsync config, right? [18:55:57] mutante: [18:57:23] apergos: yes, it dumps fragments into /etc/rsync.d/ [18:57:28] ah perfect! [18:57:30] for each module used in puppet [18:58:29] the usual thing that you have to absent or manually clean up if you stop using them though [18:59:58] yeah that's as expected [19:00:19] tbh I almost always just manually remove and skip the absenting step, too impatient [19:01:18] yea, if the number of servers is 1 or 2, drawing the line at clusters above 2 , kind of [19:02:06] 👍 [19:09:38] uses the $auto_ferm parameter of rsync::server::module to get the firewall hole "for free" [19:10:18] no stopping puppet and manually messing with iptables ..but also not adding a ferm::service for a one-time copy [19:15:49] pretty grand [19:16:00] I gotta try to remember that because this issue comes up regularly enough