[06:33:17] 10serviceops, 10SRE: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10Joe) After collecting some correct data, and discussing the matter with @Krinkle , we don't think we have a strict need for onhost memcached at the moment if not for releivin... [06:40:09] 10serviceops, 10MW-on-K8s: Allow php-fpm to read environment variables from the system, not just from the fcgi request - https://phabricator.wikimedia.org/T326705 (10Joe) [06:40:49] 10serviceops, 10MW-on-K8s: Allow php-fpm to read environment variables from the system, not just from the fcgi request - https://phabricator.wikimedia.org/T326705 (10Joe) p:05Triage→03Medium [08:10:51] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update staging-codfw to k8s 1.23 - https://phabricator.wikimedia.org/T326340 (10JMeybohm) [08:11:15] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update staging-codfw to k8s 1.23 - https://phabricator.wikimedia.org/T326340 (10JMeybohm) [10:25:09] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=edb03633-d9b6-4a06-849d-2c3da0e62688) set by cgoubert@cumin1001 for 7 days,... [10:41:11] claime: one thing to be aware of when deploying the ingress things/doing the LVS dance for it (and I was remindet by face punch about that yesterday) is that the ingressgateway will not bind to it's nodeports unless there is at least one backend. Which means LVS will not go green [10:42:11] jayme: That's once you go to lvs_setup right? [10:42:50] claime: yes. Basically health checks from pybal will fail as long as there is no backend service for your ingress in k8s [10:43:13] which kinda makes sense apart from that is also is quite counter intuitive :) [10:43:48] And if I don't have a service to deploy yet? It'll just stay "broken" [10:43:50] ? [10:43:54] yep [10:45:07] Ok, so in your opinion, would it be better to wait until I actually have an ingress backend to deploy there to advance with the ingress LVS setup? [10:51:53] 10serviceops, 10Service-deployment-requests: New Service Request 'security-api' - https://phabricator.wikimedia.org/T325147 (10STran) @Joe Suman said you were the person to talk to regarding next steps? [10:56:37] claime: yeah, that would make it more evident that you did the right thing during lvs dance [10:57:28] alternatively you could deploy some dummy thing...but otoh you can do and test everything regarding ingress even without the LVS setup [10:57:42] jayme: ack. I'm trying to figure out why my change didn't add the service ip on the loopback of the k8s worker [10:57:56] I suspect puppet shenanigans [10:58:37] claime: did you add profile::lvs::realserver::pools ? [10:58:52] jayme: https://gerrit.wikimedia.org/r/c/operations/puppet/+/868101/6/hieradata/role/common/aux_k8s/worker.yaml yep [10:59:17] But maybe the profile isn't included, I didn't touch that part. That's what I'm looking at now [10:59:33] <_joe_> that's very possible but would be weird [10:59:40] +1 :) [11:01:07] 18 # LVS configuration, for service VIPs [11:01:09] 19 # include ::profile::lvs::realserver [11:01:13] Well there's my problem. [11:01:20] It's commented out :') [11:01:21] <_joe_> ahah [11:01:58] <_joe_> yeah that would cause what you're observing [11:04:39] <_joe_> I am shopping for reviewers for a small patch... https://gitlab.wikimedia.org/repos/sre/sextant/-/merge_requests/2 [11:06:06] <_joe_> ottomata: ^^ this fixes the couple issues you had last week [11:09:28] I can take a look after lunch [11:10:11] <_joe_> thanks, the patch is really trivial but fixes a couple annoying things [11:21:50] 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host mc1038.eqiad.wmnet with OS bullseye [11:41:32] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Jclark-ctr) @clements_goubert I checked yesterday afternoon did not see any alerts. Let’s repool server close ticket [11:41:42] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Jclark-ctr) 05In progress→03Resolved [11:51:49] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Clement_Goubert) Server repooled, thanks a bunch. [11:53:59] 10serviceops, 10Kubernetes: Show less diff context by default on helm apply - https://phabricator.wikimedia.org/T326205 (10Clement_Goubert) If it's added to the helmDefaults args, helmfile will give it as argument to any downstream helm command, which results in a smaller diff when running `helmfile -e eqiad -... [12:58:33] _joe_: +1'ed. Would you mind taking a look at https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/875947 [13:01:05] <_joe_> jayme: I'm currently at lunch, I'll take a look later [13:01:14] sure, no rush! [13:03:56] 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host mc1038.eqiad.wmnet with OS bullseye executed with errors: - mc1038 (**FAIL**) - Downtimed on Icinga/Alertmanag... [13:04:16] also: deployment-charts CI is constantly complaining about the mw-debug traindev release not rendering correctly. Is that expected? [13:07:33] 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host mc1038.eqiad.wmnet with OS bullseye [13:07:48] <_joe_> jayme: sigh, no that's probably on me [13:17:22] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Remove the .Values.kubernetesApi hack - https://phabricator.wikimedia.org/T326729 (10JMeybohm) [15:50:46] ahoyhoy - given thumbor's significant (and soon to grow) resource requirements, our default of 25% maxsurge is too much for a rollout. I'd like to set it to 1 for thumbor: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/878957 [15:55:08] We should probably provide something to override this directly in the scaffold [15:55:17] Seems like something we'd want easily accessible [15:56:12] I think something like the above would fit pretty well, maybe just the whole dict akin to how we include limits [15:58:08] +1'd, in any case [15:58:38] thanks! I'll make a separate CR for the scaffold [15:59:40] I didn't mean that you had to do it :D [16:01:32] ah but it'd be nice for me to :D [16:01:40] It would <3 [16:01:52] 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host mc1038.eqiad.wmnet with OS bullseye completed: - mc1038 (**FAIL**) - Downtimed on Icinga/Alertmanager - //Un... [16:01:54] 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host mc1038.eqiad.wmnet with OS bullseye executed with errors: - mc1038 (**FAIL**) - Downtimed on Icinga/Alertmanag... [16:10:18] jayme: how's https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/878210 look? [16:10:26] cc also btullis ^ [16:11:03] I looked, but hadn't worked out why CI failed yet. [16:11:19] apprently it is not related ^^^^ :) [16:12:03] Ah great, I hadn't checked the backscroll here. Will look again after meeting. [16:21:12] I'm not convinced the CI failure is unrelated ;) [16:22:31] Also I'm not sure if we should put that under helmfile.d/services tbh [16:23:16] the ml clusters do have a separate directory (helmfile.d/ml-services) and I would think that dse should have as well [16:24:23] all the things below "services" are deployed to wikikube clusters and it would probably be smart to keep it that way so we know what acually needs to be deployed there in case of havoc or whatever [16:29:36] jayme: okay, hm. for this one, it might be nice to be able to deploy it to wikikube too if/when it is time to test deploying there? [16:29:45] but i guess we can copy the hellmfile and adjust [16:30:01] so. do we not have any helmfile based 'services' deployed to dse-k8s? [16:30:09] i thought maybe datahub was deployed there? [16:30:17] jayme: That scheme (`helmfile.d/dse-services` or similar) would work for me, I think. [16:30:40] ottomata: We don't have any services deployed to dse-k8s yet. [16:31:22] k [16:33:22] btullis: jayme done [16:34:38] hm jayme you are right about CI: [16:34:39] 11:30:44 err: no releases found that matches specified selector() and environment(ml-staging-codfw), in any helmfile [16:35:32] maybe i need to add the codfw releases even if we don't use them [16:35:50] ottomata: I *think* you might need to add the new environment (dse-... ...) to the ENV_EXPLORE in .rake_modules/tester/asset.rb [16:35:50] btullis: there isn't a dse staging, right? [16:35:56] k [16:36:15] Correct, not yet anyway. Not planned immediately either. [16:37:06] * jayme https://media.giphy.com/media/COYGe9rZvfiaQ/giphy.gif [16:38:32] Do we need to bikeshed whether it should be `helmfile.d/dse-services` or `helmfile.d/dse-k8s-services` ? Now is the time to do it if we need to. [16:39:44] shed ahead. I'll be in the hedgerow :) [16:40:59] ah [16:41:00] hm [16:41:23] perhaps dse-k8s, to match the admin_ng stuff? [16:44:57] done ^ [16:45:20] jayme: like this? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/878210/7/.rake_modules/tester/asset.rb [16:45:51] yes. did it work? [16:51:18] I'm clogging the CI a bit, sorry. Thought you had maybe verifyed it locally. But at first glance it seemed to have worked [16:51:56] OH HO! just glanced again. it did! [16:52:05] or...did it just stop checking my helmfile because i moved it [16:52:06] looking [16:52:23] nono, you now have a proper diff [16:52:26] nope, it did lint. gr8 [16:54:27] so, if we merge, to deploy this would just be a cd ...../flink-app-example; helmfile -e dse-k8s-eqiad -i apply? [16:54:44] yuü [16:54:47] the stream-enrichment-poc namespace exists, and the operator shiould be able to do its thang [16:56:53] Yep, I think so. I +1d it and I'm happy to jump on a call in 15 mins if you'd like to try it today. [16:57:50] I think I managed to merge conflict your CR in the meantime, but apart from that it looks good to me [17:04:37] btullis: woudl in 1h work? [17:07:27] ottomata: Sorry, not really. I'll be out of time in 1h. Nothing stopping you doing it without me though, or we could work together tomorrow? [17:31:58] btullis: okay what about RIGHT NOW!? [17:40:24] nm! i'll give it a go without ya and see how it goes... [17:40:26] in a bit [17:45:17] ottomata: Sorry, I missed the ping. Still available right now. [18:36:28] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update staging-codfw to k8s 1.23 - https://phabricator.wikimedia.org/T326340 (10JMeybohm) [18:39:38] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update staging-codfw to k8s 1.23 - https://phabricator.wikimedia.org/T326340 (10JMeybohm) [18:41:28] 10serviceops, 10Data-Engineering-Planning, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 07), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) Hm, trying to deploy flink-app-example is erroring, I think we need some ex... [18:43:19] 10serviceops, 10Data-Engineering-Planning, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 07), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10JMeybohm) Oops, yeah. We are pretty restrictive with permissions for deployment users... [19:06:51] 10serviceops, 10Data-Engineering-Planning, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 07), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) Or, hm, @JMeybohm @BTullis, is this because I am deploying into a namespac... [19:07:06] jayme: you still around? [19:07:12] yep [19:08:41] jayme: i wonder, it looks like the error is in the stream-enrichment-poc namespace, trying to query or flinkdeployments resource [19:09:00] oh, but, hm. install_flink_operator will be true anyway? because we are in dse-k8s-eqiad. [19:09:01] right. [19:09:01] okay [19:09:02] 10serviceops, 10Data-Engineering-Planning, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 07), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10JMeybohm) >>! In T324576#8517486, @Ottomata wrote: > Or, hm, @JMeybohm @BTullis, is... [19:09:22] yeah. It's basically "you" not having permissions to access them [19:09:24] okay [19:09:45] jayme: you want to merge and apply that or shall I? [19:09:59] I want to wait for CI first :) [19:10:05] ya ya [19:11:35] CI happy [19:12:53] yeah, diff looks good [19:14:07] ottomata: I've merged the patch. Feel free to apply to the dse cluster. I'll take care of the others tomorrow when I have the follow up patch ready [19:14:15] okay [19:19:54] can diff now... [19:23:40] 10serviceops, 10Data-Engineering-Planning, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 07), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) Getting somewhere! App was deployed, but: `lang=json { "@timestamp": "2... [20:41:55] 10serviceops, 10Data-Engineering-Planning, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 07), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) > Need more networkpolicy? ... Hm no. I think 'flink-app-main.stream-enric... [22:03:35] 10serviceops, 10Infrastructure-Foundations, 10SRE: allow certain users to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10Dzahn) Alright, finally getting back to this. So the request is that the group "deployment", which is already on the canary_appserver role on mwdebug hosts... [22:17:59] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: allow certain users to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10Dzahn) @daniel This was for you, remember that? [22:26:32] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: allow mw-deployers to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10Dzahn) [22:27:49] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: allow mw-deployers to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10Dzahn) [22:27:58] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: allow mw-deployers to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10Dzahn) >>! In T305979#7976119, @MoritzMuehlenhoff wrote: > This was discussed in the Infrastructure Foundation... [22:28:04] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: allow mw-deployers to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10Dzahn) a:05Dzahn→03None