[10:12:43] 10serviceops, 10Kubernetes: Show less diff context by default on helm apply - https://phabricator.wikimedia.org/T326205 (10fgiunchedi) [10:40:12] 10serviceops, 10Kubernetes: Show less diff context by default on helm apply - https://phabricator.wikimedia.org/T326205 (10Clement_Goubert) You can add `--args '--context n'` to any command that uses helm-diff (basically any command that can be run with `-i`, and `diff`) I'll try and find if there's a way to... [11:03:16] 10serviceops, 10Kubernetes: Show less diff context by default on helm apply - https://phabricator.wikimedia.org/T326205 (10Clement_Goubert) p:05Triage→03Low Ok, got it, it can be added to the helmfile in `helmDefaults['args']` I'll make a CR for it [13:20:00] 10serviceops, 10MediaWiki-Shell, 10SRE: Update limit.sh to support systemd-based cgroup management - https://phabricator.wikimedia.org/T136603 (10LSobanski) [13:52:55] 10serviceops, 10MediaWiki-Shell, 10SRE: Update limit.sh to support systemd-based cgroup management - https://phabricator.wikimedia.org/T136603 (10Joe) 05Open→03Invalid Since then we've moved to using remote shellbox in production, so I'm not strictly interested anymore in any solution compatible with cgr... [14:00:33] 10serviceops, 10Data-Engineering, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10JMeybohm) >>! In T324576#8464284, @Ottomata wrote: > **Ingress**: I don't think we //need// an... [14:12:59] 10serviceops, 10Content-Transform-Team-WIP, 10Maps: OSM import fails on both eqiad/codfw because of wrong data input - https://phabricator.wikimedia.org/T325293 (10TheDJ) Maybe we should add imposm to release monitoring ? https://phabricator.wikimedia.org/diffusion/LLIC/browse/master/monitoring.json [14:14:10] 10serviceops, 10Content-Transform-Team-WIP, 10Maps: OSM import fails on both eqiad/codfw because of wrong data input - https://phabricator.wikimedia.org/T325293 (10Jgiannelos) It looks like OSM syncing is catching up with all the old diffs on eqiad after bumping imposm version. [14:34:36] jayme: ty for reviews and comments so far [14:34:47] trying to understand how the egress serviceproxy bit works here. [14:35:01] the flink app isn't really a 'service' so it isn't going to have a 'public_port' (usually) [14:35:24] ottomata: you can ignore that side of the thing [14:35:52] the service-proxy has two jobs currently [14:36:12] one is tls-termination for it's backend service (this is the part you can ignore) [14:36:53] the other is proxying, tls-termination etc. for connections to other services (from your/the backend service) [14:36:54] hm okay, so I just remove the public_port part in the chart values.yaml? [14:36:59] right. [14:37:40] or do I just make up a dummy port? [14:38:04] I think the module might require a value here so you'd have to go with a dummy I think. [14:38:20] but tbh. I'm not sure because this is a first :) [14:38:39] (ab)using the service-proxy for just the service-proxying part [14:38:44] ;) [14:38:53] :) [14:38:54] oiay [14:38:56] okay [14:45:32] jayme: it looks like kafka clusters are not conected to via service proxy, right? [14:47:42] what is tcp_proxy.listeners? I guess just manually defined proxy endpointts for the service proxy? vs discovery.listeners which values comes from config management? [14:54:04] ottomata: yeah I think the connection to kafka brokers is a direct one [14:55:00] tcp_proxy probably came in for psql... effie? [14:58:13] for kafka direct connections are probably still required as we don't want load-balancing there I suppose :) but other stuff (like calling mw-api) should go via the proxy [14:58:36] right, okay [14:59:00] mmmmm I am not on my computer so I will have to come back to you on that, but I take full responsibility whatever it ia [14:59:02] is [14:59:05] next q: the 'more complex' setup idea is nice for us here because then we don't need to deploy discovery endpoints for each flink-app? [14:59:29] unless it involves a dead body, you'd be on your own there [14:59:52] effie: too late...responsibility already taken :-) [15:00:22] but there is a bullet wound and I am holding an axe, it is impossible [15:01:25] ottomata: compared to our usual ingress setup, yes [15:03:19] plus less TLS certificates (there might even be work needed to get that right with the Kubernetes native Ingress objects, I'm not sure) [15:04:37] okay, i think i can maybe test this fancy ingress stuff locally...but i think I can't test the service proxy mesh container locally? [15:04:39] or can I? [15:04:45] i think it needs extra config from prod? [15:06:13] actually, you know, maybe we can punt on the fancy ingress for the jobmanager for now? it isn't a requirement. as long as I can access the job manager port via an ssh tunnel, that's good enough for now [15:09:33] 10serviceops, 10Data-Engineering, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) Ingress: Okay, let's put off working on ingress for the jobmanager UI port until lat... [15:10:05] jayme: okay, cool. added mesh.deployment.container to https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/866510/. Couple of outstanding comments there still. [15:10:47] also responded to your comments on operator at https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/865158/ [16:10:54] <_joe_> I strongly suggest NOT to use envoy's TCP proxies [16:11:17] <_joe_> my experience with them is that they're ok for light loads, but horrible for anything else [16:13:17] k, don't need them, was just wondering what they were for [16:13:21] <_joe_> as for flink: yes if they're not exposing any port then you only need the service proxy, and we might need to tweak the module [16:13:42] <_joe_> so that it's possible not to add the service and local TLS listener [16:13:45] _joe_: should I modify the copy of the module vendor file in my chart? [16:13:51] or should I make a patch to make a new version of the module? [16:13:52] <_joe_> ottomata: no! [16:14:06] <_joe_> make a patch for that module 1.0.1 :) [16:14:08] the dummy port is probably fine too [16:14:13] okay if you would prefer that i can do it! [16:14:30] <_joe_> <3 [16:14:47] <_joe_> I mean if you don't feel like it, just use the dummy port for now and wait for me to get around implementing it [16:14:53] <_joe_> it should work too [16:16:21] i'm waiting for more reviews rn anyway... [16:28:14] ottomata: sorry, had to make a call. Yeah we can totally kick the ingress stuff down the road as people can do port-forward to the UI [16:28:54] gr8 [16:30:08] well..that's people with deployment access. But I think for now that's still okay [16:31:48] yeah that's totally fine, it would only be those folks [16:31:58] admins of the flink app [16:32:20] _joe_: should I also conditionally include the tls-proxy-certs? if no public_port i guess no certs too? [16:32:38] <_joe_> ottomata: correct [16:32:45] best practice for doing that? I'd guess i'd wrap all the usages of that in [16:32:47] {{- if or .Values.mesh.certs .Values.puppet_ca_crt }} [16:33:19] <_joe_> I would rather add all the logic based on if a tls port is offered [16:33:23] should I just repeat that eerywhere, or make some define mesh.service.enabled or somethign? [16:33:31] okay, so everywhere, if public_port [16:33:38] <_joe_> if the port is nil or 0, something like that [16:33:41] great [16:33:48] <_joe_> if I don't have a public poort, I don't have a tls terminator [16:33:55] <_joe_> then I don't have certs either [16:33:58] right cool [16:34:07] <_joe_> but if I have a public port and no certs, I want the chart to fail [16:38:27] _joe_: do I copy the new tpl files to the scaffold? [16:39:01] <_joe_> ottomata: it's enough for now that you just commit the module change, we'll apply to the charts in subsequent changes [16:39:06] okay [16:39:11] i wanted to add some fixtures or something [16:39:16] that's just in the scaffold? [16:40:10] <_joe_> uhm good idea [16:40:27] <_joe_> fixtures are in all the charts but yes, update the scaffold would be good too [16:40:38] <_joe_> the best way to do it is to do as follows: [16:40:47] <_joe_> * pip install sextant [16:40:59] <_joe_> * sextant vendor --force _scaffold [16:41:10] intterresting! [16:41:11] okay... [16:41:15] <_joe_> (not sure if force is needed :P) [16:42:16] <_joe_> yeah sorry, the documentation is wip :P [16:42:17] _joe_: should all mesh tpls get a new version, even if no change? [16:42:25] i don't have to change name_1.0.0.tpl [16:42:33] <_joe_> no, just the one you changed [16:42:34] but all the others have new version [16:42:36] okay [16:44:13] <_joe_> FTR, sextant is https://gitlab.wikimedia.org/repos/sre/sextant [16:44:22] sextant could use some docs on the pypi side -- https://pypi.org/project/sextant/ [16:44:44] <_joe_> bd808: yeah on both sides heh [16:44:52] <_joe_> I only have so much time :/ [16:45:12] <_joe_> I am currently writing a mcrouter module, btw, expect a patch for toolhub to use it to land soon [16:45:14] <_joe_> :) [16:45:20] * bd808 orders a timespinner for _joe_ [16:45:34] 10serviceops, 10Event-Platform Value Stream (Sprint 05): k8s deployment-charts mesh module should allow use of mesh without public_port Service - https://phabricator.wikimedia.org/T326252 (10Ottomata) [16:45:38] *time turner [16:45:51] <_joe_> it's a HP thing isn't it? [16:46:33] <_joe_> I watched them all dubbed in italian; by the time Alice was able to understand the movies in english I could get away with not rewatching them [16:46:41] yeah and also jk rawling is trash. [16:46:56] _joe_: do I replace the existing modules in modules.json with new versions, or do I add new module entries [16:47:08] e.g. do I add a new { [16:47:08] "name": "deployment", [16:47:08] "version": "1.1", [16:47:08] ... [16:47:09] } [16:47:21] or just bump the version in the existing one? [16:47:29] <_joe_> ottomata: you should not bump a minor, you're only adding a new switch [16:47:43] <_joe_> but if you do, just add it [16:47:45] minor is for new feature, no? patch is for bugfixes etc? [16:47:58] <_joe_> right yes, so this is indeed more of a feature [16:48:37] <_joe_> and it could be disruptive, potentially [16:48:45] k i've added, but sextand didn't seem to find 1.1.0 [16:48:48] well its backwards compatible :) [16:48:56] as long as public_port is defined everywhere, which ...it should be? [16:49:06] <_joe_> who knows! [16:49:17] <_joe_> so, you should also update package.json in the chart [16:49:26] <_joe_> it still points to 1.0 I guess? [16:49:30] oh package.json, right [16:49:30] ka [16:50:25] <_joe_> in theory, you should be able to do $ sextant update _scaffold mesh.service [16:50:31] <_joe_> but... try it :P [16:51:24] <_joe_> I have a working implementation on my computer, but it needs a couple finishing touches [16:51:57] hm, _joe_ i think i need to make a new name_1.1.0 after all, just to fix dependency issues [16:52:03] <_joe_> basically it would go around a charts tree and find all charts using that module, and update the package.json and then vendor dependencies [16:52:04] Module mesh.configuration:1.0.0 (required by mesh.name:1.0.0) is incompatible with module mesh.configuration:1.1.0 [16:52:13] <_joe_> ah uh, yes [16:52:19] k [16:52:48] <_joe_> this isn't great for code review, heh [16:53:09] _joe_: reminds me of our versioned event schema repos :p [16:53:32] <_joe_> yeah whatever way you go, it's gonna suck [16:53:34] <_joe_> but you know [16:53:39] except instead of Jsonschema $ref pointers we've got tpl defines [16:53:41] <_joe_> we can just download the patch and use diff [16:54:08] hehe, at least with our eventschemas we only have one file (current.yaml) to edit, and the rest is 'generated' from that? :p [16:54:10] but yeah. [16:54:41] <_joe_> I'm open to ideas to improve upon this btw [16:54:59] <_joe_> this is all an attempt to fit a square in a round peg [16:55:08] <_joe_> use a templating language like it was software libraries [16:55:13] which is the hole and which is the peg ? :p [16:55:31] <_joe_> and the square is fighting back :P [16:56:45] ooo, we got a circular dependency tho [16:56:52] Module mesh.name:1.0.0 (required by base.meta:1.0.0) is incompatible with module mesh.name:1.1.0 [16:56:58] any way we can avoid using mesh from base? [16:57:47] <_joe_> uhm wait [16:58:26] <_joe_> why are you updating mesh.name though? [16:59:05] <_joe_> and yes, probably we can move some functions to avoid that dependency [17:00:10] 10serviceops, 10DC-Ops, 10ops-eqiad, 10Patch-For-Review: hw troubleshooting: CPU1 machine check error on parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T326119 (10Clement_Goubert) >>! In T326119#8499305, @gerritbot wrote: > Change 875360 **merged** by Clément Goubert: > %%%[operations/puppet@pr... [17:00:14] <_joe_> also, sorry, I really GTG [17:00:24] if i don't update mesh.name [17:00:25] Module mesh.configuration:1.0.0 (required by mesh.name:1.0.0) is incompatible with module mesh.configuration:1.1.0 [17:00:35] okay no worries! [17:03:11] <_joe_> if you have something, I can take a look [17:03:27] k will push with name 1.1.0 [17:03:29] <_joe_> tomorrow morning I mean [17:03:32] yaya [17:03:34] ty [17:04:31] 10serviceops, 10Data-Engineering-Radar, 10MW-on-K8s, 10Patch-For-Review: IPInfo MediaWiki extension depends on presence of maxmind db in the container/host - https://phabricator.wikimedia.org/T288375 (10Clement_Goubert) PSP needs to be updated before we can deploy. [21:15:41] 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, 10User-jijiki: Upgrade Thumbor to Bullseye - https://phabricator.wikimedia.org/T216815 (10VirginiaPoundstone) [21:59:46] 10serviceops, 10GitLab, 10serviceops-collab, 10Kubernetes: Trusted gitlab runner containers need access to staging k8s cluster - https://phabricator.wikimedia.org/T325385 (10dancy) I verified today that trusted runners can now complete a network connection to kubestagemaster.svc.eqiad.wmnet:6443 so that pa...