[07:11:37] good morning folks [07:11:53] I created https://gerrit.wikimedia.org/r/c/operations/puppet/+/769616/ (and next one) for kubernetes2009/2010, so I'll reimage them today [07:12:34] one thing that I am wondering if when/if we should rebalance the pods in the codfw cluster, maybe after x rounds of node drain actions [08:11:40] <_joe_> elukey: in theory it should rebalance on the mid term [08:11:52] <_joe_> but let's keep an eye on that yes [08:12:05] <_joe_> actually we need probably some heatmap in a dashboard [08:54:49] It tends to rebelance quickly on it's own indeed due to a large pool of servers and frequent deployments [08:55:17] We can create a heatmap indeed though [08:58:11] ack makes sense, so k8s itself will try to shift pods depending on an avg across the cluster? Or something else? [09:26:01] akosiaris: do you prefer me to work on kubernetes10[18-22] (the new eqiad nodes, I am doing some scheduling organization about when/what to reimage :D) [09:29:48] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Move kubernetes workers to bullseye and docker to overlayfs - https://phabricator.wikimedia.org/T300744 (10elukey) [09:30:58] elukey: k8s will not self-rebalance stuff. But after all the reimaging we sould ultimately only end up with one "empty" node. I'd say that that does not need manual intervention Especially because we're going do decom 2001-2004 at some point (which will shift pods around anyways) [09:31:52] it might make sense to roll-restart the sessionstore deployment after reimage of the ganeti-vm nodes, though (as there are only 4 of them per cluster and sessionstore is not deployed very often) [09:33:00] jayme: thanks, I misread what Alex wrote, I get now the whole sense of his sentence. I was just worried that some k8s nodes would become overloaded due to an aggressive reimage schedule [09:34:20] nah. We're super fine there with just one node down at a time [09:54:07] elukey: +1 on kubernetes2009-kubernetes2010 btw [09:55:06] elukey: also, k8s workers can't easily get overloaded in our setup right now. the node pressure threshold will tend to evict pods to other nodes if that happens: https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/ [09:55:41] it does rely on setting up memory and cpu limits for pods ofc, which we 've gone into quite a bit of trouble to make sure we have for almost everything [10:03:13] ahh nice [10:05:15] 2009 done, proceeding with 2010 [10:05:42] based on my calculations by the end of march we should have (close to) all worker nodes on bullseye [10:17:28] 10serviceops, 10Data-Catalog, 10Data-Engineering, 10SRE, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10akosiaris) >>! In T303049#7753274, @BTullis wrote: > How can I tell what the source IP address(es) of my services will be, as seen by the bac... [10:20:55] 10serviceops, 10Data-Catalog, 10Data-Engineering, 10SRE, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) Great. Thanks both. I'm now working through the first set of comments left by @JMeybohm on the patch, trying to make it use the scaf... [10:30:23] 10serviceops, 10Data-Catalog, 10Data-Engineering, 10SRE, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10akosiaris) As far as I am concerned, this service request LGTM. Thanks for the very detailed diagram (including a link to the source), repos... [10:44:01] aaand 2010 done as well [10:47:54] 10serviceops, 10Data-Catalog, 10Data-Engineering, 10SRE, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10JMeybohm) I'd like to add the proposal of using Ingress (T290966) for the frontend (to not have to configure LVS for that). For the consumers... [11:06:08] 10serviceops, 10Data-Catalog, 10Data-Engineering, 10SRE, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) > I'd like to add the proposal of using Ingress (T290966) for the frontend (to not have to configure LVS for that). Sounds good to m... [11:08:21] 10serviceops, 10Data-Catalog, 10Data-Engineering, 10SRE, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) > Per my undestanding the service will reside in the wikikube cluster for the MVP phase, despite being a bad fit for it per https://... [11:53:09] 10serviceops, 10Data-Catalog, 10Data-Engineering, 10SRE, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10JMeybohm) >>! In T303049#7766702, @BTullis wrote: >> I'd like to add the proposal of using Ingress (T290966) for the frontend (to not have to... [12:09:32] 10serviceops, 10Data-Catalog, 10Data-Engineering, 10SRE, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) > Sorry, totally my fault! I meant the GMS, not consumer. From what you wrote in T301454#7741876 it sounds like you just don't want... [12:14:51] 10serviceops, 10Data-Catalog, 10Data-Engineering, 10SRE, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10JMeybohm) >>! In T303049#7766821, @BTullis wrote: > Yes, that's right. Great! >>! In T303049#7766821, @BTullis wrote: > So I'll change the `... [12:42:00] 10serviceops, 10Data-Catalog, 10Data-Engineering, 10SRE, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) > Actually I was just referring to the diagram, as is mentions specific ports and I wanted to make sure that's not a fixed requirem... [15:17:02] https://github.com/istio/istio/issues/23802#issuecomment-628035658 [15:17:08] I love CNI and istio [15:52:42] lol [17:10:49] 10serviceops, 10SRE, 10Traffic, 10envoy, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10RLazarus) [18:06:40] 10serviceops, 10envoy, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: Using port in Host header for thanos-swift / thanos-query breaks vhost selection - https://phabricator.wikimedia.org/T300119 (10RLazarus) I just upgraded thanos-fe to envoy 1.18.3, but out of the box I see... [18:09:17] 10serviceops, 10envoy, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: Using port in Host header for thanos-swift / thanos-query breaks vhost selection - https://phabricator.wikimedia.org/T300119 (10RLazarus) Oh, yep, it's strip_matching_host_port in the HTTP connection manag... [18:12:38] 10serviceops, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review, 10Sustainability (Incident Followup): eventgate-* tls telemetry is disabled - https://phabricator.wikimedia.org/T303042 (10odimitrijevic) Updating to the latest helm chart template would allow for the settings to be picked up automat... [18:14:33] 10serviceops, 10SRE, 10good first task: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10Ottomata) [18:17:20] 10serviceops, 10SRE, 10Patch-For-Review, 10good first task: Upgrade all deployment charts to use the latest version of common_templates - https://phabricator.wikimedia.org/T292390 (10Ottomata) BTW, I made a specific task to track the work to make eventgate chart use common_templates: {T303543} cc @BTullis [19:11:29] 10serviceops, 10Data-Engineering, 10Event-Platform, 10Sustainability (Incident Followup): eventgate-* tls telemetry is disabled - https://phabricator.wikimedia.org/T303042 (10JMeybohm) 05Open→03Resolved a:03JMeybohm Change is applied and rolled out to all clusters. Data incoming. [19:14:26] 10serviceops, 10Data-Engineering, 10Event-Platform, 10Sustainability (Incident Followup): eventgate-* tls telemetry is disabled - https://phabricator.wikimedia.org/T303042 (10Ottomata) Thank you [20:36:01] 10serviceops, 10PHP 7.2 support, 10Patch-For-Review: Drop PHP 7.2 support from MediaWiki master branch, once Wikimedia production is on 7.4 - https://phabricator.wikimedia.org/T261872 (10Reedy) [21:13:57] 10serviceops, 10Release-Engineering-Team, 10Scap: Use scap to deploy itself to scap targets - https://phabricator.wikimedia.org/T303559 (10dancy) [21:20:25] 10serviceops, 10Release-Engineering-Team, 10Scap: Use scap to deploy itself to scap targets - https://phabricator.wikimedia.org/T303559 (10dancy) [22:49:37] 10serviceops, 10Release-Engineering-Team, 10Scap: Use scap to deploy itself to scap targets - https://phabricator.wikimedia.org/T303559 (10dancy) [23:12:00] 10serviceops, 10SRE, 10Release-Engineering-Team (Doing): Reduce latency of new Scap releases - https://phabricator.wikimedia.org/T292646 (10dancy) [23:13:25] 10serviceops, 10Infrastructure-Foundations, 10Release-Engineering-Team, 10Scap: Use scap to deploy itself to scap targets - https://phabricator.wikimedia.org/T303559 (10dancy) Hi @MoritzMuehlenhoff and @Volans. Can you comment on how this part of the proposal could be achieved? ` Find a way to query the li... [23:14:25] 10serviceops, 10Infrastructure-Foundations, 10Release-Engineering-Team, 10Scap: Use scap to deploy itself to scap targets - https://phabricator.wikimedia.org/T303559 (10dancy) Hi @MoritzMuehlenhoff and @Volans. Can you comment on how this part of the proposal could be achieved? ` Find a way to query the li... [23:14:50] 10serviceops, 10Infrastructure-Foundations, 10Release-Engineering-Team, 10Scap: Use scap to deploy itself to scap targets - https://phabricator.wikimedia.org/T303559 (10dancy) [23:14:55] 10serviceops, 10SRE, 10Release-Engineering-Team (Doing): Reduce latency of new Scap releases - https://phabricator.wikimedia.org/T292646 (10dancy)