[06:43:33] 10serviceops, 10Release-Engineering-Team, 10Scap: scap sync failure - https://phabricator.wikimedia.org/T324023 (10Joe) >>! In T324023#8430333, @dancy wrote: >>>! In T324023#8428647, @Jdforrester-WMF wrote: >> Probably not a scap bug but a new server config issue, then. > > A little of both in the end. A d... [07:44:55] <_joe_> elukey: can i ask you to take a look and merge/deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/860830, https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/860829, https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/860715 and https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/860717 ? [07:45:36] <_joe_> btullis / ottomata: can I ask you to take a look and merge/deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/860518 ? [07:45:50] <_joe_> please all note that there might be some changes to secrets needed [07:46:56] <_joe_> happy to assist in case [07:47:53] _joe_ sure! The first two are a bit new to me, wrong list? [07:47:54] <_joe_> (yes that means mostly you have to change the "tls:" stanzas in puppet-private to "mesh:" [07:48:26] <_joe_> elukey: oh sorry I guess that's abstract wikipedia stuff not yours [07:48:28] <_joe_> sorry [07:48:33] <_joe_> then nebvermind :) [07:48:37] ack :) [08:22:34] <_joe_> btullis, ottomata please see https://phabricator.wikimedia.org/T324074 [08:22:51] <_joe_> we need to re-deploy all eventstreams after fixing its issues. [08:52:25] 10serviceops, 10API Platform, 10Foundational Technology Requests, 10Image-Suggestions, and 4 others: Public-facing API for image suggestions data - https://phabricator.wikimedia.org/T306349 (10kostajh) >>! In T306349#8429727, @VirginiaPoundstone wrote: > @LGoto this has API Platform sign off (via Bill's co... [09:05:26] 10serviceops, 10API Platform, 10Foundational Technology Requests, 10Image-Suggestions, and 4 others: Public-facing API for image suggestions data - https://phabricator.wikimedia.org/T306349 (10kostajh) [09:15:22] 10serviceops, 10Release-Engineering-Team, 10Scap: scap sync failure - https://phabricator.wikimedia.org/T324023 (10taavi) [09:19:24] 10serviceops, 10Release-Engineering-Team, 10Scap: scap sync failure - https://phabricator.wikimedia.org/T324023 (10cmooney) For the record we've merged the patch now to ensure those switches are correctly sending back TTL exceeded messages for all protocols. Apologies for the confusion. Overall it's a quir... [09:26:36] 10serviceops, 10API Platform, 10Foundational Technology Requests, 10Image-Suggestions, and 4 others: Public-facing API for image suggestions data - https://phabricator.wikimedia.org/T306349 (10Joe) Hi everyone, as I understand it, the public API for this service isn't just using the service itself, but rat... [09:36:31] I'm back with more noob questions about k8s! I failed over statsd to graphite1005 - though I'm still seeing traffic to graphite1004 from a bunch of addresses belonging to k8s pods, e.g. kubernetes-pod-10-64-66-64.eqiad.wmnet or kubernetes-pod-10-192-67-129.codfw.wmnet, how do I go about identifying those and roll-restarting what's running there? [09:37:36] _joe_: ^ sorry for the ping [09:38:13] <_joe_> godog: I assume it's the damn mediawiki deployments [09:38:50] <_joe_> can this wait ~ 30 minutes? [09:39:15] I think so yeah, it is only debug and non-prod traffic on mw-k8s now ? [09:39:43] <_joe_> yes [09:39:44] I'll look around in the meantime [09:40:02] <_joe_> sadly I can't think of a way to go pod IP => deployment [09:40:35] <_joe_> godog: when did you merge your change to add graphite1005? [09:40:43] <_joe_> the mw one I mean [09:41:25] _joe_: ~ 9:19 UTC i ran scap backport [09:41:49] <_joe_> ah, today? [09:42:03] <_joe_> then yes virtually every deployment minus mw-debug will need to be performed manually [09:42:18] <_joe_> this should change today or tomorrow AIUI [09:43:16] _joe_: Ack to both the eventgate and eventstreams tickets. I'll look at the eventgate CR now. How would you like to go about handling the eventstreams one? [09:43:17] ah ok got it [09:43:33] <_joe_> btullis: so we have two ways [09:43:56] <_joe_> 1) we just hardcopy chart: eventstreams-0.5.0 in the deployment matchLabels [09:44:04] <_joe_> and we go on with our lives [09:44:24] <_joe_> 2) we remove that part of matchLabels completely, which is how things should be, and destroy/recreate the deployments [09:44:31] <_joe_> like claime did last time for you [09:44:42] <_joe_> if I have to deal with it, I'll surely go with 1) [09:47:14] I prefer 2) to keep things neat and tidy. I'll look into the steps that you've outlined on the ticket, but would appreciate a glance over my shoulder from someone if possible. [09:52:25] godog: Want me to go and update the mw-on-k8s deployments? [09:53:07] claime: sure! thank you, I'm sure _joe_ will appreciate it too [09:53:12] <_joe_> claime: <3 [09:53:20] on it [09:53:26] or I'm happy to copy/pasta commands and pretend to know what I'm doing [09:53:31] <_joe_> btullis: sure,happy to help [09:56:14] godog: Since scap now builds the image, iirc I just need to helmfile apply for every deployment. [09:56:37] hah! nice [09:58:46] <_joe_> claime: correct [10:10:54] I think we're all good, this is the last packet I got from k8s on graphite1004 [10:10:57] 10:09:42.245041 IP kubernetes-pod-10-64-68-205.eqiad.wmnet.45348 > graphite1004.eqiad.wmnet.8125: UDP, length 201 [10:11:16] the rest is traffic from mwmaint but that's expected from long-running maint jobs [10:12:00] thank you claime _joe_ ! [10:12:05] taking a break, bbiab [10:15:30] godog: Yep, all done on my end [10:16:19] btullis: Do you need assistance with the eventgate/eventstream redeploy, or is j.oe assisting you? [10:17:23] <_joe_> claime: eventstreams need to do the dance we did last time [10:17:30] Yep, I figured [10:17:41] <_joe_> or the dirty hack I proposed :P [10:28:31] claime: I would appreciate your assistance please. Just preparing the CR now. Then I'll practice on staging, but presumably I can't depool that? [10:29:12] btullis: Yeah, you can't depool staging [10:29:43] Gimme a sec, I'll paste you my procedure from last time [10:32:07] Thanks. Here is the CR for eventstreams: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/862222 [10:35:49] btullis: https://phabricator.wikimedia.org/P41870 [10:42:29] claime: Perfect, thanks. So that `--selector name=production` will be required after the CR is merged, otherwise it wouldn't match, Is that right? [10:45:00] btullis: Actually you have canary releases right, so you'd do first destroy/apply with `--selector name=canary`, then with `--selector name=production` [10:45:49] Or if you don't care about breaking the service completely if something goes wrong, omit the selector and it will destroy/apply all releases in the environment specified by -e [10:54:40] Yes I see. Thanks. I will begin the dancing now. [10:56:04] <_joe_> the important thing to do is to first depool the service in the datacenter you're destroying, then you can go with the brutal option, really [10:56:12] Yeah [10:59:31] OK, thanks both. [11:02:38] 10serviceops, 10Kubernetes: Possible improvements to kube_env - https://phabricator.wikimedia.org/T324091 (10fgiunchedi) [11:03:01] I left some feedback re: kube_env ^ let me know what you think! [11:03:10] here and/or on task either works [11:06:30] <_joe_> thanks godog it all seems sensible stuff :) [11:07:18] cheers [11:16:46] I'm going to be testing some SAL logging improvements for helmfile on a mw k8s deployment (mw-jobrunner), so disregard the possible noise [11:22:18] folks one qs - I have an error for k8s 1.16 in https://integration.wikimedia.org/ci/job/helm-lint/8542/consoleFull, PodDisruptionBudget has been added to k8s 1.21 so it makes sense that alerts. We use it elsewhere, I am wondering how to skip the validation for 1.16 (I thought to use if semverCompare ">=1.21-0" etc.. but of course it wasn't the right choice) [11:25:13] 10serviceops, 10API Platform, 10Foundational Technology Requests, 10Image-Suggestions, and 3 others: Public-facing API for image suggestions data - https://phabricator.wikimedia.org/T306349 (10akosiaris) >>! In T306349#8429727, @VirginiaPoundstone wrote: > @LGoto this has API Platform sign off (via Bill's... [11:38:36] _joe_: claime : I have completed eventstreams on staging and codfw. Proceeding to eqiad soon. Thanks again. [11:39:03] btullis: ack, did everything go ok? [11:39:23] <_joe_> btullis: <3 [11:42:33] So far, thanks. I am extremely grateful to whoever built this safety net into `confctl` to avoid accidentally depooling both data centres. I presume that's you _joe_:? [11:42:37] https://www.irccloud.com/pastebin/9QdjAz3E/ [11:43:19] <_joe_> btullis: yes, btw now there should be a cookbook that does all the needed safety checks for you [11:48:02] _joe_: Thanks. This one? `sre.discovery.service-route` [11:48:11] <_joe_> btullis: yes [11:48:37] Great, will read it and try it now. [11:56:53] _joe: I'll add some information on this cookbook to https://wikitech.wikimedia.org/wiki/DNS/Discovery#How_to_manage_a_DNS_Discovery_service if that's ok, or unless there is a better place for it [11:57:18] 10serviceops, 10Prod-Kubernetes, 10Patch-For-Review: Helmfile !log messages do not indicate failed deployments - https://phabricator.wikimedia.org/T303900 (10Clement_Goubert) 05Open→03In progress [11:57:28] 10serviceops, 10Prod-Kubernetes, 10Patch-For-Review: Helmfile !log messages do not indicate failed deployments - https://phabricator.wikimedia.org/T303900 (10Clement_Goubert) a:03Clement_Goubert [12:08:23] 10serviceops, 10Prod-Kubernetes, 10Patch-For-Review: Helmfile !log messages do not indicate failed deployments - https://phabricator.wikimedia.org/T303900 (10Clement_Goubert) In order to log error, we need to move on from `prepare` and `cleanup` global hooks (scoped to the helmfile) to release hooks, because... [12:08:40] ^ This is a quick-win to have better helmfile log messages in SAL btw [12:25:44] claime: _joe_: Error applying eventstreams in eqiad `Error: release production failed, and has been uninstalled due to atomic being set: Service "eventstreams-production-tls-service" is invalid: spec.ports[0].nodePort: Invalid value: 4892: provided port is already allocated` [12:26:21] The canary was deployed, but production wasn't. [12:27:49] I tried it a second time with `helmfile -e eqiad --selector name=production -i apply` and it worked. [12:28:11] https://www.irccloud.com/pastebin/pcNHHyVR/ [12:48:18] <_joe_> btullis: uhm [12:50:04] <_joe_> it's strange, the service is up and running only since you deployed to production [13:01:57] That's when you go too fast with destroy/apply [13:02:12] It hasn't finished tearing down everything in kubernetes when you try to recreate [13:02:26] And helmfile has no way to know since you destroyed the deployment [13:02:49] btullis: ^ [13:03:13] Ah, many thanks. Will be more patient next time. [14:17:37] Self-merging https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/862232 to test the hooks [14:18:04] (just changing the logging hooks for an unused service to get better SAL logging) [14:22:52] <_joe_> claime: have you checked with jelto? [14:23:07] <_joe_> there were a few consideration on why we were using those hooks and not others [14:23:21] <_joe_> one was to avoid logging like 2 times per canary, 2 per main [14:23:47] <_joe_> presync and postsync will log twice I think [14:23:52] <_joe_> jelto: ^^ [14:23:54] Right, but that's at the cost of not logging errors [14:24:02] Which IMO is bad [14:24:48] But I can do it differently, and only make postsync log in case of an error [14:24:55] And keep everything else just the same [14:25:15] <_joe_> or we can write a wrapper around helmfile that does all this shit much better, just saying [14:25:34] The issue with that is that in case of error, you'll get one log saying "FAIL" and one saying "DONE" which is also confusing [14:26:28] Yes, sure, but if I can get away with not creating another layer... [14:26:34] (re: wrapper) [14:30:42] <_joe_> I just spent the day wanting one :P [14:34:42] So it logs start and finish for each release, which is what I expected [14:35:08] Given that we usually roll canaries before main/production, it makes sense to log that way imo [14:38:42] 10serviceops, 10Prod-Kubernetes: Helmfile !log messages do not indicate failed deployments - https://phabricator.wikimedia.org/T303900 (10Clement_Goubert) That makes helmfile log for each release: ` 14:31 <+logmsgbot> !log cgoubert@deploy1002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sy... [14:43:40] 10serviceops, 10Prod-Kubernetes: Helmfile !log messages do not indicate failed deployments - https://phabricator.wikimedia.org/T303900 (10Clement_Goubert) Other possible approaches are: - Only calling the `postsync` hook and passing an argument to `helmfile_log_sal` so it logs only if there is an error, leavin... [14:48:23] question: restbase-dev in eqiad needs to be decommissioned, we have 3 new machines to take its place, but they are in codfw. We use the cluster for sessionstore and echostore staging; Will it be a problem to use a database cluster in the codfw for (staging) services running in eqiad? [14:49:12] Fun fact, helmfile linter does not catch go templating errors in hooks [15:05:08] 10serviceops, 10Observability-Tracing: Helmchart for OpenTelemetry Collector - https://phabricator.wikimedia.org/T324117 (10Clement_Goubert) [15:24:52] 10serviceops, 10CampaignEvents, 10Wikimedia-Site-requests, 10Campaign-Registration, and 2 others: Run the timezone update script periodically in prod and in beta - https://phabricator.wikimedia.org/T320403 (10Daimona) @Clement_Goubert Hi, and thank you for the deployment :) The extension is now enabled on... [15:37:11] 10serviceops, 10CampaignEvents, 10Wikimedia-Site-requests, 10Campaign-Registration, and 2 others: Run the timezone update script periodically in prod and in beta - https://phabricator.wikimedia.org/T320403 (10Clement_Goubert) @Daimona The change is in review, I will deploy it when it's been +1'd. Thanks fo... [15:54:27] 10serviceops, 10SRE, 10ops-codfw: codfw: ManagementSSHDown for ores2009 and thumbor2004 - https://phabricator.wikimedia.org/T323925 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8b8e8a4d-71f2-462d-8e1f-ff904f7e3ed4) set by akosiaris@cumin1001 for 1:00:00 on 1 host(s) and their services... [15:56:55] 10serviceops, 10Prod-Kubernetes: Helmfile !log messages do not indicate failed deployments - https://phabricator.wikimedia.org/T303900 (10Clement_Goubert) p:05Triage→03Low [15:58:10] 10serviceops, 10Observability-Tracing: Helmchart for OpenTelemetry Collector - https://phabricator.wikimedia.org/T324117 (10Clement_Goubert) p:05Triage→03Medium [15:58:17] 10serviceops, 10Observability-Tracing: OpenTelemetry Collector running as a DaemonSet on Wikikube - https://phabricator.wikimedia.org/T320564 (10Clement_Goubert) p:05Triage→03Medium [15:58:19] 10serviceops, 10Observability-Tracing, 10Patch-For-Review: Package OpenTelemetry Collector atop our own base Docker images - https://phabricator.wikimedia.org/T320552 (10Clement_Goubert) 05Open→03In progress p:05Triage→03High [15:58:29] 10serviceops, 10SRE, 10ops-codfw: codfw: ManagementSSHDown for ores2009 and thumbor2004 - https://phabricator.wikimedia.org/T323925 (10klausman) ores2009 is shutting down & powering off now [15:59:02] 10serviceops, 10Release-Engineering-Team, 10Scap: scap sync failure - https://phabricator.wikimedia.org/T324023 (10dancy) 05Open→03Resolved a:03dancy Noting for the record that @Joe made the fix via https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/36 and scap 4.29.3 was deployed yesterda... [16:18:14] 10serviceops, 10GitLab (CI & Job Runners), 10Release-Engineering-Team (Priority Backlog 📥): Upgrade jwt-authorizer on all registry hosts - https://phabricator.wikimedia.org/T324037 (10brennen) [16:23:41] 10serviceops, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Priority Backlog 📥): Buildkit erroring with "cannot reuse body, request must be retried" upon multi-platform push - https://phabricator.wikimedia.org/T322453 (10dduvall) [16:26:07] 10serviceops, 10Content-Transform-Team-WIP, 10Maps: Re-import full planet data into eqiad and codfw - https://phabricator.wikimedia.org/T314472 (10Jgiannelos) [16:32:01] self-answering myself - I get "PodDisruptionBudget webhook-pdb failed validation: could not find schema for PodDisruptionBudget" from CI (validation step) because the apiVersion is policy/v1, meanwhile if I set it to policy/v1beta1 (we have some calico resources with that) it doesn't error out [16:37:45] do I need to add the schema by any chance? I checked the jsonschema dir in deployment-charts but it is not clear to me what I should do [16:46:13] TIL https://gitlab.wikimedia.org/repos/sre/kubernetes-json-schema [16:50:55] in there I see something like ./v1.23.6/poddisruptionbudget-policy-v1.json [16:51:50] ah ok right, self-self-answer - the error msg from CI explicitly states k8s v1.16.15 [16:52:51] and in fact [16:52:52] ./v1.16.15-standalone-strict/poddisruptionbudget-policy-v1beta1.json [16:53:11] so I am the first one that needs to skip one of the k8s versions probably [17:29:01] 10serviceops, 10SRE, 10ops-codfw: codfw: ManagementSSHDown for ores2009 and thumbor2004 - https://phabricator.wikimedia.org/T323925 (10Papaul) thunbor2004 had a broken IDRAC card. I replaced it. [18:41:03] 10serviceops, 10SRE, 10ops-codfw: codfw: ManagementSSHDown for ores2009 and thumbor2004 - https://phabricator.wikimedia.org/T323925 (10Papaul) 05Open→03Resolved ores2009 mgmt is back up [22:31:51] 10serviceops, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Priority Backlog 📥): Buildkit erroring with "cannot reuse body, request must be retried" upon multi-platform push - https://phabricator.wikimedia.org/T322453 (10Dzahn) [22:41:18] 10serviceops, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Priority Backlog 📥): Buildkit erroring with "cannot reuse body, request must be retried" upon multi-platform push - https://phabricator.wikimedia.org/T322453 (10Dzahn) jwt-authorizer has been upgraded to 1.1.0 on the 4 registry hosts. P... [23:44:51] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10serviceops-collab, 10Puppet: Create a puppet define for systemd timers - https://phabricator.wikimedia.org/T111031 (10Dzahn) [23:45:27] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10serviceops-collab, 10Puppet: Create a puppet define for systemd timers - https://phabricator.wikimedia.org/T111031 (10Dzahn) Nowadays in 2022 we have one, it's called `systemd::timer::job`, was created by @joe in https://gerrit.wikimedia.org/r/c/operat... [23:46:05] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10serviceops-collab, 10Puppet: Create a puppet define for systemd timers - https://phabricator.wikimedia.org/T111031 (10Dzahn) 05Open→03Resolved a:03Dzahn closing out old SRE tickets. seems resolved to me. please reopen if you disagree.