[06:43:33] <wikibugs>	 10serviceops, 10Release-Engineering-Team, 10Scap: scap sync failure - https://phabricator.wikimedia.org/T324023 (10Joe) >>! In T324023#8430333, @dancy wrote: >>>! In T324023#8428647, @Jdforrester-WMF wrote: >> Probably not a scap bug but a new server config issue, then. >  > A little of both in the end.  A d...
[07:44:55] <_joe_>	 elukey: can i ask you to take a look and merge/deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/860830, https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/860829, https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/860715 and https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/860717 ?
[07:45:36] <_joe_>	 btullis / ottomata: can I ask you to take a look and merge/deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/860518 ?
[07:45:50] <_joe_>	 please all note that there might be some changes to secrets needed
[07:46:56] <_joe_>	 happy to assist in case
[07:47:53] <elukey>	 _joe_ sure! The first two are a bit new to me, wrong list?
[07:47:54] <_joe_>	 (yes that means mostly you have to change the "tls:" stanzas in puppet-private to "mesh:" 
[07:48:26] <_joe_>	 elukey: oh sorry I guess that's abstract wikipedia stuff not yours
[07:48:28] <_joe_>	 sorry
[07:48:33] <_joe_>	 then nebvermind :)
[07:48:37] <elukey>	 ack :)
[08:22:34] <_joe_>	 btullis, ottomata please see https://phabricator.wikimedia.org/T324074
[08:22:51] <_joe_>	 we need to re-deploy all eventstreams after fixing its issues.
[08:52:25] <wikibugs>	 10serviceops, 10API Platform, 10Foundational Technology Requests, 10Image-Suggestions, and 4 others: Public-facing API for image suggestions data - https://phabricator.wikimedia.org/T306349 (10kostajh) >>! In T306349#8429727, @VirginiaPoundstone wrote: > @LGoto this has API Platform sign off (via Bill's co...
[09:05:26] <wikibugs>	 10serviceops, 10API Platform, 10Foundational Technology Requests, 10Image-Suggestions, and 4 others: Public-facing API for image suggestions data - https://phabricator.wikimedia.org/T306349 (10kostajh)
[09:15:22] <wikibugs>	 10serviceops, 10Release-Engineering-Team, 10Scap: scap sync failure - https://phabricator.wikimedia.org/T324023 (10taavi)
[09:19:24] <wikibugs>	 10serviceops, 10Release-Engineering-Team, 10Scap: scap sync failure - https://phabricator.wikimedia.org/T324023 (10cmooney) For the record we've merged the patch now to ensure those switches are correctly sending back TTL exceeded messages for all protocols.  Apologies for the confusion.  Overall it's a quir...
[09:26:36] <wikibugs>	 10serviceops, 10API Platform, 10Foundational Technology Requests, 10Image-Suggestions, and 4 others: Public-facing API for image suggestions data - https://phabricator.wikimedia.org/T306349 (10Joe) Hi everyone, as I understand it, the public API for this service isn't just using the service itself, but rat...
[09:36:31] <godog>	 I'm back with more noob questions about k8s! I failed over statsd to graphite1005 - though I'm still seeing traffic to graphite1004 from a bunch of addresses belonging to k8s pods, e.g. kubernetes-pod-10-64-66-64.eqiad.wmnet or kubernetes-pod-10-192-67-129.codfw.wmnet, how do I go about identifying those and roll-restarting what's running there?
[09:37:36] <godog>	 _joe_: ^ sorry for the ping 
[09:38:13] <_joe_>	 godog: I assume it's the damn mediawiki deployments 
[09:38:50] <_joe_>	 can this wait ~ 30 minutes?
[09:39:15] <godog>	 I think so yeah, it is only debug and non-prod traffic on mw-k8s now ?
[09:39:43] <_joe_>	 yes
[09:39:44] <godog>	 I'll look around in the meantime
[09:40:02] <_joe_>	 sadly I can't think of a way to go pod IP => deployment
[09:40:35] <_joe_>	 godog: when did you merge your change to add graphite1005?
[09:40:43] <_joe_>	 the mw one I mean
[09:41:25] <godog>	 _joe_: ~ 9:19 UTC i ran scap backport
[09:41:49] <_joe_>	 ah, today?
[09:42:03] <_joe_>	 then yes virtually every deployment minus mw-debug will need to be performed manually
[09:42:18] <_joe_>	 this should change today or tomorrow AIUI
[09:43:16] <btullis>	 _joe_: Ack to both the eventgate and eventstreams tickets. I'll look at the eventgate CR now. How would you like to go about handling the eventstreams one?
[09:43:17] <godog>	 ah ok got it
[09:43:33] <_joe_>	 btullis: so we have two ways
[09:43:56] <_joe_>	 1) we just hardcopy chart: eventstreams-0.5.0 in the deployment matchLabels
[09:44:04] <_joe_>	 and we go on with our lives
[09:44:24] <_joe_>	 2) we remove that part of matchLabels completely, which is how things should be, and destroy/recreate the deployments
[09:44:31] <_joe_>	 like claime did last time for you
[09:44:42] <_joe_>	 if I have to deal with it, I'll surely go with 1)
[09:47:14] <btullis>	 I prefer 2) to keep things neat and tidy. I'll look into the steps that you've outlined on the ticket, but would appreciate a glance over my shoulder from someone if possible. 
[09:52:25] <claime>	 godog: Want me to go and update the mw-on-k8s deployments?
[09:53:07] <godog>	 claime: sure! thank you, I'm sure _joe_ will appreciate it too
[09:53:12] <_joe_>	 claime: <3
[09:53:20] <claime>	 on it
[09:53:26] <godog>	 or I'm happy to copy/pasta commands and pretend to know what I'm doing
[09:53:31] <_joe_>	 btullis: sure,happy to help
[09:56:14] <claime>	 godog: Since scap now builds the image, iirc I just need to helmfile apply for every deployment.
[09:56:37] <godog>	 hah! nice
[09:58:46] <_joe_>	 claime: correct
[10:10:54] <godog>	 I think we're all good, this is the last packet I got from k8s on graphite1004
[10:10:57] <godog>	 10:09:42.245041 IP kubernetes-pod-10-64-68-205.eqiad.wmnet.45348 > graphite1004.eqiad.wmnet.8125: UDP, length 201
[10:11:16] <godog>	 the rest is traffic from mwmaint but that's expected from long-running maint jobs
[10:12:00] <godog>	 thank you claime _joe_ !
[10:12:05] <godog>	 taking a break, bbiab
[10:15:30] <claime>	 godog: Yep, all done on my end
[10:16:19] <claime>	 btullis: Do you need assistance with the eventgate/eventstream redeploy, or is j.oe assisting you?
[10:17:23] <_joe_>	 claime: eventstreams need to do the dance we did last time
[10:17:30] <claime>	 Yep, I figured
[10:17:41] <_joe_>	 or the dirty hack I proposed :P
[10:28:31] <btullis>	 claime: I would appreciate your assistance please. Just preparing the CR now. Then I'll practice on staging, but presumably I can't depool that?
[10:29:12] <claime>	 btullis: Yeah, you can't depool staging
[10:29:43] <claime>	 Gimme a sec, I'll paste you my procedure from last time
[10:32:07] <btullis>	 Thanks. Here is the CR for eventstreams: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/862222
[10:35:49] <claime>	 btullis: https://phabricator.wikimedia.org/P41870
[10:42:29] <btullis>	 claime: Perfect, thanks. So that `--selector name=production` will be required after the CR is merged, otherwise it wouldn't match, Is that right?
[10:45:00] <claime>	 btullis: Actually you have canary releases right, so you'd do first destroy/apply with `--selector name=canary`, then with `--selector name=production`
[10:45:49] <claime>	 Or if you don't care about breaking the service completely if something goes wrong, omit the selector and it will destroy/apply all releases in the environment specified by -e
[10:54:40] <btullis>	 Yes I see. Thanks. I will begin the dancing now.
[10:56:04] <_joe_>	 the important thing to do is to first depool the service in the datacenter you're destroying, then you can go with the brutal option, really
[10:56:12] <claime>	 Yeah
[10:59:31] <btullis>	 OK, thanks both. 
[11:02:38] <wikibugs>	 10serviceops, 10Kubernetes: Possible improvements to kube_env - https://phabricator.wikimedia.org/T324091 (10fgiunchedi)
[11:03:01] <godog>	 I left some feedback re: kube_env ^ let me know what you think!
[11:03:10] <godog>	 here and/or on task either works
[11:06:30] <_joe_>	 thanks godog it all seems sensible stuff :)
[11:07:18] <godog>	 cheers
[11:16:46] <claime>	 I'm going to be testing some SAL logging improvements for helmfile on a mw k8s deployment (mw-jobrunner), so disregard the possible noise
[11:22:18] <elukey>	 folks one qs - I have an error for k8s 1.16 in https://integration.wikimedia.org/ci/job/helm-lint/8542/consoleFull, PodDisruptionBudget has been added to k8s 1.21 so it makes sense that alerts. We use it elsewhere, I am wondering how to skip the validation for 1.16 (I thought to use if semverCompare ">=1.21-0" etc.. but of course it wasn't the right choice)
[11:25:13] <wikibugs>	 10serviceops, 10API Platform, 10Foundational Technology Requests, 10Image-Suggestions, and 3 others: Public-facing API for image suggestions data - https://phabricator.wikimedia.org/T306349 (10akosiaris) >>! In T306349#8429727, @VirginiaPoundstone wrote: > @LGoto this has API Platform sign off (via Bill's...
[11:38:36] <btullis>	 _joe_: claime : I have completed eventstreams on staging and codfw. Proceeding to eqiad soon. Thanks again.
[11:39:03] <claime>	 btullis: ack, did everything go ok?
[11:39:23] <_joe_>	 btullis: <3 
[11:42:33] <btullis>	 So far, thanks. I am extremely grateful to whoever built this safety net into `confctl` to avoid accidentally depooling both data centres. I presume that's you _joe_:?
[11:42:37] <btullis>	 https://www.irccloud.com/pastebin/9QdjAz3E/
[11:43:19] <_joe_>	 btullis: yes, btw now there should be a cookbook that does all the needed safety checks for you
[11:48:02] <btullis>	 _joe_: Thanks. This one? `sre.discovery.service-route`
[11:48:11] <_joe_>	 btullis: yes
[11:48:37] <btullis>	 Great, will read it and try it now.
[11:56:53] <btullis>	 _joe: I'll add some information on this cookbook to https://wikitech.wikimedia.org/wiki/DNS/Discovery#How_to_manage_a_DNS_Discovery_service if that's ok, or unless there is a better place for it
[11:57:18] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Patch-For-Review: Helmfile !log messages do not indicate failed deployments - https://phabricator.wikimedia.org/T303900 (10Clement_Goubert) 05Open→03In progress
[11:57:28] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Patch-For-Review: Helmfile !log messages do not indicate failed deployments - https://phabricator.wikimedia.org/T303900 (10Clement_Goubert) a:03Clement_Goubert
[12:08:23] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Patch-For-Review: Helmfile !log messages do not indicate failed deployments - https://phabricator.wikimedia.org/T303900 (10Clement_Goubert) In order to log error, we need to move on from `prepare` and `cleanup` global hooks (scoped to the helmfile) to release hooks, because...
[12:08:40] <claime>	 ^ This is a quick-win to have better helmfile log messages in SAL btw
[12:25:44] <btullis>	 claime: _joe_: Error applying eventstreams in eqiad `Error: release production failed, and has been uninstalled due to atomic being set: Service "eventstreams-production-tls-service" is invalid: spec.ports[0].nodePort: Invalid value: 4892: provided port is already allocated`
[12:26:21] <btullis>	 The canary was deployed, but production wasn't.
[12:27:49] <btullis>	 I tried it a second time with `helmfile -e eqiad --selector name=production -i apply` and it worked.
[12:28:11] <btullis>	 https://www.irccloud.com/pastebin/pcNHHyVR/
[12:48:18] <_joe_>	 btullis: uhm
[12:50:04] <_joe_>	 it's strange, the service is up and running only since you deployed to production
[13:01:57] <claime>	 That's when you go too fast with destroy/apply
[13:02:12] <claime>	 It hasn't finished tearing down everything in kubernetes when you try to recreate
[13:02:26] <claime>	 And helmfile has no way to know since you destroyed the deployment
[13:02:49] <claime>	 btullis: ^
[13:03:13] <btullis>	 Ah, many thanks. Will be more patient next time.
[14:17:37] <claime>	 Self-merging https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/862232 to test the hooks
[14:18:04] <claime>	 (just changing the logging hooks for an unused service to get better SAL logging)
[14:22:52] <_joe_>	 claime: have you checked with jelto?
[14:23:07] <_joe_>	 there were a few consideration on why we were using those hooks and not others
[14:23:21] <_joe_>	 one was to avoid logging like 2 times per canary, 2 per main
[14:23:47] <_joe_>	 presync and postsync will log twice I think
[14:23:52] <_joe_>	 jelto: ^^
[14:23:54] <claime>	 Right, but that's at the cost of not logging errors
[14:24:02] <claime>	 Which IMO is bad
[14:24:48] <claime>	 But I can do it differently, and only make postsync log in case of an error
[14:24:55] <claime>	 And keep everything else just the same
[14:25:15] <_joe_>	 or we can write a wrapper around helmfile that does all this shit much better, just saying
[14:25:34] <claime>	 The issue with that is that in case of error, you'll get one log saying "FAIL" and one saying "DONE" which is also confusing
[14:26:28] <claime>	 Yes, sure, but if I can get away with not creating another layer...
[14:26:34] <claime>	 (re: wrapper)
[14:30:42] <_joe_>	 I just spent the day wanting one :P
[14:34:42] <claime>	 So it logs start and finish for each release, which is what I expected
[14:35:08] <claime>	 Given that we usually roll canaries before main/production, it makes sense to log that way imo
[14:38:42] <wikibugs>	 10serviceops, 10Prod-Kubernetes: Helmfile !log messages do not indicate failed deployments - https://phabricator.wikimedia.org/T303900 (10Clement_Goubert) That makes helmfile log for each release:  ` 14:31 <+logmsgbot> !log cgoubert@deploy1002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sy...
[14:43:40] <wikibugs>	 10serviceops, 10Prod-Kubernetes: Helmfile !log messages do not indicate failed deployments - https://phabricator.wikimedia.org/T303900 (10Clement_Goubert) Other possible approaches are: - Only calling the `postsync` hook and passing an argument to `helmfile_log_sal` so it logs only if there is an error, leavin...
[14:48:23] <urandom>	 question: restbase-dev in eqiad needs to be decommissioned, we have 3 new machines to take its place, but they are in codfw.  We use the cluster for sessionstore and echostore staging; Will it be a problem to use a database cluster in the codfw for (staging) services running in eqiad?
[14:49:12] <claime>	 Fun fact, helmfile linter does not catch go templating errors in hooks
[15:05:08] <wikibugs>	 10serviceops, 10Observability-Tracing: Helmchart for OpenTelemetry Collector - https://phabricator.wikimedia.org/T324117 (10Clement_Goubert)
[15:24:52] <wikibugs>	 10serviceops, 10CampaignEvents, 10Wikimedia-Site-requests, 10Campaign-Registration, and 2 others: Run the timezone update script periodically in prod and in beta - https://phabricator.wikimedia.org/T320403 (10Daimona) @Clement_Goubert Hi, and thank you for the deployment :) The extension is now enabled on...
[15:37:11] <wikibugs>	 10serviceops, 10CampaignEvents, 10Wikimedia-Site-requests, 10Campaign-Registration, and 2 others: Run the timezone update script periodically in prod and in beta - https://phabricator.wikimedia.org/T320403 (10Clement_Goubert) @Daimona The change is in review, I will deploy it when it's been +1'd. Thanks fo...
[15:54:27] <wikibugs>	 10serviceops, 10SRE, 10ops-codfw: codfw: ManagementSSHDown for ores2009 and thumbor2004 - https://phabricator.wikimedia.org/T323925 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8b8e8a4d-71f2-462d-8e1f-ff904f7e3ed4) set by akosiaris@cumin1001 for 1:00:00 on 1 host(s) and their services...
[15:56:55] <wikibugs>	 10serviceops, 10Prod-Kubernetes: Helmfile !log messages do not indicate failed deployments - https://phabricator.wikimedia.org/T303900 (10Clement_Goubert) p:05Triage→03Low
[15:58:10] <wikibugs>	 10serviceops, 10Observability-Tracing: Helmchart for OpenTelemetry Collector - https://phabricator.wikimedia.org/T324117 (10Clement_Goubert) p:05Triage→03Medium
[15:58:17] <wikibugs>	 10serviceops, 10Observability-Tracing: OpenTelemetry Collector running as a DaemonSet on Wikikube - https://phabricator.wikimedia.org/T320564 (10Clement_Goubert) p:05Triage→03Medium
[15:58:19] <wikibugs>	 10serviceops, 10Observability-Tracing, 10Patch-For-Review: Package OpenTelemetry Collector atop our own base Docker images - https://phabricator.wikimedia.org/T320552 (10Clement_Goubert) 05Open→03In progress p:05Triage→03High
[15:58:29] <wikibugs>	 10serviceops, 10SRE, 10ops-codfw: codfw: ManagementSSHDown for ores2009 and thumbor2004 - https://phabricator.wikimedia.org/T323925 (10klausman) ores2009 is shutting down & powering off now
[15:59:02] <wikibugs>	 10serviceops, 10Release-Engineering-Team, 10Scap: scap sync failure - https://phabricator.wikimedia.org/T324023 (10dancy) 05Open→03Resolved a:03dancy Noting for the record that @Joe made the fix via https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/36 and scap 4.29.3 was deployed yesterda...
[16:18:14] <wikibugs>	 10serviceops, 10GitLab (CI & Job Runners), 10Release-Engineering-Team (Priority Backlog 📥): Upgrade jwt-authorizer on all registry hosts - https://phabricator.wikimedia.org/T324037 (10brennen)
[16:23:41] <wikibugs>	 10serviceops, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Priority Backlog 📥): Buildkit erroring with "cannot reuse body, request must be retried" upon multi-platform push - https://phabricator.wikimedia.org/T322453 (10dduvall)
[16:26:07] <wikibugs>	 10serviceops, 10Content-Transform-Team-WIP, 10Maps: Re-import full planet data into eqiad and codfw - https://phabricator.wikimedia.org/T314472 (10Jgiannelos)
[16:32:01] <elukey>	 self-answering myself - I get "PodDisruptionBudget webhook-pdb failed validation: could not find schema for PodDisruptionBudget" from CI (validation step) because the apiVersion is policy/v1, meanwhile if I set it to policy/v1beta1 (we have some calico resources with that) it doesn't error out
[16:37:45] <elukey>	 do I need to add the schema by any chance? I checked the jsonschema dir in deployment-charts but it is not clear to me what I should do
[16:46:13] <elukey>	 TIL https://gitlab.wikimedia.org/repos/sre/kubernetes-json-schema
[16:50:55] <elukey>	 in there I see something like ./v1.23.6/poddisruptionbudget-policy-v1.json
[16:51:50] <elukey>	 ah ok right, self-self-answer - the error msg from CI explicitly states k8s v1.16.15
[16:52:51] <elukey>	 and in fact
[16:52:52] <elukey>	 ./v1.16.15-standalone-strict/poddisruptionbudget-policy-v1beta1.json
[16:53:11] <elukey>	 so I am the first one that needs to skip one of the k8s versions probably
[17:29:01] <wikibugs>	 10serviceops, 10SRE, 10ops-codfw: codfw: ManagementSSHDown for ores2009 and thumbor2004 - https://phabricator.wikimedia.org/T323925 (10Papaul) thunbor2004 had a broken IDRAC card. I replaced it.
[18:41:03] <wikibugs>	 10serviceops, 10SRE, 10ops-codfw: codfw: ManagementSSHDown for ores2009 and thumbor2004 - https://phabricator.wikimedia.org/T323925 (10Papaul) 05Open→03Resolved ores2009 mgmt is back up
[22:31:51] <wikibugs>	 10serviceops, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Priority Backlog 📥): Buildkit erroring with "cannot reuse body, request must be retried" upon multi-platform push - https://phabricator.wikimedia.org/T322453 (10Dzahn)
[22:41:18] <wikibugs>	 10serviceops, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Priority Backlog 📥): Buildkit erroring with "cannot reuse body, request must be retried" upon multi-platform push - https://phabricator.wikimedia.org/T322453 (10Dzahn) jwt-authorizer has been upgraded to 1.1.0 on the 4 registry hosts.  P...
[23:44:51] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10SRE, 10serviceops-collab, 10Puppet: Create a puppet define for systemd timers - https://phabricator.wikimedia.org/T111031 (10Dzahn)
[23:45:27] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10SRE, 10serviceops-collab, 10Puppet: Create a puppet define for systemd timers - https://phabricator.wikimedia.org/T111031 (10Dzahn) Nowadays in 2022 we have one, it's called `systemd::timer::job`, was created by @joe in https://gerrit.wikimedia.org/r/c/operat...
[23:46:05] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10SRE, 10serviceops-collab, 10Puppet: Create a puppet define for systemd timers - https://phabricator.wikimedia.org/T111031 (10Dzahn) 05Open→03Resolved a:03Dzahn closing out old SRE tickets. seems resolved to me. please reopen if you disagree.