[05:47:35] <_joe_> jelto, mutante, arnoldokoth, eoghan: I don't know how many of you are subscribed to security@, but one of our volunteers has mail coming from gitlab classified as spam. I think it's because we don't have an SPF record for gitlab.wikimedia.org [05:47:59] <_joe_> if you're not subscribed to security@, I can share the relevant headers [08:08:28] hello folks, I'd need to roll restart eventgate-main's pods in boths DCs to pick up new streams (see https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate/Administration#EventStreamConfig_change) [08:08:33] lemme know if I can proceed [08:09:40] elukey: +1 [08:14:27] super thanks, I'll do it in a bit :) [08:35:43] _joe_: thanks I'll take a look later today! [09:11:02] 10serviceops, 10ChangeProp, 10Content-Transform-Team-WIP, 10Page Content Service, and 3 others: Parsoid cache invalidation for mobile-sections seems not reliable - https://phabricator.wikimedia.org/T226931 (10Jgiannelos) [09:15:54] mmmm so I see that there is a diff for eqiad/codfw wikikube [09:15:59] for eventgate-main I mean [09:16:10] it seems ok, maybe only the tls config changed [09:16:31] qq - if I use sync with roll_restart=1 I also deploy those changes right? [09:16:51] they seem fine but no idea if they are meant to go out or not [09:18:21] elukey: yes. you will deploy the change [09:21:11] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Clement_Goubert) >>! In T328287#8579474, @Trizek-WMF wrote: > As you gave 3 dates in the task description, can you... [09:21:11] elukey: but I'd agree that they look fine. IIRC there where some whitespace-like fixes to the chart modules - that's probably them [09:21:30] jayme: I think it is https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/860518 [09:21:36] from the version of the chart it matches [09:21:59] indeed [09:22:32] the diff looks harmess to me as well... [09:23:41] also it was already deployed in staging, I didn't see anything weird in the logs (not a great assurance but better than nothing) [09:24:05] You can use --context 5 to have more digestible diff elukey btw [09:24:35] claime: TIL <3 thanks [09:25:05] the only thing that got my attention was checksum/tls-config but I think it is just due to the refactoring [09:25:19] I'll deploy to codfw if everybody agrees [09:30:20] go ahead! [09:33:40] https://knowyourmeme.com/memes/this-is-fine [09:41:54] codfw deployed, will check metrics etc.. and then deploy to eqiad [09:57:50] deploy completed! [10:05:55] Sorry if I bother people this morning, I promise I'll stop after this :) [10:06:29] I used toYaml in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/885991 and I stumbled on https://github.com/helm/helm/issues/4262 [10:06:51] and indeed I see in the helm diff something like [10:06:51] 10:52:02 - database: '/.*/' [10:06:52] 10:52:02 + database: /.*/ [10:07:46] have you encountered the same issue before? Does it matter or not? (I mean IIRC a string will be ok even without quotes in yaml but not sure about side effects etc..) [10:08:56] (toYaml gives me a lot of flexibility for the config) [10:25:28] https://github.com/helm/helm/issues/4262#issuecomment-1311121411 < :| [10:26:01] elukey: I think it'll only matter if it can't unambiguously be recognized as a string [10:26:07] So in your example you're ok [10:26:57] But if like in the github issue you're quoting big integers in hex notation or something else that's ambiguous, or if you need to conserve the single quotes, it becomes... ugh [10:27:22] claime: in theory no, it should always be a regex on a field (in theory) [10:28:18] There's a very good point in the issue "To my knowledge, the current state of toYaml is, that its behaviour is undefined. So, there is no breaking it, as it is already broken." [10:28:27] Basically test it and see what happens... [10:29:17] ahahahhahah [10:29:25] (for the quote) [10:29:37] Seriously that bug report is maddening [10:29:41] yeah I agree [10:48:34] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops: Optimize k8s same row traffic flows - https://phabricator.wikimedia.org/T328523 (10cmooney) > BGP is smart about it (see '"first party" NEXT_HOP' in section 5.1.3.2 of the RFC), so it should just work on the router side. TIL didn't realise EBGP... [11:10:05] claime: I ended up adding explicit ranges for the use cases that I have, so I have control on | quote.. Not as flexible as toYaml but better for regexes in my opinion. Thanks a lot for the brainbounce :) [11:10:52] ack, you're welcome [12:01:11] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Clement_Goubert) >>! In T328287#8579336, @Trizek-WMF wrote: > @Clement_Goubert Has anything major changed in your p... [13:55:58] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Metrics changes with Kubernetes v1.23 - https://phabricator.wikimedia.org/T322919 (10JMeybohm) [13:59:18] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Jhancock.wm) [14:01:08] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Metrics changes with Kubernetes v1.23 - https://phabricator.wikimedia.org/T322919 (10JMeybohm) a:05JMeybohm→03None I think I'm done (and need a week of vacation ;)). I'll leave this open an... [14:01:26] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, 10Kubernetes: Post Kubernetes v1.23 cleanup - https://phabricator.wikimedia.org/T328291 (10JMeybohm) [14:01:28] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Metrics changes with Kubernetes v1.23 - https://phabricator.wikimedia.org/T322919 (10JMeybohm) [14:01:33] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [14:13:14] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Metrics changes with Kubernetes v1.23 - https://phabricator.wikimedia.org/T322919 (10JMeybohm) p:05High→03Low [14:26:40] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [14:28:42] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) a:03Jhancock.wm [15:35:01] folks I am going to deploy changeprop (already had a chat with Hugh about it) [15:36:14] starting from codfw that seems handling less events [15:37:48] ack [15:38:42] main thing to look out for is large, persistent (after say 15-30 minutes) spikes in https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&refresh=1m&viewPanel=10 [15:38:58] I'm in meetings but here still if something goes wrong [15:39:17] ok, I'm putting it up in a corner and keeping an eye on it [15:39:32] me too yes [15:39:52] the only weird thing that I can see in the diff right now is a change in the blacklist [15:44:26] ohhh crap - that was me. I can take a look in 15 minutes [15:45:32] I see something like https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/803877 but it doesn't match, basically all en.wikipedia.org blacklist [15:45:46] hnowlan: okok don't worry event tomorrow, I can skip the deploy [15:45:50] and do it on monday [15:46:07] 10serviceops, 10Prod-Kubernetes, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Write and adapt Runbooks and cookbooks related to the WDQS Streaming Updater and kubernetes - https://phabricator.wikimedia.org/T293063 (10JMeybohm) Hey @dcausse, I'm reading this again because of the upcoming k8s 1.23 upg... [15:48:12] * elukey bbiab [15:48:38] 10serviceops, 10ChangeProp, 10Content-Transform-Team-WIP, 10Page Content Service, and 3 others: Parsoid cache invalidation for mobile-sections seems not reliable - https://phabricator.wikimedia.org/T226931 (10akosiaris) >>! In T226931#8581445, @Jgiannelos wrote: > From a quick look I can consistently repro... [15:55:14] elukey: I would actually prefer to revert those blocklist changes and test the rollout of them a little more carefully if you still have time/will to deploy your change https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/886086 ~ [15:57:42] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Calico 3.17.1 kube-controllers fail to reach apiserver at startup - https://phabricator.wikimedia.org/T271422 (10JMeybohm) 05Open→03Invalid Non issue with Calico 3.23 on k8s 1.23 [16:11:00] 10serviceops, 10Prod-Kubernetes, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Write and adapt Runbooks and cookbooks related to the WDQS Streaming Updater and kubernetes - https://phabricator.wikimedia.org/T293063 (10dcausse) [16:13:09] 10serviceops, 10Prod-Kubernetes, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Write and adapt Runbooks and cookbooks related to the WDQS Streaming Updater and kubernetes - https://phabricator.wikimedia.org/T293063 (10dcausse) >>! In T293063#8582548, @JMeybohm wrote: > Hey @dcausse, I'm reading this... [16:20:20] 10serviceops, 10Prod-Kubernetes, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Write and adapt Runbooks and cookbooks related to the WDQS Streaming Updater and kubernetes - https://phabricator.wikimedia.org/T293063 (10JMeybohm) >>! In T293063#8582600, @dcausse wrote: > Hey, clarified this a bit, rena... [16:27:41] 10serviceops, 10Prod-Kubernetes, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Write and adapt Runbooks and cookbooks related to the WDQS Streaming Updater and kubernetes - https://phabricator.wikimedia.org/T293063 (10dcausse) >>! In T293063#8582625, @JMeybohm wrote: > Anyhow. AIUI this process will... [16:32:53] hnowlan: back sorry, yes I have time, just +1ed the code review.. shall I +2 and deploy to staging? [16:36:15] elukey: just needed to bump the chart version - once CI passes it's safe to +2. The blocklist itself will be a noop in prod [16:37:48] hnowlan: ack super [16:46:36] <_joe_> hnowlan, jayme, rzl: envoy security release incoming [16:46:44] <_joe_> we might need to upgrade to a new version [16:47:12] argh okay [16:47:19] probably istio included :( [16:47:23] <_joe_> yeah boringssl [16:47:25] the last time I tried to do this, I couldn't get the debian build to work [16:47:26] <_joe_> elukey: for sure [16:47:34] we weren't in a hurry then but I guess we'll have to figure it out this time [16:47:38] <_joe_> rzl: we can cut short and just make a binary debian package [16:47:59] hnowlan: changeprop deployed in codfw :) [16:48:04] elukey: nice [16:48:12] _joe_: less nice ;[ [16:48:37] <_joe_> elukey: and the worst part - the problem is a boringssl vulnerability [16:48:44] <_joe_> so I doubt we can gloss over it [16:48:51] <_joe_> we'll know more on the 7th [16:49:22] lovely [16:56:05] hnowlan: the only weird metric is processing time that went up, but I think it is only a side effect of the deployment, I'd expect to return to normal levels in a bit [16:56:42] nothing in the pod logs indicating errors [17:02:01] elukey: looks good so far [17:02:06] elukey: you mean the rule exec time? [17:02:08] looks like https://phabricator.wikimedia.org/T328683 :/ [17:02:52] hnowlan: lol yes exactly [17:03:34] changeprop is in painful need of someone who can do active dev on it and knows node. I can understand it but I think I need to set aside time and health assistance to be able to actually modify it [17:08:31] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10RZamora-WMF) a:05Trizek-WMF→03None [17:08:37] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10RZamora-WMF) a:03Trizek-WMF [17:09:02] hnowlan: I can definitely try to help on the SRE side, the more I know about it the better.. One thing that may really help a ton is to increase the logging level like we did in staging, to have some visibility about what change-prop does.. It may help explaining weird things like why metrics go away [17:09:38] not sure how big the logs would become though, maybe we could try to have a canary node or similar (IIRC eventgate has a canary release) to experiment with stuff [17:10:01] or we could just try with the increased level for a certain amount of time and see [17:10:33] logs at trace level in changeprop are more or less what people expects from info/debug-ish level imho [17:10:52] anyway, deploying to eqiad :) [17:20:19] cool! [17:20:42] yep, the trace logging is definitely the only useful level - far too much of a firehose for prod but there are some ERROR level things in there :/ [17:20:50] that change shouldn't be super huge though [17:21:20] eqiad looking good so far [17:21:40] Petr repeatedly explained to me why we can't discover which workers are processing which jobs but I really want that as a basic feature also [17:23:45] ah ok so we already know that trace for production is not viable [17:26:25] yeah, things like deduped and old messages will be logged at trace (which is perfectly reasonable - for trace :|) [17:28:39] okok I thought there were less things logged sigh [17:31:03] step 1 is configuring what level what is at I guess and then tuning that [17:31:38] the hidden TLS error we got was at trace but on a bunch of different paths - some of those I think could be enabled on another level without flooding us [17:31:45] stuff like sampled events we don't need [17:31:56] gonna look at this more tomorrow [17:32:07] still watching the graphs for now though [17:32:48] 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host mc1037.eqiad.wmnet with OS bullseye [17:33:15] 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host mc2043.codfw.wmnet with OS bullseye [17:34:08] hnowlan: ack I am interested if you need anybody to hack/brainbounce! [17:35:08] going afk for a bit, will check in a couple of hours metrics again :) [18:03:23] 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host mc1037.eqiad.wmnet with OS bullseye completed: - mc1037 (**PASS**) - Downtimed on Icinga/Alertmanager - Disa... [18:03:38] 10serviceops, 10ChangeProp, 10Content-Transform-Team-WIP, 10Page Content Service, and 3 others: Parsoid cache invalidation for mobile-sections seems not reliable - https://phabricator.wikimedia.org/T226931 (10TheDJ) Looking at the same day for the repo, they were working on {T169939}, which details that th... [18:08:27] 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host mc2043.codfw.wmnet with OS bullseye completed: - mc2043 (**PASS**) - Downtimed on Icinga/Alertmanager - Disa... [19:11:02] how does the 'site' prometheus label get added to metrics scraped from k8s pods? [19:11:21] I'm running flink in dse-k8s-eqiad [19:11:26] and lable [19:11:27] label [19:11:31] site="eqiad" [19:11:58] it might be useful to inject a label that for the k8s cluster name / environment name [19:12:24] the only thing indicating dse is the prometheus source "k8s-dse" [19:12:59] https://grafana.wikimedia.org/goto/yN96Ac04z?orgId=1 [19:15:46] ottomata: it's added by Thanos (https://wikitech.wikimedia.org/wiki/Thanos) during query time [19:17:03] hm, okay so that is not from k8s or prometheus scraper. [19:17:13] i think i want a metric added that indicates the k8s cluster. [19:17:25] s/metric/label [19:18:03] indeed, prometheus info not added if I select the specific prometheus datasource [19:18:04] hm [19:18:38] okay so, where are the k8s specific other labels added? [19:18:50] e.g. [19:18:51] kubernetes_pod_name="flink-app-main-869f9b59cc-65z9r",