[09:05:37] 10serviceops, 10CX-cxserver, 10Language-Team, 10Kubernetes, 10Patch-For-Review: cxserver: Section Mapping Database (m5) not accessible by certain region - https://phabricator.wikimedia.org/T341117 (10Nikerabbit) [09:23:28] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Reserve resources for system daemons on kubernetes nodes - https://phabricator.wikimedia.org/T277876 (10JMeybohm) [09:23:38] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10JMeybohm) [10:11:00] 10serviceops, 10MW-on-K8s, 10Observability-Logging, 10Patch-For-Review: Some apache access logs are invalid json - https://phabricator.wikimedia.org/T340935 (10Joe) 05Open→03In progress a:03Joe Summarizing the research I've done up to now: * It's impossible to "fix" the escaped strings inside httpd *... [12:37:04] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Reserve resources for system daemons on kubernetes nodes - https://phabricator.wikimedia.org/T277876 (10JMeybohm) `--enforce-node-allocatable=pods` is already enabled (by default) but the design document says: "This flag will be a no-op unless --kube-reserved an... [13:10:13] 10serviceops, 10Kubernetes: Allow parallel image pulls in k8s - https://phabricator.wikimedia.org/T344154 (10JMeybohm) p:05Triage→03Low [14:01:22] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Reserve resources for system daemons on kubernetes nodes - https://phabricator.wikimedia.org/T277876 (10JMeybohm) Also I re-read all the things and I think we got it wrong. AIUI now `--system-reserved` and `--kube-reserved` will not be enforced by default (e.g.... [14:08:28] godog: let's use this channel since it is less noisy [14:08:47] sure that works for me [14:09:19] it looks that tegola on codfw is getting all the traffic while it shouldnt, so looking into configs for the time being [14:10:11] interesting, ok [14:14:37] It looks like eqiad kartotherian is pointing to codfw tegola. Something must be wrong on the scap config templates [14:18:04] I'm looking at the tegola-vector-tiles and it looks like it is having trouble talking to thanos-swift.discovery.wmnet [14:18:18] godog: that is a side effect I reckon [14:19:47] effie: that might be, though a side effect of what ? [14:20:05] of the fact that tegola on codfw is overloaded [14:21:18] possible too yeah [14:22:19] the recent thing that changed on thanos-swift is the switch to cfssl, that happened today [14:22:38] godog: yeah I was about to ask if it could be related to cfssl? [14:23:24] since this deployment https://gerrit.wikimedia.org/r/c/maps/kartotherian/deploy/+/941956, tegola on eqiad stopped receiving traffic [14:23:26] it might, so far this is what I could find from the tegola logs [14:23:28] caused by: Get https://thanos-swift.discovery.wmnet/tegola-swift-codfw-v002/shared-cache/osm/11/1012/706: dial tcp: i/o timeout [14:23:29] godog: what time? [14:24:39] afaics the patch was merged at around 13:15, so an outage with ~30 min could be a puppet run [14:25:31] that makse sense too, on top pf the other problem which is not related, but related [14:26:02] let me have a look at those loges, since I was looking at karothrian logs [14:26:25] I was looking at https://logstash.wikimedia.org/goto/18cf5a9578783a2ede7ca00be5bfa090 FWIW [14:27:31] there isn't very much info to go by AFAICS except generic errors, I guess we could try to cycle the tegola pods ? I doubt that'll do anything significant heh [14:29:49] and definitely we need to get more details on why tegola doesn't like to talk TLS to thanos-swift anymore [14:31:23] it doesn't seem to be a golang generic thing, i.e. thanos components are fine talking to thanos-swift, my next best guess is that maybe tegola doesn't trust the cfssl ca ? [14:31:45] I just get a caused by: Get https://thanos-swift.discovery.wmnet/tegola-swift-codfw-v002/shared-cache/osm/13/4111/2868: net/http: TLS handshake timeout [14:31:47] TLS timeout, i/o timeout, broken pipe, seem odd for a cert change but maybe yeah its just generic logging? [14:32:49] could be yeah [14:33:35] nemo-yiannis: do we have anything to do with the agen "tilelive-http/0.13.0" ? [14:33:43] cc jbond [14:33:59] this is the agent used by the HTTP requests from kartotherian to tegola [14:35:26] ok so that is our, ok [14:36:12] do all pods in k8s trust cfssl by default ? [14:36:15] I will restart all pods on codfw for starters [14:36:52] <_joe_> effie: does tegola use the service mesh? [14:36:57] yes [14:37:06] oh wait [14:37:13] <_joe_> to talk to swift? [14:37:30] <_joe_> else we need to add wmf-certificates to the container of the application [14:37:33] <_joe_> jayme: ^^ [14:38:36] no [14:38:41] <_joe_> ok then [14:38:49] <_joe_> can someone point me to the repo of tegola? [14:38:51] we are using it directly, since at the time, it was not possible [14:38:52] uhm...lacking context [14:38:54] <_joe_> it needs wmf-certificates [14:39:11] <_joe_> effie: ok can you point me to the repo? [14:39:14] * jbond here reading (but as joe said wmf-certificates injcludes the pki root if its not allready there) [14:39:25] yes, if you're talking to swift directly and that changed to PKI, you'll need wmf-cert [14:39:42] this is the repo of our tegola fork: https://gerrit.wikimedia.org/g/operations/software/tegola [14:39:57] <_joe_> uhm and how do we build the container for it? [14:40:07] <_joe_> effie: you should know this right? [14:41:29] <_joe_> I don't see a blubber file so I'm not sure how we build and deploy this [14:42:06] <_joe_> nemo-yiannis: you maybe? [14:42:19] _joe_: I see there is a wmf/master branch that has a blubber.yaml [14:42:24] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/tegola/+/refs/heads/wmf/master/.pipeline/blubber.yaml [14:42:26] the previous cergen certs are still on the puppetmaster so reverting the thanos-fe cfssl switchover to buy some more time is an option as well [14:42:55] it does not install wmf-certificates, though [14:43:24] in wmf/v0.14.x branch it does https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/tegola/+/refs/heads/wmf/v0.14.x/.pipeline/blubber.yaml [14:43:36] there is a blubber file in wmf/v0.14.x [14:43:59] <_joe_> people, thsi is a mess :) [14:44:13] herron: is it possible to revert? [14:44:33] <_joe_> nemo-yiannis: and yo know what is the version currently in production? [14:44:33] effie: yes, the change was https://gerrit.wikimedia.org/r/c/operations/puppet/+/946559 [14:44:36] in the meantime we can work on the rest [14:44:44] <_joe_> herron: yeah let's revert then [14:44:54] sure, going ahead with a revert now [14:45:00] +1 [14:45:16] please do, in the meantime, nemo-yiannis and I will sort out the rest [14:45:17] <_joe_> I would've loved a heads up as oncall person, it would've taken a fraction of the time to realize what was wrong [14:45:34] <_joe_> effie: you only need an image with wmf-certificates I think [14:45:48] yiannis and I will take care of that [14:46:56] <_joe_> effie: it would also be useful to have a "master" or "main" branch that is what will be deployed in production [14:48:39] maybe I missed it, but have we seen TLS validation errors? I've just seen io timeout and handshake timeout [14:49:45] _joe_: I will let the devs know [14:49:56] jayme: same here, haven't seen TLS validation errors myself yet either [14:50:44] the puppet run for the revert is just wrapping up now [14:51:57] <_joe_> jayme: the times coincide too well though [14:52:29] sure, just wanted to clarify for myself [14:53:53] effie: regarding scap its still problematic because eqiad kartotherian talks to codfw but it shouldn't be a problem related to the maps2009 outage because this is not related to the master node [14:55:32] no it is an unhappy coincidence indeed [15:00:51] it appears that the production image has wmf-certificates 0~20211110-1 [15:01:44] and in the blubberfile it is included [15:01:59] so something else might still be not right [15:04:35] incident doc https://docs.google.com/document/d/1LaXqMyT1tYZjzoX2quLUEaTgeSLvM-udfIL6zR_l4Lk/edit [15:04:53] <_joe_> 2021? [15:06:36] <_joe_> we are running [15:06:39] <_joe_> docker-registry.discovery.wmnet/wikimedia/operations-software-tegola:2021-11-18-210902-production [15:08:31] 10serviceops, 10Data-Platform-SRE, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10pfischer) a:03pfischer [15:45:36] herron: we will create a followup task to make any changes we need on tegola, it is a bank holiday on multiple parts of europe [15:46:37] effie: sounds good thanks [15:52:13] herron: wed is the earliest we can sort this part or we could go ahead to have another go [15:53:44] effie: sure, I'll just keep an eye out for feedback on the new patch when ready and go from there [15:54:24] cheers [16:37:36] 10serviceops, 10MW-on-K8s, 10Observability-Logging, 10Patch-For-Review: Some apache access logs are invalid json - https://phabricator.wikimedia.org/T340935 (10CodeReviewBot) oblivian opened https://gitlab.wikimedia.org/repos/sre/glogger/-/merge_requests/1 Introduce glogger [16:43:26] 10serviceops, 10Data-Platform-SRE, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10pfischer) @elukey, @Joe, thank you for your feedback! I revisited the size estimations, here are the updated n... [16:43:55] 10serviceops, 10Data-Platform-SRE, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10pfischer) a:05pfischer→03None [17:12:22] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Remove allow-pod-to-pod GlobalNetworkPolicy - https://phabricator.wikimedia.org/T344177 (10JMeybohm) p:05Triage→03Medium [17:12:58] ^ that will be a fun one