[09:05:37] <wikibugs>	 10serviceops, 10CX-cxserver, 10Language-Team, 10Kubernetes, 10Patch-For-Review: cxserver: Section Mapping Database (m5) not accessible by certain region - https://phabricator.wikimedia.org/T341117 (10Nikerabbit)
[09:23:28] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Reserve resources for system daemons on kubernetes nodes - https://phabricator.wikimedia.org/T277876 (10JMeybohm)
[09:23:38] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10JMeybohm)
[10:11:00] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Observability-Logging, 10Patch-For-Review: Some apache access logs are invalid json - https://phabricator.wikimedia.org/T340935 (10Joe) 05Open→03In progress a:03Joe Summarizing the research I've done up to now: * It's impossible to "fix" the escaped strings inside httpd *...
[12:37:04] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Reserve resources for system daemons on kubernetes nodes - https://phabricator.wikimedia.org/T277876 (10JMeybohm) `--enforce-node-allocatable=pods` is already enabled (by default) but the design document says: "This flag will be a no-op unless --kube-reserved an...
[13:10:13] <wikibugs>	 10serviceops, 10Kubernetes: Allow parallel image pulls in k8s - https://phabricator.wikimedia.org/T344154 (10JMeybohm) p:05Triage→03Low
[14:01:22] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Reserve resources for system daemons on kubernetes nodes - https://phabricator.wikimedia.org/T277876 (10JMeybohm) Also I re-read all the things and I think we got it wrong.  AIUI now `--system-reserved` and `--kube-reserved` will not be enforced by default (e.g....
[14:08:28] <effie>	 godog: let's use this channel since it is less noisy 
[14:08:47] <godog>	 sure that works for me
[14:09:19] <effie>	 it looks that tegola on codfw is getting all the traffic while it shouldnt, so looking into configs for the time being 
[14:10:11] <godog>	 interesting, ok
[14:14:37] <nemo-yiannis>	 It looks like eqiad kartotherian is pointing to codfw tegola. Something must be wrong on the scap config templates
[14:18:04] <godog>	 I'm looking at the tegola-vector-tiles and it looks like it is having trouble talking to thanos-swift.discovery.wmnet
[14:18:18] <effie>	 godog: that is a side effect I reckon 
[14:19:47] <godog>	 effie: that might be, though a side effect of what ?
[14:20:05] <effie>	 of the fact that tegola on codfw is overloaded 
[14:21:18] <godog>	 possible too yeah
[14:22:19] <godog>	 the recent thing that changed on thanos-swift is the switch to cfssl, that happened today
[14:22:38] <herron>	 godog: yeah I was about to ask if it could be related to cfssl?
[14:23:24] <effie>	 since this deployment https://gerrit.wikimedia.org/r/c/maps/kartotherian/deploy/+/941956, tegola on eqiad stopped receiving traffic 
[14:23:26] <godog>	 it might, so far this is what I could find from the tegola logs
[14:23:28] <godog>	 caused by: Get https://thanos-swift.discovery.wmnet/tegola-swift-codfw-v002/shared-cache/osm/11/1012/706: dial tcp: i/o timeout
[14:23:29] <effie>	 godog: what time?
[14:24:39] <godog>	 afaics the patch was merged at around 13:15, so an outage with ~30 min could be a puppet run
[14:25:31] <effie>	 that makse sense too, on top pf the other problem which is not related, but related 
[14:26:02] <effie>	 let me have a look at those loges, since I was looking at karothrian logs
[14:26:25] <godog>	 I was looking at https://logstash.wikimedia.org/goto/18cf5a9578783a2ede7ca00be5bfa090 FWIW
[14:27:31] <godog>	 there isn't very much info to go by AFAICS except generic errors, I guess we could try to cycle the tegola pods ? I doubt that'll do anything significant heh
[14:29:49] <godog>	 and definitely we need to get more details on why tegola doesn't like to talk TLS to thanos-swift anymore
[14:31:23] <godog>	 it doesn't seem to be a golang generic thing, i.e. thanos components are fine talking to thanos-swift, my next best guess is that maybe tegola doesn't trust the cfssl ca ?
[14:31:45] <effie>	 I just get a caused by: Get https://thanos-swift.discovery.wmnet/tegola-swift-codfw-v002/shared-cache/osm/13/4111/2868: net/http: TLS handshake timeout
[14:31:47] <herron>	 TLS timeout, i/o timeout, broken pipe, seem odd for a cert change but maybe yeah its just generic logging?
[14:32:49] <godog>	 could be yeah
[14:33:35] <effie>	 nemo-yiannis: do we have anything to do with the agen "tilelive-http/0.13.0" ?
[14:33:43] <herron>	 cc jbond
[14:33:59] <nemo-yiannis>	 this is the agent used by the HTTP requests from kartotherian to tegola
[14:35:26] <effie>	 ok so that is our, ok 
[14:36:12] <godog>	 do all pods in k8s trust cfssl by default ?
[14:36:15] <effie>	 I will restart all pods on codfw for starters
[14:36:52] <_joe_>	 effie: does tegola use the service mesh?
[14:36:57] <effie>	 yes 
[14:37:06] <effie>	 oh wait 
[14:37:13] <_joe_>	 to talk to swift?
[14:37:30] <_joe_>	 else we need to add wmf-certificates to the container of the application
[14:37:33] <_joe_>	 jayme: ^^
[14:38:36] <effie>	 no 
[14:38:41] <_joe_>	 ok then
[14:38:49] <_joe_>	 can someone point me to the repo of tegola?
[14:38:51] <effie>	 we are using it directly, since at the time, it was not possible 
[14:38:52] <jayme>	 uhm...lacking context
[14:38:54] <_joe_>	 it needs wmf-certificates
[14:39:11] <_joe_>	 effie: ok can you point me to the repo?
[14:39:14] * jbond here reading (but as joe said wmf-certificates injcludes the pki root if its not allready there)
[14:39:25] <jayme>	 yes, if you're talking to swift directly and that changed to PKI, you'll need wmf-cert
[14:39:42] <nemo-yiannis>	 this is the repo of our tegola fork: https://gerrit.wikimedia.org/g/operations/software/tegola
[14:39:57] <_joe_>	 uhm and how do we build the container for it?
[14:40:07] <_joe_>	 effie: you should know this right?
[14:41:29] <_joe_>	 I don't see a blubber file so I'm not sure how we build and deploy this
[14:42:06] <_joe_>	 nemo-yiannis: you maybe?
[14:42:19] <jayme>	 _joe_: I see there is a wmf/master branch that has a blubber.yaml
[14:42:24] <jayme>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/tegola/+/refs/heads/wmf/master/.pipeline/blubber.yaml
[14:42:26] <herron>	 the previous cergen certs are still on the puppetmaster so reverting the thanos-fe cfssl switchover to buy some more time is an option as well
[14:42:55] <jayme>	 it does not install wmf-certificates, though
[14:43:24] <jayme>	 in wmf/v0.14.x branch it does https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/tegola/+/refs/heads/wmf/v0.14.x/.pipeline/blubber.yaml
[14:43:36] <nemo-yiannis>	 there is a blubber file in wmf/v0.14.x
[14:43:59] <_joe_>	 people, thsi is a mess :)
[14:44:13] <effie>	 herron: is it possible to revert?
[14:44:33] <_joe_>	 nemo-yiannis: and yo know what is the version currently in production?
[14:44:33] <herron>	 effie: yes, the change was https://gerrit.wikimedia.org/r/c/operations/puppet/+/946559
[14:44:36] <effie>	 in the meantime we can work on the rest 
[14:44:44] <_joe_>	 herron: yeah let's revert then
[14:44:54] <herron>	 sure, going ahead with a revert now
[14:45:00] <jayme>	 +1
[14:45:16] <effie>	 please do, in the meantime,  nemo-yiannis and I will sort out the rest 
[14:45:17] <_joe_>	 I would've loved a heads up as oncall person, it would've taken a fraction of the time to realize what was wrong
[14:45:34] <_joe_>	 effie: you only need an image with wmf-certificates I think
[14:45:48] <effie>	 yiannis and I will take care of that 
[14:46:56] <_joe_>	 effie: it would also be useful to have a "master" or "main" branch that is what will be deployed in production
[14:48:39] <jayme>	 maybe I missed it, but have we seen TLS validation errors? I've just seen io timeout and handshake timeout
[14:49:45] <effie>	 _joe_: I will let the devs know 
[14:49:56] <herron>	 jayme: same here, haven't seen TLS validation errors myself yet either
[14:50:44] <herron>	 the puppet run for the revert is just wrapping up now
[14:51:57] <_joe_>	 jayme: the times coincide too well though
[14:52:29] <jayme>	 sure, just wanted to clarify for myself
[14:53:53] <nemo-yiannis>	 effie: regarding scap its still problematic because eqiad kartotherian talks to codfw but it shouldn't be a problem related to the maps2009 outage because this is not related to the master node
[14:55:32] <effie>	 no it is an unhappy coincidence indeed
[15:00:51] <effie>	 it appears that the production image has  wmf-certificates    0~20211110-1 
[15:01:44] <effie>	 and in the blubberfile it is included 
[15:01:59] <effie>	 so something else might still be not right 
[15:04:35] <herron>	 incident doc https://docs.google.com/document/d/1LaXqMyT1tYZjzoX2quLUEaTgeSLvM-udfIL6zR_l4Lk/edit
[15:04:53] <_joe_>	 2021?
[15:06:36] <_joe_>	 we are running 
[15:06:39] <_joe_>	 docker-registry.discovery.wmnet/wikimedia/operations-software-tegola:2021-11-18-210902-production
[15:08:31] <wikibugs>	 10serviceops, 10Data-Platform-SRE, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10pfischer) a:03pfischer
[15:45:36] <effie>	 herron: we will create a followup task to make any changes we need on tegola, it is a bank holiday on multiple parts of europe 
[15:46:37] <herron>	 effie: sounds good thanks
[15:52:13] <effie>	 herron: wed is the earliest we can sort this part or we could go ahead to have another go 
[15:53:44] <herron>	 effie: sure, I'll just keep an eye out for feedback on the new patch when ready and go from there
[15:54:24] <effie>	 cheers
[16:37:36] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Observability-Logging, 10Patch-For-Review: Some apache access logs are invalid json - https://phabricator.wikimedia.org/T340935 (10CodeReviewBot) oblivian opened https://gitlab.wikimedia.org/repos/sre/glogger/-/merge_requests/1  Introduce glogger
[16:43:26] <wikibugs>	 10serviceops, 10Data-Platform-SRE, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10pfischer) @elukey, @Joe, thank you for your feedback! I revisited the size estimations, here are the updated n...
[16:43:55] <wikibugs>	 10serviceops, 10Data-Platform-SRE, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10pfischer) a:05pfischer→03None
[17:12:22] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Remove allow-pod-to-pod GlobalNetworkPolicy - https://phabricator.wikimedia.org/T344177 (10JMeybohm) p:05Triage→03Medium
[17:12:58] <jayme>	 ^ that will be a fun one