[08:58:43] 06serviceops, 06Content-Transform-Team: Increased latency, timeouts from wikifeeds since march 10th - https://phabricator.wikimedia.org/T360597 (10Joe) 03NEW [08:58:58] 06serviceops, 06Content-Transform-Team: Increased latency, timeouts from wikifeeds since march 10th - https://phabricator.wikimedia.org/T360597#9648363 (10Joe) p:05Triage→03High [09:14:54] 06serviceops, 06Data-Engineering: kafka-main certificates expiring on 2024-04-04 - https://phabricator.wikimedia.org/T360598 (10akosiaris) 03NEW [09:16:38] 06serviceops, 06Data-Engineering: kafka-main certificates expiring on 2024-04-04 - https://phabricator.wikimedia.org/T360598#9648430 (10akosiaris) p:05Triage→03High Adding @brouberol as they probably have way more experience than serviceops on refreshing kafka certificates than anyone in #serviceops [09:17:39] 06serviceops, 06Data-Engineering: kafka-main certificates expiring on 2024-04-04 - https://phabricator.wikimedia.org/T360598#9648445 (10brouberol) Let me have a look at how these certificates are generated. I'm thinking we should renew them and trigger a rolling-restart of the cluster. [09:25:36] 06serviceops, 06Data-Engineering: kafka-main certificates expiring on 2024-04-04 - https://phabricator.wikimedia.org/T360598#9648509 (10brouberol) ` brouberol@kafka-main2001:~$ echo y | openssl s_client -connect $(hostname -f):9093 | openssl x509 -issuer -nout x509: Unrecognized flag nout x509: Use -help for... [09:31:56] 06serviceops, 06Data-Engineering: kafka-main certificates expiring on 2024-04-04 - https://phabricator.wikimedia.org/T360598#9648523 (10akosiaris) >>! In T360598#9648509, @brouberol wrote: > ` > brouberol@kafka-main2001:~$ echo y | openssl s_client -connect $(hostname -f):9093 | openssl x509 -issuer -nout > x... [09:33:29] 06serviceops, 06Data-Engineering: kafka-main certificates expiring on 2024-04-04 - https://phabricator.wikimedia.org/T360598#9648526 (10akosiaris) So, since I 've never done this before (that I remember of), double check me on this please. Is it just enough to issue ` sudo cookbook sre.kafka.roll-restart-reb... [09:37:14] 06serviceops, 06Data-Engineering: kafka-main certificates expiring on 2024-04-04 - https://phabricator.wikimedia.org/T360598#9648538 (10MoritzMuehlenhoff) Luca migrated kafka/main to the PKI in https://phabricator.wikimedia.org/T319372 and he left a comment to that regard on the task: > What is it going to ch... [09:51:15] 06serviceops, 06Data-Engineering: kafka-main certificates expiring on 2024-04-04 - https://phabricator.wikimedia.org/T360598#9648584 (10brouberol) >>! In T360598#9648526, @akosiaris wrote: > So, since I 've never done this before (that I remember of), double check me on this please. Is it just enough to issue... [10:13:33] 06serviceops, 10Prod-Kubernetes, 10Data-Platform-SRE (2024.03.04 - 2024.03.24), 07Kubernetes: Add redis (rdb) instances to external-services - https://phabricator.wikimedia.org/T360612 (10JMeybohm) 03NEW [11:17:43] how is build --select meant to work for docker-pkg? docker-pkg build images/ --select '*debci*' isn't attempting to build any images (and I don't already have them build per docker images), and I'd expect that to try and build the three images under images/wmf-debci in production-images [11:21:15] 06serviceops, 06Content-Transform-Team: Increased latency, timeouts from wikifeeds since march 10th - https://phabricator.wikimedia.org/T360597#9648874 (10kamila) Note also the increase in RX (but not really TX) traffic that coincides with these: {F42940190} Also, this is only visible in eqiad. [11:24:14] or is it that build will not attempt to build something if the equivalent image is already in our registry? [11:24:36] I think that's the case [11:24:42] Try and bump the changelog [11:25:05] gah, I just wanted a test build! I'll try that though [11:25:23] But it's weird, in my memory it used to work [11:26:31] Yeah, if I bump the changelog it now tries to build wmf-debci-bookworm [11:26:51] IWBNI there was some "build this image even though the registry has it" option [11:29:46] You can do the changelog bump automatically with build --nightly [11:30:49] Doesn't do exactly what you want but easier than manually updating the changelog [11:31:06] Mmm, still leaves me with working-dir changes I have to undo later [11:31:07] maybe _joe_ knows the way to force build [11:31:29] if not, open a task against docker-pkg, we'll try to find a way to make it work [11:32:47] <_joe_> Emperor: correct, if the image with the current debian changelog entry is in the registry, docker-pkg will refuse to build it [11:32:56] <_joe_> just add a new dch entry [11:33:26] <_joe_> Emperor: or, just don't add the registry to the config :P [11:48:04] well, I opened a couple of phab tasks about the bits that caused me pain :) Now I think I have building ceph images though :) [13:20:57] 06serviceops, 06Data-Engineering, 06Data-Platform-SRE: kafka-main certificates expiring on 2024-04-04 - https://phabricator.wikimedia.org/T360598#9649020 (10lbowmaker) [13:22:14] 06serviceops, 10MW-on-K8s, 06SRE, 06Traffic, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9649036 (10Clement_Goubert) [13:25:03] 06serviceops, 06Data-Engineering, 06Data-Platform-SRE: 14kafka-main certificates expiring on 2024-04-04 - 14https://phabricator.wikimedia.org/T360598#9649069 (10akosiaris) 05Open→03Resolved a:03akosiaris 14Alerts gone, I 'll resolve this. As a note to anyone seeing this in the future, it's `kafka... [13:27:43] 06serviceops, 10ChangeProp, 10MW-on-K8s, 06SRE, 10WMF-JobQueue: Alter changeprop chart to use the service mesh - https://phabricator.wikimedia.org/T360625 (10Clement_Goubert) 03NEW [13:27:55] 06serviceops, 10ChangeProp, 10MW-on-K8s, 06SRE, 10WMF-JobQueue: Alter changeprop chart to use the service mesh - https://phabricator.wikimedia.org/T360625#9649117 (10Clement_Goubert) p:05Triage→03High [13:31:49] 06serviceops, 10ChangeProp, 10MW-on-K8s, 06SRE, 10WMF-JobQueue: Alter changeprop chart to use the service mesh - https://phabricator.wikimedia.org/T360625#9649168 (10Clement_Goubert) [13:54:03] 06serviceops, 10ChangeProp, 10MW-on-K8s, 06SRE, and 2 others: Alter changeprop chart to use the service mesh - https://phabricator.wikimedia.org/T360625#9649410 (10Joe) There is a few reasons why we didn't migrate changeprop to use the service mesh, first of all the fact we don't want to define timeouts ou... [14:01:34] jelto: eoghan: hello, may you puppet-merge a Gerrit config change for us? It is to slightly change the template used when it comments to Phabricator: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1013097 [14:01:51] I will run puppet on the host and Gerrit then takes the update in account automatically [14:02:29] let me take a look, one sec [14:06:47] hashar: merged [14:06:54] awesome thank you! [14:08:40] 06serviceops, 06Machine-Learning-Team: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637 (10elukey) 03NEW [14:28:54] 06serviceops, 06Machine-Learning-Team: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637#9649735 (10JMeybohm) Sounds good to me. I'd say you can just depool one of the active registry nodes and restart that VM for the RAM increase. No need for extra steps [14:43:58] 06serviceops, 10ChangeProp, 10MW-on-K8s, 06SRE, and 2 others: Alter changeprop chart to use the service mesh - https://phabricator.wikimedia.org/T360625#9649779 (10Clement_Goubert) [14:47:18] 06serviceops, 06Machine-Learning-Team: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637#9649789 (10MoritzMuehlenhoff) >>! In T360637#9649735, @JMeybohm wrote: > Sounds good to me. I'd say you can just depool one of the active registry nodes and restart that VM for the RAM incr... [14:49:21] 06serviceops, 06Machine-Learning-Team: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637#9649813 (10akosiaris) [15:24:12] Emperor: the docker-pkg build `--select` option is ALWAYS hitting me. Since it uses fnmatch and I think that is against the full image name (registry hostname, namespace, image name, column, tag) I end up wrapping my search with * and :* [15:24:25] Emperor: so I'd use: `--select '*debci:*' [15:24:44] 06serviceops, 06SRE: VRT wiki fails to create account - https://phabricator.wikimedia.org/T359901#9649944 (10thcipriani) [15:24:49] that is certainly improvable :) [15:25:31] and yeah changelog bump :/ [15:25:46] but you can `docker rmi` the image before invoking docker-pkg [15:25:56] though if it exists in the registry it will download it instead of building [15:50:29] 06serviceops, 07Datacenter-Switchover: imagecatalog_record.service fails due to read-only sqlite database - https://phabricator.wikimedia.org/T360652 (10Clement_Goubert) 03NEW [15:50:43] 06serviceops, 07Datacenter-Switchover: imagecatalog_record.service fails due to read-only sqlite database - https://phabricator.wikimedia.org/T360652#9650116 (10Clement_Goubert) p:05Triage→03High [15:52:38] claime: ^ interesting, thanks for the chown, I assume that fixed it for now? I have no memory of how we puppetized it but I'll refresh myself on that today [15:52:45] yeah [15:53:00] what was the owner/group before? [15:53:04] mwbuilder [15:53:10] okay thanks [15:54:03] We exec the creation of the empty db if it isn't there with the right user, we have an ensure on the directory with the right user, but not on the file itself and no recurse [15:54:11] modules/imagecatalog/manifests/init.pp [15:55:43] yeah I think the script creates the file (still from memory, might be wrong) [15:55:52] but that's why I'm surprised it comes out read-only [15:57:19] 06serviceops, 07Datacenter-Switchover: imagecatalog_record.service fails due to read-only sqlite database - https://phabricator.wikimedia.org/T360652#9650178 (10Clement_Goubert) ` cgoubert@deploy1002:~$ sudo chown imagecatalog:imagecatalog /srv/deployment/imagecatalog/catalog.sqlite cgoubert@deploy1002:~$ sudo... [15:58:47] 06serviceops, 07Datacenter-Switchover: imagecatalog_record.service fails due to read-only sqlite database - https://phabricator.wikimedia.org/T360652#9650190 (10Clement_Goubert) p:05High→03Low As the action taken in production fixed the immediate problem, lowering priority. [16:00:07] 06serviceops, 06Data-Engineering, 06Data-Platform-SRE: 14kafka-main certificates expiring on 2024-04-04 - 14https://phabricator.wikimedia.org/T360598#9650211 (10herron) 14FWIW I just went through a similar triage and broker restart process in T358870 It wasn't super obvious at first that all was needed... [16:14:41] hi folks! If you are ok I'd like to bump the docker registry's vram (one node at the time, depooling etc..) [16:14:44] ok if I proceed? [16:16:25] 06serviceops, 06Machine-Learning-Team: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637#9650370 (10elukey) [16:21:19] 06serviceops, 06collaboration-services, 06Data-Persistence, 06DC-Ops, and 4 others: ☂️ Northward Datacentre Switchover (March 2024) - https://phabricator.wikimedia.org/T357547#9650407 (10jijiki) [16:21:48] 06serviceops, 06collaboration-services, 06Data-Persistence, 06DC-Ops, and 4 others: ☂️ Northward Datacentre Switchover (March 2024) - https://phabricator.wikimedia.org/T357547#9650413 (10jijiki) [16:23:05] minor hickup on our side which is related to switch of deployment server. because releases* machines pull /srv/patches from deployment servers every 10 minutes.. at some point when the switch happens the rsync config on deployment* changes and it's "unknown module" when trying to pull from it.. which means failed systemd unit which in our case means a ticket gets auto-created. but then it [16:23:11] also fixes itself on the next run 10 minutes later. so it's super minor and just sharing [16:28:19] 06serviceops, 06Machine-Learning-Team: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637#9650464 (10ops-monitoring-bot) VM registry1003.eqiad.wmnet rebooted by elukey@cumin1002 with reason: Increase VRAM [16:29:01] (proceeding with eqiad nodes that are no dnsdisc pooled) [16:39:58] 06serviceops, 06Machine-Learning-Team: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637#9650535 (10ops-monitoring-bot) VM registry1004.eqiad.wmnet rebooted by elukey@cumin1002 with reason: Increase VRAM [17:03:46] 06serviceops, 06Machine-Learning-Team: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637#9650697 (10elukey) ` elukey@ganeti1027:~$ sudo gnt-instance list | grep registry registry1003.eqiad.wmnet kvm debootstrap+default ganeti1026.eqiad.wmnet running 6.0G reg...