[11:11:19] 06serviceops, 06Machine-Learning-Team, 13Patch-For-Review: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637#9664877 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=110ad5f3-e41f-4f7d-a5d0-3343dc9fca15) set by elukey@cumin1002 for 1:00:00 on 1 host(s) and... [11:11:48] 06serviceops, 06Machine-Learning-Team, 13Patch-For-Review: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637#9664878 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1d34de6a-7fb2-4477-984a-7dcc642d43b2) set by elukey@cumin1002 for 1:00:00 on 1 host(s) and... [11:17:05] 06serviceops, 06Machine-Learning-Team, 13Patch-For-Review: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637#9664883 (10ops-monitoring-bot) VM registry2003.codfw.wmnet rebooted by elukey@cumin1002 with reason: Increase VRAM [11:25:58] 06serviceops, 10Deployments, 06Release-Engineering-Team, 13Patch-For-Review: httpbb appserver test breaks deployment of the week due to a timeout parsing page - https://phabricator.wikimedia.org/T360867#9664889 (10Clement_Goubert) Some context given by @RLazarus from the CR: > At the time we added this tes... [11:33:38] 06serviceops, 06Machine-Learning-Team, 13Patch-For-Review: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637#9664919 (10ops-monitoring-bot) VM registry2004.codfw.wmnet rebooted by elukey@cumin1002 with reason: Increase VRAM [11:54:55] 06serviceops, 10MW-on-K8s, 10RESTBase, 06SRE, 13Patch-For-Review: Migrate restbase from mwapi-async to mw-api-int - https://phabricator.wikimedia.org/T358213#9664964 (10Clement_Goubert) Things to keep an eye on: - Upstream error rate is higher on `mw-api-int` than bare-metal {F43515489} - Connection esta... [11:58:52] hey folks, the registry nodes have been upgraded [11:58:59] now the tmpfs for nginx is 4G [11:59:11] nothing weird afaics, ping me if you see anything out of the ordinary [12:02:34] elukey: awesome, tysm <3 [12:05:12] 06serviceops, 06Machine-Learning-Team, 13Patch-For-Review: 14Bump memory for registry[12]00[34] VMs - 14https://phabricator.wikimedia.org/T360637#9664981 (10elukey) 05Open→03Resolved 14Everything done! [12:19:35] 06serviceops, 10iPoid-Service, 10Observability-Logging, 13Patch-For-Review: Logs from containers sometimes not visible in logstash - https://phabricator.wikimedia.org/T357616#9665035 (10JMeybohm) This seems to be an issue with how imfile implements inotify watches for symlinks (or symlinks to symlinks mayb... [12:31:01] 06serviceops, 10iPoid-Service, 10Observability-Logging, 13Patch-For-Review: Logs from containers sometimes not visible in logstash - https://phabricator.wikimedia.org/T357616#9665119 (10kostajh) btw, in case this is relevant, https://logstash.wikimedia.org/goto/b060c8f0c137245fc0d63b9329583abe shows a spik... [12:50:49] 06serviceops, 06Data-Platform-SRE, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Migrate charts to Calico Network Policies - https://phabricator.wikimedia.org/T359423#9665185 (10brouberol) [13:15:38] 06serviceops, 10Data Products (Data Products Sprint 11): Service Ops Review of Metrics Platform Configuration Management UI - https://phabricator.wikimedia.org/T358577#9665271 (10phuedx) >>! In T358577#9653832, @akosiaris wrote: > * There are multiple caches mentioned in the design doc. It is my current unders... [13:45:41] 06serviceops, 10iPoid-Service, 10Observability-Logging, 13Patch-For-Review: Logs from containers sometimes not visible in logstash - https://phabricator.wikimedia.org/T357616#9665374 (10JMeybohm) >>! In T357616#9665119, @kostajh wrote: > btw, in case this is relevant, https://logstash.wikimedia.org/goto/b0... [14:21:38] 06serviceops, 06collaboration-services, 06Data-Persistence, 06DC-Ops, and 5 others: ☂️ Northward Datacentre Switchover (March 2024) - https://phabricator.wikimedia.org/T357547#9665563 (10ops-monitoring-bot) jiji@cumin1002 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in codf... [14:41:09] 06serviceops, 06collaboration-services, 06Data-Persistence, 06DC-Ops, and 5 others: ☂️ Northward Datacentre Switchover (March 2024) - https://phabricator.wikimedia.org/T357547#9665627 (10ops-monitoring-bot) jiji@cumin1002 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in codf... [14:44:19] 06serviceops, 06Data-Platform-SRE, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Migrate charts to Calico Network Policies - https://phabricator.wikimedia.org/T359423#9665648 (10kamila) [14:48:05] is people.wikimedia still in CODFW? The banner for people1004 says it's there but dyna.wikimedia.org is pointing to text-lb.eqiad.wikimedia.org's IP? [14:59:09] inflatador: dyna. points to the CDN frontend address based on GeoDNS, for me it's esams not eqiad [14:59:22] peopleweb.discovery.wmnet is what the CDN backend uses, and it's a CNAME to people2003 [14:59:25] cgoubert@cumin1002:~$ dig peopleweb.discovery.wmnet +short [14:59:28] people2003.codfw.wmnet. [14:59:43] 06serviceops: Improve readability of Switchover documentation - https://phabricator.wikimedia.org/T361113 (10jijiki) 03NEW p:05Triage→03Low [15:00:00] 06serviceops: Improve readability of Switchover documentation - https://phabricator.wikimedia.org/T361113#9665752 (10jijiki) [15:00:06] 06serviceops, 06collaboration-services, 06Data-Persistence, 06DC-Ops, and 5 others: ☂️ Northward Datacentre Switchover (March 2024) - https://phabricator.wikimedia.org/T357547#9665753 (10jijiki) [15:04:28] taavi claime thanks, got it [15:35:27] 06serviceops, 10Deployments, 06Release-Engineering-Team, 13Patch-For-Review: httpbb appserver test breaks deployment of the week due to a timeout parsing page - https://phabricator.wikimedia.org/T360867#9665942 (10RLazarus) >>! In T360867#9664888, @Clement_Goubert wrote: > The reason we're catching it now,... [15:42:41] 06serviceops, 07Datacenter-Switchover: 14SRE comms for Northward Datacentre Switchover (March 2024) - 14https://phabricator.wikimedia.org/T358286#9665967 (10jijiki) 05Open→03Resolved p:05Triage→03Medium [15:46:54] 06serviceops, 10MW-on-K8s, 10RESTBase, 06SRE, 13Patch-For-Review: Migrate restbase from mwapi-async to mw-api-int - https://phabricator.wikimedia.org/T358213#9665984 (10Clement_Goubert) [15:47:10] 06serviceops, 06collaboration-services, 06Data-Persistence, 06DC-Ops, and 5 others: 14☂️ Northward Datacentre Switchover (March 2024) - 14https://phabricator.wikimedia.org/T357547#9665972 (10jijiki) 05Open→03Resolved a:03jijiki 14Switchover is done, it is Day 8, and we are back to Multi-DC. Than... [15:47:19] 06serviceops, 10MW-on-K8s, 10RESTBase, 06SRE, 13Patch-For-Review: Migrate restbase from mwapi-async to mw-api-int - https://phabricator.wikimedia.org/T358213#9665983 (10Clement_Goubert) 50% {F43529353} [15:52:18] 06serviceops, 10Deployments, 06Release-Engineering-Team, 13Patch-For-Review: 14httpbb appserver test breaks deployment of the week due to a timeout parsing page - 14https://phabricator.wikimedia.org/T360867#9665997 (10CodeReviewBot) 14dancy opened https://gitlab.wikimedia.org/repos/releng/train-dev/-/... [15:52:24] 06serviceops, 10Deployments, 06Release-Engineering-Team, 13Patch-For-Review: 14httpbb appserver test breaks deployment of the week due to a timeout parsing page - 14https://phabricator.wikimedia.org/T360867#9665992 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert 14`--retry_on_timeout`... [15:52:51] 06serviceops, 10Deployments, 06Release-Engineering-Team, 13Patch-For-Review: 14httpbb appserver test breaks deployment of the week due to a timeout parsing page - 14https://phabricator.wikimedia.org/T360867#9666018 (10CodeReviewBot) 14dancy merged https://gitlab.wikimedia.org/repos/releng/train-dev/-/... [16:16:38] 06serviceops, 06Data-Platform-SRE, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Migrate charts to Calico Network Policies - https://phabricator.wikimedia.org/T359423#9666129 (10JMeybohm) [16:37:20] 06serviceops, 10iPoid-Service, 10Observability-Logging, 13Patch-For-Review: Logs from containers sometimes not visible in logstash - https://phabricator.wikimedia.org/T357616#9666268 (10JMeybohm) A systemd timer has been deployed to all kubernetes nodes that will check every hour if rsyslog has accumulated...