[08:43:19] 06serviceops, 06Data-Persistence: Sessionstore's discovery TLS cert will expire before end of May 2024 - https://phabricator.wikimedia.org/T363996#9793527 (10jijiki) a:05JMeybohm→03hnowlan [08:49:47] 06serviceops, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06SRE, and 3 others: Site: eqiad 3 VM request for staging-eqiad kube-apiserver - https://phabricator.wikimedia.org/T364746#9793530 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemast... [08:52:52] 06serviceops, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06SRE, and 3 others: Site: eqiad 3 VM request for staging-eqiad kube-apiserver - https://phabricator.wikimedia.org/T364746#9793542 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemast... [09:14:22] 06serviceops, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06SRE, and 3 others: Site: eqiad 3 VM request for staging-eqiad kube-apiserver - https://phabricator.wikimedia.org/T364746#9793566 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemast... [09:19:04] 06serviceops, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06SRE, and 3 others: Site: eqiad 3 VM request for staging-eqiad kube-apiserver - https://phabricator.wikimedia.org/T364746#9793580 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster10... [09:20:15] 06serviceops, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06SRE, and 3 others: Site: eqiad 3 VM request for staging-eqiad kube-apiserver - https://phabricator.wikimedia.org/T364746#9793581 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster10... [09:45:28] 06serviceops, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06SRE, and 2 others: Site: eqiad 3 VM request for staging-eqiad kube-apiserver - https://phabricator.wikimedia.org/T364746#9793680 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster10... [09:48:20] 06serviceops, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06SRE, and 2 others: Site: eqiad 3 VM request for staging-eqiad kube-apiserver - https://phabricator.wikimedia.org/T364746#9793708 (10ops-monitoring-bot) VM kubestagemaster1003.eqiad.wmnet switching disk type to plain [09:48:58] 06serviceops, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06SRE, and 2 others: Site: eqiad 3 VM request for staging-eqiad kube-apiserver - https://phabricator.wikimedia.org/T364746#9793711 (10ops-monitoring-bot) VM kubestagemaster1004.eqiad.wmnet switching disk type to plain [09:50:03] 06serviceops, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06SRE, and 2 others: Site: eqiad 3 VM request for staging-eqiad kube-apiserver - https://phabricator.wikimedia.org/T364746#9793712 (10ops-monitoring-bot) VM kubestagemaster1005.eqiad.wmnet switching disk type to plain [09:56:00] 06serviceops, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06SRE, and 2 others: Site: eqiad 3 VM request for staging-eqiad kube-apiserver - https://phabricator.wikimedia.org/T364746#9793752 (10JMeybohm) 05Open→03Resolved [10:05:55] re: deployment-charts run_locally + podman from yesterday, I switched to docker and it works :| [10:46:37] 06serviceops, 10[DEPRECATED] wdwb-tech, 10Citoid, 06Content-Transform-Team-WIP, and 10 others: Upgrade mobileapps to node 18 - https://phabricator.wikimedia.org/T363168#9793941 (10Dibohwendy377) 8141803742 opay Wendy chinasa Diboh [11:25:36] 06serviceops, 06Data-Persistence: Sessionstore's discovery TLS cert will expire before end of May 2024 - https://phabricator.wikimedia.org/T363996#9794048 (10hnowlan) Steps that I see: - Renewing the existing cergen cert to give us breathing room just in case. We're looking at less than 2 weeks of headroom f... [11:48:24] 06serviceops, 06Data-Persistence: Sessionstore's discovery TLS cert will expire before end of May 2024 - https://phabricator.wikimedia.org/T363996#9794143 (10JMeybohm) Also worth noting that mediawiki already calls sessionstore via it's envoy sidecar, so we do have telemetry data from prod and we should be abl... [12:08:06] 06serviceops, 10MoveComms-Support, 10MW-on-K8s, 06SRE, and 2 others: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323#9794186 (10Clement_Goubert) We are currently holding at 85% of global traffic, and as such not reimaging anymore serv... [12:21:17] claime: with increased wikikube numbers we're naturally hitting the ferm/kube-proxy race more often and it feels a little noisy, given that it's now remediated let's maybe tweak the alert so that it only alerts if the unavailability exceeds the time frame within the toil class would have otherwise fixed it, or IOW make it only alert if the toil class failed for some reason? [12:21:52] moritzm: yeah, that's a good idea [12:22:29] We'd need to overload the existing alert [12:23:07] something like that, so that it only gets applied to k8s workers [12:23:17] and masters presumably [12:23:30] Wait that's the icinga check being noisy... [12:23:35] ugh [12:24:47] check_ferm is shipped from Puppet and a shell script, we can e.g. add a check in there whether kube-proxyy is running or similar? [12:26:20] yeah, or maybe better would be a role check in profile::firewall? [12:26:51] that would also work, yes [12:29:13] given that it's an icinga check, can't we just add a hiera key for the number of retries before it alerts and then increase that for the k8s worker nodes? [12:49:36] 06serviceops, 06Content-Transform-Team, 07Essential-Work, 07Wikimedia-Incident: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023) - https://phabricator.wikimedia.org/T344324#9794319 (10elukey) I found also this interesting project that explains the issue very well: https://github.com/Kri... [12:59:57] There should be 3 retries, at 30 minutes intervals, before it alerts [13:00:26] And puppet runs every 30 minutes, and it's the puppet run that should restart ferm if it's in a bad state [13:02:19] 06serviceops, 06Content-Transform-Team, 07Essential-Work, 07Wikimedia-Incident: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023) - https://phabricator.wikimedia.org/T344324#9794372 (10elukey) I took the time to re-read the whole task, and one thing that I missed was the fact that after... [13:06:46] Ah no, retry interval is 1, so it tries 3 times every 1 minute [13:06:53] so yeah, not enough time for puppet to run [13:52:18] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Allow to address Kubernets API servers from NetworkPolicy - https://phabricator.wikimedia.org/T287491#9794702 (10jijiki) @dcausse and I will deploy [[ https://gerrit.wikimedia.org/r/1029573 | 1029573 ]] tomorrow EU morning [13:54:14] effie: o/ around? Do you want to test thanos? [13:54:51] nevermind sorry I need to check some pods first :( [13:55:08] flink and others call thanos, need to check if they use the right ca bundle [13:55:12] tomorrow :) [13:55:54] hehe [14:42:07] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9794972 (10Jclark-ctr) kafka-main1010 Rack: E 5 U 26 Cableid : 2013339101771 Port : 6 [15:24:13] there are disk space alerts for the registry* hosts, I'm pruning a few older kernels [15:28:17] all done [15:28:44] thanks [15:29:04] on Bullseye and later we're using automated removals of old kernel images, but the underlying feature doesn't exist in buster [16:27:48] 06serviceops, 06SRE, 06Traffic-Icebox, 06Trust and Safety Product Team: Add IP Info (ASN & Geolocation) to requests to MediaWiki - https://phabricator.wikimedia.org/T251933#9795878 (10TAdeleye_WMF) [16:41:13] 06serviceops, 10ops-codfw, 06SRE: InterfaceSpeedError - mw2286 - https://phabricator.wikimedia.org/T364863#9796105 (10Dzahn) [16:44:57] 06serviceops, 10ops-codfw, 06SRE: InterfaceSpeedError - mw2286 - https://phabricator.wikimedia.org/T364863#9796143 (10Dzahn) @Jhancock.wm cc: @RLazarus I depooled the server and set a downtime of 24 hours. [16:51:13] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9796178 (10VRiley-WMF) [20:27:50] 06serviceops, 10Cassandra, 06SRE, 10Data Products (Data Products Sprint 13), 07Service-deployment-requests: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921 (10Eevans) 03NEW [20:47:09] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9797373 (10VRiley-WMF) [21:16:20] 06serviceops, 10Cassandra, 06SRE, 10Data Products (Data Products Sprint 13), 07Service-deployment-requests: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9797524 (10Scott_French) Thanks, @Eevans. If you can drive development of the new data gateway (i.e., base... [21:22:13] 06serviceops: docker-reporter-base-images.service failed on build2001 - https://phabricator.wikimedia.org/T364931#9797604 (10Volans) [21:24:52] 06serviceops, 10Cassandra, 06SRE, 10Data Products (Data Products Sprint 13), 07Service-deployment-requests: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9797608 (10Scott_French) [21:36:35] 06serviceops, 10Cassandra, 06SRE, 10Data Products (Data Products Sprint 13), 07Service-deployment-requests: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9797656 (10Eevans) [21:47:06] 06serviceops, 10Cassandra, 06SRE, 10Data Products (Data Products Sprint 13), and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9797672 (10Eevans) [21:48:33] 06serviceops, 10Cassandra, 06SRE, 10Data Products (Data Products Sprint 13), and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9797670 (10CodeReviewBot) eevans opened https://gitlab.wikimedia.org/repos/releng/gitlab-trusted-runner/-/merge_requests/74... [21:58:07] 06serviceops, 10Cassandra, 06SRE, 10Data Products (Data Products Sprint 13), and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9797746 (10CodeReviewBot) dancy merged https://gitlab.wikimedia.org/repos/releng/gitlab-trusted-runner/-/merge_requests/74 A... [22:25:57] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9797806 (10Jclark-ctr) @akosiaris could you please update preseed.yaml file? I did take care of site.pp file for codfw and eqiad