[01:28:31] 10serviceops, 10ChangeProp, 10Content-Transform-Team-WIP, 10Page Content Service, and 3 others: Parsoid cache invalidation for mobile-sections seems not reliable - https://phabricator.wikimedia.org/T226931 (10Brycehughes) @akosiaris The enwikivoyage community [[ https://en.wikivoyage.org/wiki/Wikivoyage:Tr... [08:37:28] 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-notice: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 (10Ladsgroup) >>! In T308932#8627787, @Urbanecm wrote: >>>! In T308932#8603337, @gerritbot wrote... [09:14:02] 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-notice: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 (10Ladsgroup) I'll do permissions tomorrow. [09:20:02] 10serviceops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Jelto) [09:51:05] folks there are some alerts about k8s latencies for several wikikube nodes, is it known/wip? [09:52:45] hmmm [09:53:03] ranging from a couple of minutes up to 1H [09:53:21] just for list_images ? [09:53:40] looks like it [09:55:17] I don't see a change in the number of list_images operations [09:55:25] at the rate, that is. [09:56:03] and the actual increase in latency in grafana is ... miniscule ? [09:56:17] I think the probe is a bit too sensitive [09:56:26] s/is/may be/ [09:57:04] some image pulls happened about 1H ago [09:57:25] both clusters, nothing particularly worrying though [10:00:43] https://lounge.uname.gr/uploads/b9bae4cb92539eb9/image.png [10:00:51] yeah, not particularly worrying right now [10:01:21] looking a bit more into it [10:04:26] there where also some issues on wikikube staging over the weekend which are the result of me leaving puppet disabled there. I'll clean that up [10:20:45] histogram_quantile(0.99, rate(kubelet_runtime_operations_duration_seconds_bucket{job="k8s-node"}[5m])) > 0.6 [10:20:47] for 5m [10:20:51] that's the alert definition [10:21:33] interestingly... I don't see more than 507ms in the last 1H [10:21:46] how did 0.99 end up being > 600ms ... [10:22:38] also interestingly, more and more hosts are alerting... what the [10:24:39] ah no, they are actually flapping [10:26:33] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10JMeybohm) [10:32:18] 10serviceops, 10Scap: Kubernetes configuration file is group-readable - https://phabricator.wikimedia.org/T329899 (10Clement_Goubert) [10:32:36] per https://w.wiki/6MbV this should have been flapping on and off for the last 1 week [10:33:00] and the baseline is apparently ~500ms anyway [10:34:15] 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-notice: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 (10Urbanecm) >>! In T308932#8628692, @Ladsgroup wrote: > [...] >> Can we agree on dblists a new... [10:34:18] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, 10Kubernetes: etcd cluster reimage strategies to use with the K8s upgrade cookbook - https://phabricator.wikimedia.org/T330060 (10elukey) [10:36:44] I wonder what happened in december, we doubled the baseline from 250ms to 500ms after a few days of spikes, then it's fairly stable until ~11 days ago [10:39:49] testing a small theory right now [10:43:40] well, theory proven [10:44:08] akosiaris@kubernetes1009:~$ time sudo docker image list > /dev/null [10:44:08] real 0m0.597s [10:44:16] yada yada on multiple runs [10:44:18] then... [10:44:28] sudo docker image prune -a, wait 4 minutes [10:44:30] and ... [10:44:38] akosiaris@kubernetes1009:~$ time sudo docker image list > /dev/null [10:44:38] real 0m0.094s [10:44:52] O(n) anyone ? [10:44:55] Well time to put that in a crontab [10:45:08] systemd-timer, whatever [10:45:14] no, kubelet has high and low watermarks for GCing images [10:45:16] k8s will take care of image purges only if there is pressure [10:45:34] Ah, didn't know that [10:45:44] yeah, I think we just want to adapt our threshold [10:46:04] the kubelet will take care of the underlying issue, which is how much space we end up giving to images [10:46:24] feel free to adapt the thresholds...I more or less made those values up when creating the alerts [10:46:50] I was about to ask. I remember something like that, but thanks for confirming before I even asked [10:47:05] time for my first alertmanager gerrit change I guess [10:47:22] 🍿 [10:47:37] you're slowly getting out of your manager role :-) [10:48:00] claime: btw --image-gc-high-threshold and --image-gc-low-threshold for the high/low watermarks I mentioned before [10:48:10] ===== NODE GROUP ===== [10:48:12] (1) kubernetes1019.eqiad.wmnet [10:48:14] ----- OUTPUT of 'docker image list | wc -l' ----- [10:48:15] defaults are 85% and 80% [10:48:16] 918 [10:48:18] ===== NODE GROUP ===== [10:48:20] (1) kubernetes1011.eqiad.wmnet [10:48:22] ----- OUTPUT of 'docker image list | wc -l' ----- [10:48:24] 746 [10:48:26] ===== NODE GROUP ===== [10:48:28] (1) kubernetes1008.eqiad.wmnet [10:48:30] ----- OUTPUT of 'docker image list | wc -l' ----- [10:48:32] 552 [10:48:34] ===== NODE GROUP ===== [10:48:36] (1) kubernetes1018.eqiad.wmnet [10:48:38] ----- OUTPUT of 'docker image list | wc -l' ----- [10:48:40] 743 [10:48:42] ===== NODE GROUP ===== [10:48:44] (1) kubernetes1021.eqiad.wmnet [10:48:46] ----- OUTPUT of 'docker image list | wc -l' ----- [10:48:48] 577 [10:48:50] Oof sorry [10:49:07] well, your bouncer saved you from a a kick :P [10:49:10] it was slow enough [10:49:49] note btw that we are 30% currently [10:50:12] and defaults for --image-gc-high-threshold and --image-gc-low-threshold are 85% and 80% respectively [10:50:18] so GC won't run [10:52:06] my guess btw is that somewhere around start of Feb, we moved to the point where some cache (which one...) isn't large enough any more [10:52:48] anyway, I 'll start with an alertmanager change and then dive a bit more into those GC numbers. Got to refresh my memory [10:53:31] I think it may be better to change the thresholds for GC than the alarm, we generate a like 10 mediawiki-webserver images a day [10:54:04] on cgoubert@kubernetes1011:~$ sudo docker image list | awk '{print $1}' | sort | uniq -c | sort -hr | head -n2 [10:54:06] 554 docker-registry.discovery.wmnet/restricted/mediawiki-webserver [10:54:08] 38 docker-registry.discovery.wmnet/restricted/mediawiki-multiversion [10:55:19] more like 3 claime [10:55:36] on kubernetes1009, they 've been piling up since August [10:55:41] so, ~6 months [10:56:19] not the best resolution btw, I just divided 554/180 [10:56:28] it might have increased lately [10:56:33] the rate that is [10:57:07] Depends on the day [10:57:26] I have days with 16 images, otheres with 1 [10:59:25] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10JMeybohm) [10:59:43] ouch [10:59:45] good point [10:59:54] told you my resolution wasn't good enough [11:00:00] aroudn 6 a day [11:00:08] btw, we caught this extra early. The alert looks at p99 [11:00:33] and admittedly p99 of 600ms for list_images is a bit aggressive. [11:00:43] a tad [11:00:57] cgoubert@kubernetes1011:~$ sudo docker image list | grep mediawiki-webserver | awk '{print $2}' | cut -d- -f1,2,3 | uniq -c | awk '{ print $1}' | ./avg.awk [11:00:59] 6.6747 [11:01:06] Yeah almost 7 a day average, 6 median [11:14:45] akosiaris: re: k8s pod ip ranges (https://phabricator.wikimedia.org/T326617#8575213) - I'm not super familiar with netbox: Do I just click "Add Prefix" on https://netbox.wikimedia.org/ipam/prefixes/379/prefixes and add the 10.194.128.0/18 there? [11:24:18] akosiaris: should I downtime the mw hosts that are insetup? there's like 90 smart warnings in alertmanager [11:24:44] meeting, be with both of you in a bit. [11:24:48] ack [11:54:01] 10serviceops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Clement_Goubert) [12:12:24] jayme: yes: https://netbox.wikimedia.org/ipam/prefixes/636/ and https://netbox.wikimedia.org/ipam/prefixes/635/ are ready for usage [12:12:50] I 've just created them (by just clicking add prefix), set them as active, added tags and fixed the description [12:13:14] I also removed the "ask alex" from the parent prefixes and added "New" to differentiate them from the old ones [12:14:48] we are now at 75% allocated space from the originally reserved /16s, but that should cover use cases for the immediate future quite fine. [12:15:00] claime: yes, please do downtime them. [12:15:08] ack [12:16:03] 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host mc-gp2002.codfw.wmnet with OS bullseye [12:16:44] akosiaris: ack, downtiming for 2 weeks [12:19:31] 10serviceops, 10SRE: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7c189d79-c66e-4544-923a-2145f8cedf2f) set by cgoubert@cumin1001 for 14 days, 0:00:00 on 32 host(s) and their services with reason: I... [12:20:02] akosiaris: do all this prefixes need the dns records like the 10.64.* ones? [12:22:14] btw AFAICT https://netbox.wikimedia.org/ipam/prefixes/128/ is missing from the forwards records [12:22:49] volans: need? No. It's just courtesy for people when debugging [12:23:12] also for that prefix, it's reserved as far as I see [12:23:16] 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10jijiki) [12:23:18] so, not present ? [12:23:40] also, these are service IPs. You will never see them used outside the cluster [12:23:49] so, DNS is kinda useless [12:24:10] 10serviceops, 10DC-Ops, 10ops-eqiad: Reset management module of mc1039 - https://phabricator.wikimedia.org/T330072 (10jijiki) [12:24:15] but if this triggers some alert or something, we can add dummy RRs [12:27:19] no, I don't think it triggers anything, I see that 10.64.72.0/24 that is also reserved is defined in the forward records (but empty in the reverse zone) and so I mentioned the 10.64.76.0/24 missing [12:30:33] I guess we can add it all of those. Won't hurt [12:51:02] 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host mc-gp2002.codfw.wmnet with OS bullseye completed: - mc-gp2002 (**PASS**) - Downtimed on Icinga/Alertmanager... [13:05:49] 10serviceops, 10SRE: Connecting to https://api.svc.codfw.wmnet/ does not work - https://phabricator.wikimedia.org/T285517 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert This has been resolved in the meantime. ` cgoubert@cumin1001:~/cookbooks$ curl https://api.svc.eqiad.wmnet/ -vI 2>&1 | grep... [13:05:53] 10serviceops, 10SRE, 10Datacenter-Switchover: Siteinfo timeout during switch datacenter - https://phabricator.wikimedia.org/T266618 (10Clement_Goubert) [13:10:25] 10serviceops, 10SRE, 10Datacenter-Switchover: Siteinfo timeout during switch datacenter - https://phabricator.wikimedia.org/T266618 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert This has been fixed in https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/779841 [13:21:10] 10serviceops, 10Observability-Logging, 10SRE, 10Datacenter-Switchover: Figure out switchover steps for mwlog hosts - https://phabricator.wikimedia.org/T261274 (10Clement_Goubert) @fgiunchedi Is this still relevant? Are there some specific steps to be taken for {T327920} ? [13:50:43] 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host mc-gp2003.codfw.wmnet with OS bullseye [14:25:25] 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host mc-gp2003.codfw.wmnet with OS bullseye completed: - mc-gp2003 (**PASS**) - Downtimed on Icinga/Alertmanager... [14:25:45] akosiaris: thanks for taking that! <3 [14:29:02] 10serviceops, 10Observability-Logging, 10SRE, 10Datacenter-Switchover: Figure out switchover steps for mwlog hosts - https://phabricator.wikimedia.org/T261274 (10fgiunchedi) >>! In T261274#8629518, @Clement_Goubert wrote: > @fgiunchedi Is this still relevant? Are there some specific steps to be taken for {... [14:31:07] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, 10Kubernetes: Post Kubernetes v1.23 cleanup - https://phabricator.wikimedia.org/T328291 (10JMeybohm) [14:40:30] 10serviceops, 10Scap: Kubernetes configuration file is group-readable - https://phabricator.wikimedia.org/T329899 (10JMeybohm) This has been discussed in helm a couple of time already (latest incarnation is https://github.com/helm/helm/issues/11083). We're having those files group-readable on purpose to allow... [14:43:32] 10serviceops, 10Scap: Kubernetes configuration file is group-readable - https://phabricator.wikimedia.org/T329899 (10Clement_Goubert) This is not a big deal, but I understand the confusion. As mentioned in the irc discussion, a possible course of action is to make these files `0600` and have `scap` use `sudo`... [14:44:07] 10serviceops, 10MW-on-K8s, 10Scap: Kubernetes configuration file is group-readable - https://phabricator.wikimedia.org/T329899 (10Clement_Goubert) [14:48:17] 10serviceops, 10Observability-Logging, 10SRE, 10Datacenter-Switchover: Figure out switchover steps for mwlog hosts - https://phabricator.wikimedia.org/T261274 (10Clement_Goubert) >>! In T261274#8629767, @fgiunchedi wrote: >>>! In T261274#8629518, @Clement_Goubert wrote: >> @fgiunchedi Is this still relevan... [18:59:25] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10RZamora-WMF) [20:27:02] 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host mc-gp1002.eqiad.wmnet with OS bullseye [20:58:27] 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host mc-gp1002.eqiad.wmnet with OS bullseye completed: - mc-gp1002 (**PASS**) - Downtimed on Icinga/Alertmanager...