[01:28:31] <wikibugs>	 10serviceops, 10ChangeProp, 10Content-Transform-Team-WIP, 10Page Content Service, and 3 others: Parsoid cache invalidation for mobile-sections seems not reliable - https://phabricator.wikimedia.org/T226931 (10Brycehughes) @akosiaris The enwikivoyage community [[ https://en.wikivoyage.org/wiki/Wikivoyage:Tr...
[08:37:28] <wikibugs>	 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-notice: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 (10Ladsgroup) >>! In T308932#8627787, @Urbanecm wrote: >>>! In T308932#8603337, @gerritbot wrote...
[09:14:02] <wikibugs>	 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-notice: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 (10Ladsgroup) I'll do permissions tomorrow.
[09:20:02] <wikibugs>	 10serviceops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Jelto)
[09:51:05] <elukey>	 folks there are some alerts about k8s latencies for several wikikube nodes, is it known/wip?
[09:52:45] <akosiaris>	 hmmm
[09:53:03] <akosiaris>	 ranging from a couple of minutes up to 1H 
[09:53:21] <akosiaris>	 just for list_images ? 
[09:53:40] <claime>	 looks like it
[09:55:17] <akosiaris>	 I don't see a change in the number of list_images operations
[09:55:25] <akosiaris>	 at the rate, that is.
[09:56:03] <akosiaris>	 and the actual increase in latency in grafana is ... miniscule ? 
[09:56:17] <claime>	 I think the probe is a bit too sensitive
[09:56:26] <claime>	 s/is/may be/
[09:57:04] <akosiaris>	 some image pulls happened about 1H ago
[09:57:25] <akosiaris>	 both clusters, nothing particularly worrying though
[10:00:43] <akosiaris>	 https://lounge.uname.gr/uploads/b9bae4cb92539eb9/image.png 
[10:00:51] <akosiaris>	 yeah, not particularly worrying right now 
[10:01:21] <akosiaris>	 looking a bit more into it
[10:04:26] <jayme>	 there where also some issues on wikikube staging over the weekend which are the result of me leaving puppet disabled there. I'll clean that up
[10:20:45] <akosiaris>	 histogram_quantile(0.99, rate(kubelet_runtime_operations_duration_seconds_bucket{job="k8s-node"}[5m])) > 0.6
[10:20:47] <akosiaris>	 for 5m
[10:20:51] <akosiaris>	 that's the alert definition
[10:21:33] <akosiaris>	 interestingly... I don't see more than 507ms in the last 1H 
[10:21:46] <akosiaris>	 how did 0.99 end up being > 600ms ...
[10:22:38] <akosiaris>	 also interestingly, more and more hosts are alerting... what the 
[10:24:39] <akosiaris>	 ah no, they are actually flapping
[10:26:33] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10JMeybohm)
[10:32:18] <wikibugs>	 10serviceops, 10Scap: Kubernetes configuration file is group-readable - https://phabricator.wikimedia.org/T329899 (10Clement_Goubert)
[10:32:36] <akosiaris>	 per https://w.wiki/6MbV this should have been flapping on and off for the last 1 week
[10:33:00] <akosiaris>	 and the baseline is apparently ~500ms anyway
[10:34:15] <wikibugs>	 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-notice: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 (10Urbanecm) >>! In T308932#8628692, @Ladsgroup wrote: > [...] >> Can we agree on dblists a new...
[10:34:18] <wikibugs>	 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, 10Kubernetes: etcd cluster reimage strategies to use with the K8s upgrade cookbook - https://phabricator.wikimedia.org/T330060 (10elukey)
[10:36:44] <claime>	 I wonder what happened in december, we doubled the baseline from 250ms to 500ms after a few days of spikes, then it's fairly stable until ~11 days ago
[10:39:49] <akosiaris>	 testing a small theory right now 
[10:43:40] <akosiaris>	 well, theory proven
[10:44:08] <akosiaris>	 akosiaris@kubernetes1009:~$ time sudo docker image list  > /dev/null
[10:44:08] <akosiaris>	 real	0m0.597s
[10:44:16] <akosiaris>	 yada yada on multiple runs
[10:44:18] <akosiaris>	 then...
[10:44:28] <akosiaris>	 sudo docker image prune -a, wait 4 minutes
[10:44:30] <akosiaris>	 and ...
[10:44:38] <akosiaris>	 akosiaris@kubernetes1009:~$ time sudo docker image list  > /dev/null
[10:44:38] <akosiaris>	 real	0m0.094s
[10:44:52] <akosiaris>	 O(n) anyone ? 
[10:44:55] <claime>	 Well time to put that in a crontab
[10:45:08] <claime>	 systemd-timer, whatever
[10:45:14] <akosiaris>	 no, kubelet has high and low watermarks for GCing images
[10:45:16] <jayme>	 k8s will take care of image purges only if there is pressure
[10:45:34] <claime>	 Ah, didn't know that
[10:45:44] <akosiaris>	 yeah, I think we just want to adapt our threshold 
[10:46:04] <akosiaris>	 the kubelet will take care of the underlying issue, which is how much space we end up giving to images
[10:46:24] <jayme>	 feel free to adapt the thresholds...I more or less made those values up when creating the alerts
[10:46:50] <akosiaris>	 I was about to ask. I remember something like that, but thanks for confirming before I even asked
[10:47:05] <akosiaris>	 time for my first alertmanager gerrit change I guess
[10:47:22] <jayme>	 🍿
[10:47:37] <jayme>	 you're slowly getting out of your manager role :-)
[10:48:00] <akosiaris>	 claime: btw --image-gc-high-threshold and --image-gc-low-threshold for the high/low watermarks I mentioned before
[10:48:10] <claime>	 ===== NODE GROUP =====                                                                                                                                                                                                        
[10:48:12] <claime>	 (1) kubernetes1019.eqiad.wmnet                                                                                                                                                                                                
[10:48:14] <claime>	 ----- OUTPUT of 'docker image list | wc -l' -----                                                                                                                                                                             
[10:48:15] <akosiaris>	 defaults are 85% and 80%
[10:48:16] <claime>	 918                                                                                                                                                                                                                           
[10:48:18] <claime>	 ===== NODE GROUP =====                                                                                                                                                                                                        
[10:48:20] <claime>	 (1) kubernetes1011.eqiad.wmnet                                                                                                                                                                                                
[10:48:22] <claime>	 ----- OUTPUT of 'docker image list | wc -l' -----                                                                                                                                                                             
[10:48:24] <claime>	 746                                                                                                                                                                                                                           
[10:48:26] <claime>	 ===== NODE GROUP =====                                                                                                                                                                                                        
[10:48:28] <claime>	 (1) kubernetes1008.eqiad.wmnet                                                                                                                                                                                                
[10:48:30] <claime>	 ----- OUTPUT of 'docker image list | wc -l' -----                                                                                                                                                                             
[10:48:32] <claime>	 552                                                                                                                                                                                                                           
[10:48:34] <claime>	 ===== NODE GROUP =====                                                                                                                                                                                                        
[10:48:36] <claime>	 (1) kubernetes1018.eqiad.wmnet                                                                                                                                                                                                
[10:48:38] <claime>	 ----- OUTPUT of 'docker image list | wc -l' -----                                                                                                                                                                             
[10:48:40] <claime>	 743                                                                                                                                                                                                                           
[10:48:42] <claime>	 ===== NODE GROUP =====                                                                                                                                                                                                        
[10:48:44] <claime>	 (1) kubernetes1021.eqiad.wmnet                                                                                                                                                                                                
[10:48:46] <claime>	 ----- OUTPUT of 'docker image list | wc -l' -----                                                                                                                                                                             
[10:48:48] <claime>	 577     
[10:48:50] <claime>	 Oof sorry
[10:49:07] <akosiaris>	 well, your bouncer saved you from a a kick :P
[10:49:10] <akosiaris>	 it was slow enough
[10:49:49] <akosiaris>	 note btw that we are 30% currently
[10:50:12] <akosiaris>	 and defaults for  --image-gc-high-threshold and  --image-gc-low-threshold  are 85% and 80% respectively
[10:50:18] <akosiaris>	 so GC won't run
[10:52:06] <akosiaris>	 my guess btw is that somewhere around start of Feb, we moved to the point where some cache (which one...)  isn't large enough any more
[10:52:48] <akosiaris>	 anyway, I 'll start with an alertmanager change and then dive a bit more into those GC numbers. Got to refresh my memory
[10:53:31] <claime>	 I think it may be better to change the thresholds for GC than the alarm, we generate a like 10 mediawiki-webserver images a day
[10:54:04] <claime>	 on cgoubert@kubernetes1011:~$ sudo docker image list | awk '{print $1}' | sort | uniq -c | sort -hr | head -n2
[10:54:06] <claime>	     554 docker-registry.discovery.wmnet/restricted/mediawiki-webserver
[10:54:08] <claime>	      38 docker-registry.discovery.wmnet/restricted/mediawiki-multiversion
[10:55:19] <akosiaris>	 more like 3 claime
[10:55:36] <akosiaris>	 on kubernetes1009, they 've been piling up since August
[10:55:41] <akosiaris>	 so, ~6 months
[10:56:19] <akosiaris>	 not the best resolution btw, I just divided 554/180
[10:56:28] <akosiaris>	 it might have increased lately
[10:56:33] <akosiaris>	 the rate that is 
[10:57:07] <claime>	 Depends on the day
[10:57:26] <claime>	 I have days with 16 images, otheres with 1
[10:59:25] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664 (10JMeybohm)
[10:59:43] <akosiaris>	 ouch
[10:59:45] <akosiaris>	 good point
[10:59:54] <akosiaris>	 told you my resolution wasn't good enough 
[11:00:00] <claime>	 aroudn 6 a day
[11:00:08] <akosiaris>	 btw, we caught this extra early. The alert looks at p99 
[11:00:33] <akosiaris>	 and admittedly p99 of 600ms for list_images is a bit aggressive. 
[11:00:43] <claime>	 a tad
[11:00:57] <claime>	 cgoubert@kubernetes1011:~$ sudo docker image list | grep mediawiki-webserver | awk '{print $2}' | cut -d- -f1,2,3 | uniq -c | awk '{ print $1}' | ./avg.awk
[11:00:59] <claime>	 6.6747
[11:01:06] <claime>	 Yeah almost 7 a day average, 6 median
[11:14:45] <jayme>	 akosiaris: re: k8s pod ip ranges (https://phabricator.wikimedia.org/T326617#8575213) - I'm not super familiar with netbox: Do I just click "Add Prefix" on https://netbox.wikimedia.org/ipam/prefixes/379/prefixes and add the 10.194.128.0/18 there?
[11:24:18] <claime>	 akosiaris: should I downtime the mw hosts that are insetup? there's like 90 smart warnings in alertmanager
[11:24:44] <akosiaris>	 meeting, be with both of you in a bit.
[11:24:48] <claime>	 ack
[11:54:01] <wikibugs>	 10serviceops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Clement_Goubert)
[12:12:24] <akosiaris>	 jayme: yes: https://netbox.wikimedia.org/ipam/prefixes/636/ and https://netbox.wikimedia.org/ipam/prefixes/635/ are ready for usage
[12:12:50] <akosiaris>	 I 've just created them (by just clicking add prefix), set them as active, added tags and fixed the description 
[12:13:14] <akosiaris>	 I also removed the "ask alex" from the parent prefixes and added "New" to differentiate them from the old ones
[12:14:48] <akosiaris>	 we are now at 75% allocated space from the originally reserved /16s, but that should cover use cases for the immediate future quite fine. 
[12:15:00] <akosiaris>	 claime: yes, please do downtime them. 
[12:15:08] <claime>	 ack
[12:16:03] <wikibugs>	 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host mc-gp2002.codfw.wmnet with OS bullseye
[12:16:44] <claime>	 akosiaris: ack, downtiming for 2 weeks
[12:19:31] <wikibugs>	 10serviceops, 10SRE: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7c189d79-c66e-4544-923a-2145f8cedf2f) set by cgoubert@cumin1001 for 14 days, 0:00:00 on 32 host(s) and their services with reason: I...
[12:20:02] <volans>	 akosiaris: do all this prefixes need the dns records like the 10.64.* ones?
[12:22:14] <volans>	 btw AFAICT https://netbox.wikimedia.org/ipam/prefixes/128/ is missing from the forwards records 
[12:22:49] <akosiaris>	 volans: need? No. It's just courtesy for people when debugging 
[12:23:12] <akosiaris>	 also for that prefix, it's reserved as far as I see
[12:23:16] <wikibugs>	 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10jijiki)
[12:23:18] <akosiaris>	 so, not present ? 
[12:23:40] <akosiaris>	 also, these are service IPs. You will never see them used outside the cluster
[12:23:49] <akosiaris>	 so, DNS is kinda useless
[12:24:10] <wikibugs>	 10serviceops, 10DC-Ops, 10ops-eqiad: Reset management module of mc1039 - https://phabricator.wikimedia.org/T330072 (10jijiki)
[12:24:15] <akosiaris>	 but if this triggers some alert or something, we can add dummy RRs
[12:27:19] <volans>	 no, I don't think it triggers anything, I see that 10.64.72.0/24 that is also reserved is defined in the forward records (but empty in the reverse zone) and so I mentioned the 10.64.76.0/24 missing
[12:30:33] <akosiaris>	 I guess we can add it all of those. Won't hurt 
[12:51:02] <wikibugs>	 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host mc-gp2002.codfw.wmnet with OS bullseye completed: - mc-gp2002 (**PASS**)   - Downtimed on Icinga/Alertmanager...
[13:05:49] <wikibugs>	 10serviceops, 10SRE: Connecting to https://api.svc.codfw.wmnet/ does not work - https://phabricator.wikimedia.org/T285517 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert This has been resolved in the meantime.  ` cgoubert@cumin1001:~/cookbooks$ curl https://api.svc.eqiad.wmnet/ -vI 2>&1 | grep...
[13:05:53] <wikibugs>	 10serviceops, 10SRE, 10Datacenter-Switchover: Siteinfo timeout during switch datacenter - https://phabricator.wikimedia.org/T266618 (10Clement_Goubert)
[13:10:25] <wikibugs>	 10serviceops, 10SRE, 10Datacenter-Switchover: Siteinfo timeout during switch datacenter - https://phabricator.wikimedia.org/T266618 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert This has been fixed in https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/779841
[13:21:10] <wikibugs>	 10serviceops, 10Observability-Logging, 10SRE, 10Datacenter-Switchover: Figure out switchover steps for mwlog hosts - https://phabricator.wikimedia.org/T261274 (10Clement_Goubert) @fgiunchedi Is this still relevant? Are there some specific steps to be taken for {T327920} ?
[13:50:43] <wikibugs>	 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host mc-gp2003.codfw.wmnet with OS bullseye
[14:25:25] <wikibugs>	 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host mc-gp2003.codfw.wmnet with OS bullseye completed: - mc-gp2003 (**PASS**)   - Downtimed on Icinga/Alertmanager...
[14:25:45] <jayme>	 akosiaris: thanks for taking that! <3
[14:29:02] <wikibugs>	 10serviceops, 10Observability-Logging, 10SRE, 10Datacenter-Switchover: Figure out switchover steps for mwlog hosts - https://phabricator.wikimedia.org/T261274 (10fgiunchedi) >>! In T261274#8629518, @Clement_Goubert wrote: > @fgiunchedi Is this still relevant? Are there some specific steps to be taken for {...
[14:31:07] <wikibugs>	 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, 10Kubernetes: Post Kubernetes v1.23 cleanup - https://phabricator.wikimedia.org/T328291 (10JMeybohm)
[14:40:30] <wikibugs>	 10serviceops, 10Scap: Kubernetes configuration file is group-readable - https://phabricator.wikimedia.org/T329899 (10JMeybohm) This has been discussed in helm a couple of time already (latest incarnation is https://github.com/helm/helm/issues/11083). We're having those files group-readable on purpose to allow...
[14:43:32] <wikibugs>	 10serviceops, 10Scap: Kubernetes configuration file is group-readable - https://phabricator.wikimedia.org/T329899 (10Clement_Goubert) This is not a big deal, but I understand the confusion. As mentioned in the irc discussion, a possible course of action is to make these files `0600` and have `scap` use `sudo`...
[14:44:07] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Scap: Kubernetes configuration file is group-readable - https://phabricator.wikimedia.org/T329899 (10Clement_Goubert)
[14:48:17] <wikibugs>	 10serviceops, 10Observability-Logging, 10SRE, 10Datacenter-Switchover: Figure out switchover steps for mwlog hosts - https://phabricator.wikimedia.org/T261274 (10Clement_Goubert) >>! In T261274#8629767, @fgiunchedi wrote: >>>! In T261274#8629518, @Clement_Goubert wrote: >> @fgiunchedi Is this still relevan...
[18:59:25] <wikibugs>	 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10RZamora-WMF)
[20:27:02] <wikibugs>	 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host mc-gp1002.eqiad.wmnet with OS bullseye
[20:58:27] <wikibugs>	 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host mc-gp1002.eqiad.wmnet with OS bullseye completed: - mc-gp1002 (**PASS**)   - Downtimed on Icinga/Alertmanager...