[00:17:42] 06serviceops: mw-api-ext unavailability 2024-05-22 18:30 UTC - https://phabricator.wikimedia.org/T365655#9823954 (10Scott_French) The sflow data is now available in turnilo. The high network tx from mw-api-ext does indeed appear directed to mwlog1002 (https://w.wiki/AAA2). Specifically, the peak around ~ 18:35... [07:49:18] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Rename wikikube worker nodes during OS reimage - https://phabricator.wikimedia.org/T365571#9824394 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by ayounsi@cumin1002 from kubernetes2023 to wikikube-worker2001 completed:... [08:02:26] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Rename wikikube worker nodes during OS reimage - https://phabricator.wikimedia.org/T365571#9824426 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by ayounsi@cumin1002 from kubernetes2023 to wikikube-worker2001 completed:... [08:07:58] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Rename wikikube worker nodes during OS reimage - https://phabricator.wikimedia.org/T365571#9824455 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wikikube-worker2001.codfw.wmnet with OS b... [08:33:49] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9824581 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin2002 for host mc1054.eqiad.wmnet with OS bookworm [08:33:50] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9824582 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1002 for host mc2054.codfw.wmnet with OS bookworm [08:45:06] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Rename wikikube worker nodes during OS reimage - https://phabricator.wikimedia.org/T365571#9824629 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube-worker2001.codfw.wmnet with OS bulls... [09:08:58] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9824721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin2002 for host mc1054.eqiad.wmnet with OS bookworm completed: - mc1054 (**PASS**) - Dow... [09:12:55] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9824729 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1002 for host mc2054.codfw.wmnet with OS bookworm completed: - mc2054 (**PASS**) - Dow... [09:31:42] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Rename wikikube worker nodes during OS reimage - https://phabricator.wikimedia.org/T365571#9824864 (10JMeybohm) After the reimage I needed to run the following for calico to start up properly: ` kubectl delete node kubernetes2023.codfw.wmn... [09:34:39] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Rename wikikube worker nodes during OS reimage - https://phabricator.wikimedia.org/T365571#9824895 (10JMeybohm) `kubernetes2023` is still cordoned and depooled for additional tests of the move v-lan process [10:44:40] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9825061 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1002 for host mc1053.eqiad.wmnet with OS bookworm [10:44:53] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9825062 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin2002 for host mc2053.codfw.wmnet with OS bookworm [11:08:32] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9825093 (10jijiki) [11:19:09] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9825116 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1002 for host mc1053.eqiad.wmnet with OS bookworm completed: - mc1053 (**PASS**) - Dow... [11:24:35] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9825130 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin2002 for host mc2053.codfw.wmnet with OS bookworm completed: - mc2053 (**PASS**) - Dow... [11:43:24] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9825170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1002 for host mc1052.eqiad.wmnet with OS bookworm [11:43:33] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9825171 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin2002 for host mc2052.codfw.wmnet with OS bookworm [12:17:48] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9825280 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1002 for host mc1052.eqiad.wmnet with OS bookworm completed: - mc1052 (**PASS**) - Dow... [12:21:51] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9825290 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin2002 for host mc2052.codfw.wmnet with OS bookworm completed: - mc2052 (**PASS**) - Dow... [12:47:18] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Rename wikikube worker nodes during OS reimage - https://phabricator.wikimedia.org/T365571#9825430 (10ayounsi) Before I forget, please notify DCops so they update the physical labels on the server. [12:57:38] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9825465 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1002 for host mc1051.eqiad.wmnet with OS bookworm [12:58:41] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9825468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin2002 for host mc2051.codfw.wmnet with OS bookworm [12:59:28] 06serviceops, 06Machine-Learning-Team, 07Kubernetes: Allow Kubernetes workers to be deployed on Bookworm - https://phabricator.wikimedia.org/T365253#9825487 (10elukey) [13:00:34] 06serviceops, 10MW-on-K8s, 06SRE, 06Traffic, 10Release-Engineering-Team (Seen): Rename X-Wikimedia-Debug k8s-experimental option - https://phabricator.wikimedia.org/T362662#9825484 (10Jdforrester-WMF) 05In progress→03Resolved [13:01:20] 06serviceops, 06Machine-Learning-Team, 07Kubernetes: Allow Kubernetes workers to be deployed on Bookworm - https://phabricator.wikimedia.org/T365253#9825495 (10elukey) I had to copy more packages (updated the task's description), but everything worked fine on ml-staging2001. The ML team is unblocked and can... [13:20:56] 06serviceops, 06Machine-Learning-Team, 07Kubernetes: Allow Kubernetes workers to be deployed on Bookworm - https://phabricator.wikimedia.org/T365253#9825618 (10MoritzMuehlenhoff) Dragonfly is an internally built golang package, it would be better if we properly rebuilt it on bookworm with current Go, otherwi... [13:22:11] FYI, I'll merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1035365 in 10-15 mins [13:26:53] 06serviceops, 06DC-Ops, 10ops-eqiad: Relabel eqiad Kubernetes hosts - https://phabricator.wikimedia.org/T365711 (10hnowlan) 03NEW [13:29:18] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9825672 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1002 for host mc1051.eqiad.wmnet with OS bookworm completed: - mc1051 (**PASS**) - Dow... [13:30:55] 06serviceops, 06DC-Ops, 10ops-codfw: Relabel codfw Kubernetes hosts - https://phabricator.wikimedia.org/T365712 (10hnowlan) 03NEW [13:41:23] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9825730 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin2002 for host mc2051.codfw.wmnet with OS bookworm completed: - mc2051 (**PASS**) - Dow... [13:47:37] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1035365 is deployed on a few hosts, I'm gradually enabling Puppet on workers [13:56:16] 06serviceops, 06Machine-Learning-Team, 07Kubernetes: Allow Kubernetes workers to be deployed on Bookworm - https://phabricator.wikimedia.org/T365253#9825766 (10elukey) >>! In T365253#9825618, @MoritzMuehlenhoff wrote: > Dragonfly is an internally built golang package, it would be better if we properly rebuil... [13:56:37] hi serviceops folks, is there an easy way to get the WikimediaDebug browser extension to be able to route a request to a *particular* k8s mw-debug instance? like on codfw for instance :) [14:13:27] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Relabel eqiad Kubernetes hosts - https://phabricator.wikimedia.org/T365711#9825819 (10Jclark-ctr) a:03Jclark-ctr [14:44:00] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Relabel codfw Kubernetes hosts - https://phabricator.wikimedia.org/T365712#9826025 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm completed [14:59:49] 06serviceops, 06Content-Transform-Team, 07Essential-Work, 13Patch-For-Review, 07Wikimedia-Incident: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023) - https://phabricator.wikimedia.org/T344324#9826058 (10elukey) The new code is deployed in staging, and it uses the envoy sidecar to con... [15:15:23] tegola is running in staging using the envoy sidecar for Thanos Swift! [15:15:42] ! [15:15:43] nice! [15:16:17] I filed a patch to upstream to hopefully get it integrated in their code [15:16:45] do we have a way to test maps in staging by any chance? [15:27:31] not really afaik given the way it works, but it's push rather than pull - nemo-yiannis might know better [15:29:54] i don't think we have staging connected to a kartotherian instance so there is no straightforward way to test actual map tiles [15:32:22] we can make a list though of valid /z/x/y tiles, purge the and then GET them via http from staging [15:35:42] nemo-yiannis: if you have time to do it I can surely help in testing etc..! [15:36:05] the task is https://phabricator.wikimedia.org/T344324# [15:37:18] hm, maybe another easier way to test things is to create a new empty swift container for staging [15:37:28] namely: I have no idea about how to find valid tiles purge etc.., but I can do it if I read how to do it :D [15:37:29] and then see if it gets populated for the requests we send [15:37:53] staging should have its own bucket in theory [15:38:00] lemme double check [15:38:44] yep bucket = "tegola-swift-staging-container" [15:52:45] elukey: https://phabricator.wikimedia.org/P63007 [15:53:01] something like that should do the trick [15:53:19] if we have an empty bucket for staging we should see it getting populated [15:53:38] 06serviceops, 10Language-Technical Support, 06SRE, 10Wikimedia-Site-requests, 13Patch-For-Review: Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319#9826396 (10cscott) So, having written the above two patches to replace byte-size limits with character-size... [15:53:39] and then on the second GET we should see in the headers: < tegola-cache: HIT [15:54:58] the initial GET should be: 06serviceops, 06MW-Interfaces-Team, 06Traffic: map the /api/ prefix to /w/rest.php - https://phabricator.wikimedia.org/T364400#9826433 (10FJoseph-WMF) p:05Triage→03High [15:59:10] thanks! [15:59:21] so the first one is a hit, the second one (https://staging.svc.eqiad.wmnet:4105/maps/osm/1/0/0) hangs [15:59:30] not sure why, I don't see any clear indication on the logs [16:02:04] I am wondering if it is trying to contact postgress and it doesn't work [16:03:17] I can try to quickly deploy the pod without the proxy settings, and see if it works [16:03:49] isn't staging connected to a postgres instance ? [16:04:22] i think it should be [16:04:26] in theory yes, not sure if it works though [16:04:34] hm [16:04:42] i think it wouldn't work without a working postgres instance [16:04:51] yep it hangs even without the proxy settings [16:05:12] yep yep what I mean is that a postgress instance is configured etc.., but maybe we never really tested it [16:05:20] so it is missing something causing the hang [16:05:57] if i remember correctly, on boot it creates a test tile so it should have a connection to postgres (but its been a while so not 100% sure) [16:05:59] * nemo-yiannis checks logs [16:07:32] maybe it actually hangs because it tries to compute a lot of stuff [16:07:42] for example 10/0/0 (zoomed in) works [16:10:17] nemo-yiannis: right good point! So 11/0/0 works as well, and I see a tegola-cache MISS [16:10:24] so probably 1/0/0 is huge [16:10:28] lemme check one thing [16:12:36] I don't see a big spike in CPU usage in https://grafana.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&var-datasource=eqiad%20prometheus%2Fk8s-staging&var-namespace=tegola-vector-tiles&var-pod=All&var-container=All&from=now-15m&to=now [16:12:58] so it is surely doing something but it is unclear what [16:15:57] this could become a good acceptance test for staging when we deploy new versions of tegola :D [16:16:51] tried to use perf but the tegola binary is probably built stripping symbols [16:18:01] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Relabel eqiad Kubernetes hosts - https://phabricator.wikimedia.org/T365711#9826540 (10Jclark-ctr) 05Open→03Resolved relabled servers [16:18:51] nemo-yiannis: if I remove the 1/X/X ones the rest seems working fine [16:19:32] ok [16:20:39] a stream of MISS followed by a stream of HIT [16:20:51] i am doublechecking the patches but as long as we are only changing s3 related stuff and this worked we should be good to go for prod [16:21:33] nemo-yiannis: this is the pull request to upstream https://github.com/go-spatial/tegola/pull/992 [16:21:45] that basically has the two patches that I merged in our 0.19.x branch [16:21:52] cool, i was checking our merged patches [16:22:28] i think its ok to try it in prod [16:22:48] did you actually verify that there is traffic in the envoy sidecar ? [16:23:33] I did yes! https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s-staging&var-app=tegola-vector-tiles&var-kubernetes_namespace=tegola-vector-tiles&var-destination=All [16:27:48] 06serviceops, 06Content-Transform-Team, 07Essential-Work, 13Patch-For-Review, 07Wikimedia-Incident: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023) - https://phabricator.wikimedia.org/T344324#9826584 (10elukey) I had a chat with @Jgiannelos on IRC, and this was the testing done in s... [16:27:59] nemo-yiannis: posted in the task the summary of what we did [16:28:03] thanks a lot for the help [16:28:17] 👍 [16:28:23] I'd say that we can wait for upstream to validate the code, and then deploy to prod? [16:29:08] thats ok [16:29:32] super [16:29:34] Cc: effie: --^ [16:29:43] we had our own patches in prod before so we can deploy our latest image to prod if we are confident [16:31:39] yes definitely, it is not super urgent so if we can be consistent with upstream it is better [16:31:54] so when we'll import new branches we'll just use the vanilla upstream code [16:31:57] (hopefully) [16:35:47] ok [16:36:21] 06serviceops, 10Language-Technical Support, 06SRE, 10Wikimedia-Site-requests, 13Patch-For-Review: Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319#9826632 (10Fuzzy) @cscott, thank you very much for your work on this issue. I completely agree that changing... [17:53:27] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9827014 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1002 for host mc1050.eqiad.wmnet with OS bookworm [17:53:30] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9827015 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin2002 for host mc2050.codfw.wmnet with OS bookworm [18:26:34] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9827147 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1002 for host mc1050.eqiad.wmnet with OS bookworm completed: - mc1050 (**PASS**) - Dow... [18:32:22] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9827159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin2002 for host mc2050.codfw.wmnet with OS bookworm completed: - mc2050 (**PASS**) - Dow... [19:12:32] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9827314 (10VRiley-WMF) After a very rigorous amount of troubleshooting, Dell will be sending out a replacement motherboard for kafka-main1009. [20:05:16] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9827478 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-main1010.eqiad.wmnet with OS bullseye