[08:22:03] 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10166596 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by jiji@cumin1002 from mw2424 to wikikube-worker2124 completed: - mw2424 (**PASS**) - ✔️... [08:30:24] 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10166605 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by jiji@cumin1002 Renumbering for host wikikube-worker2124.codfw.wmnet [08:30:59] 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10166609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1002 for host wikikube-worker2124.codfw.wmnet with OS bullseye [08:42:53] 06serviceops: Migrate poolcounter hosts to bookworm - https://phabricator.wikimedia.org/T332015#10166701 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by elukey@cumin1002 for hosts: `poolcounter1004.eqiad.wmnet` - poolcounter1004.eqiad.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanager... [08:43:44] hello folks! [08:44:13] if anybody could test registry2005 this week I'd be very grateful, so we'll be able to plan the bookwork migration for the docker registry :) [08:44:34] I am aware of the switchover so I am not planning to do it tomorrow [08:50:05] elukey: did you already ran the httpbb tests? I think that's mostly what we would do :) [08:50:20] (if not, I can totally do it ofc.!) [08:55:54] jayme: nope I haven't since I didn't know if it was the correct/final list etc.. [08:56:31] 06serviceops: Migrate poolcounter hosts to bookworm - https://phabricator.wikimedia.org/T332015#10166737 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by elukey@cumin1002 for hosts: `poolcounter1005.eqiad.wmnet` - poolcounter1005.eqiad.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanager... [08:57:02] elukey: ack, I'll take care of it and let you know [08:59:21] <3 [09:07:46] 06serviceops: Migrate poolcounter hosts to bookworm - https://phabricator.wikimedia.org/T332015#10166747 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by elukey@cumin1002 for hosts: `poolcounter2003.codfw.wmnet` - poolcounter2003.codfw.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanager... [09:13:34] 06serviceops: Migrate docker registry hosts to bookworm - https://phabricator.wikimedia.org/T332016#10166751 (10JMeybohm) ` $ sudo httpbb /srv/deployment/httpbb-tests/docker-registry/*.yaml --host registry2005.codfw.wmnet Sending to registry2005.codfw.wmnet... PASS: 14 requests sent to registry2005.codfw.wmnet.... [09:29:46] 06serviceops, 13Patch-For-Review: Migrate poolcounter hosts to bookworm - https://phabricator.wikimedia.org/T332015#10166768 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by elukey@cumin1002 for hosts: `poolcounter2004.codfw.wmnet` - poolcounter2004.codfw.wmnet (**PASS**) - Downtimed host... [09:30:27] 06serviceops: Migrate docker registry hosts to bookworm - https://phabricator.wikimedia.org/T332016#10166772 (10JMeybohm) I had 2003 and 2004 depooled for a couple of minutes without spotting problems, image uploads from build2001 (debian-weekly-build) work fine as well. Repooled 2004, so current state is: ` {"r... [09:31:35] elukey: looks good I'd say. Let's keep 2004 and 2005 pooled for the next couple of days to see if anything comes up [09:34:41] 06serviceops, 06collaboration-services, 06Infrastructure-Foundations, 10Puppet-Core, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#10166787 (10MoritzMuehlenhoff) [09:34:44] jayme: perfect! So 2003 is depooled, and 2004/5 are pooled [09:35:07] I am going to create registry1005 in eqiad and keep it inactive, so we can switch it on anytime as ell [09:35:10] *well [09:35:17] (and hostname numbering stays the same) [09:36:16] in other news, poolcounter nodes on bookworm and old vms deleted [09:36:41] the fact that I didn't cause a mw outage in the process is something that I am really proud of [09:36:46] I didn't count on it :D [09:40:10] 06serviceops, 13Patch-For-Review: Migrate poolcounter hosts to bookworm - https://phabricator.wikimedia.org/T332015#10166789 (10elukey) 05Open→03Resolved a:03elukey [09:48:30] 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10166800 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1002 for host wikikube-worker2124.codfw.wmnet with OS bullseye completed: - wikikube-worker2124.co... [09:59:42] 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10166841 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by jiji@cumin1002 Renumbering for host wikikube-worker2124.codfw.wmnet completed: - wikikube-worker2124.cod... [10:27:06] o/ tried to deploy calico network policies for another job (cirrus-streaming-updater), worked well in staging but now in codfw seems like the dns is not reachable [10:38:01] dcausse: that seems odd, as DNS policies are not part of the service deploy [10:38:28] but I had a rolling restart of all kafka brokers running earlier today - maybe not DNS but a down broker? [10:39:46] jayme: tried to resolve two addresses flink-zk2002 and kafka-main2005 both failed with ";; connection timed out; no servers could be reached" running the "host" command on the container [10:40:32] the job recovered now... not sure what happened [10:45:05] 06serviceops, 10Prod-Kubernetes, 03Discovery-Search (Current work), 07Kubernetes: Use kafka-main-[eqiad|codfw].external-services.svc.cluster.local to discover kafka brokers in kafka client running in k8s - https://phabricator.wikimedia.org/T374729#10166978 (10JMeybohm) Restarts are all done. You may give i... [10:45:43] dcausse: odd..which container? [10:47:03] jayme: that was on both flink-app-producer-dc8c5cf8-jncc5 & flink-app-consumer-search-7665fc8cd9-x266j [10:47:40] dcausse: in the codfw wikikube cluster I guess? [10:47:47] jayme: yes [10:49:04] but not it's all good... host flink-zk2002.codfw.wmnet works well and host kafka-main2005.codfw.wmnet now properly says host not found [10:49:11] s/not/now/ [10:50:57] ack. I wanted to double check if that might have been related to some firewalling issue we've seen [10:51:31] but no evidence... [10:58:46] 06serviceops, 07Datacenter-Switchover, 13Patch-For-Review: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962#10166996 (10jcrespo) [11:03:59] 06serviceops, 07Datacenter-Switchover, 13Patch-For-Review: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962#10167014 (10Dreamy_Jazz) >>! In T370962#10164176, @Dreamy_Jazz wrote: > Noting that I am running an import script for T375203 on a `tmux` session. The tmux s... [11:18:08] hey folks I created registry1005 but missed a step for puppet 7 and now it is in a broken state, but downtime/depooled properly. I'll reimage after lunch! [12:19:05] 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10167232 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by jiji@cumin1002 from mw2425 to wikikube-worker2125 completed: - mw2425 (**PASS**) - ✔️... [12:25:28] elukey: cheers, give me a shout when it is done [12:27:12] 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10167259 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by jiji@cumin1002 Renumbering for host wikikube-worker2125.codfw.wmnet [12:27:41] 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10167260 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1002 for host wikikube-worker2125.codfw.wmnet with OS bullseye [12:36:36] 06serviceops, 10Prod-Kubernetes, 03Discovery-Search (Current work), 07Kubernetes: Use kafka-main-[eqiad|codfw].external-services.svc.cluster.local to discover kafka brokers in kafka client running in k8s - https://phabricator.wikimedia.org/T374729#10167274 (10dcausse) >>! In T374729#10166978, @JMeybohm wro... [12:37:40] 06serviceops, 07Datacenter-Switchover, 13Patch-For-Review: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962#10167277 (10Clement_Goubert) Calling to attention {T375382}, failover may need to be done by #dba before the switchover. [13:18:22] 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10167395 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1002 for host wikikube-worker2125.codfw.wmnet with OS bullseye completed: -... [13:22:41] 06serviceops, 10Prod-Kubernetes, 03Discovery-Search (Current work), 07Kubernetes: Use kafka-main-[eqiad|codfw].external-services.svc.cluster.local to discover kafka brokers in kafka client running in k8s - https://phabricator.wikimedia.org/T374729#10167410 (10JMeybohm) 05Open→03Resolved >>! In T374... [13:31:26] 06serviceops, 10Prod-Kubernetes, 10Data-Platform-SRE (2024.09.06 - 2024.09.27), 07Kubernetes, 13Patch-For-Review: Migrate Search Platform-owned helm charts to Calico Network Policies - https://phabricator.wikimedia.org/T373195#10167445 (10dcausse) [13:33:55] 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10167462 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by jiji@cumin1002 Renumbering for host wikikube-worker2125.codfw.wmnet completed: - wikikube-worker2125.cod... [13:33:58] 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10167463 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by jiji@cumin1002 Renumbering for host wikikube-worker2125.codfw.wmnet completed: - wikikube-worker2125.cod... [14:06:35] 06serviceops, 06Infrastructure-Foundations, 06SRE: confd setup left without configuration doesn't stop confd - https://phabricator.wikimedia.org/T356296#10167577 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:08:18] 06serviceops, 10LPL Essential, 10MinT, 10Community Wishlist (Translations), 10Community-Tech (Jackal (not a fox) Fox (Sept 23 - Oct 4)): Caching service request for MinT - https://phabricator.wikimedia.org/T370755#10167586 (10JWheeler-WMF) [14:40:18] ok registry1005 up and running! [14:40:49] jayme: ok if I pool it like you did for 2005? [14:42:14] elukey: sure [15:01:28] 06serviceops, 06Data-Engineering, 06Data-Platform-SRE, 06SRE, 10Event-Platform: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058#10167841 (10brouberol) `cirrus-streaming-updater` is replacing the list of brokers by the external services service name: https://gerrit.wi... [15:11:34] 06serviceops, 10CirrusSearch, 06Discovery-Search, 10MediaWiki-Platform-Team (Radar): PHP web requests running for multiple hours - https://phabricator.wikimedia.org/T374662#10167860 (10Gehel) [15:14:43] 06serviceops, 10MoveComms-Support, 07Datacenter-Switchover: MoveComms support for Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T371130#10167868 (10Trizek-WMF) [15:56:58] 06serviceops, 06DC-Ops, 10ops-codfw, 10Prod-Kubernetes, 07Kubernetes: Relabel codfw kubernetes nodes mw2424 and mw2425 - https://phabricator.wikimedia.org/T375398 (10jijiki) 03NEW [15:58:36] 06serviceops, 10LPL Essential, 10MinT, 10Community Wishlist (Translations), 10Community-Tech (Jackal (not a fox) Fox (Sept 23 - Oct 4)): Caching service request for MinT - https://phabricator.wikimedia.org/T370755#10168167 (10jijiki) >>! In T370755#10156131, @santhosh wrote: > Hi @jijiki , any updates on... [16:09:40] 06serviceops, 13Patch-For-Review: Migrate docker registry hosts to bookworm - https://phabricator.wikimedia.org/T332016#10168233 (10elukey) Also added registry1005 to the active pool of servers: ` elukey@puppetserver1001:~$ sudo -i confctl select service=docker-registry get {"registry1003.eqiad.wmnet": {"weig... [16:09:51] registry1005 took the place of registry1003 in eqiad [16:09:57] the latter is depooled [16:10:15] if there are issues lemme know [17:09:39] 06serviceops, 07Datacenter-Switchover, 13Patch-For-Review: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962#10168475 (10Scott_French) @Dreamy_Jazz - ack, thank you! @Clement_Goubert - thanks for flagging! pc1015 has now been swapped in for pc3, so this should be s... [20:41:59] 06serviceops, 10MW-on-K8s, 10Abstract Wikipedia team (25Q1 (Jul–Sep)): Some wikifunctions calls end up served by mw-web - https://phabricator.wikimedia.org/T374556#10169325 (10Jdforrester-WMF) Since the resolution of T374241 via editing the on-wiki content, [[https://logstash.wikimedia.org/app/dashboards#/vi... [22:34:35] 06serviceops, 10MW-on-K8s, 06Traffic, 13Patch-For-Review: Some sites try and fail to serve favicon.ico - https://phabricator.wikimedia.org/T374997#10169615 (10matmarex) Thanks for investigating, and for the patch. That list is definitely not complete, it's just the top entries I saw in the logs last Tuesda... [22:37:55] 06serviceops, 10MW-on-K8s, 06Traffic, 13Patch-For-Review: Some sites try and fail to serve favicon.ico - https://phabricator.wikimedia.org/T374997#10169618 (10matmarex) > nothing is really broken because of this bug To be clear, our wikis are not even using the broken favicon URLs. If you go to https://do...