[09:17:24] fabfur: Hi! if you have a bit of time, I have a ticket related to ATS on which I'd appreciate your input (T358470) [09:17:25] T358470: Some Spark History links redirect to the service internal DNS - https://phabricator.wikimedia.org/T358470 [09:17:41] brouberol: sure, let me have a look [09:17:50] thanks [13:20:04] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#9576217 (10Clement_Goubert) [13:20:14] 10Traffic, 10MW-on-K8s, 10Release-Engineering-Team, 10SRE, and 2 others: Move 50% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T357507#9576216 (10Clement_Goubert) 05Open→03Resolved [13:30:44] 10Traffic, 10MW-on-K8s, 10Release-Engineering-Team, 10SRE, 10serviceops: Move 50% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T357507#9576241 (10Clement_Goubert) [13:31:11] 10Traffic, 10MW-on-K8s, 10Release-Engineering-Team, 10SRE, 10serviceops: Move 60% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T357508#9576243 (10Clement_Goubert) 05Stalled→03In progress [13:31:23] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#9576245 (10Clement_Goubert) [13:47:42] 10netops, 10Infrastructure-Foundations, 10SRE: Servers on public1-a-codfw and public1-b-codfw not getting DHCP during reimage - https://phabricator.wikimedia.org/T358488#9576320 (10cmooney) p:05Triage→03Medium [14:52:23] 10Traffic, 10Data Products, 10Data-Engineering, 10Observability-Logging, 10Patch-For-Review: Change HAProxy log-format to support missing information - https://phabricator.wikimedia.org/T358105#9576534 (10Fabfur) Applied to cp4037 the new log format to check eventual errors [15:34:46] 10Traffic, 10DC-Ops, 10Data-Persistence, 10collaboration-services, and 4 others: ☂️ Northward Datacentre Switchover (March 2024) - https://phabricator.wikimedia.org/T357547#9576708 (10joanna_borun) [15:44:22] 10Traffic, 10netops, 10Infrastructure-Foundations, 10Sustainability (Incident Followup): Primary outbound port utilisation over 80% alert muted - https://phabricator.wikimedia.org/T358455#9576748 (10joanna_borun) 05Open→03Resolved [15:45:43] 10Traffic, 10netops, 10Infrastructure-Foundations, 10Sustainability (Incident Followup): Primary outbound port utilisation over 80% alert muted - https://phabricator.wikimedia.org/T358455#9576750 (10CDanis) This would best be fixed by extending the haproxy bwlim work done in T317799 -- we've talked about h... [16:42:04] 10netops, 10Infrastructure-Foundations, 10SRE: Servers on public1-a-codfw and public1-b-codfw not getting DHCP during reimage - https://phabricator.wikimedia.org/T358488#9577083 (10cmooney) Digging a little deeper on this the source IP of the packets hitting the install server don't really matter, what is mo... [16:49:10] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9577115 (10Clement_Goubert) [16:50:34] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#8728141 (10Clement_Goubert) [17:03:27] 10Traffic, 10collaboration-services: Consider separating Gitlab code management and deb building management - https://phabricator.wikimedia.org/T357719#9577217 (10LSobanski) 05Open→03Declined Right now I'll say the Debian way is the way we'll do things. The problems so far were teething issues that should... [18:05:44] 10netops, 10Infrastructure-Foundations, 10SRE: Servers on public1-a-codfw and public1-b-codfw not getting DHCP during reimage - https://phabricator.wikimedia.org/T358488#9577561 (10cmooney) Juniper seem to document this scenario here, and advise using the "link-selection" keyword: https://www.juniper.net/do... [18:19:30] 10netops, 10Infrastructure-Foundations, 10SRE: Servers on public1-a-codfw and public1-b-codfw not getting DHCP during reimage - https://phabricator.wikimedia.org/T358488#9577613 (10cmooney) After issuing a manual release of the IP and trying again things seem to be working as expected: ` cmooney@install2004:... [18:23:13] 10netops, 10Infrastructure-Foundations, 10SRE: Servers on public1-a-codfw and public1-b-codfw not getting DHCP during reimage - https://phabricator.wikimedia.org/T358488#9577627 (10cmooney) So I think the solution is: # Add the "link-selection" command to the config on EVPN switches to add the IRB interface... [18:51:19] 10netops, 10Infrastructure-Foundations, 10SRE: Servers on public1-a-codfw and public1-b-codfw not getting DHCP during reimage - https://phabricator.wikimedia.org/T358488#9577701 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host sretest2003.codfw.wmnet with... [19:23:48] 10netops, 10Infrastructure-Foundations, 10SRE: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9577866 (10cmooney) p:05Low→03Medium Actually a different need to upgrade has now become clear, relating to the issue detailed in T358488 The solution to that requ... [19:27:35] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Servers on public1-a-codfw and public1-b-codfw not getting DHCP during reimage - https://phabricator.wikimedia.org/T358488#9577879 (10cmooney) >>! In T358488#9577627, @cmooney wrote: > # Add the "link-selection" command to the config on EVP... [19:30:29] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Servers on public1-a-codfw and public1-b-codfw not getting DHCP during reimage - https://phabricator.wikimedia.org/T358488#9577883 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host sretest2003... [19:45:23] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Servers on public1-a-codfw and public1-b-codfw not getting DHCP during reimage - https://phabricator.wikimedia.org/T358488#9577919 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host sretest... [20:19:05] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Servers on public1-a-codfw and public1-b-codfw not getting DHCP during reimage - https://phabricator.wikimedia.org/T358488#9577971 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host sretest2003... [21:25:13] 10Traffic, 10Data-Persistence, 10SRE, 10SRE-swift-storage, and 5 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#9578211 (10Jdlrobson) Providing engineering perspective on behalf of the WMF web team, I agree that if we want to make this change in English we should d... [23:40:59] 10Traffic, 10Data-Persistence, 10conftool, 10serviceops: Switch conftool to use the version 3 etcd datastore - https://phabricator.wikimedia.org/T350565#9578631 (10Scott_French) a:03Scott_French [23:49:20] 10Traffic, 10SRE: Cannot edit wikipedia from my work computer - https://phabricator.wikimedia.org/T356799#9578640 (10Dzahn) @Rijikk The footer would be right under the "If you report this error to the Wikimedia System Administrators, please include the details below." message you quoted in the error page itsel...