[03:50:17] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [07:50:17] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [08:04:20] good morning, I have a puppet patch pending to add support for tox v4 (while keeping back compat with tox v3) [08:04:25] https://gerrit.wikimedia.org/r/c/operations/puppet/+/977223 [08:04:57] that will allow to switch the CI job to use an image based on tox v4 [08:05:17] and unlock future updates of the docker image ;) [08:27:22] 10CAS-SSO, 10Infrastructure-Foundations, 10GitLab (Auth & Access), 10Release-Engineering-Team (Priority Backlog 📥), 10User-brennen: GitLab sessions expire frequently - https://phabricator.wikimedia.org/T330359 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @brennen: This is an older... [10:03:28] 10SRE-tools, 10Infrastructure-Foundations, 10Observability-Alerting, 10observability: RAID check opened a ticket for kubernetes2012 while it was being reimaged - https://phabricator.wikimedia.org/T330150 (10Volans) 05Open→03Resolved p:05Triage→03Medium a:03Volans As this ticket is few months old... [10:05:49] 10SRE-tools, 10Infrastructure-Foundations: Add --depool-sleep runtime argument when using SRELBBatchRunner class - https://phabricator.wikimedia.org/T339151 (10Volans) Is this request still current or given John's explanation could it be closed as declined? [10:08:26] 10Mail, 10Infrastructure-Foundations, 10Trust-and-Safety: Mail from Bishzilla to emergency@wikimedia.org is possibly getting lost - https://phabricator.wikimedia.org/T338032 (10Aklapper) Thanks. Please share the number (usually five or six digits) of the ZenDesk request here [10:20:28] 10SRE-tools, 10Infrastructure-Foundations: Long timeout on debmonitor client with server unreachable/unpingable - https://phabricator.wikimedia.org/T302205 (10Volans) @fgiunchedi I noticed the pontoon name in the logs, so I guess you're running it in an environment where debmonitor is not present. So instead o... [11:38:07] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c70b0979-84e8-4fe7-8682-45d50615a587) set by cmooney@cumin1002 f... [11:50:17] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [13:15:40] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10cmooney) Ok I have made the Netbox changes and pushed the resulting config to lsw1-b8-codfw now, and the port it up (note the por... [13:32:44] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10cmooney) [13:34:56] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) @MoritzMuehlenhoff what is the plan with cumin1002? @ABran-WMF has fixed db-mysql on cumin1001 so we can connect from there... [13:40:49] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10MoritzMuehlenhoff) >>! In T352974#9445758, @Marostegui wrote: > @MoritzMuehlenhoff what is the plan with cumin1002? @ABran-WMF has fixed... [13:41:49] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) >>! In T352974#9445809, @MoritzMuehlenhoff wrote: >>>! In T352974#9445758, @Marostegui wrote: >> @MoritzMuehlenhoff what is... [13:51:00] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) @Ladsgroup I have merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/989144 - can you please deploy the user for cu... [13:53:55] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) @MoritzMuehlenhoff I'm currently trying to trace how orchestrator connects to databases to manage them, to identify which cer... [14:11:24] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10akosiaris) >>! In T352883#9445709, @cmooney wrote: > Ok I have made the Netbox changes and pushed the resulting config to lsw1-b8... [14:14:40] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10ayounsi) > This is the thing we need to get fixed, I see Yep, that's {T352893} and its 2 CRs. [14:28:00] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10MoritzMuehlenhoff) >>! In T352974#9445831, @ABran-WMF wrote: > @MoritzMuehlenhoff I'm currently trying to trace how orchestrator connect... [14:34:49] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10Clement_Goubert) ` # kubectl describe nodes kubestage2002.codfw.wmnet | grep -A3 Addresses Addresses: InternalIP: 10.192.22.13... [14:35:02] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10akosiaris) >>! In T352883#9445907, @ayounsi wrote: >> This is the thing we need to get fixed, I see > Yep, that's {T352893} and i... [14:42:41] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) >>! In T352974#9446004, @MoritzMuehlenhoff wrote: >>>! In T352974#9445831, @ABran-WMF wrote: >> @MoritzMuehlenhoff I'm curren... [14:45:28] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10BTullis) I see these entries in the logs from orchestrator. ` Jan 09 14:43:15 dborch1001 orchestrator[3587041]: 2024-01-09 14:43:15 WARN... [14:56:16] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10MoritzMuehlenhoff) >>! In T352974#9446072, @ABran-WMF wrote: >>>! In T352974#9446004, @MoritzMuehlenhoff wrote: >>>>! In T352974#9445831... [15:35:13] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Revert dbstore migration from puppet7 to puppet5 - https://phabricator.wikimedia.org/T354411 (10BTullis) Just for reference, I think that we are still undecided on whether this roll-back is necessary, or whether we will be able... [15:36:02] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Revert dbstore migration from puppet7 to puppet5 - https://phabricator.wikimedia.org/T354411 (10BTullis) 05Open→03Stalled p:05Triage→03High [15:36:10] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10BTullis) [15:44:39] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10BTullis) In case it helps, this is also a useful command for showing the certificate chain that is presented by the dbstore servers. ` b... [15:50:17] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [16:12:25] 10netops, 10Ganeti, 10Infrastructure-Foundations, 10SRE: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ayounsi) [16:18:14] 10Mail, 10Infrastructure-Foundations, 10Trust-and-Safety: Mail from Bishzilla to emergency@wikimedia.org is possibly getting lost - https://phabricator.wikimedia.org/T338032 (10RoySmith) Sadly, I didn't get any such number. I got a pretty page with a drawing of a bunch of pastel-colored houses and a messag... [17:06:44] 10Mail, 10Infrastructure-Foundations, 10Trust-and-Safety: Mail from Bishzilla to emergency@wikimedia.org is possibly getting lost - https://phabricator.wikimedia.org/T338032 (10Dzahn) @RoySmith The way it's expected to work is that you would get an email that says "Your request (XXXXX) has been received.."... [17:41:31] 10Mail, 10Infrastructure-Foundations, 10Trust-and-Safety: Mail from Bishzilla to emergency@wikimedia.org is possibly getting lost - https://phabricator.wikimedia.org/T338032 (10RoySmith) @Dzahn I received no such email. Yes, I checked my spam folder. [17:49:00] 10netops, 10Ganeti, 10Infrastructure-Foundations, 10SRE: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10RobH) [18:06:49] 10Mail, 10Infrastructure-Foundations, 10Trust-and-Safety: Mail from Bishzilla to emergency@wikimedia.org is possibly getting lost - https://phabricator.wikimedia.org/T338032 (10Dzahn) That's unfortunate :( Maybe it's set to only respond to @wikimedia.org emails then. [18:22:18] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw, 10Patch-For-Review: Migrate mr1-codfw from asw-a1-codfw to lsw1-a1-codfw - https://phabricator.wikimedia.org/T348164 (10cmooney) @papaul hey the link from mr1-codfw ge-0/0/3 to lsw1-a2-codfw ge-0/0/47 is now configured, but it's down both sides.... [18:57:33] 10netops, 10Ganeti, 10Infrastructure-Foundations, 10SRE: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ayounsi) > I was surprised we don't have the same need for public IPs there. We will at some point, the thought process was that I assigned both public/private for codf... [19:50:17] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [21:04:45] 10SRE-tools, 10Infrastructure-Foundations: Add --depool-sleep runtime argument when using SRELBBatchRunner class - https://phabricator.wikimedia.org/T339151 (10BCornwall) 05Stalled→03Declined Will do. @BBlack, if you still feel strongly about this please reopen :) [21:49:31] 10Mail, 10Infrastructure-Foundations, 10Trust-and-Safety: Mail from Bishzilla to emergency@wikimedia.org is possibly getting lost - https://phabricator.wikimedia.org/T338032 (10jhathaway) p:05Low→03Medium [21:50:44] 10Mail, 10Infrastructure-Foundations, 10Trust-and-Safety: Mail from Bishzilla to emergency@wikimedia.org is possibly getting lost - https://phabricator.wikimedia.org/T338032 (10jhathaway) >>! In T338032#9442426, @RoySmith wrote: > @jhathaway I'm going to respectfully push back on the idea of prioritizing t... [23:50:17] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk