[00:38:22] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [04:38:22] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [07:01:40] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) [08:08:58] (SystemdUnitFailed) firing: (3) prometheus_puppetmerge_puppet.service Failed on puppetmaster1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:12:11] (SystemdUnitFailed) firing: (5) prometheus_puppetmerge_puppet.service Failed on puppetmaster1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:18:57] (SystemdUnitFailed) firing: (5) prometheus_puppetmerge_puppet.service Failed on puppetmaster1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:25:24] 10netops, 10Ganeti, 10Infrastructure-Foundations, 10SRE: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10MoritzMuehlenhoff) >>! In T300152#9437438, @ayounsi wrote: > On naming I didn't use `private1-ganeti-codfw` as I didn't want to tie the IPs to a specific tool. On the ot... [08:27:47] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10MoritzMuehlenhoff) >>! In T352974#9437823, @jbond wrote: >>>! In T352974#9392688, @ABran-WMF wrote: >> it appears that most of our hosts... [08:28:53] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) >>! In T352974#9440440, @MoritzMuehlenhoff wrote: >>>! In T352974#9437823, @jbond wrote: >>>>! In T352974#9392688, @ABran-WM... [08:38:37] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [08:38:57] (SystemdUnitFailed) firing: (3) prometheus_puppetmerge_puppet.service Failed on puppetmaster1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:42:11] (SystemdUnitFailed) resolved: (3) prometheus_puppetmerge_puppet.service Failed on puppetmaster1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:48:58] (SystemdUnitFailed) firing: (4) prometheus_puppetmerge_puppet.service Failed on puppetmaster1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:52:11] (SystemdUnitFailed) firing: (6) prometheus_puppetmerge_puppet.service Failed on puppetmaster1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:02:13] (SystemdUnitFailed) firing: (5) prometheus_puppetmerge_labs_private.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:05:28] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Upgrade asw1-eqsin - https://phabricator.wikimedia.org/T332395 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6bec1528-7372-478d-856a-a08325eb04f0) set by ayounsi@cumin1002 for 2:00:00 on 35 host(s) and their services w... [09:07:11] (SystemdUnitFailed) firing: (5) prometheus_puppetmerge_labs_private.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:12:12] (SystemdUnitFailed) firing: (5) prometheus_puppetmerge_labs_private.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:22:11] (SystemdUnitFailed) firing: (5) debmonitor-maintenance-gc.service Failed on debmonitor2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:24:14] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on debmonitor2003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [10:32:11] (SystemdUnitFailed) firing: (5) debmonitor-maintenance-gc.service Failed on debmonitor2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:46:59] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Upgrade asw1-eqsin - https://phabricator.wikimedia.org/T332395 (10ayounsi) 05Open→03Resolved All done. ~10min downtime. [10:52:08] 10netops, 10Infrastructure-Foundations, 10Traffic: Network issues for users in the UK and Ireland - https://phabricator.wikimedia.org/T354065 (10cmooney) 05Open→03Resolved a:03cmooney Great @Sideswipe9th thanks for the feedback. Definitely was a strange one, glad you could shed a bit more light on it... [11:31:47] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10BTullis) Could it be this reference in [[https://gitlab.wikimedia.org/repos/sre/wmfdb|wmfdb]] that should be updated to `/etc/ssl/certs/... [11:37:14] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) That's a very good point @BTullis. I'd leave this to @ABran-WMF and @MoritzMuehlenhoff. Orchestrator is still an issue thoug... [12:38:37] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [13:22:58] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) it seems that orchestrator follows the same pattern as the one @Marostegui identified here: >>! In T352974#9389945, @Marosteg... [13:54:25] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) for the orchestrator part, it seems that mariadb client [[ https://github.com/wikimedia/operations-puppet/blob/6d6dc6f4cae913... [14:24:34] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on debmonitor2003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:53:04] 10netops, 10Ganeti, 10Infrastructure-Foundations, 10SRE: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10cmooney) >>! In T300152#9440438, @MoritzMuehlenhoff wrote: >>>! In T300152#9437438, @ayounsi wrote: >> On naming I didn't use `private1-ganeti-codfw` as I didn't want to... [15:06:00] topranks, jobo, talking about interconnection: https://en.wikipedia.org/wiki/Celtic_Interconnector (it's not internet, but still nice) [15:06:35] EU<->EU it's a very good project :) [15:22:05] 10netops, 10Ganeti, 10Infrastructure-Foundations, 10SRE: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10MoritzMuehlenhoff) >>! In T300152#9441778, @cmooney wrote: >>>! In T300152#9440438, @MoritzMuehlenhoff wrote: >>>>! In T300152#9437438, @ayounsi wrote: >>> On naming I d... [15:37:50] 10netops, 10Infrastructure-Foundations, 10SRE: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Clement_Goubert) [15:38:08] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10Clement_Goubert) 05Open→03In progress a:05Clement_Goubert→03Papaul Host is now drained and cordoned. It is in codfw rack... [15:40:56] 10Mail, 10Infrastructure-Foundations, 10SRE-Sprint-Week-Sustainability-March2023, 10Sustainability (Incident Followup): Upgrade Exim to 4.96 - https://phabricator.wikimedia.org/T310836 (10jhathaway) p:05Triage→03Low [15:41:13] 10SRE-tools, 10Infrastructure-Foundations: Read Ganeti cluster config for cookbooks from Netbox - https://phabricator.wikimedia.org/T340015 (10Volans) p:05Triage→03Medium [15:43:22] (MDRAIDFailedDisk) resolved: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [15:49:22] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [16:00:53] 10Mail, 10Infrastructure-Foundations, 10Sustainability (Incident Followup): Follow up for mx1001 incident: 2023-05-17 MXQueueHigh on mx1001 - https://phabricator.wikimedia.org/T337257 (10jhathaway) p:05Triage→03Low a:03jhathaway [16:02:39] 10SRE-tools, 10Infrastructure-Foundations: Long timeout on debmonitor client with server unreachable/unpingable - https://phabricator.wikimedia.org/T302205 (10fgiunchedi) It is still the case with `debmonitor-client` `0.3.2-1+deb11u1` when `debmonitor.discovery.wmnet` is unreachable (no ping, even connection r... [16:05:27] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) wmfdb [[ https://gitlab.wikimedia.org/repos/sre/wmfdb/-/jobs/186537 | has been released ]], I'll move on to orchestrator test... [16:06:20] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) \o/ thanks! [16:07:34] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) @ABran-WMF would you deploy that new version to cumin1001? [16:09:36] 10Mail, 10Infrastructure-Foundations, 10Trust-and-Safety: Mail from Bishzilla to emergency@wikimedia.org is possibly getting lost - https://phabricator.wikimedia.org/T338032 (10jhathaway) p:05Triage→03Low [16:10:07] 10SRE-tools, 10Infrastructure-Foundations: Add GraphQL support to wmflib - https://phabricator.wikimedia.org/T341968 (10Volans) p:05Triage→03Low [16:11:34] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) oh shoot I have to build it to bullseye as well! let me check [16:11:38] 10netops, 10Infrastructure-Foundations, 10Observability-Metrics, 10SRE, 10observability: Investigate Junos Prometheus exporter - https://phabricator.wikimedia.org/T333210 (10cmooney) p:05Triage→03Low [16:18:58] 10netbox, 10Infrastructure-Foundations: Netbox report test_mgmt_dns_hostname - rq.timeouts.JobTimeoutException - https://phabricator.wikimedia.org/T341843 (10Volans) 05Open→03Stalled p:05Triage→03Medium [16:20:20] Hello, I have two hosts (dbstore100[89]) that were accidentally specified as requiring AAAA records, when they should have been ipv4 only. So I need to remove them. [16:21:21] 10SRE-tools, 10Infrastructure-Foundations: Package pyGNMI and dictdiffer to be used by cookbooks - https://phabricator.wikimedia.org/T340045 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [16:21:27] Is this still the best procedure to follow in netbox? https://phabricator.wikimedia.org/T270101#6688993 - If so, can I subsequently just edit `/etc/network/interfaces` and reboot, or it it better to reimage? Thanks? [16:22:13] 10netbox, 10Infrastructure-Foundations: Markdown bug in Netbox-next - https://phabricator.wikimedia.org/T340444 (10ayounsi) [16:22:18] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Netbox to 3.7.x - https://phabricator.wikimedia.org/T336275 (10ayounsi) [16:23:47] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10Papaul) @Clement_Goubert thanks will work on it in a minute [16:23:57] btullis: yes, and no [16:24:07] btullis: those steps are still valid, if the DNS records have been already queried you'll need to also run the wipe-cache cookbook to clear the recursors [16:24:09] btullis: yes it's the good procedure, no you don't need to edit interfaces [16:24:17] as for reboot/reimage it's not needed, the IPv6 will still be there [16:24:23] just no AAAA record pointing to it [16:25:04] *those I meant the ones in the task comment [16:25:29] no change on the host is needed [16:25:38] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10cmooney) @papaul let me know what port is used on lsw1-b8-codfw once done and I will make the Netbox changes and assign new IPs f... [16:26:16] OK, thanks both, that's great. [16:26:23] o/ I broke something again: mw1377 was originally on puppet7, then I applied https://gerrit.wikimedia.org/r/c/operations/puppet/+/988507 which did not have `profile::puppet::agent::force_puppet7: true` and that upset puppet very much [16:26:39] so now the reimage cookbook doesn't work either [16:27:16] how do I get it unstuck? [16:28:11] kamila_: in a meeting right now and until 18 UTC... [16:28:19] volans: ack [16:28:20] do you want the host in puppet7 or not? [16:28:27] yes [16:28:46] I forgot to put it in the role (sorry...) [16:28:55] so add the patch to puppet to have in puppet7, clear the certificate on puppetmaster1001 and reimage again [16:29:03] ok, thank you! [16:30:22] I hope it works, not 100% sure, maybe mor.itz can help if he's not in the next meeting ;) [16:50:15] 10Mail, 10Infrastructure-Foundations, 10Trust-and-Safety: Mail from Bishzilla to emergency@wikimedia.org is possibly getting lost - https://phabricator.wikimedia.org/T338032 (10RoySmith) @jhathaway I'm going to respectfully push back on the idea of prioritizing this as "low". Emergency@ is used to report... [17:52:09] 10Mail, 10Infrastructure-Foundations, 10Trust-and-Safety: Mail from Bishzilla to emergency@wikimedia.org is possibly getting lost - https://phabricator.wikimedia.org/T338032 (10Dzahn) This should likely be escalated to the ITS team, since they handle the Google mailbox this is about. Since that team doesn'... [17:58:04] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10Papaul) @cmooney xe-0/0/26 [17:58:56] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10Papaul) [19:37:26] 10Mail, 10Infrastructure-Foundations, 10Trust-and-Safety: Mail from Bishzilla to emergency@wikimedia.org is possibly getting lost - https://phabricator.wikimedia.org/T338032 (10RoySmith) I just opened a zendesk request, briefly describing the problem and linking to this phab ticket. Unfortunately it looks... [19:50:16] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [23:50:16] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk