[00:38:22] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[04:38:22] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[07:01:40] <wikibugs>	 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui)
[08:08:58] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) prometheus_puppetmerge_puppet.service Failed on puppetmaster1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:12:11] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) prometheus_puppetmerge_puppet.service Failed on puppetmaster1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:18:57] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) prometheus_puppetmerge_puppet.service Failed on puppetmaster1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:25:24] <wikibugs>	 10netops, 10Ganeti, 10Infrastructure-Foundations, 10SRE: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10MoritzMuehlenhoff) >>! In T300152#9437438, @ayounsi wrote: > On naming I didn't use `private1-ganeti-codfw` as I didn't want to tie the IPs to a specific tool. On the ot...
[08:27:47] <wikibugs>	 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10MoritzMuehlenhoff) >>! In T352974#9437823, @jbond wrote: >>>! In T352974#9392688, @ABran-WMF wrote: >> it appears that most of our hosts...
[08:28:53] <wikibugs>	 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) >>! In T352974#9440440, @MoritzMuehlenhoff wrote: >>>! In T352974#9437823, @jbond wrote: >>>>! In T352974#9392688, @ABran-WM...
[08:38:37] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[08:38:57] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) prometheus_puppetmerge_puppet.service Failed on puppetmaster1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:42:11] <jinxer-wm>	 (SystemdUnitFailed) resolved: (3) prometheus_puppetmerge_puppet.service Failed on puppetmaster1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:48:58] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) prometheus_puppetmerge_puppet.service Failed on puppetmaster1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:52:11] <jinxer-wm>	 (SystemdUnitFailed) firing: (6) prometheus_puppetmerge_puppet.service Failed on puppetmaster1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:02:13] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) prometheus_puppetmerge_labs_private.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:05:28] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Upgrade asw1-eqsin - https://phabricator.wikimedia.org/T332395 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6bec1528-7372-478d-856a-a08325eb04f0) set by ayounsi@cumin1002 for 2:00:00 on 35 host(s) and their services w...
[09:07:11] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) prometheus_puppetmerge_labs_private.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:12:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) prometheus_puppetmerge_labs_private.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:22:11] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) debmonitor-maintenance-gc.service Failed on debmonitor2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:24:14] <jinxer-wm>	 (PuppetConstantChange) firing: Puppet performing a change on every puppet run on debmonitor2003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[10:32:11] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) debmonitor-maintenance-gc.service Failed on debmonitor2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:46:59] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Upgrade asw1-eqsin - https://phabricator.wikimedia.org/T332395 (10ayounsi) 05Open→03Resolved All done. ~10min downtime.
[10:52:08] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10Traffic: Network issues for users in the UK and Ireland - https://phabricator.wikimedia.org/T354065 (10cmooney) 05Open→03Resolved a:03cmooney Great @Sideswipe9th thanks for the feedback.  Definitely was a strange one, glad you could shed a bit more light on it...
[11:31:47] <wikibugs>	 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10BTullis) Could it be this reference in [[https://gitlab.wikimedia.org/repos/sre/wmfdb|wmfdb]] that should be updated to `/etc/ssl/certs/...
[11:37:14] <wikibugs>	 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) That's a very good point @BTullis. I'd leave this to @ABran-WMF and @MoritzMuehlenhoff. Orchestrator is still an issue thoug...
[12:38:37] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[13:22:58] <wikibugs>	 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) it seems that orchestrator follows the same pattern as the one @Marostegui identified here: >>! In T352974#9389945, @Marosteg...
[13:54:25] <wikibugs>	 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) for the orchestrator part, it seems that mariadb client [[ https://github.com/wikimedia/operations-puppet/blob/6d6dc6f4cae913...
[14:24:34] <jinxer-wm>	 (PuppetConstantChange) firing: Puppet performing a change on every puppet run on debmonitor2003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[14:53:04] <wikibugs>	 10netops, 10Ganeti, 10Infrastructure-Foundations, 10SRE: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10cmooney) >>! In T300152#9440438, @MoritzMuehlenhoff wrote: >>>! In T300152#9437438, @ayounsi wrote: >> On naming I didn't use `private1-ganeti-codfw` as I didn't want to...
[15:06:00] <XioNoX>	 topranks, jobo, talking about interconnection: https://en.wikipedia.org/wiki/Celtic_Interconnector (it's not internet, but still nice)
[15:06:35] <topranks>	 EU<->EU it's a very good project :)
[15:22:05] <wikibugs>	 10netops, 10Ganeti, 10Infrastructure-Foundations, 10SRE: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10MoritzMuehlenhoff) >>! In T300152#9441778, @cmooney wrote: >>>! In T300152#9440438, @MoritzMuehlenhoff wrote: >>>>! In T300152#9437438, @ayounsi wrote: >>> On naming I d...
[15:37:50] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Clement_Goubert)
[15:38:08] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10Clement_Goubert) 05Open→03In progress a:05Clement_Goubert→03Papaul Host is now drained and cordoned. It is in codfw rack...
[15:40:56] <wikibugs>	 10Mail, 10Infrastructure-Foundations, 10SRE-Sprint-Week-Sustainability-March2023, 10Sustainability (Incident Followup): Upgrade Exim to 4.96 - https://phabricator.wikimedia.org/T310836 (10jhathaway) p:05Triage→03Low
[15:41:13] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Read Ganeti cluster config for cookbooks from Netbox - https://phabricator.wikimedia.org/T340015 (10Volans) p:05Triage→03Medium
[15:43:22] <jinxer-wm>	 (MDRAIDFailedDisk) resolved: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[15:49:22] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[16:00:53] <wikibugs>	 10Mail, 10Infrastructure-Foundations, 10Sustainability (Incident Followup): Follow up for mx1001 incident: 2023-05-17 MXQueueHigh on mx1001 - https://phabricator.wikimedia.org/T337257 (10jhathaway) p:05Triage→03Low a:03jhathaway
[16:02:39] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Long timeout on debmonitor client with server unreachable/unpingable - https://phabricator.wikimedia.org/T302205 (10fgiunchedi) It is still the case with `debmonitor-client` `0.3.2-1+deb11u1` when `debmonitor.discovery.wmnet` is unreachable (no ping, even connection r...
[16:05:27] <wikibugs>	 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) wmfdb [[ https://gitlab.wikimedia.org/repos/sre/wmfdb/-/jobs/186537 | has been released ]], I'll move on to orchestrator test...
[16:06:20] <wikibugs>	 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) \o/ thanks!
[16:07:34] <wikibugs>	 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) @ABran-WMF would you deploy that new version to cumin1001?
[16:09:36] <wikibugs>	 10Mail, 10Infrastructure-Foundations, 10Trust-and-Safety: Mail from Bishzilla to emergency@wikimedia.org  is possibly getting lost - https://phabricator.wikimedia.org/T338032 (10jhathaway) p:05Triage→03Low
[16:10:07] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Add GraphQL support to wmflib - https://phabricator.wikimedia.org/T341968 (10Volans) p:05Triage→03Low
[16:11:34] <wikibugs>	 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) oh shoot I have to build it to bullseye as well! let me check
[16:11:38] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10Observability-Metrics, 10SRE, 10observability: Investigate Junos Prometheus exporter - https://phabricator.wikimedia.org/T333210 (10cmooney) p:05Triage→03Low
[16:18:58] <wikibugs>	 10netbox, 10Infrastructure-Foundations: Netbox report test_mgmt_dns_hostname - rq.timeouts.JobTimeoutException - https://phabricator.wikimedia.org/T341843 (10Volans) 05Open→03Stalled p:05Triage→03Medium
[16:20:20] <btullis>	 Hello, I have two hosts (dbstore100[89]) that were accidentally specified as requiring AAAA records, when they should have been ipv4 only. So I need to remove them.
[16:21:21] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Package pyGNMI and dictdiffer to be used by cookbooks - https://phabricator.wikimedia.org/T340045 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff
[16:21:27] <btullis>	 Is this still the best procedure to follow in netbox? https://phabricator.wikimedia.org/T270101#6688993 - If so, can I subsequently just edit `/etc/network/interfaces` and reboot, or it it better to reimage? Thanks?
[16:22:13] <wikibugs>	 10netbox, 10Infrastructure-Foundations: Markdown bug in Netbox-next - https://phabricator.wikimedia.org/T340444 (10ayounsi)
[16:22:18] <wikibugs>	 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Netbox to 3.7.x - https://phabricator.wikimedia.org/T336275 (10ayounsi)
[16:23:47] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10Papaul) @Clement_Goubert thanks will work on it in a minute
[16:23:57] <XioNoX>	 btullis: yes, and no
[16:24:07] <volans>	 btullis: those steps are still valid, if the DNS records have been already queried you'll need to also run the wipe-cache cookbook to clear the recursors
[16:24:09] <XioNoX>	 btullis: yes it's the good procedure, no you don't need to edit interfaces
[16:24:17] <volans>	 as for reboot/reimage it's not needed, the IPv6 will still be there
[16:24:23] <volans>	 just no AAAA record pointing to it
[16:25:04] <volans>	 *those I meant the ones in the task comment
[16:25:29] <volans>	 no change on the host is needed
[16:25:38] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10cmooney) @papaul let me know what port is used on lsw1-b8-codfw once done and I will make the Netbox changes and assign new IPs f...
[16:26:16] <btullis>	 OK, thanks both, that's great.
[16:26:23] <kamila_>	 o/ I broke something again: mw1377 was originally on puppet7, then I applied https://gerrit.wikimedia.org/r/c/operations/puppet/+/988507 which did not have `profile::puppet::agent::force_puppet7: true` and that upset puppet very much
[16:26:39] <kamila_>	 so now the reimage cookbook doesn't work either
[16:27:16] <kamila_>	 how do I get it unstuck?
[16:28:11] <volans>	 kamila_: in a meeting right now and until 18 UTC... 
[16:28:19] <kamila_>	 volans: ack
[16:28:20] <volans>	 do you want the host in puppet7 or not?
[16:28:27] <kamila_>	 yes
[16:28:46] <kamila_>	 I forgot to put it in the role (sorry...)
[16:28:55] <volans>	 so add the patch to puppet to have in puppet7, clear the certificate on puppetmaster1001 and reimage again
[16:29:03] <kamila_>	 ok, thank you!
[16:30:22] <volans>	 I hope it works, not 100% sure, maybe mor.itz can help if he's not in the next meeting ;)
[16:50:15] <wikibugs>	 10Mail, 10Infrastructure-Foundations, 10Trust-and-Safety: Mail from Bishzilla to emergency@wikimedia.org  is possibly getting lost - https://phabricator.wikimedia.org/T338032 (10RoySmith) @jhathaway  I'm going to respectfully push back on the idea of prioritizing this as "low".  Emergency@ is used to report...
[17:52:09] <wikibugs>	 10Mail, 10Infrastructure-Foundations, 10Trust-and-Safety: Mail from Bishzilla to emergency@wikimedia.org  is possibly getting lost - https://phabricator.wikimedia.org/T338032 (10Dzahn) This should likely be escalated to the ITS team, since they handle the Google mailbox this is about.  Since that team doesn'...
[17:58:04] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10Papaul) @cmooney xe-0/0/26
[17:58:56] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10Papaul)
[19:37:26] <wikibugs>	 10Mail, 10Infrastructure-Foundations, 10Trust-and-Safety: Mail from Bishzilla to emergency@wikimedia.org  is possibly getting lost - https://phabricator.wikimedia.org/T338032 (10RoySmith) I just opened a zendesk request, briefly describing the problem and linking to this phab ticket.  Unfortunately it looks...
[19:50:16] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[23:50:16] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk