[03:01:04] (PuppetDisabled) firing: Puppet disabled on ganeti2033:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=ganeti&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [07:01:04] (PuppetDisabled) firing: Puppet disabled on ganeti2033:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=ganeti&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [07:44:07] 10netbox, 10Data-Engineering, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10bacula: Convert Netbox data (PostgresQL) longterm storage backups (bacula) into full backups rather than incrementals - https://phabricator.wikimedia.org/T316655#9562400 (10jcrespo) an-db backups looking good: ` ✔️ r... [08:08:13] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9562410 (10MoritzMuehlenhoff) [08:43:16] 10netops, 10Ganeti, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152#9562456 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1002 for hosts: `sretest2005.codfw.wmnet` - sretest2005.codfw... [09:05:49] (PuppetDisabled) resolved: Puppet disabled on ganeti2033:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=ganeti&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [10:08:36] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:08:36] (SystemdUnitFailed) firing: (3) generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:13:55] (SystemdUnitFailed) firing: (3) generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:23:01] 10netbox, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, and 2 others: Netbox: Add support for our complex host network setups in provision script - https://phabricator.wikimedia.org/T346428#9563208 (10ayounsi) {T358096} for the Cassandra/extra IPs usecase. [12:53:07] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, and 2 others: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9563291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1033.eqiad.wmnet with OS book... [13:29:46] 10Mail, 10Infrastructure-Foundations, 10Toolforge: Set up alerts for mail queue - https://phabricator.wikimedia.org/T60871#9563397 (10dcaro) p:05Medium→03Low [13:31:07] 10Mail, 10Infrastructure-Foundations, 10Toolforge: Set up alerts for mail queue - https://phabricator.wikimedia.org/T60871#606527 (10dcaro) This would be now on prometheus + alertmanager/metricsinfra [13:31:40] 10Mail, 10Infrastructure-Foundations, 10Toolforge: [toolforge.infra] Set up alerts for mail queue - https://phabricator.wikimedia.org/T60871#9563411 (10dcaro) [13:37:10] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, and 2 others: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9563419 (10aborrero) [13:38:17] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, and 2 others: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#8416726 (10aborrero) [13:50:56] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, and 2 others: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9563463 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudvirt1033.eqiad.wmnet with OS... [14:27:44] I've created a new VM to serve as the fallback node for the new bookworm-based apt server (apt2002.wikimedia.org), but there are connection errors towards puppetserver2003: https://paste.debian.net/hidden/2519bb9e/ [14:28:40] similar to what we had seen (and resolved) for codfw/cloud, any idea what might be causing these? is there some other firewall term specific to public IPs or even apt* in particular? [14:30:07] moritzm: not on the network devices at least [14:33:07] moritzm: the service seems to be attached only to the v4 IP [14:33:22] puppetserver2003:~$ sudo netstat -nlpt [14:33:22] tcp6 0 0 10.192.14.6:8140 :::* LISTEN 3945530/java [14:35:43] for v4 there is something odd, the server sends back the syn-ack, but the client never gets it [14:40:22] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, and 2 others: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9563666 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1033.eqiad.wmnet with OS book... [14:40:33] easiest to replicate is that an mtr on puppetserver2003 to the v4 IP of apt doesn't complete, while the v6 does [14:41:13] the sockets on 2003 appear to be identical to the ones on 2002, though? [14:41:42] what do you mean? [14:42:48] what I mean is: the way puppetserver listens to incoming connections isn't any different on 2003 compared to the other existing puppet servers [14:45:37] ah yeah [14:45:50] it's not fully related to the issue [14:46:56] if it was listening over v6 as well, it would be working, but also hiding the IPv4 issue [14:47:10] I'm still looking into the v4 issue [14:48:48] ack, thx [15:13:56] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:16:09] I have to step away for an hour or so dunno if topranks have some time to look into it, looks like there is something funky going on with the fabric [15:16:38] * topranks looking [15:16:48] could be related to changes I made [15:17:20] ssw1-a8 seems to think that 208.80.153.11 is directly connected [15:17:33] you can compare it with 2620:0:860:1:208:80:153:11 [15:20:49] there's no urgency for this, BTW. this is a new WIP host, nothing currently in production [15:23:07] Yeah I'm not detecting an issue as such, both devices can ping each other over v4 and v6 no problem [15:24:02] sorry that was from puppetserver2002, from 2003 I see an issue yeah [15:24:05] topranks: `apt2002:~$ ping 10.192.14.6` doesn't seem to work for me [15:44:46] topranks: all serviceops hosts for A8 depooled btw [15:45:02] claime: thanks, apprecaited :) [15:47:34] XioNoX, moritzm: the issue with the public vlan is strange alright. Seems to be affecting traffic from devices in codfw row B to the public1-a-codfw subnet. [15:47:50] I'll need to dig further, it seems unrelated to any of the changes I was doing [15:49:12] stranger still it's only affecting v4 [15:49:44] I'll get today's switch migration done then double-back to it, I expect the issue occured when we moved ganeti2028 to lsw1-a7-codfw yesterday (apt2002 is on that) [15:58:01] great, thanks [15:58:22] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A8 from asw-a8-codfw to lsw1-a8-codfw - https://phabricator.wikimedia.org/T355874#9563911 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c42ddc7f-d7d7-4ebc-9852-d3a5c7882e71) set by cmoon... [15:59:13] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A8 from asw-a8-codfw to lsw1-a8-codfw - https://phabricator.wikimedia.org/T355874#9563912 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=da675508-2cc3-4974-a4ca-677deefc2dff) set by cmoon... [16:11:24] claime: hosts moved now thanks for the help [16:20:03] topranks: ack, thanks [16:23:02] XioNoX, moritzm: just to update it would appear localised to lsw1-a7-codfw. Other IPs on the same subnet, connected via other lsw's, are reachable from the places apt2002 isn't. [16:24:40] I'll open a task shortly, unsure if it's a bug or if there is some subtle config issue going on [16:26:00] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A8 from asw-a8-codfw to lsw1-a8-codfw - https://phabricator.wikimedia.org/T355874#9564146 (10cmooney) All hosts moved without issue, thanks Jenn! [16:33:13] XioNoX, moritzm: running "clear ethernet-switching mac-ip-table " seems to have cleared the problem [16:33:47] will do more digging, but definitely some kind of bug, the local arp cache for the apt2002 IP wasn't getting populated [16:34:15] this is similar to the bug we seen when the DHCP process on those switches deletes the ARP entry but not the "mac-ip" table entry, and causes arp to fail [16:35:19] moritzm: I presume apt2002 wasn't reimaged today right? not sure what triggered the errored state if it wasn't the DHCP process [16:37:04] it was reimaged today [16:37:24] or rather installed, the VM was only created today and then installed right away [16:55:32] moritzm: ok thanks, yeah I think it makes sense [16:55:49] so there is a "known bug" (which juniper won't fix) that I mentioned above [16:56:17] for physical servers the reimage cookbook manually issues the command I ended up running on the attached switch (if the attached switch is running vxlan) [16:56:40] I think what we hit here is the first instance of a VM on a ganeti host connected to one of those switches [16:57:59] I'll update the existing task in a while with detail, we'll need to work out a way to issue that command when doing VM reimages if needed I think [16:58:00] we might need to run that line on vms too? [16:58:33] volans: yeah, but obviously there is the challenge of finding out what hypervisor it's running on, and from there if the attached switch is an lsw [16:58:59] easy peasy [16:59:00] : [16:59:02] :D [16:59:09] haha [16:59:15] first bit of good news I've had all day :) [17:00:15] lldp is one option, ganeti spicerack module maybe another [17:01:04] I'd go with the latter, it might not have the info already there but easy to add I guess [19:18:36] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:01:00] 10Mail: Allow users to receive change notification email if edit is done by a bot - https://phabricator.wikimedia.org/T358087#9562763 (10AlbanGeller) This should be the default. Bot edits aren't infallible, and it's only reasonable to expect watchers to review bot activity. [20:46:25] 10Mail: Allow users to receive change notification email if edit is done by a bot - https://phabricator.wikimedia.org/T358087#9565151 (10Novem_Linguae) > This should be the default. Not sure I agree that it should be the default. But making it opt-in would probably be reasonable, especially if we can find someo... [20:46:47] 10Mail: Create user preference to receive change notification emails for bot edits - https://phabricator.wikimedia.org/T358087#9565156 (10Novem_Linguae) [20:50:21] 10Mail: Create user preference to receive change notification emails for bot edits - https://phabricator.wikimedia.org/T358087#9565172 (10Novem_Linguae) The setting to be added would probably best fit in Special:Preferences -> User profile -> Email options. Right beneath "Email me also for minor edits of pages a... [20:59:43] 10Mail: Create user preference to receive change notification emails for bot edits - https://phabricator.wikimedia.org/T358087#9565217 (10Primefac) Having the choice is all that I ask. [21:03:36] (SystemdUnitFailed) firing: (2) check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:43:50] (PuppetZeroResources) firing: Puppet has failed generate resources on apt2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:23:50] (PuppetZeroResources) resolved: Puppet has failed generate resources on apt2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources