[06:53:21] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10ayounsi) [07:04:44] (SystemdUnitFailed) firing: expire_bitu_signups.service Failed on idm1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:12:23] 10netbox, 10Infrastructure-Foundations: Netbox report test_matching_vlan - AttributeError: 'NoneType' object has no attribute 'prefixes' - https://phabricator.wikimedia.org/T339078 (10ayounsi) [09:19:44] (SystemdUnitFailed) firing: (2) httpbb_hourly_appserver.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:22:16] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) >>! In T307357#8928451, @cmooney wrote: > @aborrero I discussed the idea of a [[ https://wikitech.wikimedia.or... [09:28:55] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10cmooney) >>! In T307357#8930600, @aborrero wrote: > I think that's the `query-local-address`option. Upstream docs: Tha... [09:29:44] (SystemdUnitFailed) firing: (2) httpbb_hourly_appserver.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:46:09] 10netops, 10Infrastructure-Foundations, 10SRE-Sprint-Week-Sustainability-March2023, 10ops-eqiad: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) [09:46:33] 10netops, 10Infrastructure-Foundations, 10SRE-Sprint-Week-Sustainability-March2023, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10ayounsi) 05In progress→03Resolved a:03ayounsi With row D upgraded, I couldn... [09:46:51] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) >>! In T307357#8930604, @cmooney wrote: > >> Could you describe the setup you have in mind? Would it be a sta... [10:34:52] 10CAS-SSO, 10Infrastructure-Foundations, 10SRE, 10serviceops-collab, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) GitLab replica service urls seem to be allowed for OIDC admin login now. However I get the same fronted error `Application Not A... [10:34:55] 10netops, 10Infrastructure-Foundations, 10SRE-Sprint-Week-Sustainability-March2023, 10ops-eqiad: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) cr1<->row D is now operational on the new 40G link @Jclark-ctr Those 4 SMF cables can now be remo... [11:11:44] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) >>! In T307357#8930653, @aborrero wrote: > > I think I'm proposing this: > Talking with @taavi on IRC, he p... [11:12:09] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10cmooney) >>! In T307357#8930653, @aborrero wrote: > My point is that we could go with the 2 public IPv4 addresses for bo... [11:13:51] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) Ok, I think we are in the same page! [11:14:15] 10puppet-compiler, 10Infrastructure-Foundations: Puppet compiler fails due to unset fact wmflib.is_container - https://phabricator.wikimedia.org/T338961 (10jbond) > However, for the cloud hosts, the documentation did not work. can your expand? FYI the normal timer jobs should have run now [11:17:16] 10netops, 10Infrastructure-Foundations, 10SRE-Sprint-Week-Sustainability-March2023, 10ops-eqiad: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) [11:45:33] topranks: XioNoX: im trying to track down some bad data that has made it in the the netxbox hiera [11:45:41] the data was updated with the following commit [11:45:42] Triggered by cookbooks.sre.dns.netbox: ms-be[1040-1043].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin1001 [11:45:55] however the bad data is [11:45:56] {'name': 'mgmt', 'ip_addresses': [{'dns_name': ''}], 'device': {'status': 'FAILED', 'site': {'slug': 'codfw'}, 'tenant': None, 'rack': {'name': 'B1', 'location': {'slug': 'codfw-row-b'}}}} [11:46:13] specifically the dns_name is an empty string [11:46:25] i can fix this in the cookbook by just ignoring empty strings [11:46:47] but i think it would be good to clean in the source data but im stuggeling to work out how to track that down [11:47:08] The decom cookbook removes the dns_name against the mgmt_ip as far as I recall [11:47:32] ftr i dont think the decom cookbook cause the issue [11:47:44] as that was working on nodes in eqiad and this affected nodes in codfw [11:47:58] i suspect possibly a manual change that got pushed along wit the decom changes [11:48:25] also the dns_name is there i.e. it has not been removed its set to blank which is the problem [11:48:36] if it was removed i think the cookbook allready deals with that [11:49:01] yeah the decom cookbook should make it null, it's strange we see the empty string [11:49:46] i also notice that the status is set to FAILED anyidea what would set that [11:51:06] it's usually manually set afaik [11:51:47] when something doesn't work, and it's it needs attention by dcops [11:52:43] ack im rerunning the query to try and get some more info [11:53:20] Decom was run by Matthew earlier on [11:53:21] ok seems to be cloudservices2004-dev [11:53:22] 10:03 mvernon@cumin1001: START - Cookbook sre.hosts.decommission for hosts ms-be[1040-1043].eqiad.wmnet [11:53:40] Arzhel set that one to failed earlier, to quiet a report [11:53:47] yes like i said think that just brought in the change as its working on eqiad host not codfw [11:53:58] Ok [11:54:18] ah sorry I'd missed that, this isn't related to ms-be104x [11:54:39] ok, well there were manual changes that caused it for sure [11:54:58] so at least this is something that ought not to happen in normal process [11:55:13] ok cool ill take a quick look at the record see if i can fix it [11:55:51] jbond: I'll have a look [11:55:55] cheers [11:58:31] jbond: the cabling that we need done to reimage that server isn't done yet, so I can't progress it out of "failed" state right now [11:58:45] I added back in the hostname for the mgmt_ip, which hopefully will clear your error [11:59:05] topranks: can we either delete the mgmt interface or add a dns name to it [11:59:15] ahh thanks let me check [12:01:29] topranks: looks good thanks [12:02:09] ok great [12:04:09] jbond: so what happened there was it went from state decom -> failed [12:04:56] I think the cookbook is probably ignoring servers in state decom? I suspect it should also do so for those in state failed? [12:05:18] 10SRE-tools, 10netbox, 10Infrastructure-Foundations: netbox: decided how to deal with blank mgmt dns_names - https://phabricator.wikimedia.org/T339121 (10jbond) p:05Triage→03Medium [12:06:07] topranks: ack ill check through the original CR to see if there was a reason we left it out originally and if not add that [12:06:36] i also created ^^^^ this task to fix the main issue [12:07:40] I'm wondering if we're at the point where we need a cookbook for any server state change [12:08:08] for safeguards and other actions (like the hiera cookbook) [12:10:41] 10SRE-tools, 10netbox, 10Infrastructure-Foundations: netbox: decided how to deal with blank mgmt dns_names - https://phabricator.wikimedia.org/T339121 (10cmooney) I'm somewhat assuming what happened here but I think this is correct. The background here is that the decom cookbook removes the DNS name against... [12:12:43] XioNoX: possible but i think this issue is also easy to fix elses where so not sure [12:12:47] https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/930165 [12:13:24] I think most state changes are done via a cookbook. "Failed" seems to be somewhat special in that it's normally something that is manually set when people hit problems? [12:14:44] jbond: +1 for that change I think it makes sense [12:16:44] 10SRE-tools, 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: netbox: decided how to deal with blank mgmt dns_names - https://phabricator.wikimedia.org/T339121 (10jbond) > As discussed on irc the answer might be for the sync cookbook to ignore devices in 'failed' i have sent a change to do this... [12:18:05] topranks: can you give it another look , i have also added a skip for the blank dns_name [12:24:27] jbond: +1 [12:24:48] cheers [12:42:17] 10SRE-tools, 10Infrastructure-Foundations, 10SRE: reimage cookbook should exit cleanly if no puppet role is applied to a node - https://phabricator.wikimedia.org/T338990 (10jbond) p:05Triage→03Medium [12:47:51] 10SRE-tools, 10Infrastructure-Foundations, 10SRE: reimage cookbook should exit cleanly if no puppet role is applied to a node - https://phabricator.wikimedia.org/T338990 (10jbond) The ultimate failure here is that the first puppet run failed. i belive that @Volans has looked at this in the past and its not... [13:29:58] (SystemdUnitFailed) firing: expire_bitu_signups.service Failed on idm1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:33:41] slyngs: FYI ^^ [13:34:33] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10Jhancock.wm) @aborrero the patch changes have been made and the server is currently connected from eno1 to cloudsw ge-0/0/11 [13:36:09] jbond: Thanks, I greated a silencer, it's related to me trying to fixing firewall access to the database server [13:36:24] ack [14:14:43] 10netops, 10Infrastructure-Foundations, 10SRE: test_matching_vlan() function crashig in Netbox network report - https://phabricator.wikimedia.org/T339133 (10cmooney) p:05Triage→03Low [14:39:44] (SystemdUnitFailed) firing: check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:44:44] (SystemdUnitFailed) resolved: check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:30:12] 10puppet-compiler, 10Infrastructure-Foundations: Puppet compiler fails due to unset fact wmflib.is_container - https://phabricator.wikimedia.org/T338961 (10jhathaway) >>! In T338961#8930997, @jbond wrote: >> However, for the cloud hosts, the documentation did not work. > can your expand? FYI the normal timer... [15:57:52] 10SRE-tools, 10Infrastructure-Foundations: Add --depool-sleep runtime argument when using SRELBBatchRunner class - https://phabricator.wikimedia.org/T339151 (10BCornwall) [16:21:00] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) a:05Papaul→03aborrero [16:31:46] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) a:05aborrero→03Jhancock.wm Hey @Jhancock.wm and @Papaul : https://netbox.wikimedia.org/dcim/devices/4143/interfaces/... [16:39:53] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 4 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) a:05Jhancock.wm→03aborrero >>! In T338778#8932260, @aborrero wrote: > Hey @Jhancock.wm and @Papaul : > > https://netb... [16:46:37] 10puppet-compiler, 10Infrastructure-Foundations: Puppet compiler fails due to unset fact wmflib.is_container - https://phabricator.wikimedia.org/T338961 (10hashar) `puppetmaster.cloudinfra.wmflabs.org` looks like it is the Puppet master for WMCS which I guess might be limited to the cloud services team? ---... [17:11:47] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 4 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10cmooney) >>! In T338778#8932283, @aborrero wrote: >>>! In T338778#8932260, @aborrero wrote: >> Hey @Jhancock.wm and @Papaul : >> >>... [17:14:55] 10netbox, 10Infrastructure-Foundations: Netbox report test_matching_vlan - AttributeError: 'NoneType' object has no attribute 'prefixes' - https://phabricator.wikimedia.org/T339078 (10cmooney) 05Open→03Declined Closing as we created duplicates, I'll have a look and update T339133 [17:16:12] 10netops, 10Infrastructure-Foundations, 10SRE: Packet Drops on Eqiad ASW -> CR uplinks - https://phabricator.wikimedia.org/T291627 (10cmooney) [17:16:36] 10netops, 10Infrastructure-Foundations, 10SRE: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) 05Open→03Resolved a:03cmooney Change is now live on all relevant Juniper devices. [18:41:44] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Traffic: Abstract LVS restart using cookbook - https://phabricator.wikimedia.org/T334166 (10BCornwall) Is this to say that the existing cookbook already suffices? If so, it sounds like the actionable here is to update the documentation to reflect Clement'... [21:52:18] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Traffic: Write a cookbook to roll reboot cache hosts - https://phabricator.wikimedia.org/T338783 (10BCornwall)