[00:19:50] (SystemdUnitFailed) firing: (14) export_smart_data_dump.service Failed on bast2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:23:23] (SystemdUnitFailed) firing: (14) export_smart_data_dump.service Failed on bast2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:29:52] (SystemdUnitFailed) firing: (14) export_smart_data_dump.service Failed on bast2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:33:24] (SystemdUnitFailed) firing: (14) export_smart_data_dump.service Failed on bast2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:33:59] (PuppetFailure) firing: (2) Puppet has failed on netmon1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:34:50] (SystemdUnitFailed) firing: (14) export_smart_data_dump.service Failed on bast2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:38:23] (SystemdUnitFailed) firing: (13) export_smart_data_dump.service Failed on ganeti1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:39:50] (SystemdUnitFailed) firing: (13) export_smart_data_dump.service Failed on ganeti1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:42:59] (PuppetZeroResources) firing: Puppet has failed generate resources on puppetserver1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:43:23] (SystemdUnitFailed) firing: (12) export_smart_data_dump.service Failed on ganeti1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:43:59] (PuppetFailure) firing: (2) Puppet has failed on netmon1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:44:50] (SystemdUnitFailed) firing: (12) export_smart_data_dump.service Failed on ganeti1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:48:23] (SystemdUnitFailed) firing: (12) export_smart_data_dump.service Failed on ganeti1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:53:23] (SystemdUnitFailed) firing: (9) export_smart_data_dump.service Failed on ganeti1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:57:59] (PuppetZeroResources) resolved: Puppet has failed generate resources on puppetserver1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:44:14] (PuppetFailure) firing: Puppet has failed on netmon1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:53:23] (SystemdUnitFailed) firing: stunnel4.service Failed on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:44:14] (PuppetFailure) firing: Puppet has failed on netmon1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:49:48] XioNoX: do you know anything about this puppet failure on netmon1003? Binding service [rsync] to :::1873: Address already in use (98) [08:50:17] * volans running puppet again to see if it repros [08:51:40] volans: no idea [08:52:06] port is already in used by rsync, so maybe some race condition? [08:53:25] (SystemdUnitFailed) firing: stunnel4.service Failed on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:53:33] stunnel is running rsync, or at least trying [08:54:48] a change from jesse ce0d3b009c887cdbf96d2fef9d54b5cdd456b5d7 [08:54:55] - Exec['compile fragments'], [08:54:55] + Concat[$rsync::server::rsync_conf], [08:55:03] I think it might be that, not 100% sure [08:56:14] it's bit in depth into puppet logic, I think jbond might answer much quicker than me digging :) [08:57:49] quick link to change https://gerrit.wikimedia.org/r/c/operations/puppet/+/976284 [09:20:47] ack cheers volans ill take a look [09:21:12] thx [09:26:29] volans: maybe silly question, but could the cookbook automatically start a tmux session instead of failing with "wmflib.exceptions.WmflibError: Must be run in non-interactive mode or inside a screen or tmux."? [09:26:55] some people use tmux some screen [09:27:10] and they are confortable with one or the other maybe not both [09:28:24] that wouldn't prevent people to run it in their preferred tool, just set a default if it's not already set [09:29:02] I understand, I'm just afaid that it could start the one you don't know how to use :D [09:29:28] and I'm not sure we can do that without some hack [09:29:34] because: [09:29:56] 1) currently the requirement for the tmux is in the cookbook's code, opt-in, so we don' know if it's needed until later on [09:29:58] yeah I was mostly wondering about technical limitations [09:30:31] 2) the easiest way would be to ship a bash wrapper for the cookbook binary to run the tmux before calling python [09:30:49] so I think it would be feasible if we make it mandatory for *all* cookbooks [09:31:12] with the addition to detect if you're already in one and not start it [09:31:15] in that case [09:32:27] thx for the info! [09:34:50] (SystemdUnitFailed) resolved: stunnel4.service Failed on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:36:25] 10Puppet, 10Infrastructure-Foundations, 10SRE: Facter is slow on a few hosts - https://phabricator.wikimedia.org/T251293 (10Vgutierrez) 05Resolved→03Open It looks like we are having some issues with the raid fact: ` Nov 22 23:47:01 cp4037 smart-data-dump[748598]: Command '['/usr/bin/timeout', '120', '/us... [09:38:59] (PuppetFailure) resolved: Puppet has failed on netmon1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:44:28] volans: XioNoX: fyi netmon is sorted i think stunnle just got its self in a strange state. killing it and restarting fixed the issue [09:44:42] thx [09:44:57] jbond: did you kill-9? because I tried systemctl stop and start and didn't work [09:45:07] thanks for fixing [09:45:20] volans: yes i did kill -9 [09:45:26] ... eventually [09:47:56] lol ok [09:51:05] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [10:09:36] hello, hello - how are things on the reimage front? 😇 [10:35:38] jbond: can I reenable puppet on prometheus codfw ? [10:39:02] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [10:42:15] jhathaway: thanks for https://ipng.ch/s/articles/2023/11/11/mellanox-sn2700.html it's a really nice read! Not sure it would fit in our infra (especially in term of automation) but it would be nice to have a 100% Debian network stack. (cc topranks if you haven't seen it) [10:48:15] wow that’s a really interesting post thanks [10:49:18] It’s definitely not prime-time for production but good to see switchdev isn’t dead, I wonder will nvidia/mellanox continue developing it [10:50:39] godog: yes [10:50:44] sorry forgot about that [10:58:06] cheers, no problem [11:17:52] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:53:43] jayme: I've been in meeting most of my morning, I'll work on that now, we've agreen on the strategy with john yesterday afternoon [11:54:29] tl;dr cleanup both v5 and v7 aalways, use the new hiera_lookup() for existing hosts to detect puppet version to install, keep more or less the same existing logic for --new as we can't use hiera_lookup() [11:54:47] I should get something testable by today [12:28:00] sweet, thanks! I've some nodes available for testing [12:29:23] great [14:35:08] 10netops, 10Infrastructure-Foundations, 10SRE: Migrate IP gateway for private1-b-codfw to spine switches - https://phabricator.wikimedia.org/T351534 (10cmooney) [14:52:49] 10SRE-tools, 10Infrastructure-Foundations: Abstract a bit more the server provisioning process - https://phabricator.wikimedia.org/T351891 (10ayounsi) [15:16:56] 10SRE-tools, 10Infrastructure-Foundations: Abstract a bit more the server provisioning process - https://phabricator.wikimedia.org/T351891 (10Volans) When we introduced the sre.hosts.provision cookbook we envision Piling many changes together simplifies the user interaction but leaves a lot of open questions... [15:25:39] 10SRE-tools, 10Infrastructure-Foundations: Abstract a bit more the server provisioning process - https://phabricator.wikimedia.org/T351891 (10cmooney) Great work! I had some thoughts on this, more around the latter pieces than the workflow itself. In terms of the proposed cookbook, do you envision it running... [15:29:20] 10SRE-tools, 10Infrastructure-Foundations: Abstract a bit more the server provisioning process - https://phabricator.wikimedia.org/T351891 (10cmooney) >>! In T351891#9355490, @Volans wrote: > In addition I think that we need to solve first another problem, that is a pre-requisite for this and other similar req... [15:30:57] 10SRE-tools, 10Infrastructure-Foundations: Abstract a bit more the server provisioning process - https://phabricator.wikimedia.org/T351891 (10Volans) >>! In T351891#9355523, @cmooney wrote: >>>! In T351891#9355490, @Volans wrote: >> In addition I think that we need to solve first another problem, that is a pre... [15:35:39] 10SRE-tools, 10Infrastructure-Foundations: Abstract a bit more the server provisioning process - https://phabricator.wikimedia.org/T351891 (10cmooney) >>! In T351891#9355525, @Volans wrote: > How does the cookbook know which spec table to use for a given host? User-input? Then we're back to square one. As Arz... [15:49:02] XioNoX: if nothing else, it is as least an interesting though exercise [15:55:23] jayme, jbond: my proposal https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/977094 [15:59:59] lgtm thanks volans [16:02:55] volans: one'ish typo...and a question: From the code it seems I would have to run it with --new in my usecase [16:03:17] I did not feel like the mw appservers would be --new [16:03:33] are they beacuse of the role change? [16:05:00] jayme: why --new? [16:05:05] no you shouldn't [16:05:10] if they are still in puppetdb [16:05:26] ah, I missread. I no longer have to provide --puppet-version [16:06:15] no, it would fail it you do without --new [16:06:18] got caught up in "if not args.new and args.puppet is not None:" [16:06:35] while with --new either you pass it or the cookbook asks for it [16:06:38] yes,yes. Makes sense. :) [16:07:20] jbond: quick q you might know, I'm checking a dhcp problem for dc-ops, going to reimage new server an-worker1160, do you know if those hosts are on puppet7 yet? [16:07:28] updated CR [16:08:54] topranks: new host? [16:09:21] an-worker1150 is on puppet7 [16:09:21] volans: yeah [16:09:49] hosts/an-worker1160.yaml:profile::puppet::agent::force_puppet7: true [16:09:50] ok cool I'll assume it's the same for all of them so, anyway if I'm wrong it can be done again, won't affect the DHCP part [16:09:52] so yes [16:09:57] not sure why they are set on a per-host basis [16:09:57] thanks :) [16:09:59] and not at a role basis [16:10:05] indeed yeah [16:10:37] git grep profile::puppet::agent::force_puppet7 hieradata/ | grep an-worker [16:11:10] ah ok [16:11:17] thanks, should really have just done that [16:11:18] git grep profile::puppet::agent::force_puppet7 hieradata/ | grep hadoop [16:11:20] hieradata/role/common/analytics_cluster/hadoop/worker.yaml:profile::puppet::agent::force_puppet7: true [16:11:23] is also at the role level [16:11:30] weird [16:11:39] the reimage cookbook is prompting to add just that key actually, probably why it's being done on a per-host basis for them [16:11:41] h no [16:11:43] hhhh got it [16:11:46] they are insetup [16:11:56] mmmh [16:11:59] https://www.irccloud.com/pastebin/xNGY2OmN/ [16:12:02] jbond: maybe the workflow is wrong [16:12:26] topranks: it's not needed [16:12:27] $ git grep profile::puppet::agent::force_puppet7 hieradata/ | grep data_engineering [16:12:30] hieradata/role/common/insetup/data_engineering.yaml:profile::puppet::agent::force_puppet7: true [16:12:34] is already done at role level [16:12:45] in my last patch I changed the prompt to [16:12:51] Unless the host's role has been already migrated to Puppet 7, [16:12:56] to migrate this host change its hiera values to: [16:12:57] volans: yeah exactly, what I mean is I suspect that warning is causing people to add it even though it's already at role level [16:13:02] yeah :/ [16:13:35] unfortunately we don't have an authoritative way to detect that until the puppet facts are exported to the puppetmaster [16:13:47] because a hiera lookup needs the facts to compile the catalog [16:14:30] yeah, I guess worst case we can just remove them all once migration done [16:15:01] yep [16:21:07] jayme: do you want to test it from gerrit or should I merge and let you test it after? [16:21:58] volans: just +1ed - I'll rest after merge [16:24:20] 10SRE-tools, 10Infrastructure-Foundations: Abstract a bit more the server provisioning process - https://phabricator.wikimedia.org/T351891 (10Volans) >>! In T351891#9355538, @cmooney wrote: > As Arzhel defined it there would be one table, and the host the script (be that existing Netbox ProvisionServerNetwork... [16:24:50] thx [16:30:59] jayme: all yours, merged and deployed. I'm around for hotfixes if needed. [16:31:12] ack, I'll kick off one reimage [16:32:20] volans: immediate fail unfortunately :/ [16:33:17] what did I do? [16:33:46] maybe unrelated..RemoteExecutionError: Cumin execution failed (exit_code=2) [16:33:58] ah mw2420 vs fqdn [16:34:12] ah [16:35:10] https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/977100 [16:35:29] +1 [16:35:30] I'll wait CI and then quick merge [16:35:33] and deploy [16:35:37] ack [16:39:12] merged, running puppet [16:40:57] jayme: all your, sorry for the trouble [16:41:14] npnp, thanks for implementing it :) [16:41:55] and [16:41:55] >>> spicerack.puppet_server().hiera_lookup("mw2420.codfw.wmnet", "profile::puppet::agent::force_puppet7") [16:41:58] 'true' [16:42:08] so it should do the right thing [16:42:09] :D [16:42:12] * volans finger crossed [16:42:14] yup. asked for the management password [16:43:01] might have pasted garbage [16:43:44] yep...it started it's thing [16:43:51] I'm tailing the logs [16:44:12] yep pupppet node clean/deactivate on both [16:44:58] jayme: what's the timeline for those repurposes? to understand if we can still automate the other bits and make you rename them or it's too late ;) [16:45:36] volans: we need to start now, but we will repurpose in batches while we increase traffic to k8s [16:45:45] so it really depends on how good it goes [16:46:18] but we will start now...yesterday :) [16:49:28] ack, and what about the renumbering for the new codfw stck in row A-b? [16:49:51] it could be less disruptive to do both at the same time for the hosts in those rows [16:50:15] I think I lack context [16:51:34] T327938 and all related [16:51:35] T327938: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 [16:51:53] bsically the new swiches will have the new network setup like in drmrs or esams [16:52:06] 10SRE-tools, 10Infrastructure-Foundations: Abstract a bit more the server provisioning process - https://phabricator.wikimedia.org/T351891 (10cmooney) >>! In T351891#9355575, @Volans wrote: >>>! In T351891#9355538, @cmooney wrote: >> As Arzhel defined it there would be one table, and the host the script (be th... [16:52:32] so hosts at some point will need to migrate to the new VLANs, the plan is yet to be defined on how and when, and surely we'll have some automation for that [16:53:07] but given is anywy a bit disruptive, maybe given the re-purposing we could catch two birds with one commit [16:53:59] uh, I wasn't aware. Does not sound like fun :D [16:54:10] yeah :D [16:55:41] I guess we can rename them then - if we need to renumber anyways [16:56:20] but AFAIK we have not at all thought about the switch refresh yet [17:03:25] I'll leave netops to comment on the timeline for that and if we're already ready to move hosts or not (potentially not yet) [17:03:56] * volans brb in 10 [17:09:00] is the percentage of the traffic sent to mw per-dc or you're keeping the size of the k8s clusters similar between the 2 dcs? [17:12:28] jayme: I have been working on the codfw row move a lot, but I am the reverse of you, I think I lack context about these hosts being "repurposed" [17:13:04] to your point it is probably not going to be fun, and to volans perhaps we can make it less bad if we somehow combine [17:13:48] is there a task about the repurposing or somewhere I can get up to speed on what's involved? [17:15:15] topranks: mw hosts becoming k8s hosts [17:15:41] buster -> bullseye, puppet5 -> puppet7, for now same name same IP [17:16:07] I did notice mw hosts added to the K8s bgp groups (sticking out cos of the different hostnames) [17:16:24] volans: ok, so buster -> bullseye involves a reimage right? [17:16:54] yep [17:17:14] we should indeed potentially try to reimage them onto the new vlans/IPs during that process [17:17:25] that was my thought [17:17:49] I had been holding off cos not all the first phase is done, but there is nothing stopping us doing that at this stage, enough steps are completed they can be connected to new leaf switches etc. [17:18:06] Do we have a list of hosts being repurposed? [17:18:35] in the end all of them :D [17:18:55] that's awesome tbh :)# [17:18:57] we're talking potentially 65 in row A and 50 in row B [17:19:08] from 'P{P:netbox::host%location ~ "A.*codfw"} and A:all-mw-codfw' [17:19:18] but I;m not sure if all the roles are migrated when [17:19:32] so maybe I should just select api or app [17:19:45] ok yep, a lot all in all [17:20:25] better to change IP when reimaging, and set the calico BGP up to peer with the new switch from day 1 [17:21:48] are we already ready to host them in the new VLANs? [17:22:17] this is the typical example in which some better cross-team coordination would have helped to better plan :( (cc jobo ) [17:23:15] indeed [17:24:09] volans: yeah I'd hoped to move the GWs and a few other bits first, but in theory we're ready to go on the new vlans [17:24:44] we can move their uplink to the new switches (and stay on existing vlan), then reimage onto the new vlan, won't be a problem [17:30:51] 10SRE-tools, 10Infrastructure-Foundations: Automation to change a server's vlan - https://phabricator.wikimedia.org/T350152 (10Volans) Some random additions: * I would probably add a grep for the IP on at least `/etc` on the host too to check if it's hardcoded somewhere else in addition to `/etc/nework/interf... [17:36:02] sorry, I had to take a call. But I see v.olans cleared it up already :) [17:36:25] btw. reimage was a success [17:37:23] yay [17:37:51] * volans errand to run [17:38:01] I made lists of all the nodes ordered by purchase date (as we're going from most to last recent). See https://phabricator.wikimedia.org/T351074 topranks [17:38:22] jayme: that's great thanks! [17:39:21] I assume any work on this will be paused for change-freeze in December? [17:42:10] yep [17:42:58] got to go - ttyl o/ [20:56:26] 10netbox, 10Infrastructure-Foundations, 10SRE: Error creating device in netbox - https://phabricator.wikimedia.org/T336547 (10Volans) 05Open→03Resolved p:05Triage→03High Thanks for reporting this. The issue was caused by a bug in one of the new custom validators that was hit only during the creation... [21:00:08] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: WMCS VIPs: Netbox netmask inconsistencies - https://phabricator.wikimedia.org/T295774 (10Volans) Trying to run the import puppetdb script on `cloudgw1002 ` is now a noop, but for `cloudgw2002-dev` fails with this exception: `lang=python... [21:01:25] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.decorators.retry: dynamic_params_callbacks=(set_tries,) dosn't seem to work as expected - https://phabricator.wikimedia.org/T346134 (10Volans) 05Open→03Resolved The change has been merged and released with Spicerack v7.3.0 on Oct. 4th. Res... [21:03:57] 10SRE-tools, 10Infrastructure-Foundations: wmflib: improve interactive.ask_input to support free-form responses - https://phabricator.wikimedia.org/T327408 (10Volans) 05Open→03Resolved This was fixed in wmflib v1.2.1 released on Feb. 2nd.