[03:40:54] 06Traffic, 06SRE, 10SRE-swift-storage, 10Thumbor: Cache thumbs in our caching infrastructure (e.g. ATS) - https://phabricator.wikimedia.org/T345334#9751601 (10tstarling) >>! In T345334#9654752, @Ladsgroup wrote: > If we do extrapolation after 10,000th hit. The Theil-Sen extrapolation becomes more useful: >... [09:02:19] 10netops, 06Infrastructure-Foundations: mr1-eqsin performance issue - https://phabricator.wikimedia.org/T362522#9752157 (10cmooney) >>! In T362522#9751106, @ayounsi wrote: > Good idea, worth trying ! If it's enough it would be less of a pain than changing the SSH port. My guess is we probably need to change t... [09:05:42] 06Traffic, 06SRE, 10SRE-swift-storage, 10Thumbor: Cache thumbs in our caching infrastructure (e.g. ATS) - https://phabricator.wikimedia.org/T345334#9752173 (10Ladsgroup) That'd work on overall hits, as you said "sort images by popularity". That's not the case here. Front caches absorb all of the hits and c... [09:40:08] 06Traffic, 06Data Products, 06Data-Engineering, 10Observability-Logging, 13Patch-For-Review: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117#9752321 (10gmodena) >> - there's couple of CRs pending (linked to this phab) and I'd like to have a second run on the event s... [11:31:44] (VarnishHighThreadCount) firing: (9) Varnish's thread count on cp1100:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [11:36:40] (VarnishHighThreadCount) firing: (20) Varnish's thread count on cp1100:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [11:41:40] (VarnishHighThreadCount) firing: (21) Varnish's thread count on cp1100:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [11:46:40] (VarnishHighThreadCount) firing: (29) Varnish's thread count on cp1100:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [11:51:40] (VarnishHighThreadCount) firing: (31) Varnish's thread count on cp1100:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [11:56:40] (VarnishHighThreadCount) firing: (36) Varnish's thread count on cp1100:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [12:01:40] (VarnishHighThreadCount) firing: (35) Varnish's thread count on cp1100:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [12:06:40] (VarnishHighThreadCount) firing: (26) Varnish's thread count on cp1100:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [12:11:40] (VarnishHighThreadCount) firing: (19) Varnish's thread count on cp1100:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [12:16:40] (VarnishHighThreadCount) resolved: (17) Varnish's thread count on cp1100:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [13:31:24] 06Traffic, 06Data-Platform-SRE: Pybal: Depool nodes outside broadcast domain - https://phabricator.wikimedia.org/T363697 (10bking) 03NEW [13:31:36] 06Traffic, 06Data-Platform-SRE: Pybal: Depool nodes outside broadcast domain - https://phabricator.wikimedia.org/T363697#9753084 (10bking) [13:32:29] 06Traffic, 06Data-Platform-SRE, 10PyBal: Pybal: Depool nodes outside broadcast domain - https://phabricator.wikimedia.org/T363697#9753090 (10taavi) [13:43:47] 06Traffic, 06Infrastructure-Foundations, 06SRE: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps - https://phabricator.wikimedia.org/T359054#9753177 (10ssingh) [13:48:54] 06Traffic, 06Data-Platform-SRE: LVS hosts: Monitor/alert on when pooled nodes are outside broadcast domain - https://phabricator.wikimedia.org/T363702 (10bking) 03NEW [14:03:39] 06Traffic, 06Data-Platform-SRE: LVS hosts: Monitor/alert on when pooled nodes are outside broadcast domain - https://phabricator.wikimedia.org/T363702#9753317 (10bking) [14:30:59] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Add probenet configuration for magru - https://phabricator.wikimedia.org/T362902#9753457 (10Volans) p:05Triage→03Medium [14:39:59] 06Traffic, 06Data-Platform-SRE: LVS hosts: Monitor/alert on when pooled nodes are outside broadcast domain - https://phabricator.wikimedia.org/T363702#9753479 (10cmooney) Thanks Brian. Yeah I think the thing the check would need to do is: 1) Get the current list of active back-end IPs I'm not at all sure ho... [14:42:33] 06Traffic, 06Data-Platform-SRE: LVS hosts: Monitor/alert on when pooled nodes are outside broadcast domain - https://phabricator.wikimedia.org/T363702#9753495 (10cmooney) Actually it may be just easier to check the route for each pooled IP and make sure the check doesn't return saying it's using the default as... [15:32:26] 06Traffic, 06Data-Platform-SRE, 10PyBal: Pybal: Depool nodes outside broadcast domain - https://phabricator.wikimedia.org/T363697#9753811 (10bking) a:05bking→03None [15:41:14] 06Traffic: Craft geo-maps file to create lowest-latency routes from south america - https://phabricator.wikimedia.org/T363722 (10BCornwall) 03NEW [15:41:27] 06Traffic: Craft geo-maps file to create lowest-latency routes from south america - https://phabricator.wikimedia.org/T363722#9753871 (10BCornwall) p:05Triage→03Low [15:56:39] 06Traffic, 06DC-Ops, 10ops-magru: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9753930 (10RobH) [16:06:59] 06Traffic, 06DC-Ops, 10ops-magru, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9753982 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host cp7002.magru.wmnet with OS bullseye [16:10:30] 06Traffic, 06Data Products, 06Data-Engineering, 10Observability-Logging, 13Patch-For-Review: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117#9754000 (10gmodena) >> I like the overall idea, but I'd prefer to proceed DC-by-DC, in switching topics and shutting down Var... [16:18:57] 06Traffic, 06DC-Ops, 10ops-magru, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9754072 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host cp7002.magru.wmnet with OS bullseye executed with errors: - cp7002 (**FA... [16:28:59] 06Traffic, 06DC-Ops, 10ops-magru, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9754123 (10RobH) [16:30:02] 06Traffic, 06DC-Ops, 10ops-magru, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9754139 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host cp7002.magru.wmnet with OS bullseye [16:34:53] 06Traffic, 06DC-Ops, 10ops-magru: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9754177 (10RobH) [16:37:56] 06Traffic: Craft geo-maps file to create lowest-latency routes from south america - https://phabricator.wikimedia.org/T363722#9754192 (10BCornwall) [16:46:37] 06Traffic, 06DC-Ops, 10ops-magru: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9754260 (10RobH) [16:50:36] 06Traffic, 06DC-Ops, 10ops-magru: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9754274 (10RobH) [17:01:25] 06Traffic, 06DC-Ops, 10ops-magru: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9754341 (10RobH) [17:06:35] 06Traffic, 06DC-Ops, 10ops-magru: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9754367 (10RobH) [17:08:40] (VarnishHighThreadCount) firing: (7) Varnish's thread count on cp1110:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [17:13:40] (VarnishHighThreadCount) firing: (11) Varnish's thread count on cp1106:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [17:13:58] 06Traffic, 06DC-Ops, 10ops-magru: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9754403 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host cp7002.magru.wmnet with OS bullseye executed with errors: - cp7002 (**FAIL**) - Removed from... [17:17:12] hi traffic friends - quick DNS zonefile update question for you: are there any "manual" coordination steps around running authdns-update? (e.g., in this channel) [17:17:12] I've had the pleasure of watching sukhe demo this process, but this is my first time doing it myself :) [17:17:57] swfrench-wmf: basically merge and then log that you are running authdns-update from any DNS host :) [17:18:40] (VarnishHighThreadCount) firing: (11) Varnish's thread count on cp1106:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [17:18:45] no coordination required here unless you want us to review something basically [17:19:14] sukhe: great, thank you very much! [17:20:48] 06Traffic, 06DC-Ops, 10ops-magru: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9754449 (10RobH) [17:23:40] (VarnishHighThreadCount) firing: (15) Varnish's thread count on cp1106:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [17:28:40] (VarnishHighThreadCount) firing: (19) Varnish's thread count on cp1106:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [17:43:40] (VarnishHighThreadCount) firing: (10) Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [17:46:31] I'm checking if there are some other places where magru is missing but honestly don't understand why it worked in the first place [17:46:39] fabfur: 7001 also fails [17:46:41] let's try 7003 [17:46:45] patching to add to site.pp [17:46:46] I'll do [17:46:48] ok [17:46:49] ok please do [17:48:07] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1025430 [17:48:40] (VarnishHighThreadCount) resolved: (8) Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [17:50:59] merging [17:51:06] cookbook command is ready [17:51:20] if this works, we will need to "purge" cp700[12] somehow from puppetdb and try again :) [17:51:25] yep [17:51:35] while you reimage, I will try to figure out that [17:51:59] ok [17:52:01] User input is: "go" [17:52:01] Starting reimage on cp7003. You can check progress via serial console or by running `install-console cp7 [17:52:01] 003.magru.wmnet` on any cumin host [17:52:01] ==> Select puppet version to install with [17:52:07] yeah [17:52:10] interesting [17:52:18] https://www.irccloud.com/pastebin/6lcFMR7x/ [17:52:34] I'll abort this [17:52:38] no no [17:52:41] why are you aborting it? [17:52:52] just finish reimaging it either way [17:52:54] apparently there's some issue, in hiera that values are present [17:53:01] this output is expected [17:53:15] what I mean is [17:53:16] really? because they are in the insetup/traffic.yaml role [17:53:22] *present [17:53:27] it just asks you to check if they are present [17:53:30] but doesn't actually check [17:53:36] mmm [17:53:38] let's try [17:53:41] so you should say 7 and hit enter [17:53:58] 06Traffic, 06DC-Ops, 10ops-magru, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9754648 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp7003.magru.wmnet with OS bullseye [17:53:58] yep it's proceedintg [17:54:20] ok [17:54:27] so now we need to figure out what to do about 70012 [17:55:01] anyway, why it gives you this message? [17:55:04] ==> Unless the host's role has been already migrated to Puppet 7, [17:55:04] to migrate this host change its hiera values to: [17:55:24] it should read the relative hiera key and not asking [17:55:59] my idea is that isn't reading that [17:56:03] anyway, booting to PXE [17:59:00] booting into d-i [17:59:22] nice [17:59:28] on the above, https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/hosts/reimage.py#L210 [17:59:33] if self.args.new: [17:59:33] if self.args.puppet_version == 7: [17:59:33] ask_confirmation(dedent( [17:59:34] 06Traffic, 06DC-Ops, 10ops-magru, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9754700 (10RobH) [17:59:49] so this is expected, it's asking for confirmation if version is 7 [17:59:57] why it doesn't work the second time -- that I don't know :) [18:00:04] the answer is probably here somewhere [18:00:18] has_puppet7 = self.puppet_server.hiera_lookup(self.fqdn, "profile::puppet::agent::force_puppet7") [18:00:23] this fails the second time basically [18:00:30] couldn't we blame volan.s as usual? [18:00:41] what's up? [18:00:42] fabfur: don't know about you but I value my life [18:00:43] (it's taking forever to show the d-i) [18:00:43] hahah [18:00:48] he has a regex for his name [18:00:49] ach [18:01:05] we should start using other words to refer to him [18:01:05] I don't, I sense it though :D [18:01:19] volans: so cookbook running the second time [18:01:21] fails on [18:01:23] has_puppet7 = self.puppet_server.hiera_lookup(self.fqdn, "profile::puppet::agent::force_puppet7") [18:01:45] https://www.irccloud.com/pastebin/Wj8Opbwz/ [18:01:49] (context: we're talking about provisioning with the insetup::traffic role new magru cp hosts) [18:02:54] so force_puppet7 lookup fails [18:02:56] the question is why [18:03:00] (only on the second time) [18:03:19] that happens only if the host already exists in puppetdb [18:03:21] for normal reimaged [18:03:33] for new hosts the --new is passed and the user is asked to pick a version [18:03:54] yeah, we tried that [18:03:55] same thing [18:04:01] Error: Could not run: Function lookup() did not find a value for the name 'prometheus_nodes' (file: /srv/puppet_code/environments/production/modules/profile/manifests/firewall.pp, line: 21) [18:04:28] if I try that with cp7001 or 7002 [18:04:48] with "that" being: sudo puppet lookup --render-as s --compile --node "cp7001.magru.wmnet" "profile::puppet::agent::force_puppet7" [18:05:17] so it seems that the catalog is still not compiling? [18:05:27] oh I see it now [18:05:39] but why would it fail at this stage and not when it actually attemps the puppet run? [18:06:15] it fails to compile the catalog, doesn't matter if you try to lookup hiera (see the --compile) or a --noop run or a real run [18:06:22] the server must compile the catalog in all cases [18:06:44] ah [18:06:47] ok :) that adds up then [18:06:54] thanks! we will take it from here, we need to patch this [18:06:59] where "server" I meant puppetserver [18:07:08] sorry for the confusion [18:07:13] yes, I ran it on the puppetmaster and see the output [18:07:29] thanks volans [18:07:30] and also why it's failing [18:07:32] which makes sense [18:07:32] you need to run it on puppetserver if it's p7 [18:07:41] not puppetmaster ;) [18:07:51] sukhe@puppetmaster1001:~$ sudo puppet lookup --render-as s --compile --node "cp7001.magru.wmnet" "profile::puppet::agent::force_puppet7" [18:07:54] Error: Could not run: Function lookup() did not find a value for the name 'prometheus_nodes' (file: /etc/puppet/modules/profile/manifests/firewall.pp, line: 21) [18:08:14] although might work too but at reimage time fo puppet7 will use the puppetservers [18:08:22] the puppetmasters are there for puppet5 [18:08:27] got it [18:08:29] plus private repo and CA for now [18:13:58] grazie [18:14:50] fabfur: 7003 should have failed for you ? [18:15:06] not really, it's at the end of the d-i but I can consider it "failed" [18:15:14] end of d-i [18:15:16] that's nice though [18:15:16] (still no first puppet run) [18:15:30] let it run and see [18:15:37] I will try 7002 again [18:15:43] ack! [18:17:46] did you fix the puppet side? [18:17:53] lookup still failing [18:17:58] yep [18:18:55] maybe we should try to remove it entirely from puppetdb [18:20:13] prometheus_nodes for magru certainly exist now [18:20:21] as in at least the string literal :) [18:22:27] trying one more thing [18:24:15] in the meantime [18:24:26] `[12/50, retrying in 36.00s] Attempt to run 'cookbooks.sre.hosts.reimage.ReimageRunner._populate_puppetdb [18:24:27] ..poll_puppetdb' raised: Nagios_host resource with title cp7003 not found yet` [18:24:31] :D [18:24:33] so there's something not working here [18:25:00] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1025441 [18:25:49] after this I am going to manually force an agent run on C:prometheus [18:26:24] is that host existing already? [18:26:32] no [18:26:40] but we need the string to exist clearly [18:27:06] it shouldn't matter for a site that is not up anyway (I think :)) [18:27:13] let's try [18:31:49] I'll go afk for some minutes [18:31:55] yeah [18:31:58] I am trying [18:32:02] take a break [18:36:35] news? [18:36:49] no didn't help [18:36:54] still looking [18:37:22] I think the explanation is simpler [18:37:53] seems like the cookbook is stuck at ` Attempt to run 'cookbooks.sre.hosts.reimage.ReimageRunner._populate_puppetdb..poll_puppetdb' raised: Nagios_host resource with title cp7003 not found yet [18:37:54] ` [18:37:58] yeah that makes sense [18:38:01] since puppet is failing [18:39:32] you mean host isn't able to reach puppetserver/puppetdb ? [18:39:42] no, puppet agent run failure [18:39:42] (why worked first time? ) [18:39:43] this [18:39:46] Error: Could not run: Function lookup() did not find a value for the name 'prometheus_nodes' (file: /srv/puppet_code/environments/production/modules/profile/manifests/firewall.pp, line: 21) [18:39:57] fabfur: good question, I think probably because we were hitting other errors [18:40:00] and didn't come to this [18:40:03] the error makes sense in a way [18:40:07] but I am not sure what the right fix is [18:40:09] so looking at that [18:52:06] my suspicion is that since we haven't used this for a while, it has been borked for some time now [18:52:17] since we directly reimaged to cache::text [18:52:29] I think that's worth a try but I want to see if this fixes it or not [18:55:39] yeah I think I found it [18:55:39] - Array[Stdlib::Host] $prometheus_hosts = [], [18:55:49] this was later changed to what it is now [18:55:56] Array[Stdlib::Host] $prometheus_nodes = lookup('prometheus_nodes'), [18:56:07] and since it isn't being set in the lookup, it fails [18:56:29] yep [18:56:30] it worked [18:56:40] sukhe@puppetserver1001:~$ sudo puppet lookup --render-as s --compile --node "cp7003.magru.wmnet" "profile::puppet::agent::force_puppet7" [18:56:43] Warning: Undefined variable '::_role'; (file & line not available) [18:56:45] Warning: Scope(Class[Profile::Netbox::Host]): cp7003.magru.wmnet is unknown in Netbox [18:56:48] Warning: Scope(Class[Profile::Netbox::Host]): cp7003.magru.wmnet: no Netbox location found [18:56:51] true [18:57:15] cool [18:57:17] trying again now [18:57:27] this needs to be patched for all other roles [18:57:29] I will file a task later [18:57:51] (all other insetup roles) [18:58:37] 06Traffic, 06DC-Ops, 10ops-magru: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9754976 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp7002.magru.wmnet with OS bullseye [19:06:40] I'll stop my cookbook and we try to reimage all to cache::text ? [19:07:09] so I fixed it for insetup::traffic [19:07:13] and now running cp7002 [19:07:16] let's see if it finishes [19:08:03] ack [19:08:08] I'll stop this anyway [19:08:18] it's useless [19:08:18] 06Traffic, 06DC-Ops, 10ops-magru: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9755009 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp7003.magru.wmnet with OS bullseye executed with errors: - cp7003 (**FAIL**) - Removed fr... [19:08:38] fabfur: :P [19:08:54] I am still not sure if this will work and if we have everything we need [19:08:57] we will find it [19:08:58] I mean other than this [19:13:01] (non-urgent) DNS-update follow-up: I've seen docs indicating sre.dns.netbox needs run after my change, but I can't quite square that with the contents of my change (it just adds a couple of CNAMEs [0], so should not result in PTR records, for example). [19:13:02] is that actually needed in practice? [19:13:02] [0] https://gerrit.wikimedia.org/r/c/operations/dns/+/1023964 [19:13:42] swfrench-wmf: so the short answer is that if you just edited the zone files directly in the repo, authdns-update is enough [19:13:54] if you made changes to DNS names on netbox, then you need to run sre.dns.netbox [19:14:03] you can run it with -d 'test' to see if there are pending changes (dry-run) [19:14:22] but in this case, I don't see it can be required [19:14:52] sukhe: great, thank you! I'll update our docs [19:14:53] sre.dns.netbox by default also runs authdns-update [19:15:03] ah, TIL :) [19:15:06] but authdns-update does not run sre.dns.netbox if it makes sense [19:15:13] yup, makes sense [19:15:21] sukhe: I want to try to reimage 7003 to cache::text [19:15:21] also thanks for the tip to dry-run to configm [19:15:29] *confirm [19:15:32] just for test [19:16:08] fabfur: I think we should confirm this first because otherwise we might have to do double the work for it... [19:16:35] but if with cache::text works we can rule out some issue [19:17:06] for cache::text [19:17:29] we don't include profile::firewall so it will work [19:17:33] since that's where it is failing [19:17:37] or did you mean some other issue? [19:17:47] right now 7002 is reimaging fine but not until we see it complete I guess :> [19:17:59] the issue with nagios I pasted before [19:18:08] after the reinstallation is complete [19:19:00] the mystery I am trying to solve is why existing includes of profile::firewall work [19:19:13] because they are looking up prometheus_nodes from somewhere [19:19:22] ok [19:19:37] considering that we need prometheus_nodes sooner or later [19:21:06] site.yaml in hiera lookup [19:21:10] is that a deprecated thing or what [19:21:27] sukhe: is that not working with `sudo puppet lookup --compile` or on the actual host during a puppet run? [19:21:37] taavi: both [19:21:56] for the insetup role we fixed it with https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/67eda9cb0a891289c18d53d3208864b7c1b7b139%5E%21/#F0 so far [19:22:13] with cp7003? [19:22:17] 7002 [19:22:22] currenlty reimaging [19:22:42] also for 7003, both are insetup::traffic [19:25:00] I wonder how many things would break if we ever reached five digits (e.g. cp10001) [19:26:29] right now the override in insetup::traffic is making the `--explain` flag in `puppet lookup` not much helpful in figuring out why the site file did not work [19:26:59] that's what is bothering me as well, why can't it read site.yaml [19:27:55] > It should also be noted that unqualified global values i.e. those without a ::, will be looked up in common.yaml and $site.yaml [19:29:05] how's going w/ 7002? [19:29:27] going fine so far [19:29:32] first puppet run coming up [19:29:50] let's hope [19:30:35] stepping away for a bit, will come back later. maybe touching grass might help! [19:31:07] ok [19:31:13] I'll stay here a little more [19:31:59] 06Traffic, 06DC-Ops, 10ops-magru: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9755136 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp7003.magru.wmnet with OS bullseye [19:33:15] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9755130 (10RobH) 05Open→03In progress a:03RobH Stealing this task to do the network provision, bios provision, dns setup, and firmware setup... [19:48:39] I'm reimaging 7003, it's me or it's slower than other hosts done in the past (I mean the d-i) [19:49:56] yeah, slower because of installserver in eqiad [19:50:18] yeah but also some tasks that doesn't require network [19:50:29] like fs creation [19:50:36] but maybe it's my impression [20:01:18] first puppet run, I'll go washing dishes [20:03:08] It says to go enjoy the wait, not do dishes [20:07:05] you are right, now I'll send a patch to change the phrase [20:07:31] "Sit back, relax and REMEMBER THAT YOU HAVEN'T DONE DISHES TODAY!!!" [20:09:12] 06Traffic, 06DC-Ops, 10ops-magru: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9755247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp7002.magru.wmnet with OS bullseye completed: - cp7002 (**WARN**) - Downtimed on Icinga/Al... [20:09:32] cp7002 finished successfully [20:09:35] minor warning, not related [20:10:09] so no more pending issues, hopefully [20:10:21] I guess we can now either dig into the prometheus_nodes thing or simply ignore it and move on [20:10:27] the latter might come back to bite us [20:10:31] as it always does :) [20:14:11] i would be interested in at least seeing if it's still broken if the definition is removed from the role hiera [20:15:19] yeah we can try that once the current fabfur reimage finishes [20:15:23] (again) [20:16:07] but what would have changed now? [20:16:22] all previous reimages failed without the override [20:16:25] anyway we will try! [20:18:00] I like the idea of berating the operator to go do some household chores while waiting [20:35:59] https://www.irccloud.com/pastebin/wXEb0k4m/ [20:36:08] it should be ok but I ask just in case [20:37:05] yeah [20:37:07] add them [20:37:24] tnx [20:37:38] I assume is something just for the very first time [20:37:41] 06Traffic, 06DC-Ops, 10ops-magru: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9755368 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp7003.magru.wmnet with OS bullseye completed: - cp7003 (**WARN**) - Downtimed on Icinga/A... [20:37:53] anyway, it passed so party time [20:39:58] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9755387 (10RobH) [20:41:22] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9755389 (10RobH) [21:00:21] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9755454 (10RobH) [22:03:23] 10netops, 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9755661 (10RobH) 05In progress→03Open a:05RobH→03None All of the misc hosts have had network provisioning, firmware, and bios provisioning... [23:56:57] 06Traffic, 06DC-Ops, 10ops-magru, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9755997 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp7004.magru.wmnet with OS bullseye