[00:48:34] so I thought I had fixed it but it's still failing with the same error https://puppetboard.wikimedia.org/report/cp3081.esams.wmnet/e04360a7e067193af28f28bbd6971c893dcceb4f [00:48:46] which is weird because the ipaddress definitely is in the subnet specified in realm.pp [00:49:47] any ideas on what can be wrong here? [00:49:57] still getting [00:49:57] Site (undefined) not found in cluster cache_upload [00:49:58] …in /etc/puppet/modules/profile/manifests/base.pp, line: 45, column: 9. [05:20:19] sukhe: https://gerrit.wikimedia.org/r/c/operations/puppet/+/948713 [07:18:35] (SystemdUnitFailed) firing: clean-confd-rundir.service Failed on bast2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:33:35] (SystemdUnitFailed) resolved: clean-confd-rundir.service Failed on bast2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:15:57] XioNoX: ha, thank you [09:17:02] XioNoX: such a weird oversight on my part :] [09:17:24] sukhe: parsing IPs as strings is a bad idea from the get go... [09:17:40] :) [09:21:29] >>> netaddr.IPAddress("10.80.1.2") in netaddr.IPNetwork("10.80.0.0/16") [09:21:32] True [09:21:40] when you don't see something, you don't see it (this is from last night) [09:21:46] anyway, let's finish this up [09:22:16] XioNoX: while I have you here, I wanted to see if you will add the anycast stuff back to homer for esams [09:22:21] anycast_neighbors, or if I should do it [09:22:57] sukhe: sure, do you have the IPs handy? [09:23:28] only dns3003 racked so far, 185.15.59.34 [09:24:11] sukhe: should we just add all the IPs directly? [09:26:09] yep, that should be it [09:26:12] except just one more thing I think [09:26:54] checking [09:27:17] profile::bird::neighbors_list: [09:27:24] - 91.198.174.245 # cr3-esams loopback v4 [09:27:24] - 2620:0:862:ffff::5 # cr3-esams loopback v6 [09:27:24] - 91.198.174.244 # cr2-esams loopback v4 [09:27:25] - 2620:0:862:ffff::3 # cr2-esams loopback v6 [09:27:36] for esams, this also needs an update [09:30:52] sukhe: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/948999 [09:31:14] looking [09:31:59] (updated) [09:34:59] sukhe: and https://gerrit.wikimedia.org/r/c/operations/puppet/+/949000 [09:43:33] sukhe: thx for pcc [09:43:48] switch side is ready [09:43:53] np, just wanted to be sure given it's anycast :) [09:45:09] sukhe: puppet side deployed too [09:46:38] thanks! going to bring up the DNS shortly and see how that goes [10:43:35] (SystemdUnitFailed) firing: prometheus-ipmi-exporter.service Failed on ganeti3005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:44:29] (SystemdUnitFailed) resolved: prometheus-ipmi-exporter.service Failed on ganeti3005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:16:20] 10CAS-SSO, 10Data-Platform-SRE, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10Stevemunene) Actively working on this, thus moving it back in progress as we plan on implementing the solutions defined on https://phabricator.wikimedia.org/T3432... [13:29:29] (SystemdUnitFailed) firing: httpbb_hourly_appserver.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:23:36] (SystemdUnitFailed) firing: (2) httpbb_hourly_appserver.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:29:29] (SystemdUnitFailed) resolved: httpbb_hourly_appserver.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:09:08] hi folks [15:09:16] wondering if someone has seen this and can help save some time with debugging [15:09:19] RuntimeError: Expected 1 result from PuppetDB got 3 [15:09:22] full error here https://www.irccloud.com/pastebin/N4ZDj7sG/ [15:10:15] fairly obvious in the cookbook but can't figure out what is failing [15:10:19] if len(json_response) != 1: [15:10:21] raise RuntimeError(f'Expected 1 result from PuppetDB got {len(json_response)}') [15:14:36] https://puppetboard.wikimedia.org/report/dns3003.wikimedia.org/969ca190ce468ba649b69c301542cf86a2f94371 [15:14:39] full error here [15:14:49] er, puppet run [15:21:45] hmm, did you run the noop three times sukhe? [15:22:22] if you mean the cookbook, just twice [15:22:34] https://puppetboard.wikimedia.org/node/dns3003.wikimedia.org [15:23:34] okay [15:25:41] let me try to reproduce from the cli [15:26:14] I tried that [15:26:15] and I get [15:26:15] '[{"certname":"dns3003.wikimedia.org"},{"certname":"dns3003.wikimedia.org"},{"certname":"dns3003.wikimedia.org"}]\n' [15:27:12] which then matches the cookbook as expected [15:27:21] but the question is why three items here though [15:28:21] ah, well you are few steps ahead of me! [15:30:09] so I guess the question remains: why are we seeing three here :) [15:30:30] in fact, irrespective of hostname, I get three items [15:30:37] >>> m = { "query": [ "and", ["=", "certname", "dns3002.wikimedia.org"], ["=", "type", "Nagios_host"], ["=", "exported", True] ] } [15:30:40] >>> print(requests.post('https://puppetdb-api.discovery.wmnet:8090/pdb/query/v4/resources', json=m).json()) [15:30:43] [{'certname': 'dns3002.wikimedia.org'}, {'certname': 'dns3002.wikimedia.org'}, {'certname': 'dns3002.wikimedia.org'}] [15:32:04] but non-dns one [15:32:05] >>> m = { "query": [ "and", ["=", "certname", "cp3081.esams.wmnet"], ["=", "type", "Nagios_host"], ["=", "exported", True] ] } [15:32:08] >>> print(requests.post('https://puppetdb-api.discovery.wmnet:8090/pdb/query/v4/resources', json=m).json()) [15:32:11] [{'certname': 'cp3081.esams.wmnet'}] [15:33:11] so yeah I wonder what's up with dns here? [15:34:19] weird [15:41:20] $ curl -sX GET http://localhost:8080/pdb/query/v4 --data-urlencode "query=$( "185.15.59.34" [15:41:24] "2a02:ec80:300:2:185:15:59:34" [15:41:26] "dns3003" [15:41:51] so it looks like we have three different titles in puppetdb for the same puppet type [15:41:59] sukhe: ^ [15:42:05] ok [15:42:08] how did this happen I guess [15:42:11] and why just for the DNS hosts [15:42:16] but more importantly how to resolve this :) [15:43:05] all good questions [15:43:58] can we just clear puppetdb for this host somehow? [15:45:02] I assume so, but I don't know how offhand, looking [15:45:08] thanks for the help [15:45:54] my puppetdb knowledge would fit in a thimble [15:47:07] just very weird that the dns hosts return three entries [15:47:15] and everything else one, which is what the cookbook wnats [15:47:27] were they provisioned differently? [15:47:38] no, same cookbook and everything else [15:47:48] and dns3002 is the old host [15:47:54] so it's been quite some time it was provisioned anyway [15:47:58] years [15:48:17] cp3081 above which I shared was today morning [15:49:29] (SystemdUnitFailed) firing: (2) prometheus-ganeti-exporter.service Failed on ganeti3005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:59:54] sukhe: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/dns/recursor.pp#85 [15:59:59] pretty sure that is the issue [16:00:29] in what way though [16:00:33] not seeing it [16:01:43] that define type ultimately creates a nagios_host with its title being the ipv4 and ipv6 address [16:02:10] I see [16:02:16] which then explains why we see it for the dns hosts only [16:02:16] I haven't yet found where the hostname gets created [16:02:22] but also then why was it working before I guess [16:02:27] this line has been there all this time [16:02:31] I assume it is part of our standard config, but I haven't found it yet [16:02:35] I think you are on the right track, just dumping my thoughts [16:02:47] perhaps the reimage check is newish? [16:03:54] blame shows that line to be at least two years old [16:04:02] https://github.com/wikimedia/operations-cookbooks/commit/3de2f255fc6139200c9f9f7651630a21b009dd28 [16:04:13] but we have done a host reimage as recently as last month [16:04:47] hmm [16:05:17] sorry quite the rabbit hole [16:06:47] what's the issue? [16:06:57] this "Could not evaluate: Working directory /srv/git/netbox_dns_snippets does not exist!" ? [16:07:05] looking at https://puppetboard.wikimedia.org/report/dns3003.wikimedia.org/969ca190ce468ba649b69c301542cf86a2f94371 [16:07:27] no because this is kinda expected [16:07:44] but more so because the cookbook fails with an obvious issue which we can reproduce on puppetdb hosts [16:08:30] if len(json_response) != 1: [16:08:31] raise RuntimeError(f'Expected 1 result from PuppetDB got {len(json_response)}') [16:08:34] failure line is this [16:08:35] (SystemdUnitFailed) resolved: prometheus-ganeti-exporter.service Failed on ganeti3007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:08:56] jhathaway: if len(json_response) != 1: [16:08:59] sorry [16:09:02] https://github.com/wikimedia/operations-cookbooks/commit/cd72b2539062328ed6a98fd84f868bff7df7e1c1 [16:09:54] XioNoX: [16:09:55] 11:26:15 < sukhe> '[{"certname":"dns3003.wikimedia.org"},{"certname":"dns3003.wikimedia.org"},{"certname":"dns3003.wikimedia.org"}]\n' [16:10:05] this is what is returned from puppet, it should be 1 item here [16:10:08] and hence the cookbook fails [16:10:10] it's 1 for lvs, cp [16:10:14] ohh ok [16:10:16] but for some three for the DNS hosts, even the old ones [16:13:18] I think we should remove that ::dnsrecursor::monitor line, and run a noop to see if that removes the exported resources [16:14:20] or comment out https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/hosts/reimage.py#L422 [16:15:03] if we have healthy hosts in prod with multiple ressources, maybe we can skip that check for those provisioning [16:15:05] XioNoX: I thought of that but don't want to do that, just in case it exists for a reason [16:15:13] also I am afraid of volan.s :) [16:16:35] I am pretty sure that puppet line is causing the issue, whether it also has some benefit I am not totally sure [16:17:03] I can try removing it [16:17:15] jhathaway: yeah it's two sides of the same coin [16:17:17] but I am also not sure because it has been there for some time [16:17:21] and we didn't have an issue [16:17:24] but [16:17:25] it's there for monitoring [16:17:32] I am wondering if it wasn't an issue before and now is [16:17:32] so if we remove it we lose some monitoring afaik [16:17:47] yes, I think the intention was to try and remove it see if it helps [16:17:51] not take it out because we can't do that [16:17:53] we would lose the ping on the ipv6 ip at least [16:18:30] the ipv4 ip is already on the resource with the hostname as the title [16:21:04] sukhe: cumin1001: curl -k -X POST https://puppetdb-api.discovery.wmnet/pdb/query/v4/resources/Nagios_host/dns3003 | jq [16:21:09] returns only 1 element [16:21:42] so you're right the bug has been introduced in https://github.com/wikimedia/operations-cookbooks/commit/cd72b2539062328ed6a98fd84f868bff7df7e1c1 [16:21:52] right, but there is also 185.15.59.34 & 2a02:ec80:300:2:185:15:59:34 [16:22:29] yeah but the commit is the more recent one [16:22:38] and thus that causes three items instead of 1 [16:22:43] XioNoX: yep, nice confirmation [16:23:22] we can try rolling back that commit [16:23:34] the curl seems to confirm that the old behavior is still working [16:23:54] yeah that makes sense [16:24:14] would that break something else? [16:24:19] seems fairly self-contained [16:24:21] but want to be sure [16:24:30] yeah and easy to revert if needed [16:24:44] I am fine with that [16:24:49] +1 from my side then [16:25:16] +1 as well [16:25:22] on it [16:25:25] seems the least intrusive of the two options [16:26:06] I would reference this line in the revert https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/dns/recursor.pp#85 XioNoX [16:27:41] https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/948604 [16:28:22] ha [16:28:23] https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/948603 [16:28:26] we will go with yours [16:28:53] ok [16:29:07] if CI wakes up... [16:30:44] thanks XioNoX and jhathaway [16:30:59] thank you! [16:31:04] yup, sorry for the breakage [16:31:21] all good [16:31:25] we found a solution :D [16:31:36] if CI agrees to merge it :) [16:32:58] finally... running puppet to deploy it [16:33:05] thanks! [16:33:53] sukhe: alright, give it a try [16:33:59] XioNoX: thanks, trying shortly [16:34:03] should be fine now! [16:34:11] moving on to the other rack after this [17:29:29] (SystemdUnitFailed) firing: httpbb_hourly_appserver.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:29:29] (SystemdUnitFailed) resolved: httpbb_hourly_appserver.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed