[08:31:40] thanks for the warning :) [09:25:44] do we plan to set stuff as ready (eg. green) on the Codfw Switch Migration planning spreadsheet? Or not needed? [09:26:48] <_joe_> ? [09:31:31] jynus: each migration as a sub tab, if you e.g depooled something you can simply update the text under "Action" [09:32:28] So do I change it to "Ready" after doing the thing? [09:35:06] yeah, e.g. I'm currently draining ganeti2031 and when it's completed I'll simply change "Migrate VMs to other hosts" to "VMs have been migrated to other hosts" [09:35:25] or simply Ready :-) [09:35:32] thanks, that was exactly my question, thank you [09:36:02] I will wait a bit more as mine takes seconds, so I will do it later [09:36:43] ack, thx [11:00:32] apergos: in this specific case, it should have been primary, and using replica was wrong, it's rather simple, the primary might have not had the correct schema, so the writes would have failed, generally speaking fieldExists() is a bad idea in a general [11:07:03] hm ok I will bear that in mind, thanks [12:10:29] ok it seems the reimage issues have been fixed, if you encounter new issues please let us know (context in T356709 ) [12:10:30] T356709: Debian installer waits for input for network config during host reimage - https://phabricator.wikimedia.org/T356709 [12:14:00] volans: oh crap -_- thank you and sorry [12:42:42] it happens :) [12:42:58] kamila_: I suggest you reimage all of them [12:43:12] btw if we were renaming them to k8s hosts this would have not happened :D [12:43:16] volans: yep, on it, just staggering it a bit [12:43:23] valid point about the rename '^^ [13:17:53] akosiaris: I see a diff on deployment-charts/helmfile.d/admin_ng related to helm-state-metrics when running `helmfile -e dse-k8s-eqiad diff` on the deployment host. Is this safe to deploy? [13:37:56] brouberol: yes it is. [13:38:04] thanks, deploying [15:01:25] Folks I'm going to restore the Netbox db from a backup to reverse change that was done in error [15:01:51] Can I ask people not to run any decommission, provision or reimage cookbooks for the next 30 mins or so until I confirm? [15:01:52] thanks. [15:01:58] ack [15:19:42] Thanks for the patience everyone, netbox db is restored now so normal activity can resume [15:45:08] one last question, ¿what is the aprox window for the today's maintenance? I'm not in a hurry but wondering if to keep around or restart stuff tomorrow? [15:51:30] (also I should stop using the Spanish inverse question mark in Spanish LUL) [15:51:39] *English [15:51:56] XDD [16:03:09] based on the downtime, I guess only 30 minutes? [16:03:19] !incidents [16:03:19] 4424 (UNACKED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [16:03:26] * Emperor around [16:03:32] !ack 4424 [16:03:32] 4424 (ACKED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [16:04:26] was it the prometheus backend? [16:05:18] * Emperor still looking [16:05:57] thanos-query was recently restarted, hmm [16:07:16] titan1001 and titan1002 recently alerted for ssh down briefly [16:07:54] titan1001 had page faults in kern.log [16:08:03] 2024-02-06T16:02:42.108834+00:00 titan1001 kernel: [534723.800913] Out of memory: Killed process 935 (thanos) total-vm:25080384kB, anon-rss:24252060kB, file-rss:0kB, shmem-rss:0kB, UID:111 pgtables:47820kB oom_score_adj:0 [16:08:25] ah, that would explain it and it would make ssh unresponsive [16:08:42] 2024-02-06T16:02:48.599092+00:00 titan1002 kernel: [529285.100104] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/thanos-query.service,task=thanos,pid=944,uid=111 [16:08:43] [16:08:48] I've seen ssh timeout during a OOM before [16:09:04] it's really odd that both eqiad titan nodes when OOM at the same time [16:09:16] I wonder if an exciting query toppled them? [16:09:35] let me search, there was a similar issue some months ago [16:10:10] !incidents [16:10:11] 4424 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [16:11:12] I cannot find a doc, maybe it wasn't done as it was like this a bried overload [16:11:21] *brief [16:13:34] godog: titan is your service, would you like a phab item opened for the OOM? [16:13:59] (we might want to adjust the docs to point more at titan for this too) [16:15:01] Ah, I remember, I did this edit: https://wikitech.wikimedia.org/w/index.php?title=Runbook&diff=prev&oldid=2115826 [16:15:36] actually, this one: https://wikitech.wikimedia.org/w/index.php?title=Thanos&diff=prev&oldid=2115837 [16:15:39] with a TODO [16:16:35] so the previous alert was just before that date [16:16:59] Mmm, in any case the service seems to self-recover after an OOM, which is nice. Presumably until systemd gets bored and marks it as failed... [16:24:16] I think I will open a phab, it can always be closed if it's not interesting. [16:24:59] thanos-query has DeadlineExceeded warnings leading up to the OOM, but they aren't unique to today, plenty of instances of them occurring in the past as well [16:25:04] SGTM [16:36:02] T356788 opened [16:36:02] T356788: thanos-query probedown due to OOM of both eqiad titan frontends - https://phabricator.wikimedia.org/T356788 [16:36:48] I will add the potentially related previous case [16:37:07] if anyone wants to bet against that being a query-of-death, I'll take you up on it [16:37:28] I wouldn't like to bet against it :-D [16:37:33] <_joe_> cdanis: lol "you like winning easy" [16:37:44] :) [16:37:54] but there could be things, like docs that could be better, as Mathew said [16:38:23] I was super lost when I saw a "Titan" alerting 4-5 months ago [16:38:45] cdanis: I speculated as much in the ticket :) [16:38:49] yeah :) [17:04:41] cheers folks, yeah we did put in place some safeguards and thank you for the task [17:05:01] query of death still a thing as you have figured out by now [17:14:34] cdanis: arnaudb, herron: We are having an issue with the wikikube cluster [17:16:34] claime: that sounds potentially bad [17:16:50] Yeah [17:17:01] basically it looks like the new k8s nodes didn't get added to lvs for some reason [17:17:11] do you need any help claime? [17:17:23] so all the traffic is going to the old nodes (and then being re-loadbalanced by nodeport...?) [17:17:34] yeah, and now it's flapping in the wind [17:17:36] <_joe_> cdanis: yes, but also we're seeing intermittent failures [17:17:51] I'm trying to figure out what we missed in the configuration between old and new nodes [17:18:42] akosiaris: Which hosts haven't had runc restarted yet? [17:18:53] there is no restart of runc [17:19:00] claime: shall I begin an incident doc? [17:19:15] all have had their runc binary upgraded [17:19:24] new pods instantiated on those nodes will run under the new runc [17:19:41] s/under/started using/ [17:20:24] I find it very improbable btw that runc is to blame here btw. It doesn't even handle networking [17:20:40] <_joe_> ok the problem for "nodes not in pybal" is just [17:20:43] <_joe_> {"mw2292.codfw.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=codfw,cluster=kubernetes,service=kubesvc"} [17:20:46] <_joe_> fixing it myself [17:21:02] this shouldn't cause issues though [17:21:10] if those are new nodes that is [17:21:15] <_joe_> no [17:21:21] <_joe_> all nodes that are mw* [17:21:28] <_joe_> so half the cluster? [17:22:08] <_joe_> I can enable them, but this hasn't changed from before your cordon/uncordon processes [17:22:28] yup, I am not messing at all with this [17:22:33] ok so we're missing a step in the node add procedure is what you're saying? [17:22:38] <_joe_> yes [17:22:46] <_joe_> anyways, unrelated to the problem [17:23:00] <_joe_> what pybal sees is k8s nodes killing connections [17:23:05] <_joe_> and nto accepting new ones [17:23:09] <_joe_> on the nodeports [17:23:31] <_joe_> this must generate some event/logs somewhere [17:24:01] <_joe_> claime/akosiaris do you want me to fix the issue with conftool? [17:24:22] killing as in sending RST? and not accepting as in also sending RST? [17:24:24] or something ese [17:24:25] I assumed you were, you said fixing it myself, but I can do it [17:24:28] cdanis: timing out [17:25:06] hmm [17:25:17] <_joe_> cdanis: connection time out is what we see when using curl; but the idleconnection monitor failing also means persistent connections are severed [17:25:21] yeah [17:25:37] just wondering if we knew if that was due to timeout of keepalives, or an explicit RST or something [17:26:00] <_joe_> no it's too frequent [17:26:12] <_joe_> to be keepalive timeouts [17:26:22] <_joe_> but I'm not 100% sure [17:29:59] ok pooling them correctly is done, still no connection between for example lvs2013 and kubernetes2010.codfw.wmnet port 4447 [17:30:17] what can I do to help? [17:35:23] <_joe_> claime, cdanis one thing is to look at logstash for the k8s events and logs [17:35:38] <_joe_> another is to find a host where you can connect to, one where you can't, and see what's the difference [17:35:47] I was tcpdumping on the kubernetes node, I can see traffic but it's illegible [17:35:51] <_joe_> and maybe reboot one host? [17:36:34] (the "not in pybal" for the new nodes probably happened because the reimages didn't complete and something ended up inconsistent, no idea if that'd affect other things though) [17:36:50] kamila_: The reimage won't pool the ndoes [17:36:52] nodes [17:37:00] That's a manual step we forgot [17:37:14] I'm gonna reboot kubernetes2010 [17:37:23] claime: oh, interesting, was that always a thing? because I never did that '^^ [17:37:31] 😅 [17:37:59] <_joe_> ok, I can't be IC nor help much, I'm dealing with another UBN! [17:38:49] ok fuck it, I have no idea what's happening but I am IC now [17:38:49] <_joe_> claime: how are you testing "connect from lvs2013 and kubernetes2010.codfw.wmnet port 4447" ?" [17:39:04] kamila_: I can take it if you'd rather hands on keyboard [17:39:04] _joe_: netcat [17:39:12] <_joe_> claime: full command please [17:39:19] just nc ip port [17:39:28] <_joe_> claime: have you tried telnet? [17:39:52] <_joe_> nc has some quirks [17:39:56] ... [17:40:02] telnet connects [17:40:04] wtf. [17:40:04] cdanis: ok, that might be a better idea, you "win" :D [17:40:17] uhh [17:40:29] wat [17:40:33] is this an ipv6 thing [17:40:39] I have to put the kids to sleep btw, I have to leave, but I 'll check up on this later. Ping me if this becomes worse (my current understanding says we aren't in deep trouble) [17:40:45] <_joe_> cdanis: no, ipv6 just doesn't work [17:40:48] I'm full ipv4 on this [17:40:55] nc knows that too? [17:41:01] I'm using the ipv4 [17:41:15] I'm gonna drain and reboot one node to see something [17:41:35] <_joe_> claime: rebooting was my next suggestion yes [17:41:44] jayme: you around? [17:41:45] runc related pod restarts are btw paused in codfw, eqiad is continuing [17:43:34] if reboot doesn't fix it, can some call jayme please [17:44:58] can I ask, what led to the detection of this, and, do we have any graphs showing the issue [17:45:15] cdanis: httpbb failures on mw-on-k8s [17:45:17] <_joe_> cdanis: https://grafana.wikimedia.org/d/000000422/pybal-service?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-server=All&var-service=mw-api-ext_4447&from=now-3h&to=now&viewPanel=4 [17:45:20] thanks [17:45:53] <_joe_> the problem is only in codfw and not eqiad [17:48:09] Timing coincides with the network migration but I don't see what could be the link, there's nothing related in the rack [17:49:16] <_joe_> claime: I was about to ask when the network migration happened [17:49:27] <_joe_> because this feels a lot like network [17:50:18] Actually it coincides more with the netbox restore than the actual migration I think [17:50:22] topranks you still around? [17:50:26] claime: yeah [17:50:46] reboot makes the server up in pybal [17:50:49] what. [17:50:56] https://i.imgur.com/oqAelXg.png [17:50:57] Ah not for long [17:51:00] Feb 6 17:50:53 lvs2013 pybal[5930]: [mw-api-int_4446] ERROR: Monitoring instance IdleConnection reports server kubernetes2010.codfw.wmnet (enabled/up/pooled) down: User timeout caused connection failure. [17:51:17] yeah it's back to flapping [17:51:57] I need to afk for just a few minutes, really need some quick food [17:52:03] Depool the nodes that were part of the migration? [17:52:05] I uncordoned the new nodes sometime around that, trying to find the exact timing [17:52:30] <_joe_> akosiaris: more like remove from k8s [17:52:40] <_joe_> kamila_: yeah that too [17:52:47] There was no k8s nodes in the migration [17:52:54] <_joe_> let's cordon the hosts kamila added then [17:53:06] on it [17:53:23] <_joe_> claime: do you agree it's worth a try? [17:53:38] what, cordoning the added nodes ? [17:53:53] oops, I kinda did it already '^^ [17:53:54] cdanis: can i get the actual link for that graph plz? [17:54:06] I mean it can't hurt anyways [17:54:18] <_joe_> kamila_: it's ok [17:54:20] (cordoned the new ones in codfw, didn't touch eqiad) [17:54:33] ah it's the pybal graph [17:55:23] <_joe_> kamila_: please !log your actions in #-operations [17:55:31] thanks, sorry [17:57:07] <_joe_> it keeps getting worse, btw [17:57:12] Cordon or drain? [17:57:18] I suggest drain [17:57:26] Move the pods away [17:57:29] <_joe_> you also need to drain them right [17:57:37] drain will cordon [17:57:45] kubectl drain kubernetes2010.codfw.wmnet --delete-emptydir-data --ignore-daemonsets [17:57:56] oh, ack, thanks [17:57:57] sorry [17:58:43] <_joe_> ok, I need to take care of mediawiki, then I need to go back to life a bit. I will just say - if things keep worsening, please consider an emergency mw switchover and complete depool of codfw [17:59:51] claime: https://grafana.wikimedia.org/goto/fDq2Jm2Sk?orgId=1 [17:59:53] I don't understand why I can telnet, openssl s_client, whatever on the ip and nodeport of a service on kubernetes2010 [17:59:58] but pybal keeps failing [18:00:10] claime: where are you doing it from [18:00:17] the lvs host [18:00:46] I can curl aswell I mean wth [18:00:56] <_joe_> claime: but httpbb keeps failing too [18:01:15] <_joe_> claime: q: have you tried running httpbb against a k8s host directly? [18:01:22] not yet [18:01:27] <_joe_> because if that doesn't fail, then we have another issue at hand [18:02:26] it's running... [18:02:39] Do we have anything about this impacts users? [18:02:42] <_joe_> cdanis: do you want me to hold the mw deployment until the outage is resolved? [18:02:43] will tell you in a minute [18:02:47] _joe_: no [18:02:49] How* [18:03:02] akosiaris: as I understand all the user impact is potential, if this keeps happening [18:03:05] or spreads [18:03:15] _joe_: please proceed with your ubn deploy' [18:03:28] <_joe_> cdanis: well connecting to mediawiki via the load balancers does experience timeouts [18:03:40] <_joe_> I guess that ATS does retry on connection failure [18:03:42] which has some user impact but also those will get retried a bit [18:03:44] yes [18:03:51] if I go by ATS and wiki status we're ok [18:04:03] so, it might just be noise, but it sorta looks like it's going down [18:04:06] Ok so retries on various levels are still covering this [18:04:28] <_joe_> kamila_: uhm indeed [18:04:40] <_joe_> so maybe the draining "solved" the issue? [18:04:44] I drained the new nodes at :58 [18:05:13] kinda looks like it, yes [18:05:14] okay so [18:05:32] adding together both up and down transitions *and* using rate rather than irate() in the graph produces hard to read results imo [18:05:48] https://i.imgur.com/M80nqb3.png https://grafana.wikimedia.org/goto/Jy7_1ihSk?orgId=1 [18:05:53] this is so much clearer imo [18:06:16] kamila_: did we miss a bgp step for the new nodes or something [18:07:21] <_joe_> yeah the query is over 1 hour [18:07:22] cdanis: I did the BGP and homer commit step [18:07:25] <_joe_> so it's gone doewn [18:07:43] _joe_: irate() isn't :) [18:08:02] oh shit? [18:08:03] <_joe_> cdanis: yeah the current dashboard, that's why the increase/decrease smooths down [18:08:08] yeah [18:08:24] <_joe_> kamila_: ok next thing is, depool (setting pooled=inactive) the nodes you drained :) [18:08:28] What happens to calico when you drain? [18:08:40] Because sudo cumin 'mw[2318-2319,2350,2352,2354,2356].codfw.wmnet' 'calicoctl node status' [18:08:44] says Connect for everything [18:08:45] a 1d rate() is ok to reason about rare events but is horrible to see things happening in near real time [18:08:47] not established [18:09:31] ack _joe_ , just checking something... [18:10:14] ok, I fucked up [18:10:18] oh [18:10:20] or something [18:10:46] not sure... in any case [18:10:54] homer commit on codfw is showing the new nodes in the diff [18:12:06] i'm wondering about that, or about the netbox restore [18:12:43] fyi eqiad nodes have established BGP sessions [18:12:44] I propose that I commit it and re-enable the nodes [18:12:46] yes [18:12:49] kamila_: +1 [18:12:50] commit [18:13:57] sry was afk, just back in - checking backscroll [18:14:25] topranks: tldr for right now is that a bunch of new k8s nodes either weren't ever put in properly to the codfw crs, or, were but got removed at some point [18:14:39] hmm ok [18:14:45] you ran Homer I gather? [18:14:56] kamila is either doing that now or is about to aiui [18:15:06] cgoubert@cumin2002:~$ /usr/bin/httpbb /srv/deployment/httpbb-tests/appserver/*.yaml --host kubernetes2010.codfw.wmnet --https_port 4447 [18:15:07] running now [18:15:08] Sending to kubernetes2010.codfw.wmnet... [18:15:10] PASS: 126 requests sent to kubernetes2010.codfw.wmnet. All assertions passed. [18:15:12] In case we were wondering [18:15:22] completed [18:15:39] ^ homer commit [18:16:16] I'll uncordon the nodes now [18:16:17] All BGP sessions established [18:16:32] kamila_: I'm adding a few steps to the procedure after this [18:16:36] Adding to the pool [18:16:37] thanks claime [18:16:43] And cumin a calicoctl node status [18:16:44] <_joe_> so the problem was a race condition with our modifications of netbox and the restore from db? [18:16:52] prolly [18:16:58] _joe_: I don't think we know that for sure yet but it is certainly likely [18:17:02] I think so [18:17:15] I was hoping perhaps topranks could go through the config commit history on the router [18:17:22] or if homer keeps all of its diffs centrally? [18:17:26] sire [18:17:30] but it's possible I didn't run homer on codfw [18:17:30] *sure [18:17:51] cdanis: no the router / rancid diffs would be the best audit log, homer doesn't save the configs it generates [18:17:58] got it [18:18:25] I remember running homer at least once, but it's possible I only did it for eqiad '^^ [18:19:02] (I'm fairly sure I did run it twice, but not 100%) [18:19:06] 13:08:40 Because sudo cumin 'mw[2318-2319,2350,2352,2354,2356].codfw.wmnet' 'calicoctl node status' [18:19:13] I think this is a list of new nodes topranks ^ [18:19:17] yes [18:19:18] <_joe_> kamila_: it should be in SAL [18:19:54] _joe_: you're thinking of dbctl and conftool not homer I think :) [18:20:09] <_joe_> cdanis: doesn't homer log its runs like cumin? [18:20:18] <_joe_> IIRC yes? [18:20:28] https://sal.toolforge.org/production?p=0&q=homer&d= [18:20:46] <_joe_> uhm [18:20:50] <_joe_> well :) [18:21:08] Adding to log homer run to the procedure as well [18:21:18] ty claime [18:21:29] where are you adding it btw? docs? [18:21:53] I'll make a patch for the script tomorrow if so [18:22:20] in terms of those nodes - mw[2318-2319,2350,2352,2354,2356] - they appear to have only been added to the config in the most recent run [18:22:54] claime: do we do any monitoring of calico bgp status? [18:23:18] the previous change was to add mw244[7-9] last Thursday [18:23:52] kamila_: script [18:24:03] claime: ok, thank you [18:24:06] cdanis: I don't think so, I'd have to check [18:24:26] it's possible I accidentally ran homer twice for eqiad and my brain was too fried to realise there wasn't a diff [18:25:43] maybe I should not be adding k8s nodes when I feel too fried for real work -.- [18:25:55] kamila_: I mean, we need some automated consistency checking [18:26:41] yes, or even manual consistency checking :D thanks claime for adding it <3 [18:27:53] we also didn't get warned about a diff for the CR config by email like we normally would [18:28:07] my command history doesn't have a recent codfw commit (other than the one now), so that happened [18:28:19] hm, good point about the diff [18:28:28] I haven't seen one of those in a long time [18:28:36] <_joe_> kamila_: that's a much better outcome than the race condition with the restore of netbox, tbh [18:28:36] hmmmmmm [18:28:42] and yes what _joe_ said [18:28:53] <_joe_> human errors happen [18:29:17] we don't have any prometheus metrics from calico huh? [18:29:45] a graph of pooled-for-pybal nodes vs bgp status counts would have made this pretty clear immediately [18:31:39] <_joe_> yes [18:31:59] even just a textfile exporter that wraps calicoctl node status ;) [18:34:06] calico has metrics, they just need to be enabled, no? [18:34:35] reading https://docs.tigera.io/calico/latest/operations/monitor/monitor-component-metrics [18:34:36] kamila_: yes, I don't know if we've done that or not [18:35:38] I see some felix_ metrics in our prometheus but I am not yet smart enough to interpret them [18:36:15] and typha_ [18:37:26] so, afaict we're good now, right? [18:37:38] can I get a sticker? XD [18:37:52] cgoubert@kubernetes2010:~$ calicoctl --version [18:37:54] Run 'calicoctl version' to see version information. [18:37:58] if you know that's what I want [18:38:00] GIVE IT [18:38:00] why XD [18:39:02] kamila_: yes I think we're good now [18:39:16] Also why it doesn't have a machine readable output is a myste [18:39:18] I started an incident doc but I don't think I'll finish it honestly, it never got populated with much and this wasn't actually user-impactful in a big way [18:39:21] mystery to me [18:39:51] action items besides adding the steps would be the calico metrics then? [18:40:14] kamila_: not just enabled, but with a few key stats pulled out somewhere [18:40:18] ack [18:40:58] it looks like we're scraping felix and typha metrics but not doing much with them? [18:41:10] also I don't actually know if either of those components is the one that is talking BGP [18:42:51] Neither [18:43:00] It's bird [18:43:17] Same container as Felix though [18:44:11] do we actually scrape those bird metrics? Looking at a few things like `bird_protocol_up` and I don't see anything from anything k8s related [18:47:34] pooling and calico check steps added to the procedure [18:47:41] Are those perhaps from some anycast DNS setup? Those have dedicated bird instances [18:47:45] akosiaris: yep [18:47:47] thank you claime <3 [18:47:50] exactly [18:47:54] I'll add them to wikitech tomorrow [18:48:10] But https://docs.tigera.io/calico/latest/reference/felix/prometheus says nothing about bgp metrics [18:48:30] So it's not that we aren't scraping, they just don't exist? [18:48:41] https://i.imgur.com/kFYgPPM.png [18:48:43] perhaps :) [18:49:13] (tip of the day, matching based on `__name__` in grafana explore is very powerful) [18:50:07] also calicoctl has no machine readable options and it *sucks* [18:50:34] ah actually it may [18:52:46] -o json iirc [18:52:48] I guess if calicoctl get bgpPeers isn´t 50+ line longs then it's not right maybe [18:53:10] not for node status [18:53:12] But it's 8pm your time, get away from the keyboard [18:53:21] But I guess you can deduct it from get bgpPeer [18:53:28] That is right. [18:53:31] I should do that. [18:53:37] * claime SIGKILL [18:53:46] Same here, ciao [18:54:00] I'll file a task for the calico metrics [18:54:02] akosiaris: I think -o json is only for `calicoctl get` specifically [18:54:04] thanks y'all <3 [18:54:07] cdanis: yep [18:54:17] <3 thanks kamila_ claime akosiaris [18:58:03] claime: I don't see the operational status returned in calicoctl get bgpPeers [18:59:25] sry, what Alex said, we can pick it up again :) [19:01:57] Also while on the subject, we in IF do get mails about pending diffs for the network devices [19:03:24] but it only runs twice daily, so we didn't get anything between those MW nodes setting bgp=true in Netbox and the problems this evening [19:06:02] topranks: thanks for checking [19:12:15] has anyone ever gotten "Unable to find fact file for: ${HOST} under directory /var/lib/catalog-differ/puppet" from a PCC run? I just migrated this host from public to private IPs, but this didn't happen with the last host I worked on. Full PCC output: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler-puppet7-test/592/console [19:12:40] the host looks OK in Puppetboard and Netbox AFAICT [19:19:30] inflatador: nothing jumping out at me, Puppetboard and Netbox both look correct, I don't see any references to the old fqdn either [19:22:00] topranks ah, I think I got it, my reimage cook_book was still waiting for a prompt [19:23:40] pcc host facts are updated nightly, so there is a delay of a host being added/renamed and it being available for PCC runs [19:24:30] I was kinda thinking that too...ran puppet on a cloud of the cloud runners but it didn't seem to help [19:24:42] errr.."a couple" not "a cloud" [20:23:53] inflatador: yeah the mechanism taavi was referring to is documented at https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler#Updating_nodes [20:26:17] * inflatador bookmarks [20:28:19] it used to need poking by hand very often heh [20:43:02] sign of progress...now we take your hard work for granted ;P