[07:08:24] https://www.irccloud.com/pastebin/bg9asqQa/ [07:33:33] moritz and I are switching Gerrit from Java 11 to Java 17 [07:34:20] given upstream Gerrit have been supporting Java 17 for quite a while, I don't anticipate issues [07:45:29] <_joe_> hashar: don't jinx it [07:57:01] Should I fret about "Warning: The current total number of facts: 2830 exceeds the number of facts limit: 2048" [07:57:14] ? [07:57:41] <_joe_> Emperor: that doesn't look good at all, no [07:57:52] <_joe_> open a task for I/F with a high priority, I'd say [07:58:54] ack [07:58:54] <_joe_> this isn't critical because AIUI that's a soft limit [07:59:04] <_joe_> but we should probably raise that soft limit [08:01:12] T366563 now exists [08:01:19] T366563: moss-be1003 "Warning: The current total number of facts: 2830 exceeds the number of facts limit: 2048" - https://phabricator.wikimedia.org/T366563 [09:16:20] IIRC, all our private IPs are in 10.0.0/8 ; what IPv6 networks are equivalent? Is there, for example, a v6 network that contains all of codfw? [I found https://wikitech.wikimedia.org/wiki/IP_and_AS_allocations but the private networks are just "refer to netbox", and I'm not sure where I should be looking] [09:18:24] (context is harking back to the discussion of Ceph networking the other week; I notice that while it doesn't dual-stack, it would do v6-only for cluster traffic) [09:19:59] Emperor: https://netbox.wikimedia.org/ipam/prefixes/783/ codfw private v6 [09:20:54] XioNoX: thanks, and presumably thus https://netbox.wikimedia.org/ipam/prefixes/252/ for eqiad? [09:21:18] Emperor: yep :) [09:21:29] ta [09:21:41] it would be awesome to do v6 only :) [09:24:18] No promises! But it might be doable [09:24:45] (and obviously doing so now before it's in service would be the time) [09:26:19] yep, happy to help if needed [09:38:56] https://toot.bike/@demoographics/112557677316569378 ;-) [10:19:15] hmm, somehow mail from phab is now failing SPF checks when being delivered to wikimedia.org addresses? ('domain of transitioning no-reply@phabricator.wikimedia.org does not designate 2620:0:861:102:10:64:16:101 as permitted sender') [10:24:39] SPF is v=spf1 ip4:208.80.152.0/22 ip6:2620:0:860::/56 ip6:2620:0:861::/56 ~all [10:27:59] which if I'm reading that correctly should contain 2620:0:861:102:10:64:16:101 ? [11:14:39] Emperor: yeah, jhathaway and I discussed this above in this channel. (see scrollback). I think he will put a fix today [11:28:50] oh, yes, thanks [13:14:20] btullis: datahub-mce-consumer (production) is at it again with the log spamming :( https://phabricator.wikimedia.org/T366596 [13:16:07] godog: On it now. Thanks. For reference, we are deleting datahub from the wikikube clusters at the moment, which may help. Not sure which deployment is spamming yet. https://phabricator.wikimedia.org/T366338 [13:16:35] cc stevemunene: ^ [13:16:50] btullis: ack thank you, this is from the dse cluster [13:17:04] pasting a sample logstash message in the task now [13:19:32] btullis: the wikikube deletion is done [13:19:43] https://logstash.wikimedia.org/app/dashboards#/view/8e16cbb0-1f2e-11ef-a8f9-a3d578de2750?_g=(filters%3A!()%2CrefreshInterval%3A(pause%3A!t%2Cvalue%3A0)%2Ctime%3A(from%3Anow-2h%2Cto%3Anow)) [13:21:15] stevemunene: Ack, thanks. [14:08:50] Emperor: fix should roll out today, spent too much time debugging PCC issues yesterday :( [14:15:24] jhathaway: thanks (and, err, sorry for the PCC stuff!) [14:16:04] no need to apologize, it is just one of those tools that could use some more love [15:10:01] <_joe_> jhathaway: the puppet compiler is a typical case of organic growth out of a small program I wrote for a specific purpose, it has never been developed as a proper software with a roadmap, and it shows :) [15:10:22] <_joe_> jhathaway: but if you need some insight when debugging, I might be able to help now that I'm back [15:15:00] thanks _joe_ I figured out how to at least unblock my patch, by purging a 9000 or so nodes in PCC's puppetdb that were causing it to timeout on simple queries. There still a periodic job that causes the db to respond really slowly, but I ran out of time debugging the nature of that capacity problem. So far the debugging has served the purpose of educating myself a bit on how the pieces fit [15:15:02] together. That said I may ping you at some point when we find some time to prioritize PCC maintenance and improvements. [15:15:37] <_joe_> nod [15:58:37] anything happening in the job queues? https://logstash.wikimedia.org/goto/78c6806f944f18c4daeef90d1d14d030 [15:58:46] an alert for mw errors was raised [16:01:49] swfrench-wmf deployed the PSPchange but timing doesn't line up [16:02:15] claime: yeah it more lines up with what you and kamila_ were doing, if anything [16:02:22] yeaaaah [16:02:32] not that I really see how it could be related, but :) [16:02:33] what did I break? [16:02:53] the error msg is not very descriptive [16:03:03] "Could not enqueue jobs" :D [16:03:03] it's all eqiad [16:03:14] I was rebooting stuff in codfw [16:03:39] but I don't see how moving k8s control planes around would cause jobqueue errors :/ [16:03:45] seems to have started around 14:20 UTC [16:03:55] no, it's a separate trigger, I think [16:03:59] https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1&from=now-6h&to=now [16:04:06] many spikes of refreshLinks near those times [16:05:09] well, like half an hour earlier? [16:05:12] sorry, just catching up - taking a look [16:05:44] that starts before I started poking at k8s control plane [16:05:59] whole bunch of refreshlinks produced at around 14:10 [16:06:31] 5x the usual number, but that shouldn't block new jobs being enqueued [16:06:39] failure to enqueue is likely an eventgate-side thing [16:07:31] yes I agree [16:07:50] hnowlan: https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=30s&var-service=eventgate-main&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos&var-site=All&viewPanel=82 [16:08:12] matches very well [16:09:19] we've seen some of this before [16:09:25] but never had a good explanation :( [16:10:34] this is quite a lot of jobs though [16:11:02] hnowlan: the other thing that correlates is this https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=eventgate&var-kubernetes_namespace=eventgate-main&var-destination=All&from=now-6h&to=now&viewPanel=28 [16:12:18] wondering if it is a specific pod or all of them [16:12:55] seems like we're hitting some kind of saturation/starvation within eventgate that we're not observing or is being hidden [16:13:10] we've been trying to track that down for years I think [16:13:21] bit weird that we *never* see any sign of issues with the actual eventgate service [16:14:37] which makes me think there's some kind of connection limit or something that it's happily hitting and going about its business [16:14:50] eventgate logs never have anything useful [16:15:09] do we have any logging from Envoy about what these active requests are [16:15:59] no, the proxy envoy doesn't really log anything [16:16:21] so IIUC this is inbound traffic to eventgate-main that fails right? [16:16:28] yep [16:16:36] (ime for envoy outside of custom logging you have two choices between nothing and firehose) [16:16:38] elukey: yes and it fails once it has reached the envoy TLS listener for eventgate-main [16:16:58] * elukey nods [16:17:30] how many eventgate-mains do we have? [16:17:40] Do I need to make some state change to a drive in state 'Firmware state: Unconfigured(good), Spun Up' before I can set it as JBOD/Non-RAID? 'sudo megacli -pdmakegood -physdrv [32:10] -a0' just says 'Adapter: 0: Failed to change PD state at EnclId-32 SlotId-10.'. I can presumably go and poke via the iDRAC, but it'd be nice to be able to CLI it [16:17:50] cdanis: 10 afaics, + canary (pods) [16:18:05] elukey: ok well good news bad news it isn't really different across pods then [16:18:22] https://grafana.wikimedia.org/goto/2Te38fsSR?orgId=1 [16:18:41] bah, sorry, wrong, command, 'sudo megacli -pdmakejbod -physdrv [32:10] -a0' (still that error message though) [16:19:51] hnowlan: and IIRC the issue goes away recycling pods right? Or do I misremember? [16:20:38] I'm not sure but I wouldn't be surprised [16:20:51] I'd be tempted to just add more pods but that'll hide the bug rather than give us more data [16:21:09] what about turning on firehose logging for the canary eventgate-main [16:23:51] I think that it is a good approach, and IIRC we should be able to do it dynamically via nsenter / localhost:port/command right? [16:24:47] I'd also recycle a 3/4 pods to see if the error rate goes down, if it happens it may be some weird state that envoys enters [16:25:12] cdanis: trying to enable debug on canary [16:26:02] cool [16:26:04] since it seems the PSS changes aren't implicated, any objections to proceeding to update changeprop-jobqueue in eqiad? (it's the only thing left to update) [16:26:59] swfrench-wmf: I'd wait a bit for the moment, we are not sure what's happening yet and it is closely related to the job queues [16:27:44] mmm now that I think about it, do we have the admin port available? [16:27:51] sure can do - just trying not to leave surprise diffs around :) [16:27:54] I seem to recall a patch to disable it [16:27:57] sure sure :) [16:29:49] unrelated? https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?orgId=1&from=now-6h&to=now [16:30:41] yeah I think we use the unix socket [16:30:55] cdanis: I think so [16:31:32] I am deleting 3 pods to see if anything changes in the metrics [16:33:21] done [16:35:50] I am trying to find a way to enable debug logging on the fly but so far I am not having luck [16:35:55] if you have ideas lemme know :) [16:36:54] I am going to keep recycling pods since the error trend looks better now [16:40:58] looks better now yeah. don't think you can get debug enabled without changing the configmap [16:42:11] errors seems zero now, I just cleaned 5 out of 11 pods thoug [16:42:14] *though [16:43:20] hmm [16:43:39] yep I didn't expect this :D [16:46:17] it's interesting that it's a correlated event across pods (e.g., low probability of a resource leak of something, which would be uncorrelated) ... so it has to be either (a) a property of the traffic it's getting or (b) a dependency [16:46:28] do we still think that this may be related to envoy getting into a weird state? If so I wouldn't be able to explain why it showed errors on all pods (see Chris' graph) but then went away when I cleared only half the flet [16:46:32] *fleet [16:46:57] yes good point as well [16:47:43] I think that a good next step could be to have documented how/where to enable debug logging for the eventgate canary, and then turn it on the next time [16:48:16] hnowlan: what do you think if we add commented code in eventgate's value.yaml to be turned on if needed? So next time it will be quicker [16:48:25] not now I mean, during the next days :) [16:48:42] swfrench-wmf: at this point I think you can proceed [16:49:00] elukey: sgtm [16:49:55] elukey: ack, thanks will do [16:50:37] super, logging off folks, have a nice rest of the day! [18:47:51] elukey: hnowlan maybe useful: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/eventgate/values.yaml#32 [18:48:56] will let you connect a to node inspector in browser https://nodejs.org/en/learn/getting-started/debugging [18:50:02] increasing log level is same as most charts [18:50:02] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/eventgate/templates/_config.yaml#22