[07:08:24] <stevemunene>	 https://www.irccloud.com/pastebin/bg9asqQa/
[07:33:33] <hashar>	 moritz and I are switching Gerrit from Java 11 to Java 17
[07:34:20] <hashar>	 given upstream Gerrit have been supporting Java 17 for quite a while, I don't anticipate issues
[07:45:29] <_joe_>	 hashar: don't jinx it
[07:57:01] <Emperor>	 Should I fret about "Warning: The current total number of facts: 2830 exceeds the number of facts limit: 2048"
[07:57:14] <Emperor>	 ?
[07:57:41] <_joe_>	 Emperor: that doesn't look good at all, no
[07:57:52] <_joe_>	 open a task for I/F with a high priority, I'd say
[07:58:54] <Emperor>	 ack
[07:58:54] <_joe_>	 this isn't critical because AIUI that's a soft limit
[07:59:04] <_joe_>	 but we should probably raise that soft limit
[08:01:12] <Emperor>	 T366563 now exists
[08:01:19] <stashbot>	 T366563: moss-be1003 "Warning: The current total number of facts: 2830 exceeds the number of facts limit: 2048" - https://phabricator.wikimedia.org/T366563
[09:16:20] <Emperor>	 IIRC, all our private IPs are in 10.0.0/8 ; what IPv6 networks are equivalent? Is there, for example, a v6 network that contains all of codfw? [I found https://wikitech.wikimedia.org/wiki/IP_and_AS_allocations but the private networks are just "refer to netbox", and I'm not sure where I should be looking]
[09:18:24] <Emperor>	 (context is harking back to the discussion of Ceph networking the other week; I notice that while it doesn't dual-stack, it would do v6-only for cluster traffic)
[09:19:59] <XioNoX>	 Emperor: https://netbox.wikimedia.org/ipam/prefixes/783/ codfw private v6
[09:20:54] <Emperor>	 XioNoX: thanks, and presumably thus https://netbox.wikimedia.org/ipam/prefixes/252/ for eqiad?
[09:21:18] <XioNoX>	 Emperor: yep :)
[09:21:29] <Emperor>	 ta
[09:21:41] <XioNoX>	 it would be awesome to do v6 only :)
[09:24:18] <Emperor>	 No promises! But it might be doable
[09:24:45] <Emperor>	 (and obviously doing so now before it's in service would be the time)
[09:26:19] <XioNoX>	 yep, happy to help if needed
[09:38:56] <Emperor>	 https://toot.bike/@demoographics/112557677316569378 ;-)
[10:19:15] <taavi>	 hmm, somehow mail from phab is now failing SPF checks when being delivered to wikimedia.org addresses? ('domain of transitioning no-reply@phabricator.wikimedia.org does not designate 2620:0:861:102:10:64:16:101 as permitted sender')
[10:24:39] <Emperor>	 SPF is v=spf1 ip4:208.80.152.0/22 ip6:2620:0:860::/56 ip6:2620:0:861::/56 ~all
[10:27:59] <Emperor>	 which if I'm reading that correctly should contain 2620:0:861:102:10:64:16:101 ?
[11:14:39] <sukhe>	 Emperor: yeah, jhathaway and I discussed this above in this channel. (see scrollback). I think he will put a fix today
[11:28:50] <Emperor>	 oh, yes, thanks
[13:14:20] <godog>	 btullis: datahub-mce-consumer (production) is at it again with the log spamming :( https://phabricator.wikimedia.org/T366596
[13:16:07] <btullis>	 godog: On it now. Thanks. For reference, we are deleting datahub from the wikikube clusters at the moment, which may help. Not sure which deployment is spamming yet. https://phabricator.wikimedia.org/T366338
[13:16:35] <btullis>	 cc stevemunene: ^
[13:16:50] <godog>	 btullis: ack thank you, this is from the dse cluster
[13:17:04] <godog>	 pasting a sample logstash message in the task now
[13:19:32] <stevemunene>	 btullis: the wikikube deletion is done
[13:19:43] <stevemunene>	 https://logstash.wikimedia.org/app/dashboards#/view/8e16cbb0-1f2e-11ef-a8f9-a3d578de2750?_g=(filters%3A!()%2CrefreshInterval%3A(pause%3A!t%2Cvalue%3A0)%2Ctime%3A(from%3Anow-2h%2Cto%3Anow))
[13:21:15] <btullis>	 stevemunene: Ack, thanks.
[14:08:50] <jhathaway>	 Emperor: fix should roll out today, spent too much time debugging PCC issues yesterday :(
[14:15:24] <Emperor>	 jhathaway: thanks (and, err, sorry for the PCC stuff!)
[14:16:04] <jhathaway>	 no need to apologize, it is just one of those tools that could use some more love
[15:10:01] <_joe_>	 jhathaway: the puppet compiler is a typical case of organic growth out of a small program I wrote for a specific purpose, it has never been developed as a proper software with a roadmap, and it shows :)
[15:10:22] <_joe_>	 jhathaway: but if you need some insight when debugging, I might be able to help now that I'm back
[15:15:00] <jhathaway>	 thanks _joe_ I figured out how to at least unblock my patch, by purging a 9000 or so nodes in PCC's puppetdb that were causing it to timeout on simple queries. There still a periodic job that causes the db to respond really slowly, but I ran out of time debugging the nature of that capacity problem. So far the debugging has served the purpose of educating myself a bit on how the pieces fit
[15:15:02] <jhathaway>	 together. That said I may ping you at some point when we find some time to prioritize PCC maintenance and improvements.
[15:15:37] <_joe_>	 nod
[15:58:37] <elukey>	 anything happening in the job queues? https://logstash.wikimedia.org/goto/78c6806f944f18c4daeef90d1d14d030
[15:58:46] <elukey>	 an alert for mw errors was raised 
[16:01:49] <claime>	 swfrench-wmf deployed the PSPchange but timing doesn't line up
[16:02:15] <cdanis>	 claime: yeah it more lines up with what you and kamila_ were doing, if anything
[16:02:22] <claime>	 yeaaaah
[16:02:32] <cdanis>	 not that I really see how it could be related, but :)
[16:02:33] <kamila_>	 what did I break?
[16:02:53] <elukey>	 the error msg is not very descriptive
[16:03:03] <elukey>	 "Could not enqueue jobs" :D
[16:03:03] <claime>	 it's all eqiad
[16:03:14] <claime>	 I was rebooting stuff in codfw
[16:03:39] <claime>	 but I don't see how moving k8s control planes around would cause jobqueue errors :/
[16:03:45] <elukey>	 seems to have started around 14:20 UTC
[16:03:55] <cdanis>	 no, it's a separate trigger, I think
[16:03:59] <cdanis>	 https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1&from=now-6h&to=now
[16:04:06] <cdanis>	 many spikes of refreshLinks near those times
[16:05:09] <cdanis>	 well, like half an hour earlier?
[16:05:12] <swfrench-wmf>	 sorry, just catching up - taking a look
[16:05:44] <kamila_>	 that starts before I started poking at k8s control plane
[16:05:59] <hnowlan>	 whole bunch of refreshlinks produced at around 14:10 
[16:06:31] <hnowlan>	 5x the usual number, but that shouldn't block new jobs being enqueued
[16:06:39] <hnowlan>	 failure to enqueue is likely an eventgate-side thing
[16:07:31] <elukey>	 yes I agree
[16:07:50] <elukey>	 hnowlan: https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=30s&var-service=eventgate-main&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos&var-site=All&viewPanel=82
[16:08:12] <elukey>	 matches very well
[16:09:19] <hnowlan>	 we've seen some of this before 
[16:09:25] <hnowlan>	 but never had a good explanation :( 
[16:10:34] <hnowlan>	 this is quite a lot of jobs though 
[16:11:02] <cdanis>	 hnowlan: the other thing that correlates is this https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=eventgate&var-kubernetes_namespace=eventgate-main&var-destination=All&from=now-6h&to=now&viewPanel=28
[16:12:18] <elukey>	 wondering if it is a specific pod or all of them
[16:12:55] <hnowlan>	 seems like we're hitting some kind of saturation/starvation within eventgate that we're not observing or is being hidden 
[16:13:10] <cdanis>	 we've been trying to track that down for years I think
[16:13:21] <hnowlan>	 bit weird that we *never* see any sign of issues with the actual eventgate service
[16:14:37] <hnowlan>	 which makes me think there's some kind of connection limit or something that it's happily hitting and going about its business 
[16:14:50] <hnowlan>	 eventgate logs never have anything useful 
[16:15:09] <cdanis>	 do we have any logging from Envoy about what these active requests are
[16:15:59] <hnowlan>	 no, the proxy envoy doesn't really log anything 
[16:16:21] <elukey>	 so IIUC this is inbound traffic to eventgate-main that fails right?
[16:16:28] <hnowlan>	 yep
[16:16:36] <hnowlan>	 (ime for envoy outside of custom logging you have two choices between nothing and firehose)
[16:16:38] <cdanis>	 elukey: yes and it fails once it has reached the envoy TLS listener for eventgate-main
[16:16:58] * elukey nods
[16:17:30] <cdanis>	 how many eventgate-mains do we have?
[16:17:40] <Emperor>	 Do I need to make some state change to a drive in state 'Firmware state: Unconfigured(good), Spun Up' before I can set it as JBOD/Non-RAID? 'sudo megacli -pdmakegood -physdrv [32:10] -a0' just says 'Adapter: 0: Failed to change PD state at EnclId-32 SlotId-10.'. I can presumably go and poke via the iDRAC, but it'd be nice to be able to CLI it
[16:17:50] <elukey>	 cdanis: 10 afaics, + canary (pods)
[16:18:05] <cdanis>	 elukey: ok well good news bad news it isn't really different across pods then
[16:18:22] <cdanis>	 https://grafana.wikimedia.org/goto/2Te38fsSR?orgId=1
[16:18:41] <Emperor>	 bah, sorry, wrong, command, 'sudo megacli -pdmakejbod -physdrv [32:10] -a0' (still that error message though)
[16:19:51] <elukey>	 hnowlan: and IIRC the issue goes away recycling pods right? Or do I misremember?
[16:20:38] <hnowlan>	 I'm not sure but I wouldn't be surprised 
[16:20:51] <hnowlan>	 I'd be tempted to just add more pods but that'll hide the bug rather than give us more data 
[16:21:09] <cdanis>	 what about turning on firehose logging for the canary eventgate-main
[16:23:51] <elukey>	 I think that it is a good approach, and IIRC we should be able to do it dynamically via nsenter / localhost:port/command right?
[16:24:47] <elukey>	 I'd also recycle a 3/4 pods to see if the error rate goes down, if it happens it may be some weird state that envoys enters
[16:25:12] <elukey>	 cdanis: trying to enable debug on canary
[16:26:02] <hnowlan>	 cool
[16:26:04] <swfrench-wmf>	 since it seems the PSS changes aren't implicated, any objections to proceeding to update changeprop-jobqueue in eqiad? (it's the only thing left to update)
[16:26:59] <elukey>	 swfrench-wmf: I'd wait a bit for the moment, we are not sure what's happening yet and it is closely related to the job queues
[16:27:44] <elukey>	 mmm now that I think about it, do we have the admin port available?
[16:27:51] <swfrench-wmf>	 sure can do - just trying not to leave surprise diffs around :)
[16:27:54] <elukey>	 I seem to recall a patch to disable it
[16:27:57] <elukey>	 sure sure :)
[16:29:49] <cdanis>	 unrelated? https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?orgId=1&from=now-6h&to=now
[16:30:41] <elukey>	 yeah I think we use the unix socket
[16:30:55] <hnowlan>	 cdanis: I think so 
[16:31:32] <elukey>	 I am deleting 3 pods to see if anything changes in the metrics
[16:33:21] <elukey>	 done
[16:35:50] <elukey>	 I am trying to find a way to enable debug logging on the fly but so far I am not having luck
[16:35:55] <elukey>	 if you have ideas lemme know :)
[16:36:54] <elukey>	 I am going to keep recycling pods since the error trend looks better now
[16:40:58] <hnowlan>	 looks better now yeah. don't think you can get debug enabled without changing the configmap 
[16:42:11] <elukey>	 errors seems zero now, I just cleaned 5 out of 11 pods thoug
[16:42:14] <elukey>	 *though
[16:43:20] <hnowlan>	 hmm
[16:43:39] <elukey>	 yep I didn't expect this :D
[16:46:17] <swfrench-wmf>	 it's interesting that it's a correlated event across pods (e.g., low probability of a resource leak of something, which would be uncorrelated) ... so it has to be either (a) a property of the traffic it's getting or (b) a dependency
[16:46:28] <elukey>	 do we still think that this may be related to envoy getting into a weird state? If so I wouldn't be able to explain why it showed errors on all pods (see Chris' graph) but then went away when I cleared only half the flet
[16:46:32] <elukey>	 *fleet
[16:46:57] <elukey>	 yes good point as well
[16:47:43] <elukey>	 I think that a good next step could be to have documented how/where to enable debug logging for the eventgate canary, and then turn it on the next time
[16:48:16] <elukey>	 hnowlan: what do you think if we add commented code in eventgate's value.yaml to be turned on if needed? So next time it will be quicker
[16:48:25] <elukey>	 not now I mean, during the next days :)
[16:48:42] <elukey>	 swfrench-wmf: at this point I think you can proceed
[16:49:00] <hnowlan>	 elukey: sgtm 
[16:49:55] <swfrench-wmf>	 elukey: ack, thanks will do
[16:50:37] <elukey>	 super, logging off folks, have a nice rest of the day!
[18:47:51] <ottomata>	 elukey: hnowlan maybe useful:  https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/eventgate/values.yaml#32
[18:48:56] <ottomata>	 will let you connect a to node inspector in browser https://nodejs.org/en/learn/getting-started/debugging
[18:50:02] <ottomata>	 increasing log level is same as most charts
[18:50:02] <ottomata>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/eventgate/templates/_config.yaml#22