[01:28:21] <jinxer-wm>	 (SystemdUnitFailed) firing: upload_puppet_facts.service Failed on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:28:22] <jinxer-wm>	 (SystemdUnitFailed) firing: upload_puppet_facts.service Failed on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:55:35] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Puppet (Puppet 7.0): Cumin: update config to use new puppet7 infrastructre - https://phabricator.wikimedia.org/T341497 (10MoritzMuehlenhoff) If it's helpful for the rampup and/or early testing we can also go ahead and point cuminunpriv1001 to the Puppet 7...
[07:01:47] <wikibugs>	 10CAS-SSO, 10Infrastructure-Foundations: Better handling of memcached service - https://phabricator.wikimedia.org/T255132 (10MoritzMuehlenhoff) >>! In T255132#9002094, @akosiaris wrote: > @dzahn, Judging from the content of the task, this is for #infrastructure-foundations, not #serviceops, retagging.  >  > Th...
[08:31:30] <XioNoX>	 I'm back! almost caught up with email, let me know if there is anything pressing I should look at
[08:31:39] <jbond>	 welcome back :)
[08:32:29] <volans>	 XioNoX: welcome back! Maybe the notes from yesterday's meeting for the knams migration
[08:32:45] <XioNoX>	 alright
[08:45:51] <moritzm>	 welcome back
[09:01:06] <wikibugs>	 10netops, 10Ganeti, 10Infrastructure-Foundations, 10SRE: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ayounsi) a:03ayounsi
[09:22:29] <btullis>	 Morning. Has anyone here got much experience with burrow on kafkamon servers? There's a burrow service failing to start on kafkamon1003 and therefore not monitoring kafka-jumbo: T341551 Any ideas?
[09:22:30] <stashbot>	 T341551: The EventLoggingKafkaLag alert indicates that the kafka consumer lag for event logging is over its threshold - https://phabricator.wikimedia.org/T341551
[09:28:21] <jinxer-wm>	 (SystemdUnitFailed) resolved: upload_puppet_facts.service Failed on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:28:38] <XioNoX>	 yay, I got added as a contributor to Aerleon, the python library we use to automate network firewall rules, and my first 2 patches got merged! - https://github.com/aerleon/aerleon/pull/315
[09:32:01] <volans>	 yay
[09:44:24] <jbond>	 nice work
[11:35:35] <wikibugs>	 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Management LAN in eqsin offline due to failure of mr1-eqsin - https://phabricator.wikimedia.org/T341447 (10cmooney) 05Open→03Resolved a:03cmooney Still stable so I will close this for now, if it re-occurs we can engage Juniper.
[11:42:03] <moritzm>	 btullis: for burrow/kafkamon best to add Keith and Cole to the task, they might have ideas or can have a closer look
[12:50:55] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team: Create a cron to clean clientbucket every day or hour - https://phabricator.wikimedia.org/T165885 (10fnegri)
[13:32:10] <jbond>	 volans: (or anyone elses) can i get a second set of eyes on the new puppetdb stuff
[13:32:26] <jbond>	 puppetdb1003 is working fine and out preforming puppetdb1002 (which make sense)
[13:32:47] <jbond>	 however puppetdb2003 is getting very large submit times and an ever growing queue
[13:32:54] <jbond>	 e.g. 2023-07-11T13:32:06.153Z INFO  [p.p.command] [1118-1689081916585-1689081915432] [363831 ms] 'replace catalog' command (4f664c65) processed for kubernetes2024.codfw.wmnet
[13:33:18] <volans>	 interesting, that's also something we've seen with teh current infra
[13:33:25] <jbond>	 we genrally see a "LOG:  duration: 42597.582 ms  execute ..."
[13:33:33] <volans>	 that codfw puppetdb had at times slowness issues right?
[13:33:41] <jbond>	 entry in postgress so it looks like an issue in postgress
[13:33:48] <jbond>	 codfw is always a bit slower
[13:34:08] <jbond>	 ~500-1000ms vs 80-500ms
[13:34:30] <jbond>	 but not normally this farm apart
[13:34:32] <volans>	 remind me, codfw puppetdb connects to postgres in eqiad with TLS?
[13:34:37] <jbond>	 yes
[13:34:49] <jbond>	 for writes, for reads it connects locally
[13:35:15] <volans>	 blindly regardless of replication lag?
[13:35:40] <jbond>	 yes, its something i was wondering if we shuld change, but thats how its configuered now
[13:35:58] <jbond>	 (this was a change i made during the last db slowness issues)
[13:37:24] <volans>	 ah ok
[13:37:46] <jbond>	 as in i added the readonly database config when we had the last pref issue but perhpas we should roll that back.  
[13:38:02] <jbond>	 i dont think this is the issue as its all writes currently but could be
[13:38:07] <volans>	 got it, and possibly explore the possibility of independent local dbs?
[13:38:41] <jbond>	 yes i think that is an option for the furture but not explored yet
[13:39:20] <volans>	 ack
[13:57:35] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10ayounsi)
[14:01:53] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10ayounsi) a:03BBlack Assigning the task to @BBlack for when he comes back.
[14:11:22] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10RobH)
[14:30:16] <topranks>	 jbond, moritzm: would you have some time tomorrow to have a quick chat about DSCP marking / nftables ?
[14:32:41] <jbond>	 topranks: i could due 15:00 UTC -> 16:00 otherwise its friday for me
[14:33:48] <topranks>	 jbond: thanks, that's good with me.  probably even 30 mins is long enough 
[14:33:56] <jbond>	 ack
[14:34:33] <topranks>	 I've no doubt I'll have additional puppet things to get your advice on about it, but want to get a high level with moritz before he is away if possible
[14:34:49] <jbond>	 sure sgtm
[14:34:58] <moritzm>	 15:00 UTC works for me tomorrow (I'm off Thurday to the 28th)
[14:35:15] <topranks>	 great thanks, I'll put something in the calendar 
[14:37:46] <moritzm>	 sgtm
[14:37:52] <jbond>	 +1
[15:14:27] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team: Create a cron to clean clientbucket every day or hour - https://phabricator.wikimedia.org/T165885 (10Dzahn) A "cron" (timer) has been created. So it could be called resolved. The only thing is that this is opt-in and not automatically fo...
[15:26:42] <wikibugs>	 10CAS-SSO, 10Infrastructure-Foundations: Better handling of memcached service - https://phabricator.wikimedia.org/T255132 (10Dzahn) ACK, thanks Alex, you are right.  And Moritz, sounds good to me to close it then.
[15:30:15] <wikibugs>	 10CAS-SSO, 10Infrastructure-Foundations: Better handling of memcached service - https://phabricator.wikimedia.org/T255132 (10jcrespo) Just returned today. If this was fixed, not a blocker on my side, I had not had the issue I commented on myself recently.
[15:32:44] <jinxer-wm>	 (SystemdUnitCrashLoop) firing: kube-controller-manager.service crashloop on aux-k8s-ctrl1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[15:32:59] <wikibugs>	 10CAS-SSO, 10Infrastructure-Foundations: Better handling of memcached service - https://phabricator.wikimedia.org/T255132 (10jcrespo) 05Open→03Resolved a:03MoritzMuehlenhoff
[15:42:44] <jinxer-wm>	 (SystemdUnitCrashLoop) resolved: kube-controller-manager.service crashloop on aux-k8s-ctrl1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[15:50:03] <jbond>	 jhathaway: slyngs: did you yuo have a sec to hang back after the meeting and oi can talk a bit ore about where i am with puppetdb
[15:50:14] <slyngs>	 Sure
[15:50:32] <jhathaway>	 sure
[15:50:35] <jbond>	 cool
[15:50:58] <jhathaway>	 actually I mistoke I have an onfire meeting right after this :(
[15:51:39] <jhathaway>	 i'll hang though if we finish early
[15:52:11] <jbond>	 ack sounds good, if not there is some info inthe back log, also the processing queue is a bit more stable right now so it can wait
[15:52:30] <jbond>	 if not, and assumign im still having issues tomorrow ill go over it tomorro
[15:53:15] <jbond>	 the one thing i will say is that i have updated the /etc/puppetdb/conf.d/command-processing.ini settings maually with puppet disabled
[15:54:12] <jbond>	 i also increased maximum-pool-size = 100 in database.ini.  and max_connections in /etc/postgresql/15/main/tuning.conf
[15:55:36] <jbond>	 unfortunatly i have guessed tonight so wont be abot later.  but iof anything explodes please ping me, im at home just not at the laptop
[15:55:52] <jhathaway>	 will do
[16:26:14] <jbond>	 looking at https://wikitech.wikimedia.org/wiki/User:Jbond/debugging#show_blocked_by_waiting_on_lock its allo facts_path and faces causing the issue (which was similar to before)
[17:46:57] <jbond>	 jhathaway: in relation to puppetdb2003 i had a chance to take a bit more of a look at things, with the current settings command processing runs fine for about ~10 mins.  but then we start getting large command processing times, which causes a thundering herd issue and puppetdb2003 cant catch up
[17:47:02] <jbond>	 https://phabricator.wikimedia.org/P49553
[17:47:16] <jbond>	 this is what it starts looking like, the previous 10 mins have times of 300-800ms
[17:47:53] <jbond>	 deleing the stockpile queue /var/lib/puppetdb/stockpile/cmd/q and a restart will get thigs to recover
[17:48:43] <jbond>	 also we have a systemd timer which also deletes this queue if it gets to bug
[17:48:46] <jbond>	 *big
[17:49:06] <jbond>	 which means thet for now puppetdb2003 is sort of struggeling on ok but definetly missing reports
[17:49:37] <jbond>	 anyway im signing of for now, if you find anything either mail me or perhaps add it to T263578
[17:49:38] <stashbot>	 T263578: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578
[17:50:01] <jbond>	 its probably good to keep all of the debugging around this attached to that task some how for discoverability
[18:02:30] <jbond>	 anyway im signing of for now, if you find anything either mail me or perhaps add it to T263578~.
[18:02:31] <stashbot>	 T263578: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578
[18:03:51] * jbond possibly not thundering herd but a definet cascading effect
[20:04:20] <wikibugs>	 10CAS-SSO, 10Data-Platform-SRE, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10BTullis) p:05Medium→03High Raising the priority of this task.
[20:12:37] <wikibugs>	 10CAS-SSO, 10Data-Platform-SRE, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10BTullis) a:05Stevemunene→03BTullis
[21:13:48] <wikibugs>	 10Mail, 10Data-Platform-SRE, 10Infrastructure-Foundations: kerberos  manage_principals.py emails go to spam - https://phabricator.wikimedia.org/T318155 (10BTullis)
[21:47:29] <wikibugs>	 10netops, 10Infrastructure-Foundations: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10BTullis) Removing #data-engineering as I think that #infrastructure-foundations is on top of it.