[01:28:21] (SystemdUnitFailed) firing: upload_puppet_facts.service Failed on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:28:22] (SystemdUnitFailed) firing: upload_puppet_facts.service Failed on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:55:35] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Puppet (Puppet 7.0): Cumin: update config to use new puppet7 infrastructre - https://phabricator.wikimedia.org/T341497 (10MoritzMuehlenhoff) If it's helpful for the rampup and/or early testing we can also go ahead and point cuminunpriv1001 to the Puppet 7... [07:01:47] 10CAS-SSO, 10Infrastructure-Foundations: Better handling of memcached service - https://phabricator.wikimedia.org/T255132 (10MoritzMuehlenhoff) >>! In T255132#9002094, @akosiaris wrote: > @dzahn, Judging from the content of the task, this is for #infrastructure-foundations, not #serviceops, retagging. > > Th... [08:31:30] I'm back! almost caught up with email, let me know if there is anything pressing I should look at [08:31:39] welcome back :) [08:32:29] XioNoX: welcome back! Maybe the notes from yesterday's meeting for the knams migration [08:32:45] alright [08:45:51] welcome back [09:01:06] 10netops, 10Ganeti, 10Infrastructure-Foundations, 10SRE: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ayounsi) a:03ayounsi [09:22:29] Morning. Has anyone here got much experience with burrow on kafkamon servers? There's a burrow service failing to start on kafkamon1003 and therefore not monitoring kafka-jumbo: T341551 Any ideas? [09:22:30] T341551: The EventLoggingKafkaLag alert indicates that the kafka consumer lag for event logging is over its threshold - https://phabricator.wikimedia.org/T341551 [09:28:21] (SystemdUnitFailed) resolved: upload_puppet_facts.service Failed on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:28:38] yay, I got added as a contributor to Aerleon, the python library we use to automate network firewall rules, and my first 2 patches got merged! - https://github.com/aerleon/aerleon/pull/315 [09:32:01] yay [09:44:24] nice work [11:35:35] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Management LAN in eqsin offline due to failure of mr1-eqsin - https://phabricator.wikimedia.org/T341447 (10cmooney) 05Open→03Resolved a:03cmooney Still stable so I will close this for now, if it re-occurs we can engage Juniper. [11:42:03] btullis: for burrow/kafkamon best to add Keith and Cole to the task, they might have ideas or can have a closer look [12:50:55] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team: Create a cron to clean clientbucket every day or hour - https://phabricator.wikimedia.org/T165885 (10fnegri) [13:32:10] volans: (or anyone elses) can i get a second set of eyes on the new puppetdb stuff [13:32:26] puppetdb1003 is working fine and out preforming puppetdb1002 (which make sense) [13:32:47] however puppetdb2003 is getting very large submit times and an ever growing queue [13:32:54] e.g. 2023-07-11T13:32:06.153Z INFO [p.p.command] [1118-1689081916585-1689081915432] [363831 ms] 'replace catalog' command (4f664c65) processed for kubernetes2024.codfw.wmnet [13:33:18] interesting, that's also something we've seen with teh current infra [13:33:25] we genrally see a "LOG: duration: 42597.582 ms execute ..." [13:33:33] that codfw puppetdb had at times slowness issues right? [13:33:41] entry in postgress so it looks like an issue in postgress [13:33:48] codfw is always a bit slower [13:34:08] ~500-1000ms vs 80-500ms [13:34:30] but not normally this farm apart [13:34:32] remind me, codfw puppetdb connects to postgres in eqiad with TLS? [13:34:37] yes [13:34:49] for writes, for reads it connects locally [13:35:15] blindly regardless of replication lag? [13:35:40] yes, its something i was wondering if we shuld change, but thats how its configuered now [13:35:58] (this was a change i made during the last db slowness issues) [13:37:24] ah ok [13:37:46] as in i added the readonly database config when we had the last pref issue but perhpas we should roll that back. [13:38:02] i dont think this is the issue as its all writes currently but could be [13:38:07] got it, and possibly explore the possibility of independent local dbs? [13:38:41] yes i think that is an option for the furture but not explored yet [13:39:20] ack [13:57:35] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10ayounsi) [14:01:53] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10ayounsi) a:03BBlack Assigning the task to @BBlack for when he comes back. [14:11:22] 10netops, 10Infrastructure-Foundations, 10SRE: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10RobH) [14:30:16] jbond, moritzm: would you have some time tomorrow to have a quick chat about DSCP marking / nftables ? [14:32:41] topranks: i could due 15:00 UTC -> 16:00 otherwise its friday for me [14:33:48] jbond: thanks, that's good with me. probably even 30 mins is long enough [14:33:56] ack [14:34:33] I've no doubt I'll have additional puppet things to get your advice on about it, but want to get a high level with moritz before he is away if possible [14:34:49] sure sgtm [14:34:58] 15:00 UTC works for me tomorrow (I'm off Thurday to the 28th) [14:35:15] great thanks, I'll put something in the calendar [14:37:46] sgtm [14:37:52] +1 [15:14:27] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team: Create a cron to clean clientbucket every day or hour - https://phabricator.wikimedia.org/T165885 (10Dzahn) A "cron" (timer) has been created. So it could be called resolved. The only thing is that this is opt-in and not automatically fo... [15:26:42] 10CAS-SSO, 10Infrastructure-Foundations: Better handling of memcached service - https://phabricator.wikimedia.org/T255132 (10Dzahn) ACK, thanks Alex, you are right. And Moritz, sounds good to me to close it then. [15:30:15] 10CAS-SSO, 10Infrastructure-Foundations: Better handling of memcached service - https://phabricator.wikimedia.org/T255132 (10jcrespo) Just returned today. If this was fixed, not a blocker on my side, I had not had the issue I commented on myself recently. [15:32:44] (SystemdUnitCrashLoop) firing: kube-controller-manager.service crashloop on aux-k8s-ctrl1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [15:32:59] 10CAS-SSO, 10Infrastructure-Foundations: Better handling of memcached service - https://phabricator.wikimedia.org/T255132 (10jcrespo) 05Open→03Resolved a:03MoritzMuehlenhoff [15:42:44] (SystemdUnitCrashLoop) resolved: kube-controller-manager.service crashloop on aux-k8s-ctrl1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [15:50:03] jhathaway: slyngs: did you yuo have a sec to hang back after the meeting and oi can talk a bit ore about where i am with puppetdb [15:50:14] Sure [15:50:32] sure [15:50:35] cool [15:50:58] actually I mistoke I have an onfire meeting right after this :( [15:51:39] i'll hang though if we finish early [15:52:11] ack sounds good, if not there is some info inthe back log, also the processing queue is a bit more stable right now so it can wait [15:52:30] if not, and assumign im still having issues tomorrow ill go over it tomorro [15:53:15] the one thing i will say is that i have updated the /etc/puppetdb/conf.d/command-processing.ini settings maually with puppet disabled [15:54:12] i also increased maximum-pool-size = 100 in database.ini. and max_connections in /etc/postgresql/15/main/tuning.conf [15:55:36] unfortunatly i have guessed tonight so wont be abot later. but iof anything explodes please ping me, im at home just not at the laptop [15:55:52] will do [16:26:14] looking at https://wikitech.wikimedia.org/wiki/User:Jbond/debugging#show_blocked_by_waiting_on_lock its allo facts_path and faces causing the issue (which was similar to before) [17:46:57] jhathaway: in relation to puppetdb2003 i had a chance to take a bit more of a look at things, with the current settings command processing runs fine for about ~10 mins. but then we start getting large command processing times, which causes a thundering herd issue and puppetdb2003 cant catch up [17:47:02] https://phabricator.wikimedia.org/P49553 [17:47:16] this is what it starts looking like, the previous 10 mins have times of 300-800ms [17:47:53] deleing the stockpile queue /var/lib/puppetdb/stockpile/cmd/q and a restart will get thigs to recover [17:48:43] also we have a systemd timer which also deletes this queue if it gets to bug [17:48:46] *big [17:49:06] which means thet for now puppetdb2003 is sort of struggeling on ok but definetly missing reports [17:49:37] anyway im signing of for now, if you find anything either mail me or perhaps add it to T263578 [17:49:38] T263578: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 [17:50:01] its probably good to keep all of the debugging around this attached to that task some how for discoverability [18:02:30] anyway im signing of for now, if you find anything either mail me or perhaps add it to T263578~. [18:02:31] T263578: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 [18:03:51] * jbond possibly not thundering herd but a definet cascading effect [20:04:20] 10CAS-SSO, 10Data-Platform-SRE, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10BTullis) p:05Medium→03High Raising the priority of this task. [20:12:37] 10CAS-SSO, 10Data-Platform-SRE, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10BTullis) a:05Stevemunene→03BTullis [21:13:48] 10Mail, 10Data-Platform-SRE, 10Infrastructure-Foundations: kerberos manage_principals.py emails go to spam - https://phabricator.wikimedia.org/T318155 (10BTullis) [21:47:29] 10netops, 10Infrastructure-Foundations: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10BTullis) Removing #data-engineering as I think that #infrastructure-foundations is on top of it.