[02:12:27] <jinxer-wm>	 FIRING: SystemdUnitCrashLoop: routinator.service crashloop on rpki2003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[02:18:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:22:27] <jinxer-wm>	 RESOLVED: SystemdUnitCrashLoop: routinator.service crashloop on rpki2003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[06:18:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:38:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:43:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:58:28] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:36:51] <elukey>	 hey folks!
[09:37:02] <elukey>	 postgres on puppetdb2003 seems in trouble
[09:37:15] <elukey>	 replication is falling behind, it seems all started from
[09:37:16] <elukey>	 2025-01-06 22:56:09 GMT FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000804300000091 has already been removed
[09:39:19] <elukey>	 and on 1003 I see
[09:39:20] <elukey>	 2025-01-06 22:33:01.568 GMT [db:puppetdb,sess:677c5153.2527cd,pid:2435021,vtid:46/497365,tid:173777325] ERROR:  duplicate key value violates unique constraint "resource_params_cache_pkey"
[09:40:57] <elukey>	 probably worth a tast, creating one
[09:44:15] <elukey>	 the last error is a red herring, I see the same on 1003 and 2003 at the same time.
[09:45:12] <elukey>	 https://phabricator.wikimedia.org/T383114
[10:08:28] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:12:19] <elukey>	 I am not 100% sure what is the best course of action, if we need to stop puppet and resync 2003's data
[10:12:26] <elukey>	 *stop postgres
[10:16:41] <elukey>	 there is a cookbook for it that basically does "/usr/local/bin/pg-resync-replica"
[10:19:49] <elukey>	 found a nice article https://www.crunchydata.com/blog/how-to-recover-when-postgresql-is-missing-a-wal-file
[10:20:08] <elukey>	 but yeah given the amount of time the replica on 2003 was down, I think that we should resync
[10:20:15] <elukey>	 never really done it with postgres
[10:21:30] <akosiaris>	 if the WAL was deleting already there is no other way
[10:21:34] <akosiaris>	 deleted*
[10:21:45] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Routinator 0.14 causing tempfs file system to fill up - https://phabricator.wikimedia.org/T383116 (10cmooney) 03NEW p:05Triage→03Medium
[10:22:21] <akosiaris>	 so, yeah if there is a cookbook already go for that. It ain't gonna catch up 
[10:22:29] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Routinator 0.14 causing tempfs file system to fill up - https://phabricator.wikimedia.org/T383116#10436696 (10cmooney)
[10:22:38] <akosiaris>	 on its own, that is
[10:38:23] <elukey>	 akosiaris: ack thanks for confirming! I'd say that we just need to disable puppet on 2003, stop postgres (and I guess puppetdb as well?) and then start the replication
[10:38:28] <elukey>	 does it make sense?
[10:38:35] <elukey>	 moritzm: --^ 
[10:39:41] <akosiaris>	 yes. IIRC puppetserver doesn't use the local db because it can't write to it
[10:39:56] <akosiaris>	 I don't know if they added read query routing support recently
[10:40:21] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Routinator 0.14 causing tempfs file system to fill up - https://phabricator.wikimedia.org/T383116#10436738 (10cmooney)
[10:40:27] <akosiaris>	 it should be easy to check that though
[10:40:50] <moritzm>	 not sure if we need to stop puppetdb. when e.g. upgrading postgresql on puppetdb hosts it also handles a temporarily unavailablity during the upgrade/restart transparently
[10:40:55] <elukey>	 okok thanks for the pointers, I know a few things about puppetdb
[10:44:37] <elukey>	 on puppetdb2003 I see
[10:44:38] <elukey>	 subname = //puppetdb1003.eqiad.wmnet:5432
[10:44:46] <elukey>	 in /etc/puppetdb/conf.d/database.ini
[10:45:52] <akosiaris>	 yeah, so it doesn't even use the local postgres
[10:46:57] <elukey>	 netstat seems to agree, but I also see 
[10:46:58] <elukey>	 puppetdb2003.codf:33212 puppetdb2003:postgresql ESTABLISHED 2108566/java
[10:47:03] <elukey>	 (a lot of them)
[10:47:11] <elukey>	 and that pid is puppetdb
[10:47:39] <elukey>	 ah yes
[10:47:40] <elukey>	 conf.d/read-database.ini:3:subname = //puppetdb2003.codfw.wmnet:5432
[10:47:51] <elukey>	 okok so it uses the local db for reads IIUC
[10:47:59] <elukey>	 and 1003 for writes (if any)
[10:48:21] <elukey>	 so puppetdb on 2003 should be stopped as well
[10:49:21] <moritzm>	 ack
[10:49:23] <elukey>	 and on 1003 I see the local db mentioned for reads as well, that makes sense
[10:49:41] <elukey>	 moritzm: ignorant question - what is the fallout when stopping puppetdb on 2003?
[10:49:54] <akosiaris>	 ah, so they added that support. Good to know
[10:50:15] * akosiaris jotting it in their notes, they is no way I 'll remember it next time
[10:50:56] <elukey>	 akosiaris: thanks a lot for all the info, I'll try to add everything to wikitech.. There is a postgres page, maybe it is not the best but ok-ish for the moment (it mentions only maps stuff)
[10:52:45] <moritzm>	 elukey: when doing maintenance on puppetdb (e.g. reboots or impactful config changes), puppet has usually been disabled fleet-wide (or we disable it in codfw and edges in this case)
[10:53:26] <moritzm>	 otherwise I think we'd see (temporary) puppet failures
[11:02:07] <elukey>	 okok makes sense
[11:02:24] <elukey>	 I am going to prep a procedure in the task so people can review
[11:02:50] <moritzm>	 +1
[11:07:47] <elukey>	 moritzm, akosiaris - https://phabricator.wikimedia.org/T383114#10436798 (when you have a moment)
[11:08:22] <elukey>	 I have to take an early errand for family reasons, if anybody wants to do it before me please go ahead, otherwise I'll get to it in ~2h
[11:09:46] <elukey>	 my main assumption is that 2003 is the replica and 1003 the master, but please confirm it to be sure :D
[11:18:04] <akosiaris>	 the assumption is correct
[11:28:28] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:27:16] <elukey>	 back!
[13:29:17] <elukey>	 moritzm: ok if I proceed?
[13:30:24] <moritzm>	 yes, procedure from https://phabricator.wikimedia.org/T383114#10436798 seems fine!
[14:46:30] <elukey>	 I created https://wikitech.wikimedia.org/wiki/Postgres#requested_WAL_segment_XXXXXXXXXXXXXXXX_has_already_been_removed to add some info about what happened today
[14:46:52] <elukey>	 ah snap I didn't see Moritz's suggestion about the target page
[14:48:55] <elukey>	 added https://wikitech.wikimedia.org/wiki/Puppet#PuppetDB
[14:50:40] <moritzm>	 cheers!
[14:56:24] <elukey>	 I'll try to see if we can add some WAL-related settings for puppetdb's postgres master
[14:56:39] <elukey>	 not really sure what the current status is, but it may re-happen
[14:59:44] <jinxer-wm>	 FIRING: [2x] NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/18/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts
[15:09:44] <jinxer-wm>	 RESOLVED: [2x] NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/18/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts
[15:39:08] <hashar>	 moritzm: hi, do you think I can archive cergen in Gerrit?   I know it is being phased out and the repo had no commit for 3 years so my guess is that it is not going to be further updated and can be archived now :)
[15:43:18] <moritzm>	 it's indeed unlikely that we'll need further changes, but you never know. can't this simply wait until it's properly phased out?
[15:43:36] <moritzm>	 cergen certs are still used by our main etc installation e.g.
[15:58:43] <hashar>	 moritzm: yes that can wait :)
[15:58:49] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Routinator 0.14 causing tempfs file system to fill up - https://phabricator.wikimedia.org/T383116#10437563 (10cmooney)
[15:58:53] <hashar>	 it does not hurt anything
[15:59:29] <hashar>	 danke schon!
[16:03:29] <moritzm>	 I'll ping you when it's fully undeployed
[16:07:14] <elukey>	 https://phabricator.wikimedia.org/T383114 is hopefully closed, the wal size is now 12G (it was 2G) as Alex suggested
[16:07:28] <elukey>	 if you see anything else left to do please let me know
[16:19:37] <moritzm>	 lgtm!
[20:51:55] <jinxer-wm>	 FIRING: MaxConntrack: Max conntrack at 83.53% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[20:56:55] <jinxer-wm>	 RESOLVED: MaxConntrack: Max conntrack at 83.53% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack