[02:12:27] FIRING: SystemdUnitCrashLoop: routinator.service crashloop on rpki2003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [02:18:28] FIRING: [2x] SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:22:27] RESOLVED: SystemdUnitCrashLoop: routinator.service crashloop on rpki2003:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [06:18:28] FIRING: [2x] SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:38:28] FIRING: [2x] SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:43:28] FIRING: [2x] SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:58:28] FIRING: [3x] SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:36:51] hey folks! [09:37:02] postgres on puppetdb2003 seems in trouble [09:37:15] replication is falling behind, it seems all started from [09:37:16] 2025-01-06 22:56:09 GMT FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000804300000091 has already been removed [09:39:19] and on 1003 I see [09:39:20] 2025-01-06 22:33:01.568 GMT [db:puppetdb,sess:677c5153.2527cd,pid:2435021,vtid:46/497365,tid:173777325] ERROR: duplicate key value violates unique constraint "resource_params_cache_pkey" [09:40:57] probably worth a tast, creating one [09:44:15] the last error is a red herring, I see the same on 1003 and 2003 at the same time. [09:45:12] https://phabricator.wikimedia.org/T383114 [10:08:28] FIRING: [3x] SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:12:19] I am not 100% sure what is the best course of action, if we need to stop puppet and resync 2003's data [10:12:26] *stop postgres [10:16:41] there is a cookbook for it that basically does "/usr/local/bin/pg-resync-replica" [10:19:49] found a nice article https://www.crunchydata.com/blog/how-to-recover-when-postgresql-is-missing-a-wal-file [10:20:08] but yeah given the amount of time the replica on 2003 was down, I think that we should resync [10:20:15] never really done it with postgres [10:21:30] if the WAL was deleting already there is no other way [10:21:34] deleted* [10:21:45] 10netops, 06Infrastructure-Foundations, 06SRE: Routinator 0.14 causing tempfs file system to fill up - https://phabricator.wikimedia.org/T383116 (10cmooney) 03NEW p:05Triage→03Medium [10:22:21] so, yeah if there is a cookbook already go for that. It ain't gonna catch up [10:22:29] 10netops, 06Infrastructure-Foundations, 06SRE: Routinator 0.14 causing tempfs file system to fill up - https://phabricator.wikimedia.org/T383116#10436696 (10cmooney) [10:22:38] on its own, that is [10:38:23] akosiaris: ack thanks for confirming! I'd say that we just need to disable puppet on 2003, stop postgres (and I guess puppetdb as well?) and then start the replication [10:38:28] does it make sense? [10:38:35] moritzm: --^ [10:39:41] yes. IIRC puppetserver doesn't use the local db because it can't write to it [10:39:56] I don't know if they added read query routing support recently [10:40:21] 10netops, 06Infrastructure-Foundations, 06SRE: Routinator 0.14 causing tempfs file system to fill up - https://phabricator.wikimedia.org/T383116#10436738 (10cmooney) [10:40:27] it should be easy to check that though [10:40:50] not sure if we need to stop puppetdb. when e.g. upgrading postgresql on puppetdb hosts it also handles a temporarily unavailablity during the upgrade/restart transparently [10:40:55] okok thanks for the pointers, I know a few things about puppetdb [10:44:37] on puppetdb2003 I see [10:44:38] subname = //puppetdb1003.eqiad.wmnet:5432 [10:44:46] in /etc/puppetdb/conf.d/database.ini [10:45:52] yeah, so it doesn't even use the local postgres [10:46:57] netstat seems to agree, but I also see [10:46:58] puppetdb2003.codf:33212 puppetdb2003:postgresql ESTABLISHED 2108566/java [10:47:03] (a lot of them) [10:47:11] and that pid is puppetdb [10:47:39] ah yes [10:47:40] conf.d/read-database.ini:3:subname = //puppetdb2003.codfw.wmnet:5432 [10:47:51] okok so it uses the local db for reads IIUC [10:47:59] and 1003 for writes (if any) [10:48:21] so puppetdb on 2003 should be stopped as well [10:49:21] ack [10:49:23] and on 1003 I see the local db mentioned for reads as well, that makes sense [10:49:41] moritzm: ignorant question - what is the fallout when stopping puppetdb on 2003? [10:49:54] ah, so they added that support. Good to know [10:50:15] * akosiaris jotting it in their notes, they is no way I 'll remember it next time [10:50:56] akosiaris: thanks a lot for all the info, I'll try to add everything to wikitech.. There is a postgres page, maybe it is not the best but ok-ish for the moment (it mentions only maps stuff) [10:52:45] elukey: when doing maintenance on puppetdb (e.g. reboots or impactful config changes), puppet has usually been disabled fleet-wide (or we disable it in codfw and edges in this case) [10:53:26] otherwise I think we'd see (temporary) puppet failures [11:02:07] okok makes sense [11:02:24] I am going to prep a procedure in the task so people can review [11:02:50] +1 [11:07:47] moritzm, akosiaris - https://phabricator.wikimedia.org/T383114#10436798 (when you have a moment) [11:08:22] I have to take an early errand for family reasons, if anybody wants to do it before me please go ahead, otherwise I'll get to it in ~2h [11:09:46] my main assumption is that 2003 is the replica and 1003 the master, but please confirm it to be sure :D [11:18:04] the assumption is correct [11:28:28] RESOLVED: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:27:16] back! [13:29:17] moritzm: ok if I proceed? [13:30:24] yes, procedure from https://phabricator.wikimedia.org/T383114#10436798 seems fine! [14:46:30] I created https://wikitech.wikimedia.org/wiki/Postgres#requested_WAL_segment_XXXXXXXXXXXXXXXX_has_already_been_removed to add some info about what happened today [14:46:52] ah snap I didn't see Moritz's suggestion about the target page [14:48:55] added https://wikitech.wikimedia.org/wiki/Puppet#PuppetDB [14:50:40] cheers! [14:56:24] I'll try to see if we can add some WAL-related settings for puppetdb's postgres master [14:56:39] not really sure what the current status is, but it may re-happen [14:59:44] FIRING: [2x] NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/18/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [15:09:44] RESOLVED: [2x] NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/18/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [15:39:08] moritzm: hi, do you think I can archive cergen in Gerrit? I know it is being phased out and the repo had no commit for 3 years so my guess is that it is not going to be further updated and can be archived now :) [15:43:18] it's indeed unlikely that we'll need further changes, but you never know. can't this simply wait until it's properly phased out? [15:43:36] cergen certs are still used by our main etc installation e.g. [15:58:43] moritzm: yes that can wait :) [15:58:49] 10netops, 06Infrastructure-Foundations, 06SRE: Routinator 0.14 causing tempfs file system to fill up - https://phabricator.wikimedia.org/T383116#10437563 (10cmooney) [15:58:53] it does not hurt anything [15:59:29] danke schon! [16:03:29] I'll ping you when it's fully undeployed [16:07:14] https://phabricator.wikimedia.org/T383114 is hopefully closed, the wal size is now 12G (it was 2G) as Alex suggested [16:07:28] if you see anything else left to do please let me know [16:19:37] lgtm! [20:51:55] FIRING: MaxConntrack: Max conntrack at 83.53% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [20:56:55] RESOLVED: MaxConntrack: Max conntrack at 83.53% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack