[07:00:47] if someone could please puppet-merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/906307/ , that will remove an erroneous alarm for contint2002 zuul-merger service which is intentionally disabled but still has a monitoring enabled :) [07:01:10] it started alarming last night and I could not figure out why since the service has always been masked there [07:01:17] the alarm shows up on https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=contint2002 [07:18:16] hashar: I'll have a look in approx. 15m [07:26:31] hashar: merged, the logic is a bit convoluted (service_enable, service_ensure, etc..) but it seems to work so +1, but maybe a little refactor of the code should be good (it is not super intuitive how it works right now) [07:34:08] It just got refactored! 😄 [07:35:00] or well, I have at least aligned the zu [07:35:16] ul servis classes to look alike [07:50:09] :) [07:50:24] for the on-callers - I am going to kick off a roll restart of kafka main codfw [07:50:31] to pick up the pki tls certs [07:50:37] (via cookbook this time) [07:50:41] ack thanks for the headsup [08:16:42] there are 3 hosts down in icinga, un-acked, are you aware of them (steve_munene || btullis) && (elukey || klausman)? [08:19:05] I need to check ml-serve2004, but it is only calico afaics [08:19:11] what are the others? [08:19:20] (Ben and Tobias are on PTO) [08:20:57] an-worker1132 and analytics1069 [08:22:49] ah lovely so an-worker1132 needs to be reimaged https://phabricator.wikimedia.org/T333091 [08:22:54] the other one I don't know [08:22:58] let's wait steve_munene [08:49:09] thanks for fixing ml-serve2004! [08:49:57] volans: I noticed some multi-bit DIMM issues in racadm getsel, I am inclined to let it run again and see if the issue was a one time only or not [08:50:08] IIRC we had similar issues with another node, need to find which one [08:51:27] ack, I can give you the link of the dell reference guide if you want check those specific errors [09:39:08] kafka main codfw on pki as well, all clusters in prod on PKI [09:39:10] \o/ [09:41:44] elukey: great job <3 [09:46:21] \o/ [09:50:39] nicely done! [10:41:29] Alright elukey [13:22:46] !incidents [13:22:47] 3530 (UNACKED) Primary outbound port utilisation over 80% (paged) global noc (asw2-a-eqiad.mgmt.eqiad.wmnet) [13:22:50] !ack 3530 [13:22:51] 3530 (ACKED) Primary outbound port utilisation over 80% (paged) global noc (asw2-a-eqiad.mgmt.eqiad.wmnet) [13:28:13] !incidents [13:28:14] 3530 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (asw2-a-eqiad.mgmt.eqiad.wmnet) [13:47:16] apergos: clouddumps1001 is overwhelmed right now, nginx logs say things like "[alert] 1555383#1555383: 768 worker_connections are not enough" [13:47:48] where are the connections coming from, andrewbogott? [13:48:08] apergos: I haven't gotten that far yet, hang on... [13:48:16] let's take this to -private shall we [13:48:32] -sre-private? or cloud-private? [13:48:37] sre [13:48:51] ok, only if you invite me :) [13:49:44] oh. ok them mediawiki_security [13:49:47] *then [13:49:49] you're there surely [14:07:20] moritzm: o/ I am building docker images, I noticed the openjdk-11 one as well [14:08:03] yeah, I bumped that one for the new JRE [14:08:42] but I thought these get auto-built by some systemd timer, would I have needed to do anything other than bumping the version? [14:12:48] afaik we need to jump on build2001:/srv/images/production-images, do git fetch/pull and run a script [14:13:28] "build-production-images" [14:13:58] ah, ok. I missed that [14:14:33] nono no problem, it was just "they are ready if you are waiting for them" kinda ping :) [14:15:03] ack, thx [15:00:08] Any quick ideas why wikitech might be having cache invalidation issues? T333925 and T334102 seem to show the same underlying issue of mismatched cache vs db state. [15:00:09] T333925: Error during MFA setup for wikitech.wikimedia.org: MWException: CAS update failed on user_touched. The version of the user to be saved is older than the current version. - https://phabricator.wikimedia.org/T333925 [15:00:09] T334102: Wikitech: Preferences not updating after email change - https://phabricator.wikimedia.org/T334102 [15:16:22] I made T334232 to track the trending wikitech issue. [15:16:23] T334232: Wikitech experiencing a spike of stale cache errors since 2023-03-15 - https://phabricator.wikimedia.org/T334232 [18:31:39] please hold off making any netbox changes for now [18:49:15] has anyone done a restore of Netbox data? [18:49:19] https://wikitech.wikimedia.org/wiki/Netbox#Restore [18:49:44] need to restore a deleted-by-mistake-host-instead-of-an-old-interface [18:49:52] I have manually recreated it but I think restoring is better [18:50:01] not breaking prod or anything but will be good to have it a clean state [18:54:56] sukhe: have never done it, but if you look at the logs and can find the deletion.. then do the "manual re-play in reverse order" just for the cabling part.. [18:55:20] the bacula and db restore seems a bit much for this case [18:55:24] it's just one host, right? [18:55:33] yep just one [18:55:42] and as such nothing breaks just that I would like to fix it :P [18:55:58] yea, let's try to find in the changelog when it gets removed [18:56:10] and then the cable IDs should follow that I guess [18:56:15] https://netbox.wikimedia.org/extras/changelog/113125/ [18:56:16] all here [18:56:56] "Manually (or via the API) re-play the actions" [18:57:11] I wonder if via API is just telling it the Request IDs from your screenshot [18:57:16] and to revert them [18:57:26] no details on the API though yet? [18:57:33] can't find the revert one :) [18:57:53] looks at https://netbox.wikimedia.org/api/docs/ [18:59:39] well, hmm.. but you have the host restored.. just not the cables? [18:59:49] just the cable for the mgmt interface left [18:59:54] the others are there! [18:59:56] and at least there are the cable object IDs [19:00:00] yep [19:00:06] going to file a task for it [19:00:15] I was about so suggest that [19:00:21] sounds good [19:03:30] monitoring in -dcops channel did detect it [19:04:13] yep! missing the rack data [19:04:22] that seems to be the only alert so far [19:15:42] the above alert was fixed [19:16:42] great! dealing with invasion of flying ants into my living space [19:17:13] mutante: duck! [19:19:08] literally asking chat GPT