[06:43:02] greetings [07:07:52] will start the neutron maint shortly [07:08:05] I've silenced pages to team=wmcs and will keep an eye [07:09:32] ack [07:09:39] ping if you need a hand [07:11:09] cheers [07:26:40] ok all done, I could still get into a vm with both neutron-l3-agent down, not sure what else was down tho [07:27:44] great [07:29:58] indeed! good times [07:30:07] I've removed the silence [07:40:22] 🎉 [08:09:27] morning [09:39:47] there's a paging alert about harbor, looking [09:41:09] everything looks ok :/v [09:41:40] ack, dcaro please LMK if it is db-related, I'm on harbordb1 testing T421857 though that's for toolsbeta not tools [09:41:41] T421857: Move trove DB instances to rabbitmq transient quorum queues - https://phabricator.wikimedia.org/T421857 [09:41:44] it seems prometheus has no harbor metrics [09:41:58] yep toolsbeta [09:42:20] toolsbeta harbor is borked yep xd [09:43:17] dammit, can it access the db ? what I did is add a security group per https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Trove#Accessing_Trove_guest_VMs [09:43:28] now I'm thinking that took over the trove security group [09:44:14] connection timed out to the db [09:44:44] to the host w6pgt6thhot.svc.trove.eqiad1.wikimedia.cloud [09:44:56] dcaro: ok, is it back now ? I added the 'postgres' security group [09:45:21] still erroring, let me try to restart the container [09:45:57] yep, now it works [09:46:13] sigh ok [09:46:24] I'll restart all the harbor containers just in case [09:46:57] ok toolsbeta only should be affected [09:47:07] so the issue is that it did not have any security group, and when adding the one for the access it started filtering all the traffic? [09:48:56] or the order of the security groups? (I don't remember having issues with that before though :/, maybe some behavior changed) [09:49:22] ish, it did have the trove-generated security group and when I added the security group to allow ssh access then I think the trove one wasn't there anymore [09:49:50] I feel like that bit where peter griffin struggles with the blinds [09:50:00] https://media1.tenor.com/m/30_6MJ3Py9EAAAAC/family-guy-struggle.gif [09:50:52] but anyways, I wasn't expecting toolsbeta to page [09:59:14] dcaro: I'll need to reboot that instance for a test, I'm leaving it as is for now and will resume after lunch [10:23:59] I don't think it paged? I see an alert with "severity=critical" but none with "severity=page" [10:47:15] it might have not, it did have the #page comment somewhere [10:48:00] or maybe I hallucinated it :/, not sure now [10:48:07] [protip] use # page when quoting paging alerts to not trigger unwanted highlights :=P [10:48:32] oh sorry [10:48:33] xd [10:50:59] hmm... toolsbeta harbor does not let me log in now :/ [10:52:12] yep, can't pull either [10:53:19] restarted the containers, and it's back up :S [10:53:46] anyhow, I'll go grab some lunch before it breaks again xd [11:54:18] the toolsbeta prometheus rewrites the severity label so that it doesn't send actual pages, but that doesn't change the key word in the alert description [12:28:37] hah! the rewrite explains [12:44:04] ok for me to test a reboot of harbordb1 dcaro ? [12:44:38] can you wait 20min? (in a meeting) [12:44:42] yes totally [13:19:28] okok, I'm around now [13:19:30] if you wanat [13:20:42] dcaro: sure, doing [13:21:56] ok we're back, far faster than I thought [13:23:20] huh, I don't think harbor noticed at all xd [13:24:07] lol [13:49:30] alerts on clouddb1023 can be ignored, manuel is working on it [13:49:42] yeah I assumed is the repro host [13:50:10] I think the repro host is 1022, which is why I was confused by the alerts, so I asked in -data-persistence [13:52:37] I'm done with the harbordb1 testing btw [13:52:40] dcaro: ^ [14:22:36] godog: ack thanks! [14:27:26] andrewbogott: hmm, we have a yet another case of ldap and openstack disagreeing on membership of the bastion project [14:27:39] in particular, https://ldap.toolforge.org/user/atsuko is a member in openstack but that's not visible in ldap [14:28:10] Do you know, is that a new account or an old one? [14:28:25] new [14:32:46] and as always there are no errors in the logs... [14:34:42] but they're in the analytics project in ldap. So some manner of ldap syncing worked properly... [14:36:44] * andrewbogott retrying the ldap sync [14:41:14] taavi: I fixed for that user but still don't have a theory, updated T421911 [14:41:14] T421911: Keystone logs no longer appearing in logstash - https://phabricator.wikimedia.org/T421911 [14:48:16] uh, that should've been T379550 [14:48:16] T379550: openstack: keystone may be failing to add users to the bastion project in Keystone and/or LDAP - https://phabricator.wikimedia.org/T379550 [15:00:12] i wonder if it's worth extending the bastionless script to alert for this scenario too [15:00:26] seems to be recurring enough [15:03:12] might be although it's at least something that likely fixes itself next time keystone gets nudged [16:19:03] I have a couple of MRs for my long-running project of improving the wikireplicas scripts: https://gitlab.wikimedia.org/repos/cloud/wikireplicas-utils/-/merge_requests/11 [16:19:19] if anybody is interested in reviewing those [16:57:19] * dhinus off [19:02:27] * dcaro off [19:03:01] fyi. I did bump yet the default buildpack on tools, found an issue on toolsbeta looking, but will finish up tomorrow [19:03:08] the flag on the cli is there and working though