[06:43:02] <godog>	 greetings
[07:07:52] <godog>	 will start the neutron maint shortly
[07:08:05] <godog>	 I've silenced pages to team=wmcs and will keep an eye
[07:09:32] <volans>	 ack
[07:09:39] <volans>	 ping if you need a hand
[07:11:09] <godog>	 cheers
[07:26:40] <godog>	 ok all done, I could still get into a vm with both neutron-l3-agent down, not sure what else was down tho
[07:27:44] <volans>	 great
[07:29:58] <godog>	 indeed! good times
[07:30:07] <godog>	 I've removed the silence
[07:40:22] <dcaro>	 🎉 
[08:09:27] <dhinus>	 morning
[09:39:47] <dcaro>	 there's a paging alert about harbor, looking
[09:41:09] <dcaro>	 everything looks ok :/v
[09:41:40] <godog>	 ack, dcaro please LMK if it is db-related, I'm on harbordb1 testing T421857 though that's for toolsbeta not tools
[09:41:41] <stashbot>	 T421857: Move trove DB instances to rabbitmq transient quorum queues - https://phabricator.wikimedia.org/T421857
[09:41:44] <dcaro>	 it seems prometheus has no harbor metrics
[09:41:58] <dcaro>	 yep toolsbeta
[09:42:20] <dcaro>	 toolsbeta harbor is borked yep xd
[09:43:17] <godog>	 dammit, can it access the db ? what I did is add a security group per https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Trove#Accessing_Trove_guest_VMs
[09:43:28] <godog>	 now I'm thinking that took over the trove security group
[09:44:14] <dcaro>	 connection timed out to the db
[09:44:44] <dcaro>	 to the host w6pgt6thhot.svc.trove.eqiad1.wikimedia.cloud 
[09:44:56] <godog>	 dcaro: ok, is it back now ? I added the 'postgres' security group
[09:45:21] <dcaro>	 still erroring, let me try to restart the container
[09:45:57] <dcaro>	 yep, now it works
[09:46:13] <godog>	 sigh ok
[09:46:24] <dcaro>	 I'll restart all the harbor containers just in case
[09:46:57] <godog>	 ok toolsbeta only should be affected
[09:47:07] <dcaro>	 so the issue is that it did not have any security group, and when adding the one for the access it started filtering all the traffic?
[09:48:56] <dcaro>	 or the order of the security groups? (I don't remember having issues with that before though :/, maybe some behavior changed)
[09:49:22] <godog>	 ish, it did have the trove-generated security group and when I added the security group to allow ssh access then I think the trove one wasn't there anymore
[09:49:50] <godog>	 I feel like that bit where peter griffin struggles with the blinds
[09:50:00] <godog>	 https://media1.tenor.com/m/30_6MJ3Py9EAAAAC/family-guy-struggle.gif
[09:50:52] <godog>	 but anyways, I wasn't expecting toolsbeta to page
[09:59:14] <godog>	 dcaro: I'll need to reboot that instance for a test, I'm leaving it as is for now and will resume after lunch
[10:23:59] <dhinus>	 I don't think it paged? I see an alert with "severity=critical" but none with "severity=page"
[10:47:15] <dcaro>	 it might have not, it did have the #page comment somewhere
[10:48:00] <dcaro>	 or maybe I hallucinated it :/, not sure now
[10:48:07] <volans>	 [protip] use # page when quoting paging alerts to not trigger unwanted highlights :=P
[10:48:32] <dcaro>	 oh sorry
[10:48:33] <dcaro>	 xd
[10:50:59] <dcaro>	 hmm... toolsbeta harbor does not let me log in now :/
[10:52:12] <dcaro>	 yep, can't pull either
[10:53:19] <dcaro>	 restarted the containers, and it's back up :S
[10:53:46] <dcaro>	 anyhow, I'll go grab some lunch before it breaks again xd
[11:54:18] <taavi>	 the toolsbeta prometheus rewrites the severity label so that it doesn't send actual pages, but that doesn't change the key word in the alert description
[12:28:37] <godog>	 hah! the rewrite explains
[12:44:04] <godog>	 ok for me to test a reboot of harbordb1 dcaro ?
[12:44:38] <dcaro>	 can you wait 20min? (in a meeting)
[12:44:42] <godog>	 yes totally
[13:19:28] <dcaro>	 okok, I'm around now
[13:19:30] <dcaro>	 if you wanat
[13:20:42] <godog>	 dcaro: sure, doing
[13:21:56] <godog>	 ok we're back, far faster than I thought 
[13:23:20] <dcaro>	 huh, I don't think harbor noticed at all xd
[13:24:07] <godog>	 lol
[13:49:30] <dhinus>	 alerts on clouddb1023 can be ignored, manuel is working on it
[13:49:42] <volans>	 yeah I assumed is the repro host
[13:50:10] <dhinus>	 I think the repro host is 1022, which is why I was confused by the alerts, so I asked in -data-persistence
[13:52:37] <godog>	 I'm done with the harbordb1 testing btw
[13:52:40] <godog>	 dcaro: ^
[14:22:36] <dcaro>	 godog: ack thanks!
[14:27:26] <taavi>	 andrewbogott: hmm, we have a yet another case of ldap and openstack disagreeing on membership of the bastion project
[14:27:39] <taavi>	 in particular, https://ldap.toolforge.org/user/atsuko is a member in openstack but that's not visible in ldap
[14:28:10] <andrewbogott>	 Do you know, is that a new account or an old one?
[14:28:25] <taavi>	 new
[14:32:46] <andrewbogott>	 and as always there are no errors in the logs...
[14:34:42] <andrewbogott>	 but they're in the analytics project in ldap. So some manner of ldap syncing worked properly...
[14:36:44] * andrewbogott retrying the ldap sync
[14:41:14] <andrewbogott>	 taavi: I fixed for that user but still don't have a theory, updated T421911
[14:41:14] <stashbot>	 T421911: Keystone logs no longer appearing in logstash - https://phabricator.wikimedia.org/T421911
[14:48:16] <andrewbogott>	 uh, that should've been T379550
[14:48:16] <stashbot>	 T379550: openstack: keystone may be failing to add users to the bastion project in Keystone and/or LDAP - https://phabricator.wikimedia.org/T379550
[15:00:12] <taavi>	 i wonder if it's worth extending the bastionless script to alert for this scenario too
[15:00:26] <taavi>	 seems to be recurring enough
[15:03:12] <andrewbogott>	 might be although it's at least something that likely fixes itself next time keystone gets nudged
[16:19:03] <dhinus>	 I have a couple of MRs for my long-running project of improving the wikireplicas scripts: https://gitlab.wikimedia.org/repos/cloud/wikireplicas-utils/-/merge_requests/11
[16:19:19] <dhinus>	 if anybody is interested in reviewing those
[16:57:19] * dhinus off
[19:02:27] * dcaro off
[19:03:01] <dcaro>	 fyi. I did bump yet the default buildpack on tools, found an issue on toolsbeta looking, but will finish up tomorrow
[19:03:08] <dcaro>	 the flag on the cli is there and working though