[09:56:20] I upgraded mariadb-server in cloudcontrol1007, with the hacky procedure described in T345811 [09:56:20] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [09:57:00] the output of 'SHOW STATUS LIKE "wsrep_local_state_comment";' is 'Synced' so things seem to be fine [09:58:37] I am doing the same on cloudcontrol1006 and cloudcontrol1005, then reimaging all 3 to bookworm [10:01:35] hmm I see cloudcontrol1007 shows as "unpollable" in https://grafana.wikimedia.org/d/8KPwK6GMk/cloudcontrol-mysql-aggregated [10:02:21] "systemctl start prometheus-mysqld-exporter" might fix it [10:02:21] that usually means prometheus-mysqld-exporter needs a restart [10:02:31] :) [10:18:59] mariadb-server is upgraded on all 3 nodes, the mariadb cluster looks fine [10:19:26] I will start with the reimages, shall I send an email to cloud-admin maybe? who is gonna be affected by 1 cloudcontrol at a time being down? [10:28:42] I checked the archives of cloud-admin@ and I didn't find any similar messages in the past, so I will proceed with the reimages without sending an email... I assume when we have clusters people are used to trying another host if one is down. [10:29:14] taavi: I seen your comment on the patch I submitted but I'm not sure I agree [10:29:48] in terms of creating definitions separately for the cloud-public VIP IPs, versus IPs in cloud-public ranges used for other things, such as NAT for instances [10:29:57] ultimately I think that filter should be: [10:30:00] 1- block from RFC1918 [10:30:00] 2- allow from cloud-public [10:30:27] The cloud-public ranges are assigned and routed to WMCS, it's kind of for you guys to decide what they get used for and how they are carved up [10:30:40] At the CR level it makes more sense to me that we allow traffic from the public ranges and leave it at that [10:31:16] ok, that makes sense [10:31:28] can you at least clarify the comment to make it clear it includes traffic from instances too? [10:32:40] sure [10:36:38] thanks! [10:54:21] topranks: https://phabricator.wikimedia.org/T350130 [11:11:45] taavi: I commented back there, good spot! [11:19:42] sent a patch, mind having a look? [11:20:03] taavi: looking at the cloudgw rules too I see it's NATing the traffic going to the cloud-private from cloud-instances [11:20:55] that's not necessarily a bad thing I guess, but I can definitely see an argument that it'd be better the various services on cloud-private saw the real VM IP (for instance with DNS queries if something is looking up malware domains) [11:21:01] we probably don't need to look right now at that [11:21:59] yeah, that's something to look at later I'd say [11:26:32] +1 on the patch [11:26:48] I also did a quick scan, the patch will block off most of what's exposed [11:26:49] https://phabricator.wikimedia.org/P53100 [11:27:10] I still think it's fairly important we tighten up that cloudgw policy, the "everything is allowed" approach is not good [11:34:23] agreed, I'll make a task [11:38:03] cloudcontrol1007 is reimaged to bookworm, and the cluster is looking fine [11:38:13] filed T350132 [11:38:13] T350132: Restrict traffic from instances to private IPs on cloudgw level - https://phabricator.wikimedia.org/T350132 [11:38:18] I'll wait for a moment for things to settle, then proceed with reimaging the other two cloudcontrols [12:27:17] I'm applying this change now btw [12:27:18] https://gerrit.wikimedia.org/r/c/operations/homer/public/+/970279 [12:58:28] dhinus: I think prometheus-openstack-exporter is broken on bookworm [14:15:08] blancadesal: photos of the spooky displays? [14:17:51] RhinosF1: not yet! everything is ready to go but I'm waiting for it to get (at least a little bit) dark before deploying [14:19:20] it's very sunny today, we don't want passersby to be able to mentally prepare themselves, do we... [14:19:43] blancadesal: ha! [14:30:25] taavi: looking [14:50:17] so prometheus-openstack-exporter is definitely broken, is there a reason we don't run it in codfw? [14:50:42] I'm creating a task to debug the error and fix it [14:51:39] dhinus: we historically didn't have the hardware to run prometheus in codfw, T350010 [14:51:40] T350010: Evaluate whether to deploy cloud Prometheus instance to codfw - https://phabricator.wikimedia.org/T350010 [14:51:57] iirc we had a custom package for bullseye or something similar [14:52:03] maybe we could still run the exporter, to spot errors like this one [14:56:20] I am still doing my drain/undrain dance in ceph and it's super slow. Is anyone seeing side-effects that users might notice? (I also got pinged in -operations about the network being saturated which is probably related) [14:58:02] I created T350154 [14:58:03] T350154: [openstack] prometheus exporter broken in bookworm - https://phabricator.wikimedia.org/T350154 [14:58:25] # of objects is decreasing but time to complete is increasing :/ [14:58:51] andrewbogott: I have not seen side effects so far, but I haven't looked closely [14:59:15] ok, if no one is screaming I'll just leave this to rebalance. I don't really know what to do as an alternative anyway [14:59:30] andrewbogott: FWIW traffic from d5 to f5 dropped from peak (40Gb/sec) shortly after the alerts fired [14:59:34] https://usercontent.irccloud-cdn.com/file/XoBEpmUT/image.png [14:59:44] still higher than normal, but settled down quite a bit [14:59:52] *f4 [15:01:59] there are probably a million "tweak the host/linux network stack and tcp params" type things that could be done to fully optimize the transfer - keeping that link saturated, and lowering the time till it finishes - but just waiting is probably the easiest thing [15:03:36] topranks: that is reassuring, thank you. [15:03:49] I am going to be doing stuff like this for the next few days so that graph won't be fully happy for quite a while [15:04:13] cool, if we find it alerting too much we can look at the thresholds or disable even [15:04:42] but ultimately there was no error here, and probably if we moved from 40 to 100G it'd saturate that briefly [15:04:42] yep [15:04:58] I suspect all the hosts fire up at once and start sending/receiving, competing for the available bandwidth [15:05:38] and probably TCP kicks in then and regulates the flows down to reliable levels, too much as often happens (that's where the tweaking could help), but either way there is no problem with either hosts or network [15:16:59] yeah, that all seems fine. I pooled one host while depooling another host and that seems to have made ceph extra busy so I guess I know not to do that again [16:58:54] hmm I think there is something wrong in cloudvps, I can't login to Horizon [16:59:16] and I see a suspicious drop in traffic here https://grafana.wikimedia.org/d/8KPwK6GMk/cloudcontrol-mysql-aggregated?orgId=1&from=now-6h&to=now [16:59:38] also, T350172 [16:59:39] T350172: Unable to retrieve instances - https://phabricator.wikimedia.org/T350172 [17:00:21] galera is up and "Synced" on all 3 cloudcontrols [17:11:06] openstack cli calls fail with 500 [17:11:14] on that note, I struggled to log in as well but put it down to mistyping password/2FA token [17:12:42] I too cannot login to horizon [17:13:23] And will be sufficiently egotistical to declare that I am not mistyping [17:15:34] I will look but am also about to head out the door [17:15:51] dhinus: which host did you reimage? 1007? [17:16:04] yes [17:16:08] keystone_rotate_keys.service is down [17:16:20] Here is where the trail starts: [17:16:20] on 1007 [17:16:23] https://www.irccloud.com/pastebin/vCrZXSjP/ [17:18:01] so yeah, likely we need to refresh fernet keys [17:18:15] I will follow https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Procedures_and_operations#Rotating_or_revoking_keystone_fernet_tokens [17:18:22] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Procedures_and_operations#Rotating_or_revoking_keystone_fernet_tokens [17:18:25] heh, yes :) [17:18:28] that will likely do it [17:19:13] And you can use that 'project list' command as a test, it's a lot easier than trying horizon login [17:19:34] dhinus: my guess is that it synced the lack of keys on 1007 to the other hosts and wiped them all out [17:19:59] thanks, I see there's an exception when generating keys on 1007, so I'll try generating on 1006 and syncing to the others [17:23:25] now openstack project list works in 1007 but not on the other two [17:25:27] I can login again to horizon [17:25:40] but listin instances is still broken [17:27:11] now "openstack project list" is working from all cloudcontrols [17:27:20] great! [17:27:23] and I can list instances again! [17:27:31] now we just wait for the key sync job to run and we'll see if it breaks everything again [17:27:41] I think we're back /cc TheresNoTime [17:27:48] * andrewbogott keeps saying 'key' instead of 'token', I hope that's not confusig [17:34:13] and now toolsdb is failing again :D [17:34:16] I will restart it [17:35:27] done [17:40:53] we also have an alert in cloudservices100[56]: "The following units failed: labs-ip-alias-dump.service" [18:04:47] dhinus: FYI getting `Bad Gateway (HTTP 502)` error pop-up intermittently on horizon now, but when it *doesn't* appear, the instances do :) [18:06:55] hmm annoying, thanks for the update. I also got some 502s earlier when I was testing via the cli, but they stopped after a while [18:07:07] slightly more verbose error message in T350172 now [18:07:07] T350172: Unable to retrieve instances - https://phabricator.wikimedia.org/T350172 [18:13:17] this is on cloudcontrol1007: 2023-10-31 18:13:00.079 236796 ERROR keystone PermissionError: [Errno 13] Permission denied: '/etc/keystone/fernet-keys/0' [18:15:11] permissions are definitely very different in 1007 and 1006 [18:15:23] yeah, I fixed that by hand. let's see what happens [18:16:15] (horizon is now working as expected for me, fwiw) [18:16:24] woot [18:37:48] dhinus: you should basically always ignore the ip-alias-dump message, it tends to be transient [18:38:57] it basically means 'the apis were broken a little while ago' [18:46:11] I think it might have been caused by the fernet errors [18:51:18] likely! [19:25:02] * dhinus off