[07:36:30] I'm rolling out a quick patch for idp-test to fix some styling [07:39:30] slyngs: easy +1 if you have time https://gerrit.wikimedia.org/r/c/operations/puppet/+/1052851 <3 [07:39:48] I like easy [07:40:46] Done [07:40:49] thx [07:42:28] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275#9964127 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1002 for hosts: `netbox-dev2002.codfw.wmnet` - netbox-dev2002.codfw.wmnet (**PASS**) - Downt... [07:49:14] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: Do we need ping offload servers at all POPs? - https://phabricator.wikimedia.org/T345809#9964152 (10ayounsi) 05Open→03Declined Closing this task as afaik we haven't seen any issue in esams, and the proper path forward is tracked in {T367973}... [08:01:15] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: [spicerack] python-kafka does not support python 3.12, there's a fix but there has not been any releases since 2020 - https://phabricator.wikimedia.org/T354410#9964208 (10elukey) >>! In T354410#9961563, @Volans wrote: > @elukey do you know how much of a... [08:34:43] hey folks, I know that https://wikitech.wikimedia.org/wiki/Ganeti#Renumber_(aka_change_network)_a_VM is discouraged, but I am wondering if we could try it for https://phabricator.wikimedia.org/T344230 [08:35:00] to avoid re-creating new VMs etc.. that may be a little more painful [08:37:18] elukey: what's the goal? I don't understand the task [08:38:36] nevermind, got it [08:38:58] why was it created that way? And what's the issue with creating a new VM in the proper location ? [08:56:36] not sure, maybe at the beginning it was more a test than something fully prod-ready [08:56:52] creating a new VM is not an issue, only a lot more work [08:57:08] there is the etcd ensemble to respect, the k8s mgmt control plane, etc.. [08:57:29] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: [spicerack] python-kafka does not support python 3.12, there's a fix but there has not been any releases since 2020 - https://phabricator.wikimedia.org/T354410#9964436 (10Volans) Ok, sounds good to me. Thanks for looking into this and yes there is no re... [09:00:49] I have never swapped an etcd cluster, maybe there is the possibility of expanding the cluster 3->6 and then cutting off the oldest VMs [09:01:03] elukey: the cumin broken aliases email boils down to O:etcd::v3::kubernetes returning no hosts, by any chance do you know the status of that one? [09:01:47] checking [09:04:07] volans: ah maybe I know, wikikube now co-locates etcd and control plane daemons on wikikube-ctrl* [09:04:45] so I think that the kubeetcd nodes are not used anymore [09:05:00] so probably the cumin alias is obsoled [09:05:04] we can also just ask serviceops to fix ti [09:05:06] *obsoleted [09:05:28] yep [09:08:00] 10Packaging, 06Infrastructure-Foundations: Package ipxe-qemu - https://phabricator.wikimedia.org/T369136#9964495 (10aborrero) hey @ayounsi, I have reviewed the .deb packages that you built. They LGTM. I even installed them on my laptop :-P So from my point of view, you have a +1 to put them on reprepro. Pleas... [09:34:50] elukey: {done} (see -serivceops :D ) [09:50:11] Today after a bit of scavenging I created this revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/1052934 [09:50:32] added some of you in Cc, I think we can safely revert but lemme know [11:55:24] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9965147 (10Marostegui) databases are ready [13:49:05] XioNoX: it seems ok to me judging from the metrics, maybe a little bit more disk space could give us some room for extra logs of netbox 4 that we didn't anticipate, but I am not strongly for it [13:49:39] elukey: sounds good, like 20G? [13:50:10] yep I'd say it is good, even 25G, we are not asking a lot :) [13:50:29] memory/cpu is always something that we can change, the disk is a problem [13:50:38] rgr [13:50:46] elukey: forthe DB too? [13:51:07] https://grafana.wikimedia.org/d/000000377/host-overview?var-server=netboxdb1002&orgId=1&refresh=5m&var-datasource=thanos&var-cluster=misc [13:51:55] yep I think so [13:52:13] Maybe a bit more disk for the database? [13:52:27] 25 too so they're similar then [13:52:28] thx! [13:52:28] Otherwise I think memory and CPU looks fine [13:55:21] in term of git, I want to move all the changes that are on the dev branch into main, what's the cleanest way? [13:55:48] squash them all into one commit ? regular rebase ? [13:58:15] I'm going with a regular rebase, wish me luck [14:12:34] rebase is nice to preserve history, good luck :) [14:26:42] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275#9965587 (10ops-monitoring-bot) Deployed netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.6 to netbox-next - ayounsi@cumin1002 - T336275 [14:33:47] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9965612 (10hnowlan) kubernetes* and mw* are ready [14:54:07] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#9965690 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6a298ae5-e736-4051-8220-9ec4f352950a) set... [14:59:29] slyngs: do you remember how we fixed the "'Group' instance expected, got " error ? it's back on netbox next after the upgrade [15:00:26] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#9965719 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=39fcbcd0-8c16-4208-ac06-f4b442e55a54) set... [15:01:48] ah, the pipeline module needed updating, so it's not picking up the new wheels from the deploy server [15:03:19] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#9965737 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2a5cb43e-793c-4103-9499-369354315479) set... [15:14:12] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275#9965772 (10ops-monitoring-bot) Deployed netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.6 to netbox-next - ayounsi@cumin1002 - T336275 [15:23:29] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#9965800 (10cmooney) Upgrade complete, all looks good network side at first glance, all online hosts are pingable again. [15:30:16] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9965849 (10cmooney) Switch upgrade completed without issue. All connected hosts are back online and responding to p... [15:30:30] who manages https://docker-registry.wikimedia.org/python3-build-bookworm ? seems like there is a regression in the latest version [15:30:40] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9965851 (10MatthewVernon) Swift looks OK, thanks. [15:34:02] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9965869 (10Marostegui) Repooling databases [15:36:36] XioNoX: it should be a package in production-images, we (SRE) manage it [15:36:41] what is the regression? [15:37:01] sorry, an image not a package [15:38:28] `netbox-deploy$ make freeze` works only if I pin the version to `python3-build-bookworm:0.1.0-20240623` in `Dockerfile.build` [15:39:55] --verbose :) [15:40:09] similarly in the Makefile with "latest" vs. "0.1.0-20240623" [15:40:20] it finishes cleanly but doesn't generate the required file [15:41:07] does it work with the 20240630? [15:41:24] (just to narrow down) [15:41:29] haven't tested with that one yet [15:41:37] I want to unbreak netbox-next first [15:41:59] anyway, the images with -date are weekly rebuild that we do, basically to refresh the OS + packages installed [15:42:44] I have to go in a few but if you want to open a task with how to repro I'll work on it tomorrow morning [15:43:02] yeah, I'll test more and report back [15:43:03] thx! [15:43:07] np! [15:44:28] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275#9965916 (10ops-monitoring-bot) Deployed netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.6 to netbox-next - ayounsi@cumin1002 - T336275 [16:53:02] XioNoX: We update the ApereoCAS pipeline thingy to 0.0.3 [16:53:35] That uses the new Netbox groups, rather than the Django default one [17:09:17] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:05:35] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:31:10] Netbox 4.0.7 released, right on time [19:50:00] :)