[06:15:25] topranks: yep, people are not happy about it https://www.reddit.com/r/networking/comments/vc4gei/juniper_mx204_end_of_life_announcement/ [06:15:39] and they say it's because of the ship shortage as well? [07:19:26] Hmm yeah [07:19:58] Juniper making hard choices on what lines to keep? [07:20:30] At least end of support is some time away but still [09:12:50] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Spicerack, and 2 others: Spicerack cookbooks TODO list - https://phabricator.wikimedia.org/T203943 (10joanna_borun) [09:15:53] ms-be1059 has had a new motherboard put in, and now the HTML5 console says "License Required\nThis iLO is not licensed to use the Integrated Remote Console after server POST is complete." Do I presume correctly that's a DC-ops thing to fix? [09:16:34] (if so I'll reopen T307667 with a note, but wanted to check that this wasn't something I could fix myself) [09:16:35] T307667: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 [09:16:57] I eventually managed to get the VSP to com2 rather than com1 so I can at least use it now... [09:20:53] Emperor: yes i would say DC ops, i dont see anything related to licences on the ilo wikitech pages [09:24:31] Emperor: if any of the embedded NICs changed their name, please run this netbox script once it's all back up and running [09:24:34] https://netbox.wikimedia.org/extras/scripts/interface_automation.ImportPuppetDB/ [09:24:53] (before this afternoon or tomorrow, after the netbox migration ;) ) [09:26:06] volans: OOI, how would I tell? [09:26:33] 1) don't care and just re-run it, it's idempotent [09:26:37] :) [09:26:42] 2) check if netbox names differ from host names [09:26:56] 3) run it in dry-run mode and see what it tells [09:27:40] luckily we don't track MACs anymore... so no need to update them anywhere ;) [09:28:06] currently playing "try and get the installer to actually pick the right drives", but will get to that when/if successfully reimaged [09:49:25] OK, the problem is '[ 12.234765] sd 0:0:0:0: [sda] Attached SCSI removable disk' [09:49:42] which breaks everything [09:50:22] the iLO doesn't think it has a license for virtual media, and AFAICS /dev/sda doesn't have any media in it [09:50:45] I had that error before, when someone in DC ops had forgotten a USB pen drive in a slot [09:51:01] which was used for diagnosing/updating or similar [09:51:18] or does /proc/partitions point to a real disk? [09:51:56] /proc/partitions doesn't even have sda in [09:52:53] but this sda thing means the 14 drives are sdb->sdo rather than sda->sdn [09:53:04] ok [09:53:06] The web-ILO thinks there are 14 drives [09:54:03] I'd expect an actual device to appear in /proc/partitions ? But I don't have any better ideas [09:55:32] ah, though /dev/disk/by-path/pci-0000\:00\:14.0-usb-0\:4\:1.0-scsi-0\:0\:0\:0 -> sda might indeed point to some USB thing left connected [09:55:56] (or I've screwed something up in the bios, but given the iLO says no licence for virtual media that seems less likely?) [09:59:07] lsscsi has [0:0:0:0] disk Generic- SD/MMC CRW 1.00 [09:59:43] I'll ask DC ops nicely to check :) [10:42:37] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Spicerack, and 2 others: Create a spicerack cookbook for restoring an etcd cluster from backups - https://phabricator.wikimedia.org/T203944 (10joanna_borun) [10:42:41] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Spicerack, 10User-Joe: Covert deploy_apache_change.sh to a spicerack cookbook - https://phabricator.wikimedia.org/T203948 (10joanna_borun) [10:43:49] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Spicerack, and 4 others: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10joanna_borun) [10:44:03] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Spicerack, 10User-Joe: Create a spicerack cookbook to empty a ganeti node from VMs - https://phabricator.wikimedia.org/T203964 (10joanna_borun) [10:45:56] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Spicerack, and 4 others: Create a cookbook to copy data between WDQS servers - https://phabricator.wikimedia.org/T213401 (10joanna_borun) [10:46:01] 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 2 others: Create Spicerack cookbook to drain/reboot/uncordon a Kubernetes worker - https://phabricator.wikimedia.org/T212866 (10joanna_borun) [10:46:40] 10SRE-tools, 10Discovery-Search, 10SRE, 10Spicerack, and 2 others: Migrate elasticsearch scripts to spicerack cookbooks - https://phabricator.wikimedia.org/T202885 (10joanna_borun) [10:46:58] 10SRE-tools, 10Elasticsearch, 10SRE, 10Spicerack, and 2 others: Refactor current code base to support multiple elasticsearch instances/multiple elasticsearch clusters - https://phabricator.wikimedia.org/T207918 (10joanna_borun) [10:47:22] 10SRE-tools, 10SRE, 10Spicerack, 10Discovery-Search (Current work), 10Patch-For-Review: Write cookbooks to support spicerack's elasticsearch multi cluster/instance - https://phabricator.wikimedia.org/T207919 (10joanna_borun) [10:47:37] 10SRE-tools, 10Infrastructure-Foundations, 10Maps, 10SRE, and 3 others: Create cookbook for postgres initialization on maps cluster - https://phabricator.wikimedia.org/T220946 (10joanna_borun) [10:47:39] 10SRE-tools, 10Discovery-Search, 10SRE, 10Spicerack: Create cookbook to reindex into elasticsearch / cirrus - https://phabricator.wikimedia.org/T219507 (10joanna_borun) [10:47:53] 10SRE-tools, 10Elasticsearch, 10SRE, 10Spicerack, and 2 others: Test spicerack elasticsearch module - https://phabricator.wikimedia.org/T207920 (10joanna_borun) [10:48:01] 10SRE-tools, 10SRE, 10Spicerack, 10Wikidata, and 2 others: Create Cookbook to restart WDQS - https://phabricator.wikimedia.org/T221832 (10joanna_borun) [10:48:15] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Spicerack, 10User-Joe: Create cookbook to do `nodetool repair` across cassandra cluster - https://phabricator.wikimedia.org/T225694 (10joanna_borun) [10:48:44] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Spicerack, and 3 others: Create WDQS reboot cookbook - https://phabricator.wikimedia.org/T224385 (10joanna_borun) [10:48:52] 10SRE-tools, 10Infrastructure-Foundations, 10Maps, 10SRE, and 3 others: Create cookbook to reboot Maps - https://phabricator.wikimedia.org/T224072 (10joanna_borun) [10:48:58] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Spicerack: Create a cookbook to restart the jvms on a Cassandra cluster - https://phabricator.wikimedia.org/T230022 (10joanna_borun) [10:49:04] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Spicerack: More structured cookbooks to reboot hosts - https://phabricator.wikimedia.org/T252807 (10joanna_borun) [10:49:32] 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 4 others: Create a cookbook to perform a rolling reboot of a kubernetes cluster - https://phabricator.wikimedia.org/T260661 (10joanna_borun) [10:49:52] 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 3 others: Create a cookbook for depooling one or all services from one kubernetes cluster - https://phabricator.wikimedia.org/T260663 (10joanna_borun) [12:33:09] XioNoX, jbond: I'm in the call [12:33:46] volans: there is a call? [12:35:46] volans: let's do it over IRC and call if there are any issues? [12:58:42] 10Mail, 10Infrastructure-Foundations, 10SRE, 10Epic: Move most (all?) exim personal aliases to WMF ITS - https://phabricator.wikimedia.org/T122144 (10MoritzMuehlenhoff) I also removed logsteralarms@ earlier the day, it's no longer needed. [13:00:20] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Netbox to 3.2 - https://phabricator.wikimedia.org/T296452 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b899820b-e817-4d4e-af91-553e8854cf5d) set by volans@cumin1001 for 4:00:00 on 1 host(s) and their services with reas... [13:00:43] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Netbox to 3.2 - https://phabricator.wikimedia.org/T296452 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1e0ccaf6-24e0-41c6-8aa6-1269763ad443) set by volans@cumin1001 for 4:00:00 on 1 host(s) and their services with reas... [13:05:57] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: systemd.timer not executing on cumin2001 after command was modified - https://phabricator.wikimedia.org/T268974 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF [13:28:54] ah, we don't have HTML5 console paid for on the HP systems :( [13:35:12] Emperor: you should be able to use `VSP` once ssh'ed to get a serial console. perhaps check with dc-ops if it we shuld have the licence for that or not [13:36:18] jbond: yeah, it does have that (once wrestled into the right com port) [13:36:34] darn it, this system still has the mystery '[ 11.846400] sd 0:0:0:0: [sda] Attached SCSI removable disk' [13:37:08] DC team say nothing attached, but the Bios thinks there is and that's breaking everything :( [14:27:03] Ah, turning off the internal SD drive seems to have helped, now it's just doing the usual trick of randomly trashing filesystems in the installer [15:05:38] jbond, XioNoX: last thing, when you have time, is to get back netbox-next into a clean state of repos and data I guess [15:05:55] that would also allow to do a final test of the gneti group script migration [15:06:37] ack sgtm [15:08:37] no hurry on that ofc [15:21:06] volans: reimage of ms-be1059 failed on getting results from Netbox [15:21:19] Emperor: checking [15:21:51] * volans suggests to use the local cumin when possible [15:22:19] ( https://netbox.wikimedia.org/api/extras/job-results/3333237/ LGTM, mind, though playbook was trying https://netbox.discovery.wmnet/api/extras/job-results/3333237/ ) [15:23:11] yes thats' the internal address [15:25:31] there's something weird about the prometheus state on that box too, trying a reboot [15:27:33] jbond: interesting failure... the cookbook polled the results for quite a while [15:27:48] and it seem it took 50 seconds [15:27:50] "created": "2022-06-15T15:20:06.324296Z", "completed": "2022-06-15T15:20:56.444293Z", [15:28:07] while the cookbook tries only for ~30seconds [15:28:34] it shouldn't need to run for that long [15:29:59] (we're also not far from job 3333333) [15:30:01] could it be some effect of the CDN in some way? [15:30:41] once the job is ran, it's not using the CDN [15:31:07] at least not the netbox cdn, maybe the puppet one if there is [15:31:34] right it's using the discovery record anyway [15:34:17] i dont think anything internal is useing the CDN, they should be pointing to netbox.discovery.wmnet which is just a dns discovery record and bypasses the cdn [15:34:26] the cdn is only hit via netbox.wikimedia.org [15:34:50] (just to be specific) the *cache* is only hit via netbox.wikimedia.org [15:35:10] yes yes [15:37:19] so at the same time there were various job polling happening [15:37:20] volans: XioNoX: so jobs on 3.2 seems to be tasking longer to run is that a fair summary? [15:37:33] jbond: so far, yes [15:37:58] and multiple POST to GetDeviceStats [15:39:15] that's a quick one, but still I think we're calling it too frequently [15:40:04] like every ~10 seconds [15:40:34] volans: for its current usage we could call it once a week it would be enough [15:40:35] :) [15:40:45] yep, my point exactly :D [15:43:29] ack [15:48:30] we do set it to run with AccuracySec=15sec [15:48:37] OnCalendar=minutely [15:48:38] RandomizedDelaySec=0 [15:51:12] ignore me, wrong file [15:58:03] Emperor: IIRC HTML5 is iLO advanced license, which *I think* we were buying when we were still buying HP(E)s (you should be able to see that from netbox -> procurement ticket -> quote) [16:07:09] (I was just relying on what DC team tell me :) ) [19:23:13] 10SRE-tools, 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Complete Netbox prometheus scraping - https://phabricator.wikimedia.org/T243928 (10ayounsi) a:03ayounsi [19:33:07] 10netbox, 10Infrastructure-Foundations: Netbox: use Provider Networks - https://phabricator.wikimedia.org/T310591 (10ayounsi) Assiging a provider network to a circuit causes: ` Traceback (most recent call last): File "/srv/deployment/homer/venv-1655301229/lib/python3.9/site-packages/homer/netbox.py", line 46... [19:54:39] 10netbox, 10Infrastructure-Foundations: Move AS allocations to Netbox - https://phabricator.wikimedia.org/T310744 (10ayounsi) p:05Triage→03Low [20:23:18] 10netbox, 10Infrastructure-Foundations: Upgrade pynetbox - https://phabricator.wikimedia.org/T310745 (10ayounsi) p:05Triage→03Medium [20:49:40] 10netbox, 10Infrastructure-Foundations: Upgrade pynetbox - https://phabricator.wikimedia.org/T310745 (10Volans) +1 for me