[06:13:29] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10Marostegui) [06:13:39] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10Marostegui) [06:15:05] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10Marostegui) Both masters, s7 and x1 have been switched over and no longer live in this rack. [06:15:11] 10netbox, 10Infrastructure-Foundations: Netbox: use Custom Model Validation - https://phabricator.wikimedia.org/T310590 (10ayounsi) Agreed, we need to keep a close look at any risk of performance hit (eg. the ones that iterate over all objects), but a lot of reports/tests could be replaced by those validators.... [06:39:30] 10netops, 10Infrastructure-Foundations, 10ops-eqiad: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) p:05Triage→03Medium [06:41:16] 10netops, 10Infrastructure-Foundations, 10ops-eqiad: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) [06:41:24] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10ayounsi) [06:42:22] 10netops, 10Infrastructure-Foundations, 10ops-eqiad: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) [06:42:30] 10netops, 10Infrastructure-Foundations, 10SRE: Packet Drops on Eqiad ASW -> CR uplinks - https://phabricator.wikimedia.org/T291627 (10ayounsi) [07:08:54] 10netops, 10Infrastructure-Foundations, 10ops-eqiad: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) [08:10:27] 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 4 others: Create a cookbook to perform a rolling reboot of a kubernetes cluster - https://phabricator.wikimedia.org/T260661 (10JMeybohm) [08:50:54] 10netbox, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Grant cn=nda some sort of read only access to Netbox - https://phabricator.wikimedia.org/T302870 (10ayounsi) Send the above patch to grant access (`is_active`). The permissions page though seems to involve quite a lot of manual work see fo... [09:18:28] 10netbox, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Grant cn=nda some sort of read only access to Netbox - https://phabricator.wikimedia.org/T302870 (10ayounsi) I had a quick look at demo.netbox.dev and created a test user there (you can try with user foobar/foobar, the DB is reset every day... [13:14:43] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10Cmjohnson) [13:14:51] 10netops, 10Infrastructure-Foundations, 10SRE: Move asw2-d5-eqiad to spares - https://phabricator.wikimedia.org/T313115 (10Cmjohnson) [13:53:10] moritzm: is there any WIP on ganeti01.svc.eqiad.wmnet:5080 APIs? [13:53:36] it timedout for the netbox_ganeti_eqiad_sync.service from netbox1002 [13:58:14] first time I hear of this, having a look now [13:59:35] first failure today at 13:35 [13:59:43] and second one at 13:50 [13:59:47] both UTC [14:22:13] the certs are all valid until 2027 and the RAPI port is also running fine, but there's traceback in rapi-daemon.log starting st 13:35. [14:22:32] there are no changes on the server itself, did anything change in the queries made by the netbox report? [14:23:22] not that I know of [14:23:56] mysterious, I'm going to open a task [14:24:17] could be that the move to drbd of some VMs somehow affected the API response time? maybe it's just a timeout too short [14:24:22] I'll test the call manually [14:25:49] all the etcds which temporarily switched to DRBD are rolled back by now [14:27:44] ok [14:29:18] it seems the call that timesout is /2/instances?bulk=1, testeing with a longer timeout [14:30:31] moritzm: confirmed [14:30:34] just a timeout issue [14:30:46] the default 5s we're currently using is not enough [14:31:08] I'll send a patch to increase it, took short enough that I don't care if it's 5 or 10s for the API call [14:31:19] doesn't seem worrying for now at least to me [14:34:16] sounds good! [14:36:53] moritzm: https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/815997 [14:53:56] all fixed [14:59:23] nice! [17:44:57] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic, 10Patch-For-Review: drmrs: initial geodns configuration - https://phabricator.wikimedia.org/T304089 (10BCornwall)