[06:47:56] (EdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [07:02:56] (EdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [11:53:24] volans: mmandere is trying to allocate IPv4 and IPv6 addresses for bast6001 in netbox, we should be able to allocate them without assigning them to bas6001 (cause it isn't on netbox yet cause it doesn't exist at all), right? [11:56:15] vgutierrez: physical or vm? [11:57:28] volans: vm [11:57:38] then no need, makevm will take care of that [11:57:39] see https://wikitech.wikimedia.org/wiki/Ganeti#Create_a_VM [11:58:24] you kjust need to wait me to release a new spicerack that has the drmrs settings in ganeti [11:58:35] <3 thanks [11:58:38] I was already in the process of releasing it [11:59:45] volans: got it, thanks :) [13:34:22] mmandere, vgutierrez: spicerack v1.1.0 is deployed on cumin[1001,2002] and has the definitions for ganeti on drmrs [13:34:34] so the makevm cookbook should work as usual [13:34:44] let me know if you encounter any issue [13:41:25] ack, thanks volans [13:57:20] volans: BTW, we have 2 clusters in drmrs, I guess that we can target one or the other by targeting the specific row/vlan? [13:57:49] let me check I didn't follow that part too much [13:58:05] ganeti6001/6003 are on row B12 and 6002/6004 on B13 IIRC [13:58:15] (as two separate per-rack clusters) [13:58:27] we have just one cluster setup in netbox atm [13:58:34] ah no sorry [13:58:35] my bad [13:59:02] and yes both are supported [13:59:07] ganeti01.svc.drmrs.wmnet => b12 [13:59:15] ganeti02.svc.drmrs.wmnet => b13 [13:59:27] how we pick one on the cookbook CLI? [13:59:37] yeah we're just trying to square that with the "drmrs" option on the CLI [14:00:03] vgutierrez: the positional argument [14:00:13] yeah [14:00:16] let me see how it works [14:00:21] this is the first multi-cluster one [14:00:26] might not work as expectd ;) [14:00:33] got it [14:00:36] fixing [14:01:09] thanks! [14:01:25] (and apologies for always pushing the edge cases!) [14:01:31] ahahah no worries [14:01:43] we are the edge^Wtraffic team after all [14:02:15] indeed! [14:08:10] fix is https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/747852 [14:08:19] if you have a sec to review [14:08:58] this is the quickest fix I could think off without changing too much the existing use cases [14:09:15] assumes we'll not have multiple ganeti clusters per rack [14:09:37] makes sense! [14:09:48] we can ofc change that too, but would become a bit more verbose, like drmrs_01_B12 (for ganeti01) [14:09:50] I guess for the other edges, it will name it for the first rack that has a ganeti? [14:09:59] yes woul dbe [14:09:59] like esams_oe13 or something? [14:10:00] 'esams_OE': ('ganeti01.svc.esams.wmnet', 'OE', 'esams'), [14:10:19] oh, "row" [14:10:49] it's what it is in netbox that is what is in Ganeti's 'group' [14:10:57] so that's a cluster per row [14:11:13] if we move it to rack we'll have them as esams_OE14, etc... [14:11:40] so where does the _B12 come from for the drmrs case? [14:12:04] oh, not that block, got it [14:12:09] currently [14:12:09] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/spicerack/+/refs/heads/master/spicerack/ganeti.py#19 [14:12:26] but I hope we can move it to be gathered dynamically from netbox or ganeti APIs [14:12:52] the ganeti module pre-dates the netbox one IIRC :) [14:13:04] got it [14:17:19] ok change merged and deployed, vgutierrez, mmandere you should be good to go choosing between: drmrs_B12,drmrs_B13 [14:17:40] <3 awesome [14:18:05] I've update the --help example in https://wikitech.wikimedia.org/wiki/Ganeti#Create_the_VM [14:18:52] let's see what's the next chicken-edge case :-P [14:19:42] {codfw_A,codfw_B,codfw_C,codfw_D,codfw_test,drmrs_B12,drmrs_B13,eqiad_A,eqiad_B,eqiad_C,eqiad_D,eqsin_1,esams_OE,ulsfo_1} [14:19:45] looks nice to me! [14:23:19] volans: ack [16:58:09] volans: trying to run the makevm to create a bastion host in DC drmrs and it seems to fail with the following error `Failed to find VLAN with name public1-drmrs` [16:58:56] `sudo cookbook sre.ganeti.makevm --network public --disk 40 drmrs_B12 bast6001 [16:58:56] ` executed that for the cookbook [16:59:14] mmandere: checking [16:59:25] volans: ack [16:59:51] so, we don't have a public1-drmrs vlan [16:59:55] but one per rack [17:00:03] I bet the cookbook need a fix too for this [17:00:06] I will have a look shortly [17:00:32] No problem 👍 [17:10:12] ah yeah [17:10:23] we have vlans per-rack in drmrs [17:10:39] basically a drmrs rack is, in most virtual senses, a lot like a core DC row [17:10:47] yeah I know fix is easy [17:10:56] just checking other consequenses [17:14:58] moritzm: so, currently in the other ganeti clusters we have as group row_$ROW [17:15:06] while in drmrs we have just the rack name [17:15:16] 'row_OE' in esams for example [17:15:39] would it be much of a trouble to rename the ganeti group in drmrs for now from 'B12' to 'row_B12' ? [17:16:25] that would simplify things and make the fix for makevm very easy, then we could work on a proper fix next week that takes care better of the different setups [17:17:00] there's a "gnt-group rename" command apparently, so I bet it's not bad [17:17:34] not sure how that will impact any existing group config that's imported to netbox [17:17:48] (but we could manually edit it there too I guess, if needed) [17:17:50] we don't import yet groups [17:17:53] oh ok [17:18:05] want me to try the renames? [17:18:05] it's WIP in my laptop [17:18:15] the group import [17:19:24] being bold! [17:19:41] ack [17:19:43] +1 [17:20:21] they're renamed to row_B12 and row_B13 now [17:20:25] (on the actual clusters) [17:20:37] I don't think that metadata exists anywhere else, so should be good to go [17:20:39] https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/747882 should be the fix [17:20:45] yep [17:21:10] row will be B12 or B13 in this case [17:21:13] and the ganeti module does [17:21:14] f" -g row_{row}" [17:21:26] in the options passed to gnt-instance add [17:21:47] got it [17:26:23] 10Traffic, 10Foundational Technology Requests, 10SRE, 10Wikimedia Enterprise, 10Wikimedia Enterprise Discussion: Allow-Listing for Enterprise IPs - https://phabricator.wikimedia.org/T294798 (10DAbad) 05Open→03In progress p:05Triage→03High [17:26:52] deployed! mmandere you can retry now [17:27:41] volans: ack [17:55:09] damn it failed for another thing, checking [17:55:40] the ganeti->netbox sync seems [17:55:59] ahhh right, we have netbox_ganeti_drmrs01_sync.timer and 02 [17:56:03] isntead of a flat one per dc [17:57:53] * bblack makes plans for next week to switch to 1 vlan per machine and 3 ganeti clusters per pair of racks [17:58:34] thanks :D [18:00:06] off for lunch, will catch up on scrollbacks later! [18:08:47] ack, I'm sending a temporary fix https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/747890 [18:09:10] then next week I'll check to streamline the naming across both ganeti/spicerack/cookbook/systemd units [18:09:14] so that they are all consistent [18:09:45] feel fre to merge it and run puppet on the cumin host to get it updated [18:09:54] I might reconnect later on, but have to logoff now [18:27:35] volans: yeah, we can easily rename those with "sudo gnt-group rename OLD NEW" [18:27:56] actually, that's what was already done before, since they come up as "default" in the initial install [18:28:27] and there's no backwards connection to Puppet, so they can be renamed at will (since we don't track them in Netbox yet either) [18:46:32] ack, thx [18:51:36] I think that marc left the cookbook at the prompt for the dns cookbook to finish the rollback of the changes for bat6001 due to the failure [18:51:57] FYI mmandere I've attached myself to the tmux and typed 'go' and enter, to let it complte [18:52:16] to recover the icinga alert "PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL" [18:52:20] no big deal, just fyi [21:20:02] 10Traffic, 10SRE: Enterprise redirects from .Org sites - https://phabricator.wikimedia.org/T296445 (10BBlack) 05Open→03Resolved These changes should be live now, please let me know if anything's amiss!