[06:49:15] 10CAS-SSO, 10Infrastructure-Foundations, 10SRE: CAS should link to account creation tutorial - https://phabricator.wikimedia.org/T297524 (10Majavah) 05Open→03Resolved a:03jbond thanks! [08:13:08] 10Mail, 10Infrastructure-Foundations, 10SRE, 10Znuny, 10serviceops: OTRS/mail: investigate why "T=remote_smtp_signed: all hosts for 'ticket.wikimedia.org' have been failing for a long time" - https://phabricator.wikimedia.org/T297160 (10akosiaris) Had a quick look at that. It is true that we never have r... [10:00:26] 10puppet-compiler, 10Infrastructure-Foundations, 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, and 3 others: [pcc] Release the latest version - https://phabricator.wikimedia.org/T297356 (10dcaro) 05In progress→03Resolved [10:04:08] 10puppet-compiler, 10Infrastructure-Foundations, 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, and 3 others: [pcc] Allow passing the fail-fast option on the jenkins job + cli - https://phabricator.wikimedia.org/T296984 (10dcaro) 05Open→03Resolved [10:19:47] moritzm, volans, I got this error while running the decom cookbook on a VM https://www.irccloud.com/pastebin/N4uF0dFK/ [10:20:06] XioNoX: did you run it from cumin1001 or 2002? [10:20:30] says connect timeout, did anything change in Ganeti API / firewall rules recently moritzm ? [10:20:48] 1001 [10:21:36] $ telnet ganeti01.svc.codfw.wmnet 5080 [10:21:36] Trying 10.192.16.131... [10:21:41] from 2002 [10:21:42] too [10:22:07] so yeah it seems that the API are not reachable [10:22:44] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Upgrade netflow VMs to Bullseye - https://phabricator.wikimedia.org/T297595 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1001 for hosts: `netflow2001.codfw.wmnet` - netflow2001.codfw.wmnet (**FAIL**) -... [10:24:09] hmmh, some issue with the VIP? I failed over the ganeti master from 2019 to 2016 about an hour ago [10:25:27] yeah, you moved it from row C to row B [10:25:49] but the VIP is from the row C subnet [10:26:15] afaik it's a limitation of our ganeti setup [10:26:38] indeed, I picked what ganeti offered as candidate, but it should only really offer from C, let me check the other candidates [10:27:47] let me know when it's back to normal, I need to delete netflow2001.codfw.wmnet [10:28:05] volans: or should I be able to re-run the cookbook? [10:28:39] the decom one should be pretty much idempotent [10:28:55] so yes retry that and if it doesn't work would be a good test to fix it [10:29:02] cool! [10:29:22] the former master was ganeti2019 which is from row B [10:31:07] ah right, B1->D3 [10:31:07] D3: test - ignore - https://phabricator.wikimedia.org/D3 [10:31:45] yeah, VIP is in private1-b-codfw [10:32:34] I'll switch it back to 2019 for now, it's the only current master candidate from B, the others are from C and D [10:32:50] and need to figure out what changed there, with 2.15 they all stuck to their row [10:33:57] moritzm: maybe silly question, but can we put them behind LVS? [10:35:04] (with only 1 backend for the LVS VIP, the master) [10:36:16] should we have 1 cluster per failure-domain? I know for drmrs we went for 1 cluster per rack with the new network design, but that would not apply well to eqiad/codfw [10:36:46] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Upgrade netflow VMs to Bullseye - https://phabricator.wikimedia.org/T297595 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1001 for hosts: `netflow4001.ulsfo.wmnet` - netflow4001.ulsfo.wmnet (**PASS**) -... [10:37:03] now that everything is done with automation, it should be much easier [10:37:55] dunno, would need a closer look [10:38:26] I've switched back to 2019, cookbook should work again after the TTL expires [10:41:10] TTL of what? [10:41:20] already working [10:41:31] (telnet) [10:42:28] I assumed ganeti does it's own ARP announcements when moving the IP [10:45:14] ok [10:45:26] XioNoX: ^^^ [10:45:33] lmk if it fails [10:59:18] I assume Moritz meant ARP timeout? As you say volan.s Ganeti likely sends a RARP or similar when a machine is moved/new one comes up, which forces a refresh on any device that receives/processes the rarp. [11:01:10] I had to step out for a bit will do the decom when I'm back [11:01:39] k [11:02:51] yeah, that's what I meant, good to know thanks [11:19:11] volans: quick one if you know off hand, I've been trying to work out what sets up /etc/network/interfaces file when we reimage a host, any suggestions on where to look? Re-image cookbook does not seem to add it, and if we're creating with puppet I've not found the code. [11:19:38] specifically I'm interested in how the vlan interfaces are added on, for instance, LVS boxes, but that may be something else I can ask traffic. [11:20:47] topranks: the basic setting is done by the debian-installer automatically using the assigned IP during installation, that's why we set in the DHCP (both manual and automated) the same IP/netmask we have in netbox [11:20:55] so it does endup with the same data [11:21:05] in addition we do some minor tweaking in ./modules/install_server/files/autoinstall/scripts/late_command.sh [11:21:18] yeah I was looking at that, for the IPv6 stuff. [11:21:20] that is executed in the chroot environment by the debian-installer after it has completed the installation [11:21:46] Ok so Debian installer takes care of it, and just makes a static file based on the IP already assigned by DHCP? [11:21:57] sorry I mistyped, it's executed in the d-i environment, where /target is the new OS [11:22:35] as for the LVS specific and VLAN stuff I think we do tht in puppet [11:22:41] ok thanks for the info! [11:22:42] makes sense why I couldn't find it :) [11:23:40] Yeah, trying to help WMCS work out how to do similar (trunked/vlan interfaces) and wondering how it's done. [11:24:08] On Ganeti hosts there is an init script in puppet that runs, can't seem to find similar for LVS but I'll ask b.black later on. [11:24:38] he needs to setup the new LVSes today so he surely have to look at that code anyway, seems a perfect day to ask ;) [11:26:02] topranks: profile::lvs::tagged_interface seems to do the tric [11:26:04] *trick [11:26:15] haha the stars have aligned perfectly [11:26:18] it's called in modules/profile/manifests/lvs.pp [11:26:19] ahh... ok thanks for the pointer :) [11:29:45] And ultimately the data comes from hieradata/common/lvs/interfaces.yaml, not from Netbox. I guess the process is assign the IPs in netbox, create the YAML file in hierdata, then the interface->ip bindings are added to Netbox by the puppet import script. [11:35:17] topranks: the data doesn't come to netbox because it gets to netbox from puppetdb [11:35:36] ofc we could aim to have the inverse process at some point... not sure there are pro/cons [11:35:59] I guess the initial assignment has to be in Netbox, cos you won't know what IPs are free. [11:36:28] And a drawback is having to do that, and then also add the IP to the YAML file. But I'm sure there are drawbacks doing the other way too, it doesn't seem too bad a workflow. [11:37:51] I think those new LVSes that brandon is about to setup (and will re-use the same IPs fwiw) are teh first new ones that we're setting up since netbox automation was introduced [11:38:18] we might need to review the workflow and decide what to do [11:38:31] the drmrs ones don't have the same settings rights? [11:38:41] I don't see the VLAN-based subnets in netbox [11:38:54] or maybe just "not yet" because of the router issue [11:44:18] I think in drmrs each will have a physical connection to each switch, and only need connection to a single Vlan on each of those ports. [11:45:07] But looking now Netbox isn't complete (still has ##PRIMARY## interface on for isntance lvs6001), and the physical hosts only have one of their physical NICs configured. [11:45:11] So perhaps it's a work in progress. [11:45:31] But you're right - I don't think they'll have the sub-interfaces. [11:46:32] ack [11:50:42] topranks: in drmrs they will need both private and public on each switch trunked [11:51:03] ah yeah we've got a public vlan there don't we? ok yep thanks. [12:01:34] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Upgrade netflow VMs to Bullseye - https://phabricator.wikimedia.org/T297595 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1001 for hosts: `netflow2001.codfw.wmnet` - netflow2001.codfw.wmnet (**PASS**) -... [12:03:14] volans: decom script worked fine! [12:04:10] great, thx [12:09:46] XioNoX: I'm actually wondering why the LVS needs access to the public vlan? Does it load-balance to real-servers on there with public IPv4 addresses? [12:11:06] topranks: not sure about there specifically, but by default they get access to private + public [12:11:46] ok yeah. I guess ideally they'd only site on private, announcing publics on the front side via BGP, and talking to realservers on the private too. [12:12:14] topranks: if the real servers need direct internet access it's sometimes required [12:12:51] I think we do have some in eqiad/codfw but I don't have specific examples in mind right now [12:13:28] but yeah I agree ideally the back-end servers would be in private only and use the proxies to fetch stuff from the internet if needed [12:14:30] cool thanks. [12:15:25] topranks: I made that some time ago https://wikitech.wikimedia.org/wiki/File:New_service_IP_flow_chart.png [12:16:17] and now I'm waiting for a "I want to deploy a new service" wikitech page to include it :) [12:17:01] ok yeah. I presume Squid is only for "outbound" internet requests they need to make? "direct return" traffic from a public source IP is sent via private vlan no? [12:17:17] yep [12:17:55] ok yeah that makes more sense. But yeah servers might need internet access for more than just answering requests sent to them by LVS, hence proxy or in some edge cases being on public vlan. [12:18:04] thanks. [12:24:10] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Upgrade netflow VMs to Bullseye - https://phabricator.wikimedia.org/T297595 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1001 for hosts: `netflow1001.eqiad.wmnet` - netflow1001.eqiad.wmnet (**PASS**) -... [12:44:53] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Upgrade netflow VMs to Bullseye - https://phabricator.wikimedia.org/T297595 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1001 for hosts: `netflow3001.esams.wmnet` - netflow3001.esams.wmnet (**PASS**) -... [12:57:56] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Upgrade netflow VMs to Bullseye - https://phabricator.wikimedia.org/T297595 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1001 for hosts: `netflow5001.eqsin.wmnet` - netflow5001.eqsin.wmnet (**PASS**) -... [13:54:49] 10Mail, 10Infrastructure-Foundations, 10SRE, 10Znuny, 10serviceops: OTRS/mail: investigate why "T=remote_smtp_signed: all hosts for 'ticket.wikimedia.org' have been failing for a long time" - https://phabricator.wikimedia.org/T297160 (10akosiaris) p:05Triage→03Low Code found. https://github.com/znuny... [13:56:13] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw, 10cloud-services-team (Kanban): connect 2nd cloudcontrol200x-dev NIC to vlan 2105 - https://phabricator.wikimedia.org/T297588 (10Papaul) @aborrero are we doing trunk so i can assign this task to netops? [14:15:51] 10netops, 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) [14:15:58] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Upgrade netflow VMs to Bullseye - https://phabricator.wikimedia.org/T297595 (10ayounsi) 05Open→03Resolved a:03ayounsi All done! [14:19:05] 10Mail, 10Infrastructure-Foundations, 10SRE, 10observability, 10Sustainability (Incident Followup): large MX queues should page - https://phabricator.wikimedia.org/T297144 (10Marostegui) p:05Triage→03Medium [14:19:31] 10netops, 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) Tests are successful: I tested it by configuring sflow on the non-yet-prod asw1-b12-drmrs switch: `lang=diff [edit protoc... [14:52:14] 10CAS-SSO, 10Infrastructure-Foundations, 10User-jbond: Deprecation of U2F API in Chrome / Enable web auth in CAS - https://phabricator.wikimedia.org/T296629 (10MatthewVernon) [15:31:11] 10CAS-SSO, 10Infrastructure-Foundations: Cookbook to manage 2FA state for a user - https://phabricator.wikimedia.org/T295579 (10MatthewVernon) [15:31:51] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10serviceops: Support services VIPs with not marked as VIP in Netbox - https://phabricator.wikimedia.org/T295793 (10MatthewVernon) [15:41:44] 10Puppet, 10Infrastructure-Foundations: Role hieradata for non-existent roles - https://phabricator.wikimedia.org/T296533 (10MatthewVernon) [15:45:17] 10Mail, 10Infrastructure-Foundations, 10Sustainability (Incident Followup): Bringing mx2001 back into service - https://phabricator.wikimedia.org/T297128 (10MatthewVernon) [15:52:15] 10Mail, 10Infrastructure-Foundations, 10Sustainability (Incident Followup): Bringing mx2001 back into service - https://phabricator.wikimedia.org/T297128 (10herron) 05Open→03Resolved a:03herron Resolving as everything in the description has now been done. Please reopen if anything else is needed! [16:04:27] 10CAS-SSO, 10Infrastructure-Foundations: Cookbook to manage 2FA state for a user - https://phabricator.wikimedia.org/T295579 (10jbond) 05Open→03Resolved a:03jbond [17:20:25] 10Mail, 10Infrastructure-Foundations, 10Sustainability (Incident Followup): Bringing mx2001 back into service - https://phabricator.wikimedia.org/T297128 (10Dzahn) Somebody clicked "disable active checks" on Icinga for mx2001. The opposite of "active checks" is "expect passive checks" though. That's not the... [17:24:32] 10netops, 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10JAllemandou) Am I right in assuming that this data has the same schema as the original `netflow`? [17:54:31] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw, 10cloud-services-team (Kanban): connect 2nd cloudcontrol200x-dev NIC to vlan 2105 - https://phabricator.wikimedia.org/T297588 (10aborrero) 05Open→03Stalled Yes, we will be doing trunk. Thanks @Papaul I think we're fine here from DCops side f... [17:54:54] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet - https://phabricator.wikimedia.org/T296411 (10aborrero) [17:55:56] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet - https://phabricator.wikimedia.org/T296411 (10aborrero) 05Open→03Stalled We just re-shifted team priorities... [17:58:15] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10aborrero) 05Open→03Stalled FYI network details for these servers are blocked on {T296411}, which is in turn stalled, so marking th... [21:04:37] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw, 10cloud-services-team (Kanban): connect 2nd cloudcontrol200x-dev NIC to vlan 2105 - https://phabricator.wikimedia.org/T297588 (10Papaul) a:05Papaul→03None