[06:43:37] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6008.drmrs.wmnet with OS buster [06:49:56] (EdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [06:52:57] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6009.drmrs.wmnet with OS buster [06:59:56] (EdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [07:23:05] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6008.drmrs.wmnet with OS buster completed: - cp6008 (**WARN**)... [07:31:09] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6010.drmrs.wmnet with OS buster [07:34:32] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6009.drmrs.wmnet with OS buster completed: - cp6009 (**WARN**)... [07:44:39] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6011.drmrs.wmnet with OS buster [08:11:44] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6010.drmrs.wmnet with OS buster completed: - cp6010 (**WARN**)... [08:14:59] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6012.drmrs.wmnet with OS buster [08:24:37] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6011.drmrs.wmnet with OS buster completed: - cp6011 (**WARN**)... [08:30:26] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6013.drmrs.wmnet with OS buster [08:51:33] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-ulsfo: ulsfo: (2) mx80s to become temp cr[34]-drmrs - https://phabricator.wikimedia.org/T295819 (10ayounsi) a:03RobH `mgmt` ports to the `mgmt` switch please :) Once we have this and console, we can check and upgrade them. [08:54:23] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6012.drmrs.wmnet with OS buster completed: - cp6012 (**WARN**)... [08:56:31] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6014.drmrs.wmnet with OS buster [09:11:15] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6013.drmrs.wmnet with OS buster completed: - cp6013 (**WARN**)... [09:13:43] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6015.drmrs.wmnet with OS buster [09:35:56] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6014.drmrs.wmnet with OS buster completed: - cp6014 (**WARN**)... [09:39:44] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6016.drmrs.wmnet with OS buster [09:53:52] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6015.drmrs.wmnet with OS buster completed: - cp6015 (**WARN**)... [10:07:00] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10ayounsi) This will cause a hard downtime for 6 servers (rack [[ https://netbox.wikimedia.org/dcim/racks/57/ | B7 ]]), for up to 1h, but most likely less: (1) thanos-be2002... [10:13:37] Heads up - I am de-pooling ulsfo in DNS to drain it of traffic before rolling out some changes to CR routers there (T295672). [10:13:37] T295672: Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 [10:19:07] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6016.drmrs.wmnet with OS buster completed: - cp6016 (**WARN**)... [10:22:49] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10LSobanski) Adding @MatthewVernon for the Swift hosts. [10:22:56] (EdgeTrafficDrop) firing: 61% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org [10:25:23] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-swift-storage, 10ops-codfw: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10LSobanski) [10:48:11] 10netops, 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, and 3 others: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10BTullis) If we look at another host that is not in the list, but was purchased and installed at the same time as an-... [10:53:05] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-tools: Netbox - PuppetDB audit 2021-11 - https://phabricator.wikimedia.org/T295762 (10BTullis) [10:53:30] 10netops, 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, and 3 others: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10BTullis) 05Open→03Resolved Committed. The results are here: https://netbox.wikimedia.org/extras/scripts/results/... [10:53:54] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-swift-storage, 10ops-codfw: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10MatthewVernon) I don't think so, no - the frontends will not route requests to down servers (at least in theory!); we'll be more vulnerable to failur... [11:06:05] 10netops, 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, and 3 others: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10Volans) Thanks a lot! [11:25:26] (EdgeTrafficDrop) resolved: 68% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org [11:38:56] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-swift-storage, 10ops-codfw: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10BTullis) I don't believe that we need to do any prep or depooling work for furud.codfw.wmnet We can downtime it in Icinga, but I think that's the lim... [12:30:56] (EdgeTrafficDrop) firing: 64% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [13:00:56] (EdgeTrafficDrop) resolved: 68% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [13:44:08] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-swift-storage, 10ops-codfw: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10Gehel) The elasticsearch cluster should be able to cope with loosing 2 nodes with no issues. Thanks for flagging this, and please ping @RKemper and m... [14:48:46] moritzm: let me know if you have time to poke at ganeti topics with me today. I did reimage the servers to bullseye (but still insetup without any ganeti config yet) [14:59:26] sure, we can do that now. circling back to what you wrote yesterday we'd in fact create two separate Ganeti node groups Rack_B12 and Rack_B13. as for the master failover I had a look at in fact all master candidates in the two main clusters and they're in fact all from the same row [14:59:59] right [15:00:10] but this might simply be the default allocation strategy, AFAICS we should be able to designate additinal nodes outside the row with [15:00:17] which means if that whole row were to vanish, there's no simple master failover [15:00:25] (because the existing master IP can't exist in another row) [15:00:38] gnt-node modify FQDN --master-candidate no [15:00:59] (because the IP for the master svc hostname is in a row-specific subnet) [15:02:02] ah, indeed [15:02:03] the risks are different/lower there, and priorities, etc [15:02:24] but for the edge site, each "row" is just 1x rack with 1x switch, etc. So we do want to still be able to operate the cluster(s) if we lose a rack. [15:03:43] this is basically why I'm leaning towards 2x separate per-rack clusters, and then for important instances which have redundant active/active sets of nodes, putting nodes for those services in both clusters. [15:04:24] but: I have not been through the rest of this and don't understand the puppetization all that well. I'm not sure if there's some baked-in "one cluster per site" assumptions in the structure of things. [15:05:05] yeah, that would also work, we might need a few tweaks to spicerack/cookbooks, but I think from the puppet level is should be fine [15:05:19] and if there's anything standing in the way we can fix it [15:05:50] ok, sounds like a plan then! :) [15:06:25] the puppet part is pretty easy to figure out the basic hieradata/config. [15:06:34] from a quick glance we'd simply need two separate RAPI certificates and two separate SVC entries for the master IP [15:06:38] the wikitech steps include a bunch of manual things that happen on cluster bringup [15:06:55] and then selection of nodes towards a cluster is based on Hiera anyway [15:07:03] right, I'm assuming ganeti01.svc.drmrs.wmnet + ganeti02.svc.drmrs.wmnet [15:07:38] I can take a stab at it all and see how I fare anyways, maybe a good 3rd-party pov on whether wikitech captures everything or not :) [15:08:12] the only less elegant aspect is maybe [15:09:26] that we're no longer able to use the DC-specific Hiera selectors, but instead need to specify profile::ganeti::rapi::certificate on a per host level, but given that we're only speaking of 4 nodes (and in the future 6) that seems fine [15:10:10] and yes, ganeti01.svc.drmrs.wmnet + ganeti02.svc.drmrs.wmnet seems fine [15:10:12] yeah, I think at least for now that would be ok. [15:11:12] ideally we should not use .svc. to not make https://phabricator.wikimedia.org/T270071#7404620 worse [15:11:16] I updated the wikitech docs when creatingthe new Ganeti test cluster, so it should in fact be complete [15:12:17] maybe ganeti-b12 and ganeti-b13 ? [15:12:41] .drmrs.wmnet [15:12:54] I wouldn't tie this to a specific rack name, the cluster name is likely longer to stay around than the rack is currently resides [15:12:58] that's a huge ticket, what's the TL;DR? [15:13:10] (about why we don't want ganeti in the .svc. domains?) [15:14:13] XioNoX: let's not tie the change from 270071 into the new setup, when we fix it we'll fix test it in the ganeti test cluster and when it's in place and tested we can easily adapt the remaining installations [15:15:17] bblack: tldr is to use .svc. only for LVS VIPs [15:15:28] moritzm: ok, no pb! [15:16:15] do you really mean only for IPs serviced by LVS, or only for IPs in those subnets that are allocated to LVS? [15:16:47] I mean, I think we've sometimes had services in the LVS subnet that aren't LVS'd, although I donno what other cases remains. [15:17:27] but I guess you want netbox automation of .svc. to have a simple mapping of subdomain<->subnets? [15:18:09] anyways, not terribly important at the moment, just lingering curiousity! :) [15:18:48] bblack: SVC subnet indeed [15:19:34] there are a handful outliers blocking managing the bulk of records using netbox [15:20:01] ok [15:20:46] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-swift-storage, 10ops-codfw: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10lmata) @ayounsi after a chat with the team we think we should be fine, we will monitor and be available should something happen. [15:32:21] moritzm: https://gerrit.wikimedia.org/r/c/operations/puppet/+/739553/ ? There's a "profile" name we didn't discuss yet, which looks like it should be distinct per cluster for netbox integration. [15:33:44] let me have a look [15:36:56] +1d, the title in the netbox sync definition seems only used to declare a unique name for the systemd timer, but Riccardo can also have a second look if needed [15:41:19] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-ulsfo: ulsfo: (2) mx80s to become temp cr[34]-drmrs - https://phabricator.wikimedia.org/T295819 (10RobH) [15:43:05] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-ulsfo: ulsfo: (2) mx80s to become temp cr[34]-drmrs - https://phabricator.wikimedia.org/T295819 (10ayounsi) If you can take pictures of the front panels that could be useful to instruct remote hands when they get to drmrs too. [16:26:06] XioNoX: do we want to implement also T262446 for drmrs? [16:26:06] T262446: Import row information into Netbox for Ganeti instances - https://phabricator.wikimedia.org/T262446 [16:26:30] that would be nice :) [16:26:31] given we need to touch the automation bit anyway probably to make it work for the multiple cluster it might be a good opportunity to have all that in one go [16:26:39] moritzm: ack, I'll have a look [16:29:43] FYI in case it affects whatever else - I did the netbox basic data for ganeti0[12] clusters in netbox already and assigned hosts to them [16:30:22] and I'm on the drmrs switches now setting up the vlan-trunking bits for the ganeti+lvs interfaces [16:32:12] bblack: the switches are now managed by Homer [16:32:22] as of this morning [16:33:09] hmmm ok [16:33:56] will look at doing it via homer then! [16:34:40] 10netops, 10Infrastructure-Foundations, 10SRE: Rebuild Routinator (rpki) VMs with larger disk - https://phabricator.wikimedia.org/T292503 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by cmooney@cumin2002 for hosts: `rpki2001.codfw.wmnet` - rpki2001.codfw.wmnet (**FAIL**) - **Host steps... [16:34:46] it should be faster, as you can mass-edit interfaces in netbox [16:35:14] ah I was going to say [16:35:22] I guess this really isn't a homer change I need, but a netbox one [16:37:34] XioNoX: now I have a question - when I was looking at the config manually, the lvs secondary interfaces had "mtu 9192" statements (which probably don't need to be there, they're not present in other sites) [16:37:50] but lvs primary don't have that config on the switch [16:37:58] but in netbox, both cases have mtu set to 9192 explicitly [16:38:40] (the ganeti interfaces do seem to need it, but don't get it, also) [16:39:00] I donno, I'll fix up the vlan stuff first and then see [16:39:05] bblack: all that work was not needed, it's automatically synced [16:39:46] all what work? [16:39:48] (altering the clusters and host allocation for ganeti in netbox) [16:39:58] imported you mean? [16:40:06] MTU 9192 is applied regardless on all access ports [16:40:27] so for those it bupass what's set in Netbox [16:40:30] bypass [16:40:48] XioNoX: yeah I see that in the netbox data, but in the actual switch config, it was only the non-Access ports in the live switch config [16:41:19] oh I get it now, it's set by the vlan interface-group in those cases [16:41:22] ok [16:44:47] so after editing tagg-vlan in netbox, do I run homer or a cookbook or? [16:45:17] homer I think [16:49:29] volans: I guess, I tend to expect everything goes into netbox manually as its the source [16:49:36] I never expect data going the other way! :) [16:49:55] for VMs for now it's the other way around [16:50:11] we sync from Ganeti API to Netbox all virtualization clusters and VMs [16:50:13] yeah `homer asw1*drmrs* commit "my message"` for both [16:50:19] XioNoX: done! [16:50:23] cool! [16:53:16] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-ulsfo: ulsfo: (2) mx80s to become temp cr[34]-drmrs - https://phabricator.wikimedia.org/T295819 (10RobH) [18:02:50] I've kept a few notes along the way, to go back and clarify some minor spots in the docs [18:03:14] but overall, going good so far: through all the manual host config parts, cergen parts done, netbox VIPs created (being pushed out now) [18:03:25] about ready for the "role(ganeti)" part and so-on [18:06:05] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp2042.codfw.wmnet with OS buster [18:12:56] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp2042:9331 is unreachable - https://alerts.wikimedia.org [18:16:00] ^^ the host is being reimaged [18:17:56] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp2042:9331 is unreachable - https://alerts.wikimedia.org [18:17:58] ah, does the reimage cookbook not know how to do alertmanager silences? [18:21:25] not yet [18:21:39] T293209 [18:21:39] T293209: Spicerack: add support for Alertmanager - https://phabricator.wikimedia.org/T293209 [18:23:04] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 (10cmooney) Change went well in ulsfo earlier. De-pooled the site in DNS first and then proceeded with steps as outlined above. All went as expected. Did tak... [18:23:14] ty [18:31:55] moritzm: if you're still around, I have an odd issue going with the ssh keys for ganeti [18:32:17] I've just done the "gnt-cluster init ..." part [18:32:56] but when I try to do the "gnt-node add ...", I get an ssh key error [18:33:28] it's erroring on the key for ganeti01.svc, and it is the host key for the master node, and it is the same one that's present in /var/lib/ganeti/known_hosts too [18:33:54] it almost seems like I need to manually sync ssh keys, like what's been lined-out in: [18:33:57] https://wikitech.wikimedia.org/wiki/Ganeti#Synchronize_ssh_host_rsa_key_across_cluster_nodes [18:34:26] (I assume it's trying to connect to the new node, using the key that only belongs to the master node) [18:35:14] I think this might have to do with --no-ssh-init in the gnt-cluster init? [18:35:23] (maybe I need to destroy and re-create without that?) [18:35:47] trying that for now, what's the worst that can happen [18:40:00] bblack: I know nothing about it but maybe the code here might help (to see what are the steps) [18:40:03] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/ganeti/addnode.py [18:40:33] (not sure if the cookbook is supposed to work for the first node of a cluster) [18:42:46] ah gnt-node add --no-ssh-key-check "{node}" [18:42:58] still not sure if --no-ssh-init is correct or not in the cluster init [18:43:43] icinga is also not super happy fwiw :) [18:44:42] icinga probably won't be until I get past N hurdles or whatever [18:56:17] bblack: If possible I intend to depool eqiad in global DNS tomorrow morning (8am UTC, so quiet time in the US). [18:56:27] does that sound reasonable? or like madness? [18:56:34] Reason is to reconfigure the iBGP between CR routers there, and not have external traffic coming in while that is done. [18:56:42] It would be possible to do this without de-pooling, but egress routing will swing one way and then the other for certain destinations during the change. [18:56:53] (as the policy change is processed / iBGP converges, routes learnt from neighbor CR stop being used and local transit is preferred, then it reverts). [18:58:11] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp2042.codfw.wmnet with OS buster c... [19:03:59] topranks: out of curiosity, would any internal routing be affected? [19:04:31] No, both CRs have local routes to all internal resources, so the only iBGP routes that get used are from peering / transit. [19:05:07] And no policy changes are being done on the BGP sessions over our transport links either. [19:05:07] topranks: what about backhaul service traffic from edge sites? [19:05:19] hmmm ok [19:05:24] That would be unaffected. [19:06:06] in any case, depooling eqiad in DNS is not madness, but there are always risks to keep an eye on [19:06:28] I did this in ulsfo this morning and it went well. Even if we don't depool neither router will lose reachability to any external network, it's just that for 2-3 mins the route they choose will probably change and change back. [19:06:41] So.... I want to do the least disruptive thing. [19:06:47] a lot right [19:06:54] keep one eye on how hot the codfw/eqdfw/eqord transit and peering ports are [19:06:57] sorry that was two halves of two different partial responses [19:07:04] but yeah, that was what I was going to say [19:07:04] (although at that time of day I'm not too worried) [19:07:09] I did it in ulsfo earlier and de-pooled, but obviously that's a different sceanrio. [19:07:23] all the eqiad public traffic will end up on codfw, so transit/peer saturation is at least a remote possibility [19:07:51] cdanis: indeed yes I'll keep a good eye on all that, will make sure it's settled into new pattern after de-pool and it's healthy before touching anything. [19:08:02] (we could probably do better these days, and have some of eqiad fail over to ulsfo too, esp as ulsfo now has a full completely of 16 cache machines, but not for tomorrow!) [19:08:09] 0800 UTC is like... 3am eqiad local time? should be fine [19:08:20] bblack: yes exactly, but I think at that time it should be ok. [19:08:24] s/completely/complement/ [19:08:50] correct yeah 03:00 EST [19:17:49] the use of --no-ssh-key-check is only really needed for the cookbook since the command otherwise asks for interactive confirmation, it should print something like this: https://paste.debian.net/hidden/271fa71c/ [19:18:46] but gnt-node add is only needed for the first node to get added to the cluster, after the gnt-cluster init the initial host is the master and should print itself in "sudo gnt-node list" [19:19:52] yes [19:20:01] I've gotten the first cluster working with two nodes [19:20:15] through some random combination I can't reproduce, of destroy/recreate and various flags and whatever [19:20:46] the second cluster (ganeti02, which is 6002 + 6004), I can init the cluster and add the second node, but verify keeps failing with an ssh key error talking to the second node [19:20:57] (even though ssh straight from 6002 to 6004 as root works fine with no complaints) [19:21:51] I'm not really sure which thing got me past it on the first set [19:21:58] hmmh, ok, doesn't ring a bell but I'll debug those tomorrow morning, need to leave in a few [19:22:04] ok [19:22:45] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-swift-storage, 10ops-codfw: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10ayounsi) For the record, there is also a link to lvs2007, after chatting with @bblack on irc, the usual `disable puppet then stop pybal` is to do bef... [19:23:05] could be a range of things, ferm or ganeti config or an artefact in Puppet where it only expects a single cluster or whatever [19:27:58] it's a "wrong host key" sort of issue, the connection is successful [19:28:39] but the /var/lib/ganeti/known_hosts has the same correct key on both sides, and I can ssh manually from either node to either node as root, and from both nodes to the VIP hostname ganeti02.svc.drmrs.wmnet, all works without any prompts or warnings. [19:29:48] basically the verify output has two errors of the form: [19:29:49] Wed Nov 17 19:29:31 2021 - ERROR: node ganeti6004.drmrs.wmnet: ssh communication with node 'ganeti6004.drmrs.wmnet': ssh problem: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\n@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! [19:30:02] Wed Nov 17 19:29:31 2021 - ERROR: node ganeti6002.drmrs.wmnet: ssh communication with node 'ganeti6004.drmrs.wmnet': ssh problem: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\n@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! [19:30:48] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-ulsfo: ulsfo: (2) mx80s to become temp cr[34]-drmrs - https://phabricator.wikimedia.org/T295819 (10ayounsi) [19:31:17] it's getting a different key for ganeti02.svc.drmrs.wmnet than it expects, when connecting to ganeti6004 [19:42:05] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Rebuild Routinator (rpki) VMs with larger disk - https://phabricator.wikimedia.org/T292503 (10cmooney) 05Open→03Resolved Ok both VMs have been rebuilt with 20GB disk and updated to version 0.10.2. rpki1001 remains with the same name, r... [19:50:24] from another ticket that was about upgrading ganeti: [19:50:26] " (Following the reimage, the ganeti VG needs to be re-created, the network bridges setup and ssh_host_rsa_key/ssh_host_rsa_key.pub/known_hosts synced.)" [19:51:49] bblack: this sounds like it could be related: https://wikitech.wikimedia.org/wiki/Ganeti#Synchronize_ssh_host_rsa_key_across_cluster_nodes but interestingly it's all crossed out [19:51:54] yeah [19:52:11] I think that's basically what it boils down to. It's supposedly not necessary anymore, except when it is :) [19:53:07] I guess it's about /var/lib/ganeti/known_hosts then [19:53:21] oh, you already mentioned that,oops [19:55:11] copying the rsa host key from one machine to another has done the trick [19:55:35] bblack: I found this in the wiki edit history: " Mark as obsolete, with the version of Ganeti in Stretch and later, the SSH keys get synched by "gnt-node add"" [19:58:14] yeah, it just didn't seem to work in this case [19:58:27] *nod* [19:58:41] possibly from non-idempotent orderings of commands (the cluster was created, destroyed, created again, while changing the creation params related to ssh keys) [19:59:03] I got some orering to work on one cluster without manual key copying, but couldn't reproduce the same on the other :) [20:00:14] I see.. yea, good enough I guess.. given that we just do this every eh.couple years [20:07:08] moritzm: gotten past the ssh key issue, nothing to look at tomorrow :)