[08:41:47] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) [08:42:07] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) a:03cmooney [11:14:28] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) [11:25:10] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) >>! In T327919#8664016, @aborrero wrote: > Please let me know if there is something I can do t... [11:46:49] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10aborrero) >>! In T327919#8679314, @cmooney wrote: >>>! In T327919#8664016, @aborrero wrote: >> Please l... [11:54:30] I'd be grateful if someone from this team could sanity check this for me please: https://gerrit.wikimedia.org/r/c/operations/puppet/+/894017 [11:55:43] I'm not expecting it to make any changes to any traffic flow, but it will fix a puppet compilation error on 12 servers (aqs2*) which are currently marked inactive in conftool. [11:57:25] * vgutierrez looking [11:58:56] btullis: so no aqs server is pooled in codfw? [11:59:16] that will trigger some alerts [12:00:51] vgutierrez: Correct. There never has been. We now have 12 realservers there, since December 2022, but we're still not planning to pool codfw for the time being. [12:01:50] vgutierrez: OK, even with the servers marked inactive? (https://phabricator.wikimedia.org/T331115#8663364) If so, how can we deal with the alerts ahead of time? [12:02:16] btullis: so pybal and the alerting system expects you to play nice and meet the depool threshold [12:05:46] vgutierrez: Oh, I see. Can we simply downtime/silence the alert, or is it more problematic than that? [12:06:16] btullis: any issues on your side if you pool the servers? [12:06:52] I mean, you still point the traffic to aqs.svc.eqiad.wmnet, right? [12:07:12] I don't see any discovery.wmnet record linked to aqs [12:08:26] vgutierrez: One main issue, which is that one of the back-end datastores for aqs (druid-public) is only in eqiad and it doesn't yet use TLS, so cross-dc requests would occur and they wouldn't be encrypted. [12:08:38] btullis: how? [12:09:19] I mean.. how those cross-dc requests will happen? [12:09:33] vgutierrez: Correct, there is no discovery record yet either. I'm only just adding the aqs.svc.codfw.wmnet DNS record now too. https://gerrit.wikimedia.org/r/c/operations/dns/+/894024 [12:09:40] right [12:09:58] so as long as you don't feed aqs.svc.codfw.wmnet to some service, it won't be used anywhere [12:11:26] https://gerrit.wikimedia.org/r/c/operations/puppet/+/894017 is a NOOP at DNS level [12:12:42] Great. I don't think we are feeding that address anywhere. It would probably need a change to restbase, which I haven't looked at. [12:13:32] In terms of how the requests happen, we have the (legacy) nodejs based aqs app deployed to the new aqs hosts on codfw, but part of their configuration is this: https://github.com/wikimedia/operations-puppet/blob/production/hieradata/role/common/aqs.yaml#L211 [12:14:19] github.com won't open here this morning [12:14:27] do you have the gerrit link handy? [12:14:46] (fixed it9 [12:15:29] btullis: I see.. I'd recommend blocking those requests at the app layer on aqs2 instances [12:15:49] So if traffic *did* arrive at at aqs.svc.codfw.wmnet and did get through to the realservers, the aqs app would send certain requests through over http to druid-public-broker.svc.eqiad.wmnet: [12:15:50] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/role/common/aqs.yaml#217 [12:16:01] this is a current limitation in our pybal puppetization [12:16:13] you need the service to be at production in eqiad and in service_setup in codfw [12:19:03] OK, so I think I /could/ pool the servers and they still wouldn't receive traffic because of this: https://gerrit.wikimedia.org/g/operations/puppet/+/production/hieradata/common/profile/restbase.yaml [12:19:35] yep [12:19:56] to be clear, pybal won't allow you to not have any pooled servers [12:20:18] so that's happening either way if you merge that CR [12:21:38] Got it! Thanks. But I don't need to create a discovery record for aqs.discovery.wmnet , right? [12:21:43] if that's a big issue I'd recommend adding an unconditional 200 response on aqs.svc.codfw.wmnet:73232 [12:21:47] btullis: nope [12:21:58] *7232 [12:22:27] after merging it you need to restart pybal on lvs2010 and lvs2009 [12:22:53] I can handle that if needed [12:26:16] OK, thanks. So I think I'll go for it. 1) finish adding the DNS record for aqs.svc.codfw.wmnet 2) Update netbox and run the sre.dns.netbox cookbook 3) pool the 12 servers in codfw with confctl 4) merge the CR and deploy, including lvs2010 and lvs2009,5) restart pybal on lvs2010 and lvs2009 (or ask you to do it) 6) Check that there is still no traffic going to aqs.svc.codfw.wmnet or any realservers [12:28:37] Ack [12:28:52] Set a weight >0 for the servers as well [12:31:01] Ack, thanks. Writing this up now on T331115 and then I'll go for it and keep you posted. [12:31:02] T331115: Finalize the multi-dc configuration of AQS (nodejs) in codfw - https://phabricator.wikimedia.org/T331115 [12:56:59] Have set weight to 10. Pooling the servers now. [13:11:54] Steps 1-4 are complete. puppet has run cleanly on lvs2009 and lvs2010 [13:12:48] vgutierrez: Would you be so kind as to restart pybal for me please? [13:15:24] vgutierrez: Uhoh, there are some alerts [13:15:59] one sec :) [13:17:35] (PurgedHighEventLag) firing: High event process lag with purged on cp5028:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin%20prometheus/ops&var-instance=cp5028 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [13:22:35] (PurgedHighEventLag) resolved: High event process lag with purged on cp5028:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin%20prometheus/ops&var-instance=cp5028 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [13:22:45] btullis: alert should clear soon [13:23:34] vgutierrez: Thanks. I saw there was a p.age just then. Was checking to see if those probes should be ok. [13:24:03] yeah.. that's the alerting trying to healthcheck the service before lvs was aware of it [13:24:46] Cool. Many thanks for your help. [14:15:38] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) >>! In T327919#8679398, @aborrero wrote: > In the past we had problems with DHCP forwarding be... [14:22:27] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Run 2x1G links from asw-b1-codfw to cloudsw1-b1-codfw - https://phabricator.wikimedia.org/T331470 (10Jhancock.wm) @cmooney I got these repatched as depicted in the links. Thanks for waiting. Please let me know if you need anything else! [14:30:47] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) [14:31:03] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Run 2x1G links from asw-b1-codfw to cloudsw1-b1-codfw - https://phabricator.wikimedia.org/T331470 (10cmooney) 05Open→03Resolved That's great Jenn thanks! All looking good and working now :) ` cmooney@cloudsw1-b1-codfw> show interfaces descrip... [15:29:15] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host acmechief1001.eqiad.wmnet with OS bullseye [15:37:27] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) Some updates on the physicals for the new cloudsw. The links to core routers are now up and c... [15:57:22] 10netops, 10Infrastructure-Foundations, 10SRE, 10netbox: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10jbond) [16:09:51] 10Traffic, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) [16:18:43] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host acmechief1001.eqiad.wmnet with OS bullseye completed: - acmechief1001 (**PASS**) - Downtimed on Icinga/Alertmanager... [16:23:32] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [16:24:13] 10Traffic, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) [17:46:01] 10Traffic, 10SRE, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ssingh) `cr2-eqiad` (replicated to `cr1-eqiad` as well): ` /* ns0 */ route 208.80.154.238/32 { next-hop 208.80.154.10; readvertise; no-reso... [17:56:50] 10Traffic, 10SRE, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ssingh) `cr2-codfw` (replicated to `cr1-codfw` as well): ` /* ns1 */ route 208.80.153.231/32 { next-hop 208.80.153.77; readvertise; no-reso... [18:26:05] 10Traffic, 10SRE, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `authdns[1001,2001].wikimedia.org` - authdns1001.wikimedia.o... [19:41:27] 10Acme-chief, 10SRE, 10Traffic-Icebox: Decide/document criteria needed to serve acme-chief LE issued unified certificate to end users - https://phabricator.wikimedia.org/T230687 (10BCornwall) @Vgutierrez It looks like the work you've done means that this can be closed. Is that the case? [20:13:04] 10Traffic, 10SRE, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns1003.wikimedia.org with OS bullseye [20:25:51] 10Traffic, 10SRE, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns1003.wikimedia.org with OS bullseye executed with errors... [20:53:31] 10Traffic, 10SRE, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns1003.wikimedia.org with OS bullseye [21:07:14] 10Traffic, 10SRE, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns1003.wikimedia.org with OS bullseye executed with errors... [21:07:50] sukhe: what is the current situation with the dns servers? [21:08:08] I made some changes that need running the Netbox DNS update cookbook, but unsure if that's wise right now [21:10:14] 10Traffic, 10SRE, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns1003.wikimedia.org with OS bullseye [21:24:34] I took a chance on it and ran the cookbook, seemed to go ok :) [21:25:13] topranks: sorry missed your message [21:25:20] imaging dns1003 [21:25:33] cool, how's it going? [21:28:15] just finished doing the firmwares :P [21:28:18] and also had to run the [21:28:38] sre.network.configure-switch-interfaces cookbook [21:29:01] authdns servers should work since till we actually reimage and everything is fine, we won't add dns1003 and dns2003 to the authdns_servers list [21:35:45] 10Traffic, 10SRE, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns1003.wikimedia.org with OS bullseye executed with errors... [21:35:57] 10Traffic, 10SRE, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns1003.wikimedia.org with OS bullseye [21:36:21] sukhe: cool good stuff [21:39:15] thanks for the help! [21:39:21] also yes, we were missing the DNS entries for the mgmt interface [21:39:28] I ran the provisioning thing again and it worked quite nicely [21:39:35] (didn't delete the mgmt interface) [21:56:31] 10Traffic, 10SRE, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns2003.wikimedia.org with OS bullseye [22:03:00] 10Traffic, 10SRE, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns2003.wikimedia.org with OS bullseye executed with errors... [22:03:14] 10Traffic, 10SRE, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns2003.wikimedia.org with OS bullseye [22:20:34] 10Traffic, 10SRE, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns1003.wikimedia.org with OS bullseye completed: - dns1003... [22:24:11] sukhe: FYI running the sre.dns.netbox gave me some problems just now [22:24:19] oh? [22:24:22] "7.1% (1/14) of nodes failed to execute command 'cd /srv/authdns/...nippets --deploy': dns2003.wikimedia.org" [22:24:32] oh that's werid [22:24:34] The rest updated ok, so I don't think (hope!) it's not an issue [22:24:54] but I wonder why is it getting this? it's not in the authdns_servers list [22:25:13] ok [22:25:23] ohhh [22:25:40] because it probably does A:dns-auth [22:25:42] which does return 2003 [22:25:47] but that also returns 1003 [22:26:13] that would make sense alright [22:26:47] I guess 1003 worked because the reimaging finished successfully [22:26:56] though I thought we had updated the active dns server list to authdns_servers [22:26:59] but maybe I misunderstood [22:27:01] My updates are showing anyway [22:27:05] cmooney@cumin1001:~$ dig +short irb-1120.cloudsw1-b1-codfw.wikimedia.org @10.3.0.1 [22:27:05] 208.80.153.185 [22:27:09] yeah [22:27:24] so no major issue probably [22:27:34] yeah I am going to make sure these hosts are functional today [22:27:40] as soon as dns2003 finishes [22:28:20] cool, 1003 went ok? [22:28:31] looks correct in Netbox anyway [22:28:47] topranks: thanks for letting me know. it seems like it's fairly obvious that it's getting A:dns-auth [22:28:53] https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/889601/ I *assumed* that would fix it [22:28:56] yep, 1003 went fine! [22:29:02] going to check bird, etc., and the do BGP [22:29:02] great! [22:29:25] ok cool, should be ok, all my homer changes are done [22:29:37] I'm gonna step away but ping me if there's any problem on that score [22:29:53] thanks for the help today, even though it's late for you [22:30:01] enjoy! [22:30:40] next sre summit for sure!! [22:30:55] haha... wrong channel :P [22:30:58] :D [22:43:56] 10Traffic, 10SRE, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns2003.wikimedia.org with OS bullseye completed: - dns2003...