[08:55:01] 06Traffic, 10Automoderator, 06Data Products, 06Product-Analytics, and 2 others: 14Add revision ID to X-Analytics header - 14https://phabricator.wikimedia.org/T346350#9656746 (10phuedx) 14>>! In T346350#9655625, @mpopov wrote: > So if I'm interpreting that table correctly, we can trust `rev_id` in X-An... [10:05:50] 06Traffic, 06Data-Engineering, 10Observability-Logging, 10Event-Platform, 13Patch-For-Review: Remove extra fields currently sent to Kafka - https://phabricator.wikimedia.org/T360642#9656880 (10Fabfur) >>! In T360642#9655231, @Ottomata wrote: >> meta.id and meta.request_id > > `meta.id` is used to unique... [10:47:45] 06Traffic, 10MW-on-K8s, 06serviceops, 06SRE, and 2 others: Migrate changeprop to mw-api-int - https://phabricator.wikimedia.org/T360767#9657017 (10Clement_Goubert) 05Open→03In progress [11:34:30] 06Traffic, 06Data-Engineering, 10Observability-Logging, 13Patch-For-Review: Install new Benthos instance on cp hosts - https://phabricator.wikimedia.org/T358109#9657223 (10Fabfur) [11:49:12] 06Traffic, 10Automoderator, 06Data Products, 06Product-Analytics, and 2 others: 14Add revision ID to X-Analytics header - 14https://phabricator.wikimedia.org/T346350#9657270 (10Samwalton9-WMF) 14Thank you both for confirming! :) [12:14:55] 06Traffic, 10MW-on-K8s, 06serviceops, 06SRE, and 2 others: Migrate changeprop to mw-api-int - https://phabricator.wikimedia.org/T360767#9657385 (10Clement_Goubert) `mw-api-int` is now receiving all calls to `mwapi_uri` from changeprop {F43323601} There are still calls coming from the `ChangePropagation/WM... [12:16:08] 10netops, 10Ganeti, 06Infrastructure-Foundations, 06SRE: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152#9657390 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1002 for hosts: `testvm2006.codfw.wmnet` - testvm2006.codfw.wmnet (**FAIL**) - Do... [12:18:13] 06Traffic, 10MW-on-K8s, 06serviceops, 06SRE, and 2 others: 14Migrate changeprop to mw-api-int - 14https://phabricator.wikimedia.org/T360767#9657393 (10Clement_Goubert) 05In progress→03Resolved [12:25:07] 06Traffic, 10MW-on-K8s, 06serviceops, 06SRE, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9657411 (10Clement_Goubert) [13:10:42] 10netops, 06Infrastructure-Foundations, 06SRE: Move public-vlan host BGP peerings from CRs to top-of-rack switches in codfw - https://phabricator.wikimedia.org/T360772#9657554 (10ayounsi) > So we need to decide if this imbalance for local queries is going to be an issue. I think load is the main thing to loo... [14:23:31] sukhe: Hi! If you're around, I'd appreciate some pairing around decommissioning the aqs LVS service, as you kindly offered last week. I've prepped the necessary patches, I can add you as reviewer if needed [14:23:33] Thank you! [14:26:13] brouberol: hi! [14:26:16] yeah, let's do it [14:27:28] alright, so I've prepped https://gerrit.wikimedia.org/r/c/operations/puppet/+/1013501 to remove the realserver pool and move the service back to state: lvs_setup [14:27:44] and https://gerrit.wikimedia.org/r/c/operations/dns/+/1013500 to remove the DNS records [14:29:17] brouberol: thanks, we will go over them [14:29:21] have you silenced the network probes? [14:30:41] I've silenced everything having to do with the AQS service being down under https://alerts.wikimedia.org/?q=aqs [14:30:56] ah thanks [14:30:58] yep [14:31:20] looking at the DNS changes now since that's the next step [14:33:08] brouberol: looks good, please merge it and run authdns-update (or let me know if you want me to :) [14:33:28] ack! I've had to rebase the patch, as gerrit was complaining of some conflicts [14:33:35] yeah [14:33:48] I'm just waiting on the CI, after that I'll merge and deploy [14:36:39] all done [14:37:11] I am assuming you ran authdns-update as well? [14:37:31] Host aqs.svc.eqiad.wmnet not found: 3(NXDOMAIN) [14:37:31] Host aqs.svc.codfw.wmnet not found: 3(NXDOMAIN) [14:37:31] I did [14:37:46] for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1013501, I left a comment; I think we in that you should just change state: lvs_setup and then we will remove the pools later [14:37:50] brouberol: thanks [14:37:53] 👍 [14:41:24] done. I've stacked 2 changes [14:42:24] +1 [14:47:11] 1st CR is merged. I'm running puppet-merge and then `sudo cumin 'A:dnsbox' run-puppet-agent` from cumin [14:47:16] thanks [14:51:49] all done [14:51:58] ok thanks! +1ed the other change [14:52:14] next step is to merge that and run cumin 'O:lvs::balancer' 'run-puppet-agent' [14:52:32] ack, doing! [14:54:05] then we will need to restart pybal in backup and primary servers in both eqiad and codfw (first the backup) [14:54:26] in eqiad the order is lvs1020 (wait for 300 second give or take), then lvs1019 [14:54:33] in codfw, lvs2014 (wait) then lvs 2013 [14:55:37] so, ssh on host, sudo systemctl restart pybal.service, wait, rinse, repeat? [14:55:43] yep [14:55:53] and log the commands as well, so we know the timing [14:56:01] <_joe_> there's a cookbook that does things correctly [14:56:13] and out of curiosity, are we waiting due to BGP? [14:56:14] <_joe_> in terms of checking all the bgp sessions are back up [14:56:22] <_joe_> brouberol: yes [14:56:40] _joe_: the cookbook has been broken for a while (because well, we never fixed the broken aliases) [14:56:54] <_joe_> sukhe: you can use it on the individual hosts though [14:57:02] <_joe_> and waits for the bgp reestablishment [14:57:04] <_joe_> anyways [14:57:12] yeah, I guess. I am just more used to doing it manually and hence [14:57:20] <_joe_> sukhe: I came around to ask for a link to the esams maint task [14:57:26] <_joe_> for wednesday [14:57:36] https://phabricator.wikimedia.org/T360430 [14:57:39] <_joe_> as we're still pooled with applayer traffic in a single dc [14:57:45] timing is still not decided but once it is, I will send an email [14:57:49] it's kinda up in the air right now [14:57:52] <_joe_> well yes [14:57:57] <_joe_> pump the brakes a sec :) [14:58:04] do you want us to delay it? [14:58:12] <_joe_> 1 sec, let me read the task [14:58:16] <_joe_> hopefully not [14:58:17] take your time [14:58:43] https://phabricator.wikimedia.org/T360430#9642677 is the most relevant comment fwiw [14:59:33] <_joe_> sukhe: one thing I am not sure about is [14:59:41] <_joe_> if you're reimaging the hosts and swapping disks [14:59:49] <_joe_> you do start with a 100% cold cache anyways [14:59:55] <_joe_> or am I missing something? [15:00:15] we are putting in the disks first, repooling and then reimaging one by one [15:01:09] <_joe_> ok, that still sounds like a repool with cold cache, right? [15:01:33] <_joe_> or are we *adding* the disks? [15:01:47] ATS cache won't be cold though? [15:02:07] adding additional new disks [15:02:10] to text [15:02:11] <_joe_> ah ok sorry, the text of the task seemed to imply a *swap* [15:02:15] <_joe_> hence my confusion [15:02:33] <_joe_> ok, then it all makes sense [15:02:35] no, just bringing up the configuration to dual NVMes for text, just like upload [15:02:40] ok :) [15:02:54] <_joe_> I would mostly ask to wait with the reimaging until codfw is repooled [15:03:06] sure, that is fine [15:03:08] <_joe_> so the afternoon of the 27th in eu time, presumably [15:03:21] also the timing is not confirmed and I am not sure given that this friday is Good Friday [15:03:30] so let's see and I will send an email and we can coordinate [15:03:54] <_joe_> the only thing to keep in mind is that currently we're running at 75% utilization in eqiad [15:04:02] <_joe_> so an additional load isn't great [15:06:49] (PyBalBGPUnstable) firing: PyBal BGP sessions on instance lvs2014 are failing - https://wikitech.wikimedia.org/wiki/PyBal#Alerts - https://grafana.wikimedia.org/d/000000488/pybal-bgp?var-datasource=codfw%20prometheus/ops&var-server=lvs2014 - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [15:06:51] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE, 13Patch-For-Review: Decom asw-b-codfw switch stack - https://phabricator.wikimedia.org/T360776#9657884 (10Papaul) [15:07:44] yeah fair. we can delay reimaging the disks for a while, not an issue [15:09:11] sukhe: I takt it I have to wait for the BGP established sessions going back go 1 in https://grafana.wikimedia.org/d/000000488/pybal-bgp?var-datasource=codfw+prometheus%2Fops&var-server=lvs2014&orgId=1&from=now-1h&to=now ? [15:11:57] brouberol: all good, please proceed with lvs2019! [15:12:04] er 2013 :) [15:12:25] ack, onto 2013 [15:18:31] ok to proceed with lvs1020? [15:18:40] brouberol: checking [15:21:49] (PyBalBGPUnstable) firing: (2) PyBal BGP sessions on instance lvs2013 are failing - https://wikitech.wikimedia.org/wiki/PyBal#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [15:22:37] brouberol: some unrelated cleanup because of the restart, gimme a sec [15:22:53] sure, thanks [15:26:14] brouberol: so now we have to do some manual IPVS cleanup [15:26:40] ipvsadm --delete-service --tcp-service addr:port for the aqs servers in codfw [15:26:54] aqs2007.codfw.wmnet', 'aqs2005.codfw.wmnet', 'aqs2003.codfw.wmnet', 'aqs2008.codfw.wmnet', 'aqs2002.codfw.wmnet', 'aqs2012.codfw.wmnet' [15:27:31] on lvs2013? [15:28:03] on both actually [15:28:08] understood [15:28:22] you can get the IP from netbox since we just killed them in the DNS repo [15:28:28] but basically [15:28:49] ipvsadm --delete-service --tcp-service 10.192.16.169:7232 for aqs2007.codfw.wmnet [15:28:53] as an example [15:29:04] you can paste them here if you want a review before running them [15:29:10] this step is manual [15:30:02] once this is done on both (same on both in codfw), we can move on to eqiad [15:30:18] going into a meeting but will keep an eye here [15:37:16] ipvsadm --delete-service --tcp-service 10.192.0.111:7232 # aqs2001.codfw.wmnet [15:37:16] ipvsadm --delete-service --tcp-service 10.192.0.210:7232 # aqs2002.codfw.wmnet [15:37:16] ipvsadm --delete-service --tcp-service 10.192.0.211:7232 # aqs2003.codfw.wmnet [15:37:16] ipvsadm --delete-service --tcp-service 10.192.0.212:7232 # aqs2004.codfw.wmnet [15:37:16] ipvsadm --delete-service --tcp-service 10.192.16.42:7232 # aqs2005.codfw.wmnet [15:37:17] ipvsadm --delete-service --tcp-service 10.192.16.168:7232 # aqs2006.codfw.wmnet [15:37:17] ipvsadm --delete-service --tcp-service 10.192.16.169:7232 # aqs2007.codfw.wmnet [15:37:18] ipvsadm --delete-service --tcp-service 10.192.16.17:7232 # aqs2008.codfw.wmnet [15:37:18] ipvsadm --delete-service --tcp-service 10.192.48.186:7232 # aqs2009.codfw.wmnet [15:37:19] ipvsadm --delete-service --tcp-service 10.192.48.187:7232 # aqs2010.codfw.wmnet [15:37:19] ipvsadm --delete-service --tcp-service 10.192.48.188:7232 # aqs2011.codfw.wmnet [15:37:20] ipvsadm --delete-service --tcp-service 10.192.48.189:7232 # aqs2012.codfw.wmnet [15:37:26] (none of them were run as of now) [15:39:00] brouberol: as you have verified the hosts and the IPs, the rest looks good [15:39:41] let me know which LVS host you run it on first and we can do the next [15:39:56] sorry I can't verify the IPs right now (meeting) but if you want to wait, we can do that as well [15:40:12] ah, actually, seems that these are the wrong IPs? [15:40:13] root@lvs2014:~# ipvsadm --delete-service --tcp-service 10.192.0.111:7232 [15:40:13] No such service [15:40:23] probably because aqs2001 is not there in the list [15:40:31] Hosts in IPVS but unknown to PyBal: set(['aqs2007.codfw.wmnet', 'aqs2005.codfw.wmnet', 'aqs2003.codfw.wmnet', 'aqs2008.codfw.wmnet', 'aqs2002.codfw.wmnet', 'aqs2012.codfw.wmnet'])") [15:40:37] just do these [15:40:45] oh, I assumed they all needed to be removed [15:41:25] I assumed too but I don't see aqs2001 anywhere? [15:42:08] hmm I do see it. are these host active? [15:42:22] AFAICT yes, I got their IP from cumin [15:44:46] now that's odd [15:44:46] root@lvs2014:~# host aqs2002.codfw.wmnet [15:44:46] aqs2002.codfw.wmnet has address 10.192.0.210 [15:44:46] root@lvs2014:~# ipvsadm --delete-service --tcp-service 10.192.0.210:7232 [15:44:46] No such service [15:45:05] and I see it as part of [15:45:05] ipvsadm --list | grep 7232 [15:45:05] TCP 10.2.1.12:7232 wrr [15:45:05] -> aqs2002.codfw.wmnet:7232 Route 10 0 0 [15:45:05] ... [15:46:01] let's look when I finish this call (~15 mins) [15:46:06] ack [15:47:26] according to https://wikitech.wikimedia.org/wiki/LVS#Remove_a_load_balanced_service, "Run ipvsadm --delete-service --tcp-service addr:port on the LVS severs, where addr needs to match the service IP of the datacenter the LVS server is in" [15:47:42] I understand that as addr = VIP local to the DC [15:48:06] so in the case of aqs/codfw, that'd be 10.2.1.12 [15:59:58] brouberol: back [16:00:30] yeah, I guess let's remove the service IP directly and just that and we will see what ipvsadm says [16:00:38] so just 10.2.1.12:7232 [16:00:57] that worked [16:01:10] root@lvs2014:~# ipvsadm --list | grep 7232 [16:01:10] root@lvs2014:~# [16:01:12] cool :) that was 2014? [16:01:12] ok [16:01:22] now 2013? [16:01:26] let me check what the Icinga check thinks [16:01:30] sg [16:01:31] yeah just a sec [16:02:11] yeah let's do it! [16:02:24] 2013 [16:03:32] done! [16:04:08] cool thanks :) [16:04:31] now eqiad, same process, restart pybal on backup (1020) [16:04:33] then 1019 [16:05:04] then remove the service IP directly I guess for eqiad, which is 10.2.2.12:7232 [16:05:17] on it [16:06:33] done for 1020. Waiting 5 min [16:10:41] cool. I am a bit split about why we didn't see aqs2002 for example above but hmm I guess as long as the diff check is fine [16:11:30] in this case removing the service IP is fine since we are removing the entire service (and really, what I should have suggested instead of removing individual backends) [16:11:44] gotcha [16:11:53] are we good to proceed to 1019? [16:12:08] yeah please do [16:12:47] should I wait 5 min before remove the VIP from IPVS btw? [16:13:27] *removing [16:14:19] no, once you have restated pybal on 1019, let me know and we can remove the IP from both without any delay [16:14:47] it's been restarted [16:15:20] I'll remove the VIP from IPVS from 1020 and then 1019 [16:15:42] `sudo ipvsadm --delete-service --tcp-service 10.2.2.12:7232` on both [16:15:51] all good [16:17:25] cool let's check [16:19:33] brouberol: looking good! [16:19:37] nice job and thanks :) [16:19:44] thank _you_ [16:19:53] the last step as per the LVS page is removing it from service::catalog and conftool as you please [16:19:59] and also cleaning up site.pp etc if requried [16:20:01] but all done from our end [16:25:03] If I could impose just a bit more, I'm not (yet) familiar with conftool. Does that entail running `confctl select name= set/pooled=no` for all aqs hosts? [16:25:40] yep [16:25:46] or directly for the service in this case [16:27:01] so you can pool it to no but also you should remove it from conftool-data and hieradata/common/service.yaml completely if desired [16:57:02] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE, 13Patch-For-Review: Decom asw-b-codfw switch stack - https://phabricator.wikimedia.org/T360776#9658513 (10Papaul) [17:02:45] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803#9658554 (10Jhancock.wm) sretest2003 and 2004 have been renamed to their original server names and been offlined (including ssd removal). [17:15:03] 06Traffic, 06Data Products, 06SRE: Data Quality - requestctl not getting set - https://phabricator.wikimedia.org/T342577#9658600 (10Milimetric) @VirginiaPoundstone: Looks like Giuseppe patched varnish to send more requestctls, so maybe that completely or partially solves the problem. I'd have to look throug... [17:31:27] 06Traffic, 06SRE, 10Data Products (Data Products Sprint 13): Data Quality - requestctl not getting set - https://phabricator.wikimedia.org/T342577#9658684 (10VirginiaPoundstone) [19:22:04] (PyBalBGPUnstable) firing: (2) PyBal BGP sessions on instance lvs2013 are failing - https://wikitech.wikimedia.org/wiki/PyBal#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [19:25:05] what now [19:40:49] ^ elastic2037 was pooled but down, should resolve [20:26:11] 06Traffic, 06cloud-services-team, 10VPS-Projects, 10Puppet (Puppet 7.0): 14Update traffic project puppetmaster - 14https://phabricator.wikimedia.org/T360940#9659356 (10Andrew) 05Open→03Resolved a:03Andrew 14I've built this new server (traffic-puppetserver-bookworm) and moved hosts over to it fro... [21:17:27] 06Traffic, 06Security-Team, 07ContentSecurityPolicy, 07SecTeam-Processed: 14Stop sending X-Webkit-CSP and X-Webkit-CSP-Report-Only headers - 14https://phabricator.wikimedia.org/T357479#9659579 (10sbassett) a:05sbassett→03TheDJ [23:22:04] (PyBalBGPUnstable) firing: (2) PyBal BGP sessions on instance lvs2013 are failing - https://wikitech.wikimedia.org/wiki/PyBal#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable