[08:55:01] <wikibugs>	 06Traffic, 10Automoderator, 06Data Products, 06Product-Analytics, and 2 others: 14Add revision ID to X-Analytics header - 14https://phabricator.wikimedia.org/T346350#9656746 (10phuedx) 14>>! In T346350#9655625, @mpopov wrote: > So if I'm interpreting that table correctly, we can trust `rev_id` in X-An...
[10:05:50] <wikibugs>	 06Traffic, 06Data-Engineering, 10Observability-Logging, 10Event-Platform, 13Patch-For-Review: Remove extra fields currently sent to Kafka - https://phabricator.wikimedia.org/T360642#9656880 (10Fabfur) >>! In T360642#9655231, @Ottomata wrote: >> meta.id and meta.request_id >  > `meta.id` is used to unique...
[10:47:45] <wikibugs>	 06Traffic, 10MW-on-K8s, 06serviceops, 06SRE, and 2 others: Migrate changeprop to mw-api-int - https://phabricator.wikimedia.org/T360767#9657017 (10Clement_Goubert) 05Open→03In progress
[11:34:30] <wikibugs>	 06Traffic, 06Data-Engineering, 10Observability-Logging, 13Patch-For-Review: Install new Benthos instance on cp hosts - https://phabricator.wikimedia.org/T358109#9657223 (10Fabfur)
[11:49:12] <wikibugs>	 06Traffic, 10Automoderator, 06Data Products, 06Product-Analytics, and 2 others: 14Add revision ID to X-Analytics header - 14https://phabricator.wikimedia.org/T346350#9657270 (10Samwalton9-WMF) 14Thank you both for confirming! :)
[12:14:55] <wikibugs>	 06Traffic, 10MW-on-K8s, 06serviceops, 06SRE, and 2 others: Migrate changeprop to mw-api-int - https://phabricator.wikimedia.org/T360767#9657385 (10Clement_Goubert) `mw-api-int` is now receiving all calls to `mwapi_uri` from changeprop {F43323601}  There are still calls coming from the `ChangePropagation/WM...
[12:16:08] <wikibugs>	 10netops, 10Ganeti, 06Infrastructure-Foundations, 06SRE: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152#9657390 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1002 for hosts: `testvm2006.codfw.wmnet` - testvm2006.codfw.wmnet (**FAIL**)   - Do...
[12:18:13] <wikibugs>	 06Traffic, 10MW-on-K8s, 06serviceops, 06SRE, and 2 others: 14Migrate changeprop to mw-api-int - 14https://phabricator.wikimedia.org/T360767#9657393 (10Clement_Goubert) 05In progress→03Resolved
[12:25:07] <wikibugs>	 06Traffic, 10MW-on-K8s, 06serviceops, 06SRE, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9657411 (10Clement_Goubert)
[13:10:42] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Move public-vlan host BGP peerings from CRs to top-of-rack switches in codfw - https://phabricator.wikimedia.org/T360772#9657554 (10ayounsi) > So we need to decide if this imbalance for local queries is going to be an issue. I think load is the main thing to loo...
[14:23:31] <brouberol>	 sukhe: Hi! If you're around, I'd appreciate some pairing around decommissioning the aqs LVS service, as you kindly offered last week. I've prepped the necessary patches, I can add you as reviewer if needed
[14:23:33] <brouberol>	 Thank you!
[14:26:13] <sukhe>	 brouberol: hi! 
[14:26:16] <sukhe>	 yeah, let's do it
[14:27:28] <brouberol>	 alright, so I've prepped https://gerrit.wikimedia.org/r/c/operations/puppet/+/1013501 to remove the realserver pool and move the service back to state: lvs_setup
[14:27:44] <brouberol>	 and https://gerrit.wikimedia.org/r/c/operations/dns/+/1013500 to remove the DNS records
[14:29:17] <sukhe>	 brouberol: thanks, we will go over them
[14:29:21] <sukhe>	 have you silenced the network probes? 
[14:30:41] <brouberol>	 I've silenced everything having to do with the AQS service being down under https://alerts.wikimedia.org/?q=aqs
[14:30:56] <sukhe>	 ah thanks
[14:30:58] <sukhe>	 yep
[14:31:20] <sukhe>	 looking at the DNS changes now since that's the next step
[14:33:08] <sukhe>	 brouberol: looks good, please merge it and run authdns-update (or let me know if you want me to :)
[14:33:28] <brouberol>	 ack! I've had to rebase the patch, as gerrit was complaining of some conflicts
[14:33:35] <sukhe>	 yeah
[14:33:48] <brouberol>	 I'm just waiting on the CI, after that I'll merge and deploy
[14:36:39] <brouberol>	 all done
[14:37:11] <sukhe>	 I am assuming you ran authdns-update as well?
[14:37:31] <brouberol>	 Host aqs.svc.eqiad.wmnet not found: 3(NXDOMAIN)
[14:37:31] <brouberol>	 Host aqs.svc.codfw.wmnet not found: 3(NXDOMAIN)
[14:37:31] <brouberol>	 I did
[14:37:46] <sukhe>	 for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1013501, I left a comment; I think we in that you should just change state: lvs_setup and then we will remove the pools later
[14:37:50] <sukhe>	 brouberol: thanks
[14:37:53] <brouberol>	 👍
[14:41:24] <brouberol>	 done. I've stacked 2 changes
[14:42:24] <sukhe>	 +1
[14:47:11] <brouberol>	 1st CR is merged. I'm running puppet-merge and then `sudo cumin 'A:dnsbox' run-puppet-agent` from cumin
[14:47:16] <sukhe>	 thanks
[14:51:49] <brouberol>	 all done
[14:51:58] <sukhe>	 ok thanks! +1ed the other change
[14:52:14] <sukhe>	 next step is to merge that and run cumin 'O:lvs::balancer' 'run-puppet-agent'
[14:52:32] <brouberol>	 ack, doing!
[14:54:05] <sukhe>	 then we will need to restart pybal in backup and primary servers in both eqiad and codfw (first the backup)
[14:54:26] <sukhe>	 in eqiad the order is lvs1020 (wait for 300 second give or take), then lvs1019
[14:54:33] <sukhe>	 in codfw, lvs2014 (wait) then lvs 2013
[14:55:37] <brouberol>	 so, ssh on host, sudo systemctl restart pybal.service, wait, rinse, repeat?
[14:55:43] <sukhe>	 yep
[14:55:53] <sukhe>	 and log the commands as well, so we know the timing
[14:56:01] <_joe_>	 there's a cookbook that does things correctly
[14:56:13] <brouberol>	 and out of curiosity, are we waiting due to BGP?
[14:56:14] <_joe_>	 in terms of checking all the bgp sessions are back up
[14:56:22] <_joe_>	 brouberol: yes
[14:56:40] <sukhe>	 _joe_: the cookbook has been broken for a while (because well, we never fixed the broken aliases)
[14:56:54] <_joe_>	 sukhe: you can use it on the individual hosts though
[14:57:02] <_joe_>	 and waits for the bgp reestablishment
[14:57:04] <_joe_>	 anyways
[14:57:12] <sukhe>	 yeah, I guess. I am just more used to doing it manually and hence
[14:57:20] <_joe_>	 sukhe: I came around to ask for a link to the esams maint task
[14:57:26] <_joe_>	 for wednesday
[14:57:36] <sukhe>	 https://phabricator.wikimedia.org/T360430
[14:57:39] <_joe_>	 as we're still pooled with applayer traffic in a single dc
[14:57:45] <sukhe>	 timing is still not decided but once it is, I will send an email
[14:57:49] <sukhe>	 it's kinda up in the air right now
[14:57:52] <_joe_>	 well yes
[14:57:57] <_joe_>	 pump the brakes a sec :)
[14:58:04] <sukhe>	 do you want us to delay it?
[14:58:12] <_joe_>	 1 sec, let me read the task 
[14:58:16] <_joe_>	 hopefully not
[14:58:17] <sukhe>	 take your time
[14:58:43] <sukhe>	 https://phabricator.wikimedia.org/T360430#9642677 is the most relevant comment fwiw
[14:59:33] <_joe_>	 sukhe: one thing I am not sure about is
[14:59:41] <_joe_>	 if you're reimaging the hosts and swapping disks
[14:59:49] <_joe_>	 you do start with a 100% cold cache anyways
[14:59:55] <_joe_>	 or am I missing something?
[15:00:15] <sukhe>	 we are putting in the disks first, repooling and then reimaging one by one
[15:01:09] <_joe_>	 ok, that still sounds like a repool with cold cache, right?
[15:01:33] <_joe_>	 or are we *adding* the disks?
[15:01:47] <sukhe>	 ATS cache won't be cold though? 
[15:02:07] <sukhe>	 adding additional new disks
[15:02:10] <sukhe>	 to text
[15:02:11] <_joe_>	 ah ok sorry, the text of the task seemed to imply a *swap*
[15:02:15] <_joe_>	 hence my confusion
[15:02:33] <_joe_>	 ok, then it all makes sense
[15:02:35] <sukhe>	 no, just bringing up the configuration to dual NVMes for text, just like upload
[15:02:40] <sukhe>	 ok :)
[15:02:54] <_joe_>	 I would mostly ask to wait with the reimaging until codfw is repooled
[15:03:06] <sukhe>	 sure, that is fine
[15:03:08] <_joe_>	 so the afternoon of the 27th in eu time, presumably
[15:03:21] <sukhe>	 also the timing is not confirmed and I am not sure given that this friday is Good Friday
[15:03:30] <sukhe>	 so let's see and I will send an email and we can coordinate
[15:03:54] <_joe_>	 the only thing to keep in mind is that currently we're running at 75% utilization in eqiad
[15:04:02] <_joe_>	 so an additional load isn't great
[15:06:49] <jinxer-wm>	 (PyBalBGPUnstable) firing: PyBal BGP sessions on instance lvs2014 are failing - https://wikitech.wikimedia.org/wiki/PyBal#Alerts - https://grafana.wikimedia.org/d/000000488/pybal-bgp?var-datasource=codfw%20prometheus/ops&var-server=lvs2014 - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable
[15:06:51] <wikibugs>	 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE, 13Patch-For-Review: Decom asw-b-codfw switch stack - https://phabricator.wikimedia.org/T360776#9657884 (10Papaul)
[15:07:44] <sukhe>	 yeah fair. we can delay reimaging the disks for a while, not an issue
[15:09:11] <brouberol>	 sukhe: I takt it I have to wait for the BGP established sessions going back go 1 in https://grafana.wikimedia.org/d/000000488/pybal-bgp?var-datasource=codfw+prometheus%2Fops&var-server=lvs2014&orgId=1&from=now-1h&to=now ?
[15:11:57] <sukhe>	 brouberol: all good, please proceed with lvs2019!
[15:12:04] <sukhe>	 er 2013 :)
[15:12:25] <brouberol>	 ack, onto 2013
[15:18:31] <brouberol>	 ok to proceed with lvs1020?
[15:18:40] <sukhe>	 brouberol: checking
[15:21:49] <jinxer-wm>	 (PyBalBGPUnstable) firing: (2) PyBal BGP sessions on instance lvs2013 are failing - https://wikitech.wikimedia.org/wiki/PyBal#Alerts  - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable
[15:22:37] <sukhe>	 brouberol: some unrelated cleanup because of the restart, gimme a sec
[15:22:53] <brouberol>	 sure, thanks
[15:26:14] <sukhe>	 brouberol: so now we have to do some manual IPVS cleanup
[15:26:40] <sukhe>	 ipvsadm --delete-service --tcp-service addr:port for the aqs servers in codfw
[15:26:54] <sukhe>	 aqs2007.codfw.wmnet', 'aqs2005.codfw.wmnet', 'aqs2003.codfw.wmnet', 'aqs2008.codfw.wmnet', 'aqs2002.codfw.wmnet', 'aqs2012.codfw.wmnet' 
[15:27:31] <brouberol>	 on lvs2013?
[15:28:03] <sukhe>	 on both actually
[15:28:08] <brouberol>	 understood
[15:28:22] <sukhe>	 you can get the IP from netbox since we just killed them in the DNS repo
[15:28:28] <sukhe>	 but basically 
[15:28:49] <sukhe>	  ipvsadm --delete-service --tcp-service 10.192.16.169:7232 for aqs2007.codfw.wmnet
[15:28:53] <sukhe>	 as an example
[15:29:04] <sukhe>	 you can paste them here if you want a review before running them
[15:29:10] <sukhe>	 this step is manual 
[15:30:02] <sukhe>	 once this is done on both (same on both in codfw), we can move on to eqiad
[15:30:18] <sukhe>	 going into a meeting but will keep an eye here
[15:37:16] <brouberol>	 ipvsadm --delete-service --tcp-service 10.192.0.111:7232  # aqs2001.codfw.wmnet
[15:37:16] <brouberol>	 ipvsadm --delete-service --tcp-service 10.192.0.210:7232  # aqs2002.codfw.wmnet
[15:37:16] <brouberol>	 ipvsadm --delete-service --tcp-service 10.192.0.211:7232  # aqs2003.codfw.wmnet
[15:37:16] <brouberol>	 ipvsadm --delete-service --tcp-service 10.192.0.212:7232  # aqs2004.codfw.wmnet
[15:37:16] <brouberol>	 ipvsadm --delete-service --tcp-service 10.192.16.42:7232  # aqs2005.codfw.wmnet
[15:37:17] <brouberol>	 ipvsadm --delete-service --tcp-service 10.192.16.168:7232 # aqs2006.codfw.wmnet
[15:37:17] <brouberol>	 ipvsadm --delete-service --tcp-service 10.192.16.169:7232 # aqs2007.codfw.wmnet
[15:37:18] <brouberol>	 ipvsadm --delete-service --tcp-service 10.192.16.17:7232  # aqs2008.codfw.wmnet
[15:37:18] <brouberol>	 ipvsadm --delete-service --tcp-service 10.192.48.186:7232 # aqs2009.codfw.wmnet
[15:37:19] <brouberol>	 ipvsadm --delete-service --tcp-service 10.192.48.187:7232 # aqs2010.codfw.wmnet
[15:37:19] <brouberol>	 ipvsadm --delete-service --tcp-service 10.192.48.188:7232 # aqs2011.codfw.wmnet
[15:37:20] <brouberol>	 ipvsadm --delete-service --tcp-service 10.192.48.189:7232 # aqs2012.codfw.wmnet
[15:37:26] <brouberol>	 (none of them were run as of now)
[15:39:00] <sukhe>	 brouberol: as you have verified the hosts and the IPs, the rest looks good 
[15:39:41] <sukhe>	 let me know which LVS host you run it on first and we can do the next
[15:39:56] <sukhe>	 sorry I can't verify the IPs right now (meeting) but if you want to wait, we can do that as well
[15:40:12] <brouberol>	 ah, actually, seems that these are the wrong IPs? 
[15:40:13] <brouberol>	 root@lvs2014:~# ipvsadm --delete-service --tcp-service 10.192.0.111:7232
[15:40:13] <brouberol>	 No such service
[15:40:23] <sukhe>	 probably because aqs2001 is not there in the list
[15:40:31] <sukhe>	 Hosts in IPVS but unknown to PyBal: set(['aqs2007.codfw.wmnet', 'aqs2005.codfw.wmnet', 'aqs2003.codfw.wmnet', 'aqs2008.codfw.wmnet', 'aqs2002.codfw.wmnet', 'aqs2012.codfw.wmnet'])")
[15:40:37] <sukhe>	 just do these 
[15:40:45] <brouberol>	 oh, I assumed they all needed to be removed
[15:41:25] <sukhe>	 I assumed too but I don't see aqs2001 anywhere? 
[15:42:08] <sukhe>	 hmm I do see it. are these host active?
[15:42:22] <brouberol>	 AFAICT yes, I got their IP from cumin
[15:44:46] <brouberol>	 now that's odd
[15:44:46] <brouberol>	 root@lvs2014:~# host aqs2002.codfw.wmnet
[15:44:46] <brouberol>	 aqs2002.codfw.wmnet has address 10.192.0.210
[15:44:46] <brouberol>	 root@lvs2014:~# ipvsadm --delete-service --tcp-service 10.192.0.210:7232
[15:44:46] <brouberol>	 No such service
[15:45:05] <brouberol>	 and I see it as part of 
[15:45:05] <brouberol>	 ipvsadm --list  | grep 7232
[15:45:05] <brouberol>	 TCP  10.2.1.12:7232 wrr
[15:45:05] <brouberol>	   -> aqs2002.codfw.wmnet:7232     Route   10     0          0
[15:45:05] <brouberol>	 ...
[15:46:01] <sukhe>	 let's look when I finish this call (~15 mins)
[15:46:06] <brouberol>	 ack
[15:47:26] <brouberol>	 according to https://wikitech.wikimedia.org/wiki/LVS#Remove_a_load_balanced_service, "Run ipvsadm --delete-service --tcp-service addr:port on the LVS severs, where addr needs to match the service IP of the datacenter the LVS server is in"
[15:47:42] <brouberol>	 I understand that as addr = VIP local to the DC
[15:48:06] <brouberol>	 so in the case of aqs/codfw, that'd be 10.2.1.12
[15:59:58] <sukhe>	 brouberol: back
[16:00:30] <sukhe>	 yeah, I guess let's remove the service IP directly and just that and we will see what ipvsadm says
[16:00:38] <sukhe>	 so just 10.2.1.12:7232
[16:00:57] <brouberol>	 that worked
[16:01:10] <brouberol>	 root@lvs2014:~# ipvsadm --list  | grep 7232
[16:01:10] <brouberol>	 root@lvs2014:~#
[16:01:12] <sukhe>	 cool :) that was 2014?
[16:01:12] <sukhe>	 ok
[16:01:22] <brouberol>	 now 2013?
[16:01:26] <sukhe>	 let me check what the Icinga check thinks
[16:01:30] <brouberol>	 sg
[16:01:31] <sukhe>	 yeah just a sec
[16:02:11] <sukhe>	 yeah let's do it!
[16:02:24] <sukhe>	 2013
[16:03:32] <brouberol>	 done!
[16:04:08] <sukhe>	 cool thanks :) 
[16:04:31] <sukhe>	 now eqiad, same process, restart pybal on backup (1020)
[16:04:33] <sukhe>	 then 1019
[16:05:04] <sukhe>	 then remove the service IP directly I guess for eqiad, which is 10.2.2.12:7232
[16:05:17] <brouberol>	 on it
[16:06:33] <brouberol>	 done for 1020. Waiting 5 min
[16:10:41] <sukhe>	 cool. I am a bit split about why we didn't see aqs2002 for example above but hmm I guess as long as the diff check is fine
[16:11:30] <sukhe>	 in this case removing the service IP is fine since we are removing the entire service (and really, what I should have suggested instead of removing individual backends) 
[16:11:44] <brouberol>	 gotcha
[16:11:53] <brouberol>	 are we good to proceed to 1019?
[16:12:08] <sukhe>	 yeah please do
[16:12:47] <brouberol>	 should I wait 5 min before remove the VIP from IPVS btw?
[16:13:27] <brouberol>	 *removing
[16:14:19] <sukhe>	 no, once you have restated pybal on 1019, let me know and we can remove the IP from both without any delay
[16:14:47] <brouberol>	 it's been restarted
[16:15:20] <brouberol>	 I'll remove the VIP from IPVS from 1020 and then 1019
[16:15:42] <brouberol>	 `sudo ipvsadm --delete-service --tcp-service 10.2.2.12:7232` on both
[16:15:51] <brouberol>	 all good
[16:17:25] <sukhe>	 cool let's check 
[16:19:33] <sukhe>	 brouberol: looking good!
[16:19:37] <sukhe>	 nice job and thanks :) 
[16:19:44] <brouberol>	 thank _you_
[16:19:53] <sukhe>	 the last step as per the LVS page is removing it from service::catalog and conftool as you please
[16:19:59] <sukhe>	 and also cleaning up site.pp etc if requried
[16:20:01] <sukhe>	 but all done from our end
[16:25:03] <brouberol>	 If I could impose just a bit more, I'm not (yet) familiar with conftool. Does that entail running `confctl select name=<fqdn> set/pooled=no` for all aqs hosts?  
[16:25:40] <sukhe>	 yep
[16:25:46] <sukhe>	 or directly for the service in this case
[16:27:01] <sukhe>	 so you can pool it to no but also you should remove it from conftool-data and hieradata/common/service.yaml completely if desired
[16:57:02] <wikibugs>	 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE, 13Patch-For-Review: Decom asw-b-codfw switch stack - https://phabricator.wikimedia.org/T360776#9658513 (10Papaul)
[17:02:45] <wikibugs>	 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803#9658554 (10Jhancock.wm) sretest2003 and 2004 have been renamed to their original server names and been offlined (including ssd removal).
[17:15:03] <wikibugs>	 06Traffic, 06Data Products, 06SRE: Data Quality - requestctl not getting set - https://phabricator.wikimedia.org/T342577#9658600 (10Milimetric) @VirginiaPoundstone: Looks like Giuseppe patched varnish to send more requestctls, so maybe that completely or partially solves the problem.  I'd have to look throug...
[17:31:27] <wikibugs>	 06Traffic, 06SRE, 10Data Products (Data Products Sprint 13): Data Quality - requestctl not getting set - https://phabricator.wikimedia.org/T342577#9658684 (10VirginiaPoundstone)
[19:22:04] <jinxer-wm>	 (PyBalBGPUnstable) firing: (2) PyBal BGP sessions on instance lvs2013 are failing - https://wikitech.wikimedia.org/wiki/PyBal#Alerts  - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable
[19:25:05] <sukhe>	 what now
[19:40:49] <sukhe>	 ^ elastic2037 was pooled but down, should resolve
[20:26:11] <wikibugs>	 06Traffic, 06cloud-services-team, 10VPS-Projects, 10Puppet (Puppet 7.0): 14Update traffic project puppetmaster - 14https://phabricator.wikimedia.org/T360940#9659356 (10Andrew) 05Open→03Resolved a:03Andrew 14I've built this new server (traffic-puppetserver-bookworm) and moved hosts over to it fro...
[21:17:27] <wikibugs>	 06Traffic, 06Security-Team, 07ContentSecurityPolicy, 07SecTeam-Processed: 14Stop sending X-Webkit-CSP and X-Webkit-CSP-Report-Only headers - 14https://phabricator.wikimedia.org/T357479#9659579 (10sbassett) a:05sbassett→03TheDJ
[23:22:04] <jinxer-wm>	 (PyBalBGPUnstable) firing: (2) PyBal BGP sessions on instance lvs2013 are failing - https://wikitech.wikimedia.org/wiki/PyBal#Alerts  - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable