[00:48:12] 10Traffic: Enterprise redirects from .Org sites - https://phabricator.wikimedia.org/T296445 (10RBrounley_WMF) [06:57:56] (EdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [07:02:56] (EdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [14:41:05] vgutierrez: o/ [14:41:22] hey [14:41:53] elukey: what do you need? [14:42:49] I am working with Tobias on the inference.svc.codfw.wmnet LVS endpoint (the eqiad one has been working fine so far) and I was wondering if later on (in say 30 mins) we could roll out the change on pybals, or if you prefer to skip to next week [14:48:12] vgutierrez: --^ [14:48:34] sure [14:48:44] in 30 minutes works for me [14:48:48] <3 [14:49:04] ok we'll finish pre-steps and submit the code change asap [15:04:27] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Rebuild ping* hosts with 10G disks - https://phabricator.wikimedia.org/T295767 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1001 for hosts: `ping2001.codfw.wmnet` - ping2001.codfw.wmnet (**PASS**) - Dow... [15:09:36] 10Traffic, 10SRE, 10Wikimedia Enterprise, 10Wikimedia Enterprise Discussion: Allow-Listing for Enterprise IPs - https://phabricator.wikimedia.org/T294798 (10MNadrofsky) Adding to the Foundational Tech Requests board for Steering Committee intake. This will help us prioritize/resource this work effectively. [15:10:26] 10Traffic, 10Foundational Technology Requests, 10SRE, 10Wikimedia Enterprise, 10Wikimedia Enterprise Discussion: Allow-Listing for Enterprise IPs - https://phabricator.wikimedia.org/T294798 (10MNadrofsky) a:03MNadrofsky [15:12:53] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Rebuild ping* hosts with 10G disks - https://phabricator.wikimedia.org/T295767 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1001 for hosts: `ping3001.esams.wmnet` - ping3001.esams.wmnet (**PASS**) - Dow... [15:15:38] vgutierrez: something interesting - in the service::catalog bits we alrady have the codfw endpoint (but we added the IP to DNS only today) [15:15:51] in state lvs_setup [15:15:59] hmm [15:17:01] vgutierrez@lvs2009:~$ fgrep inference /etc/pybal/pybal.conf [15:17:01] vgutierrez@lvs2009:~$ echo $? [15:17:01] 1 [15:17:08] it doesn't seem to be there [15:17:45] elukey: on sites only eqiad is listed [15:17:56] so you need to add codfw there [15:18:21] ahhh right right [15:18:23] perfect :) [15:21:21] you can check with pcc and it should trigger the expected changes on lvs2009 and lvs2010 [15:21:44] I missed that bit, Tobias will file a change in a bit [15:22:00] \o [15:22:23] :D [15:22:37] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Rebuild ping* hosts with 10G disks - https://phabricator.wikimedia.org/T295767 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1001 for hosts: `ping1001.eqiad.wmnet` - ping1001.eqiad.wmnet (**PASS**) - Dow... [15:28:16] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Rebuild ping* hosts with 10G disks - https://phabricator.wikimedia.org/T295767 (10ayounsi) a:03ayounsi All 3 VMs got rebuilt with larger disks, but with the default Debian Buster. @MoritzMuehlenhoff let me know if they need to be re-rebu... [15:33:12] ETA for that CR? :) [15:33:20] klausman: ---^ [15:33:36] Soon™ [15:34:06] git is not my friend today [15:35:23] google says that I should expect rain around 5PM local time.. I guess I'm gonna run under the rain [15:35:30] no pressure klausman ;P [15:35:52] an snap sorry Valentin, we can do tomorrow if you want to go now [15:35:59] nah, no problem [15:36:28] https://gerrit.wikimedia.org/r/c/operations/puppet/+/741934 (I think) [15:36:31] I'm not allergic to the rain like southern Spaniards [15:36:48] As long as the rain is on the plain [15:39:42] the CR looks good [15:40:04] Merged [15:40:24] do you want to handle the pybal restart or should I? [15:40:50] Judging from Luca's view of the matter, I am hesitant :D [15:41:02] let's make klausman do it [15:41:02] But also curious [15:41:21] klausman: all the bits are in https://wikitech.wikimedia.org/wiki/LVS#Configure_the_load_balancers [15:41:52] to keep things short (and allow Valentin to run :D) we need to run puppet on lvs2009 and 2010 [15:42:03] 2009 is the primary and 2010 the seconday [15:42:11] The setup phase is not needed, I suspect? [15:42:28] yeah it is not needed in this case [15:42:50] we already done it for the eqiad use case [15:43:23] hieradata/hosts/lvs2010.yaml:profile::pybal::primary: false [15:43:31] to double check --^ [15:44:03] so after puppet, we'd need to restart pybal (systemctl restart pybal - nothing fancy) on lvs2010, and check the ipvsadm status [15:44:16] if everything looks good, same thing on the lvs primary (2009) [15:44:46] Is there a preferred order for the puppet runs between 9 and 10? [15:45:33] no no it is fine any order [15:45:41] puppet will only update the pybal.conf [15:45:49] 9 running [15:46:16] any pybal restart needs to be !log-ed in #operations of course [15:48:12] aye [15:48:32] 9 completed with success, now running 10 [15:50:04] 10 is also complete [15:50:13] lovely [15:50:28] The docks mention something about ACKing etcd alerts in icinga, but I don't see any. Is that a matter of entirely-new services as well? [15:50:54] it's a question of time actually :) [15:51:28] you can continue though [15:51:33] Ok, so should I wait for something to fire? Or proceed with the pybal restart on 10? [15:51:37] but the alerts will be there soon [15:51:38] Ok, continuing [15:52:49] restarted [15:53:16] yep.. BGP sessions restored [15:53:58] https://phabricator.wikimedia.org/P17855 lgtm [15:54:16] nope [15:54:25] it needs to contain the ml-serve200X nodes [15:54:27] eqiad missing? [15:54:30] ah [15:55:20] lol [15:55:25] sorry, but you didn't run puppet-merge [15:55:35] I'm merging it for you :) [15:55:39] ta [15:55:52] done [15:55:56] thanks vgutierrez [15:55:58] re-running puppet on lvs2010 [15:57:43] klausman: did you run authdns update for the codfw vip? [15:57:46] Looks better now [15:58:10] https://phabricator.wikimedia.org/P17857 [15:58:12] https://www.irccloud.com/pastebin/K41woo2b/ [15:58:29] Ah, so phab's pastebin isn't good enough for you, eh? ;) [15:58:40] hmm I'm lazier than that [15:58:55] I just paste it on irccloud and it offers me the snippet option automagically [15:59:15] klausman: we need to authdns-update on the dns hosts [15:59:15] so lvs2010 is ready [15:59:19] Ah, right. I' one of those old geezers that insists on running an actual text IRC client [15:59:36] elukey: before or after lvs2009? [15:59:57] before 2010 :D [16:00:04] well.. 2010 is ready [16:00:18] we can proceed with 2009, and then we can merge the dns change [16:00:23] so Valentin will be free to go [16:00:25] Alrighty [16:00:27] remember to re-run puppet in lvs2009 :) [16:01:44] Ok, I'm about to restart pybal on 2010 [16:01:49] no need [16:01:53] klausman: 2009 [16:01:53] I did it already [16:01:55] se the log :) [16:01:57] *see [16:02:13] Oh. Ok :) [16:02:38] So after the AuthDNS change, do puppet runs for both 2009 and 2010 or just 2009? [16:02:46] 2009 :) [16:02:56] Roger [16:09:08] ok, inference config snippet is already on pybal.conf @ lvs2009 [16:09:59] ABout to restart pybal on 2009 [16:10:15] ack [16:10:38] then we check again ipvsadm and we should be done [16:11:39] BGP sessions restored and inference is showing up on ipvsadm [16:12:02] https://www.irccloud.com/pastebin/NiwUpDwM/ [16:12:03] https://phabricator.wikimedia.org/P17858 [16:12:11] thanks for choosing the Traffic edge network services [16:12:29] <3 [16:12:33] It was an exciting but ultimately rewarding ride [16:12:52] klausman: you should grep on the service port and not hostname [16:13:06] from your output is hard to tell if inference is properly set or not :) [16:14:25] I wish ipvsadm had better output filtering, but I requested that ca. 2004 and well...