[07:11:26] legoktm: I'm back :) [07:12:22] ema: welcome back! [07:12:33] haha ty [07:15:31] ema: hi! [07:16:00] I'm ready to add the shellbox LVS if now is a good time [07:17:01] legoktm: go ahead [07:20:11] ok, I just merged state: service_setup [07:20:59] ack [07:21:45] let me look up what the lvs backup servers are [07:23:01] actually, I'm not really sure where to look [07:23:04] legoktm: lvs1016 and lvs2010 [07:23:10] see modules/lvs/manifests/configuration.pp [07:24:03] TODO: add to wikitech how to find out [07:24:47] ok, and then lvs1015/lvs2009 are the active ones? [07:24:52] meanwhile shellbox data is now in etcd: https://config-master.wikimedia.org/pybal/eqiad/shellbox [07:25:38] legoktm: correct, lvs1015 and lvs2009 are the low-traffic primaries [07:26:12] ok, I'm going to go ahead with https://gerrit.wikimedia.org/r/c/operations/puppet/+/693960/ now, switching to lvs_setup [07:27:16] other than checking with dstat that indeed the primaries are getting much more traffic than the secondaries, you can `grep bgp-med /etc/pybal/pybal.conf` [07:27:38] on the primary it's 0, on the secondary it must be > 0 (100 currently) [07:28:07] * legoktm nods [07:28:14] puppet is running on the lvs nodes rn [07:29:17] ack [07:30:06] waiting for the expected icinga alerts before I restart pybal on the backups [07:30:11] excellent [07:30:24] I see the new service in /etc/pybal/pybal.conf as expected [07:31:39] so...it's not alerting [07:31:55] oh, there it is [07:32:13] yep! [07:33:38] ema: ok, all good to restart pybal now? [07:33:59] legoktm: on the secondaries, yes [07:35:49] nit: next time you may want to mention that these are the secondaries add the task number to your !log lines :) [07:36:39] oops [07:37:01] I see the instances being found by pybal: `journalctl -u pybal --since today | grep shellbox` [07:37:19] the output of `sudo ipvsadm -L -n` looks correct, I see port 4008 and it mapping to 10.2.1.51 [07:37:25] very nice [07:38:57] ok, it's been over 2 minutes now, ok to restart on the active ones? [07:39:08] here's the relevant grafana dashboard: https://grafana.wikimedia.org/d/000000421/pybal?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-server=lvs1016&var-service=shellbox_4008&from=now-1h&to=now [07:39:40] ooo [07:42:04] ema: ok on restarting pybal on the active lvs servers? or do we need to wait/check anything else? [07:42:31] legoktm: I'm just double-checking ipvsadm -L to err on the side of caution, but we can proceed soon [07:42:46] ok [07:43:36] legoktm: looks fine, please go ahead! [07:45:09] restarted [07:45:57] journalctl and ipvsadm look good [07:47:22] and so does curl -v https://shellbox.svc.eqiad.wmnet:4008/healthz [07:47:57] ditto codfw [07:48:01] wheee [07:48:10] nice :)+ [07:49:01] time to switch to monitoring_setup now? [07:49:26] sounds good to me [07:55:47] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=shellbox.svc.eqiad.wmnet and https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=shellbox.svc.codfw.wmnet [07:56:55] cool [07:59:28] could you double-check at https://gerrit.wikimedia.org/r/c/operations/dns/+/693965 ? I followed https://wikitech.wikimedia.org/wiki/LVS#Add_the_dns_discovery_record for it [08:01:54] legoktm: lgtm [08:03:09] ok, switching to "production" state now [08:03:28] ack [08:03:48] er, do I run puppet on A:dns-auth before or after merging the DNS change? [08:04:42] ema: ^ [08:04:48] after [08:07:34] legoktm: you also have to run `sudo -i authdns-update` on one authdns server after merging the DNS change, not sure if that's documented properly [08:07:48] it didn't like it [08:07:49] error: plugin_geoip: Invalid resource name 'disc-shellbox' detected from zonefile lookup [08:07:49] error: Name 'shellbox.discovery.wmnet.': resolver plugin 'geoip' rejected resource name 'disc-shellbox' [08:07:55] (that was authdns-update) [08:08:20] on which authdns did you run the command? [08:08:31] authdns1001 [08:11:59] let's see, maybe puppet had to run before merging the dns change actually [08:13:02] looks like https://phabricator.wikimedia.org/T263518 [08:13:38] ok, let me do a puppet run then [08:13:46] legoktm: I did that already [08:13:54] oh, ty :) [08:14:02] do I still need to authdns-update or did you do that as well? [08:14:05] trying authdns-update again, still fails [08:14:09] :| [08:14:22] we may have to revert the dns change and run authdns-update [08:14:52] which is the procedure followed by vo.lans to fix T263518 [08:14:52] T263518: dns repository left in a broken state - https://phabricator.wikimedia.org/T263518 [08:15:22] legoktm: do you want to do that or should I? [08:15:29] I can do it [08:15:34] ack [08:16:12] (1) revert the dns change (2) run authdns-update (3) apply the dns change again (4) run authdns-update [08:16:32] sorry about that, my bad [08:17:58] > OK - authdns-update successful on all nodes! [08:18:02] now re-reverting [08:18:12] alright [08:19:19] now `grep shellbox /etc/gdnsd/discovery-states` looks fine, which means that applying the DNS patch should work properly [08:20:27] ok, authdns-update seems happy now [08:21:12] I spoke too soon [08:21:16] same issue again [08:21:57] dns[2001-2002,4001-4002,5001-5002].wikimedia.org (6) updated properly [08:22:43] the failing ones didn't get a puppet run? [08:22:54] on authdns2001 it says: Last puppet commit: (762fd1c14a) Legoktm - service: Switch shellbox to monitoring_setup [08:23:03] while it should be "Switch shellbox to production" [08:23:57] oh, I thought we did run puppet on all the authdns [08:24:34] I assumed you had run puppet everywhere, I guess you just meant that one server [08:24:45] yep [08:25:07] let's see if we get it right at the 3rd attempt [08:27:30] ok, puppet running on A:dns-auth now :) [08:27:35] excellent :) [08:28:05] we should confirm with `cumin 'A:dns-auth' 'grep shellbox /etc/gdnsd/discovery-states'` before merging the dns change this time [08:28:14] hehe, I was thinking the exact same :D [08:31:20] legoktm: I see that eqsin is going down again, is the DNS repo in a proper state? [08:31:47] yes, nothing undeployed right now [08:31:58] ack [08:32:13] see -operations too [08:33:05] though my cumin puppet run is hanging at authdns5001.wikimedia.org, shockingly [08:34:06] legoktm: I think puppet *did* manage to run there earlier on, now discovery-state-shellbox is on all authdns' discovery-states [08:34:39] yep, the grep looks good [08:38:11] https://en.wiktionary.org/wiki/non_c%27%C3%A8_due_senza_tre we should have known [08:41:23] that's a new one for me :) [08:42:55] legoktm: I'd say the instructions were alright except for (1) finding out the active/passive LVS and (2) applying the puppet change to all authdns before merging the dns change [08:43:00] anything else? [08:43:17] no, that was it [08:43:55] for (2) I was going to add the cumin grep to verify puppet ran everywhere, and that if you end up in the busted state, you have to revert the DNS patch first, then run puppet, then re-apply DNS [08:44:20] oh, also your `journalctl -u pybal --since today | grep shellbox` command in addition to the ipvsadm one [08:44:36] ...and a link to the grafana dashboard [08:44:49] perfect [09:15:44] 10Traffic, 10DNS, 10SRE, 10netbox, 10cloud-services-team (Kanban): Move some of wikimediacloud.org 185.15.56.0/23 to Netbox - https://phabricator.wikimedia.org/T268621 (10ayounsi) @nskaggs I'm triaging the #netbox tasks. Does WMCS has an opinion on that task or it's fine to proceed? [09:26:16] 10netops, 10SRE: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929 (10aborrero) cool, thanks! [10:02:15] 10netops, 10DC-Ops: Collect and archive KML/KMZ fiber path files for new and existing network circuits - https://phabricator.wikimedia.org/T285136 (10faidon) [10:04:12] 10Traffic, 10DNS, 10SRE, 10netbox, 10cloud-services-team (Kanban): Move some of wikimediacloud.org 185.15.56.0/23 to Netbox - https://phabricator.wikimedia.org/T268621 (10aborrero) It is fine to proceed. Moreover, after the cloudgw project, some of this may be already on netbox anyway! see https://netbox... [10:05:59] 10netops, 10DC-Ops: Collect and archive KML/KMZ fiber path files for new and existing network circuits - https://phabricator.wikimedia.org/T285136 (10Volans) Related netbox feature request with alternative options: https://github.com/netbox-community/netbox/issues/2253 [10:07:52] 10Traffic, 10DNS, 10SRE, 10netbox, 10cloud-services-team (Kanban): Move some of wikimediacloud.org 185.15.56.0/23 to Netbox - https://phabricator.wikimedia.org/T268621 (10aborrero) The other topics you mentioned: * Regarding the service FQDNs. We don't need them. These FQDNs related to the edge network... [10:09:41] 10Traffic, 10DNS, 10SRE, 10netbox, 10cloud-services-team (Kanban): Move some of wikimediacloud.org 185.15.56.0/23 to Netbox - https://phabricator.wikimedia.org/T268621 (10aborrero) * regarding the DNS server addresses. You are right, an intermediate service FQDN might be in order here.