[01:35:30] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: codfw: Relocate servers racked in U27 in all racks in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul) p:05Triage→03Medium [02:17:10] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: codfw: Relocate servers racked in U27 in all racks in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul) [06:08:35] (PurgedHighBacklogQueue) firing: Large backlog queue for purged on cp5025:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin%20prometheus/ops&var-instance=cp5025 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [06:13:35] (PurgedHighBacklogQueue) firing: (2) Large backlog queue for purged on cp5025:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [06:18:35] (PurgedHighBacklogQueue) resolved: (2) Large backlog queue for purged on cp5025:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [07:19:15] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: [eqiad] faulty VC optics - https://phabricator.wikimedia.org/T325803 (10ayounsi) asw2-b-eqiad:fpc1:1/1 is still showing errors... Next step will be to replace the fiber between the two (already replaced) optics. @Jclark-ctr let me know when woul... [09:14:55] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-OnFire, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) Same issue with `rcp: /var/run/./vjunos-install.sh: Read-only file system` and then `mount: /dev/ad0s1a : Resource temporarily unavailable`, which... [09:18:22] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-OnFire, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) Note that removing ` [edit system] - internet-options { - tcp-drop-synfin-set; - no-tcp-reset drop-all-tcp; - } ` Is needed otherwi... [09:52:29] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-OnFire, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) The pre-upgrade went fine on asw1-eqsin, so I guess the ulsfo issue is a corrupted storage. The last step for eqsin is a reboot, so I'll maintain... [14:37:35] 10Traffic, 10SRE: Package and deploy varnish 6.0.11 - https://phabricator.wikimedia.org/T326634 (10ssingh) [14:47:01] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: codfw: Relocate servers racked in U27 in all racks in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul) @ayounsi @cmooney I have 2 questions 1- I have a total of 17 switches received so 1 is going to be used as the cloudsw in r... [15:04:57] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: [eqiad] faulty VC optics - https://phabricator.wikimedia.org/T325803 (10Jclark-ctr) asw2-b-eqiad: fpc1:1/1 Cleaned fiber and replaced optic [15:05:38] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: codfw: Relocate servers racked in U27 in all racks in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10ayounsi) 1/ 1 ToR per rack = 8x2 + 1 spare = 17, so indeed 1 dedicated to WMCS 2/ A1 and B1 would make sens, and would match eqiad... [17:01:16] (PurgedHighEventLag) firing: (16) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [17:09:05] (PurgedHighEventLag) resolved: (32) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [18:31:48] 10Traffic, 10Observability-Metrics, 10Patch-For-Review: Add prometheus-https load balancer - https://phabricator.wikimedia.org/T326657 (10herron) #traffic could I ask for your support in deploying this? I'm happy to execute it, but would like to coordinate for awareness and in case of unexpected issues. Tha... [18:36:23] 10Traffic, 10SRE, 10Patch-For-Review: Alert on Varnish high thread count - https://phabricator.wikimedia.org/T323723 (10BCornwall) Unfortunately, merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/863406/ has caused logspam every ten minutes in /var/log/messages. ` 03:27 brett: BTW..... [20:47:57] bblack: sukhe: I'd like to restat lvs2009 and 2010 because of https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=lvs2009 - do you have any objections? [20:49:42] do we know the cause? [20:50:21] I've reimaged kubestage200* and I set them pooled=inactive at some point [20:50:21] sorry I just don't have a lot of information to go on. Is this a service in process of decom or something [20:51:23] I'm not sure really why the service is not in IPVS, actually [20:52:37] the service is not to be decommed [20:54:28] I think you've just fallen out of the depool threshold or something [20:54:40] are there any live endpoints for the service? [20:55:18] which lvs service is this? [20:55:31] live in the terms of productive: no. Live in terms of healthy: yes [20:56:10] like what's the confctl name of the service? [20:56:50] oh I see it in the alert now [20:57:03] the lvs service is k8s-ingress-staging [20:58:58] bblack@cumin1001:~$ confctl select cluster=kubernetes-staging,service=kubesvc,dc=codfw get [20:59:01] {"kubestage2001.codfw.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=codfw,cluster=kubernetes-staging,service=kubesvc"} [20:59:04] {"kubestage2002.codfw.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=codfw,cluster=kubernetes-staging,service=kubesvc"} [20:59:07] so they're both pooled in confctl [20:59:32] indeed. I've repooled them at some point already [20:59:34] but both are failing healthchecks [20:59:37] Jan 10 20:57:42 lvs2009 pybal[10528]: [k8s-ingress-staging_30443 IdleConnection] WARN: kubestage2001.codfw.wmnet (enabled/down/not pooled): Connection to 10.192.0.195:30443 failed. [20:59:46] Jan 10 20:55:50 lvs2009 pybal[10528]: [k8s-ingress-staging_30443 IdleConnection] WARN: kubestage2002.codfw.wmnet (enabled/down/not pooled): Connect [20:59:49] ion to 10.192.16.137:30443 failed. [21:00:27] arg, dammit. sorry :/ [21:01:13] is the "PyBal IPVS diff check" a followup of that? [21:15:32] I assume so [21:15:36] sorry lots of multitasking today! [21:16:40] if you look directly on lvs2009 for example, at that IP:port in LVS: [21:17:05] root@lvs2009:~# ipvsadm -Lnt 10.2.1.69:30443 [21:17:06] Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn [21:17:08] TCP 10.2.1.69:30443 wrr [21:17:19] np. Thanks for helping me out! [21:17:21] ^ there's no lines after that for individual servers pooled functionally at all [21:17:58] normally depool_threshold would prevent removing the last server in this case, but I suspect because they came up from inactive straight to failing healthchecks, they were never pooled in the first place or something. [21:18:07] the state machine for managing this stuff is not ideal in all edge cases :) [21:18:29] yeah, it's just the two nodes and I probabl made it worse by setting them to inactive [21:19:05] but I think if you fix the healthchecks, the pybal alerts should resolve [21:19:45] ack. I wasn't sure about the diff check (if that resoves on it's own) [21:19:53] do you think it would be okay to leave it like this for a couple of hours (my night) [21:20:17] I probably need a fresh start to figure out what I broke exactly on the kubernetes side of things [21:21:33] I'm restrained from ack'ing because AIUI that would ack other (real) issues as well that might arise [21:28:26] yeah I'm not actually sure how ACK works with this. The message would change if another one arose. [21:28:31] hmmm [21:29:08] in any case, it's not the only CRIT we have active. [21:29:20] that makes me feel better :D [21:29:25] I assume even with an ACK, it will continue alerting IRC if the text changes to add new failed services [21:30:20] fwiw I think I know what the problem ist, but it will take some time to fix unfortunately [21:30:56] I'll let the oncall people now and will get back to it first thing tomorrow [21:31:57] ok sounds good! [21:32:09] thanks for your help! [21:32:12] np! [21:52:56] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet - https://phabricator.wikimedia.org/T296411 (10nskaggs) 05In progress→03Resolved As https://wikitech.wikimed... [22:22:57] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install dns100[345] - https://phabricator.wikimedia.org/T326685 (10RobH) [22:23:04] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install dns100[345] - https://phabricator.wikimedia.org/T326685 (10RobH) [22:24:26] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install dns100[345] - https://phabricator.wikimedia.org/T326685 (10RobH) @bblack, The ordering task had the racking details populated by @kofori but I suspect there is a mistake in them. This order and racking is to replace dns100[12] and authdns1001... [22:42:41] 10Traffic, 10DC-Ops, 10ops-codfw: Q3:rack/setup/install dns200[345] - https://phabricator.wikimedia.org/T326688 (10RobH) [22:42:51] 10Traffic, 10DC-Ops, 10ops-codfw: Q4:rack/setup/install dns200[345] - https://phabricator.wikimedia.org/T326688 (10RobH) [22:43:35] 10Traffic, 10DC-Ops, 10ops-codfw: Q4:rack/setup/install dns200[345] - https://phabricator.wikimedia.org/T326688 (10RobH) [22:44:56] 10Traffic, 10DC-Ops, 10ops-codfw: Q4:rack/setup/install dns200[345] - https://phabricator.wikimedia.org/T326688 (10RobH) a:03BBlack @bblack, The racking details provided on ordering task T325230 list hostnames dns200[345] for this, but they are replacing dns200[12] and authdns2001. Should these instead b... [23:04:23] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install dns100[345] - https://phabricator.wikimedia.org/T326685 (10RobH) a:05BBlack→03Jclark-ctr >>! In T325231#8514793, @KOfori wrote: >>>! In T325231#8514647, @RobH wrote: >>>>! In T325231#8514232, @KOfori wrote: >>> @RobH looks good. Approved. >... [23:04:42] 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw: Q4:rack/setup/install dns200[345] - https://phabricator.wikimedia.org/T326688 (10RobH) a:05BBlack→03Papaul >>! In T325231#8514793, @KOfori wrote: >>>! In T325231#8514647, @RobH wrote: >>>>! In T325231#8514232, @KOfori wrote: >>> @RobH looks good. Approved. >> >...