[01:35:30] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: codfw: Relocate servers racked in U27 in all racks in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul) p:05Triage→03Medium
[02:17:10] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: codfw: Relocate servers racked in U27 in all racks in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul)
[06:08:35] <jinxer-wm>	 (PurgedHighBacklogQueue) firing: Large backlog queue for purged on cp5025:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin%20prometheus/ops&var-instance=cp5025 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue
[06:13:35] <jinxer-wm>	 (PurgedHighBacklogQueue) firing: (2) Large backlog queue for purged on cp5025:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts  - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue
[06:18:35] <jinxer-wm>	 (PurgedHighBacklogQueue) resolved: (2) Large backlog queue for purged on cp5025:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts  - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue
[07:19:15] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: [eqiad] faulty VC optics - https://phabricator.wikimedia.org/T325803 (10ayounsi) asw2-b-eqiad:fpc1:1/1 is still showing errors...  Next step will be to replace the fiber between the two (already replaced) optics.  @Jclark-ctr let me know when woul...
[09:14:55] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-OnFire, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) Same issue with `rcp: /var/run/./vjunos-install.sh: Read-only file system` and then `mount: /dev/ad0s1a : Resource temporarily unavailable`, which...
[09:18:22] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-OnFire, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) Note that removing ` [edit system] -   internet-options { -       tcp-drop-synfin-set; -       no-tcp-reset drop-all-tcp; -   } ` Is needed otherwi...
[09:52:29] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-OnFire, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) The pre-upgrade went fine on asw1-eqsin, so I guess the ulsfo issue is a corrupted storage.  The last step for eqsin is a reboot, so I'll maintain...
[14:37:35] <wikibugs>	 10Traffic, 10SRE: Package and deploy varnish 6.0.11 - https://phabricator.wikimedia.org/T326634 (10ssingh)
[14:47:01] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: codfw: Relocate servers racked in U27 in all racks in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul) @ayounsi @cmooney I have 2 questions  1- I have a total of 17 switches received so 1 is going to be used as the cloudsw in r...
[15:04:57] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: [eqiad] faulty VC optics - https://phabricator.wikimedia.org/T325803 (10Jclark-ctr) asw2-b-eqiad: fpc1:1/1   Cleaned fiber and replaced optic
[15:05:38] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: codfw: Relocate servers racked in U27 in all racks in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10ayounsi) 1/ 1 ToR per rack = 8x2 + 1 spare = 17, so indeed 1 dedicated to WMCS  2/ A1 and B1 would make sens, and would match eqiad...
[17:01:16] <jinxer-wm>	 (PurgedHighEventLag) firing: (16) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts  - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag
[17:09:05] <jinxer-wm>	 (PurgedHighEventLag) resolved: (32) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts  - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag
[18:31:48] <wikibugs>	 10Traffic, 10Observability-Metrics, 10Patch-For-Review: Add prometheus-https load balancer - https://phabricator.wikimedia.org/T326657 (10herron) #traffic could I ask for your support in deploying this?  I'm happy to execute it, but would like to coordinate for awareness and in case of unexpected issues. Tha...
[18:36:23] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review: Alert on Varnish high thread count - https://phabricator.wikimedia.org/T323723 (10BCornwall) Unfortunately, merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/863406/ has caused logspam every ten minutes in /var/log/messages.   ` 03:27 <vgutierrez> brett: BTW.....
[20:47:57] <jayme>	 bblack: sukhe: I'd like to restat lvs2009 and 2010 because of https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=lvs2009 - do you have any objections?
[20:49:42] <bblack>	 do we know the cause?
[20:50:21] <jayme>	 I've reimaged kubestage200* and I set them pooled=inactive at some point
[20:50:21] <bblack>	 sorry I just don't have a lot of information to go on.  Is this a service in process of decom or something
[20:51:23] <jayme>	 I'm not sure really why the service is not in IPVS, actually
[20:52:37] <jayme>	 the service is not to be decommed
[20:54:28] <bblack>	 I think you've just fallen out of the depool threshold or something
[20:54:40] <bblack>	 are there any live endpoints for the service?
[20:55:18] <bblack>	 which lvs service is this?
[20:55:31] <jayme>	 live in the terms of productive: no. Live in terms of healthy: yes
[20:56:10] <bblack>	 like what's the confctl name of the service?
[20:56:50] <bblack>	 oh I see it in the alert now
[20:57:03] <jayme>	 the lvs service is k8s-ingress-staging
[20:58:58] <bblack>	 bblack@cumin1001:~$ confctl select cluster=kubernetes-staging,service=kubesvc,dc=codfw get
[20:59:01] <bblack>	 {"kubestage2001.codfw.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=codfw,cluster=kubernetes-staging,service=kubesvc"}
[20:59:04] <bblack>	 {"kubestage2002.codfw.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=codfw,cluster=kubernetes-staging,service=kubesvc"}
[20:59:07] <bblack>	 so they're both pooled in confctl
[20:59:32] <jayme>	 indeed. I've repooled them at some point already
[20:59:34] <bblack>	 but both are failing healthchecks
[20:59:37] <bblack>	 Jan 10 20:57:42 lvs2009 pybal[10528]: [k8s-ingress-staging_30443 IdleConnection] WARN: kubestage2001.codfw.wmnet (enabled/down/not pooled): Connection to 10.192.0.195:30443 failed.
[20:59:46] <bblack>	 Jan 10 20:55:50 lvs2009 pybal[10528]: [k8s-ingress-staging_30443 IdleConnection] WARN: kubestage2002.codfw.wmnet (enabled/down/not pooled): Connect
[20:59:49] <bblack>	 ion to 10.192.16.137:30443 failed.
[21:00:27] <jayme>	 arg, dammit. sorry :/
[21:01:13] <jayme>	 is the "PyBal IPVS diff check" a followup of that?
[21:15:32] <bblack>	 I assume so
[21:15:36] <bblack>	 sorry lots of multitasking today!
[21:16:40] <bblack>	 if you look directly on lvs2009 for example, at that IP:port in LVS:
[21:17:05] <bblack>	 root@lvs2009:~# ipvsadm -Lnt 10.2.1.69:30443
[21:17:06] <bblack>	 Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
[21:17:08] <bblack>	 TCP  10.2.1.69:30443 wrr
[21:17:19] <jayme>	 np. Thanks for helping me out!
[21:17:21] <bblack>	 ^ there's no lines after that for individual servers pooled functionally at all
[21:17:58] <bblack>	 normally depool_threshold would prevent removing the last server in this case, but I suspect because they came up from inactive straight to failing healthchecks, they were never pooled in the first place or something.
[21:18:07] <bblack>	 the state machine for managing this stuff is not ideal in all edge cases :)
[21:18:29] <jayme>	 yeah, it's just the two nodes and I probabl made it worse by setting them to inactive
[21:19:05] <bblack>	 but I think if you fix the healthchecks, the pybal alerts should resolve
[21:19:45] <jayme>	 ack. I wasn't sure about the diff check (if that resoves on it's own)
[21:19:53] <jayme>	 do you think it would be okay to leave it like this for a couple of hours (my night)
[21:20:17] <jayme>	 I probably need a fresh start to figure out what I broke exactly on the kubernetes side of things
[21:21:33] <jayme>	 I'm restrained from ack'ing because AIUI that would ack other (real) issues as well that might arise 
[21:28:26] <bblack>	 yeah I'm not actually sure how ACK works with this.  The message would change if another one arose.
[21:28:31] <bblack>	 hmmm
[21:29:08] <bblack>	 in any case, it's not the only CRIT we have active.
[21:29:20] <jayme>	 that makes me feel better :D
[21:29:25] <bblack>	 I assume even with an ACK, it will continue alerting IRC if the text changes to add new failed services
[21:30:20] <jayme>	 fwiw I think I know what the problem ist, but it will take some time to fix unfortunately
[21:30:56] <jayme>	 I'll let the oncall people now and will get back to it first thing tomorrow
[21:31:57] <bblack>	 ok sounds good!
[21:32:09] <jayme>	 thanks for your help!
[21:32:12] <bblack>	 np!
[21:52:56] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet - https://phabricator.wikimedia.org/T296411 (10nskaggs) 05In progress→03Resolved As https://wikitech.wikimed...
[22:22:57] <wikibugs>	 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install dns100[345] - https://phabricator.wikimedia.org/T326685 (10RobH)
[22:23:04] <wikibugs>	 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install dns100[345] - https://phabricator.wikimedia.org/T326685 (10RobH)
[22:24:26] <wikibugs>	 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install dns100[345] - https://phabricator.wikimedia.org/T326685 (10RobH) @bblack,  The ordering task had the racking details populated by @kofori but I suspect there is a mistake in them.  This order and racking is to replace dns100[12] and authdns1001...
[22:42:41] <wikibugs>	 10Traffic, 10DC-Ops, 10ops-codfw: Q3:rack/setup/install dns200[345] - https://phabricator.wikimedia.org/T326688 (10RobH)
[22:42:51] <wikibugs>	 10Traffic, 10DC-Ops, 10ops-codfw: Q4:rack/setup/install dns200[345] - https://phabricator.wikimedia.org/T326688 (10RobH)
[22:43:35] <wikibugs>	 10Traffic, 10DC-Ops, 10ops-codfw: Q4:rack/setup/install dns200[345] - https://phabricator.wikimedia.org/T326688 (10RobH)
[22:44:56] <wikibugs>	 10Traffic, 10DC-Ops, 10ops-codfw: Q4:rack/setup/install dns200[345] - https://phabricator.wikimedia.org/T326688 (10RobH) a:03BBlack @bblack,  The racking details provided on ordering task T325230 list hostnames dns200[345] for this, but they are replacing dns200[12] and authdns2001.  Should these instead b...
[23:04:23] <wikibugs>	 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install dns100[345] - https://phabricator.wikimedia.org/T326685 (10RobH) a:05BBlack→03Jclark-ctr >>! In T325231#8514793, @KOfori wrote: >>>! In T325231#8514647, @RobH wrote: >>>>! In T325231#8514232, @KOfori wrote: >>> @RobH looks good. Approved. >...
[23:04:42] <wikibugs>	 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw: Q4:rack/setup/install dns200[345] - https://phabricator.wikimedia.org/T326688 (10RobH) a:05BBlack→03Papaul >>! In T325231#8514793, @KOfori wrote: >>>! In T325231#8514647, @RobH wrote: >>>>! In T325231#8514232, @KOfori wrote: >>> @RobH looks good. Approved. >>  >...