[03:07:31] (VarnishChildRestarted) firing: varnish-upload restarted on cp4047 - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000330/varnish-machine-stats?orgId=1&viewPanel=66&var-server=cp4047&datasource=ulsfo%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DVarnishChildRestarted [06:19:48] 10Traffic: oom killed varnish on cp4047 - https://phabricator.wikimedia.org/T322903 (10Vgutierrez) [06:21:49] 10Traffic: oom killed varnish on cp4047 - https://phabricator.wikimedia.org/T322903 (10Vgutierrez) p:05Triage→03High [06:27:16] (VarnishChildRestarted) resolved: varnish-upload restarted on cp4047 - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000330/varnish-machine-stats?orgId=1&viewPanel=66&var-server=cp4047&datasource=ulsfo%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DVarnishChildRestarted [07:11:33] 10Traffic, 10SRE: oom killed varnish on cp4047 - https://phabricator.wikimedia.org/T322903 (10Vgutierrez) Free memory on NUMA Node 0 got below the min threshold (1028416 < 1041448): `Node 0 Normal free:1028416kB min:1041448kB low:1303560kB high:1565672kB reserved_highatomic:2048KB active_anon:1800292kB inactiv... [07:12:25] 10Traffic, 10SRE: oom killed varnish on cp4047 - https://phabricator.wikimedia.org/T322903 (10Vgutierrez) [09:46:11] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudvirt2002-dev.codfw.wmnet wi... [10:40:43] 10Traffic, 10SRE, 10Patch-For-Review: oom killed varnish on cp4047 - https://phabricator.wikimedia.org/T322903 (10Vgutierrez) After further inspection I don't think that ATS memory increase is enough to explain what we are seeing here, text nodes in ulsfo are using around 326G of RAM but upload ones are usin... [10:52:44] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudvirt2002-dev.codfw.wmnet with O... [10:56:49] 10Traffic, 10SRE, 10Patch-For-Review: oom killed varnish on cp4047 - https://phabricator.wikimedia.org/T322903 (10Vgutierrez) In fact it seems like varnish is the one eating the extra memory... in cp4045 (upload) with the following malloc specific config: `-s malloc,283G -s Transient=malloc,10G` varnish is c... [11:42:49] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudvirt2002-dev.codfw.wmnet wi... [11:51:40] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudvirt2002-dev.codfw.wmnet with O... [11:54:12] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudvirt2002-dev.codfw.wmnet wi... [12:37:55] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudvirt2002-dev.codfw.wmnet with O... [13:18:15] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [13:30:16] (VarnishChildRestarted) firing: varnish-upload restarted on cp4050 - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000330/varnish-machine-stats?orgId=1&viewPanel=66&var-server=cp4050&datasource=ulsfo%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DVarnishChildRestarted [13:30:43] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudvirt2003-dev.codfw.wmnet with OS bullseye [14:12:40] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudvirt2003-dev.codfw.wmnet with OS bullseye completed:... [15:08:31] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [15:50:21] 10Traffic, 10SRE, 10Patch-For-Review: oom killed varnish on cp4047 - https://phabricator.wikimedia.org/T322903 (10Vgutierrez) [16:42:25] 10Traffic, 10ops-eqiad: Host lvs1014.mgmt is down - https://phabricator.wikimedia.org/T322933 (10ssingh) [16:42:35] 10Traffic, 10SRE, 10Patch-For-Review: oom killed varnish on cp4047 - https://phabricator.wikimedia.org/T322903 (10Vgutierrez) p:05High→03Medium Lowing the priority after deploying several experiments in upload@ulsfo that could mitigate the issue, see the task description for more details [16:42:39] 10Traffic, 10ops-eqiad: Host lvs1014.mgmt is down - https://phabricator.wikimedia.org/T322933 (10ssingh) p:05Triage→03Medium [17:27:12] 10netops, 10Infrastructure-Foundations: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10cmooney) p:05Triage→03Medium [17:30:31] (VarnishChildRestarted) firing: varnish-upload restarted on cp4050 - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000330/varnish-machine-stats?orgId=1&viewPanel=66&var-server=cp4050&datasource=ulsfo%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DVarnishChildRestarted [17:38:33] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) @Jclark-ctr bit of a heads up I'm hoping to get the migration kicked off for those Juniper Spine devices now that we've got the lic... [17:45:16] (VarnishChildRestarted) resolved: varnish-upload restarted on cp4050 - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000330/varnish-machine-stats?orgId=1&viewPanel=66&var-server=cp4050&datasource=ulsfo%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DVarnishChildRestarted