[00:07:44] (VarnishHighThreadCount) resolved: (14) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [01:17:18] 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw: Q4:rack/setup/install dns200[345] - https://phabricator.wikimedia.org/T326688 (10Jhancock.wm) @Papaul dns2003 already exists in netbox. It's in A2. [01:18:39] 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw: Q4:rack/setup/install dns200[345] - https://phabricator.wikimedia.org/T326688 (10Papaul) @Jhancock.wm go from dns2004 up [05:31:18] 10netops, 10Infrastructure-Foundations, 10SRE, 10netbox, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10Volans) >>! In T296832#8791457, @cmooney wrote: > In terms of next steps we obviously need to keep things consistent.... [07:25:55] 10netops, 10Infrastructure-Foundations, 10SRE: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10ayounsi) Slightly relevant - https://wikitech.wikimedia.org/wiki/Juniper_TLS_certificate_install [08:42:45] 10netops, 10Infrastructure-Foundations, 10SRE: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10cmooney) [09:14:42] Hi! We turned off cassandra restbase storage for PCS endpoints (mobile-html, summary) for a few weeks for a few wikis just to get a better understanding how the API would perform. There are some insights about caching on edge that we (content transform team, api platform) would like to get feedback. Here is the report of what we found out: https://phabricator.wikimedia.org/T314770#8776938 [09:16:14] This effort is related to restbase sunset. Is there anyone available to consult us about next steps related to varnish/ats? [09:16:30] cc joe [09:41:35] nemo-yiannis: the graphs showing TTFB data.. where is that TTFB measured? [09:52:45] its the number from webrequests data (data lake) [09:53:06] i am not sure at which layer we measure but should be either varnish or ats [09:53:22] our main concern is caching hit rates [09:53:52] one of the assumptions we tried to validate before this experiment was if edge caching was enough to disable pregenation/storage on restbase [09:54:26] but i think (?) the draft numbers we had were inflated from /summary/ hits [09:55:30] (summary is heavily used from wikipedias, previews, bots etc) [09:59:41] nemo-yiannis: varnish then [10:04:59] 10netops, 10Infrastructure-Foundations, 10observability: Prometheus: ingest SONiC metrics - https://phabricator.wikimedia.org/T335027 (10ayounsi) [11:07:18] 10netops, 10Infrastructure-Foundations, 10SRE: Add Dell switches support to Homer/Cookbooks - https://phabricator.wikimedia.org/T320638 (10ayounsi) {P47077} [11:08:11] 10netops, 10Infrastructure-Foundations: Put Dell SONiC switches in production - https://phabricator.wikimedia.org/T335028 (10ayounsi) [11:08:40] 10netops, 10Infrastructure-Foundations: Put Dell SONiC switches in production - https://phabricator.wikimedia.org/T335028 (10ayounsi) [11:08:49] 10netops, 10Infrastructure-Foundations, 10SRE, 10observability, 10Patch-For-Review: Prometheus: ingest SONiC metrics - https://phabricator.wikimedia.org/T335027 (10ayounsi) [11:09:23] 10netops, 10Infrastructure-Foundations: Put Dell SONiC switches in production - https://phabricator.wikimedia.org/T335028 (10ayounsi) [11:09:31] 10netops, 10Infrastructure-Foundations, 10SRE: Add Dell switches support to Homer/Cookbooks - https://phabricator.wikimedia.org/T320638 (10ayounsi) [11:09:42] 10netops, 10Infrastructure-Foundations, 10SRE: Add Dell switches support to Homer/Cookbooks - https://phabricator.wikimedia.org/T320638 (10ayounsi) [11:09:50] 10netops, 10Infrastructure-Foundations, 10SRE: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10ayounsi) [11:10:19] 10netops, 10Infrastructure-Foundations: Put Dell SONiC switches in production - https://phabricator.wikimedia.org/T335028 (10ayounsi) [13:34:42] (SystemdUnitFailed) firing: (16) varnishmtail@default.service Failed on cp5017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:35:33] uh? [13:37:01] yeah a bunch of them [13:37:58] all of them in eqsin actually [13:38:07] https://phabricator.wikimedia.org/T253093 [13:38:07] 16 servers impacted [13:38:16] Assert error in vslc_vtx_next(), vsl_dispatch.c line 290: [13:38:21] Condition(c->offset <= c->vtx->len) not true. [13:38:28] and then subsequent quick restarts [13:38:38] recoveries coming in now [13:39:30] yeah we've gotta up the shm space [13:39:42] (SystemdUnitFailed) firing: (16) varnishmtail@default.service Failed on cp5017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:39:51] not seeing the recoveries :/ [13:40:03] just 5021 so far [13:41:45] there's a CLI arg "-l 80M" (that's the default, we're not setting it) [13:42:07] let's bump it up today I think? we have seen it quite a few times now [13:42:20] it will require restarts, so may take a little while to safely deploy [13:42:32] I have no idea how to properly tune it (maybe there's no rational way other than bump+try) [13:42:36] yeah [13:42:48] but maybe just try doubling the default as a first step? [13:44:42] (SystemdUnitFailed) firing: (16) varnishmtail@default.service Failed on cp5017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:47:15] might need some extra coffee but I'm not seeing that -l 80M [13:47:28] I think it's the default so we are not setting it explicitly [13:47:29] it's not in our unit file yet, found it in docs [13:47:47] yeah yeah, but failing to see it in the docs as well [13:47:49] oh, but I'm looking at really old docs, thanks google [13:48:05] that was for 3.0 [13:49:42] (SystemdUnitFailed) firing: (16) varnishmtail@default.service Failed on cp5017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:50:01] -l shorthand for -p vsl_space, which is a default of 80M [13:50:05] https://varnish-cache.org/docs/6.0/reference/varnishd.html#vsl-space [13:50:31] oh ok [13:50:47] yeah... 2x sounds good as an initial tuning [13:59:42] (SystemdUnitFailed) firing: (11) varnishmtail@internal.service Failed on cp5017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:04:42] (SystemdUnitFailed) resolved: (7) varnishmtail@internal.service Failed on cp5017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:04:59] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs1019.eqiad.wmnet with OS bullseye [14:21:47] 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10Jhancock.wm) [14:36:25] 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10Jhancock.wm) [14:40:19] https://gerrit.wikimedia.org/r/c/operations/puppet/+/910005 for the varnish thing above [14:40:22] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs1019.eqiad.wmnet with OS bullseye completed: - lvs1019 (**PASS**) - Downtimed on Icinga/Aler... [14:40:38] not sure what we prefer for varnish so I went with string for the size. I can do an int and append the M manually [14:41:46] 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10Jhancock.wm) according to the change log on dns2003, it was the old authdns. updated the ticket to reflect the new naming [14:43:11] string seems fine here [14:44:26] sukhe: but maybe make the class-level default use the varnish default (80M), since we're setting the new value via hiera? [14:44:33] bblack: sure [14:44:52] either that, or have it not set the param in the unit file at all unless overridden [14:45:07] I will set it to 80M so that we can match the default then [14:45:15] it just seems weird to have it defaulting to "-l 160M" and also have a hiera setting it to the same [14:46:05] updated [14:46:32] yeah fair enough! I thought but maybe 80M is too low, let's set to 160M and then we can do per-host overrides if required [14:49:48] 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10Jhancock.wm) [14:58:04] 10Traffic, 10netops, 10Data-Engineering, 10Data-Persistence, and 8 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ayounsi) [14:58:18] 10netops, 10Infrastructure-Foundations, 10SRE: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [14:58:52] 10Traffic, 10netops, 10Data-Engineering, 10Data-Persistence, and 8 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ayounsi) [15:00:03] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 8 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10Marostegui) [15:01:25] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [15:09:09] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Codfw:row A/B: rack/cable new switches - https://phabricator.wikimedia.org/T332180 (10Papaul) p:05Triage→03Medium [15:30:19] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 8 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MoritzMuehlenhoff) [15:48:19] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs1018.eqiad.wmnet with OS bullseye [16:23:15] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs1018.eqiad.wmnet with OS bullseye completed: - lvs1018 (**PASS**) - Downtimed on Icinga/Aler... [16:44:36] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [17:09:07] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10colewhite) [17:09:45] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10colewhite) [17:33:52] 10netops, 10Infrastructure-Foundations, 10SRE, 10observability, 10Patch-For-Review: Prometheus: ingest SONiC metrics - https://phabricator.wikimedia.org/T335027 (10colewhite) Getting Prometheus to scrape a new metrics endpoint is pretty straightforward. When the exporter is up and running and firewall r... [17:46:36] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs1017.eqiad.wmnet with OS bullseye [18:00:24] 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dns2004.wikimedia.org with OS bullseye [18:01:19] 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10Papaul) [18:22:49] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs1017.eqiad.wmnet with OS bullseye completed: - lvs1017 (**PASS**) - Downtimed on Icinga/Aler... [18:23:45] 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dns2005.wikimedia.org with OS bullseye [18:26:30] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [18:41:28] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) This is now complete and we have upgraded all 176 Traffic hosts to bullseye. WE would like to thank @MoritzMuehlenhoff for helping with the Pybal backport that made the LVS reimaging... [18:46:24] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) @jbond for the firmware reimaging cookbook that saved us a lot of time by automating the iDRAC and NIC firmwares and deferring having the defer reboot option. [18:52:14] 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dns2004.wikimedia.org with OS bullseye completed: - dns2004 (**PASS**) - Removed from Pup... [18:53:24] 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dns2006.wikimedia.org with OS bullseye [19:01:15] 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10Papaul) [19:02:02] 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10Papaul) @Jhancock.wm hey if you a chance can you please check network cable on dns2006? link is showing down Thanks ` ge-1/0/8 up down dns2006 [19:04:25] 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dns2005.wikimedia.org with OS bullseye completed: - dns2005 (**FAIL**) - Removed from Pup... [19:04:28] 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dns2005.wikimedia.org with OS bullseye executed with errors: - dns2005 (**FAIL**) - Remov... [19:04:35] 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10Papaul) [19:05:04] 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10Papaul) [19:05:47] 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dns2005.wikimedia.org with OS bullseye [19:18:42] 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dns2005.wikimedia.org with OS bullseye completed: - dns2005 (**PASS**) - Downtimed on Ici... [19:49:42] 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dns2006.wikimedia.org with OS bullseye executed with errors: - dns2006 (**FAIL**) - Remov...