[00:03:20] 10Traffic, 10DC-Ops: Upgrade lvs1013-1016 firmware - https://phabricator.wikimedia.org/T334259 (10wiki_willy) Hi @BCornwall - one other note, aren't these servers EOL? Based on last year's CapEx doc for annual planning on line 55 below, it looks like these were early refreshed by lvs10[17-20] during FY21-22.... [00:46:02] 10Traffic, 10Upstream: HAProxy 2.6.12 segfaults on cp2033 - https://phabricator.wikimedia.org/T334448 (10ssingh) Now also observed on cp2035: ` Apr 11 22:00:08 cp2035 haproxy[2532735]: [ALERT] (2532735) : A bogus STREAM [0x7f1a2834d450] is spinning at 193580 calls per second and refuses to die, aborting no... [01:07:12] 10Traffic, 10Infrastructure-Foundations, 10SRE, 10Performance-Team (Radar): GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10Krinkle) [02:02:00] (HAProxyRestarted) firing: HAProxy server restarted on cp2035:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=codfw%20prometheus/ops&var-instance=cp2035&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [06:02:00] (HAProxyRestarted) firing: HAProxy server restarted on cp2035:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=codfw%20prometheus/ops&var-instance=cp2035&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [06:41:45] (HAProxyRestarted) resolved: HAProxy server restarted on cp2035:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=codfw%20prometheus/ops&var-instance=cp2035&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [06:42:46] 10Traffic, 10Upstream: HAProxy 2.6.12 segfaults - https://phabricator.wikimedia.org/T334448 (10Vgutierrez) [07:19:54] 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE: Adjust routing policy to increase SSH session speed from East Asia to toolforge - https://phabricator.wikimedia.org/T334530 (10ayounsi) Thanks for the report. This is because we advertise our "customer" prefixes from all our POPs to improve the use... [08:42:55] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) [10:35:59] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Marostegui) @ayounsi we are placing new DB hosts in production, can you run the same query you ran to gather the affected DBs just in case we have new ones being affected? [12:31:28] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ayounsi) [12:31:38] 10netops, 10Infrastructure-Foundations, 10SRE: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [12:31:56] 10Traffic, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ayounsi) [12:32:12] 10Traffic, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10ayounsi) [12:34:53] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ayounsi) [12:35:46] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ayounsi) >>! In T333377#8775126, @Marostegui wrote: > @ayounsi we are placing new DB hosts in production, can you run the same query you ran to gather the affected DBs j... [12:38:32] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Marostegui) Thank you, nothing changes from our DB side! [13:17:37] 10netops, 10Infrastructure-Foundations: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10ayounsi) p:05Triage→03Low [13:26:03] 10netops, 10Infrastructure-Foundations: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10ayounsi) [13:31:40] 10netops, 10Infrastructure-Foundations, 10SRE: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10jbond) the logic we use in puppet is mostly the same as [[ https://phabricator.wikimedia.org/P46511 | this script ]] which would be a good template to use for a cookbook [14:30:36] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Add generic mechanism to add static routes on switches - https://phabricator.wikimedia.org/T334281 (10cmooney) FWIW I've submitted a new patchset with a different format for defining the routes in YAML (at Arzhel's suggestion). ` static... [15:16:06] 10Traffic, 10DC-Ops: Upgrade lvs1013-1016 firmware - https://phabricator.wikimedia.org/T334259 (10BCornwall) @Papaul I'm unable to use the cookbook since the iDRAC version scheme falls below the threshold checks (2.x.x.x). All of the servers are running 2.83.83.83 @wiki_willy These are indeed EOL but I'm told... [15:18:43] 10Traffic, 10DC-Ops: Upgrade lvs1013-1016 firmware - https://phabricator.wikimedia.org/T334259 (10ssingh) > @wiki_willy These are indeed EOL but I'm told by others on my team that they will be used for internal testing - @BBlack / @ssingh can you confirm? That is correct, we plan to use these for L4LB testing... [15:23:01] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs2010.codfw.wmnet with OS bullseye [16:04:20] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs2010.codfw.wmnet with OS bullseye completed: - lvs2010 (**PASS**) - Downtimed on Icinga/Aler... [16:19:00] 10netops, 10Infrastructure-Foundations, 10SRE: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10cmooney) I'd consider client auth a "stretch goal" for now, nice to have but not sure we want to have all that extra complexity. In terms of an intermediate CA just for network... [16:22:30] 10Traffic, 10Infrastructure-Foundations, 10SRE: Set NEL `success_fraction: 1.0` on HTTP responses for measurement domains - https://phabricator.wikimedia.org/T334608 (10CDanis) [16:36:07] 10Traffic, 10Infrastructure-Foundations, 10SRE: Set NEL `success_fraction: 1.0` on HTTP responses for measurement domains - https://phabricator.wikimedia.org/T334608 (10CDanis) [16:45:17] 10Traffic, 10DC-Ops: Upgrade lvs1013-1016 firmware - https://phabricator.wikimedia.org/T334259 (10Papaul) @BCornwall all those servers have QLogic 577xx/578xx i don't think we had any issues on those nic's. The nic's we have been having issues with are Broadcom 10G running frimware version 22. [17:24:49] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [17:29:00] 10Traffic, 10Commons, 10SRE: Specific PNG thumbnail of SVG file is outdated / stuck (European caching cluster) - https://phabricator.wikimedia.org/T333042 (10Umar) For more than a month I have not seen new versions of files. https://commons.wikimedia.org/wiki/File:Vake_District.svg [18:23:29] 10netops, 10Infrastructure-Foundations, 10SRE: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10jbond) >I would worry about how we deal with the security / key management aspects of it. Just to expand on this a bit the reason why there may be a need for an additional inte... [19:19:35] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [20:57:10] 10Traffic, 10Commons, 10SRE: Specific PNG thumbnail of SVG file is outdated / stuck (European caching cluster) - https://phabricator.wikimedia.org/T333042 (10Lionel_Scheepmans) It seems that here in Phabricator, no new is bad new. [21:01:53] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs2007.codfw.wmnet with OS bullseye [21:02:05] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs2007.codfw.wmnet with OS bullseye executed with errors: - lvs2007 (**FAIL**) - **The reimage... [21:16:18] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs2007.codfw.wmnet with OS bullseye [21:56:10] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs2007.codfw.wmnet with OS bullseye completed: - lvs2007 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled...