[08:13:42] 10Traffic, 10SRE: increased 5xx rate for esams frontend traffic - https://phabricator.wikimedia.org/T342121 (10TheDJ) This problem was also pretty visible on the wikimediastatus.net graph, I just noticed. {F37143438} [08:13:47] 10Traffic, 10SRE: increased 5xx rate for esams frontend traffic - https://phabricator.wikimedia.org/T342121 (10TheDJ) a:05TheDJ→03cmooney [08:42:16] 10Traffic, 10SRE, 10Incident Tooling: ncredir redirects for status.wiki* --> status.wikimedia.org - https://phabricator.wikimedia.org/T318804 (10Vgutierrez) >>! In T318804#8639175, @BCornwall wrote: > Looking into it further, it seems this is a very possible change! nginx mappings/site names support wildcard... [09:22:24] 10Traffic, 10SRE, 10ops-eqiad: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10Vgutierrez) @RobH I'm seeing on cumin1001 logs, that you interrupted the reimage of lvs1013 by pressing Ctrl+C: ` 2023-07-18 16:01:28,549 robh 2034852 [INFO] Completed command '/usr/local/sbin... [10:06:47] 10Traffic: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211 (10Fabfur) [10:16:37] 10Traffic: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211 (10Vgutierrez) p:05Triage→03Medium Let's research the impact of disabling KA on text@esams and evaluate if we roll it out globally based on that (we could also take into account text@eqsin). I suggest adding a hiera... [10:19:53] 10Traffic: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211 (10Fabfur) [13:18:21] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cp1[098-113] - https://phabricator.wikimedia.org/T342159 (10RobH) [13:28:10] /win 2 [13:33:03] sukhe: go public! ;P [13:34:32] vgutierrez: 2 is -ops, so hardly anything exciting there :P [13:40:10] 10Traffic, 10SRE: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211 (10Fabfur) That was released @Wed 19 Jul 2023 01:32:50 PM UTC on cp3052.esams.wmnet to test. The results matches what we were expecting, so we'll deploy on all text@esams [14:39:42] (SystemdUnitFailed) firing: anycast-healthchecker.service Failed on dns5003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:39:53] ^ should be resolving [14:44:42] (SystemdUnitFailed) resolved: anycast-healthchecker.service Failed on dns5003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:02:15] 10Traffic, 10WMF-Legal, 10Patch-For-Review, 10Performance-Team (Radar), 10Privacy: Add no-transform to Cache-Control header - https://phabricator.wikimedia.org/T218618 (10Isaac) @BCornwall and others -- I want to connect a separate but related thread here because we have all the right people to make thes... [15:07:18] 10Traffic, 10SRE: increased 5xx rate for esams frontend traffic - https://phabricator.wikimedia.org/T342121 (10TheDJ) 05Resolved→03Open Hmm. actually.. Seems there is also an exceptional amount of 4xx errors ? Especially today it seems to have exploded. https://grafana.wikimedia.org/d/000000479/cdn-fronte... [17:17:42] 10Traffic, 10SRE, 10ops-eqiad: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10ayounsi) FYI this Netbox report is alerting: https://netbox.wikimedia.org/extras/reports/results/4808787/#test_port_block_consistency ` xe-0/0/41 [eqiad] Interface type '10gbase-x-sfpp' does n... [17:20:16] 10Traffic, 10SRE, 10ops-eqiad: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10cmooney) Thanks @ayounsi @RobH you can probably connect them to 44 and 45 instead. [17:23:12] (SystemdUnitFailed) firing: (4) anycast-healthchecker.service Failed on dns5004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:25:03] ^ expected and resolving soon [17:33:12] (SystemdUnitFailed) resolved: (4) anycast-healthchecker.service Failed on dns5004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:41:24] 10Traffic, 10SRE, 10ops-eqiad: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host lvs1013.eqiad.wmnet with OS bullseye [17:43:26] 10Traffic, 10SRE, 10ops-eqiad: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10RobH) >>! In T341992#9026925, @Vgutierrez wrote: > @RobH I'm seeing on cumin1001 logs, that you interrupted the reimage of lvs1013 by pressing Ctrl+C: > ` > 2023-07-18 16:01:28,549 robh 203485... [17:45:03] 10Traffic, 10SRE, 10ops-eqiad: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10RobH) a:03RobH >>! In T341992#9029076, @ayounsi wrote: > FYI this Netbox report is alerting: > https://netbox.wikimedia.org/extras/reports/results/4808787/#test_port_block_consistency > ` >... [17:46:02] 10Traffic, 10SRE, 10ops-eqiad: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10RobH) Ok, the Bullseye OS has issues with the drivers for some of the hardware... Considering these are R430s, I don't think it is worth putting in time to install support for them in Bullsey... [17:49:12] (SystemdUnitFailed) firing: (4) anycast-healthchecker.service Failed on doh2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:49:27] ^ expected, self resolving soon [17:51:12] 10Traffic, 10SRE, 10ops-eqiad: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10ayounsi) @RobH they will need to have their switch port moved. On QFX5120s, if one port is configured at 1G, the 3 other adjacent ports can only be 1G. Here port 40 and port 42 are configure... [17:54:12] (SystemdUnitFailed) resolved: (2) anycast-healthchecker.service Failed on doh3001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:11:56] 10Traffic, 10SRE, 10ops-eqiad: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10RobH) >>! In T341992#9029218, @ayounsi wrote: > @RobH they will need to have their switch port moved. > > On QFX5120s, if one port is configured at 1G, the 3 other adjacent ports can only be... [19:31:27] (SystemdUnitFailed) firing: (6) anycast-healthchecker.service Failed on durum1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:36:12] (SystemdUnitFailed) resolved: (5) anycast-healthchecker.service Failed on durum3001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:36:27] (SystemdUnitFailed) firing: (5) anycast-healthchecker.service Failed on durum3001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:37:03] ^ expected, resolving [19:41:12] (SystemdUnitFailed) resolved: (5) anycast-healthchecker.service Failed on durum3001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:39:05] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Cabling for Eqiad racke E5-7 and F5-7 - https://phabricator.wikimedia.org/T334231 (10cmooney) @Jclark-ctr my apologies for some reason I thought these links had been cabled but seems from T338789 I didn't update the optic type so we need got them... [21:34:12] 10Traffic, 10Observability-Metrics, 10Patch-For-Review: Add prometheus-https load balancer - https://phabricator.wikimedia.org/T326657 (10BCornwall) 05Stalled→03In progress [21:42:02] 10Traffic, 10SRE, 10cloud-services-team: Rename references of labweb to cloudweb - https://phabricator.wikimedia.org/T317463 (10BCornwall) I would think that this needs to be followed since it's technically a new service even it's a rename. For instance, the dns repo still has "labweb" in templates/wmnet. A... [21:44:42] 10Traffic, 10SRE, 10cloud-services-team: Rename references of labweb to cloudweb - https://phabricator.wikimedia.org/T317463 (10taavi) Yeah, the above patches were just getting rid of the non-TLS endpoint so we have one service to rename instead of two. The actual rename still needs to be done.