[00:05:25] FIRING: SystemdUnitFailed: ncmonitor.service on ncmonitor1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:18:22] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 2 others: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9814762 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sretest2002.codf... [00:21:21] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 2 others: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9814765 (10Jhancock.wm) @cmooney I put the server in the wrong vlan. can you fix it for me. private1-a8 to private-a-codfw. th... [04:05:40] FIRING: SystemdUnitFailed: ncmonitor.service on ncmonitor1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:48:47] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: partial power outage for lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365289#9815149 (10ayounsi) [08:05:40] FIRING: SystemdUnitFailed: ncmonitor.service on ncmonitor1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:08:37] 06Traffic, 06Data Products, 06Data-Engineering, 10Observability-Logging, 13Patch-For-Review: HAProxy log format doesn't support "invalid" request path - https://phabricator.wikimedia.org/T365117#9815424 (10Fabfur) Update: opened [[ https://github.com/haproxy/haproxy/issues/2573 | this issue ]] upstream t... [08:10:03] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 2 others: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9815439 (10cmooney) @Jhancock.wm @Papaul I'd been using the server in b7 for testing already, but I should be able to move over... [08:26:48] 06Traffic, 10Observability-Logging: Add metrics to Benthos - https://phabricator.wikimedia.org/T361845#9815499 (10Fabfur) [08:36:36] 06Traffic, 06Data-Engineering, 10Observability-Logging: Umbrella task for Benthos parsing error - https://phabricator.wikimedia.org/T365441 (10Fabfur) 03NEW [08:36:57] 06Traffic, 06Data-Engineering, 10Observability-Logging: Umbrella task for Benthos parsing error - https://phabricator.wikimedia.org/T365441#9815580 (10Fabfur) [08:36:58] 06Traffic, 06Data Products, 06Data-Engineering, 10Observability-Logging, 13Patch-For-Review: HAProxy log format doesn't support "invalid" request path - https://phabricator.wikimedia.org/T365117#9815579 (10Fabfur) [08:38:44] 06Traffic, 06Data-Engineering, 10Observability-Logging: Umbrella task for Benthos parsing error - https://phabricator.wikimedia.org/T365441#9815588 (10Fabfur) A missing Host header in the request result in a 400 from Varnish and a parsing error from varnish: `json { "$schema": "/webrequest/1.0.0", "back... [09:11:10] 06Traffic: Consider preferring TLS_AES_128_GCM_SHA256 over TLS_AES_256_GCM_SHA384 - https://phabricator.wikimedia.org/T365327#9815692 (10Ladsgroup) Does reducing the key sizes (from 256 to 128 for AES and 386 to 256 for SHA) sound good to people? We can deploy it temporarily and measure the impact. [09:42:26] 06Traffic: Consider preferring TLS_AES_128_GCM_SHA256 over TLS_AES_256_GCM_SHA384 - https://phabricator.wikimedia.org/T365327#9815782 (10Vgutierrez) Personally I'm not sold on the idea of decreasing the key size, @BBlack what are your thoughts? [10:43:37] 06Traffic: Move HTTP/1.0 requests rejections at HAProxy level - https://phabricator.wikimedia.org/T365456 (10Fabfur) 03NEW [12:05:40] FIRING: SystemdUnitFailed: ncmonitor.service on ncmonitor1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:36:42] that's still spamming us :| [12:43:11] 06Traffic, 13Patch-For-Review: Move HTTP/1.0 requests rejections at HAProxy level - https://phabricator.wikimedia.org/T365456#9816399 (10Vgutierrez) Scope of the task should be rejecting invalid HTTP requests on HAProxy rather than varnish as soon as we have analytics moved to HAProxy (and not only HTTP/1.0 ones) [12:43:14] (new silence submitted) [13:14:32] 10Acme-chief: acme-chief: add support for serving individual files over the puppet file system api - https://phabricator.wikimedia.org/T364589#9816477 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez acme-chief 0.37 deployed shipping https://gitlab.wikimedia.org/repos/sre/acme-chief/-/merge_requests/8 [13:22:32] 06Traffic: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257#9816485 (10Vgutierrez) [14:04:39] 06Traffic, 10MW-on-K8s, 06serviceops: Allow choosing datacentre in XWD - https://phabricator.wikimedia.org/T365478 (10jijiki) 03NEW [14:05:07] 06Traffic, 10MW-on-K8s, 06serviceops: XWD: Allow choosing datacentre in k8s-mwdebug - https://phabricator.wikimedia.org/T365478#9816613 (10jijiki) [14:06:22] 06Traffic, 10DNS, 06SRE, 10WikiLearn: DNS records for WikiLearn - https://phabricator.wikimedia.org/T365435#9816626 (10ssingh) Hi @Asaf: The CNAME here specifies wikimedia.org but I //think// it should be learn.wiki here. So instead of: ` Cname: v3o5dlecov5umdnmw3mx7kh4x52e2kfh._domainkey.wikimedia.org V... [14:33:20] 06Traffic, 10MW-on-K8s, 06serviceops: XWD: Allow choosing datacentre in k8s-mwdebug - https://phabricator.wikimedia.org/T365478#9816834 (10akosiaris) Suggestion LGTM, it will be helpful [14:38:50] 10Acme-chief: acme-chief: add support for serving individual files over the puppet file system api - https://phabricator.wikimedia.org/T364589#9816899 (10jhathaway) >>! In T364589#9816477, @Vgutierrez wrote: > acme-chief 0.37 deployed shipping https://gitlab.wikimedia.org/repos/sre/acme-chief/-/merge_request... [14:53:15] 06Traffic: Consider preferring TLS_AES_128_GCM_SHA256 over TLS_AES_256_GCM_SHA384 - https://phabricator.wikimedia.org/T365327#9817012 (10ssingh) Some numbers from `cp7001`, with the usual caveats around measuring this with `openssl speed` and commenting on this simply to understand the above-mentioned performanc... [15:01:09] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: partial power outage for lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365289#9817104 (10Jclark-ctr) a:03Jclark-ctr [15:09:07] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: partial power outage for lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365289#9817181 (10VRiley-WMF) a:05Jclark-ctr→03VRiley-WMF [15:10:48] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: partial power outage for lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365289#9817194 (10VRiley-WMF) Checked the switch, and reseated the cable. It seems to have come back up with no issues. Everything running normally. [15:11:26] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: partial power outage for lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365289#9817199 (10VRiley-WMF) 05Open→03Resolved [15:30:38] following up on some magru latency stuff -- so, overall Peru latency looks like this: https://i.imgur.com/jbsDLYa.png -- for some users magru is narrowly the best choice, but for others it's awful [15:31:00] but then if you look at the top 3 ISPs in Peru according to our measurements, https://i.imgur.com/PHwZPys.png [15:32:43] clearly a matter of peering and preferred networks [15:32:52] and might also change with time... [15:33:36] cdanis: we are in a meeting so will look shortly but thanks in the meantime! [15:34:03] volans: for sure, we would have to be continuously re-mapping [15:36:41] if latency is similar then traffic balancing can come into play [15:44:47] we try not to (use balancing as a factor), but in this case if latency is a toss-up, moving more load to magru is probably a win (because it will be lightly-loaded in general, I tend to think) [15:45:27] also: better caching for languages that are common in the region (not as important as latency, but for edge cases, sure) [15:45:42] I see what you did there [15:45:51] :P [16:05:40] FIRING: SystemdUnitFailed: ncmonitor.service on ncmonitor1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:06:00] >:( [16:06:10] Okay, this is ridiculous. I'm just gonna disable the timer in puppet for now [16:07:30] it seems to have just expired [16:07:51] 851e224c-1fc0-4e5e-8029-196f4dcc8ac7 [16:08:23] I did four days [16:08:31] Thank you, sir [16:08:39] brett: karma will get to me eventually [16:08:52] That sounds very morbid out of context [16:09:11] I should have said karma police [16:43:51] 06Traffic, 10MW-on-K8s, 06serviceops, 06SRE, and 2 others: Rename X-Wikimedia-Debug k8s-experimental option - https://phabricator.wikimedia.org/T362662#9817838 (10Jdforrester-WMF) [16:43:55] 06Traffic, 10MW-on-K8s, 06serviceops, 06SRE, and 2 others: Rename X-Wikimedia-Debug k8s-experimental option - https://phabricator.wikimedia.org/T362662#9817839 (10Jdforrester-WMF) a:03Jdforrester-WMF [16:44:03] 06Traffic, 10MW-on-K8s, 06serviceops, 06SRE, and 2 others: Rename X-Wikimedia-Debug k8s-experimental option - https://phabricator.wikimedia.org/T362662#9817841 (10Jdforrester-WMF) 05Open→03In progress