[05:04:13] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#9955625 (10Marostegui) [09:06:46] friday's riddle! NEL have basically stopped being reported https://phabricator.wikimedia.org/T369345 [09:07:11] speculation on my part and I haven't confirmed it, though I have the impression that it isn't on our side [09:09:02] our side? [09:09:50] for example if we stopped sending out the NEL headers [09:10:05] we? [09:10:32] as the CDN? [09:10:43] mmm let me think about what has been changed on our side [09:10:49] yeah [09:11:11] yesterday I rebooted all cp hosts in eqiad, don't think this is related anyway [09:11:28] and switched to b64 encoded headers in haproxy logging (only in ulsfo) [09:11:46] I doubt it too, varnish would be sending the header out as far as I understand it [09:11:52] fabfur: check the task... since Tuesday :) [09:12:13] * vgutierrez drinks coffee [09:12:20] since today apparently [09:12:29] ah, sorry, my memory is limited to ~24h, even less if I see a squirrel somewhere [09:12:44] messages/sec dropped around June 27th [09:12:52] yes exactly, and some more today [09:13:01] more == enough to trigger the alerts [09:13:35] so we configure NEL via Report-To header [09:14:02] or not... [09:14:19] yes that's my understanding, NEL is configured via Report-To [09:14:24] sybil:~ vgutierrez$ curl https://en.wikipedia.org/wiki/Main_Page -v -o /dev/null -s 2>&1 |grep -i report-to [09:14:24] sybil:~ vgutierrez$ [09:14:31] and https_deliver_networkerrorlogging varnish function [09:14:51] intake-logging.w.o seems to be still working :) [09:15:06] lol [09:16:26] oh.. I think I broke it [09:17:10] 0b393d0e160397c61d931542d79006ff9fb3f8de is the culprit [09:17:39] godog: thx for the report [09:17:55] ah, now I see it vgutierrez, no XFP hence we don't deliver the header [09:18:58] or do we remove the XFP condition on the header set? [09:19:06] 06Traffic: NEL almost not reported anymore / very infrequently - https://phabricator.wikimedia.org/T369345#9956017 (10Vgutierrez) p:05Triage→03High a:03Vgutierrez [09:19:37] godog: that's not entirely true though [09:19:49] haproxy sets X-Forwarded-Proto to https [09:20:29] see https://github.com/wikimedia/operations-puppet/blob/8a82461c968c7ba44e786ccdbc05f240369a9d57/hieradata/common/profile/cache/haproxy.yaml#L147 [09:21:16] mmhh ok I don't know enough at this point to actually understand what's going wrong [09:22:00] I've an idea of what's going on [09:22:50] we have two conflicting rules in haproxy impacting XFP [09:23:04] one says set XFP to "https" [09:23:11] and the other one says delete XFP [09:24:14] since on HAProxy template we delete headers AFTER adding them, haproxy is adding the header and deleting afterwards [09:25:53] we use set-header [09:26:27] and per HAProxy documentation: "This does the same as "http-request add-header" except that the header name is first removed if it existed" [09:26:35] so it was a L8 issue on my side [09:26:38] * vgutierrez fixing it [09:31:27] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1052271 [09:32:06] +1 [09:32:54] fabfur: could you run puppet-merge for me? [09:32:59] sure [09:33:25] done [09:33:29] please merge it :) [09:33:44] done [09:33:53] thx [09:33:59] do you want me to run puppet on all cp hosts? [09:34:06] go ahead please [09:34:14] maybe on batches of 8 hosts [09:34:17] yep [09:36:24] so, there must be somewhere something that still passes XFP because NELs were reduced but not zero [09:36:43] some requests that doesn't pass through HAP? [09:36:43] nah [09:37:11] internal requests shouldn't get back to external users [09:37:33] Report-To header is cached by the UAs [09:37:49] and we set a TTL of 7 days (604800 seconds) [09:37:55] ah ok [09:38:15] we're definitely back btw, I'm looking at the increase in messages on https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=logging-eqiad&var-kafka_broker=All&from=now-30m&to=now&var-topic=codfw.w3c.reportingapi.network_error&var-topic=eqiad.w3c.reportingapi.network_error&refresh=1m [09:38:20] $ curl https://en.wikipedia.org/wiki/Main_Page -v -o /dev/null -s 2>&1 |grep -i report-to [09:38:20] < report-to: { "group": "wm_nel", "max_age": 604800, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] } [09:38:28] report-to is back on the drmrs hosts I'm hitting at the moment [09:38:44] first batch is done [09:39:17] godog: BTW it looks like that dashboard needs to be migrated to thanos? ;P [09:40:50] vgutierrez: heheh you are right [09:41:50] 06Traffic, 13Patch-For-Review: NEL almost not reported anymore / very infrequently - https://phabricator.wikimedia.org/T369345#9956131 (10Vgutierrez) 05Open→03Resolved Fixed by restoring `X-Forwarded-Proto` header on haproxy -> varnish traffic. [09:42:01] ok I've un-silenced the nelnotreported alerts too [09:42:06] thank you folks [09:43:11] thanks for noticing! [10:36:52] 10netops, 06Infrastructure-Foundations, 06SRE: Model GRE tunnels in Netbox - https://phabricator.wikimedia.org/T369351 (10cmooney) 03NEW p:05Triage→03Low [11:39:35] 06Traffic, 06SRE, 13Patch-For-Review: Disable acceptance of IPv6 router-advertisement on non-default LVS interface - https://phabricator.wikimedia.org/T358260#9956433 (10cmooney) Bit of an update on this one. We had a problem recently after lvs2011 was rebooted which is related, which we need to address. *... [14:53:48] 06Traffic, 06SRE: Migrate DNS depooling of sites from operation/dns (git) to confctl - https://phabricator.wikimedia.org/T369366 (10ssingh) 03NEW [14:56:18] 06Traffic, 06SRE, 13Patch-For-Review: Migrate DNS depooling of sites from operation/dns (git) to confctl - https://phabricator.wikimedia.org/T369366#9956889 (10ssingh) p:05Triage→03Medium [14:56:39] 06Traffic, 06SRE, 13Patch-For-Review: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366#9956890 (10ssingh) [15:02:19] 06Traffic, 06SRE, 13Patch-For-Review: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366#9956911 (10ssingh) [16:02:49] 06Traffic, 06Data-Engineering, 10Observability-Logging, 13Patch-For-Review: Switch HAProxy/Benthos to rfc5424 - https://phabricator.wikimedia.org/T365718#9957051 (10Vgutierrez) 05Open→03Resolved [16:03:08] ^^ fabfur I've closed that one for you as you already finished it :P [16:42:15] "resolved" :) [17:12:40] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384 (10cmooney) 03NEW p:05Triage→03Medium [17:12:42] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#9957289 (10cmooney) [17:16:18] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322#9957290 (10cmooney) 05Open→03Resolved I'm going to close this task now, the current gnmic collection is providing what we need i... [17:17:59] 10netops, 06Infrastructure-Foundations, 10observability, 10Observability-Metrics, 06SRE: Investigate Junos Prometheus exporter - https://phabricator.wikimedia.org/T333210#9957316 (10cmooney) 05Open→03Resolved Seems like a great tool, but we are going to move forward with pulling these stats using... [17:20:58] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#9957330 (10cmooney) [19:43:26] 06Traffic: Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css - https://phabricator.wikimedia.org/T366517#9957575 (10Jdlrobson) We ran into this again this week during the dark mode roll out. We deployed on the 2nd. According to data - 1 in 5 pages a...