[03:46:08] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#9993353 (10Papaul) I do agree with the 2 options however there is a possibility too that Frack will be taking a new rack if we do the codfw... [05:08:11] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [05:11:42] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [05:13:11] RESOLVED: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [05:14:16] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [05:15:04] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [05:20:04] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [05:21:42] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [05:24:16] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [05:25:04] RESOLVED: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [07:38:44] 10netops, 06Infrastructure-Foundations, 06SRE: Issue creating GNMI telemetry subscription to certain QFX5120 devices - https://phabricator.wikimedia.org/T370366#9993525 (10ayounsi) 05Open→03Resolved a:03ayounsi Thanks for the investigation ! Seems like the last step was : ` asw1-b3-magru> restart a... [07:51:50] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#9993574 (10ayounsi) Or we could just use a IPv6 /64 and stop worrying about space :) Thinking more globally, if we were to redo the product... [08:53:02] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#9993751 (10cmooney) >>! In T370164#9993574, @ayounsi wrote: > Or we could just use a IPv6 /64 and stop worrying about space :) One day :)... [11:55:19] FIRING: [2x] VarnishPrometheusExporterDown: Varnish Exporter on instance cp2033:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [11:56:10] FIRING: [3x] VarnishPrometheusExporterDown: Varnish Exporter on instance cp2033:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [11:56:26] FIRING: [5x] PurgedHighEventLag: High event process lag with purged on cp2035:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [11:56:33] FIRING: [4x] LVSHighCPU: The host lvs2012:9100 has at least its CPU 1 saturated - https://bit.ly/wmf-lvscpu - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [11:56:43] FIRING: [12x] PurgedHighEventLag: High event process lag with purged on cp2027:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [12:02:32] FIRING: [8x] VarnishPrometheusExporterDown: Varnish Exporter on instance cp2027:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [12:03:08] FIRING: [16x] VarnishPrometheusExporterDown: Varnish Exporter on instance cp2027:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [12:03:24] FIRING: [21x] PurgedHighEventLag: High event process lag with purged on cp2035:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [12:03:32] FIRING: [9x] LVSHighCPU: The host lvs2011:9100 has at least its CPU 13 saturated - https://bit.ly/wmf-lvscpu - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [12:03:36] FIRING: [29x] PurgedHighEventLag: High event process lag with purged on cp2027:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [12:03:47] FIRING: [10x] LVSHighCPU: The host lvs2011:9100 has at least its CPU 13 saturated - https://bit.ly/wmf-lvscpu - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [12:10:14] RESOLVED: [16x] VarnishPrometheusExporterDown: Varnish Exporter on instance cp2027:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [12:11:04] RESOLVED: [52x] PurgedHighEventLag: High event process lag with purged on cp2027:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [12:11:08] RESOLVED: [9x] LVSHighCPU: The host lvs2011:9100 has at least its CPU 1 saturated - https://bit.ly/wmf-lvscpu - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [13:55:03] 06Traffic, 13Patch-For-Review: [ncmonitor] ncredir should check whether second-level domains are used - https://phabricator.wikimedia.org/T369114#9994782 (10BCornwall) 05In progress→03Resolved [14:49:44] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#9995033 (10ABran-WMF) data-persistence hosts handled, ready whenever you are @cmooney [15:12:42] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9995231 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8062b5f0-d6f0-401c-9dfd-590a5facd0ad) set by cmooney@cumin... [15:54:00] hello traffic folks - anyone have thoughts on my proposal yesterday to set the DYNA record for a discovery service to `geoip!disc-failoid` as a way to achieve failoid behavior on an active/active service? :) [15:59:02] on a second check of the gdnsd configs, I can't come up with a reason why it wouldn't. I also have a deprecated discovery service we can try it on :) [16:04:12] <_joe_> +1 from me [16:04:33] <_joe_> volans, bblack you implemented that system right? why wasn't it applied to A/A services? [16:06:28] eh, that a 1M$ question for my memory :D one option could have been to conceptually replicate pybal's logic of too many depooled = none depooled? I can try to dig into commits [16:06:49] as Scott can testify the other day my brain assumed we used failoid for A/A too [16:07:38] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9995538 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=fdebcc6c-adaa-42f3-809d-4ec381a4798d) set by cmooney@cumin... [16:12:36] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9995556 (10cmooney) [16:12:41] We can try it, worst case a deprecated service is unavailable (oh no, it was supposed to go to failoid) and we do a revert? [16:12:57] I don't think it would break gdnsd, would it? [16:13:02] (I just jinxed it) [16:13:09] lol [16:13:35] that's my assumption as well, but I'm not 100% sure :) [16:13:37] https://gerrit.wikimedia.org/r/c/operations/dns/+/1055256 [16:13:45] see the comment in https://gerrit.wikimedia.org/r/c/operations/dns/+/341574/2/templates/wmnet [16:14:33] that's 7y+ ago, you're asking too much from my brain :D [16:14:58] heh [16:16:30] would anyone from traffic feel comfortable reviewing https://gerrit.wikimedia.org/r/c/operations/dns/+/1055256? [16:16:42] FTR, disc-failoid is statically configured here: https://gerrit.wikimedia.org/g/operations/puppet/+/779721b86a69619ec45429156065913fc2cceb2d/modules/profile/templates/dns/auth/discovery-geo-resources.erb#19 [16:21:15] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9995596 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1b177f94-1995-41ab-90b9-673cef9dbf94) set by cmooney@cumin... [16:34:48] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9995659 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f32e4714-9c03-456e-bc05-238c01bacbca) set by cmooney@cumin... [16:46:40] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9995727 (10cmooney) [17:11:38] swfrench-wmf: reviewed [17:12:49] bblack: thank you! are you comfortable with me giving this a try in a moment? [17:29:23] or, to make things more concrete, are there specific risks you have in mind that I should watch out for in terms of gdnsd in general? [17:30:09] if this specific service is borked, that's totally fine / intended, but if there's a wider risk I want to make sure I understand that :) [17:43:43] swfrench-wmf: no real unknown risks I don't think. Just whether authdns-update of your patch is successful or not (if not, it will fail on the starting node and not go broader, in which case revert and re-run authdns-update) [17:52:59] awesome, thank you bblack! yeah, hopefully if this isn't valid, the checkconf or what have you would fail [17:53:09] (on the first host) [17:55:32] moving forward [18:01:46] $ dig +short appservers-ro.discovery.wmnet [18:01:46] 10.192.32.20 [18:01:46] $ dig +short -x 10.192.32.20 [18:01:46] failoid2002.codfw.wmnet. [18:05:03] nice [18:06:28] thank you so much for your help [18:08:10] np! [18:08:32] if I could ask you to take a quick look at the other one, that would be greatly appreciated: https://gerrit.wikimedia.org/r/c/operations/dns/+/1055268 [18:08:48] (same change, different service, no mock changes needed) [18:11:49] thank you :) [18:13:11] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#9996129 (10Papaul) ok +1 for /25 so we all okay thanks [18:24:30] 06Traffic, 10MW-on-K8s, 06serviceops, 06SRE, and 2 others: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#9996177 (10Scott_French) `appservers-ro.discovery.wmnet` and `api-ro.discovery.wmnet` now resolve to failoid, by way of manually updating their `DYNA` record... [18:41:12] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9996287 (10cmooney) 05Open→03Resolved Work completed, traffic is currently bridged through the two spine switches over the AEs... [18:44:36] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#9996322 (10Jhancock.wm) ++ for /25 from me as well [18:53:28] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9996358 (10cmooney) [18:56:37] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9996362 (10cmooney) GNMI stats proved very helpful to keep an eye on the bandwidth shifting around {F56509244 width=600} {F56509... [19:32:21] 10netops, 06Infrastructure-Foundations, 06SRE: Move IP gateways for codfw row C/D vlans to EVPN Anycast GW - https://phabricator.wikimedia.org/T369274#9996630 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d6a640fd-d19e-4aa8-930d-6c260b51a4c3) set by cmooney@cumin1002 for 3:00:00 on 4 ho... [20:28:34] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Move sretest2002 primary uplink to asw-d4-codfw - https://phabricator.wikimedia.org/T370475 (10cmooney) 03NEW p:05Triage→03Medium [20:56:45] 06Traffic: Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css - https://phabricator.wikimedia.org/T366517#9996869 (10Jdlrobson) I'm seeing HTML older than 24hrs as we speak. When I visit the page https://en.wikipedia.org/wiki/Harmon_S._Cutting in Ch... [21:15:56] 06Traffic: Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css - https://phabricator.wikimedia.org/T366517#9996913 (10Vgutierrez) I just replicated your findings on esams: my request from my computer looks like this: `$ curl -v -s --connect-to en.wik... [21:29:00] 06Traffic, 10MediaWiki-Parser: Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css - https://phabricator.wikimedia.org/T366517#9996938 (10Vgutierrez) so the `NewPP limit report` refers to mediawiki parsing cache, given that https://en.wikipedia.org/... [21:38:59] 06Traffic, 10MediaWiki-Parser: Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css - https://phabricator.wikimedia.org/T366517#9996970 (10Jdlrobson) We can ignore the `NewPP limit report ` comment for now! I am getting a different response to you t... [21:44:20] 06Traffic, 10MediaWiki-Parser: Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css - https://phabricator.wikimedia.org/T366517#9996976 (10Vgutierrez) Your request is hitting the same cp node in esams that I hit a few minutes ago (cp3073). My request... [21:50:34] 06Traffic, 10MediaWiki-Parser: Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css - https://phabricator.wikimedia.org/T366517#9996982 (10bd808) >>! In T366517#9996869, @Jdlrobson wrote: > This was cached on 13th July but still being served on 18th... [22:13:24] 06Traffic, 10MediaWiki-Parser: Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css - https://phabricator.wikimedia.org/T366517#9997091 (10Tgr) I believe `OutputPage::checkLastModified()` sets the Last-Modified header to the date if the last edit (is... [23:21:24] 06Traffic, 10MediaWiki-Parser: Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css - https://phabricator.wikimedia.org/T366517#9997195 (10Vgutierrez) ok, I've reproduced the issue and catch the request on ATS after a few attempts: `counterexample vg... [23:32:45] 06Traffic, 10MediaWiki-Parser: Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css - https://phabricator.wikimedia.org/T366517#9997236 (10Krinkle) >>! In T366517#9997164, @Jdlrobson wrote: > […] we've been telling editors that this should only be a...