[03:46:08] <wikibugs>	 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#9993353 (10Papaul) I do agree with the 2 options however there is a possibility too that Frack will be taking a new rack if we do the codfw...
[05:08:11] <jinxer-wm>	 FIRING: [2x] SLOMetricAbsent: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[05:11:42] <jinxer-wm>	 FIRING: SLOMetricAbsent: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[05:13:11] <jinxer-wm>	 RESOLVED: [2x] SLOMetricAbsent: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[05:14:16] <jinxer-wm>	 FIRING: SLOMetricAbsent: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[05:15:04] <jinxer-wm>	 FIRING: SLOMetricAbsent: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[05:20:04] <jinxer-wm>	 FIRING: [2x] SLOMetricAbsent: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[05:21:42] <jinxer-wm>	 RESOLVED: SLOMetricAbsent: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[05:24:16] <jinxer-wm>	 RESOLVED: SLOMetricAbsent: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[05:25:04] <jinxer-wm>	 RESOLVED: [2x] SLOMetricAbsent: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[07:38:44] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Issue creating GNMI telemetry subscription to certain QFX5120 devices - https://phabricator.wikimedia.org/T370366#9993525 (10ayounsi) 05Open→03Resolved a:03ayounsi Thanks for the investigation ! Seems like the last step was : ` asw1-b3-magru> restart a...
[07:51:50] <wikibugs>	 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#9993574 (10ayounsi) Or we could just use a IPv6 /64 and stop worrying about space :)  Thinking more globally, if we were to redo the product...
[08:53:02] <wikibugs>	 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#9993751 (10cmooney) >>! In T370164#9993574, @ayounsi wrote: > Or we could just use a IPv6 /64 and stop worrying about space :)  One day :)...
[11:55:19] <jinxer-wm>	 FIRING: [2x] VarnishPrometheusExporterDown: Varnish Exporter on instance cp2033:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown
[11:56:10] <jinxer-wm>	 FIRING: [3x] VarnishPrometheusExporterDown: Varnish Exporter on instance cp2033:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown
[11:56:26] <jinxer-wm>	 FIRING: [5x] PurgedHighEventLag: High event process lag with purged on cp2035:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts  - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag
[11:56:33] <jinxer-wm>	 FIRING: [4x] LVSHighCPU: The host lvs2012:9100 has at least its CPU 1 saturated - https://bit.ly/wmf-lvscpu  - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU
[11:56:43] <jinxer-wm>	 FIRING: [12x] PurgedHighEventLag: High event process lag with purged on cp2027:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts  - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag
[12:02:32] <jinxer-wm>	 FIRING: [8x] VarnishPrometheusExporterDown: Varnish Exporter on instance cp2027:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown
[12:03:08] <jinxer-wm>	 FIRING: [16x] VarnishPrometheusExporterDown: Varnish Exporter on instance cp2027:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown
[12:03:24] <jinxer-wm>	 FIRING: [21x] PurgedHighEventLag: High event process lag with purged on cp2035:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts  - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag
[12:03:32] <jinxer-wm>	 FIRING: [9x] LVSHighCPU: The host lvs2011:9100 has at least its CPU 13 saturated - https://bit.ly/wmf-lvscpu  - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU
[12:03:36] <jinxer-wm>	 FIRING: [29x] PurgedHighEventLag: High event process lag with purged on cp2027:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts  - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag
[12:03:47] <jinxer-wm>	 FIRING: [10x] LVSHighCPU: The host lvs2011:9100 has at least its CPU 13 saturated - https://bit.ly/wmf-lvscpu  - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU
[12:10:14] <jinxer-wm>	 RESOLVED: [16x] VarnishPrometheusExporterDown: Varnish Exporter on instance cp2027:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown
[12:11:04] <jinxer-wm>	 RESOLVED: [52x] PurgedHighEventLag: High event process lag with purged on cp2027:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts  - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag
[12:11:08] <jinxer-wm>	 RESOLVED: [9x] LVSHighCPU: The host lvs2011:9100 has at least its CPU 1 saturated - https://bit.ly/wmf-lvscpu  - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU
[13:55:03] <wikibugs>	 06Traffic, 13Patch-For-Review: [ncmonitor] ncredir should check whether second-level domains are used - https://phabricator.wikimedia.org/T369114#9994782 (10BCornwall) 05In progress→03Resolved
[14:49:44] <wikibugs>	 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#9995033 (10ABran-WMF) data-persistence hosts handled, ready whenever you are @cmooney
[15:12:42] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9995231 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8062b5f0-d6f0-401c-9dfd-590a5facd0ad) set by cmooney@cumin...
[15:54:00] <swfrench-wmf>	 hello traffic folks - anyone have thoughts on my proposal yesterday to set the DYNA record for a discovery service to `geoip!disc-failoid` as a way to achieve failoid behavior on an active/active service? :)
[15:59:02] <swfrench-wmf>	 on a second check of the gdnsd configs, I can't come up with a reason why it wouldn't. I also have a deprecated discovery service we can try it on :)
[16:04:12] <_joe_>	 +1 from me
[16:04:33] <_joe_>	 volans, bblack you implemented that system right? why wasn't it applied to A/A services?
[16:06:28] <volans>	 eh, that a 1M$ question for my memory :D one option could have been to conceptually replicate pybal's logic of too many depooled = none depooled? I can try to dig into commits
[16:06:49] <volans>	 as Scott can testify the other day my brain assumed we used failoid for A/A too
[16:07:38] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9995538 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=fdebcc6c-adaa-42f3-809d-4ec381a4798d) set by cmooney@cumin...
[16:12:36] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9995556 (10cmooney)
[16:12:41] <claime>	 We can try it, worst case a deprecated service is unavailable (oh no, it was supposed to go to failoid) and we do a revert?
[16:12:57] <claime>	 I don't think it would break gdnsd, would it?
[16:13:02] <claime>	 (I just jinxed it)
[16:13:09] <swfrench-wmf>	 lol
[16:13:35] <swfrench-wmf>	 that's my assumption as well, but I'm not 100% sure :)
[16:13:37] <swfrench-wmf>	 https://gerrit.wikimedia.org/r/c/operations/dns/+/1055256
[16:13:45] <volans>	 see the comment in https://gerrit.wikimedia.org/r/c/operations/dns/+/341574/2/templates/wmnet
[16:14:33] <volans>	 that's 7y+ ago, you're asking too much from my brain :D
[16:14:58] <claime>	 heh
[16:16:30] <swfrench-wmf>	 would anyone from traffic feel comfortable reviewing https://gerrit.wikimedia.org/r/c/operations/dns/+/1055256?
[16:16:42] <swfrench-wmf>	 FTR, disc-failoid is statically configured here: https://gerrit.wikimedia.org/g/operations/puppet/+/779721b86a69619ec45429156065913fc2cceb2d/modules/profile/templates/dns/auth/discovery-geo-resources.erb#19
[16:21:15] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9995596 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1b177f94-1995-41ab-90b9-673cef9dbf94) set by cmooney@cumin...
[16:34:48] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9995659 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f32e4714-9c03-456e-bc05-238c01bacbca) set by cmooney@cumin...
[16:46:40] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9995727 (10cmooney)
[17:11:38] <bblack>	 swfrench-wmf: reviewed
[17:12:49] <swfrench-wmf>	 bblack: thank you! are you comfortable with me giving this a try in a moment?
[17:29:23] <swfrench-wmf>	 or, to make things more concrete, are there specific risks you have in mind that I should watch out for in terms of gdnsd in general?
[17:30:09] <swfrench-wmf>	 if this specific service is borked, that's totally fine / intended, but if there's a wider risk I want to make sure I understand that :)
[17:43:43] <bblack>	 swfrench-wmf: no real unknown risks I don't think.  Just whether authdns-update of your patch is successful or not (if not, it will fail on the starting node and not go broader, in which case revert and re-run authdns-update)
[17:52:59] <swfrench-wmf>	 awesome, thank you bblack! yeah, hopefully if this isn't valid, the checkconf or what have you would fail
[17:53:09] <swfrench-wmf>	 (on the first host)
[17:55:32] <swfrench-wmf>	 moving forward
[18:01:46] <swfrench-wmf>	 $ dig +short appservers-ro.discovery.wmnet
[18:01:46] <swfrench-wmf>	 10.192.32.20
[18:01:46] <swfrench-wmf>	 $ dig +short -x 10.192.32.20
[18:01:46] <swfrench-wmf>	 failoid2002.codfw.wmnet.
[18:05:03] <bblack>	 nice
[18:06:28] <swfrench-wmf>	 thank you so much for your help
[18:08:10] <bblack>	 np!
[18:08:32] <swfrench-wmf>	 if I could ask you to take a quick look at the other one, that would be greatly appreciated: https://gerrit.wikimedia.org/r/c/operations/dns/+/1055268
[18:08:48] <swfrench-wmf>	 (same change, different service, no mock changes needed)
[18:11:49] <swfrench-wmf>	 thank you :)
[18:13:11] <wikibugs>	 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#9996129 (10Papaul) ok +1 for /25 so we all okay thanks
[18:24:30] <wikibugs>	 06Traffic, 10MW-on-K8s, 06serviceops, 06SRE, and 2 others: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#9996177 (10Scott_French) `appservers-ro.discovery.wmnet` and `api-ro.discovery.wmnet` now resolve to failoid, by way of manually updating their `DYNA` record...
[18:41:12] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9996287 (10cmooney) 05Open→03Resolved Work completed, traffic is currently bridged through the two spine switches over the AEs...
[18:44:36] <wikibugs>	 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#9996322 (10Jhancock.wm) ++ for /25 from me as well
[18:53:28] <wikibugs>	 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9996358 (10cmooney)
[18:56:37] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9996362 (10cmooney) GNMI stats proved very helpful to keep an eye on the bandwidth shifting around  {F56509244 width=600}  {F56509...
[19:32:21] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Move IP gateways for codfw row C/D vlans to EVPN Anycast GW - https://phabricator.wikimedia.org/T369274#9996630 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d6a640fd-d19e-4aa8-930d-6c260b51a4c3) set by cmooney@cumin1002 for 3:00:00 on 4 ho...
[20:28:34] <wikibugs>	 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Move sretest2002 primary uplink to asw-d4-codfw - https://phabricator.wikimedia.org/T370475 (10cmooney) 03NEW p:05Triage→03Medium
[20:56:45] <wikibugs>	 06Traffic: Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css - https://phabricator.wikimedia.org/T366517#9996869 (10Jdlrobson) I'm seeing HTML older than 24hrs as we speak.  When I visit the page https://en.wikipedia.org/wiki/Harmon_S._Cutting in Ch...
[21:15:56] <wikibugs>	 06Traffic: Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css - https://phabricator.wikimedia.org/T366517#9996913 (10Vgutierrez) I just replicated your findings on esams:  my request from my computer looks like this: `$ curl -v -s --connect-to en.wik...
[21:29:00] <wikibugs>	 06Traffic, 10MediaWiki-Parser: Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css - https://phabricator.wikimedia.org/T366517#9996938 (10Vgutierrez) so the `NewPP limit report` refers to mediawiki parsing cache, given that https://en.wikipedia.org/...
[21:38:59] <wikibugs>	 06Traffic, 10MediaWiki-Parser: Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css - https://phabricator.wikimedia.org/T366517#9996970 (10Jdlrobson) We can ignore the `NewPP limit report ` comment for now!  I am getting a different response to you t...
[21:44:20] <wikibugs>	 06Traffic, 10MediaWiki-Parser: Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css - https://phabricator.wikimedia.org/T366517#9996976 (10Vgutierrez) Your request is hitting the same cp node in esams that I hit a few minutes ago (cp3073). My request...
[21:50:34] <wikibugs>	 06Traffic, 10MediaWiki-Parser: Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css - https://phabricator.wikimedia.org/T366517#9996982 (10bd808) >>! In T366517#9996869, @Jdlrobson wrote: > This was cached on 13th July but still being served on 18th...
[22:13:24] <wikibugs>	 06Traffic, 10MediaWiki-Parser: Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css - https://phabricator.wikimedia.org/T366517#9997091 (10Tgr) I believe `OutputPage::checkLastModified()` sets the Last-Modified header to the date if the last edit (is...
[23:21:24] <wikibugs>	 06Traffic, 10MediaWiki-Parser: Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css - https://phabricator.wikimedia.org/T366517#9997195 (10Vgutierrez) ok, I've reproduced the issue and catch the request on ATS after a few attempts: `counterexample vg...
[23:32:45] <wikibugs>	 06Traffic, 10MediaWiki-Parser: Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css - https://phabricator.wikimedia.org/T366517#9997236 (10Krinkle) >>! In T366517#9997164, @Jdlrobson wrote: > […] we've been telling editors that this should only be a...