[00:00:37] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10005374 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=23e26d8b-bf98-4528-9f4f-f796eb123261) set by cmooney@cumin1002 for 0:15:00 on 1 host(s) and th... [00:02:19] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10005377 (10ops-monitoring-bot) VM netflow2003.codfw.wmnet rebooted by cmooney@cumin1002 with reason: increase VM RAM [04:48:23] 06Traffic, 06Data Products, 06Data-Engineering, 10Observability-Logging: New software: haproxykafka - https://phabricator.wikimedia.org/T370668#10005622 (10Fabfur) >>! In T370668#10003489, @Ottomata wrote: > I might be out of my league here, but have yall considered the [[ https://www.haproxy.com/blog/exte... [04:55:13] 06Traffic, 06Data Products, 06Data-Engineering, 10Observability-Logging: Remove Benthos from ulsfo hosts - https://phabricator.wikimedia.org/T370741 (10Fabfur) 03NEW [06:05:16] 06Traffic, 10conftool, 13Patch-For-Review: Allow integrating requestctl rules into haproxy - https://phabricator.wikimedia.org/T369606#10005674 (10Joe) [07:20:29] 06Traffic, 10conftool, 13Patch-For-Review: Allow integrating requestctl rules into haproxy - https://phabricator.wikimedia.org/T369606#10005723 (10Joe) >>! In T369606#9985617, @CDanis wrote: > As @Fabfur points out, in haproxy 3.0+ (but not haproxy 2.8.x) we have the option of evaluating many ACLs together w... [07:28:21] 06Traffic, 10conftool: Integrate requestctl haproxy rules into our TLS terminator - https://phabricator.wikimedia.org/T370745 (10Joe) 03NEW [07:42:07] 06Traffic, 10conftool: Integrate requestctl haproxy rules into our TLS terminator - https://phabricator.wikimedia.org/T370745#10005760 (10Fabfur) If the requestctl rules are defined in a separate backend, they obviously need to be evaluated strictly after the ones in frontend (so, they are necessarily last on... [08:28:35] 10Wikimedia-Apache-configuration, 06collaboration-services, 10Phabricator, 10Release-Engineering-Team (Priority Backlog 📥), and 3 others: Apache 2.4.61 throws a 403 Forbidden for links containing %3F - https://phabricator.wikimedia.org/T370110#10005864 (10hashar) Is the `B` flag the reason the issue trigge... [08:58:09] 06Traffic, 06Movement-Insights: Disable Chrome Private Prefetch Proxy - https://phabricator.wikimedia.org/T364126#10005922 (10OSefu-WMF) Despite disabling prefetch using Google's methodology, we continue to receive ~150-200k requests per day that have Google's prefetch header. Many of these requests come from... [09:07:12] 06Traffic, 06Movement-Insights: Disable Chrome Private Prefetch Proxy - https://phabricator.wikimedia.org/T364126#10005950 (10OSefu-WMF) [09:10:10] 06Traffic, 06Movement-Insights: Disable Chrome Private Prefetch Proxy - https://phabricator.wikimedia.org/T364126#10005956 (10OSefu-WMF) 05In progress→03Resolved Closing this task as implementation is complete. Continuing impact monitoring here - T370750 [09:36:18] 06Traffic, 10conftool: Integrate requestctl haproxy rules into our TLS terminator - https://phabricator.wikimedia.org/T370745#10006006 (10Vgutierrez) requestctl integration could be a great candidate to write a custom SPOA (Stream Processing Offload Agent) that handles all the requestctl rules and returns an... [11:18:44] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10006345 (10cmooney) So, we hit a bit of a speed-bump in codfw with the gnmic stats once the new switches were made live there. We now have 36 active gnmic subscriptions... [13:11:50] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10006695 (10ops-monitoring-bot) VM netflow1002.eqiad.wmnet rebooted by cmooney@cumin1002 with reason: increase VM RAM [13:16:50] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10006699 (10ops-monitoring-bot) VM netflow3003.esams.wmnet rebooted by cmooney@cumin1002 with reason: increase VM RAM [13:24:10] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10006727 (10ops-monitoring-bot) VM netflow4002.ulsfo.wmnet rebooted by cmooney@cumin1002 with reason: increase VM RAM [13:24:27] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10006728 (10ops-monitoring-bot) VM netflow5002.eqsin.wmnet rebooted by cmooney@cumin1002 with reason: increase VM RAM [13:30:48] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10006766 (10ops-monitoring-bot) VM netflow6001.drmrs.wmnet rebooted by cmooney@cumin1002 with reason: increase VM RAM [13:33:50] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10006779 (10cmooney) In Eqiad our netflow VM was also running a little hot, and swapping to disk. I've now increased the resources for it and also the other netflow VMs i... [13:34:50] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10006783 (10ops-monitoring-bot) VM netflow7001.magru.wmnet rebooted by cmooney@cumin1002 with reason: increase VM RAM [13:37:47] 10netops, 06Infrastructure-Foundations, 06SRE: Set Leaf switches in Codfw rows C & D to active and make new vlans live - https://phabricator.wikimedia.org/T370629#10006786 (10cmooney) 05Open→03Resolved All actions complete. @papaul, @Jhancock.wm please note that after this change if running the netb... [14:33:00] 06Traffic, 06SRE, 13Patch-For-Review: Show a better error page when returning an HTTP 429, not the "Our servers are currently under maintenance" one for 5xxs - https://phabricator.wikimedia.org/T354718#10007033 (10CDobbins) This has been deployed as of 14:25 on 7/23/24, with CR #1041705. 1. I added a n... [14:33:18] 06Traffic, 06SRE, 13Patch-For-Review: Show a better error page when returning an HTTP 429, not the "Our servers are currently under maintenance" one for 5xxs - https://phabricator.wikimedia.org/T354718#10007036 (10CDobbins) 05Open→03Resolved a:03CDobbins [14:46:25] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE: Upgrade anycast-healthchecker to 0.9.8 (from 0.9.1-1+wmf12u1) - https://phabricator.wikimedia.org/T370068#10007060 (10ssingh) On `dns6001`, we have anycast-hc 0.9.8 running with the patch to change the logging level to WARN for when a service is dow... [14:49:00] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#10007067 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=85a0a04b-e091-4107-9bc3-7c9ca22300c8) se... [14:57:07] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#10007073 (10MatthewVernon) @cmooney Swift (ms-be) and Ceph (moss-be) ready when you are. [15:01:15] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#10007080 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=71f4229e-483c-4848-9bc3-6926b62b02ae) se... [15:01:45] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#10007081 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=18d9056a-9166-4006-b516-a07496523fd2) se... [15:21:36] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#10007228 (10cmooney) Upgrade complete, things look ok network wise and all host are back pinging again. Thanks all f... [15:26:44] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#10007276 (10MatthewVernon) Both Ceph and Swift back to normal, thanks. [16:05:41] hello traffic - FYI, in about an hour, I'll be starting to turn down the api-https and appservers-https LVS services following the instructions in https://wikitech.wikimedia.org/wiki/LVS#Remove_a_load_balanced_service. [16:05:41] I plan to use the restart-pybal cookbook in step #4, which looks pretty straightforward, but I might pop in here occasionally to confirm some things as I go :) [16:06:29] swfrench-wmf: works for us, happy to [16:18:07] 06Traffic, 10MW-on-K8s, 06serviceops, 06SRE, and 2 others: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#10007582 (10Scott_French) Silenced ProbeDown for api-https:443 and appservers-https:443 for 24h: * f6f67d8d-6381-43b3-9262-9a8cf58f2b19 * ed0d352b-fb83-4bd4-... [16:22:09] swfrench-wmf: I updated the instructions a bit, please try them and let me know if they can be improved [16:22:20] so please refresh the page above (https://wikitech.wikimedia.org/wiki/LVS#Remove_the_service_from_the_load-balancers_and_the_backend_servers) :) [16:23:32] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#10007599 (10cmooney) 05Open→03Resolved [16:24:27] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#10007601 (10cmooney) 05Open→03Resolved [16:30:06] sukhe: ah, thanks for letting me know! I'll take a look now [16:36:31] sukhe: cool, so you transposed it, so that it's only one DC at a time (i.e., rather than both backups in both DCs, then primaries in both DCs) [16:36:42] that makes sense and SGTM :) [16:37:45] would it be alright to flip the order, so that I'm doing codfw first, then eqiad? [16:37:47] swfrench-wmf: yep. [16:38:01] swfrench-wmf: as long as you hit the secondary/backup first, I think either is fine [16:38:17] sukhe: great, thanks! [16:40:47] sukhe: one additional question, for the `ipvsadm` cleanup step (#6), presumably it makes sense to do this with the same ordering, right? [16:41:13] swfrench-wmf: yes, good point. I will add it to the notes [16:41:52] sukhe: oh, great - thanks! I can also do so after the dust settles :) [16:42:25] so in theory, it shouldn't matter as long as the agent run and pybal restart has been completed already for that host/site but might as well be consitent [16:42:49] right, yeah - the only trick is remembering to vary the service IP across DCs [16:42:56] yep [16:43:11] I just wasn't sure if the `ipvsadm` invocation was risky enough that it warranted the same sequencing [16:43:20] https://sal.toolforge.org/log/JtwswI8BhuQtenzvYXCr example [16:43:50] nice, thank you! [17:26:11] sukhe: FYI, I'll be starting the pybal restarts soon - currently waiting for run-puppet-agent on LVS hosts [17:26:34] ok! [17:50:37] 10Wikimedia-Apache-configuration, 06collaboration-services, 10Phabricator, 13Patch-For-Review, and 4 others: Apache 2.4.61 throws a 403 Forbidden for links containing %3F - https://phabricator.wikimedia.org/T370110#10007941 (10Dzahn) >>! In T370110#10005864, @hashar wrote: > * **The `phorge` module in Pupp... [17:54:31] sukhe: any guidance on how long to wait between the secondary and low-traffic primary to make sure things are "good" ? [17:55:22] swfrench-wmf: I think you should feel free to move on to lt primary [17:55:31] if secondary looks good [17:56:49] sukhe: is there a better litmus test for "good" than the service seems to have successfully restarted? (e.g., pybal on lvs1020 is up per grafana) [17:57:22] swfrench-wmf: in this case, pybal successfully having restarted with no other errors and the IPVS diff check indicating that the service was removed is more than enough [17:57:34] great, thank you! [17:57:53] there is another WARNING on icinga for lvs1020 [17:57:55] Check if Pybal has been restarted after pybal.conf was changed [17:58:01] but this is expected [17:58:14] it is saying that pybal has not been restarted after the conf file was changed and so it should be [17:58:24] which you already did so running this again to see if it clears up is all we need to do [17:58:37] sukhe@lvs1020:~$ /usr/local/lib/nagios/plugins/check_pybal_restart --service pybal.service --file /etc/pybal/pybal.conf [17:58:40] OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. [17:58:58] so all good IMO [18:01:34] sukhe: great, thank you very much [18:02:23] I'll move ahead with the `ipvsadm` cleanups [18:03:03] thanks :) [18:05:56] happy to review and +1 here if need be since I always get icky doing manual ipvs cleans as well :P [18:09:43] looks good fwiw, saw SAL [18:26:57] 06Traffic, 13Patch-For-Review: Improve HAProxy unexpected restart alert - https://phabricator.wikimedia.org/T362833#10008131 (10BCornwall) 05In progress→03Resolved [18:28:29] <+jinxer-wm> FIRING: [8x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_api-https.toml has errors [18:28:38] ^ this one is usually happening after pybal restarts [18:28:45] or it can. in some cases [18:29:08] we fixed them in the past by manually deleting some .err files [18:29:22] yeah it's definitely related to the recent cleanup [18:29:29] swfrench-wmf: ^ or I can take care, let me know [18:29:36] but since this is confd, all yours :) [18:30:31] if you can find the path to the .err files... [18:30:40] then deleting them should fix that [18:31:39] ah, missed that! what channel was that in? [18:31:39] I can take a look, but I'm trying to debug a puppet issue as a result of the cleanup =/ [18:31:47] -operations [18:31:57] swfrench-wmf: what kind of issue? [18:32:07] puppet one [18:32:32] puppet failures on the bare-metal mwdebug hosts [18:32:42] looking [18:33:33] 06Traffic: prometheus-lvs-realserver-mss crashed on ncredir2002 - https://phabricator.wikimedia.org/T354721#10008157 (10BCornwall) 05Open→03Stalled [18:34:18] mutante: ack, thanks! I was looking at the backscroll, and didn't realize it had just started :) [18:34:31] I can take care of that once I sort the puppet issues [18:35:19] swfrench-wmf: thanks, one by one. puppet is fixed though! [18:35:25] mwmaint1002 works [18:35:39] oh, mwdebug [18:35:39] I mutante it's a differnet one, this https://puppetboard.wikimedia.org/report/mw2268.codfw.wmnet/5278161e99f7c075e854022d63ea5f5867c9a636 [18:35:53] swfrench-wmf: we are including the lvs profile, simply removing that should fix it [18:35:55] it's a result of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1050382 [18:36:04] 06Traffic, 06Data Products, 06Data-Engineering, 10Observability-Logging: New software: haproxykafka - https://phabricator.wikimedia.org/T370668#10008166 (10Ottomata) > I would need a serious help w/ C. Ya, me too! Perhaps the SPOE go lib @Vgutierrez mentioned might be easier? > ATM we decided to go down... [18:37:04] 06Traffic, 10observability, 06SRE: HAProxy metrics go down on config reload - https://phabricator.wikimedia.org/T343000#10008167 (10BCornwall) 05In progress→03Stalled [18:39:15] sukhe: yeah, I think you're right [18:39:33] let me try to figure out a safe place to peel that out [18:39:52] I can take the icinga alert [18:39:59] you do that other thing [18:40:44] mutante: thank you! [18:46:25] 06Traffic, 10Sustainability (Incident Followup): cp3050 seemd more affected then otheres in recent incident - https://phabricator.wikimedia.org/T330682#10008202 (10BCornwall) @CDanis Friendly ping. [18:48:12] 16x RESOLVED - fix documented at https://wikitech.wikimedia.org/wiki/Confd#Stale_template_error_files_present [18:48:27] no issues I can see in actual confd.log on alert1001 [18:48:37] so it was broken but now it's not [18:49:03] swfrench-wmf: so in case you haven't fixed it already [18:49:11] modules/role/manifests/mediawiki/appserver.pp has [18:49:17] which has include ::profile::mediawiki::webserver [18:49:22] 06Traffic, 10Sustainability (Incident Followup): Experiment with single backend CDN nodes - https://phabricator.wikimedia.org/T288106#10008209 (10BCornwall) 05Open→03Stalled [18:49:28] and then that has a require on lvs::realservers [18:49:49] which is then controlled by has_lvs [18:49:59] as in hieradata/role/common/mediawiki/appserver/api.yaml, has_lvs: true [18:50:05] so it needs to be false here [18:50:24] and similarly for the rest of the stuff that was removed [18:50:30] +1 [18:50:38] sukhe: thanks! yes, that's exactly the patch I'm writing a commit message for :) [18:50:53] nice [18:51:21] btw, I find it weird that it's just "has_lvs" and not profile::mediawiki::appserver::has_lvs or something but I digress [18:51:38] ditto :) [18:52:57] it's bad puppet style but it's so old we didnt have a style guide when it was added.. afaict [18:53:05] mutante: yep [18:53:11] I guess that's it [18:54:43] every time we do some kind of style improvement across all modules .. it's like "yea, i'm fine merging all these, but I won't touch LVS " [18:57:35] 06Traffic, 10Sustainability (Incident Followup): cp3050 seemd more affected then otheres in recent incident - https://phabricator.wikimedia.org/T330682#10008250 (10Vgutierrez) 05Stalled→03Invalid cp3050 is now longer being used, definitely this task can be closed now [19:03:07] thank you both for the reviews - initially, I misread an instance of has_lvs: false elsewhere in hieradata, thinking it provided a default :) [19:04:54] IMO the PCC failure is expected since the catalog is failing to compile [19:04:57] merging should fix it [19:05:01] exactly, yeah [19:05:18] all the hosts are in "or compile correctly only with the change" [19:10:06] alright, that seems to do the trick [19:10:28] sukhe: mutante: thank you both again for your help :) [19:10:52] yw, this was an epic one [19:10:57] np glad to see it all done :) [19:37:21] 06Traffic, 10MW-on-K8s, 06serviceops, 06SRE, and 2 others: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#10008340 (10Volans) [19:41:34] 06Traffic, 10MW-on-K8s, 06serviceops, 06SRE, and 2 others: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#10008343 (10Volans) I took the liberty to add a cleanup item to the task description. If that should be part of another task feel to move it around. [19:57:14] 10netops, 06Infrastructure-Foundations, 06SRE: Add data to automation for new switches in codfw C/D - https://phabricator.wikimedia.org/T369106#10008422 (10cmooney) 05Open→03Resolved [20:06:08] 06Traffic, 06SRE: Research and respond to Let's Encrypt's intent to deprecate OCSP in favour of CRLs - https://phabricator.wikimedia.org/T370821 (10ssingh) 03NEW [20:12:26] 06Traffic, 06SRE: Research and respond to Let's Encrypt's intent to deprecate OCSP in favour of CRLs - https://phabricator.wikimedia.org/T370821#10008529 (10BBlack) Firefox has historically been the reason we've been stapling OCSP for the past many years. If our certificate has an OCSP URI in its metadata, th... [20:14:29] 06Traffic, 06SRE: Research and respond to Let's Encrypt's intent to deprecate OCSP in favour of CRLs - https://phabricator.wikimedia.org/T370821#10008544 (10BBlack) Note also Digicert's annual renewal is coming soon in T368560 . We should maybe look at whether the OCSP URI is optional in the form for making t... [20:56:18] 06Traffic, 10MW-on-K8s, 06serviceops, 06SRE, and 2 others: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#10008652 (10Scott_French) Many thanks, all who helped get this out the door. At this point, the LVS service turndown is done, and we've shaken out a handful... [20:56:41] 06Traffic, 10MW-on-K8s, 06serviceops, 06SRE, and 2 others: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#10008653 (10Scott_French) [21:13:51] 06Traffic: prometheus-lvs-realserver-mss crashed on ncredir2002 - https://phabricator.wikimedia.org/T354721#10008696 (10Vgutierrez) 05Stalled→03Resolved This has been solved a long time ago but I've never got to close the task. [21:41:16] 06Traffic, 06SRE, 13Patch-For-Review: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078#10008822 (10BCornwall) 05Open→03Stalled [21:41:30] 06Traffic, 10MW-on-K8s, 06serviceops, 06SRE, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#10008833 (10Krinkle) [21:43:12] 06Traffic, 06SRE: Webrequests live data shows traffic without TLS on varnish for upload.w.o - https://phabricator.wikimedia.org/T340097#10008836 (10BCornwall) 05In progress→03Stalled [21:43:35] 06Traffic: Clean up Varnish VCL - https://phabricator.wikimedia.org/T370200#10008842 (10BCornwall) a:03BCornwall [21:55:06] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#10008870 (10Ladsgroup) I'm repooling the replicas now.