[09:29:13] 06Traffic, 06Data-Engineering, 13Patch-For-Review: Rollout haproxykafka on all hosts - https://phabricator.wikimedia.org/T378578#10499751 (10Fabfur) [10:23:06] topranks: I was checking https://grafana.wikimedia.org/goto/IBMIeUOHg?orgId=1, cr2-magru is still happy? [10:24:13] vgutierrez: yeah I was also just finishing up checking everything [10:24:41] both routers seem healthy, the IBGP session has been stable since the reset [10:26:01] no signs of any issue - the transport circuit that flapped a few times Sat is also stable I'm setting it back to normal preference [10:26:22] vgutierrez: I was gonna feed back on the task to say I think we can re-pool [10:26:26] what are your thoughts? [10:26:37] I agree [10:26:54] let's repool and keep an eye in case the issue reappears [10:26:55] ok [10:28:55] 10netops, 06Infrastructure-Foundations, 10ops-magru: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774#10499926 (10cmooney) Everything remains stable since the upgrade/reset of the routers yesterday. All protocol adjacencies, interfaces etc look good as are the gene... [10:34:27] 10netops, 06Infrastructure-Foundations, 10ops-magru: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774#10499947 (10Vgutierrez) thanks @cmooney, I'll re-pool the site [10:36:29] topranks: done [11:26:18] vgutierrez: was in a meeting last while thanks for that [11:26:24] all looks to be ok from what I can tell [11:27:27] vgutierrez: can you remind me what the test Liberica instance is? if it's running at the moment? [11:30:28] topranks: lvs1013 [11:30:44] thanks! [11:33:17] vgutierrez: you are using IPv4 transport for both address families in BGP is that right? [11:33:22] same as PyBal currently? [11:36:22] topranks: yes [11:37:15] https://www.irccloud.com/pastebin/8fajIs1n/ [11:37:48] https://www.irccloud.com/pastebin/LxHwDseZ/ [11:37:49] cool yep that should be fine. I was looking at a potential optimisation for collecting bgp stats from the routers but it wouldn't work well with this scenario (only if each neighbour only had one address family used) [11:38:03] it's no big deal though probably better we support this use-case anyway [11:38:24] those are good commands to know btw - thanks! [11:38:51] topranks: there is any benefit of sending IPv6 prefixes via an IPv6 BGP session? [11:39:23] I could add it to my TODO list [11:40:45] vgutierrez: not really, generally I prefer it as the "v6 has to be working" on the host for the v6 routes to be announced. but there are very limited edge cases that would not be the case (I assume gobgp wouldn't announce the route if the v6 IP wasn't configured on the interface?) [11:41:10] nah I was just curious really [11:49:48] ack, thanks [12:19:01] 06Traffic, 06Commons, 10MediaWiki-Uploading, 06SRE: HTTP 503 error when uploading images on Wikimedia Commons - https://phabricator.wikimedia.org/T383274#10500234 (10Vgutierrez) {F58297395} This high TTFB values make me suspect of some kind of connectivity issue. Could you try to reproduce this behavior o... [12:22:22] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10500289 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=892c37cf-859a-4da6-8f59-c75b5d153219) set by cmooney@cumin1002 for 3:00:00 on 1 host(s) and th... [12:40:42] 10netops, 06Infrastructure-Foundations, 06SRE: Routinator 0.14 causing tempfs file system to fill up - https://phabricator.wikimedia.org/T383116#10500343 (10MoritzMuehlenhoff) 05Open→03Resolved After running 0.14.1 for five days, we can confirm this fixed, disk usage of /var/lib/routinator/repository... [12:51:12] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10500385 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7b04d5bf-ab80-4626-96ba-3c376dfc52c2) set by cmooney@cumin1002 for 3:00:00 on 1 host(s) and th... [13:34:24] 06Traffic, 06Data-Engineering, 13Patch-For-Review: Rollout haproxykafka on all hosts - https://phabricator.wikimedia.org/T378578#10500539 (10Fabfur) [14:51:25] FIRING: SystemdUnitFailed: liberica-cp.service on lvs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:52:07] ^^ that's me, already working on a fix [15:06:25] RESOLVED: SystemdUnitFailed: liberica-cp.service on lvs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:15:11] 06Traffic, 06Data-Persistence, 06SRE, 10SRE-swift-storage, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10500956 (10MatthewVernon) @ovasileva any update on progress on this, please? I see a bunch of changes (e.g. Incoming -> Freezer) that suggests this is ma... [16:04:36] topranks: I want to proceed with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1113478 today [16:04:59] topranks: do we need to update anything on ulsfo routers to accept the BGP communities? [16:05:12] mainly `14907:1` [16:05:20] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10501179 (10cmooney) I'm very happy to say Karim Radhouani, one of the gnmic devs, has been extremely helpful in response to the github issue I poste... [16:05:31] vgutierrez: nope the config is pushed out so it should be fine [16:05:37] topranks: thx <3 [16:05:40] but if you want to ping me when it's done to validate it looks ok please do [16:15:17] 06Traffic, 13Patch-For-Review: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10501235 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1002 for host lvs4010.ulsfo.wmnet with OS bookworm [16:15:23] topranks: cool, moving forward now [16:22:34] 06Traffic, 06Commons, 10MediaWiki-Uploading, 06SRE: HTTP 503 error when uploading images on Wikimedia Commons - https://phabricator.wikimedia.org/T383274#10501271 (10Underbar_dk) I disabled IPv6 and the multiple uploads went through! I then switched it back on and the uploads also went through no problem.... [16:27:58] hi traffic -- can anyone review tgr's https://gerrit.wikimedia.org/r/1114070? scheduled for the puppet request window but I don't know ATS Lua and want to make sure I'm not missing anything [16:28:29] * vgutierrez 👀 [16:32:46] rzl: makes sense in ATS Lua terms [16:33:16] vgutierrez: thanks -- anything I should know about deploying? [16:33:51] including "yeah, you should let someone from traffic do it," that's a perfectly good answe [16:33:54] r [16:36:15] puppet should take care of that unless we want to apply globally roughly at the same time [16:37:00] nope sounds perfect [16:37:02] thank you! [16:37:48] my .01 cents would be to disable puppet in A:cp-text, apply to one host to test it out and then apply to everywhere else quickly [16:47:50] that sounds good to me [16:51:59] (is one of you willing to hang around at the top of this hour, in case I end up needing an adult?) [16:52:32] rzl: most of us should be around I think but I will be [16:53:47] rad, appreciate it [16:57:49] 06Traffic, 06Data-Persistence, 06SRE, 10SRE-swift-storage, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10501367 (10ovasileva) a:03ovasileva [16:58:00] 06Traffic, 06Data-Persistence, 06SRE, 10SRE-swift-storage, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10501368 (10ovasileva) a:05ovasileva→03None [17:05:56] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10501393 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e5ab529a-1fb4-461d-b85a-a2d5a66a020a) set by cmooney@cumin1002 for 1:00:... [17:10:56] 06Traffic, 13Patch-For-Review: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10501422 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1002 for host lvs4010.ulsfo.wmnet with OS bookworm completed: - lvs4010 (**WARN**) - Downtimed on... [17:12:30] topranks: got some issues with the BGP config [17:12:56] vgutierrez: what's up? [17:12:58] topranks: probably related to the cr routers not expecting a BGP peer connecting to the vrrp-gw interface? [17:13:33] no it needs to be to a real IP so the session establishes to both all the time [17:13:48] ack, I'll fix liberica config [17:14:00] it's statically configured to be the CR loopback [17:14:26] cmooney@cr3-ulsfo> show configuration protocols bgp group PyBal local-address [17:14:26] local-address 198.35.26.192; [17:33:54] fix on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1114783 [17:34:02] currently blocked updating PCC facts /o\ [17:44:42] PCC is happy now, at least for puppet 7 [17:50:33] topranks: looking better now https://www.irccloud.com/pastebin/MA0JPAiF/ [17:53:11] vgutierrez: yep looks good on the CR side too [17:53:17] community is received [17:53:24] https://www.irccloud.com/pastebin/lDy3hXHK/ [17:53:47] topranks: local-pref seems applied as expected.. ipvsadm isn't reporting any incoming traffic [17:53:53] And we can see the resulting action that "metric1" (MED) is set to 70 as a result [17:53:58] https://www.irccloud.com/pastebin/VjjKEFnx/ [17:54:27] let's not call it MED please [17:54:37] cause MED is 0 for primaries and 100 for the secondary [17:54:42] 🤯 [17:54:52] oh sorry yeah MED=Metric2 here lol [17:55:04] brain fart, local-pref is what it is [17:55:53] 06Traffic, 13Patch-For-Review: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10501575 (10Vgutierrez) 05Open→03In progress p:05Triage→03Medium [19:50:00] 06Traffic, 06Infrastructure-Foundations, 06SRE: Console domain and property access request - https://phabricator.wikimedia.org/T381904#10502013 (10Scott_French) Tagging #traffic in hopes that someone (especially with expertise in our DNS configuration) may be able to help advance the request in T381904#10464... [23:52:13] 10netops, 06Infrastructure-Foundations, 10observability, 06SRE: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10502663 (10cmooney) I was able to run a manual poller command with the updated 'lmns' command and it shows errors pro...