[11:53:10] 10netops, 06Infrastructure-Foundations: mr1-eqsin performance issue - https://phabricator.wikimedia.org/T362522 (10ayounsi) 03NEW p:05Triage→03High [11:56:08] 10netops, 06Infrastructure-Foundations: mr1-eqsin performance issue - https://phabricator.wikimedia.org/T362522#9712932 (10ayounsi) [12:11:29] 10netops, 06Infrastructure-Foundations: Juniper: use export-format state-data json compact - https://phabricator.wikimedia.org/T362523 (10ayounsi) 03NEW [12:31:15] 10netops, 06Infrastructure-Foundations: mr1-eqsin performance issue - https://phabricator.wikimedia.org/T362522#9713006 (10ayounsi) Opened JTAC 2024-0415-128563 and attached logs/RSI/coredump. [13:13:54] 06Traffic: Upgrade to HAProxy 2.6.17 - https://phabricator.wikimedia.org/T362063#9713285 (10Vgutierrez) [13:16:10] 06Traffic, 06DC-Ops, 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9713291 (10ssingh) >>! In T350179#9711211, @Papaul wrote: > @ssingh one thing that I found between the server NiC and the switch interfa... [13:38:20] 06Traffic: Upgrade to HAProxy 2.6.17 - https://phabricator.wikimedia.org/T362063#9713369 (10Vgutierrez) [13:41:13] 06Traffic: 14Upgrade to HAProxy 2.6.17 - 14https://phabricator.wikimedia.org/T362063#9713377 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [13:41:18] 06Traffic, 07Upstream: 14HAProxy 2.6.16/2.8.5 CPU spikes on cp3066 - 14https://phabricator.wikimedia.org/T354424#9713381 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [13:51:47] 06Traffic, 06Content-Transform-Team-WIP, 10MW-on-K8s, 06serviceops, and 4 others: A lot of `[info] Wikitext for this page has duplicate ids:` in logstash for mw-parsoid. Possibly related to PageBundle - https://phabricator.wikimedia.org/T358588#9713422 (10MSantos) [13:51:49] 06Traffic: Update certspotter - https://phabricator.wikimedia.org/T204993#9713420 (10ssingh) a:05ssingh→03None [14:26:00] 10netops, 06Infrastructure-Foundations: Juniper: use export-format state-data json compact - https://phabricator.wikimedia.org/T362523#9713593 (10ayounsi) p:05Triage→03Low a:03ayounsi [14:27:21] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: magru network setup - https://phabricator.wikimedia.org/T362421#9713590 (10ayounsi) p:05Triage→03High a:03ayounsi [14:40:28] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: use old asw switches from row A and B as msw switches in row C and D - https://phabricator.wikimedia.org/T361871#9713684 (10cmooney) p:05Triage→03Low @papaul yeah I think if we want to go this route we can just set them up the same as w... [15:20:06] 06Traffic, 10MW-on-K8s, 06serviceops, 06SRE, and 3 others: A lot of `[info] Wikitext for this page has duplicate ids:` in logstash for mw-parsoid. Possibly related to PageBundle - https://phabricator.wikimedia.org/T358588#9713923 (10MSantos) [15:21:01] sukhe: re https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1019825 the file is also missing codfw's new per rack ranges [15:21:43] using the v6 /56 and checking what fits bets for v4 would remove the need to update it that often [15:22:14] yeah, I will update the current CR for magru, drmrs [15:22:24] and then will do eqiad/codfw in another one, given we have both public/private there [15:24:09] I am not sure why we were using the smaller /24s there vs the /16s but I don't see why it won't work [15:25:08] I'd bet on a copy/paste from network/data.yaml [15:25:23] probably! [16:08:33] 10netops, 06DC-Ops, 06Infrastructure-Foundations: Take advantage of 10Gb NICs in the new network stack - https://phabricator.wikimedia.org/T360297#9714232 (10ayounsi) I started implementing a fix for that but it quickly gets complex as it means shutting down a port, and fully setting up another one. Before g... [18:18:16] 10netops, 06Infrastructure-Foundations, 06SRE: Move public-vlan host BGP peerings from CRs to top-of-rack switches in codfw - https://phabricator.wikimedia.org/T360772#9714985 (10cmooney) >>! In T360772#9657554, @ayounsi wrote: > We can define per host hiera keys, and empty lists as well, so to be tested but... [18:19:32] (SystemdUnitFailed) firing: prometheus_lvs_realserver_mss.service on ncredir1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:49:55] 06Traffic, 06Data Products, 06Data-Engineering, 10Observability-Logging, 13Patch-For-Review: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117#9715083 (10CodeReviewBot) gmodena opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/65... [20:04:25] (SystemdUnitFailed) firing: (2) prometheus_lvs_realserver_mss.service on ncredir1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:22:31] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Move management routers ssh port - https://phabricator.wikimedia.org/T277438#9715614 (10ayounsi) We might have to re-prioritize this task because of {T362522} [21:23:00] 10netops, 06Infrastructure-Foundations: mr1-eqsin performance issue - https://phabricator.wikimedia.org/T362522#9715618 (10ayounsi) > I have checked the logs and it looks like the issue we are facing with the slowness on the device and the reboots is product of a brute force SSH attack on the SRX. > The login... [22:20:59] 10netops, 06Infrastructure-Foundations: mr1-eqsin performance issue - https://phabricator.wikimedia.org/T362522#9715694 (10cmooney) >>! In T362522#9715615, @ayounsi wrote: > it looks like the issue we are facing with the slowness on the device and the reboots is product of a brute force SSH attack on the SRX...