[03:11:04] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [03:11:05] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [04:11:04] RESOLVED: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [04:11:05] RESOLVED: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [04:19:04] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [05:19:04] RESOLVED: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [07:43:30] we'll need to add a 3rd service IP for gerrit (to assign to gerrit2003, gerrit-spare.wm.o probably, I'll confirm quickly) similar to https://netbox.wikimedia.org/ipam/ip-addresses/11649/ and https://netbox.wikimedia.org/ipam/ip-addresses/12958/. I found there was 2 differences with (role and assignment) between those two. I'm wondering if there is [07:43:30] something to fix/harmonize? Also if I was wondering if I need to pick a specific address or "just assign one from the free codfw pile" [08:43:19] arnaudb: o/ I think that the assignment is an info that can be added anytime, IIRC it shouldn't really trigger anything behind the scenes. Same for the VIP role, it is mostly for our tracking purposes in Netbox. Maybe VIP was added to gerrit.wikimedia.org since the DNS could vary depending on what Gerrit instance is active at any given time? Not sure, maybe topranks and XioNoX have an opinion on this. For the IP to pick, I'd suggest [08:43:19] to find a candidate and then ask in this channel for a review, Public IPs need an extra care :) [08:46:15] arnaudb: yeah I think it's just that they got assigned at different times and in different ways, but they shouldn't be different. [08:48:20] elukey XioNoX ack, thanks for the infos! [08:48:33] the current setup is not ideal as they use extra IPs from the hosts vlans (eg. public1-b-eqiad) so they're tied to the row [08:49:16] the ideal would be to use a LVS (or bird) to advertise an public IP that is location agnostic (more info in https://wikitech.wikimedia.org/wiki/Wikimedia_network_guidelines#Public_IPs ) [08:55:07] Gerrit will move behind CDN (https://phabricator.wikimedia.org/T365259) it'll greatly simplify things imho [08:55:32] I forgot, nice, yeah for sure! [09:26:16] FIRING: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:29:15] 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, 06serviceops, 06SRE: Create a cookbook to automate gerrit's switchover - https://phabricator.wikimedia.org/T260666#11038391 (10ABran-WMF) 05Open→03Resolved this can be considered as done with the merge of https://gerrit.wikimedi... [09:31:15] RESOLVED: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:45:33] 10CAS-SSO, 06Infrastructure-Foundations, 10Release-Engineering-Team (Radar): Document how to authenticate a bot account through CAS-SSO - https://phabricator.wikimedia.org/T377372#11038476 (10fgiunchedi) [09:50:16] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:00:15] RESOLVED: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:09:15] FIRING: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:14:16] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:29:16] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:29:36] 10netops, 06Infrastructure-Foundations, 06SRE: Homer: PyEz "ignore_warnings" does not work for port-block speed change warning - https://phabricator.wikimedia.org/T400261#11038624 (10cmooney) 05Open→03Resolved a:03cmooney [10:34:16] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:39:15] RESOLVED: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:42:15] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:52:15] RESOLVED: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:54:15] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:14:16] RESOLVED: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:15:15] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:20:15] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:25:15] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:10:15] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:12:46] so, about that service IP, I was thinking about 208.80.153.111/27 as it is in public-1-d-codfw and its the first available: https://netbox.wikimedia.org/ipam/vlans/105/ [12:15:15] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:15:55] oh I also need a v6, same reasoning: 2620:0:860::1/48 [12:16:58] oh I can even reduce it to 2620:0:860::1/56 [12:20:03] arnaudb: yes 208.80.153.111/27 should be ok if you need another IP bridged on the same vlan [12:20:15] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:20:42] for IPv6 you should use something from 2620:0:860:4::/64 however, perhaps 2620:0:860:4:208:80:153:111/64 to match the usual convention [12:20:57] makes sense, on it! [12:49:53] slyngs: hello, you're back? looks like something's funky with idp [12:50:20] https://idp.wikimedia.org/login?service=https%3a%2f%2flibrenms.wikimedia.org%2f -> "upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection failure" [13:05:27] XioNoX: he's out on PTO [13:05:47] I am going to bump tomcat [13:10:16] RESOLVED: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:16:27] nicely done :) [14:20:42] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Link errors: ssw1-d1-codfw <-> ssw1-f1-codfw - https://phabricator.wikimedia.org/T400253#11039411 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm looks like errors ceased after cleaning. no increments since friday. [15:05:17] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Link errors: ssw1-d1-codfw <-> ssw1-f1-codfw - https://phabricator.wikimedia.org/T400253#11039647 (10cmooney) Awesome, thank you! [16:47:10] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10Data-Platform-SRE (2025.07.26 - 2025.08.15), 13Patch-For-Review: Proposal: adding a kafka admin client to spicerack - https://phabricator.wikimedia.org/T399069#11040026 (10BTullis) [21:52:55] FIRING: MaxConntrack: Max conntrack at 85.52% on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [21:57:55] RESOLVED: MaxConntrack: Max conntrack at 82.36% on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack