[00:12:03] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:06:50] 10Mail, 06Infrastructure-Foundations, 10MediaWiki-Email, 06SRE: Old "Email this user" email is repeatedly resent - https://phabricator.wikimedia.org/T361860#9691323 (10DannyS712) @Xover I created https://phabricator.wikimedia.org/P59624 that is restricted to WMF-NDA members and you, which should be secure... [02:39:54] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on testvm2006:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [04:12:03] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:13:39] 10Mail, 06Infrastructure-Foundations, 10MediaWiki-Email, 06SRE: Old "Email this user" email is repeatedly resent - https://phabricator.wikimedia.org/T361860#9691600 (10Xover) >>! In T361860#9691323, @DannyS712 wrote: > @Xover I created https://phabricator.wikimedia.org/P59624 that is restricted to WMF-NDA... [06:22:29] 10Mail, 06Infrastructure-Foundations, 10MediaWiki-Email, 06SRE: Old "Email this user" email is repeatedly resent - https://phabricator.wikimedia.org/T361860#9691625 (10Peachey88) For ceeations, I think we just need to add you into phab trusted-contribs, I can do it when I'm back at my laptop if someone doe... [06:35:19] 10netops, 06Infrastructure-Foundations: eqiad-drmrs transport down (April 2024) - https://phabricator.wikimedia.org/T361825#9691663 (10ops-monitoring-bot) ===== Automated diagnostic for Netbox circuit ID 108 --- **Interface cr1-drmrs:xe-0/1/2** - admin-status: up - oper-status: up - interface-flapped: 2024-... [06:36:09] 10netops, 06Infrastructure-Foundations: 14eqiad-drmrs transport down (April 2024) - 14https://phabricator.wikimedia.org/T361825#9691664 (10ayounsi) 05Open→03Resolved a:03ayounsi 14> RFO: The unavailability of the link was due to problems with optical modules and cards at the Marseille and Paris, Fra... [06:39:54] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on testvm2006:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [06:50:00] slyngs: I'm having an IDP redirection error on https://people.wikimedia.org/~ayounsi/circuits/ (it keeps alternating between idp and people) [06:50:51] Weird, works for me [06:51:12] yeah it's not the first time it happened to me and others, clearing cookies usually solve it [06:51:13] I'll just check the logs [06:51:27] if it helps : https://people.wikimedia.org/~ayounsi/circuits/?ticket=ST-4101-7UfAuf7Gpmgq0VfhvVxK3RaiLaE-idp2003 [06:51:41] No still works [06:53:57] https://usercontent.irccloud-cdn.com/file/kvG5fGfB/Screenshot%20from%202024-04-05%2008-53-44.png [06:55:45] yeah I removed the cookies specific to people.wikimedia.org and now it loads [06:57:31] Did you happen to see how old that cookie was? [06:58:13] slyngs: it said from today, or was it last accessed? [06:58:19] I wonder if it's older than last Monday, but now the ticket in it has expired and in the meantime I switched the IDP servers [07:01:09] Apache on people says: AH00011: ap_cookie: client submitted cookie 'MOD_AUTH_CAS' more than once: /~ayounsi/circuits/ [07:04:54] XioNoX: https://github.com/apereo/mod_auth_cas/issues/186 <- Maybe this ? [07:05:11] The bug report / ticket is even from someone at the foundation :-) [07:06:17] slyngs: hahaha, that's John [07:06:25] Oh right [07:07:06] but from that ticket the issue is solved? [07:08:04] Yes, but because you do the CAS stuff in a .htaccess I think maybe we need the CASScope in there. E [07:09:16] I'll look for the documentation, maybe it can just go into the virtualhost config [07:10:03] thx [07:14:02] CASScope is only valid in directory directives or .htacess, so it needs to go into your .htaccess [07:16:25] people2003 is still the primary, shouldn't it have moved to eqiad with the switchover? [07:18:19] Not sure, the IDP servers also don't necessarily follow the datacenter switches [07:20:25] You're missing the /~ayounsi/circuits/files/pinTeliaCarrierLogoPop96x96.png file ... or are you worried about copyright :-) [07:22:14] slyngs: I just copied the kmz without caring too much about the linked ressources [07:22:33] slyngs: I updated the .htaccess, let's see [07:23:26] We need to break it again :-) [07:36:44] Telxius link to Sao Paolo added to the map [07:57:05] Looking at the map, I wonder if it would be shorter to Marseille, if a cable existed, or it's just due to the warped map [07:59:39] 10Mail, 06Infrastructure-Foundations, 10MediaWiki-Email, 06SRE: Old "Email this user" email is repeatedly resent - https://phabricator.wikimedia.org/T361860#9691793 (10Aklapper) >>! In T361860#9691600, @Xover wrote: > @Aklapper is this by design, or just permissions accidentally set too tightly? @Xover: H... [08:03:47] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: use old asw switches from row A and B as msw switches in row C and D - https://phabricator.wikimedia.org/T361871#9691816 (10ayounsi) We first need to discuss if we want to start using managed switches for management switches (except the agg... [08:05:40] slyngs: latency will give us the final answer :) [08:12:03] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:11:31] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: use old asw switches from row A and B as msw switches in row C and D - https://phabricator.wikimedia.org/T361871#9692392 (10cmooney) >>! In T361871#9691816, @ayounsi wrote: > We first need to discuss if we want to start using managed switch... [11:54:10] (SystemdUnitFailed) resolved: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:58:58] ^ now gone for good [12:10:38] 10netops, 10Ganeti, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: 14Investigate Ganeti in routed mode - 14https://phabricator.wikimedia.org/T300152#9692520 (10ops-monitoring-bot) 14cookbooks.sre.hosts.decommission executed by ayounsi@cumin1002 for hosts: `testvm2006.codfw.wmnet` - testvm2... [12:11:10] slyngs: issue with https://people.wikimedia.org/~ayounsi/circuits/ is happening again [12:11:49] Broken for me as well this time [12:12:04] "cool" [12:12:05] :) [12:16:38] works for me? [12:17:00] topranks: it works at first, then you leave and come back later on and fails to load [12:17:06] I am slightly concerned the Sao Paolo link goes right through the Bermuda triangle though [12:17:10] will packets just disappear? [12:17:16] until the cookie is cleared [12:17:18] hahahaha [12:25:36] Okay, I see the issue, I think [12:25:41] I now have two cookies [12:26:46] XioNoX: Okay to mess around with your .htaccess, just a little [12:27:29] sure [12:33:53] Okay, I'm a little unsure if this is a valid fix. I've set the circuits .htaccess to have a custom cookie name, to avoid collisions, but if both the cookies that shows up are from the same "mod_auth_cas" session if will just break again later [12:36:04] hm [12:36:08] not sure I understand [12:39:51] I do wonder if it's a bug in mod_auth_cas. I found both my cookies, and the cookie storage on the server has /~ayounsi/circuits/, but the one in my browser just says Path: / [12:41:39] Oh, okay, I think it sohuld have been CASScope /~ayounsi/circuits/ [12:42:26] slyngs: whatever the fix is, we should add it to https://wikitech.wikimedia.org/wiki/People.wikimedia.org#How_To_add_SSO_Authentication [12:52:45] Yup, I'll just check again in a few hours and see if the renewal request is correct [12:53:46] thx! [12:55:23] Fun little issue, even if it's somewhat difficult to debug :-) [13:23:14] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: use old asw switches from row A and B as msw switches in row C and D - https://phabricator.wikimedia.org/T361871#9692687 (10Papaul) @ayounsi @cmooney thanks for all the inputs. What I am asking is to use the Juniper old switches as dummies... [13:45:14] 10Mail, 06Infrastructure-Foundations, 10MediaWiki-Email, 06SRE: Old "Email this user" email is repeatedly resent - https://phabricator.wikimedia.org/T361860#9692785 (10Xover) And now another two ticked in. [13:59:20] moritzm no rush, but I added you to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1016425 ...getting rid of a really old 3rd party Debian repo [14:00:04] already left a comment two minutes ago :-) [14:05:33] excellent, thanks. I take that to mean we don't need to mess with it? [14:14:42] yes, we can ignore it [14:16:56] ACK, just abandoned the patch [14:22:04] hello folks! [14:22:09] I'd need some input for https://phabricator.wikimedia.org/T353705 from people managing the AUX k8s cluster [14:22:59] I can take care of the changes in netbox and puppet but I'd love to get a green light before proceeding [14:46:19] elukey: it sounds safe but I'd defer to the experts for a green light also [14:46:30] interesting issue though - strange it insists on those irregular subnets [14:47:14] ack thanks! [14:47:36] It must create some data structure with every possible IP in it in etcd or something? [14:47:52] if you are ok with the delete/recreation as /116 I think that I have an authoritative +1, so I'll proceed :) [14:48:01] the ipv6 subnets are not used by default, only if we turn a hiera flag [14:48:13] so it will be enabled only if needed by aux basically [14:48:55] yeah it's safe to change the mask on them in Netbox [14:49:06] the only caveat is anywhere in the puppet repo they might be referenced we need to also change [14:49:18] but I guess you changed the other clusters already so know where those bits are [14:49:32] you don't need to delete/recreate in Netbox btw [14:49:42] you can just 'edit' the existing prefix and change the netmask [14:49:50] ah yes yes there is only one place, I'll take care of the change in puppet [14:50:43] cool, thanks! [15:07:06] 10Mail, 06Infrastructure-Foundations, 10MediaWiki-Email, 06SRE: Old "Email this user" email is repeatedly resent - https://phabricator.wikimedia.org/T361860#9693054 (10DannyS712) >>! In T361860#9691600, @Xover wrote: >>>! In T361860#9691323, @DannyS712 wrote: >> @Xover I created https://phabricator.wikimed... [17:40:22] 10Mail, 06Infrastructure-Foundations, 10MediaWiki-Email, 06SRE: Old "Email this user" email is repeatedly resent - https://phabricator.wikimedia.org/T361860#9693475 (10Xover) >>! In T361860#9693188, @jhathaway wrote: > @Xover if you could paste the headers of two of the messages that would help, the whole... [19:17:48] (PuppetZeroResources) firing: Puppet has failed generate resources on mx-out1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [22:21:25] (SystemdUnitFailed) firing: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:40:48] (PuppetFailure) firing: Puppet has failed on mx-out2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:46:25] (SystemdUnitFailed) resolved: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:35:48] (PuppetFailure) resolved: Puppet has failed on mx-out2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure