[00:03:13] (DiskSpace) resolved: Disk space puppetmaster1001:9100:/ 5.821% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=puppetmaster1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:16:07] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Upgrade new codfw switches to Juniper recommended - https://phabricator.wikimedia.org/T341670 (10ayounsi) [07:16:15] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10ayounsi) [08:00:08] 10netops, 10Infrastructure-Foundations, 10SRE: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10ayounsi) This got promoted to major. ` cr2-esams> show system alarms 2 alarms currently active Alarm time Class Description 2023-07-28 23:46:09 UTC Major FPC 0 Major Err... [08:06:28] 10CAS-SSO, 10Infrastructure-Foundations, 10SRE, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) Thanks for the troubleshooting @brennen and @bd808 . I've done some tests changing oidc settings on the test instance, mo... [08:23:36] 10CAS-SSO, 10Infrastructure-Foundations, 10SRE, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10MoritzMuehlenhoff) I think cn and uid are equally stable in practice: - Our current account handling doesn't allow to change eith... [08:43:17] 10netops, 10Infrastructure-Foundations, 10SRE: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10ayounsi) Looking more into the alert and status, both ports on FPC0 PIC2 are down, one of which is the link to asw2-esams, so we have a loss of redundancy (traffic now only goes through c... [10:06:40] XioNoX, topranks: could either of you please rearm the keyholder on cumin1001? (the change to add me to pwstore for homer is still TBD) [10:08:15] for some reason now when I do `ssh cumin1` it doesn't autocomplete all the way to `cumin1001.eqiad.wmnet` but only to `cumin1001` [10:08:53] it's a bit annoying, someone knows what I could have changed to break it? [10:10:21] moritzm: done [10:10:27] cheers [10:10:45] bash completionm for cumin1 still works for me, though [10:11:07] with bash-completion 2.11-6 and bash 5.2.15 [10:12:05] hmm, looks like .ssh/known_hosts.d/wmf-prod had both the short and long name [10:12:26] running wmf-update-known-hosts-production fixed it [11:03:28] (SystemdUnitFailed) firing: httpbb_kubernetes_mw-web_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:23:28] (SystemdUnitFailed) resolved: httpbb_kubernetes_mw-web_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:14:05] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10cloud-services-team: wmcs.spicerack: Setup a host to run cookbooks from prod network - https://phabricator.wikimedia.org/T276440 (10fnegri) [12:14:13] 10SRE-tools, 10Infrastructure-Foundations, 10Goal, 10cloud-services-team (FY2023/2024-Q1): Improve how we run WMCS cookbooks - https://phabricator.wikimedia.org/T319401 (10fnegri) [12:16:21] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10cloud-services-team: wmcs.spicerack: Setup a host to run cookbooks from prod network - https://phabricator.wikimedia.org/T276440 (10fnegri) 05Open→03Resolved a:03fnegri Two hosts have been created (cloudcumin1001.eqiad.wmnet and cloudcumin2001... [12:17:07] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Spicerack, 10cloud-services-team: spicerack: sal_logger does not work when running from a laptop - https://phabricator.wikimedia.org/T343336 (10fnegri) [12:21:03] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team: Spicerack: Add CI step to test with wmcs cookbooks - https://phabricator.wikimedia.org/T325758 (10fnegri) [12:33:59] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q1): [spicerack] support including {project} in SAL messages - https://phabricator.wikimedia.org/T341793 (10fnegri) 05In progress→03Resolved Logs are now working correctly, though the fact they are going through... [12:34:07] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (FY2023/2024-Q1): Allow wmcs cookbooks running on cloudcuminXXXX to write to the SAL - https://phabricator.wikimedia.org/T325756 (10fnegri) [12:34:18] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Spicerack, 10cloud-services-team: spicerack: sal_logger does not work when running from a laptop - https://phabricator.wikimedia.org/T343336 (10fnegri) [12:34:47] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Goal, 10cloud-services-team (FY2023/2024-Q1): Improve how we run WMCS cookbooks - https://phabricator.wikimedia.org/T319401 (10fnegri) [12:35:45] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team (FY2023/2024-Q1): Update Spicerack documentation - https://phabricator.wikimedia.org/T325754 (10fnegri) [12:36:08] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Goal, 10cloud-services-team (FY2023/2024-Q1): cloudcumin: decide sudoers rules for users without global root - https://phabricator.wikimedia.org/T325067 (10fnegri) [15:08:28] (SystemdUnitFailed) firing: prometheus_puppet_agent_stats.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:33:28] (SystemdUnitFailed) resolved: prometheus_puppet_agent_stats.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:23:29] (SystemdUnitFailed) firing: dump_ip_reputation.service Failed on puppetmaster2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:38:28] (SystemdUnitFailed) firing: (3) dump_ip_reputation.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:53:28] (SystemdUnitFailed) firing: (3) dump_ip_reputation.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:08:28] (SystemdUnitFailed) resolved: (3) dump_ip_reputation.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:11:17] 10CAS-SSO, 10Infrastructure-Foundations, 10SRE, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10bd808) >>! In T320390#9068533, @MoritzMuehlenhoff wrote: > I think cn and uid are equally stable in practice: > - Our current acc...