[06:58:00] 10CAS-SSO, 06Infrastructure-Foundations: Login attempts from bd808 get 500 on Debmonitor - https://phabricator.wikimedia.org/T369205#10128626 (10SLyngshede-WMF) @bd808 we'll add the same filter to Turnilo, that also doesn't need to be concerned with Toolforge groups. I'll let you know when the filter is in pla... [07:08:39] 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Login attempts from bd808 get 500 on Debmonitor - https://phabricator.wikimedia.org/T369205#10128664 (10SLyngshede-WMF) 05In progress→03Resolved Turnilo is now also limited in the number of LDAP groups returned in the CAS token. [09:00:53] 10netops, 06collaboration-services, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097#10129063 (10Jelto) I depooled `gitlab-runner2003` for tomorrows maintenance [09:15:00] FIRING: [2x] ProbeDown: Service idm2001:443 has failed probes (http_idm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:25:00] FIRING: [4x] ProbeDown: Service idm1001:443 has failed probes (http_idm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:25:25] FIRING: SystemdUnitFailed: rq-bitu.service on idm1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:30:25] FIRING: [2x] SystemdUnitFailed: rq-bitu.service on idm1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:40:00] FIRING: [4x] ProbeDown: Service idm1001:443 has failed probes (http_idm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:40:25] FIRING: [2x] SystemdUnitFailed: rq-bitu.service on idm1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:45:00] RESOLVED: [4x] ProbeDown: Service idm1001:443 has failed probes (http_idm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:02:21] 10Mail, 06Infrastructure-Foundations: Alert email sent from backupmon1001 didn't reach engineer's google inbox (was: check-dbbackup-time sometimes doesn't send email alerts) - https://phabricator.wikimedia.org/T369253#10129225 (10jcrespo) Thanks, everone. I think @MatthewVernon 's suggestion is fair, and somet... [11:05:25] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:37:00] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar: Race condition on puppetdb in sre.hosts.rename cookbook - https://phabricator.wikimedia.org/T374351 (10Clement_Goubert) 03NEW [12:05:25] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:22:38] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Unified pattern for RemoteHosts accessors in Spicerack - https://phabricator.wikimedia.org/T374073#10130120 (10elukey) p:05Triage→03Medium [14:36:55] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:rack/install/configuration new firewalls - https://phabricator.wikimedia.org/T374176#10130193 (10joanna_borun) p:05Triage→03Medium [14:37:18] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar, 13Patch-For-Review: Race condition on puppetdb in sre.hosts.rename cookbook - https://phabricator.wikimedia.org/T374351#10130196 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:41:38] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar, 13Patch-For-Review: Race condition on puppetdb in sre.hosts.rename cookbook - https://phabricator.wikimedia.org/T374351#10130225 (10Clement_Goubert) Tested via `test-cookbook` on `mw2428` and `mw2429` and they seem to have been correctly remove... [14:47:54] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: asw2-d2-eqid <-> asw2-d4-eqiad vcp link flapping - https://phabricator.wikimedia.org/T374272#10130269 (10CDanis) >>! In T374272#10127785, @cmooney wrote: > Thanks @cdanis and @Southparkfan for the task! > > Logs relate to [[ https://n... [14:52:32] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: asw2-d2-eqid <-> asw2-d4-eqiad vcp link flapping - https://phabricator.wikimedia.org/T374272#10130289 (10cmooney) >>! In T374272#10130269, @CDanis wrote: > The timestamps in the description come from LibreNMS's logs viewer for asw2-d-e... [14:54:59] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar, 13Patch-For-Review: Race condition on puppetdb in sre.hosts.rename cookbook - https://phabricator.wikimedia.org/T374351#10130304 (10Clement_Goubert) Correction, it worked for `puppetdb`, but they got added back to `debmonitor`. Will investigate... [15:02:29] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar, 13Patch-For-Review: Race condition on puppetdb in sre.hosts.rename cookbook - https://phabricator.wikimedia.org/T374351#10130342 (10MoritzMuehlenhoff) >>! In T374351#10130304, @Clement_Goubert wrote: > Correction, it worked for `puppetdb`, but... [15:27:44] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: asw2-d2-eqid <-> asw2-d4-eqiad vcp link flapping - https://phabricator.wikimedia.org/T374272#10130497 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=81e99a80-f593-4494-a565-ea730a19fbc7) set by cmooney@cumin1002 fo... [15:30:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:39:27] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: asw2-d2-eqid <-> asw2-d4-eqiad vcp link flapping - https://phabricator.wikimedia.org/T374272#10130595 (10cmooney) Ok link was replaced: ` Sep 9 15:36:56 asw2-d-eqiad vccpd[2257]: VCCPD_PROTOCOL_INTF_STATE_CHANGED: Member 4, interface... [15:41:10] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: asw2-d2-eqid <-> asw2-d4-eqiad vcp link flapping - https://phabricator.wikimedia.org/T374272#10130602 (10VRiley-WMF) Thank you! I appreciate it. Will be relabeling the new cable as 0325. Feel free to reach out if anything else happens. [15:48:19] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, 13Patch-For-Review: generate_vrts_aliases failing on mx-in1001 - https://phabricator.wikimedia.org/T368257#10130617 (10jhathaway) 05Open→03Resolved retry logic has been added which should resolve the issue, please reopen if i... [16:05:28] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: Q1:codfw:frack network upgrade tracking task - https://phabricator.wikimedia.org/T371434#10130724 (10Papaul) @cmooney thanks for the feedback. The discussion about not using virtual-chassis was it a final decision or j... [16:24:14] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: (2) new singlemode fiber patches from dmarc to routers for IX ports - https://phabricator.wikimedia.org/T373376#10130820 (10RobH) IRC Update: All DC Ops related items are complete and Cathal is currently working with EQ to schedul... [16:28:52] I noticed that puppet is failing on some of the stat hosts, because manuel-wmde has processes running, so their user can't be deleted. [16:29:07] Doesn't the offboarding script stop any user processes? [16:30:18] In theory yes, but sometimes I recall to have killed dangling processes to let puppet do its work [16:32:51] elukey: nod [16:34:32] 10netops, 06Infrastructure-Foundations, 06SRE: BFD won't esablish between QFX in VRF and host from IPv6 link-local - https://phabricator.wikimedia.org/T374379 (10cmooney) 03NEW p:05Triage→03Low [16:50:16] hmmh, this offboarding didn't follow the regular procedure [16:50:31] for all WMF staffers, contractors and researchers Simon and myself handle it [16:50:57] but in this case of a former WMDE employee WMDE opened a task which ended up in clinic duty [16:51:10] I'll run the logout script now to rectify that [16:55:54] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:rack/install/configuration new firewalls - https://phabricator.wikimedia.org/T374176#10131007 (10Papaul) [16:56:29] done, I've just forced a puppet run on stat* hosts and it's working again [16:57:12] I'll reach out to WMDE managers tomorrow to ask them to send all future WMDE offboardings to Simon and myself and not via clinic duty [17:40:23] thanks moritzm [18:37:26] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: BFD won't esablish between QFX in VRF and host from IPv6 link-local - https://phabricator.wikimedia.org/T374379#10131477 (10cmooney) Ok through trial and error it would appear the issue is something to do with the switch not dealing well wi... [18:53:57] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: asw2-d2-eqid <-> asw2-d4-eqiad vcp link flapping - https://phabricator.wikimedia.org/T374272#10131535 (10cmooney) So far things seem stable with this. I will leave task open to review as the week goes on, also considering if we need t... [19:55:18] 10netops, 06Infrastructure-Foundations, 06SRE: Routed Ganeti: Add support for VM QoS marking - https://phabricator.wikimedia.org/T374392 (10cmooney) 03NEW p:05Triage→03Medium [22:42:15] 10Mail, 06Infrastructure-Foundations, 07User-notice-archive: Notifications stop after bot edits until page is manually viewed or watchlist is marked as read - https://phabricator.wikimedia.org/T374404 (10AlbanGeller) 03NEW