[00:46:19] RECOVERY - SSH on dumpsdata1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:14:50] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1025.eqiad.wmnet with OS bullseye [01:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:57] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1025.eqiad.wmnet with OS bullseye executed... [01:15:40] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10Papaul) IDRAC and BIOS are up to date on cloudvirt1016 [01:15:45] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:16:12] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1016.eqiad.wmnet with OS bullseye [01:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:16:24] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1016.eqiad.wmnet with O... [01:22:55] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10Papaul) @Andrew 1016 is now able to PXE boot i stop the OS install because i am having the error below. I think you can fix this.... [01:23:01] !log pt1979@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1016.eqiad.wmnet with OS bullseye [01:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:23:13] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1016.eqiad.wmnet with OS bu... [01:37:45] (JobUnavailable) firing: Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:42:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:48:23] PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:51:15] RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:45:37] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:34:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [05:34:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [05:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1164 (T298557)', diff saved to https://phabricator.wikimedia.org/P22800 and previous config saved to /var/cache/conftool/dbconfig/20220318-053443-marostegui.json [05:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:47] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [05:35:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 10%: After schema change ', diff saved to https://phabricator.wikimedia.org/P22801 and previous config saved to /var/cache/conftool/dbconfig/20220318-053508-root.json [05:35:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:38] Did this week's deployment make any changes to how action=parse works? [05:36:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 10%: After schema change ', diff saved to https://phabricator.wikimedia.org/P22802 and previous config saved to /var/cache/conftool/dbconfig/20220318-053615-root.json [05:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:32] See https://www.irccloud.com/pastebin/0jkSPFLI/ compared to https://www.irccloud.com/pastebin/GnsqjN38/ [05:37:02] In one, the value for "level" is a string ("3"), in the other an integer (3). [05:37:24] which is breaking some of our code. [05:38:11] (03PS1) 10Marostegui: db1179: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/771765 (https://phabricator.wikimedia.org/T300600) [05:38:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1179 reimage T300600', diff saved to https://phabricator.wikimedia.org/P22803 and previous config saved to /var/cache/conftool/dbconfig/20220318-053832-marostegui.json [05:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:36] T300600: Upgrade s3 to Bullseye - https://phabricator.wikimedia.org/T300600 [05:39:14] !log dbmaint on s3@eqiad T300600 [05:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:31] (03CR) 10Marostegui: [C: 03+2] db1179: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/771765 (https://phabricator.wikimedia.org/T300600) (owner: 10Marostegui) [05:42:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1179.eqiad.wmnet with OS bullseye [05:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 25%: After schema change ', diff saved to https://phabricator.wikimedia.org/P22804 and previous config saved to /var/cache/conftool/dbconfig/20220318-055012-root.json [05:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 25%: After schema change ', diff saved to https://phabricator.wikimedia.org/P22805 and previous config saved to /var/cache/conftool/dbconfig/20220318-055119-root.json [05:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1179.eqiad.wmnet with reason: host reimage [05:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1179.eqiad.wmnet with reason: host reimage [05:57:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 50%: After schema change ', diff saved to https://phabricator.wikimedia.org/P22806 and previous config saved to /var/cache/conftool/dbconfig/20220318-060516-root.json [06:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 50%: After schema change ', diff saved to https://phabricator.wikimedia.org/P22807 and previous config saved to /var/cache/conftool/dbconfig/20220318-060623-root.json [06:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:18] (03PS1) 10Marostegui: Revert "db1179: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/771715 [06:12:08] (03CR) 10Marostegui: [C: 03+2] Revert "db1179: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/771715 (owner: 10Marostegui) [06:12:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1179.eqiad.wmnet with OS bullseye [06:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P22808 and previous config saved to /var/cache/conftool/dbconfig/20220318-061332-root.json [06:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 75%: After schema change ', diff saved to https://phabricator.wikimedia.org/P22809 and previous config saved to /var/cache/conftool/dbconfig/20220318-062020-root.json [06:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 75%: After schema change ', diff saved to https://phabricator.wikimedia.org/P22810 and previous config saved to /var/cache/conftool/dbconfig/20220318-062127-root.json [06:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:39] (03PS1) 10Marostegui: mariadb: Set event_scheduler = ON by default [puppet] - 10https://gerrit.wikimedia.org/r/771768 [06:28:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P22811 and previous config saved to /var/cache/conftool/dbconfig/20220318-062836-root.json [06:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:59] (03PS2) 10Marostegui: mariadb: Set event_scheduler = ON by default [puppet] - 10https://gerrit.wikimedia.org/r/771768 [06:32:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T298557)', diff saved to https://phabricator.wikimedia.org/P22812 and previous config saved to /var/cache/conftool/dbconfig/20220318-063235-marostegui.json [06:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:39] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [06:35:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 100%: After schema change ', diff saved to https://phabricator.wikimedia.org/P22813 and previous config saved to /var/cache/conftool/dbconfig/20220318-063524-root.json [06:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 100%: After schema change ', diff saved to https://phabricator.wikimedia.org/P22814 and previous config saved to /var/cache/conftool/dbconfig/20220318-063631-root.json [06:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P22815 and previous config saved to /var/cache/conftool/dbconfig/20220318-064340-root.json [06:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P22816 and previous config saved to /var/cache/conftool/dbconfig/20220318-064740-marostegui.json [06:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P22817 and previous config saved to /var/cache/conftool/dbconfig/20220318-065844-root.json [06:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220318T0700) [07:02:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P22818 and previous config saved to /var/cache/conftool/dbconfig/20220318-070245-marostegui.json [07:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P22819 and previous config saved to /var/cache/conftool/dbconfig/20220318-071348-root.json [07:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:08] (03CR) 10Ayounsi: [C: 03+1] "We also added capacity in eqiad since, so even without drmrs that file isn't needed." [dns] - 10https://gerrit.wikimedia.org/r/771631 (https://phabricator.wikimedia.org/T304089) (owner: 10BBlack) [07:17:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T298557)', diff saved to https://phabricator.wikimedia.org/P22820 and previous config saved to /var/cache/conftool/dbconfig/20220318-071750-marostegui.json [07:17:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [07:17:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [07:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:55] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [07:17:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T298557)', diff saved to https://phabricator.wikimedia.org/P22821 and previous config saved to /var/cache/conftool/dbconfig/20220318-071758-marostegui.json [07:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:02] (03CR) 10Ayounsi: "I'm wondering if we shouldn't have a bit more traffic to drmrs before we consider it ready to take over a esams depool." [dns] - 10https://gerrit.wikimedia.org/r/771632 (https://phabricator.wikimedia.org/T304089) (owner: 10BBlack) [07:25:36] 10SRE, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Add linecard diversity to the router-to-router interconnect in codfw - https://phabricator.wikimedia.org/T248506 (10ayounsi) 05Open→03Resolved a:03ayounsi Child task is completed enough so this is not an issue anymore. [07:25:36] (03CR) 10Majavah: geodns: add drmrs fallback for esams to whole map (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/771632 (https://phabricator.wikimedia.org/T304089) (owner: 10BBlack) [07:28:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P22822 and previous config saved to /var/cache/conftool/dbconfig/20220318-072852-root.json [07:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:08] PROBLEM - SSH on dumpsdata1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:55:38] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:59:16] (03PS1) 10Elukey: Fix root PKI CA CN for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/771816 (https://phabricator.wikimedia.org/T300130) [08:00:36] (03Abandoned) 10Elukey: Fix root PKI CA CN for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/771816 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [08:10:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T298557)', diff saved to https://phabricator.wikimedia.org/P22823 and previous config saved to /var/cache/conftool/dbconfig/20220318-081002-marostegui.json [08:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:07] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [08:16:32] (03PS1) 10Muehlenhoff: Add deployment-docker to whitelist of groups for ops users [puppet] - 10https://gerrit.wikimedia.org/r/771819 [08:17:46] (03PS3) 10Elukey: Set bullseye + overlay settings for kubernetes10[01][56] nodes [puppet] - 10https://gerrit.wikimedia.org/r/771600 (https://phabricator.wikimedia.org/T300744) [08:18:49] (03CR) 10Muehlenhoff: [C: 03+2] Add deployment-docker to whitelist of groups for ops users [puppet] - 10https://gerrit.wikimedia.org/r/771819 (owner: 10Muehlenhoff) [08:19:44] (03CR) 10Elukey: [C: 03+2] Set bullseye + overlay settings for kubernetes10[01][56] nodes [puppet] - 10https://gerrit.wikimedia.org/r/771600 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [08:25:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P22824 and previous config saved to /var/cache/conftool/dbconfig/20220318-082507-marostegui.json [08:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:09] (03CR) 10Elukey: [C: 03+2] Set overlay settings for kubernetes1005 [puppet] - 10https://gerrit.wikimedia.org/r/771601 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [08:26:15] (03PS3) 10Elukey: Set overlay settings for kubernetes1005 [puppet] - 10https://gerrit.wikimedia.org/r/771601 (https://phabricator.wikimedia.org/T300744) [08:38:01] (03PS1) 10Muehlenhoff: Restrict sourcing of systemd environent.d to Buster/Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/771821 [08:39:20] (03CR) 10jerkins-bot: [V: 04-1] Restrict sourcing of systemd environent.d to Buster/Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/771821 (owner: 10Muehlenhoff) [08:40:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P22825 and previous config saved to /var/cache/conftool/dbconfig/20220318-084012-marostegui.json [08:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:04] (03PS2) 10Muehlenhoff: Restrict sourcing of systemd environent.d to Buster/Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/771821 [08:54:41] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/771821 (owner: 10Muehlenhoff) [08:54:45] (03PS1) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/771823 (https://phabricator.wikimedia.org/T279519) [08:55:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T298557)', diff saved to https://phabricator.wikimedia.org/P22826 and previous config saved to /var/cache/conftool/dbconfig/20220318-085517-marostegui.json [08:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:22] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [08:56:54] RECOVERY - SSH on dumpsdata1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:57:24] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:01:00] (03PS3) 10Muehlenhoff: Restrict sourcing of systemd environent.d to Buster/Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/771821 [09:01:39] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/771823 (https://phabricator.wikimedia.org/T279519) (owner: 10Kosta Harlan) [09:02:19] (03CR) 10jerkins-bot: [V: 04-1] Restrict sourcing of systemd environent.d to Buster/Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/771821 (owner: 10Muehlenhoff) [09:03:40] (03PS4) 10Muehlenhoff: Restrict sourcing of systemd environent.d to Buster/Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/771821 [09:06:10] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/771823 (https://phabricator.wikimedia.org/T279519) (owner: 10Kosta Harlan) [09:08:46] !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [09:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:04] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [09:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:23] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/771821 (owner: 10Muehlenhoff) [09:12:02] (03PS4) 10MSantos: WIP: introduce geoshapes service [deployment-charts] - 10https://gerrit.wikimedia.org/r/768678 [09:13:11] (03CR) 10jerkins-bot: [V: 04-1] WIP: introduce geoshapes service [deployment-charts] - 10https://gerrit.wikimedia.org/r/768678 (owner: 10MSantos) [09:33:21] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:33:29] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:35:04] !log akosiaris@cumin1001 conftool action : set/weight=10; selector: name=kubernetes1018.eqiad.wmnet [09:35:04] !log akosiaris@cumin1001 conftool action : set/weight=10; selector: name=kubernetes1019.eqiad.wmnet [09:35:05] !log akosiaris@cumin1001 conftool action : set/weight=10; selector: name=kubernetes1020.eqiad.wmnet [09:35:05] !log akosiaris@cumin1001 conftool action : set/weight=10; selector: name=kubernetes1021.eqiad.wmnet [09:35:06] !log akosiaris@cumin1001 conftool action : set/weight=10; selector: name=kubernetes1022.eqiad.wmnet [09:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:07] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:13] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance [09:35:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance [09:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1160 (T300775)', diff saved to https://phabricator.wikimedia.org/P22827 and previous config saved to /var/cache/conftool/dbconfig/20220318-093543-marostegui.json [09:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:48] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [09:37:15] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: name=kubernetes1018.eqiad.wmnet [09:37:16] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: name=kubernetes1019.eqiad.wmnet [09:37:16] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: name=kubernetes1020.eqiad.wmnet [09:37:16] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: name=kubernetes1021.eqiad.wmnet [09:37:17] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: name=kubernetes1022.eqiad.wmnet [09:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:35] !log pool kubernetes1018-1022 in pybal. [09:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:44] !log pool kubernetes1018-1022 in pybal. T293728 [09:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:48] T293728: setup/install kubernetes10[18-22] - https://phabricator.wikimedia.org/T293728 [09:38:08] (03PS6) 10Giuseppe Lavagetto: cache: turn on dynamic bans on all of eqsin [puppet] - 10https://gerrit.wikimedia.org/r/769389 (https://phabricator.wikimedia.org/T302471) [09:38:17] 10SRE, 10Product-Infrastructure-Team-Backlog, 10serviceops, 10Maps (Geoshapes), and 2 others: New Service Request geoshapes - https://phabricator.wikimedia.org/T274388 (10MSantos) @akosiaris the initial geoshapes deployment-charts is created and ready to move forward: https://gerrit.wikimedia.org/r/c/opera... [09:38:33] 10SRE, 10Product-Infrastructure-Team-Backlog, 10serviceops, 10Maps (Geoshapes), and 2 others: New Service Request geoshapes - https://phabricator.wikimedia.org/T274388 (10MSantos) [09:39:59] (03PS3) 10DCausse: team-search-platform: relax RdfStreamingUpdaterFlinkProcessingLatencyIsHigh [alerts] - 10https://gerrit.wikimedia.org/r/770982 [09:40:15] (03CR) 10DCausse: [C: 03+2] team-search-platform: relax RdfStreamingUpdaterFlinkProcessingLatencyIsHigh [alerts] - 10https://gerrit.wikimedia.org/r/770982 (owner: 10DCausse) [09:40:29] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:41:44] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: remove duplicate ferm rule for AAAA [puppet] - 10https://gerrit.wikimedia.org/r/771633 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [09:42:00] !log uncordon kubernetes1018-1022. T293728. Nodes are live, ready to receive workloads and traffic. [09:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:55] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:47:23] This seems to be the SingTel transport to ulsfo --^ [09:49:28] (03Merged) 10jenkins-bot: team-search-platform: relax RdfStreamingUpdaterFlinkProcessingLatencyIsHigh [alerts] - 10https://gerrit.wikimedia.org/r/770982 (owner: 10DCausse) [09:51:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10akosiaris) [09:51:55] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [09:52:34] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: name=kubernetes1001.eqiad.wmnet [09:52:34] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: name=kubernetes1002.eqiad.wmnet [09:52:35] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: name=kubernetes1003.eqiad.wmnet [09:52:35] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: name=kubernetes1004.eqiad.wmnet [09:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:39] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:54:03] !log depool kubernetes100[1-4] from pybal T303044 [09:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:07] T303044: decommission kubernetes100[1-4] - https://phabricator.wikimedia.org/T303044 [09:54:39] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:00:47] (03PS5) 10Jbond: Restrict sourcing of systemd environent.d to Buster/Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/771821 (owner: 10Muehlenhoff) [10:01:07] !log drain kubernetes100[1-4] T303044 [10:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:11] T303044: decommission kubernetes100[1-4] - https://phabricator.wikimedia.org/T303044 [10:01:28] (03CR) 10Jbond: [C: 03+1] "LGTM just running pcc" [puppet] - 10https://gerrit.wikimedia.org/r/771821 (owner: 10Muehlenhoff) [10:01:35] (03CR) 10Jbond: [V: 03+1 C: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34408/console" [puppet] - 10https://gerrit.wikimedia.org/r/771821 (owner: 10Muehlenhoff) [10:03:28] (03CR) 10Jbond: [V: 03+1 C: 03+1] "pcc looks good will merge thanks" [puppet] - 10https://gerrit.wikimedia.org/r/771821 (owner: 10Muehlenhoff) [10:03:32] (03CR) 10Jbond: [V: 03+1 C: 03+2] Restrict sourcing of systemd environent.d to Buster/Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/771821 (owner: 10Muehlenhoff) [10:11:41] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:12:23] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:15:13] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:16:08] (03PS1) 10Jbond: P:environment: fix zshrc systemd env injection [puppet] - 10https://gerrit.wikimedia.org/r/771846 [10:16:48] (03CR) 10jerkins-bot: [V: 04-1] P:environment: fix zshrc systemd env injection [puppet] - 10https://gerrit.wikimedia.org/r/771846 (owner: 10Jbond) [10:17:31] (03PS2) 10Jbond: P:environment: fix zshrc systemd env injection [puppet] - 10https://gerrit.wikimedia.org/r/771846 [10:18:11] (03CR) 10jerkins-bot: [V: 04-1] P:environment: fix zshrc systemd env injection [puppet] - 10https://gerrit.wikimedia.org/r/771846 (owner: 10Jbond) [10:18:26] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34410/console" [puppet] - 10https://gerrit.wikimedia.org/r/771846 (owner: 10Jbond) [10:19:01] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:19:07] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:21:08] (03PS3) 10Jbond: P:environment: fix zshrc systemd env injection [puppet] - 10https://gerrit.wikimedia.org/r/771846 [10:21:53] (03CR) 10Jbond: [C: 03+2] P:environment: fix zshrc systemd env injection [puppet] - 10https://gerrit.wikimedia.org/r/771846 (owner: 10Jbond) [10:21:59] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:26:01] (03PS4) 10Jcrespo: bacula: Add mixed priority to all jobs [puppet] - 10https://gerrit.wikimedia.org/r/771551 (https://phabricator.wikimedia.org/T95705) [10:26:33] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:27:25] (03CR) 10Jcrespo: [C: 03+2] bacula: Add mixed priority to all jobs [puppet] - 10https://gerrit.wikimedia.org/r/771551 (https://phabricator.wikimedia.org/T95705) (owner: 10Jcrespo) [10:29:11] (03PS1) 10Jbond: P:environment: move zshenv to a template [puppet] - 10https://gerrit.wikimedia.org/r/771848 [10:29:25] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:29:46] I got a puppet error: Error: /Stage[main]/Profile::Environment/File[/etc/zsh/zshenv]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/profile/environment/zshrc [10:30:01] maybe it was just because I was running it manually [10:30:21] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.1057 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:30:31] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:30:40] I think not [10:30:44] (03CR) 10jerkins-bot: [V: 04-1] P:environment: move zshenv to a template [puppet] - 10https://gerrit.wikimedia.org/r/771848 (owner: 10Jbond) [10:30:57] jbond: aware? [10:31:04] (03PS2) 10Jbond: P:environment: move zshenv to a template [puppet] - 10https://gerrit.wikimedia.org/r/771848 [10:31:52] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34412/console" [puppet] - 10https://gerrit.wikimedia.org/r/771848 (owner: 10Jbond) [10:32:15] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:32:19] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:32:26] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:environment: move zshenv to a template [puppet] - 10https://gerrit.wikimedia.org/r/771848 (owner: 10Jbond) [10:33:19] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:33:23] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:34:53] (03PS1) 10Jbond: P:environment: fix zshenv permissions [puppet] - 10https://gerrit.wikimedia.org/r/771849 [10:36:03] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [10:37:19] (03CR) 10Jbond: [C: 03+2] P:environment: fix zshenv permissions [puppet] - 10https://gerrit.wikimedia.org/r/771849 (owner: 10Jbond) [10:37:43] 10SRE, 10Patch-For-Review: bacula restore job waiting on higher jobs - https://phabricator.wikimedia.org/T95705 (10jcrespo) I will now run a regular priority backup (priority 10) and seconds later run a recovery (which should run by default with priority 1) and the recovery should start immediately. Recovery... [10:44:43] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:46:25] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:46:30] 10SRE, 10Patch-For-Review: bacula restore job waiting on higher jobs - https://phabricator.wikimedia.org/T95705 (10jcrespo) 05Open→03Resolved It worked, restore running despite being executed later with a different priority: ` JobId Type Level Files Bytes Name Status ===============... [10:48:35] 10SRE, 10Data-Persistence-Backup, 10bacula, 10Patch-For-Review: bacula restore job waiting on higher jobs - https://phabricator.wikimedia.org/T95705 (10jcrespo) [10:50:13] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:50:40] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: name=kubernetes2001.codfw.wmnet [10:50:40] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: name=kubernetes2002.codfw.wmnet [10:50:41] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: name=kubernetes2003.codfw.wmnet [10:50:41] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: name=kubernetes2004.codfw.wmnet [10:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:24] !log depool kubernetes200[1-4] T303045 [10:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:28] T303045: decommission kubernetes200[1-4] - https://phabricator.wikimedia.org/T303045 [10:51:49] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:52:48] !log drain kubernetes200[1-4] T303045 [10:52:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host dumpsdata1007.eqiad.wmnet [10:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:55] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:54:23] !log btullis@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade. [10:54:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:31] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:55:40] (03PS1) 10Alexandros Kosiaris: decommission kubernetes[12]00[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/771850 (https://phabricator.wikimedia.org/T303044) [10:58:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1007.eqiad.wmnet [10:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:59] (03PS6) 10Jbond: P:scap::dsh: Add scap targets as a dsh group [puppet] - 10https://gerrit.wikimedia.org/r/771441 (https://phabricator.wikimedia.org/T303559) [11:04:01] (03PS1) 10Jbond: P:scap::dsh: add documentation [puppet] - 10https://gerrit.wikimedia.org/r/771853 [11:05:19] !log restarting acme-chief and acme-chief API services to catch up on OpenSSL updates [11:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:51] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.002667 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [11:06:15] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10MoritzMuehlenhoff) dumpsdata1007 is now running 5.16.11, can you please retest? I'm not familiar with perccli myself, if there's any uncertainties with the docs/setup let's clarify with Dell? [11:07:51] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:09:13] (03PS10) 10MVernon: swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) [11:09:48] !log rolling restart of nginx on ncredir instances to catch up on OpenSSL updates [11:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:15] (03CR) 10jerkins-bot: [V: 04-1] swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [11:11:29] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:13:01] (03PS11) 10MVernon: swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) [11:13:09] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:13:11] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:13:13] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:15:13] (03PS1) 10Jbond: P:pki::client: add documentation [puppet] - 10https://gerrit.wikimedia.org/r/771854 [11:16:34] (03PS2) 10Jbond: P:pki::client: add documentation [puppet] - 10https://gerrit.wikimedia.org/r/771854 [11:17:15] (03PS1) 10Vgutierrez: ncredir: Use profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/771855 [11:18:07] (03CR) 10jerkins-bot: [V: 04-1] P:pki::client: add documentation [puppet] - 10https://gerrit.wikimedia.org/r/771854 (owner: 10Jbond) [11:18:35] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:19:28] (03PS1) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/771856 (https://phabricator.wikimedia.org/T279519) [11:19:37] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/771856 (https://phabricator.wikimedia.org/T279519) (owner: 10Kosta Harlan) [11:22:37] (03PS12) 10MVernon: swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) [11:22:39] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled ht [11:22:39] kitech.wikimedia.org/wiki/PyBal [11:22:53] looking ^ [11:23:05] (03PS3) 10Jbond: P:pki::client: add documentation [puppet] - 10https://gerrit.wikimedia.org/r/771854 [11:23:25] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs20 [11:23:25] .wmnet, wdqs2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:23:49] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/771856 (https://phabricator.wikimedia.org/T279519) (owner: 10Kosta Harlan) [11:24:18] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34416/console" [puppet] - 10https://gerrit.wikimedia.org/r/771854 (owner: 10Jbond) [11:24:23] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:24:35] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki::client: add documentation [puppet] - 10https://gerrit.wikimedia.org/r/771854 (owner: 10Jbond) [11:24:50] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34417/console" [puppet] - 10https://gerrit.wikimedia.org/r/771855 (owner: 10Vgutierrez) [11:25:05] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:25:11] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:25:29] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:25:33] someone's hammering wdqs [11:26:15] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:26:33] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [11:26:57] (03CR) 10Filippo Giunchedi: "Thank you for following up!" [puppet] - 10https://gerrit.wikimedia.org/r/771610 (owner: 10Ssingh) [11:27:13] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:28:13] !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [11:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:41] RECOVERY - Confd vcl based reload on cp5011 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:29:11] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [11:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10akosiaris) >>! In T299573#7785936, @Cmjohnson wrote: > @akosiaris I can spread the other 3 between B and D if that works better for you? Yeah that sounds fine. T... [11:30:14] !log kharlan@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [11:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:51] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:30:53] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] ncredir: Use profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/771855 (owner: 10Vgutierrez) [11:32:32] !log kharlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [11:32:33] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:19] !log kharlan@deploy1002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply [11:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install conf100[789] - https://phabricator.wikimedia.org/T301272 (10akosiaris) >>! In T301272#7782148, @cmooney wrote: > FYI I don't believe there is any reason E/F would be ruled out for these, if space/power is tight in the existing... [11:35:13] !log kharlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply [11:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:33] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:37:03] (03PS1) 10Jbond: P:pki::client: make the bundle source location configurable [puppet] - 10https://gerrit.wikimedia.org/r/771859 [11:40:01] (03CR) 10Jbond: "see pcc full diff for an example of what this will look like" [puppet] - 10https://gerrit.wikimedia.org/r/771441 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond) [11:40:11] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34419/console" [puppet] - 10https://gerrit.wikimedia.org/r/771441 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond) [11:40:14] (03CR) 10Jbond: [V: 03+1] P:scap::dsh: Add scap targets as a dsh group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771441 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond) [11:41:07] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:42:09] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:42:24] (03PS13) 10MVernon: swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) [11:42:55] (03CR) 10Majavah: P:scap::dsh: Add scap targets as a dsh group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771441 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond) [11:44:59] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:45:11] (03PS2) 10Ladsgroup: idp: Open up orchestrator to cumin host, take III [puppet] - 10https://gerrit.wikimedia.org/r/771642 (https://phabricator.wikimedia.org/T281249) (owner: 10Jbond) [11:45:16] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] idp: Open up orchestrator to cumin host, take III [puppet] - 10https://gerrit.wikimedia.org/r/771642 (https://phabricator.wikimedia.org/T281249) (owner: 10Jbond) [11:45:40] 10SRE, 10Traffic: Clean up Traffic Grafana dashboards to reflect HA-Proxy metrics - https://phabricator.wikimedia.org/T304153 (10MMandere) [11:45:45] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10MMandere) [11:45:47] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [11:49:41] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:52:04] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Patch-For-Review, and 2 others: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Ladsgroup) The patch broke orch in two ways. - The first being mismatch... [11:53:35] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:54:36] (03PS1) 10Ladsgroup: orchestrator: Fix Apache settings [puppet] - 10https://gerrit.wikimedia.org/r/771861 (https://phabricator.wikimedia.org/T281249) [11:56:17] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:56:25] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:57:46] (03CR) 10Ladsgroup: [C: 03+2] orchestrator: Fix Apache settings [puppet] - 10https://gerrit.wikimedia.org/r/771861 (https://phabricator.wikimedia.org/T281249) (owner: 10Ladsgroup) [11:59:43] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Patch-For-Review, and 2 others: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Ladsgroup) orchestrator is now online with correct access (including AP... [12:01:40] (03CR) 10Ladsgroup: idp: Open up orchestrator to cumin host, take III (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771642 (https://phabricator.wikimedia.org/T281249) (owner: 10Jbond) [12:01:55] (03PS3) 10Krinkle: Migrate reads from wmfDbconfigFromEtcd to wmgDbconfigFromEtcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768256 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [12:01:57] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:02:07] (03CR) 10Krinkle: [C: 03+1] "LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768256 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [12:02:45] (03CR) 10Krinkle: [C: 03+1] "Good to go!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768256 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [12:06:29] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:06:35] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:09:17] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:09:23] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:09:59] (03PS7) 10Jbond: P:scap::dsh: Add scap targets as a dsh group [puppet] - 10https://gerrit.wikimedia.org/r/771441 (https://phabricator.wikimedia.org/T303559) [12:10:25] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:11:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 301): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34418/console" [puppet] - 10https://gerrit.wikimedia.org/r/771859 (owner: 10Jbond) [12:11:45] (03PS1) 10Giuseppe Lavagetto: varnish/tests: improve UX, refactor run.py [puppet] - 10https://gerrit.wikimedia.org/r/771863 [12:16:02] (03PS1) 10Muehlenhoff: Add component/python35 [puppet] - 10https://gerrit.wikimedia.org/r/771865 (https://phabricator.wikimedia.org/T303801) [12:16:15] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:16:56] (03PS1) 10Jbond: idp: Open up orchestrator to cumin host, take IV [puppet] - 10https://gerrit.wikimedia.org/r/771866 (https://phabricator.wikimedia.org/T281249) [12:17:31] (03PS2) 10Jbond: idp: Open up orchestrator to cumin host, take IV [puppet] - 10https://gerrit.wikimedia.org/r/771866 (https://phabricator.wikimedia.org/T281249) [12:18:22] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34421/console" [puppet] - 10https://gerrit.wikimedia.org/r/771866 (https://phabricator.wikimedia.org/T281249) (owner: 10Jbond) [12:19:07] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:20:33] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki::client: make the bundle source location configurable [puppet] - 10https://gerrit.wikimedia.org/r/771859 (owner: 10Jbond) [12:20:42] (03CR) 10Muehlenhoff: [C: 03+2] Add component/python35 [puppet] - 10https://gerrit.wikimedia.org/r/771865 (https://phabricator.wikimedia.org/T303801) (owner: 10Muehlenhoff) [12:21:49] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:22:50] (03CR) 10Ladsgroup: "PCC has the wrong element closure but I can't find it in the code. https://puppet-compiler.wmflabs.org/pcc-worker1002/34421/dborch1001.wik" [puppet] - 10https://gerrit.wikimedia.org/r/771866 (https://phabricator.wikimedia.org/T281249) (owner: 10Jbond) [12:23:10] (03CR) 10Ladsgroup: [C: 03+1] "aah fixed in PS2. LGTM, do you want to deploy it?" [puppet] - 10https://gerrit.wikimedia.org/r/771866 (https://phabricator.wikimedia.org/T281249) (owner: 10Jbond) [12:29:23] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:33:11] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:34:04] (03PS1) 10Jbond: deployment-prep: move use discover ca by default in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/771867 [12:34:15] (03CR) 10Jbond: [V: 03+2 C: 03+2] deployment-prep: move use discover ca by default in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/771867 (owner: 10Jbond) [12:35:00] !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade. [12:35:01] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:29] (03PS1) 10Jcrespo: wmfbackups: Add manpages for all executables and config files [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/771868 (https://phabricator.wikimedia.org/T138562) [12:46:15] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:49:33] (03CR) 10Jbond: idp: Open up orchestrator to cumin host, take IV (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771866 (https://phabricator.wikimedia.org/T281249) (owner: 10Jbond) [12:50:25] (03CR) 10Ladsgroup: [C: 03+1] idp: Open up orchestrator to cumin host, take IV (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771866 (https://phabricator.wikimedia.org/T281249) (owner: 10Jbond) [12:52:03] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:52:41] (03CR) 10Ayounsi: [C: 04-1] Add ACL filter to Spine switch interface connecting CR routers Eqiad (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/771461 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [12:54:54] (03CR) 10Jbond: idp: Open up orchestrator to cumin host, take IV (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771866 (https://phabricator.wikimedia.org/T281249) (owner: 10Jbond) [12:55:55] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:57:37] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:57:45] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:02:45] !log imported python3.5 3.5.3-1+deb9u5+wmf1 to component/python35 T303801 [13:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:50] T303801: Upgrade ORES to Debian Buster or Bullseye - https://phabricator.wikimedia.org/T303801 [13:06:09] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:06:17] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:07:56] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "test sync - jbond@cumin1001" [13:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:34] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "test sync - jbond@cumin1001" [13:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:09] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:09:53] (03PS11) 10Jbond: P:base::production: Add profile::netbox::host [puppet] - 10https://gerrit.wikimedia.org/r/769983 (https://phabricator.wikimedia.org/T229397) [13:10:15] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:11:18] (03PS4) 10Jcrespo: Improve logic and quality of life for remote backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/770021 (https://phabricator.wikimedia.org/T138562) [13:11:27] (03PS1) 10Esanders: Disable autotopicsub user option by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771872 (https://phabricator.wikimedia.org/T297966) [13:12:57] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:15:49] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:15:55] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:17:33] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:19:40] (03PS2) 10Jcrespo: wmfbackups: Add manpages for all executables and config files [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/771868 (https://phabricator.wikimedia.org/T138562) [13:23:21] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:23:47] (03CR) 10Marostegui: [C: 03+2] mariadb: Set event_scheduler = ON by default [puppet] - 10https://gerrit.wikimedia.org/r/771768 (owner: 10Marostegui) [13:24:23] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:26:13] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:27:06] (03PS1) 10Marostegui: instance.pp: Enable event_scheduler by default [puppet] - 10https://gerrit.wikimedia.org/r/771875 [13:27:49] (03CR) 10Marostegui: [C: 03+2] instance.pp: Enable event_scheduler by default [puppet] - 10https://gerrit.wikimedia.org/r/771875 (owner: 10Marostegui) [13:35:23] (03CR) 10Jbond: [C: 03+2] P:base::production: Add profile::netbox::host [puppet] - 10https://gerrit.wikimedia.org/r/769983 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [13:40:19] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:40:27] (03CR) 10Elukey: "<3" [puppet] - 10https://gerrit.wikimedia.org/r/771865 (https://phabricator.wikimedia.org/T303801) (owner: 10Muehlenhoff) [13:40:29] (03PS1) 10Filippo Giunchedi: sre: add dashboard to network probes alerts [alerts] - 10https://gerrit.wikimedia.org/r/771883 (https://phabricator.wikimedia.org/T291946) [13:40:53] (03PS3) 10Matthias Mullie: Remove unused WikibaseMediaInfo & MediaSearch config vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737379 [13:42:59] (03PS1) 10Marostegui: mariadb: Enable event_scheduler [puppet] - 10https://gerrit.wikimedia.org/r/771884 (https://phabricator.wikimedia.org/T266119) [13:43:17] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:44:15] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:45:15] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:45:24] (03CR) 10Filippo Giunchedi: "I don't feel like I meaningfully vote but LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/769749 (https://phabricator.wikimedia.org/T300119) (owner: 10Herron) [13:46:52] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/771884 (https://phabricator.wikimedia.org/T266119) (owner: 10Marostegui) [13:48:49] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:49:04] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/771415 (owner: 10Jbond) [13:49:45] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01173 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:51:25] looking [13:51:58] (03CR) 10Marostegui: [C: 03+2] mariadb: Enable event_scheduler [puppet] - 10https://gerrit.wikimedia.org/r/771884 (https://phabricator.wikimedia.org/T266119) (owner: 10Marostegui) [13:52:51] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:53:26] (03PS1) 10Kosta Harlan: linkrecommendation: Rollback to known good version [deployment-charts] - 10https://gerrit.wikimedia.org/r/771887 (https://phabricator.wikimedia.org/T279519) [13:53:41] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Rollback to known good version [deployment-charts] - 10https://gerrit.wikimedia.org/r/771887 (https://phabricator.wikimedia.org/T279519) (owner: 10Kosta Harlan) [13:54:57] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [13:56:37] (03PS1) 10Jbond: P:netbox::host: relx type checking as we have racks like \d.\d.\d [puppet] - 10https://gerrit.wikimedia.org/r/771889 [13:56:49] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:netbox::host: relx type checking as we have racks like \d.\d.\d [puppet] - 10https://gerrit.wikimedia.org/r/771889 (owner: 10Jbond) [13:57:23] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:58:15] (03Merged) 10jenkins-bot: linkrecommendation: Rollback to known good version [deployment-charts] - 10https://gerrit.wikimedia.org/r/771887 (https://phabricator.wikimedia.org/T279519) (owner: 10Kosta Harlan) [13:59:39] !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [13:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:56] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [13:59:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:13] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:01:01] !log kharlan@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [14:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:23] (03PS1) 10Jbond: motd::script: script names need to be lower case to be detected [puppet] - 10https://gerrit.wikimedia.org/r/771890 [14:01:25] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:01:33] !log kharlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [14:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:40] (03PS1) 10Ssingh: test_dns: add drmrs doh* hosts and their IPv4 addresses [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/771891 [14:02:16] !log kharlan@deploy1002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply [14:02:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:29] (03CR) 10Jbond: [C: 03+2] motd::script: script names need to be lower case to be detected [puppet] - 10https://gerrit.wikimedia.org/r/771890 (owner: 10Jbond) [14:02:45] !log kharlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply [14:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:07] (03PS1) 10Filippo Giunchedi: alertmanager: use alert-specific link to dashboard [puppet] - 10https://gerrit.wikimedia.org/r/771892 [14:03:15] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:04:26] (03PS2) 10Filippo Giunchedi: alertmanager: use alert-specific link to dashboard [puppet] - 10https://gerrit.wikimedia.org/r/771892 [14:05:43] (03CR) 10Ssingh: [C: 03+2] test_dns: add drmrs doh* hosts and their IPv4 addresses [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/771891 (owner: 10Ssingh) [14:05:58] (03PS5) 10Jcrespo: Improve logic and quality of life for remote backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/770021 (https://phabricator.wikimedia.org/T138562) [14:05:59] (03PS3) 10Jcrespo: wmfbackups: Add manpages for all executables and config files [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/771868 (https://phabricator.wikimedia.org/T138562) [14:06:49] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.003198 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:07:13] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:08:00] (03PS1) 10Jbond: P:netbox::host: fix motd title [puppet] - 10https://gerrit.wikimedia.org/r/771893 [14:08:36] (03CR) 10jerkins-bot: [V: 04-1] P:netbox::host: fix motd title [puppet] - 10https://gerrit.wikimedia.org/r/771893 (owner: 10Jbond) [14:09:48] (03PS2) 10Jbond: P:netbox::host: fix motd title [puppet] - 10https://gerrit.wikimedia.org/r/771893 [14:10:03] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:12:54] !log configure NAT for civi1002 - T304098 [14:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:32] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34434/console" [puppet] - 10https://gerrit.wikimedia.org/r/771893 (owner: 10Jbond) [14:17:06] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:netbox::host: fix motd title [puppet] - 10https://gerrit.wikimedia.org/r/771893 (owner: 10Jbond) [14:18:37] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:20:23] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [14:21:53] (03PS1) 10Ladsgroup: Don't pass the revision to PO access service [extensions/FlaggedRevs] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771907 (https://phabricator.wikimedia.org/T304127) [14:21:58] (03CR) 10Ladsgroup: [C: 03+2] Don't pass the revision to PO access service [extensions/FlaggedRevs] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771907 (https://phabricator.wikimedia.org/T304127) (owner: 10Ladsgroup) [14:25:27] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:25:35] (03Merged) 10jenkins-bot: Don't pass the revision to PO access service [extensions/FlaggedRevs] (wmf/1.38.0-wmf.26) - 10https://gerrit.wikimedia.org/r/771907 (https://phabricator.wikimedia.org/T304127) (owner: 10Ladsgroup) [14:26:43] (03PS1) 10Ssingh: dnsdist: remove redundant rate limits [puppet] - 10https://gerrit.wikimedia.org/r/771902 [14:26:56] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.26/extensions/FlaggedRevs/backend/FlaggedRevs.php: Backport: [[gerrit:771907|Don't pass the revision to PO access service (T304127)]] (duration: 00m 49s) [14:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:01] T304127: InvalidArgumentException: The revision does not belong to the given page. - https://phabricator.wikimedia.org/T304127 [14:30:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:25] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:33:57] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:35:09] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:37:44] (03PS1) 10Elukey: profile::rsyslog: add new cabundle paths for omkafka [puppet] - 10https://gerrit.wikimedia.org/r/771905 (https://phabricator.wikimedia.org/T300130) [14:38:01] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:39:44] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34438/console" [puppet] - 10https://gerrit.wikimedia.org/r/771905 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [14:41:22] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/771905 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [14:43:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:49] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:46:27] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:47:15] (03CR) 10Ssingh: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/771902 (owner: 10Ssingh) [14:49:00] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/pcc-worker1001/34437/doh1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/771902 (owner: 10Ssingh) [14:49:21] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:50:49] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:50:57] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:51:21] 10SRE, 10observability, 10Patch-For-Review: Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10elukey) @colewhite I was able to move the deployment-prep's kafka logging host to PKI, the new TLS settings seem to work but lemme know if you see anything weird on the lo... [14:53:29] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:54:45] (03CR) 10Ssingh: [C: 03+2] dnsdist: remove redundant rate limits [puppet] - 10https://gerrit.wikimedia.org/r/771902 (owner: 10Ssingh) [14:54:45] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:55:06] (03CR) 10Muehlenhoff: P:environment: Add no_proxy values to the default environment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771411 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond) [14:55:07] ^ XioNoX any idea what's happening with these? [14:55:22] sukhe: it's flapping :) [14:55:31] my worry is they might be the IPv6 bird stuff we were seeing on drmrs [14:56:08] sukhe: it's the Singtel link between ulsfo and eqsin [14:56:13] ah! thanks! [14:56:17] * sukhe phew [14:56:26] don't want *that much fun* on a Friday [14:56:28] sukhe: elukey sent an email to their NOC, but no replies so far [14:56:37] PROBLEM - Host kubernetes1002 is DOWN: PING CRITICAL - Packet loss = 100% [14:56:48] thanks elukey [14:56:58] it's not the primary link, so I prefer to see it flap than disable it [14:57:17] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:01:26] (KubernetesCalicoDown) firing: kubernetes1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [15:01:33] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:02:38] XioNoX, sukhe o/ Chris sent an email for a similar issue earier on in March, and afaics they didn't answer :( [15:03:46] oh right I remember that one [15:04:25] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:05:21] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:07:09] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:11:06] (03PS1) 10Jbond: R:tlsproxy: Drop version 3 support and add missing docs [puppet] - 10https://gerrit.wikimedia.org/r/771930 [15:11:51] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::rsyslog: add new cabundle paths for omkafka [puppet] - 10https://gerrit.wikimedia.org/r/771905 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [15:12:41] (03CR) 10jerkins-bot: [V: 04-1] R:tlsproxy: Drop version 3 support and add missing docs [puppet] - 10https://gerrit.wikimedia.org/r/771930 (owner: 10Jbond) [15:13:47] (03CR) 10Ssingh: P:icinga: add profile for performance tweaking (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/771610 (owner: 10Ssingh) [15:20:36] (03PS2) 10Jbond: R:tlsproxy: Drop version 3 support and add missing docs [puppet] - 10https://gerrit.wikimedia.org/r/771930 [15:22:05] (03CR) 10jerkins-bot: [V: 04-1] R:tlsproxy: Drop version 3 support and add missing docs [puppet] - 10https://gerrit.wikimedia.org/r/771930 (owner: 10Jbond) [15:23:03] (03PS3) 10Jbond: R:tlsproxy: Drop version 3 support and add missing docs [puppet] - 10https://gerrit.wikimedia.org/r/771930 [15:23:37] (03CR) 10JMeybohm: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/771892 (owner: 10Filippo Giunchedi) [15:24:31] (03CR) 10jerkins-bot: [V: 04-1] R:tlsproxy: Drop version 3 support and add missing docs [puppet] - 10https://gerrit.wikimedia.org/r/771930 (owner: 10Jbond) [15:27:57] (03CR) 10BBlack: geodns: add drmrs fallback for esams to whole map (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/771632 (https://phabricator.wikimedia.org/T304089) (owner: 10BBlack) [15:28:22] (03PS14) 10MVernon: swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) [15:28:45] (03PS3) 10BBlack: geodns: add drmrs fallback for esams to whole map [dns] - 10https://gerrit.wikimedia.org/r/771632 (https://phabricator.wikimedia.org/T304089) [15:29:49] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:38:55] !log powercycle kubernetes1002 [15:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:19] RECOVERY - Host kubernetes1002 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [15:46:26] (KubernetesCalicoDown) resolved: kubernetes1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [15:48:19] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:48:35] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:49:41] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:51:23] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:54:57] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:56:23] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:56:55] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:57:51] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:59:07] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:09:09] PROBLEM - SSH on dumpsdata1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:12:21] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:13:59] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:15:19] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:18:09] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:23:43] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:26:31] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:28:11] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:31:01] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:34:49] RECOVERY - Device not healthy -SMART- on ganeti2013 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ganeti2013&var-datasource=codfw+prometheus/ops [16:36:59] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:37:57] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:39:35] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:40:45] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:40:53] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:42:25] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:42:41] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:49:17] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:49:27] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:50:57] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:51:15] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:52:09] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:56:37] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:59:19] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:41] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:03:39] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:05:11] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:06:25] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:07:51] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:13] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:13:59] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:16:51] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:18:07] PROBLEM - IPMI Sensor Status on mc1053 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [17:20:37] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:25:23] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:27:55] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:31:01] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:39:21] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:41:27] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:46:15] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:49:13] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:50:43] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:52:25] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:57:37] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:03:17] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:10:39] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:11:55] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:13:29] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:16:37] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:24:55] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:27:45] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:28:03] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_navigationtiming_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:30:51] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:39:07] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:41:59] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:43:09] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:47:59] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:50:29] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:51:47] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:51:47] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:53:21] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:56:29] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:57:29] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:02:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:10:27] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:11:43] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:14:33] RECOVERY - SSH on dumpsdata1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:14:33] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:16:07] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:17:21] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:23:51] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:23:57] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:24:59] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:26:01] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:29:59] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:34:51] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:38:15] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:40:23] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:40:35] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:42:39] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:55:31] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:01:05] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:03:33] PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:06:45] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:12:33] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:18:07] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:18:21] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:23:59] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:29:27] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:32:19] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:37:55] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:50:47] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:57:59] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:00:53] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:02:07] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:02:19] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1016.eqiad.wmnet with OS bullseye [21:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:15] RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:06:33] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:12:45] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1016.eqiad.wmnet with reason: host reimage [21:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:57] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:16:11] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1016.eqiad.wmnet with reason: host reimage [21:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:26] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:18:36] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:20:56] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:24:26] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:24:36] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:24:54] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:25:54] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:29:20] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:31:06] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:34:46] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:35:26] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:39:10] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:44:28] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:44:48] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:46:56] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:47:32] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:50:20] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:02:23] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:05:47] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:11:25] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:14:39] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:15:45] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:20:43] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:23:21] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:37:25] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:40:27] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:43:05] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:44:47] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:44:53] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:47:41] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:48:55] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:50:29] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:04:43] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:07:33] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:08:43] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:11:27] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:14:23] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:18:49] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:18:55] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:18:57] PROBLEM - ensure kvm processes are running on cloudvirt1016 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:22:45] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:22:55] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:24:31] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:27:37] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:28:33] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:30:13] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:32:57] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:36:05] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:37:03] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:39:51] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:48:17] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:51:09] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status