[00:03:24] RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:09:52] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:12:13] (03PS2) 10STran: Add IPInfo viewing rights for certain groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766882 (https://phabricator.wikimedia.org/T296499) [01:40:30] (JobUnavailable) firing: (4) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [03:13:22] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:55:36] (03PS2) 10Ladsgroup: db1147: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/767797 (https://phabricator.wikimedia.org/T302950) (owner: 10Gerrit maintenance bot) [04:55:41] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db1147: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/767797 (https://phabricator.wikimedia.org/T302950) (owner: 10Gerrit maintenance bot) [05:14:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [05:14:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [05:14:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance [05:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance [05:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T302950)', diff saved to https://phabricator.wikimedia.org/P21858 and previous config saved to /var/cache/conftool/dbconfig/20220307-051537-ladsgroup.json [05:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:40] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [05:18:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [05:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [05:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T300992)', diff saved to https://phabricator.wikimedia.org/P21859 and previous config saved to /var/cache/conftool/dbconfig/20220307-051807-ladsgroup.json [05:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:10] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [05:22:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1147.eqiad.wmnet with OS bullseye [05:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T300992)', diff saved to https://phabricator.wikimedia.org/P21860 and previous config saved to /var/cache/conftool/dbconfig/20220307-052257-ladsgroup.json [05:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:57] (03PS1) 10Juan90264: Revert "Change temporary logo for slwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768155 [05:27:52] (03PS2) 10Juan90264: Revert "Change temporary logo for slwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768155 (https://phabricator.wikimedia.org/T302661) [05:33:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1147.eqiad.wmnet with reason: host reimage [05:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1147.eqiad.wmnet with reason: host reimage [05:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P21861 and previous config saved to /var/cache/conftool/dbconfig/20220307-053802-ladsgroup.json [05:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:56] (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [05:51:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1147.eqiad.wmnet with OS bullseye [05:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P21862 and previous config saved to /var/cache/conftool/dbconfig/20220307-055307-ladsgroup.json [05:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:02] 10SRE, 10ops-codfw, 10DBA: db2147 SMART error - https://phabricator.wikimedia.org/T302951 (10Marostegui) I don't think we need to swap any disks here, they all seem fine to me: ` Media Error Count: 0 Other Error Count: 0 Media Error Count: 0 Other Error Count: 0 Media Error Count: 0 Other Error Count: 0 Medi... [06:08:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T300992)', diff saved to https://phabricator.wikimedia.org/P21863 and previous config saved to /var/cache/conftool/dbconfig/20220307-060811-ladsgroup.json [06:08:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [06:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [06:08:15] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [06:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1164 (T300992)', diff saved to https://phabricator.wikimedia.org/P21864 and previous config saved to /var/cache/conftool/dbconfig/20220307-060819-ladsgroup.json [06:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T300992)', diff saved to https://phabricator.wikimedia.org/P21865 and previous config saved to /var/cache/conftool/dbconfig/20220307-061318-ladsgroup.json [06:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:22] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [06:27:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T302950)', diff saved to https://phabricator.wikimedia.org/P21866 and previous config saved to /var/cache/conftool/dbconfig/20220307-062713-ladsgroup.json [06:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:17] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [06:28:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P21867 and previous config saved to /var/cache/conftool/dbconfig/20220307-062823-ladsgroup.json [06:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:39] (03PS1) 10Ladsgroup: Revert "db1147: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/768157 [06:40:50] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db1147: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/768157 (owner: 10Ladsgroup) [06:42:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P21868 and previous config saved to /var/cache/conftool/dbconfig/20220307-064217-ladsgroup.json [06:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:21] (03PS2) 10Urbanecm: ThrottleTest: Cast strtotime to bool before comparing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767887 [06:43:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P21869 and previous config saved to /var/cache/conftool/dbconfig/20220307-064327-ladsgroup.json [06:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:05] (03CR) 10Urbanecm: [C: 03+2] "no-op for prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767887 (owner: 10Urbanecm) [06:44:44] (03Merged) 10jenkins-bot: ThrottleTest: Cast strtotime to bool before comparing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767887 (owner: 10Urbanecm) [06:44:59] (03PS3) 10Urbanecm: throttle: Add rule for Wikigap 2022 in CZ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767883 (https://phabricator.wikimedia.org/T303002) [06:45:13] (03CR) 10Urbanecm: [C: 03+2] throttle: Add rule for Wikigap 2022 in CZ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767883 (https://phabricator.wikimedia.org/T303002) (owner: 10Urbanecm) [06:45:24] (03PS4) 10Urbanecm: throttle: Add rule for arwiki Wikigap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767885 (https://phabricator.wikimedia.org/T302973) [06:45:28] (03CR) 10Urbanecm: [C: 03+2] throttle: Add rule for arwiki Wikigap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767885 (https://phabricator.wikimedia.org/T302973) (owner: 10Urbanecm) [06:45:53] (03Merged) 10jenkins-bot: throttle: Add rule for Wikigap 2022 in CZ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767883 (https://phabricator.wikimedia.org/T303002) (owner: 10Urbanecm) [06:46:09] (03Merged) 10jenkins-bot: throttle: Add rule for arwiki Wikigap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767885 (https://phabricator.wikimedia.org/T302973) (owner: 10Urbanecm) [06:49:51] !log urbanecm@deploy1002 Synchronized wmf-config/throttle.php: 2e9fdd4: 867bb7b: Add throttle rules (T302973; T303002) (duration: 00m 49s) [06:49:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:56] T302973: Temporary lift IP cap for WikiGap edit-a-thon at Khawarizmi College in 7 March 2022 - https://phabricator.wikimedia.org/T302973 [06:49:56] T303002: Request a throttle lift for Wikigap 2022 in Czech Republic – March 10, 2022 - https://phabricator.wikimedia.org/T303002 [06:50:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [06:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:00] !log Reset authentication throttle for 217.23.37.10 via resetAuthenticationThrottle.php (T302973) [06:52:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [06:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [06:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [06:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P21870 and previous config saved to /var/cache/conftool/dbconfig/20220307-065722-ladsgroup.json [06:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T300992)', diff saved to https://phabricator.wikimedia.org/P21871 and previous config saved to /var/cache/conftool/dbconfig/20220307-065832-ladsgroup.json [06:58:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [06:58:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:35] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [06:58:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [06:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T300992)', diff saved to https://phabricator.wikimedia.org/P21872 and previous config saved to /var/cache/conftool/dbconfig/20220307-065839-ladsgroup.json [06:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T300992)', diff saved to https://phabricator.wikimedia.org/P21873 and previous config saved to /var/cache/conftool/dbconfig/20220307-070355-ladsgroup.json [07:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:58] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [07:05:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1179', diff saved to https://phabricator.wikimedia.org/P21874 and previous config saved to /var/cache/conftool/dbconfig/20220307-070537-marostegui.json [07:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:00] !log dbmaint on db1179 s3@eqiad T302222 [07:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:02] T302222: Check and fix compressed mismatched tables - https://phabricator.wikimedia.org/T302222 [07:07:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 25%: After mysql restart', diff saved to https://phabricator.wikimedia.org/P21875 and previous config saved to /var/cache/conftool/dbconfig/20220307-070953-root.json [07:09:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T302950)', diff saved to https://phabricator.wikimedia.org/P21876 and previous config saved to /var/cache/conftool/dbconfig/20220307-071227-ladsgroup.json [07:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:30] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [07:14:36] !log kill tmux sessions of user 'zpapierski' on wdqs[1004,2002,2003] (puppet broken, offboarded user) [07:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:36] !log `elukey@ml-staging-ctrl2002:~$ sudo systemctl reset-failed ifup@ens13.service` [07:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P21877 and previous config saved to /var/cache/conftool/dbconfig/20220307-071900-ladsgroup.json [07:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:43] (03PS2) 10Ladsgroup: db1144: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/768054 (https://phabricator.wikimedia.org/T302950) (owner: 10Gerrit maintenance bot) [07:23:47] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db1144: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/768054 (https://phabricator.wikimedia.org/T302950) (owner: 10Gerrit maintenance bot) [07:24:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [07:24:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [07:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T302950)', diff saved to https://phabricator.wikimedia.org/P21878 and previous config saved to /var/cache/conftool/dbconfig/20220307-072453-ladsgroup.json [07:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:56] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [07:24:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 50%: After mysql restart', diff saved to https://phabricator.wikimedia.org/P21879 and previous config saved to /var/cache/conftool/dbconfig/20220307-072457-root.json [07:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:24] (03PS1) 10Marostegui: Revert "db2077: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/768158 [07:26:04] (03CR) 10Marostegui: [C: 03+2] Revert "db2077: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/768158 (owner: 10Marostegui) [07:26:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T302950)', diff saved to https://phabricator.wikimedia.org/P21880 and previous config saved to /var/cache/conftool/dbconfig/20220307-072624-ladsgroup.json [07:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:35] 10ops-eqiad: analytics10[63,67] mgmt interfaces seem flapping from time to time - https://phabricator.wikimedia.org/T303151 (10elukey) [07:32:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1144.eqiad.wmnet with OS bullseye [07:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:43] (03PS1) 10Gerrit maintenance bot: db1143: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/768649 (https://phabricator.wikimedia.org/T302950) [07:34:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P21881 and previous config saved to /var/cache/conftool/dbconfig/20220307-073405-ladsgroup.json [07:34:06] (03PS1) 10Gerrit maintenance bot: db1142: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/768650 (https://phabricator.wikimedia.org/T302950) [07:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:19] (03PS1) 10Gerrit maintenance bot: db1141: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/768651 (https://phabricator.wikimedia.org/T302950) [07:34:32] (03PS1) 10Gerrit maintenance bot: db1125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/768652 (https://phabricator.wikimedia.org/T302950) [07:34:45] (03PS1) 10Gerrit maintenance bot: db1124: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/768653 (https://phabricator.wikimedia.org/T302950) [07:40:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 75%: After mysql restart', diff saved to https://phabricator.wikimedia.org/P21882 and previous config saved to /var/cache/conftool/dbconfig/20220307-074001-root.json [07:40:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1144.eqiad.wmnet with reason: host reimage [07:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:40] PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:47:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1144.eqiad.wmnet with reason: host reimage [07:47:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:46] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:48:59] (03CR) 10Ladsgroup: [C: 03+1] "LGTM but this is something Ariel needs to approve." [puppet] - 10https://gerrit.wikimedia.org/r/768032 (https://phabricator.wikimedia.org/T300255) (owner: 10Hoo man) [07:49:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T300992)', diff saved to https://phabricator.wikimedia.org/P21883 and previous config saved to /var/cache/conftool/dbconfig/20220307-074909-ladsgroup.json [07:49:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [07:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:13] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [07:49:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [07:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:49:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T300992)', diff saved to https://phabricator.wikimedia.org/P21884 and previous config saved to /var/cache/conftool/dbconfig/20220307-074923-ladsgroup.json [07:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:04] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:51:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1181', diff saved to https://phabricator.wikimedia.org/P21885 and previous config saved to /var/cache/conftool/dbconfig/20220307-075120-marostegui.json [07:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:20] (03CR) 10MMandere: prometheus:rules_global: Provide HAProxy availability metrics (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/768057 (owner: 10Vgutierrez) [07:53:34] (03PS1) 10Marostegui: db1181: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/768654 [07:53:41] !log dbmaint on db1181 s7@eqiad T276150 [07:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:44] T276150: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150 [07:54:18] (03CR) 10Marostegui: [C: 03+2] db1181: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/768654 (owner: 10Marostegui) [07:54:27] (03CR) 10MMandere: [C: 03+1] prometheus:rules_ops: Provide HAProxy total responses metrics [puppet] - 10https://gerrit.wikimedia.org/r/768056 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [07:54:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T300992)', diff saved to https://phabricator.wikimedia.org/P21886 and previous config saved to /var/cache/conftool/dbconfig/20220307-075433-ladsgroup.json [07:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:37] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [07:55:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 100%: After mysql restart', diff saved to https://phabricator.wikimedia.org/P21887 and previous config saved to /var/cache/conftool/dbconfig/20220307-075504-root.json [07:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1175', diff saved to https://phabricator.wikimedia.org/P21888 and previous config saved to /var/cache/conftool/dbconfig/20220307-075523-marostegui.json [07:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 25%: After mysql restart', diff saved to https://phabricator.wikimedia.org/P21889 and previous config saved to /var/cache/conftool/dbconfig/20220307-075708-root.json [07:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:04] Amir1, awight, Urbanecm, and taavi: (Dis)respected human, time to deploy UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220307T0800). Please do the needful. [08:00:04] No Gerrit patches in the queue for this window AFAICS. [08:00:08] o/ [08:00:26] looks like nothing to do [08:01:25] Indeed. [08:03:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1144.eqiad.wmnet with OS bullseye [08:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:18] (03PS1) 10Marostegui: Revert "db1181: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/768159 [08:05:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 10%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21890 and previous config saved to /var/cache/conftool/dbconfig/20220307-080545-root.json [08:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:52] (03CR) 10Marostegui: [C: 03+2] Revert "db1181: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/768159 (owner: 10Marostegui) [08:08:18] RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:08:54] (03CR) 10Muehlenhoff: "No need, certspotter is an edge package and only used by the alert* hosts, you can instead simply upload the 0.10 package to buster-wikime" [puppet] - 10https://gerrit.wikimedia.org/r/768058 (owner: 10Ssingh) [08:09:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P21891 and previous config saved to /var/cache/conftool/dbconfig/20220307-080938-ladsgroup.json [08:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:38] PROBLEM - At least one CPU core of an LVS is saturated- packet drops are likely on lvs5002 is CRITICAL: cpu={0,10,12,14,2,4,6,8} https://bit.ly/wmf-lvscpu https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs5002&var-datasource=eqsin+prometheus/ops [08:12:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 50%: After mysql restart', diff saved to https://phabricator.wikimedia.org/P21892 and previous config saved to /var/cache/conftool/dbconfig/20220307-081212-root.json [08:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:18] RECOVERY - At least one CPU core of an LVS is saturated- packet drops are likely on lvs5002 is OK: All metrics within thresholds. https://bit.ly/wmf-lvscpu https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs5002&var-datasource=eqsin+prometheus/ops [08:17:46] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqsin.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [08:17:46] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqsin.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [08:17:54] yo [08:18:48] _security [08:19:02] yep [08:20:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 20%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21893 and previous config saved to /var/cache/conftool/dbconfig/20220307-082049-root.json [08:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:46] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqsin.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [08:22:46] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqsin.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [08:24:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P21894 and previous config saved to /var/cache/conftool/dbconfig/20220307-082443-ladsgroup.json [08:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 75%: After mysql restart', diff saved to https://phabricator.wikimedia.org/P21895 and previous config saved to /var/cache/conftool/dbconfig/20220307-082716-root.json [08:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:09] (03CR) 10DCausse: "needs to be rebased on the patch pulling the new image with s3 client drivers" [deployment-charts] - 10https://gerrit.wikimedia.org/r/766123 (https://phabricator.wikimedia.org/T302494) (owner: 10ZPapierski) [08:35:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 25%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21896 and previous config saved to /var/cache/conftool/dbconfig/20220307-083553-root.json [08:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T300992)', diff saved to https://phabricator.wikimedia.org/P21897 and previous config saved to /var/cache/conftool/dbconfig/20220307-083948-ladsgroup.json [08:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:51] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [08:39:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [08:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [08:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 14 hosts with reason: Maintenance [08:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 14 hosts with reason: Maintenance [08:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:11] jouncebot: nowandnext [08:40:12] For the next 0 hour(s) and 19 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220307T0800) [08:40:12] In 5 hour(s) and 19 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220307T1400) [08:40:33] since nothing's happening in B&C, let me sneak in one patch [08:40:50] (03PS2) 10Urbanecm: enwiki: Deploy Growth features to 100% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767525 (https://phabricator.wikimedia.org/T302846) [08:40:54] (03CR) 10Urbanecm: [C: 03+2] enwiki: Deploy Growth features to 100% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767525 (https://phabricator.wikimedia.org/T302846) (owner: 10Urbanecm) [08:41:22] (03CR) 10DCausse: [C: 04-1] "the chart also needs to be updated to declare the new S3 config in the flink config file" [deployment-charts] - 10https://gerrit.wikimedia.org/r/766123 (https://phabricator.wikimedia.org/T302494) (owner: 10ZPapierski) [08:41:35] (03Merged) 10jenkins-bot: enwiki: Deploy Growth features to 100% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767525 (https://phabricator.wikimedia.org/T302846) (owner: 10Urbanecm) [08:42:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 100%: After mysql restart', diff saved to https://phabricator.wikimedia.org/P21898 and previous config saved to /var/cache/conftool/dbconfig/20220307-084219-root.json [08:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1166', diff saved to https://phabricator.wikimedia.org/P21899 and previous config saved to /var/cache/conftool/dbconfig/20220307-084235-marostegui.json [08:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:45] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: e3f70f699e37a27872b73f6483f6d27c669bb520: enwiki: Deploy Growth features to 100% of users (T302846) (duration: 00m 50s) [08:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:48] T302846: Scale: increase share on English Wikipedia to 100% / 10% - https://phabricator.wikimedia.org/T302846 [08:43:08] * urbanecm done [08:43:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [08:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [08:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: After mysql restart', diff saved to https://phabricator.wikimedia.org/P21900 and previous config saved to /var/cache/conftool/dbconfig/20220307-084413-root.json [08:44:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T302950)', diff saved to https://phabricator.wikimedia.org/P21901 and previous config saved to /var/cache/conftool/dbconfig/20220307-084516-ladsgroup.json [08:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:19] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [08:46:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [08:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [08:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T300992)', diff saved to https://phabricator.wikimedia.org/P21902 and previous config saved to /var/cache/conftool/dbconfig/20220307-084641-ladsgroup.json [08:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:44] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [08:46:58] !log `kafka configs --alter --entity-type topics --entity-name udp_localhost-info --add-config retention.bytes=300000000000` on kafka-logging to reduce the size of the biggest topic partitions [08:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:20] (03PS1) 10Muehlenhoff: Remove cumin2001 from list of Cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/768656 [08:47:22] (03PS1) 10Muehlenhoff: Remove cumin2001 from mysql root clients and related grants [puppet] - 10https://gerrit.wikimedia.org/r/768657 (https://phabricator.wikimedia.org/T276589) [08:48:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:15] (03CR) 10Volans: [C: 04-1] "I did some testing and found a couple of issues, see inline." [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 (https://phabricator.wikimedia.org/T293209) (owner: 10Filippo Giunchedi) [08:50:09] (03CR) 10Volans: [C: 03+1] "LGTM, but I'm not sure about the order of things." [puppet] - 10https://gerrit.wikimedia.org/r/768656 (owner: 10Muehlenhoff) [08:50:26] (03CR) 10Volans: [C: 03+1] "LGTM, but the grants should also be removed from mysql." [puppet] - 10https://gerrit.wikimedia.org/r/768657 (https://phabricator.wikimedia.org/T276589) (owner: 10Muehlenhoff) [08:50:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 40%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21903 and previous config saved to /var/cache/conftool/dbconfig/20220307-085056-root.json [08:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:28] (03Abandoned) 10Ladsgroup: Depool esams [dns] - 10https://gerrit.wikimedia.org/r/767817 (owner: 10Ladsgroup) [08:51:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T300992)', diff saved to https://phabricator.wikimedia.org/P21904 and previous config saved to /var/cache/conftool/dbconfig/20220307-085139-ladsgroup.json [08:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:52:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:52:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:54] 10SRE, 10SRE-Access-Requests: Request Administrator Access to Google Search Console - https://phabricator.wikimedia.org/T302625 (10JMeybohm) >>! In T302625#7754020, @dr0ptp4kt wrote: > To answer the question on the creds, no, they don't need to be shared. But delegated access will need to be established. An SR... [08:54:01] (03PS1) 10Gerrit maintenance bot: db1121: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/768658 (https://phabricator.wikimedia.org/T302950) [08:56:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:55] (03PS3) 10Vgutierrez: prometheus:rules_ops: Provide HAProxy total responses metrics [puppet] - 10https://gerrit.wikimedia.org/r/768056 (https://phabricator.wikimedia.org/T290005) [08:57:57] (03PS2) 10Vgutierrez: prometheus:rules_global: Provide HAProxy availability metrics [puppet] - 10https://gerrit.wikimedia.org/r/768057 [08:58:36] (03CR) 10Vgutierrez: prometheus:rules_global: Provide HAProxy availability metrics (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/768057 (owner: 10Vgutierrez) [08:59:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: After mysql restart', diff saved to https://phabricator.wikimedia.org/P21905 and previous config saved to /var/cache/conftool/dbconfig/20220307-085917-root.json [08:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P21906 and previous config saved to /var/cache/conftool/dbconfig/20220307-090021-ladsgroup.json [09:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:29] 10SRE, 10ops-codfw, 10DBA: db2147 SMART error - https://phabricator.wikimedia.org/T302951 (10Marostegui) 05Open→03Invalid Closing for now, if the RAID finally fails, we can replace the failed disk. [09:01:52] !log restarting blazegraph on wdqs1013 (jvm stuck for 6hours) [09:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:07] PROBLEM - SSH on bast4003 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:06:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 50%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21907 and previous config saved to /var/cache/conftool/dbconfig/20220307-090600-root.json [09:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:43] RECOVERY - SSH on bast4003 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:06:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P21908 and previous config saved to /var/cache/conftool/dbconfig/20220307-090644-ladsgroup.json [09:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:18] (03PS1) 10Jcrespo: puppet: Print nodes that change on every puppet run, sorted [puppet] - 10https://gerrit.wikimedia.org/r/768659 [09:10:48] (03CR) 10Jcrespo: "What do you think? Small improvement (and easy to review) that has improves a lot the quality of life/observability." [puppet] - 10https://gerrit.wikimedia.org/r/768659 (owner: 10Jcrespo) [09:12:10] (03CR) 10Jcrespo: "cumin1001" [puppet] - 10https://gerrit.wikimedia.org/r/768659 (owner: 10Jcrespo) [09:14:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: After mysql restart', diff saved to https://phabricator.wikimedia.org/P21909 and previous config saved to /var/cache/conftool/dbconfig/20220307-091421-root.json [09:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P21910 and previous config saved to /var/cache/conftool/dbconfig/20220307-091527-ladsgroup.json [09:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:01] (03CR) 10Hokwelum: [C: 03+1] "we looked at this together" [puppet] - 10https://gerrit.wikimedia.org/r/768045 (https://phabricator.wikimedia.org/T302930) (owner: 10ArielGlenn) [09:16:32] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. Please collect +1 from Filippo before merging." [puppet] - 10https://gerrit.wikimedia.org/r/768294 (owner: 10Majavah) [09:17:58] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: fix remaining http keystone urls [puppet] - 10https://gerrit.wikimedia.org/r/768293 (https://phabricator.wikimedia.org/T267194) (owner: 10Majavah) [09:20:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [09:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [09:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T300381)', diff saved to https://phabricator.wikimedia.org/P21911 and previous config saved to /var/cache/conftool/dbconfig/20220307-092034-marostegui.json [09:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:38] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [09:21:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 60%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21912 and previous config saved to /var/cache/conftool/dbconfig/20220307-092103-root.json [09:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P21913 and previous config saved to /var/cache/conftool/dbconfig/20220307-092148-ladsgroup.json [09:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:37] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@19520c1]: (no justification provided) [09:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:42] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@19520c1]: (no justification provided) (duration: 00m 04s) [09:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:29] (03CR) 10Majavah: [C: 03+1] wikitech_private: write to wmg* constants [puppet] - 10https://gerrit.wikimedia.org/r/768260 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [09:24:42] (03PS1) 10Vgutierrez: site: Reimage cp2036 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/768661 (https://phabricator.wikimedia.org/T290005) [09:26:59] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp2036 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/768661 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [09:28:19] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp2036.codfw.wmnet with OS buster [09:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:32] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp2036.codfw.wmnet with OS buster [09:29:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: After mysql restart', diff saved to https://phabricator.wikimedia.org/P21914 and previous config saved to /var/cache/conftool/dbconfig/20220307-092924-root.json [09:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1123', diff saved to https://phabricator.wikimedia.org/P21915 and previous config saved to /var/cache/conftool/dbconfig/20220307-093013-marostegui.json [09:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T302950)', diff saved to https://phabricator.wikimedia.org/P21916 and previous config saved to /var/cache/conftool/dbconfig/20220307-093032-ladsgroup.json [09:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:35] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [09:31:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 25%: After mysql restart', diff saved to https://phabricator.wikimedia.org/P21917 and previous config saved to /var/cache/conftool/dbconfig/20220307-093146-root.json [09:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:20] (03CR) 10Jcrespo: [C: 03+2] Add DNS verification records for Bing and Yandex Webmaster tools [dns] - 10https://gerrit.wikimedia.org/r/768037 (https://phabricator.wikimedia.org/T302617) (owner: 10SCherukuwada) [09:35:29] !log updated non-A wikipedia.org DNS records [09:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 75%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21918 and previous config saved to /var/cache/conftool/dbconfig/20220307-093607-root.json [09:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T302950)', diff saved to https://phabricator.wikimedia.org/P21919 and previous config saved to /var/cache/conftool/dbconfig/20220307-093615-ladsgroup.json [09:36:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:18] !log updated non-A wikipedia.org DNS records T302617 [09:36:18] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [09:36:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:20] T302617: Domain Ownership Verification on Various Search Properties - https://phabricator.wikimedia.org/T302617 [09:36:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T300992)', diff saved to https://phabricator.wikimedia.org/P21920 and previous config saved to /var/cache/conftool/dbconfig/20220307-093653-ladsgroup.json [09:36:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [09:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:56] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [09:36:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [09:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T300992)', diff saved to https://phabricator.wikimedia.org/P21921 and previous config saved to /var/cache/conftool/dbconfig/20220307-093701-ladsgroup.json [09:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:42] (03CR) 10Muehlenhoff: [C: 03+2] Remove cumin2001 from list of Cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/768656 (owner: 10Muehlenhoff) [09:40:09] 10SRE: Domain Ownership Verification on Various Search Properties - https://phabricator.wikimedia.org/T302617 (10jcrespo) Looking good: `lines=10,lang=bash root@authdns1001:~$ for i in 0 1 2 ; do dig @ns${i}.wikimedia.org -t txt wikipedia.org ; done ; <<>> DiG 9.11.5-P4-5.1+deb10u6-Debian <<>> @ns0.wikimedia.o... [09:40:11] (03PS2) 10Muehlenhoff: Remove cumin2001 from mysql root clients and related grants [puppet] - 10https://gerrit.wikimedia.org/r/768657 (https://phabricator.wikimedia.org/T276589) [09:40:56] (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [09:41:49] (03CR) 10Filippo Giunchedi: misc: search-grafana-dashboards.js (031 comment) [software] - 10https://gerrit.wikimedia.org/r/767118 (owner: 10Filippo Giunchedi) [09:42:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T300992)', diff saved to https://phabricator.wikimedia.org/P21922 and previous config saved to /var/cache/conftool/dbconfig/20220307-094216-ladsgroup.json [09:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:20] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [09:42:39] 10SRE, 10Patch-For-Review: migrate services from cumin2001 to cumin2002 - https://phabricator.wikimedia.org/T276589 (10Marostegui) >>! In T276589#7755980, @gerritbot wrote: > Change 768657 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff): > %%%[operations/puppet@production] Remove cumin20... [09:43:21] (03CR) 10Marostegui: "Commented about it here too: https://phabricator.wikimedia.org/T276589#7756067" [puppet] - 10https://gerrit.wikimedia.org/r/768657 (https://phabricator.wikimedia.org/T276589) (owner: 10Muehlenhoff) [09:44:04] (03PS1) 10Majavah: P:tcpircbot: cleanup allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/768662 [09:44:32] (03PS1) 10Btullis: Add a record for datahubsearch service [dns] - 10https://gerrit.wikimedia.org/r/768663 (https://phabricator.wikimedia.org/T301458) [09:46:32] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2036.codfw.wmnet with reason: host reimage [09:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 50%: After mysql restart', diff saved to https://phabricator.wikimedia.org/P21923 and previous config saved to /var/cache/conftool/dbconfig/20220307-094649-root.json [09:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:56] (03PS1) 10SCherukuwada: Add Yandex's TXT verification entry to www. [dns] - 10https://gerrit.wikimedia.org/r/768664 [09:47:08] 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Data-Engineering, 10Event-Platform, and 2 others: Banner sampling leading to a relatively wide site outage (mostly esams) - https://phabricator.wikimedia.org/T303036 (10Ladsgroup) [09:47:21] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/768294 (owner: 10Majavah) [09:47:32] (03PS2) 10SCherukuwada: Add Yandex's TXT verification entry to www. [dns] - 10https://gerrit.wikimedia.org/r/768664 [09:48:00] 10SRE: Domain Ownership Verification on Various Search Properties - https://phabricator.wikimedia.org/T302617 (10SCherukuwada) I confirm that Bing.com verification has worked properly. However, for Yandex it seems they need the TXT entry to be under www.wikipedia.org and not wikipedia.org. Sent out patch https:/... [09:48:19] (03CR) 10jerkins-bot: [V: 04-1] Add Yandex's TXT verification entry to www. [dns] - 10https://gerrit.wikimedia.org/r/768664 (owner: 10SCherukuwada) [09:49:04] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2036.codfw.wmnet with reason: host reimage [09:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:34] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dale_Zhou - https://phabricator.wikimedia.org/T303031 (10JMeybohm) Thanks @MGerlach. Could you please also provide an expiry/end date for this contract/agreement? [09:49:40] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ShubhankarP - https://phabricator.wikimedia.org/T303032 (10JMeybohm) Thanks @MGerlach. Could you please also provide an expiry/end date for this contract/agreement? [09:50:08] (03CR) 10jerkins-bot: [V: 04-1] Add a record for datahubsearch service [dns] - 10https://gerrit.wikimedia.org/r/768663 (https://phabricator.wikimedia.org/T301458) (owner: 10Btullis) [09:51:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 100%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21924 and previous config saved to /var/cache/conftool/dbconfig/20220307-095111-root.json [09:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P21925 and previous config saved to /var/cache/conftool/dbconfig/20220307-095120-ladsgroup.json [09:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:35] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10JMeybohm) p:05Triage→03Medium [09:52:38] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10netbox: Grant cn=nda some sort of read only access to Netbox - https://phabricator.wikimedia.org/T302870 (10JMeybohm) p:05Triage→03Medium [09:52:59] (03CR) 10Jcrespo: "10:48:16 error: Name 'www.wikipedia.org.': CNAME not allowed alongside other data" [dns] - 10https://gerrit.wikimedia.org/r/768664 (owner: 10SCherukuwada) [09:53:11] 10SRE, 10SRE-Access-Requests: Request Administrator Access to Google Search Console - https://phabricator.wikimedia.org/T302625 (10JMeybohm) a:05JMeybohm→03None [09:53:59] (03PS3) 10SCherukuwada: Add Yandex's TXT verification entry to www. [dns] - 10https://gerrit.wikimedia.org/r/768664 (https://phabricator.wikimedia.org/T302617) [09:54:48] (03CR) 10jerkins-bot: [V: 04-1] Add Yandex's TXT verification entry to www. [dns] - 10https://gerrit.wikimedia.org/r/768664 (https://phabricator.wikimedia.org/T302617) (owner: 10SCherukuwada) [09:55:23] (03CR) 10Volans: Add a record for datahubsearch service (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/768663 (https://phabricator.wikimedia.org/T301458) (owner: 10Btullis) [09:56:00] (03CR) 10Volans: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/768659 (owner: 10Jcrespo) [09:57:06] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10netbox: Grant cn=nda some sort of read only access to Netbox - https://phabricator.wikimedia.org/T302870 (10Ladsgroup) >>! In T302870#7750922, @Dzahn wrote: > Are hardware serial numbers more abusable / serious than other things we give NDAed pe... [09:57:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P21926 and previous config saved to /var/cache/conftool/dbconfig/20220307-095720-ladsgroup.json [09:57:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:56] (03PS2) 10Btullis: Add a record for datahubsearch service [dns] - 10https://gerrit.wikimedia.org/r/768663 (https://phabricator.wikimedia.org/T301458) [09:58:11] (03CR) 10Btullis: Add a record for datahubsearch service (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/768663 (https://phabricator.wikimedia.org/T301458) (owner: 10Btullis) [09:58:47] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [09:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:53] 10SRE, 10DNS, 10Traffic, 10Wikimedia Enterprise: 301 redirect setup for wikimediaenterprise - https://phabricator.wikimedia.org/T302756 (10Vgutierrez) 05Open→03Stalled we cannot perform that redirect cause we don't handle the DNS for that domain: `$ host -t ns wikimediaenterprise.org wikimediaenterpris... [10:00:06] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [10:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 75%: After mysql restart', diff saved to https://phabricator.wikimedia.org/P21927 and previous config saved to /var/cache/conftool/dbconfig/20220307-100153-root.json [10:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:24] (03CR) 10Jcrespo: "www is a CNAME to dyna.wikimedia.org, as such it is my understanding that it cannot have further data (TXT or other)." [dns] - 10https://gerrit.wikimedia.org/r/768664 (https://phabricator.wikimedia.org/T302617) (owner: 10SCherukuwada) [10:03:38] (03PS2) 10Jcrespo: puppet: Print nodes that change on every puppet run, sorted [puppet] - 10https://gerrit.wikimedia.org/r/768659 [10:04:06] (03CR) 10SCherukuwada: "Yup. What I'm trying here is actually incompatible with www being a CNAME. https://www.rfc-editor.org/rfc/rfc1034" [dns] - 10https://gerrit.wikimedia.org/r/768664 (https://phabricator.wikimedia.org/T302617) (owner: 10SCherukuwada) [10:04:27] !log pool cp2036 with HAProxy as TLS termination layer - T290005 [10:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:31] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [10:05:31] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [10:06:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P21928 and previous config saved to /var/cache/conftool/dbconfig/20220307-100624-ladsgroup.json [10:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:17] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [10:08:53] (03CR) 10Jcrespo: "bblack: We were sent the following requirement for Yandex search console authentication. Apparently, they require domain validation by add" [dns] - 10https://gerrit.wikimedia.org/r/768664 (https://phabricator.wikimedia.org/T302617) (owner: 10SCherukuwada) [10:10:22] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2036.codfw.wmnet with OS buster [10:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:33] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp2036.codfw.wmnet with OS buster c... [10:12:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P21929 and previous config saved to /var/cache/conftool/dbconfig/20220307-101225-ladsgroup.json [10:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:11] (03PS1) 10Vgutierrez: site: Reimage cp1084 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/768667 (https://phabricator.wikimedia.org/T290005) [10:14:36] (03PS2) 10Ladsgroup: db1143: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/768649 (https://phabricator.wikimedia.org/T302950) (owner: 10Gerrit maintenance bot) [10:14:42] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db1143: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/768649 (https://phabricator.wikimedia.org/T302950) (owner: 10Gerrit maintenance bot) [10:16:01] (03PS1) 10Btullis: Added config for the datahubsearch LVS service [puppet] - 10https://gerrit.wikimedia.org/r/768668 [10:16:53] PROBLEM - Check systemd state on kubernetes1001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:16:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 100%: After mysql restart', diff saved to https://phabricator.wikimedia.org/P21930 and previous config saved to /var/cache/conftool/dbconfig/20220307-101657-root.json [10:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:08] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp1084 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/768667 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [10:17:51] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp1084.eqiad.wmnet with OS buster [10:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:03] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp1084.eqiad.wmnet with OS buster [10:18:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1162', diff saved to https://phabricator.wikimedia.org/P21931 and previous config saved to /var/cache/conftool/dbconfig/20220307-101824-marostegui.json [10:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:20] (03CR) 10Jcrespo: "Note the Yandex disclaimer: "If you chose DNS as your verification method, it may take up to 72 hours (three days) to verify your domain"" [dns] - 10https://gerrit.wikimedia.org/r/768664 (https://phabricator.wikimedia.org/T302617) (owner: 10SCherukuwada) [10:20:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T300381)', diff saved to https://phabricator.wikimedia.org/P21932 and previous config saved to /var/cache/conftool/dbconfig/20220307-102054-marostegui.json [10:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:57] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [10:21:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T302950)', diff saved to https://phabricator.wikimedia.org/P21933 and previous config saved to /var/cache/conftool/dbconfig/20220307-102129-ladsgroup.json [10:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:33] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [10:21:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance [10:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance [10:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T302950)', diff saved to https://phabricator.wikimedia.org/P21934 and previous config saved to /var/cache/conftool/dbconfig/20220307-102158-ladsgroup.json [10:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1146:3312', diff saved to https://phabricator.wikimedia.org/P21935 and previous config saved to /var/cache/conftool/dbconfig/20220307-102209-marostegui.json [10:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:08] (03CR) 10SCherukuwada: Add Yandex's TXT verification entry to www. (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/768664 (https://phabricator.wikimedia.org/T302617) (owner: 10SCherukuwada) [10:26:57] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:27:29] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10dom_walden) >>! In T302699#7740763, @dom_walden wrote: > ` > AH00288: scoreboard is full, not at MaxRequestWorker... [10:27:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T300992)', diff saved to https://phabricator.wikimedia.org/P21936 and previous config saved to /var/cache/conftool/dbconfig/20220307-102730-ladsgroup.json [10:27:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [10:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [10:27:34] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [10:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T300992)', diff saved to https://phabricator.wikimedia.org/P21937 and previous config saved to /var/cache/conftool/dbconfig/20220307-102737-ladsgroup.json [10:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:09] 10SRE, 10Patch-For-Review: migrate services from cumin2001 to cumin2002 - https://phabricator.wikimedia.org/T276589 (10Kormat) >>! In T276589#7756067, @Marostegui wrote: >>>! In T276589#7755980, @gerritbot wrote: >> Change 768657 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff): >> %%%[op... [10:31:09] (03PS1) 10Muehlenhoff: Switch cumin2001 to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/768670 (https://phabricator.wikimedia.org/T276589) [10:31:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1143.eqiad.wmnet with OS bullseye [10:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:18] (03PS1) 10Vgutierrez: site: Reimage cp4036 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/768671 (https://phabricator.wikimedia.org/T290005) [10:32:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T300992)', diff saved to https://phabricator.wikimedia.org/P21938 and previous config saved to /var/cache/conftool/dbconfig/20220307-103253-ladsgroup.json [10:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:58] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [10:33:14] (03CR) 10Kormat: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/768657 (https://phabricator.wikimedia.org/T276589) (owner: 10Muehlenhoff) [10:33:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 25%: After mysql restart', diff saved to https://phabricator.wikimedia.org/P21939 and previous config saved to /var/cache/conftool/dbconfig/20220307-103323-root.json [10:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:54] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp4036 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/768671 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [10:34:04] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1084.eqiad.wmnet with reason: host reimage [10:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:08] (03CR) 10Jcrespo: Add Yandex's TXT verification entry to www. (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/768664 (https://phabricator.wikimedia.org/T302617) (owner: 10SCherukuwada) [10:34:17] RECOVERY - Check systemd state on kubernetes1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:34:33] !log (re)started ferm on kubernetes1001 [10:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:34] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp4036.ulsfo.wmnet with OS buster [10:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:47] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp4036.ulsfo.wmnet with OS buster [10:37:31] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1084.eqiad.wmnet with reason: host reimage [10:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:28] (03PS1) 10Muehlenhoff: Switch TGC same site cookie to strict [puppet] - 10https://gerrit.wikimedia.org/r/768673 [10:43:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1143.eqiad.wmnet with reason: host reimage [10:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:38] PROBLEM - SSH on analytics1067.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:46:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1143.eqiad.wmnet with reason: host reimage [10:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:49] (03CR) 10Vgutierrez: [C: 03+2] prometheus:rules_ops: Provide HAProxy total responses metrics [puppet] - 10https://gerrit.wikimedia.org/r/768056 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [10:46:56] 10SRE: Domain Ownership Verification on Various Search Properties - https://phabricator.wikimedia.org/T302617 (10SCherukuwada) It got a bit trickier. You can't add a TXT entry for www when www exists as a CNAME. And it might not even help in that case. Here's why: On Yandex, when you request verification for... [10:48:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P21940 and previous config saved to /var/cache/conftool/dbconfig/20220307-104759-ladsgroup.json [10:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 50%: After mysql restart', diff saved to https://phabricator.wikimedia.org/P21941 and previous config saved to /var/cache/conftool/dbconfig/20220307-104826-root.json [10:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [10:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [10:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T300381)', diff saved to https://phabricator.wikimedia.org/P21942 and previous config saved to /var/cache/conftool/dbconfig/20220307-104906-marostegui.json [10:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:10] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [10:49:28] (03PS1) 10Muehlenhoff: Remove LDAP access for coreyfloyd [puppet] - 10https://gerrit.wikimedia.org/r/768677 [10:49:32] PROBLEM - Host ripe-atlas-esams IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [10:51:48] PROBLEM - Host ripe-atlas-esams is DOWN: PING CRITICAL - Packet loss = 100% [10:52:07] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4036.ulsfo.wmnet with reason: host reimage [10:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:28] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for coreyfloyd [puppet] - 10https://gerrit.wikimedia.org/r/768677 (owner: 10Muehlenhoff) [10:55:33] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4036.ulsfo.wmnet with reason: host reimage [10:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:57] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:59:20] !log pool cp1084 with HAProxy as TLS termination layer - T290005 [10:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:23] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [11:00:46] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1084.eqiad.wmnet with OS buster [11:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:58] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp1084.eqiad.wmnet with OS buster c... [11:02:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1143.eqiad.wmnet with OS bullseye [11:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:27] (03PS1) 10MSantos: WIP: introduce geoshapes service [deployment-charts] - 10https://gerrit.wikimedia.org/r/768678 [11:03:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P21943 and previous config saved to /var/cache/conftool/dbconfig/20220307-110304-ladsgroup.json [11:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:19] (03CR) 10Volans: [C: 04-1] "forgot to add the response message" [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 (https://phabricator.wikimedia.org/T293209) (owner: 10Filippo Giunchedi) [11:03:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 75%: After mysql restart', diff saved to https://phabricator.wikimedia.org/P21944 and previous config saved to /var/cache/conftool/dbconfig/20220307-110330-root.json [11:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:25] (03PS1) 10Vgutierrez: site: Reimage cp5016 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/768679 (https://phabricator.wikimedia.org/T290005) [11:06:00] (03PS1) 10Kosta Harlan: GrowthExperiments: Add image experiment for fa/fr/pt/trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768680 (https://phabricator.wikimedia.org/T302828) [11:08:00] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp5016 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/768679 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [11:08:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T300381)', diff saved to https://phabricator.wikimedia.org/P21945 and previous config saved to /var/cache/conftool/dbconfig/20220307-110823-marostegui.json [11:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:27] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [11:10:12] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp5016.eqsin.wmnet with OS buster [11:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:23] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp5016.eqsin.wmnet with OS buster [11:11:19] (03PS1) 10Elukey: WIP - calico,cfssl-issuer,knative-serving: fix dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/768681 [11:12:33] (03CR) 10jerkins-bot: [V: 04-1] WIP - calico,cfssl-issuer,knative-serving: fix dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/768681 (owner: 10Elukey) [11:12:42] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10AlexisJazz) >>! In T302699#7756154, @dom_walden wrote: >>>! In T302699#7740763, @dom_walden wrote: >> ` >> AH0028... [11:12:50] !log pool cp4036 with HAProxy as TLS termination layer - T290005 [11:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:53] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [11:14:21] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10Majavah) [11:17:25] (03PS1) 10Vgutierrez: site: Reimage cp3060 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/768682 (https://phabricator.wikimedia.org/T290005) [11:17:39] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10AlexisJazz) @Majavah are you sure T303165 is a dupe? That task is about api.php (and nothing else!) **consistentl... [11:18:00] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4036.ulsfo.wmnet with OS buster [11:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T300992)', diff saved to https://phabricator.wikimedia.org/P21946 and previous config saved to /var/cache/conftool/dbconfig/20220307-111809-ladsgroup.json [11:18:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [11:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:12] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [11:18:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [11:18:12] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp4036.ulsfo.wmnet with OS buster c... [11:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T300992)', diff saved to https://phabricator.wikimedia.org/P21947 and previous config saved to /var/cache/conftool/dbconfig/20220307-111816-ladsgroup.json [11:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:20] (03PS1) 10Jelto: gitlab_runner: add gitlab-runner to docker group, change folder permissions [puppet] - 10https://gerrit.wikimedia.org/r/768683 (https://phabricator.wikimedia.org/T295481) [11:18:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 100%: After mysql restart', diff saved to https://phabricator.wikimedia.org/P21948 and previous config saved to /var/cache/conftool/dbconfig/20220307-111834-root.json [11:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:57] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10Majavah) >>! In T302699#7756393, @AlexisJazz wrote: > @Majavah are you sure T303165 is a dupe? That task is about... [11:20:00] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp3060 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/768682 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [11:20:42] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp3060.esams.wmnet with OS buster [11:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:55] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp3060.esams.wmnet with OS buster [11:21:15] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34091/console" [puppet] - 10https://gerrit.wikimedia.org/r/768683 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [11:22:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T302950)', diff saved to https://phabricator.wikimedia.org/P21949 and previous config saved to /var/cache/conftool/dbconfig/20220307-112207-ladsgroup.json [11:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:11] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [11:23:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T300992)', diff saved to https://phabricator.wikimedia.org/P21950 and previous config saved to /var/cache/conftool/dbconfig/20220307-112307-ladsgroup.json [11:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P21951 and previous config saved to /var/cache/conftool/dbconfig/20220307-112328-marostegui.json [11:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:01] (03PS2) 10Ladsgroup: db1142: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/768650 (https://phabricator.wikimedia.org/T302950) (owner: 10Gerrit maintenance bot) [11:28:13] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db1142: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/768650 (https://phabricator.wikimedia.org/T302950) (owner: 10Gerrit maintenance bot) [11:29:30] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.prepare-upgrade (exit_code=0) [11:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:55] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.prepare-upgrade (exit_code=0) [11:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:15] (03PS1) 10Btullis: Failover the active hive services to the standby server [dns] - 10https://gerrit.wikimedia.org/r/768686 (https://phabricator.wikimedia.org/T303168) [11:33:18] (03PS2) 10Jelto: gitlab_runner: add gitlab-runner to docker group, change folder permissions [puppet] - 10https://gerrit.wikimedia.org/r/768683 (https://phabricator.wikimedia.org/T295481) [11:35:23] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10AlexisJazz) >>! In T302699#7756404, @Majavah wrote: >>>! In T302699#7756393, @AlexisJazz wrote: >> @Majavah are y... [11:35:35] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [11:35:39] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10LSobanski) dumpsdata1007 is running Bullseye BTW for anyone else watching from the sidelines. [11:36:23] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [11:36:42] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5016.eqsin.wmnet with reason: host reimage [11:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P21952 and previous config saved to /var/cache/conftool/dbconfig/20220307-113712-ladsgroup.json [11:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:21] (03CR) 10Btullis: [C: 03+2] Failover the active hive services to the standby server [dns] - 10https://gerrit.wikimedia.org/r/768686 (https://phabricator.wikimedia.org/T303168) (owner: 10Btullis) [11:38:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P21953 and previous config saved to /var/cache/conftool/dbconfig/20220307-113811-ladsgroup.json [11:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P21954 and previous config saved to /var/cache/conftool/dbconfig/20220307-113833-marostegui.json [11:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:04] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5016.eqsin.wmnet with reason: host reimage [11:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:19] (03PS1) 10Ladsgroup: dbtools: Add db_maint_mapper_sal.py [software] - 10https://gerrit.wikimedia.org/r/768687 [11:41:54] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10Vgutierrez) in this case a 502 is emitted by ats-backend cause it isn't able to reach its backend server. The 503... [11:45:42] !log remove MTU1400 on drmrs GTT links [11:45:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:57] RECOVERY - SSH on analytics1067.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:48:15] (03PS3) 10Vgutierrez: prometheus:rules_global: Provide HAProxy availability metrics [puppet] - 10https://gerrit.wikimedia.org/r/768057 [11:48:36] (03CR) 10Vgutierrez: prometheus:rules_global: Provide HAProxy availability metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768057 (owner: 10Vgutierrez) [11:49:48] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3060.esams.wmnet with reason: host reimage [11:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:35] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:52:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P21955 and previous config saved to /var/cache/conftool/dbconfig/20220307-115217-ladsgroup.json [11:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:39] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:53:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P21956 and previous config saved to /var/cache/conftool/dbconfig/20220307-115316-ladsgroup.json [11:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T300381)', diff saved to https://phabricator.wikimedia.org/P21957 and previous config saved to /var/cache/conftool/dbconfig/20220307-115337-marostegui.json [11:53:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [11:53:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:40] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [11:53:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [11:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:16] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3060.esams.wmnet with reason: host reimage [11:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:13] (03PS3) 10Volans: sre.hosts.provision: always set the BiosBootSeq [cookbooks] - 10https://gerrit.wikimedia.org/r/767074 [11:59:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [11:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [11:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:25] 10Puppet, 10Infrastructure-Foundations, 10netbox: puppet lookup causes spurious puppetdb entries - https://phabricator.wikimedia.org/T303170 (10jbond) p:05Triage→03Low [12:02:40] (03CR) 10Urbanecm: [C: 03+1] "SGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768680 (https://phabricator.wikimedia.org/T302828) (owner: 10Kosta Harlan) [12:03:14] !log pool cp5016 with HAProxy as TLS termination layer - T290005 [12:03:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:18] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [12:03:54] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5016.eqsin.wmnet with OS buster [12:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:06] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp5016.eqsin.wmnet with OS buster c... [12:05:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [12:05:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [12:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T300381)', diff saved to https://phabricator.wikimedia.org/P21958 and previous config saved to /var/cache/conftool/dbconfig/20220307-120532-marostegui.json [12:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:35] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [12:06:52] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync [12:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:11] (03CR) 10Volans: [C: 03+2] sre.hosts.provision: always set the BiosBootSeq [cookbooks] - 10https://gerrit.wikimedia.org/r/767074 (owner: 10Volans) [12:07:13] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync [12:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:22] (03PS1) 10Vgutierrez: site: Reimage cp2037 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/768689 (https://phabricator.wikimedia.org/T290005) [12:07:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T302950)', diff saved to https://phabricator.wikimedia.org/P21959 and previous config saved to /var/cache/conftool/dbconfig/20220307-120722-ladsgroup.json [12:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:25] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [12:08:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T300992)', diff saved to https://phabricator.wikimedia.org/P21960 and previous config saved to /var/cache/conftool/dbconfig/20220307-120821-ladsgroup.json [12:08:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [12:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [12:08:24] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [12:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:03] !log reboot cr2-drmrs for software upgrade [12:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:10] (03Merged) 10jenkins-bot: sre.hosts.provision: always set the BiosBootSeq [cookbooks] - 10https://gerrit.wikimedia.org/r/767074 (owner: 10Volans) [12:10:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T300381)', diff saved to https://phabricator.wikimedia.org/P21961 and previous config saved to /var/cache/conftool/dbconfig/20220307-121018-marostegui.json [12:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:26] (03CR) 10Urbanecm: [C: 04-1] Add IPInfo viewing rights for certain groups (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766882 (https://phabricator.wikimedia.org/T296499) (owner: 10STran) [12:11:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [12:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [12:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T300775)', diff saved to https://phabricator.wikimedia.org/P21962 and previous config saved to /var/cache/conftool/dbconfig/20220307-121122-marostegui.json [12:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:25] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [12:11:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [12:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [12:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:37] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:12:56] 10SRE, 10Observability-Metrics, 10Traffic: Port Traffic dashboards to Thanos - https://phabricator.wikimedia.org/T302266 (10MMandere) [12:13:50] !log pool cp3060 with HAProxy as TLS termination layer - T290005 [12:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:53] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [12:14:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [12:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [12:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T300992)', diff saved to https://phabricator.wikimedia.org/P21963 and previous config saved to /var/cache/conftool/dbconfig/20220307-121443-ladsgroup.json [12:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:46] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [12:15:07] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:15:07] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [12:16:07] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:17:07] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:17:25] that's expected ^ (cr2-drmrs upgrade) [12:17:33] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [12:17:34] ack [12:18:15] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3060.esams.wmnet with OS buster [12:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:28] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp3060.esams.wmnet with OS buster c... [12:18:35] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:18:55] (03CR) 10Kosta Harlan: GrowthExperiments: Add image experiment for fa/fr/pt/trwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768680 (https://phabricator.wikimedia.org/T302828) (owner: 10Kosta Harlan) [12:19:33] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp2037 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/768689 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [12:19:43] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:19:45] (03CR) 10Urbanecm: "Security review passed (T260822), but perf review (T260821) is currently opened. Per https://www.mediawiki.org/wiki/Writing_an_extension_f" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767216 (https://phabricator.wikimedia.org/T260598) (owner: 10Tchanders) [12:19:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T300992)', diff saved to https://phabricator.wikimedia.org/P21964 and previous config saved to /var/cache/conftool/dbconfig/20220307-121958-ladsgroup.json [12:20:00] (03PS1) 10Btullis: Move some common resources to the opensearch::server profile [puppet] - 10https://gerrit.wikimedia.org/r/768702 (https://phabricator.wikimedia.org/T301382) [12:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:02] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [12:20:07] RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 13 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:20:15] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp2037.codfw.wmnet with OS buster [12:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:28] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp2037.codfw.wmnet with OS buster [12:22:07] (03CR) 10Urbanecm: [C: 04-1] Autopromote-once users to the 'ipinfo' group after one edit (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767845 (https://phabricator.wikimedia.org/T296184) (owner: 10Tchanders) [12:25:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P21965 and previous config saved to /var/cache/conftool/dbconfig/20220307-122523-marostegui.json [12:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:29] (03CR) 10Urbanecm: [C: 04-2] "Per Sammy. -2'ing to prevent accidential merge. IMO, most important thing is to have a test plan (how and when to evaluate whether this wa" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767912 (https://phabricator.wikimedia.org/T43479) (owner: 10Samtar) [12:32:21] PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:33:52] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198 (10Gehel) 05Open→03Resolved [12:35:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P21966 and previous config saved to /var/cache/conftool/dbconfig/20220307-123503-ladsgroup.json [12:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:29] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34093/console" [puppet] - 10https://gerrit.wikimedia.org/r/768702 (https://phabricator.wikimedia.org/T301382) (owner: 10Btullis) [12:37:49] !log restart cr1-drmrs for software upgrade [12:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:15] RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms [12:38:42] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2037.codfw.wmnet with reason: host reimage [12:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:05] (03CR) 10Urbanecm: [C: 03+1] "looks good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747186 (https://phabricator.wikimedia.org/T203941) (owner: 10Kosta Harlan) [12:40:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P21967 and previous config saved to /var/cache/conftool/dbconfig/20220307-124028-marostegui.json [12:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:32] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2037.codfw.wmnet with reason: host reimage [12:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:50] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:45:02] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:46:26] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:46:37] (03PS1) 10Ayounsi: drmrs: add ORIGIN for v6 PTR LVS [dns] - 10https://gerrit.wikimedia.org/r/768709 [12:48:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance [12:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance [12:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T302950)', diff saved to https://phabricator.wikimedia.org/P21968 and previous config saved to /var/cache/conftool/dbconfig/20220307-124815-ladsgroup.json [12:48:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:18] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [12:48:22] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the fix" [dns] - 10https://gerrit.wikimedia.org/r/768709 (owner: 10Ayounsi) [12:48:49] (03CR) 10Ayounsi: [C: 03+2] drmrs: add ORIGIN for v6 PTR LVS [dns] - 10https://gerrit.wikimedia.org/r/768709 (owner: 10Ayounsi) [12:49:47] !log aqu@deploy1002 Started deploy [airflow-dags/analytics@46d88a2]: Migrate wikidata/item_page_link/weekly [12:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:55] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@46d88a2]: Migrate wikidata/item_page_link/weekly (duration: 00m 07s) [12:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P21969 and previous config saved to /var/cache/conftool/dbconfig/20220307-125007-ladsgroup.json [12:50:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:15] (03CR) 10Jbond: [C: 03+1] "lgtm thx" [puppet] - 10https://gerrit.wikimedia.org/r/768659 (owner: 10Jcrespo) [12:51:54] (03PS1) 104nn1l2: etwikiquote: Update logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768710 (https://phabricator.wikimedia.org/T302683) [12:53:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1142.eqiad.wmnet with OS bullseye [12:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T300381)', diff saved to https://phabricator.wikimedia.org/P21970 and previous config saved to /var/cache/conftool/dbconfig/20220307-125532-marostegui.json [12:55:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [12:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [12:55:36] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [12:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T300381)', diff saved to https://phabricator.wikimedia.org/P21971 and previous config saved to /var/cache/conftool/dbconfig/20220307-125540-marostegui.json [12:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:58] (03CR) 10Jcrespo: [C: 03+2] puppet: Print nodes that change on every puppet run, sorted [puppet] - 10https://gerrit.wikimedia.org/r/768659 (owner: 10Jcrespo) [13:00:03] (03PS1) 10Btullis: Failback the hive services to an-coord1001 [dns] - 10https://gerrit.wikimedia.org/r/768712 (https://phabricator.wikimedia.org/T303168) [13:03:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T300381)', diff saved to https://phabricator.wikimedia.org/P21972 and previous config saved to /var/cache/conftool/dbconfig/20220307-130326-marostegui.json [13:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:30] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [13:05:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1142.eqiad.wmnet with reason: host reimage [13:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T300992)', diff saved to https://phabricator.wikimedia.org/P21973 and previous config saved to /var/cache/conftool/dbconfig/20220307-130512-ladsgroup.json [13:05:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [13:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:15] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [13:05:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [13:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T300992)', diff saved to https://phabricator.wikimedia.org/P21974 and previous config saved to /var/cache/conftool/dbconfig/20220307-130520-ladsgroup.json [13:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1142.eqiad.wmnet with reason: host reimage [13:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:24] (03CR) 10Jcrespo: "They are now showing in order:" [puppet] - 10https://gerrit.wikimedia.org/r/768659 (owner: 10Jcrespo) [13:08:26] (03CR) 10Jbond: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/768662 (owner: 10Majavah) [13:09:51] !log About to deploy analytics/refinery - Migrate wikidata/item_page_link/weekly from Oozie to Airflow [13:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:30] (03PS3) 10Jelto: gitlab_runner: add gitlab-runner to docker group, change folder permissions [puppet] - 10https://gerrit.wikimedia.org/r/768683 (https://phabricator.wikimedia.org/T295481) [13:11:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T300992)', diff saved to https://phabricator.wikimedia.org/P21975 and previous config saved to /var/cache/conftool/dbconfig/20220307-131100-ladsgroup.json [13:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:03] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [13:12:03] !log aqu@deploy1002 Started deploy [analytics/refinery@51d074b]: Migrate wikidata/item_page_link/weekly from Oozie to Airflow [analytics/refinery@51d074b] [13:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:16] 10SRE, 10Traffic-Icebox, 10Performance-Team (Radar): Consider collecting more timestamp milestones from ATS-TLS - https://phabricator.wikimedia.org/T265869 (10Aklapper) a:05ema→03None Resetting inactive assignee [13:12:17] 10SRE, 10Traffic-Icebox, 10SecTeam-Processed: Consider removing X-Wikimedia-Security-Audit VCL support - https://phabricator.wikimedia.org/T229320 (10Aklapper) a:05ema→03None Resetting inactive assignee [13:12:37] 10SRE, 10Pybal, 10Traffic-Icebox: pybal's "can-depool" logic only takes downServers into account - https://phabricator.wikimedia.org/T184715 (10Aklapper) a:05ema→03None Resetting inactive assignee [13:12:58] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34094/console" [puppet] - 10https://gerrit.wikimedia.org/r/768683 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [13:16:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096 (s5,s6)', diff saved to https://phabricator.wikimedia.org/P21976 and previous config saved to /var/cache/conftool/dbconfig/20220307-131606-marostegui.json [13:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:16] (03PS1) 104nn1l2: fawiki: Disable creating community books and remove "Create a book" link from sidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768718 (https://phabricator.wikimedia.org/T303173) [13:18:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P21977 and previous config saved to /var/cache/conftool/dbconfig/20220307-131830-marostegui.json [13:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 10%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21978 and previous config saved to /var/cache/conftool/dbconfig/20220307-131857-root.json [13:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:30] 10SRE, 10Traffic-Icebox, 10Performance-Team (Radar): Consider collecting more timestamp milestones from ATS-TLS - https://phabricator.wikimedia.org/T265869 (10Krinkle) 05Open→03Resolved a:03Krinkle [13:21:37] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Krinkle) [13:22:12] 10SRE, 10Traffic-Icebox, 10Performance-Team (Radar): Consider collecting more timestamp milestones from ATS-TLS - https://phabricator.wikimedia.org/T265869 (10Krinkle) a:05Krinkle→03ema [13:22:28] (03CR) 10Jelto: [V: 03+1] "@Jbond this change adds gitlab-runner user to the docker group by setting the id and primary group. However the id needs to be numeric and" [puppet] - 10https://gerrit.wikimedia.org/r/768683 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [13:22:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1142.eqiad.wmnet with OS bullseye [13:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:24] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: facter3: use structured facts - https://phabricator.wikimedia.org/T222160 (10jbond) 05Declined→03Open p:05Medium→03Low Re-Opening this task as it seems from FACT-2913 and https://github.com/puppetlabs/puppet/pull/8868#issuecomment-1059388... [13:25:30] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Packaging, and 2 others: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10jbond) [13:26:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P21979 and previous config saved to /var/cache/conftool/dbconfig/20220307-132605-ladsgroup.json [13:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:27] (03CR) 10Kormat: Refactor check_mariadb_backups.py and add enough tests for it (031 comment) [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/767844 (owner: 10Jcrespo) [13:33:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P21980 and previous config saved to /var/cache/conftool/dbconfig/20220307-133335-marostegui.json [13:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 25%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21981 and previous config saved to /var/cache/conftool/dbconfig/20220307-133400-root.json [13:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:56] (03CR) 10Jcrespo: "Thank you." [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/767844 (owner: 10Jcrespo) [13:35:09] (03CR) 10Bartosz Dziewoński: Enable reply tool by default on enwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758988 (https://phabricator.wikimedia.org/T296645) (owner: 10Esanders) [13:35:16] (03PS2) 10Bartosz Dziewoński: Enable reply tool by default on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758988 (https://phabricator.wikimedia.org/T296645) (owner: 10Esanders) [13:36:39] (03CR) 10Kormat: Refactor check_mariadb_backups.py and add enough tests for it (031 comment) [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/767844 (owner: 10Jcrespo) [13:37:07] !log aqu@deploy1002 Finished deploy [analytics/refinery@51d074b]: Migrate wikidata/item_page_link/weekly from Oozie to Airflow [analytics/refinery@51d074b] (duration: 25m 04s) [13:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:28] PROBLEM - SSH on kubernetes2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:39:28] !log aqu@deploy1002 Started deploy [analytics/refinery@51d074b] (thin): Migrate wikidata/item_page_link/weekly from Oozie to Airflow [analytics/refinery@51d074b] [13:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:36] !log aqu@deploy1002 Finished deploy [analytics/refinery@51d074b] (thin): Migrate wikidata/item_page_link/weekly from Oozie to Airflow [analytics/refinery@51d074b] (duration: 00m 08s) [13:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:55] !log aqu@deploy1002 Started deploy [analytics/refinery@51d074b] (hadoop-test): Migrate wikidata/item_page_link/weekly from Oozie to Airflow [analytics/refinery@51d074b] [13:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:56] (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [13:41:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P21982 and previous config saved to /var/cache/conftool/dbconfig/20220307-134109-ladsgroup.json [13:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:41] (03PS13) 10Filippo Giunchedi: Introduce 'alertmanager' and 'alerting' modules [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 (https://phabricator.wikimedia.org/T293209) [13:43:38] (03CR) 10Filippo Giunchedi: "Thanks for the review, with the last PS I was able to create a silence as expected (with 'alertmanager.py' in my home on cumin1001)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 (https://phabricator.wikimedia.org/T293209) (owner: 10Filippo Giunchedi) [13:44:05] (03CR) 10Jbond: "see comments" [puppet] - 10https://gerrit.wikimedia.org/r/768683 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [13:46:24] (03PS3) 10Jcrespo: Refactor check_mariadb_backups.py and add enough tests for it [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/767844 [13:46:26] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:47:13] !log aqu@deploy1002 Finished deploy [analytics/refinery@51d074b] (hadoop-test): Migrate wikidata/item_page_link/weekly from Oozie to Airflow [analytics/refinery@51d074b] (duration: 07m 17s) [13:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T302950)', diff saved to https://phabricator.wikimedia.org/P21983 and previous config saved to /var/cache/conftool/dbconfig/20220307-134715-ladsgroup.json [13:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:18] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [13:48:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T300381)', diff saved to https://phabricator.wikimedia.org/P21984 and previous config saved to /var/cache/conftool/dbconfig/20220307-134840-marostegui.json [13:48:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [13:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [13:48:43] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [13:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T300381)', diff saved to https://phabricator.wikimedia.org/P21985 and previous config saved to /var/cache/conftool/dbconfig/20220307-134848-marostegui.json [13:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 50%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21986 and previous config saved to /var/cache/conftool/dbconfig/20220307-134904-root.json [13:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:44] (03PS4) 10Jelto: gitlab_runner: add gitlab-runner to docker group, change folder permissions [puppet] - 10https://gerrit.wikimedia.org/r/768683 (https://phabricator.wikimedia.org/T295481) [13:51:47] (03PS4) 10Jcrespo: Refactor check_mariadb_backups.py and add enough tests for it [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/767844 (https://phabricator.wikimedia.org/T138562) [13:52:09] (03PS2) 10Jbond: utils: create blame-stats script [puppet] - 10https://gerrit.wikimedia.org/r/768114 (https://phabricator.wikimedia.org/T67270) [13:52:45] (03CR) 10jerkins-bot: [V: 04-1] utils: create blame-stats script [puppet] - 10https://gerrit.wikimedia.org/r/768114 (https://phabricator.wikimedia.org/T67270) (owner: 10Jbond) [13:56:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T300992)', diff saved to https://phabricator.wikimedia.org/P21987 and previous config saved to /var/cache/conftool/dbconfig/20220307-135614-ladsgroup.json [13:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:19] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [13:59:16] (03CR) 10Kormat: [C: 03+2] Remove cumin2001 from mysql root clients and related grants [puppet] - 10https://gerrit.wikimedia.org/r/768657 (https://phabricator.wikimedia.org/T276589) (owner: 10Muehlenhoff) [13:59:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host people2002.codfw.wmnet [13:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:05] RoanKattouw, Lucas_WMDE, and Urbanecm: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220307T1400). [14:00:05] cscott, Juan_90264, nn1l2, and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:09] hi [14:00:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetboard2002.codfw.wmnet [14:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:13] !log pool cp2037 with HAProxy as TLS termination layer - T290005 [14:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:16] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [14:00:24] hello [14:00:29] !log removing cumin2001 grants from all db sections T276589 [14:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:31] T276589: migrate services from cumin2001 to cumin2002 - https://phabricator.wikimedia.org/T276589 [14:00:47] i can deploy today (unless someone else wants to) [14:01:14] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:01:34] (03CR) 10Urbanecm: [C: 03+2] etwikiquote: Update logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768710 (https://phabricator.wikimedia.org/T302683) (owner: 104nn1l2) [14:02:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people2002.codfw.wmnet [14:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:15] (03Merged) 10jenkins-bot: etwikiquote: Update logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768710 (https://phabricator.wikimedia.org/T302683) (owner: 104nn1l2) [14:02:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P21988 and previous config saved to /var/cache/conftool/dbconfig/20220307-140219-ladsgroup.json [14:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:56] nn1l2: pulled the logo to mwdebug1001, can you check? [14:02:58] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2037.codfw.wmnet with OS buster [14:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:01] ok [14:03:14] * urbanecm doesn't see cscott or Juan_90264 here, so I'll skip the patches [14:03:20] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp2037.codfw.wmnet with OS buster c... [14:03:31] (03PS1) 10Jbond: varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/768723 [14:04:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 75%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21989 and previous config saved to /var/cache/conftool/dbconfig/20220307-140408-root.json [14:04:09] LGTM [14:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:52] nn1l2: syncing [14:05:08] (03CR) 10Urbanecm: [C: 03+2] fawiki: Disable creating community books and remove "Create a book" link from sidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768718 (https://phabricator.wikimedia.org/T303173) (owner: 104nn1l2) [14:05:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2002.codfw.wmnet [14:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica2005.wikimedia.org [14:05:47] (03PS1) 10Vgutierrez: site: Reimage cp1085 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/768724 (https://phabricator.wikimedia.org/T290005) [14:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:51] (03Merged) 10jenkins-bot: fawiki: Disable creating community books and remove "Create a book" link from sidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768718 (https://phabricator.wikimedia.org/T303173) (owner: 104nn1l2) [14:06:48] (03CR) 10Jelto: "If code changes are needed in systemd::sysuser, I would prefer to implement a additional_groups parameter in a related change. I'll upload" [puppet] - 10https://gerrit.wikimedia.org/r/768683 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [14:07:07] !log urbanecm@deploy1002 Synchronized static/images/project-logos/: 8619f5933966071cdb39097a3e0d38fdead40b66: etwikiquote: Update logo (T302683; 1/3) (duration: 00m 50s) [14:07:09] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10MatthewVernon) Thanks. Yes, the ms-fe* nodes will end up behind LVS; but they're not in service at that point. So from my POV, whenever you (or DC team) are... [14:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:10] T302683: Requesting logo change for et.wikiquote.org - https://phabricator.wikimedia.org/T302683 [14:07:48] PROBLEM - Check systemd state on ml-serve2004 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:52] !log Purge https://en.wikipedia.org/static/images/project-logos/etwikiquote.png (T302683) [14:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:57] !log urbanecm@deploy1002 Synchronized wmf-config/logos.php: 8619f5933966071cdb39097a3e0d38fdead40b66: etwikiquote: Update logo (T302683; 2/3) (duration: 00m 49s) [14:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:21] MatmaRex: just double checking, it's okay to ignore Ed's -1 on your patch, right? [14:08:25] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp1085 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/768724 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [14:08:25] sounds to be about scheduling only [14:08:32] urbanecm: yeah [14:08:36] okay [14:08:46] !log urbanecm@deploy1002 Synchronized logos/config.yaml: 8619f5933966071cdb39097a3e0d38fdead40b66: etwikiquote: Update logo (T302683; 3/3) (duration: 00m 49s) [14:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:54] urbanecm: i wanted to remove that -1 vote, but i can't [14:09:03] yeah, only those with +2 and voters can :) [14:09:05] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp1085.eqiad.wmnet with OS buster [14:09:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:23] nn1l2: should be live [14:09:24] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp1085.eqiad.wmnet with OS buster [14:09:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica2005.wikimedia.org [14:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:39] Thanks! [14:09:40] nn1l2: your second patch is at mwdebug1001, please test [14:09:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:53] (03PS3) 10Urbanecm: Enable reply tool by default on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758988 (https://phabricator.wikimedia.org/T296645) (owner: 10Esanders) [14:10:12] (03CR) 10Urbanecm: [C: 03+2] "Ignoring Ed's scheduling -1 per MatmaRex. Deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758988 (https://phabricator.wikimedia.org/T296645) (owner: 10Esanders) [14:10:38] 10SRE, 10Patch-For-Review: migrate services from cumin2001 to cumin2002 - https://phabricator.wikimedia.org/T276589 (10Kormat) Granted removed: [] es1 [] es2 [] es3 [x] es4 [x] es5 [x] m1 (root@10.% seems to supersede it?) [x] m2 (root@10.% seems to supersede it?) [] m3 [] m5 [] s1 [] s2 [] s3 [] s4 [] s5 [] s... [14:10:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:10:57] (03Merged) 10jenkins-bot: Enable reply tool by default on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758988 (https://phabricator.wikimedia.org/T296645) (owner: 10Esanders) [14:10:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:45] (03PS1) 10Vgutierrez: site: Reimage cp4030 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/768729 (https://phabricator.wikimedia.org/T290005) [14:13:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica2006.wikimedia.org [14:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:25] nn1l2: how is the testing going? [14:13:41] I did not recieve ping! [14:13:49] I will test it now! [14:14:12] okay :) [14:14:22] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp4030 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/768729 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [14:14:37] It's okay [14:14:43] Good to go [14:14:54] syncing [14:15:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica2006.wikimedia.org [14:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:20] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp4030.ulsfo.wmnet with OS buster [14:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:33] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp4030.ulsfo.wmnet with OS buster [14:16:06] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 8f20ec9f5bd7f507580d8e8860116e3b1842ac9a: fawiki: Disable creating community books and remove "Create a book" link from sidebar (T303173) (duration: 00m 49s) [14:16:07] (03PS1) 10Volans: prospector: update config for latest version [software/homer] - 10https://gerrit.wikimedia.org/r/768731 [14:16:09] (03PS1) 10Volans: homer: expand user paths when reading ssh_config [software/homer] - 10https://gerrit.wikimedia.org/r/768732 [14:16:11] should be live nn1l2 [14:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:14] T303173: Disable creating community books and remove "Create a book" link from the sidebar on Farsi Wikipedia - https://phabricator.wikimedia.org/T303173 [14:16:21] MatmaRex: your patch is at mwdebug1001 [14:16:23] can you check? [14:16:45] looking [14:16:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P21990 and previous config saved to /var/cache/conftool/dbconfig/20220307-141724-ladsgroup.json [14:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:28] urbanecm: looks fine! [14:17:30] Thanks again! [14:17:34] syncing! [14:17:35] np nn1l2 [14:18:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:18:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:18:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:44] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 64b128459f04514cd0093745d7a83166555449b2: Enable reply tool by default on enwiki (T296645) (duration: 00m 49s) [14:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:47] T296645: Config change: Deploy Reply Tool as opt-out preference at en.wiki - https://phabricator.wikimedia.org/T296645 [14:19:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 100%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21991 and previous config saved to /var/cache/conftool/dbconfig/20220307-141911-root.json [14:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 10%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21992 and previous config saved to /var/cache/conftool/dbconfig/20220307-141915-root.json [14:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica1003.wikimedia.org [14:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetboard1002.eqiad.wmnet [14:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica1003.wikimedia.org [14:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1002.eqiad.wmnet [14:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:27] Sorry for the delay, shall we deploy? [14:24:05] (03CR) 10Vgutierrez: varnish: Rate limit hotlinking (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768723 (owner: 10Jbond) [14:24:48] Hello? [14:25:11] urbanecm ? [14:25:30] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1085.eqiad.wmnet with reason: host reimage [14:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:08] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:26:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica1004.wikimedia.org [14:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:29] Juan_90264: sure, wait on line [14:27:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid2002.codfw.wmnet [14:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:12] RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:28:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica1004.wikimedia.org [14:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:51] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1085.eqiad.wmnet with reason: host reimage [14:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid2002.codfw.wmnet [14:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:15] (03PS1) 10Jbond: R:systemd::sysuser: add support for id => "-:groupname" [puppet] - 10https://gerrit.wikimedia.org/r/768735 [14:29:22] (03PS3) 10Urbanecm: Revert "Change temporary logo for slwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768155 (https://phabricator.wikimedia.org/T302661) (owner: 10Juan90264) [14:29:27] (03CR) 10Urbanecm: [C: 03+2] Revert "Change temporary logo for slwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768155 (https://phabricator.wikimedia.org/T302661) (owner: 10Juan90264) [14:29:54] (03CR) 10jerkins-bot: [V: 04-1] R:systemd::sysuser: add support for id => "-:groupname" [puppet] - 10https://gerrit.wikimedia.org/r/768735 (owner: 10Jbond) [14:29:56] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34096/console" [puppet] - 10https://gerrit.wikimedia.org/r/768735 (owner: 10Jbond) [14:30:12] (03Merged) 10jenkins-bot: Revert "Change temporary logo for slwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768155 (https://phabricator.wikimedia.org/T302661) (owner: 10Juan90264) [14:30:30] Juan_90264: please test at mwdebug1001 [14:30:32] Okay merged [14:30:56] !log rebooting etherpad1003 (running etherpad1003) for kernel update [14:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host etherpad1003.eqiad.wmnet [14:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:05] (thanks for deploying!) [14:31:10] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4030.ulsfo.wmnet with reason: host reimage [14:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:17] urbanecm What's there to test on that, I'm just reversing the logo usage [14:31:25] yeah, test it was reversed ;) [14:31:31] PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve2004 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:31:59] (03PS2) 10Jbond: R:systemd::sysuser: add support for id => "-:groupname" [puppet] - 10https://gerrit.wikimedia.org/r/768735 [14:32:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T302950)', diff saved to https://phabricator.wikimedia.org/P21993 and previous config saved to /var/cache/conftool/dbconfig/20220307-143229-ladsgroup.json [14:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:33] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [14:33:06] (03PS1) 10Btullis: Add a profile specific to datahubsearch servers [puppet] - 10https://gerrit.wikimedia.org/r/768736 (https://phabricator.wikimedia.org/T301382) [14:33:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host etherpad1003.eqiad.wmnet [14:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:24] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10MoritzMuehlenhoff) megacli hasn't changed since a long time. I also tried perccli, but it also fails, the issue is rather on the kernel driver side. But I think I have identified the commits which need... [14:33:52] (03CR) 10Btullis: [V: 03+1] Move some common resources to the opensearch::server profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768702 (https://phabricator.wikimedia.org/T301382) (owner: 10Btullis) [14:34:07] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34097/console" [puppet] - 10https://gerrit.wikimedia.org/r/768736 (https://phabricator.wikimedia.org/T301382) (owner: 10Btullis) [14:34:16] Juan_90264: how is it going? [14:34:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 25%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21994 and previous config saved to /var/cache/conftool/dbconfig/20220307-143419-root.json [14:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:25] Urbanecm: I tested and approved [14:34:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:29] syncing [14:34:32] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4030.ulsfo.wmnet with reason: host reimage [14:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid1002.eqiad.wmnet [14:35:23] !log ntsako@deploy1002 Started deploy [airflow-dags/analytics@46d88a2]: (no justification provided) [14:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:27] !log ntsako@deploy1002 Finished deploy [airflow-dags/analytics@46d88a2]: (no justification provided) (duration: 00m 04s) [14:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:35:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:16] !log urbanecm@deploy1002 Synchronized wmf-config/logos.php: f50c4746c5fa733929b80b036eef4eee84cf17d1: Revert "Change temporary logo for slwiki" (T302661; 1/2) (duration: 00m 49s) [14:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:19] T302661: Requesting temporary logo change for sl.wikipedia.org - https://phabricator.wikimedia.org/T302661 [14:36:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:04] !log urbanecm@deploy1002 Synchronized static/images/project-logos/: f50c4746c5fa733929b80b036eef4eee84cf17d1: Revert "Change temporary logo for slwiki" (T302661; 2/2) (duration: 00m 48s) [14:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid1002.eqiad.wmnet [14:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:32] RECOVERY - SSH on kubernetes2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:39:17] Working, thanks Urbanecm for deploying! [14:40:39] np [14:42:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host build2001.codfw.wmnet [14:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:11] (03CR) 10Volans: "Thanks for the fixes! I've tested all the functionalities and all looks good." [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 (https://phabricator.wikimedia.org/T293209) (owner: 10Filippo Giunchedi) [14:42:24] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:42:56] (03CR) 10Btullis: [V: 03+1] "This may be a bit liberal, opening up port 9200 to all DOMAIN_NETWORKS, but I've put in a parameter so that we can restrict it further lat" [puppet] - 10https://gerrit.wikimedia.org/r/768736 (https://phabricator.wikimedia.org/T301382) (owner: 10Btullis) [14:43:23] (03CR) 10Btullis: [C: 03+2] Failback the hive services to an-coord1001 [dns] - 10https://gerrit.wikimedia.org/r/768712 (https://phabricator.wikimedia.org/T303168) (owner: 10Btullis) [14:45:32] (03CR) 10Ayounsi: [C: 03+1] homer: expand user paths when reading ssh_config [software/homer] - 10https://gerrit.wikimedia.org/r/768732 (owner: 10Volans) [14:45:43] !log pool cp1085 with HAProxy as TLS termination layer - T290005 [14:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:49] (03CR) 10Ayounsi: [C: 03+1] prospector: update config for latest version [software/homer] - 10https://gerrit.wikimedia.org/r/768731 (owner: 10Volans) [14:46:24] (03CR) 10Volans: [C: 03+2] prospector: update config for latest version [software/homer] - 10https://gerrit.wikimedia.org/r/768731 (owner: 10Volans) [14:46:29] (03CR) 10Volans: [C: 03+2] homer: expand user paths when reading ssh_config [software/homer] - 10https://gerrit.wikimedia.org/r/768732 (owner: 10Volans) [14:46:43] (03PS1) 10Jbond: C:varnish::common: Add documentation [puppet] - 10https://gerrit.wikimedia.org/r/768739 [14:46:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [14:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [14:46:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [14:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [14:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:33] (03CR) 10jerkins-bot: [V: 04-1] C:varnish::common: Add documentation [puppet] - 10https://gerrit.wikimedia.org/r/768739 (owner: 10Jbond) [14:48:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T300381)', diff saved to https://phabricator.wikimedia.org/P21995 and previous config saved to /var/cache/conftool/dbconfig/20220307-144829-marostegui.json [14:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:33] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [14:48:58] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:49:00] RECOVERY - Check systemd state on ml-serve2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:49:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 50%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21996 and previous config saved to /var/cache/conftool/dbconfig/20220307-144922-root.json [14:49:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2001.codfw.wmnet [14:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:33] (03Merged) 10jenkins-bot: prospector: update config for latest version [software/homer] - 10https://gerrit.wikimedia.org/r/768731 (owner: 10Volans) [14:49:35] (03Merged) 10jenkins-bot: homer: expand user paths when reading ssh_config [software/homer] - 10https://gerrit.wikimedia.org/r/768732 (owner: 10Volans) [14:50:46] (03CR) 10Joal: [C: 03+1] "Thanks @phuedx - Good for me on the werequest logging side." [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) (owner: 10Phuedx) [14:53:07] 10SRE, 10Patch-For-Review: migrate services from cumin2001 to cumin2002 - https://phabricator.wikimedia.org/T276589 (10Kormat) >>! In T276589#7756918, @Kormat wrote: > Granted removed: Alright, that should be all the grants cleaned up. [14:53:13] (03PS2) 10Elukey: calico,cfssl-issuer,knative-serving: fix dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/768681 [14:56:27] !log depool cp1085 [14:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:38] cp1085 is having some issues :/ [14:58:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host theemin.codfw.wmnet [14:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:37] (03PS5) 10Jelto: gitlab_runner: add gitlab-runner to docker group, change folder permissions [puppet] - 10https://gerrit.wikimedia.org/r/768683 (https://phabricator.wikimedia.org/T295481) [15:01:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2088.codfw.wmnet with reason: Maintenance [15:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2088.codfw.wmnet with reason: Maintenance [15:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: cp1085 memory errors on DIMM A5 - https://phabricator.wikimedia.org/T303183 (10Vgutierrez) [15:02:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2107.codfw.wmnet with reason: Maintenance [15:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2107.codfw.wmnet with reason: Maintenance [15:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:08] RECOVERY - Check whether ferm is active by checking the default input chain on ml-serve2004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:02:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: cp1085 memory errors on DIMM A5 - https://phabricator.wikimedia.org/T303183 (10Vgutierrez) p:05Triage→03Medium [15:02:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2125.codfw.wmnet with reason: Maintenance [15:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2125.codfw.wmnet with reason: Maintenance [15:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:35] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4030.ulsfo.wmnet with OS buster [15:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host theemin.codfw.wmnet [15:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2126.codfw.wmnet with reason: Maintenance [15:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2126.codfw.wmnet with reason: Maintenance [15:02:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2095.codfw.wmnet with reason: Maintenance [15:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2095.codfw.wmnet with reason: Maintenance [15:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:47] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp4030.ulsfo.wmnet with OS buster c... [15:03:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2138.codfw.wmnet with reason: Maintenance [15:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2138.codfw.wmnet with reason: Maintenance [15:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2148.codfw.wmnet with reason: Maintenance [15:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2148.codfw.wmnet with reason: Maintenance [15:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P21997 and previous config saved to /var/cache/conftool/dbconfig/20220307-150334-marostegui.json [15:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:36] !log pool cp4030 with HAProxy as TLS termination layer - T290005 [15:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:39] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [15:03:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2088.codfw.wmnet with reason: Maintenance [15:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2088.codfw.wmnet with reason: Maintenance [15:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 75%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21998 and previous config saved to /var/cache/conftool/dbconfig/20220307-150426-root.json [15:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2107.codfw.wmnet with reason: Maintenance [15:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2107.codfw.wmnet with reason: Maintenance [15:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:48] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [15:08:12] (03CR) 10Elukey: "The diff looks long, I am wondering if it is only a consequence of the new deps being added or if it will translate to some prod changes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/768681 (owner: 10Elukey) [15:08:28] (03CR) 10Andrew Bogott: [C: 03+2] toolserver_legacy: add a block-all robots.txt [puppet] - 10https://gerrit.wikimedia.org/r/756126 (owner: 10Majavah) [15:08:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2125.codfw.wmnet with reason: Maintenance [15:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2125.codfw.wmnet with reason: Maintenance [15:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:46] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [15:09:32] !log ntsako@deploy1002 Started deploy [airflow-dags/analytics_test@7642d65]: (no justification provided) [15:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:41] !log ntsako@deploy1002 Finished deploy [airflow-dags/analytics_test@7642d65]: (no justification provided) (duration: 00m 09s) [15:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2126.codfw.wmnet with reason: Maintenance [15:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2126.codfw.wmnet with reason: Maintenance [15:11:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2095.codfw.wmnet with reason: Maintenance [15:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2095.codfw.wmnet with reason: Maintenance [15:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2138.codfw.wmnet with reason: Maintenance [15:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2138.codfw.wmnet with reason: Maintenance [15:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2148.codfw.wmnet with reason: Maintenance [15:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2148.codfw.wmnet with reason: Maintenance [15:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:50] (03CR) 10Andrew Bogott: [C: 03+2] openstack: haproxy site definition is not a profile [puppet] - 10https://gerrit.wikimedia.org/r/756982 (owner: 10Majavah) [15:18:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P21999 and previous config saved to /var/cache/conftool/dbconfig/20220307-151839-marostegui.json [15:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2101.codfw.wmnet with reason: Maintenance [15:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2101.codfw.wmnet with reason: Maintenance [15:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 100%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P22000 and previous config saved to /var/cache/conftool/dbconfig/20220307-151929-root.json [15:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:17] !log ntsako@deploy1002 Started deploy [airflow-dags/analytics@7642d65]: (no justification provided) [15:20:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:24] !log ntsako@deploy1002 Finished deploy [airflow-dags/analytics@7642d65]: (no justification provided) (duration: 00m 07s) [15:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:12] (03PS14) 10Filippo Giunchedi: Introduce 'alertmanager' and 'alerting' modules [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 (https://phabricator.wikimedia.org/T293209) [15:25:18] (03CR) 10Filippo Giunchedi: Introduce 'alertmanager' and 'alerting' modules (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 (https://phabricator.wikimedia.org/T293209) (owner: 10Filippo Giunchedi) [15:29:21] (03CR) 10Filippo Giunchedi: "LGTM, though I'll let Cole vote" [puppet] - 10https://gerrit.wikimedia.org/r/768702 (https://phabricator.wikimedia.org/T301382) (owner: 10Btullis) [15:29:44] (03CR) 10Ssingh: aptrepo: add a component for certspotter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768058 (owner: 10Ssingh) [15:30:03] (03Abandoned) 10Ssingh: aptrepo: add a component for certspotter [puppet] - 10https://gerrit.wikimedia.org/r/768058 (owner: 10Ssingh) [15:33:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T300381)', diff saved to https://phabricator.wikimedia.org/P22001 and previous config saved to /var/cache/conftool/dbconfig/20220307-153343-marostegui.json [15:33:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [15:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:48] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [15:33:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [15:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [15:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [15:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T300381)', diff saved to https://phabricator.wikimedia.org/P22002 and previous config saved to /var/cache/conftool/dbconfig/20220307-153357-marostegui.json [15:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T300381)', diff saved to https://phabricator.wikimedia.org/P22003 and previous config saved to /var/cache/conftool/dbconfig/20220307-153641-marostegui.json [15:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:40] !log vgutierrez@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1085.eqiad.wmnet with OS buster [15:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:52] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp1085.eqiad.wmnet with OS buster e... [15:39:15] (03PS1) 10Jelto: isystemd::sysuser: create option to add additional groups to user [puppet] - 10https://gerrit.wikimedia.org/r/768743 (https://phabricator.wikimedia.org/T295481) [15:39:58] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on cp1085.eqiad.wmnet with reason: HW issues see T303183 [15:40:01] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on cp1085.eqiad.wmnet with reason: HW issues see T303183 [15:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:03] T303183: cp1085 memory errors on DIMM A5 - https://phabricator.wikimedia.org/T303183 [15:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: cp1085 memory errors on DIMM A5 - https://phabricator.wikimedia.org/T303183 (10ops-monitoring-bot) Icinga downtime set by vgutierrez@cumin1001 for 30 days, 0:00:00 1 host(s) and their services with reason: HW issues see T303183 ` cp1085.eqiad.wmnet ` [15:40:10] (03CR) 10Andrew Bogott: [C: 03+2] openstack::haproxy: add more flexibility for frontends [puppet] - 10https://gerrit.wikimedia.org/r/756983 (owner: 10Majavah) [15:40:18] (03PS4) 10Andrew Bogott: openstack::haproxy: add more flexibility for frontends [puppet] - 10https://gerrit.wikimedia.org/r/756983 (owner: 10Majavah) [15:40:24] (03CR) 10Volans: [C: 03+1] "Ship it!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 (https://phabricator.wikimedia.org/T293209) (owner: 10Filippo Giunchedi) [15:40:33] (03PS2) 10Jbond: C:varnish::common: Add documentation [puppet] - 10https://gerrit.wikimedia.org/r/768739 [15:40:35] (03PS2) 10Jbond: varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/768723 [15:40:54] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/768057 (owner: 10Vgutierrez) [15:44:18] RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:22] (03PS3) 10Jbond: C:varnish::common: Add documentation [puppet] - 10https://gerrit.wikimedia.org/r/768739 [15:45:20] (03PS1) 10Vgutierrez: site: Reimage cp5010 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/768744 (https://phabricator.wikimedia.org/T290005) [15:49:05] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp5010 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/768744 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:49:56] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp5010.eqsin.wmnet with OS buster [15:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:11] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp5010.eqsin.wmnet with OS buster [15:51:23] (03PS1) 10Majavah: P:wmcs::prometheus: use a single entry for openstack-exporter [puppet] - 10https://gerrit.wikimedia.org/r/768747 [15:51:35] (03CR) 10Andrew Bogott: [C: 03+2] hieradata: cloud: Set monitoring_hosts as empty [puppet] - 10https://gerrit.wikimedia.org/r/757014 (owner: 10Majavah) [15:51:40] (03PS2) 10Andrew Bogott: hieradata: cloud: Set monitoring_hosts as empty [puppet] - 10https://gerrit.wikimedia.org/r/757014 (owner: 10Majavah) [15:51:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P22004 and previous config saved to /var/cache/conftool/dbconfig/20220307-155146-marostegui.json [15:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:28] (03PS4) 10Jbond: C:varnish::common: Add documentation [puppet] - 10https://gerrit.wikimedia.org/r/768739 [15:56:48] !log eqiad: kubectl -n istio-system delete po istiod-69d679d8b5-hm64j - T303184 [15:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:51] T303184: High API server request latencies (LIST) for istio API groups - https://phabricator.wikimedia.org/T303184 [15:58:01] jouncebot: next [15:58:01] In 0 hour(s) and 31 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220307T1630) [15:58:14] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet [15:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow1002.eqiad.wmnet [15:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:06] 10SRE, 10serviceops: enhance otrs alerting - https://phabricator.wikimedia.org/T303190 (10Arnoldokoth) [16:01:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet [16:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:23] 10SRE, 10serviceops: investigate otrs database grants - https://phabricator.wikimedia.org/T303191 (10Arnoldokoth) [16:02:34] (03CR) 10Jbond: [C: 03+2] R:systemd::sysuser: add support for id => "-:groupname" [puppet] - 10https://gerrit.wikimedia.org/r/768735 (owner: 10Jbond) [16:02:54] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:03:01] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe1001.eqiad.wmnet [16:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet [16:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:35] (03CR) 10Jbond: gitlab_runner: add gitlab-runner to docker group, change folder permissions (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/768683 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [16:03:45] (03PS6) 10Jbond: gitlab_runner: add gitlab-runner to docker group, change folder permissions [puppet] - 10https://gerrit.wikimedia.org/r/768683 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [16:04:34] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet [16:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow1002.eqiad.wmnet [16:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:57] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe2001.codfw.wmnet [16:04:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:12] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34103/console" [puppet] - 10https://gerrit.wikimedia.org/r/768683 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [16:05:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow2002.codfw.wmnet [16:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki2002.codfw.wmnet [16:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:29] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-be2001.codfw.wmnet [16:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P22005 and previous config saved to /var/cache/conftool/dbconfig/20220307-160650-marostegui.json [16:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:26] (03PS1) 10Vgutierrez: site: Reimage cp3058 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/768751 (https://phabricator.wikimedia.org/T290005) [16:07:43] (03CR) 10Jbond: "did an early pass" [puppet] - 10https://gerrit.wikimedia.org/r/768743 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [16:09:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2002.codfw.wmnet [16:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:49] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe1001.eqiad.wmnet [16:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:30] (JobUnavailable) firing: (4) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [16:10:48] 10SRE, 10ops-eqsin, 10Traffic: SMART error (CurrentPendingSector) detected on host: cp5004 - https://phabricator.wikimedia.org/T303043 (10Vgutierrez) p:05Triage→03Medium @wiki_willy how should we handle this HW issue on eqsin? [16:10:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow2002.codfw.wmnet [16:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:50] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe2001.codfw.wmnet [16:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:55] (03CR) 10Krinkle: [C: 03+1] misc: search-grafana-dashboards.js (031 comment) [software] - 10https://gerrit.wikimedia.org/r/767118 (owner: 10Filippo Giunchedi) [16:14:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow3002.esams.wmnet [16:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:20] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2001.codfw.wmnet [16:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:28] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5010.eqsin.wmnet with reason: host reimage [16:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:30] (JobUnavailable) firing: (5) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [16:15:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host gerrit2002.wikimedia.org [16:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:24] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: sync [16:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:40] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp3058 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/768751 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [16:16:43] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe1002.eqiad.wmnet [16:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:48] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [16:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:04] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe2002.codfw.wmnet [16:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:09] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5010.eqsin.wmnet with reason: host reimage [16:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:16] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp3058.esams.wmnet with OS buster [16:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:29] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp3058.esams.wmnet with OS buster [16:18:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow3002.esams.wmnet [16:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gerrit2002.wikimedia.org [16:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow4002.ulsfo.wmnet [16:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:41] (03PS1) 10Ayounsi: Set fr-ops to operations [homer/public] - 10https://gerrit.wikimedia.org/r/768756 (https://phabricator.wikimedia.org/T302992) [16:21:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T300381)', diff saved to https://phabricator.wikimedia.org/P22006 and previous config saved to /var/cache/conftool/dbconfig/20220307-162157-marostegui.json [16:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:02] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [16:22:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [16:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [16:22:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 10 hosts with reason: Maintenance [16:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 10 hosts with reason: Maintenance [16:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:23] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe1002.eqiad.wmnet [16:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:30] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe1003.eqiad.wmnet [16:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:40] (03PS5) 10Jbond: C:varnish::common: Add documentation [puppet] - 10https://gerrit.wikimedia.org/r/768739 [16:22:47] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe2002.codfw.wmnet [16:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:53] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34104/console" [puppet] - 10https://gerrit.wikimedia.org/r/768739 (owner: 10Jbond) [16:24:15] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe2003.codfw.wmnet [16:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:03] (03PS6) 10Jbond: C:varnish::common: Add documentation [puppet] - 10https://gerrit.wikimedia.org/r/768739 [16:27:47] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe1003.eqiad.wmnet [16:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:15] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34105/console" [puppet] - 10https://gerrit.wikimedia.org/r/768739 (owner: 10Jbond) [16:28:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [16:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [16:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T300381)', diff saved to https://phabricator.wikimedia.org/P22007 and previous config saved to /var/cache/conftool/dbconfig/20220307-162821-marostegui.json [16:28:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow4002.ulsfo.wmnet [16:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:25] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [16:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:04] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host graphite2003.codfw.wmnet [16:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:05] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe2003.codfw.wmnet [16:29:06] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host graphite2003.codfw.wmnet [16:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:15] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host graphite2003.codfw.wmnet [16:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:05] jan_drewniak: #bothumor My software never has bugs. It just develops random features. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220307T1630). [16:34:41] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host graphite2003.codfw.wmnet [16:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:52] jouncebot: next [16:34:52] In 1 hour(s) and 25 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220307T1800) [16:35:49] (03PS7) 10Jbond: C:varnish::common: Add documentation [puppet] - 10https://gerrit.wikimedia.org/r/768739 [16:36:12] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet [16:36:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T300381)', diff saved to https://phabricator.wikimedia.org/P22008 and previous config saved to /var/cache/conftool/dbconfig/20220307-163612-marostegui.json [16:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:17] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [16:36:20] (03CR) 10jerkins-bot: [V: 04-1] C:varnish::common: Add documentation [puppet] - 10https://gerrit.wikimedia.org/r/768739 (owner: 10Jbond) [16:36:53] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host graphite1004.eqiad.wmnet [16:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:08] (03PS8) 10Jbond: C:varnish::common: Add documentation [puppet] - 10https://gerrit.wikimedia.org/r/768739 [16:37:46] (03CR) 10jerkins-bot: [V: 04-1] C:varnish::common: Add documentation [puppet] - 10https://gerrit.wikimedia.org/r/768739 (owner: 10Jbond) [16:38:54] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet [16:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:37] PROBLEM - Check unit status of statograph_post on alert1001 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:39:57] (03PS9) 10Jbond: C:varnish::common: Add documentation [puppet] - 10https://gerrit.wikimedia.org/r/768739 [16:41:17] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34108/console" [puppet] - 10https://gerrit.wikimedia.org/r/768739 (owner: 10Jbond) [16:41:42] !log pool cp5010 with HAProxy as TLS termination layer - T290005 [16:41:44] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5010.eqsin.wmnet with OS buster [16:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:45] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [16:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:51] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?panelId=8&fullscreen&orgId=1 [16:41:58] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp5010.eqsin.wmnet with OS buster c... [16:42:09] the statograph_port error is due to 502 from graphite.w.o FYI (cc cdanis) [16:42:16] statograph_post error even [16:42:28] JFYI though, it'll recover soon [16:43:02] (03CR) 10Jbond: [V: 03+1] "PCC is essentially a noop" [puppet] - 10https://gerrit.wikimedia.org/r/768739 (owner: 10Jbond) [16:43:08] 10SRE, 10ops-eqiad: analytics10[63,67] mgmt interfaces seem flapping from time to time - https://phabricator.wikimedia.org/T303151 (10Cmjohnson) @elukey analytics1063 and 1067 idrac's are stuck and each server needs to be physically powered off and unplugged for 20-30 secs [16:43:27] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host graphite1004.eqiad.wmnet [16:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:28] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org [16:43:36] godog: ah that's fine, ty [16:44:15] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/768702 (https://phabricator.wikimedia.org/T301382) (owner: 10Btullis) [16:44:31] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet [16:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:55] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-releng-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:44:59] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3058.esams.wmnet with reason: host reimage [16:45:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:07] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus1006.eqiad.wmnet [16:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:09] 10SRE, 10ops-eqiad: analytics10[63,67] mgmt interfaces seem flapping from time to time - https://phabricator.wikimedia.org/T303151 (10elukey) @BTullis can you coordinate with @Cmjohnson to shutdown these nodes? [16:46:33] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet [16:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:44] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus2006.codfw.wmnet [16:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow5002.eqsin.wmnet [16:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:28] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3058.esams.wmnet with reason: host reimage [16:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:38] RECOVERY - Check unit status of statograph_post on alert1001 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:49:58] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on es1029 - https://phabricator.wikimedia.org/T302169 (10Cmjohnson) 05Open→03Resolved Disk has been replaced and is rebuidling cmjohnson@es1029:~$ sudo megacli -PDList -aALL |grep "Firmware state" Firmware state: Online, Spun Up Firmware state: Online, Spun Up F... [16:50:30] (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [16:51:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P22009 and previous config saved to /var/cache/conftool/dbconfig/20220307-165117-marostegui.json [16:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:58] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1006.eqiad.wmnet [16:51:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:43] (03CR) 10Cwhite: Add a profile specific to datahubsearch servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768736 (https://phabricator.wikimedia.org/T301382) (owner: 10Btullis) [16:52:52] !log depool cp5004 - T303043 [16:52:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:54] T303043: SMART error (CurrentPendingSector) detected on host: cp5004 - https://phabricator.wikimedia.org/T303043 [16:54:16] (ThanosSidecarPrometheusDown) firing: Thanos Sidecar cannot connect to Prometheus - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org [16:54:16] (ThanosSidecarUnhealthy) firing: Thanos Sidecar is unhealthy. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org [16:54:41] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudservices1004.wikimedia.org with OS bullseye [16:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: move cloudcephmon1003.eqiad.wmnet from rack B2 to rack C8 - https://phabricator.wikimedia.org/T303058 (10Cmjohnson) @dcaro @nskaggs Can I do this anytime? [16:55:30] (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [16:55:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow5002.eqsin.wmnet [16:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: cp1085 memory errors on DIMM A5 - https://phabricator.wikimedia.org/T303183 (10Cmjohnson) @Vgutierrez @wiki_willy This server is out of warranty. Expired June 2021 [16:58:11] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2006.codfw.wmnet [16:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:28] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org [16:58:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: move cloudcephmon1003.eqiad.wmnet from rack B2 to rack C8 - https://phabricator.wikimedia.org/T303058 (10dcaro) @Cmjohnson feel free to move it yes, just make sure to ping us when you start/end (@nskaggs is clinic duty thi... [16:58:46] (03CR) 10Jelto: [V: 03+1] "great thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/768743 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [16:59:16] (ThanosSidecarPrometheusDown) resolved: Thanos Sidecar cannot connect to Prometheus - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org [16:59:16] (ThanosSidecarUnhealthy) resolved: Thanos Sidecar is unhealthy. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org [16:59:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow6001.drmrs.wmnet [16:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:50] (03PS1) 10Hnowlan: jobqueue: set CPU request [deployment-charts] - 10https://gerrit.wikimedia.org/r/768760 (https://phabricator.wikimedia.org/T300914) [17:03:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow6001.drmrs.wmnet [17:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:24] 10SRE, 10ops-eqsin, 10Traffic: SMART error (CurrentPendingSector) detected on host: cp5004 - https://phabricator.wikimedia.org/T303043 (10wiki_willy) a:03RobH Hi @Vgutierrez - it's due to be refreshed towards the end of this calendar year (and will be on next FY's budget). Would you be able to go that lon... [17:06:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P22010 and previous config saved to /var/cache/conftool/dbconfig/20220307-170622-marostegui.json [17:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:56] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudservices1004.wikimedia.org with reason: host reimage [17:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:14] !log pool cp3058 with HAProxy as TLS termination layer - T290005 [17:07:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:18] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [17:07:46] PROBLEM - Check systemd state on netflow6001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:50] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3058.esams.wmnet with OS buster [17:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:01] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [17:08:09] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp3058.esams.wmnet with OS buster c... [17:08:41] (03PS1) 10Jbond: P:cache::varnish::frontend: Update lookup keys [puppet] - 10https://gerrit.wikimedia.org/r/768762 [17:09:36] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudservices1004.wikimedia.org with reason: host reimage [17:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: cp1085 memory errors on DIMM A5 - https://phabricator.wikimedia.org/T303183 (10Vgutierrez) could we replace the faulty DIMM somehow? missing one server on text@eqiad is far from a ideal scenario [17:10:14] (03CR) 10jerkins-bot: [V: 04-1] P:cache::varnish::frontend: Update lookup keys [puppet] - 10https://gerrit.wikimedia.org/r/768762 (owner: 10Jbond) [17:14:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: cp1085 memory errors on DIMM A5 - https://phabricator.wikimedia.org/T303183 (10wiki_willy) [17:15:10] (03PS2) 10Jbond: P:cache::varnish::frontend: Update lookup keys [puppet] - 10https://gerrit.wikimedia.org/r/768762 [17:15:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: cp1085 memory errors on DIMM A5 - https://phabricator.wikimedia.org/T303183 (10wiki_willy) No problem @Vgutierrez. I just created T303203 with @RobH to procure a replacement DIMM Thanks, Willy [17:16:27] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34111/console" [puppet] - 10https://gerrit.wikimedia.org/r/768762 (owner: 10Jbond) [17:20:48] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on cp5004.eqsin.wmnet with reason: HW issues see T303043 [17:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:51] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on cp5004.eqsin.wmnet with reason: HW issues see T303043 [17:20:52] T303043: SMART error (CurrentPendingSector) detected on host: cp5004 - https://phabricator.wikimedia.org/T303043 [17:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:58] 10SRE, 10ops-eqsin, 10Traffic: SMART error (CurrentPendingSector) detected on host: cp5004 - https://phabricator.wikimedia.org/T303043 (10ops-monitoring-bot) Icinga downtime set by vgutierrez@cumin1001 for 30 days, 0:00:00 1 host(s) and their services with reason: HW issues see T303043 ` cp5004.eqsin.wmnet ` [17:21:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T300381)', diff saved to https://phabricator.wikimedia.org/P22011 and previous config saved to /var/cache/conftool/dbconfig/20220307-172126-marostegui.json [17:21:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [17:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [17:21:30] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [17:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T300381)', diff saved to https://phabricator.wikimedia.org/P22012 and previous config saved to /var/cache/conftool/dbconfig/20220307-172134-marostegui.json [17:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:08] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubernetes1022.eqiad.wmnet [17:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:37] (03CR) 10Muehlenhoff: puppet: Print nodes that change on every puppet run, sorted (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768659 (owner: 10Jcrespo) [17:27:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T300381)', diff saved to https://phabricator.wikimedia.org/P22013 and previous config saved to /var/cache/conftool/dbconfig/20220307-172755-marostegui.json [17:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:59] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [17:29:50] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes1022.eqiad.wmnet [17:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:19] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudservices1004.wikimedia.org with OS bullseye [17:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:16] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestage2001.codfw.wmnet [17:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P22014 and previous config saved to /var/cache/conftool/dbconfig/20220307-174300-marostegui.json [17:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:04] (03PS1) 10Jbond: P:varnish::common: Add support for passing wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/768766 [17:44:05] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestage2001.codfw.wmnet [17:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:10] (03PS1) 10Majavah: P:wmcs::prometheus: update pdns ports [puppet] - 10https://gerrit.wikimedia.org/r/768767 (https://phabricator.wikimedia.org/T281276) [17:46:00] (03PS3) 10Jbond: varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/768723 [17:46:25] (03PS3) 10Tchanders: Enable IPInfo on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767216 (https://phabricator.wikimedia.org/T260598) [17:46:27] (03PS2) 10Tchanders: Autopromote-once users to the 'ipinfo' group after one edit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767845 (https://phabricator.wikimedia.org/T296184) [17:47:41] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestage2002.codfw.wmnet [17:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:44] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host kubestage2002.codfw.wmnet [17:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:11] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestage2002.codfw.wmnet [17:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:05] (03PS4) 10Jbond: varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/768723 [17:50:54] (03PS2) 10Jbond: P:varnish::common: Add support for passing wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/768766 [17:51:28] (03PS10) 10Jbond: C:varnish::common: Add documentation [puppet] - 10https://gerrit.wikimedia.org/r/768739 [17:51:37] (03PS3) 10Jbond: P:cache::varnish::frontend: Update lookup keys [puppet] - 10https://gerrit.wikimedia.org/r/768762 [17:51:44] (03PS3) 10Jbond: P:varnish::common: Add support for passing wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/768766 [17:51:52] (03PS5) 10Jbond: varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/768723 [17:52:35] (03PS4) 10Jbond: P:varnish::common: Add support for passing wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/768766 [17:52:55] (03PS6) 10Jbond: varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/768723 [17:53:56] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34114/console" [puppet] - 10https://gerrit.wikimedia.org/r/768766 (owner: 10Jbond) [17:55:15] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestage2002.codfw.wmnet [17:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:11] (03PS2) 10Majavah: P:wmcs::prometheus: update pdns ports [puppet] - 10https://gerrit.wikimedia.org/r/768767 (https://phabricator.wikimedia.org/T281276) [17:56:13] (03PS1) 10Majavah: P:prometheus::ops: fix powerdns-auth port [puppet] - 10https://gerrit.wikimedia.org/r/768770 (https://phabricator.wikimedia.org/T300254) [17:58:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P22015 and previous config saved to /var/cache/conftool/dbconfig/20220307-175805-marostegui.json [17:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] ryankemper: Time to snap out of that daydream and deploy Wikidata Query Service weekly deploy. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220307T1800). [18:06:06] (03PS5) 10Jcrespo: Refactor check_mariadb_backups.py and add enough tests for it [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/767844 (https://phabricator.wikimedia.org/T138562) [18:07:34] (03CR) 10Jcrespo: [C: 03+2] Refactor check_mariadb_backups.py and add enough tests for it [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/767844 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [18:07:44] (03CR) 10Jcrespo: [C: 03+2] Use yaml safeloader to parse config files [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/767716 (owner: 10Jcrespo) [18:09:13] (03Merged) 10jenkins-bot: Use yaml safeloader to parse config files [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/767716 (owner: 10Jcrespo) [18:09:15] (03Merged) 10jenkins-bot: Refactor check_mariadb_backups.py and add enough tests for it [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/767844 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [18:13:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T300381)', diff saved to https://phabricator.wikimedia.org/P22016 and previous config saved to /var/cache/conftool/dbconfig/20220307-181310-marostegui.json [18:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:14] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [18:13:22] (03Abandoned) 10Dduvall: Move Redis server definitions to services files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726660 (owner: 10Dduvall) [18:13:51] (03CR) 10Andrew Bogott: [C: 03+2] P:prometheus::ops: fix powerdns-auth port [puppet] - 10https://gerrit.wikimedia.org/r/768770 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [18:16:11] (03CR) 10JMeybohm: [C: 03+1] calico,cfssl-issuer,knative-serving: fix dependencies (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/768681 (owner: 10Elukey) [18:20:49] (03PS1) 10Clare Ming: Fix language alert regression [skins/Vector] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/768786 (https://phabricator.wikimedia.org/T302018) [18:31:02] (03PS1) 10Ottomata: Revert "Hive - set hive.warehouse.subdir.inherit.perms = false" [puppet] - 10https://gerrit.wikimedia.org/r/768787 [18:31:21] (03PS2) 10Ottomata: Revert "Hive - set hive.warehouse.subdir.inherit.perms = false" [puppet] - 10https://gerrit.wikimedia.org/r/768787 (https://phabricator.wikimedia.org/T291664) [18:31:29] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Revert "Hive - set hive.warehouse.subdir.inherit.perms = false" [puppet] - 10https://gerrit.wikimedia.org/r/768787 (https://phabricator.wikimedia.org/T291664) (owner: 10Ottomata) [18:39:08] (03PS1) 10Dduvall: Revert "Revert "contint: Install docker 20.10 from thirdparty/ci on buster"" [puppet] - 10https://gerrit.wikimedia.org/r/768774 [18:39:47] (03CR) 10Jdlrobson: [C: 03+1] Fix language alert regression [skins/Vector] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/768786 (https://phabricator.wikimedia.org/T302018) (owner: 10Clare Ming) [18:40:36] (03PS2) 10Dduvall: Revert "Revert "contint: Install docker 20.10 from thirdparty/ci on buster"" [puppet] - 10https://gerrit.wikimedia.org/r/768774 (https://phabricator.wikimedia.org/T300682) [18:44:00] (03CR) 10RLazarus: "On including $site: LGTM, thanks!" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/768108 (https://phabricator.wikimedia.org/T302842) (owner: 10Herron) [18:55:30] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [19:02:34] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:09:08] PROBLEM - SSH on thumbor2004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:15:56] (03PS1) 10Majavah: prometheus: include number of changes on puppet run metrics [puppet] - 10https://gerrit.wikimedia.org/r/768776 [19:17:19] (03CR) 10jerkins-bot: [V: 04-1] prometheus: include number of changes on puppet run metrics [puppet] - 10https://gerrit.wikimedia.org/r/768776 (owner: 10Majavah) [19:17:59] (03PS2) 10Majavah: prometheus: include number of changes on puppet run metrics [puppet] - 10https://gerrit.wikimedia.org/r/768776 [19:23:19] (03PS1) 10Ebernhardson: Prevent caching of auth redirect [puppet] - 10https://gerrit.wikimedia.org/r/768777 (https://phabricator.wikimedia.org/T301650) [19:34:21] (03PS2) 10Ebernhardson: Prevent caching of auth redirect [puppet] - 10https://gerrit.wikimedia.org/r/768777 (https://phabricator.wikimedia.org/T301650) [19:34:27] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/768777 (https://phabricator.wikimedia.org/T301650) (owner: 10Ebernhardson) [19:38:23] (03CR) 10Ebernhardson: "Tested with the docker-compose environment in mw-oauth-proxy, verified the header is emitted by nginx from the sub-request when a redirect" [puppet] - 10https://gerrit.wikimedia.org/r/768777 (https://phabricator.wikimedia.org/T301650) (owner: 10Ebernhardson) [19:41:30] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:49:54] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudservices1003.wikimedia.org with OS bullseye [19:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:20] PROBLEM - SSH on dumpsdata1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:52:22] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:10:40] RECOVERY - SSH on thumbor2004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:11:28] PROBLEM - Host checker.tools.wmflabs.org is DOWN: check_ping: Invalid hostname/address - checker.tools.wmflabs.org [20:12:34] RECOVERY - Host checker.tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [20:13:43] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudservices1003.wikimedia.org with reason: host reimage [20:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:19] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudservices1003.wikimedia.org with reason: host reimage [20:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:55] 10SRE, 10Znuny, 10serviceops: investigate otrs database grants - https://phabricator.wikimedia.org/T303191 (10Peachey88) [20:23:04] 10SRE, 10Znuny, 10serviceops: enhance otrs alerting - https://phabricator.wikimedia.org/T303190 (10Peachey88) [20:39:48] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudservices1003.wikimedia.org with OS bullseye [20:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:43] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:51:13] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [20:54:01] RECOVERY - SSH on dumpsdata1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:56:39] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 1 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [21:00:07] RoanKattouw and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220307T2100). [21:00:07] nray: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:30] o/ here [21:06:48] is anyone available to deploy for the backport window rn? [21:11:42] 10SRE, 10Znuny, 10serviceops: enhance Znuny (otrs) alerting - https://phabricator.wikimedia.org/T303190 (10Aklapper) [21:11:58] 10SRE, 10Znuny, 10serviceops: enhance Znuny (otrs) alerting - https://phabricator.wikimedia.org/T303190 (10Aklapper) Hi @Arnoldokoth, the lack of a task description makes is hard for others to help or contribute, for a triager/tester to figure out at some point in the future whether this is still a valid tas... [21:15:51] nray: i am, sorry for being late [21:15:55] are you still around? [21:16:50] yes I'm here! [21:17:04] let's start then [21:17:13] sweet, thank you! [21:17:15] (03CR) 10Urbanecm: [C: 03+2] Fix language alert regression [skins/Vector] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/768786 (https://phabricator.wikimedia.org/T302018) (owner: 10Clare Ming) [21:32:37] (03Merged) 10jenkins-bot: Fix language alert regression [skins/Vector] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/768786 (https://phabricator.wikimedia.org/T302018) (owner: 10Clare Ming) [21:33:16] nray: should be pulled to mwdebug1001. Can you have a look? [21:33:25] yes, thank you [21:33:55] let me know how it looks like :) [21:34:05] will do [21:35:25] things look good urbanecm , you can proceed! [21:35:30] syncing! [21:35:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:36:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:36:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:36] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.24/skins/Vector/includes/SkinVector.php: eac551c: Fix language alert regression (T302018) (duration: 00m 50s) [21:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:40] T302018: [Regression] Language in sidebar should not show on pages without languages - https://phabricator.wikimedia.org/T302018 [21:37:43] nray: and should be live! [21:37:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:37:45] anything else? [21:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:57] urbanecm: that's all, thanks so much for your help! [21:38:05] happy to help [21:38:11] !log UTC late B&C window done [21:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:20] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudservices1003.wikimedia.org with OS bullseye [21:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:05] Reedy and sbassett: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220307T2200). [22:18:11] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudservices1003.wikimedia.org with reason: host reimage [22:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:42] (03PS1) 10Razzi: elasticsearch: move cluster configuration to puppet [puppet] - 10https://gerrit.wikimedia.org/r/768816 (https://phabricator.wikimedia.org/T278378) [22:20:26] (03PS2) 10Razzi: elasticsearch: move cluster configuration to puppet [puppet] - 10https://gerrit.wikimedia.org/r/768816 (https://phabricator.wikimedia.org/T278378) [22:20:38] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudservices1003.wikimedia.org with reason: host reimage [22:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:20] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudservices1003.wikimedia.org with OS bullseye [22:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:28] (03PS1) 10Ebernhardson: icinga: Move cirrus check into cirrus_cluster_checks [puppet] - 10https://gerrit.wikimedia.org/r/768818 [22:21:44] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34115/console" [puppet] - 10https://gerrit.wikimedia.org/r/768816 (https://phabricator.wikimedia.org/T278378) (owner: 10Razzi) [22:23:04] (03CR) 10jerkins-bot: [V: 04-1] icinga: Move cirrus check into cirrus_cluster_checks [puppet] - 10https://gerrit.wikimedia.org/r/768818 (owner: 10Ebernhardson) [22:23:30] (03PS3) 10Ryan Kemper: elasticsearch: move cluster configuration to puppet [puppet] - 10https://gerrit.wikimedia.org/r/768816 (https://phabricator.wikimedia.org/T278378) (owner: 10Razzi) [22:23:32] (03CR) 10Ebernhardson: "I'm a bit indecisive on what is appropriate here. I don't see any obvious reason this check should be in either file, and I'm left wonderi" [puppet] - 10https://gerrit.wikimedia.org/r/768818 (owner: 10Ebernhardson) [22:25:31] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudservices1003.wikimedia.org with OS bullseye [22:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:28] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudservices1003.wikimedia.org with reason: host reimage [22:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:39] (03CR) 10Ryan Kemper: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (https://phabricator.wikimedia.org/T278378) (owner: 10Ryan Kemper) [22:28:55] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudservices1003.wikimedia.org with reason: host reimage [22:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:30] (03PS10) 10Razzi: elasticsearch: load config from yaml [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (https://phabricator.wikimedia.org/T278378) (owner: 10Ryan Kemper) [22:33:39] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: load config from yaml [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (https://phabricator.wikimedia.org/T278378) (owner: 10Ryan Kemper) [22:33:46] (03PS2) 10Ebernhardson: icinga: Move cirrus check into cirrus_cluster_checks [puppet] - 10https://gerrit.wikimedia.org/r/768818 [22:35:15] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/768818 (owner: 10Ebernhardson) [22:36:56] (03PS11) 10Razzi: elasticsearch: load config from yaml [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (https://phabricator.wikimedia.org/T278378) (owner: 10Ryan Kemper) [22:37:49] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudservices1003.wikimedia.org with OS bullseye [22:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:43] 10SRE, 10ops-eqiad: analytics10[63,67] mgmt interfaces seem flapping from time to time - https://phabricator.wikimedia.org/T303151 (10wiki_willy) a:03Cmjohnson [22:42:57] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: load config from yaml [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (https://phabricator.wikimedia.org/T278378) (owner: 10Ryan Kemper) [22:46:22] (03PS12) 10Ryan Kemper: elasticsearch: load config from yaml [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (https://phabricator.wikimedia.org/T278378) [22:49:18] (03PS13) 10Ryan Kemper: elasticsearch: load config from yaml [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (https://phabricator.wikimedia.org/T278378) [22:51:21] (03PS14) 10Ryan Kemper: elasticsearch: load config from yaml [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (https://phabricator.wikimedia.org/T278378) [22:52:19] (03PS15) 10Ryan Kemper: elasticsearch: load config from yaml [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (https://phabricator.wikimedia.org/T278378) [22:53:47] 10SRE, 10Traffic, 10envoy, 10serviceops: Refactor envoy HTTP protocol options to new version - https://phabricator.wikimedia.org/T303230 (10RLazarus) [22:55:31] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10RLazarus) [22:55:39] 10SRE, 10Traffic, 10envoy, 10serviceops: Refactor envoy HTTP protocol options to new version - https://phabricator.wikimedia.org/T303230 (10RLazarus) 05Open→03Stalled p:05Triage→03Low [22:55:56] (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [22:59:31] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: load config from yaml [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (https://phabricator.wikimedia.org/T278378) (owner: 10Ryan Kemper) [23:02:32] 10SRE, 10Traffic, 10envoy, 10serviceops: Refactor envoy access_log_path to access loggers - https://phabricator.wikimedia.org/T303231 (10RLazarus) [23:05:05] 10SRE, 10Traffic, 10envoy, 10serviceops: Refactor envoy access_log_path to access loggers - https://phabricator.wikimedia.org/T303231 (10RLazarus) p:05Triage→03Medium [23:38:19] (03CR) 10Cwhite: Added config for the datahubsearch LVS service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768668 (owner: 10Btullis) [23:40:31] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on mirror1001.wikimedia.org with reason: reboot [23:40:32] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on mirror1001.wikimedia.org with reason: reboot [23:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:00] (03PS1) 10Andrew Bogott: Add files and templates for OpenStack Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768829 (https://phabricator.wikimedia.org/T281275) [23:44:02] (03PS1) 10Andrew Bogott: OpenStack: add manifests for openstack wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768830 (https://phabricator.wikimedia.org/T281275) [23:44:39] (03CR) 10jerkins-bot: [V: 04-1] Add files and templates for OpenStack Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768829 (https://phabricator.wikimedia.org/T281275) (owner: 10Andrew Bogott) [23:45:01] (03CR) 10jerkins-bot: [V: 04-1] OpenStack: add manifests for openstack wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768830 (https://phabricator.wikimedia.org/T281275) (owner: 10Andrew Bogott) [23:49:45] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on mx1001.wikimedia.org with reason: reboot [23:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:46] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on mx1001.wikimedia.org with reason: reboot [23:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:01] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on mx2001.wikimedia.org with reason: reboot [23:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:03] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on mx2001.wikimedia.org with reason: reboot [23:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:46] (03PS2) 10Andrew Bogott: Add files and templates for OpenStack Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768829 (https://phabricator.wikimedia.org/T281275) [23:54:48] (03PS2) 10Andrew Bogott: OpenStack: add manifests for openstack wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768830 (https://phabricator.wikimedia.org/T281275) [23:55:59] (03CR) 10jerkins-bot: [V: 04-1] OpenStack: add manifests for openstack wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768830 (https://phabricator.wikimedia.org/T281275) (owner: 10Andrew Bogott) [23:57:43] (03CR) 10jerkins-bot: [V: 04-1] Add files and templates for OpenStack Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768829 (https://phabricator.wikimedia.org/T281275) (owner: 10Andrew Bogott) [23:59:11] PROBLEM - SSH on dumpsdata1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook