[00:01:48] (03CR) 10Cwhite: [C: 03+1] "The ES clients will one day refuse to work with OpenSearch instances when compatibility mode is disabled or removed. There are no immedia" [puppet] - 10https://gerrit.wikimedia.org/r/777421 (https://phabricator.wikimedia.org/T255864) (owner: 10Herron) [00:02:12] (03CR) 10Cwhite: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/777375 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [00:05:20] RECOVERY - SSH on wtp1041.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:05:56] RECOVERY - Check systemd state on gitlab1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:10:06] PROBLEM - dump of es4 in codfw on alert1001 is CRITICAL: dump for es4 at codfw (es2022) taken more than 8 days ago: Most recent backup 2022-03-29 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:50:20] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:04:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24140 and previous config saved to /var/cache/conftool/dbconfig/20220406-010410-ladsgroup.json [01:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:14] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:13:48] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:14:16] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:15:40] 10SRE, 10MediaWiki-General, 10Performance-Team, 10Platform Engineering Code Jam, and 3 others: Allow easier ICU transitions in MediaWiki (change how sortkey collation is managed in the categorylinks table) - https://phabricator.wikimedia.org/T263437 (10Krinkle) [01:17:21] 10SRE, 10MediaWiki-General, 10Performance-Team, 10Platform Engineering Code Jam, and 3 others: Allow easier ICU transitions in MediaWiki (change how sortkey collation is managed in the categorylinks table) - https://phabricator.wikimedia.org/T263437 (10Krinkle) //(Triaging on perf board as prio, given subt... [01:19:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P24141 and previous config saved to /var/cache/conftool/dbconfig/20220406-011915-ladsgroup.json [01:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P24142 and previous config saved to /var/cache/conftool/dbconfig/20220406-013420-ladsgroup.json [01:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:38:32] PROBLEM - Check systemd state on gitlab2001 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:38:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:49:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24143 and previous config saved to /var/cache/conftool/dbconfig/20220406-014925-ladsgroup.json [01:49:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance [01:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:49:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance [01:49:31] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:49:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:02:52] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:05:06] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:12:25] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:22:56] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:23:14] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:25:00] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 47965 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:25:18] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.343 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:37:46] (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs2003:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [02:43:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [02:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:43:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [02:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:07:53] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (paramita_das) - https://phabricator.wikimedia.org/T305298 (10paramita_das) @MoritzMuehlenhoff, I am new to Kerberos settings. Any documentation available how to login to Kerberos account via ssh tunneling? My system is win... [03:11:16] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:12:04] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:18:22] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:20:30] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 692 bytes in 4.798 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:30:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [03:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:30:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [03:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:31:32] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:33:36] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.086 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:36:04] PROBLEM - SSH on aqs1007.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:17:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [04:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:17:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [04:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:36:26] RECOVERY - SSH on aqs1007.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:04:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [05:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:04:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [05:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:16] (03PS1) 10KartikMistry: Update cxserver to 2022-04-05-070409-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/777694 (https://phabricator.wikimedia.org/T305397) [05:14:36] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Marostegui) The databases can be done anytime, we just need to downtime them (and any possible slave), stop mysqld and power them off so Papaul can move them. Make sure to run an `apt full... [05:19:43] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (paramita_das) - https://phabricator.wikimedia.org/T305298 (10Aklapper) @paramita_das: Hi, basically https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Kerberos - could you elaborate at which step you are stuck? Than... [05:35:34] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (phab1001), Fresh: 107 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:39:00] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:51:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance [05:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance [05:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:33] (03CR) 10Winston Sung: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T286291) (owner: 10Winston Sung) [06:11:42] (03PS27) 10Winston Sung: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T286291) [06:11:56] (03PS7) 10Winston Sung: Rearrange zh namespace names and namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776031 [06:12:25] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:16:22] (03CR) 10Filippo Giunchedi: "Thank you for following up !" [alerts] - 10https://gerrit.wikimedia.org/r/777432 (owner: 10RLazarus) [06:17:55] (03CR) 10Func: [C: 03+1] "Ok, I think this is good to go now since it should be a NOOP. You can try to apply for deployment under the instruction of https://wikitec" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T286291) (owner: 10Winston Sung) [06:19:43] (03CR) 10Filippo Giunchedi: [C: 03+1] sre.kafka.reboot-workers: add logging-codfw targets [cookbooks] - 10https://gerrit.wikimedia.org/r/777375 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [06:20:26] (03CR) 10Filippo Giunchedi: [C: 03+1] spicerack: add logging clusters to elasticsearch config [puppet] - 10https://gerrit.wikimedia.org/r/777421 (https://phabricator.wikimedia.org/T255864) (owner: 10Herron) [06:21:55] (03CR) 10Filippo Giunchedi: [C: 04-1] "Thank you for following up on this, LGTM overall and see inline" [puppet] - 10https://gerrit.wikimedia.org/r/777453 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [06:27:26] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:37:46] (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs2003:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [06:39:13] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: Upgrade IDPs to CAS 6.5/Bullseye and enable webauthn - https://phabricator.wikimedia.org/T305518 (10MoritzMuehlenhoff) [06:39:41] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: Upgrade IDPs to CAS 6.5/Bullseye and enable webauthn - https://phabricator.wikimedia.org/T305518 (10MoritzMuehlenhoff) p:05Triage→03Medium [06:40:59] (03CR) 10Muehlenhoff: [C: 03+2] Don't make apt.wikimedia.org page [puppet] - 10https://gerrit.wikimedia.org/r/777357 (owner: 10Muehlenhoff) [06:44:18] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:46:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [06:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [06:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24144 and previous config saved to /var/cache/conftool/dbconfig/20220406-064633-ladsgroup.json [06:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:36] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [06:56:35] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [06:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:38] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [06:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:44] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [06:56:45] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [06:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:06] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [06:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:05] Amir1, awight, Urbanecm, and taavi: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220406T0700) [07:00:05] kostajh: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:15] hi [07:00:31] still no sticker ¯\_ (ツ)_/¯ [07:01:02] kostajh: hey! [07:01:15] Do you want to self-service? [07:01:16] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host webperf2003.codfw.wmnet [07:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:17] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [07:01:18] Or should I deploy? [07:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:29] urbanecm: either way, do you have a preference? [07:02:28] kostajh: if possible please self-service :) [07:02:52] ok, I'm on it [07:03:06] (y) lmk if I can help in some way [07:03:44] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [07:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:46] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [07:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:32] !log depool cp5001 for reimage - T290005 [07:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:34] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [07:05:49] (03CR) 10Kosta Harlan: GrowthExperiments: Add mailing list question for eswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773204 (https://phabricator.wikimedia.org/T303240) (owner: 10Kosta Harlan) [07:05:53] (03PS10) 10Kosta Harlan: GrowthExperiments: Add mailing list question for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773204 (https://phabricator.wikimedia.org/T303240) [07:06:22] Hello, I would like to ask to apply for deployment for this change: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/747913 . This is just for clean-up and shouldn't affect anything outside this file. [07:06:35] (03CR) 10Kosta Harlan: [C: 03+2] "Backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773204 (https://phabricator.wikimedia.org/T303240) (owner: 10Kosta Harlan) [07:07:02] (03PS2) 10MMandere: site: Reimage cp5001 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777321 (https://phabricator.wikimedia.org/T290005) [07:07:19] (03Merged) 10jenkins-bot: GrowthExperiments: Add mailing list question for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773204 (https://phabricator.wikimedia.org/T303240) (owner: 10Kosta Harlan) [07:08:03] (03CR) 10MMandere: [C: 03+2] site: Reimage cp5001 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777321 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [07:08:06] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:08:09] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [07:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:17] urbanecm: the change is on mwdebug1002, I'm verifying now [07:08:38] Sounds good [07:10:25] Hello, I would like to ask to apply for deployment for this change: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/747913 . This is just for clean-up and shouldn't affect anything outside this file. Related task: https://phabricator.wikimedia.org/T298308 . [07:10:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:11:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:49] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp5001.eqsin.wmnet with OS buster [07:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:54] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:12:57] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp5001.eqsin.wmnet with OS buster [07:13:14] urbanecm: hmm, doesn't seem to work. Need a few minutes to look into it [07:13:58] (KubernetesCalicoDown) firing: (4) ml-serve-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:14:13] Winston_Sung[m]: Is it on the deployment calendar? I can try to look at it when I'm done with the current patch [07:14:43] (03CR) 10Elukey: [C: 03+2] Update ml-serve-eqiad's dnscore pod IPs after cluster reinit [puppet] - 10https://gerrit.wikimedia.org/r/777413 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [07:15:33] kostajh: I looked the instruction of https://wikitech.wikimedia.org/wiki/Backport_windows#How_to_submit_a_patch_for_backport but not sure which time I should fill in. [07:16:05] (03CR) 10Elukey: [C: 03+2] Change ml-serve-eqiad coredns' pod IP after cluster reinit [deployment-charts] - 10https://gerrit.wikimedia.org/r/777414 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [07:16:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:57] Winston_Sung[m]: ok, give me a few minutes please [07:17:08] Ok. Thanks. [07:17:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:34] urbanecm: ah, I think the mailinglist question doesn't show because I'm also removing the config that currently shows/hides it, and the updated code is riding the wmf.6 train https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/773203 [07:17:50] urbanecm: so, I think it's safe to sync, but we could also wait until after wmf.6 in is in group2, what do you think? [07:17:58] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve-ctrl1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:18:06] kostajh: i don't have any issues with syncing [07:18:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:18:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:19] you can also backport it to .5 [07:18:45] (JobUnavailable) firing: (2) Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:18:49] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [07:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:58] (KubernetesCalicoDown) firing: (6) ml-serve-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:19:02] or do `sudo -u mwdeploy vim /srv/mediawiki/wikiversions.php` at mwdebug100X and locally & temporarily promote eswiki to .6 [07:19:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:21] urbanecm: is there an issue with backporting, since extension.json is modified to remove a config that is accessed in a PHP file? [07:20:26] Winston_Sung[m]: you would add your patch to the calender in this section https://wikitech.wikimedia.org/w/index.php?title=Deployments&action=edit§ion=13 [07:20:27] !log depool cp4035 for reimage - T290005 [07:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:29] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [07:21:02] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [07:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:09] kostajh: good question. assuming you sync WelcomeSurvey.php as the very first thing, it should be fine IMO [07:22:04] (03PS1) 10Kosta Harlan: WelcomeSurvey: Use experiment groups for showing/hiding mailing list question [extensions/GrowthExperiments] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/777397 (https://phabricator.wikimedia.org/T303240) [07:22:51] (03CR) 10Kosta Harlan: [C: 03+2] "Backport" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/777397 (https://phabricator.wikimedia.org/T303240) (owner: 10Kosta Harlan) [07:22:58] urbanecm: ok, I'll backport that one too then; added to the calendar [07:22:58] (KubernetesRsyslogDown) resolved: (6) rsyslog on ml-serve-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:23:04] sounds good [07:23:17] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [07:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:19] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [07:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:27] (03PS2) 10MMandere: site: Reimage cp4035 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777322 (https://phabricator.wikimedia.org/T290005) [07:23:45] (JobUnavailable) firing: (2) Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:23:50] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [07:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:52] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [07:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:58] (KubernetesCalicoDown) resolved: (6) ml-serve-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:24:32] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [07:24:34] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [07:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:23] (03CR) 10Kosta Harlan: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T286291) (owner: 10Winston Sung) [07:26:25] (03CR) 10MMandere: [C: 03+2] site: Reimage cp4035 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777322 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [07:27:29] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:27:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader1002.wikimedia.org [07:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:15] (03PS28) 10Winston Sung: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T286291) [07:28:22] (03PS8) 10Winston Sung: Rearrange zh namespace names and namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776031 [07:28:38] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp4035.ulsfo.wmnet with OS buster [07:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:47] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp4035.ulsfo.wmnet with OS buster [07:29:59] (03CR) 10Winston Sung: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T286291) (owner: 10Winston Sung) [07:30:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader1002.wikimedia.org [07:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:32] urbanecm: so, I already did the "git rebase" step on mediawiki-staging for the config patch, do I need to back out of that somehow since I now want to do the wmf.5 backport first? Or is it OK to leave it in place, I just need to run scap sync-file for the wmf.5 code first? [07:30:49] kostajh: it's the order of scap sync-file commands that matters [07:31:06] it's fine to leave it there (so long you don't sync a diferent IS.php change in the meantime) [07:32:34] Published. Did I do something wrong in this edit: https://wikitech.wikimedia.org/wiki/Special:Diff/1964082 ? [07:32:40] (03PS29) 10Kosta Harlan: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T286291) (owner: 10Winston Sung) [07:33:12] Winston_Sung[m]: lgtm [07:33:22] 'morning RhinosF1 :) [07:33:45] hey urbanecm [07:35:03] (03PS30) 10Winston Sung: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T286291) [07:35:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast2002.wikimedia.org [07:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host webperf2003.codfw.wmnet [07:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:30] (03PS31) 10Winston Sung: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T286291) [07:36:08] (03PS9) 10Winston Sung: Rearrange zh namespace names and namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776031 [07:36:51] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5001.eqsin.wmnet with reason: host reimage [07:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:52] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host webperf2004.codfw.wmnet [07:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:54] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [07:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:09] (03CR) 10Kosta Harlan: [C: 03+2] WelcomeSurvey: Use experiment groups for showing/hiding mailing list question [extensions/GrowthExperiments] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/777397 (https://phabricator.wikimedia.org/T303240) (owner: 10Kosta Harlan) [07:40:21] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5001.eqsin.wmnet with reason: host reimage [07:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:32] sigh, flaky localisation cache issue so I had to restart the job (https://integration.wikimedia.org/ci/job/wmf-quibble-vendor-mysql-php72-docker/94745/console) [07:42:01] T304515 ^ [07:42:02] T304515: PHP Warning: Cannot use a scalar value as an array - https://phabricator.wikimedia.org/T304515 [07:43:38] (03PS10) 10Winston Sung: Rearrange zh namespace names and namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776031 (https://phabricator.wikimedia.org/T286291) [07:43:48] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4035.ulsfo.wmnet with reason: host reimage [07:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:14] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4035.ulsfo.wmnet with reason: host reimage [07:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:36] (03CR) 10jerkins-bot: [V: 04-1] WelcomeSurvey: Use experiment groups for showing/hiding mailing list question [extensions/GrowthExperiments] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/777397 (https://phabricator.wikimedia.org/T303240) (owner: 10Kosta Harlan) [07:49:13] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 19.63 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [07:49:27] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 48.75 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [07:53:50] urbanecm: I restarted the job, but Zuul is busy with 119 jobs for Translate extension [07:53:57] :-( [07:54:06] so, I think I will forget about the wmf.5 backport and just sync the config patch [07:54:14] works for me [07:54:27] (03Abandoned) 10Kosta Harlan: WelcomeSurvey: Use experiment groups for showing/hiding mailing list question [extensions/GrowthExperiments] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/777397 (https://phabricator.wikimedia.org/T303240) (owner: 10Kosta Harlan) [07:55:58] !log kharlan@deploy1002 Synchronized wmf-config: Config: [[gerrit:773204|GrowthExperiments: Add mailing list question for eswiki (T303240 T305015)]] (duration: 00m 56s) [07:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:02] T303240: Welcome emails: opt-in checkbox - https://phabricator.wikimedia.org/T303240 [07:56:03] T305015: Welcome emails: reserve control group - https://phabricator.wikimedia.org/T305015 [07:57:46] Winston_Sung[m]: are you still around? [07:57:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cinderutils: remove absented file [puppet] - 10https://gerrit.wikimedia.org/r/777456 (owner: 10Zabe) [07:58:03] What's up? [07:58:18] Winston_Sung[m]: I can sync your patch, are you able to verify that it works properly? [07:58:32] Yes. [07:58:45] (03CR) 10Kosta Harlan: [C: 03+2] "backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T286291) (owner: 10Winston Sung) [07:59:28] (03Merged) 10jenkins-bot: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T286291) (owner: 10Winston Sung) [07:59:47] (03CR) 10Func: Rearrange zh namespace names and namespace aliases (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776031 (https://phabricator.wikimedia.org/T286291) (owner: 10Winston Sung) [08:00:03] Winston_Sung[m]: OK, your patch is on mwdebug1002, please have a look [08:00:05] jnuche and hashar: That opportune time is upon us again. Time for a MediaWiki train - Utc-0 Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220406T0800). [08:00:25] jnuche hashar: just finishing up with a config patch sync, need a few more minutes please [08:00:42] kostajh: ack, np [08:00:52] (03PS4) 10Filippo Giunchedi: WIP test replacing smokeping with blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/777330 (https://phabricator.wikimedia.org/T169860) [08:02:13] take your time no worries :] [08:03:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host webperf2004.codfw.wmnet [08:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:44] Winston_Sung[m]: I'm not sure how to verify this patch. urbanecm are you familiar with this? I saw your name on one of the associated tasks [08:06:13] (03PS1) 10Jelto: gitlab: add missing chown to restore script [puppet] - 10https://gerrit.wikimedia.org/r/777745 (https://phabricator.wikimedia.org/T274463) [08:06:50] kostajh: https://zh.wikipedia.org/wiki/模块:NoteTA , https://zh.wikipedia.org/wiki/模组:NoteTA , https://zh.wikipedia.org/wiki/模組:NoteTA should redirect to https://zh.wikipedia.org/wiki/Module:NoteTA . That's all [08:06:54] * That's all. [08:06:57] yup, that [08:07:04] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudgw1001.eqiad.wmnet with OS bullseye [08:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:08] AFAICS it works [08:07:18] Winston_Sung[m] urbanecm: I see that happening without the patch [08:07:33] Yes, because that's a clean-up. [08:07:40] kostajh: the patch is no-op. we should test it (the redirect) doesn't break [08:07:41] ok [08:07:56] then I'll sync this change [08:07:59] sounds good :) [08:08:12] (the patch is no-op. we should test it (the redirect) doesn't break) This. [08:08:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:55] !log kharlan@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:747913|Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" (T286291 T298308 T165593 T286105)]] (duration: 00m 56s) [08:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:01] T298308: Clean up zh/zh-* namespace aliases in operations/mediawiki-config /wmf-config/InitialiseSettings.php - https://phabricator.wikimedia.org/T298308 [08:10:01] T286105: Update zh namespace names and adding namespace aliases in Scribunto - https://phabricator.wikimedia.org/T286105 [08:10:02] T286291: Clean up, merge and update zh/zh-* translations - https://phabricator.wikimedia.org/T286291 [08:10:02] T165593: Modification of the default alias for namespace 828 "模块:" of Zh Projects - https://phabricator.wikimedia.org/T165593 [08:10:18] Winston_Sung[m]: done. We'll have to leave the "rearrange" patch for another time, unless someone else wants to pick it up now [08:10:32] Ok. Thanks. [08:10:56] This means I have to refill it on the calendar for another time, right? [08:10:58] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34721/console" [puppet] - 10https://gerrit.wikimedia.org/r/777745 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [08:11:07] Winston_Sung[m]: yes [08:12:40] Would it be fine if I move it to the "UTC afternoon backport window" today? [08:13:08] Winston_Sung[m]: yes, fine with me [08:13:23] Edited. Thanks. [08:13:40] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: add missing chown to restore script [puppet] - 10https://gerrit.wikimedia.org/r/777745 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [08:17:10] (03PS2) 10KartikMistry: Update cxserver to 2022-04-06-080942-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/777694 (https://phabricator.wikimedia.org/T300958) [08:18:36] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw1001.eqiad.wmnet with reason: host reimage [08:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:19:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24145 and previous config saved to /var/cache/conftool/dbconfig/20220406-081934-ladsgroup.json [08:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:37] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [08:20:25] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [08:20:37] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4035.ulsfo.wmnet with OS buster [08:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:48] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp4035.ulsfo.wmnet with OS buster com... [08:21:36] !log mmandere@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host cp5001.eqsin.wmnet with OS buster [08:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:44] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp5001.eqsin.wmnet with OS buster com... [08:21:50] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp5001.eqsin.wmnet with OS buster exe... [08:21:51] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 104.8 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [08:23:03] PROBLEM - Host ping3002 is DOWN: PING CRITICAL - Packet loss = 100% [08:23:26] kostajh: urbanecm: Winston_Sung[m]: can we run the train? :) [08:23:38] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw1001.eqiad.wmnet with reason: host reimage [08:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:39] no objections from me (but kostajh was running the window) [08:23:50] ah wasn't sure thx [08:23:57] RECOVERY - Host ping3002 is UP: PING OK - Packet loss = 0%, RTA = 81.16 ms [08:23:58] * urbanecm was just advising this time :) [08:24:03] looks like one of the patch hasn't completed and is postponed to another window [08:25:56] i don't see kosta logged on the deployment srv anymore, and AFAICS everything (except the rescheduled patch) was deployed, so I'm 95% sure it's clear now [08:26:57] hashar: yes go ahead please [08:27:45] (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:27:48] !log pool cp4035 with HAProxy as TLS termination layer - T290005 [08:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:58] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [08:28:02] PROBLEM - Host ncredir-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [08:28:42] mmhh not expected I suppose, that paged [08:28:44] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host webperf1004.eqiad.wmnet [08:28:45] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:46] RECOVERY - Host ncredir-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 82.16 ms [08:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:57] yeah I tried to ping it and it worked right away [08:28:57] phab is very slow for me godog [08:29:12] * RhinosF1 is esams [08:29:18] http requests going down [08:29:35] RhinosF1: thank you for the feedback! [08:30:14] https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&from=1649212196305&to=1649233796305&viewPanel=4 [08:30:28] yeah esams kinda slow for me too [08:31:18] godog: running ping/traceroute etc [08:31:39] there's a spike of NEL timeouts, seems recovered tho [08:31:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:45] (JobUnavailable) resolved: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:33:11] godog: see noc@, i sent mtr [08:33:16] i still very slow [08:33:20] PROBLEM - Too high an incoming rate of browser-reported Network Error Logging events #page on alert1001 is CRITICAL: type=tcp.timed_out https://wikitech.wikimedia.org/wiki/Network_monitoring%23NEL_alerts https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 [08:33:28] looks like high packet loss at knams [08:33:43] onwards [08:34:24] huge spike of connections on lvs3005 (text-esams) [08:34:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P24146 and previous config saved to /var/cache/conftool/dbconfig/20220406-083439-ladsgroup.json [08:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:44] smaller spike on upload as well [08:34:57] !log jnuche@deploy1002 deploy-promote aborted: (duration: 00m 40s) [08:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:21] !log ayounsi@cumin2002 START - Cookbook sre.network.cf [08:35:21] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw1001.eqiad.wmnet with OS bullseye [08:35:22] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.cf (exit_code=0) [08:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:53] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:36:18] (03PS1) 10Majavah: hieradata: add cloudinfra repl pass [labs/private] - 10https://gerrit.wikimedia.org/r/777746 [08:36:42] (03PS9) 10Majavah: P:openstack::puppetmaster: split ENC api to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/777341 (https://phabricator.wikimedia.org/T295247) [08:36:44] (03PS3) 10Majavah: O:openstack: add new encapi roles [puppet] - 10https://gerrit.wikimedia.org/r/777385 (https://phabricator.wikimedia.org/T295247) [08:36:46] (03PS1) 10Majavah: openstack: remove 'labs' term from new enc servers [puppet] - 10https://gerrit.wikimedia.org/r/777747 [08:37:06] (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap: make rsync use new compress algorithm [puppet] - 10https://gerrit.wikimedia.org/r/774824 (https://phabricator.wikimedia.org/T252540) (owner: 10Hashar) [08:37:29] (03PS1) 10Jaime Nuche: group1 wikis to 1.39.0-wmf.6 refs T305212 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777748 [08:37:31] (03CR) 10Jaime Nuche: [C: 03+2] group1 wikis to 1.39.0-wmf.6 refs T305212 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777748 (owner: 10Jaime Nuche) [08:37:32] godog: looks better here [08:37:46] RhinosF1: cheers, we're investigating [08:38:13] (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.6 refs T305212 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777748 (owner: 10Jaime Nuche) [08:38:13] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw1001.eqiad.wmnet [08:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:17] cool, let me know if you need anything. the packet loss has stopped now on mtr. [08:38:27] * Emperor here [08:39:56] ah, you got to the problem before my MUA got the page [08:40:17] (03CR) 10David Caro: [C: 03+2] hieradata: add cloudinfra repl pass [labs/private] - 10https://gerrit.wikimedia.org/r/777746 (owner: 10Majavah) [08:40:43] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: OpenConfirm - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:41:14] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2] hieradata: add cloudinfra repl pass [labs/private] - 10https://gerrit.wikimedia.org/r/777746 (owner: 10Majavah) [08:42:02] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1001.eqiad.wmnet [08:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:02] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34722/console" [puppet] - 10https://gerrit.wikimedia.org/r/777385 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [08:45:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:15] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:46:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:46:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:46:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:48:00] (JobUnavailable) resolved: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:49:29] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.6 refs T305212 [08:49:31] (03PS1) 10Btullis: Add a dummy datahub_encryption_key value [labs/private] - 10https://gerrit.wikimedia.org/r/777752 (https://phabricator.wikimedia.org/T301454) [08:49:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P24147 and previous config saved to /var/cache/conftool/dbconfig/20220406-084944-ladsgroup.json [08:50:14] RECOVERY - Too high an incoming rate of browser-reported Network Error Logging events #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Network_monitoring%23NEL_alerts https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 [08:50:22] !log jnuche@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.6 refs T305212 (duration: 00m 53s) [08:50:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host webperf1004.eqiad.wmnet [08:54:07] !log pool cp5001 with HAProxy as TLS termination layer - T290005 [08:55:36] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-druid1002.eqiad.wmnet [08:56:09] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:57:32] !log btullis@cumin1001 START - Cookbook sre.presto.reboot-workers for Presto analytics cluster: Reboot Presto nodes [08:58:44] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [08:59:04] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host webperf1003.eqiad.wmnet [08:59:05] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:00:31] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:01:01] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:03:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:43] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-druid1002.eqiad.wmnet [09:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:17] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-druid1001.eqiad.wmnet [09:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:38] !log force-started update-openstack-mirror.service on mirror1001 for python3-eventlet (T305157) [09:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:41] T305157: Openstack Wallaby on Debian 11 Bullseye problems because eventlet and dnspython - https://phabricator.wikimedia.org/T305157 [09:04:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24148 and previous config saved to /var/cache/conftool/dbconfig/20220406-090449-ladsgroup.json [09:04:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [09:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:52] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [09:04:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [09:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:16] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10LSobanski) @Papaul Is the April 11th date fixed? [09:08:55] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [09:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:38] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on es[2023-2025].codfw.wmnet with reason: Rebooting es2023 T303174 [09:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:41] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2023-2025].codfw.wmnet with reason: Rebooting es2023 T303174 [09:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:50] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es2023.codfw.wmnet with reason: Rebooting for T303174 [09:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:51] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es2023.codfw.wmnet with reason: Rebooting for T303174 [09:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:32] (03Abandoned) 10Jaime Nuche: test [mediawiki-config] (sandbox/jnuche) - 10https://gerrit.wikimedia.org/r/777370 (owner: 10Jaime Nuche) [09:11:41] 10SRE, 10ops-eqiad: Degraded RAID on thanos-be1003 - https://phabricator.wikimedia.org/T304873 (10fgiunchedi) Thanks @Cmjohnson much appreciated! [09:11:50] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-druid1001.eqiad.wmnet [09:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:40] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [09:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:43] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [09:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:09] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [09:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:11] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [09:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:18] !log btullis@cumin1001 END (PASS) - Cookbook sre.presto.reboot-workers (exit_code=0) for Presto analytics cluster: Reboot Presto nodes [09:19:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:33] (03CR) 10Winston Sung: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T286291) (owner: 10Winston Sung) [09:21:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host webperf1003.eqiad.wmnet [09:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:22] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudgw1002.eqiad.wmnet with OS bullseye [09:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:33] (03PS11) 10Winston Sung: Rearrange zh namespace names and namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776031 (https://phabricator.wikimedia.org/T286291) [09:26:44] (03PS1) 10Elukey: Remove the istio-cni config from Calico's for ml-serve-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/777753 (https://phabricator.wikimedia.org/T304673) [09:26:53] (03PS12) 10Winston Sung: Rearrange zh namespace names and namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776031 (https://phabricator.wikimedia.org/T286291) [09:27:05] (03CR) 10Winston Sung: Rearrange zh namespace names and namespace aliases (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776031 (https://phabricator.wikimedia.org/T286291) (owner: 10Winston Sung) [09:27:28] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:27:43] (03CR) 10Klausman: [C: 03+1] Remove the istio-cni config from Calico's for ml-serve-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/777753 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [09:28:59] (03PS2) 10Elukey: Remove the istio-cni config from Calico's for ml-serve-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/777753 (https://phabricator.wikimedia.org/T304673) [09:29:22] (03PS4) 10Majavah: O:openstack: add new encapi roles [puppet] - 10https://gerrit.wikimedia.org/r/777385 (https://phabricator.wikimedia.org/T295247) [09:30:05] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34723/console" [puppet] - 10https://gerrit.wikimedia.org/r/777753 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [09:30:14] (03CR) 10Elukey: [V: 03+1 C: 03+2] Remove the istio-cni config from Calico's for ml-serve-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/777753 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [09:31:24] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [09:32:18] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:34:59] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudgw1002.eqiad.wmnet with OS bullseye [09:36:37] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudgw1002.eqiad.wmnet with OS bullseye [09:36:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader2001.wikimedia.org [09:36:53] (03PS5) 10Majavah: O:openstack: add new encapi roles [puppet] - 10https://gerrit.wikimedia.org/r/777385 (https://phabricator.wikimedia.org/T295247) [09:38:14] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 108 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [09:38:14] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [09:38:17] (03PS13) 10Winston Sung: Rearrange zh namespace names and namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776031 (https://phabricator.wikimedia.org/T286291) [09:38:17] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [09:38:23] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [09:38:57] (03Abandoned) 10Jelto: gitlab: reduce backup_keep_time to save disk space [puppet] - 10https://gerrit.wikimedia.org/r/775265 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [09:39:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader2001.wikimedia.org [09:40:15] (03PS14) 10Winston Sung: Rearrange zh namespace names and namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776031 (https://phabricator.wikimedia.org/T286291) [09:41:30] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [09:41:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host xhgui2001.codfw.wmnet [09:42:56] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudgw1002.eqiad.wmnet with OS bullseye [09:43:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host xhgui2001.codfw.wmnet [09:44:48] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [09:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:28] !log installing mariadb-10.3 updates from buster 10.12 point released (different from wmf-mariadb packages) [09:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:35] !log depool cp3052 for reimage - T290005 [09:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:38] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [09:51:13] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudgw1002.eqiad.wmnet with OS bullseye [09:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [09:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [09:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:55] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [09:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host xhgui1001.eqiad.wmnet [09:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:14] (03CR) 10Kormat: [C: 03+1] mariadb: Rename wikiuser in db_kill [puppet] - 10https://gerrit.wikimedia.org/r/777754 (owner: 10Ladsgroup) [09:56:23] (03CR) 10MMandere: [C: 03+2] site: Reimage cp3052 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777323 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [09:56:30] (03PS2) 10Mvolz: citoid: switch to native prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/776233 (https://phabricator.wikimedia.org/T205870) (owner: 10PipelineBot) [09:57:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host xhgui1001.eqiad.wmnet [09:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:50] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [09:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:54] (03CR) 10Ladsgroup: [C: 03+2] mariadb: Rename wikiuser in db_kill [puppet] - 10https://gerrit.wikimedia.org/r/777754 (owner: 10Ladsgroup) [09:58:01] (03PS2) 10Ladsgroup: mariadb: Rename wikiuser in db_kill [puppet] - 10https://gerrit.wikimedia.org/r/777754 [09:58:03] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Rename wikiuser in db_kill [puppet] - 10https://gerrit.wikimedia.org/r/777754 (owner: 10Ladsgroup) [09:58:22] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp3052.esams.wmnet with OS buster [09:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:31] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp3052.esams.wmnet with OS buster [09:58:41] (03CR) 10Mvolz: [C: 04-1] "Not exactly sure what I was doing but made an attempt, review appreciated :)." [deployment-charts] - 10https://gerrit.wikimedia.org/r/776233 (https://phabricator.wikimedia.org/T205870) (owner: 10PipelineBot) [09:58:51] (03PS2) 10Vgutierrez: traffic: Add HAProxyEdgeTrafficDrop [alerts] - 10https://gerrit.wikimedia.org/r/776890 (https://phabricator.wikimedia.org/T290005) [10:00:35] (03PS1) 10Majavah: hieradata: use ntp servers private ip addresses [puppet] - 10https://gerrit.wikimedia.org/r/777755 [10:01:56] (03CR) 10Vgutierrez: [C: 03+2] traffic: Add HAProxyEdgeTrafficDrop [alerts] - 10https://gerrit.wikimedia.org/r/776890 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [10:02:43] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw1002.eqiad.wmnet with reason: host reimage [10:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:31] !log depool cp4027 for reimage - T290005 [10:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:34] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [10:03:59] (03PS1) 10Kormat: depool-and-wait: Check for dumps, rename wikiuser. [software] - 10https://gerrit.wikimedia.org/r/777756 [10:04:26] (03PS2) 10Kormat: depool-and-wait: Check for dumps, rename wikiuser. [software] - 10https://gerrit.wikimedia.org/r/777756 [10:05:03] (03PS1) 10Mvolz: Update zotero to include get endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/777758 (https://phabricator.wikimedia.org/T291707) [10:05:49] (03CR) 10Mvolz: "Are there any additional changes to the chart that are needed to expose the get endpoint to monitoring?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/777758 (https://phabricator.wikimedia.org/T291707) (owner: 10Mvolz) [10:06:07] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw1002.eqiad.wmnet with reason: host reimage [10:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:25] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34724/console" [puppet] - 10https://gerrit.wikimedia.org/r/777385 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [10:06:27] (03PS2) 10MMandere: site: Reimage cp4027 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777324 (https://phabricator.wikimedia.org/T290005) [10:07:33] (03CR) 10MMandere: [C: 03+2] site: Reimage cp4027 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777324 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [10:07:58] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [10:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:22] (03CR) 10Mvolz: [C: 04-1] citoid: switch to native prometheus metrics (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/776233 (https://phabricator.wikimedia.org/T205870) (owner: 10PipelineBot) [10:10:14] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp4027.ulsfo.wmnet with OS buster [10:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:23] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp4027.ulsfo.wmnet with OS buster [10:12:25] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:13:06] (03CR) 10Func: Rearrange zh namespace names and namespace aliases (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776031 (https://phabricator.wikimedia.org/T286291) (owner: 10Winston Sung) [10:13:36] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [10:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:51] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10MoritzMuehlenhoff) [10:14:35] Emperor: could you look into the alerts for ms-be hosts ? e.g. the one that just fired [10:15:02] (03PS1) 10Ladsgroup: dbtools: Update wikiuser username [software] - 10https://gerrit.wikimedia.org/r/777760 [10:15:07] 10SRE, 10Infrastructure-Foundations, 10observability, 10Patch-For-Review, 10User-MoritzMuehlenhoff: ipmiseld not running reliably - https://phabricator.wikimedia.org/T305147 (10MoritzMuehlenhoff) >>! In T305147#7833154, @herron wrote: >>>! In T305147#7824394, @MoritzMuehlenhoff wrote: >> There's some thi... [10:15:38] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for parsoid::testing [puppet] - 10https://gerrit.wikimedia.org/r/769725 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:15:53] (03CR) 10Jcrespo: [C: 03+1] dbtools: Update wikiuser username [software] - 10https://gerrit.wikimedia.org/r/777760 (owner: 10Ladsgroup) [10:16:20] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:16:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host deploy2002.codfw.wmnet [10:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:40] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/775875 (https://phabricator.wikimedia.org/T305147) (owner: 10Herron) [10:18:40] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:19:06] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:26] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:20:16] (03PS3) 10Zabe: toil: migrate systemd_scope_cleanup cron to a systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/777453 (https://phabricator.wikimedia.org/T273673) [10:20:22] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:20:38] (03CR) 10Zabe: toil: migrate systemd_scope_cleanup cron to a systemd timer job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777453 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [10:21:23] godog: ack [10:21:27] (03CR) 10Kormat: [C: 03+1] dbtools: Update wikiuser username [software] - 10https://gerrit.wikimedia.org/r/777760 (owner: 10Ladsgroup) [10:21:41] (03PS1) 10Arturo Borrero Gonzalez: cloudgw1002: rename interface names [puppet] - 10https://gerrit.wikimedia.org/r/777763 (https://phabricator.wikimedia.org/T304598) [10:22:18] (03CR) 10Zabe: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1003/34726/" [puppet] - 10https://gerrit.wikimedia.org/r/777453 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [10:23:42] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:23:44] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [10:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:09] (03PS2) 10Arturo Borrero Gonzalez: cloudgw1002: rename interface names [puppet] - 10https://gerrit.wikimedia.org/r/777763 (https://phabricator.wikimedia.org/T304598) [10:24:12] !log aborrero@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudgw1002.eqiad.wmnet with OS bullseye [10:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:32] godog: looks like the smart collection has wedged on that system [10:24:46] Active: active (running) since Tue 2022-03-22 06:01:00 UTC; 2 weeks 1 days ago [10:24:49] I'll give it a kick [10:25:01] Emperor: cheers! SGTM [10:25:48] (03CR) 10Ladsgroup: [C: 03+2] dbtools: Update wikiuser username [software] - 10https://gerrit.wikimedia.org/r/777760 (owner: 10Ladsgroup) [10:25:57] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4027.ulsfo.wmnet with reason: host reimage [10:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:00] (03CR) 10Ladsgroup: [C: 03+1] depool-and-wait: Check for dumps, rename wikiuser. [software] - 10https://gerrit.wikimedia.org/r/777756 (owner: 10Kormat) [10:26:44] (03CR) 10David Caro: [C: 03+1] cloudgw1002: rename interface names [puppet] - 10https://gerrit.wikimedia.org/r/777763 (https://phabricator.wikimedia.org/T304598) (owner: 10Arturo Borrero Gonzalez) [10:26:46] (03Merged) 10jenkins-bot: dbtools: Update wikiuser username [software] - 10https://gerrit.wikimedia.org/r/777760 (owner: 10Ladsgroup) [10:26:51] Hm, smartctl is stuck unkillably in state D [10:27:04] (03CR) 10David Caro: [C: 03+1] "Got a PCC though? (/me is really bad with typos)" [puppet] - 10https://gerrit.wikimedia.org/r/777763 (https://phabricator.wikimedia.org/T304598) (owner: 10Arturo Borrero Gonzalez) [10:27:14] godog: I think reboot is likely best for this host; if I downtime it, is just bouncing it OK? It's a swift backend... [10:27:28] (03PS1) 10Elukey: network::data: update IP ranges for ml-serve-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/777764 (https://phabricator.wikimedia.org/T304673) [10:27:55] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host deploy2002.codfw.wmnet [10:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:59] Emperor: yeah +1 downtime and reboot is fine at any time [10:28:28] (03PS3) 10Kormat: depool-and-wait: Check for backups, rename wikiuser. [software] - 10https://gerrit.wikimedia.org/r/777756 [10:28:33] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3052.esams.wmnet with reason: host reimage [10:28:34] PROBLEM - Keyholder SSH agent on deploy2002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [10:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:09] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:11] Emperor: I just added you on https://gerrit.wikimedia.org/r/c/operations/puppet/+/777453 as a FYI mostly, since that cron/timer is deployed on swift hosts too, we've done the same conversion in the past and LGTM [10:29:16] (03CR) 10Filippo Giunchedi: [C: 03+1] toil: migrate systemd_scope_cleanup cron to a systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/777453 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [10:29:23] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4027.ulsfo.wmnet with reason: host reimage [10:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:32] (03CR) 10JMeybohm: [C: 03+1] network::data: update IP ranges for ml-serve-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/777764 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [10:30:01] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:30:37] !log reruning es4 dump on backup2002 [10:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:10] (NodeTextfileStale) resolved: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:31:58] (03CR) 10Klausman: [C: 03+1] network::data: update IP ranges for ml-serve-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/777764 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [10:32:15] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3052.esams.wmnet with reason: host reimage [10:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:45] (03CR) 10Kormat: [C: 03+2] depool-and-wait: Check for backups, rename wikiuser. [software] - 10https://gerrit.wikimedia.org/r/777756 (owner: 10Kormat) [10:33:15] I don't think that is actually fixed; host is still rebooting [10:33:32] RECOVERY - Keyholder SSH agent on deploy2002 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder [10:33:35] (03Merged) 10jenkins-bot: depool-and-wait: Check for backups, rename wikiuser. [software] - 10https://gerrit.wikimedia.org/r/777756 (owner: 10Kormat) [10:33:47] (03CR) 10Elukey: [C: 03+2] network::data: update IP ranges for ml-serve-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/777764 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [10:33:48] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:34:58] (KubernetesCalicoDown) firing: (4) ml-serve-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:35:40] ACKNOWLEDGEMENT - dump of es4 in codfw on alert1001 is CRITICAL: dump for es4 at codfw (es2022) taken more than 8 days ago: Most recent backup 2022-03-29 00:00:01 Jcrespo re-running it now after failure - The acknowledgement expires at: 2022-04-08 10:34:59. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [10:35:45] (JobUnavailable) firing: Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:36:40] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:37:46] (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs2003:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [10:38:25] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:28] Oh systemd, you say "Reached target Shutdown.", but the host is not yet obviously rebooting :( [10:38:51] (03CR) 10Winston Sung: Rearrange zh namespace names and namespace aliases (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776031 (https://phabricator.wikimedia.org/T286291) (owner: 10Winston Sung) [10:38:58] (KubernetesRsyslogDown) firing: rsyslog on ml-serve1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:39:06] I think “reached target shutdown” effectively indicates the point where it hands control back to the initramfs? [10:39:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [10:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [10:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T298565)', diff saved to https://phabricator.wikimedia.org/P24150 and previous config saved to /var/cache/conftool/dbconfig/20220406-103929-ladsgroup.json [10:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:32] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [10:39:47] Not sure, the only other thing it's sad thereafter is '[3538099.727424] watchdog: watchdog0: watchdog did not stop!' [10:39:55] hm [10:40:02] sometime soon I'll get bored of waiting and look up how to powercycle [10:40:30] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:40:45] (JobUnavailable) resolved: Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:41:42] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:42:01] As Willow would say, "bored now" [10:43:56] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:47:16] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:47:18] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [10:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:20] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:48:28] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:47] PROBLEM - exim queue #page on mx1001 is CRITICAL: CRITICAL: 13375 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim [10:49:05] (03PS1) 10Muehlenhoff: Add webperf[12]00[34] to DHCP config/site.pp [puppet] - 10https://gerrit.wikimedia.org/r/777787 (https://phabricator.wikimedia.org/T305460) [10:49:09] uh :) [10:49:11] godog: host is back up, the smart data export shoul fire in 12m [10:49:27] OK, I'm here for the mail thing too. [10:49:31] same [10:49:37] Emperor: thank you! [10:49:37] I'm not sure if I can do much [10:49:42] and indeed, the mail thing now [10:49:43] 13375 mails spells leet [10:50:00] lol [10:50:32] Acked the page [10:50:41] <_joe_> it's all mails to wikimedia.org [10:50:46] there are indeed 13416 mails in the queue [10:50:50] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:50:59] <_joe_> 11779 are to wikimedia.org [10:51:40] has someone just forced a retry? lots of deliveries to wikimedia.org going through now [10:52:15] emails alerting about email doesn't help email :-) [10:52:19] <_joe_> no, it's a specific set of email [10:52:22] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:52:30] (also a lot of routing defer (-51) DT=0s: retry time not reached ) [10:52:51] <_joe_> Emperor: that's a specific email, let's move in private where we can discuss details [10:53:00] ok [10:53:20] (03CR) 10Alexandros Kosiaris: [C: 04-1] Update zotero to include get endpoint (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/777758 (https://phabricator.wikimedia.org/T291707) (owner: 10Mvolz) [10:53:24] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:54:45] (JobUnavailable) firing: Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:57:24] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [10:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:56] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:00:05] (03PS1) 10Nikerabbit: Revert "PageTranslationHooks: Don't kick in during interface message parsing" [extensions/Translate] (wmf/1.39.0-wmf.6) - 10https://gerrit.wikimedia.org/r/777767 [11:00:15] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [11:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:33] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3052.esams.wmnet with OS buster [11:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:42] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp3052.esams.wmnet with OS buster com... [11:02:27] (03PS2) 10Nikerabbit: Revert "PageTranslationHooks: Don't kick in during interface message parsing" [extensions/Translate] (wmf/1.39.0-wmf.6) - 10https://gerrit.wikimedia.org/r/777767 (https://phabricator.wikimedia.org/T305531) [11:02:48] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:03:37] !log pool cp3052 with HAProxy as TLS termination layer - T290005 [11:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:40] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [11:04:26] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:09:49] (03PS2) 10Mvolz: Update zotero to include get endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/777758 (https://phabricator.wikimedia.org/T291707) [11:10:21] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [11:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:44] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:10:58] (03CR) 10jerkins-bot: [V: 04-1] Update zotero to include get endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/777758 (https://phabricator.wikimedia.org/T291707) (owner: 10Mvolz) [11:12:05] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4027.ulsfo.wmnet with OS buster [11:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:13] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp4027.ulsfo.wmnet with OS buster com... [11:12:32] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:14:26] (03CR) 10Mvolz: Update zotero to include get endpoint (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/777758 (https://phabricator.wikimedia.org/T291707) (owner: 10Mvolz) [11:17:19] (03PS3) 10Mvolz: Update zotero to include get endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/777758 (https://phabricator.wikimedia.org/T291707) [11:18:28] (03CR) 10jerkins-bot: [V: 04-1] Update zotero to include get endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/777758 (https://phabricator.wikimedia.org/T291707) (owner: 10Mvolz) [11:20:17] !log pool cp4027 with HAProxy as TLS termination layer - T290005 [11:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:20] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [11:20:35] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] "PCC as expected https://puppet-compiler.wmflabs.org/pcc-worker1001/34727/" [puppet] - 10https://gerrit.wikimedia.org/r/777763 (https://phabricator.wikimedia.org/T304598) (owner: 10Arturo Borrero Gonzalez) [11:22:52] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [11:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:43] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudgw1002.eqiad.wmnet with OS bullseye [11:23:44] !log dbmaint s3@eqiad T297189 [11:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:49] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [11:24:33] !log depool cp4033 for reimage - T290005 [11:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:17] 10SRE, 10LDAP-Access-Requests: Requesting access to LDAP group NDA for TomekSikora.Monsoon - https://phabricator.wikimedia.org/T304502 (10TomekSikora.Monsoon) https://wikitech.wikimedia.org username: TomekSikora.Monsoon [11:25:19] (03CR) 10Muehlenhoff: [C: 03+2] Add webperf[12]00[34] to DHCP config/site.pp [puppet] - 10https://gerrit.wikimedia.org/r/777787 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff) [11:27:42] (03PS4) 10Mvolz: Update zotero to include get endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/777758 (https://phabricator.wikimedia.org/T291707) [11:28:58] (03PS2) 10MMandere: site: Reimage cp4033 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777325 (https://phabricator.wikimedia.org/T290005) [11:29:30] (03PS1) 10Marostegui: mariadb: Disable notifications on db hosts for B1 [puppet] - 10https://gerrit.wikimedia.org/r/777791 (https://phabricator.wikimedia.org/T305469) [11:30:30] (03CR) 10Marostegui: [C: 03+2] mariadb: Disable notifications on db hosts for B1 [puppet] - 10https://gerrit.wikimedia.org/r/777791 (https://phabricator.wikimedia.org/T305469) (owner: 10Marostegui) [11:30:48] (03CR) 10MMandere: [C: 03+2] site: Reimage cp4033 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777325 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [11:32:06] !log installing wavpack security updates [11:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:34] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp4033.ulsfo.wmnet with OS buster [11:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:43] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp4033.ulsfo.wmnet with OS buster [11:33:36] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [11:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:45] (JobUnavailable) firing: (2) Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:35:16] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw1002.eqiad.wmnet with reason: host reimage [11:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:22] (03CR) 10Alexandros Kosiaris: [C: 03+1] Update zotero to include get endpoint (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/777758 (https://phabricator.wikimedia.org/T291707) (owner: 10Mvolz) [11:37:24] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [11:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:07] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw1002.eqiad.wmnet with reason: host reimage [11:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:45] (JobUnavailable) firing: (2) Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:42:37] 10SRE, 10LDAP-Access-Requests: Requesting access to LDAP group NDA for TomekSikora.Monsoon - https://phabricator.wikimedia.org/T304502 (10Aklapper) @TomekSikora.Monsoon: Could you please answer all questions in the previous comment? Thanks. [11:43:43] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10MoritzMuehlenhoff) [11:46:55] (03CR) 10Wangombe: [C: 03+1] Revert "PageTranslationHooks: Don't kick in during interface message parsing" [extensions/Translate] (wmf/1.39.0-wmf.6) - 10https://gerrit.wikimedia.org/r/777767 (https://phabricator.wikimedia.org/T305531) (owner: 10Nikerabbit) [11:47:51] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4033.ulsfo.wmnet with reason: host reimage [11:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:31] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw1002.eqiad.wmnet with OS bullseye [11:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:16] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [11:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:22] (03CR) 10Mvolz: Update zotero to include get endpoint (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/777758 (https://phabricator.wikimedia.org/T291707) (owner: 10Mvolz) [11:51:14] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4033.ulsfo.wmnet with reason: host reimage [11:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:29] * kart_ updating cxserver.. [11:54:26] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2022-04-06-080942-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/777694 (https://phabricator.wikimedia.org/T300958) (owner: 10KartikMistry) [11:57:13] (03PS1) 10Filippo Giunchedi: mail: link grafana dashboard from alerts [puppet] - 10https://gerrit.wikimedia.org/r/777795 [11:57:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T298565)', diff saved to https://phabricator.wikimedia.org/P24151 and previous config saved to /var/cache/conftool/dbconfig/20220406-115717-ladsgroup.json [11:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:21] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [11:57:33] (03PS2) 10Filippo Giunchedi: mail: link grafana dashboard from exim queue alerts [puppet] - 10https://gerrit.wikimedia.org/r/777795 [11:59:15] (03Merged) 10jenkins-bot: Update cxserver to 2022-04-06-080942-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/777694 (https://phabricator.wikimedia.org/T300958) (owner: 10KartikMistry) [11:59:56] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:01:24] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:01:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [12:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [12:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:51] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [12:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:50] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [12:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:24] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [12:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:22] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [12:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:40] PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:11:56] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw1002.eqiad.wmnet [12:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P24152 and previous config saved to /var/cache/conftool/dbconfig/20220406-121222-ladsgroup.json [12:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:36] (03PS1) 10Muehlenhoff: Failover url downloaders [dns] - 10https://gerrit.wikimedia.org/r/777796 [12:12:44] RECOVERY - ensure kvm processes are running on cloudvirt-wdqs1001 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:15:43] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [12:16:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P24153 and previous config saved to /var/cache/conftool/dbconfig/20220406-121606-ladsgroup.json [12:16:38] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [12:16:56] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1002.eqiad.wmnet [12:16:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1163 T300775', diff saved to https://phabricator.wikimedia.org/P24154 and previous config saved to /var/cache/conftool/dbconfig/20220406-121657-root.json [12:17:10] (03CR) 10Muehlenhoff: [C: 03+2] Failover url downloaders [dns] - 10https://gerrit.wikimedia.org/r/777796 (owner: 10Muehlenhoff) [12:18:53] (03PS1) 10Marostegui: db1163: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/777798 (https://phabricator.wikimedia.org/T300775) [12:18:56] !log Updated cxserver to 2022-04-06-080942-production (T300958, T305397, T305281, T303762) [12:19:56] (03CR) 10Marostegui: [C: 03+2] db1163: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/777798 (https://phabricator.wikimedia.org/T300775) (owner: 10Marostegui) [12:22:02] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4033.ulsfo.wmnet with OS buster [12:22:11] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp4033.ulsfo.wmnet with OS buster com... [12:23:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1163', diff saved to https://phabricator.wikimedia.org/P24155 and previous config saved to /var/cache/conftool/dbconfig/20220406-122318-root.json [12:24:55] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [12:27:17] (03CR) 10JMeybohm: [C: 04-1] Add the networkpolicy for the setups as a pre-install hook (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/777419 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [12:27:24] PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:28:44] (03CR) 10Filippo Giunchedi: [C: 03+2] toil: migrate systemd_scope_cleanup cron to a systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/777453 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [12:31:19] (03PS3) 10Btullis: Add the networkpolicy for the setups as a pre-install hook [deployment-charts] - 10https://gerrit.wikimedia.org/r/777419 (https://phabricator.wikimedia.org/T301454) [12:32:17] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [12:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:44] (03CR) 10JMeybohm: [C: 03+1] Add the networkpolicy for the setups as a pre-install hook [deployment-charts] - 10https://gerrit.wikimedia.org/r/777419 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [12:33:59] (03PS1) 10Ayounsi: DHCP: add option 97 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/777799 (https://phabricator.wikimedia.org/T304677) [12:34:45] (JobUnavailable) resolved: Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:35:02] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [12:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1163', diff saved to https://phabricator.wikimedia.org/P24156 and previous config saved to /var/cache/conftool/dbconfig/20220406-123505-root.json [12:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:45] (03CR) 10Btullis: Add the networkpolicy for the setups as a pre-install hook (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/777419 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [12:35:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1099.eqiad.wmnet with reason: Maintenance [12:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1099.eqiad.wmnet with reason: Maintenance [12:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T300775)', diff saved to https://phabricator.wikimedia.org/P24157 and previous config saved to /var/cache/conftool/dbconfig/20220406-123603-marostegui.json [12:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:06] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [12:38:56] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [12:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:06] (03PS1) 10Elukey: Change calico's pod-to-pod ip subnet value [deployment-charts] - 10https://gerrit.wikimedia.org/r/777802 (https://phabricator.wikimedia.org/T304673) [12:41:26] (03CR) 10jerkins-bot: [V: 04-1] DHCP: add option 97 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/777799 (https://phabricator.wikimedia.org/T304677) (owner: 10Ayounsi) [12:42:20] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [12:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:00] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:43:14] (03CR) 10JMeybohm: [C: 04-1] Change calico's pod-to-pod ip subnet value (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/777802 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [12:44:48] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:45:16] !log pool cp4033 with HAProxy as TLS termination layer - T290005 [12:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:19] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [12:48:45] (JobUnavailable) firing: Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:49:05] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [12:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:57] (03PS1) 10Ayounsi: DHCP: use option 97 by default [cookbooks] - 10https://gerrit.wikimedia.org/r/777805 (https://phabricator.wikimedia.org/T304677) [12:51:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance [12:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance [12:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T297189)', diff saved to https://phabricator.wikimedia.org/P24158 and previous config saved to /var/cache/conftool/dbconfig/20220406-125117-marostegui.json [12:51:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:22] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [12:53:41] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [12:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:13] (03CR) 10jerkins-bot: [V: 04-1] DHCP: use option 97 by default [cookbooks] - 10https://gerrit.wikimedia.org/r/777805 (https://phabricator.wikimedia.org/T304677) (owner: 10Ayounsi) [12:56:00] (03CR) 10Ottomata: [C: 03+1] zookeeper: migrate zookeeper-cleanup cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/777451 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [13:00:04] RoanKattouw, Lucas_WMDE, and Urbanecm: (Dis)respected human, time to deploy UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220406T1300). Please do the needful. [13:00:04] winston_sung, kart_, and zabe_: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:16] o/ [13:00:36] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (paramita_das) - https://phabricator.wikimedia.org/T305298 (10Ottomata) Hi @paramita_das, I can try and help., If you are using ssh tunneling (I assume to access JupyterHub?), maybe https://wikitech.wikimedia.org/wiki/Anal... [13:00:59] (03CR) 10Elukey: Change calico's pod-to-pod ip subnet value (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/777802 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [13:01:16] * kart_ is here [13:01:25] (03PS2) 10Elukey: Change calico's pod-to-pod ip subnet value [deployment-charts] - 10https://gerrit.wikimedia.org/r/777802 (https://phabricator.wikimedia.org/T304673) [13:01:53] I'm lurking around (to support kart_ if needed) [13:02:45] Thanks Nikerabbit [13:03:32] Winston_Sung[m]: around? [13:03:50] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [13:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:38] (03CR) 10Btullis: [C: 03+2] Add the networkpolicy for the setups as a pre-install hook [deployment-charts] - 10https://gerrit.wikimedia.org/r/777419 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [13:04:42] OK. I can self-deploy patch I listed first and look at other patches if no deployers around. [13:06:21] (03CR) 10Klausman: [C: 03+1] Change calico's pod-to-pod ip subnet value [deployment-charts] - 10https://gerrit.wikimedia.org/r/777802 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [13:06:23] (03CR) 10KartikMistry: [C: 03+2] "Backport to wmf.6" [extensions/Translate] (wmf/1.39.0-wmf.6) - 10https://gerrit.wikimedia.org/r/777767 (https://phabricator.wikimedia.org/T305531) (owner: 10Nikerabbit) [13:06:38] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale-full only: 1 (doc1001), Fresh: 107 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [13:07:24] (03CR) 10Elukey: [C: 03+2] Change calico's pod-to-pod ip subnet value [deployment-charts] - 10https://gerrit.wikimedia.org/r/777802 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [13:07:45] !log depool cp4021 for reimage - T290005 [13:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:48] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [13:08:58] zabe_: Probably https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/776254 need rebase. OK to do? We can deploy this while CI runs on Translate. [13:09:19] (03Merged) 10jenkins-bot: Add the networkpolicy for the setups as a pre-install hook [deployment-charts] - 10https://gerrit.wikimedia.org/r/777419 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [13:09:48] kart_, yes, thx [13:09:50] zabe_: also, it has one CI test failure. [13:10:00] Is it OK? [13:10:29] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [13:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:34] (03PS2) 10MMandere: site: Reimage cp4021 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777326 (https://phabricator.wikimedia.org/T290005) [13:10:36] (03PS2) 10KartikMistry: Start writing to $wmgUsingKubernetes the same value as to $wmfUsingKubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776254 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [13:10:43] yes [13:11:25] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [13:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:28] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [13:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:36] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [13:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:10] (03CR) 10MMandere: [C: 03+2] site: Reimage cp4021 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777326 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [13:13:16] zabe_: OK. Deploying. [13:13:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader2002.wikimedia.org [13:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:25] (03CR) 10KartikMistry: [C: 03+2] Start writing to $wmgUsingKubernetes the same value as to $wmfUsingKubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776254 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [13:13:28] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [13:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:45] (JobUnavailable) resolved: Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:14:36] (03Merged) 10jenkins-bot: Start writing to $wmgUsingKubernetes the same value as to $wmfUsingKubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776254 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [13:15:27] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp4021.ulsfo.wmnet with OS buster [13:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:32] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [13:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:35] zabe_: available to test on mwdebug1001 [13:15:37] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp4021.ulsfo.wmnet with OS buster [13:16:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader2002.wikimedia.org [13:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:38] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:17:32] zabe_: Please test and let me know.. [13:17:38] doing [13:17:52] cool. [13:18:33] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2094.codfw.wmnet with reason: Rebooting for T303174 [13:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:35] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2094.codfw.wmnet with reason: Rebooting for T303174 [13:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:16] kart_, lgtm [13:19:27] zabe_: Thanks. Deploying. [13:19:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:50] !log kartik@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:776254|Start writing to $wmgUsingKubernetes the same value as to $wmfUsingKubernetes (T45956)]] (duration: 00m 55s) [13:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:53] T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956 [13:21:04] zabe_: Done :) [13:21:17] thanks :) [13:21:23] I'll wait for CI to finish for Translate backport patch. [13:22:22] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:23:00] (03Merged) 10jenkins-bot: Revert "PageTranslationHooks: Don't kick in during interface message parsing" [extensions/Translate] (wmf/1.39.0-wmf.6) - 10https://gerrit.wikimedia.org/r/777767 (https://phabricator.wikimedia.org/T305531) (owner: 10Nikerabbit) [13:23:45] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2095.codfw.wmnet with reason: Rebooting for T303174 [13:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:47] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2095.codfw.wmnet with reason: Rebooting for T303174 [13:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:22] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:25:35] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [13:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:37] @Nikerabbit mwdebug1001 updated with Translate fix. I'm also testing. [13:26:20] And, it looks good. tag is no longer appearing at: https://www.wikidata.org/wiki/Special:NewLexeme [13:26:29] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [13:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:31] oops, I don't have mwdebug on this browser, installing... [13:26:40] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [13:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:52] (03PS2) 10Zabe: Migrate $wmfUsingKubernetes to $wmgUsingKubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776255 (https://phabricator.wikimedia.org/T45956) [13:26:58] @Nikerabbit I'll do more testing meanwhile.. [13:27:24] kart_: installed, and can confirm it looks ok on mwdebug1001 [13:28:02] Nikerabbit: Thanks. Deploying.. [13:28:41] I'm here now. [13:29:24] !log kartik@deploy1002 Synchronized php-1.39.0-wmf.6/extensions/Translate/tag/PageTranslationHooks.php: Backport: [[gerrit:777767|Revert "PageTranslationHooks: Don't kick in during interface message parsing" (T305531)]] (duration: 00m 57s) [13:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:27] T305531: Several messages on Wikidata now show tags - https://phabricator.wikimedia.org/T305531 [13:29:43] @Nikerabbit Deployed! [13:29:54] @Winston_Sung[m] OK. Let me check patch. [13:30:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader1001.wikimedia.org [13:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:18] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4021.ulsfo.wmnet with reason: host reimage [13:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:36] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2107.codfw.wmnet with reason: Rebooting for T303174 [13:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:37] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2107.codfw.wmnet with reason: Rebooting for T303174 [13:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:45] (JobUnavailable) firing: Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:31:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10cmooney) @Cmjohnson no idea unfortunately, it should match the partman config so my guess is you are right, but I can't really confirm. Perha... [13:32:19] Winston_Sung[m]: OK. Let's deploy. [13:33:26] (03CR) 10KartikMistry: [C: 03+2] "UTC afternoon backport." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776031 (https://phabricator.wikimedia.org/T286291) (owner: 10Winston Sung) [13:33:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader1001.wikimedia.org [13:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:34] (03Merged) 10jenkins-bot: Rearrange zh namespace names and namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776031 (https://phabricator.wikimedia.org/T286291) (owner: 10Winston Sung) [13:34:43] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4021.ulsfo.wmnet with reason: host reimage [13:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:58] Ok. [13:36:01] Winston_Sung[m]: Patch is available to test on mwdebug1001. Please test and let me know. [13:36:36] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [13:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:42] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [13:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:47] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) @LSobanski yes it is. [13:38:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:03] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2125.codfw.wmnet with reason: Rebooting for T303174 [13:39:03] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:04] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2125.codfw.wmnet with reason: Rebooting for T303174 [13:39:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:32] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [13:40:50] (03PS1) 10Btullis: Disable the use of SSL/TLS in datahub's MySQL connection in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/777810 (https://phabricator.wikimedia.org/T301454) [13:40:57] Winston_Sung[m]: Are we good? [13:41:27] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:41:57] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase: Q3:(Need By: TBD) rack/setup/install restbase2027 - https://phabricator.wikimedia.org/T301399 (10Papaul) [13:42:18] Testing... [13:42:30] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase: Q3:(Need By: TBD) rack/setup/install restbase2027 - https://phabricator.wikimedia.org/T301399 (10Papaul) @hnowlan any update on this? [13:43:33] (03PS2) 10Btullis: Disable the use of SSL/TLS in datahub's MySQL connection in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/777810 (https://phabricator.wikimedia.org/T301454) [13:43:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:43:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:37] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2126.codfw.wmnet with reason: Rebooting for T303174 [13:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:38] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2126.codfw.wmnet with reason: Rebooting for T303174 [13:44:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:57] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [13:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:03] Everything is fine for now, checking more coverage. [13:46:21] Sure! Will wait.. [13:46:45] (JobUnavailable) resolved: Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:47:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:51] Confirmed. Everything is good. [13:47:57] (03PS3) 10Zabe: toil: remove absented systemd_scope_cleanup cron [puppet] - 10https://gerrit.wikimedia.org/r/777454 (https://phabricator.wikimedia.org/T273673) [13:48:14] (03CR) 10Btullis: [C: 03+2] Disable the use of SSL/TLS in datahub's MySQL connection in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/777810 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [13:48:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:35] Winston_Sung[m]: Cool. Deploying.. [13:51:05] (03PS3) 10Esanders: Disable autotopicsub user option by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771872 (https://phabricator.wikimedia.org/T297966) [13:51:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T297189)', diff saved to https://phabricator.wikimedia.org/P24159 and previous config saved to /var/cache/conftool/dbconfig/20220406-135132-marostegui.json [13:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:35] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [13:51:48] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:776031|Rearrange zh namespace names and namespace aliases (T286291 T298308)]] (duration: 00m 53s) [13:51:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:51] T298308: Clean up zh/zh-* namespace aliases in operations/mediawiki-config /wmf-config/InitialiseSettings.php - https://phabricator.wikimedia.org/T298308 [13:51:52] T286291: Clean up, merge and update zh/zh-* translations - https://phabricator.wikimedia.org/T286291 [13:52:00] Winston_Sung[m]: Done. [13:52:04] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs1004.eqiad.wmnet [13:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:32] !log UTC afternoon backport window - Done. [13:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:01] (03Merged) 10jenkins-bot: Disable the use of SSL/TLS in datahub's MySQL connection in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/777810 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [13:53:58] !log installing webperf2003 T305460 [13:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:02] T305460: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 [13:54:44] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [13:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:01] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [13:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:48] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:55:53] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [13:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:56] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [13:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:17] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2138.codfw.wmnet with reason: Rebooting for T303174 [13:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:18] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2138.codfw.wmnet with reason: Rebooting for T303174 [13:58:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:30] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:58:31] (03PS1) 10Elukey: Change POD IPv4 subnet for ml-serve-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/777811 (https://phabricator.wikimedia.org/T304673) [14:00:01] (03CR) 10JMeybohm: [C: 03+1] Change POD IPv4 subnet for ml-serve-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/777811 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [14:00:15] (03CR) 10Klausman: [C: 03+1] Change POD IPv4 subnet for ml-serve-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/777811 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [14:01:20] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4021.ulsfo.wmnet with OS buster [14:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:29] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp4021.ulsfo.wmnet with OS buster com... [14:01:45] (JobUnavailable) firing: Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:02:07] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host aqs1004.eqiad.wmnet [14:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:40] PROBLEM - cassandra-b CQL 10.64.0.127:9042 on aqs1004 is CRITICAL: connect to address 10.64.0.127 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [14:02:40] PROBLEM - cassandra-a CQL 10.64.0.126:9042 on aqs1004 is CRITICAL: connect to address 10.64.0.126 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [14:04:20] RECOVERY - cassandra-b CQL 10.64.0.127:9042 on aqs1004 is OK: TCP OK - 0.000 second response time on 10.64.0.127 port 9042 https://phabricator.wikimedia.org/T93886 [14:04:20] RECOVERY - cassandra-a CQL 10.64.0.126:9042 on aqs1004 is OK: TCP OK - 0.001 second response time on 10.64.0.126 port 9042 https://phabricator.wikimedia.org/T93886 [14:05:08] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [14:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P24160 and previous config saved to /var/cache/conftool/dbconfig/20220406-140637-marostegui.json [14:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:52] !log installing webperf2004 T305460 [14:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:54] T305460: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 [14:07:42] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/777811 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [14:08:36] !log pool cp4021 with HAProxy as TLS termination layer - T290005 [14:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:39] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [14:08:52] (03CR) 10Elukey: [C: 03+2] Change POD IPv4 subnet for ml-serve-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/777811 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [14:13:27] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [14:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:28] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:15:33] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [14:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:59] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs1005.eqiad.wmnet [14:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:45] (JobUnavailable) resolved: Reduced availability for job calico-felix in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:17:28] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:19:55] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [14:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:10] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [14:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:23] (03CR) 10David Caro: [C: 03+1] P:openstack::puppetmaster: split ENC api to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/777341 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [14:20:54] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [14:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:04] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [14:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P24162 and previous config saved to /var/cache/conftool/dbconfig/20220406-142142-marostegui.json [14:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:05] (03PS1) 10Elukey: Revert "Remove the istio-cni config from Calico's for ml-serve-eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/777770 [14:22:27] (03CR) 10Klausman: [C: 03+1] Revert "Remove the istio-cni config from Calico's for ml-serve-eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/777770 (owner: 10Elukey) [14:22:49] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2148.codfw.wmnet with reason: Rebooting for T303174 [14:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:51] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2148.codfw.wmnet with reason: Rebooting for T303174 [14:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:50] (03CR) 10Muehlenhoff: mail: link grafana dashboard from exim queue alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777795 (owner: 10Filippo Giunchedi) [14:27:19] RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:27:27] PROBLEM - Disk space on mx1001 is CRITICAL: DISK CRITICAL - free space: / 568 MB (3% inode=92%): /tmp 568 MB (3% inode=92%): /var/tmp 568 MB (3% inode=92%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mx1001&var-datasource=eqiad+prometheus/ops [14:27:38] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host aqs1005.eqiad.wmnet [14:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:45] PROBLEM - cassandra-a CQL 10.64.32.189:9042 on aqs1005 is CRITICAL: connect to address 10.64.32.189 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [14:28:57] RECOVERY - cassandra-a CQL 10.64.32.189:9042 on aqs1005 is OK: TCP OK - 0.000 second response time on 10.64.32.189 port 9042 https://phabricator.wikimedia.org/T93886 [14:30:45] (03CR) 10Filippo Giunchedi: [C: 03+2] toil: remove absented systemd_scope_cleanup cron [puppet] - 10https://gerrit.wikimedia.org/r/777454 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [14:31:48] (03CR) 10Klausman: [C: 03+2] Revert "Remove the istio-cni config from Calico's for ml-serve-eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/777770 (owner: 10Elukey) [14:32:48] (03PS3) 10Filippo Giunchedi: mail: link grafana dashboard from exim queue alerts [puppet] - 10https://gerrit.wikimedia.org/r/777795 [14:32:54] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2109.codfw.wmnet with reason: Rebooting for T303174 [14:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:56] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2109.codfw.wmnet with reason: Rebooting for T303174 [14:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:27] (03PS1) 10Btullis: Update the public ports for TLS for the datahub service [deployment-charts] - 10https://gerrit.wikimedia.org/r/777818 (https://phabricator.wikimedia.org/T301454) [14:35:54] (03CR) 10David Caro: "Got a question about python stuff and maybe a missing rename, but looks nice, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/777747 (owner: 10Majavah) [14:36:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T297189)', diff saved to https://phabricator.wikimedia.org/P24163 and previous config saved to /var/cache/conftool/dbconfig/20220406-143647-marostegui.json [14:36:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:51] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [14:36:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2104.codfw.wmnet with reason: Maintenance [14:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2104.codfw.wmnet with reason: Maintenance [14:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on 8 hosts with reason: Maintenance [14:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on 8 hosts with reason: Maintenance [14:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:46] (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs2003:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [14:38:47] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2127.codfw.wmnet with reason: Rebooting for T303174 [14:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:48] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2127.codfw.wmnet with reason: Rebooting for T303174 [14:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:52] 10SRE, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install netmon1003 - https://phabricator.wikimedia.org/T299106 (10Jclark-ctr) [14:43:01] (03CR) 10David Caro: "Just one question here about missing $, otherwise :+1:" [puppet] - 10https://gerrit.wikimedia.org/r/777385 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [14:43:07] 10SRE, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install netmon1003 - https://phabricator.wikimedia.org/T299106 (10Jclark-ctr) netmon1003 B1 U32 Port30 Cableid 23000067 [14:44:18] 10SRE, 10LDAP-Access-Requests: Requesting access to LDAP group NDA for TomekSikora.Monsoon - https://phabricator.wikimedia.org/T304502 (10soworu) Hello Andre. This request emanated from the project we are running with Monsoon. I'm the project lead, and would be the approving party. The request is primarily lim... [14:46:17] 10SRE, 10LDAP-Access-Requests: Requesting access to LDAP group NDA for TomekSikora.Monsoon - https://phabricator.wikimedia.org/T304502 (10RhinosF1) I'm not sure 'nda' lets you login to google search console. [14:46:33] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2149.codfw.wmnet with reason: Rebooting for T303174 [14:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:35] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2149.codfw.wmnet with reason: Rebooting for T303174 [14:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:36] (03PS2) 10Btullis: Update the public ports for TLS for the datahub service [deployment-charts] - 10https://gerrit.wikimedia.org/r/777818 (https://phabricator.wikimedia.org/T301454) [14:51:54] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs1006.eqiad.wmnet [14:51:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:31] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [14:52:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:33] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [14:52:34] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [14:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:36] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [14:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:45] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [14:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:47] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [14:52:48] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [14:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:50] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [14:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:57] PROBLEM - exim queue #page on mx1001 is CRITICAL: CRITICAL: 10994 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim [14:54:42] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2106.codfw.wmnet with reason: Rebooting for T303174 [14:54:43] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2106.codfw.wmnet with reason: Rebooting for T303174 [14:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:36] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [14:55:37] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [14:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:47] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [14:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:13] (03CR) 10JMeybohm: [C: 03+1] Update the public ports for TLS for the datahub service [deployment-charts] - 10https://gerrit.wikimedia.org/r/777818 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [14:57:00] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [14:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:09] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [14:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:15] (03CR) 10Btullis: [C: 03+2] Update the public ports for TLS for the datahub service [deployment-charts] - 10https://gerrit.wikimedia.org/r/777818 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [14:57:24] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [14:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:47] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [14:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:56] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [14:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:58] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [14:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:22] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:01:43] !log mforns@deploy1002 Started deploy [airflow-dags/analytics_test@dc748fb]: (no justification provided) [15:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:52] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics_test@dc748fb]: (no justification provided) (duration: 00m 08s) [15:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:00] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [15:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:05] (03Merged) 10jenkins-bot: Update the public ports for TLS for the datahub service [deployment-charts] - 10https://gerrit.wikimedia.org/r/777818 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [15:02:24] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host aqs1006.eqiad.wmnet [15:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:41] ACKNOWLEDGEMENT - exim queue #page on mx1001 is CRITICAL: CRITICAL: 10994 mails in exim queue. Herron T305553 https://wikitech.wikimedia.org/wiki/Exim [15:04:32] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [15:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:56] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [15:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:49] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [15:05:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:51] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [15:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:32] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [15:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:34] !log mforns@deploy1002 Started deploy [airflow-dags/analytics_test@3018fdb]: (no justification provided) [15:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:40] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [15:06:41] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics_test@3018fdb]: (no justification provided) (duration: 00m 07s) [15:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:49] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [15:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:10] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [15:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:40] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [15:07:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:54] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [15:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:02] PROBLEM - Disk space on mx1001 is CRITICAL: DISK CRITICAL - free space: / 591 MB (3% inode=92%): /tmp 591 MB (3% inode=92%): /var/tmp 591 MB (3% inode=92%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mx1001&var-datasource=eqiad+prometheus/ops [15:11:47] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [15:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:28] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/777795 (owner: 10Filippo Giunchedi) [15:14:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1102.eqiad.wmnet with reason: Maintenance [15:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1102.eqiad.wmnet with reason: Maintenance [15:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:34] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:02] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:16:12] (03PS10) 10Majavah: P:openstack::puppetmaster: split ENC api to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/777341 (https://phabricator.wikimedia.org/T295247) [15:16:14] (03PS2) 10Majavah: openstack: remove 'labs' term from new enc servers [puppet] - 10https://gerrit.wikimedia.org/r/777747 [15:16:16] (03PS6) 10Majavah: O:openstack: add new encapi roles [puppet] - 10https://gerrit.wikimedia.org/r/777385 (https://phabricator.wikimedia.org/T295247) [15:18:25] (03CR) 10Majavah: openstack: remove 'labs' term from new enc servers (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/777747 (owner: 10Majavah) [15:19:47] (03CR) 10Majavah: O:openstack: add new encapi roles (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/777385 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [15:20:37] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34728/console" [puppet] - 10https://gerrit.wikimedia.org/r/777747 (owner: 10Majavah) [15:21:36] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34729/console" [puppet] - 10https://gerrit.wikimedia.org/r/777385 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [15:24:38] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [15:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:41] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [15:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:19] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [15:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:22] RECOVERY - LVS inference codfw port 30443/tcp - Inference ML service IPv4 on inference.svc.codfw.wmnet is OK: TCP OK - 0.006 second response time on inference.discovery.wmnet port 30443 https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:28:29] \o/ [15:29:39] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [15:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:20] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:31:48] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [15:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:27] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2119.codfw.wmnet with reason: Rebooting for T303174 [15:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:28] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2119.codfw.wmnet with reason: Rebooting for T303174 [15:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:33:48] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [15:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:47] (03CR) 10David Caro: [C: 03+1] openstack: remove 'labs' term from new enc servers [puppet] - 10https://gerrit.wikimedia.org/r/777747 (owner: 10Majavah) [15:37:53] (03CR) 10David Caro: [C: 03+1] O:openstack: add new encapi roles [puppet] - 10https://gerrit.wikimedia.org/r/777385 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [15:37:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:38:10] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2136.codfw.wmnet with reason: Rebooting for T303174 [15:38:12] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2136.codfw.wmnet with reason: Rebooting for T303174 [15:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:45] (JobUnavailable) firing: Reduced availability for job k8s-pods in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:39:06] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:42:38] (03CR) 10David Caro: [C: 03+2] P:openstack::puppetmaster: split ENC api to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/777341 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [15:42:41] (03CR) 10David Caro: [C: 03+2] openstack: remove 'labs' term from new enc servers [puppet] - 10https://gerrit.wikimedia.org/r/777747 (owner: 10Majavah) [15:42:44] (03CR) 10David Caro: [C: 03+2] O:openstack: add new encapi roles [puppet] - 10https://gerrit.wikimedia.org/r/777385 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [15:43:29] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:09] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2137.codfw.wmnet with reason: Rebooting for T303174 [15:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:11] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2137.codfw.wmnet with reason: Rebooting for T303174 [15:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:00] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@3018fdb]: (no justification provided) [15:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:08] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@3018fdb]: (no justification provided) (duration: 00m 07s) [15:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:14] 10SRE, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install netmon1003 - https://phabricator.wikimedia.org/T299106 (10Jclark-ctr) [15:52:42] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:54:20] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mc2040.mgmt.codfw.wmnet with reboot policy GRACEFUL [15:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:08] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mc2040.mgmt.codfw.wmnet with reboot policy GRACEFUL [15:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host netmon1003.mgmt.eqiad.wmnet with reboot policy FORCED [15:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:35] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2140.codfw.wmnet with reason: Rebooting for T303174 [15:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:37] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2140.codfw.wmnet with reason: Rebooting for T303174 [15:56:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:12] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:57:12] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:57:40] (03PS1) 10Btullis: Update the port number for the datahub-gms service using TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/777831 (https://phabricator.wikimedia.org/T301454) [16:00:48] (03CR) 10Filippo Giunchedi: [C: 03+2] mail: link grafana dashboard from exim queue alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777795 (owner: 10Filippo Giunchedi) [16:01:18] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:02:09] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@b029f10]: (no justification provided) [16:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:17] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@b029f10]: (no justification provided) (duration: 00m 08s) [16:02:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1139.eqiad.wmnet with reason: Maintenance [16:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1139.eqiad.wmnet with reason: Maintenance [16:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:06:34] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2147.codfw.wmnet with reason: Rebooting for T303174 [16:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:36] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2147.codfw.wmnet with reason: Rebooting for T303174 [16:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:31] (03PS2) 10Btullis: Update the port number for the datahub-gms service using TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/777831 (https://phabricator.wikimedia.org/T301454) [16:11:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:11:56] (03CR) 10RLazarus: [C: 03+2] external_clouds_vendors: Add Linode [puppet] - 10https://gerrit.wikimedia.org/r/775360 (https://phabricator.wikimedia.org/T270391) (owner: 10RLazarus) [16:13:50] RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:14:02] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host netmon1003.mgmt.eqiad.wmnet with reboot policy FORCED [16:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:34] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2111.codfw.wmnet with reason: Rebooting for T303174 [16:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:35] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2111.codfw.wmnet with reason: Rebooting for T303174 [16:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:56] (03CR) 10Herron: [C: 03+2] spicerack: add logging clusters to elasticsearch config [puppet] - 10https://gerrit.wikimedia.org/r/777421 (https://phabricator.wikimedia.org/T255864) (owner: 10Herron) [16:19:08] (03PS2) 10Herron: sre.kafka.reboot-workers: add logging-codfw targets [cookbooks] - 10https://gerrit.wikimedia.org/r/777375 (https://phabricator.wikimedia.org/T279342) [16:20:35] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2113.codfw.wmnet with reason: Rebooting for T303174 [16:20:36] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2113.codfw.wmnet with reason: Rebooting for T303174 [16:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:22] (03CR) 10Btullis: [C: 03+2] Update the port number for the datahub-gms service using TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/777831 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [16:24:09] RECOVERY - exim queue #page on mx1001 is OK: OK: Less than 2000 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim https://grafana.wikimedia.org/d/000000451/mail [16:26:56] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2128.codfw.wmnet with reason: Rebooting for T303174 [16:26:57] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2128.codfw.wmnet with reason: Rebooting for T303174 [16:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:13] (03Merged) 10jenkins-bot: Update the port number for the datahub-gms service using TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/777831 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [16:27:52] (03CR) 10Dzahn: [C: 03+1] gitlab: add missing chown to restore script [puppet] - 10https://gerrit.wikimedia.org/r/777745 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [16:28:06] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:28:48] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:29:20] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:29:58] (03CR) 10Herron: [C: 03+2] sre.kafka.reboot-workers: add logging-codfw targets [cookbooks] - 10https://gerrit.wikimedia.org/r/777375 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [16:31:34] RECOVERY - Disk space on mx1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mx1001&var-datasource=eqiad+prometheus/ops [16:32:52] (03Merged) 10jenkins-bot: sre.kafka.reboot-workers: add logging-codfw targets [cookbooks] - 10https://gerrit.wikimedia.org/r/777375 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [16:36:10] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2114.codfw.wmnet with reason: Rebooting for T303174 [16:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:12] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2114.codfw.wmnet with reason: Rebooting for T303174 [16:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:46] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2117.codfw.wmnet with reason: Rebooting for T303174 [16:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:48] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2117.codfw.wmnet with reason: Rebooting for T303174 [16:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:32] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:45:21] * Krinkle experimenting on mwdebug1002 [16:46:30] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:47:34] 10SRE, 10Infrastructure-Foundations, 10Mail: MX: increasing disks space - https://phabricator.wikimedia.org/T305567 (10herron) [16:47:44] 10SRE, 10Infrastructure-Foundations, 10Mail: MX: increasing disk space - https://phabricator.wikimedia.org/T305567 (10herron) [16:51:34] (03CR) 10Herron: [C: 03+2] ipmiseld: ensure service enabled and running [puppet] - 10https://gerrit.wikimedia.org/r/775875 (https://phabricator.wikimedia.org/T305147) (owner: 10Herron) [16:51:54] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.6774 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [16:52:00] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2124.codfw.wmnet with reason: Rebooting for T303174 [16:52:02] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2124.codfw.wmnet with reason: Rebooting for T303174 [16:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:44] * Krinkle done [16:57:50] !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host db2097.codfw.wmnet with OS bullseye [16:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:38] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.04839 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [17:01:05] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2108.codfw.wmnet with reason: Rebooting for T303174 [17:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:06] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2108.codfw.wmnet with reason: Rebooting for T303174 [17:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:21] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [17:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:54] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [17:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1156.eqiad.wmnet with reason: Maintenance [17:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1156.eqiad.wmnet with reason: Maintenance [17:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [17:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [17:02:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T297189)', diff saved to https://phabricator.wikimedia.org/P24164 and previous config saved to /var/cache/conftool/dbconfig/20220406-170223-marostegui.json [17:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:41] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [17:04:20] 10ops-codfw, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10RobH) [17:04:27] 10ops-codfw, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10RobH) [17:05:36] PROBLEM - Check systemd state on dns1001 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:48] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:06:18] PROBLEM - Check systemd state on mw1375 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:20] PROBLEM - Check systemd state on mw1448 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:23] !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2097.codfw.wmnet with reason: host reimage [17:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:14] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2118.codfw.wmnet with reason: Rebooting for T303174 [17:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:15] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2118.codfw.wmnet with reason: Rebooting for T303174 [17:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:24] PROBLEM - Check systemd state on dns2002 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:21] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2097.codfw.wmnet with reason: host reimage [17:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:57] PROBLEM - Check systemd state on mw1414 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:59] 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10RobH) [17:13:09] 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10RobH) [17:13:55] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2120.codfw.wmnet with reason: Rebooting for T303174 [17:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:57] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2120.codfw.wmnet with reason: Rebooting for T303174 [17:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:03] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:18:15] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:22:59] PROBLEM - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:25:45] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2097.codfw.wmnet with OS bullseye [17:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:35] RECOVERY - Check systemd state on mw1375 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:31:39] RECOVERY - Check systemd state on mw1448 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:34:41] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:36:13] RECOVERY - Check systemd state on dns1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:36:13] RECOVERY - Check systemd state on dns2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:36:27] RECOVERY - Check systemd state on mw1414 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:36:37] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:37:31] (03PS1) 10MMandere: site: Reimage cp3050 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777846 (https://phabricator.wikimedia.org/T290005) [17:37:33] (03PS1) 10MMandere: site: Reimage cp6014 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777847 (https://phabricator.wikimedia.org/T290005) [17:37:35] (03PS1) 10MMandere: site: Reimage cp3053 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777848 (https://phabricator.wikimedia.org/T290005) [17:37:37] (03PS1) 10MMandere: site: Reimage cp6006 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777849 (https://phabricator.wikimedia.org/T290005) [17:37:39] (03PS1) 10MMandere: site: Reimage cp3051 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777850 (https://phabricator.wikimedia.org/T290005) [17:37:41] (03PS1) 10MMandere: site: Reimage cp6013 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777851 (https://phabricator.wikimedia.org/T290005) [17:37:43] (03PS1) 10MMandere: site: Reimage cp6005 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777852 (https://phabricator.wikimedia.org/T290005) [17:37:45] (03PS1) 10MMandere: site: Reimage cp6012 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777853 (https://phabricator.wikimedia.org/T290005) [17:37:47] (03PS1) 10MMandere: site: Reimage cp6004 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777854 (https://phabricator.wikimedia.org/T290005) [17:38:26] 10SRE, 10Infrastructure-Foundations, 10Mail: MX: increasing disk space - https://phabricator.wikimedia.org/T305567 (10MoritzMuehlenhoff) We could add a second disk to the mx* VMs and move /var or to it, but this sounds rather something to factor it for the new VMs running the new setup? (The immediate log c... [17:42:50] !log bking@cumin1001 START - Cookbook sre.wdqs.reboot [17:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:47] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:45:29] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.9032 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [17:45:50] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.1837 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [17:46:05] looking [17:46:13] * volans here if needed [17:47:02] here [17:48:55] RECOVERY - Check systemd state on wdqs1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:48:58] the good news is we aren't fully saturated [17:54:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T297189)', diff saved to https://phabricator.wikimedia.org/P24165 and previous config saved to /var/cache/conftool/dbconfig/20220406-175403-marostegui.json [17:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:08] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [17:55:56] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:57:19] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Review filtering for cloud-hosts on CR routers eqiad - https://phabricator.wikimedia.org/T285461 (10cmooney) Unfortunately the uRPF exception command is not supported on the QFX platform, which means configuring it on top-of-rac... [17:58:18] (03PS1) 10Cathal Mooney: Add inbound filter to analytics IRB interfaces on EVPN switches Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/777855 (https://phabricator.wikimedia.org/T299758) [17:58:24] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99) [17:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:05] jnuche and hashar: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220406T1800). [18:01:58] !log bking@cumin1001 START - Cookbook sre.wdqs.reboot [18:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P24166 and previous config saved to /var/cache/conftool/dbconfig/20220406-180909-marostegui.json [18:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:48] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is CRITICAL: 46.85 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [18:15:10] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:16:29] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.5541 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [18:19:22] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.09677 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [18:20:08] 10SRE, 10Generated Data Platform, 10Image-Suggestions, 10serviceops, and 3 others: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10Eevans) [18:22:12] PROBLEM - Check systemd state on cp6014 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_varnish-frontend-hospital.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:22:38] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:24:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P24167 and previous config saved to /var/cache/conftool/dbconfig/20220406-182414-marostegui.json [18:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:24] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:26:48] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:27:51] 10SRE, 10Generated Data Platform, 10Image-Suggestions, 10serviceops, and 3 others: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10Eevans) [18:30:16] 10SRE, 10Generated Data Platform, 10Image-Suggestions, 10serviceops, and 3 others: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10Eevans) [18:30:30] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:36:07] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:46] (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs2003:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [18:39:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T297189)', diff saved to https://phabricator.wikimedia.org/P24168 and previous config saved to /var/cache/conftool/dbconfig/20220406-183919-marostegui.json [18:39:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1162.eqiad.wmnet with reason: Maintenance [18:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1162.eqiad.wmnet with reason: Maintenance [18:39:23] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [18:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T297189)', diff saved to https://phabricator.wikimedia.org/P24169 and previous config saved to /var/cache/conftool/dbconfig/20220406-183927-marostegui.json [18:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:39] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10Cmjohnson) [18:45:10] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:48:56] (03CR) 10Ssingh: [C: 03+1] site: Reimage cp3050 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777846 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [18:49:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10Cmjohnson) [18:49:29] (03CR) 10Ssingh: [C: 03+1] site: Reimage cp6014 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777847 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [18:50:07] (03CR) 10Ssingh: [C: 03+1] site: Reimage cp3053 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777848 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [18:50:43] (03CR) 10Ssingh: [C: 03+1] site: Reimage cp6006 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777849 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [18:51:08] (03CR) 10Ssingh: [C: 03+1] site: Reimage cp6013 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777851 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [18:51:31] (03CR) 10Ssingh: [C: 03+1] site: Reimage cp3051 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777850 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [18:51:52] (03CR) 10Ssingh: [C: 03+1] site: Reimage cp6005 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777852 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [18:52:24] (03CR) 10Ssingh: [C: 03+1] site: Reimage cp6012 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777853 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [18:54:00] (03CR) 10Ssingh: [C: 03+1] site: Reimage cp6004 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/777854 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [18:54:30] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [18:56:10] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:57:30] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [19:06:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10Cmjohnson) [19:06:29] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10Volans) If I may add my use case too, I would like to be able to restrict the access to the webproxies from the cumin host... [19:07:50] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host dse-k8s-worker1001.mgmt.eqiad.wmnet with reboot policy FORCED [19:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:54] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host dse-k8s-worker1002.mgmt.eqiad.wmnet with reboot policy FORCED [19:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:35] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host dse-k8s-worker1003.mgmt.eqiad.wmnet with reboot policy FORCED [19:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:21] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host dse-k8s-worker1004.mgmt.eqiad.wmnet with reboot policy FORCED [19:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:07] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0) [19:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T297189)', diff saved to https://phabricator.wikimedia.org/P24170 and previous config saved to /var/cache/conftool/dbconfig/20220406-191402-marostegui.json [19:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:05] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [19:14:17] (03PS1) 10Cmjohnson: update site.pp with new ml-serve100[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/777860 (https://phabricator.wikimedia.org/T294949) [19:15:32] (03CR) 10Cmjohnson: [C: 03+2] update site.pp with new ml-serve100[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/777860 (https://phabricator.wikimedia.org/T294949) (owner: 10Cmjohnson) [19:15:51] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) @MoritzMuehlenhoff. Dell is stating they want us to upgrade to the newest revision of the utility for them to offer support. I'm pushing back, but they state we should be using: Version 7.1623.0... [19:16:12] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:17:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10Cmjohnson) [19:21:54] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_imagecatalog.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:23:05] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1003.mgmt.eqiad.wmnet with reboot policy FORCED [19:23:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:16] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1001.mgmt.eqiad.wmnet with reboot policy FORCED [19:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:38] !log rook@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt1016.eqiad.wmnet [19:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:50] !log rook@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host cloudvirt1016.eqiad.wmnet [19:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:02] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1002.mgmt.eqiad.wmnet with reboot policy FORCED [19:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:54] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1004.mgmt.eqiad.wmnet with reboot policy FORCED [19:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:41] jouncebot: nowandnext [19:27:41] No deployments scheduled for the next 0 hour(s) and 32 minute(s) [19:27:41] In 0 hour(s) and 32 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220406T2000) [19:29:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P24171 and previous config saved to /var/cache/conftool/dbconfig/20220406-192907-marostegui.json [19:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:22] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ml-cache1002.eqiad.wmnet with OS bullseye [19:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-cache1002.eqiad.wmnet with OS bullseye [19:31:30] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-cache1002.eqiad.wmnet with OS bullseye [19:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-cache1002.eqiad.wmnet with OS bullseye executed wit... [19:39:00] (JobUnavailable) firing: Reduced availability for job k8s-pods in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:39:53] (03PS1) 10Cmjohnson: add new dse-k8s-worker hosts to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/777863 (https://phabricator.wikimedia.org/T291579) [19:40:48] (03CR) 10Cmjohnson: [C: 03+2] add new dse-k8s-worker hosts to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/777863 (https://phabricator.wikimedia.org/T291579) (owner: 10Cmjohnson) [19:42:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10Cmjohnson) [19:42:06] 10SRE, 10conftool: requestctl v1 improvements - https://phabricator.wikimedia.org/T305580 (10CDanis) p:05Triage→03High [19:44:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P24172 and previous config saved to /var/cache/conftool/dbconfig/20220406-194412-marostegui.json [19:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:16] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1005.eqiad.wmnet with OS bullseye [19:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet wit... [19:45:28] 10SRE, 10conftool: ipblocks support for other "entities" (not clouds, not abuse nets) - https://phabricator.wikimedia.org/T305581 (10CDanis) [19:45:44] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1006.eqiad.wmnet with OS bullseye [19:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-serve1006.eqiad.wmnet wit... [19:48:13] 10SRE, 10conftool: Annotate X-Analytics header with any matching actions - https://phabricator.wikimedia.org/T305582 (10CDanis) [19:48:24] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1007.eqiad.wmnet with OS bullseye [19:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-serve1007.eqiad.wmnet wit... [19:48:50] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1008.eqiad.wmnet with OS bullseye [19:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-serve1008.eqiad.wmnet wit... [19:50:27] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1008.eqiad.wmnet with OS bullseye [19:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-serve1008.eqiad.wmnet with OS... [19:50:49] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1007.eqiad.wmnet with OS bullseye [19:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-serve1007.eqiad.wmnet with OS... [19:50:59] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1006.eqiad.wmnet with OS bullseye [19:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-serve1006.eqiad.wmnet with OS... [19:51:12] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1005.eqiad.wmnet with OS bullseye [19:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet with OS... [19:52:52] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:53:00] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:54:06] ^ ruh roh, mailman does seem down [19:54:44] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:54:46] oh nope there it is again [19:54:50] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 47966 bytes in 0.140 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:54:58] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.334 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:56:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10Cmjohnson) ml-serve1005 E4:3D:1A:A2:BF:FC ml-serve1006 E4:3D:1A:AD:D7:A2 ml-serve1007 E4:3D:1A:AC:8F:D6 ml-serve1008 E4:3D:1A:AD:... [19:57:15] rzl: it was slow for me [19:57:22] i think it's one of them nights [19:57:25] an event like this happened the other day [19:57:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10Cmjohnson) @Jclark-ctr These are erroring during the installation with the media failure, suggesting that there isn't a cable con... [19:58:42] 10SRE, 10conftool: Annotate X-Analytics header with any matching actions - https://phabricator.wikimedia.org/T305582 (10CDanis) [19:58:44] 10SRE, 10conftool: ipblocks support for other "entities" (not clouds, not abuse nets) - https://phabricator.wikimedia.org/T305581 (10CDanis) [19:58:46] 10SRE, 10conftool: requestctl v1 improvements - https://phabricator.wikimedia.org/T305580 (10CDanis) [19:58:59] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on phab1001.eqiad.wmnet with reason: reboot for maintenance [19:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:02] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab1001.eqiad.wmnet with reason: reboot for maintenance [19:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T297189)', diff saved to https://phabricator.wikimedia.org/P24173 and previous config saved to /var/cache/conftool/dbconfig/20220406-195917-marostegui.json [19:59:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance [19:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:20] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [19:59:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance [19:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T297189)', diff saved to https://phabricator.wikimedia.org/P24174 and previous config saved to /var/cache/conftool/dbconfig/20220406-195925-marostegui.json [19:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:39] (03CR) 10JHathaway: [C: 03+1] "looks good, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/777795 (owner: 10Filippo Giunchedi) [20:00:04] RoanKattouw and Urbanecm: Time to snap out of that daydream and deploy UTC late backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220406T2000). [20:00:04] jan_drewniak: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:01:00] o/ [20:01:05] i can deploy [20:01:37] really? thank you cjming! [20:01:46] np! [20:02:25] Yay thank you cjming ! [20:03:14] (03CR) 10Clare Ming: [C: 03+2] Update to 78eef14, rename viewportSize to viewportSizeBucket [extensions/WikimediaEvents] (wmf/1.39.0-wmf.6) - 10https://gerrit.wikimedia.org/r/777389 (https://phabricator.wikimedia.org/T301391) (owner: 10Jdlrobson) [20:03:45] !log phabricator about to be rebooted - hang on [20:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:16] (03Merged) 10jenkins-bot: Update to 78eef14, rename viewportSize to viewportSizeBucket [extensions/WikimediaEvents] (wmf/1.39.0-wmf.6) - 10https://gerrit.wikimedia.org/r/777389 (https://phabricator.wikimedia.org/T301391) (owner: 10Jdlrobson) [20:05:54] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:07:20] for backports to wmf.6 - is that still testable on the mwdebug servers? [20:08:18] cjming: honestly I'm not sure how to test this on mwdebug. I can check the data after it's been deployed to production though. [20:08:36] ok - i'll go ahead and sync then [20:09:06] given the bucketing and all... sounds good [20:09:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10Cmjohnson) [20:10:06] !log cjming@deploy1002 Synchronized php-1.39.0-wmf.6/extensions/WikimediaEvents/modules/ext.wikimediaEvents/desktopWebUIActions.js: Backport: [[gerrit:777389|Update to 78eef14, rename viewportSize to viewportSizeBucket (T301391)]] (duration: 00m 55s) [20:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:09] T301391: Update click tracking to take into account screen resolution - https://phabricator.wikimedia.org/T301391 [20:10:21] jan_drewniak: alrighty - your change is live [20:10:52] thanks! I hope it works this time :P [20:11:00] 🤞 [20:12:30] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:12:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:41] i'll hang out for a bit and if no one else shows up, i'll log a msg closing the window if that seems reasonable [20:13:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:13:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:42] PROBLEM - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh6_22: Servers phab1001-vcs.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:24:40] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh6_22: Servers phab1001-vcs.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:24:53] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T304849 (10phaultfinder) [20:27:18] 10SRE, 10Traffic: Upgrading Wikidough and durum VMs to bullseye - https://phabricator.wikimedia.org/T305589 (10ssingh) [20:29:06] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:30:12] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:30:20] RECOVERY - PyBal backends health check on lvs1018 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:34:41] pybal alerts were due to phab reboot and git-ssh , but it's fixed [20:35:16] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: security updates - bking@cumin1001 - T304938 [20:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:16] !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: security updates - bking@cumin1001 - T304938 [20:36:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:13] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: security updates - bking@cumin1001 - T304938 [20:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:19] !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: security updates - bking@cumin1001 - T304938 [20:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:55] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: security updates - bking@cumin1001 - T304938 [20:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10Jclark-ctr) port was not set to pxe fixed setting for all 4 host [20:42:16] (03PS1) 10Vivian Rook: add chunkeddriver.py.patch to wallaby [puppet] - 10https://gerrit.wikimedia.org/r/777873 (https://phabricator.wikimedia.org/T304694) [20:42:57] (03CR) 10Vivian Rook: "Not sure how this is tested." [puppet] - 10https://gerrit.wikimedia.org/r/777873 (https://phabricator.wikimedia.org/T304694) (owner: 10Vivian Rook) [20:43:34] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:43:35] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1005.eqiad.wmnet with OS bullseye [20:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet wit... [20:46:18] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: security updates - bking@cumin1001 - T304938 [20:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:02] (03CR) 10Razzi: [C: 03+2] aqs: update mediawiki history snapshot for March 2022 [puppet] - 10https://gerrit.wikimedia.org/r/777407 (owner: 10Razzi) [20:50:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T297189)', diff saved to https://phabricator.wikimedia.org/P24176 and previous config saved to /var/cache/conftool/dbconfig/20220406-205040-marostegui.json [20:50:44] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:48] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [20:50:50] !log end of UTC late backport & config window [20:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:24] !log razzi@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [20:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:09] (03PS1) 10Cwhite: logstash: replace all instances of @metadata.partition [puppet] - 10https://gerrit.wikimedia.org/r/777874 (https://phabricator.wikimedia.org/T305175) [20:54:24] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster cloudelastic: security updates - bking@cumin1001 - T304938 [20:54:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:33] (03PS1) 10Ryan Kemper: elastic: relforge needs --without-lvs [cookbooks] - 10https://gerrit.wikimedia.org/r/777875 [20:56:07] !log razzi@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [20:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:04] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:57:06] (03CR) 10Bking: [C: 03+1] elastic: relforge needs --without-lvs [cookbooks] - 10https://gerrit.wikimedia.org/r/777875 (owner: 10Ryan Kemper) [20:58:54] 10SRE, 10Traffic: Upgrading Wikidough and durum VMs to bullseye - https://phabricator.wikimedia.org/T305589 (10ssingh) [20:59:00] 10SRE, 10Traffic, 10Patch-For-Review: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) public resolver - https://phabricator.wikimedia.org/T252132 (10ssingh) [21:01:36] (03PS1) 10Cwhite: logstash: bugfix remove rsyslog-set log.level from blackbox_exporter events [puppet] - 10https://gerrit.wikimedia.org/r/777877 [21:02:42] PROBLEM - Check systemd state on cloudelastic1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:03:38] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1005.eqiad.wmnet with OS bullseye [21:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet with OS... [21:04:01] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1005.eqiad.wmnet with OS bullseye [21:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet wit... [21:05:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P24177 and previous config saved to /var/cache/conftool/dbconfig/20220406-210545-marostegui.json [21:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:46] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:09:20] PROBLEM - Check systemd state on cloudelastic1004 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:10:45] (03CR) 10Cwhite: "I ran into inconsistent level rendering while testing the diagnostics feature." [puppet] - 10https://gerrit.wikimedia.org/r/777877 (owner: 10Cwhite) [21:14:56] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:15:02] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:17:16] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=wtp1037.wmnet [21:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:24] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1037.wmnet [21:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:32] PROBLEM - Check systemd state on cloudelastic1006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:17:32] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1037.eqiad.wmnet [21:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:26] PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:20:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P24178 and previous config saved to /var/cache/conftool/dbconfig/20220406-212052-marostegui.json [21:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:55] !log wtp1037,wtp1038,wtp1039 - rebooting sequentially [21:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:55] PROBLEM - Host wtp1037 is DOWN: PING CRITICAL - Packet loss = 100% [21:23:35] RECOVERY - Host wtp1037 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [21:24:38] (03PS1) 10Cwhite: logstash: set partition on legacy indexes [puppet] - 10https://gerrit.wikimedia.org/r/777880 (https://phabricator.wikimedia.org/T305175) [21:25:35] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [21:26:32] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1037.eqiad.wmnet [21:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:38] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1038.eqiad.wmnet [21:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:54] (03CR) 10jerkins-bot: [V: 04-1] logstash: set partition on legacy indexes [puppet] - 10https://gerrit.wikimedia.org/r/777880 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [21:28:37] RECOVERY - Check systemd state on cloudelastic1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:28:59] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [21:30:27] PROBLEM - Check systemd state on cloudelastic1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:30:55] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-serve1005.eqiad.wmnet with OS bullseye [21:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet with OS... [21:32:51] RECOVERY - Check systemd state on cloudelastic1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:33:37] PROBLEM - Host wtp1038 is DOWN: PING CRITICAL - Packet loss = 100% [21:33:48] 10SRE, 10Infrastructure-Foundations, 10Mail: MX: increasing disk space - https://phabricator.wikimedia.org/T305567 (10jhathaway) I rotated the log file and then compressed it on another host for this specific incident, but it was cumbersome. I think we should definitely embiggen the disks for the new Postfix... [21:34:17] RECOVERY - Host wtp1038 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [21:34:21] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1005.eqiad.wmnet with OS bullseye [21:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet wit... [21:35:13] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1038.eqiad.wmnet [21:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:19] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=wtp1039.eqiad.wmnet [21:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T297189)', diff saved to https://phabricator.wikimedia.org/P24179 and previous config saved to /var/cache/conftool/dbconfig/20220406-213557-marostegui.json [21:35:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1105.eqiad.wmnet with reason: Maintenance [21:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1105.eqiad.wmnet with reason: Maintenance [21:36:01] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [21:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:04] (03PS1) 10Cwhite: logstash: transform human-friendly values to bucket date format [puppet] - 10https://gerrit.wikimedia.org/r/777882 (https://phabricator.wikimedia.org/T305175) [21:36:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T297189)', diff saved to https://phabricator.wikimedia.org/P24180 and previous config saved to /var/cache/conftool/dbconfig/20220406-213605-marostegui.json [21:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:24] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster cloudelastic: security updates - bking@cumin1001 - T304938 [21:37:25] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:20] !log razzi@deploy1002 Started deploy [analytics/turnilo/deploy@a1c5c6f]: (no justification provided) [21:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:14] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1006.eqiad.wmnet with OS bullseye [21:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-serve1006.eqiad.wmnet wit... [21:41:05] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1007.eqiad.wmnet with OS bullseye [21:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:11] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1008.eqiad.wmnet with OS bullseye [21:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-serve1007.eqiad.wmnet wit... [21:41:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-serve1008.eqiad.wmnet wit... [21:42:54] !log razzi@deploy1002 Finished deploy [analytics/turnilo/deploy@a1c5c6f]: (no justification provided) (duration: 04m 34s) [21:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:11] PROBLEM - Host wtp1039 is DOWN: PING CRITICAL - Packet loss = 100% [21:43:47] RECOVERY - Host wtp1039 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [21:44:11] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [21:44:15] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:46:07] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [21:46:17] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1039.eqiad.wmnet [21:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:25] PROBLEM - Check systemd state on cloudelastic1005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:51:26] !log parse2013, parse2014 - rebooting [21:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:49] PROBLEM - Host parse2014 is DOWN: PING CRITICAL - Packet loss = 100% [21:54:05] PROBLEM - Host parse2013 is DOWN: PING CRITICAL - Packet loss = 100% [21:54:15] RECOVERY - Host parse2013 is UP: PING OK - Packet loss = 0%, RTA = 33.17 ms [21:54:19] RECOVERY - Host parse2014 is UP: PING OK - Packet loss = 0%, RTA = 33.16 ms [21:56:33] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:57:03] PROBLEM - Host parse2012 is DOWN: PING CRITICAL - Packet loss = 100% [21:57:04] !log parse2011, parse2012 - rebooting [21:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:25] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1005.eqiad.wmnet with OS bullseye [21:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet with OS... [21:57:38] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1005.eqiad.wmnet with OS bullseye [21:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet wit... [21:58:27] PROBLEM - Host parse2011 is DOWN: PING CRITICAL - Packet loss = 100% [21:58:51] RECOVERY - Host parse2012 is UP: PING OK - Packet loss = 0%, RTA = 33.19 ms [21:59:23] RECOVERY - Host parse2011 is UP: PING WARNING - Packet loss = 77%, RTA = 34.24 ms [22:01:21] (03PS1) 10Jdlrobson: Convert performanceNow datatype to Integer in QuickSurvey Initiation in order to resolve data type mismatch in schema. [extensions/QuickSurveys] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/777775 (https://phabricator.wikimedia.org/T305171) [22:01:25] RECOVERY - Check systemd state on cloudelastic1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:02:29] RECOVERY - Check systemd state on cloudelastic1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:03:56] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1007.eqiad.wmnet with reason: host reimage [22:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:59] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1008.eqiad.wmnet with reason: host reimage [22:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:07] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1006.eqiad.wmnet with reason: host reimage [22:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:05] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1005.eqiad.wmnet with OS bullseye [22:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet with OS... [22:05:26] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve1005.eqiad.wmnet with OS bullseye [22:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet wit... [22:07:17] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1007.eqiad.wmnet with reason: host reimage [22:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:22] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1006.eqiad.wmnet with reason: host reimage [22:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:58] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1008.eqiad.wmnet with reason: host reimage [22:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:07] (03PS1) 10Cwhite: logstash: rewrite ecs output, bucket, and template_version settings [puppet] - 10https://gerrit.wikimedia.org/r/777887 (https://phabricator.wikimedia.org/T305013) [22:14:09] (03PS1) 10Cwhite: logstash: set dlq output and template_version [puppet] - 10https://gerrit.wikimedia.org/r/777888 (https://phabricator.wikimedia.org/T305088) [22:14:13] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-serve1005.eqiad.wmnet with OS bullseye [22:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-serve1005.eqiad.wmnet with OS... [22:15:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T300775)', diff saved to https://phabricator.wikimedia.org/P24181 and previous config saved to /var/cache/conftool/dbconfig/20220406-221555-marostegui.json [22:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:58] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [22:16:04] !log parse2009, parse2010 - rebooting [22:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:45] PROBLEM - Host parse2010 is DOWN: PING CRITICAL - Packet loss = 100% [22:18:21] PROBLEM - Host parse2009 is DOWN: PING CRITICAL - Packet loss = 100% [22:19:13] RECOVERY - Host parse2010 is UP: PING OK - Packet loss = 0%, RTA = 31.65 ms [22:19:15] RECOVERY - Host parse2009 is UP: PING OK - Packet loss = 0%, RTA = 31.68 ms [22:26:03] 10SRE, 10Traffic: Upgrading Wikidough and durum VMs to bullseye - https://phabricator.wikimedia.org/T305589 (10Dzahn) My 2 cents: cookbook not worth it in this case, likely more work to create and debug it than the actual time savings with installs because it will just happen like once every 2 years or less a... [22:26:56] !log parse2007, parse2008 - rebooting [22:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:13] PROBLEM - Host parse2008 is DOWN: PING CRITICAL - Packet loss = 100% [22:29:17] PROBLEM - Host parse2007 is DOWN: PING CRITICAL - Packet loss = 100% [22:29:45] RECOVERY - Host parse2008 is UP: PING OK - Packet loss = 0%, RTA = 31.68 ms [22:30:01] RECOVERY - Host parse2007 is UP: PING OK - Packet loss = 0%, RTA = 31.61 ms [22:30:17] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1007.eqiad.wmnet with OS bullseye [22:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-serve1007.eqiad.wmnet with OS... [22:31:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P24182 and previous config saved to /var/cache/conftool/dbconfig/20220406-223100-marostegui.json [22:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:35] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1006.eqiad.wmnet with OS bullseye [22:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-serve1006.eqiad.wmnet with OS... [22:33:28] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1008.eqiad.wmnet with OS bullseye [22:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-serve1008.eqiad.wmnet with OS... [22:36:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T297189)', diff saved to https://phabricator.wikimedia.org/P24183 and previous config saved to /var/cache/conftool/dbconfig/20220406-223603-marostegui.json [22:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:06] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [22:37:46] (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs2003:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [22:42:50] !log parse2006, parse2005 - rebooting [22:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:13] PROBLEM - Host parse2005 is DOWN: PING CRITICAL - Packet loss = 100% [22:45:13] RECOVERY - Host parse2005 is UP: PING OK - Packet loss = 0%, RTA = 31.70 ms [22:45:35] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:45:39] PROBLEM - Host parse2006 is DOWN: PING CRITICAL - Packet loss = 100% [22:45:53] RECOVERY - Host parse2006 is UP: PING OK - Packet loss = 0%, RTA = 31.69 ms [22:46:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P24184 and previous config saved to /var/cache/conftool/dbconfig/20220406-224605-marostegui.json [22:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:50] !log parse2004, parse2003 - rebooting [22:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P24185 and previous config saved to /var/cache/conftool/dbconfig/20220406-225108-marostegui.json [22:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:53] (03PS4) 10Krinkle: static.php: Fold "current" handling into "nohash" and extend TTL to 1y [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771357 (https://phabricator.wikimedia.org/T302465) [22:56:26] (03PS5) 10Krinkle: static.php: Fold "current" handling into "nohash" and extend TTL to 1y [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771357 (https://phabricator.wikimedia.org/T302465) [22:58:38] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:01:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T300775)', diff saved to https://phabricator.wikimedia.org/P24186 and previous config saved to /var/cache/conftool/dbconfig/20220406-230110-marostegui.json [23:01:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1163.eqiad.wmnet with reason: Maintenance [23:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1163.eqiad.wmnet with reason: Maintenance [23:01:15] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [23:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T300775)', diff saved to https://phabricator.wikimedia.org/P24187 and previous config saved to /var/cache/conftool/dbconfig/20220406-230118-marostegui.json [23:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:45] (03CR) 10Krinkle: [C: 03+2] static.php: Fold "current" handling into "nohash" and extend TTL to 1y [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771357 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [23:03:24] (03Merged) 10jenkins-bot: static.php: Fold "current" handling into "nohash" and extend TTL to 1y [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771357 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [23:06:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P24188 and previous config saved to /var/cache/conftool/dbconfig/20220406-230613-marostegui.json [23:06:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:06:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:10] !log krinkle@deploy1002 Synchronized w/static: I5a05f4728 (duration: 00m 54s) [23:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:34] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:13:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:13:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:02] (03PS1) 10Cwhite: logstash: add target index validation step [puppet] - 10https://gerrit.wikimedia.org/r/777891 (https://phabricator.wikimedia.org/T305175) [23:19:46] RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:21:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T297189)', diff saved to https://phabricator.wikimedia.org/P24189 and previous config saved to /var/cache/conftool/dbconfig/20220406-232118-marostegui.json [23:21:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1182.eqiad.wmnet with reason: Maintenance [23:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1182.eqiad.wmnet with reason: Maintenance [23:21:22] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [23:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T297189)', diff saved to https://phabricator.wikimedia.org/P24190 and previous config saved to /var/cache/conftool/dbconfig/20220406-232126-marostegui.json [23:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:58] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:31:31] (03PS1) 10Krinkle: static.php: Remove peeking at current-wiki $IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777893 (https://phabricator.wikimedia.org/T302465) [23:35:38] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:38:05] (03CR) 10Andrew Bogott: [C: 03+1] "The PCC isn't good at testing patch application, so merging and checking in codfw1dev is the best plan." [puppet] - 10https://gerrit.wikimedia.org/r/777873 (https://phabricator.wikimedia.org/T304694) (owner: 10Vivian Rook) [23:39:00] (JobUnavailable) firing: Reduced availability for job k8s-pods in k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:44:56] (03CR) 10Krinkle: [C: 03+2] static.php: Remove peeking at current-wiki $IP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777893 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [23:45:34] (03Merged) 10jenkins-bot: static.php: Remove peeking at current-wiki $IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777893 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [23:46:04] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:47:46] !log krinkle@deploy1002 Synchronized w/static.php: Ic87a8a3d00db (duration: 00m 53s) [23:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:51:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:08] RECOVERY - Check systemd state on mw1308 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state