[00:00:04] RoanKattouw and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211207T0000). [00:00:04] No Gerrit patches in the queue for this window AFAICS. [00:06:57] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:08:57] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Sustainability (Incident Followup): Bringing mx2001 back into service - https://phabricator.wikimedia.org/T297128 (10Dzahn) re: making current kernel version persistent The one running now was selected in grub but wasn't the default selection. Either edit gru... [00:10:18] !log end codfw opensearch upgrade T288621 [00:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:24] T288621: Logs and events produced by the WMF are consumed using the Elastic Common Schema by OpenSearch - https://phabricator.wikimedia.org/T288621 [00:20:48] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/743352 (owner: 10Filippo Giunchedi) [00:21:54] (03CR) 10Cwhite: [C: 03+1] prometheus: bump logging level for blackbox-exporter [puppet] - 10https://gerrit.wikimedia.org/r/743388 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [00:24:43] (03CR) 10Cwhite: "LGTM, but not sure if you mean to include the commented out tests file" [alerts] - 10https://gerrit.wikimedia.org/r/743394 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [00:25:05] (03CR) 10Cwhite: [C: 03+1] prometheus: remove textfile stale alert [puppet] - 10https://gerrit.wikimedia.org/r/743395 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [00:26:39] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/743921 (owner: 10Filippo Giunchedi) [00:26:53] (03CR) 10Cwhite: [C: 03+1] service: add public alias for grafana-rw [puppet] - 10https://gerrit.wikimedia.org/r/743922 (owner: 10Filippo Giunchedi) [00:27:51] (03CR) 10Cwhite: [C: 03+1] wmflib: add 'probes' to service::catalog type [puppet] - 10https://gerrit.wikimedia.org/r/743358 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [00:50:29] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:52:31] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:30:26] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: (Need By: TBD) rack/setup/install backup2008 - https://phabricator.wikimedia.org/T294973 (10Papaul) This was shipped today. [01:30:35] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:20] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: (Need By: TBD) rack/setup/install backup2008 - https://phabricator.wikimedia.org/T294973 (10Papaul) [01:36:53] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:53] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.38.0-wmf.12 [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/744113 [02:06:55] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.38.0-wmf.12 [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/744113 (owner: 10TrainBranchBot) [02:27:59] (03Merged) 10jenkins-bot: Branch commit for wmf/1.38.0-wmf.12 [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/744113 (owner: 10TrainBranchBot) [02:51:03] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:57:31] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211207T0300) [03:38:11] PROBLEM - snapshot of s6 in codfw on alert1001 is CRITICAL: snapshot for s6 at codfw taken more than 3 days ago: Most recent backup 2021-12-04 03:31:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [05:07:21] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:32:09] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:38:35] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:45:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1100.eqiad.wmnet with reason: Maintenance T277354 [05:45:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1100.eqiad.wmnet with reason: Maintenance T277354 [05:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:06] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [05:45:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1100 (T277354)', diff saved to https://phabricator.wikimedia.org/P18031 and previous config saved to /var/cache/conftool/dbconfig/20211207-054506-marostegui.json [05:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1100 (T277354)', diff saved to https://phabricator.wikimedia.org/P18032 and previous config saved to /var/cache/conftool/dbconfig/20211207-054625-marostegui.json [05:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:25] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:58:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2074 and db2130 T296930', diff saved to https://phabricator.wikimedia.org/P18033 and previous config saved to /var/cache/conftool/dbconfig/20211207-055808-marostegui.json [05:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:13] T296930: codfw: relocate servers in rack D6 - https://phabricator.wikimedia.org/T296930 [06:01:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1100', diff saved to https://phabricator.wikimedia.org/P18034 and previous config saved to /var/cache/conftool/dbconfig/20211207-060130-marostegui.json [06:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:25] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:08:31] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:09:03] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:14:08] !log Apply SET GLOBAL innodb_checksum_algorithm=full_crc32; on db1107 T287244 [06:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:13] T287244: Considering switching innodb_checksum_algorithm=full_crc32 - https://phabricator.wikimedia.org/T287244 [06:16:32] (03PS1) 10Marostegui: Revert "install_server: Reimage db1125 deleting /srv" [puppet] - 10https://gerrit.wikimedia.org/r/743942 [06:16:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1100', diff saved to https://phabricator.wikimedia.org/P18035 and previous config saved to /var/cache/conftool/dbconfig/20211207-061635-marostegui.json [06:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:49] (03CR) 10Marostegui: [C: 03+2] Revert "install_server: Reimage db1125 deleting /srv" [puppet] - 10https://gerrit.wikimedia.org/r/743942 (owner: 10Marostegui) [06:22:07] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [06:31:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1100 (T277354)', diff saved to https://phabricator.wikimedia.org/P18036 and previous config saved to /var/cache/conftool/dbconfig/20211207-063140-marostegui.json [06:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:45] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [06:32:57] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [06:35:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance T277354 [06:35:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance T277354 [06:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1105.eqiad.wmnet with reason: Maintenance T277354 [06:36:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1105.eqiad.wmnet with reason: Maintenance T277354 [06:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T277354)', diff saved to https://phabricator.wikimedia.org/P18037 and previous config saved to /var/cache/conftool/dbconfig/20211207-063621-marostegui.json [06:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T277354)', diff saved to https://phabricator.wikimedia.org/P18038 and previous config saved to /var/cache/conftool/dbconfig/20211207-063756-marostegui.json [06:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:01] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [06:53:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P18039 and previous config saved to /var/cache/conftool/dbconfig/20211207-065301-marostegui.json [06:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:53] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:08:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P18040 and previous config saved to /var/cache/conftool/dbconfig/20211207-070806-marostegui.json [07:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:29] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:16:03] !log power off db2074, db2078, db2101, db2130, dbproxy2004 T296930 [07:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:08] T296930: codfw: relocate servers in rack D6 - https://phabricator.wikimedia.org/T296930 [07:20:42] 10SRE, 10ops-codfw, 10DBA: codfw: relocate servers in rack D6 - https://phabricator.wikimedia.org/T296930 (10Marostegui) All hosts are now down and powered off. @Papaul you can proceed as needed. @Kormat I have upgraded mysql on all hosts, so please run `mysql_upgrade` once you bring them back up (some of th... [07:21:07] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=mysql-misc site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:22:35] PROBLEM - haproxy failover on dbproxy2002 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [07:22:49] ^ this is known [07:23:11] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:23:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T277354)', diff saved to https://phabricator.wikimedia.org/P18041 and previous config saved to /var/cache/conftool/dbconfig/20211207-072311-marostegui.json [07:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on 8 hosts with reason: Maintenance T277354 [07:23:16] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [07:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 8 hosts with reason: Maintenance T277354 [07:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:45] ACKNOWLEDGEMENT - haproxy failover on dbproxy2002 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui https://phabricator.wikimedia.org/T296930 https://wikitech.wikimedia.org/wiki/HAProxy [07:29:41] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:32:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1162.eqiad.wmnet with reason: Maintenance T277354 [07:32:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1162.eqiad.wmnet with reason: Maintenance T277354 [07:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:50] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [07:32:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T277354)', diff saved to https://phabricator.wikimedia.org/P18042 and previous config saved to /var/cache/conftool/dbconfig/20211207-073252-marostegui.json [07:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:00] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:05] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:34:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T277354)', diff saved to https://phabricator.wikimedia.org/P18043 and previous config saved to /var/cache/conftool/dbconfig/20211207-073413-marostegui.json [07:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:39] jouncebot: nowandnext [07:36:39] No deployments scheduled for the next 4 hour(s) and 23 minute(s) [07:36:39] In 4 hour(s) and 23 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211207T1200) [07:36:46] (03PS3) 10Urbanecm: Deploy Growth mentor dashboard to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743602 (https://phabricator.wikimedia.org/T278920) [07:36:54] (03CR) 10Urbanecm: [C: 03+2] Deploy Growth mentor dashboard to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743602 (https://phabricator.wikimedia.org/T278920) (owner: 10Urbanecm) [07:37:31] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:40] (03Merged) 10jenkins-bot: Deploy Growth mentor dashboard to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743602 (https://phabricator.wikimedia.org/T278920) (owner: 10Urbanecm) [07:39:34] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 2178202b86acd50b713d939c4bcfedf7d2fa93e7: Deploy Growth mentor dashboard to all wikis (T278920) (duration: 00m 58s) [07:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:38] * urbanecm done [07:39:39] T278920: Mentor dashboard: V1 desktop - https://phabricator.wikimedia.org/T278920 [07:43:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:46:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P18044 and previous config saved to /var/cache/conftool/dbconfig/20211207-074919-marostegui.json [07:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:41] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:54:42] (03CR) 10Muehlenhoff: [C: 03+2] Add current OS upgrade estimation for restbase/sessionstore [puppet] - 10https://gerrit.wikimedia.org/r/744046 (owner: 10Muehlenhoff) [07:56:09] (03CR) 10RhinosF1: Add current OS upgrade estimation for restbase/sessionstore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/744046 (owner: 10Muehlenhoff) [07:56:30] moritzm: you've got target-q twice [07:56:35] Check line above your change [07:57:54] RhinosF1: good catch, thanks :-) [07:58:06] Np [08:00:01] (03PS1) 10Muehlenhoff: stretch.yaml: Fix duplicated line [puppet] - 10https://gerrit.wikimedia.org/r/744755 [08:00:45] (03CR) 10RhinosF1: [C: 03+1] stretch.yaml: Fix duplicated line [puppet] - 10https://gerrit.wikimedia.org/r/744755 (owner: 10Muehlenhoff) [08:03:42] (03CR) 10RhinosF1: Add current OS upgrade estimation for restbase/sessionstore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/744046 (owner: 10Muehlenhoff) [08:04:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P18045 and previous config saved to /var/cache/conftool/dbconfig/20211207-080424-marostegui.json [08:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:35] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10Sustainability (Incident Followup): Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 (10ayounsi) As a general note we need to be careful with rolling out config fixes in reaction to unexpected issues. Even... [08:05:00] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10Sustainability (Incident Followup): Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 (10ayounsi) 05Open→03In progress [08:05:03] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: change blackbox jobs to use / as separator [puppet] - 10https://gerrit.wikimedia.org/r/743352 (owner: 10Filippo Giunchedi) [08:05:23] (03PS3) 10Filippo Giunchedi: prometheus: change blackbox jobs to use / as separator [puppet] - 10https://gerrit.wikimedia.org/r/743352 [08:07:18] (03CR) 10Muehlenhoff: [C: 03+2] stretch.yaml: Fix duplicated line [puppet] - 10https://gerrit.wikimedia.org/r/744755 (owner: 10Muehlenhoff) [08:07:38] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Patch-Needs-Improvement, 10User-herron: Outdated TLS config for MXes - https://phabricator.wikimedia.org/T203260 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This has been resolved with the update of the mail servers to Bullseye in the... [08:12:52] (03CR) 10Filippo Giunchedi: [C: 03+2] team-sre: port node-exporter textfile stale alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/743394 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [08:13:04] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: remove textfile stale alert [puppet] - 10https://gerrit.wikimedia.org/r/743395 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [08:14:30] 10SRE, 10Infrastructure-Foundations: Revert 5.10.70 from bullseye hosts - https://phabricator.wikimedia.org/T297180 (10MoritzMuehlenhoff) [08:18:29] (03CR) 10Filippo Giunchedi: [C: 03+2] service: add public_aliases list [puppet] - 10https://gerrit.wikimedia.org/r/743921 (owner: 10Filippo Giunchedi) [08:18:35] (03PS2) 10Filippo Giunchedi: service: add public_aliases list [puppet] - 10https://gerrit.wikimedia.org/r/743921 [08:19:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T277354)', diff saved to https://phabricator.wikimedia.org/P18046 and previous config saved to /var/cache/conftool/dbconfig/20211207-081928-marostegui.json [08:19:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1129.eqiad.wmnet with reason: Maintenance T277354 [08:19:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1129.eqiad.wmnet with reason: Maintenance T277354 [08:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:34] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [08:19:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T277354)', diff saved to https://phabricator.wikimedia.org/P18047 and previous config saved to /var/cache/conftool/dbconfig/20211207-081936-marostegui.json [08:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:50] 10SRE, 10Infrastructure-Foundations: Revert 5.10.70 from bullseye hosts - https://phabricator.wikimedia.org/T297180 (10MoritzMuehlenhoff) [08:21:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T277354)', diff saved to https://phabricator.wikimedia.org/P18048 and previous config saved to /var/cache/conftool/dbconfig/20211207-082059-marostegui.json [08:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:06] (03PS2) 10Filippo Giunchedi: service: add public alias for grafana-rw [puppet] - 10https://gerrit.wikimedia.org/r/743922 [08:26:50] (03CR) 10Filippo Giunchedi: [C: 03+2] service: add public alias for grafana-rw [puppet] - 10https://gerrit.wikimedia.org/r/743922 (owner: 10Filippo Giunchedi) [08:36:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P18049 and previous config saved to /var/cache/conftool/dbconfig/20211207-083604-marostegui.json [08:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:27] (03CR) 10Filippo Giunchedi: [C: 03+2] wmflib: add 'probes' to service::catalog type [puppet] - 10https://gerrit.wikimedia.org/r/743358 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [08:38:32] (03PS3) 10Filippo Giunchedi: wmflib: add 'probes' to service::catalog type [puppet] - 10https://gerrit.wikimedia.org/r/743358 (https://phabricator.wikimedia.org/T291946) [08:41:39] 10SRE, 10Infrastructure-Foundations, 10netops: Packet Drops on Eqiad ASW -> CR uplinks - https://phabricator.wikimedia.org/T291627 (10ayounsi) [08:41:47] 10SRE, 10Infrastructure-Foundations, 10netops: Create an alert for output discards on network devices - https://phabricator.wikimedia.org/T284593 (10ayounsi) 05In progress→03Resolved This is now set to alert to NOC through alertmanager. Added a quick mention in https://wikitech.wikimedia.org/wiki/Networ... [08:45:16] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2016.codfw.wmnet with OS buster [08:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:21] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2016.codfw.wmnet with OS buster executed with errors: - ganeti2016 (**FAIL**) - Downtimed... [08:47:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2016.codfw.wmnet with OS buster [08:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:10] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2016.codfw.wmnet with OS buster [08:51:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P18050 and previous config saved to /var/cache/conftool/dbconfig/20211207-085108-marostegui.json [08:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:43] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:55:50] !log draining primary/secondary instances off ganeti2013 T296622 [08:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:54] T296622: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 [09:06:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T277354)', diff saved to https://phabricator.wikimedia.org/P18051 and previous config saved to /var/cache/conftool/dbconfig/20211207-090613-marostegui.json [09:06:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1146.eqiad.wmnet with reason: Maintenance T277354 [09:06:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1146.eqiad.wmnet with reason: Maintenance T277354 [09:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:20] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [09:06:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T277354)', diff saved to https://phabricator.wikimedia.org/P18052 and previous config saved to /var/cache/conftool/dbconfig/20211207-090620-marostegui.json [09:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T277354)', diff saved to https://phabricator.wikimedia.org/P18053 and previous config saved to /var/cache/conftool/dbconfig/20211207-090758-marostegui.json [09:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:38] PROBLEM - Host mr1-drmrs is DOWN: CRITICAL - Time to live exceeded (185.15.58.130) [09:22:20] RECOVERY - Host mr1-drmrs is UP: PING OK - Packet loss = 0%, RTA = 85.60 ms [09:23:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P18054 and previous config saved to /var/cache/conftool/dbconfig/20211207-092302-marostegui.json [09:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2016.codfw.wmnet with OS buster [09:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:25] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2016.codfw.wmnet with OS buster completed: - ganeti2016 (**WARN**) - Removed from Puppet... [09:26:51] (03PS1) 10Majavah: discovery: switchover doc to doc1002 [dns] - 10https://gerrit.wikimedia.org/r/744762 (https://phabricator.wikimedia.org/T247653) [09:27:03] (03PS1) 10Majavah: hieradata: switchover doc to doc1002 [puppet] - 10https://gerrit.wikimedia.org/r/744763 (https://phabricator.wikimedia.org/T247653) [09:27:49] !log move all VRRP primary to cr2-codfw - https://phabricator.wikimedia.org/T289241 [09:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2012.codfw.wmnet with OS buster [09:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:20] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2012.codfw.wmnet with OS buster [09:30:41] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): reimage/pxe boot failing on cloudvirt1028 - https://phabricator.wikimedia.org/T296906 (10Volans) >>! In T296906#7550529, @Dzahn wrote: > Try if the server can talk http to apt1001.wikimedia.org / apt2001.wikimedia.org. > > After getting an IP from DHCP but... [09:31:58] !log cr1-codfw - FPC 1 PIC 0 Need bounce - T289241 [09:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:16] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): reimage/pxe boot failing on cloudvirt1028 - https://phabricator.wikimedia.org/T296906 (10Volans) >>! In T296906#7550913, @cmooney wrote: > Looking at the packet captures either side (install1003 and cloudvirt1028) they packets are they same. I realise, how... [09:33:15] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Schema-change: Mailman3 schema change: Switch autoresponse_text fields to Text - https://phabricator.wikimedia.org/T286552 (10Marostegui) m5 hosts downtimed for 2h. Reminder: db2078 is down due to T296930, the schema change will arrive there via replication once... [09:33:33] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Schema-change: Mailman3 schema change: Switch autoresponse_text fields to Text - https://phabricator.wikimedia.org/T286552 (10Marostegui) a:03Marostegui [09:34:17] !log move all VRRP primary to cr1-codfw - T289241 [09:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:06] !log cr2-codfw - FPC 1 PIC 1 Need bounce - T289241 [09:38:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P18055 and previous config saved to /var/cache/conftool/dbconfig/20211207-093807-marostegui.json [09:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:18] !log codfw, normalize VRRP - T289241 [09:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:42] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: icinga Blocked by X-Frame-Options Policy - https://phabricator.wikimedia.org/T251513 (10jbond) 05Open→03Resolved a:03jbond Going to resolve this this as the current fix seems to iliviate the majority of the pain points and proivng further fixs dosn;t feel... [09:47:00] 10SRE, 10Infrastructure-Foundations, 10observability, 10CAS-SSO, 10User-jbond: Icinga Monitoring for CAS - https://phabricator.wikimedia.org/T233935 (10jbond) 05In progress→03Resolved We currently monitor the tomcat process and further have monitoring for now this is adequate [09:47:06] 10SRE, 10Infrastructure-Foundations, 10Security-Team, 10CAS-SSO, 10User-jbond: Further steps for CAS/web SSO - https://phabricator.wikimedia.org/T233921 (10jbond) [09:49:29] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Schema-change: Mailman3 schema change: Switch autoresponse_text fields to Text - https://phabricator.wikimedia.org/T286552 (10Marostegui) Current table schema: ` CREATE TABLE `mailinglist` ( `id` int(11) NOT NULL AUTO_INCREMENT, `list_name` varchar(255) CHARA... [09:53:07] (03CR) 10David Caro: [C: 03+1] prometheus: bump logging level for blackbox-exporter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/743388 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [09:53:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T277354)', diff saved to https://phabricator.wikimedia.org/P18056 and previous config saved to /var/cache/conftool/dbconfig/20211207-095312-marostegui.json [09:53:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1170.eqiad.wmnet with reason: Maintenance T277354 [09:53:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1170.eqiad.wmnet with reason: Maintenance T277354 [09:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:17] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [09:53:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T277354)', diff saved to https://phabricator.wikimedia.org/P18057 and previous config saved to /var/cache/conftool/dbconfig/20211207-095319-marostegui.json [09:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T277354)', diff saved to https://phabricator.wikimedia.org/P18058 and previous config saved to /var/cache/conftool/dbconfig/20211207-095456-marostegui.json [09:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:46] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:57:46] (03PS1) 10Giuseppe Lavagetto: mediawiki: workaround the sadness of php/rsyslog interactions [deployment-charts] - 10https://gerrit.wikimedia.org/r/744764 [09:59:42] RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:00:22] Amir1: so you setting up mailman in maintenance mode? [10:00:34] yup [10:00:47] ok, let me know when done so I can deploy the change [10:00:55] let me know when I need to do hit the button [10:01:02] !log Deploy schema change on mailman (m5) T286552 [10:01:04] Amir1: go for it [10:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:06] T286552: Mailman3 schema change: Switch autoresponse_text fields to Text - https://phabricator.wikimedia.org/T286552 [10:01:13] done [10:01:15] go [10:01:22] deployed [10:01:43] back up [10:02:21] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Schema-change: Mailman3 schema change: Switch autoresponse_text fields to Text - https://phabricator.wikimedia.org/T286552 (10Marostegui) ` root@db1132.eqiad.wmnet[mailman3]> ALTER TABLE mailinglist MODIFY autoresponse_owner_text TEXT COLLATE utf8mb4_bin NULL; AL... [10:02:25] looks okay [10:02:32] great! [10:03:38] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: Internal server error (with ugly html tags) when changing Autoresponse postings text - https://phabricator.wikimedia.org/T286269 (10Marostegui) [10:03:48] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Schema-change: Mailman3 schema change: Switch autoresponse_text fields to Text - https://phabricator.wikimedia.org/T286552 (10Marostegui) 05Open→03Resolved All done! [10:05:59] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10jbond) >>! In T272559#7546852, @Dzahn wrote: > icinga::nsca::client is used in fundraising. so there are special cases that can be in use but this audit scri... [10:06:36] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: Internal server error (with ugly html tags) when changing Autoresponse postings text - https://phabricator.wikimedia.org/T286269 (10Ladsgroup) 05Open→03Resolved Fixed now. [10:08:21] (03PS1) 10Kormat: Drop py35 support, and various cfg cleanups. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/744767 [10:10:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P18059 and previous config saved to /var/cache/conftool/dbconfig/20211207-101001-marostegui.json [10:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2012.codfw.wmnet with OS buster [10:11:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:09] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2012.codfw.wmnet with OS buster completed: - ganeti2012 (**PASS**) - Downtimed on Icinga... [10:12:58] 10SRE, 10Traffic: Upgrade pybal-test200[23] from Stretch to Buster - https://phabricator.wikimedia.org/T297187 (10ema) [10:13:26] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on ganeti2013.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage [10:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ganeti2013.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage [10:13:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:46] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10Sustainability (Incident Followup): Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 (10cmooney) > To be clear, I agree that your proposal is a good solution however I'm wondering what's most future-proof.... [10:17:50] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10MoritzMuehlenhoff) [10:18:01] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10MoritzMuehlenhoff) One more; ganeti2013. Ready to be powered off any time. [10:18:04] (03CR) 10Ladsgroup: [C: 03+2] noc: Make colors consistent with WikimediaUI style guide [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742443 (owner: 10Ladsgroup) [10:18:47] (03Merged) 10jenkins-bot: noc: Make colors consistent with WikimediaUI style guide [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742443 (owner: 10Ladsgroup) [10:21:21] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/743980 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:21:34] (03PS1) 10Jelto: gitlab: disable restore timer to perform upgrade [puppet] - 10https://gerrit.wikimedia.org/r/744768 (https://phabricator.wikimedia.org/T297183) [10:22:00] (03CR) 10David Caro: "I'll remove my vote as probably someone that's directly affected by these alerts should +1 instead xd" [puppet] - 10https://gerrit.wikimedia.org/r/743980 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:22:20] (03CR) 10jerkins-bot: [V: 04-1] gitlab: disable restore timer to perform upgrade [puppet] - 10https://gerrit.wikimedia.org/r/744768 (https://phabricator.wikimedia.org/T297183) (owner: 10Jelto) [10:23:12] (03PS2) 10Jelto: gitlab: disable restore timer to perform upgrade [puppet] - 10https://gerrit.wikimedia.org/r/744768 (https://phabricator.wikimedia.org/T297183) [10:24:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P18060 and previous config saved to /var/cache/conftool/dbconfig/20211207-102505-marostegui.json [10:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:25:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2016.codfw.wmnet [10:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:35] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32837/console" [puppet] - 10https://gerrit.wikimedia.org/r/744768 (https://phabricator.wikimedia.org/T297183) (owner: 10Jelto) [10:27:26] (03CR) 10David Caro: "LGTM, I'll leave for someone else to do the +1 though." [puppet] - 10https://gerrit.wikimedia.org/r/743979 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:27:30] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: disable restore timer to perform upgrade [puppet] - 10https://gerrit.wikimedia.org/r/744768 (https://phabricator.wikimedia.org/T297183) (owner: 10Jelto) [10:28:53] (03CR) 10Filippo Giunchedi: prometheus: bump logging level for blackbox-exporter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/743388 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:29:40] dcaro: thanks for the reviews ^ if you have time I'd like your input on https://gerrit.wikimedia.org/r/c/operations/puppet/+/743359 too (also that's going on cloudmetrics hosts as well) [10:32:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2016.codfw.wmnet [10:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:21] (03CR) 10David Caro: "Some question, otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/743981 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:33:44] godog: /me looking [10:35:10] dcaro: cheers, appreciate it [10:36:19] godog: quick question, what do the comments with XXX mean? Todo? [10:36:34] dcaro: lol yes, they do [10:36:49] 👍 [10:36:52] I'll switch to TODO in the future, much clearer [10:39:05] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff) [10:40:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T277354)', diff saved to https://phabricator.wikimedia.org/P18061 and previous config saved to /var/cache/conftool/dbconfig/20211207-104010-marostegui.json [10:40:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1182.eqiad.wmnet with reason: Maintenance T277354 [10:40:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1182.eqiad.wmnet with reason: Maintenance T277354 [10:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:15] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [10:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T277354)', diff saved to https://phabricator.wikimedia.org/P18062 and previous config saved to /var/cache/conftool/dbconfig/20211207-104018-marostegui.json [10:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T277354)', diff saved to https://phabricator.wikimedia.org/P18063 and previous config saved to /var/cache/conftool/dbconfig/20211207-104153-marostegui.json [10:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:20] (03CR) 10Filippo Giunchedi: prometheus: add alerts for network probes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/743980 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:51:36] (03PS2) 10Kormat: Various cfg cleanups. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/744767 [10:52:22] 10SRE, 10Infrastructure-Foundations: Revert 5.10.70 from bullseye hosts - https://phabricator.wikimedia.org/T297180 (10MoritzMuehlenhoff) [10:55:23] (03PS10) 10Jbond: Switch profile::openldap::management to use profile::openldap::client [puppet] - 10https://gerrit.wikimedia.org/r/661922 (owner: 10Muehlenhoff) [10:55:56] (03PS17) 10Jbond: modify-mfa.py: update modify-mfa to use ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) [10:56:09] (03PS3) 10Jbond: C:ldap::client::utils: Update to python3 [puppet] - 10https://gerrit.wikimedia.org/r/743387 (https://phabricator.wikimedia.org/T247364) [10:56:45] (03PS18) 10Jbond: modify-mfa.py: update modify-mfa to use ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) [10:56:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P18064 and previous config saved to /var/cache/conftool/dbconfig/20211207-105658-marostegui.json [10:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:33] (03PS6) 10Jbond: P:openldap::client: Add ldap::client::utils [puppet] - 10https://gerrit.wikimedia.org/r/743366 [11:01:18] (03PS11) 10Jbond: Switch profile::openldap::management to use profile::openldap::client [puppet] - 10https://gerrit.wikimedia.org/r/661922 (owner: 10Muehlenhoff) [11:02:11] (03CR) 10jerkins-bot: [V: 04-1] P:openldap::client: Add ldap::client::utils [puppet] - 10https://gerrit.wikimedia.org/r/743366 (owner: 10Jbond) [11:03:04] (03CR) 10jerkins-bot: [V: 04-1] Switch profile::openldap::management to use profile::openldap::client [puppet] - 10https://gerrit.wikimedia.org/r/661922 (owner: 10Muehlenhoff) [11:03:58] (03PS7) 10Jbond: P:openldap::client: Add ldap::client::utils [puppet] - 10https://gerrit.wikimedia.org/r/743366 [11:04:23] (03PS8) 10Jbond: P:openldap::client: Add ldap::client::utils [puppet] - 10https://gerrit.wikimedia.org/r/743366 [11:05:06] (03PS9) 10Jbond: P:openldap::client: Add ldap::client::utils [puppet] - 10https://gerrit.wikimedia.org/r/743366 [11:05:24] (03PS12) 10Jbond: Switch profile::openldap::management to use profile::openldap::client [puppet] - 10https://gerrit.wikimedia.org/r/661922 (owner: 10Muehlenhoff) [11:06:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2012.codfw.wmnet [11:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:28] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32839/console" [puppet] - 10https://gerrit.wikimedia.org/r/661922 (owner: 10Muehlenhoff) [11:07:42] (03PS19) 10Jbond: modify-mfa.py: update modify-mfa to use ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) [11:08:08] (03PS4) 10Jbond: C:ldap::client::utils: Update to python3 [puppet] - 10https://gerrit.wikimedia.org/r/743387 (https://phabricator.wikimedia.org/T247364) [11:09:27] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32840/console" [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) (owner: 10Jbond) [11:10:31] (03CR) 10David Caro: "One comment about a file->exec relationship, otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/743359 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [11:10:35] (03PS8) 10Majavah: openstack: refactor puppetmaster access [puppet] - 10https://gerrit.wikimedia.org/r/740915 (https://phabricator.wikimedia.org/T295247) [11:10:37] (03PS2) 10Majavah: openstack: enc: properly fail on server error [puppet] - 10https://gerrit.wikimedia.org/r/742424 [11:11:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2012.codfw.wmnet [11:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P18065 and previous config saved to /var/cache/conftool/dbconfig/20211207-111203-marostegui.json [11:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:25] (03PS13) 10Jbond: Switch profile::openldap::management to use profile::openldap::client [puppet] - 10https://gerrit.wikimedia.org/r/661922 (owner: 10Muehlenhoff) [11:13:31] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Sustainability (Incident Followup): Bringing mx2001 back into service - https://phabricator.wikimedia.org/T297128 (10MoritzMuehlenhoff) >>! In T297128#7551879, @Dzahn wrote: > re: making current kernel version persistent > > The one running now was selected i... [11:13:52] (03PS20) 10Jbond: modify-mfa.py: update modify-mfa to use ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) [11:14:03] (03PS5) 10Jbond: C:ldap::client::utils: Update to python3 [puppet] - 10https://gerrit.wikimedia.org/r/743387 (https://phabricator.wikimedia.org/T247364) [11:15:19] (03CR) 10Jbond: P:openldap::client: Add ldap::client::utils (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/743366 (owner: 10Jbond) [11:15:21] (03CR) 10Jbond: [C: 03+2] P:openldap::client: Add ldap::client::utils [puppet] - 10https://gerrit.wikimedia.org/r/743366 (owner: 10Jbond) [11:15:24] (03CR) 10Jbond: [C: 03+2] Switch profile::openldap::management to use profile::openldap::client (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661922 (owner: 10Muehlenhoff) [11:15:32] (03CR) 10Jbond: [C: 03+2] modify-mfa.py: update modify-mfa to use ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) (owner: 10Jbond) [11:15:43] (03CR) 10Jbond: [C: 03+2] C:ldap::client::utils: Update to python3 [puppet] - 10https://gerrit.wikimedia.org/r/743387 (https://phabricator.wikimedia.org/T247364) (owner: 10Jbond) [11:19:04] (03CR) 10Btullis: [C: 03+2] Refactor superset caching to enable dual caches [puppet] - 10https://gerrit.wikimedia.org/r/743386 (https://phabricator.wikimedia.org/T295295) (owner: 10Btullis) [11:19:58] btullis: you happy for me to merge ^^ [11:20:05] (03PS2) 10Majavah: acme_chief: add -rw to ldap certs [puppet] - 10https://gerrit.wikimedia.org/r/739283 (https://phabricator.wikimedia.org/T295150) [11:21:00] (03CR) 10Jbond: [C: 03+2] "LGTM will merge" [puppet] - 10https://gerrit.wikimedia.org/r/739283 (https://phabricator.wikimedia.org/T295150) (owner: 10Majavah) [11:21:13] (03PS2) 10Giuseppe Lavagetto: mediawiki: workaround the sadness of php/rsyslog interactions [deployment-charts] - 10https://gerrit.wikimedia.org/r/744764 [11:21:37] majavah: see above will merge, once btul.lis confirms [11:21:43] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff) [11:24:33] thx [11:26:15] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd2004.codfw.wmnet with reason: switch to drbd storage [11:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd2004.codfw.wmnet with reason: switch to drbd storage [11:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T277354)', diff saved to https://phabricator.wikimedia.org/P18066 and previous config saved to /var/cache/conftool/dbconfig/20211207-112707-marostegui.json [11:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:12] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [11:28:22] btullis: See above is it ok to merge your change [11:29:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db[1155-1156].eqiad.wmnet with reason: Maintenance T277354 [11:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db[1155-1156].eqiad.wmnet with reason: Maintenance T277354 [11:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T277354)', diff saved to https://phabricator.wikimedia.org/P18067 and previous config saved to /var/cache/conftool/dbconfig/20211207-113005-marostegui.json [11:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T277354)', diff saved to https://phabricator.wikimedia.org/P18068 and previous config saved to /var/cache/conftool/dbconfig/20211207-113140-marostegui.json [11:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:47] !log removing IP addressing on cloudvirt1028 manually and forcing DHCP to debug reimage failure (T296906) [11:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:51] T296906: reimage/pxe boot failing on cloudvirt1028 - https://phabricator.wikimedia.org/T296906 [11:32:28] !log cmooney@cumin1001 START - Cookbook sre.hosts.dhcp for host cloudvirt1028.eqiad.wmnet [11:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:11] 10SRE, 10Infrastructure-Foundations: Revert 5.10.70 from bullseye hosts - https://phabricator.wikimedia.org/T297180 (10fgiunchedi) [11:35:37] 10SRE, 10Infrastructure-Foundations: Revert 5.10.70 from bullseye hosts - https://phabricator.wikimedia.org/T297180 (10fgiunchedi) I chatted with @MoritzMuehlenhoff re: the rollback, apt won't let you remove a running kernel though there's a way to ask `grub` to reboot into another menu entry (the second entry... [11:37:54] majavah: you change is merged [11:38:07] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host cloudvirt1028.eqiad.wmnet [11:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P18069 and previous config saved to /var/cache/conftool/dbconfig/20211207-114645-marostegui.json [11:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:51] !log draining primary/secondary instances off ganeti2014 T296622 [11:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:56] T296622: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211207T1200). [12:00:05] MatmaRex: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:10] o/ [12:00:25] hi [12:01:04] I can deploy today :) [12:01:25] ooh, reply tool \o/ [12:01:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P18070 and previous config saved to /var/cache/conftool/dbconfig/20211207-120150-marostegui.json [12:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:56] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Enable reply tool by default on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744043 (https://phabricator.wikimedia.org/T296444) (owner: 10Esanders) [12:04:00] (03PS2) 10Lucas Werkmeister (WMDE): Enable reply tool by default on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744043 (https://phabricator.wikimedia.org/T296444) (owner: 10Esanders) [12:05:19] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Enable reply tool by default on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744043 (https://phabricator.wikimedia.org/T296444) (owner: 10Esanders) [12:06:06] (03Merged) 10jenkins-bot: Enable reply tool by default on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744043 (https://phabricator.wikimedia.org/T296444) (owner: 10Esanders) [12:06:26] MatmaRex: the change is on mwdebug1001, please test! [12:06:45] looking [12:06:46] (I’m not sure how to test it myself tbh, since the two random talk pages I looked at used Flow ^^) [12:07:05] yeah, you need to find or create a non-flow one [12:07:29] e.g. https://www.mediawiki.org/wiki/Talk:Talk_pages_project/Usability - seems good on this page :) [12:07:40] Lucas_WMDE: seems good [12:07:58] ack, looks good here too [12:08:06] why is https://www.mediawiki.org/wiki/Talk:Talk_pages_project still a Flow board? :-) [12:08:25] 🤷‍♂️ [12:09:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:09:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:11] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:744043|Enable reply tool by default on mediawikiwiki (T296444)]] (duration: 00m 57s) [12:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:15] T296444: Config change: Deploy Reply Tool as opt-out preference at mediawiki.org - https://phabricator.wikimedia.org/T296444 [12:09:32] thanks [12:10:06] np [12:10:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:14] anything else to deploy? [12:10:36] I might deploy a service update in a few minutes (after testing it some more locally first), not sure if I should consider that part of the window ^^ [12:13:10] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:16:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T277354)', diff saved to https://phabricator.wikimedia.org/P18071 and previous config saved to /var/cache/conftool/dbconfig/20211207-121655-marostegui.json [12:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:00] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [12:21:04] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:21:04] (03PS2) 10Lucas Werkmeister (WMDE): Update termbox to 2021-12-06-171243-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/744071 (https://phabricator.wikimedia.org/T297006) [12:21:15] ^ I’ll start deploying this in a moment [12:22:06] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "I’ve tested locally that the older Termbox pin of Wikibase as of 1.38.0-wmf.9 is compatible with the newer SSR, both without and with Java" [deployment-charts] - 10https://gerrit.wikimedia.org/r/744071 (https://phabricator.wikimedia.org/T297006) (owner: 10Lucas Werkmeister (WMDE)) [12:22:24] flawless message cutoff [12:23:36] (03PS42) 10Jbond: monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 [12:23:58] (03PS2) 10Btullis: Remove the HDFS corrupt blocks check from Icinga [puppet] - 10https://gerrit.wikimedia.org/r/732922 (https://phabricator.wikimedia.org/T293399) [12:24:12] !log merge refactor of monitoring classes 725045 [12:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:54] (03CR) 10Jbond: [C: 03+2] monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond) [12:25:18] (03CR) 10Michael Große: [C: 03+1] Update termbox to 2021-12-06-171243-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/744071 (https://phabricator.wikimedia.org/T297006) (owner: 10Lucas Werkmeister (WMDE)) [12:25:52] (03Merged) 10jenkins-bot: Update termbox to 2021-12-06-171243-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/744071 (https://phabricator.wikimedia.org/T297006) (owner: 10Lucas Werkmeister (WMDE)) [12:26:08] (03CR) 10Btullis: [C: 03+2] Remove the HDFS corrupt blocks check from Icinga [puppet] - 10https://gerrit.wikimedia.org/r/732922 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [12:26:58] btullis: keep clashing today, you happy for me to merge [12:27:10] jbond: Yes please :-) [12:27:27] merged [12:27:40] Many thanks. [12:28:08] deploy1002 /srv/deployment-charts has uncommitted changes (mwdebug/values-eqiad) :/ [12:28:13] I assume I can deploy termbox anyways [12:28:18] but does anyone know who’s responsible for those? [12:31:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:52] hmm, there’s more diff than I expected in the `helmfile -e staging -i apply` for termbox [12:32:05] the chart in a bunch of places changes from termbox-0.0.19 to termbox-0.0.20 [12:32:55] looks like that comes from https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/742166 [12:33:07] jelto: are you around? [12:34:57] * Lucas_WMDE looks through the Dec 1 SAL [12:35:25] (03PS1) 10Jbond: Revert "monitoring: refactor class" [puppet] - 10https://gerrit.wikimedia.org/r/744786 [12:35:39] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "monitoring: refactor class" [puppet] - 10https://gerrit.wikimedia.org/r/744786 (owner: 10Jbond) [12:36:10] or maybe akosiaris can help? [12:36:19] (I didn’t find anything enlightening in the SAL) [12:36:25] (03PS1) 10Jbond: monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/744787 [12:36:42] (03PS2) 10Jbond: monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/744787 [12:37:47] (03PS3) 10Jbond: monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/744787 [12:38:40] Lucas_WMDE: I'm around. I bumped some chart version to fix a minor bug in mutliple charts. Apart from label bump 0.0.19 to 0.0.20 there should be any other change [12:38:59] and it’s fine to apply that together with the other change I’m deploying? [12:39:03] shouldn't * [12:39:37] (03CR) 10jerkins-bot: [V: 04-1] monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/744787 (owner: 10Jbond) [12:39:39] yes thats fine, apart from two charts this feature was not used by any other chart, so should ne noop anyway [12:39:46] ok thanks [12:39:50] !log lucaswerkmeister-wmde@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' . [12:39:50] !log lucaswerkmeister-wmde@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' . [12:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:44] alright, testing on test.wikidata.org [12:41:51] all working as far as I can tell [12:42:07] let’s do codfw and eqiad [12:42:31] !log lucaswerkmeister-wmde@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'termbox' for release 'production' . [12:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:15] by the way, is there a way to add a custom message to these logs, like with scap? [12:43:18] (e.g. a task ID) [12:44:22] !log lucaswerkmeister-wmde@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'production' . [12:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:08] !log deployed [[gerrit:744071|Update termbox to 2021-12-06-171243-production (T297006)]] [12:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:12] T297006: Migrate Termbox to Node 12 - https://phabricator.wikimedia.org/T297006 [12:46:17] that works, I guess ;) [12:46:24] !log UTC morning backport+config window done [12:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:24] (03PS4) 10Jbond: monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/744787 [12:47:52] (03PS1) 10Ladsgroup: auto_schema: Stop adding ticket to downtime cookbook [software] - 10https://gerrit.wikimedia.org/r/744778 (https://phabricator.wikimedia.org/T288235) [12:48:30] Lucas_WMDE: I don't think custom messages are supported yet in helmfile apply (apart from logging here manually). Technically it should be quite easy to have a optional value to append to the SAL log. Let me think a little bit about that and I may create a low-prio task :) [12:48:38] ok :) [12:49:30] (03CR) 10Filippo Giunchedi: [V: 03+1] prometheus: add blackbox/discovery jobs (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/743979 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [12:49:47] (03PS3) 10Filippo Giunchedi: prometheus: support for blackbox configuration fragments [puppet] - 10https://gerrit.wikimedia.org/r/743359 (https://phabricator.wikimedia.org/T291946) [12:49:49] (03PS2) 10Filippo Giunchedi: prometheus: bump logging level for blackbox-exporter [puppet] - 10https://gerrit.wikimedia.org/r/743388 (https://phabricator.wikimedia.org/T291946) [12:49:51] (03PS7) 10Filippo Giunchedi: prometheus: add blackbox/discovery jobs [puppet] - 10https://gerrit.wikimedia.org/r/743979 (https://phabricator.wikimedia.org/T291946) [12:49:53] (03PS7) 10Filippo Giunchedi: prometheus: add alerts for network probes [puppet] - 10https://gerrit.wikimedia.org/r/743980 (https://phabricator.wikimedia.org/T291946) [12:49:55] (03PS7) 10Filippo Giunchedi: alertmanager: add inhibit rules for network probes [puppet] - 10https://gerrit.wikimedia.org/r/743981 (https://phabricator.wikimedia.org/T291946) [12:49:59] sorry about the gerrit spam [12:52:54] (03CR) 10Marostegui: [C: 03+1] auto_schema: Stop adding ticket to downtime cookbook [software] - 10https://gerrit.wikimedia.org/r/744778 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [12:55:09] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [12:56:42] (03CR) 10Kormat: [C: 03+2] Various cfg cleanups. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/744767 (owner: 10Kormat) [12:56:51] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Stop adding ticket to downtime cookbook [software] - 10https://gerrit.wikimedia.org/r/744778 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [13:00:59] (03Merged) 10jenkins-bot: Various cfg cleanups. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/744767 (owner: 10Kormat) [13:01:01] (03Merged) 10jenkins-bot: auto_schema: Stop adding ticket to downtime cookbook [software] - 10https://gerrit.wikimedia.org/r/744778 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [13:01:37] (03CR) 10Filippo Giunchedi: "Thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/743359 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [13:01:52] (03PS4) 10Filippo Giunchedi: prometheus: support for blackbox configuration fragments [puppet] - 10https://gerrit.wikimedia.org/r/743359 (https://phabricator.wikimedia.org/T291946) [13:01:54] (03PS3) 10Filippo Giunchedi: prometheus: bump logging level for blackbox-exporter [puppet] - 10https://gerrit.wikimedia.org/r/743388 (https://phabricator.wikimedia.org/T291946) [13:01:56] (03PS8) 10Filippo Giunchedi: prometheus: add blackbox/discovery jobs [puppet] - 10https://gerrit.wikimedia.org/r/743979 (https://phabricator.wikimedia.org/T291946) [13:01:58] (03PS8) 10Filippo Giunchedi: prometheus: add alerts for network probes [puppet] - 10https://gerrit.wikimedia.org/r/743980 (https://phabricator.wikimedia.org/T291946) [13:02:00] (03PS8) 10Filippo Giunchedi: alertmanager: add inhibit rules for network probes [puppet] - 10https://gerrit.wikimedia.org/r/743981 (https://phabricator.wikimedia.org/T291946) [13:04:52] (03PS2) 10Kormat: dbutil: Make testing easier [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/744029 [13:07:31] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on ganeti2014.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage [13:07:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ganeti2014.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage [13:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:38] (03CR) 10Kormat: dbutil: Make testing easier (032 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/744029 (owner: 10Kormat) [13:08:47] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10MoritzMuehlenhoff) [13:08:59] (03CR) 10Filippo Giunchedi: alertmanager: add inhibit rules for network probes (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/743981 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [13:09:03] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10MoritzMuehlenhoff) One more; ganeti2014. Ready to be powered off any time. [13:09:09] (03PS9) 10Filippo Giunchedi: alertmanager: add inhibit rules for network probes [puppet] - 10https://gerrit.wikimedia.org/r/743981 (https://phabricator.wikimedia.org/T291946) [13:12:31] (03CR) 10Kormat: [C: 03+2] dbutil: Make testing easier [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/744029 (owner: 10Kormat) [13:15:55] (03Merged) 10jenkins-bot: dbutil: Make testing easier [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/744029 (owner: 10Kormat) [13:16:37] !log update GitLab to 14.4.4-ce.0 [13:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:34] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, and 2 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10JAllemandou) > What would be the next steps? Here is a proposal: # [DE, SRE]Agree on the name of the flow :) Will it be `sflow`... [13:23:59] (03PS1) 10Kormat: setup.cfg: Improve dir excludes [software/wmfdb] - 10https://gerrit.wikimedia.org/r/744780 [13:26:06] (03PS1) 10Jelto: Revert "gitlab: disable restore timer to perform upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/744789 [13:26:38] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host restbase2026.codfw.wmnet with OS buster [13:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:45] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase202[456].codfw.wmnet - https://phabricator.wikimedia.org/T294377 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin200... [13:29:54] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, PCC link by John https://puppet-compiler.wmflabs.org/compiler1002/32843/" [puppet] - 10https://gerrit.wikimedia.org/r/744787 (owner: 10Jbond) [13:31:04] (03PS2) 10Kormat: setup.cfg: Improve dir excludes, upgrade black [software/wmfdb] - 10https://gerrit.wikimedia.org/r/744780 [13:34:04] (03CR) 10Kormat: [V: 03+2 C: 03+2] setup.cfg: Improve dir excludes, upgrade black [software/wmfdb] - 10https://gerrit.wikimedia.org/r/744780 (owner: 10Kormat) [13:39:02] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32844/console" [puppet] - 10https://gerrit.wikimedia.org/r/744789 (owner: 10Jelto) [13:39:06] !log disable puppet fleet wide to rollout 744787 [13:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:54] !log reboot graphite2003 - T297180 [13:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:58] T297180: Revert 5.10.70 from bullseye hosts - https://phabricator.wikimedia.org/T297180 [13:41:05] (03CR) 10Jelto: [V: 03+1 C: 03+2] Revert "gitlab: disable restore timer to perform upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/744789 (owner: 10Jelto) [13:42:03] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd2004.codfw.wmnet with reason: switch to drbd storage [13:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd2004.codfw.wmnet with reason: switch to drbd storage [13:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:08] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10Sustainability (Incident Followup): Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 (10ayounsi) Ok, sounds good to me! [13:42:14] PROBLEM - Host graphite2003 is DOWN: PING CRITICAL - Packet loss = 100% [13:44:12] RECOVERY - Host graphite2003 is UP: PING OK - Packet loss = 0%, RTA = 31.69 ms [13:44:42] (03CR) 10Jbond: [C: 03+2] monitoring: refactor class (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/744787 (owner: 10Jbond) [13:45:04] (03PS1) 10Ayounsi: Deprecate interface-range external [homer/public] - 10https://gerrit.wikimedia.org/r/744782 (https://phabricator.wikimedia.org/T296935) [13:45:25] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2026.codfw.wmnet with OS buster [13:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:30] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase202[456].codfw.wmnet - https://phabricator.wikimedia.org/T294377 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 fo... [13:45:43] (03PS2) 10Ayounsi: Deprecate interface-range external [homer/public] - 10https://gerrit.wikimedia.org/r/744782 (https://phabricator.wikimedia.org/T296935) [13:49:05] (03CR) 10Ayounsi: "Example diff on cr2-esams:" [homer/public] - 10https://gerrit.wikimedia.org/r/744782 (https://phabricator.wikimedia.org/T296935) (owner: 10Ayounsi) [13:52:11] !log removing wikiuser@localhost on s6 (T296537) [13:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:19] T296537: Check and fix GRANT issues of wikiuser - https://phabricator.wikimedia.org/T296537 [14:00:38] (03PS1) 10Arturo Borrero Gonzalez: ceph: mgr: fix typo in relationship [puppet] - 10https://gerrit.wikimedia.org/r/744784 (https://phabricator.wikimedia.org/T293752) [14:02:59] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, and 2 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) Sounds good! 1. we can use "internal_flows" (not _netflow as netflow is a protocol). 2. can I start this anytime, or we... [14:07:40] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd2006.codfw.wmnet with reason: switch to drbd storage [14:07:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd2006.codfw.wmnet with reason: switch to drbd storage [14:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:28] PROBLEM - ganeti-confd running on ganeti2014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [14:08:48] PROBLEM - ganeti-mond running on ganeti2014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [14:09:34] PROBLEM - ganeti-noded running on ganeti2014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [14:09:43] ^ expected, silencing [14:11:43] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ganeti[2013-2014].codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage [14:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti[2013-2014].codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage [14:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:56] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:15:18] !log fixing heartbeat grants for wikiuser across the cluster (T296537) [14:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:23] T296537: Check and fix GRANT issues of wikiuser - https://phabricator.wikimedia.org/T296537 [14:21:56] (03PS4) 10Ayounsi: Pmacct add sflow listener [puppet] - 10https://gerrit.wikimedia.org/r/742110 (https://phabricator.wikimedia.org/T263277) [14:28:30] !log reboot graphite1004 - T297180 [14:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:35] T297180: Revert 5.10.70 from bullseye hosts - https://phabricator.wikimedia.org/T297180 [14:29:27] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, and 2 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10Ottomata) > Agree on the name of the flow : Some guidelines: https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas/Guidelin... [14:30:12] (03PS5) 10Ayounsi: Pmacct add sflow listener [puppet] - 10https://gerrit.wikimedia.org/r/742110 (https://phabricator.wikimedia.org/T263277) [14:30:34] PROBLEM - Host graphite1004 is DOWN: PING CRITICAL - Packet loss = 100% [14:31:30] RECOVERY - Host graphite1004 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [14:31:53] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, and 2 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10JAllemandou) >>! In T263277#7552972, @ayounsi wrote: > Sounds good! > 1. we can use "internal_flows" (not _netflow as netflow is... [14:32:38] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 96.11% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [14:32:49] (03CR) 10JMeybohm: imagecatalog: Install and configure OCI image catalog on deploy hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742574 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [14:34:36] (03PS1) 10Arturo Borrero Gonzalez: ceph: mgr: migrate keyring to new auth abstraction [puppet] - 10https://gerrit.wikimedia.org/r/744808 (https://phabricator.wikimedia.org/T293752) [14:35:31] (03CR) 10jerkins-bot: [V: 04-1] ceph: mgr: migrate keyring to new auth abstraction [puppet] - 10https://gerrit.wikimedia.org/r/744808 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [14:36:21] PROBLEM - carbon-cache@d service on graphite1004 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@d is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:36:26] PROBLEM - carbon-cache@e service on graphite1004 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@e is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:36:26] PROBLEM - carbon-local-relay service on graphite1004 is CRITICAL: CRITICAL - Expecting active but unit carbon-local-relay is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:36:36] PROBLEM - Check systemd state on graphite1004 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:36:36] PROBLEM - carbon-cache@b service on graphite1004 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:36:38] PROBLEM - carbon-cache@g service on graphite1004 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@g is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:36:42] PROBLEM - carbon-frontend-relay service on graphite1004 is CRITICAL: CRITICAL - Expecting active but unit carbon-frontend-relay is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:36:42] PROBLEM - carbon-cache@c service on graphite1004 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:37:00] that's me ^ [14:37:26] PROBLEM - carbon-cache@h service on graphite1004 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@h is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:37:31] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1003/32846/netflow1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/742110 (https://phabricator.wikimedia.org/T263277) (owner: 10Ayounsi) [14:37:34] jbond: I've reenabled puppet on graphite1004 btw [14:37:36] PROBLEM - carbon-cache@a service on graphite1004 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:37:48] PROBLEM - carbon-cache@f service on graphite1004 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@f is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:37:55] 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2021/2022-Q2), 10Sustainability (Incident Followup): Alert that should have paged via VictorOps was delayed because of partial networking outage - https://phabricator.wikimedia.org/T294166 (10herron) [14:37:57] godog: ack thanks im re-enabling everywhere now [14:38:08] everything looks good so far [14:38:28] RECOVERY - carbon-cache@d service on graphite1004 is OK: OK - carbon-cache@d is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:38:30] jbond: kk [14:38:31] !log renable puppet fleet wide post monitoring refactor 744787 [14:38:32] RECOVERY - carbon-cache@e service on graphite1004 is OK: OK - carbon-cache@e is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:38:34] RECOVERY - carbon-local-relay service on graphite1004 is OK: OK - carbon-local-relay is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:36] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2] "PCC, diff as expected https://puppet-compiler.wmflabs.org/compiler1003/32847/" [puppet] - 10https://gerrit.wikimedia.org/r/744784 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [14:38:44] RECOVERY - Check systemd state on graphite1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:38:44] RECOVERY - carbon-cache@b service on graphite1004 is OK: OK - carbon-cache@b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:38:46] RECOVERY - carbon-cache@g service on graphite1004 is OK: OK - carbon-cache@g is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:38:50] RECOVERY - carbon-frontend-relay service on graphite1004 is OK: OK - carbon-frontend-relay is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:38:50] RECOVERY - carbon-cache@c service on graphite1004 is OK: OK - carbon-cache@c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:38:58] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] ceph: mgr: fix typo in relationship [puppet] - 10https://gerrit.wikimedia.org/r/744784 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [14:39:34] RECOVERY - carbon-cache@h service on graphite1004 is OK: OK - carbon-cache@h is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:39:44] RECOVERY - carbon-cache@a service on graphite1004 is OK: OK - carbon-cache@a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:39:58] RECOVERY - carbon-cache@f service on graphite1004 is OK: OK - carbon-cache@f is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:40:27] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, and 2 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10Ottomata) > can I start this anytime, or we need to create the kafka topic somewhere? Not really needed, unless you need to set s... [14:41:48] 10SRE, 10Infrastructure-Foundations: Revert 5.10.70 from bullseye hosts - https://phabricator.wikimedia.org/T297180 (10fgiunchedi) [14:43:06] 10SRE, 10Infrastructure-Foundations: Revert 5.10.70 from bullseye hosts - https://phabricator.wikimedia.org/T297180 (10MoritzMuehlenhoff) [14:43:18] 10SRE, 10Infrastructure-Foundations: Revert 5.10.70 from bullseye hosts - https://phabricator.wikimedia.org/T297180 (10MoritzMuehlenhoff) [14:43:41] 10SRE, 10Infrastructure-Foundations: Revert 5.10.70 from bullseye hosts - https://phabricator.wikimedia.org/T297180 (10MoritzMuehlenhoff) [14:44:47] (03PS2) 10Majavah: Remove UserMerge rights from labswiki (wikitech) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743659 [14:51:15] (03PS1) 10Btullis: Remove more alerts that have moved to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/744809 (https://phabricator.wikimedia.org/T293399) [14:58:34] jbond: hi! profile::trafficserver::monitoring seems to include profile::monitoring even in cloud (normally profile::base::production includes it in prod only), and now the deployment-prep cache hosts are failing to run puppet due to missing hiera keys [15:01:37] majavah: ack looking now [15:02:07] (03PS1) 10Majavah: ldap: Do not install py2.7 files on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/744810 [15:02:58] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 8.201 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:02:58] (03CR) 10Ssingh: [V: 03+1 C: 03+2] durum: show if the user is using DoH or DoT [puppet] - 10https://gerrit.wikimedia.org/r/744095 (owner: 10Ssingh) [15:03:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=mysql-misc site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:03:22] jbond: also https://gerrit.wikimedia.org/r/c/operations/puppet/+/744810/ fixes a recent change to the ldap module which broke some of our bullseye hosts [15:04:34] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:04:37] majavah: ack looks good will merge that one in a sec tyhanks <3 [15:05:28] (ThanosRuleHighRuleEvaluationFailures) firing: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org [15:05:40] (03PS3) 10Ssingh: dnsdist: refactor the configuration template for updates to durum [puppet] - 10https://gerrit.wikimedia.org/r/744087 [15:06:45] (03CR) 10Ssingh: [C: 03+2] dnsdist: refactor the configuration template for updates to durum [puppet] - 10https://gerrit.wikimedia.org/r/744087 (owner: 10Ssingh) [15:06:46] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 102.8 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:07:04] (03PS1) 10Btullis: Increase the timeout for Druid on the analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/744811 (https://phabricator.wikimedia.org/T297148) [15:07:14] PROBLEM - Check if anycast-healthchecker and all configured threads are running on durum1002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [15:07:29] ^ that's me, checking [15:07:40] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:08:00] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:08:05] ^ related [15:09:08] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host restbase2026.codfw.wmnet with OS buster [15:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:15] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase202[456].codfw.wmnet - https://phabricator.wikimedia.org/T294377 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin200... [15:09:53] (03PS1) 10Jbond: P:monitoring: add defaults for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/744812 [15:10:15] (03CR) 10Jbond: [C: 03+2] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/744810 (owner: 10Majavah) [15:10:28] (ThanosRuleHighRuleEvaluationFailures) resolved: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org [15:11:02] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10JMeybohm) [15:11:43] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32849/console" [puppet] - 10https://gerrit.wikimedia.org/r/744811 (https://phabricator.wikimedia.org/T297148) (owner: 10Btullis) [15:13:43] confirmed that the ldap fix is working [15:13:50] great thanks [15:14:31] (03CR) 10Ssingh: [C: 03+2] wikimedia-dns: refactor for durum update [dns] - 10https://gerrit.wikimedia.org/r/744094 (owner: 10Ssingh) [15:14:57] !log running authdns-update for Gerrit:744094 [15:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:16] ^ complete [15:19:42] PROBLEM - Check systemd state on durum2002 is CRITICAL: CRITICAL - degraded: The following units failed: anycast-healthchecker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:19:50] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:20:04] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:20:10] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:20:16] PROBLEM - Check if anycast-healthchecker and all configured threads are running on durum2002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [15:20:22] ^ this is related to the durum change, I am looking at why it's affecting one these two hosts [15:20:43] (03PS2) 10Jbond: P:monitoring: add defaults for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/744812 [15:20:44] PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:20:52] PROBLEM - Bird Internet Routing Daemon on durum2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [15:21:02] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:21:16] PROBLEM - BFD status on cr3-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:21:37] !log sukhe@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on durum2002.codfw.wmnet with reason: debugging bird/anycast-hc issues [15:21:39] !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on durum2002.codfw.wmnet with reason: debugging bird/anycast-hc issues [15:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:44] PROBLEM - Bird Internet Routing Daemon on durum5002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [15:21:49] ^ same [15:21:56] PROBLEM - Check if anycast-healthchecker and all configured threads are running on durum5002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [15:22:00] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:22:02] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:22:12] (03PS3) 10Jbond: P:monitoring: add defaults for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/744812 [15:22:20] PROBLEM - Check systemd state on durum5002 is CRITICAL: CRITICAL - degraded: The following units failed: anycast-healthchecker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:22:52] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:23:34] (03PS4) 10Jbond: P:monitoring: add defaults for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/744812 [15:24:13] (03CR) 10Jbond: [C: 03+2] P:monitoring: add defaults for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/744812 (owner: 10Jbond) [15:24:26] PROBLEM - Bird Internet Routing Daemon on durum3001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [15:24:34] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:24:48] PROBLEM - Check if anycast-healthchecker and all configured threads are running on durum3001 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [15:24:52] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:25:02] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:25:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2026.codfw.wmnet with OS buster [15:25:22] PROBLEM - Check systemd state on durum3001 is CRITICAL: CRITICAL - degraded: The following units failed: anycast-healthchecker.service,nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:25:22] PROBLEM - BFD status on cr3-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:27] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase202[456].codfw.wmnet - https://phabricator.wikimedia.org/T294377 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 fo... [15:25:50] PROBLEM - Check systemd state on durum2001 is CRITICAL: CRITICAL - degraded: The following units failed: anycast-healthchecker.service,nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:26:02] yeah thinking of reverting this one to figure out what went wrong [15:26:04] PROBLEM - Check if anycast-healthchecker and all configured threads are running on durum2001 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [15:26:14] PROBLEM - Bird Internet Routing Daemon on durum2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [15:26:52] PROBLEM - Bird Internet Routing Daemon on durum1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [15:27:27] (03PS1) 10Btullis: Remove duplicate cluster variable from Druid check [alerts] - 10https://gerrit.wikimedia.org/r/744813 (https://phabricator.wikimedia.org/T293399) [15:28:36] (03PS1) 10Ssingh: Revert "durum: show if the user is using DoH or DoT" [puppet] - 10https://gerrit.wikimedia.org/r/744793 [15:29:02] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:29:03] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:29:12] PROBLEM - Check systemd state on durum4002 is CRITICAL: CRITICAL - degraded: The following units failed: anycast-healthchecker.service,nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:29:20] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:30:02] PROBLEM - Bird Internet Routing Daemon on durum4002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [15:30:23] (03CR) 10Ssingh: [C: 03+2] Revert "durum: show if the user is using DoH or DoT" [puppet] - 10https://gerrit.wikimedia.org/r/744793 (owner: 10Ssingh) [15:30:34] PROBLEM - Check if anycast-healthchecker and all configured threads are running on durum4002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [15:31:08] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:32:30] PROBLEM - Check if anycast-healthchecker and all configured threads are running on durum1001 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [15:32:39] sigh [15:32:44] PROBLEM - Bird Internet Routing Daemon on durum1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [15:33:00] PROBLEM - Check systemd state on durum1001 is CRITICAL: CRITICAL - degraded: The following units failed: anycast-healthchecker.service,nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:33:21] !log sukhe@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on 10 hosts with reason: debugging bird/anycast-hc issues [15:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:26] (03PS1) 10Ladsgroup: Do not inject rev id of template when it's empty [extensions/FlaggedRevs] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/744796 [15:33:29] !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 10 hosts with reason: debugging bird/anycast-hc issues [15:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:37] (03PS1) 10Ladsgroup: Do not inject rev id of template when it's empty [extensions/FlaggedRevs] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/744797 [15:33:48] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 503 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [15:35:02] (03PS1) 10Jbond: P:monitoring: fix ordering [puppet] - 10https://gerrit.wikimedia.org/r/744814 [15:35:16] PROBLEM - Host check.wikimedia-dns.org is DOWN: PING CRITICAL - Packet loss = 100% [15:35:53] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, and 2 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) `internal_network_flows` works, `network.flows.internal` too. @Ottomata indeed we do have restriction on the producer s... [15:36:00] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [15:37:38] jouncebot: nowandnext [15:37:38] No deployments scheduled for the next 1 hour(s) and 22 minute(s) [15:37:38] In 1 hour(s) and 22 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211207T1700) [15:37:43] coool [15:38:04] (03CR) 10Ladsgroup: [C: 03+2] Do not inject rev id of template when it's empty [extensions/FlaggedRevs] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/744797 (owner: 10Ladsgroup) [15:38:07] (03CR) 10Ladsgroup: [C: 03+2] Do not inject rev id of template when it's empty [extensions/FlaggedRevs] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/744796 (owner: 10Ladsgroup) [15:38:21] wmf.12 doesn't need deployment, just to catch the train [15:38:38] RECOVERY - Check if anycast-healthchecker and all configured threads are running on durum1002 is OK: OK: UP (pid=10734) and all threads (4) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [15:40:00] RECOVERY - Bird Internet Routing Daemon on durum1002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [15:40:52] ^ XioNoX: discovered an interesting bug with anycast-hc today, let me share it once I resolve it. essentially, it doesn't remove the older conf files, resulting in an error like this, "Dec 07 15:36:27 durum1002 anycast-healthchecker[10438]: Invalid configuration: 185.71.138.140/32 is used by 2 service checks " [15:41:30] RECOVERY - Host check.wikimedia-dns.org is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [15:41:31] (03PS1) 10Ssingh: Revert "Revert "durum: show if the user is using DoH or DoT"" [puppet] - 10https://gerrit.wikimedia.org/r/744798 [15:41:43] sukhe: is it an anycast-hc bug or a puppet oversight? [15:42:08] that's true, more like a Puppet oversight I guess but I will share a patch once I resolve the durum error [15:42:18] (03Merged) 10jenkins-bot: Do not inject rev id of template when it's empty [extensions/FlaggedRevs] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/744797 (owner: 10Ladsgroup) [15:42:22] (03Merged) 10jenkins-bot: Do not inject rev id of template when it's empty [extensions/FlaggedRevs] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/744796 (owner: 10Ladsgroup) [15:42:35] cool, thanks! [15:43:30] (03CR) 10Ssingh: [C: 03+2] Revert "Revert "durum: show if the user is using DoH or DoT"" [puppet] - 10https://gerrit.wikimedia.org/r/744798 (owner: 10Ssingh) [15:44:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica2006.wikimedia.org [15:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:44] RECOVERY - Check if anycast-healthchecker and all configured threads are running on durum1001 is OK: OK: UP (pid=31997) and all threads (4) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [15:45:50] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:45:56] RECOVERY - Bird Internet Routing Daemon on durum1001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [15:46:14] RECOVERY - Check systemd state on durum1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:18] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:46:38] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 96, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:47:35] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/FlaggedRevs/backend/FlaggedRevision.php: Backport: [[gerrit:744797|Do not inject rev id of template when it's empty]] (duration: 00m 57s) [15:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:40] RECOVERY - Check systemd state on durum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:44] RECOVERY - Bird Internet Routing Daemon on durum5002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [15:47:56] RECOVERY - Check if anycast-healthchecker and all configured threads are running on durum5002 is OK: OK: UP (pid=9923) and all threads (4) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [15:47:56] RECOVERY - Check if anycast-healthchecker and all configured threads are running on durum2001 is OK: OK: UP (pid=19117) and all threads (4) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [15:48:06] RECOVERY - Bird Internet Routing Daemon on durum2001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [15:48:18] RECOVERY - Check systemd state on durum5002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica2006.wikimedia.org [15:50:34] RECOVERY - Bird Internet Routing Daemon on durum3001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [15:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:42] PROBLEM - Host db2074.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:52:00] PROBLEM - Host db2130.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:52:06] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2084.codfw.wmnet with reason: Reracking T296930 [15:52:08] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2084.codfw.wmnet with reason: Reracking T296930 [15:52:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:10] T296930: codfw: relocate servers in rack D6 - https://phabricator.wikimedia.org/T296930 [15:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica2005.wikimedia.org [15:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:42] PROBLEM - Host db2101.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:54:55] XioNoX: ok since it's all done, so what happened was that I renamed an existing vip_fqdn and it created the new one but didn't remove the old one, which in hindsight I should have expected (?) [15:54:58] RECOVERY - Check if anycast-healthchecker and all configured threads are running on durum4002 is OK: OK: UP (pid=19952) and all threads (4) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [15:55:19] I am not sure of what a good solution to this is, perhaps we should append the VIP to the vip_fqdn and then check if that exists? [15:55:20] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, and 2 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10BTullis) In case it helps, I came across this abandoned change from 2020: https://gerrit.wikimedia.org/r/c/schemas/event/secondar... [15:55:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica2005.wikimedia.org [15:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:56] PROBLEM - Host dbproxy2004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:56:08] 10SRE, 10Infrastructure-Foundations: Revert 5.10.70 from bullseye hosts - https://phabricator.wikimedia.org/T297180 (10MoritzMuehlenhoff) [15:56:59] (03PS2) 10Jbond: P:monitoring: fix ordering [puppet] - 10https://gerrit.wikimedia.org/r/744814 [15:58:56] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [15:59:36] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:59:56] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:59:56] PROBLEM - Host db2084.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:00:03] (03PS3) 10Jbond: WIP P:monitoring: fix ordering [puppet] - 10https://gerrit.wikimedia.org/r/744814 [16:00:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica1003.wikimedia.org [16:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:14] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:01:26] RECOVERY - BGP status on cr3-eqsin is OK: BGP OK - up: 333, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:01:36] RECOVERY - BGP status on cr2-eqsin is OK: BGP OK - up: 79, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:02:08] RECOVERY - BFD status on cr3-ulsfo is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:02:30] (03CR) 10jerkins-bot: [V: 04-1] WIP P:monitoring: fix ordering [puppet] - 10https://gerrit.wikimedia.org/r/744814 (owner: 10Jbond) [16:02:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica1003.wikimedia.org [16:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:28] 10SRE, 10Infrastructure-Foundations: Revert 5.10.70 from bullseye hosts - https://phabricator.wikimedia.org/T297180 (10MoritzMuehlenhoff) [16:04:20] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 73, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:04:28] RECOVERY - Check if anycast-healthchecker and all configured threads are running on durum3001 is OK: OK: UP (pid=8954) and all threads (4) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [16:04:32] PROBLEM - Check if anycast-healthchecker and all configured threads are running on durum2002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [16:04:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid2002.codfw.wmnet [16:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:48] RECOVERY - Host db2074.mgmt is UP: PING OK - Packet loss = 0%, RTA = 35.08 ms [16:07:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid2002.codfw.wmnet [16:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:31] !log root@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [16:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:12] RECOVERY - Host db2130.mgmt is UP: PING OK - Packet loss = 0%, RTA = 45.72 ms [16:08:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host copernicium.wikimedia.org [16:08:54] RECOVERY - BFD status on cr3-eqsin is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:58] RECOVERY - Check systemd state on durum3001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:09:40] (03PS4) 10Jbond: WIP P:monitoring: fix ordering [puppet] - 10https://gerrit.wikimedia.org/r/744814 [16:10:08] RECOVERY - Host db2101.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.36 ms [16:10:27] (03PS1) 10Andrew Bogott: cinder-backup: generate backup_file_size relative to available RAM [puppet] - 10https://gerrit.wikimedia.org/r/744821 (https://phabricator.wikimedia.org/T292546) [16:11:08] RECOVERY - Host db2084.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.67 ms [16:11:41] (03PS2) 10Andrew Bogott: cinder-backup: generate backup_file_size relative to available RAM [puppet] - 10https://gerrit.wikimedia.org/r/744821 (https://phabricator.wikimedia.org/T292546) [16:12:09] (03CR) 10jerkins-bot: [V: 04-1] WIP P:monitoring: fix ordering [puppet] - 10https://gerrit.wikimedia.org/r/744814 (owner: 10Jbond) [16:13:13] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:13:14] 10SRE, 10Infrastructure-Foundations: Revert 5.10.70 from bullseye hosts - https://phabricator.wikimedia.org/T297180 (10MoritzMuehlenhoff) [16:13:20] RECOVERY - Bird Internet Routing Daemon on durum4002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [16:14:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host copernicium.wikimedia.org [16:14:02] (03PS3) 10Andrew Bogott: cinder-backup: generate backup_file_size relative to available RAM [puppet] - 10https://gerrit.wikimedia.org/r/744821 (https://phabricator.wikimedia.org/T292546) [16:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:42] RECOVERY - Check systemd state on durum4002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:17:28] (03PS4) 10Andrew Bogott: cinder-backup: generate backup_file_size relative to available RAM [puppet] - 10https://gerrit.wikimedia.org/r/744821 (https://phabricator.wikimedia.org/T292546) [16:17:52] RECOVERY - BFD status on cr3-esams is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:20:18] (03PS5) 10Jbond: WIP P:monitoring: fix ordering [puppet] - 10https://gerrit.wikimedia.org/r/744814 [16:22:48] (03CR) 10jerkins-bot: [V: 04-1] WIP P:monitoring: fix ordering [puppet] - 10https://gerrit.wikimedia.org/r/744814 (owner: 10Jbond) [16:23:13] (03PS5) 10Andrew Bogott: cinder-backup: generate backup_file_size relative to available RAM [puppet] - 10https://gerrit.wikimedia.org/r/744821 (https://phabricator.wikimedia.org/T292546) [16:24:30] (03PS6) 10Jbond: WIP P:monitoring: fix ordering [puppet] - 10https://gerrit.wikimedia.org/r/744814 [16:25:28] !log deleting broken flaggedtemplates rows on dewiki (T297094) [16:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:33] T297094: Add globaluser.gu_hidden_level column to production - https://phabricator.wikimedia.org/T297094 [16:26:11] wrong ticket, T296380 [16:26:11] T296380: flaggedtemplates table is still too big - https://phabricator.wikimedia.org/T296380 [16:26:45] kormat: db2074 ready [16:26:58] (03CR) 10jerkins-bot: [V: 04-1] WIP P:monitoring: fix ordering [puppet] - 10https://gerrit.wikimedia.org/r/744814 (owner: 10Jbond) [16:28:27] 10SRE-swift-storage: swift-proxy not starting on ms-fe2009 due to missing python-monotonic - https://phabricator.wikimedia.org/T296289 (10MatthewVernon) OK, I know what the problem is (at least at one level). Our swift front-ends use a bit of middleware wmf.rewrite which is shipped by us from puppet; that calls... [16:28:29] (03CR) 10David Caro: alertmanager: add inhibit rules for network probes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/743981 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [16:40:27] (03PS1) 10Giuseppe Lavagetto: miscweb: add volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/744826 [16:41:23] (03CR) 10jerkins-bot: [V: 04-1] miscweb: add volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/744826 (owner: 10Giuseppe Lavagetto) [16:41:37] (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.15.0-a12 [vendor] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/744800 (https://phabricator.wikimedia.org/T263203) [16:42:02] (03PS2) 10Arturo Borrero Gonzalez: ceph: mgr: migrate keyring to new auth abstraction [puppet] - 10https://gerrit.wikimedia.org/r/744808 (https://phabricator.wikimedia.org/T293752) [16:44:08] RECOVERY - Host dbproxy2004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.68 ms [16:44:26] (03CR) 10jerkins-bot: [V: 04-1] ceph: mgr: migrate keyring to new auth abstraction [puppet] - 10https://gerrit.wikimedia.org/r/744808 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [16:45:16] 10SRE, 10ops-codfw, 10DBA: codfw: relocate servers in rack D6 - https://phabricator.wikimedia.org/T296930 (10Papaul) [16:46:41] (03PS1) 10MVernon: swift::proxy: install python{3,}-monotonic [puppet] - 10https://gerrit.wikimedia.org/r/744827 (https://phabricator.wikimedia.org/T296289) [16:46:48] (03PS3) 10Arturo Borrero Gonzalez: ceph: mgr: migrate keyring to new auth abstraction [puppet] - 10https://gerrit.wikimedia.org/r/744808 (https://phabricator.wikimedia.org/T293752) [16:46:55] 10SRE, 10ops-codfw, 10DBA: codfw: relocate servers in rack D6 - https://phabricator.wikimedia.org/T296930 (10Papaul) 05Open→03Resolved @Marostegui @Kormat all the servers are back up online from my end. Thanks for helping [16:47:28] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/744827 (https://phabricator.wikimedia.org/T296289) (owner: 10MVernon) [16:47:40] (03CR) 10jerkins-bot: [V: 04-1] swift::proxy: install python{3,}-monotonic [puppet] - 10https://gerrit.wikimedia.org/r/744827 (https://phabricator.wikimedia.org/T296289) (owner: 10MVernon) [16:48:02] There are a ton of jsonTruncated mediawiki errors being logged in the last hour. I don't know what that's about. [16:48:43] The message part begins `"Search backend error during sending 1 documents to the commonswiki_content_1617495209 index(s) after 7: bulk: Error in one or more bulk request actions:\n\nupdate: /commonswiki_content_1617495209/page/10465338 caused [commonswiki_content_1617495209][0] primary shard is not active Timeout: [1ms].....` [16:49:09] (03CR) 10jerkins-bot: [V: 04-1] ceph: mgr: migrate keyring to new auth abstraction [puppet] - 10https://gerrit.wikimedia.org/r/744808 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [16:50:58] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase202[456].codfw.wmnet - https://phabricator.wikimedia.org/T294377 (10Papaul) [16:51:17] (03PS4) 10Arturo Borrero Gonzalez: ceph: mgr: migrate keyring to new auth abstraction [puppet] - 10https://gerrit.wikimedia.org/r/744808 (https://phabricator.wikimedia.org/T293752) [16:51:29] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase202[456].codfw.wmnet - https://phabricator.wikimedia.org/T294377 (10Papaul) 05Open→03Resolved This is complete [16:52:07] (03PS2) 10MVernon: swift::proxy: install python{3,}-monotonic [puppet] - 10https://gerrit.wikimedia.org/r/744827 (https://phabricator.wikimedia.org/T296289) [16:54:09] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/744827 (https://phabricator.wikimedia.org/T296289) (owner: 10MVernon) [16:56:56] We have a pre-train backport for Parsoid [16:57:01] (03PS1) 10Ssingh: bird: add validate_cmd for anycast-healthchecker.conf [puppet] - 10https://gerrit.wikimedia.org/r/744830 [16:57:02] it's cherry-picked on mediawiki-vendor to the wmf.12 branch, but i haven't merged it yet [16:57:14] cscott: Go ahead and merge. [16:57:25] last time there was an issue where y'all had already staged the wmf.12 release pre-train and our backport didn't "take" [16:57:40] I haven't checked out wmf.12 yet so you should be good to go. [16:57:45] dancy: is there a recommended process I should document, for the future? [16:58:12] i pinged on the blocker phab task for wmf.12, is "ping ops on #wikimedia-operations before merge" good eough documentation for the future? [16:58:34] Yes, that is sufficient. [16:58:51] (03PS6) 10Andrew Bogott: cinder-backup: generate backup_file_size relative to available RAM [puppet] - 10https://gerrit.wikimedia.org/r/744821 (https://phabricator.wikimedia.org/T292546) [16:59:03] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32863/console" [puppet] - 10https://gerrit.wikimedia.org/r/744830 (owner: 10Ssingh) [16:59:23] (03CR) 10C. Scott Ananian: [C: 03+2] "Pinged dancy on #wikimedia-operations and confirmed that wmf.12 hasn't been checked out yet so this is safe to merge." [vendor] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/744800 (https://phabricator.wikimedia.org/T263203) (owner: 10C. Scott Ananian) [16:59:40] RECOVERY - Bird Internet Routing Daemon on durum2002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [16:59:44] cscott: Thanks for making the mods to reduce the risk. [17:00:04] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [17:00:04] jbond and rzl: I, the Bot under the Fountain, call upon thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211207T1700). [17:00:04] No Gerrit patches in the queue for this window AFAICS. [17:00:18] RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:00:28] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 107, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:01:06] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 75, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:01:12] RECOVERY - Check if anycast-healthchecker and all configured threads are running on durum2002 is OK: OK: UP (pid=13491) and all threads (4) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [17:01:16] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:02:18] (03CR) 10Hnowlan: cassandra: load grants files upon change (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) (owner: 10Hnowlan) [17:04:10] (03PS4) 10Hnowlan: api-gateway: allow discovery services to set custom rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/741937 (https://phabricator.wikimedia.org/T295956) [17:04:41] (03CR) 10Hnowlan: api-gateway: allow discovery services to set custom rate limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/741937 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan) [17:07:02] PROBLEM - graphite.wikimedia.org render on graphite1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [17:07:07] (03PS2) 10Giuseppe Lavagetto: miscweb: add volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/744826 [17:08:44] PROBLEM - graphite.wikimedia.org api on graphite1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [17:10:53] (03CR) 10Andrew Bogott: [C: 03+2] cinder-backup: generate backup_file_size relative to available RAM [puppet] - 10https://gerrit.wikimedia.org/r/744821 (https://phabricator.wikimedia.org/T292546) (owner: 10Andrew Bogott) [17:11:09] ^ that graphite alert may or may not explain the Wikidata alert (edit rate below x/min) [17:11:40] (03CR) 10Dzahn: [C: 03+1] "THANKS!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/744826 (owner: 10Giuseppe Lavagetto) [17:12:02] (03CR) 10Giuseppe Lavagetto: [C: 03+2] miscweb: add volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/744826 (owner: 10Giuseppe Lavagetto) [17:14:02] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:01] (03Merged) 10jenkins-bot: miscweb: add volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/744826 (owner: 10Giuseppe Lavagetto) [17:19:14] !log root@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [17:19:17] is anyone looking into Graphite? [17:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:40] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.15.0-a12 [vendor] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/744800 (https://phabricator.wikimedia.org/T263203) (owner: 10C. Scott Ananian) [17:21:22] Lucas_WMDE: i know nothing about graphite, but that host seems.. Busy. load avg is 139 [17:21:27] sheesh [17:22:28] grafana has no metrics for the last 20 mins for it [17:22:50] PROBLEM - Check unit status of statograph_post on alert1001 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:23:47] godog: ^ [17:25:03] kormat: see pm [17:25:06] !log root@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'miscweb' for release 'main' . [17:25:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:16] PROBLEM - Host db2078.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:26:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:52] (03PS7) 10Jbond: WIP P:monitoring: fix ordering [puppet] - 10https://gerrit.wikimedia.org/r/744814 [17:26:54] (03PS1) 10Jbond: nrep::monitoring: nrpe checks should be disabled by default in cloud [puppet] - 10https://gerrit.wikimedia.org/r/744833 [17:27:34] !log root@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'miscweb' for release 'main' . [17:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:49] 10SRE, 10Infrastructure-Foundations: Revert 5.10.70 from bullseye hosts - https://phabricator.wikimedia.org/T297180 (10MoritzMuehlenhoff) [17:28:07] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32867/console" [puppet] - 10https://gerrit.wikimedia.org/r/744833 (owner: 10Jbond) [17:29:46] (03CR) 10jerkins-bot: [V: 04-1] WIP P:monitoring: fix ordering [puppet] - 10https://gerrit.wikimedia.org/r/744814 (owner: 10Jbond) [17:30:27] ACKNOWLEDGEMENT - MariaDB Replica Lag: s3 on db2074 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 10882.43 seconds Kormat Catching up on replication https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:31:33] !log root@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'miscweb' for release 'main' . [17:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:30] RECOVERY - Host db2078.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.40 ms [17:32:40] !log root@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'miscweb' for release 'main' . [17:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:56] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - miscweb_4111: Servers kubernetes2002.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2007.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2004.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:33:24] ACKNOWLEDGEMENT - MariaDB Replica Lag: s1 on db2130 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 12036.15 seconds Kormat Catching up on replication https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:33:30] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - miscweb_4111: Servers kubernetes2013.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2016.codfw.wmnet, kubernetes2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:33:32] !log root@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'miscweb' for release 'main' . [17:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet [17:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:24] ACKNOWLEDGEMENT - haproxy failover on dbproxy2004 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Kormat db1178 is getting an idrac update https://wikitech.wikimedia.org/wiki/HAProxy [17:35:16] !log root@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'miscweb' for release 'main' . [17:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:42] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:36:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet [17:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:26] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet [17:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:20] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:38:10] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:40:02] (03PS1) 10Jbond: WIP: find the dependency [puppet] - 10https://gerrit.wikimedia.org/r/744835 [17:40:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet [17:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:10] !log graphite1004.mgmt: racadm serveraction powercycle [17:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:54] (03PS2) 10Jbond: WIP: find the dependency [puppet] - 10https://gerrit.wikimedia.org/r/744835 [17:43:12] PROBLEM - Mediawiki CirrusSearch update rate - eqiad on alert1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [17:43:36] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 96.11% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [17:43:44] Possibly related to the CirrusSearch alert: https://phabricator.wikimedia.org/T297221 [17:43:52] RECOVERY - graphite.wikimedia.org api on graphite1004 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [17:44:26] RECOVERY - graphite.wikimedia.org render on graphite1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1594 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [17:44:43] (03PS3) 10Jbond: WIP: find the dependency [puppet] - 10https://gerrit.wikimedia.org/r/744835 [17:44:48] PROBLEM - Mediawiki CirrusSearch update rate - codfw on alert1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [17:44:58] looking ^ [17:45:10] RECOVERY - Check unit status of statograph_post on alert1001 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:45:25] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32871/console" [puppet] - 10https://gerrit.wikimedia.org/r/744835 (owner: 10Jbond) [17:46:11] (03PS4) 10Jbond: WIP: find the dependency [puppet] - 10https://gerrit.wikimedia.org/r/744835 [17:46:46] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32872/console" [puppet] - 10https://gerrit.wikimedia.org/r/744835 (owner: 10Jbond) [17:47:51] (03PS5) 10Jbond: WIP: find the dependency [puppet] - 10https://gerrit.wikimedia.org/r/744835 [17:47:58] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 96.11% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [17:48:51] (03PS6) 10Jbond: WIP: find the dependency [puppet] - 10https://gerrit.wikimedia.org/r/744835 [17:51:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki2002.codfw.wmnet [17:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:33] (03PS7) 10Jbond: WIP: find the dependency [puppet] - 10https://gerrit.wikimedia.org/r/744835 [17:52:10] 10SRE, 10Infrastructure-Foundations: Revert 5.10.70 from bullseye hosts - https://phabricator.wikimedia.org/T297180 (10MoritzMuehlenhoff) [17:54:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2002.codfw.wmnet [17:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:05] (03PS4) 10Ahmon Dancy: Choose wikiversions.php file relative to MWMultiVersion.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743038 [17:55:07] (03PS1) 10Ahmon Dancy: MWMultiVersion.php: Reverse logic for wikiversions file selection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744836 [17:55:23] (03CR) 10Ahmon Dancy: [C: 03+1] Choose wikiversions.php file relative to MWMultiVersion.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743038 (owner: 10Ahmon Dancy) [17:56:47] (03PS1) 10Jgiannelos: proton: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/744837 [18:00:04] chrisalbon and accraze: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211207T1800). [18:03:18] (03CR) 10CDanis: [C: 03+1] bird: add validate_cmd for anycast-healthchecker.conf [puppet] - 10https://gerrit.wikimedia.org/r/744830 (owner: 10Ssingh) [18:03:40] (03CR) 10Andrew Bogott: [C: 03+2] P::trafficserver: use https for cloudweb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/742214 (https://phabricator.wikimedia.org/T263829) (owner: 10Majavah) [18:03:43] (03CR) 10Andrew Bogott: [C: 03+2] set up tls termination on cloudweb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/742213 (https://phabricator.wikimedia.org/T263829) (owner: 10Majavah) [18:04:31] (03PS4) 10Andrew Bogott: P::trafficserver: use https for cloudweb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/742214 (https://phabricator.wikimedia.org/T263829) (owner: 10Majavah) [18:05:47] (03PS7) 10Hnowlan: partman: add reuse partman profile for cassandra hosts [puppet] - 10https://gerrit.wikimedia.org/r/738924 (https://phabricator.wikimedia.org/T295375) [18:05:54] (03CR) 10Hnowlan: partman: add reuse partman profile for cassandra hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738924 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan) [18:07:08] Test for auto_schema! [18:07:13] xD [18:07:18] :D [18:07:57] I ran it with cumin from cumin1001 on mwmaint1002 as cumin doesn't have dologmsg [18:09:51] Test again for auto_schema! [18:09:59] cool [18:10:11] (03CR) 10Jgiannelos: [C: 03+2] proton: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/744837 (owner: 10Jgiannelos) [18:11:15] (03PS8) 10Jbond: WIP: find the dependency [puppet] - 10https://gerrit.wikimedia.org/r/744835 [18:11:55] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32876/console" [puppet] - 10https://gerrit.wikimedia.org/r/744835 (owner: 10Jbond) [18:12:10] (03CR) 10CDanis: "I'm a bit unsure about this as a threshold -- it's imaginable to me that we have some well-behaved clients that would exceed this limit." [puppet] - 10https://gerrit.wikimedia.org/r/740818 (https://phabricator.wikimedia.org/T224891) (owner: 10Jbond) [18:12:12] (03PS9) 10Jbond: WIP: find the dependency [puppet] - 10https://gerrit.wikimedia.org/r/744835 [18:13:29] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32877/console" [puppet] - 10https://gerrit.wikimedia.org/r/744835 (owner: 10Jbond) [18:13:46] (03Merged) 10jenkins-bot: proton: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/744837 (owner: 10Jgiannelos) [18:14:24] (03PS10) 10Jbond: P:trafficserver: add a hiera guard for checking [puppet] - 10https://gerrit.wikimedia.org/r/744835 [18:14:40] (03PS11) 10Jbond: P:trafficserver: add a hiera guard for checking [puppet] - 10https://gerrit.wikimedia.org/r/744835 [18:14:49] (03PS1) 10Jgiannelos: tegola-vector-tiles: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/744838 [18:15:55] (03CR) 10Subramanya Sastry: [C: 03+1] tegola-vector-tiles: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/744838 (owner: 10Jgiannelos) [18:16:27] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32878/console" [puppet] - 10https://gerrit.wikimedia.org/r/744835 (owner: 10Jbond) [18:17:34] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32879/console" [puppet] - 10https://gerrit.wikimedia.org/r/744835 (owner: 10Jbond) [18:18:22] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:trafficserver: add a hiera guard for checking [puppet] - 10https://gerrit.wikimedia.org/r/744835 (owner: 10Jbond) [18:18:53] (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/744838 (owner: 10Jgiannelos) [18:20:07] 10SRE, 10Traffic-Icebox, 10HTTPS: HTTPS for internal service traffic - https://phabricator.wikimedia.org/T108580 (10Majavah) [18:20:32] (03CR) 10Dzahn: "While I appreciate you are making these, this is a duplicate of https://gerrit.wikimedia.org/r/c/operations/dns/+/650625 which I abandoned" [dns] - 10https://gerrit.wikimedia.org/r/744762 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah) [18:20:47] 10SRE, 10Cloud-Services, 10Traffic-Icebox, 10HTTPS, 10cloud-services-team (Kanban): cloudweb2001-dev: add TLS termination - https://phabricator.wikimedia.org/T263829 (10Majavah) 05Open→03Resolved a:03Majavah [18:22:35] (03Merged) 10jenkins-bot: tegola-vector-tiles: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/744838 (owner: 10Jgiannelos) [18:22:37] (03CR) 10Dzahn: "it's similar here, see comments on https://gerrit.wikimedia.org/r/c/operations/dns/+/650625/" [puppet] - 10https://gerrit.wikimedia.org/r/744763 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah) [18:23:31] (03Abandoned) 10Dzahn: Revert "Revert "Revert "Revert "Revert "mx2001: disable ldap validation""""" [puppet] - 10https://gerrit.wikimedia.org/r/743424 (owner: 10Dzahn) [18:23:35] (03CR) 10Majavah: discovery: switchover doc to doc1002 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/744762 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah) [18:24:39] (03CR) 10Dzahn: discovery: switchover doc to doc1002 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/744762 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah) [18:25:56] RECOVERY - Mediawiki CirrusSearch update rate - codfw on alert1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [18:26:31] (03Abandoned) 10Jbond: nrep::monitoring: nrpe checks should be disabled by default in cloud [puppet] - 10https://gerrit.wikimedia.org/r/744833 (owner: 10Jbond) [18:26:34] RECOVERY - Mediawiki CirrusSearch update rate - eqiad on alert1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [18:27:09] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' . [18:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:46] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 14 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [18:27:48] (03Abandoned) 10Jbond: WIP P:monitoring: fix ordering [puppet] - 10https://gerrit.wikimedia.org/r/744814 (owner: 10Jbond) [18:28:34] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 30.83 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [18:28:46] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 50.81 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [18:31:10] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is CRITICAL: 32.87 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [18:33:00] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 84.05 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [18:33:12] ummm [18:33:12] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 82.34 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [18:33:13] !log jgiannelos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' . [18:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:22] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [18:33:54] looks like we had a request spike? [18:34:26] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [18:36:48] (03PS1) 10Dzahn: contint: delete deployment_dir class [puppet] - 10https://gerrit.wikimedia.org/r/744839 (https://phabricator.wikimedia.org/T272559) [18:37:14] Grafana security update out [18:37:24] (03CR) 10Ssingh: [V: 03+1 C: 03+2] bird: add validate_cmd for anycast-healthchecker.conf [puppet] - 10https://gerrit.wikimedia.org/r/744830 (owner: 10Ssingh) [18:37:30] Bsadowski1: we're already aware, but thanks [18:37:36] k lol [18:37:41] I saw it on Twitter :P [18:37:45] sorry [18:38:13] !log jgiannelos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' . [18:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:56] (03Abandoned) 10Jbond: puppet_compiler:puppetdb: We only need one puppetdb for all compilers [puppet] - 10https://gerrit.wikimedia.org/r/739808 (owner: 10Jbond) [18:42:16] (03PS1) 10Dzahn: contint: delete the proxy_gerrit class [puppet] - 10https://gerrit.wikimedia.org/r/744840 (https://phabricator.wikimedia.org/T272559) [18:42:54] Amir1: re logmsgbot message... you can use https://doc.wikimedia.org/wmflib/master/api/wmflib.irc.html [18:43:36] oh nice [18:43:44] better than running cumin on mwmaint [18:44:13] 10SRE, 10observability: Remove Diamond from production - https://phabricator.wikimedia.org/T212231 (10Dzahn) Even though T210993 is open? Thanks! I am uploading a change to delete them. [18:44:33] we use this for irc logging on wmcs cookbooks: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/wmcs/cookbooks/wmcs/do_log_msg.py [18:44:54] (03PS1) 10Dzahn: diamond: delete collector::servicestats* [puppet] - 10https://gerrit.wikimedia.org/r/744841 (https://phabricator.wikimedia.org/T272559) [18:44:58] essentially https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/wmcs/cookbooks/wmcs/__init__.py#1073 [18:45:10] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Seen): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Majavah) @hashar @Krinkle Content sync between instances, the je... [18:45:12] (copying the `dologmsg` util in the machines) [18:45:13] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [18:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:51] (03CR) 10Majavah: discovery: switchover doc to doc1002 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/744762 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah) [18:46:08] !log jgiannelos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [18:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:18] !log jgiannelos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [18:47:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:28] (03PS1) 10Ssingh: test_dns: update tests for new durum features [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/744843 [18:49:31] (03CR) 10Ssingh: [C: 03+2] test_dns: update tests for new durum features [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/744843 (owner: 10Ssingh) [18:55:28] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10Papaul) [18:55:42] (03PS4) 10Jbond: C:puppet_compiler: add uploader class [puppet] - 10https://gerrit.wikimedia.org/r/743224 [18:58:47] (03PS1) 10Cwhite: opensearch_dashboards: allow up to 64mb restore payload [puppet] - 10https://gerrit.wikimedia.org/r/744845 (https://phabricator.wikimedia.org/T288621) [19:00:05] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211207T1900) [19:00:55] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:02:53] (03PS1) 10Jgiannelos: tegola-vector-tiles: Use versioned base paths for caches [deployment-charts] - 10https://gerrit.wikimedia.org/r/744846 [19:04:49] (03CR) 10Jgiannelos: [C: 04-1] "This patch introduces versioning in the name of the cache base paths on swift. Heads up this needs to be deployed at the same time we star" [deployment-charts] - 10https://gerrit.wikimedia.org/r/744846 (owner: 10Jgiannelos) [19:07:29] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:08:09] (03PS5) 10Jbond: C:puppet_compiler: add uploader class [puppet] - 10https://gerrit.wikimedia.org/r/743224 [19:09:09] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:09:37] (03CR) 10Krinkle: "Interesting. I vaguely recall there being an operational reason to favour the deployed version. I don't recall the specifics though, but s" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743038 (owner: 10Ahmon Dancy) [19:10:05] 10SRE-Access-Requests: Add Lucas_WMDE to #mediawiki_security - https://phabricator.wikimedia.org/T297226 (10Legoktm) [19:10:11] Lucas_WMDE: ^^ [19:10:27] thanks \o/ [19:10:36] lmao csrf hunter [19:11:39] I'm still somewhat confused by the difference of #wikimedia-security and #mediawiki_security [19:11:55] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Seen): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Krinkle) 05Stalled→03Open [19:12:05] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Krinkle) [19:12:17] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Seen): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Krinkle) a:05Dzahn→03hashar [19:12:40] (03PS6) 10Jbond: C:puppet_compiler: add uploader class [puppet] - 10https://gerrit.wikimedia.org/r/743224 [19:13:49] !log start outage recovery for commonswiki against eqiad cirrus cluster after snapshot restore [19:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:29] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:14:54] 10SRE, 10Readers-Web-Backlog, 10Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Jdlrobson) [19:15:01] PROBLEM - graphite.wikimedia.org api on graphite1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [19:15:55] PROBLEM - Check unit status of statograph_post on alert1001 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:16:28] hmm that's looking like graphite1004 again statograph[13838]: requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: https://graphite.wikimedia.org//render?target=MediaWiki.timing.editResponseT [19:17:23] PROBLEM - graphite.wikimedia.org render on graphite1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [19:18:11] !log graphite1004.mgmt: racadm serveraction powercycle [19:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:19] (03CR) 10Jbond: [C: 03+2] C:puppet_compiler: add uploader class [puppet] - 10https://gerrit.wikimedia.org/r/743224 (owner: 10Jbond) [19:19:10] (03CR) 10Jbond: C:puppet_compiler: add uploader class (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/743224 (owner: 10Jbond) [19:19:23] PROBLEM - Host graphite1004 is DOWN: PING CRITICAL - Packet loss = 100% [19:19:51] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:20:21] RECOVERY - Host graphite1004 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [19:20:29] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 96.11% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [19:20:31] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:20:47] PROBLEM - carbon-cache@e service on graphite1004 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@e is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:22:16] RECOVERY - carbon-cache@e service on graphite1004 is OK: OK - carbon-cache@e is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:24:54] (03PS3) 10Andrew Bogott: encapi: Remove statsd metrics [puppet] - 10https://gerrit.wikimedia.org/r/740307 (owner: 10Majavah) [19:25:45] (03PS2) 10Ebernhardson: query_service: Provide return-to url with auth checks [puppet] - 10https://gerrit.wikimedia.org/r/739942 (https://phabricator.wikimedia.org/T295676) [19:26:00] RECOVERY - Check unit status of statograph_post on alert1001 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:26:10] (03CR) 10Ebernhardson: "no reason i can think of, updated commit message." [puppet] - 10https://gerrit.wikimedia.org/r/739942 (https://phabricator.wikimedia.org/T295676) (owner: 10Ebernhardson) [19:26:15] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster [19:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:31] !log upgrading sacp to 4.1.0 everywhere (T296867) [19:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:35] T296867: Deploy Scap version 4.1.0 - https://phabricator.wikimedia.org/T296867 [19:28:38] (03PS1) 10Ladsgroup: auto_schema: Add logging on file [software] - 10https://gerrit.wikimedia.org/r/744850 (https://phabricator.wikimedia.org/T288235) [19:29:34] (03CR) 10Andrew Bogott: [C: 03+2] encapi: Remove statsd metrics [puppet] - 10https://gerrit.wikimedia.org/r/740307 (owner: 10Majavah) [19:29:44] (03CR) 10jerkins-bot: [V: 04-1] auto_schema: Add logging on file [software] - 10https://gerrit.wikimedia.org/r/744850 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [19:34:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:37:00] RECOVERY - graphite.wikimedia.org api on graphite1004 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.027 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [19:37:44] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10Papaul) [19:39:16] RECOVERY - graphite.wikimedia.org render on graphite1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1594 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [19:41:12] (03PS1) 10Cathal Mooney: Allow cloud-hosts1-eqiad DHCP responses to eqiad CRs [homer/public] - 10https://gerrit.wikimedia.org/r/744854 (https://phabricator.wikimedia.org/T296906) [19:41:56] (03PS1) 10Jbond: P:puppetdb::microsite: just ensure package [puppet] - 10https://gerrit.wikimedia.org/r/744856 [19:42:02] (03PS1) 10Ebernhardson: Revert "Move cirrus traffic to codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744857 (https://phabricator.wikimedia.org/T296897) [19:42:39] (03PS2) 10Ladsgroup: auto_schema: Add logging on file [software] - 10https://gerrit.wikimedia.org/r/744850 (https://phabricator.wikimedia.org/T288235) [19:42:47] 10SRE, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech: Add HTTPS support to wdqs-internal service - https://phabricator.wikimedia.org/T193473 (10RKemper) [19:42:53] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, and 2 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10Ottomata) Ah, right! https://phabricator.wikimedia.org/T248865#6289287 So yeah, unless we can at least control the event format... [19:43:24] (03CR) 10Andrew Bogott: [C: 04-1] "Looks like there are some missing pieces" [puppet] - 10https://gerrit.wikimedia.org/r/740915 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [19:43:44] (03CR) 10jerkins-bot: [V: 04-1] auto_schema: Add logging on file [software] - 10https://gerrit.wikimedia.org/r/744850 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [19:43:53] (03CR) 10Eevans: cassandra: load grants files upon change (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) (owner: 10Hnowlan) [19:44:02] (03CR) 10Jbond: [C: 03+2] P:puppetdb::microsite: just ensure package [puppet] - 10https://gerrit.wikimedia.org/r/744856 (owner: 10Jbond) [19:45:55] 10Puppet, 10Infrastructure-Foundations, 10Readers-Web-Backlog, 10MobileFrontend (Tracking), 10User-Jdlrobson: Mobile site does not automatically redirect to desktop version (and not possible to use browser "use desktop view") - https://phabricator.wikimedia.org/T60425 (10Jdlrobson) [19:46:36] 10SRE, 10WMF-Legal, 10SEO: (Automate) adding wikinews language versions to the Google Publisher Center / Google News - https://phabricator.wikimedia.org/T254437 (10Jdlrobson) [19:46:42] 10SRE, 10Analytics-Radar, 10Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10Jdlrobson) [19:46:50] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:47:28] (03PS1) 10Jbond: C:puppet_compiler::uploader: correct typo [puppet] - 10https://gerrit.wikimedia.org/r/744859 [19:48:44] (03PS3) 10Ladsgroup: auto_schema: Add logging on file [software] - 10https://gerrit.wikimedia.org/r/744850 (https://phabricator.wikimedia.org/T288235) [19:51:29] (03CR) 10Jbond: [C: 03+2] C:puppet_compiler::uploader: correct typo [puppet] - 10https://gerrit.wikimedia.org/r/744859 (owner: 10Jbond) [19:52:05] (03PS2) 10Ryan Kemper: rdf-query-service: Allow logback config to load outside the blazegraph war [puppet] - 10https://gerrit.wikimedia.org/r/743499 (owner: 10Ebernhardson) [19:52:17] (03PS1) 10Jbond: C:puppet_compiler::uploader: pass params as array [puppet] - 10https://gerrit.wikimedia.org/r/744861 [19:52:53] (03CR) 10Jbond: [C: 03+2] C:puppet_compiler::uploader: pass params as array [puppet] - 10https://gerrit.wikimedia.org/r/744861 (owner: 10Jbond) [19:53:14] (03PS3) 10Ryan Kemper: rdf-query-service: Allow logback config to load outside the blazegraph war [puppet] - 10https://gerrit.wikimedia.org/r/743499 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [19:53:32] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] rdf-query-service: Allow logback config to load outside the blazegraph war [puppet] - 10https://gerrit.wikimedia.org/r/743499 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [19:54:05] (03PS2) 10Cathal Mooney: Allow cloud-hosts1-eqiad DHCP responses to eqiad CRs [homer/public] - 10https://gerrit.wikimedia.org/r/744854 (https://phabricator.wikimedia.org/T296906) [19:54:57] (03PS9) 10Majavah: openstack: refactor puppetmaster access [puppet] - 10https://gerrit.wikimedia.org/r/740915 (https://phabricator.wikimedia.org/T295247) [19:54:59] (03PS3) 10Majavah: openstack: enc: properly fail on server error [puppet] - 10https://gerrit.wikimedia.org/r/742424 [19:56:31] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1028.eqiad.wmnet with OS buster [19:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:47] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/744862 [19:56:59] !log ebernhardson@deploy1002 Started deploy [wdqs/wdqs@c21117f] (wcqs): Deploy version 0.3.95 to wcqs [19:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:48] (03CR) 10Majavah: openstack: refactor puppetmaster access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740915 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [19:57:58] (03PS1) 10Jbond: C:puppet_compiler: add configurable port for uploader [puppet] - 10https://gerrit.wikimedia.org/r/744863 [19:58:48] !log ebernhardson@deploy1002 Finished deploy [wdqs/wdqs@c21117f] (wcqs): Deploy version 0.3.95 to wcqs (duration: 01m 48s) [19:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:41] (03CR) 10Jbond: [C: 03+2] C:puppet_compiler: add configurable port for uploader [puppet] - 10https://gerrit.wikimedia.org/r/744863 (owner: 10Jbond) [20:00:05] dancy and brennen: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211207T2000). [20:00:32] o/ [20:00:49] here as backup, but i'm under the impression we're still blocked atm. [20:00:49] I'll start (if unblocked) in about 30 minutes [20:00:55] ack [20:05:00] 10SRE, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Kanban): reimage/pxe boot failing on cloudvirt1028 - https://phabricator.wikimedia.org/T296906 (10cmooney) @volans many thanks for the info, that is super handy :) Using that cookbook I got the same results as my previous attempt. However I n... [20:07:59] (03Abandoned) 10Ryan Kemper: wcqs: enable oauth [puppet] - 10https://gerrit.wikimedia.org/r/724821 (https://phabricator.wikimedia.org/T280006) (owner: 10Ryan Kemper) [20:08:57] (03PS1) 10Jbond: puppet_compiler: fix folder names [puppet] - 10https://gerrit.wikimedia.org/r/744865 [20:10:43] (03CR) 10Jbond: [C: 03+2] puppet_compiler: fix folder names [puppet] - 10https://gerrit.wikimedia.org/r/744865 (owner: 10Jbond) [20:13:38] (03CR) 10Dzahn: "hmm.. let's get back to this one way or another. see https://phabricator.wikimedia.org/T265864#6995415 as a reminder where this was" [puppet] - 10https://gerrit.wikimedia.org/r/677872 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi) [20:17:17] (03CR) 10Dzahn: "I removed my -1 based on latest comment on the ticket from legoktm. That's been also a while ago though." [puppet] - 10https://gerrit.wikimedia.org/r/677872 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi) [20:17:25] 10SRE, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Kanban): reimage/pxe boot failing on cloudvirt1028 - https://phabricator.wikimedia.org/T296906 (10Andrew) I am fine with wrangling with the disk partitioning pieces if you don't feel like it; IIRC the cloudvirts often prompt for a keypress at s... [20:23:23] 10SRE, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Kanban): reimage/pxe boot failing on cloudvirt1028 - https://phabricator.wikimedia.org/T296906 (10cmooney) @Andrew thanks yeah. I have the screen open here still and can do that if you wish: {F34856312} I suspected the issue may be that the... [20:37:04] (03PS1) 10Jbond: pcc: need to seek back t the beginning of the file before we write it [puppet] - 10https://gerrit.wikimedia.org/r/744873 [20:39:06] 10SRE, 10Discovery-Search, 10Elasticsearch, 10SRE Observability, and 2 others: Migrate Elasticsearch from deprecated Gelf logstash input to rsyslog Kafka logging pipeline - https://phabricator.wikimedia.org/T225125 (10herron) These logs have been migrated to kafka-logging with the deployment of gelf_relay... [20:40:07] 10SRE, 10Wikimedia-Logstash, 10observability: Migrate services using deprecated Gelf logstash input to Kafka enabled logging pipeline - https://phabricator.wikimedia.org/T225122 (10herron) [20:40:20] 10SRE, 10Discovery-Search, 10Elasticsearch, 10SRE Observability, and 2 others: Migrate Elasticsearch from deprecated Gelf logstash input to rsyslog Kafka logging pipeline - https://phabricator.wikimedia.org/T225125 (10herron) 05Open→03Resolved a:03herron [20:40:27] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:41:19] (03CR) 10Jbond: [C: 03+2] pcc: need to seek back t the beginning of the file before we write it [puppet] - 10https://gerrit.wikimedia.org/r/744873 (owner: 10Jbond) [20:41:58] (03PS1) 10Dzahn: mgmt: delete the entire module and role::mgmt::drac_ilo [puppet] - 10https://gerrit.wikimedia.org/r/744874 (https://phabricator.wikimedia.org/T272559) [20:42:13] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:45:04] (03CR) 10Dzahn: "CCing more dcops just in case anyone happens to use these shell scripts to change mgmt password, probably not but making sure" [puppet] - 10https://gerrit.wikimedia.org/r/744874 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn) [20:45:36] (03PS1) 10Jbond: C:puppet_compiler: cast pathlike object to string [puppet] - 10https://gerrit.wikimedia.org/r/744875 [20:47:28] (03CR) 10Jbond: [C: 03+2] C:puppet_compiler: cast pathlike object to string [puppet] - 10https://gerrit.wikimedia.org/r/744875 (owner: 10Jbond) [20:49:29] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10Dzahn) >>! In T272559#7552446, @jbond wrote: >>>! In T272559#7546852, @Dzahn wrote: >> icinga::nsca::client is used in fundraising. so there are special case... [20:51:35] PROBLEM - Host elastic2037 is DOWN: PING CRITICAL - Packet loss = 100% [20:52:33] 10SRE, 10SRE Observability, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Deprecate all non-Kafka logstash inputs - https://phabricator.wikimedia.org/T227080 (10herron) 05Open→03Resolved a:03herron Looking at the dashboard linked in the description there have been no logs received via... [20:54:15] RECOVERY - Host elastic2037 is UP: PING OK - Packet loss = 0%, RTA = 31.61 ms [20:54:27] 10SRE, 10vm-requests, 10Patch-For-Review: Site: (2) VM request for DMARC - https://phabricator.wikimedia.org/T169566 (10Dzahn) Hey @herron @akosiaris you might be suprised to see a notification on this ticket from 2017 but .. I just found it by digging backwards in history to find out why we have a "**role::... [20:56:47] (03PS1) 10Dzahn: delete role::dmarc [puppet] - 10https://gerrit.wikimedia.org/r/744877 (https://phabricator.wikimedia.org/T272559) [20:57:12] (03CR) 10Herron: [C: 03+1] delete role::dmarc [puppet] - 10https://gerrit.wikimedia.org/r/744877 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn) [20:58:51] (03CR) 10Dzahn: [C: 03+2] delete role::dmarc [puppet] - 10https://gerrit.wikimedia.org/r/744877 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn) [21:04:01] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 96.11% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [21:06:26] (03CR) 10RLazarus: imagecatalog: Install and configure OCI image catalog on deploy hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742574 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [21:06:41] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10Dzahn) > profile::beta::motd This isn't instantiated and does not have any include line elsewhere but it shows up like this: hieradata/cloud/eqiad1/deploy... [21:07:56] I have returned. [21:10:05] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:11:22] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10Dzahn) xdummy: T133183#7554483 [21:11:30] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10Dzahn) [21:15:33] 10SRE, 10Observability-Logging: Move logstash api-feature-usage output away from v5 cluster - https://phabricator.wikimedia.org/T297239 (10herron) p:05Triage→03Medium [21:15:47] 10SRE, 10Observability-Logging: Move logstash api-feature-usage output away from v5 cluster - https://phabricator.wikimedia.org/T297239 (10herron) [21:15:50] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10herron) [21:16:12] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10herron) 05Resolved→03Open Reopening this as progress has been made retiring legacy log inputs and now we're ready to move on to decom of the Ganeti VMs.... [21:17:10] Starting train stuff now. testwikis first [21:17:11] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/744862 (https://phabricator.wikimedia.org/T297239) (owner: 10Herron) [21:18:35] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster [21:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:35] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/744862 (https://phabricator.wikimedia.org/T297239) (owner: 10Herron) [21:22:09] (03PS1) 10Ahmon Dancy: testwikis wikis to 1.38.0-wmf.12 refs T293953 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744878 [21:22:11] (03CR) 10Ahmon Dancy: [C: 03+2] testwikis wikis to 1.38.0-wmf.12 refs T293953 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744878 (owner: 10Ahmon Dancy) [21:23:39] (03Merged) 10jenkins-bot: testwikis wikis to 1.38.0-wmf.12 refs T293953 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744878 (owner: 10Ahmon Dancy) [21:23:43] !log dancy@deploy1002 Started scap: testwikis wikis to 1.38.0-wmf.12 refs T293953 [21:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:49] T293953: 1.38.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T293953 [21:25:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:47] (03CR) 10Cwhite: "I would rather we not move api-feature-usage into the elk7 cluster for several reasons:" [puppet] - 10https://gerrit.wikimedia.org/r/744862 (https://phabricator.wikimedia.org/T297239) (owner: 10Herron) [21:56:36] 10SRE, 10Observability-Logging, 10Patch-For-Review: Move logstash api-feature-usage output away from v5 cluster - https://phabricator.wikimedia.org/T297239 (10herron) > Cwhite 4:43 PM > I would rather we not move api-feature-usage into the elk7 cluster for several reasons: > > 1. We've wanted to move it... [21:56:56] 10SRE, 10Observability-Logging, 10Patch-For-Review: Move logstash api-feature-usage output away from v5 cluster - https://phabricator.wikimedia.org/T297239 (10herron) [21:57:03] 10SRE, 10Elasticsearch, 10SRE Observability, 10Wikimedia-Logstash, and 2 others: logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable - https://phabricator.wikimedia.org/T176335 (10herron) [21:57:48] (03CR) 10Herron: logstash: move api-feature-usage outputs to elk7 cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/744862 (https://phabricator.wikimedia.org/T297239) (owner: 10Herron) [22:06:43] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1028.eqiad.wmnet with OS buster [22:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:22] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster [22:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:58] !log dancy@deploy1002 Finished scap: testwikis wikis to 1.38.0-wmf.12 refs T293953 (duration: 44m 14s) [22:08:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:02] T293953: 1.38.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T293953 [22:11:07] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:12:33] !log dancy@deploy1002 Pruned MediaWiki: 1.38.0-wmf.7 (duration: 04m 18s) [22:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:21] (03PS1) 10Ahmon Dancy: group0 wikis to 1.38.0-wmf.12 refs T293953 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744886 [22:13:23] (03PS2) 10Jdlrobson: Clean up readers web team config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743051 [22:13:25] (03CR) 10Ahmon Dancy: [C: 03+2] group0 wikis to 1.38.0-wmf.12 refs T293953 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744886 (owner: 10Ahmon Dancy) [22:14:11] (03Merged) 10jenkins-bot: group0 wikis to 1.38.0-wmf.12 refs T293953 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744886 (owner: 10Ahmon Dancy) [22:15:24] (03CR) 10Jdlrobson: Clean up readers web team config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743051 (owner: 10Jdlrobson) [22:15:24] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.38.0-wmf.12 refs T293953 [22:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:29] T293953: 1.38.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T293953 [22:17:53] The train has been rolled out to group0 wikis. I will check on logs periodically for a bit. [22:18:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:34] (03PS1) 10Ebernhardson: rdf query service: limit namespace aliasing to /bigdata/namespace [puppet] - 10https://gerrit.wikimedia.org/r/744892 [22:49:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:13] 10SRE, 10Observability-Logging, 10Patch-For-Review: Move logstash api-feature-usage output away from v5 cluster - https://phabricator.wikimedia.org/T297239 (10herron) > We have an ingester installed with appropriate access on the cirrus cluster which can do this via the work from https://phabricator.wikimed... [23:07:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10Jclark-ctr) lvs1017 A7 U9 id# 1206202101 Port#26 lvs1018 B7 U29 id# 1206202102 Port#4 lvs1019 C7 U25 id# 1206202103 Port#30 lvs1020 D7 U41 id# 120620... [23:08:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10Jclark-ctr) [23:08:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [23:21:49] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1028.eqiad.wmnet with OS buster [23:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:15] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:44:18] (03PS1) 10MewOphaswongse: Add an image: Only validate caption if the recommendation is accepted [extensions/GrowthExperiments] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/744896 (https://phabricator.wikimedia.org/T297250) [23:53:04] (03PS1) 10Jforrester: Fix invalid reference to core resources/ directory [extensions/CodeMirror] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/744803 (https://phabricator.wikimedia.org/T296639) [23:55:53] PROBLEM - SSH on cp5006 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:56:13] (03CR) 10jerkins-bot: [V: 04-1] Fix invalid reference to core resources/ directory [extensions/CodeMirror] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/744803 (https://phabricator.wikimedia.org/T296639) (owner: 10Jforrester) [23:57:57] (03CR) 10Jforrester: "recheck" [extensions/CodeMirror] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/744803 (https://phabricator.wikimedia.org/T296639) (owner: 10Jforrester)