[00:00:04] <jouncebot>	 RoanKattouw and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211207T0000).
[00:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[00:06:57] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[00:08:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Sustainability (Incident Followup): Bringing mx2001 back into service - https://phabricator.wikimedia.org/T297128 (10Dzahn) re: making current kernel version persistent  The one running now was selected in grub but wasn't the default selection. Either edit gru...
[00:10:18] <cwhite>	 !log end codfw opensearch upgrade T288621
[00:10:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:10:24] <stashbot>	 T288621: Logs and events produced by the WMF are consumed using the Elastic Common Schema by OpenSearch - https://phabricator.wikimedia.org/T288621
[00:20:48] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/743352 (owner: 10Filippo Giunchedi)
[00:21:54] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] prometheus: bump logging level for blackbox-exporter [puppet] - 10https://gerrit.wikimedia.org/r/743388 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[00:24:43] <wikibugs>	 (03CR) 10Cwhite: "LGTM, but not sure if you mean to include the commented out tests file" [alerts] - 10https://gerrit.wikimedia.org/r/743394 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi)
[00:25:05] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] prometheus: remove textfile stale alert [puppet] - 10https://gerrit.wikimedia.org/r/743395 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi)
[00:26:39] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/743921 (owner: 10Filippo Giunchedi)
[00:26:53] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] service: add public alias for grafana-rw [puppet] - 10https://gerrit.wikimedia.org/r/743922 (owner: 10Filippo Giunchedi)
[00:27:51] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] wmflib: add 'probes' to service::catalog type [puppet] - 10https://gerrit.wikimedia.org/r/743358 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[00:50:29] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:52:31] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:30:26] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: (Need By: TBD) rack/setup/install backup2008 - https://phabricator.wikimedia.org/T294973 (10Papaul) This was shipped today.
[01:30:35] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:31:20] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: (Need By: TBD) rack/setup/install backup2008 - https://phabricator.wikimedia.org/T294973 (10Papaul)
[01:36:53] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:06:53] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.38.0-wmf.12 [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/744113
[02:06:55] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.38.0-wmf.12 [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/744113 (owner: 10TrainBranchBot)
[02:27:59] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.38.0-wmf.12 [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/744113 (owner: 10TrainBranchBot)
[02:51:03] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:57:31] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211207T0300)
[03:38:11] <icinga-wm>	 PROBLEM - snapshot of s6 in codfw on alert1001 is CRITICAL: snapshot for s6 at codfw taken more than 3 days ago: Most recent backup 2021-12-04 03:31:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[05:07:21] <icinga-wm>	 PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:32:09] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:38:35] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:45:00] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1100.eqiad.wmnet with reason: Maintenance T277354
[05:45:02] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1100.eqiad.wmnet with reason: Maintenance T277354
[05:45:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:45:06] <stashbot>	 T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354
[05:45:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1100 (T277354)', diff saved to https://phabricator.wikimedia.org/P18031 and previous config saved to /var/cache/conftool/dbconfig/20211207-054506-marostegui.json
[05:45:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:45:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:46:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1100 (T277354)', diff saved to https://phabricator.wikimedia.org/P18032 and previous config saved to /var/cache/conftool/dbconfig/20211207-054625-marostegui.json
[05:46:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:51:25] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[05:58:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2074 and db2130 T296930', diff saved to https://phabricator.wikimedia.org/P18033 and previous config saved to /var/cache/conftool/dbconfig/20211207-055808-marostegui.json
[05:58:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:58:13] <stashbot>	 T296930: codfw: relocate servers in rack D6 - https://phabricator.wikimedia.org/T296930
[06:01:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1100', diff saved to https://phabricator.wikimedia.org/P18034 and previous config saved to /var/cache/conftool/dbconfig/20211207-060130-marostegui.json
[06:01:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:02:25] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:08:31] <icinga-wm>	 RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:09:03] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:14:08] <marostegui>	 !log Apply SET GLOBAL innodb_checksum_algorithm=full_crc32; on db1107 T287244
[06:14:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:14:13] <stashbot>	 T287244: Considering switching innodb_checksum_algorithm=full_crc32 - https://phabricator.wikimedia.org/T287244
[06:16:32] <wikibugs>	 (03PS1) 10Marostegui: Revert "install_server: Reimage db1125 deleting /srv" [puppet] - 10https://gerrit.wikimedia.org/r/743942
[06:16:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1100', diff saved to https://phabricator.wikimedia.org/P18035 and previous config saved to /var/cache/conftool/dbconfig/20211207-061635-marostegui.json
[06:16:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:18:49] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "install_server: Reimage db1125 deleting /srv" [puppet] - 10https://gerrit.wikimedia.org/r/743942 (owner: 10Marostegui)
[06:22:07] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[06:31:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1100 (T277354)', diff saved to https://phabricator.wikimedia.org/P18036 and previous config saved to /var/cache/conftool/dbconfig/20211207-063140-marostegui.json
[06:31:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:31:45] <stashbot>	 T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354
[06:32:57] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[06:35:40] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance T277354
[06:35:42] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance T277354
[06:35:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:35:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:36:15] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1105.eqiad.wmnet with reason: Maintenance T277354
[06:36:17] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1105.eqiad.wmnet with reason: Maintenance T277354
[06:36:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:36:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:36:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T277354)', diff saved to https://phabricator.wikimedia.org/P18037 and previous config saved to /var/cache/conftool/dbconfig/20211207-063621-marostegui.json
[06:36:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:37:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T277354)', diff saved to https://phabricator.wikimedia.org/P18038 and previous config saved to /var/cache/conftool/dbconfig/20211207-063756-marostegui.json
[06:38:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:38:01] <stashbot>	 T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354
[06:53:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P18039 and previous config saved to /var/cache/conftool/dbconfig/20211207-065301-marostegui.json
[06:53:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:07:53] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:08:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P18040 and previous config saved to /var/cache/conftool/dbconfig/20211207-070806-marostegui.json
[07:08:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:14:29] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:16:03] <marostegui>	 !log power off db2074, db2078, db2101, db2130, dbproxy2004 T296930
[07:16:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:16:08] <stashbot>	 T296930: codfw: relocate servers in rack D6 - https://phabricator.wikimedia.org/T296930
[07:20:42] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: codfw: relocate servers in rack D6 - https://phabricator.wikimedia.org/T296930 (10Marostegui) All hosts are now down and powered off. @Papaul you can proceed as needed. @Kormat I have upgraded mysql on all hosts, so please run `mysql_upgrade` once you bring them back up (some of th...
[07:21:07] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=mysql-misc site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:22:35] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy2002 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy
[07:22:49] <marostegui>	 ^ this is known
[07:23:11] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:23:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T277354)', diff saved to https://phabricator.wikimedia.org/P18041 and previous config saved to /var/cache/conftool/dbconfig/20211207-072311-marostegui.json
[07:23:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:23:15] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on 8 hosts with reason: Maintenance T277354
[07:23:16] <stashbot>	 T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354
[07:23:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:23:22] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 8 hosts with reason: Maintenance T277354
[07:23:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:23:45] <icinga-wm>	 ACKNOWLEDGEMENT - haproxy failover on dbproxy2002 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui https://phabricator.wikimedia.org/T296930 https://wikitech.wikimedia.org/wiki/HAProxy
[07:29:41] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:32:46] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1162.eqiad.wmnet with reason: Maintenance T277354
[07:32:47] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1162.eqiad.wmnet with reason: Maintenance T277354
[07:32:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:32:50] <stashbot>	 T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354
[07:32:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T277354)', diff saved to https://phabricator.wikimedia.org/P18042 and previous config saved to /var/cache/conftool/dbconfig/20211207-073252-marostegui.json
[07:32:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:32:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:33:00] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[07:33:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:34:05] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:34:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T277354)', diff saved to https://phabricator.wikimedia.org/P18043 and previous config saved to /var/cache/conftool/dbconfig/20211207-073413-marostegui.json
[07:34:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:36:39] <urbanecm>	 jouncebot: nowandnext
[07:36:39] <jouncebot>	 No deployments scheduled for the next 4 hour(s) and 23 minute(s)
[07:36:39] <jouncebot>	 In 4 hour(s) and 23 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211207T1200)
[07:36:46] <wikibugs>	 (03PS3) 10Urbanecm: Deploy Growth mentor dashboard to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743602 (https://phabricator.wikimedia.org/T278920)
[07:36:54] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Deploy Growth mentor dashboard to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743602 (https://phabricator.wikimedia.org/T278920) (owner: 10Urbanecm)
[07:37:31] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[07:37:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:37:40] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy Growth mentor dashboard to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743602 (https://phabricator.wikimedia.org/T278920) (owner: 10Urbanecm)
[07:39:34] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 2178202b86acd50b713d939c4bcfedf7d2fa93e7: Deploy Growth mentor dashboard to all wikis (T278920) (duration: 00m 58s)
[07:39:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:39:38] * urbanecm done
[07:39:39] <stashbot>	 T278920: Mentor dashboard: V1 desktop - https://phabricator.wikimedia.org/T278920
[07:43:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[07:43:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:46:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[07:46:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:49:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P18044 and previous config saved to /var/cache/conftool/dbconfig/20211207-074919-marostegui.json
[07:49:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:51:41] <icinga-wm>	 PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:54:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add current OS upgrade estimation for restbase/sessionstore [puppet] - 10https://gerrit.wikimedia.org/r/744046 (owner: 10Muehlenhoff)
[07:56:09] <wikibugs>	 (03CR) 10RhinosF1: Add current OS upgrade estimation for restbase/sessionstore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/744046 (owner: 10Muehlenhoff)
[07:56:30] <RhinosF1>	 moritzm: you've got target-q twice
[07:56:35] <RhinosF1>	 Check line above your change
[07:57:54] <moritzm>	 RhinosF1: good catch, thanks :-)
[07:58:06] <RhinosF1>	 Np
[08:00:01] <wikibugs>	 (03PS1) 10Muehlenhoff: stretch.yaml: Fix duplicated line [puppet] - 10https://gerrit.wikimedia.org/r/744755
[08:00:45] <wikibugs>	 (03CR) 10RhinosF1: [C: 03+1] stretch.yaml: Fix duplicated line [puppet] - 10https://gerrit.wikimedia.org/r/744755 (owner: 10Muehlenhoff)
[08:03:42] <wikibugs>	 (03CR) 10RhinosF1: Add current OS upgrade estimation for restbase/sessionstore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/744046 (owner: 10Muehlenhoff)
[08:04:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P18045 and previous config saved to /var/cache/conftool/dbconfig/20211207-080424-marostegui.json
[08:04:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:04:35] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10Sustainability (Incident Followup): Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 (10ayounsi) As a general note we need to be careful with rolling out config fixes in reaction to unexpected issues. Even...
[08:05:00] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10Sustainability (Incident Followup): Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 (10ayounsi) 05Open→03In progress
[08:05:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: change blackbox jobs to use / as separator [puppet] - 10https://gerrit.wikimedia.org/r/743352 (owner: 10Filippo Giunchedi)
[08:05:23] <wikibugs>	 (03PS3) 10Filippo Giunchedi: prometheus: change blackbox jobs to use / as separator [puppet] - 10https://gerrit.wikimedia.org/r/743352
[08:07:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] stretch.yaml: Fix duplicated line [puppet] - 10https://gerrit.wikimedia.org/r/744755 (owner: 10Muehlenhoff)
[08:07:38] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Patch-Needs-Improvement, 10User-herron: Outdated TLS config for MXes - https://phabricator.wikimedia.org/T203260 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This has been resolved with the update of the mail servers to Bullseye in the...
[08:12:52] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] team-sre: port node-exporter textfile stale alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/743394 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi)
[08:13:04] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: remove textfile stale alert [puppet] - 10https://gerrit.wikimedia.org/r/743395 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi)
[08:14:30] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Revert 5.10.70 from bullseye hosts - https://phabricator.wikimedia.org/T297180 (10MoritzMuehlenhoff)
[08:18:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] service: add public_aliases list [puppet] - 10https://gerrit.wikimedia.org/r/743921 (owner: 10Filippo Giunchedi)
[08:18:35] <wikibugs>	 (03PS2) 10Filippo Giunchedi: service: add public_aliases list [puppet] - 10https://gerrit.wikimedia.org/r/743921
[08:19:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T277354)', diff saved to https://phabricator.wikimedia.org/P18046 and previous config saved to /var/cache/conftool/dbconfig/20211207-081928-marostegui.json
[08:19:30] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1129.eqiad.wmnet with reason: Maintenance T277354
[08:19:32] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1129.eqiad.wmnet with reason: Maintenance T277354
[08:19:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:34] <stashbot>	 T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354
[08:19:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T277354)', diff saved to https://phabricator.wikimedia.org/P18047 and previous config saved to /var/cache/conftool/dbconfig/20211207-081936-marostegui.json
[08:19:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:20:50] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Revert 5.10.70 from bullseye hosts - https://phabricator.wikimedia.org/T297180 (10MoritzMuehlenhoff)
[08:21:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T277354)', diff saved to https://phabricator.wikimedia.org/P18048 and previous config saved to /var/cache/conftool/dbconfig/20211207-082059-marostegui.json
[08:21:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:24:06] <wikibugs>	 (03PS2) 10Filippo Giunchedi: service: add public alias for grafana-rw [puppet] - 10https://gerrit.wikimedia.org/r/743922
[08:26:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] service: add public alias for grafana-rw [puppet] - 10https://gerrit.wikimedia.org/r/743922 (owner: 10Filippo Giunchedi)
[08:36:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P18049 and previous config saved to /var/cache/conftool/dbconfig/20211207-083604-marostegui.json
[08:36:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:38:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] wmflib: add 'probes' to service::catalog type [puppet] - 10https://gerrit.wikimedia.org/r/743358 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[08:38:32] <wikibugs>	 (03PS3) 10Filippo Giunchedi: wmflib: add 'probes' to service::catalog type [puppet] - 10https://gerrit.wikimedia.org/r/743358 (https://phabricator.wikimedia.org/T291946)
[08:41:39] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Packet Drops on Eqiad ASW -> CR uplinks - https://phabricator.wikimedia.org/T291627 (10ayounsi)
[08:41:47] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Create an alert for output discards on network devices - https://phabricator.wikimedia.org/T284593 (10ayounsi) 05In progress→03Resolved This is now set to alert to NOC through alertmanager.  Added a quick mention in https://wikitech.wikimedia.org/wiki/Networ...
[08:45:16] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2016.codfw.wmnet with OS buster
[08:45:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:45:21] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2016.codfw.wmnet with OS buster executed with errors: - ganeti2016 (**FAIL**)   - Downtimed...
[08:47:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2016.codfw.wmnet with OS buster
[08:47:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:47:10] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2016.codfw.wmnet with OS buster
[08:51:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P18050 and previous config saved to /var/cache/conftool/dbconfig/20211207-085108-marostegui.json
[08:51:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:43] <icinga-wm>	 RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:55:50] <moritzm>	 !log draining primary/secondary instances off ganeti2013 T296622
[08:55:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:54] <stashbot>	 T296622: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622
[09:06:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T277354)', diff saved to https://phabricator.wikimedia.org/P18051 and previous config saved to /var/cache/conftool/dbconfig/20211207-090613-marostegui.json
[09:06:15] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1146.eqiad.wmnet with reason: Maintenance T277354
[09:06:16] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1146.eqiad.wmnet with reason: Maintenance T277354
[09:06:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:20] <stashbot>	 T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354
[09:06:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T277354)', diff saved to https://phabricator.wikimedia.org/P18052 and previous config saved to /var/cache/conftool/dbconfig/20211207-090620-marostegui.json
[09:06:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T277354)', diff saved to https://phabricator.wikimedia.org/P18053 and previous config saved to /var/cache/conftool/dbconfig/20211207-090758-marostegui.json
[09:08:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:38] <icinga-wm>	 PROBLEM - Host mr1-drmrs is DOWN: CRITICAL - Time to live exceeded (185.15.58.130)
[09:22:20] <icinga-wm>	 RECOVERY - Host mr1-drmrs is UP: PING OK - Packet loss = 0%, RTA = 85.60 ms
[09:23:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P18054 and previous config saved to /var/cache/conftool/dbconfig/20211207-092302-marostegui.json
[09:23:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:26:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2016.codfw.wmnet with OS buster
[09:26:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:26:25] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2016.codfw.wmnet with OS buster completed: - ganeti2016 (**WARN**)   - Removed from Puppet...
[09:26:51] <wikibugs>	 (03PS1) 10Majavah: discovery: switchover doc to doc1002 [dns] - 10https://gerrit.wikimedia.org/r/744762 (https://phabricator.wikimedia.org/T247653)
[09:27:03] <wikibugs>	 (03PS1) 10Majavah: hieradata: switchover doc to doc1002 [puppet] - 10https://gerrit.wikimedia.org/r/744763 (https://phabricator.wikimedia.org/T247653)
[09:27:49] <XioNoX>	 !log move all VRRP primary to cr2-codfw - https://phabricator.wikimedia.org/T289241
[09:27:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2012.codfw.wmnet with OS buster
[09:29:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:20] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2012.codfw.wmnet with OS buster
[09:30:41] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): reimage/pxe boot failing on cloudvirt1028 - https://phabricator.wikimedia.org/T296906 (10Volans) >>! In T296906#7550529, @Dzahn wrote: > Try if the server can talk http to apt1001.wikimedia.org / apt2001.wikimedia.org. >  > After getting an IP from DHCP but...
[09:31:58] <XioNoX>	 !log cr1-codfw - FPC 1 PIC 0 Need bounce - T289241
[09:32:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:32:16] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): reimage/pxe boot failing on cloudvirt1028 - https://phabricator.wikimedia.org/T296906 (10Volans) >>! In T296906#7550913, @cmooney wrote: > Looking at the packet captures either side (install1003 and cloudvirt1028) they packets are they same.  I realise, how...
[09:33:15] <wikibugs>	 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Schema-change: Mailman3 schema change:  Switch autoresponse_text fields to Text - https://phabricator.wikimedia.org/T286552 (10Marostegui) m5 hosts downtimed for 2h. Reminder: db2078 is down due to T296930, the schema change will arrive there via replication once...
[09:33:33] <wikibugs>	 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Schema-change: Mailman3 schema change:  Switch autoresponse_text fields to Text - https://phabricator.wikimedia.org/T286552 (10Marostegui) a:03Marostegui
[09:34:17] <XioNoX>	 !log move all VRRP primary to cr1-codfw - T289241
[09:34:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:06] <XioNoX>	 !log cr2-codfw - FPC 1 PIC 1 Need bounce - T289241
[09:38:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P18055 and previous config saved to /var/cache/conftool/dbconfig/20211207-093807-marostegui.json
[09:38:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:40:18] <XioNoX>	 !log codfw, normalize VRRP - T289241
[09:40:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:45:42] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: icinga Blocked by X-Frame-Options Policy - https://phabricator.wikimedia.org/T251513 (10jbond) 05Open→03Resolved a:03jbond Going to resolve this this as the current fix seems to iliviate the majority of the pain points and proivng further fixs dosn;t feel...
[09:47:00] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10observability, 10CAS-SSO, 10User-jbond: Icinga Monitoring for CAS - https://phabricator.wikimedia.org/T233935 (10jbond) 05In progress→03Resolved We currently monitor the tomcat process and further have monitoring for now this is adequate
[09:47:06] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Security-Team, 10CAS-SSO, 10User-jbond: Further steps for CAS/web SSO - https://phabricator.wikimedia.org/T233921 (10jbond)
[09:49:29] <wikibugs>	 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Schema-change: Mailman3 schema change:  Switch autoresponse_text fields to Text - https://phabricator.wikimedia.org/T286552 (10Marostegui) Current table schema: ` CREATE TABLE `mailinglist` (   `id` int(11) NOT NULL AUTO_INCREMENT,   `list_name` varchar(255) CHARA...
[09:53:07] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] prometheus: bump logging level for blackbox-exporter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/743388 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[09:53:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T277354)', diff saved to https://phabricator.wikimedia.org/P18056 and previous config saved to /var/cache/conftool/dbconfig/20211207-095312-marostegui.json
[09:53:14] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1170.eqiad.wmnet with reason: Maintenance T277354
[09:53:15] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1170.eqiad.wmnet with reason: Maintenance T277354
[09:53:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:17] <stashbot>	 T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354
[09:53:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T277354)', diff saved to https://phabricator.wikimedia.org/P18057 and previous config saved to /var/cache/conftool/dbconfig/20211207-095319-marostegui.json
[09:53:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:54:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T277354)', diff saved to https://phabricator.wikimedia.org/P18058 and previous config saved to /var/cache/conftool/dbconfig/20211207-095456-marostegui.json
[09:55:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:56:46] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:57:46] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki: workaround the sadness of php/rsyslog interactions [deployment-charts] - 10https://gerrit.wikimedia.org/r/744764
[09:59:42] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:00:22] <marostegui>	 Amir1: so you setting up mailman in maintenance mode?
[10:00:34] <Amir1>	 yup
[10:00:47] <marostegui>	 ok, let me know when done so I can deploy the change
[10:00:55] <Amir1>	 let me know when I need to do hit the button
[10:01:02] <marostegui>	 !log Deploy schema change on mailman (m5) T286552
[10:01:04] <marostegui>	 Amir1: go for it
[10:01:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:01:06] <stashbot>	 T286552: Mailman3 schema change:  Switch autoresponse_text fields to Text - https://phabricator.wikimedia.org/T286552
[10:01:13] <Amir1>	 done
[10:01:15] <Amir1>	 go
[10:01:22] <marostegui>	 deployed
[10:01:43] <Amir1>	 back up
[10:02:21] <wikibugs>	 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Schema-change: Mailman3 schema change:  Switch autoresponse_text fields to Text - https://phabricator.wikimedia.org/T286552 (10Marostegui) ` root@db1132.eqiad.wmnet[mailman3]> ALTER TABLE mailinglist MODIFY autoresponse_owner_text TEXT COLLATE utf8mb4_bin NULL; AL...
[10:02:25] <Amir1>	 looks okay
[10:02:32] <marostegui>	 great!
[10:03:38] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: Internal server error (with ugly html tags) when changing Autoresponse postings text - https://phabricator.wikimedia.org/T286269 (10Marostegui)
[10:03:48] <wikibugs>	 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Schema-change: Mailman3 schema change:  Switch autoresponse_text fields to Text - https://phabricator.wikimedia.org/T286552 (10Marostegui) 05Open→03Resolved All done!
[10:05:59] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10jbond) >>! In T272559#7546852, @Dzahn wrote: > icinga::nsca::client is used in fundraising. so there are special cases that can be in use but this audit scri...
[10:06:36] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: Internal server error (with ugly html tags) when changing Autoresponse postings text - https://phabricator.wikimedia.org/T286269 (10Ladsgroup) 05Open→03Resolved Fixed now.
[10:08:21] <wikibugs>	 (03PS1) 10Kormat: Drop py35 support, and various cfg cleanups. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/744767
[10:10:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P18059 and previous config saved to /var/cache/conftool/dbconfig/20211207-101001-marostegui.json
[10:10:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:11:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2012.codfw.wmnet with OS buster
[10:11:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:09] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2012.codfw.wmnet with OS buster completed: - ganeti2012 (**PASS**)   - Downtimed on Icinga...
[10:12:58] <wikibugs>	 10SRE, 10Traffic: Upgrade pybal-test200[23] from Stretch to Buster - https://phabricator.wikimedia.org/T297187 (10ema)
[10:13:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on ganeti2013.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage
[10:13:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:13:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ganeti2013.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage
[10:13:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:14:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10Sustainability (Incident Followup): Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 (10cmooney) > To be clear, I agree that your proposal is a good solution however I'm wondering what's most future-proof....
[10:17:50] <wikibugs>	 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10MoritzMuehlenhoff)
[10:18:01] <wikibugs>	 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10MoritzMuehlenhoff) One more; ganeti2013. Ready to be powered off any time.
[10:18:04] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] noc: Make colors consistent with WikimediaUI style guide [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742443 (owner: 10Ladsgroup)
[10:18:47] <wikibugs>	 (03Merged) 10jenkins-bot: noc: Make colors consistent with WikimediaUI style guide [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742443 (owner: 10Ladsgroup)
[10:21:21] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/743980 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[10:21:34] <wikibugs>	 (03PS1) 10Jelto: gitlab: disable restore timer to perform upgrade [puppet] - 10https://gerrit.wikimedia.org/r/744768 (https://phabricator.wikimedia.org/T297183)
[10:22:00] <wikibugs>	 (03CR) 10David Caro: "I'll remove my vote as probably someone that's directly affected by these alerts should +1 instead xd" [puppet] - 10https://gerrit.wikimedia.org/r/743980 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[10:22:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] gitlab: disable restore timer to perform upgrade [puppet] - 10https://gerrit.wikimedia.org/r/744768 (https://phabricator.wikimedia.org/T297183) (owner: 10Jelto)
[10:23:12] <wikibugs>	 (03PS2) 10Jelto: gitlab: disable restore timer to perform upgrade [puppet] - 10https://gerrit.wikimedia.org/r/744768 (https://phabricator.wikimedia.org/T297183)
[10:24:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[10:24:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:25:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P18060 and previous config saved to /var/cache/conftool/dbconfig/20211207-102505-marostegui.json
[10:25:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:25:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[10:25:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2016.codfw.wmnet
[10:26:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:35] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32837/console" [puppet] - 10https://gerrit.wikimedia.org/r/744768 (https://phabricator.wikimedia.org/T297183) (owner: 10Jelto)
[10:27:26] <wikibugs>	 (03CR) 10David Caro: "LGTM, I'll leave for someone else to do the +1 though." [puppet] - 10https://gerrit.wikimedia.org/r/743979 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[10:27:30] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: disable restore timer to perform upgrade [puppet] - 10https://gerrit.wikimedia.org/r/744768 (https://phabricator.wikimedia.org/T297183) (owner: 10Jelto)
[10:28:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: prometheus: bump logging level for blackbox-exporter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/743388 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[10:29:40] <godog>	 dcaro: thanks for the reviews ^ if you have time I'd like your input on https://gerrit.wikimedia.org/r/c/operations/puppet/+/743359 too (also that's going on cloudmetrics hosts as well)
[10:32:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2016.codfw.wmnet
[10:32:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:21] <wikibugs>	 (03CR) 10David Caro: "Some question, otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/743981 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[10:33:44] <dcaro>	 godog: /me looking
[10:35:10] <godog>	 dcaro: cheers, appreciate it
[10:36:19] <dcaro>	 godog: quick question, what do the comments with XXX mean? Todo?
[10:36:34] <godog>	 dcaro: lol yes, they do
[10:36:49] <dcaro>	 👍
[10:36:52] <godog>	 I'll switch to TODO in the future, much clearer
[10:39:05] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff)
[10:40:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T277354)', diff saved to https://phabricator.wikimedia.org/P18061 and previous config saved to /var/cache/conftool/dbconfig/20211207-104010-marostegui.json
[10:40:12] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1182.eqiad.wmnet with reason: Maintenance T277354
[10:40:14] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1182.eqiad.wmnet with reason: Maintenance T277354
[10:40:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:40:15] <stashbot>	 T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354
[10:40:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:40:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T277354)', diff saved to https://phabricator.wikimedia.org/P18062 and previous config saved to /var/cache/conftool/dbconfig/20211207-104018-marostegui.json
[10:40:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:40:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:41:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T277354)', diff saved to https://phabricator.wikimedia.org/P18063 and previous config saved to /var/cache/conftool/dbconfig/20211207-104153-marostegui.json
[10:41:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: prometheus: add alerts for network probes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/743980 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[10:51:36] <wikibugs>	 (03PS2) 10Kormat: Various cfg cleanups. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/744767
[10:52:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Revert 5.10.70 from bullseye hosts - https://phabricator.wikimedia.org/T297180 (10MoritzMuehlenhoff)
[10:55:23] <wikibugs>	 (03PS10) 10Jbond: Switch profile::openldap::management to use profile::openldap::client [puppet] - 10https://gerrit.wikimedia.org/r/661922 (owner: 10Muehlenhoff)
[10:55:56] <wikibugs>	 (03PS17) 10Jbond: modify-mfa.py: update modify-mfa to use ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150)
[10:56:09] <wikibugs>	 (03PS3) 10Jbond: C:ldap::client::utils: Update to python3 [puppet] - 10https://gerrit.wikimedia.org/r/743387 (https://phabricator.wikimedia.org/T247364)
[10:56:45] <wikibugs>	 (03PS18) 10Jbond: modify-mfa.py: update modify-mfa to use ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150)
[10:56:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P18064 and previous config saved to /var/cache/conftool/dbconfig/20211207-105658-marostegui.json
[10:57:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:00:33] <wikibugs>	 (03PS6) 10Jbond: P:openldap::client: Add ldap::client::utils [puppet] - 10https://gerrit.wikimedia.org/r/743366
[11:01:18] <wikibugs>	 (03PS11) 10Jbond: Switch profile::openldap::management to use profile::openldap::client [puppet] - 10https://gerrit.wikimedia.org/r/661922 (owner: 10Muehlenhoff)
[11:02:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:openldap::client: Add ldap::client::utils [puppet] - 10https://gerrit.wikimedia.org/r/743366 (owner: 10Jbond)
[11:03:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Switch profile::openldap::management to use profile::openldap::client [puppet] - 10https://gerrit.wikimedia.org/r/661922 (owner: 10Muehlenhoff)
[11:03:58] <wikibugs>	 (03PS7) 10Jbond: P:openldap::client: Add ldap::client::utils [puppet] - 10https://gerrit.wikimedia.org/r/743366
[11:04:23] <wikibugs>	 (03PS8) 10Jbond: P:openldap::client: Add ldap::client::utils [puppet] - 10https://gerrit.wikimedia.org/r/743366
[11:05:06] <wikibugs>	 (03PS9) 10Jbond: P:openldap::client: Add ldap::client::utils [puppet] - 10https://gerrit.wikimedia.org/r/743366
[11:05:24] <wikibugs>	 (03PS12) 10Jbond: Switch profile::openldap::management to use profile::openldap::client [puppet] - 10https://gerrit.wikimedia.org/r/661922 (owner: 10Muehlenhoff)
[11:06:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2012.codfw.wmnet
[11:06:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:07:28] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32839/console" [puppet] - 10https://gerrit.wikimedia.org/r/661922 (owner: 10Muehlenhoff)
[11:07:42] <wikibugs>	 (03PS19) 10Jbond: modify-mfa.py: update modify-mfa to use ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150)
[11:08:08] <wikibugs>	 (03PS4) 10Jbond: C:ldap::client::utils: Update to python3 [puppet] - 10https://gerrit.wikimedia.org/r/743387 (https://phabricator.wikimedia.org/T247364)
[11:09:27] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32840/console" [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) (owner: 10Jbond)
[11:10:31] <wikibugs>	 (03CR) 10David Caro: "One comment about a file->exec relationship, otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/743359 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[11:10:35] <wikibugs>	 (03PS8) 10Majavah: openstack: refactor puppetmaster access [puppet] - 10https://gerrit.wikimedia.org/r/740915 (https://phabricator.wikimedia.org/T295247)
[11:10:37] <wikibugs>	 (03PS2) 10Majavah: openstack: enc: properly fail on server error [puppet] - 10https://gerrit.wikimedia.org/r/742424
[11:11:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2012.codfw.wmnet
[11:11:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P18065 and previous config saved to /var/cache/conftool/dbconfig/20211207-111203-marostegui.json
[11:12:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:13:25] <wikibugs>	 (03PS13) 10Jbond: Switch profile::openldap::management to use profile::openldap::client [puppet] - 10https://gerrit.wikimedia.org/r/661922 (owner: 10Muehlenhoff)
[11:13:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Sustainability (Incident Followup): Bringing mx2001 back into service - https://phabricator.wikimedia.org/T297128 (10MoritzMuehlenhoff) >>! In T297128#7551879, @Dzahn wrote: > re: making current kernel version persistent >  > The one running now was selected i...
[11:13:52] <wikibugs>	 (03PS20) 10Jbond: modify-mfa.py: update modify-mfa to use ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150)
[11:14:03] <wikibugs>	 (03PS5) 10Jbond: C:ldap::client::utils: Update to python3 [puppet] - 10https://gerrit.wikimedia.org/r/743387 (https://phabricator.wikimedia.org/T247364)
[11:15:19] <wikibugs>	 (03CR) 10Jbond: P:openldap::client: Add ldap::client::utils (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/743366 (owner: 10Jbond)
[11:15:21] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:openldap::client: Add ldap::client::utils [puppet] - 10https://gerrit.wikimedia.org/r/743366 (owner: 10Jbond)
[11:15:24] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Switch profile::openldap::management to use profile::openldap::client (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661922 (owner: 10Muehlenhoff)
[11:15:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] modify-mfa.py: update modify-mfa to use ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/743205 (https://phabricator.wikimedia.org/T295150) (owner: 10Jbond)
[11:15:43] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:ldap::client::utils: Update to python3 [puppet] - 10https://gerrit.wikimedia.org/r/743387 (https://phabricator.wikimedia.org/T247364) (owner: 10Jbond)
[11:19:04] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Refactor superset caching to enable dual caches [puppet] - 10https://gerrit.wikimedia.org/r/743386 (https://phabricator.wikimedia.org/T295295) (owner: 10Btullis)
[11:19:58] <jbond>	 btullis: you happy for me to merge ^^
[11:20:05] <wikibugs>	 (03PS2) 10Majavah: acme_chief: add -rw to ldap certs [puppet] - 10https://gerrit.wikimedia.org/r/739283 (https://phabricator.wikimedia.org/T295150)
[11:21:00] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "LGTM will merge" [puppet] - 10https://gerrit.wikimedia.org/r/739283 (https://phabricator.wikimedia.org/T295150) (owner: 10Majavah)
[11:21:13] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mediawiki: workaround the sadness of php/rsyslog interactions [deployment-charts] - 10https://gerrit.wikimedia.org/r/744764
[11:21:37] <jbond>	 majavah: see above will merge, once btul.lis confirms
[11:21:43] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff)
[11:24:33] <majavah>	 thx
[11:26:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd2004.codfw.wmnet with reason: switch to drbd storage
[11:26:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:26:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd2004.codfw.wmnet with reason: switch to drbd storage
[11:26:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:27:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T277354)', diff saved to https://phabricator.wikimedia.org/P18066 and previous config saved to /var/cache/conftool/dbconfig/20211207-112707-marostegui.json
[11:27:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:27:12] <stashbot>	 T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354
[11:28:22] <jbond>	 btullis: See above is it ok to merge your change
[11:29:56] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db[1155-1156].eqiad.wmnet with reason: Maintenance T277354
[11:30:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:01] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db[1155-1156].eqiad.wmnet with reason: Maintenance T277354
[11:30:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T277354)', diff saved to https://phabricator.wikimedia.org/P18067 and previous config saved to /var/cache/conftool/dbconfig/20211207-113005-marostegui.json
[11:30:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T277354)', diff saved to https://phabricator.wikimedia.org/P18068 and previous config saved to /var/cache/conftool/dbconfig/20211207-113140-marostegui.json
[11:31:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:47] <topranks>	 !log removing IP addressing on cloudvirt1028 manually and forcing DHCP to debug reimage failure (T296906)
[11:31:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:51] <stashbot>	 T296906: reimage/pxe boot failing on cloudvirt1028 - https://phabricator.wikimedia.org/T296906
[11:32:28] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.dhcp for host cloudvirt1028.eqiad.wmnet
[11:32:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:11] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Revert 5.10.70 from bullseye hosts - https://phabricator.wikimedia.org/T297180 (10fgiunchedi)
[11:35:37] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Revert 5.10.70 from bullseye hosts - https://phabricator.wikimedia.org/T297180 (10fgiunchedi) I chatted with @MoritzMuehlenhoff re: the rollback, apt won't let you remove a running kernel though there's a way to ask `grub` to reboot into another menu entry (the second entry...
[11:37:54] <jbond>	 majavah: you change is merged
[11:38:07] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host cloudvirt1028.eqiad.wmnet
[11:38:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:46:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P18069 and previous config saved to /var/cache/conftool/dbconfig/20211207-114645-marostegui.json
[11:46:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:51:51] <moritzm>	 !log draining primary/secondary instances off ganeti2014 T296622
[11:51:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:51:56] <stashbot>	 T296622: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622
[12:00:04] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211207T1200).
[12:00:05] <jouncebot>	 MatmaRex: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[12:00:10] <Lucas_WMDE>	 o/
[12:00:25] <MatmaRex>	 hi
[12:01:04] <Lucas_WMDE>	 I can deploy today :)
[12:01:25] <Lucas_WMDE>	 ooh, reply tool \o/
[12:01:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P18070 and previous config saved to /var/cache/conftool/dbconfig/20211207-120150-marostegui.json
[12:01:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:03:56] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Enable reply tool by default on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744043 (https://phabricator.wikimedia.org/T296444) (owner: 10Esanders)
[12:04:00] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Enable reply tool by default on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744043 (https://phabricator.wikimedia.org/T296444) (owner: 10Esanders)
[12:05:19] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Enable reply tool by default on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744043 (https://phabricator.wikimedia.org/T296444) (owner: 10Esanders)
[12:06:06] <wikibugs>	 (03Merged) 10jenkins-bot: Enable reply tool by default on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744043 (https://phabricator.wikimedia.org/T296444) (owner: 10Esanders)
[12:06:26] <Lucas_WMDE>	 MatmaRex: the change is on mwdebug1001, please test!
[12:06:45] <MatmaRex>	 looking
[12:06:46] <Lucas_WMDE>	 (I’m not sure how to test it myself tbh, since the two random talk pages I looked at used Flow ^^)
[12:07:05] <MatmaRex>	 yeah, you need to find or create a non-flow one
[12:07:29] <MatmaRex>	 e.g. https://www.mediawiki.org/wiki/Talk:Talk_pages_project/Usability - seems good on this page :)
[12:07:40] <MatmaRex>	 Lucas_WMDE: seems good
[12:07:58] <Lucas_WMDE>	 ack, looks good here too
[12:08:06] <majavah>	 why is https://www.mediawiki.org/wiki/Talk:Talk_pages_project still a Flow board? :-)
[12:08:25] <MatmaRex>	 🤷‍♂️
[12:09:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[12:09:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:09:11] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:744043|Enable reply tool by default on mediawikiwiki (T296444)]] (duration: 00m 57s)
[12:09:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:09:15] <stashbot>	 T296444: Config change: Deploy Reply Tool as opt-out preference at mediawiki.org - https://phabricator.wikimedia.org/T296444
[12:09:32] <MatmaRex>	 thanks
[12:10:06] <Lucas_WMDE>	 np
[12:10:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[12:10:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:14] <Lucas_WMDE>	 anything else to deploy?
[12:10:36] <Lucas_WMDE>	 I might deploy a service update in a few minutes (after testing it some more locally first), not sure if I should consider that part of the window ^^
[12:13:10] <icinga-wm>	 PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:16:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T277354)', diff saved to https://phabricator.wikimedia.org/P18071 and previous config saved to /var/cache/conftool/dbconfig/20211207-121655-marostegui.json
[12:16:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:17:00] <stashbot>	 T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354
[12:21:04] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:21:04] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Update termbox to 2021-12-06-171243-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/744071 (https://phabricator.wikimedia.org/T297006)
[12:21:15] <Lucas_WMDE>	 ^ I’ll start deploying this in a moment
[12:22:06] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "I’ve tested locally that the older Termbox pin of Wikibase as of 1.38.0-wmf.9 is compatible with the newer SSR, both without and with Java" [deployment-charts] - 10https://gerrit.wikimedia.org/r/744071 (https://phabricator.wikimedia.org/T297006) (owner: 10Lucas Werkmeister (WMDE))
[12:22:24] <Lucas_WMDE>	 flawless message cutoff
[12:23:36] <wikibugs>	 (03PS42) 10Jbond: monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045
[12:23:58] <wikibugs>	 (03PS2) 10Btullis: Remove the HDFS corrupt blocks check from Icinga [puppet] - 10https://gerrit.wikimedia.org/r/732922 (https://phabricator.wikimedia.org/T293399)
[12:24:12] <jbond>	 !log merge refactor of monitoring classes 725045
[12:24:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:24:54] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/725045 (owner: 10Jbond)
[12:25:18] <wikibugs>	 (03CR) 10Michael Große: [C: 03+1] Update termbox to 2021-12-06-171243-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/744071 (https://phabricator.wikimedia.org/T297006) (owner: 10Lucas Werkmeister (WMDE))
[12:25:52] <wikibugs>	 (03Merged) 10jenkins-bot: Update termbox to 2021-12-06-171243-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/744071 (https://phabricator.wikimedia.org/T297006) (owner: 10Lucas Werkmeister (WMDE))
[12:26:08] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Remove the HDFS corrupt blocks check from Icinga [puppet] - 10https://gerrit.wikimedia.org/r/732922 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis)
[12:26:58] <jbond>	 btullis: keep clashing today, you happy for me to merge
[12:27:10] <btullis>	 jbond: Yes please :-)
[12:27:27] <jbond>	 merged
[12:27:40] <btullis>	 Many thanks.
[12:28:08] <Lucas_WMDE>	 deploy1002 /srv/deployment-charts has uncommitted changes (mwdebug/values-eqiad) :/
[12:28:13] <Lucas_WMDE>	 I assume I can deploy termbox anyways
[12:28:18] <Lucas_WMDE>	 but does anyone know who’s responsible for those?
[12:31:24] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:31:52] <Lucas_WMDE>	 hmm, there’s more diff than I expected in the `helmfile -e staging -i apply` for termbox
[12:32:05] <Lucas_WMDE>	 the chart in a bunch of places changes from termbox-0.0.19 to termbox-0.0.20
[12:32:55] <Lucas_WMDE>	 looks like that comes from https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/742166
[12:33:07] <Lucas_WMDE>	 jelto: are you around?
[12:34:57] * Lucas_WMDE looks through the Dec 1 SAL
[12:35:25] <wikibugs>	 (03PS1) 10Jbond: Revert "monitoring: refactor class" [puppet] - 10https://gerrit.wikimedia.org/r/744786
[12:35:39] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "monitoring: refactor class" [puppet] - 10https://gerrit.wikimedia.org/r/744786 (owner: 10Jbond)
[12:36:10] <Lucas_WMDE>	 or maybe akosiaris can help?
[12:36:19] <Lucas_WMDE>	 (I didn’t find anything enlightening in the SAL)
[12:36:25] <wikibugs>	 (03PS1) 10Jbond: monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/744787
[12:36:42] <wikibugs>	 (03PS2) 10Jbond: monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/744787
[12:37:47] <wikibugs>	 (03PS3) 10Jbond: monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/744787
[12:38:40] <jelto>	 Lucas_WMDE: I'm around. I bumped some chart version to fix a minor bug in mutliple charts. Apart from label bump 0.0.19 to 0.0.20 there should be any other change 
[12:38:59] <Lucas_WMDE>	 and it’s fine to apply that together with the other change I’m deploying?
[12:39:03] <jelto>	 shouldn't *
[12:39:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/744787 (owner: 10Jbond)
[12:39:39] <jelto>	 yes thats fine, apart from two charts this feature was not used by any other chart, so should ne noop anyway
[12:39:46] <Lucas_WMDE>	 ok thanks
[12:39:50] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' .
[12:39:50] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' .
[12:39:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:39:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:44] <Lucas_WMDE>	 alright, testing on test.wikidata.org
[12:41:51] <Lucas_WMDE>	 all working as far as I can tell
[12:42:07] <Lucas_WMDE>	 let’s do codfw and eqiad
[12:42:31] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'termbox' for release 'production' .
[12:42:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:43:15] <Lucas_WMDE>	 by the way, is there a way to add a custom message to these logs, like with scap?
[12:43:18] <Lucas_WMDE>	 (e.g. a task ID)
[12:44:22] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'production' .
[12:44:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:46:08] <Lucas_WMDE>	 !log deployed [[gerrit:744071|Update termbox to 2021-12-06-171243-production (T297006)]]
[12:46:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:46:12] <stashbot>	 T297006: Migrate Termbox to Node 12 - https://phabricator.wikimedia.org/T297006
[12:46:17] <Lucas_WMDE>	 that works, I guess ;)
[12:46:24] <Lucas_WMDE>	 !log UTC morning backport+config window done
[12:46:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:24] <wikibugs>	 (03PS4) 10Jbond: monitoring: refactor class [puppet] - 10https://gerrit.wikimedia.org/r/744787
[12:47:52] <wikibugs>	 (03PS1) 10Ladsgroup: auto_schema: Stop adding ticket to downtime cookbook [software] - 10https://gerrit.wikimedia.org/r/744778 (https://phabricator.wikimedia.org/T288235)
[12:48:30] <jelto>	 Lucas_WMDE: I don't think custom messages are supported yet in helmfile apply (apart from logging here manually). Technically it should be quite easy to have a optional value to append to the SAL log. Let me think a little bit about that and I may create a low-prio task :)
[12:48:38] <Lucas_WMDE>	 ok :)
[12:49:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] prometheus: add blackbox/discovery jobs (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/743979 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[12:49:47] <wikibugs>	 (03PS3) 10Filippo Giunchedi: prometheus: support for blackbox configuration fragments [puppet] - 10https://gerrit.wikimedia.org/r/743359 (https://phabricator.wikimedia.org/T291946)
[12:49:49] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: bump logging level for blackbox-exporter [puppet] - 10https://gerrit.wikimedia.org/r/743388 (https://phabricator.wikimedia.org/T291946)
[12:49:51] <wikibugs>	 (03PS7) 10Filippo Giunchedi: prometheus: add blackbox/discovery jobs [puppet] - 10https://gerrit.wikimedia.org/r/743979 (https://phabricator.wikimedia.org/T291946)
[12:49:53] <wikibugs>	 (03PS7) 10Filippo Giunchedi: prometheus: add alerts for network probes [puppet] - 10https://gerrit.wikimedia.org/r/743980 (https://phabricator.wikimedia.org/T291946)
[12:49:55] <wikibugs>	 (03PS7) 10Filippo Giunchedi: alertmanager: add inhibit rules for network probes [puppet] - 10https://gerrit.wikimedia.org/r/743981 (https://phabricator.wikimedia.org/T291946)
[12:49:59] <godog>	 sorry about the gerrit spam
[12:52:54] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] auto_schema: Stop adding ticket to downtime cookbook [software] - 10https://gerrit.wikimedia.org/r/744778 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup)
[12:55:09] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[12:56:42] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] Various cfg cleanups. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/744767 (owner: 10Kormat)
[12:56:51] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Stop adding ticket to downtime cookbook [software] - 10https://gerrit.wikimedia.org/r/744778 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup)
[13:00:59] <wikibugs>	 (03Merged) 10jenkins-bot: Various cfg cleanups. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/744767 (owner: 10Kormat)
[13:01:01] <wikibugs>	 (03Merged) 10jenkins-bot: auto_schema: Stop adding ticket to downtime cookbook [software] - 10https://gerrit.wikimedia.org/r/744778 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup)
[13:01:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/743359 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[13:01:52] <wikibugs>	 (03PS4) 10Filippo Giunchedi: prometheus: support for blackbox configuration fragments [puppet] - 10https://gerrit.wikimedia.org/r/743359 (https://phabricator.wikimedia.org/T291946)
[13:01:54] <wikibugs>	 (03PS3) 10Filippo Giunchedi: prometheus: bump logging level for blackbox-exporter [puppet] - 10https://gerrit.wikimedia.org/r/743388 (https://phabricator.wikimedia.org/T291946)
[13:01:56] <wikibugs>	 (03PS8) 10Filippo Giunchedi: prometheus: add blackbox/discovery jobs [puppet] - 10https://gerrit.wikimedia.org/r/743979 (https://phabricator.wikimedia.org/T291946)
[13:01:58] <wikibugs>	 (03PS8) 10Filippo Giunchedi: prometheus: add alerts for network probes [puppet] - 10https://gerrit.wikimedia.org/r/743980 (https://phabricator.wikimedia.org/T291946)
[13:02:00] <wikibugs>	 (03PS8) 10Filippo Giunchedi: alertmanager: add inhibit rules for network probes [puppet] - 10https://gerrit.wikimedia.org/r/743981 (https://phabricator.wikimedia.org/T291946)
[13:04:52] <wikibugs>	 (03PS2) 10Kormat: dbutil: Make testing easier [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/744029
[13:07:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on ganeti2014.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage
[13:07:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ganeti2014.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage
[13:07:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:08:38] <wikibugs>	 (03CR) 10Kormat: dbutil: Make testing easier (032 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/744029 (owner: 10Kormat)
[13:08:47] <wikibugs>	 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10MoritzMuehlenhoff)
[13:08:59] <wikibugs>	 (03CR) 10Filippo Giunchedi: alertmanager: add inhibit rules for network probes (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/743981 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[13:09:03] <wikibugs>	 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10MoritzMuehlenhoff) One more; ganeti2014. Ready to be powered off any time.
[13:09:09] <wikibugs>	 (03PS9) 10Filippo Giunchedi: alertmanager: add inhibit rules for network probes [puppet] - 10https://gerrit.wikimedia.org/r/743981 (https://phabricator.wikimedia.org/T291946)
[13:12:31] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] dbutil: Make testing easier [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/744029 (owner: 10Kormat)
[13:15:55] <wikibugs>	 (03Merged) 10jenkins-bot: dbutil: Make testing easier [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/744029 (owner: 10Kormat)
[13:16:37] <jelto>	 !log update GitLab to 14.4.4-ce.0
[13:16:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:34] <wikibugs>	 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, and 2 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10JAllemandou) > What would be the next steps?  Here is a proposal:  # [DE, SRE]Agree on the name of the flow :) Will it be `sflow`...
[13:23:59] <wikibugs>	 (03PS1) 10Kormat: setup.cfg: Improve dir excludes [software/wmfdb] - 10https://gerrit.wikimedia.org/r/744780
[13:26:06] <wikibugs>	 (03PS1) 10Jelto: Revert "gitlab: disable restore timer to perform upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/744789
[13:26:38] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host restbase2026.codfw.wmnet with OS buster
[13:26:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:45] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase202[456].codfw.wmnet - https://phabricator.wikimedia.org/T294377 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin200...
[13:29:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, PCC link by John https://puppet-compiler.wmflabs.org/compiler1002/32843/" [puppet] - 10https://gerrit.wikimedia.org/r/744787 (owner: 10Jbond)
[13:31:04] <wikibugs>	 (03PS2) 10Kormat: setup.cfg: Improve dir excludes, upgrade black [software/wmfdb] - 10https://gerrit.wikimedia.org/r/744780
[13:34:04] <wikibugs>	 (03CR) 10Kormat: [V: 03+2 C: 03+2] setup.cfg: Improve dir excludes, upgrade black [software/wmfdb] - 10https://gerrit.wikimedia.org/r/744780 (owner: 10Kormat)
[13:39:02] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32844/console" [puppet] - 10https://gerrit.wikimedia.org/r/744789 (owner: 10Jelto)
[13:39:06] <jbond>	 !log disable puppet fleet wide to rollout 744787
[13:39:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:40:54] <godog>	 !log reboot graphite2003 - T297180
[13:40:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:40:58] <stashbot>	 T297180: Revert 5.10.70 from bullseye hosts - https://phabricator.wikimedia.org/T297180
[13:41:05] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] Revert "gitlab: disable restore timer to perform upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/744789 (owner: 10Jelto)
[13:42:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd2004.codfw.wmnet with reason: switch to drbd storage
[13:42:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:42:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd2004.codfw.wmnet with reason: switch to drbd storage
[13:42:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:42:08] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10Sustainability (Incident Followup): Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 (10ayounsi) Ok, sounds good to me!
[13:42:14] <icinga-wm>	 PROBLEM - Host graphite2003 is DOWN: PING CRITICAL - Packet loss = 100%
[13:44:12] <icinga-wm>	 RECOVERY - Host graphite2003 is UP: PING OK - Packet loss = 0%, RTA = 31.69 ms
[13:44:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] monitoring: refactor class (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/744787 (owner: 10Jbond)
[13:45:04] <wikibugs>	 (03PS1) 10Ayounsi: Deprecate interface-range external [homer/public] - 10https://gerrit.wikimedia.org/r/744782 (https://phabricator.wikimedia.org/T296935)
[13:45:25] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2026.codfw.wmnet with OS buster
[13:45:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:30] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase202[456].codfw.wmnet - https://phabricator.wikimedia.org/T294377 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 fo...
[13:45:43] <wikibugs>	 (03PS2) 10Ayounsi: Deprecate interface-range external [homer/public] - 10https://gerrit.wikimedia.org/r/744782 (https://phabricator.wikimedia.org/T296935)
[13:49:05] <wikibugs>	 (03CR) 10Ayounsi: "Example diff on cr2-esams:" [homer/public] - 10https://gerrit.wikimedia.org/r/744782 (https://phabricator.wikimedia.org/T296935) (owner: 10Ayounsi)
[13:52:11] <Amir1>	 !log removing wikiuser@localhost on s6 (T296537)
[13:52:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:19] <stashbot>	 T296537: Check and fix GRANT issues of wikiuser - https://phabricator.wikimedia.org/T296537
[14:00:38] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: ceph: mgr: fix typo in relationship [puppet] - 10https://gerrit.wikimedia.org/r/744784 (https://phabricator.wikimedia.org/T293752)
[14:02:59] <wikibugs>	 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, and 2 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) Sounds good!  1. we can use "internal_flows" (not _netflow as netflow is a protocol). 2. can I start this anytime, or we...
[14:07:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd2006.codfw.wmnet with reason: switch to drbd storage
[14:07:42] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd2006.codfw.wmnet with reason: switch to drbd storage
[14:07:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:07:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:28] <icinga-wm>	 PROBLEM - ganeti-confd running on ganeti2014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti
[14:08:48] <icinga-wm>	 PROBLEM - ganeti-mond running on ganeti2014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti
[14:09:34] <icinga-wm>	 PROBLEM - ganeti-noded running on ganeti2014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti
[14:09:43] <moritzm>	 ^ expected, silencing
[14:11:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ganeti[2013-2014].codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage
[14:11:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti[2013-2014].codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage
[14:11:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:56] <icinga-wm>	 RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:15:18] <Amir1>	 !log fixing heartbeat grants for wikiuser across the cluster (T296537)
[14:15:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:23] <stashbot>	 T296537: Check and fix GRANT issues of wikiuser - https://phabricator.wikimedia.org/T296537
[14:21:56] <wikibugs>	 (03PS4) 10Ayounsi: Pmacct add sflow listener [puppet] - 10https://gerrit.wikimedia.org/r/742110 (https://phabricator.wikimedia.org/T263277)
[14:28:30] <godog>	 !log reboot graphite1004 - T297180
[14:28:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:35] <stashbot>	 T297180: Revert 5.10.70 from bullseye hosts - https://phabricator.wikimedia.org/T297180
[14:29:27] <wikibugs>	 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, and 2 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10Ottomata) > Agree on the name of the flow :  Some guidelines: https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas/Guidelin...
[14:30:12] <wikibugs>	 (03PS5) 10Ayounsi: Pmacct add sflow listener [puppet] - 10https://gerrit.wikimedia.org/r/742110 (https://phabricator.wikimedia.org/T263277)
[14:30:34] <icinga-wm>	 PROBLEM - Host graphite1004 is DOWN: PING CRITICAL - Packet loss = 100%
[14:31:30] <icinga-wm>	 RECOVERY - Host graphite1004 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms
[14:31:53] <wikibugs>	 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, and 2 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10JAllemandou) >>! In T263277#7552972, @ayounsi wrote: > Sounds good!  > 1. we can use "internal_flows" (not _netflow as netflow is...
[14:32:38] <icinga-wm>	 PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 96.11% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[14:32:49] <wikibugs>	 (03CR) 10JMeybohm: imagecatalog: Install and configure OCI image catalog on deploy hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742574 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus)
[14:34:36] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: ceph: mgr: migrate keyring to new auth abstraction [puppet] - 10https://gerrit.wikimedia.org/r/744808 (https://phabricator.wikimedia.org/T293752)
[14:35:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] ceph: mgr: migrate keyring to new auth abstraction [puppet] - 10https://gerrit.wikimedia.org/r/744808 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez)
[14:36:21] <icinga-wm>	 PROBLEM - carbon-cache@d service on graphite1004 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@d is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:36:26] <icinga-wm>	 PROBLEM - carbon-cache@e service on graphite1004 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@e is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:36:26] <icinga-wm>	 PROBLEM - carbon-local-relay service on graphite1004 is CRITICAL: CRITICAL - Expecting active but unit carbon-local-relay is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:36:36] <icinga-wm>	 PROBLEM - Check systemd state on graphite1004 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:36:36] <icinga-wm>	 PROBLEM - carbon-cache@b service on graphite1004 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:36:38] <icinga-wm>	 PROBLEM - carbon-cache@g service on graphite1004 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@g is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:36:42] <icinga-wm>	 PROBLEM - carbon-frontend-relay service on graphite1004 is CRITICAL: CRITICAL - Expecting active but unit carbon-frontend-relay is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:36:42] <icinga-wm>	 PROBLEM - carbon-cache@c service on graphite1004 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:37:00] <godog>	 that's me ^
[14:37:26] <icinga-wm>	 PROBLEM - carbon-cache@h service on graphite1004 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@h is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:37:31] <wikibugs>	 (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1003/32846/netflow1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/742110 (https://phabricator.wikimedia.org/T263277) (owner: 10Ayounsi)
[14:37:34] <godog>	 jbond: I've reenabled puppet on graphite1004 btw
[14:37:36] <icinga-wm>	 PROBLEM - carbon-cache@a service on graphite1004 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:37:48] <icinga-wm>	 PROBLEM - carbon-cache@f service on graphite1004 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@f is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:37:55] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2021/2022-Q2), 10Sustainability (Incident Followup): Alert that should have paged via VictorOps was delayed because of partial networking outage - https://phabricator.wikimedia.org/T294166 (10herron)
[14:37:57] <jbond>	 godog: ack thanks im re-enabling everywhere now
[14:38:08] <jbond>	 everything looks good so far
[14:38:28] <icinga-wm>	 RECOVERY - carbon-cache@d service on graphite1004 is OK: OK - carbon-cache@d is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:38:30] <godog>	 jbond: kk
[14:38:31] <jbond>	 !log renable puppet fleet wide post monitoring refactor 744787
[14:38:32] <icinga-wm>	 RECOVERY - carbon-cache@e service on graphite1004 is OK: OK - carbon-cache@e is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:38:34] <icinga-wm>	 RECOVERY - carbon-local-relay service on graphite1004 is OK: OK - carbon-local-relay is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:38:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:36] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+2] "PCC, diff as expected https://puppet-compiler.wmflabs.org/compiler1003/32847/" [puppet] - 10https://gerrit.wikimedia.org/r/744784 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez)
[14:38:44] <icinga-wm>	 RECOVERY - Check systemd state on graphite1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:38:44] <icinga-wm>	 RECOVERY - carbon-cache@b service on graphite1004 is OK: OK - carbon-cache@b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:38:46] <icinga-wm>	 RECOVERY - carbon-cache@g service on graphite1004 is OK: OK - carbon-cache@g is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:38:50] <icinga-wm>	 RECOVERY - carbon-frontend-relay service on graphite1004 is OK: OK - carbon-frontend-relay is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:38:50] <icinga-wm>	 RECOVERY - carbon-cache@c service on graphite1004 is OK: OK - carbon-cache@c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:38:58] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] ceph: mgr: fix typo in relationship [puppet] - 10https://gerrit.wikimedia.org/r/744784 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez)
[14:39:34] <icinga-wm>	 RECOVERY - carbon-cache@h service on graphite1004 is OK: OK - carbon-cache@h is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:39:44] <icinga-wm>	 RECOVERY - carbon-cache@a service on graphite1004 is OK: OK - carbon-cache@a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:39:58] <icinga-wm>	 RECOVERY - carbon-cache@f service on graphite1004 is OK: OK - carbon-cache@f is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:40:27] <wikibugs>	 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, and 2 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10Ottomata) > can I start this anytime, or we need to create the kafka topic somewhere? Not really needed, unless you need to set s...
[14:41:48] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Revert 5.10.70 from bullseye hosts - https://phabricator.wikimedia.org/T297180 (10fgiunchedi)
[14:43:06] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Revert 5.10.70 from bullseye hosts - https://phabricator.wikimedia.org/T297180 (10MoritzMuehlenhoff)
[14:43:18] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Revert 5.10.70 from bullseye hosts - https://phabricator.wikimedia.org/T297180 (10MoritzMuehlenhoff)
[14:43:41] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Revert 5.10.70 from bullseye hosts - https://phabricator.wikimedia.org/T297180 (10MoritzMuehlenhoff)
[14:44:47] <wikibugs>	 (03PS2) 10Majavah: Remove UserMerge rights from labswiki (wikitech) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743659
[14:51:15] <wikibugs>	 (03PS1) 10Btullis: Remove more alerts that have moved to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/744809 (https://phabricator.wikimedia.org/T293399)
[14:58:34] <majavah>	 jbond: hi! profile::trafficserver::monitoring seems to include profile::monitoring even in cloud (normally profile::base::production includes it in prod only), and now the deployment-prep cache hosts are failing to run puppet due to missing hiera keys
[15:01:37] <jbond>	 majavah: ack looking now
[15:02:07] <wikibugs>	 (03PS1) 10Majavah: ldap: Do not install py2.7 files on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/744810
[15:02:58] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 8.201 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[15:02:58] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] durum: show if the user is using DoH or DoT [puppet] - 10https://gerrit.wikimedia.org/r/744095 (owner: 10Ssingh)
[15:03:08] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=mysql-misc site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:03:22] <majavah>	 jbond: also https://gerrit.wikimedia.org/r/c/operations/puppet/+/744810/ fixes a recent change to the ldap module which broke some of our bullseye hosts
[15:04:34] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:04:37] <jbond>	 majavah: ack looks good will merge that one in a sec tyhanks <3
[15:05:28] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) firing: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org
[15:05:40] <wikibugs>	 (03PS3) 10Ssingh: dnsdist: refactor the configuration template for updates to durum [puppet] - 10https://gerrit.wikimedia.org/r/744087
[15:06:45] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] dnsdist: refactor the configuration template for updates to durum [puppet] - 10https://gerrit.wikimedia.org/r/744087 (owner: 10Ssingh)
[15:06:46] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 102.8 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[15:07:04] <wikibugs>	 (03PS1) 10Btullis: Increase the timeout for Druid on the analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/744811 (https://phabricator.wikimedia.org/T297148)
[15:07:14] <icinga-wm>	 PROBLEM - Check if anycast-healthchecker and all configured threads are running on durum1002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[15:07:29] <sukhe>	 ^ that's me, checking
[15:07:40] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:08:00] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:08:05] <sukhe>	 ^ related
[15:09:08] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host restbase2026.codfw.wmnet with OS buster
[15:09:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:09:15] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase202[456].codfw.wmnet - https://phabricator.wikimedia.org/T294377 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin200...
[15:09:53] <wikibugs>	 (03PS1) 10Jbond: P:monitoring: add defaults for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/744812
[15:10:15] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/744810 (owner: 10Majavah)
[15:10:28] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) resolved: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org
[15:11:02] <wikibugs>	 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10JMeybohm)
[15:11:43] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32849/console" [puppet] - 10https://gerrit.wikimedia.org/r/744811 (https://phabricator.wikimedia.org/T297148) (owner: 10Btullis)
[15:13:43] <majavah>	 confirmed that the ldap fix is working
[15:13:50] <jbond>	 great thanks
[15:14:31] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] wikimedia-dns: refactor for durum update [dns] - 10https://gerrit.wikimedia.org/r/744094 (owner: 10Ssingh)
[15:14:57] <sukhe>	 !log running authdns-update for Gerrit:744094
[15:14:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:16] <sukhe>	 ^ complete
[15:19:42] <icinga-wm>	 PROBLEM - Check systemd state on durum2002 is CRITICAL: CRITICAL - degraded: The following units failed: anycast-healthchecker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:19:50] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:20:04] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:20:10] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:20:16] <icinga-wm>	 PROBLEM - Check if anycast-healthchecker and all configured threads are running on durum2002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[15:20:22] <sukhe>	 ^ this is related to the durum change, I am looking at why it's affecting one these two hosts
[15:20:43] <wikibugs>	 (03PS2) 10Jbond: P:monitoring: add defaults for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/744812
[15:20:44] <icinga-wm>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:20:52] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on durum2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[15:21:02] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:21:16] <icinga-wm>	 PROBLEM - BFD status on cr3-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:21:37] <logmsgbot>	 !log sukhe@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on durum2002.codfw.wmnet with reason: debugging bird/anycast-hc issues
[15:21:39] <logmsgbot>	 !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on durum2002.codfw.wmnet with reason: debugging bird/anycast-hc issues
[15:21:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:21:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:21:44] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on durum5002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[15:21:49] <sukhe>	 ^ same
[15:21:56] <icinga-wm>	 PROBLEM - Check if anycast-healthchecker and all configured threads are running on durum5002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[15:22:00] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:22:02] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:22:12] <wikibugs>	 (03PS3) 10Jbond: P:monitoring: add defaults for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/744812
[15:22:20] <icinga-wm>	 PROBLEM - Check systemd state on durum5002 is CRITICAL: CRITICAL - degraded: The following units failed: anycast-healthchecker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:22:52] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:23:34] <wikibugs>	 (03PS4) 10Jbond: P:monitoring: add defaults for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/744812
[15:24:13] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:monitoring: add defaults for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/744812 (owner: 10Jbond)
[15:24:26] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on durum3001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[15:24:34] <icinga-wm>	 PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:24:48] <icinga-wm>	 PROBLEM - Check if anycast-healthchecker and all configured threads are running on durum3001 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[15:24:52] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:25:02] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:25:21] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2026.codfw.wmnet with OS buster
[15:25:22] <icinga-wm>	 PROBLEM - Check systemd state on durum3001 is CRITICAL: CRITICAL - degraded: The following units failed: anycast-healthchecker.service,nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:25:22] <icinga-wm>	 PROBLEM - BFD status on cr3-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:25:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:27] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase202[456].codfw.wmnet - https://phabricator.wikimedia.org/T294377 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 fo...
[15:25:50] <icinga-wm>	 PROBLEM - Check systemd state on durum2001 is CRITICAL: CRITICAL - degraded: The following units failed: anycast-healthchecker.service,nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:26:02] <sukhe>	 yeah thinking of reverting this one to figure out what went wrong
[15:26:04] <icinga-wm>	 PROBLEM - Check if anycast-healthchecker and all configured threads are running on durum2001 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[15:26:14] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on durum2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[15:26:52] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on durum1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[15:27:27] <wikibugs>	 (03PS1) 10Btullis: Remove duplicate cluster variable from Druid check [alerts] - 10https://gerrit.wikimedia.org/r/744813 (https://phabricator.wikimedia.org/T293399)
[15:28:36] <wikibugs>	 (03PS1) 10Ssingh: Revert "durum: show if the user is using DoH or DoT" [puppet] - 10https://gerrit.wikimedia.org/r/744793
[15:29:02] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:29:03] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:29:12] <icinga-wm>	 PROBLEM - Check systemd state on durum4002 is CRITICAL: CRITICAL - degraded: The following units failed: anycast-healthchecker.service,nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:29:20] <icinga-wm>	 PROBLEM - BFD status on cr3-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:30:02] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on durum4002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[15:30:23] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Revert "durum: show if the user is using DoH or DoT" [puppet] - 10https://gerrit.wikimedia.org/r/744793 (owner: 10Ssingh)
[15:30:34] <icinga-wm>	 PROBLEM - Check if anycast-healthchecker and all configured threads are running on durum4002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[15:31:08] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:32:30] <icinga-wm>	 PROBLEM - Check if anycast-healthchecker and all configured threads are running on durum1001 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[15:32:39] <sukhe>	 sigh
[15:32:44] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on durum1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[15:33:00] <icinga-wm>	 PROBLEM - Check systemd state on durum1001 is CRITICAL: CRITICAL - degraded: The following units failed: anycast-healthchecker.service,nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:33:21] <logmsgbot>	 !log sukhe@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on 10 hosts with reason: debugging bird/anycast-hc issues
[15:33:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:26] <wikibugs>	 (03PS1) 10Ladsgroup: Do not inject rev id of template when it's empty [extensions/FlaggedRevs] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/744796
[15:33:29] <logmsgbot>	 !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 10 hosts with reason: debugging bird/anycast-hc issues
[15:33:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:37] <wikibugs>	 (03PS1) 10Ladsgroup: Do not inject rev id of template when it's empty [extensions/FlaggedRevs] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/744797
[15:33:48] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 503 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid
[15:35:02] <wikibugs>	 (03PS1) 10Jbond: P:monitoring: fix ordering [puppet] - 10https://gerrit.wikimedia.org/r/744814
[15:35:16] <icinga-wm>	 PROBLEM - Host check.wikimedia-dns.org is DOWN: PING CRITICAL - Packet loss = 100%
[15:35:53] <wikibugs>	 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, and 2 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) `internal_network_flows` works, `network.flows.internal` too.  @Ottomata indeed we do have restriction on the producer s...
[15:36:00] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[15:37:38] <Amir1>	 jouncebot: nowandnext
[15:37:38] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 22 minute(s)
[15:37:38] <jouncebot>	 In 1 hour(s) and 22 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211207T1700)
[15:37:43] <Amir1>	 coool
[15:38:04] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Do not inject rev id of template when it's empty [extensions/FlaggedRevs] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/744797 (owner: 10Ladsgroup)
[15:38:07] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Do not inject rev id of template when it's empty [extensions/FlaggedRevs] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/744796 (owner: 10Ladsgroup)
[15:38:21] <Amir1>	 wmf.12 doesn't need deployment, just to catch the train
[15:38:38] <icinga-wm>	 RECOVERY - Check if anycast-healthchecker and all configured threads are running on durum1002 is OK: OK: UP (pid=10734) and all threads (4) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[15:40:00] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on durum1002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[15:40:52] <sukhe>	 ^ XioNoX: discovered an interesting bug with anycast-hc today, let me share it once I resolve it. essentially, it doesn't remove the older conf files, resulting in an error like this, "Dec 07 15:36:27 durum1002 anycast-healthchecker[10438]: Invalid configuration: 185.71.138.140/32 is used by 2 service checks "
[15:41:30] <icinga-wm>	 RECOVERY - Host check.wikimedia-dns.org is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms
[15:41:31] <wikibugs>	 (03PS1) 10Ssingh: Revert "Revert "durum: show if the user is using DoH or DoT"" [puppet] - 10https://gerrit.wikimedia.org/r/744798
[15:41:43] <XioNoX>	 sukhe: is it an anycast-hc bug or a puppet oversight?
[15:42:08] <sukhe>	 that's true, more like a Puppet oversight I guess but I will share a patch once I resolve the durum error 
[15:42:18] <wikibugs>	 (03Merged) 10jenkins-bot: Do not inject rev id of template when it's empty [extensions/FlaggedRevs] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/744797 (owner: 10Ladsgroup)
[15:42:22] <wikibugs>	 (03Merged) 10jenkins-bot: Do not inject rev id of template when it's empty [extensions/FlaggedRevs] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/744796 (owner: 10Ladsgroup)
[15:42:35] <XioNoX>	 cool, thanks!
[15:43:30] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Revert "Revert "durum: show if the user is using DoH or DoT"" [puppet] - 10https://gerrit.wikimedia.org/r/744798 (owner: 10Ssingh)
[15:44:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica2006.wikimedia.org
[15:44:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:44] <icinga-wm>	 RECOVERY - Check if anycast-healthchecker and all configured threads are running on durum1001 is OK: OK: UP (pid=31997) and all threads (4) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[15:45:50] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:45:56] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on durum1001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[15:46:14] <icinga-wm>	 RECOVERY - Check systemd state on durum1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:46:18] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:46:38] <icinga-wm>	 RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 96, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:47:35] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/FlaggedRevs/backend/FlaggedRevision.php: Backport: [[gerrit:744797|Do not inject rev id of template when it's empty]] (duration: 00m 57s)
[15:47:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:47:40] <icinga-wm>	 RECOVERY - Check systemd state on durum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:47:44] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on durum5002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[15:47:56] <icinga-wm>	 RECOVERY - Check if anycast-healthchecker and all configured threads are running on durum5002 is OK: OK: UP (pid=9923) and all threads (4) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[15:47:56] <icinga-wm>	 RECOVERY - Check if anycast-healthchecker and all configured threads are running on durum2001 is OK: OK: UP (pid=19117) and all threads (4) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[15:48:06] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on durum2001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[15:48:18] <icinga-wm>	 RECOVERY - Check systemd state on durum5002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:48:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[15:48:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:50:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica2006.wikimedia.org
[15:50:34] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on durum3001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[15:50:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:50:42] <icinga-wm>	 PROBLEM - Host db2074.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:52:00] <icinga-wm>	 PROBLEM - Host db2130.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:52:06] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2084.codfw.wmnet with reason: Reracking T296930
[15:52:08] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2084.codfw.wmnet with reason: Reracking T296930
[15:52:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:10] <stashbot>	 T296930: codfw: relocate servers in rack D6 - https://phabricator.wikimedia.org/T296930
[15:52:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[15:53:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:14] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica2005.wikimedia.org
[15:53:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:42] <icinga-wm>	 PROBLEM - Host db2101.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:54:55] <sukhe>	 XioNoX: ok since it's all done, so what happened was that I renamed an existing vip_fqdn and it created the new one but didn't remove the old one, which in hindsight I should have expected (?)
[15:54:58] <icinga-wm>	 RECOVERY - Check if anycast-healthchecker and all configured threads are running on durum4002 is OK: OK: UP (pid=19952) and all threads (4) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[15:55:19] <sukhe>	 I am not sure of what a good solution to this is, perhaps we should append the VIP to the vip_fqdn and then check if that exists?
[15:55:20] <wikibugs>	 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, and 2 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10BTullis) In case it helps, I came across this abandoned change from 2020: https://gerrit.wikimedia.org/r/c/schemas/event/secondar...
[15:55:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica2005.wikimedia.org
[15:55:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:56] <icinga-wm>	 PROBLEM - Host dbproxy2004.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:56:08] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Revert 5.10.70 from bullseye hosts - https://phabricator.wikimedia.org/T297180 (10MoritzMuehlenhoff)
[15:56:59] <wikibugs>	 (03PS2) 10Jbond: P:monitoring: fix ordering [puppet] - 10https://gerrit.wikimedia.org/r/744814
[15:58:56] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[15:59:36] <icinga-wm>	 RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:59:56] <icinga-wm>	 RECOVERY - BFD status on cr2-esams is OK: OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:59:56] <icinga-wm>	 PROBLEM - Host db2084.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:00:03] <wikibugs>	 (03PS3) 10Jbond: WIP P:monitoring: fix ordering [puppet] - 10https://gerrit.wikimedia.org/r/744814
[16:00:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica1003.wikimedia.org
[16:00:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:01:14] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:01:26] <icinga-wm>	 RECOVERY - BGP status on cr3-eqsin is OK: BGP OK - up: 333, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:01:36] <icinga-wm>	 RECOVERY - BGP status on cr2-eqsin is OK: BGP OK - up: 79, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:02:08] <icinga-wm>	 RECOVERY - BFD status on cr3-ulsfo is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:02:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP P:monitoring: fix ordering [puppet] - 10https://gerrit.wikimedia.org/r/744814 (owner: 10Jbond)
[16:02:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica1003.wikimedia.org
[16:02:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:28] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Revert 5.10.70 from bullseye hosts - https://phabricator.wikimedia.org/T297180 (10MoritzMuehlenhoff)
[16:04:20] <icinga-wm>	 RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 73, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:04:28] <icinga-wm>	 RECOVERY - Check if anycast-healthchecker and all configured threads are running on durum3001 is OK: OK: UP (pid=8954) and all threads (4) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[16:04:32] <icinga-wm>	 PROBLEM - Check if anycast-healthchecker and all configured threads are running on durum2002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[16:04:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid2002.codfw.wmnet
[16:04:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:06:48] <icinga-wm>	 RECOVERY - Host db2074.mgmt is UP: PING OK - Packet loss = 0%, RTA = 35.08 ms
[16:07:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid2002.codfw.wmnet
[16:07:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:31] <logmsgbot>	 !log root@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' .
[16:07:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:12] <icinga-wm>	 RECOVERY - Host db2130.mgmt is UP: PING OK - Packet loss = 0%, RTA = 45.72 ms
[16:08:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host copernicium.wikimedia.org
[16:08:54] <icinga-wm>	 RECOVERY - BFD status on cr3-eqsin is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:08:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:58] <icinga-wm>	 RECOVERY - Check systemd state on durum3001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:09:40] <wikibugs>	 (03PS4) 10Jbond: WIP P:monitoring: fix ordering [puppet] - 10https://gerrit.wikimedia.org/r/744814
[16:10:08] <icinga-wm>	 RECOVERY - Host db2101.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.36 ms
[16:10:27] <wikibugs>	 (03PS1) 10Andrew Bogott: cinder-backup: generate backup_file_size relative to available RAM [puppet] - 10https://gerrit.wikimedia.org/r/744821 (https://phabricator.wikimedia.org/T292546)
[16:11:08] <icinga-wm>	 RECOVERY - Host db2084.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.67 ms
[16:11:41] <wikibugs>	 (03PS2) 10Andrew Bogott: cinder-backup: generate backup_file_size relative to available RAM [puppet] - 10https://gerrit.wikimedia.org/r/744821 (https://phabricator.wikimedia.org/T292546)
[16:12:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP P:monitoring: fix ordering [puppet] - 10https://gerrit.wikimedia.org/r/744814 (owner: 10Jbond)
[16:13:13] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:13:14] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Revert 5.10.70 from bullseye hosts - https://phabricator.wikimedia.org/T297180 (10MoritzMuehlenhoff)
[16:13:20] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on durum4002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[16:14:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host copernicium.wikimedia.org
[16:14:02] <wikibugs>	 (03PS3) 10Andrew Bogott: cinder-backup: generate backup_file_size relative to available RAM [puppet] - 10https://gerrit.wikimedia.org/r/744821 (https://phabricator.wikimedia.org/T292546)
[16:14:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:15:42] <icinga-wm>	 RECOVERY - Check systemd state on durum4002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:17:28] <wikibugs>	 (03PS4) 10Andrew Bogott: cinder-backup: generate backup_file_size relative to available RAM [puppet] - 10https://gerrit.wikimedia.org/r/744821 (https://phabricator.wikimedia.org/T292546)
[16:17:52] <icinga-wm>	 RECOVERY - BFD status on cr3-esams is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:20:18] <wikibugs>	 (03PS5) 10Jbond: WIP P:monitoring: fix ordering [puppet] - 10https://gerrit.wikimedia.org/r/744814
[16:22:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP P:monitoring: fix ordering [puppet] - 10https://gerrit.wikimedia.org/r/744814 (owner: 10Jbond)
[16:23:13] <wikibugs>	 (03PS5) 10Andrew Bogott: cinder-backup: generate backup_file_size relative to available RAM [puppet] - 10https://gerrit.wikimedia.org/r/744821 (https://phabricator.wikimedia.org/T292546)
[16:24:30] <wikibugs>	 (03PS6) 10Jbond: WIP P:monitoring: fix ordering [puppet] - 10https://gerrit.wikimedia.org/r/744814
[16:25:28] <Amir1>	 !log deleting broken flaggedtemplates rows on dewiki (T297094)
[16:25:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:33] <stashbot>	 T297094: Add globaluser.gu_hidden_level column to production - https://phabricator.wikimedia.org/T297094
[16:26:11] <Amir1>	 wrong ticket, T296380
[16:26:11] <stashbot>	 T296380: flaggedtemplates table is still too big - https://phabricator.wikimedia.org/T296380
[16:26:45] <papaul>	 kormat: db2074 ready 
[16:26:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP P:monitoring: fix ordering [puppet] - 10https://gerrit.wikimedia.org/r/744814 (owner: 10Jbond)
[16:28:27] <wikibugs>	 10SRE-swift-storage: swift-proxy not starting on ms-fe2009 due to missing python-monotonic - https://phabricator.wikimedia.org/T296289 (10MatthewVernon) OK, I know what the problem is (at least at one level). Our swift front-ends use a bit of middleware wmf.rewrite which is shipped by us from puppet; that calls...
[16:28:29] <wikibugs>	 (03CR) 10David Caro: alertmanager: add inhibit rules for network probes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/743981 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[16:40:27] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: miscweb: add volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/744826
[16:41:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] miscweb: add volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/744826 (owner: 10Giuseppe Lavagetto)
[16:41:37] <wikibugs>	 (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.15.0-a12 [vendor] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/744800 (https://phabricator.wikimedia.org/T263203)
[16:42:02] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: ceph: mgr: migrate keyring to new auth abstraction [puppet] - 10https://gerrit.wikimedia.org/r/744808 (https://phabricator.wikimedia.org/T293752)
[16:44:08] <icinga-wm>	 RECOVERY - Host dbproxy2004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.68 ms
[16:44:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] ceph: mgr: migrate keyring to new auth abstraction [puppet] - 10https://gerrit.wikimedia.org/r/744808 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez)
[16:45:16] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: codfw: relocate servers in rack D6 - https://phabricator.wikimedia.org/T296930 (10Papaul)
[16:46:41] <wikibugs>	 (03PS1) 10MVernon: swift::proxy: install python{3,}-monotonic [puppet] - 10https://gerrit.wikimedia.org/r/744827 (https://phabricator.wikimedia.org/T296289)
[16:46:48] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: ceph: mgr: migrate keyring to new auth abstraction [puppet] - 10https://gerrit.wikimedia.org/r/744808 (https://phabricator.wikimedia.org/T293752)
[16:46:55] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: codfw: relocate servers in rack D6 - https://phabricator.wikimedia.org/T296930 (10Papaul) 05Open→03Resolved @Marostegui  @Kormat all the servers are back up online from my end.  Thanks for helping
[16:47:28] <wikibugs>	 (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/744827 (https://phabricator.wikimedia.org/T296289) (owner: 10MVernon)
[16:47:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] swift::proxy: install python{3,}-monotonic [puppet] - 10https://gerrit.wikimedia.org/r/744827 (https://phabricator.wikimedia.org/T296289) (owner: 10MVernon)
[16:48:02] <dancy>	 There are a ton of jsonTruncated mediawiki errors being logged in the last hour.  I don't know what that's about.
[16:48:43] <dancy>	 The message part begins `"Search backend error during sending 1 documents to the commonswiki_content_1617495209 index(s) after 7: bulk: Error in one or more bulk request actions:\n\nupdate: /commonswiki_content_1617495209/page/10465338 caused [commonswiki_content_1617495209][0] primary shard is not active Timeout: [1ms].....`
[16:49:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] ceph: mgr: migrate keyring to new auth abstraction [puppet] - 10https://gerrit.wikimedia.org/r/744808 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez)
[16:50:58] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase202[456].codfw.wmnet - https://phabricator.wikimedia.org/T294377 (10Papaul)
[16:51:17] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: ceph: mgr: migrate keyring to new auth abstraction [puppet] - 10https://gerrit.wikimedia.org/r/744808 (https://phabricator.wikimedia.org/T293752)
[16:51:29] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase202[456].codfw.wmnet - https://phabricator.wikimedia.org/T294377 (10Papaul) 05Open→03Resolved This is complete
[16:52:07] <wikibugs>	 (03PS2) 10MVernon: swift::proxy: install python{3,}-monotonic [puppet] - 10https://gerrit.wikimedia.org/r/744827 (https://phabricator.wikimedia.org/T296289)
[16:54:09] <wikibugs>	 (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/744827 (https://phabricator.wikimedia.org/T296289) (owner: 10MVernon)
[16:56:56] <cscott>	 We have a pre-train backport for Parsoid
[16:57:01] <wikibugs>	 (03PS1) 10Ssingh: bird: add validate_cmd for anycast-healthchecker.conf [puppet] - 10https://gerrit.wikimedia.org/r/744830
[16:57:02] <cscott>	 it's cherry-picked on mediawiki-vendor to the wmf.12 branch, but i haven't merged it yet
[16:57:14] <dancy>	 cscott: Go ahead and merge.
[16:57:25] <cscott>	 last time there was an issue where y'all had already staged the wmf.12 release pre-train and our backport didn't "take"
[16:57:40] <dancy>	 I haven't checked out wmf.12 yet so you should be good to go.
[16:57:45] <cscott>	 dancy: is there a recommended process I should document, for the future?
[16:58:12] <cscott>	 i pinged on the blocker phab task for wmf.12, is "ping ops on #wikimedia-operations before merge" good eough documentation for the future?
[16:58:34] <dancy>	 Yes, that is sufficient.
[16:58:51] <wikibugs>	 (03PS6) 10Andrew Bogott: cinder-backup: generate backup_file_size relative to available RAM [puppet] - 10https://gerrit.wikimedia.org/r/744821 (https://phabricator.wikimedia.org/T292546)
[16:59:03] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32863/console" [puppet] - 10https://gerrit.wikimedia.org/r/744830 (owner: 10Ssingh)
[16:59:23] <wikibugs>	 (03CR) 10C. Scott Ananian: [C: 03+2] "Pinged dancy on #wikimedia-operations and confirmed that wmf.12 hasn't been checked out yet so this is safe to merge." [vendor] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/744800 (https://phabricator.wikimedia.org/T263203) (owner: 10C. Scott Ananian)
[16:59:40] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on durum2002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[16:59:44] <dancy>	 cscott: Thanks for making the mods to reduce the risk.
[17:00:04] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[17:00:04] <jouncebot>	 jbond and rzl: I, the Bot under the Fountain, call upon thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211207T1700).
[17:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[17:00:18] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:00:28] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 107, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:01:06] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 75, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:01:12] <icinga-wm>	 RECOVERY - Check if anycast-healthchecker and all configured threads are running on durum2002 is OK: OK: UP (pid=13491) and all threads (4) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[17:01:16] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:02:18] <wikibugs>	 (03CR) 10Hnowlan: cassandra: load grants files upon change (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) (owner: 10Hnowlan)
[17:04:10] <wikibugs>	 (03PS4) 10Hnowlan: api-gateway: allow discovery services to set custom rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/741937 (https://phabricator.wikimedia.org/T295956)
[17:04:41] <wikibugs>	 (03CR) 10Hnowlan: api-gateway: allow discovery services to set custom rate limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/741937 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan)
[17:07:02] <icinga-wm>	 PROBLEM - graphite.wikimedia.org render on graphite1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting
[17:07:07] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: miscweb: add volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/744826
[17:08:44] <icinga-wm>	 PROBLEM - graphite.wikimedia.org api on graphite1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting
[17:10:53] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cinder-backup: generate backup_file_size relative to available RAM [puppet] - 10https://gerrit.wikimedia.org/r/744821 (https://phabricator.wikimedia.org/T292546) (owner: 10Andrew Bogott)
[17:11:09] <Lucas_WMDE>	 ^ that graphite alert may or may not explain the Wikidata alert (edit rate below x/min)
[17:11:40] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "THANKS!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/744826 (owner: 10Giuseppe Lavagetto)
[17:12:02] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] miscweb: add volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/744826 (owner: 10Giuseppe Lavagetto)
[17:14:02] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:16:01] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb: add volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/744826 (owner: 10Giuseppe Lavagetto)
[17:19:14] <logmsgbot>	 !log root@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' .
[17:19:17] <Lucas_WMDE>	 is anyone looking into Graphite?
[17:19:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:20:40] <wikibugs>	 (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.15.0-a12 [vendor] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/744800 (https://phabricator.wikimedia.org/T263203) (owner: 10C. Scott Ananian)
[17:21:22] <kormat>	 Lucas_WMDE: i know nothing about graphite, but that host seems.. Busy. load avg is 139
[17:21:27] <Lucas_WMDE>	 sheesh
[17:22:28] <kormat>	 grafana has no metrics for the last 20 mins for it
[17:22:50] <icinga-wm>	 PROBLEM - Check unit status of statograph_post on alert1001 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[17:23:47] <RhinosF1>	 godog: ^
[17:25:03] <RhinosF1>	 kormat: see pm
[17:25:06] <logmsgbot>	 !log root@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'miscweb' for release 'main' .
[17:25:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:16] <icinga-wm>	 PROBLEM - Host db2078.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:26:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[17:26:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:52] <wikibugs>	 (03PS7) 10Jbond: WIP P:monitoring: fix ordering [puppet] - 10https://gerrit.wikimedia.org/r/744814
[17:26:54] <wikibugs>	 (03PS1) 10Jbond: nrep::monitoring: nrpe checks should be disabled by default in cloud [puppet] - 10https://gerrit.wikimedia.org/r/744833
[17:27:34] <logmsgbot>	 !log root@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'miscweb' for release 'main' .
[17:27:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:27:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[17:27:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:27:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Revert 5.10.70 from bullseye hosts - https://phabricator.wikimedia.org/T297180 (10MoritzMuehlenhoff)
[17:28:07] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32867/console" [puppet] - 10https://gerrit.wikimedia.org/r/744833 (owner: 10Jbond)
[17:29:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP P:monitoring: fix ordering [puppet] - 10https://gerrit.wikimedia.org/r/744814 (owner: 10Jbond)
[17:30:27] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Replica Lag: s3 on db2074 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 10882.43 seconds Kormat Catching up on replication https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:31:33] <logmsgbot>	 !log root@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'miscweb' for release 'main' .
[17:31:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:32:30] <icinga-wm>	 RECOVERY - Host db2078.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.40 ms
[17:32:40] <logmsgbot>	 !log root@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'miscweb' for release 'main' .
[17:32:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:32:56] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - miscweb_4111: Servers kubernetes2002.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2007.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2004.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[17:33:24] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Replica Lag: s1 on db2130 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 12036.15 seconds Kormat Catching up on replication https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:33:30] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - miscweb_4111: Servers kubernetes2013.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2016.codfw.wmnet, kubernetes2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[17:33:32] <logmsgbot>	 !log root@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'miscweb' for release 'main' .
[17:33:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:34:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet
[17:34:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:34:24] <icinga-wm>	 ACKNOWLEDGEMENT - haproxy failover on dbproxy2004 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Kormat db1178 is getting an idrac update https://wikitech.wikimedia.org/wiki/HAProxy
[17:35:16] <logmsgbot>	 !log root@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'miscweb' for release 'main' .
[17:35:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:35:42] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[17:36:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet
[17:36:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:36:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet
[17:36:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:37:20] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[17:38:10] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:40:02] <wikibugs>	 (03PS1) 10Jbond: WIP: find the dependency [puppet] - 10https://gerrit.wikimedia.org/r/744835
[17:40:06] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet
[17:40:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:41:10] <herron>	 !log graphite1004.mgmt: racadm serveraction powercycle
[17:41:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:41:54] <wikibugs>	 (03PS2) 10Jbond: WIP: find the dependency [puppet] - 10https://gerrit.wikimedia.org/r/744835
[17:43:12] <icinga-wm>	 PROBLEM - Mediawiki CirrusSearch update rate - eqiad on alert1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[17:43:36] <icinga-wm>	 PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 96.11% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[17:43:44] <dancy>	 Possibly related to the CirrusSearch alert: https://phabricator.wikimedia.org/T297221
[17:43:52] <icinga-wm>	 RECOVERY - graphite.wikimedia.org api on graphite1004 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting
[17:44:26] <icinga-wm>	 RECOVERY - graphite.wikimedia.org render on graphite1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1594 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting
[17:44:43] <wikibugs>	 (03PS3) 10Jbond: WIP: find the dependency [puppet] - 10https://gerrit.wikimedia.org/r/744835
[17:44:48] <icinga-wm>	 PROBLEM - Mediawiki CirrusSearch update rate - codfw on alert1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[17:44:58] <dcausse>	 looking ^
[17:45:10] <icinga-wm>	 RECOVERY - Check unit status of statograph_post on alert1001 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[17:45:25] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32871/console" [puppet] - 10https://gerrit.wikimedia.org/r/744835 (owner: 10Jbond)
[17:46:11] <wikibugs>	 (03PS4) 10Jbond: WIP: find the dependency [puppet] - 10https://gerrit.wikimedia.org/r/744835
[17:46:46] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32872/console" [puppet] - 10https://gerrit.wikimedia.org/r/744835 (owner: 10Jbond)
[17:47:51] <wikibugs>	 (03PS5) 10Jbond: WIP: find the dependency [puppet] - 10https://gerrit.wikimedia.org/r/744835
[17:47:58] <icinga-wm>	 PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 96.11% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[17:48:51] <wikibugs>	 (03PS6) 10Jbond: WIP: find the dependency [puppet] - 10https://gerrit.wikimedia.org/r/744835
[17:51:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki2002.codfw.wmnet
[17:51:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:51:33] <wikibugs>	 (03PS7) 10Jbond: WIP: find the dependency [puppet] - 10https://gerrit.wikimedia.org/r/744835
[17:52:10] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Revert 5.10.70 from bullseye hosts - https://phabricator.wikimedia.org/T297180 (10MoritzMuehlenhoff)
[17:54:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2002.codfw.wmnet
[17:54:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:55:05] <wikibugs>	 (03PS4) 10Ahmon Dancy: Choose wikiversions.php file relative to MWMultiVersion.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743038
[17:55:07] <wikibugs>	 (03PS1) 10Ahmon Dancy: MWMultiVersion.php: Reverse logic for wikiversions file selection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744836
[17:55:23] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] Choose wikiversions.php file relative to MWMultiVersion.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743038 (owner: 10Ahmon Dancy)
[17:56:47] <wikibugs>	 (03PS1) 10Jgiannelos: proton: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/744837
[18:00:04] <jouncebot>	 chrisalbon and accraze: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211207T1800).
[18:03:18] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] bird: add validate_cmd for anycast-healthchecker.conf [puppet] - 10https://gerrit.wikimedia.org/r/744830 (owner: 10Ssingh)
[18:03:40] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] P::trafficserver: use https for cloudweb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/742214 (https://phabricator.wikimedia.org/T263829) (owner: 10Majavah)
[18:03:43] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] set up tls termination on cloudweb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/742213 (https://phabricator.wikimedia.org/T263829) (owner: 10Majavah)
[18:04:31] <wikibugs>	 (03PS4) 10Andrew Bogott: P::trafficserver: use https for cloudweb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/742214 (https://phabricator.wikimedia.org/T263829) (owner: 10Majavah)
[18:05:47] <wikibugs>	 (03PS7) 10Hnowlan: partman: add reuse partman profile for cassandra hosts [puppet] - 10https://gerrit.wikimedia.org/r/738924 (https://phabricator.wikimedia.org/T295375)
[18:05:54] <wikibugs>	 (03CR) 10Hnowlan: partman: add reuse partman profile for cassandra hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738924 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan)
[18:07:08] <logmsgbot>	 Test for auto_schema!
[18:07:13] <addshore>	 xD
[18:07:18] <Amir1>	 :D
[18:07:57] <Amir1>	 I ran it with cumin from cumin1001 on mwmaint1002 as cumin doesn't have dologmsg
[18:09:51] <logmsgbot>	 Test again for auto_schema!
[18:09:59] <Amir1>	 cool
[18:10:11] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] proton: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/744837 (owner: 10Jgiannelos)
[18:11:15] <wikibugs>	 (03PS8) 10Jbond: WIP: find the dependency [puppet] - 10https://gerrit.wikimedia.org/r/744835
[18:11:55] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32876/console" [puppet] - 10https://gerrit.wikimedia.org/r/744835 (owner: 10Jbond)
[18:12:10] <wikibugs>	 (03CR) 10CDanis: "I'm a bit unsure about this as a threshold -- it's imaginable to me that we have some well-behaved clients that would exceed this limit." [puppet] - 10https://gerrit.wikimedia.org/r/740818 (https://phabricator.wikimedia.org/T224891) (owner: 10Jbond)
[18:12:12] <wikibugs>	 (03PS9) 10Jbond: WIP: find the dependency [puppet] - 10https://gerrit.wikimedia.org/r/744835
[18:13:29] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32877/console" [puppet] - 10https://gerrit.wikimedia.org/r/744835 (owner: 10Jbond)
[18:13:46] <wikibugs>	 (03Merged) 10jenkins-bot: proton: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/744837 (owner: 10Jgiannelos)
[18:14:24] <wikibugs>	 (03PS10) 10Jbond: P:trafficserver: add a hiera guard for checking [puppet] - 10https://gerrit.wikimedia.org/r/744835
[18:14:40] <wikibugs>	 (03PS11) 10Jbond: P:trafficserver: add a hiera guard for checking [puppet] - 10https://gerrit.wikimedia.org/r/744835
[18:14:49] <wikibugs>	 (03PS1) 10Jgiannelos: tegola-vector-tiles: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/744838
[18:15:55] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 03+1] tegola-vector-tiles: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/744838 (owner: 10Jgiannelos)
[18:16:27] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32878/console" [puppet] - 10https://gerrit.wikimedia.org/r/744835 (owner: 10Jbond)
[18:17:34] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32879/console" [puppet] - 10https://gerrit.wikimedia.org/r/744835 (owner: 10Jbond)
[18:18:22] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] P:trafficserver: add a hiera guard for checking [puppet] - 10https://gerrit.wikimedia.org/r/744835 (owner: 10Jbond)
[18:18:53] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/744838 (owner: 10Jgiannelos)
[18:20:07] <wikibugs>	 10SRE, 10Traffic-Icebox, 10HTTPS: HTTPS for internal service traffic - https://phabricator.wikimedia.org/T108580 (10Majavah)
[18:20:32] <wikibugs>	 (03CR) 10Dzahn: "While I appreciate you are making these, this is a duplicate of https://gerrit.wikimedia.org/r/c/operations/dns/+/650625 which I abandoned" [dns] - 10https://gerrit.wikimedia.org/r/744762 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah)
[18:20:47] <wikibugs>	 10SRE, 10Cloud-Services, 10Traffic-Icebox, 10HTTPS, 10cloud-services-team (Kanban): cloudweb2001-dev: add TLS termination - https://phabricator.wikimedia.org/T263829 (10Majavah) 05Open→03Resolved a:03Majavah
[18:22:35] <wikibugs>	 (03Merged) 10jenkins-bot: tegola-vector-tiles: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/744838 (owner: 10Jgiannelos)
[18:22:37] <wikibugs>	 (03CR) 10Dzahn: "it's similar here, see comments on https://gerrit.wikimedia.org/r/c/operations/dns/+/650625/" [puppet] - 10https://gerrit.wikimedia.org/r/744763 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah)
[18:23:31] <wikibugs>	 (03Abandoned) 10Dzahn: Revert "Revert "Revert "Revert "Revert "mx2001: disable ldap validation""""" [puppet] - 10https://gerrit.wikimedia.org/r/743424 (owner: 10Dzahn)
[18:23:35] <wikibugs>	 (03CR) 10Majavah: discovery: switchover doc to doc1002 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/744762 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah)
[18:24:39] <wikibugs>	 (03CR) 10Dzahn: discovery: switchover doc to doc1002 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/744762 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah)
[18:25:56] <icinga-wm>	 RECOVERY - Mediawiki CirrusSearch update rate - codfw on alert1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[18:26:31] <wikibugs>	 (03Abandoned) 10Jbond: nrep::monitoring: nrpe checks should be disabled by default in cloud [puppet] - 10https://gerrit.wikimedia.org/r/744833 (owner: 10Jbond)
[18:26:34] <icinga-wm>	 RECOVERY - Mediawiki CirrusSearch update rate - eqiad on alert1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[18:27:09] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' .
[18:27:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:27:46] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 14 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[18:27:48] <wikibugs>	 (03Abandoned) 10Jbond: WIP P:monitoring: fix ordering [puppet] - 10https://gerrit.wikimedia.org/r/744814 (owner: 10Jbond)
[18:28:34] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 30.83 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[18:28:46] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 50.81 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[18:31:10] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is CRITICAL: 32.87 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[18:33:00] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 84.05 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[18:33:12] <majavah>	 ummm
[18:33:12] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 82.34 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[18:33:13] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' .
[18:33:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:33:22] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[18:33:54] <majavah>	 looks like we had a request spike?
[18:34:26] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[18:36:48] <wikibugs>	 (03PS1) 10Dzahn: contint: delete deployment_dir class [puppet] - 10https://gerrit.wikimedia.org/r/744839 (https://phabricator.wikimedia.org/T272559)
[18:37:14] <Bsadowski1>	 Grafana security update out
[18:37:24] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] bird: add validate_cmd for anycast-healthchecker.conf [puppet] - 10https://gerrit.wikimedia.org/r/744830 (owner: 10Ssingh)
[18:37:30] <majavah>	 Bsadowski1: we're already aware, but thanks
[18:37:36] <Bsadowski1>	 k lol
[18:37:41] <Bsadowski1>	 I saw it on Twitter :P
[18:37:45] <Bsadowski1>	 sorry
[18:38:13] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' .
[18:38:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:38:56] <wikibugs>	 (03Abandoned) 10Jbond: puppet_compiler:puppetdb: We only need one puppetdb for all compilers [puppet] - 10https://gerrit.wikimedia.org/r/739808 (owner: 10Jbond)
[18:42:16] <wikibugs>	 (03PS1) 10Dzahn: contint: delete the proxy_gerrit class [puppet] - 10https://gerrit.wikimedia.org/r/744840 (https://phabricator.wikimedia.org/T272559)
[18:42:54] <volans>	 Amir1: re logmsgbot message... you can use https://doc.wikimedia.org/wmflib/master/api/wmflib.irc.html
[18:43:36] <Amir1>	 oh nice
[18:43:44] <Amir1>	 better than running cumin on mwmaint
[18:44:13] <wikibugs>	 10SRE, 10observability: Remove Diamond from production - https://phabricator.wikimedia.org/T212231 (10Dzahn) Even though T210993  is open?   Thanks! I am uploading a change to delete them.
[18:44:33] <dcaro_away>	 we use this for irc logging on wmcs cookbooks: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/wmcs/cookbooks/wmcs/do_log_msg.py
[18:44:54] <wikibugs>	 (03PS1) 10Dzahn: diamond: delete collector::servicestats* [puppet] - 10https://gerrit.wikimedia.org/r/744841 (https://phabricator.wikimedia.org/T272559)
[18:44:58] <dcaro_away>	 essentially https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/wmcs/cookbooks/wmcs/__init__.py#1073
[18:45:10] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Seen): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Majavah) @hashar @Krinkle Content sync between instances, the je...
[18:45:12] <dcaro_away>	 (copying the `dologmsg` util in the machines)
[18:45:13] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' .
[18:45:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:45:51] <wikibugs>	 (03CR) 10Majavah: discovery: switchover doc to doc1002 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/744762 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah)
[18:46:08] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' .
[18:46:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:47:18] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' .
[18:47:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:48:28] <wikibugs>	 (03PS1) 10Ssingh: test_dns: update tests for new durum features [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/744843
[18:49:31] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] test_dns: update tests for new durum features [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/744843 (owner: 10Ssingh)
[18:55:28] <wikibugs>	 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10Papaul)
[18:55:42] <wikibugs>	 (03PS4) 10Jbond: C:puppet_compiler: add uploader class [puppet] - 10https://gerrit.wikimedia.org/r/743224
[18:58:47] <wikibugs>	 (03PS1) 10Cwhite: opensearch_dashboards: allow up to 64mb restore payload [puppet] - 10https://gerrit.wikimedia.org/r/744845 (https://phabricator.wikimedia.org/T288621)
[19:00:05] <jouncebot>	 Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211207T1900)
[19:00:55] <icinga-wm>	 RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:02:53] <wikibugs>	 (03PS1) 10Jgiannelos: tegola-vector-tiles: Use versioned base paths for caches [deployment-charts] - 10https://gerrit.wikimedia.org/r/744846
[19:04:49] <wikibugs>	 (03CR) 10Jgiannelos: [C: 04-1] "This patch introduces versioning in the name of the cache base paths on swift. Heads up this needs to be deployed at the same time we star" [deployment-charts] - 10https://gerrit.wikimedia.org/r/744846 (owner: 10Jgiannelos)
[19:07:29] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:08:09] <wikibugs>	 (03PS5) 10Jbond: C:puppet_compiler: add uploader class [puppet] - 10https://gerrit.wikimedia.org/r/743224
[19:09:09] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:09:37] <wikibugs>	 (03CR) 10Krinkle: "Interesting. I vaguely recall there being an operational reason to favour the deployed version. I don't recall the specifics though, but s" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743038 (owner: 10Ahmon Dancy)
[19:10:05] <wikibugs>	 10SRE-Access-Requests: Add Lucas_WMDE to #mediawiki_security - https://phabricator.wikimedia.org/T297226 (10Legoktm)
[19:10:11] <legoktm>	 Lucas_WMDE: ^^
[19:10:27] <Lucas_WMDE>	 thanks \o/
[19:10:36] <Lucas_WMDE>	 lmao csrf hunter
[19:11:39] <majavah>	 I'm still somewhat confused by the difference of #wikimedia-security and #mediawiki_security
[19:11:55] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Seen): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Krinkle) 05Stalled→03Open
[19:12:05] <wikibugs>	 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Krinkle)
[19:12:17] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Seen): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Krinkle) a:05Dzahn→03hashar
[19:12:40] <wikibugs>	 (03PS6) 10Jbond: C:puppet_compiler: add uploader class [puppet] - 10https://gerrit.wikimedia.org/r/743224
[19:13:49] <ebernhardson>	 !log start outage recovery for commonswiki against eqiad cirrus cluster after snapshot restore
[19:13:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:14:29] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:14:54] <wikibugs>	 10SRE, 10Readers-Web-Backlog, 10Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Jdlrobson)
[19:15:01] <icinga-wm>	 PROBLEM - graphite.wikimedia.org api on graphite1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting
[19:15:55] <icinga-wm>	 PROBLEM - Check unit status of statograph_post on alert1001 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[19:16:28] <herron>	 hmm that's looking like graphite1004 again statograph[13838]: requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: https://graphite.wikimedia.org//render?target=MediaWiki.timing.editResponseT
[19:17:23] <icinga-wm>	 PROBLEM - graphite.wikimedia.org render on graphite1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting
[19:18:11] <herron>	 !log graphite1004.mgmt: racadm serveraction powercycle
[19:18:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:18:19] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:puppet_compiler: add uploader class [puppet] - 10https://gerrit.wikimedia.org/r/743224 (owner: 10Jbond)
[19:19:10] <wikibugs>	 (03CR) 10Jbond: C:puppet_compiler: add uploader class (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/743224 (owner: 10Jbond)
[19:19:23] <icinga-wm>	 PROBLEM - Host graphite1004 is DOWN: PING CRITICAL - Packet loss = 100%
[19:19:51] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:20:21] <icinga-wm>	 RECOVERY - Host graphite1004 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms
[19:20:29] <icinga-wm>	 PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 96.11% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[19:20:31] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:20:47] <icinga-wm>	 PROBLEM - carbon-cache@e service on graphite1004 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@e is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:22:16] <icinga-wm>	 RECOVERY - carbon-cache@e service on graphite1004 is OK: OK - carbon-cache@e is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:24:54] <wikibugs>	 (03PS3) 10Andrew Bogott: encapi: Remove statsd metrics [puppet] - 10https://gerrit.wikimedia.org/r/740307 (owner: 10Majavah)
[19:25:45] <wikibugs>	 (03PS2) 10Ebernhardson: query_service: Provide return-to url with auth checks [puppet] - 10https://gerrit.wikimedia.org/r/739942 (https://phabricator.wikimedia.org/T295676)
[19:26:00] <icinga-wm>	 RECOVERY - Check unit status of statograph_post on alert1001 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[19:26:10] <wikibugs>	 (03CR) 10Ebernhardson: "no reason i can think of, updated commit message." [puppet] - 10https://gerrit.wikimedia.org/r/739942 (https://phabricator.wikimedia.org/T295676) (owner: 10Ebernhardson)
[19:26:15] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster
[19:26:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:26:31] <legoktm>	 !log upgrading sacp to 4.1.0 everywhere (T296867)
[19:26:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:26:35] <stashbot>	 T296867: Deploy Scap version 4.1.0 - https://phabricator.wikimedia.org/T296867
[19:28:38] <wikibugs>	 (03PS1) 10Ladsgroup: auto_schema: Add logging on file [software] - 10https://gerrit.wikimedia.org/r/744850 (https://phabricator.wikimedia.org/T288235)
[19:29:34] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] encapi: Remove statsd metrics [puppet] - 10https://gerrit.wikimedia.org/r/740307 (owner: 10Majavah)
[19:29:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] auto_schema: Add logging on file [software] - 10https://gerrit.wikimedia.org/r/744850 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup)
[19:34:54] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:37:00] <icinga-wm>	 RECOVERY - graphite.wikimedia.org api on graphite1004 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.027 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting
[19:37:44] <wikibugs>	 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10Papaul)
[19:39:16] <icinga-wm>	 RECOVERY - graphite.wikimedia.org render on graphite1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1594 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting
[19:41:12] <wikibugs>	 (03PS1) 10Cathal Mooney: Allow cloud-hosts1-eqiad DHCP responses to eqiad CRs [homer/public] - 10https://gerrit.wikimedia.org/r/744854 (https://phabricator.wikimedia.org/T296906)
[19:41:56] <wikibugs>	 (03PS1) 10Jbond: P:puppetdb::microsite: just ensure package [puppet] - 10https://gerrit.wikimedia.org/r/744856
[19:42:02] <wikibugs>	 (03PS1) 10Ebernhardson: Revert "Move cirrus traffic to codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744857 (https://phabricator.wikimedia.org/T296897)
[19:42:39] <wikibugs>	 (03PS2) 10Ladsgroup: auto_schema: Add logging on file [software] - 10https://gerrit.wikimedia.org/r/744850 (https://phabricator.wikimedia.org/T288235)
[19:42:47] <wikibugs>	 10SRE, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech: Add HTTPS support to wdqs-internal service - https://phabricator.wikimedia.org/T193473 (10RKemper)
[19:42:53] <wikibugs>	 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, and 2 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10Ottomata) Ah, right!  https://phabricator.wikimedia.org/T248865#6289287  So yeah, unless we can at least control the event format...
[19:43:24] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 04-1] "Looks like there are some missing pieces" [puppet] - 10https://gerrit.wikimedia.org/r/740915 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah)
[19:43:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] auto_schema: Add logging on file [software] - 10https://gerrit.wikimedia.org/r/744850 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup)
[19:43:53] <wikibugs>	 (03CR) 10Eevans: cassandra: load grants files upon change (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) (owner: 10Hnowlan)
[19:44:02] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:puppetdb::microsite: just ensure package [puppet] - 10https://gerrit.wikimedia.org/r/744856 (owner: 10Jbond)
[19:45:55] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Readers-Web-Backlog, 10MobileFrontend (Tracking), 10User-Jdlrobson: Mobile site does not automatically redirect to desktop version (and not possible to use browser "use desktop view") - https://phabricator.wikimedia.org/T60425 (10Jdlrobson)
[19:46:36] <wikibugs>	 10SRE, 10WMF-Legal, 10SEO: (Automate) adding wikinews language versions to the Google Publisher Center / Google News - https://phabricator.wikimedia.org/T254437 (10Jdlrobson)
[19:46:42] <wikibugs>	 10SRE, 10Analytics-Radar, 10Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10Jdlrobson)
[19:46:50] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:47:28] <wikibugs>	 (03PS1) 10Jbond: C:puppet_compiler::uploader: correct typo [puppet] - 10https://gerrit.wikimedia.org/r/744859
[19:48:44] <wikibugs>	 (03PS3) 10Ladsgroup: auto_schema: Add logging on file [software] - 10https://gerrit.wikimedia.org/r/744850 (https://phabricator.wikimedia.org/T288235)
[19:51:29] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:puppet_compiler::uploader: correct typo [puppet] - 10https://gerrit.wikimedia.org/r/744859 (owner: 10Jbond)
[19:52:05] <wikibugs>	 (03PS2) 10Ryan Kemper: rdf-query-service: Allow logback config to load outside the blazegraph war [puppet] - 10https://gerrit.wikimedia.org/r/743499 (owner: 10Ebernhardson)
[19:52:17] <wikibugs>	 (03PS1) 10Jbond: C:puppet_compiler::uploader: pass params as array [puppet] - 10https://gerrit.wikimedia.org/r/744861
[19:52:53] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:puppet_compiler::uploader: pass params as array [puppet] - 10https://gerrit.wikimedia.org/r/744861 (owner: 10Jbond)
[19:53:14] <wikibugs>	 (03PS3) 10Ryan Kemper: rdf-query-service: Allow logback config to load outside the blazegraph war [puppet] - 10https://gerrit.wikimedia.org/r/743499 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson)
[19:53:32] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] rdf-query-service: Allow logback config to load outside the blazegraph war [puppet] - 10https://gerrit.wikimedia.org/r/743499 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson)
[19:54:05] <wikibugs>	 (03PS2) 10Cathal Mooney: Allow cloud-hosts1-eqiad DHCP responses to eqiad CRs [homer/public] - 10https://gerrit.wikimedia.org/r/744854 (https://phabricator.wikimedia.org/T296906)
[19:54:57] <wikibugs>	 (03PS9) 10Majavah: openstack: refactor puppetmaster access [puppet] - 10https://gerrit.wikimedia.org/r/740915 (https://phabricator.wikimedia.org/T295247)
[19:54:59] <wikibugs>	 (03PS3) 10Majavah: openstack: enc: properly fail on server error [puppet] - 10https://gerrit.wikimedia.org/r/742424
[19:56:31] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1028.eqiad.wmnet with OS buster
[19:56:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:56:47] <wikibugs>	 (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/744862
[19:56:59] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [wdqs/wdqs@c21117f] (wcqs): Deploy version 0.3.95 to wcqs
[19:57:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:57:48] <wikibugs>	 (03CR) 10Majavah: openstack: refactor puppetmaster access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740915 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah)
[19:57:58] <wikibugs>	 (03PS1) 10Jbond: C:puppet_compiler: add configurable port for uploader [puppet] - 10https://gerrit.wikimedia.org/r/744863
[19:58:48] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [wdqs/wdqs@c21117f] (wcqs): Deploy version 0.3.95 to wcqs (duration: 01m 48s)
[19:58:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:59:41] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:puppet_compiler: add configurable port for uploader [puppet] - 10https://gerrit.wikimedia.org/r/744863 (owner: 10Jbond)
[20:00:05] <jouncebot>	 dancy and brennen: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211207T2000).
[20:00:32] <brennen>	 o/
[20:00:49] <brennen>	 here as backup, but i'm under the impression we're still blocked atm.
[20:00:49] <dancy>	 I'll start (if unblocked) in about 30 minutes 
[20:00:55] <brennen>	 ack
[20:05:00] <wikibugs>	 10SRE, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Kanban): reimage/pxe boot failing on cloudvirt1028 - https://phabricator.wikimedia.org/T296906 (10cmooney) @volans many thanks for the info, that is super handy :)  Using that cookbook I got the same results as my previous attempt.  However I n...
[20:07:59] <wikibugs>	 (03Abandoned) 10Ryan Kemper: wcqs: enable oauth [puppet] - 10https://gerrit.wikimedia.org/r/724821 (https://phabricator.wikimedia.org/T280006) (owner: 10Ryan Kemper)
[20:08:57] <wikibugs>	 (03PS1) 10Jbond: puppet_compiler: fix folder names [puppet] - 10https://gerrit.wikimedia.org/r/744865
[20:10:43] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet_compiler: fix folder names [puppet] - 10https://gerrit.wikimedia.org/r/744865 (owner: 10Jbond)
[20:13:38] <wikibugs>	 (03CR) 10Dzahn: "hmm.. let's get back to this one way or another. see https://phabricator.wikimedia.org/T265864#6995415 as a reminder where this was" [puppet] - 10https://gerrit.wikimedia.org/r/677872 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi)
[20:17:17] <wikibugs>	 (03CR) 10Dzahn: "I removed my -1 based on latest comment on the ticket from legoktm. That's been also a while ago though." [puppet] - 10https://gerrit.wikimedia.org/r/677872 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi)
[20:17:25] <wikibugs>	 10SRE, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Kanban): reimage/pxe boot failing on cloudvirt1028 - https://phabricator.wikimedia.org/T296906 (10Andrew) I am fine with wrangling with the disk partitioning pieces if you don't feel like it; IIRC the cloudvirts often prompt for a keypress at s...
[20:23:23] <wikibugs>	 10SRE, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Kanban): reimage/pxe boot failing on cloudvirt1028 - https://phabricator.wikimedia.org/T296906 (10cmooney) @Andrew thanks yeah.  I have the screen open here still and can do that if you wish:  {F34856312}  I suspected the issue may be that the...
[20:37:04] <wikibugs>	 (03PS1) 10Jbond: pcc: need to seek back t the beginning of the file before we write it [puppet] - 10https://gerrit.wikimedia.org/r/744873
[20:39:06] <wikibugs>	 10SRE, 10Discovery-Search, 10Elasticsearch, 10SRE Observability, and 2 others: Migrate Elasticsearch from deprecated Gelf logstash input to rsyslog Kafka logging pipeline - https://phabricator.wikimedia.org/T225125 (10herron) These logs have been migrated to kafka-logging with the deployment of gelf_relay...
[20:40:07] <wikibugs>	 10SRE, 10Wikimedia-Logstash, 10observability: Migrate services using deprecated Gelf logstash input to Kafka enabled logging pipeline - https://phabricator.wikimedia.org/T225122 (10herron)
[20:40:20] <wikibugs>	 10SRE, 10Discovery-Search, 10Elasticsearch, 10SRE Observability, and 2 others: Migrate Elasticsearch from deprecated Gelf logstash input to rsyslog Kafka logging pipeline - https://phabricator.wikimedia.org/T225125 (10herron) 05Open→03Resolved a:03herron
[20:40:27] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:41:19] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] pcc: need to seek back t the beginning of the file before we write it [puppet] - 10https://gerrit.wikimedia.org/r/744873 (owner: 10Jbond)
[20:41:58] <wikibugs>	 (03PS1) 10Dzahn: mgmt: delete the entire module and role::mgmt::drac_ilo [puppet] - 10https://gerrit.wikimedia.org/r/744874 (https://phabricator.wikimedia.org/T272559)
[20:42:13] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:45:04] <wikibugs>	 (03CR) 10Dzahn: "CCing more dcops just in case anyone happens to use these shell scripts to change mgmt password, probably not but making sure" [puppet] - 10https://gerrit.wikimedia.org/r/744874 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn)
[20:45:36] <wikibugs>	 (03PS1) 10Jbond: C:puppet_compiler: cast pathlike object to string [puppet] - 10https://gerrit.wikimedia.org/r/744875
[20:47:28] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:puppet_compiler: cast pathlike object to string [puppet] - 10https://gerrit.wikimedia.org/r/744875 (owner: 10Jbond)
[20:49:29] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10Dzahn) >>! In T272559#7552446, @jbond wrote: >>>! In T272559#7546852, @Dzahn wrote: >> icinga::nsca::client is used in fundraising. so there are special case...
[20:51:35] <icinga-wm>	 PROBLEM - Host elastic2037 is DOWN: PING CRITICAL - Packet loss = 100%
[20:52:33] <wikibugs>	 10SRE, 10SRE Observability, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Deprecate all non-Kafka logstash inputs - https://phabricator.wikimedia.org/T227080 (10herron) 05Open→03Resolved a:03herron Looking at the dashboard linked in the description there have been no logs received via...
[20:54:15] <icinga-wm>	 RECOVERY - Host elastic2037 is UP: PING OK - Packet loss = 0%, RTA = 31.61 ms
[20:54:27] <wikibugs>	 10SRE, 10vm-requests, 10Patch-For-Review: Site: (2) VM request for DMARC - https://phabricator.wikimedia.org/T169566 (10Dzahn) Hey @herron @akosiaris you might be suprised to see a notification on this ticket from 2017 but .. I just found it by digging backwards in history to find out why we have a "**role::...
[20:56:47] <wikibugs>	 (03PS1) 10Dzahn: delete role::dmarc [puppet] - 10https://gerrit.wikimedia.org/r/744877 (https://phabricator.wikimedia.org/T272559)
[20:57:12] <wikibugs>	 (03CR) 10Herron: [C: 03+1] delete role::dmarc [puppet] - 10https://gerrit.wikimedia.org/r/744877 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn)
[20:58:51] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] delete role::dmarc [puppet] - 10https://gerrit.wikimedia.org/r/744877 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn)
[21:04:01] <icinga-wm>	 PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 96.11% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[21:06:26] <wikibugs>	 (03CR) 10RLazarus: imagecatalog: Install and configure OCI image catalog on deploy hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742574 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus)
[21:06:41] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10Dzahn) >  profile::beta::motd  This isn't instantiated and does not have any include line elsewhere but it shows up like this:  hieradata/cloud/eqiad1/deploy...
[21:07:56] <dancy>	 I have returned.
[21:10:05] <icinga-wm>	 PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:11:22] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10Dzahn) xdummy: T133183#7554483
[21:11:30] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10Dzahn)
[21:15:33] <wikibugs>	 10SRE, 10Observability-Logging: Move logstash api-feature-usage output away from v5 cluster - https://phabricator.wikimedia.org/T297239 (10herron) p:05Triage→03Medium
[21:15:47] <wikibugs>	 10SRE, 10Observability-Logging: Move logstash api-feature-usage output away from v5 cluster - https://phabricator.wikimedia.org/T297239 (10herron)
[21:15:50] <wikibugs>	 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10herron)
[21:16:12] <wikibugs>	 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10herron) 05Resolved→03Open Reopening this as progress has been made retiring legacy log inputs and now we're ready to move on to decom of the Ganeti VMs....
[21:17:10] <dancy>	 Starting train stuff now.  testwikis first
[21:17:11] <wikibugs>	 (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/744862 (https://phabricator.wikimedia.org/T297239) (owner: 10Herron)
[21:18:35] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster
[21:18:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:21:35] <wikibugs>	 (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/744862 (https://phabricator.wikimedia.org/T297239) (owner: 10Herron)
[21:22:09] <wikibugs>	 (03PS1) 10Ahmon Dancy: testwikis wikis to 1.38.0-wmf.12  refs T293953 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744878
[21:22:11] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+2] testwikis wikis to 1.38.0-wmf.12  refs T293953 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744878 (owner: 10Ahmon Dancy)
[21:23:39] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.38.0-wmf.12  refs T293953 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744878 (owner: 10Ahmon Dancy)
[21:23:43] <logmsgbot>	 !log dancy@deploy1002 Started scap: testwikis wikis to 1.38.0-wmf.12  refs T293953
[21:23:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:23:49] <stashbot>	 T293953: 1.38.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T293953
[21:25:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[21:25:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:26:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[21:26:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:43:47] <wikibugs>	 (03CR) 10Cwhite: "I would rather we not move api-feature-usage into the elk7 cluster for several reasons:" [puppet] - 10https://gerrit.wikimedia.org/r/744862 (https://phabricator.wikimedia.org/T297239) (owner: 10Herron)
[21:56:36] <wikibugs>	 10SRE, 10Observability-Logging, 10Patch-For-Review: Move logstash api-feature-usage output away from v5 cluster - https://phabricator.wikimedia.org/T297239 (10herron) > Cwhite  4:43 PM >  I would rather we not move api-feature-usage into the elk7 cluster for several reasons: >  >  1. We've wanted to move it...
[21:56:56] <wikibugs>	 10SRE, 10Observability-Logging, 10Patch-For-Review: Move logstash api-feature-usage output away from v5 cluster - https://phabricator.wikimedia.org/T297239 (10herron)
[21:57:03] <wikibugs>	 10SRE, 10Elasticsearch, 10SRE Observability, 10Wikimedia-Logstash, and 2 others: logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable - https://phabricator.wikimedia.org/T176335 (10herron)
[21:57:48] <wikibugs>	 (03CR) 10Herron: logstash: move api-feature-usage outputs to elk7 cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/744862 (https://phabricator.wikimedia.org/T297239) (owner: 10Herron)
[22:06:43] <logmsgbot>	 !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1028.eqiad.wmnet with OS buster
[22:06:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:07:22] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster
[22:07:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:07:58] <logmsgbot>	 !log dancy@deploy1002 Finished scap: testwikis wikis to 1.38.0-wmf.12  refs T293953 (duration: 44m 14s)
[22:08:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:08:02] <stashbot>	 T293953: 1.38.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T293953
[22:11:07] <icinga-wm>	 RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:12:33] <logmsgbot>	 !log dancy@deploy1002 Pruned MediaWiki: 1.38.0-wmf.7 (duration: 04m 18s)
[22:12:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:13:21] <wikibugs>	 (03PS1) 10Ahmon Dancy: group0 wikis to 1.38.0-wmf.12  refs T293953 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744886
[22:13:23] <wikibugs>	 (03PS2) 10Jdlrobson: Clean up readers web team config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743051
[22:13:25] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+2] group0 wikis to 1.38.0-wmf.12  refs T293953 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744886 (owner: 10Ahmon Dancy)
[22:14:11] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.38.0-wmf.12  refs T293953 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744886 (owner: 10Ahmon Dancy)
[22:15:24] <wikibugs>	 (03CR) 10Jdlrobson: Clean up readers web team config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743051 (owner: 10Jdlrobson)
[22:15:24] <logmsgbot>	 !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.38.0-wmf.12  refs T293953
[22:15:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:15:29] <stashbot>	 T293953: 1.38.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T293953
[22:17:53] <dancy>	 The train has been rolled out to group0 wikis.  I will check on logs periodically for a bit.
[22:18:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[22:18:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:19:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[22:19:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:27:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[22:27:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:33:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[22:33:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:46:34] <wikibugs>	 (03PS1) 10Ebernhardson: rdf query service: limit namespace aliasing to /bigdata/namespace [puppet] - 10https://gerrit.wikimedia.org/r/744892
[22:49:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[22:49:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:56:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[22:56:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:56:13] <wikibugs>	 10SRE, 10Observability-Logging, 10Patch-For-Review: Move logstash api-feature-usage output away from v5 cluster - https://phabricator.wikimedia.org/T297239 (10herron) >  We have an ingester installed with appropriate access on the cirrus cluster which can do this via the work from https://phabricator.wikimed...
[23:07:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10Jclark-ctr) lvs1017 A7 U9    id# 1206202101   Port#26 lvs1018 B7 U29  id# 1206202102   Port#4 lvs1019 C7 U25  id# 1206202103   Port#30 lvs1020 D7 U41  id# 120620...
[23:08:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10Jclark-ctr)
[23:08:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson
[23:21:49] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1028.eqiad.wmnet with OS buster
[23:21:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:38:15] <icinga-wm>	 PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:44:18] <wikibugs>	 (03PS1) 10MewOphaswongse: Add an image: Only validate caption if the recommendation is accepted [extensions/GrowthExperiments] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/744896 (https://phabricator.wikimedia.org/T297250)
[23:53:04] <wikibugs>	 (03PS1) 10Jforrester: Fix invalid reference to core resources/ directory [extensions/CodeMirror] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/744803 (https://phabricator.wikimedia.org/T296639)
[23:55:53] <icinga-wm>	 PROBLEM - SSH on cp5006 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[23:56:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Fix invalid reference to core resources/ directory [extensions/CodeMirror] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/744803 (https://phabricator.wikimedia.org/T296639) (owner: 10Jforrester)
[23:57:57] <wikibugs>	 (03CR) 10Jforrester: "recheck" [extensions/CodeMirror] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/744803 (https://phabricator.wikimedia.org/T296639) (owner: 10Jforrester)