[00:02:47] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[00:03:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (8) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[00:09:05] <wikibugs>	 10SRE, 10DBA, 10Performance-Team (Radar), 10Sustainability (MediaWiki-MultiDC): App servers <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809 (10tstarling)
[00:09:40] <wikibugs>	 10SRE, 10DBA, 10Performance-Team (Radar), 10Sustainability (MediaWiki-MultiDC): App servers <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809 (10tstarling) Task description edit: added plan for direct TLS, no connection pooling or tunnel.
[00:20:29] <wikibugs>	 10SRE, 10SRE-OnFire, 10Release-Engineering-Team, 10Sustainability: Remove old scap repositories from deploy1002 - https://phabricator.wikimedia.org/T309162 (10Dzahn) list of repos that exist on deployment servers but do not appear in the kubernetes.yaml. (just using the string that is the first level of th...
[00:22:43] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[00:27:05] <icinga-wm>	 PROBLEM - Disk space on gitlab1001 is CRITICAL: DISK CRITICAL - free space: /mnt/gitlab-backup 0 MB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=gitlab1001&var-datasource=eqiad+prometheus/ops
[00:28:08] <mutante>	 ^ arr. checking that
[00:28:25] <mutante>	 it's "just" the backups but we made changes to avoid this
[00:28:53] <mutante>	 the good part is.. it didn't take the service down because that's a dedicated mount 
[00:33:01] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on gitlab1001 is CRITICAL: CRITICAL - degraded: The following units failed: full-backup.service,rsync-config-backup-gitlab1003.wikimedia.org.service daniel_zahn https://phabricator.wikimedia.org/T308089 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:33:01] <icinga-wm>	 ACKNOWLEDGEMENT - Disk space on gitlab1001 is CRITICAL: DISK CRITICAL - free space: /mnt/gitlab-backup 0 MB (0% inode=99%): daniel_zahn https://phabricator.wikimedia.org/T308089 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=gitlab1001&var-datasource=eqiad+prometheus/ops
[00:33:01] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service daniel_zahn https://phabricator.wikimedia.org/T308089 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:42:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[00:48:19] <icinga-wm>	 RECOVERY - Disk space on gitlab1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=gitlab1001&var-datasource=eqiad+prometheus/ops
[00:50:49] <wikibugs>	 10SRE: a couple longer running icinga alerts to be fixed - https://phabricator.wikimedia.org/T309257 (10Dzahn)
[00:52:02] <wikibugs>	 10SRE: a couple longer running icinga alerts to be fixed - https://phabricator.wikimedia.org/T309257 (10Dzahn) We can just fix them but we can also question if they should/can be removed on non-active hosts (via puppet changes), whether they should really be CRIT etc.
[00:52:47] <wikibugs>	 10SRE: a couple longer running icinga alerts to be fixed - https://phabricator.wikimedia.org/T309257 (10Dzahn)
[00:56:51] <mutante>	 !log gitlab1001 - T308089 T274463 - '<+icinga-wm> PROBLEM - Disk space on gitlab1001 is CRITICAL: DISK CRITICAL - free space: /mnt/gitlab-backup 0 MB' - manually deleted 1653294190_2022_05_23_14.10.2_gitlab_backup.tar (we have May 24 and 25, 26 could not finish writing backup) - RECOVERY - Disk space on gitlab1001 is OK
[00:56:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:56:59] <stashbot>	 T274463: Backups for GitLab - https://phabricator.wikimedia.org/T274463
[00:56:59] <stashbot>	 T308089: gitlab-restore: version detection fail / restore fail - https://phabricator.wikimedia.org/T308089
[00:57:39] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:58:18] <mutante>	 !log gitlab1001 - T308089 T274463 - gitlab1001 - systemctl start full-backup
[00:58:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:01:21] <mutante>	 !log gitlab1003 - T308089 T274463 - gitlab1003 - systemctl status backup-restore is failed because it's looking for /mnt/gitlab-backup/latest/latest.tar needs gerrit:799016
[01:01:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:01:44] <wikibugs>	 (03PS3) 10Dzahn: gitlab: switch backup location to /srv, don't use /etc [puppet] - 10https://gerrit.wikimedia.org/r/799016 (https://phabricator.wikimedia.org/T274463)
[01:02:59] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gitlab: switch backup location to /srv, don't use /etc [puppet] - 10https://gerrit.wikimedia.org/r/799016 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn)
[01:05:59] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "noop on gitlab1001, change on gitlab2001, re-enabling puppet on gitlab1003 (puppet was stopped by restore script but could not finish)" [puppet] - 10https://gerrit.wikimedia.org/r/799016 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn)
[01:10:00] <wikibugs>	 10SRE, 10GitLab, 10serviceops: gitlab1004 - puppet cert revoked? - https://phabricator.wikimedia.org/T309259 (10Dzahn)
[01:10:47] <wikibugs>	 (03PS2) 10Dzahn: gitlab: rsync config and data backup to same folder on replica [puppet] - 10https://gerrit.wikimedia.org/r/799280 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[01:12:20] <wikibugs>	 10SRE, 10serviceops, 10GitLab (Infrastructure): gitlab1004 - puppet cert revoked? - https://phabricator.wikimedia.org/T309259 (10Dzahn)
[01:13:07] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:13:49] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gitlab: rsync config and data backup to same folder on replica [puppet] - 10https://gerrit.wikimedia.org/r/799280 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[01:20:04] <mutante>	 !log gitlab1003 - T308089 T274463 - gitlab1001 - deleted backups from April 4 and April 5 from /srv/gitlab-backup   AND  deleted partial failed backups from May 26 from /mnt/gitlab-backup; deployed both gerrit:799016 and  gerrit:799280 ; restarting full-backup service
[01:20:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:20:12] <stashbot>	 T274463: Backups for GitLab - https://phabricator.wikimedia.org/T274463
[01:20:12] <stashbot>	 T308089: gitlab-restore: version detection fail / restore fail - https://phabricator.wikimedia.org/T308089
[01:24:51] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:26:09] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:27:01] <icinga-wm>	 RECOVERY - Check systemd state on gitlab1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:27:51] <mutante>	 !log T308089 T274463 - gitlab1001 - systemctl start rsync-config-backup-gitlab1003.wikimedia.org - Suceeded - RECOVERY - Check systemd state on gitlab1001 is OK
[01:27:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:27:59] <stashbot>	 T274463: Backups for GitLab - https://phabricator.wikimedia.org/T274463
[01:27:59] <stashbot>	 T308089: gitlab-restore: version detection fail / restore fail - https://phabricator.wikimedia.org/T308089
[01:34:55] <icinga-wm>	 PROBLEM - Check systemd state on gitlab2001 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:37:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298555)', diff saved to https://phabricator.wikimedia.org/P28566 and previous config saved to /var/cache/conftool/dbconfig/20220526-013741-ladsgroup.json
[01:37:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:37:48] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[01:40:33] <icinga-wm>	 PROBLEM - Check systemd state on gitlab1001 is CRITICAL: CRITICAL - degraded: The following units failed: full-backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:43:51] <wikibugs>	 (03PS10) 10Tim Starling: Add "db-mainstash" entry to $wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752807 (https://phabricator.wikimedia.org/T212129) (owner: 10Aaron Schulz)
[01:45:49] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+1] "PS10: globalKeyLB -> cluster, globalKeyLbDomain -> dbDomain, add Depends-On." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752807 (https://phabricator.wikimedia.org/T212129) (owner: 10Aaron Schulz)
[01:46:39] <mutante>	 !log T308089 T274463 - gitlab1001 - still not enough disk space to finish full backup. moved backup of May 24th to /root/ . deleted latest.tar; started full-backup service once again 
[01:46:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:46:46] <stashbot>	 T274463: Backups for GitLab - https://phabricator.wikimedia.org/T274463
[01:46:46] <stashbot>	 T308089: gitlab-restore: version detection fail / restore fail - https://phabricator.wikimedia.org/T308089
[01:47:17] <icinga-wm>	 RECOVERY - Check systemd state on gitlab1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:50:29] <wikibugs>	 (03PS1) 10Tim Starling: Switch wgMainStash to db-mainstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799433
[01:51:26] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[01:51:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[01:51:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:51:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:51:59] <wikibugs>	 (03CR) 10Tim Starling: "Please give +1 for deployment after eval.php testing of db-mainstash." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799433 (owner: 10Tim Starling)
[01:52:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P28567 and previous config saved to /var/cache/conftool/dbconfig/20220526-015247-ladsgroup.json
[01:52:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:07:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P28568 and previous config saved to /var/cache/conftool/dbconfig/20220526-020752-ladsgroup.json
[02:07:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:11:51] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 2 (backup1002, ...), Fresh: 109 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[02:14:21] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:23:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298555)', diff saved to https://phabricator.wikimedia.org/P28569 and previous config saved to /var/cache/conftool/dbconfig/20220526-022259-ladsgroup.json
[02:23:01] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1119.eqiad.wmnet with reason: Maintenance
[02:23:03] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1119.eqiad.wmnet with reason: Maintenance
[02:23:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:23:06] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[02:23:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T298555)', diff saved to https://phabricator.wikimedia.org/P28570 and previous config saved to /var/cache/conftool/dbconfig/20220526-022307-ladsgroup.json
[02:23:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:23:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:23:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:25:25] <icinga-wm>	 PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1002.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:27:39] <icinga-wm>	 PROBLEM - Check systemd state on gitlab1001 is CRITICAL: CRITICAL - degraded: The following units failed: full-backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:44:23] <icinga-wm>	 PROBLEM - Disk space on gitlab1001 is CRITICAL: DISK CRITICAL - free space: /mnt/gitlab-backup 0 MB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=gitlab1001&var-datasource=eqiad+prometheus/ops
[02:52:31] <wikibugs>	 (03CR) 10TsepoThoabala: [C: 03+1] Assign similareditors right to the checkuser group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799378 (https://phabricator.wikimedia.org/T307205) (owner: 10AGueyte)
[03:05:21] <icinga-wm>	 PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:21:27] <icinga-wm>	 RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:31:05] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 31.32 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[03:33:07] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 37.6 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[03:35:41] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 45.34 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[03:37:37] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 108.6 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[03:37:49] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[03:37:57] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 87.45 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[04:02:29] <wikibugs>	 (03PS5) 10Abijeet Patro: Add namespaces to Punjabi wikisource default search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793799 (https://phabricator.wikimedia.org/T287887)
[04:02:47] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[04:03:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (8) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[04:31:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T298555)', diff saved to https://phabricator.wikimedia.org/P28571 and previous config saved to /var/cache/conftool/dbconfig/20220526-043126-ladsgroup.json
[04:31:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:31:34] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[04:33:05] <wikibugs>	 10SRE, 10DBA, 10Performance-Team (Radar), 10Sustainability (MediaWiki-MultiDC): App servers <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809 (10tstarling) 05Stalled→03Open
[04:42:00] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (MediaWiki-MultiDC): App servers <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809 (10tstarling) a:03tstarling
[04:42:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[04:45:31] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:45:43] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:46:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P28572 and previous config saved to /var/cache/conftool/dbconfig/20220526-044631-ladsgroup.json
[04:56:42] <wikibugs>	 (03PS1) 10Tim Starling: Enable SSL for master DB connections in the secondary datacenter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799437 (https://phabricator.wikimedia.org/T134809)
[04:56:57] <icinga-wm>	 PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:01:21] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:01:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P28573 and previous config saved to /var/cache/conftool/dbconfig/20220526-050136-ladsgroup.json
[05:01:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:03:21] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:07:45] <icinga-wm>	 RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:10:01] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:10:17] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:16:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T298555)', diff saved to https://phabricator.wikimedia.org/P28574 and previous config saved to /var/cache/conftool/dbconfig/20220526-051641-ladsgroup.json
[05:16:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1099.eqiad.wmnet with reason: Maintenance
[05:16:44] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1099.eqiad.wmnet with reason: Maintenance
[05:16:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:16:49] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[05:16:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T298555)', diff saved to https://phabricator.wikimedia.org/P28575 and previous config saved to /var/cache/conftool/dbconfig/20220526-051649-ladsgroup.json
[05:16:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:16:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:17:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:31:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1111', diff saved to https://phabricator.wikimedia.org/P28576 and previous config saved to /var/cache/conftool/dbconfig/20220526-053155-marostegui.json
[05:32:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:33:54] <wikibugs>	 (03PS1) 10Marostegui: db1111: Install 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/799443 (https://phabricator.wikimedia.org/T308915)
[05:42:15] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1111: Install 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/799443 (https://phabricator.wikimedia.org/T308915) (owner: 10Marostegui)
[05:47:07] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2022-05-26-052433-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/799638 (https://phabricator.wikimedia.org/T309161)
[05:53:24] <wikibugs>	 (03PS1) 10Marostegui: db1111: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/799647
[05:54:09] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1111: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/799647 (owner: 10Marostegui)
[05:59:13] * kart_ updating cxserver..
[05:59:39] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: #bothumor I � Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220526T0600).
[06:00:37] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:00:52] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2022-05-26-052433-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/799638 (https://phabricator.wikimedia.org/T309161) (owner: 10KartikMistry)
[06:02:08] <kart_>	 marostegui: oops. I missed the switchover window as I was looking at May 26 in the deployment calendar.. my deployment will take few minutes only..
[06:03:17] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:04:29] <marostegui>	 kart_: no, no, go for it
[06:04:41] <marostegui>	 kart_: it is a predefined window each Tuesday and Thursday 
[06:04:46] <marostegui>	 but it is empty this week 
[06:05:06] <kart_>	 Ok. Thanks marostegui
[06:05:14] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2022-05-26-052433-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/799638 (https://phabricator.wikimedia.org/T309161) (owner: 10KartikMistry)
[06:05:25] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 25 Jun 2022 07:55:09 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:05:41] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:06:01] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:06:11] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48107 bytes in 0.210 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:06:32] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply
[06:06:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:06:58] <RhinosF1>	 kart_: that's weird, the pin to US hours makes it show on the wrong day for us
[06:07:05] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[06:07:09] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.403 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:07:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:09:57] <kart_>	 RhinosF1: yes!
[06:10:18] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[06:10:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:11:10] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[06:11:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:12:23] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:12:43] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:12:51] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[06:12:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:13:46] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[06:13:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:14:06] <kart_>	 Anyone know why Grafana stopped showing restart/deploys in the graphs?
[06:14:25] <kart_>	 Button is there, but it has no effect.
[06:15:42] <kart_>	 !log Updated cxserver to 2022-05-26-052433-production (T309161, T308829, T308834)
[06:15:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:15:50] <stashbot>	 T308834: Enable Section Translation on some wikis while Content Translation remains in beta - https://phabricator.wikimedia.org/T308834
[06:15:50] <stashbot>	 T308829: Enable Section Translation on 10 Wikipedias where Content Translaiton is available by default - https://phabricator.wikimedia.org/T308829
[06:15:51] <stashbot>	 T309161: Infoxbox Writer template fails to translate with Google MT - https://phabricator.wikimedia.org/T309161
[06:19:03] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:19:25] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:21:39] <wikibugs>	 10SRE, 10Deployments, 10Parsoid, 10bacula, 10Release-Engineering-Team (Doing): Accidental removal of some files under /srv/deployment on deploy1002 - https://phabricator.wikimedia.org/T307349 (10jcrespo) Incident lightweight report: https://wikitech.wikimedia.org/wiki/Incidents/2022-05-2_deployment
[06:21:45] <wikibugs>	 10SRE: ProdPasteBot uses deprecated certificate auth - https://phabricator.wikimedia.org/T242857 (10Aklapper) @DSharpe: Do you maybe know the answer to my last comment (or know someone who could)? Thanks!
[06:31:14] <wikibugs>	 10SRE, 10SRE-OnFire, 10Release-Engineering-Team, 10Sustainability: Remove old scap repositories from deploy1002 - https://phabricator.wikimedia.org/T309162 (10jcrespo) @Dzahn That doesn't seem right- mediawiki-staging is the current main method of deploying mediawiki, and httpbb-tests seems in active usage...
[06:32:30] <wikibugs>	 (03PS1) 10Tim Starling: Add the master from the primary DC to the secondary DC load arrays [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809)
[06:44:24] <icinga-wm>	 PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1002 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner
[06:46:36] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:48:17] <wikibugs>	 (03CR) 10Marostegui: "Does this change in anyway the way we do operations on the stand-by DC at the moment? ie: right now we don't even have to depool the codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) (owner: 10Tim Starling)
[07:00:04] <jouncebot>	 Amir1 and apergos: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220526T0700).
[07:00:04] <jouncebot>	 samwilson: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:11] <wikibugs>	 10SRE, 10SRE-OnFire, 10Release-Engineering-Team, 10Sustainability: Remove old scap repositories from deploy1002 - https://phabricator.wikimedia.org/T309162 (10jcrespo) This is a list of resources configured on puppet, but I am not sure if the list is exhaustive: ` File[/srv/deployment/scap] from /etc/puppe...
[07:00:11] <apergos>	 hello!
[07:00:30] <samwilson>	 hello :)
[07:00:47] <apergos>	 no trainees are scheduled for today's window
[07:01:12] <apergos>	 there are two patches in the window only, and they are yours, samwilson
[07:01:20] <apergos>	 are you doing self deploy?
[07:02:08] <samwilson>	 no (although I guess I could be a trainee!). Can you deploy? I'm here to test, and Satdeep is going to help test too.
[07:02:13] <apergos>	 ah ok
[07:02:25] <apergos>	 I'll do that then
[07:02:44] <samwilson>	 :) thanks
[07:03:19] <apergos>	 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/796385 this says there's a merge conflict, can you sort that?
[07:03:40] <apergos>	 samwilson: 
[07:03:44] <samwilson>	 sure, doing now
[07:03:47] <apergos>	 ty
[07:04:18] <wikibugs>	 (03PS6) 10Samwilson: Enable Realtime Preview on more pilot wikis: huwiki and fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/796385 (https://phabricator.wikimedia.org/T303961)
[07:05:16] <apergos>	 (I don't feel comfortable doing both  a training and the deploy by myself, so deploy it is)
[07:06:02] <samwilson>	 sure!
[07:06:19] <samwilson>	 I really should re-learn deployment stuff. I did do it once, years ago.
[07:06:58] <apergos>	 you should. sign up for a training!
[07:07:40] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] Enable Realtime Preview on more pilot wikis: huwiki and fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/796385 (https://phabricator.wikimedia.org/T303961) (owner: 10Samwilson)
[07:09:25] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Realtime Preview on more pilot wikis: huwiki and fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/796385 (https://phabricator.wikimedia.org/T303961) (owner: 10Samwilson)
[07:10:33] <apergos>	 samwilson: live on mwdebug1002, please test
[07:10:45] <samwilson>	 thanks. testing now.
[07:12:45] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: update docker image for revscoring-editquality-* [deployment-charts] - 10https://gerrit.wikimedia.org/r/799349 (https://phabricator.wikimedia.org/T309102) (owner: 10Elukey)
[07:12:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:12:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:13:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:13:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:13:52] <samwilson>	 apergos: looks good, go for it
[07:13:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:13:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:14:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:14:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:15:21] <logmsgbot>	 !log ariel@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:796385|Enable Realtime Preview on more pilot wikis: huwiki and fiwiki (T303961)]] (duration: 00m 51s)
[07:15:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:15:26] <stashbot>	 T303961: Rollout plan for real-time preview - https://phabricator.wikimedia.org/T303961
[07:15:42] <apergos>	 samwilson: it's live, please do any followup testing
[07:16:38] <samwilson>	 yep, all looks as it should.
[07:17:52] <icinga-wm>	 RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1002 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner
[07:18:17] <apergos>	 seems ok, proceeding
[07:18:24] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[07:18:26] <apergos>	 heh merge conflict
[07:18:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:18:32] <apergos>	 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/793799 
[07:18:48] <apergos>	 please sort, samwilson
[07:19:03] <wikibugs>	 (03PS6) 10Samwilson: Add namespaces to Punjabi wikisource default search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793799 (https://phabricator.wikimedia.org/T287887) (owner: 10Abijeet Patro)
[07:19:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:19:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:20:16] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] Add namespaces to Punjabi wikisource default search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793799 (https://phabricator.wikimedia.org/T287887) (owner: 10Abijeet Patro)
[07:20:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:20:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:20:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:20:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:21:06] <wikibugs>	 (03Merged) 10jenkins-bot: Add namespaces to Punjabi wikisource default search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793799 (https://phabricator.wikimedia.org/T287887) (owner: 10Abijeet Patro)
[07:22:27] <apergos>	 samwilson: live on mwdebug1002, please test.
[07:22:42] <samwilson>	 testing now
[07:23:12] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[07:23:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:23:39] <samwilson>	 apergos: looks great, is working.
[07:24:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:24:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:25:12] <logmsgbot>	 !log ariel@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:793799|Add namespaces to Punjabi wikisource default search (T287887)]] (duration: 00m 50s)
[07:25:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:25:18] <stashbot>	 T287887: Optimize default search namespaces for Punjabi wikisources - https://phabricator.wikimedia.org/T287887
[07:25:22] <apergos>	 samwilson: live, please do followup testing
[07:25:57] <samwilson>	 testing now. satdeep is also testing.
[07:29:33] <samwilson>	 apergos: all good!
[07:29:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:29:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:29:50] <apergos>	 looks good from here too
[07:30:17] <apergos>	 thank you for choosing us as your deployment providers today, do come back again!
[07:30:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:30:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:30:38] <samwilson>	 :-) no, thank you!
[07:30:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:30:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:30:51] <samwilson>	 and I will try to do the training at some point 
[07:31:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:31:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:32:11] <apergos>	 https://wikitech.wikimedia.org/wiki/Deployments/Training  https://phabricator.wikimedia.org/maniphest/task/edit/form/96/  how to sign up, samwilson
[07:32:14] <apergos>	 see you there!
[07:41:05] <wikibugs>	 (03PS1) 10Marostegui: es2030,es2022: Install 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/799789 (https://phabricator.wikimedia.org/T309265)
[07:44:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298555)', diff saved to https://phabricator.wikimedia.org/P28577 and previous config saved to /var/cache/conftool/dbconfig/20220526-074436-ladsgroup.json
[07:44:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:44:43] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[07:45:18] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es2030,es2022: Install 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/799789 (https://phabricator.wikimedia.org/T309265) (owner: 10Marostegui)
[07:47:40] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:50:04] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:55:19] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[07:55:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[07:55:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T298560)', diff saved to https://phabricator.wikimedia.org/P28578 and previous config saved to /var/cache/conftool/dbconfig/20220526-075525-ladsgroup.json
[07:55:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:34] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[07:59:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P28579 and previous config saved to /var/cache/conftool/dbconfig/20220526-075941-ladsgroup.json
[07:59:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:04] <jouncebot>	 dancy and jnuche: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220526T0800).
[08:02:47] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:03:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (8) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:09:12] <icinga-wm>	 PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:10:04] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations: Puppet should prune stale entries from sudoers.d - https://phabricator.wikimedia.org/T309268 (10Majavah)
[08:14:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P28580 and previous config saved to /var/cache/conftool/dbconfig/20220526-081446-ladsgroup.json
[08:14:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:19] <wikibugs>	 (03PS2) 10Majavah: hieradata: purge stale sudoers.d entries in production [puppet] - 10https://gerrit.wikimedia.org/r/799268 (https://phabricator.wikimedia.org/T309268)
[08:18:21] <wikibugs>	 (03PS1) 10Majavah: Remove some unmanaged files from sudoers.d [puppet] - 10https://gerrit.wikimedia.org/r/799820 (https://phabricator.wikimedia.org/T309268)
[08:20:54] <wikibugs>	 (03CR) 10Majavah: hieradata: purge stale sudoers.d entries in production (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799268 (https://phabricator.wikimedia.org/T309268) (owner: 10Majavah)
[08:21:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Fix problems found by github.com/cloudflare/pint [alerts] - 10https://gerrit.wikimedia.org/r/799285 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[08:28:12] <wikibugs>	 (03PS1) 10Volans: CHANGELOG: add changelogs for release v2.5.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/799831
[08:29:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298555)', diff saved to https://phabricator.wikimedia.org/P28581 and previous config saved to /var/cache/conftool/dbconfig/20220526-082951-ladsgroup.json
[08:29:53] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[08:29:54] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[08:29:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:29:58] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[08:30:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:30:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:36:39] <wikibugs>	 (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v2.5.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/799831 (owner: 10Volans)
[08:40:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1032 T309265', diff saved to https://phabricator.wikimedia.org/P28582 and previous config saved to /var/cache/conftool/dbconfig/20220526-084009-marostegui.json
[08:40:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:16] <stashbot>	 T309265: Migrate 4 DB ES hosts to 10.6 - https://phabricator.wikimedia.org/T309265
[08:41:25] <wikibugs>	 (03PS1) 10Majavah: P:openstack::pdns: remove unused sudo rules [puppet] - 10https://gerrit.wikimedia.org/r/799839
[08:42:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:42:48] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35551/console" [puppet] - 10https://gerrit.wikimedia.org/r/799839 (owner: 10Majavah)
[08:43:34] <wikibugs>	 (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v2.5.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/799831 (owner: 10Volans)
[08:44:38] <wikibugs>	 (03PS1) 10Marostegui: es1032: Install MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/799841 (https://phabricator.wikimedia.org/T309265)
[08:44:56] <wikibugs>	 (03CR) 10Tim Starling: Add the master from the primary DC to the secondary DC load arrays (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) (owner: 10Tim Starling)
[08:45:18] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es1032: Install MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/799841 (https://phabricator.wikimedia.org/T309265) (owner: 10Marostegui)
[08:52:19] <wikibugs>	 (03PS1) 10Volans: Upstream release v2.5.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/799844
[08:52:56] <wikibugs>	 (03CR) 10Marostegui: Add the master from the primary DC to the secondary DC load arrays (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) (owner: 10Tim Starling)
[08:55:32] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[08:55:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:56:53] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[08:56:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:08] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[08:57:59] <wikibugs>	 (03PS1) 10Majavah: P:openstack::puppetmaster: remove unused stuff [puppet] - 10https://gerrit.wikimedia.org/r/799845
[08:58:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[08:58:55] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35552/console" [puppet] - 10https://gerrit.wikimedia.org/r/799845 (owner: 10Majavah)
[09:01:28] <icinga-wm>	 RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:03:28] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sudo: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799371 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[09:03:40] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[09:03:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[09:04:46] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] toolforge: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/797339 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[09:05:11] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] statograph: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799373 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[09:05:31] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] statsite: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799372 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[09:06:15] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] squid: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799377 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[09:06:46] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "merging all the ones i +2'ed thanks <3" [puppet] - 10https://gerrit.wikimedia.org/r/799377 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[09:08:15] <wikibugs>	 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Remove prod-specific bits from cloud puppetmasters - https://phabricator.wikimedia.org/T309281 (10Majavah)
[09:08:34] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/799381 (owner: 10Volans)
[09:09:32] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Upstream release v2.5.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/799844 (owner: 10Volans)
[09:09:36] <wikibugs>	 (03PS2) 10Majavah: P:openstack::puppetmaster: remove conftool client [puppet] - 10https://gerrit.wikimedia.org/r/799845 (https://phabricator.wikimedia.org/T309281)
[09:11:52] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM should help with future changes on these thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/799381 (owner: 10Volans)
[09:12:31] <wikibugs>	 (03CR) 10Volans: [C: 03+2] transports: allow to set a global timeout [software/homer] - 10https://gerrit.wikimedia.org/r/799375 (owner: 10Volans)
[09:12:34] <wikibugs>	 (03CR) 10Volans: [C: 03+2] devices: allow to pass additional metadata [software/homer] - 10https://gerrit.wikimedia.org/r/799376 (owner: 10Volans)
[09:15:42] <wikibugs>	 (03Merged) 10jenkins-bot: transports: allow to set a global timeout [software/homer] - 10https://gerrit.wikimedia.org/r/799375 (owner: 10Volans)
[09:15:50] <wikibugs>	 (03Merged) 10jenkins-bot: devices: allow to pass additional metadata [software/homer] - 10https://gerrit.wikimedia.org/r/799376 (owner: 10Volans)
[09:16:28] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[09:16:31] <wikibugs>	 (03CR) 10Jbond: "Thanks LGMT, missed form the last one but could you add an entry to the change log e.g." [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 (owner: 10BCornwall)
[09:16:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:46] <wikibugs>	 (03PS5) 10MSantos: WIP: introduce geoshapes service [deployment-charts] - 10https://gerrit.wikimedia.org/r/768678 (https://phabricator.wikimedia.org/T302967)
[09:17:05] <wikibugs>	 (03Merged) 10jenkins-bot: Upstream release v2.5.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/799844 (owner: 10Volans)
[09:17:24] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to mwmaint1002.eqiad.wmnet for sgimeno - https://phabricator.wikimedia.org/T309045 (10Sgs) >>! In T309045#7954779, @Dzahn wrote: >>>! In T309045#7950982, @MShilova_WMF wrote: >> I confirm that @sgs needs access to a production server and it...
[09:18:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP: introduce geoshapes service [deployment-charts] - 10https://gerrit.wikimedia.org/r/768678 (https://phabricator.wikimedia.org/T302967) (owner: 10MSantos)
[09:19:01] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[09:19:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:40] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:21:20] <volans>	 !log uploaded spicerack_2.5.0 to apt.wikimedia.org bullseye-wikimedia
[09:21:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[09:22:01] <wikibugs>	 (03PS1) 10Majavah: puppetmaster: remove 'allow_from' [puppet] - 10https://gerrit.wikimedia.org/r/799859
[09:24:31] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM will leave for WMCS to merge" [puppet] - 10https://gerrit.wikimedia.org/r/799845 (https://phabricator.wikimedia.org/T309281) (owner: 10Majavah)
[09:26:03] <wikibugs>	 (03PS2) 10Jbond: Remove some unmanaged files from sudoers.d [puppet] - 10https://gerrit.wikimedia.org/r/799820 (https://phabricator.wikimedia.org/T309268) (owner: 10Majavah)
[09:26:10] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/799820 (https://phabricator.wikimedia.org/T309268) (owner: 10Majavah)
[09:26:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[09:29:42] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35556/console" [puppet] - 10https://gerrit.wikimedia.org/r/799859 (owner: 10Majavah)
[09:33:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35557/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799386 (owner: 10Jbond)
[09:34:05] <wikibugs>	 (03CR) 10Majavah: "see also https://gerrit.wikimedia.org/r/c/operations/puppet/+/799344" [puppet] - 10https://gerrit.wikimedia.org/r/799386 (owner: 10Jbond)
[09:38:33] <icinga-wm>	 PROBLEM - Host db1128 is DOWN: PING CRITICAL - Packet loss = 100%
[09:38:43] <marostegui>	 what
[09:38:49] <marostegui>	 it is indeed down
[09:39:00] <volans>	 I did get paged but this is not a #page alert, weird too
[09:39:04] <jynus>	 connectivity or hw, youi know?
[09:39:04] <volans>	 marostegui: need  a hand?
[09:39:05] <_joe_>	 uh
[09:39:05] <marostegui>	 It is a master
[09:39:14] <marostegui>	 m1 master, let me check
[09:39:15] <volans>	 I can ssh to the mgmt
[09:39:17] <marostegui>	 RO should be fine
[09:39:25] <marostegui>	 volans: can you reboot or check what happened?
[09:39:29] <_joe_>	 marostegui: m1 is what?
[09:39:37] <jynus>	 wait before rebooting
[09:39:39] <marostegui>	 _joe_: misc services
[09:39:41] <jynus>	 it may be network
[09:39:53] <jynus>	 volans: is he host up?
[09:39:55] <jynus>	 *the
[09:39:58] <volans>	 host is up
[09:40:03] <volans>	  09:39:56 up 0 min,  1 user,  load average: 0.36, 0.09, 0.03
[09:40:03] <sobanski>	 _joe_: misc utilitie: Bacula, Etherpad, etc.
[09:40:05] <volans>	 but just rebooted
[09:40:06] <marostegui>	 so rebooted
[09:40:10] <sobanski>	 utilities even
[09:40:10] <_joe_>	 ahh just rebooted
[09:40:12] <jynus>	 "just"
[09:40:13] <_joe_>	 sobanski: thanks
[09:40:15] <icinga-wm>	 RECOVERY - Host db1128 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[09:40:22] <marostegui>	 jynus: ..
[09:40:27] <volans>	 I'm checking hardware logs
[09:40:32] <volans>	 given I''m already in
[09:40:52] <jbond>	 fyi pki is also on misc
[09:40:55] <_joe_>	 well whatever the reason, I guess we're in for a master switchover in m1?
[09:40:59] <marostegui>	 no
[09:41:00] <Amir1>	 I woke up to the page
[09:41:14] <vgutierrez>	 which page?
[09:41:15] <vgutierrez>	 :)
[09:41:17] <_joe_>	 Amir1: who needs an alarm clock when you have pages?
[09:41:30] <Amir1>	 It's probably a loose cable again 
[09:41:35] <wikibugs>	 (03PS1) 10Majavah: hieradata: set swift_clusters: {} on cloud [puppet] - 10https://gerrit.wikimedia.org/r/799861 (https://phabricator.wikimedia.org/T309281)
[09:41:40] <vgutierrez>	 oh right, VO pages on the host being down, funny :)
[09:41:43] <marostegui>	 I am starting mariadb
[09:41:46] <marostegui>	 Storage seems ok 
[09:42:00] <icinga-wm>	 PROBLEM - MariaDB Replica IO: m1 on db1117 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1128.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1128.eqiad.wmnet (113 No route to host) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:42:02] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[09:42:07] <volans>	 marostegui: Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A6.
[09:42:12] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Puppet should prune stale entries from sudoers.d - https://phabricator.wikimedia.org/T309268 (10jbond) similar to the previous task on [[ https://phabricator.wikimedia.org/T214605#5756945 | apt directories ]], i have queried the repo for ma...
[09:42:13] <jynus>	 File /var/log/journal/d2918de808fb4bc5ba5ad42f3e7b95c5/system.journal corrupted or uncleanly shut down, renaming and replacing
[09:42:41] <volans>	 same error happened on the 2022-03-17 and back in 2022-02-27, but this first one was a correctable error
[09:42:44] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35558/console" [puppet] - 10https://gerrit.wikimedia.org/r/799861 (https://phabricator.wikimedia.org/T309281) (owner: 10Majavah)
[09:42:49] <marostegui>	 volans: we need new memory then
[09:42:50] <volans>	 so it seems a bad DIMM
[09:42:59] <volans>	 if we have a task I can paste the logs
[09:43:09] <marostegui>	 I am checking the data before failing the proxy back
[09:43:24] <volans>	 k
[09:43:39] <_joe_>	 marostegui: shouldn't we switch masters if this server has faulty dimms?
[09:43:52] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[09:44:10] <icinga-wm>	 RECOVERY - MariaDB Replica IO: m1 on db1117 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:44:16] <_joe_>	 also, I'd be happy to do any operational step myself for failovers / etc
[09:44:42] <marostegui>	 I prefer not if we can avoid it, I can prepare a proper host today and then switch it, but I prefer not to switch to db1117:3321 for now
[09:45:01] <_joe_>	 ack
[09:45:05] <_joe_>	 it's your call 
[09:45:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298560)', diff saved to https://phabricator.wikimedia.org/P28583 and previous config saved to /var/cache/conftool/dbconfig/20220526-094509-ladsgroup.json
[09:45:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:45:15] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[09:45:25] <marostegui>	 I am going to failback the proxies again
[09:45:26] <_joe_>	 I am happy with whatever you think is best
[09:45:37] <_joe_>	 marostegui: is there a runbook for the failback?
[09:45:40] <marostegui>	 I will have a replacement ready in a few hours
[09:46:01] <wikibugs>	 (03PS2) 10Majavah: hieradata: set swift_clusters: {} on cloud [puppet] - 10https://gerrit.wikimedia.org/r/799861 (https://phabricator.wikimedia.org/T309281)
[09:46:02] <_joe_>	 sorry, being oncall, I'd prefer to be able to perform at least the failbacks myself
[09:46:06] <Amir1>	 marostegui: do you need me for anything?
[09:46:13] <marostegui>	 sorry _joe_ just did it
[09:46:23] <marostegui>	 But it is basically reloading the proxies
[09:46:26] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[09:46:28] <_joe_>	 marostegui: you need it in a runbook
[09:46:41] <_joe_>	 about dbproxy specifically
[09:46:46] <marostegui>	 _joe_: for later I guess, let me address all this first
[09:46:48] <_joe_>	 the page linked in the alert isn't helpful
[09:46:59] <_joe_>	 marostegui: sure sorry I wasn't implying for now
[09:47:06] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35559/console" [puppet] - 10https://gerrit.wikimedia.org/r/799861 (https://phabricator.wikimedia.org/T309281) (owner: 10Majavah)
[09:47:11] <wikibugs>	 (03CR) 10Majavah: hieradata: set swift_clusters: {} on cloud [puppet] - 10https://gerrit.wikimedia.org/r/799861 (https://phabricator.wikimedia.org/T309281) (owner: 10Majavah)
[09:47:21] <marostegui>	 _joe_: https://wikitech.wikimedia.org/wiki/HAProxy this tells what to do
[09:47:24] <jynus>	 https://phabricator.wikimedia.org/T309286
[09:47:25] <marostegui>	 not very clearly, but it does
[09:47:31] <marostegui>	 jynus: thanks
[09:47:57] <marostegui>	 we need to restart etherpad
[09:48:14] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[09:48:25] <marostegui>	 can someone do that?
[09:48:30] <jynus>	 doing
[09:48:34] <marostegui>	 thanks jynus 
[09:48:50] <marostegui>	 services we need to check: https://phabricator.wikimedia.org/P28584
[09:49:01] <marostegui>	 especially writes
[09:49:21] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Puppet should prune stale entries from sudoers.d - https://phabricator.wikimedia.org/T309268 (10jbond) removing `nagios_long_procs` as it was dropped in https://gerrit.wikimedia.org/r/c/operations/puppet/+/723543/4/modules/base/manifests/mo...
[09:49:30] <jynus>	 marostegui: I did it but it didn't work
[09:49:39] <jynus>	 do I need to restart apache or something?
[09:50:11] <jynus>	 or maybe it works now, just took a minute?
[09:50:19] <jynus>	 or maybe connections where killed?
[09:50:28] <wikibugs>	 10ops-eqiad, 10DBA, 10Data-Persistence: db1128 host (containing m1 databases) crashed - https://phabricator.wikimedia.org/T309286 (10Volans) According to `racadm lclog view` it's a bad DIMM, `DIMM_A6` in particular, and it happened already on 2022-03-17 (but it didn't trigger a reboot) and on 2022-02-27 (alt...
[09:50:33] <sobanski>	 Etherpad loads for me
[09:50:39] <sobanski>	 And I just created a new pad
[09:50:47] <jynus>	 sobanski:  it didn't work immediately after restart
[09:51:04] <marostegui>	 jynus: usually it is the etherpad service (at least what I have seen before)
[09:51:09] * volans updated task with HW logs
[09:51:10] <vgutierrez>	 jynus: if apache2 tried a graceful restart that could explain the delay
[09:51:15] <marostegui>	 jynus: can you run a quick bacula test?
[09:51:27] <marostegui>	 I am going to prepare a new host, it will take a few hours
[09:51:33] <vgutierrez>	 etherpad keeps WS open to every client.. those are long lived connections that will keep the workers busy for a while
[09:51:43] <_joe_>	 do we still have racktables?
[09:52:03] <jynus>	 ongoing es backups failed
[09:52:14] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:52:16] <jynus>	 7TB to retry
[09:52:19] <marostegui>	 _joe_: only on RO mode as far as  I remember
[09:52:21] <volans>	 _joe_: yes, in RO mode
[09:52:22] <wikibugs>	 10ops-eqiad, 10DBA, 10Data-Persistence: db1128 host (containing m1 databases) crashed - https://phabricator.wikimedia.org/T309286 (10Marostegui) p:05Triage→03High We need to build a new host and switchover db1128 so we can replace its memory.
[09:52:39] <_joe_>	 so nothing to verify there really
[09:52:48] <icinga-wm>	 PROBLEM - Check systemd state on pki1001 is CRITICAL: CRITICAL - degraded: The following units failed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service,cfssl-ocsprefresh-cloud_wmnet_ca.service,cfssl-ocsprefresh-kafka.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:53:10] <wikibugs>	 10ops-eqiad, 10DBA, 10Data-Persistence: db1128 host (containing m1 databases) crashed - https://phabricator.wikimedia.org/T309286 (10Volans) As an action item for later, we should check why the page didn't have the `#page` hashtag on the IRC alert:  ` icinga-wm| PROBLEM - Host db1128 is DOWN: PING CRITICAL -...
[09:54:02] <marostegui>	 volans: maybe we need to review https://phabricator.wikimedia.org/T233684 and check if something is missing
[09:54:14] <marostegui>	 I am going to look for a host to replace db1128
[09:54:45] <volans>	 marostegui: are you replying to my comment on the task about the #-page hashtag?
[09:55:22] <marostegui>	 volans: yeah :)
[09:55:40] <volans>	 ack
[09:55:52] <wikibugs>	 10ops-eqiad, 10DBA, 10Data-Persistence: db1128 host (containing m1 databases) crashed - https://phabricator.wikimedia.org/T309286 (10Marostegui) a:03Marostegui
[09:56:04] <wikibugs>	 10ops-eqiad, 10DBA, 10Data-Persistence: db1128 host (containing m1 databases) crashed - https://phabricator.wikimedia.org/T309286 (10Marostegui) I am going to replace db1128 with a s4 host for now.
[09:56:42] <wikibugs>	 10ops-eqiad, 10DBA, 10Data-Persistence: db1128 host (containing m1 databases) crashed - https://phabricator.wikimedia.org/T309286 (10Marostegui) Should take a few hours and later I will do an emergency m1 switchover, don't want to leave db1128 running like this for the weekend
[09:56:54] <wikibugs>	 (03PS1) 10Filippo Giunchedi: rsyslog: bound disk-assisted queues [puppet] - 10https://gerrit.wikimedia.org/r/799870 (https://phabricator.wikimedia.org/T308439)
[09:57:04] <wikibugs>	 10ops-eqiad, 10DBA, 10Data-Persistence: db1128 host (containing m1 databases) crashed - https://phabricator.wikimedia.org/T309286 (10Marostegui)
[09:57:27] <jynus>	 marostegui: backups and restores seem to work well, but I have to retry what were ongoing backups
[09:57:38] <volans>	 marostegui: I know the issue now, I'll send a patch 
[09:57:47] <wikibugs>	 (03PS1) 10Majavah: monitoring::icinga::git_merge: use sudo::rule [puppet] - 10https://gerrit.wikimedia.org/r/799871 (https://phabricator.wikimedia.org/T309268)
[09:57:53] <marostegui>	 volans: <3
[09:57:59] <marostegui>	 jynus: maybe it is worth waiting for db1128 replacement?
[09:58:07] <marostegui>	 cause I will kill connections to run the switchover today
[09:58:10] <jynus>	 yeah, I was about to say that
[09:58:17] <marostegui>	 yeah, worth waiting then
[09:58:21] <marostegui>	 I will try to get it done fast
[09:58:22] <jynus>	 but we should do it shortly
[09:58:32] <jynus>	 like, before the end of the week
[09:58:42] <marostegui>	 jynus: I am planning to do it in a few hours
[09:58:52] <marostegui>	 it shouldn't take long, I am deciding which host to pick now
[09:59:08] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35561/console" [puppet] - 10https://gerrit.wikimedia.org/r/799871 (https://phabricator.wikimedia.org/T309268) (owner: 10Majavah)
[09:59:15] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Puppet should prune stale entries from sudoers.d - https://phabricator.wikimedia.org/T309268 (10jbond) new updated list with removed nagios_long_procs and also with a fixed file list  ` sudo cumin '*' 'ls -1  /etc/sudoers.d/ | grep -Ev "mw-...
[09:59:46] <volans>	 _joe_, vgutierrez: for you I have a different problem, it seems to me (at least from my VO app) that the incident was not auto-resolved on VO, if you could have a look
[10:00:04] <jouncebot>	 mvolz: (Dis)respected human, time to deploy Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220526T1000). Please do the needful.
[10:00:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1164 T309286', diff saved to https://phabricator.wikimedia.org/P28585 and previous config saved to /var/cache/conftool/dbconfig/20220526-100013-marostegui.json
[10:00:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P28586 and previous config saved to /var/cache/conftool/dbconfig/20220526-100020-ladsgroup.json
[10:00:22] <stashbot>	 T309286: db1128 host (containing m1 databases) crashed - https://phabricator.wikimedia.org/T309286
[10:00:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:01:13] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove db1164 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/799876 (https://phabricator.wikimedia.org/T309286)
[10:01:15] <_joe_>	 volans: yes it wasn't
[10:01:19] <_joe_>	 I'll resolve it
[10:02:08] <godog>	 indeed, we have looked into why host pages don't resolve automatically but IIRC found no smoking gun yet, thanks for resolving though
[10:02:11] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1164 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/799876 (https://phabricator.wikimedia.org/T309286) (owner: 10Marostegui)
[10:02:11] <godog>	 I'll find the task
[10:02:36] <godog>	 T264016
[10:02:37] <stashbot>	 T264016: Host page did not auto-resolve in VO - https://phabricator.wikimedia.org/T264016
[10:03:47] <jynus>	 expect backup check alerts in the next hours due to backup failures and probable delays
[10:05:15] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Move db1164 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/799883 (https://phabricator.wikimedia.org/T309286)
[10:05:17] <marostegui>	 jynus: can you review  ^
[10:05:35] <jynus>	 doing
[10:05:38] <marostegui>	 !log Stop mysql on db1117:3321 to clone db1164 T309286
[10:05:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:05:44] <stashbot>	 T309286: db1128 host (containing m1 databases) crashed - https://phabricator.wikimedia.org/T309286
[10:06:57] <jynus>	 marostegui: only one suggetion- let's add monitoring enabled:false to db1128?
[10:07:06] <marostegui>	 sounds good, let me do it
[10:07:23] <jynus>	 maybe to the new one, temporarilly
[10:07:41] <marostegui>	 jynus: the new one has it set to false on that patch
[10:07:44] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Move db1164 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/799883 (https://phabricator.wikimedia.org/T309286)
[10:07:49] <jynus>	 ah, sorry, I missed that
[10:07:59] <marostegui>	 or you mean fully disable it?
[10:08:04] <marostegui>	 let's fully disable it instead
[10:08:22] <jynus>	 yeah, is_critical + enabled false until fully setup
[10:08:32] <wikibugs>	 (03PS3) 10Marostegui: mariadb: Move db1164 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/799883 (https://phabricator.wikimedia.org/T309286)
[10:08:34] <marostegui>	 done ^
[10:08:47] <jynus>	 trying to avoid more pages
[10:09:29] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] mariadb: Move db1164 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/799883 (https://phabricator.wikimedia.org/T309286) (owner: 10Marostegui)
[10:09:41] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1164 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/799883 (https://phabricator.wikimedia.org/T309286) (owner: 10Marostegui)
[10:09:44] <jynus>	 you merge, start preparing everything while I prepare the backups patch
[10:10:02] <marostegui>	 yep, starting the cloning now
[10:10:14] <jynus>	 once monitoring is in place (maybe except read only) we reenable notifications
[10:11:07] <marostegui>	 dbproxy irc alerts migh trigger as db1117 might flap (network saturation)
[10:11:26] <icinga-wm>	 RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:11:28] <marostegui>	 Updating zarcillo now
[10:12:14] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[10:12:26] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[10:12:36] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[10:12:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1164 from dbctl', diff saved to https://phabricator.wikimedia.org/P28588 and previous config saved to /var/cache/conftool/dbconfig/20220526-101250-marostegui.json
[10:12:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:13:50] <wikibugs>	 (03PS1) 10Jcrespo: Update dbackups check and statistics to use db1164 instead of db1128 [puppet] - 10https://gerrit.wikimedia.org/r/799894 (https://phabricator.wikimedia.org/T309286)
[10:14:47] <wikibugs>	 (03CR) 10Marostegui: "Looks good, I will let you know once the host is up" [puppet] - 10https://gerrit.wikimedia.org/r/799894 (https://phabricator.wikimedia.org/T309286) (owner: 10Jcrespo)
[10:14:50] <jynus>	 ^marostegui low prio
[10:15:02] <jynus>	 will done once things have been stable for a while
[10:15:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P28589 and previous config saved to /var/cache/conftool/dbconfig/20220526-101525-ladsgroup.json
[10:15:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:32] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] "Not yet until failover is done and things are stable." [puppet] - 10https://gerrit.wikimedia.org/r/799894 (https://phabricator.wikimedia.org/T309286) (owner: 10Jcrespo)
[10:17:00] <wikibugs>	 10ops-eqiad: db1128 faulty memory - https://phabricator.wikimedia.org/T309291 (10Marostegui)
[10:17:07] <wikibugs>	 10ops-eqiad: db1128 faulty memory - https://phabricator.wikimedia.org/T309291 (10Marostegui) p:05Triage→03Medium
[10:17:13] <marostegui>	 volans: I have created an specific task for dcops
[10:18:40] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[10:19:34] <volans>	 marostegui: ack, sorry if mixed the dcops data in that task
[10:19:42] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Puppet should prune stale entries from sudoers.d - https://phabricator.wikimedia.org/T309268 (10jbond) i also slightly update the script io used for apt to hanlde if the path parameter is used and also use the simpler pql syntax  ` lang=pyt...
[10:19:49] <volans>	 *if I
[10:21:24] <marostegui>	 volans: not a problem, I create a new one so they don't get lost in all the comments 
[10:21:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35562/console" [puppet] - 10https://gerrit.wikimedia.org/r/799870 (https://phabricator.wikimedia.org/T308439) (owner: 10Filippo Giunchedi)
[10:23:02] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[10:23:04] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[10:23:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T298555)', diff saved to https://phabricator.wikimedia.org/P28590 and previous config saved to /var/cache/conftool/dbconfig/20220526-102308-ladsgroup.json
[10:23:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:15] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[10:23:22] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Failover m1 primary from db1128 to db1164 [puppet] - 10https://gerrit.wikimedia.org/r/799901 (https://phabricator.wikimedia.org/T309286)
[10:23:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/799871 (https://phabricator.wikimedia.org/T309268) (owner: 10Majavah)
[10:23:44] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] monitoring::icinga::git_merge: use sudo::rule [puppet] - 10https://gerrit.wikimedia.org/r/799871 (https://phabricator.wikimedia.org/T309268) (owner: 10Majavah)
[10:24:28] <jynus>	 marostegui: will check on misc docs if we have a checklist of all changes needed and the order, I've created so far https://gerrit.wikimedia.org/r/c/operations/puppet/+/799901
[10:24:56] <jynus>	 but you will likely have more experience on that
[10:25:18] <jynus>	 ah, it is well documented at https://wikitech.wikimedia.org/wiki/MariaDB#Misc_section_failover_checklist_(example_with_m2)
[10:26:35] <wikibugs>	 (03PS1) 10Volans: Icinga: add page hashtag to paging host alerts [puppet] - 10https://gerrit.wikimedia.org/r/799903
[10:26:56] <jynus>	 I think I have to add the haproxy change on the same change, control the sequence of deployment with puppet
[10:27:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Icinga: add page hashtag to paging host alerts [puppet] - 10https://gerrit.wikimedia.org/r/799903 (owner: 10Volans)
[10:28:24] <wikibugs>	 (03PS2) 10Volans: Icinga: add page hashtag to paging host alerts [puppet] - 10https://gerrit.wikimedia.org/r/799903
[10:28:44] <marostegui>	 jynus: yeah, we can do everything in a single patch, that patch + changing haproxy ips and databases
[10:28:48] <marostegui>	  I can do that, no problem
[10:28:55] <jynus>	 well, let me try
[10:28:58] <marostegui>	 Maybe we should even create a subtask for the switchover
[10:28:59] <jynus>	 and you review of course
[10:29:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Icinga: add page hashtag to paging host alerts [puppet] - 10https://gerrit.wikimedia.org/r/799903 (owner: 10Volans)
[10:29:02] <marostegui>	 with all the steps
[10:29:17] <jynus>	 the docs have some outadated stuff that is now on hiera/zarcillo
[10:29:20] <marostegui>	 jynus: https://phabricator.wikimedia.org/T302190
[10:29:31] <jynus>	 ah, will copy that
[10:29:54] <jynus>	 should we add db1164 as a temporary secondary, to check haproxy works as intended?
[10:30:30] <marostegui>	 yeah, but let's wait until it is up
[10:30:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298560)', diff saved to https://phabricator.wikimedia.org/P28591 and previous config saved to /var/cache/conftool/dbconfig/20220526-103030-ladsgroup.json
[10:30:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[10:30:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[10:30:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:30:38] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[10:30:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:30:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:30:52] <jynus>	 marostegui: yeah, it is to prepare doc/patches, not doing nothing without your ok
[10:31:03] <marostegui>	 jynus: thanks, appreciate it!
[10:31:12] <marostegui>	 the transfer is half way done
[10:39:25] <jynus>	 T309296 but it is a quick copy and paste, will review now
[10:39:26] <stashbot>	 T309296: Failover m1 primary db from db1128 to db1164 - https://phabricator.wikimedia.org/T309296
[10:41:10] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: Frequent puppet failures - https://phabricator.wikimedia.org/T221529 (10jbond) >>! In T221529#5143984, @jbond wrote:  > The error happened as puppet-merge was rolling out changes.  I have not looked at how puppet-merge works but this looks like i...
[10:41:21] <jynus>	 marostegui: I am not so sure about those steps, since haproxy started being used by traffic, reload my happen automatically?
[10:41:39] <wikibugs>	 (03Abandoned) 10Jbond: nrpe: move plugins off the base nrpe class [puppet] - 10https://gerrit.wikimedia.org/r/799386 (owner: 10Jbond)
[10:42:39] <marostegui>	 jynus: no, it was disabled for us I believe
[10:42:45] <jynus>	 ah, ok
[10:43:11] <jynus>	 I am also not sure about the db-switchover parameter order
[10:44:29] <marostegui>	 what are you not sure about?
[10:44:48] <marostegui>	 transfer finished
[10:44:49] <jynus>	 not sure how db-switchover works
[10:45:03] <jynus>	 so just asking for you to review them
[10:45:08] <marostegui>	 yeah, no worries
[10:46:24] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Failover m1 primary from db1128 to db1164 [puppet] - 10https://gerrit.wikimedia.org/r/799901 (https://phabricator.wikimedia.org/T309286)
[10:46:38] <jynus>	 added the haproxy change^
[10:48:25] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "LGTM but see inline comments.  also see the following for why the previous issue failed" [puppet] - 10https://gerrit.wikimedia.org/r/797422 (owner: 10Majavah)
[10:48:37] <wikibugs>	 (03CR) 10Marostegui: mariadb: Failover m1 primary from db1128 to db1164 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799901 (https://phabricator.wikimedia.org/T309286) (owner: 10Jcrespo)
[10:48:58] <wikibugs>	 (03PS1) 10Jcrespo: dbproxy: Add db1164 as the m1 eqiad secondary [puppet] - 10https://gerrit.wikimedia.org/r/799915 (https://phabricator.wikimedia.org/T309286)
[10:49:24] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Failover m1 primary from db1128 to db1164 [puppet] - 10https://gerrit.wikimedia.org/r/799901 (https://phabricator.wikimedia.org/T309286)
[10:49:29] <wikibugs>	 (03CR) 10Jcrespo: mariadb: Failover m1 primary from db1128 to db1164 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799901 (https://phabricator.wikimedia.org/T309286) (owner: 10Jcrespo)
[10:49:54] <jynus>	 let me rebase on top of the latest patch
[10:50:34] <jynus>	 uh, it says conflict, will have to rebase manually
[10:51:02] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:52:28] <wikibugs>	 (03PS4) 10Jcrespo: mariadb: Failover m1 primary from db1128 to db1164 [puppet] - 10https://gerrit.wikimedia.org/r/799901 (https://phabricator.wikimedia.org/T309286)
[10:52:49] <marostegui>	 https://phabricator.wikimedia.org/T309286#7960042
[10:52:52] <marostegui>	 jynus: ^
[10:53:16] <jynus>	 oh, I had created T309296
[10:53:17] <stashbot>	 T309296: Failover m1 primary db from db1128 to db1164 - https://phabricator.wikimedia.org/T309296
[10:53:21] <marostegui>	 ah sorry
[10:53:24] <marostegui>	 didn't see it
[10:53:24] <jynus>	 let's compare :-)
[10:54:28] <icinga-wm>	 RECOVERY - Check systemd state on pki1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:55:32] <jynus>	 the ticket on my log was wrong
[10:55:46] <wikibugs>	 (03PS4) 10Majavah: nrpe: manage sudo rules via nrpe::check (try 2) [puppet] - 10https://gerrit.wikimedia.org/r/799344
[10:56:25] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] nrpe: manage sudo rules via nrpe::check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/797422 (owner: 10Majavah)
[10:57:32] <marostegui>	 jynus: I am going to start then
[10:57:51] <jynus>	 wait
[10:58:28] <jynus>	 let's add / do https://gerrit.wikimedia.org/r/c/operations/puppet/+/799915/1 somewhere beforehand?
[10:58:49] <marostegui>	 Ah, I was going to create a patch for it
[10:58:51] <marostegui>	 didn't see that one
[10:58:51] <marostegui>	 yeah
[10:58:53] <marostegui>	 let's start with that
[10:58:56] <marostegui>	 let me review and merge
[10:58:58] <jynus>	 not super important
[10:59:07] <jynus>	 but because I rebased the important patch on top of that
[10:59:13] <jynus>	 so to avoir conflicts on merge
[10:59:20] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbproxy: Add db1164 as the m1 eqiad secondary [puppet] - 10https://gerrit.wikimedia.org/r/799915 (https://phabricator.wikimedia.org/T309286) (owner: 10Jcrespo)
[10:59:20] <jynus>	 *avoid
[10:59:31] <marostegui>	 merged, you can rebase while I test
[10:59:46] <jynus>	 it was already rebased, that is why :-)
[10:59:53] <jynus>	 let me just add it to the list as done
[11:00:28] <marostegui>	 yep
[11:01:10] <jynus>	 so from now on, I will not touch the description or perform anything unless you tell me to
[11:01:14] <jynus>	  will just monitor
[11:01:22] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[11:01:22] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[11:01:22] <marostegui>	 ok
[11:01:26] <marostegui>	 the proxy test looks good
[11:01:45] <jynus>	 so, on your time
[11:01:59] <marostegui>	 let's wait for db1117:3321 to catch up
[11:01:59] <marostegui>	 it shouldn't take long
[11:02:08] <jynus>	 ah, right
[11:02:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Agree how to handle port-block speeds for QFX5120-48Y - https://phabricator.wikimedia.org/T303529 (10cmooney) @ayounsi I think based on the above we should proceed with https://gerrit.wikimedia.org/r/c/operations/software/homer/deploy/+/769...
[11:03:25] <wikibugs>	 (03PS5) 10Jcrespo: mariadb: Failover m1 primary from db1128 to db1164 [puppet] - 10https://gerrit.wikimedia.org/r/799901 (https://phabricator.wikimedia.org/T309286)
[11:03:33] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792622 (owner: 10Ayounsi)
[11:03:38] <jynus>	 ^I actually had to rebase, marostegui sorry
[11:03:49] <jynus>	 saw it has a conflict (but an automatic resolved one)
[11:03:51] <marostegui>	 db1117 is in sync
[11:04:13] <marostegui>	 Going to start moving the topology
[11:04:41] <jynus>	 there is an extra space on one of the hiera, but to resolve later
[11:05:35] <jynus>	 ah, no, the space is removed on the patch, I got it wrong, all good
[11:06:23] <jynus>	 db1117 seems moved ok?
[11:06:35] <jynus>	 according to orchestrator
[11:06:53] <jynus>	 and codfw servers
[11:07:52] <wikibugs>	 (03PS3) 10Jbond: Icinga: add page hashtag to paging host alerts [puppet] - 10https://gerrit.wikimedia.org/r/799903 (owner: 10Volans)
[11:09:42] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:09:57] <marostegui>	 yeah it looks good
[11:11:20] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35563/console" [puppet] - 10https://gerrit.wikimedia.org/r/799903 (owner: 10Volans)
[11:11:26] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Failover m1 primary from db1128 to db1164 [puppet] - 10https://gerrit.wikimedia.org/r/799901 (https://phabricator.wikimedia.org/T309286) (owner: 10Jcrespo)
[11:13:03] <marostegui>	 jynus: I am going for it now
[11:13:09] <jynus>	 ok
[11:13:12] <marostegui>	 !log Failover m1 from db1128 to db1164 - T309296
[11:13:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:13:18] <stashbot>	 T309296: Failover m1 primary db from db1128 to db1164 - https://phabricator.wikimedia.org/T309296
[11:13:33] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Thanks John for the workaround, if that works on PCC for both icinga and a normal host with exported resources it seems ok to me. But let'" [puppet] - 10https://gerrit.wikimedia.org/r/799903 (owner: 10Volans)
[11:13:37] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35564/console" [puppet] - 10https://gerrit.wikimedia.org/r/799903 (owner: 10Volans)
[11:14:15] <marostegui>	 all done
[11:14:56] <jynus>	 cleanup of heartbeat needed?
[11:15:03] <marostegui>	 that is done too
[11:15:06] <jynus>	 at least for orchestrator, not sure if for the check
[11:15:16] <jynus>	 ah, it took some time
[11:15:19] <jynus>	 to get it
[11:15:22] <marostegui>	 yeah
[11:15:26] <jynus>	 (orchestrator)
[11:15:32] <marostegui>	 let's check services
[11:15:42] <marostegui>	 etherpad might need the restart
[11:16:02] <marostegui>	 yeah it does
[11:16:02] <jynus>	 let me dod that
[11:16:05] <marostegui>	 ok
[11:16:12] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35565/console" [puppet] - 10https://gerrit.wikimedia.org/r/799903 (owner: 10Volans)
[11:16:40] <marostegui>	 db1164 looks green on icinga
[11:16:43] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Also master works fine:" [puppet] - 10https://gerrit.wikimedia.org/r/799903 (owner: 10Volans)
[11:17:21] <jynus>	 etherpad took close to 1 minute to get back
[11:17:29] <jynus>	 it must overload or something
[11:17:30] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35567/console" [puppet] - 10https://gerrit.wikimedia.org/r/799903 (owner: 10Volans)
[11:18:00] <marostegui>	 yeah I can write now
[11:18:05] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Update db1128 and db1164 status [puppet] - 10https://gerrit.wikimedia.org/r/799930 (https://phabricator.wikimedia.org/T309296)
[11:18:13] <marostegui>	 jynus: ^ for later
[11:18:35] <jynus>	 ah, true, I forgot
[11:19:19] <jynus>	 s4- you like danger :-D
[11:19:24] <marostegui>	 haha
[11:19:36] <marostegui>	 I stole db1164 from it :)
[11:19:38] <jynus>	 let me get a snapshot of that before you move it, ok?
[11:19:43] <marostegui>	 ah no wait
[11:19:44] <marostegui>	 I took it from s1
[11:20:34] <jynus>	 I would trust the data on the current primary more than the old one, but I want to have it around (the data, not the server) for some time
[11:20:42] <marostegui>	 jynus: sure, no rush
[11:20:45] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Update db1128 and db1164 status [puppet] - 10https://gerrit.wikimedia.org/r/799930 (https://phabricator.wikimedia.org/T309296)
[11:21:10] <marostegui>	 I am merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/799930 which is a noop for the data
[11:21:12] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Update db1128 and db1164 status [puppet] - 10https://gerrit.wikimedia.org/r/799930 (https://phabricator.wikimedia.org/T309296) (owner: 10Marostegui)
[11:21:14] <marostegui>	 just notifications
[11:21:29] <jynus>	 a space that bothered me
[11:21:35] <marostegui>	 xd
[11:21:38] <marostegui>	 Good to merge?
[11:21:49] <jynus>	 one say, was giving it a last look
[11:21:52] <jynus>	 *second
[11:21:59] <marostegui>	 yep just +1 when ready
[11:22:46] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] mariadb: Update db1128 and db1164 status [puppet] - 10https://gerrit.wikimedia.org/r/799930 (https://phabricator.wikimedia.org/T309296) (owner: 10Marostegui)
[11:22:50] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:23:31] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Update db1128 and db1164 status [puppet] - 10https://gerrit.wikimedia.org/r/799930 (https://phabricator.wikimedia.org/T309296) (owner: 10Marostegui)
[11:24:18] <wikibugs>	 (03CR) 10Jbond: "sorry E_TOOMANYCHANGES was meant to leave this remark." [puppet] - 10https://gerrit.wikimedia.org/r/799344 (owner: 10Majavah)
[11:25:03] <jynus>	 let me know when deployed to run puppet on alert1001
[11:25:08] <marostegui>	 jynus: done
[11:25:30] <jynus>	 one last check when finished to icinga
[11:25:43] <jynus>	 tendril was done automatically, right?
[11:25:54] <marostegui>	 tendril?
[11:25:57] <jynus>	 sorry
[11:25:59] <jynus>	 zarcillo
[11:26:02] <marostegui>	 yes
[11:26:06] <marostegui>	 it should be, let me double check
[11:26:23] <marostegui>	 yep it is good
[11:26:36] <wikibugs>	 (03PS5) 10Majavah: nrpe: manage sudo rules via nrpe::check (try 2) [puppet] - 10https://gerrit.wikimedia.org/r/799344
[11:26:52] <wikibugs>	 (03CR) 10Majavah: nrpe: manage sudo rules via nrpe::check (try 2) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/799344 (owner: 10Majavah)
[11:27:19] <jynus>	 let me force a rerun on prometheus of config jobs
[11:27:45] <marostegui>	 ok
[11:28:08] <jynus>	 as grafana didn't show any primary dbs at the moment
[11:30:13] <jynus>	 it doesn't work
[11:30:14] <marostegui>	 writes on librenms work fine
[11:30:26] <marostegui>	 prometheus?
[11:30:27] <jynus>	 something could be wrong on zarcillo
[11:30:35] <jynus>	 I mean, it doesn't give errors
[11:30:47] <jynus>	 but it doesn't detect any m1-master
[11:30:55] <marostegui>	 let me see
[11:31:12] <jynus>	 which host is the main zarcillo db?
[11:31:22] <marostegui>	 db1115
[11:31:27] <jynus>	 ah, I think I know what it happend
[11:31:35] <jynus>	 the script changes who is the master
[11:31:41] <jynus>	 but doesn't update the section
[11:31:47] <marostegui>	 aaah right
[11:31:48] <marostegui>	 let me fix that
[11:31:49] <jynus>	 that should have been done beforehand
[11:31:54] <jynus>	 not a big deal
[11:33:06] <marostegui>	 mmm but I did update section_instances
[11:33:12] <marostegui>	 before the switchover
[11:33:30] <jynus>	 then it could be something else
[11:33:36] <jynus>	 the group?
[11:33:43] <jynus>	 core -> misc, maybe?
[11:33:52] <marostegui>	 but db1128 is also showing core
[11:33:58] <marostegui>	 which is wrong
[11:34:00] <jynus>	 that is the one that is missing
[11:34:08] <marostegui>	 but db1128 the previous master was in core
[11:34:14] <marostegui>	 let me update it anyways
[11:34:22] <wikibugs>	 (03PS1) 10Jbond: puppet-merge: Add logging so we know when changes where merged [puppet] - 10https://gerrit.wikimedia.org/r/799943 (https://phabricator.wikimedia.org/T221529)
[11:34:28] <jynus>	 and that is why probably didn't show up in the aggregated graphs :-)
[11:34:38] <marostegui>	 ah it wasn't working before either?
[11:34:55] <jynus>	 not sure, but right now, db1117:13321 only show up on m1 misc
[11:35:02] <marostegui>	 then it must be that
[11:35:02] <jynus>	 probably the others are on core
[11:35:07] <marostegui>	 codfw failing too?
[11:35:11] <jynus>	 let me seee
[11:35:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppet-merge: Add logging so we know when changes where merged [puppet] - 10https://gerrit.wikimedia.org/r/799943 (https://phabricator.wikimedia.org/T221529) (owner: 10Jbond)
[11:36:14] <marostegui>	 updated db1164 to misc
[11:36:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "thanks lgtm, have another change to deploy this afternoon which also needs to be rolled out carefully so will include this one with that:" [puppet] - 10https://gerrit.wikimedia.org/r/799344 (owner: 10Majavah)
[11:37:52] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] Icinga: add page hashtag to paging host alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799903 (owner: 10Volans)
[11:38:19] <jynus>	 on codfw db2132:9104 shows up as master
[11:38:27] <marostegui>	 that's correct yep
[11:38:28] <jynus>	 let me rerun eqiad
[11:38:49] <jynus>	 now db1164:9104 shows up as master
[11:38:59] <jynus>	 but db1128 doesn't show up on misc m1
[11:38:59] <marostegui>	 I will leave db1128 as core as it will be core in s1
[11:39:02] <marostegui>	 yeah
[11:39:05] <marostegui>	 I will leave it as core
[11:39:12] <marostegui>	 I will reclone it to s1
[11:39:16] <jynus>	 let me check that it at least is on core
[11:39:17] <marostegui>	 (once you give me green light)
[11:39:26] <jynus>	 no problem as long as we get metrics from it
[11:40:25] <jynus>	 yeah, it is on "m1-core"
[11:40:44] <jynus>	 I saw m2 core, too more issues
[11:40:46] <jynus>	 for another time
[11:41:05] <marostegui>	 | db1159              | db1159.eqiad.wmnet                | 3306 | NULL    | NULL                | core        |
[11:41:06] <marostegui>	 fixing
[11:41:17] <marostegui>	 fixed, that is m2
[11:41:29] <jynus>	 it is ok, as long as there are metrics it is just a label
[11:41:35] <marostegui>	 m3 seems to be ok
[11:41:36] <jynus>	 there will be likely many other issues
[11:41:52] <marostegui>	 m5 is ok too
[11:43:07] <jynus>	 puppet run on alert host but didn't disable the issue -probably will require puppet on the hosts to run first
[11:43:14] <marostegui>	 yeah
[11:43:15] <jynus>	 * the alert
[11:43:36] <marostegui>	 I am running it now
[11:44:32] <jynus>	 once alerts are ok, I will deploy the dbbackups patch, create a snapshot of db1128 and then unblock you
[11:45:15] <wikibugs>	 (03PS2) 10Jcrespo: Update dbackups check and statistics to use db1164 instead of db1128 [puppet] - 10https://gerrit.wikimedia.org/r/799894 (https://phabricator.wikimedia.org/T309286)
[11:46:21] <icinga-wm>	 PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:46:24] <marostegui>	 jynus: db1164 notifications enabled
[11:46:31] <marostegui>	 and db1128 disabled
[11:46:33] <jynus>	 and all good?
[11:46:38] <marostegui>	 yep
[11:46:51] <marostegui>	 I am going to remove the downtimes from db1164
[11:47:01] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Update dbackups check and statistics to use db1164 instead of db1128 [puppet] - 10https://gerrit.wikimedia.org/r/799894 (https://phabricator.wikimedia.org/T309286) (owner: 10Jcrespo)
[11:47:49] <marostegui>	 jynus: I am going to get some food
[11:47:54] <marostegui>	 Thanks for all the help <3 <3
[11:48:04] <jynus>	 ok for me to do the intended things left?
[11:48:06] <jynus>	 the backup?
[11:48:13] <wikibugs>	 10SRE, 10Data-Engineering, 10Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10Isaac)
[11:48:17] <jynus>	 (not the move, that is for you)
[11:48:20] <jynus>	 0:-)
[11:48:30] <marostegui>	 jynus: yep, go for your tests
[11:48:42] <marostegui>	 and let me know if I can proceed further with recloning db1128
[11:48:45] <jynus>	 have a nice lunch
[11:48:53] <marostegui>	 But we can also leave it running, the memory won't be changed today anyways, so no rush
[11:48:53] <jynus>	 it will take probably 2-3 hours
[11:48:57] <marostegui>	 np
[11:48:59] <marostegui>	 see you later
[11:49:06] <jynus>	 I will have lunch also when it starts
[11:49:20] <jynus>	 think either later on the day or tomorrow for the move
[11:50:37] <jynus>	 (there is also some chance that the host could fail again as backups touch all memory)
[11:54:58] <jynus>	 !log Running XtraBackup at db1128.eqiad.wmnet:3306 and sending it to dbprov1001.eqiad.wmnet
[11:55:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:02:47] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:02:59] <wikibugs>	 10SRE, 10Data-Engineering, 10Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10Isaac) Thanks all for the input on this task and @BBlack especially for digging up what was happening. I finally updated the task description to reflect what I think is the...
[12:03:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (8) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:09:24] <wikibugs>	 (03PS2) 10Hnowlan: service: image-suggestion state to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/799358 (https://phabricator.wikimedia.org/T304891)
[12:21:50] <wikibugs>	 (03PS1) 10Majavah: P:puppetmaster::common: drop support for activerecord [puppet] - 10https://gerrit.wikimedia.org/r/799956
[12:23:50] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35568/console" [puppet] - 10https://gerrit.wikimedia.org/r/799956 (owner: 10Majavah)
[12:34:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298555)', diff saved to https://phabricator.wikimedia.org/P28593 and previous config saved to /var/cache/conftool/dbconfig/20220526-123413-ladsgroup.json
[12:34:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:34:21] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[12:41:31] <icinga-wm>	 PROBLEM - SSH on labweb1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:42:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:43:01] <icinga-wm>	 PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:47:09] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] memcached: Untangle TLS/1.6 options [puppet] - 10https://gerrit.wikimedia.org/r/799348 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff)
[12:47:23] <icinga-wm>	 RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:49:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P28594 and previous config saved to /var/cache/conftool/dbconfig/20220526-124918-ladsgroup.json
[12:49:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220526T1300).
[13:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[13:02:15] <wikibugs>	 (03CR) 10Physikerwelt: "See discussion in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/792625 (probably needs some updates to work with the new" [deployment-charts] - 10https://gerrit.wikimedia.org/r/798394 (owner: 10PipelineBot)
[13:04:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P28596 and previous config saved to /var/cache/conftool/dbconfig/20220526-130423-ladsgroup.json
[13:04:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:55] <icinga-wm>	 PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:14:34] <wikibugs>	 (03PS1) 10Majavah: wip [puppet] - 10https://gerrit.wikimedia.org/r/799976
[13:14:53] <icinga-wm>	 PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:15:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wip [puppet] - 10https://gerrit.wikimedia.org/r/799976 (owner: 10Majavah)
[13:16:39] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 5 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35569/console" [puppet] - 10https://gerrit.wikimedia.org/r/799976 (owner: 10Majavah)
[13:17:26] <wikibugs>	 (03CR) 10Tchanders: Assign similareditors right to the checkuser group (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799378 (https://phabricator.wikimedia.org/T307205) (owner: 10AGueyte)
[13:18:09] <wikibugs>	 (03PS1) 10Jbond: Rakefie: Add URI.escape monkey patch [puppet] - 10https://gerrit.wikimedia.org/r/799982
[13:19:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Rakefie: Add URI.escape monkey patch [puppet] - 10https://gerrit.wikimedia.org/r/799982 (owner: 10Jbond)
[13:19:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298555)', diff saved to https://phabricator.wikimedia.org/P28597 and previous config saved to /var/cache/conftool/dbconfig/20220526-131928-ladsgroup.json
[13:19:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2103.codfw.wmnet with reason: Maintenance
[13:19:35] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[13:19:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2103.codfw.wmnet with reason: Maintenance
[13:19:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on 14 hosts with reason: Maintenance
[13:19:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:47] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on 14 hosts with reason: Maintenance
[13:19:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:49] <wikibugs>	 (03PS2) 10Majavah: add wmflib::is_active to pick a single active host [puppet] - 10https://gerrit.wikimedia.org/r/799976
[13:21:59] <wikibugs>	 (03CR) 10Tchanders: Deploy SimilarEditors to the beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799012 (https://phabricator.wikimedia.org/T306908) (owner: 10AGueyte)
[13:22:09] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35570/console" [puppet] - 10https://gerrit.wikimedia.org/r/799976 (owner: 10Majavah)
[13:24:59] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:27:05] <wikibugs>	 (03PS3) 10Majavah: add wmflib::is_active to pick a single active host [puppet] - 10https://gerrit.wikimedia.org/r/799976
[13:28:45] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 5 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35571/console" [puppet] - 10https://gerrit.wikimedia.org/r/799976 (owner: 10Majavah)
[13:29:13] <wikibugs>	 (03PS2) 10Jbond: Rakefie: Add URI.escape monkey patch [puppet] - 10https://gerrit.wikimedia.org/r/799982
[13:34:24] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/799956 (owner: 10Majavah)
[13:36:07] <wikibugs>	 (03PS1) 10Hnowlan: service: image-suggestion state to production [puppet] - 10https://gerrit.wikimedia.org/r/799998 (https://phabricator.wikimedia.org/T304891)
[13:42:39] <icinga-wm>	 RECOVERY - SSH on labweb1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:44:09] <icinga-wm>	 RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:48:18] <wikibugs>	 (03PS1) 10Majavah: P:openstack::nova: remove stretch specific code [puppet] - 10https://gerrit.wikimedia.org/r/800009
[13:49:39] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.provision for host cloudcephosd1025.mgmt.eqiad.wmnet with reboot policy FORCED
[13:49:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:53:31] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:54:59] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: wmflib::service::get_url: avoid using monitoring to find the url. [puppet] - 10https://gerrit.wikimedia.org/r/800010
[13:56:15] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:58:40] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35572/console" [puppet] - 10https://gerrit.wikimedia.org/r/800010 (owner: 10Giuseppe Lavagetto)
[14:07:05] <icinga-wm>	 RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:08:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/800010 (owner: 10Giuseppe Lavagetto)
[14:09:39] <wikibugs>	 10SRE: ProdPasteBot uses deprecated certificate auth - https://phabricator.wikimedia.org/T242857 (10Dsharpe) I don't know who owns or maintains this.  https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+log/refs/heads/production/modules/base/files/phaste.py shows some folks who have touched the cod...
[14:10:02] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] cassandra-http-gateway: add missing log level [deployment-charts] - 10https://gerrit.wikimedia.org/r/799283 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan)
[14:15:14] <wikibugs>	 (03Merged) 10jenkins-bot: cassandra-http-gateway: add missing log level [deployment-charts] - 10https://gerrit.wikimedia.org/r/799283 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan)
[14:18:52] <marostegui>	 volans: just checking you'll send a patch (not now, not today) for the #p.age thing that didn't work with this master crash?
[14:19:23] <volans>	 marostegui: I have already sent it, then there was an issue with puppet reserved words and jbond kindly patched it with a workaround
[14:19:37] <marostegui>	 volans: ah, I don't see it on my reviews
[14:20:08] <marostegui>	 I was curious about what it was
[14:20:56] <volans>	 I'm adding more people now
[14:21:49] <volans>	 marostegui: added people anyway it's https://gerrit.wikimedia.org/r/c/operations/puppet/+/799903
[14:21:54] <wikibugs>	 (03CR) 10Jbond: "done first pass" [puppet] - 10https://gerrit.wikimedia.org/r/799976 (owner: 10Majavah)
[14:22:03] <volans>	 alias is a reserved meta-parameter in puppet
[14:24:52] <marostegui>	 volans: ah thank you <3
[14:26:15] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2035 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:26:33] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2035 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:35:52] <jynus>	 marostegui: backups were retried automatically and still failed
[14:35:57] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1025.mgmt.eqiad.wmnet with reboot policy FORCED
[14:35:59] <jynus>	 looking on what could be the reason
[14:36:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:45] <marostegui>	 jynus: let me know if I can help
[14:36:54] <jynus>	 I am checking the logs
[14:37:17] <jynus>	 I may do a next trieal with replication stopped
[14:37:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10cmooney)
[14:38:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10nskaggs) a:05nskaggs→03Andrew
[14:40:30] <jynus>	 transfer was succesful twice, but prepare failed, looking at the xtrabackup logs
[14:41:36] <jynus>	 ERROR - xtrabackup version mismatch- xtrabackup version: {'major': '10.4', 'minor': 22, 'vendor': 'MariaDB'}, backup version: {'major': '10.4', 'minor': 22, 'vendor': 'MariaDB-log'}
[14:42:29] <wikibugs>	 (03PS1) 10Filippo Giunchedi: cfssl::db require sqlite3 package [puppet] - 10https://gerrit.wikimedia.org/r/800027
[14:42:31] <wikibugs>	 10SRE-swift-storage, 10Commons, 10Thumbor, 10affects-Kiwix-and-openZIM: HTTP Mime-Type now always returned properly if "If-None-Match" request header used - https://phabricator.wikimedia.org/T265006 (10Kelson) @Krinkle I have rechecked this bug/ticket with the given example and now it works. Might that be...
[14:42:34] <wikibugs>	 (03PS1) 10Filippo Giunchedi: cfssl: add missing migrations [puppet] - 10https://gerrit.wikimedia.org/r/800028
[14:42:38] <wikibugs>	 (03PS1) 10Filippo Giunchedi: cfssl: write pretty json [puppet] - 10https://gerrit.wikimedia.org/r/800029
[14:42:42] <wikibugs>	 (03PS1) 10Filippo Giunchedi: cfssl: fix sqlite3 path selection [puppet] - 10https://gerrit.wikimedia.org/r/800030
[14:42:46] <wikibugs>	 (03PS1) 10Filippo Giunchedi: puppetdb: create dbs before grants [puppet] - 10https://gerrit.wikimedia.org/r/800031
[14:42:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10cmooney) Quick update - I've been trying to image cloudcephosd1025 to make sure all is ok, and completed some operations.  Not being comp...
[14:43:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cfssl: add missing migrations [puppet] - 10https://gerrit.wikimedia.org/r/800028 (owner: 10Filippo Giunchedi)
[14:44:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cfssl: write pretty json [puppet] - 10https://gerrit.wikimedia.org/r/800029 (owner: 10Filippo Giunchedi)
[14:44:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cfssl: fix sqlite3 path selection [puppet] - 10https://gerrit.wikimedia.org/r/800030 (owner: 10Filippo Giunchedi)
[14:45:03] <godog>	 ooof
[14:45:14] <godog>	 volans: looking at the icinga host change now
[14:47:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetdb: create dbs before grants [puppet] - 10https://gerrit.wikimedia.org/r/800031 (owner: 10Filippo Giunchedi)
[14:53:11] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:55:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/799903 (owner: 10Volans)
[14:57:51] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2035 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:59:35] <wikibugs>	 10SRE-swift-storage, 10Commons, 10Thumbor, 10Performance-Team (Radar), 10affects-Kiwix-and-openZIM: HTTP Mime-Type now always returned properly if "If-None-Match" request header used - https://phabricator.wikimedia.org/T265006 (10Krinkle) 05Open→03Resolved Yep, it would appear so. I suspect this is l...
[15:00:04] <wikibugs>	 10SRE-swift-storage, 10Commons, 10Thumbor, 10Performance-Team (Radar), 10affects-Kiwix-and-openZIM: upload.wikimedia.org HTTP 304 responses lack a Content-Type header - https://phabricator.wikimedia.org/T265006 (10Krinkle)
[15:02:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Not quite sure how to fix test failures at https://integration.wikimedia.org/ci/job/operations-puppet-tests-buster-docker/45287/console" [puppet] - 10https://gerrit.wikimedia.org/r/800029 (owner: 10Filippo Giunchedi)
[15:02:56] <wikibugs>	 (03PS2) 10Filippo Giunchedi: puppetdb: create dbs before grants [puppet] - 10https://gerrit.wikimedia.org/r/800031
[15:02:58] <wikibugs>	 (03PS2) 10Filippo Giunchedi: cfssl::db require sqlite3 package [puppet] - 10https://gerrit.wikimedia.org/r/800027
[15:03:00] <wikibugs>	 (03PS2) 10Filippo Giunchedi: cfssl: add missing migrations [puppet] - 10https://gerrit.wikimedia.org/r/800028
[15:03:02] <wikibugs>	 (03PS2) 10Filippo Giunchedi: cfssl: fix sqlite3 path selection [puppet] - 10https://gerrit.wikimedia.org/r/800030
[15:03:04] <wikibugs>	 (03PS2) 10Filippo Giunchedi: cfssl: write pretty json [puppet] - 10https://gerrit.wikimedia.org/r/800029
[15:04:26] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic-Icebox, 10Performance-Team (Radar), 10affects-Kiwix-and-openZIM: Swift sends ETAG without double-quotes - https://phabricator.wikimedia.org/T256217 (10Krinkle)
[15:04:30] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "Am I understanding correctly that the limit in practice would be 40 Mb with our current queue length settings?  +1 for giving it a try" [puppet] - 10https://gerrit.wikimedia.org/r/799870 (https://phabricator.wikimedia.org/T308439) (owner: 10Filippo Giunchedi)
[15:05:01] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic-Icebox, 10Performance-Team (Radar), 10affects-Kiwix-and-openZIM: Swift sends ETAG without double-quotes - https://phabricator.wikimedia.org/T256217 (10Krinkle) I'm not sure since when, but based on us having <14 days ats-be storage, and based on there still beeing ET...
[15:05:17] <wikibugs>	 (03CR) 10Herron: [C: 03+1] rsyslog: bound disk-assisted queues (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799870 (https://phabricator.wikimedia.org/T308439) (owner: 10Filippo Giunchedi)
[15:05:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetdb: create dbs before grants [puppet] - 10https://gerrit.wikimedia.org/r/800031 (owner: 10Filippo Giunchedi)
[15:05:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cfssl: write pretty json [puppet] - 10https://gerrit.wikimedia.org/r/800029 (owner: 10Filippo Giunchedi)
[15:09:15] <wikibugs>	 (03PS3) 10Filippo Giunchedi: puppetdb: create dbs before grants [puppet] - 10https://gerrit.wikimedia.org/r/800031
[15:09:17] <wikibugs>	 (03PS3) 10Filippo Giunchedi: cfssl::db require sqlite3 package [puppet] - 10https://gerrit.wikimedia.org/r/800027
[15:09:19] <wikibugs>	 (03PS3) 10Filippo Giunchedi: cfssl: add missing migrations [puppet] - 10https://gerrit.wikimedia.org/r/800028
[15:09:21] <wikibugs>	 (03PS3) 10Filippo Giunchedi: cfssl: fix sqlite3 path selection [puppet] - 10https://gerrit.wikimedia.org/r/800030
[15:09:23] <wikibugs>	 (03PS3) 10Filippo Giunchedi: cfssl: write pretty json [puppet] - 10https://gerrit.wikimedia.org/r/800029
[15:11:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "Thank you for the review! Given that this change reloads rsyslog across the fleet I'll deploy early next week" [puppet] - 10https://gerrit.wikimedia.org/r/799870 (https://phabricator.wikimedia.org/T308439) (owner: 10Filippo Giunchedi)
[15:12:14] <wikibugs>	 (03CR) 10jenkins-bot: cfssl: write pretty json [puppet] - 10https://gerrit.wikimedia.org/r/800029 (owner: 10Filippo Giunchedi)
[15:14:11] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:15:30] <wikibugs>	 (03PS1) 10Jbond: sqlite: update packages and add dependency [puppet] - 10https://gerrit.wikimedia.org/r/800048
[15:17:00] <wikibugs>	 (03PS8) 10BCornwall: cli: Add support for XDG Base Directory spec [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459
[15:17:09] <wikibugs>	 (03CR) 10Jbond: "don't have a big issue with this but see comment and proposed alternative" [puppet] - 10https://gerrit.wikimedia.org/r/800027 (owner: 10Filippo Giunchedi)
[15:17:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cli: Add support for XDG Base Directory spec [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 (owner: 10BCornwall)
[15:17:17] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[15:17:18] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[15:17:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T298555)', diff saved to https://phabricator.wikimedia.org/P28599 and previous config saved to /var/cache/conftool/dbconfig/20220526-151723-ladsgroup.json
[15:18:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:18:11] <wikibugs>	 (03PS1) 10Cathal Mooney: Re-add urpf check to cloudsw -> cr interfaces Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/800053 (https://phabricator.wikimedia.org/T304989)
[15:18:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:18:13] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[15:18:27] <wikibugs>	 (03PS9) 10BCornwall: cli: Add support for XDG Base Directory spec [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459
[15:19:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Re-add urpf check to cloudsw -> cr interfaces Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/800053 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney)
[15:19:05] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/800028 (owner: 10Filippo Giunchedi)
[15:20:04] <wikibugs>	 (03CR) 10BCornwall: cli: Add support for XDG Base Directory spec (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 (owner: 10BCornwall)
[15:20:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM thx" [puppet] - 10https://gerrit.wikimedia.org/r/800030 (owner: 10Filippo Giunchedi)
[15:21:26] <wikibugs>	 (03PS2) 10Cathal Mooney: Re-add urpf check to cloudsw -> cr interfaces Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/800053 (https://phabricator.wikimedia.org/T304989)
[15:24:01] <wikibugs>	 (03CR) 10Herron: [V: 03+2 C: 03+2] Add HAProxy SLO dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/790672 (https://phabricator.wikimedia.org/T307898) (owner: 10Vgutierrez)
[15:24:03] <wikibugs>	 (03CR) 10Jbond: "did you test this?  Its been a while since i delved into the postgress module. also please tag with the following task" [puppet] - 10https://gerrit.wikimedia.org/r/800031 (owner: 10Filippo Giunchedi)
[15:29:03] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] changeprop: Remove WP:ANI from page blacklist [deployment-charts] - 10https://gerrit.wikimedia.org/r/797354 (https://phabricator.wikimedia.org/T274359) (owner: 10Samtar)
[15:31:02] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Re-add urpf check to cloudsw -> cr interfaces Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/800053 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney)
[15:32:10] <wikibugs>	 (03Merged) 10jenkins-bot: Re-add urpf check to cloudsw -> cr interfaces Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/800053 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney)
[15:34:38] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop: Remove WP:ANI from page blacklist [deployment-charts] - 10https://gerrit.wikimedia.org/r/797354 (https://phabricator.wikimedia.org/T274359) (owner: 10Samtar)
[15:37:30] <wikibugs>	 (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.5.0 [software/homer] - 10https://gerrit.wikimedia.org/r/800067
[15:39:17] <wikibugs>	 (03CR) 10BryanDavis: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/799427 (https://phabricator.wikimedia.org/T297140) (owner: 10BryanDavis)
[15:42:25] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[15:42:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:44:44] <topranks>	 bd808: hey, apologies I made a typo in netbox I believe is messing up your deploy (re: T297140)
[15:44:45] <stashbot>	 T297140: New Service Request: developer-portal - https://phabricator.wikimedia.org/T297140
[15:44:49] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:44:52] <topranks>	 I'm correting now
[15:44:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:44:57] <wikibugs>	 (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.5.0 [software/homer] - 10https://gerrit.wikimedia.org/r/800067 (owner: 10Volans)
[15:45:00] <wikibugs>	 (03CR) 10BBlack: Add dumps mapping to cache_upload (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793525 (https://phabricator.wikimedia.org/T306550) (owner: 10BBlack)
[15:45:38] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: sync
[15:45:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:49] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: sync
[15:45:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:46:38] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: sync
[15:46:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:46:50] <wikibugs>	 (03PS4) 10Jbond: cfssl: write pretty json [puppet] - 10https://gerrit.wikimedia.org/r/800029 (owner: 10Filippo Giunchedi)
[15:46:52] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync
[15:46:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:47:10] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: sync
[15:47:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:47:21] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync
[15:47:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:48:17] <wikibugs>	 (03PS5) 10Jbond: cfssl: write pretty json [puppet] - 10https://gerrit.wikimedia.org/r/800029 (owner: 10Filippo Giunchedi)
[15:48:28] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/800029 (owner: 10Filippo Giunchedi)
[15:49:01] <wikibugs>	 (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.5.0 [software/homer] - 10https://gerrit.wikimedia.org/r/800067 (owner: 10Volans)
[15:49:08] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[15:49:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:51:11] <volans>	 !log upgrading spicerack on cumin2002 to (2.5.0-1+deb11u1
[15:51:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:33] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:53:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:58:41] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:00:05] <jouncebot>	 jbond and rzl: #bothumor My software never has bugs. It just develops random features. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220526T1600).
[16:00:05] <jouncebot>	 bd808: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[16:00:29] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] cli: Add support for XDG Base Directory spec [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 (owner: 10BCornwall)
[16:00:47] <wikibugs>	 (03CR) 10Cathal Mooney: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/799427 (https://phabricator.wikimedia.org/T297140) (owner: 10BryanDavis)
[16:00:52] <rzl>	 bd808: hi! looking
[16:01:14] <bd808>	 rzl: awesome. I think it's pretty trivial
[16:01:45] <rzl>	 haha I saw the filename and got nervous but then I saw the diff :D yep no worries, merging
[16:01:51] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] base: remove "managed by puppet" notice on /etc/skel/.bashrc [puppet] - 10https://gerrit.wikimedia.org/r/798874 (owner: 10BryanDavis)
[16:02:10] <wikibugs>	 (03Merged) 10jenkins-bot: cli: Add support for XDG Base Directory spec [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 (owner: 10BCornwall)
[16:02:19] <wikibugs>	 (03PS4) 10Filippo Giunchedi: cfssl: add missing migrations [puppet] - 10https://gerrit.wikimedia.org/r/800028
[16:03:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (8) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[16:03:39] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Update hieradata from Netbox - volans@cumin2002"
[16:03:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] cfssl: add missing migrations [puppet] - 10https://gerrit.wikimedia.org/r/800028 (owner: 10Filippo Giunchedi)
[16:04:01] <rzl>	 bd808: done!
[16:04:15] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: puppet documentation generation is missing some compnets - https://phabricator.wikimedia.org/T271909 (10jbond)
[16:04:58] <bd808>	 thanks rzl 
[16:05:24] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Update hieradata from Netbox - volans@cumin2002"
[16:05:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:51] <wikibugs>	 (03PS4) 10Filippo Giunchedi: cfssl: fix sqlite3 path selection [puppet] - 10https://gerrit.wikimedia.org/r/800030
[16:07:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] cfssl: fix sqlite3 path selection [puppet] - 10https://gerrit.wikimedia.org/r/800030 (owner: 10Filippo Giunchedi)
[16:13:41] <wikibugs>	 (03PS2) 10Jbond: puppet-merge: Add logging so we know when changes where merged [puppet] - 10https://gerrit.wikimedia.org/r/799943 (https://phabricator.wikimedia.org/T221529)
[16:14:25] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/799427 (https://phabricator.wikimedia.org/T297140) (owner: 10BryanDavis)
[16:18:29] <icinga-wm>	 RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:21:03] <wikibugs>	 (03Abandoned) 10Filippo Giunchedi: cfssl::db require sqlite3 package [puppet] - 10https://gerrit.wikimedia.org/r/800027 (owner: 10Filippo Giunchedi)
[16:22:01] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1112.eqiad.wmnet with reason: Maintenance
[16:22:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1112.eqiad.wmnet with reason: Maintenance
[16:22:04] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[16:22:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:07] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[16:22:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: sqlite: update packages and add dependency (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/800048 (owner: 10Jbond)
[16:22:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T298560)', diff saved to https://phabricator.wikimedia.org/P28601 and previous config saved to /var/cache/conftool/dbconfig/20220526-162212-ladsgroup.json
[16:22:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:24] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[16:22:44] <wikibugs>	 (03PS3) 10Ahmon Dancy: mwdebug service: Add traindev environment support [deployment-charts] - 10https://gerrit.wikimedia.org/r/798883
[16:22:46] <wikibugs>	 (03PS1) 10Ahmon Dancy: mediawiki 0.2.1: Add a helm test [deployment-charts] - 10https://gerrit.wikimedia.org/r/800118
[16:22:57] <wikibugs>	 (03PS6) 10Filippo Giunchedi: cfssl: write pretty json [puppet] - 10https://gerrit.wikimedia.org/r/800029
[16:23:49] <wikibugs>	 (03PS3) 10Jbond: puppet-merge: Add logging so we know when changes where merged [puppet] - 10https://gerrit.wikimedia.org/r/799943 (https://phabricator.wikimedia.org/T221529)
[16:25:19] <wikibugs>	 (03CR) 10Filippo Giunchedi: cfssl::db require sqlite3 package (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/800027 (owner: 10Filippo Giunchedi)
[16:26:02] <wikibugs>	 (03PS2) 10Jbond: sqlite: update packages and add dependency [puppet] - 10https://gerrit.wikimedia.org/r/800048
[16:26:14] <wikibugs>	 (03CR) 10Jbond: sqlite: update packages and add dependency (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/800048 (owner: 10Jbond)
[16:27:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: puppetdb: create dbs before grants (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/800031 (owner: 10Filippo Giunchedi)
[16:28:31] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, just make sure it works as expected :D" [puppet] - 10https://gerrit.wikimedia.org/r/799943 (https://phabricator.wikimedia.org/T221529) (owner: 10Jbond)
[16:29:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/800048 (owner: 10Jbond)
[16:29:53] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sqlite: update packages and add dependency [puppet] - 10https://gerrit.wikimedia.org/r/800048 (owner: 10Jbond)
[16:30:56] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet-merge: Add logging so we know when changes where merged (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799943 (https://phabricator.wikimedia.org/T221529) (owner: 10Jbond)
[16:33:17] <wikibugs>	 (03PS1) 10Jbond: README: minor commit to test new puppet merge logging [puppet] - 10https://gerrit.wikimedia.org/r/800120
[16:34:33] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] README: minor commit to test new puppet merge logging [puppet] - 10https://gerrit.wikimedia.org/r/800120 (owner: 10Jbond)
[16:36:02] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet-merge: Add logging so we know when changes where merged (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799943 (https://phabricator.wikimedia.org/T221529) (owner: 10Jbond)
[16:36:36] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 0:20:00 on cuminunpriv1001.eqiad.wmnet with reason: Testing new Ganeti features on Spicerack
[16:36:37] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on cuminunpriv1001.eqiad.wmnet with reason: Testing new Ganeti features on Spicerack
[16:36:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:36:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:25] <wikibugs>	 (03PS1) 10Jbond: puppet-merge: include repo name in log messages [puppet] - 10https://gerrit.wikimedia.org/r/800121
[16:39:59] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] puppet-merge: include repo name in log messages [puppet] - 10https://gerrit.wikimedia.org/r/800121 (owner: 10Jbond)
[16:41:49] <wikibugs>	 (03PS1) 10Jbond: README: test puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/800125
[16:42:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[16:43:04] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] README: test puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/800125 (owner: 10Jbond)
[16:43:49] <wikibugs>	 (03CR) 10Ori: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/800122 (owner: 10Ori)
[16:47:59] <wikibugs>	 (03CR) 10Ori: developer-portal: add to service catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799429 (https://phabricator.wikimedia.org/T297140) (owner: 10BryanDavis)
[16:48:01] <icinga-wm>	 PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:51:16] <wikibugs>	 (03CR) 10BryanDavis: developer-portal: add to service catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799429 (https://phabricator.wikimedia.org/T297140) (owner: 10BryanDavis)
[16:51:32] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.dns.netbox
[16:51:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:52:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Andrew)
[16:52:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Andrew) Note that I'm renaming these two hosts to clouddumps100[12]
[16:53:14] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster
[16:53:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:56:51] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:56:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:57:21] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:57:39] <wikibugs>	 (03PS1) 10Andrew Bogott: Rename cloudstore101[01] to clouddumps100[12] [puppet] - 10https://gerrit.wikimedia.org/r/800152 (https://phabricator.wikimedia.org/T302981)
[16:57:49] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.netbox
[16:57:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:59:05] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Rename cloudstore101[01] to clouddumps100[12] [puppet] - 10https://gerrit.wikimedia.org/r/800152 (https://phabricator.wikimedia.org/T302981) (owner: 10Andrew Bogott)
[17:01:16] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:01:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:01:56] <wikibugs>	 10SRE-tools, 10Discovery, 10Infrastructure-Foundations, 10Discovery-Search (Current work), 10IPv6: Some elastic hosts do not have IPv6 DNS records - https://phabricator.wikimedia.org/T271143 (10Volans) Eqiad is also done, pasting only the differences with the above snippet:  `lang=python >>> devices = De...
[17:04:01] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
[17:04:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:33] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
[17:06:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:07:32] <wikibugs>	 (03PS1) 10Vlad.shapik: WP:Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719)
[17:13:25] <wikibugs>	 (03PS1) 10Andrew Bogott: Added partman recipe 'hwraid-2dev.cfg' [puppet] - 10https://gerrit.wikimedia.org/r/800172 (https://phabricator.wikimedia.org/T302981)
[17:14:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Added partman recipe 'hwraid-2dev.cfg' [puppet] - 10https://gerrit.wikimedia.org/r/800172 (https://phabricator.wikimedia.org/T302981) (owner: 10Andrew Bogott)
[17:16:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T298555)', diff saved to https://phabricator.wikimedia.org/P28602 and previous config saved to /var/cache/conftool/dbconfig/20220526-171638-ladsgroup.json
[17:16:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:16:47] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[17:16:57] <wikibugs>	 (03PS2) 10Andrew Bogott: Added partman recipe 'hwraid-2dev.cfg' [puppet] - 10https://gerrit.wikimedia.org/r/800172 (https://phabricator.wikimedia.org/T302981)
[17:18:40] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Added partman recipe 'hwraid-2dev.cfg' [puppet] - 10https://gerrit.wikimedia.org/r/800172 (https://phabricator.wikimedia.org/T302981) (owner: 10Andrew Bogott)
[17:20:12] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS buster
[17:20:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:20:37] <wikibugs>	 (03PS1) 10BryanDavis: developer-portal: add developer.wikimedia.org to CDN config [puppet] - 10https://gerrit.wikimedia.org/r/800181 (https://phabricator.wikimedia.org/T297140)
[17:20:56] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster
[17:21:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:22:01] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[17:22:02] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye
[17:22:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:22:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org w...
[17:23:08] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: curator support new and legacy index patterns [puppet] - 10https://gerrit.wikimedia.org/r/798982 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite)
[17:24:06] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.hosts.provision for host cloudcephosd1026.mgmt.eqiad.wmnet with reboot policy FORCED
[17:24:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:24:38] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: re-label cloudstore101[01] to clouddumps100[12] - https://phabricator.wikimedia.org/T309338 (10Andrew)
[17:25:25] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.netbox
[17:25:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:25:48] <logmsgbot>	 !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1026.mgmt.eqiad.wmnet with reboot policy FORCED
[17:25:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:12] <wikibugs>	 (03PS1) 10Ladsgroup: Add drop_page_restrictions_T60674.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/800183 (https://phabricator.wikimedia.org/T60674)
[17:30:10] <wikibugs>	 (03PS1) 10Andrew Bogott: hwraid-2dev.cfg: Throw in a few more autoconfirms [puppet] - 10https://gerrit.wikimedia.org/r/800184 (https://phabricator.wikimedia.org/T302981)
[17:30:12] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:30:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:30:30] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.hosts.provision for host cloudcephosd1026.mgmt.eqiad.wmnet with reboot policy FORCED
[17:31:05] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] hwraid-2dev.cfg: Throw in a few more autoconfirms [puppet] - 10https://gerrit.wikimedia.org/r/800184 (https://phabricator.wikimedia.org/T302981) (owner: 10Andrew Bogott)
[17:31:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P28603 and previous config saved to /var/cache/conftool/dbconfig/20220526-173143-ladsgroup.json
[17:31:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:54] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
[17:31:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:32:10] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.wipe-cache cloudcephosd1026.mgmt.eqiad.wmnet on all recursors
[17:32:14] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd1026.mgmt.eqiad.wmnet on all recursors
[17:32:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:32:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:34:02] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddumps1001.wikimedia.org with reason: host reimage
[17:34:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:34:19] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
[17:34:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:34:47] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1002.wikimedia.org with OS bullseye
[17:34:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:34:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1002.wikimedia.org w...
[17:35:18] <jinxer-wm>	 (ProbeDown) firing: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:37:26] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddumps1001.wikimedia.org with reason: host reimage
[17:37:28] <wikibugs>	 (03PS1) 10Zabe: Revert "Revert "Start writing to cuc_actor in s3, kcgwiki and labtestwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800190 (https://phabricator.wikimedia.org/T233004)
[17:37:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:40:18] <jinxer-wm>	 (ProbeDown) resolved: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:41:04] <wikibugs>	 (03PS1) 10Andrew Bogott: hwraid-2dev.cfg partman: reorder again [puppet] - 10https://gerrit.wikimedia.org/r/800196
[17:41:53] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] hwraid-2dev.cfg partman: reorder again [puppet] - 10https://gerrit.wikimedia.org/r/800196 (owner: 10Andrew Bogott)
[17:44:55] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/799870 (https://phabricator.wikimedia.org/T308439) (owner: 10Filippo Giunchedi)
[17:45:38] <wikibugs>	 (03PS1) 10Andrew Bogott: hwraid-2dev.cfg partman: add 1G swap [puppet] - 10https://gerrit.wikimedia.org/r/800197
[17:46:05] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS buster
[17:46:07] <logmsgbot>	 !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1026.mgmt.eqiad.wmnet with reboot policy FORCED
[17:46:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:46:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:46:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P28604 and previous config saved to /var/cache/conftool/dbconfig/20220526-174648-ladsgroup.json
[17:46:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:47:04] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.hosts.provision for host cloudcephosd1026.mgmt.eqiad.wmnet with reboot policy FORCED
[17:47:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:47:08] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] hwraid-2dev.cfg partman: add 1G swap [puppet] - 10https://gerrit.wikimedia.org/r/800197 (owner: 10Andrew Bogott)
[17:49:45] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddumps1001.wikimedia.org with OS bullseye
[17:49:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:49:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with...
[17:55:08] <wikibugs>	 (03CR) 10Jbond: "This looks good to me, however lets get WMCS to look as well.  In theory this could remove some protections from a WMCS stand-alone puppet" [puppet] - 10https://gerrit.wikimedia.org/r/799859 (owner: 10Majavah)
[17:58:37] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye
[17:58:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:58:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org w...
[18:00:04] <jouncebot>	 dancy and jnuche: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220526T1800).
[18:00:28] <dancy>	 o/
[18:01:18] <wikibugs>	 (03PS1) 10Ahmon Dancy: group2 wikis to 1.39.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800212 (https://phabricator.wikimedia.org/T305219)
[18:01:20] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+2] group2 wikis to 1.39.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800212 (https://phabricator.wikimedia.org/T305219) (owner: 10Ahmon Dancy)
[18:01:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T298555)', diff saved to https://phabricator.wikimedia.org/P28605 and previous config saved to /var/cache/conftool/dbconfig/20220526-180153-ladsgroup.json
[18:01:55] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1132.eqiad.wmnet with reason: Maintenance
[18:01:57] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1132.eqiad.wmnet with reason: Maintenance
[18:02:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:02:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1132 (T298555)', diff saved to https://phabricator.wikimedia.org/P28606 and previous config saved to /var/cache/conftool/dbconfig/20220526-180201-ladsgroup.json
[18:02:02] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[18:02:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:02:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:02:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:02:53] <wikibugs>	 (03PS1) 10Majavah: Provide a python3-bullseye image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/800213
[18:03:41] <wikibugs>	 (03Merged) 10jenkins-bot: group2 wikis to 1.39.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800212 (https://phabricator.wikimedia.org/T305219) (owner: 10Ahmon Dancy)
[18:04:36] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddumps1002.wikimedia.org with OS bullseye
[18:04:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:04:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1002.wikimedia.org with...
[18:04:51] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1002.wikimedia.org with OS bullseye
[18:04:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:04:57] <logmsgbot>	 !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.39.0-wmf.13  refs T305219
[18:05:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1002.wikimedia.org w...
[18:05:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:05:03] <stashbot>	 T305219: 1.39.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T305219
[18:08:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[18:08:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:09:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[18:09:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[18:09:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:09:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:10:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[18:10:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:10:35] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddumps1001.wikimedia.org with reason: host reimage
[18:10:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:13:26] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddumps1001.wikimedia.org with reason: host reimage
[18:13:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:16:31] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddumps1002.wikimedia.org with reason: host reimage
[18:16:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:19:32] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddumps1002.wikimedia.org with reason: host reimage
[18:19:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:25:47] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddumps1001.wikimedia.org with OS bullseye
[18:25:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:25:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with...
[18:32:21] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddumps1002.wikimedia.org with OS bullseye
[18:32:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:32:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1002.wikimedia.org with...
[18:33:50] <wikibugs>	 (03PS1) 10Jbond: service::node::config: drop merge_config [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639)
[18:34:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] service::node::config: drop merge_config [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond)
[18:37:24] <wikibugs>	 (03PS2) 10Jbond: service::node::config: drop merge_config [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639)
[18:38:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] service::node::config: drop merge_config [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond)
[18:40:17] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 15): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35576/console" [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond)
[18:42:37] <wikibugs>	 (03PS3) 10Jbond: service::node::config: drop merge_config [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639)
[18:46:53] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10RKemper)
[18:47:08] <wikibugs>	 (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond)
[18:48:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T298560)', diff saved to https://phabricator.wikimedia.org/P28609 and previous config saved to /var/cache/conftool/dbconfig/20220526-184824-ladsgroup.json
[18:48:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:48:30] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[18:50:17] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/35577/testreduce1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/799306 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[18:51:28] <wikibugs>	 (03PS1) 10BCornwall: turnilo: Fix port variable dererence for monitor [puppet] - 10https://gerrit.wikimedia.org/r/800231
[18:53:23] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+1] "thanks! good catch. confirmed this is currently "check_http_on_port!${port}" in Icinga config. it will fix turnilo monitoring (https://pha" [puppet] - 10https://gerrit.wikimedia.org/r/800231 (owner: 10BCornwall)
[18:53:39] <wikibugs>	 (03PS2) 10Dzahn: turnilo: Fix port variable dererence for monitor [puppet] - 10https://gerrit.wikimedia.org/r/800231 (https://phabricator.wikimedia.org/T277729) (owner: 10BCornwall)
[18:55:29] <wikibugs>	 (03CR) 10Dzahn: "Feel free to merge or I can. If you do, please run puppet on alert1001 afterwards. Then let's see what happens at https://icinga.wikimedia" [puppet] - 10https://gerrit.wikimedia.org/r/800231 (https://phabricator.wikimedia.org/T277729) (owner: 10BCornwall)
[18:55:32] <wikibugs>	 (03PS1) 10Majavah: openstack: horizon: remove enc url from hiera [puppet] - 10https://gerrit.wikimedia.org/r/800232
[18:55:52] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[18:56:20] <wikibugs>	 (03PS1) 10Ryan Kemper: elasticsearch: add more reimage usage examples [cookbooks] - 10https://gerrit.wikimedia.org/r/800233
[18:56:25] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "This should fix the UNKNOWN at https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=turnilo" [puppet] - 10https://gerrit.wikimedia.org/r/800231 (https://phabricator.wikimedia.org/T277729) (owner: 10BCornwall)
[18:56:46] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35578/console" [puppet] - 10https://gerrit.wikimedia.org/r/800232 (owner: 10Majavah)
[18:56:56] <wikibugs>	 (03PS2) 10Ryan Kemper: elasticsearch: add more reimage usage examples [cookbooks] - 10https://gerrit.wikimedia.org/r/800233
[18:57:32] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:00:09] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "noop on parse*, testreduce1001 looks fine (besides unrelated issue that those wmf_auto_restart systemd units fail because some servers are" [puppet] - 10https://gerrit.wikimedia.org/r/799306 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[19:02:53] <wikibugs>	 (03CR) 10Bking: [C: 03+1] elasticsearch: add more reimage usage examples [cookbooks] - 10https://gerrit.wikimedia.org/r/800233 (owner: 10Ryan Kemper)
[19:03:04] <wikibugs>	 (03PS3) 10Ryan Kemper: elasticsearch: add more reimage usage examples [cookbooks] - 10https://gerrit.wikimedia.org/r/800233 (https://phabricator.wikimedia.org/T308606)
[19:03:28] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] elasticsearch: add more reimage usage examples [cookbooks] - 10https://gerrit.wikimedia.org/r/800233 (https://phabricator.wikimedia.org/T308606) (owner: 10Ryan Kemper)
[19:03:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P28610 and previous config saved to /var/cache/conftool/dbconfig/20220526-190329-ladsgroup.json
[19:03:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:04:13] <wikibugs>	 (03PS4) 10AGueyte: Add SimilarEditors extension – I: Add to i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789250 (https://phabricator.wikimedia.org/T306909) (owner: 10Tchanders)
[19:04:15] <wikibugs>	 (03PS2) 10AGueyte: Add SimilarEditors extension – II: Add to InitialiseSettings, default off [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793500 (https://phabricator.wikimedia.org/T306909) (owner: 10Jforrester)
[19:04:17] <wikibugs>	 (03PS2) 10AGueyte: Add SimilarEditors extension – III: Add to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793501 (https://phabricator.wikimedia.org/T306909) (owner: 10Jforrester)
[19:04:19] <wikibugs>	 (03PS4) 10AGueyte: Deploy SimilarEditors to the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799012 (https://phabricator.wikimedia.org/T306908)
[19:05:07] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (2 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster reimage - bking@cumin1001 - T309343
[19:05:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:05:14] <stashbot>	 T309343: Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343
[19:05:35] <wikibugs>	 (03PS5) 10AGueyte: Add SimilarEditors extension – I: Add to i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789250 (https://phabricator.wikimedia.org/T306909) (owner: 10Tchanders)
[19:05:37] <wikibugs>	 (03PS3) 10AGueyte: Add SimilarEditors extension – II: Add to InitialiseSettings, default off [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793500 (https://phabricator.wikimedia.org/T306909) (owner: 10Jforrester)
[19:05:39] <wikibugs>	 (03PS3) 10AGueyte: Add SimilarEditors extension – III: Add to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793501 (https://phabricator.wikimedia.org/T306909) (owner: 10Jforrester)
[19:05:41] <wikibugs>	 (03PS2) 10AGueyte: Assign similareditors right to the checkuser group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799378 (https://phabricator.wikimedia.org/T307205)
[19:06:25] <wikibugs>	 (03PS1) 10Dzahn: profile::auto_restarts: make comment match class name, minor grammar [puppet] - 10https://gerrit.wikimedia.org/r/800235
[19:06:26] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail: [mitigated] Google returning 503 error when delivering to mx1001 and mx2001 - https://phabricator.wikimedia.org/T307873 (10jhathaway) I sent a message to the Exim mailing list, https://www.mail-archive.com/exim-users@exim.org/msg57216.html.  Jeremy Harris suggeste...
[19:06:37] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail: [mitigated] Google returning 503 error when delivering to mx1001 and mx2001 - https://phabricator.wikimedia.org/T307873 (10jhathaway) a:03jhathaway
[19:06:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] profile::auto_restarts: make comment match class name, minor grammar [puppet] - 10https://gerrit.wikimedia.org/r/800235 (owner: 10Dzahn)
[19:07:17] <wikibugs>	 (03PS2) 10Dzahn: profile::auto_restarts: make comment match class name, minor grammar [puppet] - 10https://gerrit.wikimedia.org/r/800235
[19:08:26] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1004.wikimedia.org with OS bullseye
[19:08:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:08:31] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye
[19:09:28] <wikibugs>	 (03PS3) 10Dzahn: profile::auto_restarts: make comment match class name, minor grammar [puppet] - 10https://gerrit.wikimedia.org/r/800235
[19:11:33] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] "This stack should now be good to deploy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789250 (https://phabricator.wikimedia.org/T306909) (owner: 10Tchanders)
[19:16:04] <wikibugs>	 (03PS1) 10Dzahn: parsoid::testing: remove auto_restart for apache, it uses nginx instead [puppet] - 10https://gerrit.wikimedia.org/r/800237
[19:16:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10Andrew)
[19:16:44] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:18:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P28611 and previous config saved to /var/cache/conftool/dbconfig/20220526-191834-ladsgroup.json
[19:18:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:18:49] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10Andrew) a:05Andrew→03ArielGlenn @ArielGlenn these two new servers should be ready; I'm hoping that you have the time to move the data a...
[19:19:08] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Andrew)
[19:19:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Andrew) 05Open→03Resolved
[19:19:22] <wikibugs>	 (03CR) 10Dzahn: "it works but the issue is that scandium DOES have an apache while testreduce1001 does not.. but both are using parsoid::testing" [puppet] - 10https://gerrit.wikimedia.org/r/800237 (owner: 10Dzahn)
[19:19:54] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "for now -1, need a different approach to separate scandium/testreduce1001" [puppet] - 10https://gerrit.wikimedia.org/r/800237 (owner: 10Dzahn)
[19:27:22] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[19:28:10] <wikibugs>	 10SRE, 10DC-Ops, 10Infrastructure-Foundations: private repo deployment - perccli implementation - https://phabricator.wikimedia.org/T308027 (10RobH)
[19:28:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH)
[19:28:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH)
[19:28:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10Jclark-ctr) stat1010 E1  u24  cableid #  20220077   port24
[19:29:15] <wikibugs>	 (03PS1) 10Dzahn: parsoid::testing: add an auto_restart service for nginx [puppet] - 10https://gerrit.wikimedia.org/r/800241
[19:30:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10Jclark-ctr)
[19:30:37] <wikibugs>	 (03PS1) 10Jbond: wmflib::clusters::fetch: possible replacement for cluster_config [puppet] - 10https://gerrit.wikimedia.org/r/800242 (https://phabricator.wikimedia.org/T308639)
[19:30:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson
[19:33:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wqds101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10Jclark-ctr) wqds1014 wqds1015
[19:33:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T298560)', diff saved to https://phabricator.wikimedia.org/P28612 and previous config saved to /var/cache/conftool/dbconfig/20220526-193339-ladsgroup.json
[19:33:41] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[19:33:43] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[19:33:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:33:46] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[19:33:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T298560)', diff saved to https://phabricator.wikimedia.org/P28613 and previous config saved to /var/cache/conftool/dbconfig/20220526-193347-ladsgroup.json
[19:33:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:33:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:34:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:35:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib::clusters::fetch: possible replacement for cluster_config [puppet] - 10https://gerrit.wikimedia.org/r/800242 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond)
[19:35:37] <wikibugs>	 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10RobH) > Jin, >  > When you were last onsite, I neglected to include the swap of a problematic optic we have. >  > Can you quote us for an on-site to swap the optic in cr3-eqsin:xe-0/1/1 located in 603, U40....
[19:36:28] <wikibugs>	 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T298459 (10RobH) 05Open→03Declined same as T300485
[19:36:33] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1004.wikimedia.org with OS bullseye
[19:36:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:36:38] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye executed with errors: - cloudela...
[19:40:27] <tgr>	 !log T304548 running extensions/GrowthExperiments/maintenance/changeWikiConfig.php on tier4 Growth wikis
[19:40:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:40:32] <stashbot>	 T304548: Deploy "add a link" to 4th round of wikis - https://phabricator.wikimedia.org/T304548
[19:40:58] <wikibugs>	 (03PS1) 10Ryan Kemper: elasticsearch: add ANSI color codes [cookbooks] - 10https://gerrit.wikimedia.org/r/800244 (https://phabricator.wikimedia.org/T308606)
[19:41:25] <wikibugs>	 (03PS2) 10Ryan Kemper: elasticsearch: add ANSI color codes [cookbooks] - 10https://gerrit.wikimedia.org/r/800244 (https://phabricator.wikimedia.org/T308606)
[19:42:08] <wikibugs>	 (03PS3) 10Ryan Kemper: elasticsearch: add ANSI color codes [cookbooks] - 10https://gerrit.wikimedia.org/r/800244 (https://phabricator.wikimedia.org/T308606)
[19:44:10] <wikibugs>	 (03PS1) 10Dzahn: parsoid::testing: move apache/php auto_restarts to separate profile [puppet] - 10https://gerrit.wikimedia.org/r/800245
[19:45:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] parsoid::testing: move apache/php auto_restarts to separate profile [puppet] - 10https://gerrit.wikimedia.org/r/800245 (owner: 10Dzahn)
[19:45:27] <wikibugs>	 (03PS2) 10Dzahn: parsoid::testing: move apache/php auto_restarts to separate profile [puppet] - 10https://gerrit.wikimedia.org/r/800245
[19:46:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] parsoid::testing: move apache/php auto_restarts to separate profile [puppet] - 10https://gerrit.wikimedia.org/r/800245 (owner: 10Dzahn)
[19:46:44] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (2 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster reimage - bking@cumin1001 - T309343
[19:46:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] elasticsearch: add ANSI color codes [cookbooks] - 10https://gerrit.wikimedia.org/r/800244 (https://phabricator.wikimedia.org/T308606) (owner: 10Ryan Kemper)
[19:46:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:46:50] <stashbot>	 T309343: Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343
[19:49:31] <wikibugs>	 (03CR) 10Dzahn: "hah! jerkins already gives -1 for " The following are missing a SPDX licence header:". nice" [puppet] - 10https://gerrit.wikimedia.org/r/800245 (owner: 10Dzahn)
[19:49:49] <mutante>	 it's not called jerkins anymore? :o
[19:52:06] <wikibugs>	 (03PS3) 10Dzahn: parsoid::testing: move apache/php auto_restarts to separate profile [puppet] - 10https://gerrit.wikimedia.org/r/800245
[19:53:06] <wikibugs>	 (03Abandoned) 10Dzahn: parsoid::testing: remove auto_restart for apache, it uses nginx instead [puppet] - 10https://gerrit.wikimedia.org/r/800237 (owner: 10Dzahn)
[19:53:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] parsoid::testing: move apache/php auto_restarts to separate profile [puppet] - 10https://gerrit.wikimedia.org/r/800245 (owner: 10Dzahn)
[19:54:52] <wikibugs>	 (03PS4) 10Dzahn: parsoid::testing: move apache/php auto_restarts to separate profile [puppet] - 10https://gerrit.wikimedia.org/r/800245
[19:55:18] <rzl>	 end of an era!
[19:55:35] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frbackup2002 - https://phabricator.wikimedia.org/T306842 (10Jgreen)
[19:56:00] <mutante>	 hahaa, yea
[19:58:33] <wikibugs>	 (03PS1) 10Gergő Tisza: Enable GrowthExperiments link recommendations, round 4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800247 (https://phabricator.wikimedia.org/T304548)
[19:58:48] <wikibugs>	 (03PS4) 10Ryan Kemper: elasticsearch: add ANSI color codes [cookbooks] - 10https://gerrit.wikimedia.org/r/800244 (https://phabricator.wikimedia.org/T308606)
[19:59:43] <wikibugs>	 (03PS1) 10Zabe: snmp: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/800248 (https://phabricator.wikimedia.org/T308013)
[20:00:05] <jouncebot>	 brennen: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220526T2000).
[20:00:05] <jouncebot>	 zabe and tgr: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:25] <brennen>	 o/
[20:00:31] <tgr>	 o/
[20:00:38] <wikibugs>	 (03PS5) 10Ryan Kemper: elasticsearch: add ANSI color codes [cookbooks] - 10https://gerrit.wikimedia.org/r/800244 (https://phabricator.wikimedia.org/T308606)
[20:00:56] <brennen>	 zabe: about?
[20:01:12] <zabe>	 hey
[20:01:22] <wikibugs>	 (03CR) 10Dzahn: "traffic team, you should just decide how you prefer it. I don't know how often this happens currently and how urgent it really is. Maybe w" [puppet] - 10https://gerrit.wikimedia.org/r/788312 (owner: 10David Caro)
[20:01:48] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] Acquire fresh actor id [extensions/CheckUser] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/798818 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[20:02:22] <brennen>	 zabe: anything to test with this first one?
[20:02:26] <zabe>	 no
[20:03:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (8) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[20:03:28] <wikibugs>	 (03PS1) 10Zabe: shiny_server: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/800249 (https://phabricator.wikimedia.org/T308013)
[20:03:32] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] Fix phan failure PhanPluginSimplifyExpressionBool [extensions/CheckUser] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/798819 (owner: 10Zabe)
[20:04:29] <wikibugs>	 (03CR) 10Dzahn: "Has been answered on ticket. While it could be automated they do want shell access at first at least to understand the full process. The r" [puppet] - 10https://gerrit.wikimedia.org/r/798667 (https://phabricator.wikimedia.org/T309045) (owner: 10Alexandros Kosiaris)
[20:06:40] <brennen>	 zabe, tgr - any reason not to deploy these config patches while waiting on the checkuser ones?
[20:06:57] <tgr>	 mine can be deployed without testing
[20:06:58] <wikibugs>	 (03PS1) 10Zabe: sbuild: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/800250 (https://phabricator.wikimedia.org/T308013)
[20:07:39] <zabe>	 brennen, mine can't. The checkuser patches fix a production error that needs to be fixed for that config patch.
[20:07:52] <brennen>	 zabe: ack, cool.  will go ahead with tgr's then.
[20:08:58] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] Enable GrowthExperiments link recommendations, round 4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800247 (https://phabricator.wikimedia.org/T304548) (owner: 10Gergő Tisza)
[20:09:50] <wikibugs>	 (03PS1) 10Zabe: samplicator: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/800251 (https://phabricator.wikimedia.org/T308013)
[20:09:52] <wikibugs>	 (03Merged) 10jenkins-bot: Enable GrowthExperiments link recommendations, round 4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800247 (https://phabricator.wikimedia.org/T304548) (owner: 10Gergő Tisza)
[20:10:40] <wikibugs>	 (03PS2) 10Dzahn: admin: Add sgimeno to restricted [puppet] - 10https://gerrit.wikimedia.org/r/798667 (https://phabricator.wikimedia.org/T309045) (owner: 10Alexandros Kosiaris)
[20:12:28] <wikibugs>	 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10Kelson)
[20:12:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Kelson)
[20:12:44] <logmsgbot>	 !log brennen@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:800247|Enable GrowthExperiments link recommendations, round 4 (T304548)]] (duration: 00m 56s)
[20:12:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:12:51] <stashbot>	 T304548: Deploy "add a link" to 4th round of wikis - https://phabricator.wikimedia.org/T304548
[20:13:03] <wikibugs>	 (03PS1) 10Zabe: rsyslog: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/800252 (https://phabricator.wikimedia.org/T308013)
[20:13:05] <brennen>	 tgr: synched.
[20:16:24] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Incident: 2022-05-09 Exim BDAT Errors incident - https://phabricator.wikimedia.org/T309238 (10jhathaway)
[20:16:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:16:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:17:38] <wikibugs>	 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10Kelson) @ArielGlenn It seems that T302981 has just been implemented. Does that mean you have no blocker anymore for this task?
[20:17:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:17:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:18:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:18:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:18:16] <wikibugs>	 (03PS1) 10Zabe: r_lang: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/800254 (https://phabricator.wikimedia.org/T308013)
[20:18:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Kelson) @Andrew  Thank you for finally completing this task!
[20:18:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:18:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:19:00] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] turnilo: Fix port variable dererence for monitor [puppet] - 10https://gerrit.wikimedia.org/r/800231 (https://phabricator.wikimedia.org/T277729) (owner: 10BCornwall)
[20:19:23] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] "PCC indicates this will alter /etc/default/opensearch but it does not notify the opensearch service.  LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/799310 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[20:19:49] <wikibugs>	 (03PS1) 10Zabe: resolvconf: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/800255 (https://phabricator.wikimedia.org/T308013)
[20:20:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] resolvconf: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/800255 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[20:21:09] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1004.wikimedia.org with OS bullseye
[20:21:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:21:14] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye
[20:21:31] <wikibugs>	 (03Merged) 10jenkins-bot: Fix phan failure PhanPluginSimplifyExpressionBool [extensions/CheckUser] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/798819 (owner: 10Zabe)
[20:22:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wqds101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson wqds1014  E2          cableid 20220072    port   30  wqds1015  E3          cableid 20220071    port...
[20:22:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wqds101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10Jclark-ctr)
[20:23:38] <wikibugs>	 10SRE, 10ops-eqiad: db1128 faulty memory - https://phabricator.wikimedia.org/T309291 (10wiki_willy) a:05wiki_willy→03Cmjohnson Hi @Marostegui - @Cmjohnson is going to check if we can pull one of the DIMMs from one of these retired pc* hosts:  https://netbox.wikimedia.org/dcim/devices/?q=&site=eqiad&status=...
[20:23:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Acquire fresh actor id [extensions/CheckUser] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/798818 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[20:24:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (3) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[20:24:22] <logmsgbot>	 !log brennen@deploy1002 Synchronized php-1.39.0-wmf.13/extensions/CheckUser/src/Specials/SpecialCheckUser.php: Backport: [[gerrit:798819|Fix phan failure PhanPluginSimplifyExpressionBool]] (duration: 00m 52s)
[20:24:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:26:05] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] "recheck" [extensions/CheckUser] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/798818 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[20:26:20] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1004.wikimedia.org with OS bullseye
[20:26:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:26:25] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye executed with errors: - cloudela...
[20:26:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T298555)', diff saved to https://phabricator.wikimedia.org/P28614 and previous config saved to /var/cache/conftool/dbconfig/20220526-202625-ladsgroup.json
[20:26:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:26:31] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[20:27:55] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1004.wikimedia.org with OS bullseye
[20:28:00] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye
[20:28:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:28:03] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1004.wikimedia.org with OS bullseye
[20:28:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:28:07] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye executed with errors: - cloudela...
[20:28:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:29:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:29:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:29:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:30:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:30:03] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1004.wikimedia.org with OS bullseye
[20:30:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:30:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:30:07] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye
[20:30:10] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1004.wikimedia.org with OS bullseye
[20:30:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:30:15] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye executed with errors: - cloudela...
[20:30:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:31:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:40:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10Jclark-ctr) @BTullis please confirm if New rows E- F are ok for this host.
[20:40:54] <inflatador>	 !log bking@install1003 removed cloudelastic1004.conf pxe config file T309343
[20:40:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:41:03] <stashbot>	 T309343: Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343
[20:41:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P28615 and previous config saved to /var/cache/conftool/dbconfig/20220526-204130-ladsgroup.json
[20:41:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:41:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10Jclark-ctr) @BTullis  please confirm racking instructions and if New rows E- F are ok racking
[20:42:14] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[20:42:22] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1004.wikimedia.org with OS bullseye
[20:42:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:42:28] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye
[20:44:49] <wikibugs>	 (03Merged) 10jenkins-bot: Acquire fresh actor id [extensions/CheckUser] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/798818 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[20:45:23] <wikibugs>	 (03CR) 10Bking: [C: 03+1] elasticsearch: add ANSI color codes [cookbooks] - 10https://gerrit.wikimedia.org/r/800244 (https://phabricator.wikimedia.org/T308606) (owner: 10Ryan Kemper)
[20:45:45] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] Revert "Revert "Start writing to cuc_actor in s3, kcgwiki and labtestwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800190 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[20:45:51] <wikibugs>	 (03PS2) 10Brennen Bearnes: Revert "Revert "Start writing to cuc_actor in s3, kcgwiki and labtestwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800190 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[20:46:03] <wikibugs>	 (03CR) 10Brennen Bearnes: Revert "Revert "Start writing to cuc_actor in s3, kcgwiki and labtestwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800190 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[20:47:57] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Revert "Start writing to cuc_actor in s3, kcgwiki and labtestwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800190 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[20:48:27] <brennen>	 zabe: "acquire fresh actor id" and the config revert-revert are on mwdebug1002 if there's anything testable
[20:48:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10cmooney) a:05Cmjohnson→03Jclark-ctr @Cmjohnson apologies I assigned this to you in error (blind as a bat), I see @Jclark-ctr actually...
[20:48:42] <zabe>	 looking
[20:49:59] <zabe>	 brennen, looks good
[20:50:03] <brennen>	 zabe: cool, syncing.
[20:51:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:51:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:52:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:52:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:52:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:52:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:53:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:53:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:53:39] <logmsgbot>	 !log brennen@deploy1002 Synchronized php-1.39.0-wmf.13/extensions/CheckUser/src/Hooks.php: Backport: [[gerrit:798818|Acquire fresh actor id (T233004 T309148)]] (duration: 00m 51s)
[20:53:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:53:46] <stashbot>	 T309148: Wikimedia\Rdbms\DBQueryError: Error 1048: Column 'cuc_actor' cannot be nullFunction: MediaWiki\CheckUser\Hooks::updateCheckUserDataQuery: INSERT INTO `cu_changes` (cuc_namespace,cuc_title,cuc_minor,cuc_user,cuc_user_text,cuc_ - https://phabricator.wikimedia.org/T309148
[20:53:46] <stashbot>	 T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004
[20:55:05] <logmsgbot>	 !log brennen@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:800190|Revert "Revert "Start writing to cuc_actor in s3, kcgwiki and labtestwiki"" (T233004)]] (duration: 00m 50s)
[20:55:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:55:35] <brennen>	 !log end of utc late backport and config window
[20:55:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:55:47] <brennen>	 zabe: done, thx.
[20:55:56] <zabe>	 thanks for your help :)
[20:56:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P28616 and previous config saved to /var/cache/conftool/dbconfig/20220526-205635-ladsgroup.json
[20:56:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:00:04] <wikibugs>	 (03PS1) 10Cathal Mooney: Modify Eqiad CR labs-in filter to allow BGP and ICMP [homer/public] - 10https://gerrit.wikimedia.org/r/800270 (https://phabricator.wikimedia.org/T304989)
[21:02:07] <wikibugs>	 (03CR) 10Dzahn: "I know Alexandros is currently out so I am being bold and just amend here and use "restricted". That is a subset of deployment and ensures" [puppet] - 10https://gerrit.wikimedia.org/r/798667 (https://phabricator.wikimedia.org/T309045) (owner: 10Alexandros Kosiaris)
[21:02:33] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Modify Eqiad CR labs-in filter to allow BGP and ICMP [homer/public] - 10https://gerrit.wikimedia.org/r/800270 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney)
[21:03:09] <wikibugs>	 (03Merged) 10jenkins-bot: Modify Eqiad CR labs-in filter to allow BGP and ICMP [homer/public] - 10https://gerrit.wikimedia.org/r/800270 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney)
[21:03:21] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to mwmaint1002.eqiad.wmnet for sgimeno - https://phabricator.wikimedia.org/T309045 (10Dzahn)
[21:03:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10Jclark-ctr) @cmooney  Apologize for that not sure how that changed when i copied it from excel to here i noticed a few other mistakes dow...
[21:03:33] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to mwmaint1002.eqiad.wmnet for sgimeno - https://phabricator.wikimedia.org/T309045 (10Dzahn) a:03Dzahn
[21:03:38] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to mwmaint1002.eqiad.wmnet for sgimeno - https://phabricator.wikimedia.org/T309045 (10Dzahn) 05Open→03In progress
[21:05:15] <wikibugs>	 (03PS1) 10Zabe: Start writing to cuc_actor everywhere except s4 and s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800278 (https://phabricator.wikimedia.org/T233004)
[21:06:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10cmooney) @Jclark-ctr  ok thanks for the clarification.  I've only put the port details for 1025 and 1026 into Netbox so far, ports 21 and...
[21:09:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10Jclark-ctr) cloudcephosd1030 f4 21u 20 20220087 ; 21 20220081 cloudcephosd1031 f4 22u 22 20220075 ; 23 20220083 cloudcephosd1032 f4 23u 2...
[21:10:30] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1004.wikimedia.org with OS bullseye
[21:10:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:10:34] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye executed with errors: - cloudela...
[21:11:26] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to mwmaint1002.eqiad.wmnet for sgimeno - https://phabricator.wikimedia.org/T309045 (10Dzahn) @thcipriani Your approval is requested as group approver for "restricted" (just like for 'deployment').
[21:11:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T298555)', diff saved to https://phabricator.wikimedia.org/P28617 and previous config saved to /var/cache/conftool/dbconfig/20220526-211140-ladsgroup.json
[21:11:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:11:47] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[21:12:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10Jclark-ctr) a:05Jclark-ctr→03cmooney
[21:12:56] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to mwmaint1002.eqiad.wmnet for sgimeno - https://phabricator.wikimedia.org/T309045 (10Dzahn) @Sgs Understood! I'll move this forward to get you your access to unblock you. Automating it as a systemd timer would be nice indeed and we can help...
[21:15:10] <mutante>	 !log puppetmaster1001 - sudo puppet cert clean gitlab1004.wikimedia.org revoked cert with serial 9600 AND cert with serial 9694 - somehow agent got "cert revoked" before I did anything (T309259)
[21:15:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:15:17] <stashbot>	 T309259: gitlab1004 - puppet cert revoked? - https://phabricator.wikimedia.org/T309259
[21:16:11] <mutante>	 !log gitlab1004 - rm -rf /var/lib/puppet/ssl (T309259)
[21:16:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:17:07] <mutante>	 !log gitlab1004/puppetmaster1001 - create new signing request, sign new cert for puppet, fixed puppet run - T309259
[21:17:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:17:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10cmooney) Should be good for rows E and F if that works for the team.
[21:17:33] <wikibugs>	 10SRE, 10serviceops, 10GitLab (Infrastructure): gitlab1004 - puppet cert revoked? - https://phabricator.wikimedia.org/T309259 (10Dzahn) 05Open→03Resolved a:03Dzahn Notice: /Stage[main]/Ferm/Service[ferm]/ensure: ensure changed 'stopped' to 'running' (corrective) Info: /Stage[main]/Ferm/Service[ferm]: U...
[21:17:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10cmooney) These should be ok for rows E/F if that suits the team.
[21:18:59] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:21:47] <icinga-wm>	 PROBLEM - BGP status on cloudsw1-d5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv6: Connect - wmf_public_asn, AS14907/IPv4: Connect - wmf_public_asn, AS14907/IPv4: Connect - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:24:48] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[21:24:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:27:12] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1026.mgmt.eqiad.wmnet with reboot policy FORCED
[21:27:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:29:48] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10bking) Looking at the console, the installer keeps coming up in non-interactive mode. I tried clicking through, but it said it couldn't download the preseed file. Will raise...
[21:33:37] <wikibugs>	 10SRE: a couple longer running icinga alerts to be fixed - https://phabricator.wikimedia.org/T309257 (10mdedul.islam.16)
[21:33:38] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] Add language fallback support for wmgSiteLogoVariants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799415 (https://phabricator.wikimedia.org/T305692) (owner: 10Stang)
[21:34:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (3) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[21:40:11] <wikibugs>	 (03CR) 10Ori: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/800282 (https://phabricator.wikimedia.org/T207200) (owner: 10Ori)
[21:42:08] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:42:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:42:31] <wikibugs>	 (03PS3) 10Ori: Beta Cluster: ship logs from docker services to logstash [puppet] - 10https://gerrit.wikimedia.org/r/800282 (https://phabricator.wikimedia.org/T309319)
[21:42:44] <wikibugs>	 10SRE: a couple longer running icinga alerts to be fixed - https://phabricator.wikimedia.org/T309257 (10TheresNoTime) 05duplicate→03Open
[21:43:34] <wikibugs>	 (03CR) 10Krinkle: [C: 04-1] Add language fallback support for wmgSiteLogoVariants (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799415 (https://phabricator.wikimedia.org/T305692) (owner: 10Stang)
[21:43:49] <mutante>	 phab spammer trying to merge stuff into his spam task.. someone already blocked them. good
[21:44:31] <icinga-wm>	 RECOVERY - BGP status on cloudsw1-d5-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:45:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Beta Cluster: ship logs from docker services to logstash [puppet] - 10https://gerrit.wikimedia.org/r/800282 (https://phabricator.wikimedia.org/T309319) (owner: 10Ori)
[21:46:26] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10Jclark-ctr) Yea looks right  i just need it for setting up servers
[21:49:25] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+1] "This seems sane to me, forcing revalidation. Unfortunately while I've adjusted this file I'm also far from an expert on these things." [puppet] - 10https://gerrit.wikimedia.org/r/799297 (https://phabricator.wikimedia.org/T289243) (owner: 10Lucas Werkmeister (WMDE))
[21:52:03] <wikibugs>	 (03PS4) 10Ori: Beta Cluster: ship logs from docker services to logstash [puppet] - 10https://gerrit.wikimedia.org/r/800282 (https://phabricator.wikimedia.org/T309319)
[21:53:30] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[21:53:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:54:13] <icinga-wm>	 PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Connect - wmf_public_asn, AS14907/IPv6: Connect - wmf_public_asn, AS14907/IPv4: Connect - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:54:46] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "thanks again. fixed https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=an-tool1007&service=Check+Turnilo+node+appserver" [puppet] - 10https://gerrit.wikimedia.org/r/800231 (https://phabricator.wikimedia.org/T277729) (owner: 10BCornwall)
[21:57:05] <wikibugs>	 (03CR) 10Dzahn: "adding group approver" [puppet] - 10https://gerrit.wikimedia.org/r/798667 (https://phabricator.wikimedia.org/T309045) (owner: 10Alexandros Kosiaris)
[21:58:22] <wikibugs>	 (03PS1) 10Dzahn: admin: add mabualruz to ldap_only admins (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/800284 (https://phabricator.wikimedia.org/T309215)
[21:58:49] <icinga-wm>	 RECOVERY - BGP status on cloudsw1-c8-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:59:08] <wikibugs>	 (03CR) 10Dzahn: "does anyone think the ' in the realname field will be an issue?" [puppet] - 10https://gerrit.wikimedia.org/r/800284 (https://phabricator.wikimedia.org/T309215) (owner: 10Dzahn)
[22:00:05] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for Mabualruz - https://phabricator.wikimedia.org/T309215 (10Dzahn) 05Open→03In progress
[22:00:21] <icinga-wm>	 RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:02:21] <wikibugs>	 (03PS1) 10Cathal Mooney: Change cloud uplink interface definition for VRF [homer/public] - 10https://gerrit.wikimedia.org/r/800285 (https://phabricator.wikimedia.org/T304989)
[22:02:42] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "My 2 cents, if I can (non blocking):" [cookbooks] - 10https://gerrit.wikimedia.org/r/800244 (https://phabricator.wikimedia.org/T308606) (owner: 10Ryan Kemper)
[22:04:32] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "This already has approval from group_approver and manager and Alex uploaded it.. so I'll go ahead and close this out. Easy one too since b" [puppet] - 10https://gerrit.wikimedia.org/r/798664 (https://phabricator.wikimedia.org/T308308) (owner: 10Alexandros Kosiaris)
[22:09:00] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Requesting access to the deployment POSIX group for aikochou and kevinbazira - https://phabricator.wikimedia.org/T308308 (10Dzahn) 05Open→03In progress
[22:10:12] <wikibugs>	 (03CR) 10Stang: Add language fallback support for wmgSiteLogoVariants (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799415 (https://phabricator.wikimedia.org/T305692) (owner: 10Stang)
[22:10:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10cmooney) @Jclark-ctr I'm not really able to progress this.  I was gonna try one reimage but given the disk / RAID config needs to be done...
[22:10:46] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] admin: add mabualruz to ldap_only admins (wmf) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/800284 (https://phabricator.wikimedia.org/T309215) (owner: 10Dzahn)
[22:12:20] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Requesting access to the deployment POSIX group for aikochou and kevinbazira - https://phabricator.wikimedia.org/T308308 (10Dzahn) deployed / resolved.  Both users exist on the deployment server now.  They will also e...
[22:12:55] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:13:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:13:41] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Change cloud uplink interface definition for VRF [homer/public] - 10https://gerrit.wikimedia.org/r/800285 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney)
[22:16:54] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Requesting access to the deployment POSIX group for aikochou and kevinbazira - https://phabricator.wikimedia.org/T308308 (10Dzahn) 05In progress→03Resolved a:03Dzahn
[22:17:07] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to mwmaint1002.eqiad.wmnet for sgimeno - https://phabricator.wikimedia.org/T309045 (10Dzahn) a:05Dzahn→03thcipriani
[22:18:23] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[22:18:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:20:18] <mutante>	 !log phabricator adding mabualruz to WMF-NDA group for accest to private tickets T309215
[22:20:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:20:24] <stashbot>	 T309215: Grant Access to ldap/wmf for Mabualruz - https://phabricator.wikimedia.org/T309215
[22:20:59] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "thanks for review, going ahead" [puppet] - 10https://gerrit.wikimedia.org/r/800284 (https://phabricator.wikimedia.org/T309215) (owner: 10Dzahn)
[22:21:04] <wikibugs>	 (03PS2) 10Dzahn: admin: add mabualruz to ldap_only admins (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/800284 (https://phabricator.wikimedia.org/T309215)
[22:22:00] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:22:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:22:34] <wikibugs>	 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10ArielGlenn) >>! In T57503#7961680, @Kelson wrote: > @ArielGlenn It seems that T302981 has just been implemented. Does that mean you have...
[22:23:32] <wikibugs>	 (03Merged) 10jenkins-bot: Change cloud uplink interface definition for VRF [homer/public] - 10https://gerrit.wikimedia.org/r/800285 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney)
[22:25:19] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for Mabualruz - https://phabricator.wikimedia.org/T309215 (10Dzahn) 05In progress→03Resolved a:03Dzahn @Mabualruz Welcome! You have been added to the "wmf" LDAP group and the "WMF-NDA" Phabricator group.  This means you can now...
[22:31:34] <wikibugs>	 (03PS1) 10Cwhite: aptrepo: add opensearch2 thirdparty component [puppet] - 10https://gerrit.wikimedia.org/r/800294 (https://phabricator.wikimedia.org/T304440)
[22:35:07] <mutante>	 win 14
[22:49:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[22:49:23] <icinga-wm>	 PROBLEM - Check systemd state on elastic2045 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_6@production-search-codfw.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:52:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10BTullis) Yes, rows E and F are fine for this, thanks.
[22:53:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10BTullis) Yes, rows E and F are fine for these presto servers, thanks.
[22:53:57] <icinga-wm>	 RECOVERY - Check systemd state on elastic2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:54:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10BTullis)
[23:04:04] <wikibugs>	 (03PS1) 10Dzahn: gitlab: create role/profile to temp use gitlab1004 to store backups [puppet] - 10https://gerrit.wikimedia.org/r/800308 (https://phabricator.wikimedia.org/T274463)
[23:05:07] <wikibugs>	 (03PS2) 10Dzahn: gitlab: create role/profile to temp use gitlab1004 to store backups [puppet] - 10https://gerrit.wikimedia.org/r/800308 (https://phabricator.wikimedia.org/T274463)
[23:05:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] gitlab: create role/profile to temp use gitlab1004 to store backups [puppet] - 10https://gerrit.wikimedia.org/r/800308 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn)
[23:08:00] <wikibugs>	 (03CR) 10Dzahn: "the part that I am also including backup::host means one step to being able to let Bacula fetch from it too. but the second step needed wi" [puppet] - 10https://gerrit.wikimedia.org/r/800308 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn)
[23:08:28] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+1] "Discussed this with @Dzahn, seems like a good stop-gap." [puppet] - 10https://gerrit.wikimedia.org/r/800308 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn)
[23:10:37] <wikibugs>	 (03PS3) 10Dzahn: gitlab: create role/profile to temp use gitlab1004 to store backups [puppet] - 10https://gerrit.wikimedia.org/r/800308 (https://phabricator.wikimedia.org/T274463)
[23:12:10] <wikibugs>	 (03CR) 10Dzahn: "btw this only works because https://phabricator.wikimedia.org/T309259 is resolved since earlier today" [puppet] - 10https://gerrit.wikimedia.org/r/800308 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn)
[23:14:04] <wikibugs>	 10SRE, 10serviceops, 10GitLab (Infrastructure): gitlab1004 - puppet cert revoked? - https://phabricator.wikimedia.org/T309259 (10Dzahn) Now using this machine for https://gerrit.wikimedia.org/r/c/operations/puppet/+/800308 and setting it active in netbox.
[23:14:08] <wikibugs>	 (03PS4) 10Dzahn: gitlab: create role/profile to temp use gitlab1004 to store backups [puppet] - 10https://gerrit.wikimedia.org/r/800308 (https://phabricator.wikimedia.org/T274463)
[23:16:48] <wikibugs>	 (03PS3) 10Cwhite: opensearch_dashboards: add backup script enable job [puppet] - 10https://gerrit.wikimedia.org/r/798886 (https://phabricator.wikimedia.org/T237224)
[23:18:34] <wikibugs>	 (03PS5) 10Dzahn: gitlab: create role/profile to temp use gitlab1004 to store backups [puppet] - 10https://gerrit.wikimedia.org/r/800308 (https://phabricator.wikimedia.org/T274463)
[23:19:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[23:20:28] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/35581/" [puppet] - 10https://gerrit.wikimedia.org/r/800308 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn)
[23:20:37] <wikibugs>	 (03PS6) 10Dzahn: gitlab: create role/profile to temp use gitlab1004 to store backups [puppet] - 10https://gerrit.wikimedia.org/r/800308 (https://phabricator.wikimedia.org/T274463)
[23:23:45] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+1] gitlab: create role/profile to temp use gitlab1004 to store backups [puppet] - 10https://gerrit.wikimedia.org/r/800308 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn)
[23:28:34] <wikibugs>	 (03CR) 10Dzahn: "noop on gitlab1001 and on gitlab1003 puppet is disabled because it was trying to run the automatic restore.. which disabled puppet.. and t" [puppet] - 10https://gerrit.wikimedia.org/r/800308 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn)
[23:33:09] <wikibugs>	 (03PS1) 10Dzahn: gitlab::dump: ensure /srv/gitlab-backup exists so that rsync starts [puppet] - 10https://gerrit.wikimedia.org/r/800312 (https://phabricator.wikimedia.org/T274463)
[23:33:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] gitlab::dump: ensure /srv/gitlab-backup exists so that rsync starts [puppet] - 10https://gerrit.wikimedia.org/r/800312 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn)
[23:34:46] <wikibugs>	 (03PS2) 10Dzahn: gitlab::dump: ensure /srv/gitlab-backup exists so that rsync starts [puppet] - 10https://gerrit.wikimedia.org/r/800312 (https://phabricator.wikimedia.org/T274463)
[23:37:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] gitlab::dump: ensure /srv/gitlab-backup exists so that rsync starts [puppet] - 10https://gerrit.wikimedia.org/r/800312 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn)
[23:42:27] <wikibugs>	 (03PS3) 10Dzahn: gitlab::dump: ensure /srv/gitlab-backup exists so that rsync starts [puppet] - 10https://gerrit.wikimedia.org/r/800312 (https://phabricator.wikimedia.org/T274463)
[23:45:24] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gitlab::dump: ensure /srv/gitlab-backup exists so that rsync starts [puppet] - 10https://gerrit.wikimedia.org/r/800312 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn)
[23:52:45] <icinga-wm>	 PROBLEM - SSH on labweb1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:54:39] <wikibugs>	 (03PS4) 10Cwhite: opensearch_dashboards: add backup script enable job [puppet] - 10https://gerrit.wikimedia.org/r/798886 (https://phabricator.wikimedia.org/T237224)
[23:54:41] <wikibugs>	 (03PS1) 10Dzahn: gitlab::dump: use rsync::module directly, not quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/800329 (https://phabricator.wikimedia.org/T274463)
[23:55:14] <wikibugs>	 (03CR) 10Cwhite: opensearch_dashboards: add backup script enable job (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/798886 (https://phabricator.wikimedia.org/T237224) (owner: 10Cwhite)
[23:55:51] <wikibugs>	 (03PS2) 10Dzahn: gitlab::dump: use rsync::module directly, not quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/800329 (https://phabricator.wikimedia.org/T274463)
[23:58:49] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gitlab::dump: use rsync::module directly, not quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/800329 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn)