[00:02:47] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:03:13] (KubernetesRsyslogDown) firing: (8) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:09:05] 10SRE, 10DBA, 10Performance-Team (Radar), 10Sustainability (MediaWiki-MultiDC): App servers <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809 (10tstarling) [00:09:40] 10SRE, 10DBA, 10Performance-Team (Radar), 10Sustainability (MediaWiki-MultiDC): App servers <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809 (10tstarling) Task description edit: added plan for direct TLS, no connection pooling or tunnel. [00:20:29] 10SRE, 10SRE-OnFire, 10Release-Engineering-Team, 10Sustainability: Remove old scap repositories from deploy1002 - https://phabricator.wikimedia.org/T309162 (10Dzahn) list of repos that exist on deployment servers but do not appear in the kubernetes.yaml. (just using the string that is the first level of th... [00:22:43] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:27:05] PROBLEM - Disk space on gitlab1001 is CRITICAL: DISK CRITICAL - free space: /mnt/gitlab-backup 0 MB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=gitlab1001&var-datasource=eqiad+prometheus/ops [00:28:08] ^ arr. checking that [00:28:25] it's "just" the backups but we made changes to avoid this [00:28:53] the good part is.. it didn't take the service down because that's a dedicated mount [00:33:01] ACKNOWLEDGEMENT - Check systemd state on gitlab1001 is CRITICAL: CRITICAL - degraded: The following units failed: full-backup.service,rsync-config-backup-gitlab1003.wikimedia.org.service daniel_zahn https://phabricator.wikimedia.org/T308089 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:01] ACKNOWLEDGEMENT - Disk space on gitlab1001 is CRITICAL: DISK CRITICAL - free space: /mnt/gitlab-backup 0 MB (0% inode=99%): daniel_zahn https://phabricator.wikimedia.org/T308089 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=gitlab1001&var-datasource=eqiad+prometheus/ops [00:33:01] ACKNOWLEDGEMENT - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service daniel_zahn https://phabricator.wikimedia.org/T308089 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:42:13] (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:48:19] RECOVERY - Disk space on gitlab1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=gitlab1001&var-datasource=eqiad+prometheus/ops [00:50:49] 10SRE: a couple longer running icinga alerts to be fixed - https://phabricator.wikimedia.org/T309257 (10Dzahn) [00:52:02] 10SRE: a couple longer running icinga alerts to be fixed - https://phabricator.wikimedia.org/T309257 (10Dzahn) We can just fix them but we can also question if they should/can be removed on non-active hosts (via puppet changes), whether they should really be CRIT etc. [00:52:47] 10SRE: a couple longer running icinga alerts to be fixed - https://phabricator.wikimedia.org/T309257 (10Dzahn) [00:56:51] !log gitlab1001 - T308089 T274463 - '<+icinga-wm> PROBLEM - Disk space on gitlab1001 is CRITICAL: DISK CRITICAL - free space: /mnt/gitlab-backup 0 MB' - manually deleted 1653294190_2022_05_23_14.10.2_gitlab_backup.tar (we have May 24 and 25, 26 could not finish writing backup) - RECOVERY - Disk space on gitlab1001 is OK [00:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:56:59] T274463: Backups for GitLab - https://phabricator.wikimedia.org/T274463 [00:56:59] T308089: gitlab-restore: version detection fail / restore fail - https://phabricator.wikimedia.org/T308089 [00:57:39] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:58:18] !log gitlab1001 - T308089 T274463 - gitlab1001 - systemctl start full-backup [00:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:21] !log gitlab1003 - T308089 T274463 - gitlab1003 - systemctl status backup-restore is failed because it's looking for /mnt/gitlab-backup/latest/latest.tar needs gerrit:799016 [01:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:44] (03PS3) 10Dzahn: gitlab: switch backup location to /srv, don't use /etc [puppet] - 10https://gerrit.wikimedia.org/r/799016 (https://phabricator.wikimedia.org/T274463) [01:02:59] (03CR) 10Dzahn: [C: 03+2] gitlab: switch backup location to /srv, don't use /etc [puppet] - 10https://gerrit.wikimedia.org/r/799016 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [01:05:59] (03CR) 10Dzahn: [C: 03+2] "noop on gitlab1001, change on gitlab2001, re-enabling puppet on gitlab1003 (puppet was stopped by restore script but could not finish)" [puppet] - 10https://gerrit.wikimedia.org/r/799016 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [01:10:00] 10SRE, 10GitLab, 10serviceops: gitlab1004 - puppet cert revoked? - https://phabricator.wikimedia.org/T309259 (10Dzahn) [01:10:47] (03PS2) 10Dzahn: gitlab: rsync config and data backup to same folder on replica [puppet] - 10https://gerrit.wikimedia.org/r/799280 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [01:12:20] 10SRE, 10serviceops, 10GitLab (Infrastructure): gitlab1004 - puppet cert revoked? - https://phabricator.wikimedia.org/T309259 (10Dzahn) [01:13:07] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:13:49] (03CR) 10Dzahn: [C: 03+2] gitlab: rsync config and data backup to same folder on replica [puppet] - 10https://gerrit.wikimedia.org/r/799280 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [01:20:04] !log gitlab1003 - T308089 T274463 - gitlab1001 - deleted backups from April 4 and April 5 from /srv/gitlab-backup AND deleted partial failed backups from May 26 from /mnt/gitlab-backup; deployed both gerrit:799016 and gerrit:799280 ; restarting full-backup service [01:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:12] T274463: Backups for GitLab - https://phabricator.wikimedia.org/T274463 [01:20:12] T308089: gitlab-restore: version detection fail / restore fail - https://phabricator.wikimedia.org/T308089 [01:24:51] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:26:09] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:27:01] RECOVERY - Check systemd state on gitlab1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:27:51] !log T308089 T274463 - gitlab1001 - systemctl start rsync-config-backup-gitlab1003.wikimedia.org - Suceeded - RECOVERY - Check systemd state on gitlab1001 is OK [01:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:59] T274463: Backups for GitLab - https://phabricator.wikimedia.org/T274463 [01:27:59] T308089: gitlab-restore: version detection fail / restore fail - https://phabricator.wikimedia.org/T308089 [01:34:55] PROBLEM - Check systemd state on gitlab2001 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:37:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298555)', diff saved to https://phabricator.wikimedia.org/P28566 and previous config saved to /var/cache/conftool/dbconfig/20220526-013741-ladsgroup.json [01:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:37:48] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [01:40:33] PROBLEM - Check systemd state on gitlab1001 is CRITICAL: CRITICAL - degraded: The following units failed: full-backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:43:51] (03PS10) 10Tim Starling: Add "db-mainstash" entry to $wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752807 (https://phabricator.wikimedia.org/T212129) (owner: 10Aaron Schulz) [01:45:49] (03CR) 10Tim Starling: [C: 03+1] "PS10: globalKeyLB -> cluster, globalKeyLbDomain -> dbDomain, add Depends-On." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752807 (https://phabricator.wikimedia.org/T212129) (owner: 10Aaron Schulz) [01:46:39] !log T308089 T274463 - gitlab1001 - still not enough disk space to finish full backup. moved backup of May 24th to /root/ . deleted latest.tar; started full-backup service once again [01:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:46:46] T274463: Backups for GitLab - https://phabricator.wikimedia.org/T274463 [01:46:46] T308089: gitlab-restore: version detection fail / restore fail - https://phabricator.wikimedia.org/T308089 [01:47:17] RECOVERY - Check systemd state on gitlab1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:50:29] (03PS1) 10Tim Starling: Switch wgMainStash to db-mainstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799433 [01:51:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance [01:51:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance [01:51:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:51:59] (03CR) 10Tim Starling: "Please give +1 for deployment after eval.php testing of db-mainstash." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799433 (owner: 10Tim Starling) [01:52:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P28567 and previous config saved to /var/cache/conftool/dbconfig/20220526-015247-ladsgroup.json [01:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P28568 and previous config saved to /var/cache/conftool/dbconfig/20220526-020752-ladsgroup.json [02:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:11:51] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 2 (backup1002, ...), Fresh: 109 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [02:14:21] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:23:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298555)', diff saved to https://phabricator.wikimedia.org/P28569 and previous config saved to /var/cache/conftool/dbconfig/20220526-022259-ladsgroup.json [02:23:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1119.eqiad.wmnet with reason: Maintenance [02:23:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1119.eqiad.wmnet with reason: Maintenance [02:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:23:06] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [02:23:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T298555)', diff saved to https://phabricator.wikimedia.org/P28570 and previous config saved to /var/cache/conftool/dbconfig/20220526-022307-ladsgroup.json [02:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:25:25] PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1002.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:27:39] PROBLEM - Check systemd state on gitlab1001 is CRITICAL: CRITICAL - degraded: The following units failed: full-backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:44:23] PROBLEM - Disk space on gitlab1001 is CRITICAL: DISK CRITICAL - free space: /mnt/gitlab-backup 0 MB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=gitlab1001&var-datasource=eqiad+prometheus/ops [02:52:31] (03CR) 10TsepoThoabala: [C: 03+1] Assign similareditors right to the checkuser group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799378 (https://phabricator.wikimedia.org/T307205) (owner: 10AGueyte) [03:05:21] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:21:27] RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:31:05] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 31.32 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [03:33:07] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 37.6 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [03:35:41] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 45.34 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [03:37:37] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 108.6 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [03:37:49] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [03:37:57] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 87.45 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [04:02:29] (03PS5) 10Abijeet Patro: Add namespaces to Punjabi wikisource default search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793799 (https://phabricator.wikimedia.org/T287887) [04:02:47] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:03:13] (KubernetesRsyslogDown) firing: (8) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:31:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T298555)', diff saved to https://phabricator.wikimedia.org/P28571 and previous config saved to /var/cache/conftool/dbconfig/20220526-043126-ladsgroup.json [04:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:31:34] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [04:33:05] 10SRE, 10DBA, 10Performance-Team (Radar), 10Sustainability (MediaWiki-MultiDC): App servers <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809 (10tstarling) 05Stalled→03Open [04:42:00] 10SRE, 10DBA, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (MediaWiki-MultiDC): App servers <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809 (10tstarling) a:03tstarling [04:42:13] (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:45:31] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:45:43] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:46:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P28572 and previous config saved to /var/cache/conftool/dbconfig/20220526-044631-ladsgroup.json [04:56:42] (03PS1) 10Tim Starling: Enable SSL for master DB connections in the secondary datacenter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799437 (https://phabricator.wikimedia.org/T134809) [04:56:57] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:01:21] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:01:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P28573 and previous config saved to /var/cache/conftool/dbconfig/20220526-050136-ladsgroup.json [05:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:21] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:07:45] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:10:01] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:10:17] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:16:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T298555)', diff saved to https://phabricator.wikimedia.org/P28574 and previous config saved to /var/cache/conftool/dbconfig/20220526-051641-ladsgroup.json [05:16:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1099.eqiad.wmnet with reason: Maintenance [05:16:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1099.eqiad.wmnet with reason: Maintenance [05:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:49] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [05:16:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T298555)', diff saved to https://phabricator.wikimedia.org/P28575 and previous config saved to /var/cache/conftool/dbconfig/20220526-051649-ladsgroup.json [05:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1111', diff saved to https://phabricator.wikimedia.org/P28576 and previous config saved to /var/cache/conftool/dbconfig/20220526-053155-marostegui.json [05:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:54] (03PS1) 10Marostegui: db1111: Install 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/799443 (https://phabricator.wikimedia.org/T308915) [05:42:15] (03CR) 10Marostegui: [C: 03+2] db1111: Install 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/799443 (https://phabricator.wikimedia.org/T308915) (owner: 10Marostegui) [05:47:07] (03PS1) 10KartikMistry: Update cxserver to 2022-05-26-052433-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/799638 (https://phabricator.wikimedia.org/T309161) [05:53:24] (03PS1) 10Marostegui: db1111: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/799647 [05:54:09] (03CR) 10Marostegui: [C: 03+2] db1111: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/799647 (owner: 10Marostegui) [05:59:13] * kart_ updating cxserver.. [05:59:39] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:00:05] kormat, marostegui, and Amir1: #bothumor I � Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220526T0600). [06:00:37] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:00:52] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2022-05-26-052433-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/799638 (https://phabricator.wikimedia.org/T309161) (owner: 10KartikMistry) [06:02:08] marostegui: oops. I missed the switchover window as I was looking at May 26 in the deployment calendar.. my deployment will take few minutes only.. [06:03:17] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:04:29] kart_: no, no, go for it [06:04:41] kart_: it is a predefined window each Tuesday and Thursday [06:04:46] but it is empty this week [06:05:06] Ok. Thanks marostegui [06:05:14] (03Merged) 10jenkins-bot: Update cxserver to 2022-05-26-052433-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/799638 (https://phabricator.wikimedia.org/T309161) (owner: 10KartikMistry) [06:05:25] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 25 Jun 2022 07:55:09 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:05:41] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:06:01] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:06:11] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48107 bytes in 0.210 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:06:32] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [06:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:58] kart_: that's weird, the pin to US hours makes it show on the wrong day for us [06:07:05] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [06:07:09] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.403 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:57] RhinosF1: yes! [06:10:18] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [06:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:10] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [06:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:23] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:12:43] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:12:51] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [06:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:46] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [06:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:06] Anyone know why Grafana stopped showing restart/deploys in the graphs? [06:14:25] Button is there, but it has no effect. [06:15:42] !log Updated cxserver to 2022-05-26-052433-production (T309161, T308829, T308834) [06:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:50] T308834: Enable Section Translation on some wikis while Content Translation remains in beta - https://phabricator.wikimedia.org/T308834 [06:15:50] T308829: Enable Section Translation on 10 Wikipedias where Content Translaiton is available by default - https://phabricator.wikimedia.org/T308829 [06:15:51] T309161: Infoxbox Writer template fails to translate with Google MT - https://phabricator.wikimedia.org/T309161 [06:19:03] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:19:25] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:21:39] 10SRE, 10Deployments, 10Parsoid, 10bacula, 10Release-Engineering-Team (Doing): Accidental removal of some files under /srv/deployment on deploy1002 - https://phabricator.wikimedia.org/T307349 (10jcrespo) Incident lightweight report: https://wikitech.wikimedia.org/wiki/Incidents/2022-05-2_deployment [06:21:45] 10SRE: ProdPasteBot uses deprecated certificate auth - https://phabricator.wikimedia.org/T242857 (10Aklapper) @DSharpe: Do you maybe know the answer to my last comment (or know someone who could)? Thanks! [06:31:14] 10SRE, 10SRE-OnFire, 10Release-Engineering-Team, 10Sustainability: Remove old scap repositories from deploy1002 - https://phabricator.wikimedia.org/T309162 (10jcrespo) @Dzahn That doesn't seem right- mediawiki-staging is the current main method of deploying mediawiki, and httpbb-tests seems in active usage... [06:32:30] (03PS1) 10Tim Starling: Add the master from the primary DC to the secondary DC load arrays [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) [06:44:24] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1002 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [06:46:36] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:48:17] (03CR) 10Marostegui: "Does this change in anyway the way we do operations on the stand-by DC at the moment? ie: right now we don't even have to depool the codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) (owner: 10Tim Starling) [07:00:04] Amir1 and apergos: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220526T0700). [07:00:04] samwilson: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:11] 10SRE, 10SRE-OnFire, 10Release-Engineering-Team, 10Sustainability: Remove old scap repositories from deploy1002 - https://phabricator.wikimedia.org/T309162 (10jcrespo) This is a list of resources configured on puppet, but I am not sure if the list is exhaustive: ` File[/srv/deployment/scap] from /etc/puppe... [07:00:11] hello! [07:00:30] hello :) [07:00:47] no trainees are scheduled for today's window [07:01:12] there are two patches in the window only, and they are yours, samwilson [07:01:20] are you doing self deploy? [07:02:08] no (although I guess I could be a trainee!). Can you deploy? I'm here to test, and Satdeep is going to help test too. [07:02:13] ah ok [07:02:25] I'll do that then [07:02:44] :) thanks [07:03:19] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/796385 this says there's a merge conflict, can you sort that? [07:03:40] samwilson: [07:03:44] sure, doing now [07:03:47] ty [07:04:18] (03PS6) 10Samwilson: Enable Realtime Preview on more pilot wikis: huwiki and fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/796385 (https://phabricator.wikimedia.org/T303961) [07:05:16] (I don't feel comfortable doing both a training and the deploy by myself, so deploy it is) [07:06:02] sure! [07:06:19] I really should re-learn deployment stuff. I did do it once, years ago. [07:06:58] you should. sign up for a training! [07:07:40] (03CR) 10ArielGlenn: [C: 03+2] Enable Realtime Preview on more pilot wikis: huwiki and fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/796385 (https://phabricator.wikimedia.org/T303961) (owner: 10Samwilson) [07:09:25] (03Merged) 10jenkins-bot: Enable Realtime Preview on more pilot wikis: huwiki and fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/796385 (https://phabricator.wikimedia.org/T303961) (owner: 10Samwilson) [07:10:33] samwilson: live on mwdebug1002, please test [07:10:45] thanks. testing now. [07:12:45] (03CR) 10Elukey: [C: 03+2] ml-services: update docker image for revscoring-editquality-* [deployment-charts] - 10https://gerrit.wikimedia.org/r/799349 (https://phabricator.wikimedia.org/T309102) (owner: 10Elukey) [07:12:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:13:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:13:52] apergos: looks good, go for it [07:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:21] !log ariel@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:796385|Enable Realtime Preview on more pilot wikis: huwiki and fiwiki (T303961)]] (duration: 00m 51s) [07:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:26] T303961: Rollout plan for real-time preview - https://phabricator.wikimedia.org/T303961 [07:15:42] samwilson: it's live, please do any followup testing [07:16:38] yep, all looks as it should. [07:17:52] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1002 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [07:18:17] seems ok, proceeding [07:18:24] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [07:18:26] heh merge conflict [07:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:32] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/793799 [07:18:48] please sort, samwilson [07:19:03] (03PS6) 10Samwilson: Add namespaces to Punjabi wikisource default search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793799 (https://phabricator.wikimedia.org/T287887) (owner: 10Abijeet Patro) [07:19:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:16] (03CR) 10ArielGlenn: [C: 03+2] Add namespaces to Punjabi wikisource default search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793799 (https://phabricator.wikimedia.org/T287887) (owner: 10Abijeet Patro) [07:20:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:20:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:06] (03Merged) 10jenkins-bot: Add namespaces to Punjabi wikisource default search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793799 (https://phabricator.wikimedia.org/T287887) (owner: 10Abijeet Patro) [07:22:27] samwilson: live on mwdebug1002, please test. [07:22:42] testing now [07:23:12] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [07:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:39] apergos: looks great, is working. [07:24:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:12] !log ariel@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:793799|Add namespaces to Punjabi wikisource default search (T287887)]] (duration: 00m 50s) [07:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:18] T287887: Optimize default search namespaces for Punjabi wikisources - https://phabricator.wikimedia.org/T287887 [07:25:22] samwilson: live, please do followup testing [07:25:57] testing now. satdeep is also testing. [07:29:33] apergos: all good! [07:29:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:50] looks good from here too [07:30:17] thank you for choosing us as your deployment providers today, do come back again! [07:30:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:30:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:30:38] :-) no, thank you! [07:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:51] and I will try to do the training at some point [07:31:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:11] https://wikitech.wikimedia.org/wiki/Deployments/Training https://phabricator.wikimedia.org/maniphest/task/edit/form/96/ how to sign up, samwilson [07:32:14] see you there! [07:41:05] (03PS1) 10Marostegui: es2030,es2022: Install 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/799789 (https://phabricator.wikimedia.org/T309265) [07:44:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298555)', diff saved to https://phabricator.wikimedia.org/P28577 and previous config saved to /var/cache/conftool/dbconfig/20220526-074436-ladsgroup.json [07:44:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:43] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [07:45:18] (03CR) 10Marostegui: [C: 03+2] es2030,es2022: Install 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/799789 (https://phabricator.wikimedia.org/T309265) (owner: 10Marostegui) [07:47:40] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:50:04] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:55:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [07:55:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [07:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T298560)', diff saved to https://phabricator.wikimedia.org/P28578 and previous config saved to /var/cache/conftool/dbconfig/20220526-075525-ladsgroup.json [07:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:34] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [07:59:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P28579 and previous config saved to /var/cache/conftool/dbconfig/20220526-075941-ladsgroup.json [07:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:04] dancy and jnuche: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220526T0800). [08:02:47] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:03:13] (KubernetesRsyslogDown) firing: (8) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:09:12] PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:10:04] 10Puppet, 10SRE, 10Infrastructure-Foundations: Puppet should prune stale entries from sudoers.d - https://phabricator.wikimedia.org/T309268 (10Majavah) [08:14:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P28580 and previous config saved to /var/cache/conftool/dbconfig/20220526-081446-ladsgroup.json [08:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:19] (03PS2) 10Majavah: hieradata: purge stale sudoers.d entries in production [puppet] - 10https://gerrit.wikimedia.org/r/799268 (https://phabricator.wikimedia.org/T309268) [08:18:21] (03PS1) 10Majavah: Remove some unmanaged files from sudoers.d [puppet] - 10https://gerrit.wikimedia.org/r/799820 (https://phabricator.wikimedia.org/T309268) [08:20:54] (03CR) 10Majavah: hieradata: purge stale sudoers.d entries in production (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799268 (https://phabricator.wikimedia.org/T309268) (owner: 10Majavah) [08:21:03] (03CR) 10Filippo Giunchedi: [C: 03+2] Fix problems found by github.com/cloudflare/pint [alerts] - 10https://gerrit.wikimedia.org/r/799285 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [08:28:12] (03PS1) 10Volans: CHANGELOG: add changelogs for release v2.5.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/799831 [08:29:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298555)', diff saved to https://phabricator.wikimedia.org/P28581 and previous config saved to /var/cache/conftool/dbconfig/20220526-082951-ladsgroup.json [08:29:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [08:29:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [08:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:58] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [08:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:39] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v2.5.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/799831 (owner: 10Volans) [08:40:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1032 T309265', diff saved to https://phabricator.wikimedia.org/P28582 and previous config saved to /var/cache/conftool/dbconfig/20220526-084009-marostegui.json [08:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:16] T309265: Migrate 4 DB ES hosts to 10.6 - https://phabricator.wikimedia.org/T309265 [08:41:25] (03PS1) 10Majavah: P:openstack::pdns: remove unused sudo rules [puppet] - 10https://gerrit.wikimedia.org/r/799839 [08:42:13] (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:42:48] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35551/console" [puppet] - 10https://gerrit.wikimedia.org/r/799839 (owner: 10Majavah) [08:43:34] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v2.5.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/799831 (owner: 10Volans) [08:44:38] (03PS1) 10Marostegui: es1032: Install MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/799841 (https://phabricator.wikimedia.org/T309265) [08:44:56] (03CR) 10Tim Starling: Add the master from the primary DC to the secondary DC load arrays (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) (owner: 10Tim Starling) [08:45:18] (03CR) 10Marostegui: [C: 03+2] es1032: Install MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/799841 (https://phabricator.wikimedia.org/T309265) (owner: 10Marostegui) [08:52:19] (03PS1) 10Volans: Upstream release v2.5.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/799844 [08:52:56] (03CR) 10Marostegui: Add the master from the primary DC to the secondary DC load arrays (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) (owner: 10Tim Starling) [08:55:32] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [08:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:53] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [08:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:08] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:57:59] (03PS1) 10Majavah: P:openstack::puppetmaster: remove unused stuff [puppet] - 10https://gerrit.wikimedia.org/r/799845 [08:58:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:58:55] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35552/console" [puppet] - 10https://gerrit.wikimedia.org/r/799845 (owner: 10Majavah) [09:01:28] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:03:28] (03CR) 10Jbond: [C: 03+2] sudo: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799371 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:03:40] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:03:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:04:46] (03CR) 10Jbond: [C: 03+2] toolforge: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/797339 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:05:11] (03CR) 10Jbond: [C: 03+2] statograph: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799373 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:05:31] (03CR) 10Jbond: [C: 03+2] statsite: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799372 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:06:15] (03CR) 10Jbond: [C: 03+2] squid: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799377 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:06:46] (03CR) 10Jbond: [C: 03+2] "merging all the ones i +2'ed thanks <3" [puppet] - 10https://gerrit.wikimedia.org/r/799377 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:08:15] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Remove prod-specific bits from cloud puppetmasters - https://phabricator.wikimedia.org/T309281 (10Majavah) [09:08:34] (03CR) 10Jbond: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/799381 (owner: 10Volans) [09:09:32] (03CR) 10Volans: [C: 03+2] Upstream release v2.5.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/799844 (owner: 10Volans) [09:09:36] (03PS2) 10Majavah: P:openstack::puppetmaster: remove conftool client [puppet] - 10https://gerrit.wikimedia.org/r/799845 (https://phabricator.wikimedia.org/T309281) [09:11:52] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM should help with future changes on these thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/799381 (owner: 10Volans) [09:12:31] (03CR) 10Volans: [C: 03+2] transports: allow to set a global timeout [software/homer] - 10https://gerrit.wikimedia.org/r/799375 (owner: 10Volans) [09:12:34] (03CR) 10Volans: [C: 03+2] devices: allow to pass additional metadata [software/homer] - 10https://gerrit.wikimedia.org/r/799376 (owner: 10Volans) [09:15:42] (03Merged) 10jenkins-bot: transports: allow to set a global timeout [software/homer] - 10https://gerrit.wikimedia.org/r/799375 (owner: 10Volans) [09:15:50] (03Merged) 10jenkins-bot: devices: allow to pass additional metadata [software/homer] - 10https://gerrit.wikimedia.org/r/799376 (owner: 10Volans) [09:16:28] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [09:16:31] (03CR) 10Jbond: "Thanks LGMT, missed form the last one but could you add an entry to the change log e.g." [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 (owner: 10BCornwall) [09:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:46] (03PS5) 10MSantos: WIP: introduce geoshapes service [deployment-charts] - 10https://gerrit.wikimedia.org/r/768678 (https://phabricator.wikimedia.org/T302967) [09:17:05] (03Merged) 10jenkins-bot: Upstream release v2.5.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/799844 (owner: 10Volans) [09:17:24] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to mwmaint1002.eqiad.wmnet for sgimeno - https://phabricator.wikimedia.org/T309045 (10Sgs) >>! In T309045#7954779, @Dzahn wrote: >>>! In T309045#7950982, @MShilova_WMF wrote: >> I confirm that @sgs needs access to a production server and it... [09:18:30] (03CR) 10CI reject: [V: 04-1] WIP: introduce geoshapes service [deployment-charts] - 10https://gerrit.wikimedia.org/r/768678 (https://phabricator.wikimedia.org/T302967) (owner: 10MSantos) [09:19:01] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [09:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:40] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:21:20] !log uploaded spicerack_2.5.0 to apt.wikimedia.org bullseye-wikimedia [09:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:22:01] (03PS1) 10Majavah: puppetmaster: remove 'allow_from' [puppet] - 10https://gerrit.wikimedia.org/r/799859 [09:24:31] (03CR) 10Jbond: [C: 03+1] "LGTM will leave for WMCS to merge" [puppet] - 10https://gerrit.wikimedia.org/r/799845 (https://phabricator.wikimedia.org/T309281) (owner: 10Majavah) [09:26:03] (03PS2) 10Jbond: Remove some unmanaged files from sudoers.d [puppet] - 10https://gerrit.wikimedia.org/r/799820 (https://phabricator.wikimedia.org/T309268) (owner: 10Majavah) [09:26:10] (03CR) 10Jbond: [C: 03+2] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/799820 (https://phabricator.wikimedia.org/T309268) (owner: 10Majavah) [09:26:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:29:42] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35556/console" [puppet] - 10https://gerrit.wikimedia.org/r/799859 (owner: 10Majavah) [09:33:02] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35557/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799386 (owner: 10Jbond) [09:34:05] (03CR) 10Majavah: "see also https://gerrit.wikimedia.org/r/c/operations/puppet/+/799344" [puppet] - 10https://gerrit.wikimedia.org/r/799386 (owner: 10Jbond) [09:38:33] PROBLEM - Host db1128 is DOWN: PING CRITICAL - Packet loss = 100% [09:38:43] what [09:38:49] it is indeed down [09:39:00] I did get paged but this is not a #page alert, weird too [09:39:04] connectivity or hw, youi know? [09:39:04] marostegui: need a hand? [09:39:05] <_joe_> uh [09:39:05] It is a master [09:39:14] m1 master, let me check [09:39:15] I can ssh to the mgmt [09:39:17] RO should be fine [09:39:25] volans: can you reboot or check what happened? [09:39:29] <_joe_> marostegui: m1 is what? [09:39:37] wait before rebooting [09:39:39] _joe_: misc services [09:39:41] it may be network [09:39:53] volans: is he host up? [09:39:55] *the [09:39:58] host is up [09:40:03] 09:39:56 up 0 min, 1 user, load average: 0.36, 0.09, 0.03 [09:40:03] _joe_: misc utilitie: Bacula, Etherpad, etc. [09:40:05] but just rebooted [09:40:06] so rebooted [09:40:10] utilities even [09:40:10] <_joe_> ahh just rebooted [09:40:12] "just" [09:40:13] <_joe_> sobanski: thanks [09:40:15] RECOVERY - Host db1128 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [09:40:22] jynus: .. [09:40:27] I'm checking hardware logs [09:40:32] given I''m already in [09:40:52] fyi pki is also on misc [09:40:55] <_joe_> well whatever the reason, I guess we're in for a master switchover in m1? [09:40:59] no [09:41:00] I woke up to the page [09:41:14] which page? [09:41:15] :) [09:41:17] <_joe_> Amir1: who needs an alarm clock when you have pages? [09:41:30] It's probably a loose cable again [09:41:35] (03PS1) 10Majavah: hieradata: set swift_clusters: {} on cloud [puppet] - 10https://gerrit.wikimedia.org/r/799861 (https://phabricator.wikimedia.org/T309281) [09:41:40] oh right, VO pages on the host being down, funny :) [09:41:43] I am starting mariadb [09:41:46] Storage seems ok [09:42:00] PROBLEM - MariaDB Replica IO: m1 on db1117 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1128.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1128.eqiad.wmnet (113 No route to host) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:42:02] PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:42:07] marostegui: Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A6. [09:42:12] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Puppet should prune stale entries from sudoers.d - https://phabricator.wikimedia.org/T309268 (10jbond) similar to the previous task on [[ https://phabricator.wikimedia.org/T214605#5756945 | apt directories ]], i have queried the repo for ma... [09:42:13] File /var/log/journal/d2918de808fb4bc5ba5ad42f3e7b95c5/system.journal corrupted or uncleanly shut down, renaming and replacing [09:42:41] same error happened on the 2022-03-17 and back in 2022-02-27, but this first one was a correctable error [09:42:44] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35558/console" [puppet] - 10https://gerrit.wikimedia.org/r/799861 (https://phabricator.wikimedia.org/T309281) (owner: 10Majavah) [09:42:49] volans: we need new memory then [09:42:50] so it seems a bad DIMM [09:42:59] if we have a task I can paste the logs [09:43:09] I am checking the data before failing the proxy back [09:43:24] k [09:43:39] <_joe_> marostegui: shouldn't we switch masters if this server has faulty dimms? [09:43:52] PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:44:10] RECOVERY - MariaDB Replica IO: m1 on db1117 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:44:16] <_joe_> also, I'd be happy to do any operational step myself for failovers / etc [09:44:42] I prefer not if we can avoid it, I can prepare a proper host today and then switch it, but I prefer not to switch to db1117:3321 for now [09:45:01] <_joe_> ack [09:45:05] <_joe_> it's your call [09:45:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298560)', diff saved to https://phabricator.wikimedia.org/P28583 and previous config saved to /var/cache/conftool/dbconfig/20220526-094509-ladsgroup.json [09:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:15] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [09:45:25] I am going to failback the proxies again [09:45:26] <_joe_> I am happy with whatever you think is best [09:45:37] <_joe_> marostegui: is there a runbook for the failback? [09:45:40] I will have a replacement ready in a few hours [09:46:01] (03PS2) 10Majavah: hieradata: set swift_clusters: {} on cloud [puppet] - 10https://gerrit.wikimedia.org/r/799861 (https://phabricator.wikimedia.org/T309281) [09:46:02] <_joe_> sorry, being oncall, I'd prefer to be able to perform at least the failbacks myself [09:46:06] marostegui: do you need me for anything? [09:46:13] sorry _joe_ just did it [09:46:23] But it is basically reloading the proxies [09:46:26] RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [09:46:28] <_joe_> marostegui: you need it in a runbook [09:46:41] <_joe_> about dbproxy specifically [09:46:46] _joe_: for later I guess, let me address all this first [09:46:48] <_joe_> the page linked in the alert isn't helpful [09:46:59] <_joe_> marostegui: sure sorry I wasn't implying for now [09:47:06] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35559/console" [puppet] - 10https://gerrit.wikimedia.org/r/799861 (https://phabricator.wikimedia.org/T309281) (owner: 10Majavah) [09:47:11] (03CR) 10Majavah: hieradata: set swift_clusters: {} on cloud [puppet] - 10https://gerrit.wikimedia.org/r/799861 (https://phabricator.wikimedia.org/T309281) (owner: 10Majavah) [09:47:21] _joe_: https://wikitech.wikimedia.org/wiki/HAProxy this tells what to do [09:47:24] https://phabricator.wikimedia.org/T309286 [09:47:25] not very clearly, but it does [09:47:31] jynus: thanks [09:47:57] we need to restart etherpad [09:48:14] RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [09:48:25] can someone do that? [09:48:30] doing [09:48:34] thanks jynus [09:48:50] services we need to check: https://phabricator.wikimedia.org/P28584 [09:49:01] especially writes [09:49:21] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Puppet should prune stale entries from sudoers.d - https://phabricator.wikimedia.org/T309268 (10jbond) removing `nagios_long_procs` as it was dropped in https://gerrit.wikimedia.org/r/c/operations/puppet/+/723543/4/modules/base/manifests/mo... [09:49:30] marostegui: I did it but it didn't work [09:49:39] do I need to restart apache or something? [09:50:11] or maybe it works now, just took a minute? [09:50:19] or maybe connections where killed? [09:50:28] 10ops-eqiad, 10DBA, 10Data-Persistence: db1128 host (containing m1 databases) crashed - https://phabricator.wikimedia.org/T309286 (10Volans) According to `racadm lclog view` it's a bad DIMM, `DIMM_A6` in particular, and it happened already on 2022-03-17 (but it didn't trigger a reboot) and on 2022-02-27 (alt... [09:50:33] Etherpad loads for me [09:50:39] And I just created a new pad [09:50:47] sobanski: it didn't work immediately after restart [09:51:04] jynus: usually it is the etherpad service (at least what I have seen before) [09:51:09] * volans updated task with HW logs [09:51:10] jynus: if apache2 tried a graceful restart that could explain the delay [09:51:15] jynus: can you run a quick bacula test? [09:51:27] I am going to prepare a new host, it will take a few hours [09:51:33] etherpad keeps WS open to every client.. those are long lived connections that will keep the workers busy for a while [09:51:43] <_joe_> do we still have racktables? [09:52:03] ongoing es backups failed [09:52:14] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:52:16] 7TB to retry [09:52:19] _joe_: only on RO mode as far as I remember [09:52:21] _joe_: yes, in RO mode [09:52:22] 10ops-eqiad, 10DBA, 10Data-Persistence: db1128 host (containing m1 databases) crashed - https://phabricator.wikimedia.org/T309286 (10Marostegui) p:05Triage→03High We need to build a new host and switchover db1128 so we can replace its memory. [09:52:39] <_joe_> so nothing to verify there really [09:52:48] PROBLEM - Check systemd state on pki1001 is CRITICAL: CRITICAL - degraded: The following units failed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service,cfssl-ocsprefresh-cloud_wmnet_ca.service,cfssl-ocsprefresh-kafka.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:53:10] 10ops-eqiad, 10DBA, 10Data-Persistence: db1128 host (containing m1 databases) crashed - https://phabricator.wikimedia.org/T309286 (10Volans) As an action item for later, we should check why the page didn't have the `#page` hashtag on the IRC alert: ` icinga-wm| PROBLEM - Host db1128 is DOWN: PING CRITICAL -... [09:54:02] volans: maybe we need to review https://phabricator.wikimedia.org/T233684 and check if something is missing [09:54:14] I am going to look for a host to replace db1128 [09:54:45] marostegui: are you replying to my comment on the task about the #-page hashtag? [09:55:22] volans: yeah :) [09:55:40] ack [09:55:52] 10ops-eqiad, 10DBA, 10Data-Persistence: db1128 host (containing m1 databases) crashed - https://phabricator.wikimedia.org/T309286 (10Marostegui) a:03Marostegui [09:56:04] 10ops-eqiad, 10DBA, 10Data-Persistence: db1128 host (containing m1 databases) crashed - https://phabricator.wikimedia.org/T309286 (10Marostegui) I am going to replace db1128 with a s4 host for now. [09:56:42] 10ops-eqiad, 10DBA, 10Data-Persistence: db1128 host (containing m1 databases) crashed - https://phabricator.wikimedia.org/T309286 (10Marostegui) Should take a few hours and later I will do an emergency m1 switchover, don't want to leave db1128 running like this for the weekend [09:56:54] (03PS1) 10Filippo Giunchedi: rsyslog: bound disk-assisted queues [puppet] - 10https://gerrit.wikimedia.org/r/799870 (https://phabricator.wikimedia.org/T308439) [09:57:04] 10ops-eqiad, 10DBA, 10Data-Persistence: db1128 host (containing m1 databases) crashed - https://phabricator.wikimedia.org/T309286 (10Marostegui) [09:57:27] marostegui: backups and restores seem to work well, but I have to retry what were ongoing backups [09:57:38] marostegui: I know the issue now, I'll send a patch [09:57:47] (03PS1) 10Majavah: monitoring::icinga::git_merge: use sudo::rule [puppet] - 10https://gerrit.wikimedia.org/r/799871 (https://phabricator.wikimedia.org/T309268) [09:57:53] volans: <3 [09:57:59] jynus: maybe it is worth waiting for db1128 replacement? [09:58:07] cause I will kill connections to run the switchover today [09:58:10] yeah, I was about to say that [09:58:17] yeah, worth waiting then [09:58:21] I will try to get it done fast [09:58:22] but we should do it shortly [09:58:32] like, before the end of the week [09:58:42] jynus: I am planning to do it in a few hours [09:58:52] it shouldn't take long, I am deciding which host to pick now [09:59:08] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35561/console" [puppet] - 10https://gerrit.wikimedia.org/r/799871 (https://phabricator.wikimedia.org/T309268) (owner: 10Majavah) [09:59:15] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Puppet should prune stale entries from sudoers.d - https://phabricator.wikimedia.org/T309268 (10jbond) new updated list with removed nagios_long_procs and also with a fixed file list ` sudo cumin '*' 'ls -1 /etc/sudoers.d/ | grep -Ev "mw-... [09:59:46] _joe_, vgutierrez: for you I have a different problem, it seems to me (at least from my VO app) that the incident was not auto-resolved on VO, if you could have a look [10:00:04] mvolz: (Dis)respected human, time to deploy Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220526T1000). Please do the needful. [10:00:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1164 T309286', diff saved to https://phabricator.wikimedia.org/P28585 and previous config saved to /var/cache/conftool/dbconfig/20220526-100013-marostegui.json [10:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P28586 and previous config saved to /var/cache/conftool/dbconfig/20220526-100020-ladsgroup.json [10:00:22] T309286: db1128 host (containing m1 databases) crashed - https://phabricator.wikimedia.org/T309286 [10:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:13] (03PS1) 10Marostegui: instances.yaml: Remove db1164 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/799876 (https://phabricator.wikimedia.org/T309286) [10:01:15] <_joe_> volans: yes it wasn't [10:01:19] <_joe_> I'll resolve it [10:02:08] indeed, we have looked into why host pages don't resolve automatically but IIRC found no smoking gun yet, thanks for resolving though [10:02:11] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1164 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/799876 (https://phabricator.wikimedia.org/T309286) (owner: 10Marostegui) [10:02:11] I'll find the task [10:02:36] T264016 [10:02:37] T264016: Host page did not auto-resolve in VO - https://phabricator.wikimedia.org/T264016 [10:03:47] expect backup check alerts in the next hours due to backup failures and probable delays [10:05:15] (03PS1) 10Marostegui: mariadb: Move db1164 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/799883 (https://phabricator.wikimedia.org/T309286) [10:05:17] jynus: can you review ^ [10:05:35] doing [10:05:38] !log Stop mysql on db1117:3321 to clone db1164 T309286 [10:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:44] T309286: db1128 host (containing m1 databases) crashed - https://phabricator.wikimedia.org/T309286 [10:06:57] marostegui: only one suggetion- let's add monitoring enabled:false to db1128? [10:07:06] sounds good, let me do it [10:07:23] maybe to the new one, temporarilly [10:07:41] jynus: the new one has it set to false on that patch [10:07:44] (03PS2) 10Marostegui: mariadb: Move db1164 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/799883 (https://phabricator.wikimedia.org/T309286) [10:07:49] ah, sorry, I missed that [10:07:59] or you mean fully disable it? [10:08:04] let's fully disable it instead [10:08:22] yeah, is_critical + enabled false until fully setup [10:08:32] (03PS3) 10Marostegui: mariadb: Move db1164 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/799883 (https://phabricator.wikimedia.org/T309286) [10:08:34] done ^ [10:08:47] trying to avoid more pages [10:09:29] (03CR) 10Jcrespo: [C: 03+1] mariadb: Move db1164 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/799883 (https://phabricator.wikimedia.org/T309286) (owner: 10Marostegui) [10:09:41] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1164 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/799883 (https://phabricator.wikimedia.org/T309286) (owner: 10Marostegui) [10:09:44] you merge, start preparing everything while I prepare the backups patch [10:10:02] yep, starting the cloning now [10:10:14] once monitoring is in place (maybe except read only) we reenable notifications [10:11:07] dbproxy irc alerts migh trigger as db1117 might flap (network saturation) [10:11:26] RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:11:28] Updating zarcillo now [10:12:14] PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [10:12:26] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [10:12:36] PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [10:12:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1164 from dbctl', diff saved to https://phabricator.wikimedia.org/P28588 and previous config saved to /var/cache/conftool/dbconfig/20220526-101250-marostegui.json [10:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:50] (03PS1) 10Jcrespo: Update dbackups check and statistics to use db1164 instead of db1128 [puppet] - 10https://gerrit.wikimedia.org/r/799894 (https://phabricator.wikimedia.org/T309286) [10:14:47] (03CR) 10Marostegui: "Looks good, I will let you know once the host is up" [puppet] - 10https://gerrit.wikimedia.org/r/799894 (https://phabricator.wikimedia.org/T309286) (owner: 10Jcrespo) [10:14:50] ^marostegui low prio [10:15:02] will done once things have been stable for a while [10:15:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P28589 and previous config saved to /var/cache/conftool/dbconfig/20220526-101525-ladsgroup.json [10:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:32] (03CR) 10Jcrespo: [C: 04-1] "Not yet until failover is done and things are stable." [puppet] - 10https://gerrit.wikimedia.org/r/799894 (https://phabricator.wikimedia.org/T309286) (owner: 10Jcrespo) [10:17:00] 10ops-eqiad: db1128 faulty memory - https://phabricator.wikimedia.org/T309291 (10Marostegui) [10:17:07] 10ops-eqiad: db1128 faulty memory - https://phabricator.wikimedia.org/T309291 (10Marostegui) p:05Triage→03Medium [10:17:13] volans: I have created an specific task for dcops [10:18:40] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [10:19:34] marostegui: ack, sorry if mixed the dcops data in that task [10:19:42] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Puppet should prune stale entries from sudoers.d - https://phabricator.wikimedia.org/T309268 (10jbond) i also slightly update the script io used for apt to hanlde if the path parameter is used and also use the simpler pql syntax ` lang=pyt... [10:19:49] *if I [10:21:24] volans: not a problem, I create a new one so they don't get lost in all the comments [10:21:55] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35562/console" [puppet] - 10https://gerrit.wikimedia.org/r/799870 (https://phabricator.wikimedia.org/T308439) (owner: 10Filippo Giunchedi) [10:23:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1105.eqiad.wmnet with reason: Maintenance [10:23:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1105.eqiad.wmnet with reason: Maintenance [10:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T298555)', diff saved to https://phabricator.wikimedia.org/P28590 and previous config saved to /var/cache/conftool/dbconfig/20220526-102308-ladsgroup.json [10:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:15] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [10:23:22] (03PS1) 10Jcrespo: mariadb: Failover m1 primary from db1128 to db1164 [puppet] - 10https://gerrit.wikimedia.org/r/799901 (https://phabricator.wikimedia.org/T309286) [10:23:42] (03CR) 10Jbond: [C: 03+1] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/799871 (https://phabricator.wikimedia.org/T309268) (owner: 10Majavah) [10:23:44] (03CR) 10Jbond: [C: 03+2] monitoring::icinga::git_merge: use sudo::rule [puppet] - 10https://gerrit.wikimedia.org/r/799871 (https://phabricator.wikimedia.org/T309268) (owner: 10Majavah) [10:24:28] marostegui: will check on misc docs if we have a checklist of all changes needed and the order, I've created so far https://gerrit.wikimedia.org/r/c/operations/puppet/+/799901 [10:24:56] but you will likely have more experience on that [10:25:18] ah, it is well documented at https://wikitech.wikimedia.org/wiki/MariaDB#Misc_section_failover_checklist_(example_with_m2) [10:26:35] (03PS1) 10Volans: Icinga: add page hashtag to paging host alerts [puppet] - 10https://gerrit.wikimedia.org/r/799903 [10:26:56] I think I have to add the haproxy change on the same change, control the sequence of deployment with puppet [10:27:13] (03CR) 10CI reject: [V: 04-1] Icinga: add page hashtag to paging host alerts [puppet] - 10https://gerrit.wikimedia.org/r/799903 (owner: 10Volans) [10:28:24] (03PS2) 10Volans: Icinga: add page hashtag to paging host alerts [puppet] - 10https://gerrit.wikimedia.org/r/799903 [10:28:44] jynus: yeah, we can do everything in a single patch, that patch + changing haproxy ips and databases [10:28:48] I can do that, no problem [10:28:55] well, let me try [10:28:58] Maybe we should even create a subtask for the switchover [10:28:59] and you review of course [10:29:00] (03CR) 10CI reject: [V: 04-1] Icinga: add page hashtag to paging host alerts [puppet] - 10https://gerrit.wikimedia.org/r/799903 (owner: 10Volans) [10:29:02] with all the steps [10:29:17] the docs have some outadated stuff that is now on hiera/zarcillo [10:29:20] jynus: https://phabricator.wikimedia.org/T302190 [10:29:31] ah, will copy that [10:29:54] should we add db1164 as a temporary secondary, to check haproxy works as intended? [10:30:30] yeah, but let's wait until it is up [10:30:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298560)', diff saved to https://phabricator.wikimedia.org/P28591 and previous config saved to /var/cache/conftool/dbconfig/20220526-103030-ladsgroup.json [10:30:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [10:30:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [10:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:38] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [10:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:52] marostegui: yeah, it is to prepare doc/patches, not doing nothing without your ok [10:31:03] jynus: thanks, appreciate it! [10:31:12] the transfer is half way done [10:39:25] T309296 but it is a quick copy and paste, will review now [10:39:26] T309296: Failover m1 primary db from db1128 to db1164 - https://phabricator.wikimedia.org/T309296 [10:41:10] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: Frequent puppet failures - https://phabricator.wikimedia.org/T221529 (10jbond) >>! In T221529#5143984, @jbond wrote: > The error happened as puppet-merge was rolling out changes. I have not looked at how puppet-merge works but this looks like i... [10:41:21] marostegui: I am not so sure about those steps, since haproxy started being used by traffic, reload my happen automatically? [10:41:39] (03Abandoned) 10Jbond: nrpe: move plugins off the base nrpe class [puppet] - 10https://gerrit.wikimedia.org/r/799386 (owner: 10Jbond) [10:42:39] jynus: no, it was disabled for us I believe [10:42:45] ah, ok [10:43:11] I am also not sure about the db-switchover parameter order [10:44:29] what are you not sure about? [10:44:48] transfer finished [10:44:49] not sure how db-switchover works [10:45:03] so just asking for you to review them [10:45:08] yeah, no worries [10:46:24] (03PS2) 10Jcrespo: mariadb: Failover m1 primary from db1128 to db1164 [puppet] - 10https://gerrit.wikimedia.org/r/799901 (https://phabricator.wikimedia.org/T309286) [10:46:38] added the haproxy change^ [10:48:25] (03CR) 10Jbond: [C: 03+2] "LGTM but see inline comments. also see the following for why the previous issue failed" [puppet] - 10https://gerrit.wikimedia.org/r/797422 (owner: 10Majavah) [10:48:37] (03CR) 10Marostegui: mariadb: Failover m1 primary from db1128 to db1164 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799901 (https://phabricator.wikimedia.org/T309286) (owner: 10Jcrespo) [10:48:58] (03PS1) 10Jcrespo: dbproxy: Add db1164 as the m1 eqiad secondary [puppet] - 10https://gerrit.wikimedia.org/r/799915 (https://phabricator.wikimedia.org/T309286) [10:49:24] (03PS3) 10Jcrespo: mariadb: Failover m1 primary from db1128 to db1164 [puppet] - 10https://gerrit.wikimedia.org/r/799901 (https://phabricator.wikimedia.org/T309286) [10:49:29] (03CR) 10Jcrespo: mariadb: Failover m1 primary from db1128 to db1164 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799901 (https://phabricator.wikimedia.org/T309286) (owner: 10Jcrespo) [10:49:54] let me rebase on top of the latest patch [10:50:34] uh, it says conflict, will have to rebase manually [10:51:02] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:52:28] (03PS4) 10Jcrespo: mariadb: Failover m1 primary from db1128 to db1164 [puppet] - 10https://gerrit.wikimedia.org/r/799901 (https://phabricator.wikimedia.org/T309286) [10:52:49] https://phabricator.wikimedia.org/T309286#7960042 [10:52:52] jynus: ^ [10:53:16] oh, I had created T309296 [10:53:17] T309296: Failover m1 primary db from db1128 to db1164 - https://phabricator.wikimedia.org/T309296 [10:53:21] ah sorry [10:53:24] didn't see it [10:53:24] let's compare :-) [10:54:28] RECOVERY - Check systemd state on pki1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:55:32] the ticket on my log was wrong [10:55:46] (03PS4) 10Majavah: nrpe: manage sudo rules via nrpe::check (try 2) [puppet] - 10https://gerrit.wikimedia.org/r/799344 [10:56:25] (03CR) 10Majavah: [V: 03+1] nrpe: manage sudo rules via nrpe::check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/797422 (owner: 10Majavah) [10:57:32] jynus: I am going to start then [10:57:51] wait [10:58:28] let's add / do https://gerrit.wikimedia.org/r/c/operations/puppet/+/799915/1 somewhere beforehand? [10:58:49] Ah, I was going to create a patch for it [10:58:51] didn't see that one [10:58:51] yeah [10:58:53] let's start with that [10:58:56] let me review and merge [10:58:58] not super important [10:59:07] but because I rebased the important patch on top of that [10:59:13] so to avoir conflicts on merge [10:59:20] (03CR) 10Marostegui: [C: 03+2] dbproxy: Add db1164 as the m1 eqiad secondary [puppet] - 10https://gerrit.wikimedia.org/r/799915 (https://phabricator.wikimedia.org/T309286) (owner: 10Jcrespo) [10:59:20] *avoid [10:59:31] merged, you can rebase while I test [10:59:46] it was already rebased, that is why :-) [10:59:53] let me just add it to the list as done [11:00:28] yep [11:01:10] so from now on, I will not touch the description or perform anything unless you tell me to [11:01:14] will just monitor [11:01:22] RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:01:22] RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:01:22] ok [11:01:26] the proxy test looks good [11:01:45] so, on your time [11:01:59] let's wait for db1117:3321 to catch up [11:01:59] it shouldn't take long [11:02:08] ah, right [11:02:31] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Agree how to handle port-block speeds for QFX5120-48Y - https://phabricator.wikimedia.org/T303529 (10cmooney) @ayounsi I think based on the above we should proceed with https://gerrit.wikimedia.org/r/c/operations/software/homer/deploy/+/769... [11:03:25] (03PS5) 10Jcrespo: mariadb: Failover m1 primary from db1128 to db1164 [puppet] - 10https://gerrit.wikimedia.org/r/799901 (https://phabricator.wikimedia.org/T309286) [11:03:33] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792622 (owner: 10Ayounsi) [11:03:38] ^I actually had to rebase, marostegui sorry [11:03:49] saw it has a conflict (but an automatic resolved one) [11:03:51] db1117 is in sync [11:04:13] Going to start moving the topology [11:04:41] there is an extra space on one of the hiera, but to resolve later [11:05:35] ah, no, the space is removed on the patch, I got it wrong, all good [11:06:23] db1117 seems moved ok? [11:06:35] according to orchestrator [11:06:53] and codfw servers [11:07:52] (03PS3) 10Jbond: Icinga: add page hashtag to paging host alerts [puppet] - 10https://gerrit.wikimedia.org/r/799903 (owner: 10Volans) [11:09:42] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:09:57] yeah it looks good [11:11:20] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35563/console" [puppet] - 10https://gerrit.wikimedia.org/r/799903 (owner: 10Volans) [11:11:26] (03CR) 10Marostegui: [C: 03+2] mariadb: Failover m1 primary from db1128 to db1164 [puppet] - 10https://gerrit.wikimedia.org/r/799901 (https://phabricator.wikimedia.org/T309286) (owner: 10Jcrespo) [11:13:03] jynus: I am going for it now [11:13:09] ok [11:13:12] !log Failover m1 from db1128 to db1164 - T309296 [11:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:18] T309296: Failover m1 primary db from db1128 to db1164 - https://phabricator.wikimedia.org/T309296 [11:13:33] (03CR) 10Volans: [C: 03+1] "Thanks John for the workaround, if that works on PCC for both icinga and a normal host with exported resources it seems ok to me. But let'" [puppet] - 10https://gerrit.wikimedia.org/r/799903 (owner: 10Volans) [11:13:37] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35564/console" [puppet] - 10https://gerrit.wikimedia.org/r/799903 (owner: 10Volans) [11:14:15] all done [11:14:56] cleanup of heartbeat needed? [11:15:03] that is done too [11:15:06] at least for orchestrator, not sure if for the check [11:15:16] ah, it took some time [11:15:19] to get it [11:15:22] yeah [11:15:26] (orchestrator) [11:15:32] let's check services [11:15:42] etherpad might need the restart [11:16:02] yeah it does [11:16:02] let me dod that [11:16:05] ok [11:16:12] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35565/console" [puppet] - 10https://gerrit.wikimedia.org/r/799903 (owner: 10Volans) [11:16:40] db1164 looks green on icinga [11:16:43] (03CR) 10Volans: [C: 03+1] "Also master works fine:" [puppet] - 10https://gerrit.wikimedia.org/r/799903 (owner: 10Volans) [11:17:21] etherpad took close to 1 minute to get back [11:17:29] it must overload or something [11:17:30] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35567/console" [puppet] - 10https://gerrit.wikimedia.org/r/799903 (owner: 10Volans) [11:18:00] yeah I can write now [11:18:05] (03PS1) 10Marostegui: mariadb: Update db1128 and db1164 status [puppet] - 10https://gerrit.wikimedia.org/r/799930 (https://phabricator.wikimedia.org/T309296) [11:18:13] jynus: ^ for later [11:18:35] ah, true, I forgot [11:19:19] s4- you like danger :-D [11:19:24] haha [11:19:36] I stole db1164 from it :) [11:19:38] let me get a snapshot of that before you move it, ok? [11:19:43] ah no wait [11:19:44] I took it from s1 [11:20:34] I would trust the data on the current primary more than the old one, but I want to have it around (the data, not the server) for some time [11:20:42] jynus: sure, no rush [11:20:45] (03PS2) 10Marostegui: mariadb: Update db1128 and db1164 status [puppet] - 10https://gerrit.wikimedia.org/r/799930 (https://phabricator.wikimedia.org/T309296) [11:21:10] I am merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/799930 which is a noop for the data [11:21:12] (03PS3) 10Jcrespo: mariadb: Update db1128 and db1164 status [puppet] - 10https://gerrit.wikimedia.org/r/799930 (https://phabricator.wikimedia.org/T309296) (owner: 10Marostegui) [11:21:14] just notifications [11:21:29] a space that bothered me [11:21:35] xd [11:21:38] Good to merge? [11:21:49] one say, was giving it a last look [11:21:52] *second [11:21:59] yep just +1 when ready [11:22:46] (03CR) 10Jcrespo: [C: 03+1] mariadb: Update db1128 and db1164 status [puppet] - 10https://gerrit.wikimedia.org/r/799930 (https://phabricator.wikimedia.org/T309296) (owner: 10Marostegui) [11:22:50] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:23:31] (03CR) 10Marostegui: [C: 03+2] mariadb: Update db1128 and db1164 status [puppet] - 10https://gerrit.wikimedia.org/r/799930 (https://phabricator.wikimedia.org/T309296) (owner: 10Marostegui) [11:24:18] (03CR) 10Jbond: "sorry E_TOOMANYCHANGES was meant to leave this remark." [puppet] - 10https://gerrit.wikimedia.org/r/799344 (owner: 10Majavah) [11:25:03] let me know when deployed to run puppet on alert1001 [11:25:08] jynus: done [11:25:30] one last check when finished to icinga [11:25:43] tendril was done automatically, right? [11:25:54] tendril? [11:25:57] sorry [11:25:59] zarcillo [11:26:02] yes [11:26:06] it should be, let me double check [11:26:23] yep it is good [11:26:36] (03PS5) 10Majavah: nrpe: manage sudo rules via nrpe::check (try 2) [puppet] - 10https://gerrit.wikimedia.org/r/799344 [11:26:52] (03CR) 10Majavah: nrpe: manage sudo rules via nrpe::check (try 2) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/799344 (owner: 10Majavah) [11:27:19] let me force a rerun on prometheus of config jobs [11:27:45] ok [11:28:08] as grafana didn't show any primary dbs at the moment [11:30:13] it doesn't work [11:30:14] writes on librenms work fine [11:30:26] prometheus? [11:30:27] something could be wrong on zarcillo [11:30:35] I mean, it doesn't give errors [11:30:47] but it doesn't detect any m1-master [11:30:55] let me see [11:31:12] which host is the main zarcillo db? [11:31:22] db1115 [11:31:27] ah, I think I know what it happend [11:31:35] the script changes who is the master [11:31:41] but doesn't update the section [11:31:47] aaah right [11:31:48] let me fix that [11:31:49] that should have been done beforehand [11:31:54] not a big deal [11:33:06] mmm but I did update section_instances [11:33:12] before the switchover [11:33:30] then it could be something else [11:33:36] the group? [11:33:43] core -> misc, maybe? [11:33:52] but db1128 is also showing core [11:33:58] which is wrong [11:34:00] that is the one that is missing [11:34:08] but db1128 the previous master was in core [11:34:14] let me update it anyways [11:34:22] (03PS1) 10Jbond: puppet-merge: Add logging so we know when changes where merged [puppet] - 10https://gerrit.wikimedia.org/r/799943 (https://phabricator.wikimedia.org/T221529) [11:34:28] and that is why probably didn't show up in the aggregated graphs :-) [11:34:38] ah it wasn't working before either? [11:34:55] not sure, but right now, db1117:13321 only show up on m1 misc [11:35:02] then it must be that [11:35:02] probably the others are on core [11:35:07] codfw failing too? [11:35:11] let me seee [11:35:19] (03CR) 10CI reject: [V: 04-1] puppet-merge: Add logging so we know when changes where merged [puppet] - 10https://gerrit.wikimedia.org/r/799943 (https://phabricator.wikimedia.org/T221529) (owner: 10Jbond) [11:36:14] updated db1164 to misc [11:36:58] (03CR) 10Jbond: [C: 03+1] "thanks lgtm, have another change to deploy this afternoon which also needs to be rolled out carefully so will include this one with that:" [puppet] - 10https://gerrit.wikimedia.org/r/799344 (owner: 10Majavah) [11:37:52] (03CR) 10Jbond: [V: 03+1] Icinga: add page hashtag to paging host alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799903 (owner: 10Volans) [11:38:19] on codfw db2132:9104 shows up as master [11:38:27] that's correct yep [11:38:28] let me rerun eqiad [11:38:49] now db1164:9104 shows up as master [11:38:59] but db1128 doesn't show up on misc m1 [11:38:59] I will leave db1128 as core as it will be core in s1 [11:39:02] yeah [11:39:05] I will leave it as core [11:39:12] I will reclone it to s1 [11:39:16] let me check that it at least is on core [11:39:17] (once you give me green light) [11:39:26] no problem as long as we get metrics from it [11:40:25] yeah, it is on "m1-core" [11:40:44] I saw m2 core, too more issues [11:40:46] for another time [11:41:05] | db1159 | db1159.eqiad.wmnet | 3306 | NULL | NULL | core | [11:41:06] fixing [11:41:17] fixed, that is m2 [11:41:29] it is ok, as long as there are metrics it is just a label [11:41:35] m3 seems to be ok [11:41:36] there will be likely many other issues [11:41:52] m5 is ok too [11:43:07] puppet run on alert host but didn't disable the issue -probably will require puppet on the hosts to run first [11:43:14] yeah [11:43:15] * the alert [11:43:36] I am running it now [11:44:32] once alerts are ok, I will deploy the dbbackups patch, create a snapshot of db1128 and then unblock you [11:45:15] (03PS2) 10Jcrespo: Update dbackups check and statistics to use db1164 instead of db1128 [puppet] - 10https://gerrit.wikimedia.org/r/799894 (https://phabricator.wikimedia.org/T309286) [11:46:21] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:46:24] jynus: db1164 notifications enabled [11:46:31] and db1128 disabled [11:46:33] and all good? [11:46:38] yep [11:46:51] I am going to remove the downtimes from db1164 [11:47:01] (03CR) 10Jcrespo: [C: 03+2] Update dbackups check and statistics to use db1164 instead of db1128 [puppet] - 10https://gerrit.wikimedia.org/r/799894 (https://phabricator.wikimedia.org/T309286) (owner: 10Jcrespo) [11:47:49] jynus: I am going to get some food [11:47:54] Thanks for all the help <3 <3 [11:48:04] ok for me to do the intended things left? [11:48:06] the backup? [11:48:13] 10SRE, 10Data-Engineering, 10Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10Isaac) [11:48:17] (not the move, that is for you) [11:48:20] 0:-) [11:48:30] jynus: yep, go for your tests [11:48:42] and let me know if I can proceed further with recloning db1128 [11:48:45] have a nice lunch [11:48:53] But we can also leave it running, the memory won't be changed today anyways, so no rush [11:48:53] it will take probably 2-3 hours [11:48:57] np [11:48:59] see you later [11:49:06] I will have lunch also when it starts [11:49:20] think either later on the day or tomorrow for the move [11:50:37] (there is also some chance that the host could fail again as backups touch all memory) [11:54:58] !log Running XtraBackup at db1128.eqiad.wmnet:3306 and sending it to dbprov1001.eqiad.wmnet [11:55:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:47] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:02:59] 10SRE, 10Data-Engineering, 10Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10Isaac) Thanks all for the input on this task and @BBlack especially for digging up what was happening. I finally updated the task description to reflect what I think is the... [12:03:13] (KubernetesRsyslogDown) firing: (8) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:09:24] (03PS2) 10Hnowlan: service: image-suggestion state to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/799358 (https://phabricator.wikimedia.org/T304891) [12:21:50] (03PS1) 10Majavah: P:puppetmaster::common: drop support for activerecord [puppet] - 10https://gerrit.wikimedia.org/r/799956 [12:23:50] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35568/console" [puppet] - 10https://gerrit.wikimedia.org/r/799956 (owner: 10Majavah) [12:34:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298555)', diff saved to https://phabricator.wikimedia.org/P28593 and previous config saved to /var/cache/conftool/dbconfig/20220526-123413-ladsgroup.json [12:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:21] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [12:41:31] PROBLEM - SSH on labweb1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:42:13] (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:43:01] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:47:09] (03CR) 10Elukey: [C: 03+1] memcached: Untangle TLS/1.6 options [puppet] - 10https://gerrit.wikimedia.org/r/799348 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [12:47:23] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:49:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P28594 and previous config saved to /var/cache/conftool/dbconfig/20220526-124918-ladsgroup.json [12:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220526T1300). [13:00:04] No Gerrit patches in the queue for this window AFAICS. [13:02:15] (03CR) 10Physikerwelt: "See discussion in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/792625 (probably needs some updates to work with the new" [deployment-charts] - 10https://gerrit.wikimedia.org/r/798394 (owner: 10PipelineBot) [13:04:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P28596 and previous config saved to /var/cache/conftool/dbconfig/20220526-130423-ladsgroup.json [13:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:55] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:14:34] (03PS1) 10Majavah: wip [puppet] - 10https://gerrit.wikimedia.org/r/799976 [13:14:53] PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:15:25] (03CR) 10CI reject: [V: 04-1] wip [puppet] - 10https://gerrit.wikimedia.org/r/799976 (owner: 10Majavah) [13:16:39] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 5 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35569/console" [puppet] - 10https://gerrit.wikimedia.org/r/799976 (owner: 10Majavah) [13:17:26] (03CR) 10Tchanders: Assign similareditors right to the checkuser group (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799378 (https://phabricator.wikimedia.org/T307205) (owner: 10AGueyte) [13:18:09] (03PS1) 10Jbond: Rakefie: Add URI.escape monkey patch [puppet] - 10https://gerrit.wikimedia.org/r/799982 [13:19:13] (03CR) 10CI reject: [V: 04-1] Rakefie: Add URI.escape monkey patch [puppet] - 10https://gerrit.wikimedia.org/r/799982 (owner: 10Jbond) [13:19:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298555)', diff saved to https://phabricator.wikimedia.org/P28597 and previous config saved to /var/cache/conftool/dbconfig/20220526-131928-ladsgroup.json [13:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2103.codfw.wmnet with reason: Maintenance [13:19:35] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [13:19:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2103.codfw.wmnet with reason: Maintenance [13:19:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on 14 hosts with reason: Maintenance [13:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on 14 hosts with reason: Maintenance [13:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:49] (03PS2) 10Majavah: add wmflib::is_active to pick a single active host [puppet] - 10https://gerrit.wikimedia.org/r/799976 [13:21:59] (03CR) 10Tchanders: Deploy SimilarEditors to the beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799012 (https://phabricator.wikimedia.org/T306908) (owner: 10AGueyte) [13:22:09] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35570/console" [puppet] - 10https://gerrit.wikimedia.org/r/799976 (owner: 10Majavah) [13:24:59] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:27:05] (03PS3) 10Majavah: add wmflib::is_active to pick a single active host [puppet] - 10https://gerrit.wikimedia.org/r/799976 [13:28:45] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 5 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35571/console" [puppet] - 10https://gerrit.wikimedia.org/r/799976 (owner: 10Majavah) [13:29:13] (03PS2) 10Jbond: Rakefie: Add URI.escape monkey patch [puppet] - 10https://gerrit.wikimedia.org/r/799982 [13:34:24] (03CR) 10Jbond: [C: 03+2] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/799956 (owner: 10Majavah) [13:36:07] (03PS1) 10Hnowlan: service: image-suggestion state to production [puppet] - 10https://gerrit.wikimedia.org/r/799998 (https://phabricator.wikimedia.org/T304891) [13:42:39] RECOVERY - SSH on labweb1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:44:09] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:48:18] (03PS1) 10Majavah: P:openstack::nova: remove stretch specific code [puppet] - 10https://gerrit.wikimedia.org/r/800009 [13:49:39] !log cmooney@cumin1001 START - Cookbook sre.hosts.provision for host cloudcephosd1025.mgmt.eqiad.wmnet with reboot policy FORCED [13:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:31] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:54:59] (03PS1) 10Giuseppe Lavagetto: wmflib::service::get_url: avoid using monitoring to find the url. [puppet] - 10https://gerrit.wikimedia.org/r/800010 [13:56:15] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:58:40] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35572/console" [puppet] - 10https://gerrit.wikimedia.org/r/800010 (owner: 10Giuseppe Lavagetto) [14:07:05] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:08:29] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/800010 (owner: 10Giuseppe Lavagetto) [14:09:39] 10SRE: ProdPasteBot uses deprecated certificate auth - https://phabricator.wikimedia.org/T242857 (10Dsharpe) I don't know who owns or maintains this. https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+log/refs/heads/production/modules/base/files/phaste.py shows some folks who have touched the cod... [14:10:02] (03CR) 10Hnowlan: [C: 03+2] cassandra-http-gateway: add missing log level [deployment-charts] - 10https://gerrit.wikimedia.org/r/799283 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [14:15:14] (03Merged) 10jenkins-bot: cassandra-http-gateway: add missing log level [deployment-charts] - 10https://gerrit.wikimedia.org/r/799283 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [14:18:52] volans: just checking you'll send a patch (not now, not today) for the #p.age thing that didn't work with this master crash? [14:19:23] marostegui: I have already sent it, then there was an issue with puppet reserved words and jbond kindly patched it with a workaround [14:19:37] volans: ah, I don't see it on my reviews [14:20:08] I was curious about what it was [14:20:56] I'm adding more people now [14:21:49] marostegui: added people anyway it's https://gerrit.wikimedia.org/r/c/operations/puppet/+/799903 [14:21:54] (03CR) 10Jbond: "done first pass" [puppet] - 10https://gerrit.wikimedia.org/r/799976 (owner: 10Majavah) [14:22:03] alias is a reserved meta-parameter in puppet [14:24:52] volans: ah thank you <3 [14:26:15] PROBLEM - Check systemd state on ms-be2035 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:33] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2035 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:35:52] marostegui: backups were retried automatically and still failed [14:35:57] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1025.mgmt.eqiad.wmnet with reboot policy FORCED [14:35:59] looking on what could be the reason [14:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:45] jynus: let me know if I can help [14:36:54] I am checking the logs [14:37:17] I may do a next trieal with replication stopped [14:37:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10cmooney) [14:38:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10nskaggs) a:05nskaggs→03Andrew [14:40:30] transfer was succesful twice, but prepare failed, looking at the xtrabackup logs [14:41:36] ERROR - xtrabackup version mismatch- xtrabackup version: {'major': '10.4', 'minor': 22, 'vendor': 'MariaDB'}, backup version: {'major': '10.4', 'minor': 22, 'vendor': 'MariaDB-log'} [14:42:29] (03PS1) 10Filippo Giunchedi: cfssl::db require sqlite3 package [puppet] - 10https://gerrit.wikimedia.org/r/800027 [14:42:31] 10SRE-swift-storage, 10Commons, 10Thumbor, 10affects-Kiwix-and-openZIM: HTTP Mime-Type now always returned properly if "If-None-Match" request header used - https://phabricator.wikimedia.org/T265006 (10Kelson) @Krinkle I have rechecked this bug/ticket with the given example and now it works. Might that be... [14:42:34] (03PS1) 10Filippo Giunchedi: cfssl: add missing migrations [puppet] - 10https://gerrit.wikimedia.org/r/800028 [14:42:38] (03PS1) 10Filippo Giunchedi: cfssl: write pretty json [puppet] - 10https://gerrit.wikimedia.org/r/800029 [14:42:42] (03PS1) 10Filippo Giunchedi: cfssl: fix sqlite3 path selection [puppet] - 10https://gerrit.wikimedia.org/r/800030 [14:42:46] (03PS1) 10Filippo Giunchedi: puppetdb: create dbs before grants [puppet] - 10https://gerrit.wikimedia.org/r/800031 [14:42:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10cmooney) Quick update - I've been trying to image cloudcephosd1025 to make sure all is ok, and completed some operations. Not being comp... [14:43:45] (03CR) 10CI reject: [V: 04-1] cfssl: add missing migrations [puppet] - 10https://gerrit.wikimedia.org/r/800028 (owner: 10Filippo Giunchedi) [14:44:12] (03CR) 10CI reject: [V: 04-1] cfssl: write pretty json [puppet] - 10https://gerrit.wikimedia.org/r/800029 (owner: 10Filippo Giunchedi) [14:44:59] (03CR) 10CI reject: [V: 04-1] cfssl: fix sqlite3 path selection [puppet] - 10https://gerrit.wikimedia.org/r/800030 (owner: 10Filippo Giunchedi) [14:45:03] ooof [14:45:14] volans: looking at the icinga host change now [14:47:18] (03CR) 10CI reject: [V: 04-1] puppetdb: create dbs before grants [puppet] - 10https://gerrit.wikimedia.org/r/800031 (owner: 10Filippo Giunchedi) [14:53:11] RECOVERY - Check systemd state on ms-be2035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:55:05] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/799903 (owner: 10Volans) [14:57:51] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2035 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:59:35] 10SRE-swift-storage, 10Commons, 10Thumbor, 10Performance-Team (Radar), 10affects-Kiwix-and-openZIM: HTTP Mime-Type now always returned properly if "If-None-Match" request header used - https://phabricator.wikimedia.org/T265006 (10Krinkle) 05Open→03Resolved Yep, it would appear so. I suspect this is l... [15:00:04] 10SRE-swift-storage, 10Commons, 10Thumbor, 10Performance-Team (Radar), 10affects-Kiwix-and-openZIM: upload.wikimedia.org HTTP 304 responses lack a Content-Type header - https://phabricator.wikimedia.org/T265006 (10Krinkle) [15:02:32] (03CR) 10Filippo Giunchedi: "Not quite sure how to fix test failures at https://integration.wikimedia.org/ci/job/operations-puppet-tests-buster-docker/45287/console" [puppet] - 10https://gerrit.wikimedia.org/r/800029 (owner: 10Filippo Giunchedi) [15:02:56] (03PS2) 10Filippo Giunchedi: puppetdb: create dbs before grants [puppet] - 10https://gerrit.wikimedia.org/r/800031 [15:02:58] (03PS2) 10Filippo Giunchedi: cfssl::db require sqlite3 package [puppet] - 10https://gerrit.wikimedia.org/r/800027 [15:03:00] (03PS2) 10Filippo Giunchedi: cfssl: add missing migrations [puppet] - 10https://gerrit.wikimedia.org/r/800028 [15:03:02] (03PS2) 10Filippo Giunchedi: cfssl: fix sqlite3 path selection [puppet] - 10https://gerrit.wikimedia.org/r/800030 [15:03:04] (03PS2) 10Filippo Giunchedi: cfssl: write pretty json [puppet] - 10https://gerrit.wikimedia.org/r/800029 [15:04:26] 10SRE, 10SRE-swift-storage, 10Traffic-Icebox, 10Performance-Team (Radar), 10affects-Kiwix-and-openZIM: Swift sends ETAG without double-quotes - https://phabricator.wikimedia.org/T256217 (10Krinkle) [15:04:30] (03CR) 10Herron: [C: 03+1] "Am I understanding correctly that the limit in practice would be 40 Mb with our current queue length settings? +1 for giving it a try" [puppet] - 10https://gerrit.wikimedia.org/r/799870 (https://phabricator.wikimedia.org/T308439) (owner: 10Filippo Giunchedi) [15:05:01] 10SRE, 10SRE-swift-storage, 10Traffic-Icebox, 10Performance-Team (Radar), 10affects-Kiwix-and-openZIM: Swift sends ETAG without double-quotes - https://phabricator.wikimedia.org/T256217 (10Krinkle) I'm not sure since when, but based on us having <14 days ats-be storage, and based on there still beeing ET... [15:05:17] (03CR) 10Herron: [C: 03+1] rsyslog: bound disk-assisted queues (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799870 (https://phabricator.wikimedia.org/T308439) (owner: 10Filippo Giunchedi) [15:05:45] (03CR) 10CI reject: [V: 04-1] puppetdb: create dbs before grants [puppet] - 10https://gerrit.wikimedia.org/r/800031 (owner: 10Filippo Giunchedi) [15:05:59] (03CR) 10CI reject: [V: 04-1] cfssl: write pretty json [puppet] - 10https://gerrit.wikimedia.org/r/800029 (owner: 10Filippo Giunchedi) [15:09:15] (03PS3) 10Filippo Giunchedi: puppetdb: create dbs before grants [puppet] - 10https://gerrit.wikimedia.org/r/800031 [15:09:17] (03PS3) 10Filippo Giunchedi: cfssl::db require sqlite3 package [puppet] - 10https://gerrit.wikimedia.org/r/800027 [15:09:19] (03PS3) 10Filippo Giunchedi: cfssl: add missing migrations [puppet] - 10https://gerrit.wikimedia.org/r/800028 [15:09:21] (03PS3) 10Filippo Giunchedi: cfssl: fix sqlite3 path selection [puppet] - 10https://gerrit.wikimedia.org/r/800030 [15:09:23] (03PS3) 10Filippo Giunchedi: cfssl: write pretty json [puppet] - 10https://gerrit.wikimedia.org/r/800029 [15:11:37] (03CR) 10Filippo Giunchedi: [V: 03+1] "Thank you for the review! Given that this change reloads rsyslog across the fleet I'll deploy early next week" [puppet] - 10https://gerrit.wikimedia.org/r/799870 (https://phabricator.wikimedia.org/T308439) (owner: 10Filippo Giunchedi) [15:12:14] (03CR) 10jenkins-bot: cfssl: write pretty json [puppet] - 10https://gerrit.wikimedia.org/r/800029 (owner: 10Filippo Giunchedi) [15:14:11] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:15:30] (03PS1) 10Jbond: sqlite: update packages and add dependency [puppet] - 10https://gerrit.wikimedia.org/r/800048 [15:17:00] (03PS8) 10BCornwall: cli: Add support for XDG Base Directory spec [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 [15:17:09] (03CR) 10Jbond: "don't have a big issue with this but see comment and proposed alternative" [puppet] - 10https://gerrit.wikimedia.org/r/800027 (owner: 10Filippo Giunchedi) [15:17:12] (03CR) 10CI reject: [V: 04-1] cli: Add support for XDG Base Directory spec [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 (owner: 10BCornwall) [15:17:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1169.eqiad.wmnet with reason: Maintenance [15:17:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1169.eqiad.wmnet with reason: Maintenance [15:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T298555)', diff saved to https://phabricator.wikimedia.org/P28599 and previous config saved to /var/cache/conftool/dbconfig/20220526-151723-ladsgroup.json [15:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:11] (03PS1) 10Cathal Mooney: Re-add urpf check to cloudsw -> cr interfaces Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/800053 (https://phabricator.wikimedia.org/T304989) [15:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:13] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [15:18:27] (03PS9) 10BCornwall: cli: Add support for XDG Base Directory spec [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 [15:19:01] (03CR) 10CI reject: [V: 04-1] Re-add urpf check to cloudsw -> cr interfaces Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/800053 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [15:19:05] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/800028 (owner: 10Filippo Giunchedi) [15:20:04] (03CR) 10BCornwall: cli: Add support for XDG Base Directory spec (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 (owner: 10BCornwall) [15:20:12] (03CR) 10Jbond: [C: 03+1] "LGTM thx" [puppet] - 10https://gerrit.wikimedia.org/r/800030 (owner: 10Filippo Giunchedi) [15:21:26] (03PS2) 10Cathal Mooney: Re-add urpf check to cloudsw -> cr interfaces Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/800053 (https://phabricator.wikimedia.org/T304989) [15:24:01] (03CR) 10Herron: [V: 03+2 C: 03+2] Add HAProxy SLO dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/790672 (https://phabricator.wikimedia.org/T307898) (owner: 10Vgutierrez) [15:24:03] (03CR) 10Jbond: "did you test this? Its been a while since i delved into the postgress module. also please tag with the following task" [puppet] - 10https://gerrit.wikimedia.org/r/800031 (owner: 10Filippo Giunchedi) [15:29:03] (03CR) 10Hnowlan: [C: 03+2] changeprop: Remove WP:ANI from page blacklist [deployment-charts] - 10https://gerrit.wikimedia.org/r/797354 (https://phabricator.wikimedia.org/T274359) (owner: 10Samtar) [15:31:02] (03CR) 10Cathal Mooney: [C: 03+2] Re-add urpf check to cloudsw -> cr interfaces Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/800053 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [15:32:10] (03Merged) 10jenkins-bot: Re-add urpf check to cloudsw -> cr interfaces Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/800053 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [15:34:38] (03Merged) 10jenkins-bot: changeprop: Remove WP:ANI from page blacklist [deployment-charts] - 10https://gerrit.wikimedia.org/r/797354 (https://phabricator.wikimedia.org/T274359) (owner: 10Samtar) [15:37:30] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.5.0 [software/homer] - 10https://gerrit.wikimedia.org/r/800067 [15:39:17] (03CR) 10BryanDavis: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/799427 (https://phabricator.wikimedia.org/T297140) (owner: 10BryanDavis) [15:42:25] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [15:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:44] bd808: hey, apologies I made a typo in netbox I believe is messing up your deploy (re: T297140) [15:44:45] T297140: New Service Request: developer-portal - https://phabricator.wikimedia.org/T297140 [15:44:49] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:44:52] I'm correting now [15:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:57] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.5.0 [software/homer] - 10https://gerrit.wikimedia.org/r/800067 (owner: 10Volans) [15:45:00] (03CR) 10BBlack: Add dumps mapping to cache_upload (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793525 (https://phabricator.wikimedia.org/T306550) (owner: 10BBlack) [15:45:38] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: sync [15:45:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:49] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [15:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:38] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: sync [15:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:50] (03PS4) 10Jbond: cfssl: write pretty json [puppet] - 10https://gerrit.wikimedia.org/r/800029 (owner: 10Filippo Giunchedi) [15:46:52] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync [15:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:10] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: sync [15:47:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:21] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [15:47:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:17] (03PS5) 10Jbond: cfssl: write pretty json [puppet] - 10https://gerrit.wikimedia.org/r/800029 (owner: 10Filippo Giunchedi) [15:48:28] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/800029 (owner: 10Filippo Giunchedi) [15:49:01] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.5.0 [software/homer] - 10https://gerrit.wikimedia.org/r/800067 (owner: 10Volans) [15:49:08] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [15:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:11] !log upgrading spicerack on cumin2002 to (2.5.0-1+deb11u1 [15:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:33] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:41] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:00:05] jbond and rzl: #bothumor My software never has bugs. It just develops random features. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220526T1600). [16:00:05] bd808: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:29] (03CR) 10Jbond: [C: 03+2] cli: Add support for XDG Base Directory spec [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 (owner: 10BCornwall) [16:00:47] (03CR) 10Cathal Mooney: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/799427 (https://phabricator.wikimedia.org/T297140) (owner: 10BryanDavis) [16:00:52] bd808: hi! looking [16:01:14] rzl: awesome. I think it's pretty trivial [16:01:45] haha I saw the filename and got nervous but then I saw the diff :D yep no worries, merging [16:01:51] (03CR) 10RLazarus: [C: 03+2] base: remove "managed by puppet" notice on /etc/skel/.bashrc [puppet] - 10https://gerrit.wikimedia.org/r/798874 (owner: 10BryanDavis) [16:02:10] (03Merged) 10jenkins-bot: cli: Add support for XDG Base Directory spec [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 (owner: 10BCornwall) [16:02:19] (03PS4) 10Filippo Giunchedi: cfssl: add missing migrations [puppet] - 10https://gerrit.wikimedia.org/r/800028 [16:03:13] (KubernetesRsyslogDown) firing: (8) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:03:39] !log volans@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Update hieradata from Netbox - volans@cumin2002" [16:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:56] (03CR) 10Filippo Giunchedi: [C: 03+2] cfssl: add missing migrations [puppet] - 10https://gerrit.wikimedia.org/r/800028 (owner: 10Filippo Giunchedi) [16:04:01] bd808: done! [16:04:15] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: puppet documentation generation is missing some compnets - https://phabricator.wikimedia.org/T271909 (10jbond) [16:04:58] thanks rzl [16:05:24] !log volans@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Update hieradata from Netbox - volans@cumin2002" [16:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:51] (03PS4) 10Filippo Giunchedi: cfssl: fix sqlite3 path selection [puppet] - 10https://gerrit.wikimedia.org/r/800030 [16:07:27] (03CR) 10Filippo Giunchedi: [C: 03+2] cfssl: fix sqlite3 path selection [puppet] - 10https://gerrit.wikimedia.org/r/800030 (owner: 10Filippo Giunchedi) [16:13:41] (03PS2) 10Jbond: puppet-merge: Add logging so we know when changes where merged [puppet] - 10https://gerrit.wikimedia.org/r/799943 (https://phabricator.wikimedia.org/T221529) [16:14:25] (03CR) 10Cathal Mooney: [C: 03+2] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/799427 (https://phabricator.wikimedia.org/T297140) (owner: 10BryanDavis) [16:18:29] RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:21:03] (03Abandoned) 10Filippo Giunchedi: cfssl::db require sqlite3 package [puppet] - 10https://gerrit.wikimedia.org/r/800027 (owner: 10Filippo Giunchedi) [16:22:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1112.eqiad.wmnet with reason: Maintenance [16:22:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1112.eqiad.wmnet with reason: Maintenance [16:22:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:22:08] (03CR) 10Filippo Giunchedi: sqlite: update packages and add dependency (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/800048 (owner: 10Jbond) [16:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T298560)', diff saved to https://phabricator.wikimedia.org/P28601 and previous config saved to /var/cache/conftool/dbconfig/20220526-162212-ladsgroup.json [16:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:24] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [16:22:44] (03PS3) 10Ahmon Dancy: mwdebug service: Add traindev environment support [deployment-charts] - 10https://gerrit.wikimedia.org/r/798883 [16:22:46] (03PS1) 10Ahmon Dancy: mediawiki 0.2.1: Add a helm test [deployment-charts] - 10https://gerrit.wikimedia.org/r/800118 [16:22:57] (03PS6) 10Filippo Giunchedi: cfssl: write pretty json [puppet] - 10https://gerrit.wikimedia.org/r/800029 [16:23:49] (03PS3) 10Jbond: puppet-merge: Add logging so we know when changes where merged [puppet] - 10https://gerrit.wikimedia.org/r/799943 (https://phabricator.wikimedia.org/T221529) [16:25:19] (03CR) 10Filippo Giunchedi: cfssl::db require sqlite3 package (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/800027 (owner: 10Filippo Giunchedi) [16:26:02] (03PS2) 10Jbond: sqlite: update packages and add dependency [puppet] - 10https://gerrit.wikimedia.org/r/800048 [16:26:14] (03CR) 10Jbond: sqlite: update packages and add dependency (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/800048 (owner: 10Jbond) [16:27:48] (03CR) 10Filippo Giunchedi: puppetdb: create dbs before grants (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/800031 (owner: 10Filippo Giunchedi) [16:28:31] (03CR) 10Volans: [C: 03+1] "LGTM, just make sure it works as expected :D" [puppet] - 10https://gerrit.wikimedia.org/r/799943 (https://phabricator.wikimedia.org/T221529) (owner: 10Jbond) [16:29:38] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/800048 (owner: 10Jbond) [16:29:53] (03CR) 10Jbond: [C: 03+2] sqlite: update packages and add dependency [puppet] - 10https://gerrit.wikimedia.org/r/800048 (owner: 10Jbond) [16:30:56] (03CR) 10Jbond: [C: 03+2] puppet-merge: Add logging so we know when changes where merged (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799943 (https://phabricator.wikimedia.org/T221529) (owner: 10Jbond) [16:33:17] (03PS1) 10Jbond: README: minor commit to test new puppet merge logging [puppet] - 10https://gerrit.wikimedia.org/r/800120 [16:34:33] (03CR) 10Jbond: [C: 03+2] README: minor commit to test new puppet merge logging [puppet] - 10https://gerrit.wikimedia.org/r/800120 (owner: 10Jbond) [16:36:02] (03CR) 10Jbond: [C: 03+2] puppet-merge: Add logging so we know when changes where merged (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799943 (https://phabricator.wikimedia.org/T221529) (owner: 10Jbond) [16:36:36] !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 0:20:00 on cuminunpriv1001.eqiad.wmnet with reason: Testing new Ganeti features on Spicerack [16:36:37] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on cuminunpriv1001.eqiad.wmnet with reason: Testing new Ganeti features on Spicerack [16:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:25] (03PS1) 10Jbond: puppet-merge: include repo name in log messages [puppet] - 10https://gerrit.wikimedia.org/r/800121 [16:39:59] (03CR) 10Jbond: [V: 03+2 C: 03+2] puppet-merge: include repo name in log messages [puppet] - 10https://gerrit.wikimedia.org/r/800121 (owner: 10Jbond) [16:41:49] (03PS1) 10Jbond: README: test puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/800125 [16:42:13] (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:43:04] (03CR) 10Jbond: [C: 03+2] README: test puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/800125 (owner: 10Jbond) [16:43:49] (03CR) 10Ori: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/800122 (owner: 10Ori) [16:47:59] (03CR) 10Ori: developer-portal: add to service catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799429 (https://phabricator.wikimedia.org/T297140) (owner: 10BryanDavis) [16:48:01] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:51:16] (03CR) 10BryanDavis: developer-portal: add to service catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799429 (https://phabricator.wikimedia.org/T297140) (owner: 10BryanDavis) [16:51:32] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [16:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Andrew) [16:52:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Andrew) Note that I'm renaming these two hosts to clouddumps100[12] [16:53:14] !log volans@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster [16:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:51] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:21] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:57:39] (03PS1) 10Andrew Bogott: Rename cloudstore101[01] to clouddumps100[12] [puppet] - 10https://gerrit.wikimedia.org/r/800152 (https://phabricator.wikimedia.org/T302981) [16:57:49] !log volans@cumin1001 START - Cookbook sre.dns.netbox [16:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:05] (03CR) 10Andrew Bogott: [C: 03+2] Rename cloudstore101[01] to clouddumps100[12] [puppet] - 10https://gerrit.wikimedia.org/r/800152 (https://phabricator.wikimedia.org/T302981) (owner: 10Andrew Bogott) [17:01:16] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:56] 10SRE-tools, 10Discovery, 10Infrastructure-Foundations, 10Discovery-Search (Current work), 10IPv6: Some elastic hosts do not have IPv6 DNS records - https://phabricator.wikimedia.org/T271143 (10Volans) Eqiad is also done, pasting only the differences with the above snippet: `lang=python >>> devices = De... [17:04:01] !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [17:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:33] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [17:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:32] (03PS1) 10Vlad.shapik: WP:Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) [17:13:25] (03PS1) 10Andrew Bogott: Added partman recipe 'hwraid-2dev.cfg' [puppet] - 10https://gerrit.wikimedia.org/r/800172 (https://phabricator.wikimedia.org/T302981) [17:14:36] (03CR) 10CI reject: [V: 04-1] Added partman recipe 'hwraid-2dev.cfg' [puppet] - 10https://gerrit.wikimedia.org/r/800172 (https://phabricator.wikimedia.org/T302981) (owner: 10Andrew Bogott) [17:16:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T298555)', diff saved to https://phabricator.wikimedia.org/P28602 and previous config saved to /var/cache/conftool/dbconfig/20220526-171638-ladsgroup.json [17:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:47] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [17:16:57] (03PS2) 10Andrew Bogott: Added partman recipe 'hwraid-2dev.cfg' [puppet] - 10https://gerrit.wikimedia.org/r/800172 (https://phabricator.wikimedia.org/T302981) [17:18:40] (03CR) 10Andrew Bogott: [C: 03+2] Added partman recipe 'hwraid-2dev.cfg' [puppet] - 10https://gerrit.wikimedia.org/r/800172 (https://phabricator.wikimedia.org/T302981) (owner: 10Andrew Bogott) [17:20:12] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS buster [17:20:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:37] (03PS1) 10BryanDavis: developer-portal: add developer.wikimedia.org to CDN config [puppet] - 10https://gerrit.wikimedia.org/r/800181 (https://phabricator.wikimedia.org/T297140) [17:20:56] !log volans@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster [17:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:01] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:22:02] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [17:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org w... [17:23:08] (03CR) 10Cwhite: [C: 03+2] logstash: curator support new and legacy index patterns [puppet] - 10https://gerrit.wikimedia.org/r/798982 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [17:24:06] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host cloudcephosd1026.mgmt.eqiad.wmnet with reboot policy FORCED [17:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: re-label cloudstore101[01] to clouddumps100[12] - https://phabricator.wikimedia.org/T309338 (10Andrew) [17:25:25] !log volans@cumin1001 START - Cookbook sre.dns.netbox [17:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:48] !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1026.mgmt.eqiad.wmnet with reboot policy FORCED [17:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:12] (03PS1) 10Ladsgroup: Add drop_page_restrictions_T60674.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/800183 (https://phabricator.wikimedia.org/T60674) [17:30:10] (03PS1) 10Andrew Bogott: hwraid-2dev.cfg: Throw in a few more autoconfirms [puppet] - 10https://gerrit.wikimedia.org/r/800184 (https://phabricator.wikimedia.org/T302981) [17:30:12] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:30] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host cloudcephosd1026.mgmt.eqiad.wmnet with reboot policy FORCED [17:31:05] (03CR) 10Andrew Bogott: [C: 03+2] hwraid-2dev.cfg: Throw in a few more autoconfirms [puppet] - 10https://gerrit.wikimedia.org/r/800184 (https://phabricator.wikimedia.org/T302981) (owner: 10Andrew Bogott) [17:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P28603 and previous config saved to /var/cache/conftool/dbconfig/20220526-173143-ladsgroup.json [17:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:54] !log volans@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [17:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:10] !log volans@cumin1001 START - Cookbook sre.dns.wipe-cache cloudcephosd1026.mgmt.eqiad.wmnet on all recursors [17:32:14] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd1026.mgmt.eqiad.wmnet on all recursors [17:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:02] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddumps1001.wikimedia.org with reason: host reimage [17:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:19] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [17:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:47] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1002.wikimedia.org with OS bullseye [17:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1002.wikimedia.org w... [17:35:18] (ProbeDown) firing: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:37:26] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddumps1001.wikimedia.org with reason: host reimage [17:37:28] (03PS1) 10Zabe: Revert "Revert "Start writing to cuc_actor in s3, kcgwiki and labtestwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800190 (https://phabricator.wikimedia.org/T233004) [17:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:18] (ProbeDown) resolved: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:41:04] (03PS1) 10Andrew Bogott: hwraid-2dev.cfg partman: reorder again [puppet] - 10https://gerrit.wikimedia.org/r/800196 [17:41:53] (03CR) 10Andrew Bogott: [C: 03+2] hwraid-2dev.cfg partman: reorder again [puppet] - 10https://gerrit.wikimedia.org/r/800196 (owner: 10Andrew Bogott) [17:44:55] (03CR) 10Cwhite: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/799870 (https://phabricator.wikimedia.org/T308439) (owner: 10Filippo Giunchedi) [17:45:38] (03PS1) 10Andrew Bogott: hwraid-2dev.cfg partman: add 1G swap [puppet] - 10https://gerrit.wikimedia.org/r/800197 [17:46:05] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS buster [17:46:07] !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1026.mgmt.eqiad.wmnet with reboot policy FORCED [17:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P28604 and previous config saved to /var/cache/conftool/dbconfig/20220526-174648-ladsgroup.json [17:46:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:04] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host cloudcephosd1026.mgmt.eqiad.wmnet with reboot policy FORCED [17:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:08] (03CR) 10Andrew Bogott: [C: 03+2] hwraid-2dev.cfg partman: add 1G swap [puppet] - 10https://gerrit.wikimedia.org/r/800197 (owner: 10Andrew Bogott) [17:49:45] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddumps1001.wikimedia.org with OS bullseye [17:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with... [17:55:08] (03CR) 10Jbond: "This looks good to me, however lets get WMCS to look as well. In theory this could remove some protections from a WMCS stand-alone puppet" [puppet] - 10https://gerrit.wikimedia.org/r/799859 (owner: 10Majavah) [17:58:37] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [17:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org w... [18:00:04] dancy and jnuche: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220526T1800). [18:00:28] o/ [18:01:18] (03PS1) 10Ahmon Dancy: group2 wikis to 1.39.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800212 (https://phabricator.wikimedia.org/T305219) [18:01:20] (03CR) 10Ahmon Dancy: [C: 03+2] group2 wikis to 1.39.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800212 (https://phabricator.wikimedia.org/T305219) (owner: 10Ahmon Dancy) [18:01:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T298555)', diff saved to https://phabricator.wikimedia.org/P28605 and previous config saved to /var/cache/conftool/dbconfig/20220526-180153-ladsgroup.json [18:01:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1132.eqiad.wmnet with reason: Maintenance [18:01:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1132.eqiad.wmnet with reason: Maintenance [18:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1132 (T298555)', diff saved to https://phabricator.wikimedia.org/P28606 and previous config saved to /var/cache/conftool/dbconfig/20220526-180201-ladsgroup.json [18:02:02] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [18:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:53] (03PS1) 10Majavah: Provide a python3-bullseye image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/800213 [18:03:41] (03Merged) 10jenkins-bot: group2 wikis to 1.39.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800212 (https://phabricator.wikimedia.org/T305219) (owner: 10Ahmon Dancy) [18:04:36] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddumps1002.wikimedia.org with OS bullseye [18:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1002.wikimedia.org with... [18:04:51] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1002.wikimedia.org with OS bullseye [18:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:57] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.39.0-wmf.13 refs T305219 [18:05:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1002.wikimedia.org w... [18:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:03] T305219: 1.39.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T305219 [18:08:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:09:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:35] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddumps1001.wikimedia.org with reason: host reimage [18:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:26] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddumps1001.wikimedia.org with reason: host reimage [18:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:31] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddumps1002.wikimedia.org with reason: host reimage [18:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:32] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddumps1002.wikimedia.org with reason: host reimage [18:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:47] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddumps1001.wikimedia.org with OS bullseye [18:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with... [18:32:21] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddumps1002.wikimedia.org with OS bullseye [18:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1002.wikimedia.org with... [18:33:50] (03PS1) 10Jbond: service::node::config: drop merge_config [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639) [18:34:34] (03CR) 10CI reject: [V: 04-1] service::node::config: drop merge_config [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [18:37:24] (03PS2) 10Jbond: service::node::config: drop merge_config [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639) [18:38:08] (03CR) 10CI reject: [V: 04-1] service::node::config: drop merge_config [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [18:40:17] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 15): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35576/console" [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [18:42:37] (03PS3) 10Jbond: service::node::config: drop merge_config [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639) [18:46:53] 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10RKemper) [18:47:08] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [18:48:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T298560)', diff saved to https://phabricator.wikimedia.org/P28609 and previous config saved to /var/cache/conftool/dbconfig/20220526-184824-ladsgroup.json [18:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:30] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [18:50:17] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/35577/testreduce1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/799306 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [18:51:28] (03PS1) 10BCornwall: turnilo: Fix port variable dererence for monitor [puppet] - 10https://gerrit.wikimedia.org/r/800231 [18:53:23] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "thanks! good catch. confirmed this is currently "check_http_on_port!${port}" in Icinga config. it will fix turnilo monitoring (https://pha" [puppet] - 10https://gerrit.wikimedia.org/r/800231 (owner: 10BCornwall) [18:53:39] (03PS2) 10Dzahn: turnilo: Fix port variable dererence for monitor [puppet] - 10https://gerrit.wikimedia.org/r/800231 (https://phabricator.wikimedia.org/T277729) (owner: 10BCornwall) [18:55:29] (03CR) 10Dzahn: "Feel free to merge or I can. If you do, please run puppet on alert1001 afterwards. Then let's see what happens at https://icinga.wikimedia" [puppet] - 10https://gerrit.wikimedia.org/r/800231 (https://phabricator.wikimedia.org/T277729) (owner: 10BCornwall) [18:55:32] (03PS1) 10Majavah: openstack: horizon: remove enc url from hiera [puppet] - 10https://gerrit.wikimedia.org/r/800232 [18:55:52] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:56:20] (03PS1) 10Ryan Kemper: elasticsearch: add more reimage usage examples [cookbooks] - 10https://gerrit.wikimedia.org/r/800233 [18:56:25] (03CR) 10Dzahn: [C: 03+1] "This should fix the UNKNOWN at https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=turnilo" [puppet] - 10https://gerrit.wikimedia.org/r/800231 (https://phabricator.wikimedia.org/T277729) (owner: 10BCornwall) [18:56:46] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35578/console" [puppet] - 10https://gerrit.wikimedia.org/r/800232 (owner: 10Majavah) [18:56:56] (03PS2) 10Ryan Kemper: elasticsearch: add more reimage usage examples [cookbooks] - 10https://gerrit.wikimedia.org/r/800233 [18:57:32] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:00:09] (03CR) 10Dzahn: [C: 03+2] "noop on parse*, testreduce1001 looks fine (besides unrelated issue that those wmf_auto_restart systemd units fail because some servers are" [puppet] - 10https://gerrit.wikimedia.org/r/799306 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [19:02:53] (03CR) 10Bking: [C: 03+1] elasticsearch: add more reimage usage examples [cookbooks] - 10https://gerrit.wikimedia.org/r/800233 (owner: 10Ryan Kemper) [19:03:04] (03PS3) 10Ryan Kemper: elasticsearch: add more reimage usage examples [cookbooks] - 10https://gerrit.wikimedia.org/r/800233 (https://phabricator.wikimedia.org/T308606) [19:03:28] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] elasticsearch: add more reimage usage examples [cookbooks] - 10https://gerrit.wikimedia.org/r/800233 (https://phabricator.wikimedia.org/T308606) (owner: 10Ryan Kemper) [19:03:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P28610 and previous config saved to /var/cache/conftool/dbconfig/20220526-190329-ladsgroup.json [19:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:13] (03PS4) 10AGueyte: Add SimilarEditors extension – I: Add to i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789250 (https://phabricator.wikimedia.org/T306909) (owner: 10Tchanders) [19:04:15] (03PS2) 10AGueyte: Add SimilarEditors extension – II: Add to InitialiseSettings, default off [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793500 (https://phabricator.wikimedia.org/T306909) (owner: 10Jforrester) [19:04:17] (03PS2) 10AGueyte: Add SimilarEditors extension – III: Add to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793501 (https://phabricator.wikimedia.org/T306909) (owner: 10Jforrester) [19:04:19] (03PS4) 10AGueyte: Deploy SimilarEditors to the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799012 (https://phabricator.wikimedia.org/T306908) [19:05:07] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (2 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster reimage - bking@cumin1001 - T309343 [19:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:14] T309343: Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 [19:05:35] (03PS5) 10AGueyte: Add SimilarEditors extension – I: Add to i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789250 (https://phabricator.wikimedia.org/T306909) (owner: 10Tchanders) [19:05:37] (03PS3) 10AGueyte: Add SimilarEditors extension – II: Add to InitialiseSettings, default off [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793500 (https://phabricator.wikimedia.org/T306909) (owner: 10Jforrester) [19:05:39] (03PS3) 10AGueyte: Add SimilarEditors extension – III: Add to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793501 (https://phabricator.wikimedia.org/T306909) (owner: 10Jforrester) [19:05:41] (03PS2) 10AGueyte: Assign similareditors right to the checkuser group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799378 (https://phabricator.wikimedia.org/T307205) [19:06:25] (03PS1) 10Dzahn: profile::auto_restarts: make comment match class name, minor grammar [puppet] - 10https://gerrit.wikimedia.org/r/800235 [19:06:26] 10SRE, 10Infrastructure-Foundations, 10Mail: [mitigated] Google returning 503 error when delivering to mx1001 and mx2001 - https://phabricator.wikimedia.org/T307873 (10jhathaway) I sent a message to the Exim mailing list, https://www.mail-archive.com/exim-users@exim.org/msg57216.html. Jeremy Harris suggeste... [19:06:37] 10SRE, 10Infrastructure-Foundations, 10Mail: [mitigated] Google returning 503 error when delivering to mx1001 and mx2001 - https://phabricator.wikimedia.org/T307873 (10jhathaway) a:03jhathaway [19:06:50] (03CR) 10CI reject: [V: 04-1] profile::auto_restarts: make comment match class name, minor grammar [puppet] - 10https://gerrit.wikimedia.org/r/800235 (owner: 10Dzahn) [19:07:17] (03PS2) 10Dzahn: profile::auto_restarts: make comment match class name, minor grammar [puppet] - 10https://gerrit.wikimedia.org/r/800235 [19:08:26] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1004.wikimedia.org with OS bullseye [19:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:31] 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye [19:09:28] (03PS3) 10Dzahn: profile::auto_restarts: make comment match class name, minor grammar [puppet] - 10https://gerrit.wikimedia.org/r/800235 [19:11:33] (03CR) 10Jforrester: [C: 03+1] "This stack should now be good to deploy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789250 (https://phabricator.wikimedia.org/T306909) (owner: 10Tchanders) [19:16:04] (03PS1) 10Dzahn: parsoid::testing: remove auto_restart for apache, it uses nginx instead [puppet] - 10https://gerrit.wikimedia.org/r/800237 [19:16:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10Andrew) [19:16:44] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:18:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P28611 and previous config saved to /var/cache/conftool/dbconfig/20220526-191834-ladsgroup.json [19:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10Andrew) a:05Andrew→03ArielGlenn @ArielGlenn these two new servers should be ready; I'm hoping that you have the time to move the data a... [19:19:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Andrew) [19:19:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Andrew) 05Open→03Resolved [19:19:22] (03CR) 10Dzahn: "it works but the issue is that scandium DOES have an apache while testreduce1001 does not.. but both are using parsoid::testing" [puppet] - 10https://gerrit.wikimedia.org/r/800237 (owner: 10Dzahn) [19:19:54] (03CR) 10Dzahn: [C: 04-1] "for now -1, need a different approach to separate scandium/testreduce1001" [puppet] - 10https://gerrit.wikimedia.org/r/800237 (owner: 10Dzahn) [19:27:22] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:28:10] 10SRE, 10DC-Ops, 10Infrastructure-Foundations: private repo deployment - perccli implementation - https://phabricator.wikimedia.org/T308027 (10RobH) [19:28:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) [19:28:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) [19:28:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10Jclark-ctr) stat1010 E1 u24 cableid # 20220077 port24 [19:29:15] (03PS1) 10Dzahn: parsoid::testing: add an auto_restart service for nginx [puppet] - 10https://gerrit.wikimedia.org/r/800241 [19:30:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10Jclark-ctr) [19:30:37] (03PS1) 10Jbond: wmflib::clusters::fetch: possible replacement for cluster_config [puppet] - 10https://gerrit.wikimedia.org/r/800242 (https://phabricator.wikimedia.org/T308639) [19:30:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [19:33:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wqds101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10Jclark-ctr) wqds1014 wqds1015 [19:33:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T298560)', diff saved to https://phabricator.wikimedia.org/P28612 and previous config saved to /var/cache/conftool/dbconfig/20220526-193339-ladsgroup.json [19:33:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance [19:33:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance [19:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:46] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [19:33:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T298560)', diff saved to https://phabricator.wikimedia.org/P28613 and previous config saved to /var/cache/conftool/dbconfig/20220526-193347-ladsgroup.json [19:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:05] (03CR) 10CI reject: [V: 04-1] wmflib::clusters::fetch: possible replacement for cluster_config [puppet] - 10https://gerrit.wikimedia.org/r/800242 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [19:35:37] 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10RobH) > Jin, > > When you were last onsite, I neglected to include the swap of a problematic optic we have. > > Can you quote us for an on-site to swap the optic in cr3-eqsin:xe-0/1/1 located in 603, U40.... [19:36:28] 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T298459 (10RobH) 05Open→03Declined same as T300485 [19:36:33] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1004.wikimedia.org with OS bullseye [19:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:38] 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye executed with errors: - cloudela... [19:40:27] !log T304548 running extensions/GrowthExperiments/maintenance/changeWikiConfig.php on tier4 Growth wikis [19:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:32] T304548: Deploy "add a link" to 4th round of wikis - https://phabricator.wikimedia.org/T304548 [19:40:58] (03PS1) 10Ryan Kemper: elasticsearch: add ANSI color codes [cookbooks] - 10https://gerrit.wikimedia.org/r/800244 (https://phabricator.wikimedia.org/T308606) [19:41:25] (03PS2) 10Ryan Kemper: elasticsearch: add ANSI color codes [cookbooks] - 10https://gerrit.wikimedia.org/r/800244 (https://phabricator.wikimedia.org/T308606) [19:42:08] (03PS3) 10Ryan Kemper: elasticsearch: add ANSI color codes [cookbooks] - 10https://gerrit.wikimedia.org/r/800244 (https://phabricator.wikimedia.org/T308606) [19:44:10] (03PS1) 10Dzahn: parsoid::testing: move apache/php auto_restarts to separate profile [puppet] - 10https://gerrit.wikimedia.org/r/800245 [19:45:06] (03CR) 10CI reject: [V: 04-1] parsoid::testing: move apache/php auto_restarts to separate profile [puppet] - 10https://gerrit.wikimedia.org/r/800245 (owner: 10Dzahn) [19:45:27] (03PS2) 10Dzahn: parsoid::testing: move apache/php auto_restarts to separate profile [puppet] - 10https://gerrit.wikimedia.org/r/800245 [19:46:20] (03CR) 10CI reject: [V: 04-1] parsoid::testing: move apache/php auto_restarts to separate profile [puppet] - 10https://gerrit.wikimedia.org/r/800245 (owner: 10Dzahn) [19:46:44] !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (2 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster reimage - bking@cumin1001 - T309343 [19:46:44] (03CR) 10CI reject: [V: 04-1] elasticsearch: add ANSI color codes [cookbooks] - 10https://gerrit.wikimedia.org/r/800244 (https://phabricator.wikimedia.org/T308606) (owner: 10Ryan Kemper) [19:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:50] T309343: Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 [19:49:31] (03CR) 10Dzahn: "hah! jerkins already gives -1 for " The following are missing a SPDX licence header:". nice" [puppet] - 10https://gerrit.wikimedia.org/r/800245 (owner: 10Dzahn) [19:49:49] it's not called jerkins anymore? :o [19:52:06] (03PS3) 10Dzahn: parsoid::testing: move apache/php auto_restarts to separate profile [puppet] - 10https://gerrit.wikimedia.org/r/800245 [19:53:06] (03Abandoned) 10Dzahn: parsoid::testing: remove auto_restart for apache, it uses nginx instead [puppet] - 10https://gerrit.wikimedia.org/r/800237 (owner: 10Dzahn) [19:53:17] (03CR) 10CI reject: [V: 04-1] parsoid::testing: move apache/php auto_restarts to separate profile [puppet] - 10https://gerrit.wikimedia.org/r/800245 (owner: 10Dzahn) [19:54:52] (03PS4) 10Dzahn: parsoid::testing: move apache/php auto_restarts to separate profile [puppet] - 10https://gerrit.wikimedia.org/r/800245 [19:55:18] end of an era! [19:55:35] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frbackup2002 - https://phabricator.wikimedia.org/T306842 (10Jgreen) [19:56:00] hahaa, yea [19:58:33] (03PS1) 10Gergő Tisza: Enable GrowthExperiments link recommendations, round 4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800247 (https://phabricator.wikimedia.org/T304548) [19:58:48] (03PS4) 10Ryan Kemper: elasticsearch: add ANSI color codes [cookbooks] - 10https://gerrit.wikimedia.org/r/800244 (https://phabricator.wikimedia.org/T308606) [19:59:43] (03PS1) 10Zabe: snmp: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/800248 (https://phabricator.wikimedia.org/T308013) [20:00:05] brennen: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220526T2000). [20:00:05] zabe and tgr: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:25] o/ [20:00:31] o/ [20:00:38] (03PS5) 10Ryan Kemper: elasticsearch: add ANSI color codes [cookbooks] - 10https://gerrit.wikimedia.org/r/800244 (https://phabricator.wikimedia.org/T308606) [20:00:56] zabe: about? [20:01:12] hey [20:01:22] (03CR) 10Dzahn: "traffic team, you should just decide how you prefer it. I don't know how often this happens currently and how urgent it really is. Maybe w" [puppet] - 10https://gerrit.wikimedia.org/r/788312 (owner: 10David Caro) [20:01:48] (03CR) 10Brennen Bearnes: [C: 03+2] Acquire fresh actor id [extensions/CheckUser] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/798818 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [20:02:22] zabe: anything to test with this first one? [20:02:26] no [20:03:13] (KubernetesRsyslogDown) firing: (8) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:03:28] (03PS1) 10Zabe: shiny_server: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/800249 (https://phabricator.wikimedia.org/T308013) [20:03:32] (03CR) 10Brennen Bearnes: [C: 03+2] Fix phan failure PhanPluginSimplifyExpressionBool [extensions/CheckUser] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/798819 (owner: 10Zabe) [20:04:29] (03CR) 10Dzahn: "Has been answered on ticket. While it could be automated they do want shell access at first at least to understand the full process. The r" [puppet] - 10https://gerrit.wikimedia.org/r/798667 (https://phabricator.wikimedia.org/T309045) (owner: 10Alexandros Kosiaris) [20:06:40] zabe, tgr - any reason not to deploy these config patches while waiting on the checkuser ones? [20:06:57] mine can be deployed without testing [20:06:58] (03PS1) 10Zabe: sbuild: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/800250 (https://phabricator.wikimedia.org/T308013) [20:07:39] brennen, mine can't. The checkuser patches fix a production error that needs to be fixed for that config patch. [20:07:52] zabe: ack, cool. will go ahead with tgr's then. [20:08:58] (03CR) 10Brennen Bearnes: [C: 03+2] Enable GrowthExperiments link recommendations, round 4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800247 (https://phabricator.wikimedia.org/T304548) (owner: 10Gergő Tisza) [20:09:50] (03PS1) 10Zabe: samplicator: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/800251 (https://phabricator.wikimedia.org/T308013) [20:09:52] (03Merged) 10jenkins-bot: Enable GrowthExperiments link recommendations, round 4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800247 (https://phabricator.wikimedia.org/T304548) (owner: 10Gergő Tisza) [20:10:40] (03PS2) 10Dzahn: admin: Add sgimeno to restricted [puppet] - 10https://gerrit.wikimedia.org/r/798667 (https://phabricator.wikimedia.org/T309045) (owner: 10Alexandros Kosiaris) [20:12:28] 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10Kelson) [20:12:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Kelson) [20:12:44] !log brennen@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:800247|Enable GrowthExperiments link recommendations, round 4 (T304548)]] (duration: 00m 56s) [20:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:51] T304548: Deploy "add a link" to 4th round of wikis - https://phabricator.wikimedia.org/T304548 [20:13:03] (03PS1) 10Zabe: rsyslog: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/800252 (https://phabricator.wikimedia.org/T308013) [20:13:05] tgr: synched. [20:16:24] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Incident: 2022-05-09 Exim BDAT Errors incident - https://phabricator.wikimedia.org/T309238 (10jhathaway) [20:16:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:38] 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10Kelson) @ArielGlenn It seems that T302981 has just been implemented. Does that mean you have no blocker anymore for this task? [20:17:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:17:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:16] (03PS1) 10Zabe: r_lang: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/800254 (https://phabricator.wikimedia.org/T308013) [20:18:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Kelson) @Andrew Thank you for finally completing this task! [20:18:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:00] (03CR) 10BCornwall: [C: 03+2] turnilo: Fix port variable dererence for monitor [puppet] - 10https://gerrit.wikimedia.org/r/800231 (https://phabricator.wikimedia.org/T277729) (owner: 10BCornwall) [20:19:23] (03CR) 10Cwhite: [C: 03+2] "PCC indicates this will alter /etc/default/opensearch but it does not notify the opensearch service. LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/799310 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [20:19:49] (03PS1) 10Zabe: resolvconf: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/800255 (https://phabricator.wikimedia.org/T308013) [20:20:39] (03CR) 10CI reject: [V: 04-1] resolvconf: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/800255 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [20:21:09] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1004.wikimedia.org with OS bullseye [20:21:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:14] 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye [20:21:31] (03Merged) 10jenkins-bot: Fix phan failure PhanPluginSimplifyExpressionBool [extensions/CheckUser] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/798819 (owner: 10Zabe) [20:22:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wqds101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson wqds1014 E2 cableid 20220072 port 30 wqds1015 E3 cableid 20220071 port... [20:22:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wqds101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10Jclark-ctr) [20:23:38] 10SRE, 10ops-eqiad: db1128 faulty memory - https://phabricator.wikimedia.org/T309291 (10wiki_willy) a:05wiki_willy→03Cmjohnson Hi @Marostegui - @Cmjohnson is going to check if we can pull one of the DIMMs from one of these retired pc* hosts: https://netbox.wikimedia.org/dcim/devices/?q=&site=eqiad&status=... [20:23:45] (03CR) 10CI reject: [V: 04-1] Acquire fresh actor id [extensions/CheckUser] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/798818 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [20:24:01] (CirrusSearchHighOldGCFrequency) firing: (3) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [20:24:22] !log brennen@deploy1002 Synchronized php-1.39.0-wmf.13/extensions/CheckUser/src/Specials/SpecialCheckUser.php: Backport: [[gerrit:798819|Fix phan failure PhanPluginSimplifyExpressionBool]] (duration: 00m 52s) [20:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:05] (03CR) 10Brennen Bearnes: [C: 03+2] "recheck" [extensions/CheckUser] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/798818 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [20:26:20] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1004.wikimedia.org with OS bullseye [20:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:25] 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye executed with errors: - cloudela... [20:26:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T298555)', diff saved to https://phabricator.wikimedia.org/P28614 and previous config saved to /var/cache/conftool/dbconfig/20220526-202625-ladsgroup.json [20:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:31] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [20:27:55] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1004.wikimedia.org with OS bullseye [20:28:00] 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye [20:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:03] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1004.wikimedia.org with OS bullseye [20:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:07] 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye executed with errors: - cloudela... [20:28:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:29:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:03] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1004.wikimedia.org with OS bullseye [20:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:07] 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye [20:30:10] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1004.wikimedia.org with OS bullseye [20:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:15] 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye executed with errors: - cloudela... [20:30:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10Jclark-ctr) @BTullis please confirm if New rows E- F are ok for this host. [20:40:54] !log bking@install1003 removed cloudelastic1004.conf pxe config file T309343 [20:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:03] T309343: Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 [20:41:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P28615 and previous config saved to /var/cache/conftool/dbconfig/20220526-204130-ladsgroup.json [20:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10Jclark-ctr) @BTullis please confirm racking instructions and if New rows E- F are ok racking [20:42:14] (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:42:22] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1004.wikimedia.org with OS bullseye [20:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:28] 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye [20:44:49] (03Merged) 10jenkins-bot: Acquire fresh actor id [extensions/CheckUser] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/798818 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [20:45:23] (03CR) 10Bking: [C: 03+1] elasticsearch: add ANSI color codes [cookbooks] - 10https://gerrit.wikimedia.org/r/800244 (https://phabricator.wikimedia.org/T308606) (owner: 10Ryan Kemper) [20:45:45] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "Revert "Start writing to cuc_actor in s3, kcgwiki and labtestwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800190 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [20:45:51] (03PS2) 10Brennen Bearnes: Revert "Revert "Start writing to cuc_actor in s3, kcgwiki and labtestwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800190 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [20:46:03] (03CR) 10Brennen Bearnes: Revert "Revert "Start writing to cuc_actor in s3, kcgwiki and labtestwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800190 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [20:47:57] (03Merged) 10jenkins-bot: Revert "Revert "Start writing to cuc_actor in s3, kcgwiki and labtestwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800190 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [20:48:27] zabe: "acquire fresh actor id" and the config revert-revert are on mwdebug1002 if there's anything testable [20:48:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10cmooney) a:05Cmjohnson→03Jclark-ctr @Cmjohnson apologies I assigned this to you in error (blind as a bat), I see @Jclark-ctr actually... [20:48:42] looking [20:49:59] brennen, looks good [20:50:03] zabe: cool, syncing. [20:51:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:52:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:39] !log brennen@deploy1002 Synchronized php-1.39.0-wmf.13/extensions/CheckUser/src/Hooks.php: Backport: [[gerrit:798818|Acquire fresh actor id (T233004 T309148)]] (duration: 00m 51s) [20:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:46] T309148: Wikimedia\Rdbms\DBQueryError: Error 1048: Column 'cuc_actor' cannot be nullFunction: MediaWiki\CheckUser\Hooks::updateCheckUserDataQuery: INSERT INTO `cu_changes` (cuc_namespace,cuc_title,cuc_minor,cuc_user,cuc_user_text,cuc_ - https://phabricator.wikimedia.org/T309148 [20:53:46] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [20:55:05] !log brennen@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:800190|Revert "Revert "Start writing to cuc_actor in s3, kcgwiki and labtestwiki"" (T233004)]] (duration: 00m 50s) [20:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:35] !log end of utc late backport and config window [20:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:47] zabe: done, thx. [20:55:56] thanks for your help :) [20:56:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P28616 and previous config saved to /var/cache/conftool/dbconfig/20220526-205635-ladsgroup.json [20:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:04] (03PS1) 10Cathal Mooney: Modify Eqiad CR labs-in filter to allow BGP and ICMP [homer/public] - 10https://gerrit.wikimedia.org/r/800270 (https://phabricator.wikimedia.org/T304989) [21:02:07] (03CR) 10Dzahn: "I know Alexandros is currently out so I am being bold and just amend here and use "restricted". That is a subset of deployment and ensures" [puppet] - 10https://gerrit.wikimedia.org/r/798667 (https://phabricator.wikimedia.org/T309045) (owner: 10Alexandros Kosiaris) [21:02:33] (03CR) 10Cathal Mooney: [C: 03+2] Modify Eqiad CR labs-in filter to allow BGP and ICMP [homer/public] - 10https://gerrit.wikimedia.org/r/800270 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [21:03:09] (03Merged) 10jenkins-bot: Modify Eqiad CR labs-in filter to allow BGP and ICMP [homer/public] - 10https://gerrit.wikimedia.org/r/800270 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [21:03:21] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to mwmaint1002.eqiad.wmnet for sgimeno - https://phabricator.wikimedia.org/T309045 (10Dzahn) [21:03:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10Jclark-ctr) @cmooney Apologize for that not sure how that changed when i copied it from excel to here i noticed a few other mistakes dow... [21:03:33] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to mwmaint1002.eqiad.wmnet for sgimeno - https://phabricator.wikimedia.org/T309045 (10Dzahn) a:03Dzahn [21:03:38] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to mwmaint1002.eqiad.wmnet for sgimeno - https://phabricator.wikimedia.org/T309045 (10Dzahn) 05Open→03In progress [21:05:15] (03PS1) 10Zabe: Start writing to cuc_actor everywhere except s4 and s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800278 (https://phabricator.wikimedia.org/T233004) [21:06:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10cmooney) @Jclark-ctr ok thanks for the clarification. I've only put the port details for 1025 and 1026 into Netbox so far, ports 21 and... [21:09:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10Jclark-ctr) cloudcephosd1030 f4 21u 20 20220087 ; 21 20220081 cloudcephosd1031 f4 22u 22 20220075 ; 23 20220083 cloudcephosd1032 f4 23u 2... [21:10:30] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1004.wikimedia.org with OS bullseye [21:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:34] 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye executed with errors: - cloudela... [21:11:26] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to mwmaint1002.eqiad.wmnet for sgimeno - https://phabricator.wikimedia.org/T309045 (10Dzahn) @thcipriani Your approval is requested as group approver for "restricted" (just like for 'deployment'). [21:11:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T298555)', diff saved to https://phabricator.wikimedia.org/P28617 and previous config saved to /var/cache/conftool/dbconfig/20220526-211140-ladsgroup.json [21:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:47] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [21:12:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10Jclark-ctr) a:05Jclark-ctr→03cmooney [21:12:56] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to mwmaint1002.eqiad.wmnet for sgimeno - https://phabricator.wikimedia.org/T309045 (10Dzahn) @Sgs Understood! I'll move this forward to get you your access to unblock you. Automating it as a systemd timer would be nice indeed and we can help... [21:15:10] !log puppetmaster1001 - sudo puppet cert clean gitlab1004.wikimedia.org revoked cert with serial 9600 AND cert with serial 9694 - somehow agent got "cert revoked" before I did anything (T309259) [21:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:17] T309259: gitlab1004 - puppet cert revoked? - https://phabricator.wikimedia.org/T309259 [21:16:11] !log gitlab1004 - rm -rf /var/lib/puppet/ssl (T309259) [21:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:07] !log gitlab1004/puppetmaster1001 - create new signing request, sign new cert for puppet, fixed puppet run - T309259 [21:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10cmooney) Should be good for rows E and F if that works for the team. [21:17:33] 10SRE, 10serviceops, 10GitLab (Infrastructure): gitlab1004 - puppet cert revoked? - https://phabricator.wikimedia.org/T309259 (10Dzahn) 05Open→03Resolved a:03Dzahn Notice: /Stage[main]/Ferm/Service[ferm]/ensure: ensure changed 'stopped' to 'running' (corrective) Info: /Stage[main]/Ferm/Service[ferm]: U... [21:17:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10cmooney) These should be ok for rows E/F if that suits the team. [21:18:59] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:21:47] PROBLEM - BGP status on cloudsw1-d5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv6: Connect - wmf_public_asn, AS14907/IPv4: Connect - wmf_public_asn, AS14907/IPv4: Connect - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:24:48] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [21:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:12] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1026.mgmt.eqiad.wmnet with reboot policy FORCED [21:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:48] 10SRE, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10bking) Looking at the console, the installer keeps coming up in non-interactive mode. I tried clicking through, but it said it couldn't download the preseed file. Will raise... [21:33:37] 10SRE: a couple longer running icinga alerts to be fixed - https://phabricator.wikimedia.org/T309257 (10mdedul.islam.16) [21:33:38] (03CR) 10Jdlrobson: [C: 03+1] Add language fallback support for wmgSiteLogoVariants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799415 (https://phabricator.wikimedia.org/T305692) (owner: 10Stang) [21:34:01] (CirrusSearchHighOldGCFrequency) firing: (3) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:40:11] (03CR) 10Ori: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/800282 (https://phabricator.wikimedia.org/T207200) (owner: 10Ori) [21:42:08] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:31] (03PS3) 10Ori: Beta Cluster: ship logs from docker services to logstash [puppet] - 10https://gerrit.wikimedia.org/r/800282 (https://phabricator.wikimedia.org/T309319) [21:42:44] 10SRE: a couple longer running icinga alerts to be fixed - https://phabricator.wikimedia.org/T309257 (10TheresNoTime) 05duplicate→03Open [21:43:34] (03CR) 10Krinkle: [C: 04-1] Add language fallback support for wmgSiteLogoVariants (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799415 (https://phabricator.wikimedia.org/T305692) (owner: 10Stang) [21:43:49] phab spammer trying to merge stuff into his spam task.. someone already blocked them. good [21:44:31] RECOVERY - BGP status on cloudsw1-d5-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:45:05] (03CR) 10CI reject: [V: 04-1] Beta Cluster: ship logs from docker services to logstash [puppet] - 10https://gerrit.wikimedia.org/r/800282 (https://phabricator.wikimedia.org/T309319) (owner: 10Ori) [21:46:26] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10Jclark-ctr) Yea looks right i just need it for setting up servers [21:49:25] (03CR) 10Ebernhardson: [C: 03+1] "This seems sane to me, forcing revalidation. Unfortunately while I've adjusted this file I'm also far from an expert on these things." [puppet] - 10https://gerrit.wikimedia.org/r/799297 (https://phabricator.wikimedia.org/T289243) (owner: 10Lucas Werkmeister (WMDE)) [21:52:03] (03PS4) 10Ori: Beta Cluster: ship logs from docker services to logstash [puppet] - 10https://gerrit.wikimedia.org/r/800282 (https://phabricator.wikimedia.org/T309319) [21:53:30] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [21:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:13] PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Connect - wmf_public_asn, AS14907/IPv6: Connect - wmf_public_asn, AS14907/IPv4: Connect - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:54:46] (03CR) 10Dzahn: [C: 03+1] "thanks again. fixed https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=an-tool1007&service=Check+Turnilo+node+appserver" [puppet] - 10https://gerrit.wikimedia.org/r/800231 (https://phabricator.wikimedia.org/T277729) (owner: 10BCornwall) [21:57:05] (03CR) 10Dzahn: "adding group approver" [puppet] - 10https://gerrit.wikimedia.org/r/798667 (https://phabricator.wikimedia.org/T309045) (owner: 10Alexandros Kosiaris) [21:58:22] (03PS1) 10Dzahn: admin: add mabualruz to ldap_only admins (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/800284 (https://phabricator.wikimedia.org/T309215) [21:58:49] RECOVERY - BGP status on cloudsw1-c8-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:59:08] (03CR) 10Dzahn: "does anyone think the ' in the realname field will be an issue?" [puppet] - 10https://gerrit.wikimedia.org/r/800284 (https://phabricator.wikimedia.org/T309215) (owner: 10Dzahn) [22:00:05] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for Mabualruz - https://phabricator.wikimedia.org/T309215 (10Dzahn) 05Open→03In progress [22:00:21] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:02:21] (03PS1) 10Cathal Mooney: Change cloud uplink interface definition for VRF [homer/public] - 10https://gerrit.wikimedia.org/r/800285 (https://phabricator.wikimedia.org/T304989) [22:02:42] (03CR) 10Volans: [C: 04-1] "My 2 cents, if I can (non blocking):" [cookbooks] - 10https://gerrit.wikimedia.org/r/800244 (https://phabricator.wikimedia.org/T308606) (owner: 10Ryan Kemper) [22:04:32] (03CR) 10Dzahn: [C: 03+2] "This already has approval from group_approver and manager and Alex uploaded it.. so I'll go ahead and close this out. Easy one too since b" [puppet] - 10https://gerrit.wikimedia.org/r/798664 (https://phabricator.wikimedia.org/T308308) (owner: 10Alexandros Kosiaris) [22:09:00] 10SRE, 10SRE-Access-Requests, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Requesting access to the deployment POSIX group for aikochou and kevinbazira - https://phabricator.wikimedia.org/T308308 (10Dzahn) 05Open→03In progress [22:10:12] (03CR) 10Stang: Add language fallback support for wmgSiteLogoVariants (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799415 (https://phabricator.wikimedia.org/T305692) (owner: 10Stang) [22:10:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10cmooney) @Jclark-ctr I'm not really able to progress this. I was gonna try one reimage but given the disk / RAID config needs to be done... [22:10:46] (03CR) 10RLazarus: [C: 03+1] admin: add mabualruz to ldap_only admins (wmf) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/800284 (https://phabricator.wikimedia.org/T309215) (owner: 10Dzahn) [22:12:20] 10SRE, 10SRE-Access-Requests, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Requesting access to the deployment POSIX group for aikochou and kevinbazira - https://phabricator.wikimedia.org/T308308 (10Dzahn) deployed / resolved. Both users exist on the deployment server now. They will also e... [22:12:55] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:41] (03CR) 10Cathal Mooney: [C: 03+2] Change cloud uplink interface definition for VRF [homer/public] - 10https://gerrit.wikimedia.org/r/800285 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [22:16:54] 10SRE, 10SRE-Access-Requests, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Requesting access to the deployment POSIX group for aikochou and kevinbazira - https://phabricator.wikimedia.org/T308308 (10Dzahn) 05In progress→03Resolved a:03Dzahn [22:17:07] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to mwmaint1002.eqiad.wmnet for sgimeno - https://phabricator.wikimedia.org/T309045 (10Dzahn) a:05Dzahn→03thcipriani [22:18:23] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [22:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:18] !log phabricator adding mabualruz to WMF-NDA group for accest to private tickets T309215 [22:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:24] T309215: Grant Access to ldap/wmf for Mabualruz - https://phabricator.wikimedia.org/T309215 [22:20:59] (03CR) 10Dzahn: [C: 03+2] "thanks for review, going ahead" [puppet] - 10https://gerrit.wikimedia.org/r/800284 (https://phabricator.wikimedia.org/T309215) (owner: 10Dzahn) [22:21:04] (03PS2) 10Dzahn: admin: add mabualruz to ldap_only admins (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/800284 (https://phabricator.wikimedia.org/T309215) [22:22:00] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:34] 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10ArielGlenn) >>! In T57503#7961680, @Kelson wrote: > @ArielGlenn It seems that T302981 has just been implemented. Does that mean you have... [22:23:32] (03Merged) 10jenkins-bot: Change cloud uplink interface definition for VRF [homer/public] - 10https://gerrit.wikimedia.org/r/800285 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [22:25:19] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for Mabualruz - https://phabricator.wikimedia.org/T309215 (10Dzahn) 05In progress→03Resolved a:03Dzahn @Mabualruz Welcome! You have been added to the "wmf" LDAP group and the "WMF-NDA" Phabricator group. This means you can now... [22:31:34] (03PS1) 10Cwhite: aptrepo: add opensearch2 thirdparty component [puppet] - 10https://gerrit.wikimedia.org/r/800294 (https://phabricator.wikimedia.org/T304440) [22:35:07] win 14 [22:49:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [22:49:23] PROBLEM - Check systemd state on elastic2045 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_6@production-search-codfw.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:52:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10BTullis) Yes, rows E and F are fine for this, thanks. [22:53:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10BTullis) Yes, rows E and F are fine for these presto servers, thanks. [22:53:57] RECOVERY - Check systemd state on elastic2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:54:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10BTullis) [23:04:04] (03PS1) 10Dzahn: gitlab: create role/profile to temp use gitlab1004 to store backups [puppet] - 10https://gerrit.wikimedia.org/r/800308 (https://phabricator.wikimedia.org/T274463) [23:05:07] (03PS2) 10Dzahn: gitlab: create role/profile to temp use gitlab1004 to store backups [puppet] - 10https://gerrit.wikimedia.org/r/800308 (https://phabricator.wikimedia.org/T274463) [23:05:57] (03CR) 10CI reject: [V: 04-1] gitlab: create role/profile to temp use gitlab1004 to store backups [puppet] - 10https://gerrit.wikimedia.org/r/800308 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [23:08:00] (03CR) 10Dzahn: "the part that I am also including backup::host means one step to being able to let Bacula fetch from it too. but the second step needed wi" [puppet] - 10https://gerrit.wikimedia.org/r/800308 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [23:08:28] (03CR) 10Brennen Bearnes: [C: 03+1] "Discussed this with @Dzahn, seems like a good stop-gap." [puppet] - 10https://gerrit.wikimedia.org/r/800308 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [23:10:37] (03PS3) 10Dzahn: gitlab: create role/profile to temp use gitlab1004 to store backups [puppet] - 10https://gerrit.wikimedia.org/r/800308 (https://phabricator.wikimedia.org/T274463) [23:12:10] (03CR) 10Dzahn: "btw this only works because https://phabricator.wikimedia.org/T309259 is resolved since earlier today" [puppet] - 10https://gerrit.wikimedia.org/r/800308 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [23:14:04] 10SRE, 10serviceops, 10GitLab (Infrastructure): gitlab1004 - puppet cert revoked? - https://phabricator.wikimedia.org/T309259 (10Dzahn) Now using this machine for https://gerrit.wikimedia.org/r/c/operations/puppet/+/800308 and setting it active in netbox. [23:14:08] (03PS4) 10Dzahn: gitlab: create role/profile to temp use gitlab1004 to store backups [puppet] - 10https://gerrit.wikimedia.org/r/800308 (https://phabricator.wikimedia.org/T274463) [23:16:48] (03PS3) 10Cwhite: opensearch_dashboards: add backup script enable job [puppet] - 10https://gerrit.wikimedia.org/r/798886 (https://phabricator.wikimedia.org/T237224) [23:18:34] (03PS5) 10Dzahn: gitlab: create role/profile to temp use gitlab1004 to store backups [puppet] - 10https://gerrit.wikimedia.org/r/800308 (https://phabricator.wikimedia.org/T274463) [23:19:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [23:20:28] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/35581/" [puppet] - 10https://gerrit.wikimedia.org/r/800308 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [23:20:37] (03PS6) 10Dzahn: gitlab: create role/profile to temp use gitlab1004 to store backups [puppet] - 10https://gerrit.wikimedia.org/r/800308 (https://phabricator.wikimedia.org/T274463) [23:23:45] (03CR) 10Brennen Bearnes: [C: 03+1] gitlab: create role/profile to temp use gitlab1004 to store backups [puppet] - 10https://gerrit.wikimedia.org/r/800308 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [23:28:34] (03CR) 10Dzahn: "noop on gitlab1001 and on gitlab1003 puppet is disabled because it was trying to run the automatic restore.. which disabled puppet.. and t" [puppet] - 10https://gerrit.wikimedia.org/r/800308 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [23:33:09] (03PS1) 10Dzahn: gitlab::dump: ensure /srv/gitlab-backup exists so that rsync starts [puppet] - 10https://gerrit.wikimedia.org/r/800312 (https://phabricator.wikimedia.org/T274463) [23:33:58] (03CR) 10CI reject: [V: 04-1] gitlab::dump: ensure /srv/gitlab-backup exists so that rsync starts [puppet] - 10https://gerrit.wikimedia.org/r/800312 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [23:34:46] (03PS2) 10Dzahn: gitlab::dump: ensure /srv/gitlab-backup exists so that rsync starts [puppet] - 10https://gerrit.wikimedia.org/r/800312 (https://phabricator.wikimedia.org/T274463) [23:37:17] (03CR) 10CI reject: [V: 04-1] gitlab::dump: ensure /srv/gitlab-backup exists so that rsync starts [puppet] - 10https://gerrit.wikimedia.org/r/800312 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [23:42:27] (03PS3) 10Dzahn: gitlab::dump: ensure /srv/gitlab-backup exists so that rsync starts [puppet] - 10https://gerrit.wikimedia.org/r/800312 (https://phabricator.wikimedia.org/T274463) [23:45:24] (03CR) 10Dzahn: [C: 03+2] gitlab::dump: ensure /srv/gitlab-backup exists so that rsync starts [puppet] - 10https://gerrit.wikimedia.org/r/800312 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [23:52:45] PROBLEM - SSH on labweb1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:54:39] (03PS4) 10Cwhite: opensearch_dashboards: add backup script enable job [puppet] - 10https://gerrit.wikimedia.org/r/798886 (https://phabricator.wikimedia.org/T237224) [23:54:41] (03PS1) 10Dzahn: gitlab::dump: use rsync::module directly, not quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/800329 (https://phabricator.wikimedia.org/T274463) [23:55:14] (03CR) 10Cwhite: opensearch_dashboards: add backup script enable job (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/798886 (https://phabricator.wikimedia.org/T237224) (owner: 10Cwhite) [23:55:51] (03PS2) 10Dzahn: gitlab::dump: use rsync::module directly, not quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/800329 (https://phabricator.wikimedia.org/T274463) [23:58:49] (03CR) 10Dzahn: [C: 03+2] gitlab::dump: use rsync::module directly, not quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/800329 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn)