[00:05:33] (03PS3) 10Andrew Bogott: Add files and templates for OpenStack Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768829 (https://phabricator.wikimedia.org/T281275) [00:05:35] (03PS3) 10Andrew Bogott: OpenStack: add manifests for openstack wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768830 (https://phabricator.wikimedia.org/T281275) [00:06:05] PROBLEM - Check systemd state on mx2001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:33] (03CR) 10jerkins-bot: [V: 04-1] OpenStack: add manifests for openstack wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768830 (https://phabricator.wikimedia.org/T281275) (owner: 10Andrew Bogott) [00:06:51] PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:09] PROBLEM - Check systemd state on mx1001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:54] (03CR) 10jerkins-bot: [V: 04-1] Add files and templates for OpenStack Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768829 (https://phabricator.wikimedia.org/T281275) (owner: 10Andrew Bogott) [00:08:27] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@b5f7840]: (no justification provided) [00:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:35] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@b5f7840]: (no justification provided) (duration: 00m 08s) [00:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:41] RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:23:33] (03CR) 10Samtar: [C: 03+1] Increase AbuseFilter's emergency disable threshold for fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763982 (https://phabricator.wikimedia.org/T302227) (owner: 10Huji) [00:24:25] PROBLEM - Check systemd state on webperf1002 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_compress_logs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:25:06] (03PS4) 10Andrew Bogott: Add files and templates for OpenStack Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768829 (https://phabricator.wikimedia.org/T281275) [00:25:08] (03PS4) 10Andrew Bogott: OpenStack: add manifests for openstack wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768830 (https://phabricator.wikimedia.org/T281275) [00:26:04] (03CR) 10jerkins-bot: [V: 04-1] OpenStack: add manifests for openstack wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768830 (https://phabricator.wikimedia.org/T281275) (owner: 10Andrew Bogott) [00:27:22] (03CR) 10jerkins-bot: [V: 04-1] Add files and templates for OpenStack Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768829 (https://phabricator.wikimedia.org/T281275) (owner: 10Andrew Bogott) [00:34:12] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@c8a753b]: (no justification provided) [00:34:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:20] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@c8a753b]: (no justification provided) (duration: 00m 07s) [00:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:35] (03CR) 10Cwhite: [C: 03+2] zuul: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763824 (https://phabricator.wikimedia.org/T211982) (owner: 10Cwhite) [00:49:24] (03CR) 10Cwhite: [C: 03+2] maps: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763822 (https://phabricator.wikimedia.org/T211982) (owner: 10Cwhite) [00:49:31] (03PS3) 10Cwhite: maps: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763822 (https://phabricator.wikimedia.org/T211982) [01:00:49] RECOVERY - SSH on dumpsdata1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:07:50] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@c47e886]: (no justification provided) [01:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:07:58] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@c47e886]: (no justification provided) (duration: 00m 08s) [01:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:45] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@c47e886]: (no justification provided) [01:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:50] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@c47e886]: (no justification provided) (duration: 00m 04s) [01:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:17:09] RECOVERY - Check systemd state on webperf1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:19:39] PROBLEM - SSH on thumbor2004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:22:37] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@21af07c]: (no justification provided) [01:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:22:45] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@21af07c]: (no justification provided) (duration: 00m 07s) [01:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:31:53] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@1c598f5]: (no justification provided) [01:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:32:00] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@1c598f5]: (no justification provided) (duration: 00m 08s) [01:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:40:30] (JobUnavailable) firing: (4) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:57:24] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@1c598f5]: (no justification provided) [01:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:57:28] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@1c598f5]: (no justification provided) (duration: 00m 04s) [01:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220308T0200) [02:05:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:06:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.38.0-wmf.25 [core] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/768845 [02:07:28] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.38.0-wmf.25 [core] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/768845 (owner: 10TrainBranchBot) [02:07:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:07:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:23:17] (03Merged) 10jenkins-bot: Branch commit for wmf/1.38.0-wmf.25 [core] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/768845 (owner: 10TrainBranchBot) [02:27:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:28:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:30] (JobUnavailable) firing: (4) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [02:35:30] (JobUnavailable) firing: (5) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [02:40:30] (JobUnavailable) firing: (5) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [02:50:30] (JobUnavailable) firing: (5) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [03:16:31] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:22:01] RECOVERY - MegaRAID on es1029 is OK: OK: optimal, 1 logical, 12 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:34:23] PROBLEM - Check unit status of geoip_update_legacy on puppetmaster1001 is CRITICAL: CRITICAL: Status of the systemd unit geoip_update_legacy https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:35:53] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: geoip_update_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:46:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P22019 and previous config saved to /var/cache/conftool/dbconfig/20220308-054602-root.json [05:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:41] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on es1029 - https://phabricator.wikimedia.org/T302169 (10Marostegui) Thank you Chris, it looks good now! [05:53:27] (03PS5) 10Andrew Bogott: OpenStack: add manifests for openstack wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768830 (https://phabricator.wikimedia.org/T281275) [05:53:29] (03PS1) 10Andrew Bogott: Update hacked nova/api/openstack/compute/servers.py for Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768852 (https://phabricator.wikimedia.org/T281275) [05:53:31] (03PS1) 10Andrew Bogott: Update trove/instance/models.py for wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768853 (https://phabricator.wikimedia.org/T281275) [05:53:33] (03PS1) 10Andrew Bogott: Update trove/instance/models.py for wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768854 (https://phabricator.wikimedia.org/T281275) [05:54:17] (03CR) 10jerkins-bot: [V: 04-1] OpenStack: add manifests for openstack wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768830 (https://phabricator.wikimedia.org/T281275) (owner: 10Andrew Bogott) [06:01:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P22020 and previous config saved to /var/cache/conftool/dbconfig/20220308-060106-root.json [06:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:40] (03PS1) 10Marostegui: core_multiinstance.my.cnf: innodb_adaptive_hash_index = OFF [puppet] - 10https://gerrit.wikimedia.org/r/768855 (https://phabricator.wikimedia.org/T268869) [06:04:13] (03CR) 10Marostegui: "PCC looks good: https://puppet-compiler.wmflabs.org/pcc-worker1002/34116/" [puppet] - 10https://gerrit.wikimedia.org/r/768855 (https://phabricator.wikimedia.org/T268869) (owner: 10Marostegui) [06:04:16] (03CR) 10Marostegui: [C: 03+2] core_multiinstance.my.cnf: innodb_adaptive_hash_index = OFF [puppet] - 10https://gerrit.wikimedia.org/r/768855 (https://phabricator.wikimedia.org/T268869) (owner: 10Marostegui) [06:16:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P22021 and previous config saved to /var/cache/conftool/dbconfig/20220308-061609-root.json [06:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1166.eqiad.wmnet with reason: Maintenance [06:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1166.eqiad.wmnet with reason: Maintenance [06:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T300381)', diff saved to https://phabricator.wikimedia.org/P22022 and previous config saved to /var/cache/conftool/dbconfig/20220308-061700-marostegui.json [06:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:03] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [06:18:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [06:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [06:18:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [06:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [06:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T300775)', diff saved to https://phabricator.wikimedia.org/P22023 and previous config saved to /var/cache/conftool/dbconfig/20220308-061842-marostegui.json [06:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:46] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [06:20:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1096.eqiad.wmnet with reason: Maintenance [06:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1096.eqiad.wmnet with reason: Maintenance [06:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T298294)', diff saved to https://phabricator.wikimedia.org/P22024 and previous config saved to /var/cache/conftool/dbconfig/20220308-062100-marostegui.json [06:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:03] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [06:22:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T298294)', diff saved to https://phabricator.wikimedia.org/P22025 and previous config saved to /var/cache/conftool/dbconfig/20220308-062206-marostegui.json [06:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:20] (03PS1) 10Marostegui: mariadb: Promote db1107 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/768954 (https://phabricator.wikimedia.org/T302190) [06:27:07] (03PS1) 10Marostegui: db1107: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/768955 (https://phabricator.wikimedia.org/T302190) [06:30:04] (03CR) 10Marostegui: [C: 03+2] db1107: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/768955 (https://phabricator.wikimedia.org/T302190) (owner: 10Marostegui) [06:30:29] (03PS2) 10Marostegui: mariadb: Promote db1107 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/768954 (https://phabricator.wikimedia.org/T302190) [06:31:12] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/768954 (https://phabricator.wikimedia.org/T302190) (owner: 10Marostegui) [06:32:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T300381)', diff saved to https://phabricator.wikimedia.org/P22026 and previous config saved to /var/cache/conftool/dbconfig/20220308-063210-marostegui.json [06:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:13] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [06:37:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P22027 and previous config saved to /var/cache/conftool/dbconfig/20220308-063711-marostegui.json [06:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P22028 and previous config saved to /var/cache/conftool/dbconfig/20220308-064714-marostegui.json [06:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:56] (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [06:52:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P22029 and previous config saved to /var/cache/conftool/dbconfig/20220308-065216-marostegui.json [06:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:11] (03CR) 10Majavah: Use namespaced ApiFeatureUsageQueryEngineElastica (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767596 (https://phabricator.wikimedia.org/T302907) (owner: 10Reedy) [07:02:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P22030 and previous config saved to /var/cache/conftool/dbconfig/20220308-070219-marostegui.json [07:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T298294)', diff saved to https://phabricator.wikimedia.org/P22031 and previous config saved to /var/cache/conftool/dbconfig/20220308-070721-marostegui.json [07:07:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1113.eqiad.wmnet with reason: Maintenance [07:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1113.eqiad.wmnet with reason: Maintenance [07:07:24] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [07:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T298294)', diff saved to https://phabricator.wikimedia.org/P22032 and previous config saved to /var/cache/conftool/dbconfig/20220308-070728-marostegui.json [07:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T298294)', diff saved to https://phabricator.wikimedia.org/P22033 and previous config saved to /var/cache/conftool/dbconfig/20220308-070824-marostegui.json [07:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T300381)', diff saved to https://phabricator.wikimedia.org/P22034 and previous config saved to /var/cache/conftool/dbconfig/20220308-071724-marostegui.json [07:17:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1145.eqiad.wmnet with reason: Maintenance [07:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1145.eqiad.wmnet with reason: Maintenance [07:17:28] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [07:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:52] (03CR) 10Andrew Bogott: [C: 03+2] P:wmcs::prometheus: update pdns ports [puppet] - 10https://gerrit.wikimedia.org/r/768767 (https://phabricator.wikimedia.org/T281276) (owner: 10Majavah) [07:19:35] (03PS2) 10Majavah: P:wmcs::prometheus: update blackbox urls [puppet] - 10https://gerrit.wikimedia.org/r/768294 [07:19:51] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:20:30] (03PS2) 10Majavah: P:toolforge::static: publish SSH fingerprints under /admin [puppet] - 10https://gerrit.wikimedia.org/r/766292 [07:21:00] (03PS1) 10Marostegui: production.my.cnf: innodb_adaptive_hash_index = OFF [puppet] - 10https://gerrit.wikimedia.org/r/768959 (https://phabricator.wikimedia.org/T268869) [07:22:23] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:23:05] (03CR) 10Marostegui: "PCC looks good: https://puppet-compiler.wmflabs.org/pcc-worker1002/34117/" [puppet] - 10https://gerrit.wikimedia.org/r/768959 (https://phabricator.wikimedia.org/T268869) (owner: 10Marostegui) [07:23:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P22035 and previous config saved to /var/cache/conftool/dbconfig/20220308-072329-marostegui.json [07:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:49] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:26:31] (03CR) 10Marostegui: [C: 03+2] production.my.cnf: innodb_adaptive_hash_index = OFF [puppet] - 10https://gerrit.wikimedia.org/r/768959 (https://phabricator.wikimedia.org/T268869) (owner: 10Marostegui) [07:34:01] (03PS1) 10Marostegui: db1124: Install 10.6 on db1124 [puppet] - 10https://gerrit.wikimedia.org/r/768961 (https://phabricator.wikimedia.org/T301879) [07:35:23] (03CR) 10Marostegui: [C: 03+2] db1124: Install 10.6 on db1124 [puppet] - 10https://gerrit.wikimedia.org/r/768961 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui) [07:38:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P22036 and previous config saved to /var/cache/conftool/dbconfig/20220308-073833-marostegui.json [07:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1175.eqiad.wmnet with reason: Maintenance [07:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1175.eqiad.wmnet with reason: Maintenance [07:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T300381)', diff saved to https://phabricator.wikimedia.org/P22037 and previous config saved to /var/cache/conftool/dbconfig/20220308-074136-marostegui.json [07:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:39] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [07:42:51] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:43:33] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:53:20] (03CR) 10JMeybohm: [C: 03+1] jobqueue: set CPU request [deployment-charts] - 10https://gerrit.wikimedia.org/r/768760 (https://phabricator.wikimedia.org/T300914) (owner: 10Hnowlan) [07:53:33] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestage1003.eqiad.wmnet [07:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T298294)', diff saved to https://phabricator.wikimedia.org/P22038 and previous config saved to /var/cache/conftool/dbconfig/20220308-075338-marostegui.json [07:53:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1168.eqiad.wmnet with reason: Maintenance [07:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1168.eqiad.wmnet with reason: Maintenance [07:53:42] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [07:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T298294)', diff saved to https://phabricator.wikimedia.org/P22039 and previous config saved to /var/cache/conftool/dbconfig/20220308-075345-marostegui.json [07:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T300381)', diff saved to https://phabricator.wikimedia.org/P22040 and previous config saved to /var/cache/conftool/dbconfig/20220308-075634-marostegui.json [07:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:38] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [08:00:04] Amir1, awight, Urbanecm, and taavi: (Dis)respected human, time to deploy UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220308T0800). Please do the needful. [08:00:05] kostajh: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:57] good morning kostajh -- do you want to self-serve? [08:01:23] hi [08:01:26] sure, I can do that [08:01:29] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestage1003.eqiad.wmnet [08:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:38] I need a couple minutes, just got back to my computer [08:01:59] sure. Go ahead when ready. I'm around (ping me if I'm needed) [08:03:48] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestage1004.eqiad.wmnet [08:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:50] ok getting started [08:08:32] (03PS2) 10Kosta Harlan: GrowthExperiments: Add image experiment for fa/fr/pt/trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768680 (https://phabricator.wikimedia.org/T302828) [08:08:43] (03CR) 10Kosta Harlan: [C: 03+2] "Backport/Config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768680 (https://phabricator.wikimedia.org/T302828) (owner: 10Kosta Harlan) [08:09:28] (03Merged) 10jenkins-bot: GrowthExperiments: Add image experiment for fa/fr/pt/trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768680 (https://phabricator.wikimedia.org/T302828) (owner: 10Kosta Harlan) [08:11:00] the patch is on mwdebug1002, I'll do a little testing [08:11:16] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestage1004.eqiad.wmnet [08:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P22041 and previous config saved to /var/cache/conftool/dbconfig/20220308-081138-marostegui.json [08:11:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:04] urbanecm: it seems fine to me, I'll sync it [08:13:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:10] sounds good! [08:14:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T298294)', diff saved to https://phabricator.wikimedia.org/P22042 and previous config saved to /var/cache/conftool/dbconfig/20220308-081407-marostegui.json [08:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:10] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [08:14:16] !log kharlan@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:768680|GrowthExperiments: Add image experiment for fa/fr/pt/trwiki (T302828)]] (duration: 00m 49s) [08:14:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:14:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:19] T302828: Scale: deploy "add an image" to pt, fa, fr, tr - https://phabricator.wikimedia.org/T302828 [08:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:15] (03PS16) 10Ryan Kemper: elasticsearch: load config from yaml [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (https://phabricator.wikimedia.org/T278378) [08:16:28] (03PS17) 10Ryan Kemper: elasticsearch: load config from yaml [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (https://phabricator.wikimedia.org/T278378) [08:16:57] (03PS18) 10Ryan Kemper: elasticsearch: load config from yaml [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (https://phabricator.wikimedia.org/T278378) [08:24:01] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: load config from yaml [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (https://phabricator.wikimedia.org/T278378) (owner: 10Ryan Kemper) [08:24:05] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:26:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P22043 and previous config saved to /var/cache/conftool/dbconfig/20220308-082643-marostegui.json [08:26:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P22044 and previous config saved to /var/cache/conftool/dbconfig/20220308-082912-marostegui.json [08:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:26] (03PS2) 10Muehlenhoff: Switch cumin2001 to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/768670 (https://phabricator.wikimedia.org/T276589) [08:32:14] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubernetes2018.codfw.wmnet [08:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:31] (03CR) 10Ayounsi: [C: 03+2] Set fr-ops to operations [homer/public] - 10https://gerrit.wikimedia.org/r/768756 (https://phabricator.wikimedia.org/T302992) (owner: 10Ayounsi) [08:39:06] (03Merged) 10jenkins-bot: Set fr-ops to operations [homer/public] - 10https://gerrit.wikimedia.org/r/768756 (https://phabricator.wikimedia.org/T302992) (owner: 10Ayounsi) [08:39:42] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes2018.codfw.wmnet [08:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:24] (03CR) 10Ladsgroup: [C: 03+1] mariadb: Promote db1107 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/768954 (https://phabricator.wikimedia.org/T302190) (owner: 10Marostegui) [08:41:37] (03CR) 10Muehlenhoff: [C: 03+2] Switch cumin2001 to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/768670 (https://phabricator.wikimedia.org/T276589) (owner: 10Muehlenhoff) [08:41:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T300381)', diff saved to https://phabricator.wikimedia.org/P22045 and previous config saved to /var/cache/conftool/dbconfig/20220308-084148-marostegui.json [08:41:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [08:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [08:41:52] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [08:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P22046 and previous config saved to /var/cache/conftool/dbconfig/20220308-084416-marostegui.json [08:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:57] 10ops-esams, 10DC-Ops: ripe-atlas-esams down - https://phabricator.wikimedia.org/T303242 (10ayounsi) [08:52:31] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:52:54] ACKNOWLEDGEMENT - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 664 probes of 665 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map ayounsi https://phabricator.wikimedia.org/T303242 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:52:54] ACKNOWLEDGEMENT - Host ripe-atlas-esams IPv6 is DOWN: CRITICAL - Destination Unreachable (2620:0:862:201:91:198:174:132) ayounsi https://phabricator.wikimedia.org/T303242 [08:52:54] ACKNOWLEDGEMENT - Host ripe-atlas-esams is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T303242 [08:53:27] RECOVERY - Check systemd state on netflow6001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:53:33] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:54:44] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubernetes2019.codfw.wmnet [08:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P22047 and previous config saved to /var/cache/conftool/dbconfig/20220308-085644-root.json [08:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:45] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:58:09] thats me [08:59:03] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:59:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T298294)', diff saved to https://phabricator.wikimedia.org/P22048 and previous config saved to /var/cache/conftool/dbconfig/20220308-085921-marostegui.json [08:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1165.eqiad.wmnet with reason: Maintenance [08:59:25] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [08:59:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1165.eqiad.wmnet with reason: Maintenance [08:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T298294)', diff saved to https://phabricator.wikimedia.org/P22049 and previous config saved to /var/cache/conftool/dbconfig/20220308-085934-marostegui.json [08:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T298294)', diff saved to https://phabricator.wikimedia.org/P22050 and previous config saved to /var/cache/conftool/dbconfig/20220308-090051-marostegui.json [09:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:54] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes2019.codfw.wmnet [09:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:42] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubernetes2020.codfw.wmnet [09:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:22] (03Abandoned) 10Hashar: logging: set canary field when host is a canary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724696 (https://phabricator.wikimedia.org/T291870) (owner: 10Hashar) [09:05:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1179.eqiad.wmnet with reason: Maintenance [09:05:25] (03CR) 10Filippo Giunchedi: "I like the idea overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/768776 (owner: 10Majavah) [09:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1179.eqiad.wmnet with reason: Maintenance [09:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T300381)', diff saved to https://phabricator.wikimedia.org/P22051 and previous config saved to /var/cache/conftool/dbconfig/20220308-090531-marostegui.json [09:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:34] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [09:05:57] (03CR) 10Filippo Giunchedi: [C: 03+2] Introduce 'alertmanager' and 'alerting' modules [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 (https://phabricator.wikimedia.org/T293209) (owner: 10Filippo Giunchedi) [09:06:54] (03PS1) 10Marostegui: mariadb: Remove innodb_thread_concurrency variable [puppet] - 10https://gerrit.wikimedia.org/r/769019 (https://phabricator.wikimedia.org/T301879) [09:08:05] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Introduce 'alertmanager' and 'alerting' modules [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 (https://phabricator.wikimedia.org/T293209) (owner: 10Filippo Giunchedi) [09:08:42] (03PS2) 10Marostegui: mariadb: Remove innodb_thread_concurrency variable [puppet] - 10https://gerrit.wikimedia.org/r/769019 (https://phabricator.wikimedia.org/T301879) [09:10:01] (03PS3) 10Majavah: prometheus: include number of changes on puppet run metrics [puppet] - 10https://gerrit.wikimedia.org/r/768776 [09:10:14] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes2020.codfw.wmnet [09:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:17] (03CR) 10Majavah: prometheus: include number of changes on puppet run metrics (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/768776 (owner: 10Majavah) [09:10:26] 10SRE-tools, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: Spicerack: add support for Alertmanager - https://phabricator.wikimedia.org/T293209 (10fgiunchedi) The code for silencing itself is merged now, I'd imagine there are other followu... [09:10:41] (03PS3) 10Marostegui: mariadb: Promote db1107 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/768954 (https://phabricator.wikimedia.org/T302190) [09:10:49] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove innodb_thread_concurrency variable [puppet] - 10https://gerrit.wikimedia.org/r/769019 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui) [09:11:02] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, would like John's opinion/vote too" [puppet] - 10https://gerrit.wikimedia.org/r/768776 (owner: 10Majavah) [09:11:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P22052 and previous config saved to /var/cache/conftool/dbconfig/20220308-091147-root.json [09:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:46] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:wmcs::prometheus: update blackbox urls [puppet] - 10https://gerrit.wikimedia.org/r/768294 (owner: 10Majavah) [09:15:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P22053 and previous config saved to /var/cache/conftool/dbconfig/20220308-091556-marostegui.json [09:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:55] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubernetes2021.codfw.wmnet [09:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T300381)', diff saved to https://phabricator.wikimedia.org/P22054 and previous config saved to /var/cache/conftool/dbconfig/20220308-092045-marostegui.json [09:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:49] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [09:26:24] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes2021.codfw.wmnet [09:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P22055 and previous config saved to /var/cache/conftool/dbconfig/20220308-092651-root.json [09:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:28] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubernetes2022.codfw.wmnet [09:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:45] (03CR) 10Btullis: Added config for the datahubsearch LVS service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768668 (owner: 10Btullis) [09:31:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P22056 and previous config saved to /var/cache/conftool/dbconfig/20220308-093101-marostegui.json [09:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:14] (03PS2) 10Ayounsi: Cleanup transport-in filters for codfw/eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/747551 [09:34:57] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes2022.codfw.wmnet [09:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:12] (03CR) 10Vgutierrez: P:varnish::common: Add support for passing wikimedia_domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768766 (owner: 10Jbond) [09:35:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P22057 and previous config saved to /var/cache/conftool/dbconfig/20220308-093550-marostegui.json [09:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:33] (03CR) 10Volans: [C: 04-1] "As asked on IRC I did a pass, there are some things to fix, see inline for the details." [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (https://phabricator.wikimedia.org/T278378) (owner: 10Ryan Kemper) [09:39:04] 10SRE, 10WMF-General-or-Unknown, 10WMF-Legal, 10Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270 (10BTullis) I hereby license all my existing contributions to the operations/puppet under the Apache 2.0 license. [09:39:53] /7 [09:39:56] ufff [09:41:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P22058 and previous config saved to /var/cache/conftool/dbconfig/20220308-094155-root.json [09:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [09:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [09:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T300775)', diff saved to https://phabricator.wikimedia.org/P22059 and previous config saved to /var/cache/conftool/dbconfig/20220308-094354-marostegui.json [09:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:57] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [09:46:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T298294)', diff saved to https://phabricator.wikimedia.org/P22060 and previous config saved to /var/cache/conftool/dbconfig/20220308-094605-marostegui.json [09:46:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1180.eqiad.wmnet with reason: Maintenance [09:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1180.eqiad.wmnet with reason: Maintenance [09:46:09] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [09:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T298294)', diff saved to https://phabricator.wikimedia.org/P22061 and previous config saved to /var/cache/conftool/dbconfig/20220308-094613-marostegui.json [09:46:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298294)', diff saved to https://phabricator.wikimedia.org/P22062 and previous config saved to /var/cache/conftool/dbconfig/20220308-094730-marostegui.json [09:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P22063 and previous config saved to /var/cache/conftool/dbconfig/20220308-095055-marostegui.json [09:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:04] (03PS2) 10Majavah: P:wmcs::prometheus: use a single entry for openstack-exporter [puppet] - 10https://gerrit.wikimedia.org/r/768747 [09:54:06] (03PS1) 10Vgutierrez: site: Reimage cp2035 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/769022 (https://phabricator.wikimedia.org/T290005) [09:55:25] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp2035 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/769022 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [09:56:42] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp2035.codfw.wmnet with OS buster [09:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:55] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp2035.codfw.wmnet with OS buster [10:02:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P22064 and previous config saved to /var/cache/conftool/dbconfig/20220308-100234-marostegui.json [10:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:02] !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single for host cumin2002.codfw.wmnet [10:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T300381)', diff saved to https://phabricator.wikimedia.org/P22065 and previous config saved to /var/cache/conftool/dbconfig/20220308-100559-marostegui.json [10:06:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1102.eqiad.wmnet with reason: Maintenance [10:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1102.eqiad.wmnet with reason: Maintenance [10:06:03] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [10:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:36] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve2001.codfw.wmnet [10:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:26] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin2002.codfw.wmnet [10:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:23] PROBLEM - Keyholder SSH agent on cumin2002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [10:14:59] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2035.codfw.wmnet with reason: host reimage [10:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:56] (03PS5) 10Jbond: systemd::sysuser: create option to add additional groups to user [puppet] - 10https://gerrit.wikimedia.org/r/768743 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [10:17:30] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2035.codfw.wmnet with reason: host reimage [10:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P22066 and previous config saved to /var/cache/conftool/dbconfig/20220308-101739-marostegui.json [10:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:05] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2001.codfw.wmnet [10:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:54] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve2002.codfw.wmnet [10:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:03] RECOVERY - Keyholder SSH agent on cumin2002 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder [10:21:49] (03PS1) 10Marostegui: mariadb: Remove innodb_file_format [puppet] - 10https://gerrit.wikimedia.org/r/769025 (https://phabricator.wikimedia.org/T301879) [10:22:55] !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-staging2001.codfw.wmnet [10:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:13] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove innodb_file_format [puppet] - 10https://gerrit.wikimedia.org/r/769025 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui) [10:26:47] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2002.codfw.wmnet [10:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:28] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-staging2001.codfw.wmnet [10:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:45] (03PS3) 10Ayounsi: Cleanup transport-in filters for codfw/eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/747551 [10:27:54] !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-staging2002.codfw.wmnet [10:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:07] (03PS6) 10Jbond: isystemd::sysuser: create option to add additional groups to user [puppet] - 10https://gerrit.wikimedia.org/r/768743 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [10:28:10] (03PS7) 10Jbond: gitlab_runner: add gitlab-runner to docker group, change folder permissions [puppet] - 10https://gerrit.wikimedia.org/r/768683 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [10:28:26] (03PS7) 10Jbond: systemd::sysuser: create option to add additional groups to user [puppet] - 10https://gerrit.wikimedia.org/r/768743 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [10:28:27] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve2003.codfw.wmnet [10:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:02] (03PS8) 10Jbond: gitlab_runner: add gitlab-runner to docker group, change folder permissions [puppet] - 10https://gerrit.wikimedia.org/r/768683 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [10:30:02] (03CR) 10Jbond: "LGTM, i added a spec test and fixed a minor issue. i also change the patch set order so this is applied before the gitlab-runner cr" [puppet] - 10https://gerrit.wikimedia.org/r/768743 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [10:30:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1112.eqiad.wmnet with reason: Maintenance [10:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1112.eqiad.wmnet with reason: Maintenance [10:30:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T300381)', diff saved to https://phabricator.wikimedia.org/P22067 and previous config saved to /var/cache/conftool/dbconfig/20220308-103017-marostegui.json [10:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:21] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [10:32:06] (03CR) 10Jbond: [C: 03+1] systemd::sysuser: create option to add additional groups to user [puppet] - 10https://gerrit.wikimedia.org/r/768743 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [10:32:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298294)', diff saved to https://phabricator.wikimedia.org/P22068 and previous config saved to /var/cache/conftool/dbconfig/20220308-103243-marostegui.json [10:32:45] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:32:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1098.eqiad.wmnet with reason: Maintenance [10:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1098.eqiad.wmnet with reason: Maintenance [10:32:48] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [10:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T298294)', diff saved to https://phabricator.wikimedia.org/P22069 and previous config saved to /var/cache/conftool/dbconfig/20220308-103251-marostegui.json [10:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 10%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P22070 and previous config saved to /var/cache/conftool/dbconfig/20220308-103409-root.json [10:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:27] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:34:53] !log pool cp2035 with HAProxy as TLS termination layer - T290005 [10:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:55] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [10:35:38] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2003.codfw.wmnet [10:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:41] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-staging2002.codfw.wmnet [10:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:06] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve2004.codfw.wmnet [10:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:20] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2035.codfw.wmnet with OS buster [10:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:32] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp2035.codfw.wmnet with OS buster c... [10:39:43] (03CR) 10Jbond: [V: 03+1] "thanks see comment" [puppet] - 10https://gerrit.wikimedia.org/r/768766 (owner: 10Jbond) [10:41:12] (03CR) 10Jbond: varnish: Rate limit hotlinking (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/768723 (owner: 10Jbond) [10:42:39] (03PS1) 10Vgutierrez: site: Reimage cp1083 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/769027 (https://phabricator.wikimedia.org/T290005) [10:42:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T298294)', diff saved to https://phabricator.wikimedia.org/P22072 and previous config saved to /var/cache/conftool/dbconfig/20220308-104250-marostegui.json [10:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:53] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [10:43:35] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host datahubsearch1001.eqiad.wmnet [10:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:38] !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-cache2001.codfw.wmnet [10:45:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T300381)', diff saved to https://phabricator.wikimedia.org/P22073 and previous config saved to /var/cache/conftool/dbconfig/20220308-104548-marostegui.json [10:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:52] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [10:46:11] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp1083 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/769027 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [10:46:41] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp1083.eqiad.wmnet with OS buster [10:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:53] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp1083.eqiad.wmnet with OS buster [10:47:09] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2004.codfw.wmnet [10:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:30] (03PS9) 10Jelto: gitlab_runner: add gitlab-runner to docker group, change folder permissions [puppet] - 10https://gerrit.wikimedia.org/r/768683 (https://phabricator.wikimedia.org/T295481) [10:49:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 25%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P22074 and previous config saved to /var/cache/conftool/dbconfig/20220308-104913-root.json [10:49:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:01] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34119/console" [puppet] - 10https://gerrit.wikimedia.org/r/768683 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [10:50:56] (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [10:51:01] !log btullis@datahubsearch1001:~$ sudo systemctl reset-failed ifup@ens13.service T273026 [10:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:03] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host datahubsearch1001.eqiad.wmnet [10:51:04] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [10:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:26] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host datahubsearch1002.eqiad.wmnet [10:51:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:19] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-cache2001.codfw.wmnet [10:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:52] !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-cache2002.codfw.wmnet [10:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:56] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve2005.codfw.wmnet [10:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:10] (03CR) 10Hnowlan: [C: 03+2] jobqueue: set CPU request [deployment-charts] - 10https://gerrit.wikimedia.org/r/768760 (https://phabricator.wikimedia.org/T300914) (owner: 10Hnowlan) [10:57:27] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-cache2002.codfw.wmnet [10:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P22075 and previous config saved to /var/cache/conftool/dbconfig/20220308-105754-marostegui.json [10:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:43] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:58:50] (03Merged) 10jenkins-bot: jobqueue: set CPU request [deployment-charts] - 10https://gerrit.wikimedia.org/r/768760 (https://phabricator.wikimedia.org/T300914) (owner: 10Hnowlan) [10:58:56] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host datahubsearch1002.eqiad.wmnet [10:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:07] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:59:09] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host datahubsearch1003.eqiad.wmnet [10:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:22] !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-cache2002.codfw.wmnet [10:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:25] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync [10:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:49] !log btullis@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [10:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:37] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:00:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P22076 and previous config saved to /var/cache/conftool/dbconfig/20220308-110053-marostegui.json [11:00:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:47] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:02:04] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2005.codfw.wmnet [11:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:41] (03CR) 10Muehlenhoff: "Looks good, two nits inline." [puppet] - 10https://gerrit.wikimedia.org/r/768736 (https://phabricator.wikimedia.org/T301382) (owner: 10Btullis) [11:02:53] !log btullis@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [11:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:06] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1083.eqiad.wmnet with reason: host reimage [11:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:24] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-cache2002.codfw.wmnet [11:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:58] !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-cache2003.codfw.wmnet [11:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 50%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P22077 and previous config saved to /var/cache/conftool/dbconfig/20220308-110416-root.json [11:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:42] (03PS1) 10Marostegui: mariadb: Remove innodb_buffer_pool_instances flag [puppet] - 10https://gerrit.wikimedia.org/r/769028 (https://phabricator.wikimedia.org/T301879) [11:05:33] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve2006.codfw.wmnet [11:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:03] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove innodb_buffer_pool_instances flag [puppet] - 10https://gerrit.wikimedia.org/r/769028 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui) [11:06:28] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1083.eqiad.wmnet with reason: host reimage [11:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:39] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host datahubsearch1003.eqiad.wmnet [11:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:52] 10Puppet, 10SRE, 10Infrastructure-Foundations: Duplicate monitoring for systemd::timer::job - https://phabricator.wikimedia.org/T303253 (10fgiunchedi) [11:07:37] PROBLEM - Check systemd state on datahubsearch1003 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:08:21] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-cache2003.codfw.wmnet [11:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:49] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:09:51] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync [11:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:03] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:10:22] btullis: FYI ^^^ datahubsearch1003's ifup failed [11:10:23] !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp1083.eqiad.wmnet with OS buster [11:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:35] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp1083.eqiad.wmnet with OS buster e... [11:11:12] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp1083.eqiad.wmnet with OS buster [11:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:16] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync [11:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:24] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp1083.eqiad.wmnet with OS buster [11:11:27] Ah, thanks volans. I'll reset it and add it to T273026 [11:11:27] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [11:11:28] (03PS3) 10Kormat: mariadb: Switch s6 primary db1173 -> db1131 [puppet] - 10https://gerrit.wikimedia.org/r/764784 (https://phabricator.wikimedia.org/T300471) [11:12:07] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2006.codfw.wmnet [11:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:12] np [11:12:21] RECOVERY - Check systemd state on datahubsearch1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:12:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P22078 and previous config saved to /var/cache/conftool/dbconfig/20220308-111259-marostegui.json [11:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:04] (03CR) 10Ayounsi: [C: 03+2] Cleanup transport-in filters for codfw/eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/747551 (owner: 10Ayounsi) [11:13:14] (03CR) 10Filippo Giunchedi: [C: 03+1] misc: search-grafana-dashboards.js (031 comment) [software] - 10https://gerrit.wikimedia.org/r/767118 (owner: 10Filippo Giunchedi) [11:13:37] (03Merged) 10jenkins-bot: Cleanup transport-in filters for codfw/eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/747551 (owner: 10Ayounsi) [11:15:12] !log Cleanup transport-in filters for codfw/eqiad (CR747551) [11:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P22079 and previous config saved to /var/cache/conftool/dbconfig/20220308-111558-marostegui.json [11:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:03] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync [11:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:00] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dale_Zhou - https://phabricator.wikimedia.org/T303031 (10MGerlach) >>! In T303031#7756076, @JMeybohm wrote: Could you please also provide an expiry/end date for this contract/agreement? MOU/NDA are valid until July 21, 2022. [11:18:17] (03PS2) 10Btullis: Add a profile specific to datahubsearch servers [puppet] - 10https://gerrit.wikimedia.org/r/768736 (https://phabricator.wikimedia.org/T301382) [11:18:38] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ShubhankarP - https://phabricator.wikimedia.org/T303032 (10MGerlach) >>! In T303032#7756077, @JMeybohm wrote: Could you please also provide an expiry/end date for this contract/agreement? MOU/NDA are valid until July 21, 2022. [11:18:51] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve2007.codfw.wmnet [11:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 75%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P22080 and previous config saved to /var/cache/conftool/dbconfig/20220308-111920-root.json [11:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:06] !log btullis@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid test cluster: Roll restart of Druid jvm daemons. [11:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:09] (03CR) 10Vgutierrez: [C: 03+2] prometheus:rules_global: Provide HAProxy availability metrics [puppet] - 10https://gerrit.wikimedia.org/r/768057 (owner: 10Vgutierrez) [11:20:30] (03PS1) 10Volans: CHANGELOG: add changelogs for release v2.2.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/769029 [11:21:26] 10SRE-tools, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: Spicerack: add support for Alertmanager - https://phabricator.wikimedia.org/T293209 (10Volans) >>! In T293209#7759313, @fgiunchedi wrote: > The code for silencing itself is merged... [11:23:20] (03PS4) 10Filippo Giunchedi: misc: search-grafana-dashboards.js [software] - 10https://gerrit.wikimedia.org/r/767118 [11:23:56] (03CR) 10Btullis: Add a profile specific to datahubsearch servers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/768736 (https://phabricator.wikimedia.org/T301382) (owner: 10Btullis) [11:23:58] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34120/console" [puppet] - 10https://gerrit.wikimedia.org/r/768736 (https://phabricator.wikimedia.org/T301382) (owner: 10Btullis) [11:24:04] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] "I went with GPL-2 for the license, merging! Thanks all" [software] - 10https://gerrit.wikimedia.org/r/767118 (owner: 10Filippo Giunchedi) [11:24:34] (03Merged) 10jenkins-bot: misc: search-grafana-dashboards.js [software] - 10https://gerrit.wikimedia.org/r/767118 (owner: 10Filippo Giunchedi) [11:24:49] (03CR) 10Kormat: [C: 03+1] mariadb: Promote db1107 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/768954 (https://phabricator.wikimedia.org/T302190) (owner: 10Marostegui) [11:25:02] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2007.codfw.wmnet [11:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:33] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve2008.codfw.wmnet [11:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:48] (03CR) 10Btullis: [V: 03+1 C: 03+2] Add a profile specific to datahubsearch servers [puppet] - 10https://gerrit.wikimedia.org/r/768736 (https://phabricator.wikimedia.org/T301382) (owner: 10Btullis) [11:26:37] (03CR) 10Btullis: [V: 03+1 C: 03+2] Move some common resources to the opensearch::server profile [puppet] - 10https://gerrit.wikimedia.org/r/768702 (https://phabricator.wikimedia.org/T301382) (owner: 10Btullis) [11:27:22] (03PS2) 10Btullis: Added config for the datahubsearch LVS service [puppet] - 10https://gerrit.wikimedia.org/r/768668 [11:27:32] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1083.eqiad.wmnet with reason: host reimage [11:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:33] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v2.2.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/769029 (owner: 10Volans) [11:28:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T298294)', diff saved to https://phabricator.wikimedia.org/P22081 and previous config saved to /var/cache/conftool/dbconfig/20220308-112804-marostegui.json [11:28:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1131.eqiad.wmnet with reason: Maintenance [11:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1131.eqiad.wmnet with reason: Maintenance [11:28:08] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [11:28:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T298294)', diff saved to https://phabricator.wikimedia.org/P22082 and previous config saved to /var/cache/conftool/dbconfig/20220308-112811-marostegui.json [11:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:16] !log btullis@cumin1001 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid test cluster: Roll restart of Druid jvm daemons. [11:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T298294)', diff saved to https://phabricator.wikimedia.org/P22083 and previous config saved to /var/cache/conftool/dbconfig/20220308-112929-marostegui.json [11:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:43] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:30:54] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1083.eqiad.wmnet with reason: host reimage [11:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T300381)', diff saved to https://phabricator.wikimedia.org/P22084 and previous config saved to /var/cache/conftool/dbconfig/20220308-113102-marostegui.json [11:31:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1123.eqiad.wmnet with reason: Maintenance [11:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1123.eqiad.wmnet with reason: Maintenance [11:31:06] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [11:31:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1123 (T300381)', diff saved to https://phabricator.wikimedia.org/P22085 and previous config saved to /var/cache/conftool/dbconfig/20220308-113110-marostegui.json [11:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:15] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:31:17] (03CR) 10Kormat: [C: 03+1] auto_schema: Add support for --check in running schema changes [software] - 10https://gerrit.wikimedia.org/r/767554 (https://phabricator.wikimedia.org/T301896) (owner: 10Ladsgroup) [11:31:54] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2008.codfw.wmnet [11:31:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:03] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34121/console" [puppet] - 10https://gerrit.wikimedia.org/r/768668 (owner: 10Btullis) [11:32:39] RECOVERY - SSH on thumbor2004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:33:29] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v2.2.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/769029 (owner: 10Volans) [11:34:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 100%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P22086 and previous config saved to /var/cache/conftool/dbconfig/20220308-113424-root.json [11:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:29] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Add support for --check in running schema changes [software] - 10https://gerrit.wikimedia.org/r/767554 (https://phabricator.wikimedia.org/T301896) (owner: 10Ladsgroup) [11:34:59] (03Merged) 10jenkins-bot: auto_schema: Add support for --check in running schema changes [software] - 10https://gerrit.wikimedia.org/r/767554 (https://phabricator.wikimedia.org/T301896) (owner: 10Ladsgroup) [11:36:34] (03PS1) 10Volans: Upstream release v2.2.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/769031 [11:37:01] (03CR) 10Volans: [C: 03+2] Upstream release v2.2.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/769031 (owner: 10Volans) [11:39:17] (03CR) 10Btullis: [V: 03+1] "Adding Vgutierrez and Ssingh from Traffic to reviewers." [puppet] - 10https://gerrit.wikimedia.org/r/768668 (owner: 10Btullis) [11:41:04] (03CR) 10Vgutierrez: Added config for the datahubsearch LVS service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768668 (owner: 10Btullis) [11:41:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1123.eqiad.wmnet with reason: Maintenance [11:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1123.eqiad.wmnet with reason: Maintenance [11:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P22088 and previous config saved to /var/cache/conftool/dbconfig/20220308-114434-marostegui.json [11:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:01] PROBLEM - Check systemd state on datahubsearch1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-elasticsearch-exporter-9200.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:39] PROBLEM - Check systemd state on datahubsearch1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-elasticsearch-exporter-9200.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:48:25] !log pool cp1083 with HAProxy as TLS termination layer - T290005 [11:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:28] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [11:49:11] (03CR) 10Btullis: [V: 03+1] Added config for the datahubsearch LVS service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768668 (owner: 10Btullis) [11:50:40] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1083.eqiad.wmnet with OS buster [11:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:53] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp1083.eqiad.wmnet with OS buster c... [11:51:19] !log btullis@cumin2002 START - Cookbook sre.druid.roll-restart-workers for Druid public cluster: Roll restart of Druid jvm daemons. [11:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:34] (03PS3) 10Reedy: Use namespaced ApiFeatureUsageQueryEngineElastica [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767596 (https://phabricator.wikimedia.org/T302907) [11:53:06] jouncebot: nowandnext [11:53:06] No deployments scheduled for the next 2 hour(s) and 6 minute(s) [11:53:06] In 2 hour(s) and 6 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220308T1400) [11:53:39] (03CR) 10Reedy: [C: 03+2] Use namespaced ApiFeatureUsageQueryEngineElastica [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767596 (https://phabricator.wikimedia.org/T302907) (owner: 10Reedy) [11:53:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2105.codfw.wmnet with reason: Maintenance [11:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2105.codfw.wmnet with reason: Maintenance [11:54:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 6 hosts with reason: Maintenance [11:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 6 hosts with reason: Maintenance [11:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:29] (03Merged) 10jenkins-bot: Use namespaced ApiFeatureUsageQueryEngineElastica [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767596 (https://phabricator.wikimedia.org/T302907) (owner: 10Reedy) [11:55:05] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34122/console" [puppet] - 10https://gerrit.wikimedia.org/r/768668 (owner: 10Btullis) [11:55:57] !log reedy@deploy1002 Synchronized wmf-config/CommonSettings.php: Use namespaced ApiFeatureUsageQueryEngineElastica T302907 (duration: 00m 49s) [11:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:00] T302907: Error: Class 'ApiFeatureUsageQueryEngineElastica' not found - https://phabricator.wikimedia.org/T302907 [11:58:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [11:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:35] !log uploaded spicerack_2.2.0 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia [11:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [11:59:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [11:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P22089 and previous config saved to /var/cache/conftool/dbconfig/20220308-115938-marostegui.json [11:59:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:11] (03PS3) 10Btullis: Added config for the datahubsearch LVS service [puppet] - 10https://gerrit.wikimedia.org/r/768668 [12:00:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [12:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:14] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34123/console" [puppet] - 10https://gerrit.wikimedia.org/r/768668 (owner: 10Btullis) [12:01:19] (03CR) 10Vgutierrez: Added config for the datahubsearch LVS service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768668 (owner: 10Btullis) [12:04:13] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [12:06:41] (03CR) 10Btullis: [V: 03+1] Added config for the datahubsearch LVS service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768668 (owner: 10Btullis) [12:06:48] (03PS4) 10Btullis: Added config for the datahubsearch LVS service [puppet] - 10https://gerrit.wikimedia.org/r/768668 [12:08:51] PROBLEM - Check systemd state on datahubsearch1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-elasticsearch-exporter-9200.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:14:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T298294)', diff saved to https://phabricator.wikimedia.org/P22090 and previous config saved to /var/cache/conftool/dbconfig/20220308-121443-marostegui.json [12:14:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1140.eqiad.wmnet with reason: Maintenance [12:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1140.eqiad.wmnet with reason: Maintenance [12:14:47] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [12:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [12:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [12:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2129.codfw.wmnet with reason: Maintenance [12:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2129.codfw.wmnet with reason: Maintenance [12:16:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on 8 hosts with reason: Maintenance [12:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on 8 hosts with reason: Maintenance [12:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P22091 and previous config saved to /var/cache/conftool/dbconfig/20220308-122752-root.json [12:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:51] (03PS1) 10Btullis: Update the linting requirements to allow for local dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/769037 (https://phabricator.wikimedia.org/T301454) [12:32:54] (03PS2) 10Jbond: WIP: Early start on firmware cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 [12:33:26] 10SRE, 10SRE-Access-Requests: Request Administrator Access to Google Search Console - https://phabricator.wikimedia.org/T302625 (10SCherukuwada) @JMeybohm Given that we didn't actually add each language domain one by one, there should be, a "wikipedia.org" entry listed as a "Domain Property" along with all of... [12:34:26] (03CR) 10Jbond: [C: 03+2] "LGTm will merge" [puppet] - 10https://gerrit.wikimedia.org/r/768776 (owner: 10Majavah) [12:35:48] (03CR) 10Btullis: "I believe that this linting check is causing my deployment charts patch for DataHub to fail because I have used a technique of locally def" [deployment-charts] - 10https://gerrit.wikimedia.org/r/769037 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [12:42:21] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1096.eqiad.wmnet with reason: Maintenance [12:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:23] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1096.eqiad.wmnet with reason: Maintenance [12:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:57] !log btullis@cumin2002 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid public cluster: Roll restart of Druid jvm daemons. [12:42:58] !log marostegui@cumin2002 dbctl commit (dc=all): 'Depooling db1096:3315 (T298294)', diff saved to https://phabricator.wikimedia.org/P22092 and previous config saved to /var/cache/conftool/dbconfig/20220308-124257-marostegui.json [12:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:01] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [12:43:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P22093 and previous config saved to /var/cache/conftool/dbconfig/20220308-124302-root.json [12:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:41] (03PS1) 10Hnowlan: jobqueue: use guaranteed QoS strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/769038 (https://phabricator.wikimedia.org/T300914) [12:46:11] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T298294)', diff saved to https://phabricator.wikimedia.org/P22094 and previous config saved to /var/cache/conftool/dbconfig/20220308-124610-marostegui.json [12:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:44] !log btullis@cumin2002 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons. [12:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:18] !log aborrero@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1021.eqiad.wmnet [12:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:21] !log aborrero@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host cloudcephosd1021.eqiad.wmnet [12:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:40] !log aborrero@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1021.eqiad.wmnet [12:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:12] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1021.eqiad.wmnet [12:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:13] !log aborrero@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1005.wikimedia.org [12:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P22095 and previous config saved to /var/cache/conftool/dbconfig/20220308-125806-root.json [12:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:47] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P22096 and previous config saved to /var/cache/conftool/dbconfig/20220308-130145-marostegui.json [13:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:19] (03CR) 10JMeybohm: [C: 03+1] Update the linting requirements to allow for local dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/769037 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [13:04:02] (03PS10) 10Jbond: sre.puppet.sync-netbox-hiera: Cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) [13:07:07] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol1005.wikimedia.org [13:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:15] !log aborrero@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1004.wikimedia.org [13:07:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:01] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1001.eqiad.wmnet [13:09:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:33] * arturo high-fives elukey, rebooting partner [13:10:34] (03CR) 10ArielGlenn: [C: 03+1] "Hannah and I had a look at this together. I'll take your word on the run times, looks good. Thanks for the file sizes update too!" [puppet] - 10https://gerrit.wikimedia.org/r/768032 (https://phabricator.wikimedia.org/T300255) (owner: 10Hoo man) [13:10:57] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_navigationtiming_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:13:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P22097 and previous config saved to /var/cache/conftool/dbconfig/20220308-131309-root.json [13:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:12] * elukey waves to arturo back :D [13:14:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [13:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [13:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T300775)', diff saved to https://phabricator.wikimedia.org/P22098 and previous config saved to /var/cache/conftool/dbconfig/20220308-131420-marostegui.json [13:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:23] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [13:15:20] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: add gitlab-runner to docker group, change folder permissions [puppet] - 10https://gerrit.wikimedia.org/r/768683 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [13:16:56] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1001.eqiad.wmnet [13:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:21] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P22099 and previous config saved to /var/cache/conftool/dbconfig/20220308-131720-marostegui.json [13:17:22] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34124/console" [puppet] - 10https://gerrit.wikimedia.org/r/768743 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [13:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:44] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1002.eqiad.wmnet [13:17:45] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol1004.wikimedia.org [13:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:01] (03CR) 10Jelto: [V: 03+1 C: 03+2] systemd::sysuser: create option to add additional groups to user [puppet] - 10https://gerrit.wikimedia.org/r/768743 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [13:25:58] (03PS1) 10Arturo Borrero Gonzalez: wikimediacloud.org: temporal failover of primary cloudcontrol server [dns] - 10https://gerrit.wikimedia.org/r/769043 [13:26:19] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1002.eqiad.wmnet [13:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:33] (03CR) 10David Caro: [C: 03+1] wikimediacloud.org: temporal failover of primary cloudcontrol server [dns] - 10https://gerrit.wikimedia.org/r/769043 (owner: 10Arturo Borrero Gonzalez) [13:27:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikimediacloud.org: temporal failover of primary cloudcontrol server [dns] - 10https://gerrit.wikimedia.org/r/769043 (owner: 10Arturo Borrero Gonzalez) [13:31:41] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1003.eqiad.wmnet [13:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:56] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T298294)', diff saved to https://phabricator.wikimedia.org/P22100 and previous config saved to /var/cache/conftool/dbconfig/20220308-133255-marostegui.json [13:32:58] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance [13:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:00] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [13:33:01] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance [13:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:11] (03PS11) 10Jbond: sre.puppet.sync-netbox-hiera: Cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) [13:33:36] !log marostegui@cumin2002 dbctl commit (dc=all): 'Depooling db1144:3315 (T298294)', diff saved to https://phabricator.wikimedia.org/P22101 and previous config saved to /var/cache/conftool/dbconfig/20220308-133335-marostegui.json [13:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:31] !log btullis@cumin2002 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons. [13:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:48] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T298294)', diff saved to https://phabricator.wikimedia.org/P22102 and previous config saved to /var/cache/conftool/dbconfig/20220308-133647-marostegui.json [13:36:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:39] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1003.eqiad.wmnet [13:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:04] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1004.eqiad.wmnet [13:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:27] !log aqu@deploy1002 Started deploy [airflow-dags/analytics@725f528]: Set wikidata/item_page_link/weekly start date in production [13:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:34] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@725f528]: Set wikidata/item_page_link/weekly start date in production (duration: 00m 07s) [13:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:44] (03PS1) 10Bartosz Dziewoński: Fix logic for finding the oldest comment in a bundle [extensions/DiscussionTools] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/768796 (https://phabricator.wikimedia.org/T302014) [13:41:48] (03CR) 10Btullis: [C: 03+2] Update the linting requirements to allow for local dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/769037 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [13:41:51] (03PS1) 10Bartosz Dziewoński: Fix logic for finding the oldest comment in a bundle [extensions/DiscussionTools] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/768797 (https://phabricator.wikimedia.org/T302014) [13:45:39] (03Merged) 10jenkins-bot: Update the linting requirements to allow for local dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/769037 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [13:46:20] !log dcaro@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1003.wikimedia.org [13:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:32] (03CR) 10jerkins-bot: [V: 04-1] Fix logic for finding the oldest comment in a bundle [extensions/DiscussionTools] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/768796 (https://phabricator.wikimedia.org/T302014) (owner: 10Bartosz Dziewoński) [13:48:00] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1004.eqiad.wmnet [13:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:46] (03PS1) 10Bartosz Dziewoński: Disable failing talk page tests temporarily [skins/MinervaNeue] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/768798 (https://phabricator.wikimedia.org/T302993) [13:49:05] (03CR) 10Bartosz Dziewoński: "Test failure is T302993, and unrelated to this patch." [extensions/DiscussionTools] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/768796 (https://phabricator.wikimedia.org/T302014) (owner: 10Bartosz Dziewoński) [13:49:17] (03PS2) 10Bartosz Dziewoński: Fix logic for finding the oldest comment in a bundle [extensions/DiscussionTools] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/768796 (https://phabricator.wikimedia.org/T302014) [13:52:24] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P22103 and previous config saved to /var/cache/conftool/dbconfig/20220308-135223-marostegui.json [13:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:34] !log aqu@deploy1002 Started deploy [airflow-dags/analytics@d1c8ae0]: Fix wikidata_item_page_link destination table after tests [13:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:41] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@d1c8ae0]: Fix wikidata_item_page_link destination table after tests (duration: 00m 07s) [13:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:47] (03PS1) 104nn1l2: fawiki: Add patrolmarks right to autopatrolled group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769044 (https://phabricator.wikimedia.org/T303269) [14:00:04] RoanKattouw, Lucas_WMDE, and Urbanecm: (Dis)respected human, time to deploy UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220308T1400). Please do the needful. [14:00:04] MatmaRex and nn1l2: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:09] hi [14:00:10] i can deploy today [14:00:15] hello nn1l2 and MatmaRex [14:00:23] hey [14:00:23] hello [14:00:35] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:01:31] (03CR) 10Urbanecm: [C: 03+2] Fix logic for finding the oldest comment in a bundle [extensions/DiscussionTools] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/768796 (https://phabricator.wikimedia.org/T302014) (owner: 10Bartosz Dziewoński) [14:01:33] (03CR) 10Urbanecm: [C: 03+2] Fix logic for finding the oldest comment in a bundle [extensions/DiscussionTools] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/768797 (https://phabricator.wikimedia.org/T302014) (owner: 10Bartosz Dziewoński) [14:01:59] (03CR) 10Urbanecm: [C: 03+2] fawiki: Add patrolmarks right to autopatrolled group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769044 (https://phabricator.wikimedia.org/T303269) (owner: 104nn1l2) [14:02:37] (03PS19) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [14:03:12] (03CR) 10jerkins-bot: [V: 04-1] Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [14:03:37] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/767709 (owner: 10Muehlenhoff) [14:03:46] (03Merged) 10jenkins-bot: fawiki: Add patrolmarks right to autopatrolled group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769044 (https://phabricator.wikimedia.org/T303269) (owner: 104nn1l2) [14:04:04] nn1l2: pulled to mwdebug1001, can you have a look? [14:04:08] ok [14:04:24] LGTM [14:04:28] syncing [14:04:29] urbanecm: my patch for wmf.24 also requires https://gerrit.wikimedia.org/r/c/mediawiki/skins/MinervaNeue/+/768798 , otherwise the tests are failing for unrelated reasons. i updated the table [14:04:41] oh, okay, good to know [14:04:53] (03CR) 10Urbanecm: [C: 03+2] Disable failing talk page tests temporarily [skins/MinervaNeue] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/768798 (https://phabricator.wikimedia.org/T302993) (owner: 10Bartosz Dziewoński) [14:05:07] (03CR) 10Urbanecm: [C: 03+2] Fix logic for finding the oldest comment in a bundle [extensions/DiscussionTools] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/768796 (https://phabricator.wikimedia.org/T302014) (owner: 10Bartosz Dziewoński) [14:06:03] (03Merged) 10jenkins-bot: Fix logic for finding the oldest comment in a bundle [extensions/DiscussionTools] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/768797 (https://phabricator.wikimedia.org/T302014) (owner: 10Bartosz Dziewoński) [14:06:33] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 75465dd0288b998ee6d4668e87e57b0d7961471a: fawiki: Add patrolmarks right to autopatrolled group (T303269) (duration: 00m 49s) [14:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:37] T303269: Add patrolmarks right to autopatrolled group on Farsi Wikipedia - https://phabricator.wikimedia.org/T303269 [14:06:44] !log dcaro@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol1003.wikimedia.org [14:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:28] nn1l2: should be live [14:07:38] Thanks! [14:07:59] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P22104 and previous config saved to /var/cache/conftool/dbconfig/20220308-140758-marostegui.json [14:08:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:03] MatmaRex: still waiting for wmf.24 merge, will ping you once it's ready for testing [14:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:08] wmf.25 is not anywhere yet, so can't be tested [14:08:12] sure, thanks [14:09:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:45] (03PS1) 10Jbond: P:spicerack::reposync: add netbox frontends as remotes [puppet] - 10https://gerrit.wikimedia.org/r/769046 (https://phabricator.wikimedia.org/T229397) [14:15:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:15:43] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34125/console" [puppet] - 10https://gerrit.wikimedia.org/r/769046 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [14:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:01] (03PS3) 10Btullis: Add a record for datahubsearch service [dns] - 10https://gerrit.wikimedia.org/r/768663 (https://phabricator.wikimedia.org/T301458) [14:17:16] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34126/console" [puppet] - 10https://gerrit.wikimedia.org/r/769046 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [14:18:31] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34127/console" [puppet] - 10https://gerrit.wikimedia.org/r/769046 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [14:20:51] (03Merged) 10jenkins-bot: Disable failing talk page tests temporarily [skins/MinervaNeue] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/768798 (https://phabricator.wikimedia.org/T302993) (owner: 10Bartosz Dziewoński) [14:20:54] (03Merged) 10jenkins-bot: Fix logic for finding the oldest comment in a bundle [extensions/DiscussionTools] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/768796 (https://phabricator.wikimedia.org/T302014) (owner: 10Bartosz Dziewoński) [14:21:05] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:spicerack::reposync: add netbox frontends as remotes [puppet] - 10https://gerrit.wikimedia.org/r/769046 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [14:22:20] 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Infrastructure-Foundations, 10SRE Observability (FY2021/2022-Q3): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10CDanis) After some more experience with actually updating the page during incidents... [14:23:34] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T298294)', diff saved to https://phabricator.wikimedia.org/P22107 and previous config saved to /var/cache/conftool/dbconfig/20220308-142332-marostegui.json [14:23:35] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1110.eqiad.wmnet with reason: Maintenance [14:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:37] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [14:23:38] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1110.eqiad.wmnet with reason: Maintenance [14:23:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:13] !log marostegui@cumin2002 dbctl commit (dc=all): 'Depooling db1110 (T298294)', diff saved to https://phabricator.wikimedia.org/P22108 and previous config saved to /var/cache/conftool/dbconfig/20220308-142412-marostegui.json [14:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:22] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T298294)', diff saved to https://phabricator.wikimedia.org/P22109 and previous config saved to /var/cache/conftool/dbconfig/20220308-142721-marostegui.json [14:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:27:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:18] i see it merged a while back [14:28:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:58] MatmaRex: pulled to mwdebug1001, can you check? [14:29:08] looking [14:30:07] urbanecm: looks good on enwiki [14:30:12] so, let's sync? [14:30:17] yeah. thanks [14:30:20] syncing [14:31:53] (03PS1) 10Jbond: P:netbox::automation: use srv/reposyn [puppet] - 10https://gerrit.wikimedia.org/r/769049 (https://phabricator.wikimedia.org/T229397) [14:32:04] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.24/extensions/DiscussionTools/includes/Notifications/DiscussionToolsEventTrait.php: 23939c7: Fix logic for finding the oldest comment in a bundle (T302014) (duration: 00m 50s) [14:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:07] T302014: Highlight all new comments since a specific timestamp when opening a new comment notification - https://phabricator.wikimedia.org/T302014 [14:32:27] (03CR) 10jerkins-bot: [V: 04-1] P:netbox::automation: use srv/reposyn [puppet] - 10https://gerrit.wikimedia.org/r/769049 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [14:32:32] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34128/console" [puppet] - 10https://gerrit.wikimedia.org/r/769049 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [14:32:35] MatmaRex: should be live! [14:32:37] anything else? [14:32:53] that's all, thanks [14:33:38] no problem [14:33:46] !log UTC afternoon B&C window done [14:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:05] (03PS2) 10Jbond: P:netbox::automation: use srv/reposyn [puppet] - 10https://gerrit.wikimedia.org/r/769049 (https://phabricator.wikimedia.org/T229397) [14:34:41] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34130/console" [puppet] - 10https://gerrit.wikimedia.org/r/769049 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [14:35:06] 10SRE, 10serviceops: Move VTRS db passwords to a different hiera location - https://phabricator.wikimedia.org/T303272 (10Kormat) [14:35:23] (03CR) 10jerkins-bot: [V: 04-1] P:netbox::automation: use srv/reposyn [puppet] - 10https://gerrit.wikimedia.org/r/769049 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [14:35:30] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34131/console" [puppet] - 10https://gerrit.wikimedia.org/r/769049 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [14:35:33] (03PS3) 10Kormat: mariadb: Reference the actual OTRS passwords in the m2 grants file. [puppet] - 10https://gerrit.wikimedia.org/r/764744 (https://phabricator.wikimedia.org/T303272) [14:39:27] (03PS3) 10Jbond: P:netbox::automation: use srv/reposyn [puppet] - 10https://gerrit.wikimedia.org/r/769049 (https://phabricator.wikimedia.org/T229397) [14:41:45] (03PS1) 10Btullis: Update helm linting again to allow local dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/769050 (https://phabricator.wikimedia.org/T301454) [14:42:58] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P22110 and previous config saved to /var/cache/conftool/dbconfig/20220308-144256-marostegui.json [14:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:47] (03PS5) 10Andrew Bogott: Add templates for OpenStack Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768829 (https://phabricator.wikimedia.org/T281275) [14:45:52] (03PS2) 10Btullis: Update helm linting again to allow local dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/769050 (https://phabricator.wikimedia.org/T301454) [14:46:09] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/769049 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [14:46:50] (03CR) 10jerkins-bot: [V: 04-1] Add templates for OpenStack Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768829 (https://phabricator.wikimedia.org/T281275) (owner: 10Andrew Bogott) [14:47:38] (03PS6) 10Andrew Bogott: Add templates for OpenStack Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768829 (https://phabricator.wikimedia.org/T281275) [14:50:04] (03CR) 10jerkins-bot: [V: 04-1] Add templates for OpenStack Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768829 (https://phabricator.wikimedia.org/T281275) (owner: 10Andrew Bogott) [14:50:27] (03CR) 10Btullis: [C: 03+2] Update helm linting again to allow local dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/769050 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [14:50:56] (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [14:52:26] (03PS7) 10Andrew Bogott: Add templates for OpenStack Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768829 (https://phabricator.wikimedia.org/T281275) [14:54:00] (03PS8) 10Andrew Bogott: Add templates for OpenStack Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768829 (https://phabricator.wikimedia.org/T281275) [14:54:02] (03PS1) 10Andrew Bogott: Add files for OpenStack Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/769051 (https://phabricator.wikimedia.org/T281275) [14:54:18] (03Merged) 10jenkins-bot: Update helm linting again to allow local dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/769050 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [14:54:35] (03PS20) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [14:55:33] (03PS4) 10Jbond: P:netbox::automation: use /srv/reposync [puppet] - 10https://gerrit.wikimedia.org/r/769049 (https://phabricator.wikimedia.org/T229397) [14:55:48] (03CR) 10Btullis: [C: 03+2] Added config for the datahubsearch LVS service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768668 (owner: 10Btullis) [14:55:55] (03CR) 10jerkins-bot: [V: 04-1] Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [14:56:08] (03CR) 10Btullis: [C: 03+2] Added config for the datahubsearch LVS service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768668 (owner: 10Btullis) [14:56:20] (03CR) 10jerkins-bot: [V: 04-1] Add files for OpenStack Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/769051 (https://phabricator.wikimedia.org/T281275) (owner: 10Andrew Bogott) [14:57:11] (03CR) 10Andrew Bogott: [C: 03+2] Add templates for OpenStack Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768829 (https://phabricator.wikimedia.org/T281275) (owner: 10Andrew Bogott) [14:57:23] (03PS9) 10Andrew Bogott: Add templates for OpenStack Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768829 (https://phabricator.wikimedia.org/T281275) [14:57:25] (03CR) 10Jbond: [C: 03+2] P:netbox::automation: use /srv/reposync [puppet] - 10https://gerrit.wikimedia.org/r/769049 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [14:57:36] (03CR) 10Jbond: [C: 03+2] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/769049 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [14:58:32] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P22111 and previous config saved to /var/cache/conftool/dbconfig/20220308-145831-marostegui.json [14:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:57] (03PS2) 10Andrew Bogott: Add files for OpenStack Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/769051 (https://phabricator.wikimedia.org/T281275) [15:03:00] (03CR) 10jerkins-bot: [V: 04-1] Add files for OpenStack Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/769051 (https://phabricator.wikimedia.org/T281275) (owner: 10Andrew Bogott) [15:03:11] (03PS1) 10Majavah: policies/cr-labs: Add firewall rules for clouddb2001-dev return flows [homer/public] - 10https://gerrit.wikimedia.org/r/769052 (https://phabricator.wikimedia.org/T303248) [15:03:47] 10SRE, 10ops-esams, 10DC-Ops: ripe-atlas-esams down - https://phabricator.wikimedia.org/T303242 (10RobH) Rebooted the atlas via the PDU, please check and let me know if it comes back! If not we can look at replacing it. [15:04:01] (03PS1) 10Kormat: admin: [kormat] Change history handling. [puppet] - 10https://gerrit.wikimedia.org/r/769053 [15:05:14] (03PS3) 10Andrew Bogott: Add some files for OpenStack Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/769051 (https://phabricator.wikimedia.org/T281275) [15:05:48] (03CR) 10Kormat: [C: 03+2] admin: [kormat] Change history handling. [puppet] - 10https://gerrit.wikimedia.org/r/769053 (owner: 10Kormat) [15:07:17] (03CR) 10jerkins-bot: [V: 04-1] Add some files for OpenStack Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/769051 (https://phabricator.wikimedia.org/T281275) (owner: 10Andrew Bogott) [15:09:05] 10SRE, 10ops-esams, 10DC-Ops: ripe-atlas-esams down - https://phabricator.wikimedia.org/T303242 (10RobH) Watching on icinga, it hasn't come back, so something is wrong with it. Next steps: * full power removal via PDU command for 30 seconds * connect via scs and see if it has output on power return [15:10:21] (03PS4) 10Andrew Bogott: Add some files for OpenStack Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/769051 (https://phabricator.wikimedia.org/T281275) [15:11:34] (03CR) 10Andrew Bogott: [C: 03+2] Add some files for OpenStack Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/769051 (https://phabricator.wikimedia.org/T281275) (owner: 10Andrew Bogott) [15:14:07] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T298294)', diff saved to https://phabricator.wikimedia.org/P22112 and previous config saved to /var/cache/conftool/dbconfig/20220308-151406-marostegui.json [15:14:09] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1100.eqiad.wmnet with reason: Maintenance [15:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:12] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1100.eqiad.wmnet with reason: Maintenance [15:14:12] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [15:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:48] !log marostegui@cumin2002 dbctl commit (dc=all): 'Depooling db1100 (T298294)', diff saved to https://phabricator.wikimedia.org/P22113 and previous config saved to /var/cache/conftool/dbconfig/20220308-151446-marostegui.json [15:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:55] (03PS4) 10Kormat: mariadb: Reference the actual VRTS passwords in the m2 grants file. [puppet] - 10https://gerrit.wikimedia.org/r/764744 (https://phabricator.wikimedia.org/T303272) [15:15:08] (03CR) 10Elukey: [C: 03+2] calico,cfssl-issuer,knative-serving: fix dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/768681 (owner: 10Elukey) [15:15:38] (03PS6) 10Andrew Bogott: OpenStack: add manifests for openstack wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768830 (https://phabricator.wikimedia.org/T281275) [15:15:40] (03PS2) 10Andrew Bogott: Update hacked nova/api/openstack/compute/servers.py for Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768852 (https://phabricator.wikimedia.org/T281275) [15:15:42] (03PS2) 10Andrew Bogott: Update trove/instance/models.py for wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768853 (https://phabricator.wikimedia.org/T281275) [15:15:49] (03PS2) 10Andrew Bogott: Update trove/instance/models.py for wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768854 (https://phabricator.wikimedia.org/T281275) [15:15:53] (03PS1) 10Andrew Bogott: Add more files for OpenStack Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/769054 (https://phabricator.wikimedia.org/T281275) [15:16:25] (03CR) 10jerkins-bot: [V: 04-1] OpenStack: add manifests for openstack wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768830 (https://phabricator.wikimedia.org/T281275) (owner: 10Andrew Bogott) [15:16:52] (03PS2) 10Hnowlan: jobqueue: use guaranteed QoS strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/769038 (https://phabricator.wikimedia.org/T300914) [15:17:32] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T298294)', diff saved to https://phabricator.wikimedia.org/P22114 and previous config saved to /var/cache/conftool/dbconfig/20220308-151731-marostegui.json [15:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:54] (03CR) 10JMeybohm: [C: 03+1] jobqueue: use guaranteed QoS strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/769038 (https://phabricator.wikimedia.org/T300914) (owner: 10Hnowlan) [15:19:55] (03PS1) 10Jbond: C:reposync: Add ability to control user and group permissions [puppet] - 10https://gerrit.wikimedia.org/r/769055 (https://phabricator.wikimedia.org/T229397) [15:19:57] (03CR) 10jerkins-bot: [V: 04-1] Add more files for OpenStack Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/769054 (https://phabricator.wikimedia.org/T281275) (owner: 10Andrew Bogott) [15:20:43] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34132/console" [puppet] - 10https://gerrit.wikimedia.org/r/769055 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [15:20:52] (03PS1) 10Majavah: Revert "smart: Fix quotes in tests" [puppet] - 10https://gerrit.wikimedia.org/r/768799 [15:23:19] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:reposync: Add ability to control user and group permissions [puppet] - 10https://gerrit.wikimedia.org/r/769055 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [15:23:21] (03CR) 10Hnowlan: [C: 03+2] jobqueue: use guaranteed QoS strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/769038 (https://phabricator.wikimedia.org/T300914) (owner: 10Hnowlan) [15:25:46] (03PS1) 10Elukey: Revert "calico,cfssl-issuer,knative-serving: fix dependencies" [deployment-charts] - 10https://gerrit.wikimedia.org/r/768800 [15:27:20] (03Merged) 10jenkins-bot: jobqueue: use guaranteed QoS strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/769038 (https://phabricator.wikimedia.org/T300914) (owner: 10Hnowlan) [15:29:06] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync [15:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:54] (03CR) 10Elukey: [C: 03+2] Revert "calico,cfssl-issuer,knative-serving: fix dependencies" [deployment-charts] - 10https://gerrit.wikimedia.org/r/768800 (owner: 10Elukey) [15:31:41] (03PS1) 10Jbond: P:netbox::automation: Correct reposync path [puppet] - 10https://gerrit.wikimedia.org/r/769057 [15:32:59] (03CR) 10jerkins-bot: [V: 04-1] P:netbox::automation: Correct reposync path [puppet] - 10https://gerrit.wikimedia.org/r/769057 (owner: 10Jbond) [15:33:07] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P22115 and previous config saved to /var/cache/conftool/dbconfig/20220308-153306-marostegui.json [15:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:17] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34134/console" [puppet] - 10https://gerrit.wikimedia.org/r/769057 (owner: 10Jbond) [15:33:56] RECOVERY - Check systemd state on mx2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:34:16] (03PS2) 10Jbond: P:netbox::automation: Correct reposync path [puppet] - 10https://gerrit.wikimedia.org/r/769057 (https://phabricator.wikimedia.org/T229397) [15:35:19] (03CR) 10Jbond: [C: 03+2] P:netbox::automation: Correct reposync path [puppet] - 10https://gerrit.wikimedia.org/r/769057 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [15:36:19] (03CR) 10Hashar: [C: 04-1] "I have a patch for it ;)" [puppet] - 10https://gerrit.wikimedia.org/r/768799 (owner: 10Majavah) [15:37:52] (03CR) 10Marostegui: [C: 03+1] mariadb: Reference the actual VRTS passwords in the m2 grants file. [puppet] - 10https://gerrit.wikimedia.org/r/764744 (https://phabricator.wikimedia.org/T303272) (owner: 10Kormat) [15:38:46] PROBLEM - SSH on thumbor2004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:39:32] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync [15:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:56] (03PS1) 10Hashar: smart: use a regex to check command error [puppet] - 10https://gerrit.wikimedia.org/r/769060 [15:39:58] (03PS1) 10Hashar: smart: set LC_MESSAGES when testing output [puppet] - 10https://gerrit.wikimedia.org/r/769061 [15:40:39] (03PS1) 10Elukey: calico,cfssl-issuer,knative-serving: bump chart's version [deployment-charts] - 10https://gerrit.wikimedia.org/r/769062 (https://phabricator.wikimedia.org/T303279) [15:40:39] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync [15:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:44] (03CR) 10Andrew Bogott: [C: 03+1] smart: use a regex to check command error [puppet] - 10https://gerrit.wikimedia.org/r/769060 (owner: 10Hashar) [15:41:04] (03Abandoned) 10Hashar: Revert "smart: Fix quotes in tests" [puppet] - 10https://gerrit.wikimedia.org/r/768799 (owner: 10Majavah) [15:41:25] (03CR) 10Andrew Bogott: [C: 03+1] smart: set LC_MESSAGES when testing output [puppet] - 10https://gerrit.wikimedia.org/r/769061 (owner: 10Hashar) [15:41:51] (03PS1) 10Volans: alertmanager: catch already deleted silence [software/spicerack] - 10https://gerrit.wikimedia.org/r/769063 (https://phabricator.wikimedia.org/T293209) [15:42:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P22116 and previous config saved to /var/cache/conftool/dbconfig/20220308-154232-root.json [15:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:55] !log update capirca hosts definitions [15:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1181', diff saved to https://phabricator.wikimedia.org/P22117 and previous config saved to /var/cache/conftool/dbconfig/20220308-154507-marostegui.json [15:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:41] (03PS1) 10Jelto: gitlab_runner: add dedicated service unit file [puppet] - 10https://gerrit.wikimedia.org/r/769065 (https://phabricator.wikimedia.org/T295481) [15:46:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10Jclark-ctr) @Cmjohnson These are using sfp-t adapter and are only 1g name rack Unit Port CableID ms-be1068 e2 25u 25 2013339101799 ms-be1069 e2 25u 25... [15:46:35] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/769063 (https://phabricator.wikimedia.org/T293209) (owner: 10Volans) [15:47:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [15:48:20] 10SRE, 10ops-eqiad: 8 x SMF Patches between cages Eqiad - LVS & WMCS - https://phabricator.wikimedia.org/T301419 (10Jclark-ctr) These runs have been completed and netbox has been updated with all cableids [15:48:28] 10SRE, 10ops-eqiad: 8 x SMF Patches between cages Eqiad - LVS & WMCS - https://phabricator.wikimedia.org/T301419 (10Jclark-ctr) 05Open→03Resolved [15:48:42] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P22118 and previous config saved to /var/cache/conftool/dbconfig/20220308-154841-marostegui.json [15:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:02] RECOVERY - Check systemd state on mx1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:03] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync [15:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:29] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34135/console" [puppet] - 10https://gerrit.wikimedia.org/r/769065 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [15:51:53] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [15:52:17] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34136/console" [puppet] - 10https://gerrit.wikimedia.org/r/764744 (https://phabricator.wikimedia.org/T303272) (owner: 10Kormat) [15:53:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P22119 and previous config saved to /var/cache/conftool/dbconfig/20220308-155312-root.json [15:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:46] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync [15:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:23] (03PS2) 10Jelto: gitlab_runner: add dedicated service unit file [puppet] - 10https://gerrit.wikimedia.org/r/769065 (https://phabricator.wikimedia.org/T295481) [15:54:35] (03CR) 10Elukey: [C: 03+2] calico,cfssl-issuer,knative-serving: bump chart's version [deployment-charts] - 10https://gerrit.wikimedia.org/r/769062 (https://phabricator.wikimedia.org/T303279) (owner: 10Elukey) [15:55:30] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync [15:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:31] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/datahubsearch on puppetmaster2001 is CRITICAL: File not found: /srv/config-master/pybal/eqiad/datahubsearch https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:56:44] (03CR) 10Alexandros Kosiaris: [C: 03+1] calico,cfssl-issuer,knative-serving: bump chart's version [deployment-charts] - 10https://gerrit.wikimedia.org/r/769062 (https://phabricator.wikimedia.org/T303279) (owner: 10Elukey) [15:58:48] (03PS1) 10Volans: sre.hosts.downtime: conver to use the new alerting [cookbooks] - 10https://gerrit.wikimedia.org/r/769067 (https://phabricator.wikimedia.org/T293209) [15:59:57] 10SRE, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ganeti10[29|3(012)] - https://phabricator.wikimedia.org/T299459 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [16:00:07] 10SRE, 10ops-eqiad, 10DC-Ops: Q3:Row E/F temp/humid probe installation - https://phabricator.wikimedia.org/T296424 (10Jclark-ctr) Racks e1-4 and f1-4 have been installed [16:00:23] (03CR) 10Volans: [C: 03+2] alertmanager: catch already deleted silence [software/spicerack] - 10https://gerrit.wikimedia.org/r/769063 (https://phabricator.wikimedia.org/T293209) (owner: 10Volans) [16:01:46] (03CR) 10Kormat: [V: 03+1 C: 04-2] "Blocked by https://phabricator.wikimedia.org/T303272 for now" [puppet] - 10https://gerrit.wikimedia.org/r/764744 (https://phabricator.wikimedia.org/T303272) (owner: 10Kormat) [16:02:10] !log bking@deneb manually installed openjdk-11-jdk for T293862 . moritzm will add puppet patch for this [16:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:14] T293862: Investigate using jvmquake to limit the time a JVM is unusable due to GC overhead - https://phabricator.wikimedia.org/T293862 [16:04:18] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T298294)', diff saved to https://phabricator.wikimedia.org/P22120 and previous config saved to /var/cache/conftool/dbconfig/20220308-160416-marostegui.json [16:04:20] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2123.codfw.wmnet with reason: Maintenance [16:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:21] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [16:04:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10Jclark-ctr) @BTullis Are we able to rack these in new cage Row E and F [16:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:22] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2123.codfw.wmnet with reason: Maintenance [16:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:24] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 16:00:00 on 8 hosts with reason: Maintenance [16:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:36] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on 8 hosts with reason: Maintenance [16:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:05] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1113.eqiad.wmnet with reason: Maintenance [16:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:08] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1113.eqiad.wmnet with reason: Maintenance [16:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:43] !log marostegui@cumin2002 dbctl commit (dc=all): 'Depooling db1113:3315 (T298294)', diff saved to https://phabricator.wikimedia.org/P22121 and previous config saved to /var/cache/conftool/dbconfig/20220308-160542-marostegui.json [16:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:57] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 0 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [16:06:24] (03Merged) 10jenkins-bot: alertmanager: catch already deleted silence [software/spicerack] - 10https://gerrit.wikimedia.org/r/769063 (https://phabricator.wikimedia.org/T293209) (owner: 10Volans) [16:06:27] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34138/console" [puppet] - 10https://gerrit.wikimedia.org/r/769065 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [16:07:52] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298294)', diff saved to https://phabricator.wikimedia.org/P22122 and previous config saved to /var/cache/conftool/dbconfig/20220308-160751-marostegui.json [16:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P22123 and previous config saved to /var/cache/conftool/dbconfig/20220308-160815-root.json [16:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:02] (03PS1) 10Majavah: P:wmcs::prometheus: set protocol for https scrapes [puppet] - 10https://gerrit.wikimedia.org/r/769069 [16:10:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:wmcs::prometheus: set protocol for https scrapes [puppet] - 10https://gerrit.wikimedia.org/r/769069 (owner: 10Majavah) [16:22:07] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/datahubsearch on puppetmaster1001 is CRITICAL: File not found: /srv/config-master/pybal/eqiad/datahubsearch https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [16:23:27] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P22124 and previous config saved to /var/cache/conftool/dbconfig/20220308-162326-marostegui.json [16:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P22125 and previous config saved to /var/cache/conftool/dbconfig/20220308-162331-root.json [16:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:46] (03PS1) 10Jbond: C:puppetmaster: clean up and fix up docuemtnation [puppet] - 10https://gerrit.wikimedia.org/r/769072 [16:25:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10BTullis) Hi @Jclark-ctr, yes that would be fine. Many thanks. [16:25:54] (03CR) 10Klausman: [C: 03+2] ml-services: add lvwiki, nlwiki & nowiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/766565 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [16:26:26] (03CR) 10Ayounsi: [C: 03+2] policies/cr-labs: Include cloudbackup-dev hosts [homer/public] - 10https://gerrit.wikimedia.org/r/767487 (owner: 10Majavah) [16:28:38] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/769060 (owner: 10Hashar) [16:29:28] (03CR) 10Cwhite: [C: 03+1] "Good catch, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/769061 (owner: 10Hashar) [16:29:43] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, just a nit" [cookbooks] - 10https://gerrit.wikimedia.org/r/769067 (https://phabricator.wikimedia.org/T293209) (owner: 10Volans) [16:30:18] (03Merged) 10jenkins-bot: ml-services: add lvwiki, nlwiki & nowiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/766565 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [16:30:44] btullis, vgutierrez: FYI ^^^ (16:22:07 UTC) Confd template missing on puppetmaster seems related to the datahubsearch changes [16:32:45] (03CR) 10Ayounsi: policies/cr-labs: Add firewall rules for clouddb2001-dev return flows (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/769052 (https://phabricator.wikimedia.org/T303248) (owner: 10Majavah) [16:32:47] (03PS2) 10Volans: sre.hosts.downtime: conver to use the new alerting [cookbooks] - 10https://gerrit.wikimedia.org/r/769067 (https://phabricator.wikimedia.org/T293209) [16:33:03] (03CR) 10Cwhite: [C: 03+2] smart: use a regex to check command error [puppet] - 10https://gerrit.wikimedia.org/r/769060 (owner: 10Hashar) [16:33:17] (03CR) 10Cwhite: [C: 03+2] smart: set LC_MESSAGES when testing output [puppet] - 10https://gerrit.wikimedia.org/r/769061 (owner: 10Hashar) [16:33:19] (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/769067 (https://phabricator.wikimedia.org/T293209) (owner: 10Volans) [16:33:37] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:34:33] !log rzl@apt1001:~$ sudo -i reprepro -C main includedeb buster-wikimedia /home/rzl/envoyproxy_1.18.3-1_amd64.deb # reimporting from component/envoy-future into main, for T300324 [16:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:38] T300324: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 [16:36:10] (03CR) 10Filippo Giunchedi: [C: 03+1] sre.hosts.downtime: conver to use the new alerting [cookbooks] - 10https://gerrit.wikimedia.org/r/769067 (https://phabricator.wikimedia.org/T293209) (owner: 10Volans) [16:36:55] 10SRE, 10ops-esams, 10DC-Ops: ripe-atlas-esams down - https://phabricator.wikimedia.org/T303242 (10RobH) Ok, logged into scs-oe16-esams, port 12 and powered off the atlas via PDU, then after 30 seconds, powered it back up. There is no output on scs for that port. When troubleshooting this with remote hands... [16:37:35] !log bking@deneb manually installed tox for T293862 . moritzm will add puppet patch for this [16:37:36] T293862: Investigate using jvmquake to limit the time a JVM is unusable due to GC overhead - https://phabricator.wikimedia.org/T293862 [16:38:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P22126 and previous config saved to /var/cache/conftool/dbconfig/20220308-163835-root.json [16:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:02] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P22127 and previous config saved to /var/cache/conftool/dbconfig/20220308-163901-marostegui.json [16:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:10] 10SRE, 10ops-esams, 10DC-Ops: ripe-atlas-esams down - https://phabricator.wikimedia.org/T303242 (10ayounsi) > confirm the atlas is plugged into ps2-oe16, port BA36 (36th port on the ps2 tower) You can probably check if the switch port goes down (or went down) when the power outlet is off. > confirm scs is p... [16:39:21] (03PS2) 10Jbond: C:puppetmaster: clean up and fix up docuemtnation [puppet] - 10https://gerrit.wikimedia.org/r/769072 [16:39:52] (03CR) 10jerkins-bot: [V: 04-1] C:puppetmaster: clean up and fix up docuemtnation [puppet] - 10https://gerrit.wikimedia.org/r/769072 (owner: 10Jbond) [16:40:45] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34140/console" [puppet] - 10https://gerrit.wikimedia.org/r/769072 (owner: 10Jbond) [16:40:51] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10Milimetric) > Perhaps a way forward would be to find a way to serve those use cases by design instead of by accident. +1,... [16:45:06] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/769054 (https://phabricator.wikimedia.org/T281275) (owner: 10Andrew Bogott) [16:45:53] (03PS1) 10DCausse: flink-session-cluster: add thanos S3 config [deployment-charts] - 10https://gerrit.wikimedia.org/r/769075 (https://phabricator.wikimedia.org/T302494) [16:46:00] (03PS19) 10Razzi: elasticsearch: load config from yaml [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (https://phabricator.wikimedia.org/T278378) (owner: 10Ryan Kemper) [16:46:23] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1016.eqiad.wmnet with OS bullseye [16:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:33] (03PS2) 10Majavah: policies/cr-labs: Add firewall rules for clouddb2001-dev return flows [homer/public] - 10https://gerrit.wikimedia.org/r/769052 (https://phabricator.wikimedia.org/T303248) [16:47:09] inflatador: your !log wasn't recorded, because the heading space [16:47:15] (03CR) 10Majavah: policies/cr-labs: Add firewall rules for clouddb2001-dev return flows (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/769052 (https://phabricator.wikimedia.org/T303248) (owner: 10Majavah) [16:48:25] (03CR) 10Andrew Bogott: [C: 03+2] Add more files for OpenStack Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/769054 (https://phabricator.wikimedia.org/T281275) (owner: 10Andrew Bogott) [16:50:14] (03PS1) 10Majavah: P:wmcs::prometheus: drop blackbox checks for services in the cloud realm [puppet] - 10https://gerrit.wikimedia.org/r/769077 [16:52:13] (03CR) 10Ayounsi: [C: 03+2] policies/cr-labs: Add firewall rules for clouddb2001-dev return flows [homer/public] - 10https://gerrit.wikimedia.org/r/769052 (https://phabricator.wikimedia.org/T303248) (owner: 10Majavah) [16:52:22] (03PS7) 10Andrew Bogott: OpenStack: add manifests for openstack wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768830 (https://phabricator.wikimedia.org/T281275) [16:52:24] (03PS3) 10Andrew Bogott: Update hacked nova/api/openstack/compute/servers.py for Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768852 (https://phabricator.wikimedia.org/T281275) [16:52:26] (03PS3) 10Andrew Bogott: Update trove/instance/models.py for wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768853 (https://phabricator.wikimedia.org/T281275) [16:52:28] (03PS3) 10Andrew Bogott: Update trove/instance/models.py for wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768854 (https://phabricator.wikimedia.org/T281275) [16:52:43] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: load config from yaml [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (https://phabricator.wikimedia.org/T278378) (owner: 10Ryan Kemper) [16:52:47] (03Merged) 10jenkins-bot: policies/cr-labs: Add firewall rules for clouddb2001-dev return flows [homer/public] - 10https://gerrit.wikimedia.org/r/769052 (https://phabricator.wikimedia.org/T303248) (owner: 10Majavah) [16:53:47] Thanks arturo [16:53:49] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance [16:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:52] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance [16:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:53] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on 10 hosts with reason: Maintenance [16:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:57] !log bking@deneb manually installed tox for T293862 . moritzm will add puppet patch for this [16:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:00] T293862: Investigate using jvmquake to limit the time a JVM is unusable due to GC overhead - https://phabricator.wikimedia.org/T293862 [16:54:08] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on 10 hosts with reason: Maintenance [16:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:16] (03PS20) 10Razzi: elasticsearch: load config from yaml [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (https://phabricator.wikimedia.org/T278378) (owner: 10Ryan Kemper) [16:54:20] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1016.eqiad.wmnet with OS bullseye [16:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:37] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298294)', diff saved to https://phabricator.wikimedia.org/P22128 and previous config saved to /var/cache/conftool/dbconfig/20220308-165436-marostegui.json [16:54:38] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance [16:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:40] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [16:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:41] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance [16:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:52] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1016.eqiad.wmnet with OS bullseye [16:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:23] (03PS1) 10Ayounsi: Fix typo mysql -> mySQL [homer/public] - 10https://gerrit.wikimedia.org/r/769078 [16:55:56] (03Abandoned) 10Andrew Bogott: Add parking domain for wmfcloud.org [dns] - 10https://gerrit.wikimedia.org/r/762075 (https://phabricator.wikimedia.org/T301592) (owner: 10Andrew Bogott) [16:56:06] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack: add manifests for openstack wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768830 (https://phabricator.wikimedia.org/T281275) (owner: 10Andrew Bogott) [16:56:18] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [16:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:21] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [16:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:48] (03CR) 10Ayounsi: [C: 03+2] Fix typo mysql -> mySQL [homer/public] - 10https://gerrit.wikimedia.org/r/769078 (owner: 10Ayounsi) [16:57:04] (03CR) 10David Caro: [C: 03+1] P:wmcs::prometheus: drop blackbox checks for services in the cloud realm [puppet] - 10https://gerrit.wikimedia.org/r/769077 (owner: 10Majavah) [16:57:27] (03CR) 10Andrew Bogott: [C: 03+2] Update hacked nova/api/openstack/compute/servers.py for Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768852 (https://phabricator.wikimedia.org/T281275) (owner: 10Andrew Bogott) [16:57:40] (03CR) 10Andrew Bogott: [C: 03+2] Update trove/instance/models.py for wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768853 (https://phabricator.wikimedia.org/T281275) (owner: 10Andrew Bogott) [16:57:53] (03CR) 10Andrew Bogott: [C: 03+2] Update trove/instance/models.py for wallaby [puppet] - 10https://gerrit.wikimedia.org/r/768854 (https://phabricator.wikimedia.org/T281275) (owner: 10Andrew Bogott) [16:57:58] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1161.eqiad.wmnet with reason: Maintenance [16:57:59] (03CR) 10David Caro: [C: 03+2] P:wmcs::prometheus: drop blackbox checks for services in the cloud realm [puppet] - 10https://gerrit.wikimedia.org/r/769077 (owner: 10Majavah) [16:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:01] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1161.eqiad.wmnet with reason: Maintenance [16:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:02] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:09] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:44] !log marostegui@cumin2002 dbctl commit (dc=all): 'Depooling db1161 (T298294)', diff saved to https://phabricator.wikimedia.org/P22129 and previous config saved to /var/cache/conftool/dbconfig/20220308-165843-marostegui.json [16:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:04] jbond and rzl: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220308T1700). [17:00:05] dancy: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:18] hmm [17:00:19] btullis around? https://phabricator.wikimedia.org/T303151 [17:00:26] dancy: 👋 [17:00:51] Hey rzl. Looks like I put an entry in the wrong section. My change was merged last week. [17:01:02] oh that's the- yep cool [17:01:07] anything to merge this time around? [17:01:14] Nope! [17:01:24] PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:01:34] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1016.eqiad.wmnet with OS bullseye [17:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:48] 👍 [17:01:54] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298294)', diff saved to https://phabricator.wikimedia.org/P22130 and previous config saved to /var/cache/conftool/dbconfig/20220308-170153-marostegui.json [17:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:57] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [17:02:52] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34146/console" [puppet] - 10https://gerrit.wikimedia.org/r/769072 (owner: 10Jbond) [17:02:56] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: load config from yaml [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (https://phabricator.wikimedia.org/T278378) (owner: 10Ryan Kemper) [17:03:01] (03PS3) 10Jbond: C:puppetmaster: clean up and fix up docuemtnation [puppet] - 10https://gerrit.wikimedia.org/r/769072 [17:04:42] 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (Kanban): cloudvirt1016 fails to PXE boot - https://phabricator.wikimedia.org/T303296 (10Andrew) [17:04:46] (03CR) 10Brennen Bearnes: [C: 03+1] gitlab_runner: add dedicated service unit file [puppet] - 10https://gerrit.wikimedia.org/r/769065 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [17:05:05] (03PS4) 10Jbond: C:puppetmaster: clean up and fix up docuemtnation [puppet] - 10https://gerrit.wikimedia.org/r/769072 [17:05:40] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34147/console" [puppet] - 10https://gerrit.wikimedia.org/r/769072 (owner: 10Jbond) [17:05:42] (03PS21) 10Razzi: elasticsearch: load config from yaml [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (https://phabricator.wikimedia.org/T278378) (owner: 10Ryan Kemper) [17:06:59] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/eqiad/datahubsearch on puppetmaster1001 is CRITICAL: File not found: /srv/config-master/pybal/eqiad/datahubsearch Btullis T301458 Investigating. https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [17:07:45] !log deploy minor clean up of puppetmaster classes gerrit:769072 [17:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:48] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:puppetmaster: clean up and fix up docuemtnation [puppet] - 10https://gerrit.wikimedia.org/r/769072 (owner: 10Jbond) [17:12:40] (03PS1) 10PipelineBot: rdf-streaming-updater: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/769081 [17:13:23] rzl, I would have a patch when you still have time [17:17:29] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P22131 and previous config saved to /var/cache/conftool/dbconfig/20220308-171728-marostegui.json [17:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:47] (03CR) 10Razzi: "Thanks for the review Volans. Let me know how this looks." [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (https://phabricator.wikimedia.org/T278378) (owner: 10Ryan Kemper) [17:18:45] (03PS4) 10Razzi: elasticsearch: move cluster configuration to puppet [puppet] - 10https://gerrit.wikimedia.org/r/768816 (https://phabricator.wikimedia.org/T278378) [17:20:47] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10AlexisJazz) >>! In T302699#7756446, @Vgutierrez wrote: > in this case a 502 is emitted by ats-backend cause it is... [17:23:36] (03PS2) 10DCausse: flink-session-cluster: add thanos S3 config [deployment-charts] - 10https://gerrit.wikimedia.org/r/769075 (https://phabricator.wikimedia.org/T302494) [17:23:50] (03Abandoned) 10DCausse: rdf-streaming-updater: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/769081 (owner: 10PipelineBot) [17:25:09] (03CR) 10Volans: [C: 03+1] "LGTM, reply to the open question inline. No blockers." [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (https://phabricator.wikimedia.org/T278378) (owner: 10Ryan Kemper) [17:27:29] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1017.eqiad.wmnet with OS bullseye [17:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:04] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P22132 and previous config saved to /var/cache/conftool/dbconfig/20220308-173302-marostegui.json [17:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:43] 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (Kanban): cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10Andrew) [17:33:53] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1017.eqiad.wmnet with OS bullseye [17:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:40] (03PS1) 10Volans: spicerack: use the private property for config dir [software/spicerack] - 10https://gerrit.wikimedia.org/r/769084 [17:37:39] (03CR) 10Ahmon Dancy: [C: 03+1] gerrit: prevent 'null' entry in email [puppet] - 10https://gerrit.wikimedia.org/r/768005 (https://phabricator.wikimedia.org/T288312) (owner: 10Hashar) [17:40:01] (03CR) 10DCausse: [C: 04-1] "uploaded I3c94b23e6970568c6ae5afd6b8bb74022ec69910 that has the proper config, I'll rebase this patch to use it as the last cleanup to ful" [deployment-charts] - 10https://gerrit.wikimedia.org/r/766123 (https://phabricator.wikimedia.org/T302494) (owner: 10ZPapierski) [17:41:28] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:41:57] (03CR) 10Brennen Bearnes: [C: 03+1] gerrit: prevent 'null' entry in email [puppet] - 10https://gerrit.wikimedia.org/r/768005 (https://phabricator.wikimedia.org/T288312) (owner: 10Hashar) [17:42:46] (03CR) 10jerkins-bot: [V: 04-1] spicerack: use the private property for config dir [software/spicerack] - 10https://gerrit.wikimedia.org/r/769084 (owner: 10Volans) [17:43:05] is there an outage? i see 503 Service Unavailable when trying to view any wikimedia site [17:43:10] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:44:05] never mind, things work now. it was down for a minute or two though [17:44:32] i'm still getting 503 [17:44:33] yeah, I was getting `upstream connect error or disconnect/reset before headers. reset reason: overflow` on all wikis for about a minute [17:44:44] (03PS3) 10STran: Add IPInfo viewing rights for certain groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766882 (https://phabricator.wikimedia.org/T296499) [17:45:05] (03PS1) 10Elukey: Set Bullseye + overlayfs settings for kubernetes2005 [puppet] - 10https://gerrit.wikimedia.org/r/769085 (https://phabricator.wikimedia.org/T300744) [17:45:05] might be related to the above BGP alerts ,the time that BGP reconverged, checking [17:45:12] XioNoX in case you're around ^^^6 [17:45:17] we are still at 10k 503s per second [17:45:32] https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&viewPanel=14 [17:45:40] what's up? [17:45:51] (03CR) 10jerkins-bot: [V: 04-1] Add IPInfo viewing rights for certain groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766882 (https://phabricator.wikimedia.org/T296499) (owner: 10STran) [17:45:55] (ProbeHttpFailed) firing: URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [17:46:25] connectivity issue between eqiad and eqsin https://smokeping.wikimedia.org/?displaymode=n;start=2022-03-08%2014:46;end=now;target=eqsin.Core.cr3-eqsin [17:46:39] PROBLEM - LVS text eqsin port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv4 #page on text-lb.eqsin.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:46:42] please prepare a patch to depool I'll force a transport failover [17:46:44] esams is affected as well [17:46:48] PROBLEM - PyBal backends health check on lvs1017 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_80: Servers cp1087.eqiad.wmnet are marked down but pooled: testlb_443: Servers cp1083.eqiad.wmnet, cp1087.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1083.eqiad.wmnet, cp1089.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1087.eqiad.wmnet, cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1 [17:46:48] d.wmnet, cp1089.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:46:55] (ProbeHttpFailed) firing: (2) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [17:46:58] here [17:47:01] here [17:47:02] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp5009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [17:47:04] yeah esams too seems [17:47:04] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp5015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [17:47:05] o/ [17:47:10] XioNoX: can do, depool which? [17:47:12] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [17:47:12] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [17:47:12] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [17:47:18] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [17:47:21] * volans preparing patch to depool eqsin [17:47:22] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [17:47:26] volans: ack, all yours [17:47:29] here [17:47:34] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [17:47:34] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp5009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [17:47:41] I'm around if needed [17:47:41] PROBLEM - LVS text eqiad port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv4 #page on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:47:44] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp5015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [17:47:44] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp5009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [17:47:47] why are eqiad cp's alerting if it's a connectivity issue? [17:47:50] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [17:47:58] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [17:48:04] (03PS1) 10Volans: Emergency depool of eqsin [dns] - 10https://gerrit.wikimedia.org/r/769086 [17:48:08] PROBLEM - Varnish HTTP text-frontend - port 80 on cp5010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [17:48:09] patch here ^^^ vgutierrez [17:48:10] PROBLEM - PyBal backends health check on lvs5001 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5016.eqsin.wmnet, cp5015.eqsin.wmnet, cp5012.eqsin.wmnet, cp5010.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5016.eqsin.wmnet, cp5015.eqsin.wmnet, cp5012.eqsin.wmnet, cp5007.eqsin.wmnet, cp5010.eqsin.wmnet are ma [17:48:10] n but pooled: testlb6_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5016.eqsin.wmnet, cp5015.eqsin.wmnet, cp5012.eqsin.wmnet, cp5010.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5016.eqsin.wmnet, cp5015.eqsin.wmnet, cp5012.eqsin.wmnet, cp5010.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:48:16] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp5015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [17:48:16] (03CR) 10STran: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766882 (https://phabricator.wikimedia.org/T296499) (owner: 10STran) [17:48:18] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_80: Servers cp1089.eqiad.wmnet are marked down but pooled: testlb_443: Servers cp1087.eqiad.wmnet, cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1087.eqiad.wmnet, cp1089.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1083.eqiad.wmnet, cp1087.eqiad.wmnet, cp1089.eqiad.wmnet are marked down but pooled: text [17:48:18] Servers cp1083.eqiad.wmnet, cp1087.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:48:20] PROBLEM - Varnish HTTP text-frontend - port 80 on cp5012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [17:48:32] PROBLEM - PyBal backends health check on lvs5003 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5011.eqsin.wmnet, cp5016.eqsin.wmnet, cp5015.eqsin.wmnet, cp5012.eqsin.wmnet, cp5010.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5016.eqsin.wmnet, cp5015.eqsin.wmnet, cp5012.eqsin.wmnet, cp5007.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled [17:48:32] 6_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5016.eqsin.wmnet, cp5015.eqsin.wmnet, cp5012.eqsin.wmnet, cp5007.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5016.eqsin.wmnet, cp5015.eqsin.wmnet, cp5012.eqsin.wmnet, cp5007.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:48:34] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp5015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [17:48:34] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp5009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [17:48:39] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298294)', diff saved to https://phabricator.wikimedia.org/P22133 and previous config saved to /var/cache/conftool/dbconfig/20220308-174838-marostegui.json [17:48:41] (03CR) 10Vgutierrez: [C: 03+1] Emergency depool of eqsin [dns] - 10https://gerrit.wikimedia.org/r/769086 (owner: 10Volans) [17:48:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:43] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [17:48:46] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp5009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [17:48:50] PROBLEM - Varnish HTTP text-frontend - port 80 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [17:49:09] RECOVERY - LVS text eqsin port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv4 #page on text-lb.eqsin.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 610 bytes in 0.909 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:49:22] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [17:49:22] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp5015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [17:49:26] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp5015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [17:49:44] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [17:49:58] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp5015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [17:50:04] PROBLEM - Varnish HTTP text-frontend - port 80 on cp5011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [17:50:04] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp5015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [17:50:06] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [17:50:17] RECOVERY - LVS text eqiad port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv4 #page on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 610 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:50:30] (JobUnavailable) firing: (5) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [17:50:55] (ProbeHttpFailed) firing: (37) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [17:52:07] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp5009 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.524 second response time https://wikitech.wikimedia.org/wiki/Varnish [17:52:07] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5006 is CRITICAL: 6.67e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5006 [17:52:13] RECOVERY - Varnish HTTP text-frontend - port 80 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 474 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [17:52:13] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5012 is CRITICAL: 6.999e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5012 [17:52:17] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5013 is CRITICAL: 6.959e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5013 [17:52:27] RECOVERY - PyBal backends health check on lvs5001 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:52:39] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5007 is CRITICAL: 6.264e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007 [17:52:51] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [17:52:51] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp5015 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.449 second response time https://wikitech.wikimedia.org/wiki/Varnish [17:52:55] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp5015 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.464 second response time https://wikitech.wikimedia.org/wiki/Varnish [17:53:03] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp5015 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.447 second response time https://wikitech.wikimedia.org/wiki/Varnish [17:53:03] RECOVERY - Varnish HTTP text-frontend - port 80 on cp5011 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [17:53:03] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp5015 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [17:53:03] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [17:53:19] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5012 is OK: (C)5000 gt (W)3000 gt 229.8 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5012 [17:53:23] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5013 is OK: (C)5000 gt (W)3000 gt 361.9 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5013 [17:53:38] (03CR) 10Ahmon Dancy: [C: 03+1] "It would be good to get this merged before the train starts today, otherwise we're in for another round of spam in #wikimedia-operations." [puppet] - 10https://gerrit.wikimedia.org/r/767242 (https://phabricator.wikimedia.org/T302832) (owner: 10Ahmon Dancy) [17:53:43] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5007 is OK: (C)5000 gt (W)3000 gt 359.6 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007 [17:54:15] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5006 is OK: (C)5000 gt (W)3000 gt 357.1 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5006 [17:55:09] jouncebot nowandnext [17:55:09] For the next 0 hour(s) and 4 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220308T1700) [17:55:09] In 1 hour(s) and 4 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220308T1900) [17:55:19] (03CR) 10Zabe: Depool esams (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/767817 (owner: 10Ladsgroup) [17:55:30] (JobUnavailable) firing: (5) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [17:55:55] (ProbeHttpFailed) firing: (37) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [17:56:55] (ProbeHttpFailed) firing: (2) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [17:57:30] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ms-fe1012.eqiad.wmnet with OS stretch [17:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:36] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-fe1012.eqiad.wmnet with OS stretch [17:58:15] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Varnish [17:58:15] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.447 second response time https://wikitech.wikimedia.org/wiki/Varnish [17:58:15] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [17:58:15] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [17:58:15] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 474 bytes in 0.465 second response time https://wikitech.wikimedia.org/wiki/Varnish [17:58:15] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.451 second response time https://wikitech.wikimedia.org/wiki/Varnish [17:58:15] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [17:58:17] RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:58:23] RECOVERY - Varnish HTTP text-frontend - port 80 on cp5012 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [17:58:36] (03CR) 10Bking: [C: 03+2] elasticsearch: move cluster configuration to puppet [puppet] - 10https://gerrit.wikimedia.org/r/768816 (https://phabricator.wikimedia.org/T278378) (owner: 10Razzi) [17:58:43] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1032.eqiad.wmnet with OS buster [17:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:45] I'm getting 503 Service Unavailable in Google Chrome but not Firefox on all wikipedias... anyone else seeing this? [17:58:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin... [17:58:56] Jdlrobson: still? [17:59:03] yeh logged in and anon [17:59:24] oh and now we're back [18:00:41] RECOVERY - Varnish HTTP text-frontend - port 80 on cp5010 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:00:55] (ProbeHttpFailed) resolved: (37) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [18:00:59] RECOVERY - PyBal backends health check on lvs5003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:01:07] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:01:08] (03CR) 10RhinosF1: Depool esams (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/767817 (owner: 10Ladsgroup) [18:01:35] RECOVERY - PyBal backends health check on lvs1017 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:01:43] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:01:55] (ProbeHttpFailed) resolved: (2) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [18:05:23] (03CR) 10STran: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766882 (https://phabricator.wikimedia.org/T296499) (owner: 10STran) [18:05:31] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1083 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:07:27] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp5009 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:07:46] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1012.eqiad.wmnet with reason: host reimage [18:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:37] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp5009 is OK: HTTP OK: HTTP/1.1 200 OK - 474 bytes in 0.453 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:09:39] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp5015 is OK: HTTP OK: HTTP/1.1 200 OK - 474 bytes in 0.451 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:10:59] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1032.eqiad.wmnet with reason: host reimage [18:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:07] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1012.eqiad.wmnet with reason: host reimage [18:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:27] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp5009 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:11:27] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp5015 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:11:44] jouncebot now [18:11:44] No deployments scheduled for the next 0 hour(s) and 48 minute(s) [18:13:07] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp5009 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:13:09] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp5015 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.451 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:13:57] rzl: Are you still around? [18:14:14] dancy: here but still firefighting a little :) what's up? [18:14:19] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1032.eqiad.wmnet with reason: host reimage [18:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:39] I need assistance getting https://gerrit.wikimedia.org/r/c/operations/puppet/+/767242 rolled out. but I can wait if you're busy [18:17:35] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp5008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:17:35] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp5009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:17:35] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp5015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:17:41] PROBLEM - Varnish HTTP text-frontend - port 80 on cp5010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:17:43] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp5015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:17:53] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp5008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:17:59] PROBLEM - Check unit status of statograph_post on alert1001 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:18:03] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:18:04] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:18:04] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:18:04] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:18:04] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:18:04] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:18:04] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:18:05] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:18:05] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:18:13] PROBLEM - Varnish HTTP text-frontend - port 80 on cp5012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:18:21] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp5009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:18:21] PROBLEM - Varnish HTTP text-frontend - port 80 on cp5016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:18:25] PROBLEM - LVS text eqsin port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv4 #page on text-lb.eqsin.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:18:25] PROBLEM - Varnish HTTP text-frontend - port 80 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:18:27] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp5009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:18:27] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp5015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:18:37] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1087.eqiad.wmnet, cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1087.eqiad.wmnet, cp1089.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1083.eqiad.wmnet, cp1087.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1087.eqiad.wmnet, cp1089.eqiad.wmnet are marked down but pooled htt [18:18:37] itech.wikimedia.org/wiki/PyBal [18:18:47] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:18:47] PROBLEM - Varnish HTTP text-frontend - port 80 on cp5009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:18:47] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp5015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:18:59] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp5008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:18:59] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp5009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:18:59] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp5015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:19:31] RECOVERY - LVS text eqsin port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv4 #page on text-lb.eqsin.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 610 bytes in 0.533 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:19:34] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:19:43] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5008 is CRITICAL: 3.042e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5008 [18:19:56] (03PS1) 10RLazarus: varnish: URL normalization [puppet] - 10https://gerrit.wikimedia.org/r/769089 [18:20:17] PROBLEM - PyBal backends health check on lvs1017 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1083.eqiad.wmnet, cp1087.eqiad.wmnet, cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1083.eqiad.wmnet, cp1087.eqiad.wmnet, cp1089.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1087.eqiad.wmnet, cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1083.eqiad.wmnet, cp1087.eq [18:20:17] t are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:20:30] (JobUnavailable) firing: (4) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [18:20:37] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp5008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:20:47] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5012 is CRITICAL: 3.183e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5012 [18:20:55] (ProbeHttpFailed) firing: (8) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [18:20:55] (ProbeHttpFailed) firing: URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [18:21:09] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp5015 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 1.300 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:21:16] (03CR) 10CDanis: [C: 03+1] varnish: URL normalization [puppet] - 10https://gerrit.wikimedia.org/r/769089 (owner: 10RLazarus) [18:21:17] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp5015 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 1.274 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:21:22] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1012.eqiad.wmnet with OS stretch [18:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:27] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-fe1012.eqiad.wmnet with OS stretch completed: - ms-fe1012... [18:21:31] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5007 is CRITICAL: 4.085e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007 [18:21:41] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5010 is CRITICAL: 3.936e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5010 [18:21:47] RECOVERY - Varnish HTTP text-frontend - port 80 on cp5016 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 1.854 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:21:53] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp5015 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.490 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:22:19] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp5015 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.532 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:22:23] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp5015 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.533 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:22:26] (03CR) 10BBlack: [C: 03+1] varnish: URL normalization [puppet] - 10https://gerrit.wikimedia.org/r/769089 (owner: 10RLazarus) [18:22:27] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5015 is CRITICAL: 4.172e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5015 [18:22:33] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp5015 is OK: HTTP OK: HTTP/1.1 200 OK - 474 bytes in 0.532 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:23:37] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp5008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:24:59] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp5009 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.532 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:25:03] RECOVERY - Varnish HTTP text-frontend - port 80 on cp5010 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.496 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:25:19] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp5008 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 9.226 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:25:29] RECOVERY - Varnish HTTP text-frontend - port 80 on cp5012 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.532 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:25:30] (JobUnavailable) firing: (6) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [18:25:35] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp5009 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.532 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:25:39] RECOVERY - Varnish HTTP text-frontend - port 80 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.534 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:25:44] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp5009 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 1.276 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:25:55] (ProbeHttpFailed) firing: (19) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [18:26:11] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.463 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:26:11] RECOVERY - Varnish HTTP text-frontend - port 80 on cp5009 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.464 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:26:13] (03CR) 10RLazarus: [C: 03+2] varnish: URL normalization [puppet] - 10https://gerrit.wikimedia.org/r/769089 (owner: 10RLazarus) [18:26:27] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp5008 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:26:27] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp5009 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:27:00] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1032.eqiad.wmnet with OS buster [18:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001... [18:27:13] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5008 is OK: (C)5000 gt (W)3000 gt 405 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5008 [18:27:21] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5007 is OK: (C)5000 gt (W)3000 gt 471.9 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007 [18:27:33] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp5008 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.452 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:27:33] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5010 is OK: (C)5000 gt (W)3000 gt 353.1 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5010 [18:28:21] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5015 is OK: (C)5000 gt (W)3000 gt 354.7 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5015 [18:28:27] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5012 is OK: (C)5000 gt (W)3000 gt 238 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5012 [18:28:27] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp5008 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:29:11] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp5008 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.513 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:29:37] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.514 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:29:37] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.514 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:29:37] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.513 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:29:37] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.514 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:29:37] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.514 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:29:37] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.514 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:29:57] PROBLEM - Check if active EventStreams endpoint is delivering messages. on alert1001 is CRITICAL: CRITICAL: No EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration [18:29:58] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1021.eqiad.wmnet with OS bullseye [18:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:30] (JobUnavailable) firing: (6) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [18:30:55] (ProbeHttpFailed) resolved: URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [18:30:55] (ProbeHttpFailed) resolved: (19) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [18:35:12] !log cp10[3579] - restarting varnish-fe [18:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:21] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:35:39] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:35:52] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:27] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:38:27] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1083 is OK: HTTP OK: HTTP/1.1 200 OK - 465 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:40:30] (JobUnavailable) firing: (6) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [18:40:47] RECOVERY - Check unit status of statograph_post on alert1001 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:42:07] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:42:29] RECOVERY - PyBal backends health check on lvs1017 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:43:11] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 471 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:44:45] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:02] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=cp1085.eqiad.wmnet [18:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:35] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2104.codfw.wmnet with reason: Maintenance [18:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:38] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2104.codfw.wmnet with reason: Maintenance [18:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:39] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 16:00:00 on 8 hosts with reason: Maintenance [18:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:51] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on 8 hosts with reason: Maintenance [18:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:14] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1102.eqiad.wmnet with reason: Maintenance [18:48:17] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1102.eqiad.wmnet with reason: Maintenance [18:48:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:57] 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (Kanban): cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10Andrew) It's hanging on the dhcp request: ` Booting from QLogic MBA Slot 0100 v7.14.2 QLogic UNDI PXE-2.1 v7.14.2 Co... [18:49:14] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1021.eqiad.wmnet with OS bullseye [18:49:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:33] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=cp5004.eqsin.wmnet [18:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:57] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1129.eqiad.wmnet with reason: Maintenance [18:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:59] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1129.eqiad.wmnet with reason: Maintenance [18:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:34] !log marostegui@cumin2002 dbctl commit (dc=all): 'Depooling db1129 (T298294)', diff saved to https://phabricator.wikimedia.org/P22134 and previous config saved to /var/cache/conftool/dbconfig/20220308-185033-marostegui.json [18:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:37] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [18:50:44] 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (Kanban): cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet and cloudvirt1021.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10Andrew) [18:52:43] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298294)', diff saved to https://phabricator.wikimedia.org/P22135 and previous config saved to /var/cache/conftool/dbconfig/20220308-185242-marostegui.json [18:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:11] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:05] dancy and brennen: That opportune time is upon us again. Time for a MediaWiki train - Utc-7 Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220308T1900). [19:00:19] Wiki seems to be down here [19:00:20] O_o [19:00:22] RECOVERY - Check if active EventStreams endpoint is delivering messages. on alert1001 is OK: OK: An EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration [19:00:45] o/ [19:00:55] see channel topic - train will be held for the time being. [19:05:11] 👍🏾 [19:07:44] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:08:19] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P22136 and previous config saved to /var/cache/conftool/dbconfig/20220308-190818-marostegui.json [19:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:01] AWS is having significant issues, probably related: https://downdetector.com/status/aws-amazon-web-services/ [19:16:15] Though I'm not seeing anything on AWS's health status pages so it could also be incorrectly reported er... reports [19:17:55] (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools autotopicsub on MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769096 (https://phabricator.wikimedia.org/T302256) [19:18:38] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:19:12] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:19:44] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:09] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [19:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:55] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P22137 and previous config saved to /var/cache/conftool/dbconfig/20220308-192354-marostegui.json [19:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:33] 10SRE, 10SRE-Access-Requests: Request Administrator Access to Google Search Console - https://phabricator.wikimedia.org/T302625 (10dr0ptp4kt) That's correct. I'll follow up under separate cover. [19:25:01] 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (Kanban): cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet and cloudvirt1021.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10Andrew) @MoritzMuehlenhoff says this will be fixed with firmware updates; I'd suggest th... [19:25:11] Starting train operations [19:27:30] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:22] (03PS1) 10Ahmon Dancy: testwikis wikis to 1.38.0-wmf.25 refs T300201 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769100 [19:31:24] (03CR) 10Ahmon Dancy: [C: 03+2] testwikis wikis to 1.38.0-wmf.25 refs T300201 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769100 (owner: 10Ahmon Dancy) [19:32:21] (03Merged) 10jenkins-bot: testwikis wikis to 1.38.0-wmf.25 refs T300201 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769100 (owner: 10Ahmon Dancy) [19:32:25] !log dancy@deploy1002 Started scap: testwikis wikis to 1.38.0-wmf.25 refs T300201 [19:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:28] T300201: 1.38.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T300201 [19:37:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10Jclark-ctr) name rack Unit Port CableID an-worker1142 e1 27u 27 an-worker1143 e2 27u 27 an-worker1144 f1 27u 27... [19:38:42] (03CR) 10Tchanders: Autopromote-once users to the 'ipinfo' group after one edit (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767845 (https://phabricator.wikimedia.org/T296184) (owner: 10Tchanders) [19:38:53] (03CR) 10Tchanders: Enable IPInfo on testwiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767216 (https://phabricator.wikimedia.org/T260598) (owner: 10Tchanders) [19:39:12] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:39:21] (03CR) 10STran: [C: 03+1] Autopromote-once users to the 'ipinfo' group after one edit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767845 (https://phabricator.wikimedia.org/T296184) (owner: 10Tchanders) [19:39:31] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298294)', diff saved to https://phabricator.wikimedia.org/P22138 and previous config saved to /var/cache/conftool/dbconfig/20220308-193930-marostegui.json [19:39:32] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1139.eqiad.wmnet with reason: Maintenance [19:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:35] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [19:39:35] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1139.eqiad.wmnet with reason: Maintenance [19:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:55] (03CR) 10STran: [C: 03+1] Enable IPInfo on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767216 (https://phabricator.wikimedia.org/T260598) (owner: 10Tchanders) [19:40:13] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: ganeti4002 dimm error - https://phabricator.wikimedia.org/T303318 (10RobH) [19:40:58] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:41:10] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: host 103.102.166.131, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:41:23] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance [19:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:25] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance [19:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:00] !log marostegui@cumin2002 dbctl commit (dc=all): 'Depooling db1146:3312 (T298294)', diff saved to https://phabricator.wikimedia.org/P22139 and previous config saved to /var/cache/conftool/dbconfig/20220308-194159-marostegui.json [19:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:05] !log !log push DHCP term to labs-in filters on eqiad cr [19:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:44] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:43:56] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:44:06] (03PS4) 10STran: Add IPInfo viewing rights for certain groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766882 (https://phabricator.wikimedia.org/T296499) [19:45:28] (03PS5) 10STran: Add IPInfo viewing rights for certain groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766882 (https://phabricator.wikimedia.org/T296499) [19:45:45] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1016.eqiad.wmnet with OS bullseye [19:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:52] RECOVERY - SSH on thumbor2004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:46:10] (03PS1) 10Ayounsi: Re-add DHCP term to labs-in filter [homer/public] - 10https://gerrit.wikimedia.org/r/769102 (https://phabricator.wikimedia.org/T303296) [19:47:13] (03PS2) 10Ayounsi: Re-add DHCP term to labs-in filter [homer/public] - 10https://gerrit.wikimedia.org/r/769102 (https://phabricator.wikimedia.org/T303296) [19:48:10] (03PS4) 10STran: Enable IPInfo on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767216 (https://phabricator.wikimedia.org/T260598) (owner: 10Tchanders) [19:48:34] (03PS3) 10STran: Autopromote-once users to the 'ipinfo' group after one edit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767845 (https://phabricator.wikimedia.org/T296184) (owner: 10Tchanders) [19:50:57] (03CR) 10Tchanders: [C: 03+1] Add IPInfo viewing rights for certain groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766882 (https://phabricator.wikimedia.org/T296499) (owner: 10STran) [19:50:59] (03PS1) 10BBlack: cut default per-IP burst from 1000 to 500 [puppet] - 10https://gerrit.wikimedia.org/r/769103 [19:51:56] (03CR) 10Ayounsi: [C: 03+2] Re-add DHCP term to labs-in filter [homer/public] - 10https://gerrit.wikimedia.org/r/769102 (https://phabricator.wikimedia.org/T303296) (owner: 10Ayounsi) [19:52:43] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1016.eqiad.wmnet with OS bullseye [19:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:47] (03Merged) 10jenkins-bot: Re-add DHCP term to labs-in filter [homer/public] - 10https://gerrit.wikimedia.org/r/769102 (https://phabricator.wikimedia.org/T303296) (owner: 10Ayounsi) [19:53:10] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1021.eqiad.wmnet with OS bullseye [19:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:09] (03CR) 10Jbond: [C: 03+1] "lets give it a shot" [puppet] - 10https://gerrit.wikimedia.org/r/769103 (owner: 10BBlack) [19:54:18] PROBLEM - Ensure local MW versions match expected deployment on deploy2002 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:55:30] huh [19:58:22] Ignore that. [19:58:48] The fix for that hasn't been deployed yet [19:58:53] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban: Increase max.incremental.fetch.session.cache.slots on Kafka jumbo eqiad - https://phabricator.wikimedia.org/T303324 (10Ottomata) [20:02:19] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/769084 (owner: 10Volans) [20:11:54] 10SRE, 10ops-eqiad: 8 x SMF Patches between cages Eqiad - LVS & WMCS - https://phabricator.wikimedia.org/T301419 (10Jclark-ctr) [20:13:15] (03CR) 10Volans: [C: 03+2] spicerack: use the private property for config dir [software/spicerack] - 10https://gerrit.wikimedia.org/r/769084 (owner: 10Volans) [20:16:59] 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops, 10User-jbond: varnish filtering: should we automatically update public_cloud_nets - https://phabricator.wikimedia.org/T270391 (10jbond) brandon also just pointed me to `git grep netmapper` and https://gerrit.wikimedia.org/g/operations/softw... [20:18:32] (03CR) 10Jbond: [C: 03+1] spicerack: use the private property for config dir [software/spicerack] - 10https://gerrit.wikimedia.org/r/769084 (owner: 10Volans) [20:18:34] 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10Jclark-ctr) [20:19:32] (03Merged) 10jenkins-bot: spicerack: use the private property for config dir [software/spicerack] - 10https://gerrit.wikimedia.org/r/769084 (owner: 10Volans) [20:19:48] (03PS3) 10Ryan Kemper: elasticsearch: upgrade relforge to elasticsearch 6.8 [puppet] - 10https://gerrit.wikimedia.org/r/763479 (https://phabricator.wikimedia.org/T301955) (owner: 10Gehel) [20:20:06] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/763479 (https://phabricator.wikimedia.org/T301955) (owner: 10Gehel) [20:21:20] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1021.eqiad.wmnet with OS bullseye [20:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:33] !log dancy@deploy1002 Started scap: testwikis wikis to 1.38.0-wmf.25 refs T300201 [20:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:36] T300201: 1.38.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T300201 [20:32:53] (Traffic bill over quota) firing: Alert for device cr2-eqord.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org [20:33:06] (03PS1) 10Ryan Kemper: elastic: relax & restore perms during upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/769109 (https://phabricator.wikimedia.org/T301955) [20:36:25] !log rzl@apt1001:~$ sudo -i reprepro copy stretch-wikimedia buster-wikimedia envoyproxy # T300324 [20:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:29] T300324: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 [20:36:34] !log rzl@apt1001:~$ sudo -i reprepro copy bullseye-wikimedia buster-wikimedia envoyproxy # T300324 [20:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:53] (Traffic bill over quota) firing: (4) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org [20:38:02] RECOVERY - Check systemd state on cp6010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:38:55] (03PS1) 10RLazarus: envoy: Update to 1.18.3 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/769110 (https://phabricator.wikimedia.org/T300324) [20:41:33] (03CR) 10RLazarus: [V: 03+2 C: 03+2] envoy: Update to 1.18.3 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/769110 (https://phabricator.wikimedia.org/T300324) (owner: 10RLazarus) [20:41:44] (03PS2) 10Ryan Kemper: elastic: relax & restore perms during upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/769109 (https://phabricator.wikimedia.org/T301955) [20:42:55] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298294)', diff saved to https://phabricator.wikimedia.org/P22142 and previous config saved to /var/cache/conftool/dbconfig/20220308-204254-marostegui.json [20:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:59] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [20:43:56] PROBLEM - Ensure local MW versions match expected deployment on mw1418 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [20:46:32] PROBLEM - Check systemd state on cp6010 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_exim4.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:46:40] PROBLEM - Ensure local MW versions match expected deployment on mw1415 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [20:47:16] PROBLEM - Ensure local MW versions match expected deployment on mw1448 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [20:47:16] PROBLEM - Ensure local MW versions match expected deployment on mw1450 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [20:47:16] PROBLEM - Ensure local MW versions match expected deployment on mw1313 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [20:47:37] Gah [20:47:46] Prepare for a flood of spam [20:48:44] RECOVERY - Ensure local MW versions match expected deployment on deploy2002 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [20:49:24] is the wmf.25 train deployment happening today? [20:50:12] MatmaRex: currently underway. [20:50:15] yes [20:50:23] A bit behind schedule [20:50:34] RECOVERY - Ensure local MW versions match expected deployment on mw1418 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [20:52:53] (Traffic bill over quota) firing: (4) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org [20:53:20] RECOVERY - Ensure local MW versions match expected deployment on mw1415 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [20:53:56] RECOVERY - Ensure local MW versions match expected deployment on mw1450 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [20:53:56] RECOVERY - Ensure local MW versions match expected deployment on mw1448 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [20:53:56] RECOVERY - Ensure local MW versions match expected deployment on mw1313 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [20:57:53] (Traffic bill over quota) resolved: (3) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org [20:58:30] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P22143 and previous config saved to /var/cache/conftool/dbconfig/20220308-205829-marostegui.json [20:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:46] !log dancy@deploy1002 Finished scap: testwikis wikis to 1.38.0-wmf.25 refs T300201 (duration: 32m 13s) [20:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:51] T300201: 1.38.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T300201 [21:00:05] RoanKattouw and Urbanecm: Dear deployers, time to do the UTC late backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220308T2100). [21:00:05] MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:17] I can deploy today! [21:00:40] hi [21:00:43] but i see dancy's full scap just finished -- so just confirming i can start with B&C? [21:00:47] hi MatmaRex [21:01:03] MatmaRex: you listed a "maybe" item -- can you clarify it please? [21:01:06] i also have some patches i might want to backport to wmf.25, but i need to test if that version is affected first [21:01:11] and wmf.25 is not deployed yet [21:01:21] https://www.mediawiki.org/wiki/Special:Version still says wmf.24 [21:01:32] I haven't deployed to group0 yet [21:01:35] !log dancy@deploy1002 Pruned MediaWiki: 1.38.0-wmf.23 (duration: 01m 46s) [21:01:36] MatmaRex: it's at testwiki [21:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:40] https://test.wikipedia.org/wiki/Special:Version says wmf.25 [21:01:56] ah, that will work [21:02:01] i thought they go at the same time [21:02:02] thanks [21:02:12] scap clean is running at the moment. Should be done in a few. [21:02:21] dancy: please ping me when i can start with B&C then :?) [21:02:26] OK [21:02:49] and it looks like i do need to backport. i'll prepare patches in a second [21:03:04] !log dancy@deploy1002 Pruned MediaWiki: 1.38.0-wmf.22 (duration: 01m 28s) [21:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:28] MatmaRex: okay, sounds good to me. [21:03:29] urbanecm: Done w/ deployment to testwikis. You can step in now. [21:03:34] thanks! [21:03:43] (03PS1) 10Bartosz Dziewoński: Fix handling of disabled 'mobileformat' [extensions/DiscussionTools] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/768802 (https://phabricator.wikimedia.org/T303262) [21:03:45] (03PS1) 10Bartosz Dziewoński: Fix handling of disabled 'mobileformat' [extensions/VisualEditor] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/768803 (https://phabricator.wikimedia.org/T303262) [21:03:48] (03PS2) 10Urbanecm: Enable DiscussionTools autotopicsub on MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769096 (https://phabricator.wikimedia.org/T302256) (owner: 10Bartosz Dziewoński) [21:03:50] (03CR) 10Urbanecm: [C: 03+2] Enable DiscussionTools autotopicsub on MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769096 (https://phabricator.wikimedia.org/T302256) (owner: 10Bartosz Dziewoński) [21:05:12] (03Merged) 10jenkins-bot: Enable DiscussionTools autotopicsub on MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769096 (https://phabricator.wikimedia.org/T302256) (owner: 10Bartosz Dziewoński) [21:05:46] updated table [21:06:16] thanks, let me +2 em [21:06:52] MatmaRex: do they depend on the other backport? [21:07:03] (03CR) 10Urbanecm: [C: 03+2] Fix handling of disabled 'mobileformat' [extensions/VisualEditor] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/768803 (https://phabricator.wikimedia.org/T303262) (owner: 10Bartosz Dziewoński) [21:07:12] (03CR) 10Urbanecm: [C: 03+2] Fix handling of disabled 'mobileformat' [extensions/DiscussionTools] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/768802 (https://phabricator.wikimedia.org/T303262) (owner: 10Bartosz Dziewoński) [21:07:22] no [21:07:25] urbanecm, brennen, et al: I'm going to take a break now. I'll come back in an hour to check status and roll forward to group0. [21:07:29] okay, thanks [21:07:38] ack, sounds good dancy. [21:07:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:50] ack too [21:08:11] MatmaRex: config patch is at mwdebug1001 -- can you check? [21:08:52] urbanecm: yep. looks good [21:08:58] syncing! [21:10:29] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 3132fca7b3b078155fa406339d05286ca6e0797b: Enable DiscussionTools autotopicsub on MediaWiki.org (T302256) (duration: 00m 49s) [21:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:33] T302256: Config Change: offer Reply Tool, New Discussion Tool, Topic Subscriptions as Opt-Out at mediawiki.org - https://phabricator.wikimedia.org/T302256 [21:10:35] config patch live [21:10:40] waiting on CI for the backports [21:11:36] i'm likewise taking a short break, back in 25m or so. [21:14:05] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P22144 and previous config saved to /var/cache/conftool/dbconfig/20220308-211404-marostegui.json [21:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:14:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:12] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1016.eqiad.wmnet with OS bullseye [21:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:14] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:18:58] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:19:34] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 43, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:20:00] ehm...is this ^^a reason to worry (during a MW deployment)? [21:20:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:41] (03Merged) 10jenkins-bot: Fix handling of disabled 'mobileformat' [extensions/VisualEditor] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/768803 (https://phabricator.wikimedia.org/T303262) (owner: 10Bartosz Dziewoński) [21:21:43] (03Merged) 10jenkins-bot: Fix handling of disabled 'mobileformat' [extensions/DiscussionTools] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/768802 (https://phabricator.wikimedia.org/T303262) (owner: 10Bartosz Dziewoński) [21:21:50] can't comment on if that's worrying, but probably not caused by your deployments [21:22:22] that's true [21:23:09] * urbanecm decides to continue unless he's told to stop [21:23:37] MatmaRex: pulled to mwdebug1001, can you check? [21:23:41] (both of the backports are there) [21:24:29] urbanecm: looks good [21:24:34] syncing [21:25:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:19] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.25/extensions/VisualEditor/includes/ApiVisualEditorEdit.php: a5c6d06714d76d84fac270e3dac4a4a0a2d83927: Fix handling of disabled mobileformat (T303262) (duration: 00m 49s) [21:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:24] T303262: Headings and other elements look different after saving the page using VisualEditor or DiscussionTools - https://phabricator.wikimedia.org/T303262 [21:28:09] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.25/extensions/DiscussionTools/includes/ApiDiscussionToolsEdit.php: cc5acc27ad02f473b001d088dba454e013129cb2: Fix handling of disabled mobileformat (T303262) (duration: 00m 49s) [21:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:17] MatmaRex: should be live. anything else? [21:29:01] thanks urbanecm, that's all from me [21:29:12] okay. In that case, have a nice day MatmaRex [21:29:40] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298294)', diff saved to https://phabricator.wikimedia.org/P22145 and previous config saved to /var/cache/conftool/dbconfig/20220308-212939-marostegui.json [21:29:42] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1156.eqiad.wmnet with reason: Maintenance [21:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:44] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [21:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:45] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1156.eqiad.wmnet with reason: Maintenance [21:29:46] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 16:00:00 on db1155.eqiad.wmnet with reason: Maintenance [21:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:49] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db1155.eqiad.wmnet with reason: Maintenance [21:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:25] !log marostegui@cumin2002 dbctl commit (dc=all): 'Depooling db1156 (T298294)', diff saved to https://phabricator.wikimedia.org/P22146 and previous config saved to /var/cache/conftool/dbconfig/20220308-213024-marostegui.json [21:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:26] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1016.eqiad.wmnet with OS bullseye [21:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:00] !log UTC early B&C window done [21:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:33:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:32] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298294)', diff saved to https://phabricator.wikimedia.org/P22147 and previous config saved to /var/cache/conftool/dbconfig/20220308-213331-marostegui.json [21:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:17] !log andrew@cumin1001 START - Cookbook sre.hosts.dhcp for host cloudvirt1016.eqiad.wmnet [21:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:52] (03PS1) 10Ebernhardson: prometheus: Add more per-index metrics for elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/769123 (https://phabricator.wikimedia.org/T300295) [21:41:33] (03CR) 10jerkins-bot: [V: 04-1] prometheus: Add more per-index metrics for elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/769123 (https://phabricator.wikimedia.org/T300295) (owner: 10Ebernhardson) [21:46:39] (03PS2) 10Ebernhardson: prometheus: Add more per-index metrics for elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/769123 (https://phabricator.wikimedia.org/T300295) [21:49:07] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P22148 and previous config saved to /var/cache/conftool/dbconfig/20220308-214906-marostegui.json [21:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:22] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10Cmjohnson) [21:52:32] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10Cmjohnson) 05Open→03Resolved [21:53:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10Cmjohnson) [21:54:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10Cmjohnson) 05Open→03Resolved [21:55:26] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission elastic10[32-47].eqiad.wmnet - https://phabricator.wikimedia.org/T302517 (10Cmjohnson) 05Open→03Resolved netbox script ran, removed from rack [21:55:32] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission rdb100[56].eqiad.wmnet - https://phabricator.wikimedia.org/T273139 (10Cmjohnson) 05Open→03Resolved removed from rack [21:56:10] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission mw130[2-6].eqiad.wmnet - https://phabricator.wikimedia.org/T303027 (10Cmjohnson) 05Open→03Resolved removed from rack [21:56:26] 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Unrack wmf3570, wmf4579, conf1003, mw1301 - https://phabricator.wikimedia.org/T302034 (10Cmjohnson) 05Open→03Resolved removed from the rack netbox updated [21:56:46] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission prometheus1003.eqiad.wmnet - https://phabricator.wikimedia.org/T301466 (10Cmjohnson) 05Open→03Resolved removed from the rack netbox updated [21:56:58] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission prometheus1004.eqiad.wmnet - https://phabricator.wikimedia.org/T301851 (10Cmjohnson) 05Open→03Resolved removed from the rack netbox updated [22:02:11] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet and cloudvirt1021.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10Cmjohnson) @Andrew safe to do this anytime? [22:02:32] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:03:18] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:03:54] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:04:42] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P22149 and previous config saved to /var/cache/conftool/dbconfig/20220308-220441-marostegui.json [22:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:12] I have returned [22:20:17] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298294)', diff saved to https://phabricator.wikimedia.org/P22150 and previous config saved to /var/cache/conftool/dbconfig/20220308-222016-marostegui.json [22:20:19] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1162.eqiad.wmnet with reason: Maintenance [22:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:21] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [22:20:22] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1162.eqiad.wmnet with reason: Maintenance [22:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:56] !log marostegui@cumin2002 dbctl commit (dc=all): 'Depooling db1162 (T298294)', diff saved to https://phabricator.wikimedia.org/P22151 and previous config saved to /var/cache/conftool/dbconfig/20220308-222055-marostegui.json [22:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:05] Everything seems chill, so rolling the train to group0 [22:22:36] (03PS1) 10Ahmon Dancy: group0 wikis to 1.38.0-wmf.25 refs T300201 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769126 [22:22:38] (03CR) 10Ahmon Dancy: [C: 03+2] group0 wikis to 1.38.0-wmf.25 refs T300201 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769126 (owner: 10Ahmon Dancy) [22:23:04] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298294)', diff saved to https://phabricator.wikimedia.org/P22152 and previous config saved to /var/cache/conftool/dbconfig/20220308-222303-marostegui.json [22:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:22] (03Merged) 10jenkins-bot: group0 wikis to 1.38.0-wmf.25 refs T300201 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769126 (owner: 10Ahmon Dancy) [22:24:35] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.38.0-wmf.25 refs T300201 [22:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:38] T300201: 1.38.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T300201 [22:24:55] (03PS1) 10Ebernhardson: Create phab task when indices are too old [alerts] - 10https://gerrit.wikimedia.org/r/769127 (https://phabricator.wikimedia.org/T300295) [22:30:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:31:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:38] PROBLEM - SSH on dumpsdata1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:38:39] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P22153 and previous config saved to /var/cache/conftool/dbconfig/20220308-223838-marostegui.json [22:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:56] (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [22:43:30] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:44:06] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:44:16] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:44:50] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 43, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:47:00] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet and cloudvirt1021.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10Andrew) >>! In T303296#7761994, @Cmjohnson wrote: > @Andrew safe to do this anytime? All three hos... [22:54:15] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P22155 and previous config saved to /var/cache/conftool/dbconfig/20220308-225413-marostegui.json [22:54:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:08] (03PS1) 10Ebernhardson: alertmanager: Configure task creation for search-platform [puppet] - 10https://gerrit.wikimedia.org/r/769131 (https://phabricator.wikimedia.org/T300295) [23:04:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:04:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:02] (03PS1) 10Jbond: (WIP) C:varnish: Add automatic cloud nets update [puppet] - 10https://gerrit.wikimedia.org/r/769132 (https://phabricator.wikimedia.org/T270391) [23:07:34] (03CR) 10jerkins-bot: [V: 04-1] (WIP) C:varnish: Add automatic cloud nets update [puppet] - 10https://gerrit.wikimedia.org/r/769132 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [23:09:00] (03CR) 10RLazarus: [C: 03+2] miscweb: Restore envoy image_version to the inherited default [deployment-charts] - 10https://gerrit.wikimedia.org/r/766842 (https://phabricator.wikimedia.org/T300324) (owner: 10RLazarus) [23:09:50] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298294)', diff saved to https://phabricator.wikimedia.org/P22156 and previous config saved to /var/cache/conftool/dbconfig/20220308-230949-marostegui.json [23:09:52] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance [23:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:54] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [23:09:55] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance [23:09:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:28] (03PS2) 10Jbond: (WIP) C:varnish: Add automatic cloud nets update [puppet] - 10https://gerrit.wikimedia.org/r/769132 (https://phabricator.wikimedia.org/T270391) [23:10:29] !log marostegui@cumin2002 dbctl commit (dc=all): 'Depooling db1170:3312 (T298294)', diff saved to https://phabricator.wikimedia.org/P22157 and previous config saved to /var/cache/conftool/dbconfig/20220308-231028-marostegui.json [23:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:52] (03PS3) 10Jbond: (WIP) C:varnish: Add automatic cloud nets update [puppet] - 10https://gerrit.wikimedia.org/r/769132 (https://phabricator.wikimedia.org/T270391) [23:13:09] (03Merged) 10jenkins-bot: miscweb: Restore envoy image_version to the inherited default [deployment-charts] - 10https://gerrit.wikimedia.org/r/766842 (https://phabricator.wikimedia.org/T300324) (owner: 10RLazarus) [23:13:42] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298294)', diff saved to https://phabricator.wikimedia.org/P22158 and previous config saved to /var/cache/conftool/dbconfig/20220308-231340-marostegui.json [23:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:57] (03CR) 10Jbond: "This PS still needs a bit of cleaning up but i think this gets close to what we spoke about on IRC regarding using netmapper for cloud net" [puppet] - 10https://gerrit.wikimedia.org/r/769132 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [23:29:16] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P22159 and previous config saved to /var/cache/conftool/dbconfig/20220308-232915-marostegui.json [23:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:51] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P22160 and previous config saved to /var/cache/conftool/dbconfig/20220308-234450-marostegui.json [23:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:52] (03PS1) 10Cwhite: grafana ldap users sync: enable retries [puppet] - 10https://gerrit.wikimedia.org/r/769142 (https://phabricator.wikimedia.org/T303064)