[00:02:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2181.codfw.wmnet with OS bullseye [00:02:56] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2181.codfw.wmnet with OS bullseye completed: - db2... [00:07:51] RECOVERY - PyBal connections to etcd on lvs6002 is OK: OK: 4 connections established with conf1006.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [00:23:59] PROBLEM - PyBal connections to etcd on lvs6002 is CRITICAL: CRITICAL: 3 connections established with conf1006.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [00:26:53] RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:07] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "CLOSEDDOWN 208.80.154.14 22 tcp 6 0 gitlab-old.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/811912 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [00:34:01] PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8807.service,thumbor@8808.service,thumbor@8812.service,thumbor@8816.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:00:39] RECOVERY - Check systemd state on thumbor2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:05:01] PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-config-backup-gitlab1001.wikimedia.org.service,rsync-config-backup-gitlab2001.wikimedia.org.service,rsync-data-backup-gitlab1001.wikimedia.org.service,rsync-data-backup-gitlab2001.wikimedia.org.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:07:57] PROBLEM - Check systemd state on thumbor2004 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8803.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8809.service,thumbor@8811.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:12:07] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:12:23] RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:12:43] !log gitlab1004 - _still_ icinga alerts about rsync to decom'ed host. 'systemctl daemon-reload' to teach it about deleted units, then systemctl reset failed ..then RECOVERY T307142 [01:12:47] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2182.codfw.wmnet with OS bullseye [01:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:12:49] T307142: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 [01:12:54] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2182.codfw.wmnet with OS bullseye [01:19:10] 10SRE, 10Traffic: Create Ganeti VMs for Wikidough in drmrs - https://phabricator.wikimedia.org/T300156 (10ssingh) 05Open→03Resolved [01:19:45] 10SRE, 10Traffic, 10Patch-For-Review: Create Ganeti VMs for durum in drmrs - https://phabricator.wikimedia.org/T300158 (10ssingh) 05Open→03Resolved Resolved for quite a while now. [01:26:02] (03CR) 10Krinkle: [C: 03+1] Add ucfirst overrides for the PHP 7.4 migration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811875 (https://phabricator.wikimedia.org/T271736) (owner: 10Tim Starling) [01:27:24] (03PS1) 10Dzahn: vrts/blackbox: adjust monitoring back to port 80, but fix path [puppet] - 10https://gerrit.wikimedia.org/r/812142 [01:31:27] (03PS1) 10Dzahn: vrts/prometheus: set force_tls to true for check on port 1443 [puppet] - 10https://gerrit.wikimedia.org/r/812144 [01:31:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2182.codfw.wmnet with reason: host reimage [01:32:29] (03CR) 10CI reject: [V: 04-1] vrts/prometheus: set force_tls to true for check on port 1443 [puppet] - 10https://gerrit.wikimedia.org/r/812144 (owner: 10Dzahn) [01:32:33] (03CR) 10Dzahn: [C: 03+2] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/812144/" [puppet] - 10https://gerrit.wikimedia.org/r/811985 (owner: 10Majavah) [01:35:39] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2182.codfw.wmnet with reason: host reimage [01:35:48] (03PS1) 10Krinkle: Enable wgResourceLoaderUseObjectCacheForDeps for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812146 (https://phabricator.wikimedia.org/T113916) [01:35:50] (03PS1) 10Krinkle: Enable wgResourceLoaderUseObjectCacheForDeps for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812147 (https://phabricator.wikimedia.org/T113916) [01:36:06] (03PS2) 10Dzahn: vrts/prometheus: set force_tls to true for check on port 1443 [puppet] - 10https://gerrit.wikimedia.org/r/812144 [01:37:14] (03CR) 10Dzahn: [C: 03+2] "trying it to fix alerts" [puppet] - 10https://gerrit.wikimedia.org/r/812144 (owner: 10Dzahn) [01:37:45] (JobUnavailable) firing: Reduced availability for job workhorse in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:45:09] (03CR) 10Krinkle: [C: 03+2] Enable wgResourceLoaderUseObjectCacheForDeps for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812146 (https://phabricator.wikimedia.org/T113916) (owner: 10Krinkle) [01:47:09] (03Merged) 10jenkins-bot: Enable wgResourceLoaderUseObjectCacheForDeps for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812146 (https://phabricator.wikimedia.org/T113916) (owner: 10Krinkle) [01:47:45] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:49:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2182.codfw.wmnet with OS bullseye [01:50:03] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2182.codfw.wmnet with OS bullseye completed: - db2... [01:52:45] (JobUnavailable) resolved: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:53:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [01:54:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [01:54:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [01:54:13] (03CR) 10Dzahn: [C: 03+2] "after a little while on prometheus1005:" [puppet] - 10https://gerrit.wikimedia.org/r/812144 (owner: 10Dzahn) [01:55:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:02:10] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10Papaul) [02:03:55] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10Papaul) @Marostegui all the servers are ready but not db2175 @Volans is running some tests on it. Once he is done I will install the OS on it. Th... [02:12:35] (03CR) 10Dzahn: [C: 03+2] "we are now checking https instead of http. just the next issue is ""Received redirect" but "not following redirect". adjusting path ..." [puppet] - 10https://gerrit.wikimedia.org/r/812144 (owner: 10Dzahn) [02:12:58] !log krinkle@deploy1002 Synchronized wmf-config/InitialiseSettings.php: RL use MainStash on dewiki I1c120d64d226 (duration: 03m 21s) [02:13:12] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:15:47] (03PS1) 10Dzahn: vrts/prometheus: fix monitored path, avoid redirect [puppet] - 10https://gerrit.wikimedia.org/r/812152 [02:16:38] (03CR) 10Dzahn: [C: 03+2] vrts/prometheus: fix monitored path, avoid redirect [puppet] - 10https://gerrit.wikimedia.org/r/812152 (owner: 10Dzahn) [02:25:01] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@c271774]: Update rdf-spark-tools to 0.3.112 [02:25:30] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster reimage to bullseye - bking@cumin1001 - T309343 [02:25:33] T309343: Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 [02:26:32] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1004.wikimedia.org with OS bullseye [02:26:38] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye [02:27:15] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@c271774]: Update rdf-spark-tools to 0.3.112 (duration: 02m 13s) [03:16:18] (03PS1) 10Dzahn: vrts/prometheus: comment out broken check [puppet] - 10https://gerrit.wikimedia.org/r/812158 [03:17:26] (03CR) 10Dzahn: [C: 03+2] "I ran out of time.. and don't want more false positives now." [puppet] - 10https://gerrit.wikimedia.org/r/812158 (owner: 10Dzahn) [03:22:48] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1004.wikimedia.org with OS bullseye [03:22:53] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye executed with errors... [03:32:59] !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster reimage to bullseye - bking@cumin1001 - T309343 [03:33:02] T309343: Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 [04:02:10] (03PS1) 10Krinkle: ResourceLoader: Switch Image.php to injected log channel [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/812108 (https://phabricator.wikimedia.org/T32956) [04:06:47] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@b5d49fe]: use mode=reschedule on all airflow sensors [04:08:51] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@b5d49fe]: use mode=reschedule on all airflow sensors (duration: 02m 03s) [04:27:43] (03PS1) 10Andrew Bogott: Openstack Heat: standardize on heat_domain_admin name [puppet] - 10https://gerrit.wikimedia.org/r/812164 [04:31:36] RECOVERY - Check systemd state on thumbor2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:36:42] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:38:58] PROBLEM - Check systemd state on thumbor2004 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8809.service,thumbor@8811.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:00:56] RECOVERY - Check systemd state on thumbor2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:04:49] (03PS1) 10Marostegui: db2161: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/812168 (https://phabricator.wikimedia.org/T311493) [05:05:36] (03CR) 10Marostegui: [C: 03+2] db2161: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/812168 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [05:08:20] PROBLEM - Check systemd state on thumbor2004 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8809.service,thumbor@8811.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:11:18] (03PS1) 10Marostegui: db2165: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/812169 (https://phabricator.wikimedia.org/T311493) [05:12:15] (03CR) 10Marostegui: [C: 03+2] db2165: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/812169 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [05:18:38] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:18:56] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:21:06] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:21:26] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:23:07] !log dbmaint s3@eqiad T312574 [05:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:12] T312574: Adjust the field type of flow_revision.rev_mod_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312574 [05:26:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2016.codfw.wmnet with OS bullseye [05:26:32] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2016.codfw.wmnet with OS bullseye [05:28:12] (03PS1) 10Marostegui: db2076: Remove db2076 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/812170 (https://phabricator.wikimedia.org/T312190) [05:28:59] (03CR) 10Marostegui: [C: 03+2] db2076: Remove db2076 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/812170 (https://phabricator.wikimedia.org/T312190) (owner: 10Marostegui) [05:29:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2076 from dbctl T312190', diff saved to https://phabricator.wikimedia.org/P30962 and previous config saved to /var/cache/conftool/dbconfig/20220708-052926-marostegui.json [05:29:31] T312190: decommission db2076 - https://phabricator.wikimedia.org/T312190 [05:29:38] 10SRE, 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2076 - https://phabricator.wikimedia.org/T312190 (10Marostegui) [05:31:22] !log draining ganeti2027 T311686 [05:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:25] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [05:33:00] (03PS1) 10Marostegui: mariadb: Decommission db2076 [puppet] - 10https://gerrit.wikimedia.org/r/812171 (https://phabricator.wikimedia.org/T312190) [05:34:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db2076.codfw.wmnet [05:38:02] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:38:28] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [05:42:00] (03PS1) 10Muehlenhoff: apparmor: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/812172 (https://phabricator.wikimedia.org/T308013) [05:42:02] (03PS1) 10Muehlenhoff: memcached: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/812173 (https://phabricator.wikimedia.org/T308013) [05:42:04] (03PS1) 10Muehlenhoff: nginx: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/812174 (https://phabricator.wikimedia.org/T308013) [05:42:06] (03PS1) 10Muehlenhoff: alertmanager: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/812175 (https://phabricator.wikimedia.org/T308013) [05:42:08] (03PS1) 10Muehlenhoff: interface: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/812176 (https://phabricator.wikimedia.org/T308013) [05:42:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [05:44:04] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2076 [puppet] - 10https://gerrit.wikimedia.org/r/812171 (https://phabricator.wikimedia.org/T312190) (owner: 10Marostegui) [05:44:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2076.codfw.wmnet [05:44:30] 10SRE, 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2076 - https://phabricator.wikimedia.org/T312190 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2076.codfw.wmnet` - db2076.codfw.wmnet (**PASS**) - Downtimed host on Ici... [05:44:46] 10ops-codfw, 10decommission-hardware: decommission db2076 - https://phabricator.wikimedia.org/T312190 (10Marostegui) @Papaul this is all yours [05:44:48] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 (10MoritzMuehlenhoff) >>! In T308331#8061619, @ssastry wrote: > I am hoping @Dzahn can answer the question for me for scandium since I don't know what the 10G requirement means. scandium doesn'... [05:45:24] 10ops-codfw, 10decommission-hardware: decommission db2076 - https://phabricator.wikimedia.org/T312190 (10Marostegui) a:03Papaul [05:45:49] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 (10MoritzMuehlenhoff) ganeti1020 is now emptied of VMs and can be moved. [05:46:11] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2016.codfw.wmnet with reason: host reimage [05:49:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2016.codfw.wmnet with reason: host reimage [05:49:28] (03PS2) 10Muehlenhoff: alertmanager: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/812175 (https://phabricator.wikimedia.org/T308013) [05:52:26] (03PS1) 10Marostegui: instances.yaml: Remove db2077 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/812180 (https://phabricator.wikimedia.org/T312191) [05:52:28] (03PS2) 10Muehlenhoff: nginx: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/812174 (https://phabricator.wikimedia.org/T308013) [05:53:10] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db2077 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/812180 (https://phabricator.wikimedia.org/T312191) (owner: 10Marostegui) [05:53:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2077 from dbctl T312191', diff saved to https://phabricator.wikimedia.org/P30963 and previous config saved to /var/cache/conftool/dbconfig/20220708-055334-marostegui.json [05:53:38] T312191: decommission db2077 - https://phabricator.wikimedia.org/T312191 [05:53:51] 10SRE, 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2077 - https://phabricator.wikimedia.org/T312191 (10Marostegui) [05:54:33] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki: Split updateSpecialPages.php job to be per-shard (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/804788 (https://phabricator.wikimedia.org/T307314) (owner: 10Legoktm) [05:56:59] RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:57:12] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "I don't think this is a good idea." [puppet] - 10https://gerrit.wikimedia.org/r/803560 (owner: 10Jbond) [05:58:25] (03PS4) 10Tim Starling: [MultiDC] Switch $wgCentralAuthTokenCacheType to mcrouter-primary-dc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809326 (https://phabricator.wikimedia.org/T278392) (owner: 10Krinkle) [05:58:27] (03PS1) 10Marostegui: db2080: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/812183 (https://phabricator.wikimedia.org/T312618) [05:59:03] (03CR) 10Tim Starling: [C: 03+2] [MultiDC] Switch $wgCentralAuthTokenCacheType to mcrouter-primary-dc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809326 (https://phabricator.wikimedia.org/T278392) (owner: 10Krinkle) [05:59:27] (03CR) 10Marostegui: [C: 03+2] db2080: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/812183 (https://phabricator.wikimedia.org/T312618) (owner: 10Marostegui) [06:00:06] (03Merged) 10jenkins-bot: [MultiDC] Switch $wgCentralAuthTokenCacheType to mcrouter-primary-dc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809326 (https://phabricator.wikimedia.org/T278392) (owner: 10Krinkle) [06:00:55] PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8807.service,thumbor@8808.service,thumbor@8812.service,thumbor@8816.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:01:55] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I like the idea of adding a default (but it should probably be a working, correct one, like /var/lib/operations/private/requestctl), but t" [software/conftool] - 10https://gerrit.wikimedia.org/r/805338 (owner: 10Jbond) [06:02:45] (03PS1) 10Marostegui: install_server: Do not reimage db216[0|1|5] [puppet] - 10https://gerrit.wikimedia.org/r/812185 (https://phabricator.wikimedia.org/T311493) [06:03:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [06:04:08] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db216[0|1|5] [puppet] - 10https://gerrit.wikimedia.org/r/812185 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [06:04:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [06:04:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [06:05:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [06:05:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2016.codfw.wmnet with OS bullseye [06:05:30] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2016.codfw.wmnet with OS bullseye completed: - ganeti2016 (**PASS**) - Downtimed on... [06:05:36] !log tstarling@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Switch $wgCentralAuthTokenCacheType to mcrouter-primary-dc (duration: 03m 18s) [06:06:45] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:39:27] (03CR) 10Filippo Giunchedi: [C: 03+1] alertmanager: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/812175 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [06:39:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2016.codfw.wmnet [06:45:09] (03CR) 10Filippo Giunchedi: [C: 03+1] "Just a nit inline, LGTM otherwise" [puppet] - 10https://gerrit.wikimedia.org/r/811790 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [06:47:35] !log on mwmaint2002: using iptables to simulate cross-DC memcached traffic loss [06:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2016.codfw.wmnet [06:52:23] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2016.codfw.wmnet to cluster codfw and group D [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220708T0700) [07:00:32] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [07:01:42] RECOVERY - etcd request latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [07:06:33] (03CR) 10Filippo Giunchedi: "LGTM overall, you mentioned yesterday on IRC concurrency for requests, is the concurrency implicit here (i.e. all parametrized tests will " [alerts] - 10https://gerrit.wikimedia.org/r/812011 (owner: 10David Caro) [07:07:01] (03PS5) 10Slyngshede: Add Nagios script for monitoring the Dell PERC RAID controller. [puppet] - 10https://gerrit.wikimedia.org/r/811667 (https://phabricator.wikimedia.org/T297913) [07:07:02] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:07:19] (03CR) 10Slyngshede: Add Nagios script for monitoring the Dell PERC RAID controller. (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/811667 (https://phabricator.wikimedia.org/T297913) (owner: 10Slyngshede) [07:08:02] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/805873 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [07:13:31] (03CR) 10Filippo Giunchedi: "Thank you Daniel, LGTM overall and see inline" [puppet] - 10https://gerrit.wikimedia.org/r/807176 (owner: 10Dzahn) [07:22:37] !log reboot rdb1010 for kernel upgrades [07:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:56] !log restart pybal on lvs6002 [07:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:09] PROBLEM - Host rdb1009 is DOWN: PING CRITICAL - Packet loss = 100% [07:32:13] RECOVERY - Host rdb1009 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [07:33:54] !log reboot rdb1009 for kernel upgrades [07:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:17] vgutierrez: did I forget to restart lvs6002 yesterday? I don't remember seeing it in alertmanager [07:36:05] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki: redirect policy and related sites to wikimediafoundation.org [puppet] - 10https://gerrit.wikimedia.org/r/809324 (https://phabricator.wikimedia.org/T310738) (owner: 10Dzahn) [07:39:42] (03PS1) 10Muehlenhoff: postgres: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/812230 [07:40:26] (03PS3) 10David Caro: tests: add a test to ensure that the runbook is accessible if there is one [alerts] - 10https://gerrit.wikimedia.org/r/812011 [07:40:32] (03CR) 10David Caro: tests: add a test to ensure that the runbook is accessible if there is one (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/812011 (owner: 10David Caro) [07:40:33] akosiaris: apparently yep [07:40:55] (03PS4) 10David Caro: tests: add test to ensure that the runbook exists if there is one [alerts] - 10https://gerrit.wikimedia.org/r/812011 [07:41:04] vgutierrez: btw, we are going to have to switch from conf100[456] all those pybal [07:41:11] I guess next-next week ? [07:41:29] sounds ogod [07:41:30] *good [07:41:54] is icinga super slow or am I super impatient today? [07:42:05] icinga is always lagging now [07:42:11] severely [07:43:36] it doesn't seem slower than it's not-very-excellent base line to me [07:43:55] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/812230 (owner: 10Muehlenhoff) [07:43:59] (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/811718 (owner: 10Majavah) [07:45:22] (03CR) 10David Caro: P:toolforge::static: add blackbox monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811719 (owner: 10Majavah) [07:45:39] (03PS5) 10David Caro: P:toolforge::static: remove HTTPS enforcement [puppet] - 10https://gerrit.wikimedia.org/r/811718 (owner: 10Majavah) [07:46:17] RECOVERY - PyBal connections to etcd on lvs6002 is OK: OK: 4 connections established with conf1006.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [07:46:27] thx icinga [07:48:53] (03PS1) 10Muehlenhoff: maps: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/812232 [07:54:31] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/812232 (owner: 10Muehlenhoff) [07:54:55] (03PS5) 10Majavah: P:toolforge::static: add blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/811719 [07:55:10] (03PS6) 10Majavah: P:toolforge::static: add blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/811719 [07:55:36] (03CR) 10Majavah: P:toolforge::static: add blackbox monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811719 (owner: 10Majavah) [07:58:34] (03PS1) 10Muehlenhoff: Drop the temporary ifdef before it turns two years old [puppet] - 10https://gerrit.wikimedia.org/r/812235 [07:58:42] (03CR) 10David Caro: [C: 03+2] P:toolforge::static: add blackbox monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811719 (owner: 10Majavah) [08:00:04] (03CR) 10Majavah: P:toolforge::static: add blackbox monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811719 (owner: 10Majavah) [08:01:09] RECOVERY - Check systemd state on thumbor2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:03:56] 10SRE, 10MediaWiki-extensions-Score, 10Shellbox, 10serviceops, 10Sustainability (Incident Followup): Reduce Lilypond shellouts from VisualEditor - https://phabricator.wikimedia.org/T312319 (10Legoktm) [08:04:32] (03CR) 10David Caro: [C: 03+2] "Delete delete delete! ❤️" [puppet] - 10https://gerrit.wikimedia.org/r/812043 (owner: 10Majavah) [08:05:58] 10SRE, 10MediaWiki-extensions-Score, 10Shellbox, 10serviceops, 10Sustainability (Incident Followup): Reduce Lilypond shellouts from VisualEditor - https://phabricator.wikimedia.org/T312319 (10Legoktm) p:05Triage→03High I'm triaging this as high priority because it is causing temporary outages of Shel... [08:08:29] PROBLEM - Check systemd state on thumbor2004 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8809.service,thumbor@8811.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:09:28] (03CR) 10David Caro: [C: 03+2] P:wmcs: unify toolsdb profiles [puppet] - 10https://gerrit.wikimedia.org/r/789611 (owner: 10Majavah) [08:10:11] (03CR) 10David Caro: [C: 03+2] "There's a merge conflict :/" [puppet] - 10https://gerrit.wikimedia.org/r/789611 (owner: 10Majavah) [08:13:22] (03CR) 10Alexandros Kosiaris: [C: 03+1] mediawiki: redirect policy and related sites to wikimediafoundation.org [puppet] - 10https://gerrit.wikimedia.org/r/809324 (https://phabricator.wikimedia.org/T310738) (owner: 10Dzahn) [08:14:44] (03CR) 10David Caro: "Looks good to me, but I'd prefer waiting for Jbond or Moritz to verify before merging." [puppet] - 10https://gerrit.wikimedia.org/r/797293 (owner: 10Majavah) [08:16:03] (03PS5) 10Majavah: P:wmcs: unify toolsdb profiles [puppet] - 10https://gerrit.wikimedia.org/r/789611 [08:17:13] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36225/console" [puppet] - 10https://gerrit.wikimedia.org/r/789611 (owner: 10Majavah) [08:18:09] (03CR) 10Majavah: [V: 03+1] P:wmcs: unify toolsdb profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789611 (owner: 10Majavah) [08:18:43] (03PS1) 10Muehlenhoff: prometheus::mysqld_exporter: Remove stretch support [puppet] - 10https://gerrit.wikimedia.org/r/812239 [08:21:55] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/812239 (owner: 10Muehlenhoff) [08:23:01] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/811667 (https://phabricator.wikimedia.org/T297913) (owner: 10Slyngshede) [08:25:33] (03PS1) 10David Caro: Revert "P:toolforge::static: add blackbox monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/812115 [08:26:44] (03CR) 10Majavah: [C: 03+1] Revert "P:toolforge::static: add blackbox monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/812115 (owner: 10David Caro) [08:30:01] (03PS1) 10Alexandros Kosiaris: Release etcd-mirror 0.0.9, dropping python2 dependencies [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/812241 [08:31:22] (03PS2) 10Alexandros Kosiaris: Release etcd-mirror 0.0.9, dropping python2 dependencies [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/812241 [08:33:44] (03CR) 10David Caro: [C: 03+2] Revert "P:toolforge::static: add blackbox monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/812115 (owner: 10David Caro) [08:35:05] (03CR) 10David Caro: "Waiting on this as it requires restarting the toolsdb master and/or manually adjusting the perf_schema variables" [puppet] - 10https://gerrit.wikimedia.org/r/789611 (owner: 10Majavah) [08:40:13] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:58:54] 10SRE-swift-storage, 10Maps, 10Product-Infrastructure-Team-Backlog, 10User-fgiunchedi: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 (10Jgiannelos) >>>! In T307184#8061581, @Jgiannelos wrote: >> From a tegola development point of view I think it will be complicate... [09:09:11] (03PS1) 10Muehlenhoff: Remove deneb from docker registry ACL [puppet] - 10https://gerrit.wikimedia.org/r/812247 (https://phabricator.wikimedia.org/T298463) [09:09:31] (03PS3) 10Vgutierrez: haproxy: Log backend saturation detection [puppet] - 10https://gerrit.wikimedia.org/r/811914 (https://phabricator.wikimedia.org/T306580) [09:12:52] 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1082 - https://phabricator.wikimedia.org/T312626 (10BTullis) [09:13:35] (03CR) 10JMeybohm: [C: 03+1] Remove deneb from docker registry ACL [puppet] - 10https://gerrit.wikimedia.org/r/812247 (https://phabricator.wikimedia.org/T298463) (owner: 10Muehlenhoff) [09:15:24] (03PS1) 10Slyngshede: c:raid::perccli add PowerEdge RAID Controller monitoring to Icinga. [puppet] - 10https://gerrit.wikimedia.org/r/812250 (https://phabricator.wikimedia.org/T297913) [09:23:34] (03CR) 10Muehlenhoff: [C: 03+2] Remove deneb from docker registry ACL [puppet] - 10https://gerrit.wikimedia.org/r/812247 (https://phabricator.wikimedia.org/T298463) (owner: 10Muehlenhoff) [09:25:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2016.codfw.wmnet to cluster codfw and group D [09:28:35] (03CR) 10Jelto: [C: 03+2] wikimedia.org: remove gitlab-old.wikimedia.org (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/811912 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [09:28:58] (03PS2) 10Jelto: wikimedia.org: remove gitlab-old.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/811912 (https://phabricator.wikimedia.org/T307142) [09:30:08] (03CR) 10Giuseppe Lavagetto: [C: 03+1] k8s: Retry checks for expected pods on drain [software/spicerack] - 10https://gerrit.wikimedia.org/r/811331 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [09:33:02] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I actually would prefer to always return a list for taints. See rationale inline." [software/spicerack] - 10https://gerrit.wikimedia.org/r/811336 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [09:36:20] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "A small question, otherwise LGTM." [software/spicerack] - 10https://gerrit.wikimedia.org/r/811983 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [09:38:42] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff) [09:38:56] (03PS5) 10David Caro: tests: add a test to ensure that the runbook is accessible if there [alerts] - 10https://gerrit.wikimedia.org/r/812011 [09:38:58] (03PS2) 10David Caro: wmcs: add test to ensure we add a runbook to each alert [alerts] - 10https://gerrit.wikimedia.org/r/812013 [09:39:00] (03PS3) 10David Caro: wmcs: add openstack nodes down alerts [alerts] - 10https://gerrit.wikimedia.org/r/811999 [09:39:02] (03PS1) 10David Caro: wmcs: add alerts for any node going down [alerts] - 10https://gerrit.wikimedia.org/r/812259 [09:40:04] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ganeti2027.codfw.wmnet with reason: Temporarily remove from Ganeti cluster for reimage [09:40:13] (03PS6) 10David Caro: tests: add test to ensure that runbook existis if set [alerts] - 10https://gerrit.wikimedia.org/r/812011 [09:40:15] (03PS3) 10David Caro: wmcs: add test to ensure we add a runbook to each alert [alerts] - 10https://gerrit.wikimedia.org/r/812013 [09:40:17] (03PS2) 10David Caro: wmcs: add alerts for any node going down [alerts] - 10https://gerrit.wikimedia.org/r/812259 [09:40:19] (03PS4) 10David Caro: wmcs: add openstack nodes down alerts [alerts] - 10https://gerrit.wikimedia.org/r/811999 [09:40:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ganeti2027.codfw.wmnet with reason: Temporarily remove from Ganeti cluster for reimage [09:40:30] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:40:42] (03CR) 10David Caro: "Finally, sorry, rebased to merge first and modified the commit message" [alerts] - 10https://gerrit.wikimedia.org/r/812011 (owner: 10David Caro) [09:41:03] (03CR) 10Giuseppe Lavagetto: [C: 03+1] service-proxy: Set SNI and Host header for ingress services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/811744 (https://phabricator.wikimedia.org/T312225) (owner: 10JMeybohm) [09:43:10] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM, but fix the typo." [deployment-charts] - 10https://gerrit.wikimedia.org/r/811751 (owner: 10JMeybohm) [09:44:27] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Remove the need for charts to define services_proxy fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/811752 (owner: 10JMeybohm) [09:50:55] 10ops-codfw: Reset mgmt password for ganeti2027 - https://phabricator.wikimedia.org/T312627 (10MoritzMuehlenhoff) [09:55:25] (03PS2) 10JMeybohm: k8s: Retry checks for expected pods on drain [software/spicerack] - 10https://gerrit.wikimedia.org/r/811331 (https://phabricator.wikimedia.org/T260661) [09:55:27] (03PS4) 10JMeybohm: k8s: Add KubernetesNode.taints propertry [software/spicerack] - 10https://gerrit.wikimedia.org/r/811336 (https://phabricator.wikimedia.org/T260661) [09:55:29] (03PS2) 10JMeybohm: k8s: Retry pod evictions on HTTP 429 from API server [software/spicerack] - 10https://gerrit.wikimedia.org/r/811983 (https://phabricator.wikimedia.org/T260661) [10:01:35] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36226/console" [puppet] - 10https://gerrit.wikimedia.org/r/809996 (https://phabricator.wikimedia.org/T311652) (owner: 10Btullis) [10:02:06] (03PS1) 10Muehlenhoff: Remove Puppet references for deneb [puppet] - 10https://gerrit.wikimedia.org/r/812261 (https://phabricator.wikimedia.org/T298463) [10:07:32] (03CR) 10Muehlenhoff: [C: 03+2] Remove Puppet references for deneb [puppet] - 10https://gerrit.wikimedia.org/r/812261 (https://phabricator.wikimedia.org/T298463) (owner: 10Muehlenhoff) [10:09:48] (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [10:11:55] 10ops-codfw: Reset mgmt password for ganeti2027 - https://phabricator.wikimedia.org/T312627 (10MoritzMuehlenhoff) a:03Papaul [10:12:26] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts deneb.codfw.wmnet [10:13:00] (03CR) 10Hnowlan: [C: 03+1] "lgtm, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/812232 (owner: 10Muehlenhoff) [10:14:48] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [10:16:08] (03PS4) 10Btullis: Assign new password to Cassandra superuser [labs/private] - 10https://gerrit.wikimedia.org/r/809639 (https://phabricator.wikimedia.org/T311652) (owner: 10Eevans) [10:16:30] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:16:37] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10cmooney) 05Open→03Resolved Ok thanks @nskaggs. I'm going to close this task now as I believe everything is confi... [10:20:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:20:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts deneb.codfw.wmnet [10:20:58] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup a new build host based on bullseye - https://phabricator.wikimedia.org/T298463 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `deneb.codfw.wmnet` - deneb.codfw.wmnet (**PASS**) - Downtimed host on I... [10:21:37] (03PS1) 10Jelto: gitlab_runner: Allow DNS requests from GitLab runner containers in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/812264 (https://phabricator.wikimedia.org/T311241) [10:23:56] (03CR) 10Btullis: [V: 03+2 C: 03+2] Assign new password to Cassandra superuser [labs/private] - 10https://gerrit.wikimedia.org/r/809639 (https://phabricator.wikimedia.org/T311652) (owner: 10Eevans) [10:24:33] (03CR) 10Slyngshede: [C: 03+2] Add Nagios script for monitoring the Dell PERC RAID controller. [puppet] - 10https://gerrit.wikimedia.org/r/811667 (https://phabricator.wikimedia.org/T297913) (owner: 10Slyngshede) [10:25:26] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36227/console" [puppet] - 10https://gerrit.wikimedia.org/r/809996 (https://phabricator.wikimedia.org/T311652) (owner: 10Btullis) [10:29:17] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36228/console" [puppet] - 10https://gerrit.wikimedia.org/r/809996 (https://phabricator.wikimedia.org/T311652) (owner: 10Btullis) [10:34:06] 10SRE, 10Infrastructure-Foundations, 10netops: Packet Drops on Eqiad ASW -> CR uplinks - https://phabricator.wikimedia.org/T291627 (10cmooney) Good suggestion. The discrepancy isn't ideal but I think a little asymmetry is worth it if we can improve performance. +1 [10:34:59] (03CR) 10Hnowlan: [C: 03+2] "lgtm, thanks!" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/811257 (https://phabricator.wikimedia.org/T312103) (owner: 10Vlad.shapik) [10:36:46] (03Merged) 10jenkins-bot: Adjust the online tests to new changes in the thumbor functionality [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/811257 (https://phabricator.wikimedia.org/T312103) (owner: 10Vlad.shapik) [10:37:16] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Improve Netbox import script to avoid port-number collisions in JunOS - https://phabricator.wikimedia.org/T301392 (10cmooney) 05Open→03Resolved [10:40:02] 10SRE-swift-storage, 10Maps, 10Product-Infrastructure-Team-Backlog, 10User-fgiunchedi: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 (10Jgiannelos) Regarding cache-control. Here is the output of my local setup with swift API running on `http:127.0.0.1:8080` ` bash... [10:41:22] 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10cmooney) [10:41:24] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Configuration of New Switches Eqiad Rows E-F - https://phabricator.wikimedia.org/T299758 (10cmooney) 05Open→03Resolved [10:42:53] (03PS2) 10Btullis: Add a hiera alias for the cassandra superuser password to AQS [puppet] - 10https://gerrit.wikimedia.org/r/809996 (https://phabricator.wikimedia.org/T311652) [10:44:49] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup a new build host based on bullseye - https://phabricator.wikimedia.org/T298463 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff build2001.codfw.wmnet has been created and image building/reporting was switched to it. The old hos... [10:49:13] 10SRE, 10Infrastructure-Foundations, 10netops: Move interface VRF assignment to Netbox - https://phabricator.wikimedia.org/T310715 (10cmooney) [10:51:04] (03PS1) 10Muehlenhoff: druid: Fixed UID/GIDs are universally in use now [puppet] - 10https://gerrit.wikimedia.org/r/812286 [10:53:01] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/812286 (owner: 10Muehlenhoff) [10:53:06] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:55:16] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:55:32] (03PS1) 10Muehlenhoff: Avoid direct references [puppet] - 10https://gerrit.wikimedia.org/r/812287 [10:56:38] (03PS1) 10Ayounsi: Add parent support for servers interfaces creation [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812288 (https://phabricator.wikimedia.org/T296832) [10:57:12] (03CR) 10Ayounsi: [C: 04-1] "reviews welcome but not tested yet." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812288 (https://phabricator.wikimedia.org/T296832) (owner: 10Ayounsi) [10:59:34] (03PS3) 10Giuseppe Lavagetto: mtail: fix regexes due to changes in apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/811934 (https://phabricator.wikimedia.org/T312634) [11:05:40] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10ayounsi) I took a naive approach with the above patch (only checks for the usual "dot" delimiter. Similarly I tested... [11:10:26] (03PS2) 10Ayounsi: Add parent support for servers interfaces creation [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812288 (https://phabricator.wikimedia.org/T296832) [11:11:31] (03PS1) 10Btullis: Correct the hiera entries for the aqs cassandra super user [labs/private] - 10https://gerrit.wikimedia.org/r/812289 (https://phabricator.wikimedia.org/T311652) [11:12:59] (03PS1) 10Slyngshede: c:ganeti::prometheus Enable Prometheus exporter for Ganeti on Buster [puppet] - 10https://gerrit.wikimedia.org/r/812290 [11:13:32] (03CR) 10Btullis: [V: 03+2 C: 03+2] Correct the hiera entries for the aqs cassandra super user [labs/private] - 10https://gerrit.wikimedia.org/r/812289 (https://phabricator.wikimedia.org/T311652) (owner: 10Btullis) [11:15:48] RECOVERY - Check systemd state on thumbor1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:16:03] (03CR) 10Filippo Giunchedi: [C: 03+1] Avoid direct references [puppet] - 10https://gerrit.wikimedia.org/r/812287 (owner: 10Muehlenhoff) [11:17:17] (03PS3) 10Btullis: Add a hiera alias for the cassandra superuser password to AQS [puppet] - 10https://gerrit.wikimedia.org/r/809996 (https://phabricator.wikimedia.org/T311652) [11:17:42] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:19:05] (03CR) 10Filippo Giunchedi: "LGTM! Thank you" [puppet] - 10https://gerrit.wikimedia.org/r/812235 (owner: 10Muehlenhoff) [11:19:27] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36231/console" [puppet] - 10https://gerrit.wikimedia.org/r/809996 (https://phabricator.wikimedia.org/T311652) (owner: 10Btullis) [11:19:29] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff) [11:19:38] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff) p:05Triage→03Medium [11:20:05] (03CR) 10Muehlenhoff: [C: 03+2] Drop the temporary ifdef before it turns two years old [puppet] - 10https://gerrit.wikimedia.org/r/812235 (owner: 10Muehlenhoff) [11:20:07] (03PS1) 10Cathal Mooney: Return Vlan object if no IP prefix warning is triggered [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812293 [11:20:46] RECOVERY - Check systemd state on thumbor2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:20:49] (03CR) 10CI reject: [V: 04-1] Return Vlan object if no IP prefix warning is triggered [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812293 (owner: 10Cathal Mooney) [11:22:58] PROBLEM - Check systemd state on thumbor1002 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8811.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8815.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:23:33] (03PS1) 10Muehlenhoff: build_envoy_deb.sh: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/812294 [11:27:24] (03CR) 10Filippo Giunchedi: tests: add test to ensure that runbook existis if set (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/812011 (owner: 10David Caro) [11:28:00] PROBLEM - Check systemd state on thumbor2006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8802.service,thumbor@8803.service,thumbor@8805.service,thumbor@8809.service,thumbor@8812.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:32:08] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:38:58] (03CR) 10Muehlenhoff: c:ganeti::prometheus Enable Prometheus exporter for Ganeti on Buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812290 (owner: 10Slyngshede) [11:46:19] 10SRE-tools, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Spicerack: add support for Alertmanager - https://phabricator.wikimedia.org/T293209 (10fgiunchedi) >>! In T293209#8048107, @Volans wrote: >>>! In T293209#8043558, @fgiunchedi wrot... [11:47:04] (03PS2) 10Slyngshede: c:ganeti::prometheus Enable Prometheus exporter for Ganeti on Buster [puppet] - 10https://gerrit.wikimedia.org/r/812290 [11:47:35] (03CR) 10Slyngshede: c:ganeti::prometheus Enable Prometheus exporter for Ganeti on Buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812290 (owner: 10Slyngshede) [11:50:12] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff) [11:51:44] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one question inline." [puppet] - 10https://gerrit.wikimedia.org/r/812290 (owner: 10Slyngshede) [11:52:30] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/809996 (https://phabricator.wikimedia.org/T311652) (owner: 10Btullis) [11:52:42] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:55:00] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.267 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:56:10] (03CR) 10Btullis: [V: 03+1] Add a hiera alias for the cassandra superuser password to AQS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809996 (https://phabricator.wikimedia.org/T311652) (owner: 10Btullis) [11:56:22] (03Abandoned) 10Btullis: Add a hiera alias for the cassandra superuser password to AQS [puppet] - 10https://gerrit.wikimedia.org/r/809996 (https://phabricator.wikimedia.org/T311652) (owner: 10Btullis) [12:00:14] (03PS2) 10Cathal Mooney: Return Vlan object if no IP prefix warning is triggered [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812293 [12:01:57] (03PS1) 10Muehlenhoff: lxc: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/812303 [12:04:32] (03CR) 10Btullis: "Joe just pointed out a better way of triggering our alerts, which means that we no longer need this functionality right now. https://phabr" [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [12:04:43] (03Abandoned) 10Btullis: Add a host's conftool pooled status and weight per service to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [12:05:22] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/812303 (owner: 10Muehlenhoff) [12:17:59] (03CR) 10Jelto: vrts/prometheus: comment out broken check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812158 (owner: 10Dzahn) [12:26:46] (03PS1) 10Kosta Harlan: AddImage: Only process metadata for a single valid suggestion [extensions/GrowthExperiments] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/812279 (https://phabricator.wikimedia.org/T312544) [12:28:56] thcipriani: jnuche: I'd like to do an emergency deploy for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/812279 -- context is T312544. (cc urbanecm) [12:28:57] T312544: Deferred update 'MWCallableUpdate_GrowthExperiments\NewcomerTasks\TaskSetListener->run' failed to run. - https://phabricator.wikimedia.org/T312544 [12:29:41] (03PS5) 10JMeybohm: k8s: Add KubernetesNode.taints propertry [software/spicerack] - 10https://gerrit.wikimedia.org/r/811336 (https://phabricator.wikimedia.org/T260661) [12:29:43] (03PS3) 10JMeybohm: k8s: Retry pod evictions on HTTP 429 from API server [software/spicerack] - 10https://gerrit.wikimedia.org/r/811983 (https://phabricator.wikimedia.org/T260661) [12:31:13] (03PS6) 10JMeybohm: k8s: Add KubernetesNode.taints propertry [software/spicerack] - 10https://gerrit.wikimedia.org/r/811336 (https://phabricator.wikimedia.org/T260661) [12:31:15] (03PS4) 10JMeybohm: k8s: Retry pod evictions on HTTP 429 from API server [software/spicerack] - 10https://gerrit.wikimedia.org/r/811983 (https://phabricator.wikimedia.org/T260661) [12:33:19] (03CR) 10JMeybohm: k8s: Add KubernetesNode.taints propertry (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/811336 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [12:33:45] 10SRE, 10ops-codfw: Reset mgmt password for ganeti2027 - https://phabricator.wikimedia.org/T312627 (10Papaul) 05Open→03Resolved @MoritzMuehlenhoff password reset done [12:33:50] (03CR) 10JMeybohm: k8s: Retry pod evictions on HTTP 429 from API server (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/811983 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [12:35:29] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) @ayounsi yes i can complete cr1 but of course with your help. Thanks [12:39:26] (03PS3) 10JMeybohm: service-proxy: Set SNI and Host header for ingress services [deployment-charts] - 10https://gerrit.wikimedia.org/r/811744 (https://phabricator.wikimedia.org/T312225) [12:39:28] (03PS4) 10JMeybohm: Use the generic services_proxy definition for envoy config [deployment-charts] - 10https://gerrit.wikimedia.org/r/811751 [12:39:30] (03PS4) 10JMeybohm: Remove the need for charts to define services_proxy fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/811752 [12:39:56] (03CR) 10JMeybohm: Use the generic services_proxy definition for envoy config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/811751 (owner: 10JMeybohm) [12:40:48] (03PS3) 10Alexandros Kosiaris: Release etcd-mirror 0.0.9, dropping python2 dependencies [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/812241 [12:40:50] (03PS1) 10Alexandros Kosiaris: Port to python3 [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/812306 [12:41:57] (03CR) 10CI reject: [V: 04-1] Port to python3 [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/812306 (owner: 10Alexandros Kosiaris) [12:41:59] (03CR) 10CI reject: [V: 04-1] Release etcd-mirror 0.0.9, dropping python2 dependencies [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/812241 (owner: 10Alexandros Kosiaris) [12:45:17] (03PS1) 10Marostegui: Revert "db1160: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/812280 [12:46:44] (03PS2) 10Alexandros Kosiaris: Port to python3 [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/812306 [12:46:46] (03PS4) 10Alexandros Kosiaris: Release etcd-mirror 0.0.9, dropping python2 dependencies [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/812241 [12:47:21] (03CR) 10CI reject: [V: 04-1] Port to python3 [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/812306 (owner: 10Alexandros Kosiaris) [12:47:23] (03CR) 10CI reject: [V: 04-1] Release etcd-mirror 0.0.9, dropping python2 dependencies [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/812241 (owner: 10Alexandros Kosiaris) [12:47:49] (03CR) 10Marostegui: [C: 03+2] Revert "db1160: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/812280 (owner: 10Marostegui) [12:48:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 1%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30970 and previous config saved to /var/cache/conftool/dbconfig/20220708-124844-root.json [12:51:08] RECOVERY - Check systemd state on thumbor2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:53:14] 10SRE-swift-storage: Adjust ms ring min_part_hours to 12 hours - https://phabricator.wikimedia.org/T312643 (10MatthewVernon) [12:54:02] (03PS3) 10Alexandros Kosiaris: Port to python3 [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/812306 [12:54:04] (03PS5) 10Alexandros Kosiaris: Release etcd-mirror 0.0.9, dropping python2 dependencies [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/812241 [12:54:52] (03CR) 10CI reject: [V: 04-1] Port to python3 [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/812306 (owner: 10Alexandros Kosiaris) [12:54:56] (03CR) 10CI reject: [V: 04-1] Release etcd-mirror 0.0.9, dropping python2 dependencies [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/812241 (owner: 10Alexandros Kosiaris) [12:56:42] (03CR) 10David Caro: tests: add test to ensure that runbook existis if set (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/812011 (owner: 10David Caro) [12:57:14] (03CR) 10Kosta Harlan: "I'm stepping away for a while, but in case anyone wants to backport this, here are some testing instructions:" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/812279 (https://phabricator.wikimedia.org/T312544) (owner: 10Kosta Harlan) [12:58:32] PROBLEM - Check systemd state on thumbor2006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8802.service,thumbor@8803.service,thumbor@8805.service,thumbor@8809.service,thumbor@8812.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:02:11] (03PS4) 10Alexandros Kosiaris: Port to python3 [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/812306 [13:02:13] (03PS6) 10Alexandros Kosiaris: Release etcd-mirror 0.0.9, dropping python2 dependencies [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/812241 [13:03:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 2%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30971 and previous config saved to /var/cache/conftool/dbconfig/20220708-130348-root.json [13:07:08] (03CR) 10Alexandros Kosiaris: [C: 03+2] Port to python3 [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/812306 (owner: 10Alexandros Kosiaris) [13:08:37] (03Merged) 10jenkins-bot: Port to python3 [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/812306 (owner: 10Alexandros Kosiaris) [13:11:39] (03CR) 10Alexandros Kosiaris: "recheck" [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/812241 (owner: 10Alexandros Kosiaris) [13:16:02] (03PS1) 10David Caro: wmcs: add systemd unit down alerts [alerts] - 10https://gerrit.wikimedia.org/r/812313 [13:18:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 5%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30973 and previous config saved to /var/cache/conftool/dbconfig/20220708-131852-root.json [13:19:10] (03PS7) 10Alexandros Kosiaris: Release etcd-mirror 0.0.9, dropping python2 dependencies [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/812241 [13:22:13] (03CR) 10MVernon: [C: 03+1] "Looks good (and a noop in effect) to me, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/811932 (https://phabricator.wikimedia.org/T311690) (owner: 10Filippo Giunchedi) [13:24:01] (03CR) 10MVernon: [C: 03+1] "LGTM, thanks - are you planning on trimming the 5m samples, too?" [puppet] - 10https://gerrit.wikimedia.org/r/811933 (https://phabricator.wikimedia.org/T311690) (owner: 10Filippo Giunchedi) [13:26:54] (03CR) 10MVernon: "Sorry, probably a stupid question - but have you (automatically) checked that there are no non-WMF contributions to these files, or are yo" [puppet] - 10https://gerrit.wikimedia.org/r/811233 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:27:36] (03CR) 10Nskaggs: wmcs.openstack.cloudgw: add reboot_node and roll_reboot_cloudgws (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810914 (owner: 10David Caro) [13:28:47] (03CR) 10MVernon: [C: 03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/810276 (https://phabricator.wikimedia.org/T297959) (owner: 10Filippo Giunchedi) [13:30:56] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36232/console" [puppet] - 10https://gerrit.wikimedia.org/r/811914 (https://phabricator.wikimedia.org/T306580) (owner: 10Vgutierrez) [13:32:18] (ProbeDown) firing: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:33:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 10%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30974 and previous config saved to /var/cache/conftool/dbconfig/20220708-133356-root.json [13:35:43] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10cmooney) Super work! I'll maybe try to dig into the puppet custom facts stuff, be a chance to learn some Ruby I gues... [13:37:18] (ProbeDown) resolved: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:39:41] I am going to deploy a hotfix for GrowthExperiments https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/812279 [13:40:07] (03CR) 10Hashar: [C: 03+2] "Thanks for the detailed test instructions. I am rolling it and will test on mwdebug1001." [extensions/GrowthExperiments] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/812279 (https://phabricator.wikimedia.org/T312544) (owner: 10Kosta Harlan) [13:41:42] (03CR) 10Cathal Mooney: [C: 03+1] "Nice work! We can probably improve with the full data in puppetdb via custom_facts, which will also support cases where the interfaces don" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812288 (https://phabricator.wikimedia.org/T296832) (owner: 10Ayounsi) [13:42:03] (03CR) 10Cathal Mooney: Add parent support for servers interfaces creation [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812288 (https://phabricator.wikimedia.org/T296832) (owner: 10Ayounsi) [13:49:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 25%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30975 and previous config saved to /var/cache/conftool/dbconfig/20220708-134900-root.json [13:49:38] (03CR) 10Muehlenhoff: swift: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811233 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:49:50] (03PS7) 10David Caro: tests: add test to ensure that runbook existis if set [alerts] - 10https://gerrit.wikimedia.org/r/812011 [13:49:58] (03PS4) 10David Caro: wmcs: add test to ensure we add a runbook to each alert [alerts] - 10https://gerrit.wikimedia.org/r/812013 [13:50:02] (03PS3) 10David Caro: wmcs: add alerts for any node going down [alerts] - 10https://gerrit.wikimedia.org/r/812259 [13:50:06] (03PS2) 10David Caro: wmcs: add systemd unit down alerts [alerts] - 10https://gerrit.wikimedia.org/r/812313 [13:50:10] (03PS5) 10David Caro: wmcs: add openstack nodes down alerts [alerts] - 10https://gerrit.wikimedia.org/r/811999 [13:50:14] (03PS1) 10Dzahn: Revert "vrts/prometheus: comment out broken check" [puppet] - 10https://gerrit.wikimedia.org/r/812282 [13:50:40] (03CR) 10MVernon: [C: 03+1] swift: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811233 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:51:10] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:51:20] (03CR) 10David Caro: tests: add test to ensure that runbook existis if set (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/812011 (owner: 10David Caro) [13:51:25] (03CR) 10Dzahn: [C: 03+2] "Yes Jelto, I wanted" [puppet] - 10https://gerrit.wikimedia.org/r/812158 (owner: 10Dzahn) [13:51:43] (03CR) 10David Caro: tests: add test to ensure that runbook existis if set (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/812011 (owner: 10David Caro) [13:51:57] (03CR) 10CI reject: [V: 04-1] wmcs: add alerts for any node going down [alerts] - 10https://gerrit.wikimedia.org/r/812259 (owner: 10David Caro) [13:52:05] (03CR) 10CI reject: [V: 04-1] wmcs: add test to ensure we add a runbook to each alert [alerts] - 10https://gerrit.wikimedia.org/r/812013 (owner: 10David Caro) [13:52:11] (03CR) 10Dzahn: [C: 03+2] vrts/prometheus: comment out broken check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812158 (owner: 10Dzahn) [13:52:42] (03CR) 10CI reject: [V: 04-1] tests: add test to ensure that runbook existis if set [alerts] - 10https://gerrit.wikimedia.org/r/812011 (owner: 10David Caro) [13:53:20] (03CR) 10CI reject: [V: 04-1] wmcs: add systemd unit down alerts [alerts] - 10https://gerrit.wikimedia.org/r/812313 (owner: 10David Caro) [13:53:22] (03CR) 10CI reject: [V: 04-1] wmcs: add openstack nodes down alerts [alerts] - 10https://gerrit.wikimedia.org/r/811999 (owner: 10David Caro) [13:53:28] (03PS8) 10David Caro: tests: add test to ensure that runbook existis if set [alerts] - 10https://gerrit.wikimedia.org/r/812011 [13:53:30] (03PS5) 10David Caro: wmcs: add test to ensure we add a runbook to each alert [alerts] - 10https://gerrit.wikimedia.org/r/812013 [13:53:32] (03PS4) 10David Caro: wmcs: add alerts for any node going down [alerts] - 10https://gerrit.wikimedia.org/r/812259 [13:53:34] (03PS3) 10David Caro: wmcs: add systemd unit down alerts [alerts] - 10https://gerrit.wikimedia.org/r/812313 [13:53:36] (03PS6) 10David Caro: wmcs: add openstack nodes down alerts [alerts] - 10https://gerrit.wikimedia.org/r/811999 [13:56:34] (03PS1) 10JMeybohm: k8s/reboot-nodes: Error if nodes are cordoned [cookbooks] - 10https://gerrit.wikimedia.org/r/812325 (https://phabricator.wikimedia.org/T260661) [13:57:28] (03PS1) 10Dzahn: vrts/prometheus: re-activate commented check after fixing path [puppet] - 10https://gerrit.wikimedia.org/r/812326 [13:58:20] (03CR) 10JMeybohm: [C: 03+2] Remove the need for charts to define services_proxy fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/811752 (owner: 10JMeybohm) [13:58:36] (03CR) 10JMeybohm: [C: 03+2] Use the generic services_proxy definition for envoy config [deployment-charts] - 10https://gerrit.wikimedia.org/r/811751 (owner: 10JMeybohm) [13:58:39] (03CR) 10JMeybohm: [C: 03+2] service-proxy: Set SNI and Host header for ingress services [deployment-charts] - 10https://gerrit.wikimedia.org/r/811744 (https://phabricator.wikimedia.org/T312225) (owner: 10JMeybohm) [14:00:26] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/812326 (owner: 10Dzahn) [14:00:36] (03PS2) 10JMeybohm: k8s/reboot-nodes: Error if nodes are cordoned [cookbooks] - 10https://gerrit.wikimedia.org/r/812325 (https://phabricator.wikimedia.org/T260661) [14:01:12] (03CR) 10Jelto: [C: 03+1] "lgtm in combination with related change I9cddb2a2e6f98a219d3c88a5d38e288d573cdd06" [puppet] - 10https://gerrit.wikimedia.org/r/812282 (owner: 10Dzahn) [14:01:41] (03Merged) 10jenkins-bot: AddImage: Only process metadata for a single valid suggestion [extensions/GrowthExperiments] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/812279 (https://phabricator.wikimedia.org/T312544) (owner: 10Kosta Harlan) [14:02:46] (03Merged) 10jenkins-bot: service-proxy: Set SNI and Host header for ingress services [deployment-charts] - 10https://gerrit.wikimedia.org/r/811744 (https://phabricator.wikimedia.org/T312225) (owner: 10JMeybohm) [14:03:13] (03Merged) 10jenkins-bot: Use the generic services_proxy definition for envoy config [deployment-charts] - 10https://gerrit.wikimedia.org/r/811751 (owner: 10JMeybohm) [14:03:15] (03Merged) 10jenkins-bot: Remove the need for charts to define services_proxy fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/811752 (owner: 10JMeybohm) [14:03:18] (ProbeDown) firing: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:03:27] deploying to mwdebug1001 [14:03:35] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice! Thank you David, ship it!" [alerts] - 10https://gerrit.wikimedia.org/r/812011 (owner: 10David Caro) [14:04:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 50%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30978 and previous config saved to /var/cache/conftool/dbconfig/20220708-140404-root.json [14:06:14] (03PS3) 10Cathal Mooney: Return Vlan object if no IP prefix warning is triggered [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812293 [14:06:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:07:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:07:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:07:16] (03CR) 10David Caro: [C: 03+2] tests: add test to ensure that runbook existis if set (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/812011 (owner: 10David Caro) [14:08:03] (03PS1) 10Filippo Giunchedi: smokeping: remove DNS targets, moved to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/812329 (https://phabricator.wikimedia.org/T169860) [14:08:05] (03PS1) 10Filippo Giunchedi: prometheus: add support to blackbox icmp probe hosts [puppet] - 10https://gerrit.wikimedia.org/r/812330 (https://phabricator.wikimedia.org/T169860) [14:08:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:08:07] (03PS1) 10Filippo Giunchedi: prometheus: blackbox icmp probes for hosts [puppet] - 10https://gerrit.wikimedia.org/r/812331 (https://phabricator.wikimedia.org/T169860) [14:08:18] (ProbeDown) resolved: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:09:18] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1004.wikimedia.org with OS bullseye [14:09:23] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye [14:09:35] (03PS8) 10Alexandros Kosiaris: Release etcd-mirror 0.0.9, dropping python2 dependencies [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/812241 [14:10:03] (03CR) 10JMeybohm: New service: function-evaluator (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/793862 (https://phabricator.wikimedia.org/T295698) (owner: 10Ori) [14:10:58] !log hashar@deploy1002 Synchronized php-1.39.0-wmf.19/extensions/GrowthExperiments/includes/NewcomerTasks/AddImage/ServiceImageRecommendationProvider.php: AddImage: Only process metadata for a single valid suggestion - T312544 (duration: 03m 25s) [14:11:06] T312544: Deferred update 'MWCallableUpdate_GrowthExperiments\NewcomerTasks\TaskSetListener->run' failed to run. - https://phabricator.wikimedia.org/T312544 [14:14:00] urbanecm: kostajh: looks like the growth issue ^ has vanished successfuly! [14:14:13] great! [14:14:20] let's hope it stays like that [14:14:52] I love being able to deploy a patch which has a test procedure attached [14:15:04] that makes it a pleasant adventure [14:15:06] (03PS9) 10Alexandros Kosiaris: Release etcd-mirror 0.0.9, dropping python2 dependencies [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/812241 [14:15:08] heh, i can see that [14:18:14] (03PS1) 10JMeybohm: Remove statsd from _scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/812333 [14:18:57] (03CR) 10Filippo Giunchedi: "Basic scaffolding to be able to ping an arbitrary list of hosts" [puppet] - 10https://gerrit.wikimedia.org/r/812330 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [14:19:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 75%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30983 and previous config saved to /var/cache/conftool/dbconfig/20220708-141907-root.json [14:19:19] (03CR) 10JMeybohm: New service: function-evaluator (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/793862 (https://phabricator.wikimedia.org/T295698) (owner: 10Ori) [14:19:50] (03PS10) 10Alexandros Kosiaris: Release etcd-mirror 0.0.9, dropping python2 dependencies [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/812241 [14:19:54] (03CR) 10Filippo Giunchedi: "Not the perfect solution but good enough for now IMHO, and one step closer to being able to turn down smokeping!" [puppet] - 10https://gerrit.wikimedia.org/r/812331 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [14:20:54] (03PS8) 10David Caro: novafullstack: add types and some names refactor [puppet] - 10https://gerrit.wikimedia.org/r/810950 [14:20:56] (03PS6) 10David Caro: novafullstack: Refactor and minor fix [puppet] - 10https://gerrit.wikimedia.org/r/811316 [14:20:58] (03PS2) 10David Caro: novafullstack: generate prometheus stats too [puppet] - 10https://gerrit.wikimedia.org/r/812037 [14:21:31] (03CR) 10JMeybohm: Port to Python 3.5 [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801646 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi) [14:22:20] (03CR) 10David Caro: novafullstack: Refactor and minor fix (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811316 (owner: 10David Caro) [14:22:38] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1004.wikimedia.org with reason: host reimage [14:25:24] (03PS11) 10Alexandros Kosiaris: Release etcd-mirror 0.0.9, dropping python2 dependencies [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/812241 [14:26:16] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1004.wikimedia.org with reason: host reimage [14:26:37] (03Abandoned) 10Alexandros Kosiaris: Port to Python 3.5 [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801646 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi) [14:28:11] (03CR) 10CDanis: [C: 03+1] haproxy: Log backend saturation detection [puppet] - 10https://gerrit.wikimedia.org/r/811914 (https://phabricator.wikimedia.org/T306580) (owner: 10Vgutierrez) [14:28:36] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] haproxy: Log backend saturation detection [puppet] - 10https://gerrit.wikimedia.org/r/811914 (https://phabricator.wikimedia.org/T306580) (owner: 10Vgutierrez) [14:28:46] (03PS12) 10Alexandros Kosiaris: Release etcd-mirror 0.0.9, dropping python2 dependencies [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/812241 [14:32:04] (03PS13) 10Alexandros Kosiaris: Release etcd-mirror 0.0.9, dropping python2 dependencies [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/812241 [14:34:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 100%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30990 and previous config saved to /var/cache/conftool/dbconfig/20220708-143411-root.json [14:38:02] (03CR) 10Majavah: [C: 04-1] "Nice! In openstack::nova::fullstack::service, could you add something to remove the prometheus file in case we change which host this runs" [puppet] - 10https://gerrit.wikimedia.org/r/812037 (owner: 10David Caro) [14:46:57] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1004.wikimedia.org with OS bullseye [14:47:02] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye completed: - cloudel... [14:48:09] (03CR) 10David Caro: novafullstack: generate prometheus stats too (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812037 (owner: 10David Caro) [14:49:02] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1005.wikimedia.org with OS bullseye [14:49:07] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye [14:52:15] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:59:47] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1005.wikimedia.org with OS bullseye [14:59:52] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye executed with errors... [15:00:03] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1005.wikimedia.org with OS bullseye [15:00:09] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye [15:03:55] (03PS14) 10Alexandros Kosiaris: Release etcd-mirror 0.0.9, dropping python2 dependencies [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/812241 [15:06:41] (03CR) 10Lucas Werkmeister (WMDE): "I guess this was superseded by Icab18579bd? (I wasn’t aware this change existed at the time, I just found it now.)" [puppet] - 10https://gerrit.wikimedia.org/r/713870 (owner: 10Michael Große) [15:08:31] (03PS9) 10David Caro: novafullstack: add types and some names refactor [puppet] - 10https://gerrit.wikimedia.org/r/810950 [15:08:33] (03PS7) 10David Caro: novafullstack: Refactor and minor fix [puppet] - 10https://gerrit.wikimedia.org/r/811316 [15:08:35] (03PS3) 10David Caro: novafullstack: generate prometheus stats too [puppet] - 10https://gerrit.wikimedia.org/r/812037 [15:08:47] (03CR) 10David Caro: novafullstack: generate prometheus stats too (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/812037 (owner: 10David Caro) [15:09:35] (03CR) 10David Caro: novafullstack: generate prometheus stats too (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812037 (owner: 10David Caro) [15:10:08] (03CR) 10David Caro: novafullstack: generate prometheus stats too (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812037 (owner: 10David Caro) [15:10:40] (03PS4) 10David Caro: novafullstack: generate prometheus stats too [puppet] - 10https://gerrit.wikimedia.org/r/812037 [15:11:37] (03CR) 10Michael Große: Don't cache Query Builder index.html (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713870 (owner: 10Michael Große) [15:11:50] (03PS15) 10Alexandros Kosiaris: Release etcd-mirror 0.0.9, dropping python2 dependencies [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/812241 [15:11:52] (03Abandoned) 10Michael Große: Don't cache Query Builder index.html [puppet] - 10https://gerrit.wikimedia.org/r/713870 (owner: 10Michael Große) [15:12:36] (03CR) 10David Caro: novafullstack: generate prometheus stats too (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/812037 (owner: 10David Caro) [15:12:51] (03CR) 10David Caro: novafullstack: Refactor and minor fix (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811316 (owner: 10David Caro) [15:15:32] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1005.wikimedia.org with OS bullseye [15:15:37] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye executed with errors... [15:15:50] (03PS1) 10David Caro: gitignore: add vscode editor [puppet] - 10https://gerrit.wikimedia.org/r/812343 [15:17:03] (03CR) 10Zabe: "see https://gerrit.wikimedia.org/r/c/operations/puppet/+/802630" [puppet] - 10https://gerrit.wikimedia.org/r/812343 (owner: 10David Caro) [15:20:25] (03CR) 10Ayounsi: [C: 03+1] Return Vlan object if no IP prefix warning is triggered [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812293 (owner: 10Cathal Mooney) [15:20:45] (03CR) 10Majavah: novafullstack: generate prometheus stats too (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812037 (owner: 10David Caro) [15:24:28] (03PS16) 10Alexandros Kosiaris: Release etcd-mirror 0.0.9, dropping python2 dependencies [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/812241 [15:26:19] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10ayounsi) Depending on the depth of this rabbit hole, it might be better to focus on DHCP option 97 (which solves the... [15:27:44] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1005.wikimedia.org with OS bullseye [15:27:50] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye [15:27:51] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1005.wikimedia.org with OS bullseye [15:27:55] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye executed with errors... [15:29:00] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1005.wikimedia.org with OS bullseye [15:29:06] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye [15:30:04] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Finally, debian-glue liked me!" [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/812241 (owner: 10Alexandros Kosiaris) [15:30:38] (03CR) 10David Caro: gitignore: add vscode (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802630 (owner: 10Samtar) [15:30:56] (03Merged) 10jenkins-bot: Release etcd-mirror 0.0.9, dropping python2 dependencies [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/812241 (owner: 10Alexandros Kosiaris) [15:31:35] (03CR) 10David Caro: gitignore: add vscode (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802630 (owner: 10Samtar) [15:33:45] (03PS2) 10David Caro: gitignore: add note to use global ignore file [puppet] - 10https://gerrit.wikimedia.org/r/812343 [15:34:33] (03PS3) 10David Caro: gitignore: add note to use global ignore file [puppet] - 10https://gerrit.wikimedia.org/r/812343 [15:34:35] (03CR) 10David Caro: gitignore: add note to use global ignore file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812343 (owner: 10David Caro) [15:40:14] (03CR) 10BCornwall: [C: 03+2] varnish: add VarnishHighMmapCount [alerts] - 10https://gerrit.wikimedia.org/r/805873 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [15:44:00] (03Merged) 10jenkins-bot: varnish: add VarnishHighMmapCount [alerts] - 10https://gerrit.wikimedia.org/r/805873 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [15:46:55] hashar urbanecm: thank you! [15:52:39] (03CR) 10Andrew Bogott: [C: 03+1] "We should do this! I sort of thought there was already enforcement about this but probably that's for new icinga alerts and not prometheus" [alerts] - 10https://gerrit.wikimedia.org/r/812013 (owner: 10David Caro) [15:53:19] 10SRE-swift-storage, 10Maps, 10Product-Infrastructure-Team-Backlog, 10User-fgiunchedi: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 (10Jgiannelos) On a second thought, this is for serving `cache-control` headers so not very relevant to our problem. [15:55:15] (03CR) 10BCornwall: [C: 03+1] smokeping: remove DNS targets, moved to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/812329 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [16:25:09] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1005.wikimedia.org with OS bullseye [16:25:14] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye executed with errors... [16:27:13] RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:37:50] 10SRE, 10ops-codfw: codfw: Master PDU rack/setup row A, row B, rowC and row D task - https://phabricator.wikimedia.org/T309956 (10Papaul) [16:38:27] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Papaul) [16:54:29] (03PS1) 10Cathal Mooney: Add function to int_automation to validate QFX5120 port blocks [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812376 (https://phabricator.wikimedia.org/T303529) [16:55:56] (03CR) 10CI reject: [V: 04-1] Add function to int_automation to validate QFX5120 port blocks [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812376 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [16:57:30] (03PS2) 10Cathal Mooney: Add function to int_automation to validate QFX5120 port blocks [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812376 (https://phabricator.wikimedia.org/T303529) [16:58:15] (03CR) 10CI reject: [V: 04-1] Add function to int_automation to validate QFX5120 port blocks [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812376 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [16:59:22] (03PS3) 10Cathal Mooney: Add function to int_automation to validate QFX5120 port blocks [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812376 (https://phabricator.wikimedia.org/T303529) [16:59:29] (03PS1) 10DDesouza: QuickSurveys: Disable 'research-incentive' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812377 (https://phabricator.wikimedia.org/T311015) [17:00:06] (03CR) 10CI reject: [V: 04-1] Add function to int_automation to validate QFX5120 port blocks [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812376 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [17:00:57] (03PS4) 10Cathal Mooney: Add function to int_automation to validate QFX5120 port blocks [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812376 (https://phabricator.wikimedia.org/T303529) [17:03:43] PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8812.service,thumbor@8816.service,thumbor@8817.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:37] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10cmooney) I agree it's not worth massive effort, Option 97 is the better way to resolve the initial problem for sure.... [17:13:17] 10SRE, 10Wikimedia-Mailing-lists: Create mailing list for WikiSound Audio Speaks Campaign - https://phabricator.wikimedia.org/T311230 (10jhathaway) @Anasskoko I know this request has been open for some time, do you still need the mailing list created? [17:14:02] (03PS1) 10Ayounsi: sre.network.debug: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/812380 [17:17:36] (03CR) 10CI reject: [V: 04-1] sre.network.debug: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/812380 (owner: 10Ayounsi) [17:19:35] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [17:20:54] (03PS1) 10BryanDavis: labweb: point tlsproxy envoy at %{facts.ipaddress}:8080 for striker [puppet] - 10https://gerrit.wikimedia.org/r/812381 (https://phabricator.wikimedia.org/T306469) [17:22:01] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [17:35:01] (03CR) 10Cathal Mooney: [C: 03+2] Return Vlan object if no IP prefix warning is triggered [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812293 (owner: 10Cathal Mooney) [17:35:47] (03Merged) 10jenkins-bot: Return Vlan object if no IP prefix warning is triggered [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812293 (owner: 10Cathal Mooney) [17:39:38] (03CR) 10BryanDavis: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1002/36233/" [puppet] - 10https://gerrit.wikimedia.org/r/812381 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [17:40:15] (03PS3) 10BryanDavis: hieradata: cloudweb-dev: route striker to the docker port [puppet] - 10https://gerrit.wikimedia.org/r/811332 (https://phabricator.wikimedia.org/T306469) (owner: 10Majavah) [17:43:34] (03CR) 10BryanDavis: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1002/36234/" [puppet] - 10https://gerrit.wikimedia.org/r/811332 (https://phabricator.wikimedia.org/T306469) (owner: 10Majavah) [17:44:27] (03PS2) 10BryanDavis: Revert "striker: Open firewall for Docker-managed service" [puppet] - 10https://gerrit.wikimedia.org/r/811274 (https://phabricator.wikimedia.org/T306469) [17:49:33] (03CR) 10BryanDavis: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1002/36235/" [puppet] - 10https://gerrit.wikimedia.org/r/811274 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [17:56:28] (03CR) 10Legoktm: [V: 03+2 C: 03+2] Support percent-encoded array key syntax [software/varnish/libvmod-querysort] - 10https://gerrit.wikimedia.org/r/810552 (owner: 10Ori) [17:56:49] (03CR) 10Legoktm: [V: 03+2 C: 03+2] ":shipit:" [software/varnish/libvmod-querysort] - 10https://gerrit.wikimedia.org/r/810551 (https://phabricator.wikimedia.org/T138093) (owner: 10Ori) [18:02:17] (03PS2) 10Ayounsi: sre.network.debug: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/812380 [18:03:03] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1001.wikimedia.org with OS bullseye [18:03:08] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1001.wikimedia.org with OS bullseye [18:03:43] (03CR) 10Ori: "This change is ready for review." [software/varnish/libvmod-querysort] - 10https://gerrit.wikimedia.org/r/812389 (owner: 10Ori) [18:04:21] (03PS1) 10Eigyan: [config]: Add click event logging for mobile and desktop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812391 (https://phabricator.wikimedia.org/T310852) [18:18:27] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1001.wikimedia.org with reason: host reimage [18:21:08] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1001.wikimedia.org with reason: host reimage [18:24:56] (03Abandoned) 10Jdlrobson: Enable title above tabs on all opt-in wikis (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808057 (https://phabricator.wikimedia.org/T310054) (owner: 10Jdlrobson) [18:26:41] !log changing Cassandra superuser password, AQS cluster -- T311652 [18:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:30] RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:31:11] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for DDesouza - https://phabricator.wikimedia.org/T312271 (10jhathaway) 05Open→03Resolved a:03jhathaway @DDeSouza you have been added to the wmf group, enjoy! [18:32:22] PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8807.service,thumbor@8808.service,thumbor@8809.service,thumbor@8812.service,thumbor@8816.service,thumbor@8817.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:33:13] (03CR) 10Andrew Bogott: "Ping me when I'm back on the 18th for a merge if you haven't found someone else to do it by then." [puppet] - 10https://gerrit.wikimedia.org/r/812381 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [18:33:19] (03CR) 10Andrew Bogott: [C: 03+1] labweb: point tlsproxy envoy at %{facts.ipaddress}:8080 for striker [puppet] - 10https://gerrit.wikimedia.org/r/812381 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [18:36:12] (03PS1) 10JHathaway: admin: Add Daniel Souza to LDAP only [puppet] - 10https://gerrit.wikimedia.org/r/812394 [18:42:16] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1001.wikimedia.org with OS bullseye [18:42:21] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1001.wikimedia.org with OS bullseye completed: - cloudel... [19:10:16] (03PS2) 10JHathaway: admin: Add Daniel Souza to LDAP only [puppet] - 10https://gerrit.wikimedia.org/r/812394 (https://phabricator.wikimedia.org/T312271) [19:11:25] (03CR) 10CI reject: [V: 04-1] admin: Add Daniel Souza to LDAP only [puppet] - 10https://gerrit.wikimedia.org/r/812394 (https://phabricator.wikimedia.org/T312271) (owner: 10JHathaway) [19:12:15] (03PS3) 10JHathaway: admin: Add Daniel Souza to LDAP only [puppet] - 10https://gerrit.wikimedia.org/r/812394 (https://phabricator.wikimedia.org/T312271) [19:13:13] (03CR) 10CDanis: [C: 03+1] admin: Add Daniel Souza to LDAP only [puppet] - 10https://gerrit.wikimedia.org/r/812394 (https://phabricator.wikimedia.org/T312271) (owner: 10JHathaway) [19:13:50] (03CR) 10JHathaway: [C: 03+2] admin: Add Daniel Souza to LDAP only [puppet] - 10https://gerrit.wikimedia.org/r/812394 (https://phabricator.wikimedia.org/T312271) (owner: 10JHathaway) [19:23:56] 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP wmf group for Aline Bruenger WMDE - https://phabricator.wikimedia.org/T312220 (10jhathaway) @Aline_Bruenger_WMDE do you perhaps mean the wmde ldap group? I don't see any other folks from wmde who are part of the wmf group. [19:24:59] 10SRE, 10Observability-Alerting, 10Traffic, 10Patch-For-Review, 10User-fgiunchedi: Migrate Traffic Prometheus alerts from Icinga to Alertmanager - https://phabricator.wikimedia.org/T300723 (10BCornwall) @fgiunchedi Looks like the rules mentioned in the ticket have all either been ported or confirmed as... [19:28:57] 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP wmf group for Aline Bruenger WMDE - https://phabricator.wikimedia.org/T312220 (10jhathaway) @karapayneWMDE do you happen to know? [19:30:43] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for ddw - https://phabricator.wikimedia.org/T312675 (10Ddwaal-WMF) [19:31:07] 10SRE, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T312676 (10DDeSouza) [19:34:50] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for ddw - https://phabricator.wikimedia.org/T312675 (10jhathaway) @Ddwaal-WMF happy to help grant you access, who is you manager for this contract period? [19:35:23] 10SRE, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T312676 (10DDeSouza) It looks like I will not need the SSH key as my use case fits in "Dashboards in Superset / Hive interfaces (like Hue) that do access private data". https://wikitech.wikimedia.org/wi... [19:37:18] Can I get a +2 for a package version bump https://gerrit.wikimedia.org/r/c/operations/software/varnish/libvmod-querysort/+/812389 ? [19:37:27] 10SRE, 10SRE-Access-Requests: Requesting access to to analytics-privatedata-users for DDesouza - https://phabricator.wikimedia.org/T312676 (10DDeSouza) [19:37:45] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DDesouza - https://phabricator.wikimedia.org/T312676 (10DDeSouza) [19:46:50] (03CR) 10Legoktm: [V: 03+2 C: 03+2] Update package for version 0.2 [software/varnish/libvmod-querysort] - 10https://gerrit.wikimedia.org/r/812389 (owner: 10Ori) [19:47:02] ori: done, also I think it's fine to self +2 patches like that [19:48:06] thanks [19:49:05] !log removing 2 files for legal compliance [19:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:08] (03PS1) 10Dduvall: gitlab_runner: Handle changes to runner config [puppet] - 10https://gerrit.wikimedia.org/r/812402 (https://phabricator.wikimedia.org/T311746) [19:53:23] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (39) node(s) change every puppet run: an-test-client1001, aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudservices1003, cloudservices1004, elastic2049, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, stat1004, stat1005, stat1006, stat1007 [19:53:23] 08, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, thanos-fe2003, thumbor1002, thumbor1005, thumbor1006, thumbor2003, thumbor2004, thumbor2005, thumbor2006 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [19:53:44] (03CR) 10CI reject: [V: 04-1] gitlab_runner: Handle changes to runner config [puppet] - 10https://gerrit.wikimedia.org/r/812402 (https://phabricator.wikimedia.org/T311746) (owner: 10Dduvall) [19:55:37] (03PS2) 10Mary Yang: Add alert manager alert receivers for the Abstract Wikipedia team. [puppet] - 10https://gerrit.wikimedia.org/r/811790 (https://phabricator.wikimedia.org/T311457) [19:56:20] (03CR) 10Mary Yang: Add alert manager alert receivers for the Abstract Wikipedia team. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811790 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [19:56:26] !log cdanis@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on phab1001.eqiad.wmnet with reason: bug fix [19:56:39] !log cdanis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab1001.eqiad.wmnet with reason: bug fix [19:56:51] !log cdanis@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on phabricator.wikimedia.org with reason: bug fix [19:57:05] !log cdanis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phabricator.wikimedia.org with reason: bug fix [19:57:08] !log cdanis@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on phab.wmfusercontent.org with reason: bug fix [19:57:21] !log cdanis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab.wmfusercontent.org with reason: bug fix [19:58:38] !log quick phab downtime for deploy to fix T312614 [19:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:42] T312614: Task Form becomes inaccessible after edit - https://phabricator.wikimedia.org/T312614 [20:04:35] (03PS2) 10Dduvall: gitlab_runner: Handle changes to runner config [puppet] - 10https://gerrit.wikimedia.org/r/812402 (https://phabricator.wikimedia.org/T311746) [20:04:46] 10SRE, 10Campaign-Tools: [Request for Comment] Campaigns Geolocation API proposal - https://phabricator.wikimedia.org/T312677 (10ldelench_wmf) [20:05:20] 10SRE, 10Campaign-Tools: [Request for Comment] Campaigns Geolocation API proposal - https://phabricator.wikimedia.org/T312677 (10ldelench_wmf) [20:11:46] (03PS2) 10Andrew Bogott: Openstack Heat: standardize on heat_domain_admin name [puppet] - 10https://gerrit.wikimedia.org/r/812164 [20:11:48] (03PS1) 10Andrew Bogott: Keystone: add a no-op userid hash generator [puppet] - 10https://gerrit.wikimedia.org/r/812403 [20:11:59] PROBLEM - ElasticSearch unassigned shard check - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - enwiki_content_1617264042[10](2022-07-05T20:05:55.919Z), enwiki_content_1617264042[9](2022-07-05T20:05:55.924Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [20:13:11] (03PS3) 10Andrew Bogott: Openstack Heat: standardize on heat_domain_admin name [puppet] - 10https://gerrit.wikimedia.org/r/812164 [20:13:13] (03PS2) 10Andrew Bogott: Keystone: add a no-op userid hash generator [puppet] - 10https://gerrit.wikimedia.org/r/812403 [20:16:40] 10SRE, 10Observability-Alerting, 10Traffic, 10Patch-For-Review, 10User-fgiunchedi: Migrate Traffic Prometheus alerts from Icinga to Alertmanager - https://phabricator.wikimedia.org/T300723 (10BCornwall) Ah, I've since learned where to look and verify where the rules are. Are we comfortable enough with th... [20:18:56] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for ddw - https://phabricator.wikimedia.org/T312675 (10Ddwaal-WMF) Hi @jhathaway @dr0ptp4kt (Adam Baso) [20:23:37] (03PS4) 10Dbrant: Add sampling to android.breadcrumbs event stream. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811765 (https://phabricator.wikimedia.org/T310847) [20:32:23] (03CR) 10Dzahn: "compiling would be nice, it's just that we have to sync puppet compiler facts first again, which is a little painful.. but i'll look http" [puppet] - 10https://gerrit.wikimedia.org/r/812264 (https://phabricator.wikimedia.org/T311241) (owner: 10Jelto) [20:34:37] PROBLEM - Check systemd state on thumbor2005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8808.service,thumbor@8810.service,thumbor@8812.service,thumbor@8813.service,thumbor@8814.service,thumbor@8818.service,thumbor@8833.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_s [20:49:07] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for ddw - https://phabricator.wikimedia.org/T312675 (10jhathaway) @dr0ptp4kt so I assume they need the wmf group, anything else? [21:00:01] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [21:02:23] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [21:16:16] (03PS1) 10Andrew Bogott: Keystone: rearrange how domain drivers are defined [puppet] - 10https://gerrit.wikimedia.org/r/812405 [21:16:18] (03PS1) 10Andrew Bogott: Keystone: rearrange how service domains are configured. [puppet] - 10https://gerrit.wikimedia.org/r/812406 [21:18:45] (03CR) 10CI reject: [V: 04-1] Keystone: rearrange how service domains are configured. [puppet] - 10https://gerrit.wikimedia.org/r/812406 (owner: 10Andrew Bogott) [21:21:03] (03PS3) 10Andrew Bogott: Keystone: add a no-op userid hash generator [puppet] - 10https://gerrit.wikimedia.org/r/812403 [21:21:05] (03PS2) 10Andrew Bogott: Keystone: rearrange how service domains are configured. [puppet] - 10https://gerrit.wikimedia.org/r/812406 [21:21:19] (03Abandoned) 10Andrew Bogott: Keystone: rearrange how domain drivers are defined [puppet] - 10https://gerrit.wikimedia.org/r/812405 (owner: 10Andrew Bogott) [21:22:37] (03CR) 10CI reject: [V: 04-1] Keystone: rearrange how service domains are configured. [puppet] - 10https://gerrit.wikimedia.org/r/812406 (owner: 10Andrew Bogott) [21:24:06] (03PS3) 10Andrew Bogott: Keystone: rearrange how service domains are configured. [puppet] - 10https://gerrit.wikimedia.org/r/812406 [21:28:49] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Heat: standardize on heat_domain_admin name [puppet] - 10https://gerrit.wikimedia.org/r/812164 (owner: 10Andrew Bogott) [21:32:27] !log apt1001: reprepro -C main include buster-wikimedia libvmod-querysort_0.2_amd64.changes [21:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:19] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:41:41] 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1082 - https://phabricator.wikimedia.org/T312626 (10wiki_willy) a:05BTullis→03Cmjohnson Looks like it's a R730 that's out of warranty. @Cmjohnson or @Jclark-ctr - do we still have any extra RAID controller batteries lying around?... [21:42:21] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DDesouza - https://phabricator.wikimedia.org/T312676 (10Bethany) Approved [21:44:21] !log [Elastic] Reshuffled shards on eqiad to get cluster back into green status (from yellow): https://phabricator.wikimedia.org/P30995#130117 [21:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:39] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [21:56:13] jhathaway: andrewbogott: ^ 2 changes are waiting on the master [22:04:54] mutante: mine can go, thanks [22:06:29] thanks, I can only say multiple or no though in this case [22:08:07] Please merge mine [22:08:27] alright, merging both at once [22:08:30] done [22:09:36] Thx [22:09:49] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [22:18:11] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DDesouza - https://phabricator.wikimedia.org/T312676 (10jhathaway) [22:18:50] (03PS1) 10Zabe: Update interwiki cache for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812412 [22:19:31] (03PS2) 10Zabe: Update interwiki cache for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812412 [22:21:13] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DDesouza - https://phabricator.wikimedia.org/T312676 (10jhathaway) @Ottomata kindly approve [22:51:22] 10SRE, 10Wikimedia-Mailing-lists: Create mailing list for WikiSound Audio Speaks Campaign - https://phabricator.wikimedia.org/T311230 (10Anasskoko) Thank you Jhathaway, Yes of course we still need mailing list created, as we are almost 80% done with creating the page for The WikiSound Audio Speaks Campaign.... [22:55:11] (03CR) 10RLazarus: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/812294 (owner: 10Muehlenhoff) [22:57:25] 10SRE, 10Wikimedia-Mailing-lists: Create mailing list for WikiSound Audio Speaks Campaign - https://phabricator.wikimedia.org/T311230 (10jhathaway) a:03jhathaway [23:03:27] PROBLEM - Check systemd state on thumbor2004 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8809.service,thumbor@8811.service,thumbor@8818.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:06:25] 10SRE, 10Wikimedia-Mailing-lists: Create mailing list for WikiSound Audio Speaks Campaign - https://phabricator.wikimedia.org/T311230 (10Anasskoko) Thank you Jhathaway for claiming the task [23:13:33] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (39) node(s) change every puppet run: an-test-client1001, aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudservices1003, cloudservices1004, elastic2049, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, stat1004, stat1005, stat1006, stat1007 [23:13:33] 08, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, thanos-fe2003, thumbor1002, thumbor1005, thumbor1006, thumbor2003, thumbor2004, thumbor2005, thumbor2006 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [23:26:27] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service,eventlogging_to_druid_navigationtiming_hourly.service,eventlogging_to_druid_prefupdate_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:43:57] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:52:59] PROBLEM - Check systemd state on thumbor2006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8802.service,thumbor@8803.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8809.service,thumbor@8810.service,thumbor@8812.service,thumbor@8813.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state