[00:38:28] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:42:18] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:54:36] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [00:55:00] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:13:04] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:34:12] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [02:26:38] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:57:32] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:00:50] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:10:04] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:14:17] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [03:19:17] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [03:27:54] (03PS5) 10Ori: Initial Debian packaging [software/varnish/libvmod-querysort] - 10https://gerrit.wikimedia.org/r/810551 (https://phabricator.wikimedia.org/T138093) [03:58:54] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:19:16] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [04:38:10] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:55:16] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:55:58] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:29:42] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:39:47] (03PS1) 10Marostegui: db2153: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/810706 (https://phabricator.wikimedia.org/T311493) [05:40:57] (03CR) 10Marostegui: [C: 03+2] db2153: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/810706 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [05:47:56] (03PS1) 10Marostegui: db2154: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/810707 (https://phabricator.wikimedia.org/T311493) [05:49:06] (03CR) 10Marostegui: [C: 03+2] db2154: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/810707 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [05:51:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 14 hosts with reason: codfw s4 sanitarium master switch [05:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 14 hosts with reason: codfw s4 sanitarium master switch [05:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:24] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:04:33] (03PS1) 10Marostegui: db2155: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/810711 (https://phabricator.wikimedia.org/T311493) [06:07:21] (03CR) 10Marostegui: [C: 03+2] db2155: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/810711 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [06:10:09] PROBLEM - SSH on mw1321.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:10:11] (03PS1) 10Marostegui: mariadb: db2073 no longer sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/810713 (https://phabricator.wikimedia.org/T311493) [06:11:32] (03CR) 10Marostegui: [C: 03+2] mariadb: db2073 no longer sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/810713 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [06:13:33] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:18:41] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:20:46] (03PS1) 10Marostegui: mariadb: Decommission db2091 [puppet] - 10https://gerrit.wikimedia.org/r/810715 (https://phabricator.wikimedia.org/T311803) [06:24:12] !log marostegui@cumin2002 START - Cookbook sre.hosts.decommission for hosts db2091.codfw.wmnet [06:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:44] (03CR) 10Ayounsi: "Thanks for the cleanup, lgtm overall with 2 comments." [puppet] - 10https://gerrit.wikimedia.org/r/810323 (owner: 10Muehlenhoff) [06:28:23] !log marostegui@cumin2002 START - Cookbook sre.dns.netbox [06:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:34] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2091 [puppet] - 10https://gerrit.wikimedia.org/r/810715 (https://phabricator.wikimedia.org/T311803) (owner: 10Marostegui) [06:32:21] !log marostegui@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:41] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2091.codfw.wmnet [06:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:02] 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2091 - https://phabricator.wikimedia.org/T311803 (10Marostegui) a:03Papaul [06:36:23] 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2091 - https://phabricator.wikimedia.org/T311803 (10Marostegui) Ready for you @Papaul! [06:38:52] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/809038 (https://phabricator.wikimedia.org/T311445) (owner: 10BCornwall) [06:39:37] !log marostegui@cumin2002 START - Cookbook sre.hosts.decommission for hosts db2092.codfw.wmnet [06:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:57] (03PS1) 10Marostegui: mariadb: Decommission db2092 [puppet] - 10https://gerrit.wikimedia.org/r/810819 (https://phabricator.wikimedia.org/T311802) [06:41:28] (03CR) 10Giuseppe Lavagetto: mediawiki: add scap restarts script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810027 (owner: 10Giuseppe Lavagetto) [06:43:48] !log marostegui@cumin2002 START - Cookbook sre.dns.netbox [06:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:10] (03CR) 10Filippo Giunchedi: "LGTM, adding Andrea since she's been working on Bullseye support. To avoid the obvious conflict with https://gerrit.wikimedia.org/r/c/oper" [puppet] - 10https://gerrit.wikimedia.org/r/810323 (owner: 10Muehlenhoff) [06:47:19] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2092 [puppet] - 10https://gerrit.wikimedia.org/r/810819 (https://phabricator.wikimedia.org/T311802) (owner: 10Marostegui) [06:47:44] !log marostegui@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:55] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:49:17] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:49:41] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2092.codfw.wmnet [06:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:48] 10ops-codfw, 10decommission-hardware: decommission db2092 - https://phabricator.wikimedia.org/T311802 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin2002 for hosts: `db2092.codfw.wmnet` - db2092.codfw.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanager - Found phys... [06:49:54] 10ops-codfw, 10decommission-hardware: decommission db2092 - https://phabricator.wikimedia.org/T311802 (10Marostegui) @Papaul this is all yours! [06:50:01] 10ops-codfw, 10decommission-hardware: decommission db2092 - https://phabricator.wikimedia.org/T311802 (10Marostegui) a:03Papaul [06:51:17] (03PS5) 10Giuseppe Lavagetto: mediawiki: add scap restarts script [puppet] - 10https://gerrit.wikimedia.org/r/810027 [06:51:19] (03PS4) 10Giuseppe Lavagetto: scap: use the new script to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/810031 [06:51:21] (03PS3) 10Giuseppe Lavagetto: scap: drop unused parameters from the configuration [puppet] - 10https://gerrit.wikimedia.org/r/810048 [06:53:01] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:54:17] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:55:23] (03PS1) 10Muehlenhoff: Remove access for mewoph [puppet] - 10https://gerrit.wikimedia.org/r/810822 [06:56:39] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/810106 (owner: 10Andrea Denisse) [06:57:39] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for mewoph [puppet] - 10https://gerrit.wikimedia.org/r/810822 (owner: 10Muehlenhoff) [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220704T0700) [07:00:39] (03CR) 10Muehlenhoff: "Sure thing, I'll rebase the patch when https://gerrit.wikimedia.org/r/c/operations/puppet/+/802593 is merged." [puppet] - 10https://gerrit.wikimedia.org/r/810323 (owner: 10Muehlenhoff) [07:02:46] (03CR) 10Muehlenhoff: [C: 03+2] "Looks good, merging" [puppet] - 10https://gerrit.wikimedia.org/r/802146 (https://phabricator.wikimedia.org/T286856) (owner: 10Majavah) [07:04:24] (03CR) 10Muehlenhoff: "We can skip this, the entire aptrepo config for stretch will be removed at large once all Stretch hosts are gone (2-3 months)." [puppet] - 10https://gerrit.wikimedia.org/r/810459 (owner: 10Majavah) [07:10:10] (03CR) 10Muehlenhoff: [C: 03+2] snapshot: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810319 (owner: 10Muehlenhoff) [07:10:19] (03PS6) 10Giuseppe Lavagetto: mediawiki: add scap restarts script [puppet] - 10https://gerrit.wikimedia.org/r/810027 [07:10:21] (03PS5) 10Giuseppe Lavagetto: scap: use the new script to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/810031 [07:10:22] RECOVERY - SSH on mw1321.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:10:23] (03PS4) 10Giuseppe Lavagetto: scap: drop unused parameters from the configuration [puppet] - 10https://gerrit.wikimedia.org/r/810048 [07:11:57] (03PS5) 10Majavah: kubeadm: drop support for 1.20 [puppet] - 10https://gerrit.wikimedia.org/r/802143 [07:11:59] (03PS5) 10Majavah: aptrepo: add thirdparty/kubeadm-k8s-1-22 [puppet] - 10https://gerrit.wikimedia.org/r/802146 (https://phabricator.wikimedia.org/T286856) [07:12:01] (03Abandoned) 10Majavah: aptrepo: drop kubeadm components from stretch [puppet] - 10https://gerrit.wikimedia.org/r/810459 (owner: 10Majavah) [07:12:16] (03PS1) 10Marostegui: mariadb: Productionize db2157 [puppet] - 10https://gerrit.wikimedia.org/r/810826 (https://phabricator.wikimedia.org/T311493) [07:12:27] (03PS3) 10Muehlenhoff: graphite: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810306 [07:14:13] (03PS7) 10Giuseppe Lavagetto: mediawiki: add scap restarts script [puppet] - 10https://gerrit.wikimedia.org/r/810027 [07:14:15] (03PS6) 10Giuseppe Lavagetto: scap: use the new script to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/810031 [07:14:17] (03PS5) 10Giuseppe Lavagetto: scap: drop unused parameters from the configuration [puppet] - 10https://gerrit.wikimedia.org/r/810048 [07:15:12] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2157 [puppet] - 10https://gerrit.wikimedia.org/r/810826 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [07:17:07] (03PS8) 10Giuseppe Lavagetto: mediawiki: add scap restarts script [puppet] - 10https://gerrit.wikimedia.org/r/810027 [07:17:35] (03CR) 10Muehlenhoff: [C: 03+2] graphite: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810306 (owner: 10Muehlenhoff) [07:19:49] (03PS9) 10Giuseppe Lavagetto: mediawiki: add scap restarts script [puppet] - 10https://gerrit.wikimedia.org/r/810027 [07:20:23] (03PS1) 10Marostegui: mariadb: Productionize db2156 [puppet] - 10https://gerrit.wikimedia.org/r/810829 (https://phabricator.wikimedia.org/T311493) [07:20:42] (03CR) 10Muehlenhoff: [C: 03+2] prometheus::postgres_exporter: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810318 (owner: 10Muehlenhoff) [07:21:46] (03PS2) 10Marostegui: mariadb: Productionize db2156 [puppet] - 10https://gerrit.wikimedia.org/r/810829 (https://phabricator.wikimedia.org/T311493) [07:22:42] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2156 [puppet] - 10https://gerrit.wikimedia.org/r/810829 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [07:25:03] (03PS1) 10Marostegui: site.pp: Remove insetup from db215[6-7] [puppet] - 10https://gerrit.wikimedia.org/r/810830 (https://phabricator.wikimedia.org/T311493) [07:28:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cumin1001.eqiad.wmnet [07:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:53] (03PS10) 10Giuseppe Lavagetto: mediawiki: add scap restarts script [puppet] - 10https://gerrit.wikimedia.org/r/810027 [07:32:01] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36175/console" [puppet] - 10https://gerrit.wikimedia.org/r/810027 (owner: 10Giuseppe Lavagetto) [07:32:56] 10SRE, 10Infrastructure-Foundations, 10netops: DHCPd: update config to log more info - https://phabricator.wikimedia.org/T309524 (10ayounsi) [07:34:15] (03CR) 10WMDE-Fisch: [C: 03+1] "Good to go now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808803 (https://phabricator.wikimedia.org/T310684) (owner: 10Awight) [07:34:24] (03CR) 10WMDE-Fisch: [C: 03+1] "Good to go now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804609 (https://phabricator.wikimedia.org/T310684) (owner: 10Awight) [07:34:37] (03PS2) 10WMDE-Fisch: Drop dependent feature flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808803 (https://phabricator.wikimedia.org/T310684) (owner: 10Awight) [07:35:49] (03PS8) 10WMDE-Fisch: Drop deprecated feature flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804609 (https://phabricator.wikimedia.org/T310684) (owner: 10Awight) [07:37:04] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove insetup from db215[6-7] [puppet] - 10https://gerrit.wikimedia.org/r/810830 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [07:37:23] 10SRE, 10Infrastructure-Foundations: DHCPd: update config to log more info - https://phabricator.wikimedia.org/T309524 (10ayounsi) [07:38:34] (03PS1) 10Muehlenhoff: Remove access for mattcleinman [puppet] - 10https://gerrit.wikimedia.org/r/810832 [07:38:41] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:38:56] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10ayounsi) [07:39:35] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cumin1001.eqiad.wmnet [07:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:21] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:40:49] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:43:03] 10SRE, 10ops-esams, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: decommission atlas-esams - https://phabricator.wikimedia.org/T307026 (10ayounsi) [07:44:17] 10SRE, 10ops-esams, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: decommission atlas-esams - https://phabricator.wikimedia.org/T307026 (10ayounsi) a:05ayounsi→03RobH Thanks Faidon. I ended up emailing them as the dashboard seems limited. @robh: we can proceed with the decom of that box, followi... [07:44:25] 10SRE, 10ops-esams, 10DC-Ops, 10Infrastructure-Foundations, 10decommission-hardware: decommission atlas-esams - https://phabricator.wikimedia.org/T307026 (10ayounsi) [07:44:54] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for mattcleinman [puppet] - 10https://gerrit.wikimedia.org/r/810832 (owner: 10Muehlenhoff) [07:45:03] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:45:33] (03CR) 10Muehlenhoff: [C: 03+2] Raise profile::cumin::monitoring_agentrun::crit [puppet] - 10https://gerrit.wikimedia.org/r/807497 (owner: 10Muehlenhoff) [07:46:56] (03CR) 10Vgutierrez: prometheus: Add custom vm.max_map_count metric (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809038 (https://phabricator.wikimedia.org/T311445) (owner: 10BCornwall) [07:47:11] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 3.774 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:56:45] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] define osm::planet_sync move from cron to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:57:21] (03CR) 10Slyngshede: [V: 03+1] define osm::planet_sync move from cron to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/810304 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:58:24] (03CR) 10Slyngshede: [C: 03+2] P:aptrepo::wikimedia move private repo to nginx and uninstall apache [puppet] - 10https://gerrit.wikimedia.org/r/809969 (owner: 10Slyngshede) [08:02:34] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10ayounsi) @jbond, I was wondering if the upgrade went as expected. And if there was a timeline for OIDC. No urgency at all, even if it's 1, 6 or more months, just doing some planning :) [08:04:12] !log kill leftover processes of user `mewoph` on stat100x to allow puppet runs [08:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:36] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging MewOphaswongse out of all services on: 1299 hosts [08:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging MewOphaswongse out of all services on: 1299 hosts [08:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:21] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging MewOphaswongse out of all services on: 634 hosts [08:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging MewOphaswongse out of all services on: 634 hosts [08:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:00] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: drmrs: primary software task - https://phabricator.wikimedia.org/T282788 (10ayounsi) [08:10:26] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:11:03] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: drmrs: initial geodns configuration - https://phabricator.wikimedia.org/T304089 (10ayounsi) 05Open→03Resolved a:03ayounsi I think everything here is done, and follow up is in {T311472} Feel free to re-open if needed. [08:16:19] (03PS1) 10Marostegui: instances.yaml: Add db2157 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/810838 (https://phabricator.wikimedia.org/T311493) [08:17:26] (03CR) 10Volans: [C: 03+2] sre.ganeti.makevm: Remove duplicated space [cookbooks] - 10https://gerrit.wikimedia.org/r/810338 (owner: 10David Caro) [08:17:35] (03CR) 10Volans: [C: 03+2] "Thanks for the fix" [cookbooks] - 10https://gerrit.wikimedia.org/r/810338 (owner: 10David Caro) [08:18:05] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2157 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/810838 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [08:18:07] 10SRE, 10Infrastructure-Foundations, 10netops: Validate new (anycast) IPv6 /48 announcement being accepted by transits - https://phabricator.wikimedia.org/T301900 (10ayounsi) @cmooney I had a quick look at NTT/Telia/Lumen looking glass and the prefix seems to be accepted properly. Longer term, this could ma... [08:18:43] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Improve Netbox import script to avoid port-number collisions in JunOS - https://phabricator.wikimedia.org/T301392 (10ayounsi) [08:21:09] (03Merged) 10jenkins-bot: sre.ganeti.makevm: Remove duplicated space [cookbooks] - 10https://gerrit.wikimedia.org/r/810338 (owner: 10David Caro) [08:22:07] (03PS11) 10Giuseppe Lavagetto: mediawiki: add scap restarts script [puppet] - 10https://gerrit.wikimedia.org/r/810027 [08:23:36] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36176/console" [puppet] - 10https://gerrit.wikimedia.org/r/810027 (owner: 10Giuseppe Lavagetto) [08:23:44] (03CR) 10Elukey: [C: 03+1] uwsgi: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810321 (owner: 10Muehlenhoff) [08:24:05] 10SRE, 10Icinga, 10Observability-Alerting: PROBLEM: Icinga on alert2001.wikimedia.org is CRITICAL - https://phabricator.wikimedia.org/T311926 (10Volans) The alert has fired 22 times over the weekend. AFAICT all of them around the :34 minute mark, that seems suspiciously close to the start time of the timer f... [08:24:07] !log marostegui@cumin2002 dbctl commit (dc=all): 'Add db2157 to s5 T311493', diff saved to https://phabricator.wikimedia.org/P30758 and previous config saved to /var/cache/conftool/dbconfig/20220704-082406-marostegui.json [08:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:12] T311493: Productionize db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T311493 [08:24:33] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10ayounsi) [08:26:01] (03PS12) 10Giuseppe Lavagetto: mediawiki: add scap restarts script [puppet] - 10https://gerrit.wikimedia.org/r/810027 [08:31:34] 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10serviceops: Support services VIPs with not marked as VIP in Netbox - https://phabricator.wikimedia.org/T295793 (10ayounsi) Pinging @jelto for gitlab (not sure if the issue is still present or relevant) and @akosiaris for lists1001 (I can confirm the mi... [08:32:10] 10SRE, 10Icinga, 10Observability-Alerting: PROBLEM: Icinga on alert2001.wikimedia.org is CRITICAL - https://phabricator.wikimedia.org/T311926 (10jcrespo) Thanks, @Volans, to me that would indicate a problem in the way the alert is setup/coordinated, and less of the infrastructure itself (even if latency is a... [08:32:59] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36177/console" [puppet] - 10https://gerrit.wikimedia.org/r/810027 (owner: 10Giuseppe Lavagetto) [08:33:46] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki: add scap restarts script [puppet] - 10https://gerrit.wikimedia.org/r/810027 (owner: 10Giuseppe Lavagetto) [08:34:03] (03PS13) 10Giuseppe Lavagetto: mediawiki: add scap restarts script [puppet] - 10https://gerrit.wikimedia.org/r/810027 [08:34:32] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:35:40] (03PS1) 10Elukey: profile::thanos::swift: add a read only account for ml-serve [puppet] - 10https://gerrit.wikimedia.org/r/810840 (https://phabricator.wikimedia.org/T311628) [08:38:34] 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations: sre.hosts.reimage: wait reboot time timeout on aqs nodes - https://phabricator.wikimedia.org/T307260 (10Volans) 05Open→03Resolved This should be resolved. Feel free to reopen it in case it's not. [08:45:10] (03CR) 10Slavina Stefanova: "out of curiosity, why run black and isort via bash scripts instead of using pre-commit?" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810295 (owner: 10David Caro) [08:48:05] 10SRE, 10Infrastructure-Foundations, 10netops: Packet Drops on Eqiad ASW -> CR uplinks - https://phabricator.wikimedia.org/T291627 (10ayounsi) With {T304712} will give us the possibility to move to 40G uplinks (instead of 4x10G) for some rows: C as of now, and D once {T308331} is done. This could be a good t... [08:50:13] ACKNOWLEDGEMENT - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis Investigating in T311991 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:52:01] (03CR) 10David Caro: Add mypy, black and isort tests (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810295 (owner: 10David Caro) [08:53:19] 10SRE, 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations, 10Patch-For-Review: Allow idrac ftp fetching of firmware updates (either to existing ftp or new solution) - https://phabricator.wikimedia.org/T283771 (10ayounsi) [08:53:23] (03PS1) 10Elukey: Upgrade kserve images to upstream release 0.8 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/810841 (https://phabricator.wikimedia.org/T311982) [08:53:26] 10SRE-tools, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Spicerack: add support for Alertmanager - https://phabricator.wikimedia.org/T293209 (10Volans) >>! In T293209#8043558, @fgiunchedi wrote: > I don't know offhand how to best achiev... [08:53:32] (03PS3) 10Majavah: keyholder::monitoring: drop absented resources [puppet] - 10https://gerrit.wikimedia.org/r/810041 [08:53:41] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Drop dependent feature flags (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808803 (https://phabricator.wikimedia.org/T310684) (owner: 10Awight) [08:55:04] (03CR) 10Slavina Stefanova: alerts: add a default duration of 1h (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810367 (owner: 10David Caro) [08:56:59] (03CR) 10David Caro: Add mypy, black and isort tests (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810295 (owner: 10David Caro) [08:59:51] (03PS7) 10David Caro: cloudnet: add show, reboot_node and roll_reboot_cloudnets [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810368 [09:00:29] (03PS4) 10Jcrespo: bacula::director: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810325 (owner: 10Muehlenhoff) [09:02:28] (03CR) 10Slavina Stefanova: Add mypy, black and isort tests (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810295 (owner: 10David Caro) [09:04:07] (03CR) 10Slavina Stefanova: Add mypy, black and isort tests (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810295 (owner: 10David Caro) [09:04:52] (03CR) 10Jcrespo: [C: 03+2] bacula::director: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810325 (owner: 10Muehlenhoff) [09:05:04] (03CR) 10CI reject: [V: 04-1] cloudnet: add show, reboot_node and roll_reboot_cloudnets [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810368 (owner: 10David Caro) [09:29:48] (03CR) 10Slavina Stefanova: [C: 03+1] Add mypy, black and isort tests [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810295 (owner: 10David Caro) [09:30:26] (03PS1) 10Giuseppe Lavagetto: conftool::scripts: fix restart of multiple services [puppet] - 10https://gerrit.wikimedia.org/r/810850 [09:31:06] (03PS2) 10Muehlenhoff: profile::mariadb::packages_client: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810847 [09:34:15] (03Merged) 10jenkins-bot: Use our own alert managing [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805108 (https://phabricator.wikimedia.org/T309789) (owner: 10David Caro) [09:34:16] (03Merged) 10jenkins-bot: wmcs: added vm_console runbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805316 (https://phabricator.wikimedia.org/T309930) (owner: 10David Caro) [09:34:16] (03Merged) 10jenkins-bot: wmcs.ceph: don't use sre upgrade-and-reboot [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805327 (https://phabricator.wikimedia.org/T309786) (owner: 10David Caro) [09:34:18] (03Merged) 10jenkins-bot: wmcs: move alerting code to a library [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805376 (https://phabricator.wikimedia.org/T309786) (owner: 10David Caro) [09:34:20] (03Merged) 10jenkins-bot: wmcs.ceph.upgrade*: add sal logs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805377 (https://phabricator.wikimedia.org/T309786) (owner: 10David Caro) [09:34:22] (03Merged) 10jenkins-bot: wmcs.ceph: move core code to a library [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805741 (https://phabricator.wikimedia.org/T309786) (owner: 10David Caro) [09:34:24] (03Merged) 10jenkins-bot: wmcs.alert/ceph: allow downtiming alerts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805742 (owner: 10David Caro) [09:34:26] (03Merged) 10jenkins-bot: wmcs.openstack: Add runbook to increase the quotas [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/806429 (https://phabricator.wikimedia.org/T297606) (owner: 10David Caro) [09:34:30] (03CR) 10CI reject: [V: 04-1] profile::mariadb::packages_client: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810847 (owner: 10Muehlenhoff) [09:36:22] (03PS3) 10Muehlenhoff: profile::mariadb::packages_client: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810847 [09:42:20] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/810847 (owner: 10Muehlenhoff) [09:47:04] (03CR) 10jenkins-bot: profile::mariadb::packages_client: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810847 (owner: 10Muehlenhoff) [09:50:27] (03CR) 10David Caro: alerts: add a default duration of 1h (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810367 (owner: 10David Caro) [09:53:46] (03CR) 10LSobanski: [C: 03+1] typos: add "vtrs" [puppet] - 10https://gerrit.wikimedia.org/r/810403 (owner: 10Dzahn) [09:56:46] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:57:29] (03PS3) 10David Caro: wmcs.openstack: move libs to it's own module [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/809543 [09:57:31] (03PS2) 10David Caro: alerts: add a default duration of 1h [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810367 [09:57:33] (03CR) 10David Caro: alerts: add a default duration of 1h (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810367 (owner: 10David Caro) [09:57:35] (03PS2) 10David Caro: wmcs.lib.openstack: move to a directory module [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810451 [09:57:37] (03PS8) 10David Caro: cloudnet: add show, reboot_node and roll_reboot_cloudnets [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810368 [09:57:39] (03PS1) 10David Caro: openstack: move known nodes to the openstack lib [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810854 [09:59:14] (03CR) 10Slavina Stefanova: [C: 03+1] alerts: add a default duration of 1h [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810367 (owner: 10David Caro) [10:12:02] (03CR) 10Urbanecm: [C: 03+2] "beta-only, no-op for prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810452 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [10:13:01] jouncebot: nowandnext [10:13:01] For the next 20 hour(s) and 46 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220704T0700) [10:13:01] In 20 hour(s) and 46 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220705T0700) [10:13:23] sigh [10:13:35] I'm going to backport the meta rc fix regardless [10:13:43] (03Merged) 10jenkins-bot: [beta] Temporarily allow everyone to enroll as mentor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810452 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [10:13:51] (03CR) 10Ladsgroup: [C: 03+2] Revert "Revert "RecentChange: Straight join to actor table when needed"" [core] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/810139 (https://phabricator.wikimedia.org/T311360) (owner: 10Zabe) [10:13:57] (03CR) 10Ladsgroup: [C: 03+2] RecentChange: Make join to comment table also straight [core] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/810138 (https://phabricator.wikimedia.org/T311360) (owner: 10Zabe) [10:14:16] (03CR) 10Urbanecm: [C: 03+2] "beta-only, no-op for prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808268 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [10:14:56] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, note that you'll have to set permissions/ACL on teh containers using the admin account" [puppet] - 10https://gerrit.wikimedia.org/r/810840 (https://phabricator.wikimedia.org/T311628) (owner: 10Elukey) [10:15:35] (03Merged) 10jenkins-bot: [beta] Growth: Enable structured mentor list at enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808268 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [10:15:50] * urbanecm done [10:15:53] (03PS2) 10Hnowlan: similar-users: make max queries per account configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/808923 (https://phabricator.wikimedia.org/T310646) [10:16:20] (03CR) 10Filippo Giunchedi: [C: 03+2] keyholder::monitoring: drop absented resources [puppet] - 10https://gerrit.wikimedia.org/r/810041 (owner: 10Majavah) [10:16:22] (03CR) 10Hnowlan: similar-users: make max queries per account configurable (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/808923 (https://phabricator.wikimedia.org/T310646) (owner: 10Hnowlan) [10:17:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:18] <_joe_> !log upgraded etcdmirror to 0.0.7 on conf2006, now going with the rest of codfw [10:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:18:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:30] !log installing gnupg2 security updates [10:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:35] <_joe_> !log restarting etcdmirror on conf2005 [10:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:24:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:25] !log silence etcd p a g e [10:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:52] <_joe_> !log rollback etcdmirror to 0.0.6 on conf2005 [10:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:04] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:30:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:18] (03Merged) 10jenkins-bot: Revert "Revert "RecentChange: Straight join to actor table when needed"" [core] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/810139 (https://phabricator.wikimedia.org/T311360) (owner: 10Zabe) [10:34:24] (03Merged) 10jenkins-bot: RecentChange: Make join to comment table also straight [core] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/810138 (https://phabricator.wikimedia.org/T311360) (owner: 10Zabe) [10:35:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:35:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:17] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [10:41:49] (03CR) 10Jaime Nuche: "Thanks for the fix Daniel!" [puppet] - 10https://gerrit.wikimedia.org/r/808061 (https://phabricator.wikimedia.org/T310740) (owner: 10Dzahn) [10:44:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:17] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [10:48:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:48:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:55] (03PS1) 10Ladsgroup: Add statsd metric collection on db calls [extensions/GlobalBlocking] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/810518 (https://phabricator.wikimedia.org/T307648) [10:52:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:19] Amir1: if you're doing backports, can we deploy a GrowthExperiments patch that fixes a bad breakage? [10:52:30] kostajh: I accept bribe [10:52:42] hehe [10:52:50] (I can do the backport, can you test it?) [10:52:56] it's pretty minor https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/810509. And it's mitigated in that we have switched off the feature [10:53:23] sure, let's do it [10:53:23] so it could wait another ~20 hours or whatever [10:53:27] alright [10:53:33] I can test it, yes [10:53:40] (03CR) 10Giuseppe Lavagetto: [C: 03+2] conftool::scripts: fix restart of multiple services [puppet] - 10https://gerrit.wikimedia.org/r/810850 (owner: 10Giuseppe Lavagetto) [10:54:55] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.18/includes: Backport: [[gerrit:810139|Revert "Revert "RecentChange: Straight join to actor table when needed"" (T311360)]] (duration: 03m 49s) [10:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:58] T311360: RecentChanges timing out - https://phabricator.wikimedia.org/T311360 [10:55:57] (03PS1) 10Filippo Giunchedi: rest: fix getLag typo and add test [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/810864 (https://phabricator.wikimedia.org/T309546) [10:58:22] (03CR) 10Vlad.shapik: "recheck" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik) [10:59:00] urbanecm: maybe you are also around to help verify it on cswiki? [10:59:05] Amir1: ready when you are :) [10:59:15] kostajh: sure thing [10:59:21] I am ready, let me know when it's merged [10:59:32] (03CR) 10Ladsgroup: [C: 03+2] AddImageArticleTarget: Update to new mediaClass/mediaTag format [extensions/GrowthExperiments] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/810509 (https://phabricator.wikimedia.org/T311916) (owner: 10Urbanecm) [11:01:01] (03CR) 10Giuseppe Lavagetto: [C: 03+2] rest: fix getLag typo and add test [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/810864 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi) [11:02:50] (03Merged) 10jenkins-bot: rest: fix getLag typo and add test [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/810864 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi) [11:06:29] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: Enable OIDC - https://phabricator.wikimedia.org/T311999 (10MoritzMuehlenhoff) [11:06:45] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: Enable OIDC in CAS - https://phabricator.wikimedia.org/T311999 (10MoritzMuehlenhoff) p:05Triage→03Medium [11:08:25] (03PS1) 10Muehlenhoff: Enable OIDC in Gradle build [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/810867 (https://phabricator.wikimedia.org/T311999) [11:08:32] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10MoritzMuehlenhoff) >>! In T306238#8047863, @ayounsi wrote: > @jbond, I was wondering if the upgrade went as expected. And if there was a timeline for OIDC. > No urgency at all, even... [11:10:35] (03CR) 10Hashar: "recheck after adding libexiv2-dev to the CI image ( https://gerrit.wikimedia.org/r/c/integration/config/+/810866 )" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik) [11:11:24] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:20:01] (03PS1) 10Marostegui: instances.yaml: Add db2156 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/810874 (https://phabricator.wikimedia.org/T311493) [11:21:06] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2156 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/810874 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [11:22:45] (03Merged) 10jenkins-bot: AddImageArticleTarget: Update to new mediaClass/mediaTag format [extensions/GrowthExperiments] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/810509 (https://phabricator.wikimedia.org/T311916) (owner: 10Urbanecm) [11:22:53] Amir1: it's merged now [11:23:54] will do it soon [11:24:46] urbanecm: for testing, I guess we re-enable the tasktype on e.g. cswiki, then switch to mwdebug and verify an edit? Or do you think verifying an edit on testwiki is enough, then we can re-enable the tsak type on the wikis? [11:25:30] kostajh: I'd test at testwiki via mwdebug, if it works, re-enable at cswiki, test there too (still at mwdebug) and if it works, sync (and re-enable everywhere) [11:26:24] ok [11:27:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [11:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [11:31:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [11:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:17] (03CR) 10Hashar: "recheck after adding libboost-python-dev to the CI image https://gerrit.wikimedia.org/r/810871" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik) [11:33:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [11:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:53] (03CR) 10Marostegui: [C: 03+1] profile::mariadb::packages_wmf: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810846 (owner: 10Muehlenhoff) [11:35:03] (03CR) 10Marostegui: [C: 03+1] profile::mariadb::packages_client: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810847 (owner: 10Muehlenhoff) [11:36:41] !log marostegui@cumin2002 dbctl commit (dc=all): 'Add db2156 to s3 T311493', diff saved to https://phabricator.wikimedia.org/P30774 and previous config saved to /var/cache/conftool/dbconfig/20220704-113640-marostegui.json [11:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:44] T311493: Productionize db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T311493 [11:39:54] (03PS1) 10Matthias Mullie: Retrieve pages-with-suggestion via Elastic scroll directly [extensions/ImageSuggestions] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/810889 (https://phabricator.wikimedia.org/T311476) [11:40:57] (03PS1) 10Marostegui: mariadb: Promote db1173 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/810878 (https://phabricator.wikimedia.org/T311522) [11:41:24] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/810878 (https://phabricator.wikimedia.org/T311522) (owner: 10Marostegui) [11:41:45] urbanecm: kostajh live in mwdebug1002 [11:41:54] looking [11:42:11] (03PS1) 10Marostegui: wmnet: Update s6-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/810881 (https://phabricator.wikimedia.org/T311522) [11:42:23] (03CR) 10Marostegui: "Wait for the failover day" [dns] - 10https://gerrit.wikimedia.org/r/810881 (https://phabricator.wikimedia.org/T311522) (owner: 10Marostegui) [11:43:14] https://test.wikipedia.org/w/index.php?title=Brudalen&diff=516168&oldid=485928: looks like an image was added. testing at cswiki now. [11:43:22] (03CR) 10Ladsgroup: [C: 03+1] mariadb: Promote db1173 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/810878 (https://phabricator.wikimedia.org/T311522) (owner: 10Marostegui) [11:43:31] (03CR) 10Ladsgroup: [C: 03+1] wmnet: Update s6-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/810881 (https://phabricator.wikimedia.org/T311522) (owner: 10Marostegui) [11:43:40] urbanecm: ack, testwiki looks good for my edit as well [11:44:39] (03CR) 10Ladsgroup: [C: 03+2] Add statsd metric collection on db calls [extensions/GlobalBlocking] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/810518 (https://phabricator.wikimedia.org/T307648) (owner: 10Ladsgroup) [11:45:40] cswiki works too: https://cs.wikipedia.org/w/index.php?title=Politick%C3%A1_a_pr%C3%A1vn%C3%AD_komise_%C3%BAst%C5%99edn%C3%ADho_v%C3%BDboru_Komunistick%C3%A9_strany_%C4%8C%C3%ADny&diff=21439374&oldid=19920770 [11:46:02] urbanecm: lgtm [11:46:11] Amir1: let's sync please! [11:46:18] awesome [11:48:08] (03Merged) 10jenkins-bot: Add statsd metric collection on db calls [extensions/GlobalBlocking] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/810518 (https://phabricator.wikimedia.org/T307648) (owner: 10Ladsgroup) [11:50:00] (03PS1) 10Jcrespo: Add new user for dbbackups database for django dashboard [puppet] - 10https://gerrit.wikimedia.org/r/810885 (https://phabricator.wikimedia.org/T283017) [11:50:03] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.18/extensions/GrowthExperiments/modules/ext.growthExperiments.StructuredTask/addimage/AddImageArticleTarget.js: Backport: [[gerrit:810509|AddImageArticleTarget: Update to new mediaClass/mediaTag format (T311916)]] (duration: 03m 33s) [11:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:08] T311916: "Add an image" structured edits add a blank line instead of an image - https://phabricator.wikimedia.org/T311916 [11:54:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [11:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [11:54:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [11:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:22] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.18/extensions/GlobalBlocking/includes/GlobalBlocking.php: Backport: [[gerrit:810518|Add statsd metric collection on db calls (T307648)]] (duration: 03m 26s) [11:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:25] T307648: Audit database usage of GlobalBlocking extension - https://phabricator.wikimedia.org/T307648 [11:55:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [11:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:57] (03PS2) 10Jcrespo: Add new user for dbbackups database for django dashboard [puppet] - 10https://gerrit.wikimedia.org/r/810885 (https://phabricator.wikimedia.org/T283017) [11:58:02] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:05:02] (03CR) 10Jcrespo: "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/810885 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [12:10:25] (03CR) 10Matěj Suchánek: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810890 (owner: 10Matěj Suchánek) [12:12:52] (03CR) 10Urbanecm: [C: 03+1] "lgtm. needs to be scheduled via a backport window." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810890 (owner: 10Matěj Suchánek) [12:13:34] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:15:54] (03PS1) 10Ladsgroup: Switchover s4 master [puppet] - 10https://gerrit.wikimedia.org/r/810908 (https://phabricator.wikimedia.org/T311611) [12:17:08] (03PS1) 10Ladsgroup: Switchover s4 master [dns] - 10https://gerrit.wikimedia.org/r/810909 (https://phabricator.wikimedia.org/T311611) [12:17:36] !log installing 4.9.320 on stretch hosts [12:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:03] (03PS1) 10Filippo Giunchedi: prometheus: remove distro-based conditionals for blackbox [puppet] - 10https://gerrit.wikimedia.org/r/810910 (https://phabricator.wikimedia.org/T305847) [12:20:22] (03CR) 10Marostegui: [C: 03+1] Switchover s4 master [puppet] - 10https://gerrit.wikimedia.org/r/810908 (https://phabricator.wikimedia.org/T311611) (owner: 10Ladsgroup) [12:20:43] (03CR) 10Marostegui: [C: 03+1] Switchover s4 master [dns] - 10https://gerrit.wikimedia.org/r/810909 (https://phabricator.wikimedia.org/T311611) (owner: 10Ladsgroup) [12:20:53] (03CR) 10Ladsgroup: [C: 04-2] "until the day of switchover" [puppet] - 10https://gerrit.wikimedia.org/r/810908 (https://phabricator.wikimedia.org/T311611) (owner: 10Ladsgroup) [12:21:20] (03CR) 10Ladsgroup: [C: 04-2] "until the day of switchover" [dns] - 10https://gerrit.wikimedia.org/r/810909 (https://phabricator.wikimedia.org/T311611) (owner: 10Ladsgroup) [12:21:56] (03CR) 10Marostegui: "Was the "-" allowed in usernames?" [puppet] - 10https://gerrit.wikimedia.org/r/810885 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [12:22:21] (03CR) 10Filippo Giunchedi: "This change will require some coordination between merging and upgrading blackbox exporter on buster hosts. I can take care of production," [puppet] - 10https://gerrit.wikimedia.org/r/810910 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [12:22:29] (03CR) 10Vlad.shapik: "Thank you for the patch. The changes are used in the patch(Icabc39dab7347ac5c6d75f834a06ddfca5c4ca09) which this is partially based on." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/806333 (owner: 10Hnowlan) [12:23:23] (03CR) 10Majavah: prometheus: remove distro-based conditionals for blackbox (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810910 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [12:23:44] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947 (10JoKalliauer) [12:27:31] <_joe_> !log updated etcdmirror to 0.0.8 everywhere [12:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:34] (03CR) 10Jcrespo: "All characters are technically allowed, it was _ the one that gave us issues in the past (because it was a wildcard) not '-'." [puppet] - 10https://gerrit.wikimedia.org/r/810885 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [12:37:50] (03PS1) 10David Caro: wmcs.openstack.cloudgw: add reboot_node and roll_reboot_cloudgws [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810914 [12:37:52] (03PS1) 10David Caro: wmcs.openstack: Use the known cloudcontrols instead of asking [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810915 [12:38:22] !log running alter table on dbbackups db T283017 [12:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:26] T283017: Create a dashboard for database backups monitoring/reporting - https://phabricator.wikimedia.org/T283017 [12:40:47] (03CR) 10Matěj Suchánek: Don't call deprecated IContextSource::getStats (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810890 (owner: 10Matěj Suchánek) [12:43:03] (03CR) 10CI reject: [V: 04-1] wmcs.openstack.cloudgw: add reboot_node and roll_reboot_cloudgws [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810914 (owner: 10David Caro) [12:43:39] (03CR) 10CI reject: [V: 04-1] wmcs.openstack: Use the known cloudcontrols instead of asking [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810915 (owner: 10David Caro) [12:51:15] taavi: thanks for assisting re: blackbox-exporter, would you have time today ? [12:52:48] (03CR) 10Filippo Giunchedi: prometheus: remove distro-based conditionals for blackbox (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810910 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [12:59:51] (03PS1) 10Stang: zh(wikiversity|wiktionary): Disable local upload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810916 (https://phabricator.wikimedia.org/T312012) [13:08:32] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:10:44] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2064.codfw.wmnet [13:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:48] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1068.eqiad.wmnet [13:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:58] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:15:06] (03PS1) 10Filippo Giunchedi: sre: add etcd-mirror lag page [alerts] - 10https://gerrit.wikimedia.org/r/810918 (https://phabricator.wikimedia.org/T309546) [13:17:54] (03PS1) 10Filippo Giunchedi: etcd: remove paging alert, moved to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/810919 (https://phabricator.wikimedia.org/T309546) [13:21:12] (03PS2) 10David Caro: wmcs.openstack.cloudgw: add reboot_node and roll_reboot_cloudgws [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810914 [13:21:14] (03PS2) 10David Caro: wmcs.openstack: Use the known cloudcontrols instead of asking [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810915 [13:22:59] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2064.codfw.wmnet [13:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:48] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1068.eqiad.wmnet [13:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:15] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2065.codfw.wmnet [13:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:57] (03CR) 10Jaime Nuche: "We don't need this change after Ahmon added this: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production" [puppet] - 10https://gerrit.wikimedia.org/r/807510 (https://phabricator.wikimedia.org/T310740) (owner: 10Jbond) [13:27:03] (03CR) 10CI reject: [V: 04-1] wmcs.openstack.cloudgw: add reboot_node and roll_reboot_cloudgws [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810914 (owner: 10David Caro) [13:27:13] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1069.eqiad.wmnet [13:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:30] (03CR) 10CI reject: [V: 04-1] wmcs.openstack: Use the known cloudcontrols instead of asking [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810915 (owner: 10David Caro) [13:34:36] godog: sure, what about right now? [13:35:09] taavi: SGTM! [13:35:14] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2065.codfw.wmnet [13:35:42] taavi: see the procedure outlined in my comment, I'm upgrading blackbox-exporter now [13:37:05] the plan looks fine, lmk when I should start upgrading the toolforge hosts [13:38:21] taavi: yeah you can upgrade, I used this fwiw apt -yq -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="- [13:38:24] -force-confold" install prometheus-blackbox-exporter [13:38:31] sigh, ok you get the idea, there's a dpkg prompt involved [13:38:40] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet [13:39:03] !log upgrade prometheus-blackbox-exporter to 0.18.0+ds-3~bpo10+1 on prometheus and metricsinfra Buster hosts - T305847 [13:39:18] (03PS1) 10Zabe: base: remove absented files [puppet] - 10https://gerrit.wikimedia.org/r/810925 [13:39:52] mmhh stashbot doesn't want to play with us atm [13:40:15] taavi: I'm going to merge the change [13:40:21] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: remove distro-based conditionals for blackbox [puppet] - 10https://gerrit.wikimedia.org/r/810910 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [13:40:28] (03PS2) 10Filippo Giunchedi: prometheus: remove distro-based conditionals for blackbox [puppet] - 10https://gerrit.wikimedia.org/r/810910 (https://phabricator.wikimedia.org/T305847) [13:40:51] (03CR) 10Filippo Giunchedi: [V: 03+2] prometheus: remove distro-based conditionals for blackbox [puppet] - 10https://gerrit.wikimedia.org/r/810910 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [13:43:09] ok, all done on my side [13:43:45] (JobUnavailable) firing: (8) Reduced availability for job blackbox/pingthing_proxied in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:43:56] PROBLEM - Check systemd state on prometheus3001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blackbox-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:44:24] PROBLEM - Check systemd state on prometheus4001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blackbox-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:44:24] PROBLEM - Check systemd state on prometheus6001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blackbox-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:44:38] PROBLEM - Check systemd state on prometheus5001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blackbox-exporter.service,wmf_auto_restart_prometheus-blackbox-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:44:43] expected? :) [13:44:45] (JobUnavailable) firing: (4) Reduced availability for job blackbox/pingthing in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:44:53] yes [13:44:59] kinda, I'm looking into it [13:46:03] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) Just talked to Amir about the next test. Tomorrow morning we'll: - Depool db1132 - Enable performance schema - Repool i... [13:46:32] ah yes, I goofed that one up, fixing [13:46:53] (03PS1) 10Elukey: profile::thanos::swift: add mlserve_ro account [labs/private] - 10https://gerrit.wikimedia.org/r/810926 (https://phabricator.wikimedia.org/T311628) [13:47:22] (03CR) 10Elukey: [V: 03+2 C: 03+2] profile::thanos::swift: add mlserve_ro account [labs/private] - 10https://gerrit.wikimedia.org/r/810926 (https://phabricator.wikimedia.org/T311628) (owner: 10Elukey) [13:47:50] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/809338 (https://phabricator.wikimedia.org/T306654) (owner: 10Ssingh) [13:48:45] (JobUnavailable) firing: (19) Reduced availability for job blackbox/icmp in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:49:00] (03PS1) 10Filippo Giunchedi: wmflib: remove distro conditionals from blackbox http module options [puppet] - 10https://gerrit.wikimedia.org/r/810927 (https://phabricator.wikimedia.org/T309546) [13:49:17] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [13:49:29] (03PS2) 10Filippo Giunchedi: wmflib: remove distro conditionals from blackbox http module options [puppet] - 10https://gerrit.wikimedia.org/r/810927 (https://phabricator.wikimedia.org/T309546) [13:49:48] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet [13:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:53] serves me well for not testing changes in pontoon first [13:51:10] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] wmflib: remove distro conditionals from blackbox http module options [puppet] - 10https://gerrit.wikimedia.org/r/810927 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi) [13:51:12] (03CR) 10Ssingh: [C: 03+2] admin: allow sudo for jclark-ctr for cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/809338 (https://phabricator.wikimedia.org/T306654) (owner: 10Ssingh) [13:51:20] PROBLEM - Check systemd state on ms-be1069 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:35] 10SRE, 10Wikimedia-Mailing-lists: Create stewards-usergroup private mailing list - https://phabricator.wikimedia.org/T312018 (10MarcoAurelio) [13:51:40] (03PS2) 10Ssingh: admin: allow sudo for jclark-ctr for cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/809338 (https://phabricator.wikimedia.org/T306654) [13:52:28] (03CR) 10Ssingh: "rebased, no code change" [puppet] - 10https://gerrit.wikimedia.org/r/809338 (https://phabricator.wikimedia.org/T306654) (owner: 10Ssingh) [13:52:33] (03CR) 10Ssingh: [V: 03+2 C: 03+2] admin: allow sudo for jclark-ctr for cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/809338 (https://phabricator.wikimedia.org/T306654) (owner: 10Ssingh) [13:53:50] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4), 10Patch-For-Review: Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10ssingh) [13:54:17] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [13:54:18] RECOVERY - Check systemd state on prometheus4001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:54:20] RECOVERY - Check systemd state on prometheus6001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:54:46] RECOVERY - Check systemd state on ms-be1069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:54:52] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1069.eqiad.wmnet [13:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:56] ok we're back [13:56:22] RECOVERY - Check systemd state on prometheus3001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:56:53] (03PS3) 10David Caro: wmcs.openstack.cloudgw: add reboot_node and roll_reboot_cloudgws [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810914 [13:56:55] (03PS3) 10David Caro: wmcs.openstack: Use the known cloudcontrols instead of asking [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810915 [13:57:13] taavi: should be all good, i.e. upgrading blackbox-exporter and then running puppet, confirmed it works in production [13:58:26] (03Abandoned) 10Filippo Giunchedi: prometheus: adjust check::http params based on distro [puppet] - 10https://gerrit.wikimedia.org/r/809586 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [13:58:28] great [13:58:45] (JobUnavailable) resolved: (19) Reduced availability for job blackbox/icmp in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:59:17] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [13:59:45] (JobUnavailable) resolved: (8) Reduced availability for job blackbox/pingthing in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:01:07] 10SRE, 10Wikimedia-Mailing-lists: Create stewards-usergroup private mailing list - https://phabricator.wikimedia.org/T312018 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup [14:01:30] (03PS6) 10Filippo Giunchedi: icinga: check commons.w.o with blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/804274 (https://phabricator.wikimedia.org/T305847) [14:01:32] (03PS5) 10Filippo Giunchedi: WIP irc check via blackbox [puppet] - 10https://gerrit.wikimedia.org/r/805815 [14:01:34] (03PS1) 10Filippo Giunchedi: prometheus: disable protocol fallback for blackbox::check::http [puppet] - 10https://gerrit.wikimedia.org/r/810929 (https://phabricator.wikimedia.org/T305847) [14:01:38] (03PS3) 10Ladsgroup: Set GlobalBlockingAllowedRanges for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810055 (https://phabricator.wikimedia.org/T307648) [14:01:49] (03CR) 10Ladsgroup: [C: 03+2] Set GlobalBlockingAllowedRanges for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810055 (https://phabricator.wikimedia.org/T307648) (owner: 10Ladsgroup) [14:03:00] (03CR) 10Giuseppe Lavagetto: scap: use the new script to restart php-fpm (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/810031 (owner: 10Giuseppe Lavagetto) [14:03:12] (03CR) 10Elukey: "Found the following after PCC:" [puppet] - 10https://gerrit.wikimedia.org/r/810840 (https://phabricator.wikimedia.org/T311628) (owner: 10Elukey) [14:03:15] (03Merged) 10jenkins-bot: Set GlobalBlockingAllowedRanges for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810055 (https://phabricator.wikimedia.org/T307648) (owner: 10Ladsgroup) [14:04:45] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add initial blackbox dns probes for wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/809535 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [14:05:06] (03PS2) 10Elukey: profile::thanos::swift: add a read only account for ml-serve [puppet] - 10https://gerrit.wikimedia.org/r/810840 (https://phabricator.wikimedia.org/T311628) [14:05:09] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thank you Brett for the review! Merging, we can add/iterate on the probes at a later stage too" [puppet] - 10https://gerrit.wikimedia.org/r/809535 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [14:05:10] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2067.codfw.wmnet [14:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:13] <_joe_> jouncebot: next [14:05:13] In 16 hour(s) and 54 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220705T0700) [14:05:15] (03PS2) 10Filippo Giunchedi: prometheus: add initial blackbox dns probes for wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/809535 (https://phabricator.wikimedia.org/T169860) [14:05:20] <_joe_> ok there's time :) [14:05:50] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1070.eqiad.wmnet [14:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:14] PROBLEM - puppet last run on ms-be2028 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:06:22] (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap: use the new script to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/810031 (owner: 10Giuseppe Lavagetto) [14:06:30] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36179/console" [puppet] - 10https://gerrit.wikimedia.org/r/810840 (https://phabricator.wikimedia.org/T311628) (owner: 10Elukey) [14:06:32] (03PS7) 10Giuseppe Lavagetto: scap: use the new script to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/810031 [14:07:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:07:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:08:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:17] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [14:09:59] (03CR) 10Filippo Giunchedi: "My bad, account_name is AUTH_mlserve (i.e. same as the rw user) though you need stats_enabled: no for this account" [puppet] - 10https://gerrit.wikimedia.org/r/810840 (https://phabricator.wikimedia.org/T311628) (owner: 10Elukey) [14:10:03] (03CR) 10David Caro: [C: 03+2] "LGTM Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/810925 (owner: 10Zabe) [14:10:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:42] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:810055|Set GlobalBlockingAllowedRanges for testwiki (T307648)]] (duration: 03m 39s) [14:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:46] T307648: Audit database usage of GlobalBlocking extension - https://phabricator.wikimedia.org/T307648 [14:10:56] dcaro: merged your change too [14:11:20] godog: thanks! I was trying to find out the user<->irc name xd [14:11:29] (03PS1) 10Ladsgroup: Excempt WMCS ranges from globalblocking everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810932 (https://phabricator.wikimedia.org/T307648) [14:11:31] (03PS3) 10Elukey: profile::thanos::swift: add a read only account for ml-serve [puppet] - 10https://gerrit.wikimedia.org/r/810840 (https://phabricator.wikimedia.org/T311628) [14:11:47] lolz [14:12:06] (03PS3) 10Filippo Giunchedi: prometheus: probe DNS for (www).wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/809536 (https://phabricator.wikimedia.org/T169860) [14:12:32] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36180/console" [puppet] - 10https://gerrit.wikimedia.org/r/810840 (https://phabricator.wikimedia.org/T311628) (owner: 10Elukey) [14:13:34] (03PS2) 10Ladsgroup: Exempt WMCS ranges from globalblocking everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810932 (https://phabricator.wikimedia.org/T307648) [14:14:15] <_joe_> I'll do a null deploy to verify the new php restart script works [14:14:25] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::thanos::swift: add a read only account for ml-serve [puppet] - 10https://gerrit.wikimedia.org/r/810840 (https://phabricator.wikimedia.org/T311628) (owner: 10Elukey) [14:14:41] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2067.codfw.wmnet [14:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:07] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::thanos::swift: add a read only account for ml-serve [puppet] - 10https://gerrit.wikimedia.org/r/810840 (https://phabricator.wikimedia.org/T311628) (owner: 10Elukey) [14:15:36] (03PS4) 10David Caro: P:wmcs: unify toolsdb profiles [puppet] - 10https://gerrit.wikimedia.org/r/789611 (owner: 10Majavah) [14:15:48] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:15:55] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: probe DNS for (www).wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/809536 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [14:16:26] Emperor, godog - o/ I am going to roll restart thanos-fe's swift-proxy for https://gerrit.wikimedia.org/r/c/operations/puppet/+/810840/ [14:16:29] (enable a new account) [14:16:38] elukey: ack [14:17:06] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1070.eqiad.wmnet [14:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:37] (03CR) 10Ladsgroup: [C: 03+2] Exempt WMCS ranges from globalblocking everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810932 (https://phabricator.wikimedia.org/T307648) (owner: 10Ladsgroup) [14:17:50] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: disable protocol fallback for blackbox::check::http [puppet] - 10https://gerrit.wikimedia.org/r/810929 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [14:18:07] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2068.codfw.wmnet [14:18:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:30] 10SRE, 10SRE-OnFire, 10Shellbox, 10serviceops, 10Sustainability (Incident Followup): Shellbox resource management - https://phabricator.wikimedia.org/T310557 (10LSobanski) [14:18:41] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1071.eqiad.wmnet [14:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:11] (03Merged) 10jenkins-bot: Exempt WMCS ranges from globalblocking everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810932 (https://phabricator.wikimedia.org/T307648) (owner: 10Ladsgroup) [14:19:43] !log roll restart of thanos-fe's proxy to pick up a new account - T311628 [14:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:47] T311628: Create Swift account for readonly access to ML models - https://phabricator.wikimedia.org/T311628 [14:20:54] !log oblivian@deploy1002 Synchronized README: testing new php restart script (duration: 03m 23s) [14:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:10] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2068.codfw.wmnet [14:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:41] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2069.codfw.wmnet [14:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:16] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:810932|Exempt WMCS ranges from globalblocking everywhere (T307648)]] (duration: 03m 26s) [14:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:19] T307648: Audit database usage of GlobalBlocking extension - https://phabricator.wikimedia.org/T307648 [14:28:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:28:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:41] (03PS4) 10Giuseppe Lavagetto: mediawiki: install php7.4 on the canaries [puppet] - 10https://gerrit.wikimedia.org/r/808909 (https://phabricator.wikimedia.org/T311386) [14:29:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:21] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1071.eqiad.wmnet [14:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:09] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: install php7.4 on the canaries [puppet] - 10https://gerrit.wikimedia.org/r/808909 (https://phabricator.wikimedia.org/T311386) (owner: 10Giuseppe Lavagetto) [14:35:20] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2069.codfw.wmnet [14:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:16] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:54:00] (03CR) 10Ayounsi: reports.network: improve IPv6 AAAA records checks (034 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807986 (owner: 10Volans) [14:54:18] (03CR) 10Ayounsi: [C: 03+1] "Small suggestion, LGTM otherwise!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807986 (owner: 10Volans) [15:00:27] (03CR) 10Giuseppe Lavagetto: [C: 03+1] sre: add etcd-mirror lag page [alerts] - 10https://gerrit.wikimedia.org/r/810918 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi) [15:01:30] (03CR) 10Giuseppe Lavagetto: [C: 03+1] etcd: remove paging alert, moved to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/810919 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi) [15:02:50] (03CR) 10Filippo Giunchedi: [C: 03+2] etcd: remove paging alert, moved to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/810919 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi) [15:02:55] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: add etcd-mirror lag page [alerts] - 10https://gerrit.wikimedia.org/r/810918 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi) [15:09:19] 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10serviceops: Support services VIPs with not marked as VIP in Netbox - https://phabricator.wikimedia.org/T295793 (10akosiaris) >>! In T295793#8047985, @ayounsi wrote: > Pinging @jelto for gitlab (not sure if the issue is still present or relevant) and @a... [15:10:03] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:10:06] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10serviceops-collab, and 2 others: replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10akosiaris) Many thanks for the work on this one @Dzahn! [15:13:47] PROBLEM - puppet last run on deneb is CRITICAL: CRITICAL: Puppet last ran 4 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:14:04] 4 days ago? [15:15:05] I am more worried about the ns2-v4 Auth DNS check being in soft state [15:15:39] I'm not seeing anything related on /alerts [15:21:17] I disabled puppet on deneb for some debugging on Friday and just re-enabled it [15:21:31] all benign :-) [15:23:05] jynus: forced a recheck.. looking good again [15:23:22] ok, so then it was just a fluke + long time between rechecks [15:23:25] yup [15:23:30] ns2 looks good [15:23:47] (03PS2) 10Volans: reports.network: improve IPv6 AAAA records checks [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807986 [15:24:13] (03CR) 10Volans: "addressed comment" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807986 (owner: 10Volans) [15:28:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1161.eqiad.wmnet with reason: Maintenance [15:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1161.eqiad.wmnet with reason: Maintenance [15:28:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T305300)', diff saved to https://phabricator.wikimedia.org/P30781 and previous config saved to /var/cache/conftool/dbconfig/20220704-152931-ladsgroup.json [15:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:34] T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300 [15:30:47] 10SRE-swift-storage, 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Create Swift account for readonly access to ML models - https://phabricator.wikimedia.org/T311628 (10elukey) The new mlserve:ro account has been added, but if I try to use s3cmd with the new credentials I get an error: ` elukey@stat10... [15:30:53] RECOVERY - puppet last run on deneb is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:31:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1161.eqiad.wmnet with reason: Maintenance [15:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1161.eqiad.wmnet with reason: Maintenance [15:31:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:31:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1144.eqiad.wmnet with reason: Maintenance [15:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1144.eqiad.wmnet with reason: Maintenance [15:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T312027)', diff saved to https://phabricator.wikimedia.org/P30782 and previous config saved to /var/cache/conftool/dbconfig/20220704-153218-ladsgroup.json [15:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:22] T312027: Fix enwikivoyage drifts on pagelinks - https://phabricator.wikimedia.org/T312027 [15:33:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 10%: Maint done', diff saved to https://phabricator.wikimedia.org/P30783 and previous config saved to /var/cache/conftool/dbconfig/20220704-153306-ladsgroup.json [15:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:45] (03CR) 10Muehlenhoff: doc: remove support for stretch, add support for bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810401 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn) [15:34:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T312027)', diff saved to https://phabricator.wikimedia.org/P30784 and previous config saved to /var/cache/conftool/dbconfig/20220704-153428-ladsgroup.json [15:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:23] volans: :34 again, you were into something... [15:36:29] :) [15:37:21] (03CR) 10Volans: [C: 03+2] reports.network: improve IPv6 AAAA records checks [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807986 (owner: 10Volans) [15:37:26] for context for others, I was talking about insightful suggestion (I wouldn't have noticed myself) T311926#8047952 [15:37:26] T311926: PROBLEM: Icinga on alert2001.wikimedia.org is CRITICAL - https://phabricator.wikimedia.org/T311926 [15:37:59] (03Abandoned) 10Volans: Netbox location: fix naming [puppet] - 10https://gerrit.wikimedia.org/r/807997 (owner: 10Volans) [15:38:01] hopefully tomorrow someone can have a look [15:38:07] (03Merged) 10jenkins-bot: reports.network: improve IPv6 AAAA records checks [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807986 (owner: 10Volans) [15:48:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 50%: Maint done', diff saved to https://phabricator.wikimedia.org/P30785 and previous config saved to /var/cache/conftool/dbconfig/20220704-154810-ladsgroup.json [15:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P30786 and previous config saved to /var/cache/conftool/dbconfig/20220704-154933-ladsgroup.json [15:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:26] (03CR) 10Alexandros Kosiaris: [C: 04-1] scap: Make scap3 provider packages depend on /usr/bin/scap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809270 (https://phabricator.wikimedia.org/T310740) (owner: 10Ahmon Dancy) [16:03:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 75%: Maint done', diff saved to https://phabricator.wikimedia.org/P30787 and previous config saved to /var/cache/conftool/dbconfig/20220704-160314-ladsgroup.json [16:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:45] PROBLEM - puppet last run on ms-be2030 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:04:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P30788 and previous config saved to /var/cache/conftool/dbconfig/20220704-160439-ladsgroup.json [16:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:51] (03PS1) 10David Caro: novafullstack: black and isort [puppet] - 10https://gerrit.wikimedia.org/r/810949 [16:09:53] (03PS1) 10David Caro: novafullstack: add types and some names refactor [puppet] - 10https://gerrit.wikimedia.org/r/810950 [16:10:16] (03PS1) 10Aqu: [WIP] Build spark assembly for Spark3 [puppet] - 10https://gerrit.wikimedia.org/r/810951 (https://phabricator.wikimedia.org/T310578) [16:10:58] (03CR) 10CI reject: [V: 04-1] novafullstack: add types and some names refactor [puppet] - 10https://gerrit.wikimedia.org/r/810950 (owner: 10David Caro) [16:15:11] RECOVERY - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [16:18:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P30789 and previous config saved to /var/cache/conftool/dbconfig/20220704-161817-ladsgroup.json [16:18:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T312027)', diff saved to https://phabricator.wikimedia.org/P30790 and previous config saved to /var/cache/conftool/dbconfig/20220704-161944-ladsgroup.json [16:19:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1110.eqiad.wmnet with reason: Maintenance [16:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:49] T312027: Fix enwikivoyage drifts on pagelinks - https://phabricator.wikimedia.org/T312027 [16:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1110.eqiad.wmnet with reason: Maintenance [16:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T312027)', diff saved to https://phabricator.wikimedia.org/P30791 and previous config saved to /var/cache/conftool/dbconfig/20220704-162015-ladsgroup.json [16:20:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T312027)', diff saved to https://phabricator.wikimedia.org/P30792 and previous config saved to /var/cache/conftool/dbconfig/20220704-162225-ladsgroup.json [16:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P30793 and previous config saved to /var/cache/conftool/dbconfig/20220704-163730-ladsgroup.json [16:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:29] (03PS1) 10Volans: scripts/hiera_export: add ganeti group [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/810955 [16:48:02] (03PS1) 10Volans: netbox::host: adapt to new Netbox data [puppet] - 10https://gerrit.wikimedia.org/r/810956 [16:48:29] (03CR) 10Volans: "See also the related I4a857d6c14c227a810233ff1259d5b01635005b0" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/810955 (owner: 10Volans) [16:48:39] 10SRE, 10Wikimedia-Mailing-lists: mailman3: First/Last name should not be mandatory fields - https://phabricator.wikimedia.org/T312020 (10ssingh) p:05Triage→03Medium [16:51:03] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:52:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P30795 and previous config saved to /var/cache/conftool/dbconfig/20220704-165235-ladsgroup.json [16:52:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:11] (03PS1) 10Zabe: vrts: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/810957 (https://phabricator.wikimedia.org/T308013) [16:55:13] (03PS1) 10Zabe: sudo: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/810958 (https://phabricator.wikimedia.org/T308013) [16:55:15] (03PS1) 10Zabe: vagrant: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/810959 (https://phabricator.wikimedia.org/T308013) [16:55:17] (03PS1) 10Zabe: utils: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/810960 (https://phabricator.wikimedia.org/T308013) [17:01:36] (03PS1) 10Muehlenhoff: bacula::storage: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810961 [17:07:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T312027)', diff saved to https://phabricator.wikimedia.org/P30796 and previous config saved to /var/cache/conftool/dbconfig/20220704-170740-ladsgroup.json [17:07:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1113.eqiad.wmnet with reason: Maintenance [17:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:46] T312027: Fix enwikivoyage drifts on pagelinks - https://phabricator.wikimedia.org/T312027 [17:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1113.eqiad.wmnet with reason: Maintenance [17:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T312027)', diff saved to https://phabricator.wikimedia.org/P30797 and previous config saved to /var/cache/conftool/dbconfig/20220704-170800-ladsgroup.json [17:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T312027)', diff saved to https://phabricator.wikimedia.org/P30798 and previous config saved to /var/cache/conftool/dbconfig/20220704-170910-ladsgroup.json [17:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P30799 and previous config saved to /var/cache/conftool/dbconfig/20220704-172415-ladsgroup.json [17:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P30800 and previous config saved to /var/cache/conftool/dbconfig/20220704-173920-ladsgroup.json [17:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T312027)', diff saved to https://phabricator.wikimedia.org/P30801 and previous config saved to /var/cache/conftool/dbconfig/20220704-175425-ladsgroup.json [17:54:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1100.eqiad.wmnet with reason: Maintenance [17:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:31] T312027: Fix enwikivoyage drifts on pagelinks - https://phabricator.wikimedia.org/T312027 [17:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1100.eqiad.wmnet with reason: Maintenance [17:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1100 (T312027)', diff saved to https://phabricator.wikimedia.org/P30802 and previous config saved to /var/cache/conftool/dbconfig/20220704-175446-ladsgroup.json [17:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T312027)', diff saved to https://phabricator.wikimedia.org/P30803 and previous config saved to /var/cache/conftool/dbconfig/20220704-175655-ladsgroup.json [17:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:51] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:07:13] (03PS1) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [18:09:36] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [18:12:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P30804 and previous config saved to /var/cache/conftool/dbconfig/20220704-181200-ladsgroup.json [18:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:02] (03PS4) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) [18:20:21] (03CR) 10CI reject: [V: 04-1] Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [18:22:07] (03CR) 10Ori: "This change is ready for review." [software/varnish/libvmod-querysort] - 10https://gerrit.wikimedia.org/r/810552 (owner: 10Ori) [18:27:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P30805 and previous config saved to /var/cache/conftool/dbconfig/20220704-182706-ladsgroup.json [18:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T312027)', diff saved to https://phabricator.wikimedia.org/P30806 and previous config saved to /var/cache/conftool/dbconfig/20220704-184211-ladsgroup.json [18:42:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1096.eqiad.wmnet with reason: Maintenance [18:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:16] T312027: Fix enwikivoyage drifts on pagelinks - https://phabricator.wikimedia.org/T312027 [18:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1096.eqiad.wmnet with reason: Maintenance [18:42:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T312027)', diff saved to https://phabricator.wikimedia.org/P30807 and previous config saved to /var/cache/conftool/dbconfig/20220704-184231-ladsgroup.json [18:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:06] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudservices1003.wikimedia.org [18:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:50] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudservices2004-dev.wikimedia.org [18:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T312027)', diff saved to https://phabricator.wikimedia.org/P30808 and previous config saved to /var/cache/conftool/dbconfig/20220704-184440-ladsgroup.json [18:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:59] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1003.wikimedia.org [18:52:01] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices1003.wikimedia.org [18:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:17] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudservices1004.wikimedia.org [18:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:12] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices2004-dev.wikimedia.org [18:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:49] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudservices2005-dev.wikimedia.org [18:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:31] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices1004.wikimedia.org [18:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P30809 and previous config saved to /var/cache/conftool/dbconfig/20220704-185945-ladsgroup.json [18:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:42] (03PS5) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) [19:01:33] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol2001-dev.wikimedia.org [19:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:42] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices2005-dev.wikimedia.org [19:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:23] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudcontrol1003.wikimedia.org [19:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P30810 and previous config saved to /var/cache/conftool/dbconfig/20220704-191450-ladsgroup.json [19:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:02] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudcontrol2001-dev.wikimedia.org [19:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:25] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1004.wikimedia.org [19:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:09] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol2003-dev.wikimedia.org [19:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:39] (03PS6) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) [19:26:51] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1005.wikimedia.org [19:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:57] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol1004.wikimedia.org [19:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:25] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol2004-dev.wikimedia.org [19:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:15] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol2003-dev.wikimedia.org [19:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T312027)', diff saved to https://phabricator.wikimedia.org/P30811 and previous config saved to /var/cache/conftool/dbconfig/20220704-192955-ladsgroup.json [19:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2123.codfw.wmnet with reason: Maintenance [19:30:00] T312027: Fix enwikivoyage drifts on pagelinks - https://phabricator.wikimedia.org/T312027 [19:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2123.codfw.wmnet with reason: Maintenance [19:30:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 8 hosts with reason: Maintenance [19:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 8 hosts with reason: Maintenance [19:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1150.eqiad.wmnet with reason: Maintenance [19:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1150.eqiad.wmnet with reason: Maintenance [19:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [19:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [19:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:35] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol1005.wikimedia.org [19:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:47] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol2004-dev.wikimedia.org [19:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:23] PROBLEM - nova-compute proc minimum on cloudvirt1044 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:43:24] PROBLEM - nova-compute proc minimum on cloudvirt1023 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:43:24] PROBLEM - nova-compute proc minimum on cloudvirt1042 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:43:43] PROBLEM - nova-compute proc minimum on cloudvirt1022 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:43:43] PROBLEM - nova-compute proc minimum on cloudvirt1029 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:43:45] PROBLEM - nova-compute proc minimum on cloudvirt1032 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:43:47] PROBLEM - nova-compute proc minimum on cloudvirt1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:44:11] PROBLEM - nova-compute proc minimum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:44:13] PROBLEM - nova-compute proc minimum on cloudvirt1019 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:44:17] PROBLEM - nova-compute proc minimum on cloudvirt1026 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:44:23] PROBLEM - nova-compute proc minimum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:44:24] PROBLEM - nova-compute proc minimum on cloudvirt1038 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:44:31] PROBLEM - nova-compute proc minimum on cloudvirt1017 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:44:31] PROBLEM - nova-compute proc minimum on cloudvirt1021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:44:32] PROBLEM - nova-compute proc minimum on cloudvirt1028 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:44:33] PROBLEM - nova-compute proc minimum on cloudvirt1020 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:44:43] PROBLEM - nova-compute proc minimum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:44:43] PROBLEM - nova-compute proc minimum on cloudvirt1035 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:44:44] PROBLEM - nova-compute proc minimum on cloudvirt1030 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:44:53] PROBLEM - nova-compute proc minimum on cloudvirt1024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:44:54] PROBLEM - nova-compute proc minimum on cloudvirt1025 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:45:01] PROBLEM - nova-compute proc minimum on cloudvirt1036 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:45:01] PROBLEM - nova-compute proc minimum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:45:17] PROBLEM - nova-compute proc minimum on cloudvirt1033 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:45:23] PROBLEM - nova-compute proc minimum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:45:33] PROBLEM - nova-compute proc minimum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:45:34] PROBLEM - nova-compute proc minimum on cloudvirt1037 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:45:34] PROBLEM - nova-compute proc minimum on cloudvirt1031 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:45:35] PROBLEM - nova-compute proc minimum on cloudvirt1034 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:45:41] PROBLEM - nova-compute proc minimum on cloudvirt1039 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:47:33] RECOVERY - nova-compute proc minimum on cloudvirt1045 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:47:55] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:49:19] RECOVERY - nova-compute proc minimum on cloudvirt1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:49:20] RECOVERY - nova-compute proc minimum on cloudvirt1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:49:25] RECOVERY - nova-compute proc minimum on cloudvirt1026 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:49:31] RECOVERY - nova-compute proc minimum on cloudvirt1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:49:31] RECOVERY - nova-compute proc minimum on cloudvirt1038 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:49:40] RECOVERY - nova-compute proc minimum on cloudvirt1017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:49:40] RECOVERY - nova-compute proc minimum on cloudvirt1021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:49:40] RECOVERY - nova-compute proc minimum on cloudvirt1028 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:49:43] RECOVERY - nova-compute proc minimum on cloudvirt1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:49:51] RECOVERY - nova-compute proc minimum on cloudvirt1035 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:49:51] RECOVERY - nova-compute proc minimum on cloudvirt1030 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:50:01] RECOVERY - nova-compute proc minimum on cloudvirt1024 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:50:10] RECOVERY - nova-compute proc minimum on cloudvirt1036 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:50:40] RECOVERY - nova-compute proc minimum on cloudvirt1037 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:50:40] RECOVERY - nova-compute proc minimum on cloudvirt1031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:50:40] RECOVERY - nova-compute proc minimum on cloudvirt1034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:50:47] RECOVERY - nova-compute proc minimum on cloudvirt1039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:51:07] RECOVERY - nova-compute proc minimum on cloudvirt1023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:51:08] RECOVERY - nova-compute proc minimum on cloudvirt1042 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:51:25] RECOVERY - nova-compute proc minimum on cloudvirt1029 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:51:29] RECOVERY - nova-compute proc minimum on cloudvirt1032 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:51:30] RECOVERY - nova-compute proc minimum on cloudvirt1027 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:52:05] PROBLEM - nova-compute proc minimum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:52:15] PROBLEM - nova-compute proc minimum on cloudvirt1028 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:52:25] RECOVERY - nova-compute proc minimum on cloudvirt1043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:52:35] RECOVERY - nova-compute proc minimum on cloudvirt1025 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:52:43] PROBLEM - nova-compute proc minimum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:52:57] RECOVERY - nova-compute proc minimum on cloudvirt1033 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:53:25] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1004.wikimedia.org [19:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:41] PROBLEM - nova-compute proc minimum on cloudvirt1042 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:53:43] PROBLEM - nova-compute proc maximum on cloudvirt1028 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:53:59] andrewbogott: I don't know if you're aware of this flood^ [19:54:00] RECOVERY - nova-compute proc minimum on cloudvirt1022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:54:12] I am, I'm working on it [19:54:23] ok, I get out of the way [19:54:40] PROBLEM - nova-compute proc maximum on cloudvirt1044 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:54:44] ACKNOWLEDGEMENT - nova-compute proc maximum on cloudvirt1023 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute andrew bogott rabbitmq is losing its mind after some routine reboots https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:54:44] ACKNOWLEDGEMENT - nova-compute proc maximum on cloudvirt1028 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute andrew bogott rabbitmq is losing its mind after some routine reboots https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:54:45] ACKNOWLEDGEMENT - nova-compute proc minimum on cloudvirt1028 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute andrew bogott rabbitmq is losing its mind after some routine reboots https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:54:46] ACKNOWLEDGEMENT - nova-compute proc maximum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute andrew bogott rabbitmq is losing its mind after some routine reboots https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:54:47] ACKNOWLEDGEMENT - nova-compute proc minimum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute andrew bogott rabbitmq is losing its mind after some routine reboots https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:54:48] ACKNOWLEDGEMENT - nova-compute proc minimum on cloudvirt1042 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute andrew bogott rabbitmq is losing its mind after some routine reboots https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:54:49] ACKNOWLEDGEMENT - nova-compute proc maximum on cloudvirt1044 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute andrew bogott rabbitmq is losing its mind after some routine reboots https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:54:50] ACKNOWLEDGEMENT - nova-compute proc minimum on cloudvirt1044 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute andrew bogott rabbitmq is losing its mind after some routine reboots https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:54:51] ACKNOWLEDGEMENT - nova-compute proc minimum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute andrew bogott rabbitmq is losing its mind after some routine reboots https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:54:52] ACKNOWLEDGEMENT - nova-compute proc maximum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute andrew bogott rabbitmq is losing its mind after some routine reboots https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:54:53] ACKNOWLEDGEMENT - nova-compute proc minimum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute andrew bogott rabbitmq is losing its mind after some routine reboots https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:54:54] ACKNOWLEDGEMENT - nova-compute proc maximum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute andrew bogott rabbitmq is losing its mind after some routine reboots https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:54:55] ACKNOWLEDGEMENT - nova-compute proc minimum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute andrew bogott rabbitmq is losing its mind after some routine reboots https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:55:01] PROBLEM - nova-compute proc minimum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:55:01] PROBLEM - nova-compute proc minimum on cloudvirt1035 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:55:02] PROBLEM - nova-compute proc minimum on cloudvirt1030 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:55:13] PROBLEM - nova-compute proc minimum on cloudvirt1024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:55:14] PROBLEM - nova-compute proc minimum on cloudvirt1025 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:55:14] PROBLEM - puppet last run on ms-be2031 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:55:20] PROBLEM - nova-compute proc minimum on cloudvirt1036 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:55:35] PROBLEM - nova-compute proc minimum on cloudvirt1033 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:55:51] PROBLEM - nova-compute proc minimum on cloudvirt1037 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:55:52] PROBLEM - nova-compute proc minimum on cloudvirt1034 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:55:52] PROBLEM - nova-compute proc minimum on cloudvirt1031 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:55:59] PROBLEM - nova-compute proc minimum on cloudvirt1039 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:56:20] PROBLEM - nova-compute proc minimum on cloudvirt1023 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:56:20] RECOVERY - nova-compute proc maximum on cloudvirt1028 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:56:37] PROBLEM - nova-compute proc minimum on cloudvirt1022 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:56:38] PROBLEM - nova-compute proc minimum on cloudvirt1029 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:56:41] PROBLEM - nova-compute proc minimum on cloudvirt1032 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:56:42] PROBLEM - nova-compute proc minimum on cloudvirt1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:57:07] PROBLEM - nova-compute proc minimum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:57:11] PROBLEM - nova-compute proc minimum on cloudvirt1026 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:57:17] PROBLEM - nova-compute proc minimum on cloudvirt1038 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:57:25] PROBLEM - nova-compute proc minimum on cloudvirt1021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:57:27] RECOVERY - nova-compute proc minimum on cloudvirt1028 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:59:37] PROBLEM - nova-compute proc minimum on cloudvirt1019 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:59:55] PROBLEM - nova-compute proc minimum on cloudvirt1017 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:59:57] PROBLEM - nova-compute proc minimum on cloudvirt1028 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:59:58] PROBLEM - nova-compute proc minimum on cloudvirt1020 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:03:30] RECOVERY - nova-compute proc minimum on cloudvirt1039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:03:51] RECOVERY - nova-compute proc minimum on cloudvirt1042 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:05:05] RECOVERY - nova-compute proc minimum on cloudvirt1035 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:05:39] PROBLEM - nova-compute proc maximum on cloudvirt1030 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:05:55] PROBLEM - nova-compute proc maximum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:05:55] RECOVERY - nova-compute proc minimum on cloudvirt1031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:06:01] PROBLEM - nova-compute proc maximum on cloudvirt1036 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:06:03] PROBLEM - nova-compute proc maximum on cloudvirt1025 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:06:20] PROBLEM - nova-compute proc maximum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:06:21] PROBLEM - nova-compute proc maximum on cloudvirt1021 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:06:40] PROBLEM - nova-compute proc maximum on cloudvirt1020 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:06:41] RECOVERY - nova-compute proc minimum on cloudvirt1029 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:06:45] RECOVERY - nova-compute proc minimum on cloudvirt1032 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:06:46] PROBLEM - nova-compute proc maximum on cloudvirt1027 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:07:07] PROBLEM - nova-compute proc maximum on cloudvirt1022 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:07:09] PROBLEM - nova-compute proc maximum on cloudvirt1026 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:07:11] PROBLEM - nova-compute proc maximum on cloudvirt1037 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:07:11] PROBLEM - nova-compute proc maximum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:07:15] PROBLEM - nova-compute proc maximum on cloudvirt1038 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:07:23] PROBLEM - nova-compute proc maximum on cloudvirt1033 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:07:25] PROBLEM - nova-compute proc maximum on cloudvirt1024 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:07:30] RECOVERY - nova-compute proc minimum on cloudvirt1017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:07:30] RECOVERY - nova-compute proc minimum on cloudvirt1021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:07:33] PROBLEM - nova-compute proc maximum on cloudvirt1034 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:07:43] RECOVERY - nova-compute proc minimum on cloudvirt1030 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:08:15] RECOVERY - nova-compute proc minimum on cloudvirt1033 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:08:15] RECOVERY - nova-compute proc maximum on cloudvirt1030 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:08:23] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01115 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [20:08:55] RECOVERY - nova-compute proc maximum on cloudvirt1040 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:08:57] RECOVERY - nova-compute proc maximum on cloudvirt1021 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:09:00] PROBLEM - nova-compute proc maximum on cloudvirt1028 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:09:19] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudcontrol1004.wikimedia.org [20:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:27] (03PS7) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) [20:09:41] PROBLEM - nova-compute proc maximum on cloudvirt1019 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:09:45] RECOVERY - nova-compute proc minimum on cloudvirt1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:09:57] RECOVERY - nova-compute proc maximum on cloudvirt1033 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:10:00] RECOVERY - nova-compute proc maximum on cloudvirt1024 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:10:25] RECOVERY - nova-compute proc minimum on cloudvirt1024 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:11:03] RECOVERY - nova-compute proc minimum on cloudvirt1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:13:00] RECOVERY - nova-compute proc minimum on cloudvirt1036 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:13:31] PROBLEM - nova-compute proc minimum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:13:35] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [20:13:40] RECOVERY - nova-compute proc maximum on cloudvirt1036 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:14:17] RECOVERY - nova-compute proc maximum on cloudvirt1020 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:14:20] PROBLEM - nova-compute proc minimum on cloudvirt1032 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:15:05] RECOVERY - nova-compute proc minimum on cloudvirt1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:16:30] RECOVERY - nova-compute proc minimum on cloudvirt1023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:16:47] RECOVERY - nova-compute proc minimum on cloudvirt1022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:16:50] RECOVERY - nova-compute proc minimum on cloudvirt1032 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:17:13] RECOVERY - nova-compute proc maximum on cloudvirt1022 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:17:17] RECOVERY - nova-compute proc maximum on cloudvirt1045 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:17:35] PROBLEM - nova-compute proc minimum on cloudvirt1021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:17:35] PROBLEM - nova-compute proc minimum on cloudvirt1017 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:17:45] PROBLEM - nova-compute proc minimum on cloudvirt1035 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:17:47] PROBLEM - nova-compute proc minimum on cloudvirt1030 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:17:57] PROBLEM - nova-compute proc minimum on cloudvirt1024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:18:03] RECOVERY - nova-compute proc minimum on cloudvirt1045 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:18:04] PROBLEM - nova-compute proc minimum on cloudvirt1036 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:18:21] PROBLEM - nova-compute proc minimum on cloudvirt1033 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:18:37] PROBLEM - nova-compute proc minimum on cloudvirt1031 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:18:45] PROBLEM - nova-compute proc minimum on cloudvirt1039 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:19:05] PROBLEM - nova-compute proc minimum on cloudvirt1042 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:19:23] PROBLEM - nova-compute proc minimum on cloudvirt1029 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:19:24] PROBLEM - nova-compute proc minimum on cloudvirt1032 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:19:51] PROBLEM - nova-compute proc minimum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:19:55] RECOVERY - nova-compute proc maximum on cloudvirt1038 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:20:03] RECOVERY - nova-compute proc minimum on cloudvirt1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:20:03] RECOVERY - nova-compute proc minimum on cloudvirt1038 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:20:13] RECOVERY - nova-compute proc maximum on cloudvirt1034 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:20:21] RECOVERY - nova-compute proc minimum on cloudvirt1035 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:20:23] RECOVERY - nova-compute proc minimum on cloudvirt1030 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:20:40] RECOVERY - nova-compute proc minimum on cloudvirt1036 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:20:57] RECOVERY - nova-compute proc minimum on cloudvirt1033 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:21:13] RECOVERY - nova-compute proc minimum on cloudvirt1037 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:21:14] RECOVERY - nova-compute proc minimum on cloudvirt1034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:21:14] RECOVERY - nova-compute proc minimum on cloudvirt1031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:21:20] RECOVERY - nova-compute proc minimum on cloudvirt1039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:21:40] PROBLEM - nova-compute proc minimum on cloudvirt1023 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:22:00] RECOVERY - nova-compute proc minimum on cloudvirt1032 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:22:23] RECOVERY - nova-compute proc maximum on cloudvirt1019 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:22:27] RECOVERY - nova-compute proc maximum on cloudvirt1037 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:22:27] RECOVERY - nova-compute proc minimum on cloudvirt1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:22:57] RECOVERY - nova-compute proc minimum on cloudvirt1043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:23:07] RECOVERY - nova-compute proc minimum on cloudvirt1025 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:23:47] RECOVERY - nova-compute proc maximum on cloudvirt1043 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:23:55] RECOVERY - nova-compute proc maximum on cloudvirt1025 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:24:35] RECOVERY - nova-compute proc maximum on cloudvirt1027 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:24:35] RECOVERY - nova-compute proc minimum on cloudvirt1027 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:24:57] RECOVERY - nova-compute proc maximum on cloudvirt1026 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:25:05] RECOVERY - nova-compute proc minimum on cloudvirt1026 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:25:20] RECOVERY - nova-compute proc minimum on cloudvirt1028 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:26:06] RECOVERY - nova-compute proc minimum on cloudvirt1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:26:45] RECOVERY - nova-compute proc maximum on cloudvirt1028 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:26:57] PROBLEM - puppet last run on ms-be2033 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:27:45] PROBLEM - nova-compute proc maximum on cloudvirt1029 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:27:46] PROBLEM - nova-compute proc maximum on cloudvirt1024 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:27:57] PROBLEM - nova-compute proc maximum on cloudvirt1017 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:28:15] PROBLEM - nova-compute proc maximum on cloudvirt1042 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:29:11] PROBLEM - nova-compute proc maximum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:29:13] PROBLEM - nova-compute proc maximum on cloudvirt1021 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:31:17] PROBLEM - nova-compute proc maximum on cloudvirt1023 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:31:41] RECOVERY - nova-compute proc maximum on cloudvirt1040 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:31:43] RECOVERY - nova-compute proc minimum on cloudvirt1044 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:31:44] RECOVERY - nova-compute proc minimum on cloudvirt1042 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:32:31] RECOVERY - nova-compute proc minimum on cloudvirt1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:32:43] RECOVERY - nova-compute proc maximum on cloudvirt1044 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:33:10] RECOVERY - nova-compute proc minimum on cloudvirt1024 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:33:15] RECOVERY - nova-compute proc maximum on cloudvirt1042 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:33:47] RECOVERY - nova-compute proc minimum on cloudvirt1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:33:49] RECOVERY - nova-compute proc maximum on cloudvirt1023 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:34:15] RECOVERY - nova-compute proc maximum on cloudvirt1021 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:34:16] RECOVERY - nova-compute proc minimum on cloudvirt1023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:34:35] RECOVERY - nova-compute proc minimum on cloudvirt1029 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:35:17] RECOVERY - nova-compute proc maximum on cloudvirt1029 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:35:18] RECOVERY - nova-compute proc maximum on cloudvirt1024 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:35:23] RECOVERY - nova-compute proc minimum on cloudvirt1021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:35:23] RECOVERY - nova-compute proc minimum on cloudvirt1017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:35:31] RECOVERY - nova-compute proc maximum on cloudvirt1017 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:36:13] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.003393 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [20:44:31] (03PS1) 10Zabe: utils: chmod +x setup_rake.sh and vcl_ec2_nets.py [puppet] - 10https://gerrit.wikimedia.org/r/810973 [20:49:23] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:50:15] 10SRE-tools, 10Infrastructure-Foundations, 10Wikimedia-Mailing-lists, 10netops, 10serviceops: Support services VIPs with not marked as VIP in Netbox - https://phabricator.wikimedia.org/T295793 (10Legoktm) Any IP (mis)configuration most likely predates Amir's and my involvement with mailman, we never touc... [21:09:17] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:14:17] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [22:07:35] PROBLEM - puppet last run on ms-be2036 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [22:12:47] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:27:11] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:56:06] (03PS3) 10Krinkle: Update call to deprecated IContextSource::getStats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810890 (owner: 10Matěj Suchánek) [22:56:29] (03PS4) 10Krinkle: static.php: Update call to deprecated IContextSource::getStats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810890 (owner: 10Matěj Suchánek) [22:56:33] (03PS5) 10Krinkle: static.php: Update call to deprecated IContextSource::getStats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810890 (owner: 10Matěj Suchánek) [23:28:35] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook