[00:04:51] PROBLEM - dump of es4 in codfw on backupmon1001 is CRITICAL: dump for es4 at codfw (es2022) taken more than a week ago: Most recent backup 2022-06-14 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:05:17] PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:05] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:12:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:14:57] PROBLEM - Check systemd state on an-conf1001 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:14:59] PROBLEM - Check systemd state on conf1004 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:15:25] PROBLEM - Check systemd state on conf2006 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:15:27] PROBLEM - Check systemd state on druid1006 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:15:31] PROBLEM - Check systemd state on conf2005 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:15:31] PROBLEM - dump of es5 in codfw on backupmon1001 is CRITICAL: dump for es5 at codfw (es2025) taken more than a week ago: Most recent backup 2022-06-14 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:15:31] PROBLEM - Check systemd state on an-conf1003 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:16:01] PROBLEM - Check systemd state on druid1004 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:16:19] PROBLEM - Check systemd state on zookeeper-test1002 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:16:21] PROBLEM - Check systemd state on an-conf1002 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:16:43] PROBLEM - Check systemd state on conf1005 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:16:51] PROBLEM - Check systemd state on conf1006 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:16:57] PROBLEM - Check systemd state on an-druid1001 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:16:57] PROBLEM - Check systemd state on druid1005 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:16:57] PROBLEM - Check systemd state on conf2004 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:16:59] PROBLEM - Check systemd state on an-druid1002 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:17:05] PROBLEM - Check systemd state on an-druid1003 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:03] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:25:07] PROBLEM - dump of es5 in eqiad on backupmon1001 is CRITICAL: dump for es5 at eqiad (es1025) taken more than a week ago: Most recent backup 2022-06-14 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:25:22] (03CR) 10Tim Starling: [C: 03+2] "Puppet compiler result https://puppet-compiler.wmflabs.org/pcc-worker1002/35940/" [puppet] - 10https://gerrit.wikimedia.org/r/806975 (https://phabricator.wikimedia.org/T310662) (owner: 10Tim Starling) [00:25:42] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [00:26:41] (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [00:28:23] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:31] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:51] PROBLEM - dump of es4 in eqiad on backupmon1001 is CRITICAL: dump for es4 at eqiad (es1022) taken more than a week ago: Most recent backup 2022-06-14 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:39:39] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:43:20] (03PS3) 10DDesouza: QuickSurveys: Deploy research-incentive to jawiki on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807202 (https://phabricator.wikimedia.org/T311015) [00:44:23] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:51:40] (03PS2) 10Krinkle: [POC] noc: Add perwiki.php to show per wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [00:53:26] (03CR) 10CI reject: [V: 04-1] [POC] noc: Add perwiki.php to show per wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [00:54:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:57:27] (03CR) 10Tim Starling: [C: 03+2] mc.php: Add "mcrouter-primary-dc" to $wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683022 (https://phabricator.wikimedia.org/T278392) (owner: 10Aaron Schulz) [00:59:09] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:59:32] (03Merged) 10jenkins-bot: mc.php: Add "mcrouter-primary-dc" to $wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683022 (https://phabricator.wikimedia.org/T278392) (owner: 10Aaron Schulz) [01:01:44] (03PS1) 10Tim Starling: Revert "mc.php: Add "mcrouter-primary-dc" to $wgObjectCaches" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807156 [01:01:51] (03CR) 10Tim Starling: [C: 03+2] Revert "mc.php: Add "mcrouter-primary-dc" to $wgObjectCaches" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807156 (owner: 10Tim Starling) [01:02:45] (03Merged) 10jenkins-bot: Revert "mc.php: Add "mcrouter-primary-dc" to $wgObjectCaches" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807156 (owner: 10Tim Starling) [01:05:35] (03PS1) 10Tim Starling: Reapply "mc.php: Add "mcrouter-primary-dc" to $wgObjectCaches" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807158 [01:05:42] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:06:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [01:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:59] (03PS2) 10Tim Starling: Reapply "mc.php: Add "mcrouter-primary-dc" to $wgObjectCaches" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807158 [01:07:26] (03CR) 10Tim Starling: "wmf -> wmg" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807158 (owner: 10Tim Starling) [01:07:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [01:07:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [01:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:08:21] (03CR) 10Tim Starling: [C: 03+2] Reapply "mc.php: Add "mcrouter-primary-dc" to $wgObjectCaches" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807158 (owner: 10Tim Starling) [01:09:08] (03Merged) 10jenkins-bot: Reapply "mc.php: Add "mcrouter-primary-dc" to $wgObjectCaches" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807158 (owner: 10Tim Starling) [01:11:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [01:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:55] (03CR) 10Krinkle: [C: 04-1] "Need to update the mtime invalidator in getConfigGlobals() as well, at least until T169821 is resolved (which is currently blocked on me a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799272 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [01:13:38] !log tstarling@deploy1002 Synchronized wmf-config/mc.php: g 807158 T278392 (duration: 03m 35s) [01:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:13:42] T278392: Storage solution for cross-datacenter tokens - https://phabricator.wikimedia.org/T278392 [01:16:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [01:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:17:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [01:17:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [01:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:17:52] !log tstarling@deploy1002 Synchronized wmf-config/mc-labs.php: for completeness (duration: 03m 41s) [01:17:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:18:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [01:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:26:45] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:38:25] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:40:15] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:46:55] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [02:04:57] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [02:07:45] (Memory over 85%) firing: Alert for device cloudsw1-c8-eqiad.mgmt.eqiad.wmnet - Memory over 85% - https://alerts.wikimedia.org/?q=alertname%3DMemory+over+85%25 [02:17:55] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:20:11] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:23:35] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:25:53] RECOVERY - Check systemd state on dumpsdata1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:43:13] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:06:05] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:15:19] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:20:33] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:22:37] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.058 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:38:47] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:43:17] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [04:12:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:25:42] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:26:42] (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [04:54:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:05:42] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:27:00] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:58] (03CR) 10Majavah: icinga::monitor::toollabs: replace stretch with buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807169 (https://phabricator.wikimedia.org/T277653) (owner: 10Majavah) [05:52:30] !log dbmaint s8@eqiad T310011 [05:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:35] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [06:07:45] (Memory over 85%) firing: Alert for device cloudsw1-c8-eqiad.mgmt.eqiad.wmnet - Memory over 85% - https://alerts.wikimedia.org/?q=alertname%3DMemory+over+85%25 [06:15:51] (03CR) 10Labdajiwa: "SVG already optimized. Ran svgo with a config from https://www.mediawiki.org/wiki/Manual:Coding_conventions/SVG#Exemplified_safe_configura" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807247 (https://phabricator.wikimedia.org/T311104) (owner: 10Labdajiwa) [06:30:09] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [06:35:32] (NodeTextfileStale) resolved: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:37:15] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [06:43:13] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:52:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Switchover es1, es2 and es3 masters', diff saved to https://phabricator.wikimedia.org/P29941 and previous config saved to /var/cache/conftool/dbconfig/20220622-065208-marostegui.json [06:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1027 es1026 es1031 for kernel reboots', diff saved to https://phabricator.wikimedia.org/P29942 and previous config saved to /var/cache/conftool/dbconfig/20220622-065507-root.json [06:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:50] (03PS1) 10Marostegui: es1026,1027,1031: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/807473 [06:59:04] (03CR) 10Marostegui: [C: 03+2] es1026,1027,1031: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/807473 (owner: 10Marostegui) [07:00:04] Amir1 and Urbanecm: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220622T0700) [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:02:50] (03CR) 10Slyngshede: [C: 03+2] prometheus: remove absented prometheus_directorysize cron [puppet] - 10https://gerrit.wikimedia.org/r/782360 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [07:03:06] (03CR) 10Slyngshede: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/782360 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [07:03:14] (03CR) 10Slyngshede: [C: 03+2] prometheus: remove absented prometheus_directorysize cron [puppet] - 10https://gerrit.wikimedia.org/r/782360 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [07:04:42] (03PS6) 10Hashar: zuul: disable core.logAllRefUpdates at clone time [puppet] - 10https://gerrit.wikimedia.org/r/790350 (https://phabricator.wikimedia.org/T307620) [07:05:33] (03CR) 10Hashar: zuul: disable core.logAllRefUpdates at clone time (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790350 (https://phabricator.wikimedia.org/T307620) (owner: 10Hashar) [07:06:45] ACKNOWLEDGEMENT - dump of es4 in codfw on backupmon1001 is CRITICAL: dump for es4 at codfw (es2022) taken more than a week ago: Most recent backup 2022-06-14 00:00:02 Jcrespo failed soon after starting - The acknowledgement expires at: 2022-06-23 09:06:10. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [07:06:45] ACKNOWLEDGEMENT - dump of es4 in eqiad on backupmon1001 is CRITICAL: dump for es4 at eqiad (es1022) taken more than a week ago: Most recent backup 2022-06-14 00:00:01 Jcrespo failed soon after starting - The acknowledgement expires at: 2022-06-23 09:06:10. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [07:06:45] ACKNOWLEDGEMENT - dump of es5 in codfw on backupmon1001 is CRITICAL: dump for es5 at codfw (es2025) taken more than a week ago: Most recent backup 2022-06-14 00:00:02 Jcrespo failed soon after starting - The acknowledgement expires at: 2022-06-23 09:06:10. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [07:06:45] ACKNOWLEDGEMENT - dump of es5 in eqiad on backupmon1001 is CRITICAL: dump for es5 at eqiad (es1025) taken more than a week ago: Most recent backup 2022-06-14 00:00:01 Jcrespo failed soon after starting - The acknowledgement expires at: 2022-06-23 09:06:10. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [07:11:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29943 and previous config saved to /var/cache/conftool/dbconfig/20220622-071143-root.json [07:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29944 and previous config saved to /var/cache/conftool/dbconfig/20220622-071201-root.json [07:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29945 and previous config saved to /var/cache/conftool/dbconfig/20220622-071210-root.json [07:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:15] (03PS1) 10Marostegui: Revert "es1026,1027,1031: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/807251 [07:13:57] (03CR) 10Marostegui: [C: 03+2] Revert "es1026,1027,1031: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/807251 (owner: 10Marostegui) [07:19:22] (03CR) 10Muehlenhoff: cumin: add alias for hosts with sensitive sysctl settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806486 (https://phabricator.wikimedia.org/T287081) (owner: 10Dzahn) [07:19:38] (03PS1) 10Slyngshede: profile::zookeeper::server remove cron mail spam hack [puppet] - 10https://gerrit.wikimedia.org/r/807475 [07:20:44] (03CR) 10Muehlenhoff: [C: 03+2] aptly: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/807127 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [07:23:19] (03PS2) 10Muehlenhoff: aptrepo: Add a few missing SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/807128 (https://phabricator.wikimedia.org/T308013) [07:25:31] (03CR) 10Muehlenhoff: [C: 03+2] aptrepo: Add a few missing SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/807128 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [07:25:43] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/790350 (https://phabricator.wikimedia.org/T307620) (owner: 10Hashar) [07:26:36] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:26:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29946 and previous config saved to /var/cache/conftool/dbconfig/20220622-072647-root.json [07:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29947 and previous config saved to /var/cache/conftool/dbconfig/20220622-072705-root.json [07:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29948 and previous config saved to /var/cache/conftool/dbconfig/20220622-072714-root.json [07:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:04] (03PS2) 10Muehlenhoff: grafana: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/807129 (https://phabricator.wikimedia.org/T308013) [07:31:12] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2050.codfw.wmnet [07:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:35] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/807475 (owner: 10Slyngshede) [07:33:37] (03CR) 10Slyngshede: "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/781051 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [07:33:39] (03CR) 10Slyngshede: [C: 03+2] osm: remove absented import_waterlines cron [puppet] - 10https://gerrit.wikimedia.org/r/781051 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [07:33:53] (03CR) 10Slyngshede: [C: 03+2] profile::zookeeper::server remove cron mail spam hack [puppet] - 10https://gerrit.wikimedia.org/r/807475 (owner: 10Slyngshede) [07:38:19] (03PS4) 10Slyngshede: zookeeper: remove absented zookeeper-cleanup cron [puppet] - 10https://gerrit.wikimedia.org/r/777452 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [07:38:21] (03PS7) 10Hashar: zuul: disable core.logAllRefUpdates at clone time [puppet] - 10https://gerrit.wikimedia.org/r/790350 (https://phabricator.wikimedia.org/T307620) [07:38:36] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/790350 (https://phabricator.wikimedia.org/T307620) (owner: 10Hashar) [07:39:17] (03CR) 10Hashar: zuul: disable core.logAllRefUpdates at clone time (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790350 (https://phabricator.wikimedia.org/T307620) (owner: 10Hashar) [07:39:22] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2050.codfw.wmnet [07:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:54] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48249 bytes in 0.113 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:40:06] !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single for host cumin2002.codfw.wmnet [07:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29949 and previous config saved to /var/cache/conftool/dbconfig/20220622-074151-root.json [07:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29950 and previous config saved to /var/cache/conftool/dbconfig/20220622-074209-root.json [07:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29951 and previous config saved to /var/cache/conftool/dbconfig/20220622-074217-root.json [07:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:38] RECOVERY - Check systemd state on zookeeper-test1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:45:05] (03CR) 10Marostegui: [C: 03+2] zuul: disable core.logAllRefUpdates at clone time [puppet] - 10https://gerrit.wikimedia.org/r/790350 (https://phabricator.wikimedia.org/T307620) (owner: 10Hashar) [07:47:54] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:49:29] !log jmm@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cumin2002.codfw.wmnet [07:49:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:02] !log Upgrade kernel and reboot on db[2145-2150].codfw.wmnet [07:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:52] PROBLEM - Keyholder SSH agent on cumin2002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [07:51:52] moritzm: ^ [07:52:49] yeah, that is the homer keyholder, needs someone from netops to rearm it, only they have the passphrase [07:53:33] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1056.eqiad.wmnet [07:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:00] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2051.codfw.wmnet [07:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:20] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:55:58] (03CR) 10Muehlenhoff: C:snapshot::dumps::timechecker convert cron to timer. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806366 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:56:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29952 and previous config saved to /var/cache/conftool/dbconfig/20220622-075655-root.json [07:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29953 and previous config saved to /var/cache/conftool/dbconfig/20220622-075713-root.json [07:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29954 and previous config saved to /var/cache/conftool/dbconfig/20220622-075721-root.json [07:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:42] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:00:04] hashar and brennen: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220622T0800). [08:00:52] (03CR) 10Slyngshede: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/784324 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [08:00:56] (03CR) 10Slyngshede: [C: 03+2] memcached: remove absented memkeys cron [puppet] - 10https://gerrit.wikimedia.org/r/784324 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [08:01:28] PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - Certificate lists.wikimedia.org valid until 2022-06-25 07:55:09 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:01:45] (JobUnavailable) resolved: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:02:21] (03CR) 10Slyngshede: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/779033 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [08:02:23] (03CR) 10Slyngshede: [C: 03+2] acme_chief: remove absented acme-chief-designate-tidyup cron [puppet] - 10https://gerrit.wikimedia.org/r/779033 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [08:04:33] !log Updating operations-puppet-tests-buster-docker Jenkins job to use the latest Docker image (rebuild to catch up with latest defined gems). https://gerrit.wikimedia.org/r/c/integration/config/+/807478 [08:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:51] if CI complains on `operations/puppet` that might be due to the new docker image [08:04:55] (03CR) 10Slyngshede: "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/778493 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [08:04:57] (03CR) 10Slyngshede: [C: 03+2] sslcert: remove absented update-ocsp-all cron [puppet] - 10https://gerrit.wikimedia.org/r/778493 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [08:04:58] I will run the train [08:05:10] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1056.eqiad.wmnet [08:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:52] RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2022-08-24 07:48:40 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:05:55] (03PS1) 10Hashar: group1 wikis to 1.39.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807483 (https://phabricator.wikimedia.org/T308070) [08:05:57] (03CR) 10Hashar: [C: 03+2] group1 wikis to 1.39.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807483 (https://phabricator.wikimedia.org/T308070) (owner: 10Hashar) [08:06:04] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1057.eqiad.wmnet [08:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:11] (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807483 (https://phabricator.wikimedia.org/T308070) (owner: 10Hashar) [08:07:56] 10SRE, 10Infrastructure-Foundations, 10Parsoid: Retire the old Parsoid deb repository? - https://phabricator.wikimedia.org/T309765 (10MoritzMuehlenhoff) 05Openβ†’03Resolved a:03MoritzMuehlenhoff Since there were no further objections, the repository has now been removed. [08:08:23] (03CR) 10Slyngshede: [C: 03+1] "LGTM, matches recommendations on: https://github.com/squid-cache/squid/security/advisories/GHSA-f5cp-6rh3-284w" [puppet] - 10https://gerrit.wikimedia.org/r/807094 (owner: 10Muehlenhoff) [08:09:25] (03CR) 10Slyngshede: [C: 03+1] "LGTM, as recommended on https://github.com/squid-cache/squid/security/advisories/GHSA-f5cp-6rh3-284w" [puppet] - 10https://gerrit.wikimedia.org/r/807093 (owner: 10Muehlenhoff) [08:11:28] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:11:33] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1057.eqiad.wmnet [08:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:37] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.17 refs T308070 [08:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:42] T308070: 1.39.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T308070 [08:11:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29955 and previous config saved to /var/cache/conftool/dbconfig/20220622-081159-root.json [08:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:14] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:12:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29956 and previous config saved to /var/cache/conftool/dbconfig/20220622-081217-root.json [08:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29957 and previous config saved to /var/cache/conftool/dbconfig/20220622-081227-root.json [08:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:12:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:13:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:35] not sure why the php restart takes longer nowadays [08:14:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:21] !log hashar@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.17 refs T308070 (duration: 03m 43s) [08:15:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:07] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: add job to cleanup old docker volumes/cache [puppet] - 10https://gerrit.wikimedia.org/r/807103 (https://phabricator.wikimedia.org/T310593) (owner: 10Jelto) [08:16:10] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:16:31] (03PS2) 10Jelto: gitlab_runner: add job to cleanup old docker volumes/cache [puppet] - 10https://gerrit.wikimedia.org/r/807103 (https://phabricator.wikimedia.org/T310593) [08:16:51] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2051.codfw.wmnet [08:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:12] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:18:45] !log Upgrade kernel and reboot on db[1111,1132,1143,1127].eqiad.wmnet [08:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:34] RECOVERY - Keyholder SSH agent on cumin2002 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder [08:25:42] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:26:05] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2052.codfw.wmnet [08:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:34] (03CR) 10Lucas Werkmeister (WMDE): "🀦 sorry about that…" [puppet] - 10https://gerrit.wikimedia.org/r/807200 (https://phabricator.wikimedia.org/T289243) (owner: 10Ryan Kemper) [08:26:42] (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [08:26:46] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1058.eqiad.wmnet [08:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29959 and previous config saved to /var/cache/conftool/dbconfig/20220622-082702-root.json [08:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29960 and previous config saved to /var/cache/conftool/dbconfig/20220622-082721-root.json [08:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29961 and previous config saved to /var/cache/conftool/dbconfig/20220622-082730-root.json [08:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:08] (03CR) 10Jbond: [C: 03+1] prometheus: Add ipmi_exporter to bullseye+ [puppet] - 10https://gerrit.wikimedia.org/r/807124 (owner: 10Alexandros Kosiaris) [08:30:38] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/806373 (owner: 10Filippo Giunchedi) [08:32:03] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2052.codfw.wmnet [08:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:52] well train looks fine so far [08:37:02] (03PS4) 10Itamar Givon: [cirrus] Add a custom profile for the wikibase language selector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801793 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse) [08:42:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29962 and previous config saved to /var/cache/conftool/dbconfig/20220622-084206-root.json [08:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29963 and previous config saved to /var/cache/conftool/dbconfig/20220622-084225-root.json [08:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29964 and previous config saved to /var/cache/conftool/dbconfig/20220622-084234-root.json [08:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:46] (03PS1) 10Lucas Werkmeister (WMDE): Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807254 (https://phabricator.wikimedia.org/T304328) [08:44:02] (03PS2) 10Lucas Werkmeister (WMDE): Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807254 (https://phabricator.wikimedia.org/T304328) [08:44:17] (03Abandoned) 10Muehlenhoff: Remove obsolete webperf hosts [puppet] - 10https://gerrit.wikimedia.org/r/785118 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff) [08:44:59] (03PS1) 10Lucas Werkmeister (WMDE): Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807255 (https://phabricator.wikimedia.org/T304328) [08:45:09] (03PS2) 10Lucas Werkmeister (WMDE): Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807255 (https://phabricator.wikimedia.org/T304328) [08:45:24] (03PS3) 10Lucas Werkmeister (WMDE): Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803496 (https://phabricator.wikimedia.org/T304328) [08:46:52] (03PS1) 10Stang: logos: Update phpcs comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807486 [08:47:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2001.codfw.wmnet [08:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:43] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1058.eqiad.wmnet [08:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2001.codfw.wmnet [08:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:56:16] (03CR) 10Ayounsi: [C: 04-1] Add sukhe to super-user for router configuration (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/807145 (owner: 10Ssingh) [09:00:54] (03CR) 10Slyngshede: C:snapshot::dumps::timechecker convert cron to timer. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806366 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [09:01:17] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35978/console" [puppet] - 10https://gerrit.wikimedia.org/r/790350 (https://phabricator.wikimedia.org/T307620) (owner: 10Hashar) [09:09:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet [09:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:46] (03PS1) 10Stang: specieswiki: Adjust width-height ratio of logo to fix display issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807491 (https://phabricator.wikimedia.org/T310961) [09:11:32] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:12:49] (03PS1) 10Ayounsi: eqsin: disable Telia transit [homer/public] - 10https://gerrit.wikimedia.org/r/807492 (https://phabricator.wikimedia.org/T300485) [09:15:25] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1059.eqiad.wmnet [09:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:01] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2053.codfw.wmnet [09:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet [09:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:12] !log ayounsi@cumin1001 START - Cookbook sre.network.cf [09:17:13] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [09:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:26] !log ayounsi@cumin1001 START - Cookbook sre.network.cf [09:17:28] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [09:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:04] (03PS1) 10Jbond: puppet_compiler: upload facts [puppet] - 10https://gerrit.wikimedia.org/r/807493 [09:18:39] (03CR) 10CI reject: [V: 04-1] puppet_compiler: upload facts [puppet] - 10https://gerrit.wikimedia.org/r/807493 (owner: 10Jbond) [09:19:05] (03CR) 10Ayounsi: [C: 03+2] eqsin: disable Telia transit [homer/public] - 10https://gerrit.wikimedia.org/r/807492 (https://phabricator.wikimedia.org/T300485) (owner: 10Ayounsi) [09:19:42] (03Merged) 10jenkins-bot: eqsin: disable Telia transit [homer/public] - 10https://gerrit.wikimedia.org/r/807492 (https://phabricator.wikimedia.org/T300485) (owner: 10Ayounsi) [09:23:57] (03PS2) 10Jbond: puppet_compiler: upload facts [puppet] - 10https://gerrit.wikimedia.org/r/807493 [09:25:18] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:26:01] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/807096 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [09:26:06] (03CR) 10Jbond: [C: 03+2] puppet_compiler: upload facts [puppet] - 10https://gerrit.wikimedia.org/r/807493 (owner: 10Jbond) [09:27:53] (03PS6) 10Alexandros Kosiaris: prometheus: Add ipmi_exporter to bullseye+ [puppet] - 10https://gerrit.wikimedia.org/r/807124 [09:27:55] (03PS1) 10Alexandros Kosiaris: prometheus: Enable scraping of the ipmi exporter [puppet] - 10https://gerrit.wikimedia.org/r/807494 [09:29:25] (03PS2) 10JMeybohm: Allow to dry-run SREBatchRunnerBase [cookbooks] - 10https://gerrit.wikimedia.org/r/806285 [09:29:27] (03PS2) 10JMeybohm: SREBatchBase: Fix broken batchsize argument [cookbooks] - 10https://gerrit.wikimedia.org/r/806286 [09:29:29] (03PS3) 10JMeybohm: sre.k8s.reboot-nodes: Fix errors identified during dry-run [cookbooks] - 10https://gerrit.wikimedia.org/r/806287 (https://phabricator.wikimedia.org/T260661) [09:29:31] (03PS4) 10JMeybohm: sre.k8s.reboot-node: Dynamically adjust batchsize [cookbooks] - 10https://gerrit.wikimedia.org/r/806288 (https://phabricator.wikimedia.org/T260661) [09:30:05] 10SRE, 10Traffic, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q4), 10User-fgiunchedi: Migrate Traffic Prometheus alerts from Icinga to Alertmanager - https://phabricator.wikimedia.org/T300723 (10fgiunchedi) >>! In T300723#8017017, @BCornwall wrote: > the varnish-mmap-count situation could be res... [09:30:18] PROBLEM - ganeti-wconfd running on ganeti-test2003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [09:31:13] ^ ganeti-test2003 is expected, master was failed over [09:31:25] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/807214 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [09:31:43] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga::monitor::toollabs: replace stretch with buster [puppet] - 10https://gerrit.wikimedia.org/r/807169 (https://phabricator.wikimedia.org/T277653) (owner: 10Majavah) [09:33:12] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 34.78 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [09:33:28] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 59.51 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [09:34:28] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:34:34] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2053.codfw.wmnet [09:34:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:43] 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10ayounsi) a:05ayounsiβ†’03RobH @RobH BGP disabled. [09:34:59] !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on netbox1002.eqiad.wmnet with reason: Adding support for Ganeti groups [09:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:01] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on netbox1002.eqiad.wmnet with reason: Adding support for Ganeti groups [09:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:30] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [09:35:46] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 100.3 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [09:35:58] (03CR) 10Volans: [C: 03+2] Revert "ganeti-netbox-sync: Add netbox 3.2 support" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/805869 (https://phabricator.wikimedia.org/T296452) (owner: 10Volans) [09:36:02] jbond: merging your changes too [09:36:04] (03CR) 10Volans: [C: 03+2] ganeti-netbox-sync: refactor into classes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802178 (owner: 10Volans) [09:36:09] (03CR) 10Volans: [C: 03+2] Netbox Ganeti sync: add groups support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802179 (https://phabricator.wikimedia.org/T262446) (owner: 10Volans) [09:36:40] (03Merged) 10jenkins-bot: Revert "ganeti-netbox-sync: Add netbox 3.2 support" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/805869 (https://phabricator.wikimedia.org/T296452) (owner: 10Volans) [09:36:43] (03Merged) 10jenkins-bot: ganeti-netbox-sync: refactor into classes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802178 (owner: 10Volans) [09:36:51] (03Merged) 10jenkins-bot: Netbox Ganeti sync: add groups support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802179 (https://phabricator.wikimedia.org/T262446) (owner: 10Volans) [09:37:17] (03CR) 10Volans: [C: 03+2] Netbox: adapt ganeti-sync config file [puppet] - 10https://gerrit.wikimedia.org/r/805337 (https://phabricator.wikimedia.org/T262446) (owner: 10Volans) [09:39:32] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35979/console" [puppet] - 10https://gerrit.wikimedia.org/r/807494 (owner: 10Alexandros Kosiaris) [09:45:45] (03PS2) 10JMeybohm: Initial commit of helm-state-metrics [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/806888 (https://phabricator.wikimedia.org/T310714) [09:45:47] (03PS2) 10JMeybohm: Add vendor dir [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/806889 (https://phabricator.wikimedia.org/T310714) [09:46:37] (03CR) 10Ayounsi: [C: 03+1] smokeping: stop targetting cr devices, moved to Prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807100 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [09:48:06] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1059.eqiad.wmnet [09:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:28] (03CR) 10JMeybohm: Initial commit of helm-state-metrics (034 comments) [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/806888 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [09:49:46] (03CR) 10Filippo Giunchedi: "See inline for my opinions (none blocking) on the review. LGTM though, happy to discuss more depending on what you (the team) prefer" [puppet] - 10https://gerrit.wikimedia.org/r/807201 (owner: 10Dzahn) [09:49:54] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (16) node(s) change every puppet run: cloudservices1003, cloudservices1004, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, netboxdb2001, netboxdb2002, puppetdb2002, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, thanos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_c [09:51:08] (03PS4) 10Slyngshede: C:snapshot::dumps::timechecker convert cron to timer. [puppet] - 10https://gerrit.wikimedia.org/r/806366 (https://phabricator.wikimedia.org/T273673) [09:51:24] (03CR) 10Filippo Giunchedi: smokeping: stop targetting cr devices, moved to Prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807100 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [09:51:27] (03PS2) 10Filippo Giunchedi: smokeping: stop targetting cr devices, moved to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/807100 (https://phabricator.wikimedia.org/T169860) [09:51:29] (03PS3) 10Filippo Giunchedi: prometheus: ping access switches and FR firewalls [puppet] - 10https://gerrit.wikimedia.org/r/807179 (https://phabricator.wikimedia.org/T169860) [09:52:18] (03CR) 10Filippo Giunchedi: [C: 03+2] smokeping: stop targetting cr devices, moved to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/807100 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [09:52:28] (03CR) 10CI reject: [V: 04-1] prometheus: ping access switches and FR firewalls [puppet] - 10https://gerrit.wikimedia.org/r/807179 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [09:53:00] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: rework prometheus settings in its own file [puppet] - 10https://gerrit.wikimedia.org/r/806379 (owner: 10Filippo Giunchedi) [09:53:28] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: add profile::pontoon::base [puppet] - 10https://gerrit.wikimedia.org/r/806373 (owner: 10Filippo Giunchedi) [09:53:56] (03PS1) 10Muehlenhoff: Raise profile::cumin::monitoring_agentrun::crit [puppet] - 10https://gerrit.wikimedia.org/r/807497 [09:54:09] (03PS3) 10JMeybohm: Deploy helm-state-metrics to staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/806871 (https://phabricator.wikimedia.org/T310714) [09:55:42] (03PS5) 10Slyngshede: C:snapshot::dumps::timechecker convert cron to timer. [puppet] - 10https://gerrit.wikimedia.org/r/806366 (https://phabricator.wikimedia.org/T273673) [09:56:08] (03CR) 10JMeybohm: Deploy helm-state-metrics to staging-codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/806871 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [09:56:12] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:57:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2003.codfw.wmnet [09:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:29] (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/807179 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [09:58:06] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/807497 (owner: 10Muehlenhoff) [10:02:55] (03CR) 10JMeybohm: [C: 03+2] sre.k8s.reboot-node: Dynamically adjust batchsize [cookbooks] - 10https://gerrit.wikimedia.org/r/806288 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [10:03:05] (03CR) 10JMeybohm: [C: 03+2] sre.k8s.reboot-nodes: Fix errors identified during dry-run [cookbooks] - 10https://gerrit.wikimedia.org/r/806287 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [10:03:13] (03CR) 10JMeybohm: [C: 03+2] SREBatchBase: Fix broken batchsize argument [cookbooks] - 10https://gerrit.wikimedia.org/r/806286 (owner: 10JMeybohm) [10:03:18] (03CR) 10JMeybohm: [C: 03+2] Allow to dry-run SREBatchRunnerBase [cookbooks] - 10https://gerrit.wikimedia.org/r/806285 (owner: 10JMeybohm) [10:04:34] !log installing vim security updates [10:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:39] jouncebot: now [10:04:39] No deployments scheduled for the next 2 hour(s) and 55 minute(s) [10:05:43] (03CR) 10Jbond: "this seems fine to me but adding riccardo who i think has more historical context with this repo" [dns] - 10https://gerrit.wikimedia.org/r/807229 (owner: 10BCornwall) [10:06:29] (03CR) 10Jbond: [C: 03+1] squid/url downloaders: Drop Gopher in ACLs, not used anywhere [puppet] - 10https://gerrit.wikimedia.org/r/807094 (owner: 10Muehlenhoff) [10:06:31] (03Merged) 10jenkins-bot: Allow to dry-run SREBatchRunnerBase [cookbooks] - 10https://gerrit.wikimedia.org/r/806285 (owner: 10JMeybohm) [10:06:35] (03Merged) 10jenkins-bot: SREBatchBase: Fix broken batchsize argument [cookbooks] - 10https://gerrit.wikimedia.org/r/806286 (owner: 10JMeybohm) [10:06:37] (03Merged) 10jenkins-bot: sre.k8s.reboot-nodes: Fix errors identified during dry-run [cookbooks] - 10https://gerrit.wikimedia.org/r/806287 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [10:06:39] (03Merged) 10jenkins-bot: sre.k8s.reboot-node: Dynamically adjust batchsize [cookbooks] - 10https://gerrit.wikimedia.org/r/806288 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [10:06:53] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti-test2003.codfw.wmnet [10:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:18] PROBLEM - ganeti-wconfd running on ganeti-test2003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [10:07:45] (Memory over 85%) firing: Alert for device cloudsw1-c8-eqiad.mgmt.eqiad.wmnet - Memory over 85% - https://alerts.wikimedia.org/?q=alertname%3DMemory+over+85%25 [10:08:45] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2054.codfw.wmnet [10:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:14] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1060.eqiad.wmnet [10:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:47] (03PS2) 10Matthias Mullie: [ImageSuggestions] Enable extension on beta testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807049 (https://phabricator.wikimedia.org/T302711) [10:12:08] (03PS1) 10Jbond: puppet_compiler: fix pcc_facts_processor script [puppet] - 10https://gerrit.wikimedia.org/r/807500 [10:13:26] (03CR) 10Matthias Mullie: [C: 03+2] [ImageSuggestions] Enable extension on beta testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807049 (https://phabricator.wikimedia.org/T302711) (owner: 10Matthias Mullie) [10:14:08] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2054.codfw.wmnet [10:14:09] (03Merged) 10jenkins-bot: [ImageSuggestions] Enable extension on beta testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807049 (https://phabricator.wikimedia.org/T302711) (owner: 10Matthias Mullie) [10:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:36] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [10:15:08] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:15:12] (03CR) 10Jbond: [C: 03+2] puppet_compiler: fix pcc_facts_processor script [puppet] - 10https://gerrit.wikimedia.org/r/807500 (owner: 10Jbond) [10:16:46] (03CR) 10Muehlenhoff: [C: 03+2] grafana: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/807129 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [10:17:55] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1060.eqiad.wmnet [10:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:58] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2055.codfw.wmnet [10:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:36] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1061.eqiad.wmnet [10:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:13] (03CR) 10Jbond: "lgtm but a couple of nits to make sure things work on the fist run" [puppet] - 10https://gerrit.wikimedia.org/r/806397 (https://phabricator.wikimedia.org/T310740) (owner: 10Jaime Nuche) [10:21:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:22:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:23:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:48] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1061.eqiad.wmnet [10:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:54] (03CR) 10Ayounsi: [C: 03+1] "Had a look at the latest PCC output as well (including centrallog) and it lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [10:28:04] (03CR) 10Ayounsi: [C: 03+1] prometheus: ping access switches and FR firewalls [puppet] - 10https://gerrit.wikimedia.org/r/807179 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [10:28:56] (03CR) 10Klausman: [C: 03+2] net: Add network config setup for ML staging k8s [puppet] - 10https://gerrit.wikimedia.org/r/807096 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [10:30:42] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1062.eqiad.wmnet [10:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:51] (03PS1) 10Klausman: pki: Add ML staging k8s to list of CAs [puppet] - 10https://gerrit.wikimedia.org/r/807502 (https://phabricator.wikimedia.org/T302195) [10:32:55] (03CR) 10Volans: "reply inline" [dns] - 10https://gerrit.wikimedia.org/r/807229 (owner: 10BCornwall) [10:33:32] (03PS2) 10Muehlenhoff: smokeping: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/807130 (https://phabricator.wikimedia.org/T308013) [10:35:40] (03CR) 10Mark Bergsma: [C: 03+1] Delete git-setup script [dns] - 10https://gerrit.wikimedia.org/r/807229 (owner: 10BCornwall) [10:36:09] (03CR) 10Muehlenhoff: [C: 03+2] smokeping: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/807130 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [10:36:14] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab1003.wikimedia.org [10:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:37:06] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1062.eqiad.wmnet [10:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:45] (03PS1) 10JMeybohm: k8s.reboot-nodes: Fix call to super()._batchsize [cookbooks] - 10https://gerrit.wikimedia.org/r/807504 (https://phabricator.wikimedia.org/T260661) [10:38:23] RECOVERY - k8s requests count to the API on ml-serve-ctrl2001 is OK: (C)100 ge (W)50 ge 43.78 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [10:40:36] (03PS2) 10Muehlenhoff: squid/racktables: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/806406 (https://phabricator.wikimedia.org/T308013) [10:41:45] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1003.wikimedia.org [10:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:42:33] (03CR) 10Muehlenhoff: [C: 03+2] squid: Harden config, we don't use Gopher anywhere [puppet] - 10https://gerrit.wikimedia.org/r/807093 (owner: 10Muehlenhoff) [10:42:57] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1063.eqiad.wmnet [10:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:13] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:47:30] certifi did sacare me because it seemed from last year ;) [10:50:48] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2055.codfw.wmnet [10:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:03] (03PS1) 10Volans: Add wmflib as additional dependency [software/netbox-deploy] (3-2-2) - 10https://gerrit.wikimedia.org/r/807507 (https://phabricator.wikimedia.org/T262446) [10:52:40] (03CR) 10Ayounsi: [C: 03+1] Add wmflib as additional dependency [software/netbox-deploy] (3-2-2) - 10https://gerrit.wikimedia.org/r/807507 (https://phabricator.wikimedia.org/T262446) (owner: 10Volans) [10:52:50] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/807504 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [10:53:49] !log systemctl restart rsyslog on kubernetes2008 [10:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:59] (03CR) 10JMeybohm: [C: 03+2] k8s.reboot-nodes: Fix call to super()._batchsize [cookbooks] - 10https://gerrit.wikimedia.org/r/807504 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [10:55:22] (03CR) 10Volans: [V: 03+2 C: 03+2] Add wmflib as additional dependency [software/netbox-deploy] (3-2-2) - 10https://gerrit.wikimedia.org/r/807507 (https://phabricator.wikimedia.org/T262446) (owner: 10Volans) [10:56:42] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1063.eqiad.wmnet [10:56:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:57:58] (KubernetesRsyslogDown) resolved: rsyslog on kubernetes2008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:58:30] (03Merged) 10jenkins-bot: k8s.reboot-nodes: Fix call to super()._batchsize [cookbooks] - 10https://gerrit.wikimedia.org/r/807504 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [11:02:52] 10SRE, 10SRE-tools, 10Icinga, 10Infrastructure-Foundations, 10observability: Icinga paged for a host that should have been downtimed - https://phabricator.wikimedia.org/T309447 (10MoritzMuehlenhoff) @Volans: Can this task be closed with https://gerrit.wikimedia.org/r/803317 merged? [11:03:47] 10SRE, 10SRE-tools, 10Icinga, 10Infrastructure-Foundations, 10observability: Icinga paged for a host that should have been downtimed - https://phabricator.wikimedia.org/T309447 (10Volans) I was planning to close it when the new spicerack will be released with the patch... is not yet deployed to prod. But... [11:05:04] !log volans@deploy1002 Started deploy [netbox/deploy@7bbf659]: Adding wmflib to venv deps [11:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:39] RECOVERY - Check systemd state on an-conf1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:45] (03PS1) 10Jbond: P:mediawiki::scap_client: add paremeter to indicate scap master [puppet] - 10https://gerrit.wikimedia.org/r/807510 (https://phabricator.wikimedia.org/T310740) [11:07:58] !log volans@deploy1002 Finished deploy [netbox/deploy@7bbf659]: Adding wmflib to venv deps (duration: 02m 54s) [11:08:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:11] !log volans@deploy1002 Started deploy [netbox/deploy@7bbf659]: Adding wmflib to venv deps [11:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:23] !log volans@deploy1002 Finished deploy [netbox/deploy@7bbf659]: Adding wmflib to venv deps (duration: 01m 11s) [11:09:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:13] !log volans@deploy1002 Started deploy [netbox/deploy@7bbf659]: Adding wmflib to venv deps [11:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:34] !log volans@deploy1002 Finished deploy [netbox/deploy@7bbf659]: Adding wmflib to venv deps (duration: 01m 20s) [11:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:45] (Memory over 85%) resolved: Alert for device cloudsw1-c8-eqiad.mgmt.eqiad.wmnet - Memory over 85% got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DMemory+over+85%25 [11:14:29] (03PS2) 10Jbond: P:mediawiki::scap_client: add paremeter to indicate scap master [puppet] - 10https://gerrit.wikimedia.org/r/807510 (https://phabricator.wikimedia.org/T310740) [11:14:34] (03CR) 10EllenR: "Code looks good, I am seeing a merge conflict tag and not sure if that needs to give a ding or not." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807211 (https://phabricator.wikimedia.org/T311079) (owner: 10Eigyan) [11:17:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:20:12] (03PS5) 10Vlad.shapik: WP:Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) [11:20:46] (03CR) 10CI reject: [V: 04-1] WP:Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik) [11:20:55] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/807502 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [11:22:03] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/806406 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:22:23] RECOVERY - Check systemd state on an-conf1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:24:15] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:24:29] RECOVERY - Check systemd state on an-druid1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:24:39] RECOVERY - Check systemd state on an-druid1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:24:41] RECOVERY - Check systemd state on an-conf1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:24:45] RECOVERY - Check systemd state on druid1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:24:53] RECOVERY - Check systemd state on conf1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:24:55] RECOVERY - Check systemd state on conf1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:25:05] RECOVERY - Check systemd state on conf2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:25:11] RECOVERY - Check systemd state on druid1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:25:23] RECOVERY - Check systemd state on druid1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:25:39] RECOVERY - Check systemd state on conf2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:25:57] RECOVERY - Check systemd state on conf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:25:59] RECOVERY - Check systemd state on conf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:26:49] RECOVERY - Check systemd state on an-druid1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:32:23] (03CR) 10Muehlenhoff: [C: 03+2] squid/racktables: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/806406 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:41:30] !log akosiaris@cumin1001 START - Cookbook sre.dns.netbox [11:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:03] (03PS1) 10Jbond: P::base: allow useres to configure enable_unpriv_userns via hiera [puppet] - 10https://gerrit.wikimedia.org/r/807515 (https://phabricator.wikimedia.org/T287081) [11:43:05] (03PS1) 10Jbond: cumin: add alias for hosts with sensitive sysctl settings [puppet] - 10https://gerrit.wikimedia.org/r/807516 (https://phabricator.wikimedia.org/T287081) [11:43:12] (03PS1) 10Slyngshede: C:osm::import_waterlines remove logrotate configuration. [puppet] - 10https://gerrit.wikimedia.org/r/807517 [11:44:04] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2056.codfw.wmnet [11:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:15] (03PS1) 10Jelto: gitlab_runner: add docker-registry.discovery.wmnet to allowed_images [puppet] - 10https://gerrit.wikimedia.org/r/807518 [11:44:18] (03CR) 10Jbond: base: create profile to allow unprivileged userns, use it on gitlab_runners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn) [11:45:16] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35987/console" [puppet] - 10https://gerrit.wikimedia.org/r/807518 (owner: 10Jelto) [11:45:33] (03CR) 10Alexandros Kosiaris: [V: 03+1] "Adding observability team." [puppet] - 10https://gerrit.wikimedia.org/r/807494 (owner: 10Alexandros Kosiaris) [11:45:51] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35986/console" [puppet] - 10https://gerrit.wikimedia.org/r/807516 (https://phabricator.wikimedia.org/T287081) (owner: 10Jbond) [11:46:25] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:48] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35989/console" [puppet] - 10https://gerrit.wikimedia.org/r/807515 (https://phabricator.wikimedia.org/T287081) (owner: 10Jbond) [11:47:00] (03CR) 10Jbond: P::base: allow useres to configure enable_unpriv_userns via hiera [puppet] - 10https://gerrit.wikimedia.org/r/807515 (https://phabricator.wikimedia.org/T287081) (owner: 10Jbond) [11:48:22] (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/806871 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [11:48:32] (03PS1) 10Klausman: Add dummy secrets for ML staging k8s CA [labs/private] - 10https://gerrit.wikimedia.org/r/807520 [11:48:45] (03PS2) 10Slyngshede: C:osm::import_waterlines remove logrotate configuration. [puppet] - 10https://gerrit.wikimedia.org/r/807517 [11:49:10] (03PS2) 10Klausman: Add dummy secrets for ML staging k8s CA [labs/private] - 10https://gerrit.wikimedia.org/r/807520 (https://phabricator.wikimedia.org/T302195) [11:49:48] (03CR) 10Jbond: "i think the https://gerrit.wikimedia.org/r/c/operations/puppet/+/807516/1 may be a better way to go as it relies on what is actually set a" [puppet] - 10https://gerrit.wikimedia.org/r/806486 (https://phabricator.wikimedia.org/T287081) (owner: 10Dzahn) [11:50:03] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35991/console" [puppet] - 10https://gerrit.wikimedia.org/r/807517 (owner: 10Slyngshede) [11:50:08] (03CR) 10Klausman: [C: 03+2] Add dummy secrets for ML staging k8s CA [labs/private] - 10https://gerrit.wikimedia.org/r/807520 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [11:50:33] (03CR) 10Klausman: [V: 03+2 C: 03+2] Add dummy secrets for ML staging k8s CA [labs/private] - 10https://gerrit.wikimedia.org/r/807520 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [11:52:02] (03CR) 10Jbond: [V: 03+1] cumin: add alias for hosts with sensitive sysctl settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807516 (https://phabricator.wikimedia.org/T287081) (owner: 10Jbond) [11:58:21] (03CR) 10Slyngshede: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/777452 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [11:58:23] (03CR) 10Slyngshede: [C: 03+2] zookeeper: remove absented zookeeper-cleanup cron [puppet] - 10https://gerrit.wikimedia.org/r/777452 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [11:59:37] (03PS1) 10Klausman: pki: Fix wrong cluster name for ML staging k8s [labs/private] - 10https://gerrit.wikimedia.org/r/807524 [12:00:01] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:dumps::web::dumpstatusfiles, convert to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/807057 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [12:00:05] (03CR) 10Klausman: [V: 03+2 C: 03+2] pki: Fix wrong cluster name for ML staging k8s [labs/private] - 10https://gerrit.wikimedia.org/r/807524 (owner: 10Klausman) [12:02:45] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2056.codfw.wmnet [12:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:04] (03CR) 10Kosta Harlan: Structured task: enable free text for "other" rejection reason (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805480 (https://phabricator.wikimedia.org/T304099) (owner: 10MewOphaswongse) [12:05:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10Cmjohnson) @ayounsi lsw1-e4 and f4 do not show up as options in netbox in the provision network script. [12:06:07] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2057.codfw.wmnet [12:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:48] (03PS2) 10Klausman: pki: Add ML staging k8s to list of CAs [puppet] - 10https://gerrit.wikimedia.org/r/807502 (https://phabricator.wikimedia.org/T302195) [12:08:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10Cmjohnson) @cmooney the switches do not show up in netbox as an option for the provisioning script. I tagged Arzhel in a differe... [12:11:11] PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - Certificate lists.wikimedia.org valid until 2022-06-25 07:55:09 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:12:00] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1016.eqiad.wmnet with OS buster [12:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wdqs101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host wdqs1016.eqiad.wmnet with OS buster [12:12:12] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1064.eqiad.wmnet [12:12:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:51] RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2022-08-24 07:48:40 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:17:03] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2057.codfw.wmnet [12:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:26] (03PS1) 10Cmjohnson: Adding backup1009 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/807530 (https://phabricator.wikimedia.org/T307048) [12:18:37] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1016.eqiad.wmnet with OS buster [12:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:15] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1064.eqiad.wmnet [12:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wdqs101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host wdqs1016.eqiad.wmnet with OS buster executed w... [12:20:34] (03PS5) 10Jgiannelos: Improve performance of Tegola tile pregeneration [puppet] - 10https://gerrit.wikimedia.org/r/790679 (https://phabricator.wikimedia.org/T307182) (owner: 10Isabelle Hurbain-Palatin) [12:22:02] (03CR) 10Cmjohnson: [C: 03+2] Adding backup1009 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/807530 (https://phabricator.wikimedia.org/T307048) (owner: 10Cmjohnson) [12:22:34] (03CR) 10Jgiannelos: Improve performance of Tegola tile pregeneration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790679 (https://phabricator.wikimedia.org/T307182) (owner: 10Isabelle Hurbain-Palatin) [12:23:13] (03CR) 10CI reject: [V: 04-1] Improve performance of Tegola tile pregeneration [puppet] - 10https://gerrit.wikimedia.org/r/790679 (https://phabricator.wikimedia.org/T307182) (owner: 10Isabelle Hurbain-Palatin) [12:23:32] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2058.codfw.wmnet [12:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:01] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host backup1009.eqiad.wmnet with OS bullseye [12:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:09] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1009.eqiad.wmnet with OS bullseye [12:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host backup1009.eqiad.wmn... [12:24:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host backup1009.eqiad.wmnet w... [12:25:35] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:25:42] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:26:42] (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [12:26:45] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: ping access switches and FR firewalls [puppet] - 10https://gerrit.wikimedia.org/r/807179 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [12:26:52] (03PS4) 10Filippo Giunchedi: prometheus: ping access switches and FR firewalls [puppet] - 10https://gerrit.wikimedia.org/r/807179 (https://phabricator.wikimedia.org/T169860) [12:27:12] 10SRE, 10Commons, 10Data-Persistence (Consultation), 10MediaWiki-extensions-WikibaseClient, and 5 others: Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730 (10Manuel) Hi @ItamarWMDE this seems to be on the tech board already, right? [12:27:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10Cmjohnson) @jcrespo can you confirm how you want the raid, it is failing during the installation. I have it as Each SS... [12:27:28] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/807517 (owner: 10Slyngshede) [12:30:33] (03PS1) 10Slyngshede: C:dumps::web::dumpstatusfiles run every five minutes. [puppet] - 10https://gerrit.wikimedia.org/r/807533 [12:31:16] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1065.eqiad.wmnet [12:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10jcrespo) the SSDs should be a single *software* raid0. If the reminder is HDs, those should be on RAID 6. The installation should succeed- but... [12:32:24] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35999/console" [puppet] - 10https://gerrit.wikimedia.org/r/807533 (owner: 10Slyngshede) [12:32:59] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2058.codfw.wmnet [12:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:10] (03CR) 10Filippo Giunchedi: [C: 03+1] "This LGTM, thank you Alex for metrics estimates, super useful!" [puppet] - 10https://gerrit.wikimedia.org/r/807494 (owner: 10Alexandros Kosiaris) [12:36:30] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:osm::import_waterlines remove logrotate configuration. [puppet] - 10https://gerrit.wikimedia.org/r/807517 (owner: 10Slyngshede) [12:38:34] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2059.codfw.wmnet [12:38:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:26] (03CR) 10Muehlenhoff: [C: 03+1] C:dumps::web::dumpstatusfiles run every five minutes. [puppet] - 10https://gerrit.wikimedia.org/r/807533 (owner: 10Slyngshede) [12:39:31] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:dumps::web::dumpstatusfiles run every five minutes. [puppet] - 10https://gerrit.wikimedia.org/r/807533 (owner: 10Slyngshede) [12:40:32] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/807124 (owner: 10Alexandros Kosiaris) [12:42:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10ayounsi) @Cmjohnson they're named "cloudsw1-e4/f4" [12:48:29] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.04354 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [12:48:52] (03PS1) 10Alexandros Kosiaris: prometheus: Fixes for I0c1a0b9ef2a1310fa5d0c9 [puppet] - 10https://gerrit.wikimedia.org/r/807540 [12:49:52] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] prometheus: Fixes for I0c1a0b9ef2a1310fa5d0c9 [puppet] - 10https://gerrit.wikimedia.org/r/807540 (owner: 10Alexandros Kosiaris) [12:50:55] PROBLEM - Check systemd state on ms-be2044 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:51:55] PROBLEM - Check systemd state on ms-be1050 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:52:03] PROBLEM - Check systemd state on ms-be1062 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:52:11] RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:52:24] (03CR) 10Andrew Bogott: "One thing to remember about these settings (which I forget) is that the VM doesn't GET the settings until after the VM is able to contact " [puppet] - 10https://gerrit.wikimedia.org/r/807174 (https://phabricator.wikimedia.org/T309930) (owner: 10David Caro) [12:53:09] PROBLEM - Check systemd state on thanos-fe1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:54:14] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:54:43] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:55:03] PROBLEM - Check systemd state on thanos-be2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:55:45] (03PS1) 10Majavah: openstack::nova::monitor: improve check_flavor_properties performance [puppet] - 10https://gerrit.wikimedia.org/r/807541 [12:56:13] (03CR) 10Vgutierrez: Implement MediaWiki multi-DC traffic component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) (owner: 10Tim Starling) [12:57:07] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36002/console" [puppet] - 10https://gerrit.wikimedia.org/r/807541 (owner: 10Majavah) [12:57:35] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2059.codfw.wmnet [12:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:19] !log fix MTU on codfw switches access ports [12:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:02] Greetings Everyone! [12:59:16] (03CR) 10Elukey: [C: 03+1] pki: Add ML staging k8s to list of CAs [puppet] - 10https://gerrit.wikimedia.org/r/807502 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220622T1300). [13:00:05] eigyan, itamarWMDE, and koi: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:05] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:00:28] o/ [13:01:22] (03PS5) 10Eigyan: [wmf-config]: Deploy GDI Survey Wave 2 - BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807211 (https://phabricator.wikimedia.org/T311079) [13:01:23] I can deploy :) [13:01:52] (03PS2) 10Majavah: openstack::nova::monitor: improve check_flavor_properties performance [puppet] - 10https://gerrit.wikimedia.org/r/807541 [13:01:59] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 3 VMs requested for dse_etcd cluster - https://phabricator.wikimedia.org/T311131 (10BTullis) [13:02:29] Thank you Lucas_WMDE [13:02:40] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 3 VMs requested for dse_etcd cluster - https://phabricator.wikimedia.org/T311131 (10BTullis) [13:02:58] 10SRE, 10DSE-Kubernetes-Cluster, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 3 VMs requested for dse_etcd cluster - https://phabricator.wikimedia.org/T311131 (10BTullis) [13:03:00] (03CR) 10Eigyan: [wmf-config]: Deploy GDI Survey Wave 2 - BETA (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807211 (https://phabricator.wikimedia.org/T311079) (owner: 10Eigyan) [13:03:03] (03CR) 10Lucas Werkmeister (WMDE): [wmf-config]: Deploy GDI Survey Wave 2 - BETA (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807211 (https://phabricator.wikimedia.org/T311079) (owner: 10Eigyan) [13:03:32] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] [wmf-config]: Deploy GDI Survey Wave 2 - BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807211 (https://phabricator.wikimedia.org/T311079) (owner: 10Eigyan) [13:03:51] (03CR) 10Vgutierrez: "wmf-tls log format could be dropped altogether considering that we've adopted HAProxy as our TLS terminator" [puppet] - 10https://gerrit.wikimedia.org/r/803301 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [13:04:12] (03PS1) 10Elukey: profile::pki::multirootca: fix ml_staging key [labs/private] - 10https://gerrit.wikimedia.org/r/807544 [13:04:30] (03Merged) 10jenkins-bot: [wmf-config]: Deploy GDI Survey Wave 2 - BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807211 (https://phabricator.wikimedia.org/T311079) (owner: 10Eigyan) [13:04:34] (03CR) 10Klausman: [C: 03+1] profile::pki::multirootca: fix ml_staging key [labs/private] - 10https://gerrit.wikimedia.org/r/807544 (owner: 10Elukey) [13:04:35] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36003/console" [puppet] - 10https://gerrit.wikimedia.org/r/807541 (owner: 10Majavah) [13:04:58] (03CR) 10Elukey: [V: 03+2 C: 03+2] profile::pki::multirootca: fix ml_staging key [labs/private] - 10https://gerrit.wikimedia.org/r/807544 (owner: 10Elukey) [13:05:33] PROBLEM - Check systemd state on ms-fe1010 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:05:40] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36004/console" [puppet] - 10https://gerrit.wikimedia.org/r/807541 (owner: 10Majavah) [13:05:45] PROBLEM - Check systemd state on ms-fe2012 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:05:49] alright, syncing the goddammit survey ;) [13:05:55] PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:06:01] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36005/console" [puppet] - 10https://gerrit.wikimedia.org/r/807502 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [13:06:01] :) [13:06:21] (if it’s only for beta at the moment, there’s no point testing it on mwdebug) [13:06:35] Agreed Lucas_WMDE [13:06:41] (03CR) 10Klausman: [V: 03+1 C: 03+2] pki: Add ML staging k8s to list of CAs [puppet] - 10https://gerrit.wikimedia.org/r/807502 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [13:06:48] Thank you very much Lucas_WMDE [13:06:53] np! [13:07:13] (03PS3) 10Lucas Werkmeister (WMDE): Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807254 (https://phabricator.wikimedia.org/T304328) [13:07:25] PROBLEM - Check systemd state on ms-be2047 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:08:00] (03CR) 10Andrew Bogott: [C: 03+2] openstack::nova::monitor: improve check_flavor_properties performance [puppet] - 10https://gerrit.wikimedia.org/r/807541 (owner: 10Majavah) [13:08:25] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 2 VMs requested for DSE Kubernetes Cluster control plane servers - https://phabricator.wikimedia.org/T311133 (10BTullis) [13:08:42] 10SRE, 10DSE-Kubernetes-Cluster, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 3 VMs requested for dse_etcd cluster - https://phabricator.wikimedia.org/T311131 (10BTullis) [13:09:00] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 2 VMs requested for DSE Kubernetes Cluster control plane servers - https://phabricator.wikimedia.org/T311133 (10BTullis) [13:09:10] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:807211|[wmf-config]: Deploy GDI Survey Wave 2 - BETA (T311079)]] (duration: 03m 29s) [13:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:15] T311079: Deploy GDI Safety Survey Wave 2 on EN, ES, FA, FR, and PT wikis - https://phabricator.wikimedia.org/T311079 [13:09:20] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807254 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE)) [13:09:30] eigyan: done, it should show up on beta soon [13:09:43] Excellent! [13:09:49] 10SRE, 10DSE-Kubernetes-Cluster, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 3 VMs requested for Etcd cluster in support of the new DSE Kubernetes cluster - https://phabricator.wikimedia.org/T311131 (10BTullis) [13:09:53] (03CR) 10Slyngshede: C:snapshot::dumps::timechecker convert cron to timer. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806366 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [13:10:06] (03Merged) 10jenkins-bot: Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807254 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE)) [13:10:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:19] (03PS1) 10Volans: Rename cluster to ganeti_cluster [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807545 [13:10:36] !log fix MTU in drmrs [13:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:41] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1065.eqiad.wmnet [13:10:43] (03PS1) 10Volans: netbox::host: rename cluster to ganeti_cluster [puppet] - 10https://gerrit.wikimedia.org/r/807546 [13:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:08] syncing the first wmgWikibaseTermboxEnabled change directly, it only adds a new variable and I don’t think it makes sense to test it on mwdebug [13:11:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:11:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:11:11] (03CR) 10Jgreen: [C: 03+1] Delete git-setup script (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/807229 (owner: 10BCornwall) [13:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:31] PROBLEM - Check systemd state on ms-be2052 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:12:12] (03CR) 10Hokwelum: [C: 04-1] "The interval key is missing here" [puppet] - 10https://gerrit.wikimedia.org/r/806366 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [13:12:15] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:12:47] (03PS3) 10Lucas Werkmeister (WMDE): Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807255 (https://phabricator.wikimedia.org/T304328) [13:12:57] (CertManagerCertNotReady) resolved: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [13:13:00] 10SRE, 10DSE-Kubernetes-Cluster, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 2 VMs requested for DSE Kubernetes Cluster control plane servers - https://phabricator.wikimedia.org/T311133 (10BTullis) [13:14:13] PROBLEM - Check systemd state on ms-be1045 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:14:18] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:807254|Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (1/3) (T304328)]] (duration: 03m 35s) [13:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:23] T304328: Move Termbox SSR for Beta Wikidata into deployment-prep project - https://phabricator.wikimedia.org/T304328 [13:14:32] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807255 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE)) [13:14:42] (03PS6) 10Slyngshede: C:snapshot::dumps::timechecker convert cron to timer. [puppet] - 10https://gerrit.wikimedia.org/r/806366 (https://phabricator.wikimedia.org/T273673) [13:14:51] (03CR) 10Slyngshede: C:snapshot::dumps::timechecker convert cron to timer. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806366 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [13:14:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:29] (03Merged) 10jenkins-bot: Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807255 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE)) [13:15:53] okay, change 2/3 is on mwdebug1001 [13:15:57] RECOVERY - Check systemd state on ms-be2044 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:15:59] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/807515 (https://phabricator.wikimedia.org/T287081) (owner: 10Jbond) [13:16:00] fyi itamarWMDE (but I’ll also take a look myself) [13:17:00] termbox looks fine on my end [13:17:56] I’ll go ahead and sync that [13:18:14] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2060.codfw.wmnet [13:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:44] (03PS4) 10Lucas Werkmeister (WMDE): Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803496 (https://phabricator.wikimedia.org/T304328) [13:19:02] !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:18] !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:13] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:21:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:21:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:43] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:807255|Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (2/3) (T304328)]] (duration: 03m 35s) [13:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:48] T304328: Move Termbox SSR for Beta Wikidata into deployment-prep project - https://phabricator.wikimedia.org/T304328 [13:22:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:13] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803496 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE)) [13:22:55] (03PS2) 10Majavah: wmcs: neutron: use min_over_time [alerts] - 10https://gerrit.wikimedia.org/r/805783 [13:23:00] (03Merged) 10jenkins-bot: Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803496 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE)) [13:23:17] (03CR) 10Muehlenhoff: profile::aptrepo::wikimedia test public apt repo on Apache (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/803506 (owner: 10Slyngshede) [13:24:21] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:24:47] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] wmcs: neutron: use min_over_time [alerts] - 10https://gerrit.wikimedia.org/r/805783 (owner: 10Majavah) [13:25:03] wmgWikibaseTermboxEnabled change 3/3 is on mwdebug1001 (cc itamarWMDE) [13:25:05] testing again… [13:25:46] still looks okay to me, I’ll sync [13:26:01] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: send_tile_invalidations.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:26:22] (03CR) 10JHathaway: [C: 03+1] "looks good!" [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/806888 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [13:26:52] (03Merged) 10jenkins-bot: wmcs: neutron: use min_over_time [alerts] - 10https://gerrit.wikimedia.org/r/805783 (owner: 10Majavah) [13:27:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:17] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2060.codfw.wmnet [13:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:27:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:16] (03CR) 10Vgutierrez: "generally speaking it looks good but we should move towards setting this to ENFORCED rather than PERMISSIVE." [puppet] - 10https://gerrit.wikimedia.org/r/803296 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [13:28:25] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:28:56] !log fix MTU on eqiad server facing switch ports [13:28:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:08] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2061.codfw.wmnet [13:29:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:56] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1066.eqiad.wmnet [13:29:56] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:803496|Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (3/3) (T304328)]] (1/2) (duration: 03m 35s) [13:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:04] T304328: Move Termbox SSR for Beta Wikidata into deployment-prep project - https://phabricator.wikimedia.org/T304328 [13:30:36] (03CR) 10Vgutierrez: [C: 03+1] "even if this CR isn't backwards compatible it isn't a big deal cause ats-be doesn't use parent proxies (and we don't run ats-tls anymore)" [puppet] - 10https://gerrit.wikimedia.org/r/803288 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [13:31:59] (03CR) 10Vgutierrez: "looks good, should we consider backwards compatibility to let 8.x and 9.x coexist?" [puppet] - 10https://gerrit.wikimedia.org/r/803286 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [13:32:13] PROBLEM - Check systemd state on ms-fe1011 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:15] (03CR) 10Vgutierrez: [C: 03+1] trafficserver: 9.x upgrade: remove redundant metrics [puppet] - 10https://gerrit.wikimedia.org/r/803297 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [13:33:21] RECOVERY - Check systemd state on ms-be2047 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:46] (03PS5) 10Lucas Werkmeister (WMDE): [cirrus] Add a custom profile for the wikibase language selector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801793 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse) [13:33:52] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:803496|Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (3/3) (T304328)]] (2/2) (duration: 03m 39s) [13:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:59] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] [cirrus] Add a custom profile for the wikibase language selector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801793 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse) [13:35:45] (03PS10) 10Slyngshede: profile::aptrepo::wikimedia test public apt repo on Apache [puppet] - 10https://gerrit.wikimedia.org/r/803506 [13:35:50] (03Merged) 10jenkins-bot: [cirrus] Add a custom profile for the wikibase language selector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801793 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse) [13:35:51] PROBLEM - nova-compute proc minimum on cloudvirt1021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:35:52] PROBLEM - nova-compute proc minimum on cloudvirt1026 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:35:53] PROBLEM - nova-compute proc minimum on cloudvirt1033 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:35:58] (03CR) 10Slyngshede: profile::aptrepo::wikimedia test public apt repo on Apache (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/803506 (owner: 10Slyngshede) [13:35:59] PROBLEM - nova-compute proc minimum on cloudvirt1034 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:35:59] PROBLEM - nova-compute proc minimum on cloudvirt1037 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:36:07] PROBLEM - nova-compute proc minimum on cloudvirt1022 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:36:07] PROBLEM - nova-compute proc minimum on cloudvirt1036 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:36:13] PROBLEM - nova-compute proc minimum on cloudvirt1025 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:36:19] PROBLEM - nova-compute proc minimum on cloudvirt1020 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:36:35] PROBLEM - nova-compute proc minimum on cloudvirt1017 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:36:36] PROBLEM - nova-compute proc minimum on cloudvirt1039 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:36:39] PROBLEM - nova-compute proc minimum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:36:49] PROBLEM - nova-compute proc minimum on cloudvirt1031 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:36:49] PROBLEM - nova-compute proc minimum on cloudvirt1023 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:36:50] PROBLEM - nova-compute proc minimum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:36:50] PROBLEM - nova-compute proc minimum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:36:51] PROBLEM - nova-compute proc minimum on cloudvirt1028 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:36:59] PROBLEM - nova-compute proc minimum on cloudvirt1029 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:37:00] PROBLEM - nova-compute proc minimum on cloudvirt1042 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:37:07] RECOVERY - Check systemd state on ms-be1062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:37:47] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:37:59] RECOVERY - nova-compute proc minimum on cloudvirt1036 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:38:05] RECOVERY - Check systemd state on ms-be1045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:38:13] RECOVERY - nova-compute proc minimum on cloudvirt1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:38:29] (03PS1) 10Lucas Werkmeister (WMDE): Revert "[cirrus] Add a custom profile for the wikibase language selector" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807265 [13:38:35] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Revert "[cirrus] Add a custom profile for the wikibase language selector" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807265 (owner: 10Lucas Werkmeister (WMDE)) [13:38:45] PROBLEM - nova-compute proc minimum on cloudvirt1044 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:39:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:22] (03Merged) 10jenkins-bot: Revert "[cirrus] Add a custom profile for the wikibase language selector" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807265 (owner: 10Lucas Werkmeister (WMDE)) [13:39:47] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:39:53] PROBLEM - nova-compute proc minimum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:40:01] koi: your turn :) are you there? [13:40:01] PROBLEM - nova-compute proc minimum on cloudvirt1036 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:40:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:40:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:40:13] PROBLEM - nova-compute proc minimum on cloudvirt1020 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:19] oh hi, I'm here [13:40:25] (03PS6) 10Vlad.shapik: WP:Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) [13:40:29] RECOVERY - nova-compute proc minimum on cloudvirt1017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:40:31] RECOVERY - nova-compute proc minimum on cloudvirt1039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:40:33] ok [13:40:35] RECOVERY - nova-compute proc minimum on cloudvirt1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:40:57] (03CR) 10CI reject: [V: 04-1] WP:Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik) [13:41:07] PROBLEM - nova-compute proc minimum on cloudvirt1024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:41:08] PROBLEM - nova-compute proc minimum on cloudvirt1032 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:41:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:41:12] let’s do the logos change first in case we don’t have time for both [13:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:20] (03PS26) 10Filippo Giunchedi: Add a host's conftool pooled status and weight per service to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [13:41:31] PROBLEM - nova-compute proc minimum on cloudvirt1019 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:41:32] PROBLEM - nova-compute proc minimum on cloudvirt1035 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:42:08] (03CR) 10Herron: [C: 03+1] prometheus: Enable scraping of the ipmi exporter [puppet] - 10https://gerrit.wikimedia.org/r/807494 (owner: 10Alexandros Kosiaris) [13:42:12] (03PS2) 10Lucas Werkmeister (WMDE): specieswiki: Adjust width-height ratio of logo to fix display issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807491 (https://phabricator.wikimedia.org/T310961) (owner: 10Stang) [13:42:33] PROBLEM - nova-compute proc minimum on cloudvirt1017 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:42:35] PROBLEM - nova-compute proc minimum on cloudvirt1039 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:42:37] PROBLEM - nova-compute proc minimum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:42:39] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.004559 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:43:03] (03CR) 10Vgutierrez: "Looks good, should we consider an approach that allows 8.x and 9.x to coexist?" [puppet] - 10https://gerrit.wikimedia.org/r/803272 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [13:43:49] PROBLEM - nova-compute proc maximum on cloudvirt1042 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:43:51] PROBLEM - nova-compute proc maximum on cloudvirt1020 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:43:53] PROBLEM - nova-compute proc maximum on cloudvirt1025 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:43:55] PROBLEM - nova-compute proc minimum on cloudvirt1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:43:55] PROBLEM - nova-compute proc minimum on cloudvirt1030 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:43:56] PROBLEM - nova-compute proc minimum on cloudvirt1038 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:44:05] PROBLEM - nova-compute proc maximum on cloudvirt1019 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:44:11] PROBLEM - nova-compute proc maximum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:44:12] PROBLEM - nova-compute proc maximum on cloudvirt1029 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:44:19] PROBLEM - nova-compute proc maximum on cloudvirt1031 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:44:24] (03CR) 10Filippo Giunchedi: "I finally was able to test this patch in Pontoon (great job Ben!)" [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [13:44:29] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] specieswiki: Adjust width-height ratio of logo to fix display issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807491 (https://phabricator.wikimedia.org/T310961) (owner: 10Stang) [13:44:57] PROBLEM - nova-compute proc maximum on cloudvirt1023 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:45:01] PROBLEM - nova-compute proc maximum on cloudvirt1017 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:45:03] PROBLEM - nova-compute proc maximum on cloudvirt1039 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:45:04] PROBLEM - nova-compute proc maximum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:45:13] PROBLEM - nova-compute proc maximum on cloudvirt1037 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:45:13] PROBLEM - nova-compute proc maximum on cloudvirt1034 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:45:37] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2061.codfw.wmnet [13:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:50] ^ is being worked on in the -cloud-admin channel [13:45:53] (03Merged) 10jenkins-bot: specieswiki: Adjust width-height ratio of logo to fix display issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807491 (https://phabricator.wikimedia.org/T310961) (owner: 10Stang) [13:46:08] ack [13:46:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:19] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1066.eqiad.wmnet [13:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:25] PROBLEM - nova-compute proc minimum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:46:26] PROBLEM - nova-compute proc minimum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:46:51] RECOVERY - Check systemd state on ms-be2052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:47:11] koi: the logos change is on mwdebug1001, can you test it? [13:47:17] looking [13:47:21] (03CR) 10Vgutierrez: trafficserver: 9.x upgrade: switch ip_allow.config to YAML format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803272 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [13:47:25] PROBLEM - nova-compute proc maximum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:47:40] (I need a Ctrl+F5 but after that it actually seems to have loaded the new logo from mwdebug) [13:47:42] (03CR) 10Majavah: [V: 03+1] P:openstack::puppetmaster: alert for puppet certs for deleted instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806433 (owner: 10Majavah) [13:47:44] (*needed) [13:48:04] LGTM [13:48:07] ack [13:48:08] (03CR) 10Filippo Giunchedi: Add a host's conftool pooled status and weight per service to prometheus (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [13:48:27] ok, so I guess I should sync the PNGs, then the yaml, then the PHP, and then finally purge the PNGs from the cache [13:48:37] probably doesn’t matter in practice but that order feels sensible to me ^^ [13:48:39] PROBLEM - nova-compute proc maximum on cloudvirt1044 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:48:59] PROBLEM - nova-compute proc maximum on cloudvirt1028 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:48:59] PROBLEM - nova-compute proc maximum on cloudvirt1036 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:49:01] yeah it make sense [13:49:13] RECOVERY - Check systemd state on thanos-be2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:13] just do it in the way you like :) [13:49:38] ok :) [13:49:55] but I’m syncing project-logos/ as a whole, I don’t want to wait for the php-fpm restarts three times by syncing the three PNGs individually [13:49:57] PROBLEM - nova-compute proc maximum on cloudvirt1035 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:50:11] (if scap sync-file has a flag to skip the restarts then it’s not in the --help output) [13:50:21] PROBLEM - nova-compute proc maximum on cloudvirt1022 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:50:22] PROBLEM - nova-compute proc maximum on cloudvirt1038 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:50:22] PROBLEM - nova-compute proc maximum on cloudvirt1027 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:50:23] PROBLEM - nova-compute proc maximum on cloudvirt1030 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:50:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:50:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:57] RECOVERY - nova-compute proc minimum on cloudvirt1043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:51:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:04] PROBLEM - nova-compute proc maximum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:53:16] !log lucaswerkmeister-wmde@deploy1002 Synchronized static/images/project-logos/: Config: [[gerrit:807491|specieswiki: Adjust width-height ratio of logo to fix display issue (T310961)]] (1/3) (duration: 03m 46s) [13:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:21] PROBLEM - nova-compute proc maximum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:53:21] PROBLEM - nova-compute proc maximum on cloudvirt1033 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:53:21] T310961: Site logo cropped/not fully displayed on some projects - https://phabricator.wikimedia.org/T310961 [13:54:22] PROBLEM - nova-compute proc minimum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:54:52] RECOVERY - nova-compute proc maximum on cloudvirt1038 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:55:04] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2062.codfw.wmnet [13:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:08] RECOVERY - nova-compute proc minimum on cloudvirt1038 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:55:26] RECOVERY - nova-compute proc minimum on cloudvirt1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:55:31] RECOVERY - nova-compute proc maximum on cloudvirt1029 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:56:09] RECOVERY - nova-compute proc maximum on cloudvirt1046 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:56:09] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1067.eqiad.wmnet [13:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:14] PROBLEM - nova-compute proc maximum on cloudvirt1026 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:56:18] RECOVERY - nova-compute proc minimum on cloudvirt1037 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:56:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:56:31] RECOVERY - nova-compute proc maximum on cloudvirt1041 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:07] RECOVERY - nova-compute proc minimum on cloudvirt1026 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:57:14] RECOVERY - nova-compute proc minimum on cloudvirt1027 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:57:18] RECOVERY - nova-compute proc maximum on cloudvirt1026 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:57:23] !log lucaswerkmeister-wmde@deploy1002 Synchronized logos/config.yaml: Config: [[gerrit:807491|specieswiki: Adjust width-height ratio of logo to fix display issue (T310961)]] (2/3) (duration: 03m 29s) [13:57:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:57:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:41] RECOVERY - nova-compute proc minimum on cloudvirt1022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:57:44] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [13:58:19] RECOVERY - nova-compute proc maximum on cloudvirt1042 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:58:24] RECOVERY - nova-compute proc minimum on cloudvirt1033 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:58:26] PROBLEM - nova-compute proc maximum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:58:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:41] RECOVERY - nova-compute proc minimum on cloudvirt1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:58:51] RECOVERY - nova-compute proc maximum on cloudvirt1033 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:59:54] (03PS1) 10Ssingh: dnsdist: add spec tests [puppet] - 10https://gerrit.wikimedia.org/r/807551 [14:00:26] RECOVERY - nova-compute proc maximum on cloudvirt1031 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:01:04] RECOVERY - nova-compute proc minimum on cloudvirt1031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:01:07] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/logos.php: Config: [[gerrit:807491|specieswiki: Adjust width-height ratio of logo to fix display issue (T310961)]] (3/3) (duration: 03m 30s) [14:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:14] T310961: Site logo cropped/not fully displayed on some projects - https://phabricator.wikimedia.org/T310961 [14:01:33] !log lucaswerkmeister-wmde@mwmaint1002:~$ printf 'https://en.wikipedia.org/static/images/project-logos/%s\n' specieswiki{,-{1.5,2}x}.png | mwscript purgeList.php # T310961 [14:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:58] RECOVERY - nova-compute proc minimum on cloudvirt1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:02:31] RECOVERY - nova-compute proc maximum on cloudvirt1019 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:03:08] (03PS2) 10Lucas Werkmeister (WMDE): logos: Update phpcs comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807486 (owner: 10Stang) [14:03:16] let’s do this one as well, it shouldn’t wait for too long [14:04:03] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1067.eqiad.wmnet [14:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:16] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] logos: Update phpcs comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807486 (owner: 10Stang) [14:04:20] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:05:08] (03PS1) 10Jbond: C:postgresql: grab the data directory from postgresql [puppet] - 10https://gerrit.wikimedia.org/r/807553 [14:05:10] (03Merged) 10jenkins-bot: logos: Update phpcs comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807486 (owner: 10Stang) [14:05:52] RECOVERY - nova-compute proc maximum on cloudvirt1027 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:06:16] RECOVERY - SSH on restbase2012.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:06:21] RECOVERY - nova-compute proc minimum on cloudvirt1044 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:06:53] (03PS2) 10Jbond: P::base: allow useres to configure enable_unpriv_userns via hiera [puppet] - 10https://gerrit.wikimedia.org/r/807515 (https://phabricator.wikimedia.org/T287081) [14:07:08] RECOVERY - nova-compute proc maximum on cloudvirt1044 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:07:09] (03PS2) 10Jbond: cumin: add alias for hosts with sensitive sysctl settings [puppet] - 10https://gerrit.wikimedia.org/r/807516 (https://phabricator.wikimedia.org/T287081) [14:07:20] RECOVERY - nova-compute proc maximum on cloudvirt1040 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:07:35] (03PS1) 10Ssingh: Release 9.1.2-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/807554 [14:08:18] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, see inline too" [puppet] - 10https://gerrit.wikimedia.org/r/807546 (owner: 10Volans) [14:08:18] RECOVERY - nova-compute proc maximum on cloudvirt1028 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:08:19] RECOVERY - nova-compute proc maximum on cloudvirt1039 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:08:26] (03CR) 10CI reject: [V: 04-1] Release 9.1.2-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/807554 (owner: 10Ssingh) [14:08:30] ha [14:08:34] (03CR) 10Filippo Giunchedi: [C: 03+1] Rename cluster to ganeti_cluster [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807545 (owner: 10Volans) [14:08:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:06] !log lucaswerkmeister-wmde@deploy1002 Synchronized logos/manage.py: Config: [[gerrit:807486|logos: Update phpcs comment]] (should be a no-op but syncing just in case) (duration: 03m 19s) [14:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:11] !log UTC afternoon backport+config window done [14:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:20] thanks a lot! [14:09:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:09:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:54] RECOVERY - nova-compute proc maximum on cloudvirt1020 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:10:22] RECOVERY - nova-compute proc minimum on cloudvirt1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:10:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:44] RECOVERY - nova-compute proc minimum on cloudvirt1036 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:10:46] RECOVERY - nova-compute proc minimum on cloudvirt1028 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:10:46] RECOVERY - nova-compute proc minimum on cloudvirt1039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:11:31] RECOVERY - nova-compute proc maximum on cloudvirt1034 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:11:52] RECOVERY - nova-compute proc maximum on cloudvirt1036 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:12:01] RECOVERY - nova-compute proc minimum on cloudvirt1034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:12:06] (03PS6) 10Jgiannelos: Improve performance of Tegola tile pregeneration [puppet] - 10https://gerrit.wikimedia.org/r/790679 (https://phabricator.wikimedia.org/T307182) (owner: 10Isabelle Hurbain-Palatin) [14:12:19] (03PS2) 10Ssingh: trafficserver: 9.x upgrade: remove wmf-tls log format [puppet] - 10https://gerrit.wikimedia.org/r/803301 (https://phabricator.wikimedia.org/T309651) [14:13:08] PROBLEM - nova-compute proc maximum on cloudvirt1021 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:13:08] RECOVERY - nova-compute proc minimum on cloudvirt1032 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:13:09] RECOVERY - nova-compute proc minimum on cloudvirt1024 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:13:09] RECOVERY - nova-compute proc minimum on cloudvirt1042 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:13:32] RECOVERY - nova-compute proc maximum on cloudvirt1045 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:13:56] RECOVERY - nova-compute proc minimum on cloudvirt1045 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:14:18] (03CR) 10Ssingh: [V: 03+1] trafficserver: 9.x upgrade: rename max_connections_active_in (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803286 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [14:15:00] (03PS1) 10Ayounsi: Network check MTU report: improve log messages [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807556 [14:15:04] RECOVERY - nova-compute proc minimum on cloudvirt1025 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:15:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:15:38] RECOVERY - nova-compute proc maximum on cloudvirt1017 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:42] RECOVERY - nova-compute proc maximum on cloudvirt1025 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:16:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:16:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:06] RECOVERY - nova-compute proc maximum on cloudvirt1021 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:17:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:17:31] RECOVERY - nova-compute proc minimum on cloudvirt1023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:17:31] RECOVERY - nova-compute proc minimum on cloudvirt1021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:45] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2062.codfw.wmnet [14:17:48] (03PS2) 10Ssingh: trafficserver: 9.x upgrade: switch ip_allow.config to YAML format [puppet] - 10https://gerrit.wikimedia.org/r/803272 (https://phabricator.wikimedia.org/T309651) [14:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:58] RECOVERY - nova-compute proc minimum on cloudvirt1017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:17:58] RECOVERY - nova-compute proc minimum on cloudvirt1029 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:17:58] RECOVERY - nova-compute proc maximum on cloudvirt1037 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:17:59] RECOVERY - nova-compute proc maximum on cloudvirt1047 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:18:06] RECOVERY - nova-compute proc maximum on cloudvirt1023 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:18:18] (03CR) 10Ssingh: trafficserver: 9.x upgrade: switch ip_allow.config to YAML format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803272 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [14:19:31] RECOVERY - nova-compute proc maximum on cloudvirt1030 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:19:41] RECOVERY - nova-compute proc maximum on cloudvirt1035 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:19:54] RECOVERY - nova-compute proc minimum on cloudvirt1030 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:20:07] (03PS1) 10Ayounsi: Netbox: don't alert for the accounting report [puppet] - 10https://gerrit.wikimedia.org/r/807558 [14:20:26] RECOVERY - nova-compute proc minimum on cloudvirt1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:20:26] RECOVERY - nova-compute proc minimum on cloudvirt1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:20:54] RECOVERY - Check systemd state on ms-be1050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:10] RECOVERY - nova-compute proc minimum on cloudvirt1035 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:22:07] RECOVERY - nova-compute proc minimum on cloudvirt1043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:24:32] (03CR) 10Jgreen: [C: 03+1] vrts: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806220 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [14:24:33] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:24:45] (03CR) 10JMeybohm: [C: 03+2] Initial commit of helm-state-metrics (031 comment) [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/806888 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [14:25:18] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Initial commit of helm-state-metrics [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/806888 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [14:25:26] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add vendor dir [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/806889 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [14:26:47] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2063.codfw.wmnet [14:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:21] (03PS2) 10Ssingh: trafficserver: 9.x upgrade: replace client.verify.server [puppet] - 10https://gerrit.wikimedia.org/r/803296 (https://phabricator.wikimedia.org/T309651) [14:31:47] (03PS1) 10Jgiannelos: tegola: Re-enable tile pregeneration on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/807561 (https://phabricator.wikimedia.org/T305845) [14:32:12] (03CR) 10Ssingh: trafficserver: 9.x upgrade: replace client.verify.server (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/803296 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [14:32:14] (03CR) 10Muehlenhoff: vrts: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806220 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [14:33:30] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/803506 (owner: 10Slyngshede) [14:34:55] (03PS1) 10Majavah: P:(toolforge|wmcs::paws)::prometheus: improve namespace filtering [puppet] - 10https://gerrit.wikimedia.org/r/807562 [14:36:51] (03CR) 10Jbond: [C: 03+2] P::base: allow useres to configure enable_unpriv_userns via hiera [puppet] - 10https://gerrit.wikimedia.org/r/807515 (https://phabricator.wikimedia.org/T287081) (owner: 10Jbond) [14:37:26] (03PS1) 10Muehlenhoff: Record new MOU date for aarora [puppet] - 10https://gerrit.wikimedia.org/r/807563 [14:37:27] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:38:34] RECOVERY - nova-compute proc maximum on cloudvirt1022 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:39:13] (03CR) 10Muehlenhoff: cumin: add alias for hosts with sensitive sysctl settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807516 (https://phabricator.wikimedia.org/T287081) (owner: 10Jbond) [14:40:17] (03PS1) 10Jbond: base:sysctl: rename sysctl value as it could be enabled or disabled [puppet] - 10https://gerrit.wikimedia.org/r/807564 [14:41:01] RECOVERY - nova-compute proc maximum on cloudvirt1043 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:41:37] (03PS2) 10Ssingh: Add sukhe to super-user for router configuration [homer/public] - 10https://gerrit.wikimedia.org/r/807145 [14:42:24] (03CR) 10Ssingh: "Thanks for the review; addressed the comments and updated the CR." [homer/public] - 10https://gerrit.wikimedia.org/r/807145 (owner: 10Ssingh) [14:44:17] (03CR) 10Jbond: cumin: add alias for hosts with sensitive sysctl settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807516 (https://phabricator.wikimedia.org/T287081) (owner: 10Jbond) [14:44:19] (03CR) 10Muehlenhoff: cumin: add alias for hosts with sensitive sysctl settings (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/807516 (https://phabricator.wikimedia.org/T287081) (owner: 10Jbond) [14:44:21] (03PS3) 10Jbond: cumin: add alias for hosts with sensitive sysctl settings [puppet] - 10https://gerrit.wikimedia.org/r/807516 (https://phabricator.wikimedia.org/T287081) [14:44:24] (03CR) 10Ssingh: "recheck" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/807554 (owner: 10Ssingh) [14:45:32] (03CR) 10Muehlenhoff: [C: 03+2] Record new MOU date for aarora [puppet] - 10https://gerrit.wikimedia.org/r/807563 (owner: 10Muehlenhoff) [14:47:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica2005.wikimedia.org [14:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:41] (03PS4) 10DDesouza: QuickSurveys: Deploy research-incentive to jawiki on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807202 (https://phabricator.wikimedia.org/T311015) [14:49:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica2005.wikimedia.org [14:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:52] (03CR) 10Ssingh: "I guess this is expected since it's actually trying to patch 9.1.2-1wm1~bpo10+1 but we haven't updated our Deb yet?" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/807554 (owner: 10Ssingh) [14:51:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica2006.wikimedia.org [14:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:28] (03CR) 10Jbond: [C: 03+1] "LGTM very minor nit" [puppet] - 10https://gerrit.wikimedia.org/r/807551 (owner: 10Ssingh) [14:53:29] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807545 (owner: 10Volans) [14:53:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica2006.wikimedia.org [14:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:53] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/807546 (owner: 10Volans) [14:54:23] (03CR) 10Jbond: [C: 03+2] base:sysctl: rename sysctl value as it could be enabled or disabled [puppet] - 10https://gerrit.wikimedia.org/r/807564 (owner: 10Jbond) [14:55:08] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add helm-state-metrics image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/806879 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [14:56:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica1003.wikimedia.org [14:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:38] (03PS4) 10Ahmon Dancy: scap bootstrap: refactor [puppet] - 10https://gerrit.wikimedia.org/r/806397 (https://phabricator.wikimedia.org/T310740) (owner: 10Jaime Nuche) [14:58:10] (03CR) 10Ahmon Dancy: scap bootstrap: refactor (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/806397 (https://phabricator.wikimedia.org/T310740) (owner: 10Jaime Nuche) [14:58:31] (03PS4) 10Jbond: cumin: add alias for hosts with sensitive sysctl settings [puppet] - 10https://gerrit.wikimedia.org/r/807516 (https://phabricator.wikimedia.org/T287081) [14:58:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica1003.wikimedia.org [14:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica1004.wikimedia.org [14:59:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:36] (03CR) 10Jbond: cumin: add alias for hosts with sensitive sysctl settings (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/807516 (https://phabricator.wikimedia.org/T287081) (owner: 10Jbond) [14:59:48] (03PS2) 10Ssingh: Release 9.1.2-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/807554 [14:59:58] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/807558 (owner: 10Ayounsi) [15:00:14] (03CR) 10CI reject: [V: 04-1] Release 9.1.2-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/807554 (owner: 10Ssingh) [15:00:34] !log published docker-registry.discovery.wmnet/helm-state-metrics:0.1.0-1 - T310714 [15:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:39] T310714: Detect and alert on helm releases in unclean state - https://phabricator.wikimedia.org/T310714 [15:01:04] (03PS2) 10Eevans: AQS: Use data-center apropos host list [puppet] - 10https://gerrit.wikimedia.org/r/805883 (https://phabricator.wikimedia.org/T307641) [15:01:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica1004.wikimedia.org [15:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:23] (03CR) 10JMeybohm: [C: 03+2] Add helm-state-metrics helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/806870 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [15:01:44] (03CR) 10Ayounsi: [C: 03+2] Netbox: don't alert for the accounting report [puppet] - 10https://gerrit.wikimedia.org/r/807558 (owner: 10Ayounsi) [15:02:01] (03PS2) 10Ayounsi: Netbox: don't alert for the accounting report [puppet] - 10https://gerrit.wikimedia.org/r/807558 [15:02:28] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/807516 (https://phabricator.wikimedia.org/T287081) (owner: 10Jbond) [15:04:07] (03CR) 10Eevans: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1002/36007/" [puppet] - 10https://gerrit.wikimedia.org/r/805883 (https://phabricator.wikimedia.org/T307641) (owner: 10Eevans) [15:04:32] (03CR) 10Ayounsi: [C: 03+1] "all good! rolling it out is time consuming as we need to check the diff and say "yes" for every single device. Let me know if you need hel" [homer/public] - 10https://gerrit.wikimedia.org/r/807145 (owner: 10Ssingh) [15:05:21] (03Merged) 10jenkins-bot: Add helm-state-metrics helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/806870 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [15:07:27] (03CR) 10Ssingh: Add sukhe to super-user for router configuration (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/807145 (owner: 10Ssingh) [15:08:00] (03CR) 10Ssingh: [C: 03+2] dnsdist: add spec tests [puppet] - 10https://gerrit.wikimedia.org/r/807551 (owner: 10Ssingh) [15:08:24] (03CR) 10Ssingh: [C: 03+2] dnsdist: add spec tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807551 (owner: 10Ssingh) [15:08:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet [15:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet [15:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:12] (03PS1) 10Jgiannelos: tegola: Point tegola to the latest swift container [deployment-charts] - 10https://gerrit.wikimedia.org/r/807567 [15:13:40] (03CR) 10Jbond: [C: 03+2] cumin: add alias for hosts with sensitive sysctl settings [puppet] - 10https://gerrit.wikimedia.org/r/807516 (https://phabricator.wikimedia.org/T287081) (owner: 10Jbond) [15:15:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki2002.codfw.wmnet [15:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:45] (JobUnavailable) firing: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:17:43] PROBLEM - Check systemd state on ms-be2059 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:17:51] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/806397 (https://phabricator.wikimedia.org/T310740) (owner: 10Jaime Nuche) [15:18:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2002.codfw.wmnet [15:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:33] PROBLEM - Host ms-be2063 is DOWN: PING CRITICAL - Packet loss = 100% [15:22:03] RECOVERY - Host ms-be2063 is UP: PING OK - Packet loss = 0%, RTA = 31.54 ms [15:22:23] RECOVERY - Check systemd state on ms-be2059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:24:20] (03PS6) 10MewOphaswongse: Structured task: enable free text for "other" rejection reason in betalabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805480 (https://phabricator.wikimedia.org/T304099) [15:24:47] (03CR) 10MewOphaswongse: Structured task: enable free text for "other" rejection reason in betalabs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805480 (https://phabricator.wikimedia.org/T304099) (owner: 10MewOphaswongse) [15:25:45] (JobUnavailable) firing: (2) Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:26:20] (03PS1) 10MewOphaswongse: Structured task: enable free text for "other" rejection reason [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807576 (https://phabricator.wikimedia.org/T304099) [15:27:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10Cmjohnson) @jcrespo there are 3 drives and I did make it raid 6 [15:28:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:28:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:00] 10SRE, 10serviceops, 10Patch-For-Review: Requesting SSH keypair for deployment server keyholder to push to Gerrit - https://phabricator.wikimedia.org/T310620 (10JMeybohm) I would assume we can reuse the `pwstore/pw.git/deployment-key-passphrase` for this as the audience is the same as well? [15:31:15] RECOVERY - Check systemd state on ms-fe1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:31:42] (03CR) 10JMeybohm: [C: 03+2] Deploy helm-state-metrics to staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/806871 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [15:32:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:45] (JobUnavailable) resolved: (2) Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:36:21] !log upload jenkins 2.332.4 to apt.wikimedia.org T311068 [15:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:02] (03Merged) 10jenkins-bot: Deploy helm-state-metrics to staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/806871 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [15:40:09] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:40:27] RECOVERY - Check systemd state on ms-fe1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:40:30] (03CR) 10Ssingh: "recheck" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/807554 (owner: 10Ssingh) [15:40:33] PROBLEM - Host ms-be2063 is DOWN: PING CRITICAL - Packet loss = 100% [15:41:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10jcrespo) So I checked the recipe and it didn't change since april 2020 (except briefly in may for 3 days for some old/bad hardware). In particu... [15:42:03] RECOVERY - Host ms-be2063 is UP: PING WARNING - Packet loss = 77%, RTA = 34.19 ms [15:44:01] (03PS3) 10Ssingh: Release 9.1.2-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/807554 [15:50:33] 10Puppet, 10Infrastructure-Foundations, 10netbox, 10PostgreSQL: Puppet change at each run on postgres replicas - https://phabricator.wikimedia.org/T311156 (10ayounsi) p:05Triageβ†’03Medium [15:51:05] PROBLEM - Host ms-be2063 is DOWN: PING CRITICAL - Packet loss = 100% [15:51:21] RECOVERY - Host ms-be2063 is UP: PING OK - Packet loss = 0%, RTA = 31.55 ms [15:53:24] (03PS9) 10Ayounsi: Add python3.10 support to Tox [cookbooks] - 10https://gerrit.wikimedia.org/r/803263 [15:53:44] (03PS1) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/807581 (https://phabricator.wikimedia.org/T308244) [15:54:03] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/807581 (https://phabricator.wikimedia.org/T308244) (owner: 10Kosta Harlan) [15:59:52] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/807581 (https://phabricator.wikimedia.org/T308244) (owner: 10Kosta Harlan) [16:00:30] (03PS2) 10Alexandros Kosiaris: prometheus: Enable scraping of the ipmi exporter [puppet] - 10https://gerrit.wikimedia.org/r/807494 [16:00:35] (03CR) 10Alexandros Kosiaris: [C: 03+2] prometheus: Enable scraping of the ipmi exporter [puppet] - 10https://gerrit.wikimedia.org/r/807494 (owner: 10Alexandros Kosiaris) [16:01:46] (03CR) 10Alexandros Kosiaris: [C: 03+2] Keyholder: add new agent for trainbranchbot [puppet] - 10https://gerrit.wikimedia.org/r/807192 (https://phabricator.wikimedia.org/T310620) (owner: 10Thcipriani) [16:04:02] !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [16:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:08] (03PS10) 10Ayounsi: Add python3.10 support to Tox [cookbooks] - 10https://gerrit.wikimedia.org/r/803263 [16:04:10] (03PS18) 10Ayounsi: Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 [16:05:19] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [16:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:31] !log kharlan@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [16:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:38] (03CR) 10BCornwall: [C: 03+2] Delete git-setup script [dns] - 10https://gerrit.wikimedia.org/r/807229 (owner: 10BCornwall) [16:07:55] (03CR) 10Ayounsi: Initial support for servers switch interfaces (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi) [16:08:47] !log kharlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [16:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:01] (03CR) 10CI reject: [V: 04-1] Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi) [16:09:03] (03PS3) 10Zabe: vrts: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/806220 (https://phabricator.wikimedia.org/T308013) [16:09:41] (03CR) 10Zabe: vrts: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806220 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [16:09:51] !log kharlan@deploy1002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply [16:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:01] (03PS19) 10Ayounsi: Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 [16:11:46] !log kharlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply [16:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:53] PROBLEM - Host ms-be2063 is DOWN: PING CRITICAL - Packet loss = 100% [16:12:14] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:13:55] RECOVERY - Host ms-be2063 is UP: PING OK - Packet loss = 0%, RTA = 31.58 ms [16:13:55] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host backup1009.eqiad.wmnet with OS bullseye [16:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host backup1009.eqiad.wmnet with OS bullseye [16:14:02] !log jynus@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1009.eqiad.wmnet with OS bullseye [16:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1009.eqiad.wmnet with OS bullseye executed w... [16:14:21] PROBLEM - Keyholder SSH agent on deploy1002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [16:14:21] PROBLEM - Keyholder SSH agent on deploy2002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [16:14:31] PROBLEM - Check systemd state on ms-be2063 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:51] RECOVERY - Check systemd state on ms-be2063 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:31] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host backup1009.eqiad.wmnet with OS bullseye [16:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host backup1009.eqiad.wmnet with OS bullseye [16:20:07] they keyholder thing is expected (e.g. maintenance, reboot)? [16:20:34] jynus: akosiaris just merged a change [16:20:44] ok [16:20:49] It's likely them adding the gerrit/scap stuff [16:20:56] ok, cool [16:21:16] with so much noise it is not easy to track all changes :-D [16:21:23] RECOVERY - Keyholder SSH agent on deploy1002 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder [16:21:25] RECOVERY - Keyholder SSH agent on deploy2002 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder [16:21:32] 10SRE, 10DSE-Kubernetes-Cluster, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 3 VMs requested for Etcd cluster in support of the new DSE Kubernetes cluster - https://phabricator.wikimedia.org/T311131 (10BTullis) [16:23:46] 10SRE, 10serviceops, 10Patch-For-Review: Requesting SSH keypair for deployment server keyholder to push to Gerrit - https://phabricator.wikimedia.org/T310620 (10akosiaris) 05Openβ†’03Resolved [16:23:50] 10SRE, 10serviceops, 10Patch-For-Review: Requesting SSH keypair for deployment server keyholder to push to Gerrit - https://phabricator.wikimedia.org/T310620 (10akosiaris) >>! In T310620#8020287, @JMeybohm wrote: > I would assume we can reuse the `pwstore/pw.git/deployment-key-passphrase` for this as the aud... [16:24:34] 10SRE, 10serviceops, 10Patch-For-Review: Requesting SSH keypair for deployment server keyholder to push to Gerrit - https://phabricator.wikimedia.org/T310620 (10akosiaris) a:03akosiaris key generated, change merged, keyholder and keyholder-proxy restart and rearmed. I think we are done on this front! I am... [16:25:13] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:25:42] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:29:25] PROBLEM - Host ms-be2063 is DOWN: PING CRITICAL - Packet loss = 100% [16:29:46] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1009.eqiad.wmnet with reason: host reimage [16:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:09] (03CR) 10David Caro: [C: 03+2] openstack.vendordata: reduce timeout so it retries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807174 (https://phabricator.wikimedia.org/T309930) (owner: 10David Caro) [16:30:21] RECOVERY - Host ms-be2063 is UP: PING OK - Packet loss = 0%, RTA = 31.60 ms [16:33:02] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1009.eqiad.wmnet with reason: host reimage [16:33:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:12] (03PS2) 10Matthias Mullie: [ImageSuggestions] Enable extension on ptwiki, ruwiki & idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807050 (https://phabricator.wikimedia.org/T302711) [16:37:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10jcrespo) [16:37:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10jcrespo) 05Openβ†’03Resolved It turned out everything was perfectly configured, we just needed to retry (e.g. for puppet to apply the new con... [16:42:37] jouncebot: now [16:42:37] No deployments scheduled for the next 1 hour(s) and 17 minute(s) [16:43:49] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2063.codfw.wmnet [16:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:24] (03PS4) 10Ssingh: Release 9.1.2-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/807554 [16:45:04] !log Restarting CI Jenkins [16:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:15] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1009.eqiad.wmnet with OS bullseye [16:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1009.eqiad.wmnet with OS bullseye completed:... [16:54:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10BTullis) @Cmjohnson yes please, that's perfect. [16:54:05] (03PS2) 10BCornwall: traffic: Port over ATS restart alert [alerts] - 10https://gerrit.wikimedia.org/r/807214 (https://phabricator.wikimedia.org/T300723) [16:54:14] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:54:19] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:56:50] 10SRE, 10serviceops: Requesting SSH keypair for deployment server keyholder to push to Gerrit - https://phabricator.wikimedia.org/T310620 (10dancy) Thank you @akosiaris ! What's the official way to collect the public key? [16:57:23] (03CR) 10BCornwall: [C: 03+2] traffic: Port over ATS restart alert [alerts] - 10https://gerrit.wikimedia.org/r/807214 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [16:58:44] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:59:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:02:05] (03PS1) 10MarcoAurelio: gawiki: Set category collation from `uppercase` to `uca-ga-u-kn` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807593 (https://phabricator.wikimedia.org/T311136) [17:04:10] 10SRE, 10serviceops: Requesting SSH keypair for deployment server keyholder to push to Gerrit - https://phabricator.wikimedia.org/T310620 (10akosiaris) >>! In T310620#8020587, @dancy wrote: > Thank you @akosiaris ! > > What's the official way to collect the public key? Can't say we have an official way to... [17:04:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:08:13] 10SRE-tools, 10Spicerack: spicerack.redfish: Add handle for when job returns - "Job for this device is already present" - https://phabricator.wikimedia.org/T311162 (10jbond) p:05Triageβ†’03Medium [17:09:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:10:39] (03PS2) 10Jbond: C:postgresql: grab the data directory from postgresql [puppet] - 10https://gerrit.wikimedia.org/r/807553 (https://phabricator.wikimedia.org/T311156) [17:13:36] 10Puppet, 10Infrastructure-Foundations, 10netbox, 10Patch-For-Review, 10PostgreSQL: Puppet change at each run on postgres replicas - https://phabricator.wikimedia.org/T311156 (10jbond) I think https://gerrit.wikimedia.org/r/c/operations/puppet/+/807553 should fix this issue > How to know if it's safe to... [17:15:45] jouncebot: nowandnext [17:15:45] No deployments scheduled for the next 0 hour(s) and 44 minute(s) [17:15:45] In 0 hour(s) and 44 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220622T1800) [17:15:45] In 0 hour(s) and 44 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220622T1800) [17:17:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10BTullis) @Cmjohnson yes please, let's use hardware RAID for this please. As @RobH suggested in the parent task, let's... > use the flex bays as a raid1 for the OS data, and the... [17:37:12] jouncebot: now [17:37:12] No deployments scheduled for the next 0 hour(s) and 22 minute(s) [17:37:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:39:56] 10SRE, 10serviceops: Requesting SSH keypair for deployment server keyholder to push to Gerrit - https://phabricator.wikimedia.org/T310620 (10dancy) Beautiful. I added the public key to Gerrit's trainbranchbot using the following command: ` echo ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIA9PnDpx0+F5mgJUbLxiCOFm2G5an... [17:42:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:44:48] (03PS2) 10MarcoAurelio: gawiki: Change category collation from `uppercase` to `uca-ga-u-kn` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807593 (https://phabricator.wikimedia.org/T311136) [18:00:04] hashar and brennen: Time to snap out of that daydream and deploy Train log triage with CPT. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220622T1800). [18:00:04] hashar and brennen: May I have your attention please! MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220622T1800) [18:04:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:09:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:12:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:17:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:33:39] (03PS1) 10Jcrespo: install_server: Move backup1009, backup2009 to the list of manual partitioning [puppet] - 10https://gerrit.wikimedia.org/r/807602 [18:41:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:41:19] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:41:40] here [18:41:42] and ACKed [18:42:07] here as well [18:42:25] (03CR) 10Jcrespo: [C: 03+2] install_server: Move backup1009, backup2009 to the list of manual partitioning [puppet] - 10https://gerrit.wikimedia.org/r/807602 (owner: 10Jcrespo) [18:44:21] increase in thumbor latency, but I don't see anything particular strange in the thumbor dashboard [18:44:27] yeah [18:44:45] I am trying to find the resolution just in case it gets worse [18:45:39] that seems to be a recovery unless I am reading it incorrectly [18:46:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:46:24] oh yeah [18:46:25] hm [18:47:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:47:40] πŸ‘€ [18:47:47] yeah needs to be resolved [18:48:13] ah, it will keep firing with an ack? [18:48:48] jhathaway: I think it resolved and happened again, hence the separate alert [18:48:58] ah, ok that makes sense [18:49:25] I have ACKed this one again but yeah, this not the solution clearly [18:51:17] (03PS3) 10Krinkle: [POC] noc: Add perwiki.php to show per wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [18:51:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:51:19] (03PS1) 10Krinkle: buildConfigCache,buildDBLists: Remove redundant defines.php include [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807604 [18:51:21] (03PS1) 10Krinkle: build: Make config gen signature for prod compatible with test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807605 [18:51:33] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:51:48] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:52:06] (03CR) 10CI reject: [V: 04-1] [POC] noc: Add perwiki.php to show per wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [18:52:08] (03CR) 10CI reject: [V: 04-1] buildConfigCache,buildDBLists: Remove redundant defines.php include [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807604 (owner: 10Krinkle) [18:52:11] (03CR) 10CI reject: [V: 04-1] build: Make config gen signature for prod compatible with test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807605 (owner: 10Krinkle) [18:54:24] (03PS2) 10Krinkle: buildConfigCache,buildDBLists: Remove redundant defines.php include [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807604 [18:54:26] (03PS2) 10Krinkle: build: Make config gen signature for prod compatible with test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807605 [18:54:28] (03PS4) 10Krinkle: [POC] noc: Add perwiki.php to show per wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [18:54:49] (03PS3) 10Krinkle: build: Remove redundant defines.php includes from CI build scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807604 [18:54:51] (03PS3) 10Krinkle: build: Make config gen signature for prod compatible with test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807605 [18:54:53] (03PS5) 10Krinkle: [POC] noc: Add perwiki.php to show per wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [18:55:47] (03CR) 10CI reject: [V: 04-1] build: Make config gen signature for prod compatible with test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807605 (owner: 10Krinkle) [18:55:52] (03CR) 10CI reject: [V: 04-1] build: Remove redundant defines.php includes from CI build scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807604 (owner: 10Krinkle) [18:56:05] (03CR) 10CI reject: [V: 04-1] [POC] noc: Add perwiki.php to show per wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [18:56:33] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:58:47] (03PS4) 10Krinkle: build: Remove redundant defines.php includes from CI build scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807604 [18:58:49] (03PS4) 10Krinkle: build: Make config gen signature for prod compatible with test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807605 [18:58:51] (03PS6) 10Krinkle: [POC] noc: Add perwiki.php to show per wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [19:00:11] (03CR) 10CI reject: [V: 04-1] build: Make config gen signature for prod compatible with test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807605 (owner: 10Krinkle) [19:00:14] (03CR) 10CI reject: [V: 04-1] build: Remove redundant defines.php includes from CI build scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807604 (owner: 10Krinkle) [19:00:16] (03CR) 10jenkins-bot: [POC] noc: Add perwiki.php to show per wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [19:00:33] (03PS5) 10Krinkle: build: Remove redundant defines.php includes from CI build scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807604 [19:00:35] (03PS5) 10Krinkle: build: Make config gen signature for prod compatible with test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807605 [19:00:37] (03PS7) 10Krinkle: [POC] noc: Add perwiki.php to show per wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [19:01:17] (03CR) 10CI reject: [V: 04-1] build: Make config gen signature for prod compatible with test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807605 (owner: 10Krinkle) [19:01:23] (03CR) 10CI reject: [V: 04-1] [POC] noc: Add perwiki.php to show per wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [19:02:51] (03PS6) 10Krinkle: build: Make config gen signature for prod compatible with test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807605 [19:02:53] (03PS8) 10Krinkle: [POC] noc: Add perwiki.php to show per wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [19:03:57] (03CR) 10CI reject: [V: 04-1] [POC] noc: Add perwiki.php to show per wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [19:05:31] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [19:05:43] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:06:29] !log Restarting CI Jenkins [19:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:59] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:11:03] (03PS9) 10Krinkle: [POC] noc: Add perwiki.php to show per wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [19:11:17] PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - Certificate lists.wikimedia.org valid until 2022-06-25 07:55:09 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:11:59] ^ this is not a problem problem [19:12:16] (03CR) 10CI reject: [V: 04-1] [POC] noc: Add perwiki.php to show per wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [19:13:11] RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2022-08-24 07:48:40 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:14:17] !log bounced apache on lists1001 [19:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:39] (03PS10) 10Krinkle: noc: Add wiki.php to view a given wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [19:15:58] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1006.eqiad.wmnet with OS buster [19:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS buster [19:16:48] (03Abandoned) 10Hashar: Add SonarQube scanner [software/gerrit/jsonschemagenerator] - 10https://gerrit.wikimedia.org/r/791692 (owner: 10Hashar) [19:16:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10Cmjohnson) [19:23:56] (03CR) 10Dzahn: alertmanager: create receivers for serviceops-collab (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/807201 (owner: 10Dzahn) [19:31:10] !log Deploying analytics/refinery (weekly train) [19:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:51] Is there anyone around that could spare a few minutes to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/805883 for me? [19:32:05] It's pretty trivial [19:32:07] !log aqu@deploy1002 Started deploy [analytics/refinery@99cca44]: Regular analytics weekly train [analytics/refinery@99cca44] [19:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:46] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T309741 (10phaultfinder) [19:37:06] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@1f2f286]: namespace maps: Exclude labtest database group from data collection [19:37:10] urandom: I can merge it [19:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:24] ryankemper: awesome, thank you! [19:37:45] (03CR) 10Ryan Kemper: [C: 03+2] AQS: Use data-center apropos host list [puppet] - 10https://gerrit.wikimedia.org/r/805883 (https://phabricator.wikimedia.org/T307641) (owner: 10Eevans) [19:38:49] urandom: just merged (haven't manually ran puppet yet) [19:38:59] ryankemper: I can take care of that [19:39:02] FWIW there's another patch that was waiting that I puppet-merged as well: `Jcrespo: install_server: Move backup1009, backup2009 to the list of manual partitioning (8eded6f9e9)` [19:39:10] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@1f2f286]: namespace maps: Exclude labtest database group from data collection (duration: 02m 03s) [19:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:14] not sure who jcrespo is on IRC exactly but the change looked minor [19:39:23] urandom: cool, puppet merge done so feel free to proceed [19:39:32] ryankemper: thanks again! [19:41:06] np! [19:41:15] jynus: ^ [19:41:26] ryankemper: jynus is jcrespo [19:42:08] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-presto1006.eqiad.wmnet with OS buster [19:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:13] jynus: just to save you a small backlog scroll, I merged `Jcrespo: install_server: Move backup1009, backup2009 to the list of manual partitioning (8eded6f9e9)` [19:42:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS buster exec... [19:42:26] RhinosF1: tyvm [19:42:29] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1006.eqiad.wmnet with OS bullseye [19:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS bullseye [19:43:04] ryankemper: /who *jcrespo* should work as will any first letter, surname for anyone who has a WMF cloak [19:43:34] oh neat, thanks [19:43:53] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:45:37] (03PS1) 10Krinkle: missing.php: Update docs and add test plan [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807609 (https://phabricator.wikimedia.org/T308932) [19:45:39] (03PS1) 10Krinkle: multiversion: Move missing.php from wmf-config/ to /multiversion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807610 (https://phabricator.wikimedia.org/T308932) [19:46:35] RECOVERY - AQS root url on aqs2003 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.085 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [19:46:35] RECOVERY - AQS root url on aqs2004 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [19:48:23] RECOVERY - Check systemd state on aqs2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:51:21] RECOVERY - Check systemd state on aqs2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:52:27] RECOVERY - AQS root url on aqs2005 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.085 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [19:53:45] RECOVERY - AQS root url on aqs2006 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [19:54:29] RECOVERY - AQS root url on aqs2007 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.085 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [19:55:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:55:45] RECOVERY - AQS root url on aqs2009 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [19:55:52] πŸ‘€ [19:56:57] RECOVERY - Check systemd state on aqs2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:58:03] RECOVERY - AQS root url on aqs2012 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [19:58:05] RECOVERY - AQS root url on aqs2011 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.088 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [19:58:11] RECOVERY - Check systemd state on aqs2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:00:05] RoanKattouw, Urbanecm, and cjming: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220622T2000). [20:00:05] hauskatze: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:01:51] * hauskatze reporting for backport window, sorry I'm late [20:02:16] hi - i can deploy [20:03:06] !log aqu@deploy1002 Finished deploy [analytics/refinery@99cca44]: Regular analytics weekly train [analytics/refinery@99cca44] (duration: 30m 58s) [20:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:40] (03CR) 10Clare Ming: [C: 03+2] gawiki: Change category collation from `uppercase` to `uca-ga-u-kn` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807593 (https://phabricator.wikimedia.org/T311136) (owner: 10MarcoAurelio) [20:03:58] !log aqu@deploy1002 Started deploy [analytics/refinery@99cca44]: Regular analytics weekly train retry [analytics/refinery@99cca44] [20:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:25] (03Merged) 10jenkins-bot: gawiki: Change category collation from `uppercase` to `uca-ga-u-kn` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807593 (https://phabricator.wikimedia.org/T311136) (owner: 10MarcoAurelio) [20:05:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:06:09] hi cjming - thanks for deploying today. My patch cannot really be tested on mwdebug [20:06:25] it needs a maintenance script run after deployment to fully apply [20:06:30] hi hauskatze: i was just gonna ask you about that -- ok -- so i'll sync and then run the script [20:06:33] see https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#updateCollation for details :) [20:07:01] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:07:24] left the command on the calendar, in this case: mwscript updateCollation.php --wiki=gawiki --previous-collation=uppercase [20:07:40] i should run that on the deployment server right? [20:07:52] on mwmaint yep [20:07:56] urbanecm: right? [20:08:17] cjming: all maintenance scripts should be ran from mwmaint1002.eqiad.wmnet (ie. _not_ deployment srv) [20:08:18] not sure which mwmaint100x are we on nowadays :-) [20:08:19] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:08:33] thanks urbanecm: got it [20:08:49] np [20:09:06] otherwise, the cmdline hauskatze quoted should work fine [20:09:09] urbancecm: so the process is 1. sync on deployment server 2. run mwscript on maintenance server [20:09:43] I think we need the change fully deployed first [20:10:00] ok - syncing now [20:10:01] cjming: correct [20:10:15] !log aqu@deploy1002 Finished deploy [analytics/refinery@99cca44]: Regular analytics weekly train retry [analytics/refinery@99cca44] (duration: 06m 16s) [20:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:17] !log aqu@deploy1002 Started deploy [analytics/refinery@99cca44] (thin): Regular analytics weekly train THIN [analytics/refinery@99cca44] [20:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:25] !log aqu@deploy1002 Finished deploy [analytics/refinery@99cca44] (thin): Regular analytics weekly train THIN [analytics/refinery@99cca44] (duration: 00m 07s) [20:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:41] !log aqu@deploy1002 Started deploy [analytics/refinery@99cca44] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@99cca44] [20:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:14] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:13:12] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-presto1006.eqiad.wmnet with OS bullseye [20:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS bullseye ex... [20:13:51] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:807593|gawiki: Change category collation from `uppercase` to `uca-ga-u-kn` (T311136)]] (duration: 03m 39s) [20:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:54] T311136: Set $wgCategoryCollation for the Irish language Wikipedia, gawiki - https://phabricator.wikimedia.org/T311136 [20:14:02] running maint script now [20:14:09] Great, thanks :) [20:14:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:14:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:11] hauskatze: any idea how many rows in total? it's still running - at about ~100k rows now [20:17:40] The requestor mentioned some 50k articles [20:19:16] !log aqu@deploy1002 Finished deploy [analytics/refinery@99cca44] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@99cca44] (duration: 07m 36s) [20:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:34] alrighty - just finished -- processed ~189k rows [20:20:16] does the output look alright? [20:20:31] I'm seeing no havoc on wiki so it should be okay :) [20:20:39] hauskatze: should be live - script is done [20:21:36] thanks cjming - I'll let our requestor know, so she can check as well [20:21:37] ya - i'm not sure what to look for other than gawiki is still up and not blowing up [20:21:49] np! [20:22:00] definitely not setting the wiki ablaze today :) [20:22:01] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service,refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:22:03] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1006.eqiad.wmnet with OS buster [20:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS buster [20:22:51] RECOVERY - Check systemd state on mw1406 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:24:09] !log end of UTC late backport window [20:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:42] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [20:27:41] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-presto1006.eqiad.wmnet with OS buster [20:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS buster exec... [20:28:00] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1006.eqiad.wmnet with OS bullseye [20:28:04] (03CR) 10Kosta Harlan: [C: 03+1] Structured task: enable free text for "other" rejection reason in betalabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805480 (https://phabricator.wikimedia.org/T304099) (owner: 10MewOphaswongse) [20:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS bullseye [20:45:06] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1007.eqiad.wmnet with OS bullseye [20:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1007.eqiad.wmnet with OS bullseye [20:48:15] !log aqu@deploy1002 Started deploy [analytics/refinery@99cca44]: Regular analytics weekly train retry force [analytics/refinery@99cca44] [20:48:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:34] !log aqu@deploy1002 Finished deploy [analytics/refinery@99cca44]: Regular analytics weekly train retry force [analytics/refinery@99cca44] (duration: 01m 18s) [20:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:14] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:58:12] (03PS2) 10Cwhite: profile: add kibana to dashboards rewrite rule [puppet] - 10https://gerrit.wikimedia.org/r/806451 (https://phabricator.wikimedia.org/T310360) [20:58:51] (03CR) 10Cwhite: [C: 03+2] "tested redirect on beta" [puppet] - 10https://gerrit.wikimedia.org/r/806451 (https://phabricator.wikimedia.org/T310360) (owner: 10Cwhite) [21:09:05] (03PS2) 10Cwhite: opensearch: disable compatibility mode [puppet] - 10https://gerrit.wikimedia.org/r/803588 (https://phabricator.wikimedia.org/T301017) [21:10:19] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:17:21] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:28:42] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1006.eqiad.wmnet with OS bullseye [21:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS bullseye ex... [21:33:25] (03CR) 10Dzahn: alertmanager: create receivers for serviceops-collab (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/807201 (owner: 10Dzahn) [21:37:39] (03CR) 10Dzahn: [C: 03+2] "just merging - since we don't actually use this yet and we can always amend. I'll bring it up in the next team meeting." [puppet] - 10https://gerrit.wikimedia.org/r/807201 (owner: 10Dzahn) [21:38:59] (03Abandoned) 10Dzahn: docker::network: refresh service docker after adding a docker network [puppet] - 10https://gerrit.wikimedia.org/r/806341 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn) [21:40:15] (03Abandoned) 10Dzahn: cumin: add alias for hosts with sensitive sysctl settings [puppet] - 10https://gerrit.wikimedia.org/r/806486 (https://phabricator.wikimedia.org/T287081) (owner: 10Dzahn) [21:40:39] (03PS2) 10Dzahn: gitlab: add prometheus blackbox http monitor [puppet] - 10https://gerrit.wikimedia.org/r/806476 [21:44:21] !log restart elasticsearch_6@cloudelastic-chi-eqiad to resolve Old GC Hell alert [21:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:35] !log restart elasticsearch_6@cloudelastic-chi-eqiad on cloudelastic1003 to resolve Old GC Hell alert [21:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:16] (03PS1) 10Ryan Kemper: [wip] elastic: temp keystore for index restoration [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) [21:45:50] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1007.eqiad.wmnet with OS bullseye [21:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1007.eqiad.wmnet with OS bullseye ex... [21:46:33] (03CR) 10CI reject: [V: 04-1] [wip] elastic: temp keystore for index restoration [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [21:46:53] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:47:43] 10SRE, 10WMF-Annual-Report (Policy site): migrate policy.wikimedia.org from WMF cluster to Wordpress - https://phabricator.wikimedia.org/T110203 (10Dzahn) In T310738 there is a request to revert this and move the domain back to WMF infra. [21:48:24] (03PS2) 10Ryan Kemper: [wip] elastic: temp keystore for index restoration [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) [21:48:58] (03CR) 10CI reject: [V: 04-1] [wip] elastic: temp keystore for index restoration [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [21:50:32] (CirrusSearchHighOldGCFrequency) firing: (3) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:51:37] (03PS1) 10Ahmon Dancy: safe-service-restart.py: Ensure 'status' always has a value [puppet] - 10https://gerrit.wikimedia.org/r/807624 (https://phabricator.wikimedia.org/T311182) [21:56:49] (03PS3) 10Ryan Kemper: [wip] elastic: temp keystore for index restoration [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) [21:57:22] (03CR) 10CI reject: [V: 04-1] [wip] elastic: temp keystore for index restoration [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [22:00:16] (03PS4) 10Ryan Kemper: [wip] elastic: temp keystore for index restoration [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) [22:00:52] (03CR) 10CI reject: [V: 04-1] [wip] elastic: temp keystore for index restoration [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [22:02:33] 10SRE, 10DSE-Kubernetes-Cluster, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 2 VMs requested for DSE Kubernetes Cluster control plane servers - https://phabricator.wikimedia.org/T311133 (10Dzahn) fyi: The design document isn't accesible and from the tickets alone it's unclear what this is ab... [22:09:53] (CirrusSearchHighOldGCFrequency) firing: (3) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [22:16:17] (03PS5) 10Ryan Kemper: [wip] elastic: temp keystore for index restoration [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) [22:17:36] (03CR) 10CI reject: [V: 04-1] [wip] elastic: temp keystore for index restoration [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [22:21:09] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:21:16] (03PS6) 10Ryan Kemper: [wip] elastic: temp keystore for index restoration [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) [22:22:19] (03CR) 10CI reject: [V: 04-1] [wip] elastic: temp keystore for index restoration [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [22:23:57] 10SRE, 10Infrastructure-Foundations, 10netbox: Grant cn=nda some sort of read only access to Netbox - https://phabricator.wikimedia.org/T302870 (10Dzahn) Before we talk about technical implementation and putting this on ice. I am wondering..has anyone even had specific concerns or data fields in mind that sh... [22:27:58] (03PS1) 10Ryan Kemper: elastic: add fake elasticsearch.keystore [labs/private] - 10https://gerrit.wikimedia.org/r/807650 (https://phabricator.wikimedia.org/T309648) [22:28:15] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:29:21] (03CR) 10Brennen Bearnes: [C: 03+1] "Pattern looks correct." [puppet] - 10https://gerrit.wikimedia.org/r/807518 (owner: 10Jelto) [22:29:33] (03CR) 10Bking: [C: 03+2] elastic: add fake elasticsearch.keystore [labs/private] - 10https://gerrit.wikimedia.org/r/807650 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [22:29:39] (03CR) 10Dzahn: [C: 03+2] gitlab_runner: add docker-registry.discovery.wmnet to allowed_images [puppet] - 10https://gerrit.wikimedia.org/r/807518 (owner: 10Jelto) [22:30:03] (03CR) 10Ryan Kemper: [V: 03+2] elastic: add fake elasticsearch.keystore [labs/private] - 10https://gerrit.wikimedia.org/r/807650 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [22:30:51] (03PS7) 10Ryan Kemper: [wip] elastic: temp keystore for index restoration [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) [22:31:52] (03CR) 10Ryan Kemper: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [22:33:48] (03CR) 10Dzahn: [C: 03+1] "we may be able to deploy this during phab maintenance window in a bit" [puppet] - 10https://gerrit.wikimedia.org/r/806207 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [22:35:32] (03PS8) 10Ryan Kemper: [wip] elastic: temp keystore for index restoration [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) [22:37:41] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:50:37] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [22:55:18] (ProbeDown) firing: (7) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:56:15] !log removing 1 file for legal compliance [22:56:18] (ProbeDown) firing: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:56:35] (FrontendUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [22:56:35] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [22:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:01] PROBLEM - Apache HTTP on mw1354 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [22:57:18] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:57:21] PROBLEM - Apache HTTP on mw1351 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [22:57:21] PROBLEM - Apache HTTP on mw1355 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [22:57:27] PROBLEM - Apache HTTP on mw1365 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [22:57:35] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [22:58:11] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is CRITICAL: 0.9726 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [22:58:29] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [22:59:01] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:59:15] RECOVERY - Apache HTTP on mw1354 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers [22:59:33] RECOVERY - Apache HTTP on mw1351 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers [22:59:33] RECOVERY - Apache HTTP on mw1355 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers [22:59:41] RECOVERY - Apache HTTP on mw1365 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:00:18] (ProbeDown) resolved: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:00:21] (03PS3) 10Labdajiwa: Add wordmark and tagline for jvwiki, jvwikt, and jvws [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807247 (https://phabricator.wikimedia.org/T311104) [23:00:33] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [23:00:51] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [23:01:18] (ProbeDown) resolved: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:01:35] (FrontendUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [23:01:35] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [23:02:17] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [23:02:18] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:16:29] 10SRE, 10MediaWiki-General, 10Traffic: Advance declaration of query parameters - https://phabricator.wikimedia.org/T310087 (10ori) [23:17:57] 10SRE, 10DNS, 10Traffic, 10WMF-Legal, and 2 others: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10Dzahn) >>! In T310738#8018422, @Varnent wrote: > We are "closing" this site on the VIP site. So, essentially whenever we want on... [23:22:35] 10SRE, 10MediaWiki-General, 10Traffic: Advance declaration of query parameters - https://phabricator.wikimedia.org/T310087 (10ori) [23:22:45] 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10ori) [23:23:05] 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10RobH) Ok now SG3 staff are telling me my ticket isn't valid for this type of thing, despite telling me on a voice call yesterday they'd place it today, and require me to raise a trouble ticket, not a remote h... [23:27:00] 10SRE, 10MediaWiki-General, 10Traffic: Advance declaration of query parameters - https://phabricator.wikimedia.org/T310087 (10ori) Re-scoping this to be about advanced declaration of query parameters, and moving discussion of parameter ordering to T302459. [23:35:34] (03CR) 10Brennen Bearnes: phabricator: get envoy to listen on ipv6 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806207 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)