[00:04:51] <icinga-wm>	 PROBLEM - dump of es4 in codfw on backupmon1001 is CRITICAL: dump for es4 at codfw (es2022) taken more than a week ago: Most recent backup 2022-06-14 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:05:17] <icinga-wm>	 PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:07:05] <icinga-wm>	 PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:12:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[00:14:57] <icinga-wm>	 PROBLEM - Check systemd state on an-conf1001 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:14:59] <icinga-wm>	 PROBLEM - Check systemd state on conf1004 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:15:25] <icinga-wm>	 PROBLEM - Check systemd state on conf2006 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:15:27] <icinga-wm>	 PROBLEM - Check systemd state on druid1006 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:15:31] <icinga-wm>	 PROBLEM - Check systemd state on conf2005 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:15:31] <icinga-wm>	 PROBLEM - dump of es5 in codfw on backupmon1001 is CRITICAL: dump for es5 at codfw (es2025) taken more than a week ago: Most recent backup 2022-06-14 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:15:31] <icinga-wm>	 PROBLEM - Check systemd state on an-conf1003 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:16:01] <icinga-wm>	 PROBLEM - Check systemd state on druid1004 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:16:19] <icinga-wm>	 PROBLEM - Check systemd state on zookeeper-test1002 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:16:21] <icinga-wm>	 PROBLEM - Check systemd state on an-conf1002 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:16:43] <icinga-wm>	 PROBLEM - Check systemd state on conf1005 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:16:51] <icinga-wm>	 PROBLEM - Check systemd state on conf1006 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:16:57] <icinga-wm>	 PROBLEM - Check systemd state on an-druid1001 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:16:57] <icinga-wm>	 PROBLEM - Check systemd state on druid1005 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:16:57] <icinga-wm>	 PROBLEM - Check systemd state on conf2004 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:16:59] <icinga-wm>	 PROBLEM - Check systemd state on an-druid1002 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:17:05] <icinga-wm>	 PROBLEM - Check systemd state on an-druid1003 is CRITICAL: CRITICAL - degraded: The following units failed: zookeeper-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:21:03] <icinga-wm>	 PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:25:07] <icinga-wm>	 PROBLEM - dump of es5 in eqiad on backupmon1001 is CRITICAL: dump for es5 at eqiad (es1025) taken more than a week ago: Most recent backup 2022-06-14 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:25:22] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] "Puppet compiler result https://puppet-compiler.wmflabs.org/pcc-worker1002/35940/" [puppet] - 10https://gerrit.wikimedia.org/r/806975 (https://phabricator.wikimedia.org/T310662) (owner: 10Tim Starling)
[00:25:42] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[00:26:41] <jinxer-wm>	 (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager  - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady
[00:28:23] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:30:31] <icinga-wm>	 RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:34:51] <icinga-wm>	 PROBLEM - dump of es4 in eqiad on backupmon1001 is CRITICAL: dump for es4 at eqiad (es1022) taken more than a week ago: Most recent backup 2022-06-14 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:39:39] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[00:43:20] <wikibugs>	 (03PS3) 10DDesouza: QuickSurveys: Deploy research-incentive to jawiki on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807202 (https://phabricator.wikimedia.org/T311015)
[00:44:23] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[00:51:40] <wikibugs>	 (03PS2) 10Krinkle: [POC] noc: Add perwiki.php to show per wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[00:53:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [POC] noc: Add perwiki.php to show per wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[00:54:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[00:57:27] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] mc.php: Add "mcrouter-primary-dc" to $wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683022 (https://phabricator.wikimedia.org/T278392) (owner: 10Aaron Schulz)
[00:59:09] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:59:32] <wikibugs>	 (03Merged) 10jenkins-bot: mc.php: Add "mcrouter-primary-dc" to $wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683022 (https://phabricator.wikimedia.org/T278392) (owner: 10Aaron Schulz)
[01:01:44] <wikibugs>	 (03PS1) 10Tim Starling: Revert "mc.php: Add "mcrouter-primary-dc" to $wgObjectCaches" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807156
[01:01:51] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Revert "mc.php: Add "mcrouter-primary-dc" to $wgObjectCaches" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807156 (owner: 10Tim Starling)
[01:02:45] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mc.php: Add "mcrouter-primary-dc" to $wgObjectCaches" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807156 (owner: 10Tim Starling)
[01:05:35] <wikibugs>	 (03PS1) 10Tim Starling: Reapply "mc.php: Add "mcrouter-primary-dc" to $wgObjectCaches" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807158
[01:05:42] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[01:06:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[01:06:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:06:59] <wikibugs>	 (03PS2) 10Tim Starling: Reapply "mc.php: Add "mcrouter-primary-dc" to $wgObjectCaches" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807158
[01:07:26] <wikibugs>	 (03CR) 10Tim Starling: "wmf -> wmg" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807158 (owner: 10Tim Starling)
[01:07:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[01:07:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[01:07:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:07:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:08:21] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Reapply "mc.php: Add "mcrouter-primary-dc" to $wgObjectCaches" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807158 (owner: 10Tim Starling)
[01:09:08] <wikibugs>	 (03Merged) 10jenkins-bot: Reapply "mc.php: Add "mcrouter-primary-dc" to $wgObjectCaches" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807158 (owner: 10Tim Starling)
[01:11:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[01:11:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:11:55] <wikibugs>	 (03CR) 10Krinkle: [C: 04-1] "Need to update the mtime invalidator in getConfigGlobals() as well, at least until T169821 is resolved (which is currently blocked on me a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799272 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[01:13:38] <logmsgbot>	 !log tstarling@deploy1002 Synchronized wmf-config/mc.php: g 807158 T278392 (duration: 03m 35s)
[01:13:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:13:42] <stashbot>	 T278392: Storage solution for cross-datacenter tokens - https://phabricator.wikimedia.org/T278392
[01:16:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[01:16:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:17:30] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[01:17:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[01:17:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:17:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:17:52] <logmsgbot>	 !log tstarling@deploy1002 Synchronized wmf-config/mc-labs.php: for completeness (duration: 03m 41s)
[01:17:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:18:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[01:18:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:26:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:38:25] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:40:15] <icinga-wm>	 RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:46:55] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[02:04:57] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[02:07:45] <jinxer-wm>	 (Memory over 85%) firing: Alert for device cloudsw1-c8-eqiad.mgmt.eqiad.wmnet - Memory over 85%   - https://alerts.wikimedia.org/?q=alertname%3DMemory+over+85%25
[02:17:55] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 3 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:20:11] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:23:35] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:25:53] <icinga-wm>	 RECOVERY - Check systemd state on dumpsdata1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:43:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubernetes2008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[03:06:05] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:15:19] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:20:33] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:22:37] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.058 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:38:47] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[03:43:17] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[04:12:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[04:25:42] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[04:26:42] <jinxer-wm>	 (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager  - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady
[04:54:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[05:05:42] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[05:27:00] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:35:58] <wikibugs>	 (03CR) 10Majavah: icinga::monitor::toollabs: replace stretch with buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807169 (https://phabricator.wikimedia.org/T277653) (owner: 10Majavah)
[05:52:30] <marostegui>	 !log dbmaint s8@eqiad T310011
[05:52:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:52:35] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[06:07:45] <jinxer-wm>	 (Memory over 85%) firing: Alert for device cloudsw1-c8-eqiad.mgmt.eqiad.wmnet - Memory over 85%   - https://alerts.wikimedia.org/?q=alertname%3DMemory+over+85%25
[06:15:51] <wikibugs>	 (03CR) 10Labdajiwa: "SVG already optimized. Ran svgo with a config from https://www.mediawiki.org/wiki/Manual:Coding_conventions/SVG#Exemplified_safe_configura" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807247 (https://phabricator.wikimedia.org/T311104) (owner: 10Labdajiwa)
[06:30:09] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[06:35:32] <jinxer-wm>	 (NodeTextfileStale) resolved: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[06:37:15] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[06:43:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubernetes2008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[06:52:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Switchover es1, es2 and es3 masters', diff saved to https://phabricator.wikimedia.org/P29941 and previous config saved to /var/cache/conftool/dbconfig/20220622-065208-marostegui.json
[06:52:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:55:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1027 es1026 es1031 for kernel reboots', diff saved to https://phabricator.wikimedia.org/P29942 and previous config saved to /var/cache/conftool/dbconfig/20220622-065507-root.json
[06:55:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:57:50] <wikibugs>	 (03PS1) 10Marostegui: es1026,1027,1031: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/807473
[06:59:04] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es1026,1027,1031: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/807473 (owner: 10Marostegui)
[07:00:04] <jouncebot>	 Amir1 and Urbanecm: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220622T0700)
[07:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:02:50] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] prometheus: remove absented prometheus_directorysize cron [puppet] - 10https://gerrit.wikimedia.org/r/782360 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[07:03:06] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/782360 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[07:03:14] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] prometheus: remove absented prometheus_directorysize cron [puppet] - 10https://gerrit.wikimedia.org/r/782360 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[07:04:42] <wikibugs>	 (03PS6) 10Hashar: zuul: disable core.logAllRefUpdates at clone time [puppet] - 10https://gerrit.wikimedia.org/r/790350 (https://phabricator.wikimedia.org/T307620)
[07:05:33] <wikibugs>	 (03CR) 10Hashar: zuul: disable core.logAllRefUpdates at clone time (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790350 (https://phabricator.wikimedia.org/T307620) (owner: 10Hashar)
[07:06:45] <icinga-wm>	 ACKNOWLEDGEMENT - dump of es4 in codfw on backupmon1001 is CRITICAL: dump for es4 at codfw (es2022) taken more than a week ago: Most recent backup 2022-06-14 00:00:02 Jcrespo failed soon after starting - The acknowledgement expires at: 2022-06-23 09:06:10. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[07:06:45] <icinga-wm>	 ACKNOWLEDGEMENT - dump of es4 in eqiad on backupmon1001 is CRITICAL: dump for es4 at eqiad (es1022) taken more than a week ago: Most recent backup 2022-06-14 00:00:01 Jcrespo failed soon after starting - The acknowledgement expires at: 2022-06-23 09:06:10. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[07:06:45] <icinga-wm>	 ACKNOWLEDGEMENT - dump of es5 in codfw on backupmon1001 is CRITICAL: dump for es5 at codfw (es2025) taken more than a week ago: Most recent backup 2022-06-14 00:00:02 Jcrespo failed soon after starting - The acknowledgement expires at: 2022-06-23 09:06:10. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[07:06:45] <icinga-wm>	 ACKNOWLEDGEMENT - dump of es5 in eqiad on backupmon1001 is CRITICAL: dump for es5 at eqiad (es1025) taken more than a week ago: Most recent backup 2022-06-14 00:00:01 Jcrespo failed soon after starting - The acknowledgement expires at: 2022-06-23 09:06:10. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[07:11:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29943 and previous config saved to /var/cache/conftool/dbconfig/20220622-071143-root.json
[07:11:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:12:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29944 and previous config saved to /var/cache/conftool/dbconfig/20220622-071201-root.json
[07:12:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:12:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 2%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29945 and previous config saved to /var/cache/conftool/dbconfig/20220622-071210-root.json
[07:12:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:12:15] <wikibugs>	 (03PS1) 10Marostegui: Revert "es1026,1027,1031: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/807251
[07:13:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "es1026,1027,1031: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/807251 (owner: 10Marostegui)
[07:19:22] <wikibugs>	 (03CR) 10Muehlenhoff: cumin: add alias for hosts with sensitive sysctl settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806486 (https://phabricator.wikimedia.org/T287081) (owner: 10Dzahn)
[07:19:38] <wikibugs>	 (03PS1) 10Slyngshede: profile::zookeeper::server remove cron mail spam hack [puppet] - 10https://gerrit.wikimedia.org/r/807475
[07:20:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] aptly: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/807127 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[07:23:19] <wikibugs>	 (03PS2) 10Muehlenhoff: aptrepo: Add a few missing SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/807128 (https://phabricator.wikimedia.org/T308013)
[07:25:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] aptrepo: Add a few missing SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/807128 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[07:25:43] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/790350 (https://phabricator.wikimedia.org/T307620) (owner: 10Hashar)
[07:26:36] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:26:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29946 and previous config saved to /var/cache/conftool/dbconfig/20220622-072647-root.json
[07:26:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:27:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29947 and previous config saved to /var/cache/conftool/dbconfig/20220622-072705-root.json
[07:27:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:27:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29948 and previous config saved to /var/cache/conftool/dbconfig/20220622-072714-root.json
[07:27:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:28:04] <wikibugs>	 (03PS2) 10Muehlenhoff: grafana: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/807129 (https://phabricator.wikimedia.org/T308013)
[07:31:12] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2050.codfw.wmnet
[07:31:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:33:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/807475 (owner: 10Slyngshede)
[07:33:37] <wikibugs>	 (03CR) 10Slyngshede: "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/781051 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[07:33:39] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] osm: remove absented import_waterlines cron [puppet] - 10https://gerrit.wikimedia.org/r/781051 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[07:33:53] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] profile::zookeeper::server remove cron mail spam hack [puppet] - 10https://gerrit.wikimedia.org/r/807475 (owner: 10Slyngshede)
[07:38:19] <wikibugs>	 (03PS4) 10Slyngshede: zookeeper: remove absented zookeeper-cleanup cron [puppet] - 10https://gerrit.wikimedia.org/r/777452 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[07:38:21] <wikibugs>	 (03PS7) 10Hashar: zuul: disable core.logAllRefUpdates at clone time [puppet] - 10https://gerrit.wikimedia.org/r/790350 (https://phabricator.wikimedia.org/T307620)
[07:38:36] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/790350 (https://phabricator.wikimedia.org/T307620) (owner: 10Hashar)
[07:39:17] <wikibugs>	 (03CR) 10Hashar: zuul: disable core.logAllRefUpdates at clone time (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790350 (https://phabricator.wikimedia.org/T307620) (owner: 10Hashar)
[07:39:22] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2050.codfw.wmnet
[07:39:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:39:54] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48249 bytes in 0.113 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:40:06] <logmsgbot>	 !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single for host cumin2002.codfw.wmnet
[07:40:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:41:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29949 and previous config saved to /var/cache/conftool/dbconfig/20220622-074151-root.json
[07:41:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:42:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29950 and previous config saved to /var/cache/conftool/dbconfig/20220622-074209-root.json
[07:42:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:42:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29951 and previous config saved to /var/cache/conftool/dbconfig/20220622-074217-root.json
[07:42:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:43:38] <icinga-wm>	 RECOVERY - Check systemd state on zookeeper-test1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:45:05] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] zuul: disable core.logAllRefUpdates at clone time [puppet] - 10https://gerrit.wikimedia.org/r/790350 (https://phabricator.wikimedia.org/T307620) (owner: 10Hashar)
[07:47:54] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:49:29] <logmsgbot>	 !log jmm@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cumin2002.codfw.wmnet
[07:49:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:50:02] <marostegui>	 !log Upgrade kernel and reboot on db[2145-2150].codfw.wmnet
[07:50:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:50:52] <icinga-wm>	 PROBLEM - Keyholder SSH agent on cumin2002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder
[07:51:52] <marostegui>	 moritzm: ^
[07:52:49] <moritzm>	 yeah, that is the homer keyholder, needs someone from netops to rearm it, only they have the passphrase
[07:53:33] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1056.eqiad.wmnet
[07:53:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:54:00] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2051.codfw.wmnet
[07:54:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:54:20] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[07:55:58] <wikibugs>	 (03CR) 10Muehlenhoff: C:snapshot::dumps::timechecker convert cron to timer. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806366 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[07:56:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29952 and previous config saved to /var/cache/conftool/dbconfig/20220622-075655-root.json
[07:56:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:57:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29953 and previous config saved to /var/cache/conftool/dbconfig/20220622-075713-root.json
[07:57:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:57:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 25%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29954 and previous config saved to /var/cache/conftool/dbconfig/20220622-075721-root.json
[07:57:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:57:42] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[08:00:04] <jouncebot>	 hashar and brennen: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220622T0800).
[08:00:52] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/784324 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[08:00:56] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] memcached: remove absented memkeys cron [puppet] - 10https://gerrit.wikimedia.org/r/784324 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[08:01:28] <icinga-wm>	 PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - Certificate lists.wikimedia.org valid until 2022-06-25 07:55:09 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:01:45] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:02:21] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/779033 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[08:02:23] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] acme_chief: remove absented acme-chief-designate-tidyup cron [puppet] - 10https://gerrit.wikimedia.org/r/779033 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[08:04:33] <hashar>	 !log Updating operations-puppet-tests-buster-docker Jenkins job to use the latest Docker image (rebuild to catch up with latest defined gems). https://gerrit.wikimedia.org/r/c/integration/config/+/807478
[08:04:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:04:51] <hashar>	 if CI complains on `operations/puppet`  that might be due to the new docker image
[08:04:55] <wikibugs>	 (03CR) 10Slyngshede: "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/778493 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[08:04:57] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] sslcert: remove absented update-ocsp-all cron [puppet] - 10https://gerrit.wikimedia.org/r/778493 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[08:04:58] <hashar>	 I will run the train
[08:05:10] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1056.eqiad.wmnet
[08:05:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:05:52] <icinga-wm>	 RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2022-08-24 07:48:40 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:05:55] <wikibugs>	 (03PS1) 10Hashar: group1 wikis to 1.39.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807483 (https://phabricator.wikimedia.org/T308070)
[08:05:57] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] group1 wikis to 1.39.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807483 (https://phabricator.wikimedia.org/T308070) (owner: 10Hashar)
[08:06:04] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1057.eqiad.wmnet
[08:06:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:07:11] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807483 (https://phabricator.wikimedia.org/T308070) (owner: 10Hashar)
[08:07:56] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Parsoid: Retire the old Parsoid deb repository? - https://phabricator.wikimedia.org/T309765 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Since there were no further objections, the repository has now been removed.
[08:08:23] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "LGTM, matches recommendations on: https://github.com/squid-cache/squid/security/advisories/GHSA-f5cp-6rh3-284w" [puppet] - 10https://gerrit.wikimedia.org/r/807094 (owner: 10Muehlenhoff)
[08:09:25] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "LGTM, as recommended on https://github.com/squid-cache/squid/security/advisories/GHSA-f5cp-6rh3-284w" [puppet] - 10https://gerrit.wikimedia.org/r/807093 (owner: 10Muehlenhoff)
[08:11:28] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:11:33] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1057.eqiad.wmnet
[08:11:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:11:37] <logmsgbot>	 !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.17  refs T308070
[08:11:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:11:42] <stashbot>	 T308070: 1.39.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T308070
[08:11:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29955 and previous config saved to /var/cache/conftool/dbconfig/20220622-081159-root.json
[08:12:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:12:14] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:12:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29956 and previous config saved to /var/cache/conftool/dbconfig/20220622-081217-root.json
[08:12:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:12:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 50%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29957 and previous config saved to /var/cache/conftool/dbconfig/20220622-081227-root.json
[08:12:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:12:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[08:12:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[08:13:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[08:13:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:14:35] <hashar>	 not sure why the php restart takes longer nowadays
[08:14:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[08:14:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:15:21] <logmsgbot>	 !log hashar@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.17  refs T308070 (duration: 03m 43s)
[08:15:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:16:07] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: add job to cleanup old docker volumes/cache [puppet] - 10https://gerrit.wikimedia.org/r/807103 (https://phabricator.wikimedia.org/T310593) (owner: 10Jelto)
[08:16:10] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:16:31] <wikibugs>	 (03PS2) 10Jelto: gitlab_runner: add job to cleanup old docker volumes/cache [puppet] - 10https://gerrit.wikimedia.org/r/807103 (https://phabricator.wikimedia.org/T310593)
[08:16:51] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2051.codfw.wmnet
[08:16:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:17:12] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:18:45] <marostegui>	 !log Upgrade kernel and reboot on db[1111,1132,1143,1127].eqiad.wmnet
[08:18:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:34] <icinga-wm>	 RECOVERY - Keyholder SSH agent on cumin2002 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder
[08:25:42] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[08:26:05] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2052.codfw.wmnet
[08:26:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:34] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "🤦 sorry about that…" [puppet] - 10https://gerrit.wikimedia.org/r/807200 (https://phabricator.wikimedia.org/T289243) (owner: 10Ryan Kemper)
[08:26:42] <jinxer-wm>	 (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager  - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady
[08:26:46] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1058.eqiad.wmnet
[08:26:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:27:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29959 and previous config saved to /var/cache/conftool/dbconfig/20220622-082702-root.json
[08:27:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:27:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29960 and previous config saved to /var/cache/conftool/dbconfig/20220622-082721-root.json
[08:27:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:27:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 75%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29961 and previous config saved to /var/cache/conftool/dbconfig/20220622-082730-root.json
[08:27:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:28:08] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] prometheus: Add ipmi_exporter to bullseye+ [puppet] - 10https://gerrit.wikimedia.org/r/807124 (owner: 10Alexandros Kosiaris)
[08:30:38] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/806373 (owner: 10Filippo Giunchedi)
[08:32:03] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2052.codfw.wmnet
[08:32:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:52] <hashar>	 well train looks fine so far
[08:37:02] <wikibugs>	 (03PS4) 10Itamar Givon: [cirrus] Add a custom profile for the wikibase language selector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801793 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse)
[08:42:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29962 and previous config saved to /var/cache/conftool/dbconfig/20220622-084206-root.json
[08:42:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29963 and previous config saved to /var/cache/conftool/dbconfig/20220622-084225-root.json
[08:42:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 100%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29964 and previous config saved to /var/cache/conftool/dbconfig/20220622-084234-root.json
[08:42:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:46] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807254 (https://phabricator.wikimedia.org/T304328)
[08:44:02] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807254 (https://phabricator.wikimedia.org/T304328)
[08:44:17] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Remove obsolete webperf hosts [puppet] - 10https://gerrit.wikimedia.org/r/785118 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff)
[08:44:59] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807255 (https://phabricator.wikimedia.org/T304328)
[08:45:09] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807255 (https://phabricator.wikimedia.org/T304328)
[08:45:24] <wikibugs>	 (03PS3) 10Lucas Werkmeister (WMDE): Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803496 (https://phabricator.wikimedia.org/T304328)
[08:46:52] <wikibugs>	 (03PS1) 10Stang: logos: Update phpcs comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807486
[08:47:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2001.codfw.wmnet
[08:47:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:43] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1058.eqiad.wmnet
[08:49:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:53:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2001.codfw.wmnet
[08:53:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:56:16] <wikibugs>	 (03CR) 10Ayounsi: [C: 04-1] Add sukhe to super-user for router configuration (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/807145 (owner: 10Ssingh)
[09:00:54] <wikibugs>	 (03CR) 10Slyngshede: C:snapshot::dumps::timechecker convert cron to timer. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806366 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[09:01:17] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35978/console" [puppet] - 10https://gerrit.wikimedia.org/r/790350 (https://phabricator.wikimedia.org/T307620) (owner: 10Hashar)
[09:09:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet
[09:09:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:46] <wikibugs>	 (03PS1) 10Stang: specieswiki: Adjust width-height ratio of logo to fix display issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807491 (https://phabricator.wikimedia.org/T310961)
[09:11:32] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:12:49] <wikibugs>	 (03PS1) 10Ayounsi: eqsin: disable Telia transit [homer/public] - 10https://gerrit.wikimedia.org/r/807492 (https://phabricator.wikimedia.org/T300485)
[09:15:25] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1059.eqiad.wmnet
[09:15:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:01] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2053.codfw.wmnet
[09:16:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet
[09:16:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:12] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.cf
[09:17:13] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0)
[09:17:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:26] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.cf
[09:17:28] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0)
[09:17:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:18:04] <wikibugs>	 (03PS1) 10Jbond: puppet_compiler: upload facts [puppet] - 10https://gerrit.wikimedia.org/r/807493
[09:18:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppet_compiler: upload facts [puppet] - 10https://gerrit.wikimedia.org/r/807493 (owner: 10Jbond)
[09:19:05] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] eqsin: disable Telia transit [homer/public] - 10https://gerrit.wikimedia.org/r/807492 (https://phabricator.wikimedia.org/T300485) (owner: 10Ayounsi)
[09:19:42] <wikibugs>	 (03Merged) 10jenkins-bot: eqsin: disable Telia transit [homer/public] - 10https://gerrit.wikimedia.org/r/807492 (https://phabricator.wikimedia.org/T300485) (owner: 10Ayounsi)
[09:23:57] <wikibugs>	 (03PS2) 10Jbond: puppet_compiler: upload facts [puppet] - 10https://gerrit.wikimedia.org/r/807493
[09:25:18] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:26:01] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/807096 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[09:26:06] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet_compiler: upload facts [puppet] - 10https://gerrit.wikimedia.org/r/807493 (owner: 10Jbond)
[09:27:53] <wikibugs>	 (03PS6) 10Alexandros Kosiaris: prometheus: Add ipmi_exporter to bullseye+ [puppet] - 10https://gerrit.wikimedia.org/r/807124
[09:27:55] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: prometheus: Enable scraping of the ipmi exporter [puppet] - 10https://gerrit.wikimedia.org/r/807494
[09:29:25] <wikibugs>	 (03PS2) 10JMeybohm: Allow to dry-run SREBatchRunnerBase [cookbooks] - 10https://gerrit.wikimedia.org/r/806285
[09:29:27] <wikibugs>	 (03PS2) 10JMeybohm: SREBatchBase: Fix broken batchsize argument [cookbooks] - 10https://gerrit.wikimedia.org/r/806286
[09:29:29] <wikibugs>	 (03PS3) 10JMeybohm: sre.k8s.reboot-nodes: Fix errors identified during dry-run [cookbooks] - 10https://gerrit.wikimedia.org/r/806287 (https://phabricator.wikimedia.org/T260661)
[09:29:31] <wikibugs>	 (03PS4) 10JMeybohm: sre.k8s.reboot-node: Dynamically adjust batchsize [cookbooks] - 10https://gerrit.wikimedia.org/r/806288 (https://phabricator.wikimedia.org/T260661)
[09:30:05] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q4), 10User-fgiunchedi: Migrate Traffic Prometheus alerts from Icinga to Alertmanager - https://phabricator.wikimedia.org/T300723 (10fgiunchedi) >>! In T300723#8017017, @BCornwall wrote: > the varnish-mmap-count situation could be res...
[09:30:18] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti-test2003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[09:31:13] <moritzm>	 ^ ganeti-test2003 is expected, master was failed over
[09:31:25] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/807214 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall)
[09:31:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] icinga::monitor::toollabs: replace stretch with buster [puppet] - 10https://gerrit.wikimedia.org/r/807169 (https://phabricator.wikimedia.org/T277653) (owner: 10Majavah)
[09:33:12] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 34.78 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[09:33:28] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 59.51 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[09:34:28] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:34:34] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2053.codfw.wmnet
[09:34:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:43] <wikibugs>	 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10ayounsi) a:05ayounsi→03RobH @RobH BGP disabled.
[09:34:59] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on netbox1002.eqiad.wmnet with reason: Adding support for Ganeti groups
[09:35:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:01] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on netbox1002.eqiad.wmnet with reason: Adding support for Ganeti groups
[09:35:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:30] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[09:35:46] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 100.3 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[09:35:58] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Revert "ganeti-netbox-sync: Add netbox 3.2 support" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/805869 (https://phabricator.wikimedia.org/T296452) (owner: 10Volans)
[09:36:02] <godog>	 jbond: merging your changes too
[09:36:04] <wikibugs>	 (03CR) 10Volans: [C: 03+2] ganeti-netbox-sync: refactor into classes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802178 (owner: 10Volans)
[09:36:09] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Netbox Ganeti sync: add groups support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802179 (https://phabricator.wikimedia.org/T262446) (owner: 10Volans)
[09:36:40] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "ganeti-netbox-sync: Add netbox 3.2 support" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/805869 (https://phabricator.wikimedia.org/T296452) (owner: 10Volans)
[09:36:43] <wikibugs>	 (03Merged) 10jenkins-bot: ganeti-netbox-sync: refactor into classes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802178 (owner: 10Volans)
[09:36:51] <wikibugs>	 (03Merged) 10jenkins-bot: Netbox Ganeti sync: add groups support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802179 (https://phabricator.wikimedia.org/T262446) (owner: 10Volans)
[09:37:17] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Netbox: adapt ganeti-sync config file [puppet] - 10https://gerrit.wikimedia.org/r/805337 (https://phabricator.wikimedia.org/T262446) (owner: 10Volans)
[09:39:32] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35979/console" [puppet] - 10https://gerrit.wikimedia.org/r/807494 (owner: 10Alexandros Kosiaris)
[09:45:45] <wikibugs>	 (03PS2) 10JMeybohm: Initial commit of helm-state-metrics [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/806888 (https://phabricator.wikimedia.org/T310714)
[09:45:47] <wikibugs>	 (03PS2) 10JMeybohm: Add vendor dir [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/806889 (https://phabricator.wikimedia.org/T310714)
[09:46:37] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] smokeping: stop targetting cr devices, moved to Prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807100 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[09:48:06] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1059.eqiad.wmnet
[09:48:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:48:28] <wikibugs>	 (03CR) 10JMeybohm: Initial commit of helm-state-metrics (034 comments) [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/806888 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm)
[09:49:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: "See inline for my opinions (none blocking) on the review. LGTM though, happy to discuss more depending on what you (the team) prefer" [puppet] - 10https://gerrit.wikimedia.org/r/807201 (owner: 10Dzahn)
[09:49:54] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (16) node(s) change every puppet run: cloudservices1003, cloudservices1004, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, netboxdb2001, netboxdb2002, puppetdb2002, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, thanos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_c
[09:51:08] <wikibugs>	 (03PS4) 10Slyngshede: C:snapshot::dumps::timechecker convert cron to timer. [puppet] - 10https://gerrit.wikimedia.org/r/806366 (https://phabricator.wikimedia.org/T273673)
[09:51:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: smokeping: stop targetting cr devices, moved to Prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807100 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[09:51:27] <wikibugs>	 (03PS2) 10Filippo Giunchedi: smokeping: stop targetting cr devices, moved to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/807100 (https://phabricator.wikimedia.org/T169860)
[09:51:29] <wikibugs>	 (03PS3) 10Filippo Giunchedi: prometheus: ping access switches and FR firewalls [puppet] - 10https://gerrit.wikimedia.org/r/807179 (https://phabricator.wikimedia.org/T169860)
[09:52:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] smokeping: stop targetting cr devices, moved to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/807100 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[09:52:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prometheus: ping access switches and FR firewalls [puppet] - 10https://gerrit.wikimedia.org/r/807179 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[09:53:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: rework prometheus settings in its own file [puppet] - 10https://gerrit.wikimedia.org/r/806379 (owner: 10Filippo Giunchedi)
[09:53:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: add profile::pontoon::base [puppet] - 10https://gerrit.wikimedia.org/r/806373 (owner: 10Filippo Giunchedi)
[09:53:56] <wikibugs>	 (03PS1) 10Muehlenhoff: Raise profile::cumin::monitoring_agentrun::crit [puppet] - 10https://gerrit.wikimedia.org/r/807497
[09:54:09] <wikibugs>	 (03PS3) 10JMeybohm: Deploy helm-state-metrics to staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/806871 (https://phabricator.wikimedia.org/T310714)
[09:55:42] <wikibugs>	 (03PS5) 10Slyngshede: C:snapshot::dumps::timechecker convert cron to timer. [puppet] - 10https://gerrit.wikimedia.org/r/806366 (https://phabricator.wikimedia.org/T273673)
[09:56:08] <wikibugs>	 (03CR) 10JMeybohm: Deploy helm-state-metrics to staging-codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/806871 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm)
[09:56:12] <icinga-wm>	 PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:57:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2003.codfw.wmnet
[09:57:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:57:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/807179 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[09:58:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/807497 (owner: 10Muehlenhoff)
[10:02:55] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] sre.k8s.reboot-node: Dynamically adjust batchsize [cookbooks] - 10https://gerrit.wikimedia.org/r/806288 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[10:03:05] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] sre.k8s.reboot-nodes: Fix errors identified during dry-run [cookbooks] - 10https://gerrit.wikimedia.org/r/806287 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[10:03:13] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] SREBatchBase: Fix broken batchsize argument [cookbooks] - 10https://gerrit.wikimedia.org/r/806286 (owner: 10JMeybohm)
[10:03:18] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Allow to dry-run SREBatchRunnerBase [cookbooks] - 10https://gerrit.wikimedia.org/r/806285 (owner: 10JMeybohm)
[10:04:34] <moritzm>	 !log installing vim security updates
[10:04:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:39] <Lucas_WMDE>	 jouncebot: now
[10:04:39] <jouncebot>	 No deployments scheduled for the next 2 hour(s) and 55 minute(s)
[10:05:43] <wikibugs>	 (03CR) 10Jbond: "this seems fine to me but adding riccardo who i think has more historical context with this repo" [dns] - 10https://gerrit.wikimedia.org/r/807229 (owner: 10BCornwall)
[10:06:29] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] squid/url downloaders: Drop Gopher in ACLs, not used anywhere [puppet] - 10https://gerrit.wikimedia.org/r/807094 (owner: 10Muehlenhoff)
[10:06:31] <wikibugs>	 (03Merged) 10jenkins-bot: Allow to dry-run SREBatchRunnerBase [cookbooks] - 10https://gerrit.wikimedia.org/r/806285 (owner: 10JMeybohm)
[10:06:35] <wikibugs>	 (03Merged) 10jenkins-bot: SREBatchBase: Fix broken batchsize argument [cookbooks] - 10https://gerrit.wikimedia.org/r/806286 (owner: 10JMeybohm)
[10:06:37] <wikibugs>	 (03Merged) 10jenkins-bot: sre.k8s.reboot-nodes: Fix errors identified during dry-run [cookbooks] - 10https://gerrit.wikimedia.org/r/806287 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[10:06:39] <wikibugs>	 (03Merged) 10jenkins-bot: sre.k8s.reboot-node: Dynamically adjust batchsize [cookbooks] - 10https://gerrit.wikimedia.org/r/806288 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[10:06:53] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti-test2003.codfw.wmnet
[10:06:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:07:18] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti-test2003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[10:07:45] <jinxer-wm>	 (Memory over 85%) firing: Alert for device cloudsw1-c8-eqiad.mgmt.eqiad.wmnet - Memory over 85%   - https://alerts.wikimedia.org/?q=alertname%3DMemory+over+85%25
[10:08:45] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2054.codfw.wmnet
[10:08:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:10:14] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1060.eqiad.wmnet
[10:10:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:11:47] <wikibugs>	 (03PS2) 10Matthias Mullie: [ImageSuggestions] Enable extension on beta testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807049 (https://phabricator.wikimedia.org/T302711)
[10:12:08] <wikibugs>	 (03PS1) 10Jbond: puppet_compiler: fix pcc_facts_processor script [puppet] - 10https://gerrit.wikimedia.org/r/807500
[10:13:26] <wikibugs>	 (03CR) 10Matthias Mullie: [C: 03+2] [ImageSuggestions] Enable extension on beta testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807049 (https://phabricator.wikimedia.org/T302711) (owner: 10Matthias Mullie)
[10:14:08] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2054.codfw.wmnet
[10:14:09] <wikibugs>	 (03Merged) 10jenkins-bot: [ImageSuggestions] Enable extension on beta testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807049 (https://phabricator.wikimedia.org/T302711) (owner: 10Matthias Mullie)
[10:14:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:14:36] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh)
[10:15:08] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:15:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet_compiler: fix pcc_facts_processor script [puppet] - 10https://gerrit.wikimedia.org/r/807500 (owner: 10Jbond)
[10:16:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] grafana: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/807129 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[10:17:55] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1060.eqiad.wmnet
[10:17:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:17:58] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2055.codfw.wmnet
[10:18:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:36] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1061.eqiad.wmnet
[10:18:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:20:13] <wikibugs>	 (03CR) 10Jbond: "lgtm but a couple of nits to make sure things work on the fist run" [puppet] - 10https://gerrit.wikimedia.org/r/806397 (https://phabricator.wikimedia.org/T310740) (owner: 10Jaime Nuche)
[10:21:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[10:21:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[10:22:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[10:22:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[10:23:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:24:48] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1061.eqiad.wmnet
[10:24:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:24:54] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "Had a look at the latest PCC output as well (including centrallog) and it lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh)
[10:28:04] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] prometheus: ping access switches and FR firewalls [puppet] - 10https://gerrit.wikimedia.org/r/807179 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[10:28:56] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] net: Add network config setup for ML staging k8s [puppet] - 10https://gerrit.wikimedia.org/r/807096 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[10:30:42] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1062.eqiad.wmnet
[10:30:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:30:51] <wikibugs>	 (03PS1) 10Klausman: pki: Add ML staging k8s to list of CAs [puppet] - 10https://gerrit.wikimedia.org/r/807502 (https://phabricator.wikimedia.org/T302195)
[10:32:55] <wikibugs>	 (03CR) 10Volans: "reply inline" [dns] - 10https://gerrit.wikimedia.org/r/807229 (owner: 10BCornwall)
[10:33:32] <wikibugs>	 (03PS2) 10Muehlenhoff: smokeping: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/807130 (https://phabricator.wikimedia.org/T308013)
[10:35:40] <wikibugs>	 (03CR) 10Mark Bergsma: [C: 03+1] Delete git-setup script [dns] - 10https://gerrit.wikimedia.org/r/807229 (owner: 10BCornwall)
[10:36:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] smokeping: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/807130 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[10:36:14] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab1003.wikimedia.org
[10:36:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:36:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[10:37:06] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1062.eqiad.wmnet
[10:37:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:45] <wikibugs>	 (03PS1) 10JMeybohm: k8s.reboot-nodes: Fix call to super()._batchsize [cookbooks] - 10https://gerrit.wikimedia.org/r/807504 (https://phabricator.wikimedia.org/T260661)
[10:38:23] <icinga-wm>	 RECOVERY - k8s requests count to the API on ml-serve-ctrl2001 is OK: (C)100 ge (W)50 ge 43.78 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1
[10:40:36] <wikibugs>	 (03PS2) 10Muehlenhoff: squid/racktables: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/806406 (https://phabricator.wikimedia.org/T308013)
[10:41:45] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1003.wikimedia.org
[10:41:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:41:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[10:42:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] squid: Harden config, we don't use Gopher anywhere [puppet] - 10https://gerrit.wikimedia.org/r/807093 (owner: 10Muehlenhoff)
[10:42:57] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1063.eqiad.wmnet
[10:42:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubernetes2008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[10:47:30] <volans>	 certifi did sacare me because it seemed from last year ;)
[10:50:48] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2055.codfw.wmnet
[10:50:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:51:03] <wikibugs>	 (03PS1) 10Volans: Add wmflib as additional dependency [software/netbox-deploy] (3-2-2) - 10https://gerrit.wikimedia.org/r/807507 (https://phabricator.wikimedia.org/T262446)
[10:52:40] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Add wmflib as additional dependency [software/netbox-deploy] (3-2-2) - 10https://gerrit.wikimedia.org/r/807507 (https://phabricator.wikimedia.org/T262446) (owner: 10Volans)
[10:52:50] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/807504 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[10:53:49] <jayme>	 !log systemctl restart rsyslog on kubernetes2008
[10:53:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:59] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] k8s.reboot-nodes: Fix call to super()._batchsize [cookbooks] - 10https://gerrit.wikimedia.org/r/807504 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[10:55:22] <wikibugs>	 (03CR) 10Volans: [V: 03+2 C: 03+2] Add wmflib as additional dependency [software/netbox-deploy] (3-2-2) - 10https://gerrit.wikimedia.org/r/807507 (https://phabricator.wikimedia.org/T262446) (owner: 10Volans)
[10:56:42] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1063.eqiad.wmnet
[10:56:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:57:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[10:57:58] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on kubernetes2008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[10:58:30] <wikibugs>	 (03Merged) 10jenkins-bot: k8s.reboot-nodes: Fix call to super()._batchsize [cookbooks] - 10https://gerrit.wikimedia.org/r/807504 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[11:02:52] <wikibugs>	 10SRE, 10SRE-tools, 10Icinga, 10Infrastructure-Foundations, 10observability: Icinga paged for a host that should have been downtimed - https://phabricator.wikimedia.org/T309447 (10MoritzMuehlenhoff) @Volans: Can this task be closed with https://gerrit.wikimedia.org/r/803317 merged?
[11:03:47] <wikibugs>	 10SRE, 10SRE-tools, 10Icinga, 10Infrastructure-Foundations, 10observability: Icinga paged for a host that should have been downtimed - https://phabricator.wikimedia.org/T309447 (10Volans) I was planning to close it when the new spicerack will be released with the patch... is not yet deployed to prod. But...
[11:05:04] <logmsgbot>	 !log volans@deploy1002 Started deploy [netbox/deploy@7bbf659]: Adding wmflib to venv deps
[11:05:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:06:39] <icinga-wm>	 RECOVERY - Check systemd state on an-conf1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:06:45] <wikibugs>	 (03PS1) 10Jbond: P:mediawiki::scap_client: add paremeter to indicate scap master [puppet] - 10https://gerrit.wikimedia.org/r/807510 (https://phabricator.wikimedia.org/T310740)
[11:07:58] <logmsgbot>	 !log volans@deploy1002 Finished deploy [netbox/deploy@7bbf659]: Adding wmflib to venv deps (duration: 02m 54s)
[11:08:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:08:11] <logmsgbot>	 !log volans@deploy1002 Started deploy [netbox/deploy@7bbf659]: Adding wmflib to venv deps
[11:08:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:09:23] <logmsgbot>	 !log volans@deploy1002 Finished deploy [netbox/deploy@7bbf659]: Adding wmflib to venv deps (duration: 01m 11s)
[11:09:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:10:13] <logmsgbot>	 !log volans@deploy1002 Started deploy [netbox/deploy@7bbf659]: Adding wmflib to venv deps
[11:10:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:11:34] <logmsgbot>	 !log volans@deploy1002 Finished deploy [netbox/deploy@7bbf659]: Adding wmflib to venv deps (duration: 01m 20s)
[11:11:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:45] <jinxer-wm>	 (Memory over 85%) resolved: Alert for device cloudsw1-c8-eqiad.mgmt.eqiad.wmnet - Memory over 85% got acknowledged   - https://alerts.wikimedia.org/?q=alertname%3DMemory+over+85%25
[11:14:29] <wikibugs>	 (03PS2) 10Jbond: P:mediawiki::scap_client: add paremeter to indicate scap master [puppet] - 10https://gerrit.wikimedia.org/r/807510 (https://phabricator.wikimedia.org/T310740)
[11:14:34] <wikibugs>	 (03CR) 10EllenR: "Code looks good, I am seeing a merge conflict tag and not sure if that needs to give a ding or not." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807211 (https://phabricator.wikimedia.org/T311079) (owner: 10Eigyan)
[11:17:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[11:20:12] <wikibugs>	 (03PS5) 10Vlad.shapik: WP:Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719)
[11:20:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WP:Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik)
[11:20:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/807502 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[11:22:03] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/806406 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[11:22:23] <icinga-wm>	 RECOVERY - Check systemd state on an-conf1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:24:15] <icinga-wm>	 PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:24:29] <icinga-wm>	 RECOVERY - Check systemd state on an-druid1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:24:39] <icinga-wm>	 RECOVERY - Check systemd state on an-druid1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:24:41] <icinga-wm>	 RECOVERY - Check systemd state on an-conf1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:24:45] <icinga-wm>	 RECOVERY - Check systemd state on druid1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:24:53] <icinga-wm>	 RECOVERY - Check systemd state on conf1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:24:55] <icinga-wm>	 RECOVERY - Check systemd state on conf1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:25:05] <icinga-wm>	 RECOVERY - Check systemd state on conf2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:25:11] <icinga-wm>	 RECOVERY - Check systemd state on druid1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:25:23] <icinga-wm>	 RECOVERY - Check systemd state on druid1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:25:39] <icinga-wm>	 RECOVERY - Check systemd state on conf2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:25:57] <icinga-wm>	 RECOVERY - Check systemd state on conf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:25:59] <icinga-wm>	 RECOVERY - Check systemd state on conf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:26:49] <icinga-wm>	 RECOVERY - Check systemd state on an-druid1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:32:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] squid/racktables: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/806406 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[11:41:30] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.dns.netbox
[11:41:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:43:03] <wikibugs>	 (03PS1) 10Jbond: P::base: allow useres to configure enable_unpriv_userns via hiera [puppet] - 10https://gerrit.wikimedia.org/r/807515 (https://phabricator.wikimedia.org/T287081)
[11:43:05] <wikibugs>	 (03PS1) 10Jbond: cumin: add alias for hosts with sensitive sysctl settings [puppet] - 10https://gerrit.wikimedia.org/r/807516 (https://phabricator.wikimedia.org/T287081)
[11:43:12] <wikibugs>	 (03PS1) 10Slyngshede: C:osm::import_waterlines remove logrotate configuration. [puppet] - 10https://gerrit.wikimedia.org/r/807517
[11:44:04] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2056.codfw.wmnet
[11:44:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:44:15] <wikibugs>	 (03PS1) 10Jelto: gitlab_runner: add docker-registry.discovery.wmnet to allowed_images [puppet] - 10https://gerrit.wikimedia.org/r/807518
[11:44:18] <wikibugs>	 (03CR) 10Jbond: base: create profile to allow unprivileged userns, use it on gitlab_runners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn)
[11:45:16] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35987/console" [puppet] - 10https://gerrit.wikimedia.org/r/807518 (owner: 10Jelto)
[11:45:33] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1] "Adding observability team." [puppet] - 10https://gerrit.wikimedia.org/r/807494 (owner: 10Alexandros Kosiaris)
[11:45:51] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35986/console" [puppet] - 10https://gerrit.wikimedia.org/r/807516 (https://phabricator.wikimedia.org/T287081) (owner: 10Jbond)
[11:46:25] <logmsgbot>	 !log akosiaris@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:46:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:46:48] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35989/console" [puppet] - 10https://gerrit.wikimedia.org/r/807515 (https://phabricator.wikimedia.org/T287081) (owner: 10Jbond)
[11:47:00] <wikibugs>	 (03CR) 10Jbond: P::base: allow useres to configure enable_unpriv_userns via hiera [puppet] - 10https://gerrit.wikimedia.org/r/807515 (https://phabricator.wikimedia.org/T287081) (owner: 10Jbond)
[11:48:22] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/806871 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm)
[11:48:32] <wikibugs>	 (03PS1) 10Klausman: Add dummy secrets for ML staging k8s CA [labs/private] - 10https://gerrit.wikimedia.org/r/807520
[11:48:45] <wikibugs>	 (03PS2) 10Slyngshede: C:osm::import_waterlines remove logrotate configuration. [puppet] - 10https://gerrit.wikimedia.org/r/807517
[11:49:10] <wikibugs>	 (03PS2) 10Klausman: Add dummy secrets for ML staging k8s CA [labs/private] - 10https://gerrit.wikimedia.org/r/807520 (https://phabricator.wikimedia.org/T302195)
[11:49:48] <wikibugs>	 (03CR) 10Jbond: "i think the https://gerrit.wikimedia.org/r/c/operations/puppet/+/807516/1 may be a better way to go as it relies on what is actually set a" [puppet] - 10https://gerrit.wikimedia.org/r/806486 (https://phabricator.wikimedia.org/T287081) (owner: 10Dzahn)
[11:50:03] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35991/console" [puppet] - 10https://gerrit.wikimedia.org/r/807517 (owner: 10Slyngshede)
[11:50:08] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] Add dummy secrets for ML staging k8s CA [labs/private] - 10https://gerrit.wikimedia.org/r/807520 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[11:50:33] <wikibugs>	 (03CR) 10Klausman: [V: 03+2 C: 03+2] Add dummy secrets for ML staging k8s CA [labs/private] - 10https://gerrit.wikimedia.org/r/807520 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[11:52:02] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] cumin: add alias for hosts with sensitive sysctl settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807516 (https://phabricator.wikimedia.org/T287081) (owner: 10Jbond)
[11:58:21] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/777452 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[11:58:23] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] zookeeper: remove absented zookeeper-cleanup cron [puppet] - 10https://gerrit.wikimedia.org/r/777452 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[11:59:37] <wikibugs>	 (03PS1) 10Klausman: pki: Fix wrong cluster name for ML staging k8s [labs/private] - 10https://gerrit.wikimedia.org/r/807524
[12:00:01] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:dumps::web::dumpstatusfiles, convert to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/807057 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[12:00:05] <wikibugs>	 (03CR) 10Klausman: [V: 03+2 C: 03+2] pki: Fix wrong cluster name for ML staging k8s [labs/private] - 10https://gerrit.wikimedia.org/r/807524 (owner: 10Klausman)
[12:02:45] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2056.codfw.wmnet
[12:02:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:03:04] <wikibugs>	 (03CR) 10Kosta Harlan: Structured task: enable free text for "other" rejection reason (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805480 (https://phabricator.wikimedia.org/T304099) (owner: 10MewOphaswongse)
[12:05:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10Cmjohnson) @ayounsi lsw1-e4 and f4 do not show up as options in netbox in the provision network script.
[12:06:07] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2057.codfw.wmnet
[12:06:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:07:48] <wikibugs>	 (03PS2) 10Klausman: pki: Add ML staging k8s to list of CAs [puppet] - 10https://gerrit.wikimedia.org/r/807502 (https://phabricator.wikimedia.org/T302195)
[12:08:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10Cmjohnson) @cmooney the switches do not show up in netbox as an option for the provisioning script. I tagged Arzhel in a differe...
[12:11:11] <icinga-wm>	 PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - Certificate lists.wikimedia.org valid until 2022-06-25 07:55:09 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:12:00] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1016.eqiad.wmnet with OS buster
[12:12:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wdqs101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host wdqs1016.eqiad.wmnet with OS buster
[12:12:12] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1064.eqiad.wmnet
[12:12:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:12:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:15:51] <icinga-wm>	 RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2022-08-24 07:48:40 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:17:03] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2057.codfw.wmnet
[12:17:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:17:26] <wikibugs>	 (03PS1) 10Cmjohnson: Adding backup1009 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/807530 (https://phabricator.wikimedia.org/T307048)
[12:18:37] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1016.eqiad.wmnet with OS buster
[12:18:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:15] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1064.eqiad.wmnet
[12:19:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wdqs101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host wdqs1016.eqiad.wmnet with OS buster executed w...
[12:20:34] <wikibugs>	 (03PS5) 10Jgiannelos: Improve performance of Tegola tile pregeneration [puppet] - 10https://gerrit.wikimedia.org/r/790679 (https://phabricator.wikimedia.org/T307182) (owner: 10Isabelle Hurbain-Palatin)
[12:22:02] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Adding backup1009 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/807530 (https://phabricator.wikimedia.org/T307048) (owner: 10Cmjohnson)
[12:22:34] <wikibugs>	 (03CR) 10Jgiannelos: Improve performance of Tegola tile pregeneration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790679 (https://phabricator.wikimedia.org/T307182) (owner: 10Isabelle Hurbain-Palatin)
[12:23:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Improve performance of Tegola tile pregeneration [puppet] - 10https://gerrit.wikimedia.org/r/790679 (https://phabricator.wikimedia.org/T307182) (owner: 10Isabelle Hurbain-Palatin)
[12:23:32] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2058.codfw.wmnet
[12:23:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:24:01] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host backup1009.eqiad.wmnet with OS bullseye
[12:24:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:24:09] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1009.eqiad.wmnet with OS bullseye
[12:24:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:24:13] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host backup1009.eqiad.wmn...
[12:24:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host backup1009.eqiad.wmnet w...
[12:25:35] <icinga-wm>	 RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:25:42] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[12:26:42] <jinxer-wm>	 (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager  - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady
[12:26:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: ping access switches and FR firewalls [puppet] - 10https://gerrit.wikimedia.org/r/807179 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[12:26:52] <wikibugs>	 (03PS4) 10Filippo Giunchedi: prometheus: ping access switches and FR firewalls [puppet] - 10https://gerrit.wikimedia.org/r/807179 (https://phabricator.wikimedia.org/T169860)
[12:27:12] <wikibugs>	 10SRE, 10Commons, 10Data-Persistence (Consultation), 10MediaWiki-extensions-WikibaseClient, and 5 others: Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730 (10Manuel) Hi @ItamarWMDE this seems to be on the tech board already, right?
[12:27:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10Cmjohnson) @jcrespo can you confirm how you want the raid, it is failing during the installation.    I have it as Each SS...
[12:27:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/807517 (owner: 10Slyngshede)
[12:30:33] <wikibugs>	 (03PS1) 10Slyngshede: C:dumps::web::dumpstatusfiles run every five minutes. [puppet] - 10https://gerrit.wikimedia.org/r/807533
[12:31:16] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1065.eqiad.wmnet
[12:31:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:31:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10jcrespo) the SSDs should be a single *software* raid0. If the reminder is HDs, those should be on RAID 6. The installation should succeed- but...
[12:32:24] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35999/console" [puppet] - 10https://gerrit.wikimedia.org/r/807533 (owner: 10Slyngshede)
[12:32:59] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2058.codfw.wmnet
[12:33:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "This LGTM, thank you Alex for metrics estimates, super useful!" [puppet] - 10https://gerrit.wikimedia.org/r/807494 (owner: 10Alexandros Kosiaris)
[12:36:30] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:osm::import_waterlines remove logrotate configuration. [puppet] - 10https://gerrit.wikimedia.org/r/807517 (owner: 10Slyngshede)
[12:38:34] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2059.codfw.wmnet
[12:38:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:39:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] C:dumps::web::dumpstatusfiles run every five minutes. [puppet] - 10https://gerrit.wikimedia.org/r/807533 (owner: 10Slyngshede)
[12:39:31] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:dumps::web::dumpstatusfiles run every five minutes. [puppet] - 10https://gerrit.wikimedia.org/r/807533 (owner: 10Slyngshede)
[12:40:32] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/807124 (owner: 10Alexandros Kosiaris)
[12:42:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10ayounsi) @Cmjohnson they're named "cloudsw1-e4/f4"
[12:48:29] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.04354 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[12:48:52] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: prometheus: Fixes for I0c1a0b9ef2a1310fa5d0c9 [puppet] - 10https://gerrit.wikimedia.org/r/807540
[12:49:52] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] prometheus: Fixes for I0c1a0b9ef2a1310fa5d0c9 [puppet] - 10https://gerrit.wikimedia.org/r/807540 (owner: 10Alexandros Kosiaris)
[12:50:55] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2044 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:51:55] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1050 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:52:03] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1062 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:52:11] <icinga-wm>	 RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:52:24] <wikibugs>	 (03CR) 10Andrew Bogott: "One thing to remember about these settings (which I forget) is that the VM doesn't GET the settings until after the VM is able to contact " [puppet] - 10https://gerrit.wikimedia.org/r/807174 (https://phabricator.wikimedia.org/T309930) (owner: 10David Caro)
[12:53:09] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:54:14] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:54:43] <icinga-wm>	 RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:55:03] <icinga-wm>	 PROBLEM - Check systemd state on thanos-be2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:55:45] <wikibugs>	 (03PS1) 10Majavah: openstack::nova::monitor: improve check_flavor_properties performance [puppet] - 10https://gerrit.wikimedia.org/r/807541
[12:56:13] <wikibugs>	 (03CR) 10Vgutierrez: Implement MediaWiki multi-DC traffic component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) (owner: 10Tim Starling)
[12:57:07] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36002/console" [puppet] - 10https://gerrit.wikimedia.org/r/807541 (owner: 10Majavah)
[12:57:35] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2059.codfw.wmnet
[12:57:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:58:19] <XioNoX>	 !log fix MTU on codfw switches access ports
[12:58:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:59:02] <eigyan>	 Greetings Everyone!
[12:59:16] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] pki: Add ML staging k8s to list of CAs [puppet] - 10https://gerrit.wikimedia.org/r/807502 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220622T1300).
[13:00:05] <jouncebot>	 eigyan, itamarWMDE, and koi: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:05] <icinga-wm>	 RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:00:28] <Lucas_WMDE>	 o/
[13:01:22] <wikibugs>	 (03PS5) 10Eigyan: [wmf-config]: Deploy GDI Survey Wave 2 - BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807211 (https://phabricator.wikimedia.org/T311079)
[13:01:23] <Lucas_WMDE>	 I can deploy :)
[13:01:52] <wikibugs>	 (03PS2) 10Majavah: openstack::nova::monitor: improve check_flavor_properties performance [puppet] - 10https://gerrit.wikimedia.org/r/807541
[13:01:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 3 VMs requested for dse_etcd cluster - https://phabricator.wikimedia.org/T311131 (10BTullis)
[13:02:29] <eigyan>	 Thank you Lucas_WMDE
[13:02:40] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 3 VMs requested for dse_etcd cluster - https://phabricator.wikimedia.org/T311131 (10BTullis)
[13:02:58] <wikibugs>	 10SRE, 10DSE-Kubernetes-Cluster, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 3 VMs requested for dse_etcd cluster - https://phabricator.wikimedia.org/T311131 (10BTullis)
[13:03:00] <wikibugs>	 (03CR) 10Eigyan: [wmf-config]: Deploy GDI Survey Wave 2 - BETA (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807211 (https://phabricator.wikimedia.org/T311079) (owner: 10Eigyan)
[13:03:03] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [wmf-config]: Deploy GDI Survey Wave 2 - BETA (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807211 (https://phabricator.wikimedia.org/T311079) (owner: 10Eigyan)
[13:03:32] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] [wmf-config]: Deploy GDI Survey Wave 2 - BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807211 (https://phabricator.wikimedia.org/T311079) (owner: 10Eigyan)
[13:03:51] <wikibugs>	 (03CR) 10Vgutierrez: "wmf-tls log format could be dropped altogether considering that we've adopted HAProxy as our TLS terminator" [puppet] - 10https://gerrit.wikimedia.org/r/803301 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[13:04:12] <wikibugs>	 (03PS1) 10Elukey: profile::pki::multirootca: fix ml_staging key [labs/private] - 10https://gerrit.wikimedia.org/r/807544
[13:04:30] <wikibugs>	 (03Merged) 10jenkins-bot: [wmf-config]: Deploy GDI Survey Wave 2 - BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807211 (https://phabricator.wikimedia.org/T311079) (owner: 10Eigyan)
[13:04:34] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] profile::pki::multirootca: fix ml_staging key [labs/private] - 10https://gerrit.wikimedia.org/r/807544 (owner: 10Elukey)
[13:04:35] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36003/console" [puppet] - 10https://gerrit.wikimedia.org/r/807541 (owner: 10Majavah)
[13:04:58] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] profile::pki::multirootca: fix ml_staging key [labs/private] - 10https://gerrit.wikimedia.org/r/807544 (owner: 10Elukey)
[13:05:33] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe1010 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:05:40] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36004/console" [puppet] - 10https://gerrit.wikimedia.org/r/807541 (owner: 10Majavah)
[13:05:45] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe2012 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:05:49] <Lucas_WMDE>	 alright, syncing the goddammit survey ;)
[13:05:55] <icinga-wm>	 PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:06:01] <wikibugs>	 (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36005/console" [puppet] - 10https://gerrit.wikimedia.org/r/807502 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[13:06:01] <eigyan>	 :)
[13:06:21] <Lucas_WMDE>	 (if it’s only for beta at the moment, there’s no point testing it on mwdebug)
[13:06:35] <eigyan>	 Agreed Lucas_WMDE
[13:06:41] <wikibugs>	 (03CR) 10Klausman: [V: 03+1 C: 03+2] pki: Add ML staging k8s to list of CAs [puppet] - 10https://gerrit.wikimedia.org/r/807502 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[13:06:48] <eigyan>	 Thank you very much Lucas_WMDE
[13:06:53] <Lucas_WMDE>	 np!
[13:07:13] <wikibugs>	 (03PS3) 10Lucas Werkmeister (WMDE): Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807254 (https://phabricator.wikimedia.org/T304328)
[13:07:25] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2047 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:08:00] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] openstack::nova::monitor: improve check_flavor_properties performance [puppet] - 10https://gerrit.wikimedia.org/r/807541 (owner: 10Majavah)
[13:08:25] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 2 VMs requested for DSE Kubernetes Cluster control plane servers - https://phabricator.wikimedia.org/T311133 (10BTullis)
[13:08:42] <wikibugs>	 10SRE, 10DSE-Kubernetes-Cluster, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 3 VMs requested for dse_etcd cluster - https://phabricator.wikimedia.org/T311131 (10BTullis)
[13:09:00] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 2 VMs requested for DSE Kubernetes Cluster control plane servers - https://phabricator.wikimedia.org/T311133 (10BTullis)
[13:09:10] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:807211|[wmf-config]: Deploy GDI Survey Wave 2 - BETA (T311079)]] (duration: 03m 29s)
[13:09:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:15] <stashbot>	 T311079: Deploy GDI Safety Survey Wave 2 on EN, ES, FA, FR, and PT wikis - https://phabricator.wikimedia.org/T311079
[13:09:20] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807254 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE))
[13:09:30] <Lucas_WMDE>	 eigyan: done, it should show up on beta soon
[13:09:43] <eigyan>	 Excellent!
[13:09:49] <wikibugs>	 10SRE, 10DSE-Kubernetes-Cluster, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 3 VMs requested for Etcd cluster in support of the new DSE Kubernetes cluster - https://phabricator.wikimedia.org/T311131 (10BTullis)
[13:09:53] <wikibugs>	 (03CR) 10Slyngshede: C:snapshot::dumps::timechecker convert cron to timer. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806366 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[13:10:06] <wikibugs>	 (03Merged) 10jenkins-bot: Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807254 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE))
[13:10:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:10:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:10:19] <wikibugs>	 (03PS1) 10Volans: Rename cluster to ganeti_cluster [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807545
[13:10:36] <XioNoX>	 !log fix MTU in drmrs
[13:10:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:10:41] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1065.eqiad.wmnet
[13:10:43] <wikibugs>	 (03PS1) 10Volans: netbox::host: rename cluster to ganeti_cluster [puppet] - 10https://gerrit.wikimedia.org/r/807546
[13:10:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:11:08] <Lucas_WMDE>	 syncing the first wmgWikibaseTermboxEnabled change directly, it only adds a new variable and I don’t think it makes sense to test it on mwdebug
[13:11:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:11:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:11:11] <wikibugs>	 (03CR) 10Jgreen: [C: 03+1] Delete git-setup script (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/807229 (owner: 10BCornwall)
[13:11:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:11:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:11:31] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2052 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:12:12] <wikibugs>	 (03CR) 10Hokwelum: [C: 04-1] "The interval key is missing here" [puppet] - 10https://gerrit.wikimedia.org/r/806366 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[13:12:15] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:12:47] <wikibugs>	 (03PS3) 10Lucas Werkmeister (WMDE): Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807255 (https://phabricator.wikimedia.org/T304328)
[13:12:57] <jinxer-wm>	 (CertManagerCertNotReady) resolved: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager  - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady
[13:13:00] <wikibugs>	 10SRE, 10DSE-Kubernetes-Cluster, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 2 VMs requested for DSE Kubernetes Cluster control plane servers - https://phabricator.wikimedia.org/T311133 (10BTullis)
[13:14:13] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1045 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:14:18] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:807254|Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (1/3) (T304328)]] (duration: 03m 35s)
[13:14:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:23] <stashbot>	 T304328: Move Termbox SSR for Beta Wikidata into deployment-prep project - https://phabricator.wikimedia.org/T304328
[13:14:32] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807255 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE))
[13:14:42] <wikibugs>	 (03PS6) 10Slyngshede: C:snapshot::dumps::timechecker convert cron to timer. [puppet] - 10https://gerrit.wikimedia.org/r/806366 (https://phabricator.wikimedia.org/T273673)
[13:14:51] <wikibugs>	 (03CR) 10Slyngshede: C:snapshot::dumps::timechecker convert cron to timer. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806366 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[13:14:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:14:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:29] <wikibugs>	 (03Merged) 10jenkins-bot: Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807255 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE))
[13:15:53] <Lucas_WMDE>	 okay, change 2/3 is on mwdebug1001
[13:15:57] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2044 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:15:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/807515 (https://phabricator.wikimedia.org/T287081) (owner: 10Jbond)
[13:16:00] <Lucas_WMDE>	 fyi itamarWMDE (but I’ll also take a look myself)
[13:17:00] <Lucas_WMDE>	 termbox looks fine on my end
[13:17:56] <Lucas_WMDE>	 I’ll go ahead and sync that
[13:18:14] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2060.codfw.wmnet
[13:18:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:44] <wikibugs>	 (03PS4) 10Lucas Werkmeister (WMDE): Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803496 (https://phabricator.wikimedia.org/T304328)
[13:19:02] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[13:19:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:18] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[13:19:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:19:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:13] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:21:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:21:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:21:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:43] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:807255|Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (2/3) (T304328)]] (duration: 03m 35s)
[13:21:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:48] <stashbot>	 T304328: Move Termbox SSR for Beta Wikidata into deployment-prep project - https://phabricator.wikimedia.org/T304328
[13:22:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:22:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:13] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803496 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE))
[13:22:55] <wikibugs>	 (03PS2) 10Majavah: wmcs: neutron: use min_over_time [alerts] - 10https://gerrit.wikimedia.org/r/805783
[13:23:00] <wikibugs>	 (03Merged) 10jenkins-bot: Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803496 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE))
[13:23:17] <wikibugs>	 (03CR) 10Muehlenhoff: profile::aptrepo::wikimedia test public apt repo on Apache (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/803506 (owner: 10Slyngshede)
[13:24:21] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[13:24:47] <wikibugs>	 (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] wmcs: neutron: use min_over_time [alerts] - 10https://gerrit.wikimedia.org/r/805783 (owner: 10Majavah)
[13:25:03] <Lucas_WMDE>	 wmgWikibaseTermboxEnabled change 3/3 is on mwdebug1001 (cc itamarWMDE)
[13:25:05] <Lucas_WMDE>	 testing again…
[13:25:46] <Lucas_WMDE>	 still looks okay to me, I’ll sync
[13:26:01] <icinga-wm>	 PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: send_tile_invalidations.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:26:22] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good!" [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/806888 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm)
[13:26:52] <wikibugs>	 (03Merged) 10jenkins-bot: wmcs: neutron: use min_over_time [alerts] - 10https://gerrit.wikimedia.org/r/805783 (owner: 10Majavah)
[13:27:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:27:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:27:17] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2060.codfw.wmnet
[13:27:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:27:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:27:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:28:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:28:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:28:16] <wikibugs>	 (03CR) 10Vgutierrez: "generally speaking it looks good but we should move towards setting this to ENFORCED rather than PERMISSIVE." [puppet] - 10https://gerrit.wikimedia.org/r/803296 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[13:28:25] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[13:28:56] <XioNoX>	 !log fix MTU on eqiad server facing switch ports
[13:28:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:28:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:08] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2061.codfw.wmnet
[13:29:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:56] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1066.eqiad.wmnet
[13:29:56] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:803496|Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (3/3) (T304328)]] (1/2) (duration: 03m 35s)
[13:29:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:04] <stashbot>	 T304328: Move Termbox SSR for Beta Wikidata into deployment-prep project - https://phabricator.wikimedia.org/T304328
[13:30:36] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "even if this CR isn't backwards compatible it isn't a big deal cause ats-be doesn't use parent proxies (and we don't run ats-tls anymore)" [puppet] - 10https://gerrit.wikimedia.org/r/803288 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[13:31:59] <wikibugs>	 (03CR) 10Vgutierrez: "looks good, should we consider backwards compatibility to let 8.x and 9.x coexist?" [puppet] - 10https://gerrit.wikimedia.org/r/803286 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[13:32:13] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe1011 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:33:15] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] trafficserver: 9.x upgrade: remove redundant metrics [puppet] - 10https://gerrit.wikimedia.org/r/803297 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[13:33:21] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2047 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:33:46] <wikibugs>	 (03PS5) 10Lucas Werkmeister (WMDE): [cirrus] Add a custom profile for the wikibase language selector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801793 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse)
[13:33:52] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:803496|Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (3/3) (T304328)]] (2/2) (duration: 03m 39s)
[13:33:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:59] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] [cirrus] Add a custom profile for the wikibase language selector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801793 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse)
[13:35:45] <wikibugs>	 (03PS10) 10Slyngshede: profile::aptrepo::wikimedia test public apt repo on Apache [puppet] - 10https://gerrit.wikimedia.org/r/803506
[13:35:50] <wikibugs>	 (03Merged) 10jenkins-bot: [cirrus] Add a custom profile for the wikibase language selector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801793 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse)
[13:35:51] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:35:52] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1026 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:35:53] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1033 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:35:58] <wikibugs>	 (03CR) 10Slyngshede: profile::aptrepo::wikimedia test public apt repo on Apache (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/803506 (owner: 10Slyngshede)
[13:35:59] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1034 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:35:59] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1037 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:36:07] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1022 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:36:07] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1036 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:36:13] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1025 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:36:19] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1020 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:36:35] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1017 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:36:36] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1039 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:36:39] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:36:49] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1031 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:36:49] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1023 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:36:50] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:36:50] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:36:51] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1028 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:36:59] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1029 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:37:00] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1042 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:37:07] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:37:47] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:37:59] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1036 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:38:05] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:38:13] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:38:29] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Revert "[cirrus] Add a custom profile for the wikibase language selector" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807265
[13:38:35] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Revert "[cirrus] Add a custom profile for the wikibase language selector" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807265 (owner: 10Lucas Werkmeister (WMDE))
[13:38:45] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1044 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:39:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:39:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:22] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "[cirrus] Add a custom profile for the wikibase language selector" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807265 (owner: 10Lucas Werkmeister (WMDE))
[13:39:47] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:39:53] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:40:01] <Lucas_WMDE>	 koi: your turn :) are you there?
[13:40:01] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1036 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:40:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:40:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:40:13] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1020 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:40:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:40:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:40:19] <koi>	 oh hi, I'm here
[13:40:25] <wikibugs>	 (03PS6) 10Vlad.shapik: WP:Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719)
[13:40:29] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:40:31] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:40:33] <Lucas_WMDE>	 ok
[13:40:35] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:40:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WP:Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik)
[13:41:07] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:41:08] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1032 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:41:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:41:12] <Lucas_WMDE>	 let’s do the logos change first in case we don’t have time for both
[13:41:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:20] <wikibugs>	 (03PS26) 10Filippo Giunchedi: Add a host's conftool pooled status and weight per service to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis)
[13:41:31] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1019 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:41:32] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1035 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:42:08] <wikibugs>	 (03CR) 10Herron: [C: 03+1] prometheus: Enable scraping of the ipmi exporter [puppet] - 10https://gerrit.wikimedia.org/r/807494 (owner: 10Alexandros Kosiaris)
[13:42:12] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): specieswiki: Adjust width-height ratio of logo to fix display issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807491 (https://phabricator.wikimedia.org/T310961) (owner: 10Stang)
[13:42:33] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1017 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:42:35] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1039 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:42:37] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:42:39] <icinga-wm>	 RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.004559 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[13:43:03] <wikibugs>	 (03CR) 10Vgutierrez: "Looks good, should we consider an approach that allows 8.x and 9.x to coexist?" [puppet] - 10https://gerrit.wikimedia.org/r/803272 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[13:43:49] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1042 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:43:51] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1020 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:43:53] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1025 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:43:55] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:43:55] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1030 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:43:56] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1038 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:44:05] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1019 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:44:11] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:44:12] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1029 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:44:19] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1031 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:44:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: "I finally was able to test this patch in Pontoon (great job Ben!)" [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis)
[13:44:29] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] specieswiki: Adjust width-height ratio of logo to fix display issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807491 (https://phabricator.wikimedia.org/T310961) (owner: 10Stang)
[13:44:57] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1023 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:45:01] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1017 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:45:03] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1039 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:45:04] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:45:13] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1037 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:45:13] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1034 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:45:37] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2061.codfw.wmnet
[13:45:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:50] <RhinosF1>	 ^ is being worked on in the -cloud-admin channel
[13:45:53] <wikibugs>	 (03Merged) 10jenkins-bot: specieswiki: Adjust width-height ratio of logo to fix display issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807491 (https://phabricator.wikimedia.org/T310961) (owner: 10Stang)
[13:46:08] <Lucas_WMDE>	 ack
[13:46:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:46:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:46:19] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1066.eqiad.wmnet
[13:46:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:46:25] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:46:26] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:46:51] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:47:11] <Lucas_WMDE>	 koi: the logos change is on mwdebug1001, can you test it?
[13:47:17] <koi>	 looking
[13:47:21] <wikibugs>	 (03CR) 10Vgutierrez: trafficserver: 9.x upgrade: switch ip_allow.config to YAML format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803272 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[13:47:25] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:47:40] <Lucas_WMDE>	 (I need a Ctrl+F5 but after that it actually seems to have loaded the new logo from mwdebug)
[13:47:42] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] P:openstack::puppetmaster: alert for puppet certs for deleted instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806433 (owner: 10Majavah)
[13:47:44] <Lucas_WMDE>	 (*needed)
[13:48:04] <koi>	 LGTM
[13:48:07] <Lucas_WMDE>	 ack
[13:48:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: Add a host's conftool pooled status and weight per service to prometheus (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis)
[13:48:27] <Lucas_WMDE>	 ok, so I guess I should sync the PNGs, then the yaml, then the PHP, and then finally purge the PNGs from the cache
[13:48:37] <Lucas_WMDE>	 probably doesn’t matter in practice but that order feels sensible to me ^^
[13:48:39] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1044 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:48:59] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1028 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:48:59] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1036 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:49:01] <koi>	 yeah it make sense
[13:49:13] <icinga-wm>	 RECOVERY - Check systemd state on thanos-be2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:49:13] <koi>	 just do it in the way you like :)
[13:49:38] <Lucas_WMDE>	 ok :)
[13:49:55] <Lucas_WMDE>	 but I’m syncing project-logos/ as a whole, I don’t want to wait for the php-fpm restarts three times by syncing the three PNGs individually
[13:49:57] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1035 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:50:11] <Lucas_WMDE>	 (if scap sync-file has a flag to skip the restarts then it’s not in the --help output)
[13:50:21] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1022 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:50:22] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1038 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:50:22] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1027 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:50:23] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1030 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:50:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:50:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:50:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:57] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:51:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:51:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:04] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:53:16] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized static/images/project-logos/: Config: [[gerrit:807491|specieswiki: Adjust width-height ratio of logo to fix display issue (T310961)]] (1/3) (duration: 03m 46s)
[13:53:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:53:21] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:53:21] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1033 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:53:21] <stashbot>	 T310961: Site logo cropped/not fully displayed on some projects - https://phabricator.wikimedia.org/T310961
[13:54:22] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:54:52] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1038 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:55:04] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2062.codfw.wmnet
[13:55:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:08] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1038 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:55:26] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:55:31] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1029 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:56:09] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1046 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:56:09] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1067.eqiad.wmnet
[13:56:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:14] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1026 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:56:18] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1037 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:56:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:56:31] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1041 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:56:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:07] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1026 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:57:14] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1027 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:57:18] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1026 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:57:23] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized logos/config.yaml: Config: [[gerrit:807491|specieswiki: Adjust width-height ratio of logo to fix display issue (T310961)]] (2/3) (duration: 03m 29s)
[13:57:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:33] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:57:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:57:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:41] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:57:44] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[13:58:19] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1042 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:58:24] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1033 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:58:26] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:58:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:58:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:58:41] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:58:51] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1033 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:59:54] <wikibugs>	 (03PS1) 10Ssingh: dnsdist: add spec tests [puppet] - 10https://gerrit.wikimedia.org/r/807551
[14:00:26] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1031 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:01:04] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:01:07] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/logos.php: Config: [[gerrit:807491|specieswiki: Adjust width-height ratio of logo to fix display issue (T310961)]] (3/3) (duration: 03m 30s)
[14:01:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:14] <stashbot>	 T310961: Site logo cropped/not fully displayed on some projects - https://phabricator.wikimedia.org/T310961
[14:01:33] <Lucas_WMDE>	 !log lucaswerkmeister-wmde@mwmaint1002:~$ printf 'https://en.wikipedia.org/static/images/project-logos/%s\n' specieswiki{,-{1.5,2}x}.png | mwscript purgeList.php # T310961
[14:01:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:58] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:02:31] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1019 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:03:08] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): logos: Update phpcs comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807486 (owner: 10Stang)
[14:03:16] <Lucas_WMDE>	 let’s do this one as well, it shouldn’t wait for too long
[14:04:03] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1067.eqiad.wmnet
[14:04:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:16] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] logos: Update phpcs comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807486 (owner: 10Stang)
[14:04:20] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[14:05:08] <wikibugs>	 (03PS1) 10Jbond: C:postgresql: grab the data directory from postgresql [puppet] - 10https://gerrit.wikimedia.org/r/807553
[14:05:10] <wikibugs>	 (03Merged) 10jenkins-bot: logos: Update phpcs comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807486 (owner: 10Stang)
[14:05:52] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1027 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:06:16] <icinga-wm>	 RECOVERY - SSH on restbase2012.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:06:21] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1044 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:06:53] <wikibugs>	 (03PS2) 10Jbond: P::base: allow useres to configure enable_unpriv_userns via hiera [puppet] - 10https://gerrit.wikimedia.org/r/807515 (https://phabricator.wikimedia.org/T287081)
[14:07:08] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1044 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:07:09] <wikibugs>	 (03PS2) 10Jbond: cumin: add alias for hosts with sensitive sysctl settings [puppet] - 10https://gerrit.wikimedia.org/r/807516 (https://phabricator.wikimedia.org/T287081)
[14:07:20] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1040 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:07:35] <wikibugs>	 (03PS1) 10Ssingh: Release 9.1.2-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/807554
[14:08:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, see inline too" [puppet] - 10https://gerrit.wikimedia.org/r/807546 (owner: 10Volans)
[14:08:18] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1028 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:08:19] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1039 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:08:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Release 9.1.2-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/807554 (owner: 10Ssingh)
[14:08:30] <sukhe>	 ha
[14:08:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Rename cluster to ganeti_cluster [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807545 (owner: 10Volans)
[14:08:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:08:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:06] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized logos/manage.py: Config: [[gerrit:807486|logos: Update phpcs comment]] (should be a no-op but syncing just in case) (duration: 03m 19s)
[14:09:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:11] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[14:09:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:20] <koi>	 thanks a lot!
[14:09:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:09:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:09:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:54] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1020 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:10:22] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:10:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:10:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:44] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1036 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:10:46] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1028 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:10:46] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:11:31] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1034 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:11:52] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1036 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:12:01] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:12:06] <wikibugs>	 (03PS6) 10Jgiannelos: Improve performance of Tegola tile pregeneration [puppet] - 10https://gerrit.wikimedia.org/r/790679 (https://phabricator.wikimedia.org/T307182) (owner: 10Isabelle Hurbain-Palatin)
[14:12:19] <wikibugs>	 (03PS2) 10Ssingh: trafficserver: 9.x upgrade: remove wmf-tls log format [puppet] - 10https://gerrit.wikimedia.org/r/803301 (https://phabricator.wikimedia.org/T309651)
[14:13:08] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1021 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:13:08] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1032 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:13:09] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1024 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:13:09] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1042 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:13:32] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1045 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:13:56] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1045 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:14:18] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] trafficserver: 9.x upgrade: rename max_connections_active_in (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803286 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[14:15:00] <wikibugs>	 (03PS1) 10Ayounsi: Network check MTU report: improve log messages [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807556
[14:15:04] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1025 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:15:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:15:38] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1017 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:15:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:42] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1025 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:16:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:16:30] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:16:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:16:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:06] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1021 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:17:30] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:17:31] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:17:31] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:17:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:45] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2062.codfw.wmnet
[14:17:48] <wikibugs>	 (03PS2) 10Ssingh: trafficserver: 9.x upgrade: switch ip_allow.config to YAML format [puppet] - 10https://gerrit.wikimedia.org/r/803272 (https://phabricator.wikimedia.org/T309651)
[14:17:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:58] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:17:58] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1029 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:17:58] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1037 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:17:59] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1047 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:18:06] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1023 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:18:18] <wikibugs>	 (03CR) 10Ssingh: trafficserver: 9.x upgrade: switch ip_allow.config to YAML format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803272 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[14:19:31] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1030 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:19:41] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1035 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:19:54] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1030 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:20:07] <wikibugs>	 (03PS1) 10Ayounsi: Netbox: don't alert for the accounting report [puppet] - 10https://gerrit.wikimedia.org/r/807558
[14:20:26] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:20:26] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:20:54] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:21:10] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1035 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:22:07] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:24:32] <wikibugs>	 (03CR) 10Jgreen: [C: 03+1] vrts: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806220 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[14:24:33] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[14:24:45] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Initial commit of helm-state-metrics (031 comment) [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/806888 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm)
[14:25:18] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Initial commit of helm-state-metrics [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/806888 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm)
[14:25:26] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add vendor dir [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/806889 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm)
[14:26:47] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2063.codfw.wmnet
[14:26:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:21] <wikibugs>	 (03PS2) 10Ssingh: trafficserver: 9.x upgrade: replace client.verify.server [puppet] - 10https://gerrit.wikimedia.org/r/803296 (https://phabricator.wikimedia.org/T309651)
[14:31:47] <wikibugs>	 (03PS1) 10Jgiannelos: tegola: Re-enable tile pregeneration on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/807561 (https://phabricator.wikimedia.org/T305845)
[14:32:12] <wikibugs>	 (03CR) 10Ssingh: trafficserver: 9.x upgrade: replace client.verify.server (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/803296 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[14:32:14] <wikibugs>	 (03CR) 10Muehlenhoff: vrts: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806220 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[14:33:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/803506 (owner: 10Slyngshede)
[14:34:55] <wikibugs>	 (03PS1) 10Majavah: P:(toolforge|wmcs::paws)::prometheus: improve namespace filtering [puppet] - 10https://gerrit.wikimedia.org/r/807562
[14:36:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P::base: allow useres to configure enable_unpriv_userns via hiera [puppet] - 10https://gerrit.wikimedia.org/r/807515 (https://phabricator.wikimedia.org/T287081) (owner: 10Jbond)
[14:37:26] <wikibugs>	 (03PS1) 10Muehlenhoff: Record new MOU date for aarora [puppet] - 10https://gerrit.wikimedia.org/r/807563
[14:37:27] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[14:38:34] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1022 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:39:13] <wikibugs>	 (03CR) 10Muehlenhoff: cumin: add alias for hosts with sensitive sysctl settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807516 (https://phabricator.wikimedia.org/T287081) (owner: 10Jbond)
[14:40:17] <wikibugs>	 (03PS1) 10Jbond: base:sysctl: rename sysctl value as it could be enabled or disabled [puppet] - 10https://gerrit.wikimedia.org/r/807564
[14:41:01] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1043 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:41:37] <wikibugs>	 (03PS2) 10Ssingh: Add sukhe to super-user for router configuration [homer/public] - 10https://gerrit.wikimedia.org/r/807145
[14:42:24] <wikibugs>	 (03CR) 10Ssingh: "Thanks for the review; addressed the comments and updated the CR." [homer/public] - 10https://gerrit.wikimedia.org/r/807145 (owner: 10Ssingh)
[14:44:17] <wikibugs>	 (03CR) 10Jbond: cumin: add alias for hosts with sensitive sysctl settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807516 (https://phabricator.wikimedia.org/T287081) (owner: 10Jbond)
[14:44:19] <wikibugs>	 (03CR) 10Muehlenhoff: cumin: add alias for hosts with sensitive sysctl settings (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/807516 (https://phabricator.wikimedia.org/T287081) (owner: 10Jbond)
[14:44:21] <wikibugs>	 (03PS3) 10Jbond: cumin: add alias for hosts with sensitive sysctl settings [puppet] - 10https://gerrit.wikimedia.org/r/807516 (https://phabricator.wikimedia.org/T287081)
[14:44:24] <wikibugs>	 (03CR) 10Ssingh: "recheck" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/807554 (owner: 10Ssingh)
[14:45:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Record new MOU date for aarora [puppet] - 10https://gerrit.wikimedia.org/r/807563 (owner: 10Muehlenhoff)
[14:47:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica2005.wikimedia.org
[14:47:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:41] <wikibugs>	 (03PS4) 10DDesouza: QuickSurveys: Deploy research-incentive to jawiki on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807202 (https://phabricator.wikimedia.org/T311015)
[14:49:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica2005.wikimedia.org
[14:49:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:52] <wikibugs>	 (03CR) 10Ssingh: "I guess this is expected since it's actually trying to patch 9.1.2-1wm1~bpo10+1 but we haven't updated our Deb yet?" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/807554 (owner: 10Ssingh)
[14:51:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica2006.wikimedia.org
[14:51:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:28] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM very minor nit" [puppet] - 10https://gerrit.wikimedia.org/r/807551 (owner: 10Ssingh)
[14:53:29] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/807545 (owner: 10Volans)
[14:53:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica2006.wikimedia.org
[14:53:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:53] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/807546 (owner: 10Volans)
[14:54:23] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] base:sysctl: rename sysctl value as it could be enabled or disabled [puppet] - 10https://gerrit.wikimedia.org/r/807564 (owner: 10Jbond)
[14:55:08] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add helm-state-metrics image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/806879 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm)
[14:56:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica1003.wikimedia.org
[14:56:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:38] <wikibugs>	 (03PS4) 10Ahmon Dancy: scap bootstrap: refactor [puppet] - 10https://gerrit.wikimedia.org/r/806397 (https://phabricator.wikimedia.org/T310740) (owner: 10Jaime Nuche)
[14:58:10] <wikibugs>	 (03CR) 10Ahmon Dancy: scap bootstrap: refactor (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/806397 (https://phabricator.wikimedia.org/T310740) (owner: 10Jaime Nuche)
[14:58:31] <wikibugs>	 (03PS4) 10Jbond: cumin: add alias for hosts with sensitive sysctl settings [puppet] - 10https://gerrit.wikimedia.org/r/807516 (https://phabricator.wikimedia.org/T287081)
[14:58:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica1003.wikimedia.org
[14:58:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica1004.wikimedia.org
[14:59:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:36] <wikibugs>	 (03CR) 10Jbond: cumin: add alias for hosts with sensitive sysctl settings (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/807516 (https://phabricator.wikimedia.org/T287081) (owner: 10Jbond)
[14:59:48] <wikibugs>	 (03PS2) 10Ssingh: Release 9.1.2-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/807554
[14:59:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/807558 (owner: 10Ayounsi)
[15:00:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Release 9.1.2-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/807554 (owner: 10Ssingh)
[15:00:34] <jayme>	 !log published docker-registry.discovery.wmnet/helm-state-metrics:0.1.0-1 - T310714
[15:00:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:39] <stashbot>	 T310714: Detect and alert on helm releases in unclean state - https://phabricator.wikimedia.org/T310714
[15:01:04] <wikibugs>	 (03PS2) 10Eevans: AQS: Use data-center apropos host list [puppet] - 10https://gerrit.wikimedia.org/r/805883 (https://phabricator.wikimedia.org/T307641)
[15:01:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica1004.wikimedia.org
[15:01:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:23] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Add helm-state-metrics helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/806870 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm)
[15:01:44] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Netbox: don't alert for the accounting report [puppet] - 10https://gerrit.wikimedia.org/r/807558 (owner: 10Ayounsi)
[15:02:01] <wikibugs>	 (03PS2) 10Ayounsi: Netbox: don't alert for the accounting report [puppet] - 10https://gerrit.wikimedia.org/r/807558
[15:02:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/807516 (https://phabricator.wikimedia.org/T287081) (owner: 10Jbond)
[15:04:07] <wikibugs>	 (03CR) 10Eevans: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1002/36007/" [puppet] - 10https://gerrit.wikimedia.org/r/805883 (https://phabricator.wikimedia.org/T307641) (owner: 10Eevans)
[15:04:32] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "all good! rolling it out is time consuming as we need to check the diff and say "yes" for every single device. Let me know if you need hel" [homer/public] - 10https://gerrit.wikimedia.org/r/807145 (owner: 10Ssingh)
[15:05:21] <wikibugs>	 (03Merged) 10jenkins-bot: Add helm-state-metrics helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/806870 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm)
[15:07:27] <wikibugs>	 (03CR) 10Ssingh: Add sukhe to super-user for router configuration (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/807145 (owner: 10Ssingh)
[15:08:00] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] dnsdist: add spec tests [puppet] - 10https://gerrit.wikimedia.org/r/807551 (owner: 10Ssingh)
[15:08:24] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] dnsdist: add spec tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807551 (owner: 10Ssingh)
[15:08:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet
[15:08:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet
[15:10:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:11:12] <wikibugs>	 (03PS1) 10Jgiannelos: tegola: Point tegola to the latest swift container [deployment-charts] - 10https://gerrit.wikimedia.org/r/807567
[15:13:40] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] cumin: add alias for hosts with sensitive sysctl settings [puppet] - 10https://gerrit.wikimedia.org/r/807516 (https://phabricator.wikimedia.org/T287081) (owner: 10Jbond)
[15:15:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki2002.codfw.wmnet
[15:15:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:15:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:17:43] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2059 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:17:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/806397 (https://phabricator.wikimedia.org/T310740) (owner: 10Jaime Nuche)
[15:18:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2002.codfw.wmnet
[15:18:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:19:33] <icinga-wm>	 PROBLEM - Host ms-be2063 is DOWN: PING CRITICAL - Packet loss = 100%
[15:22:03] <icinga-wm>	 RECOVERY - Host ms-be2063 is UP: PING OK - Packet loss = 0%, RTA = 31.54 ms
[15:22:23] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:24:20] <wikibugs>	 (03PS6) 10MewOphaswongse: Structured task: enable free text for "other" rejection reason in betalabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805480 (https://phabricator.wikimedia.org/T304099)
[15:24:47] <wikibugs>	 (03CR) 10MewOphaswongse: Structured task: enable free text for "other" rejection reason in betalabs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805480 (https://phabricator.wikimedia.org/T304099) (owner: 10MewOphaswongse)
[15:25:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:26:20] <wikibugs>	 (03PS1) 10MewOphaswongse: Structured task: enable free text for "other" rejection reason [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807576 (https://phabricator.wikimedia.org/T304099)
[15:27:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[15:28:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10Cmjohnson) @jcrespo there are 3 drives and I did make it raid 6
[15:28:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[15:28:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[15:28:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:00] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: Requesting SSH keypair for deployment server keyholder to push to Gerrit - https://phabricator.wikimedia.org/T310620 (10JMeybohm) I would assume we can reuse the `pwstore/pw.git/deployment-key-passphrase` for this as the audience is the same as well?
[15:31:15] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:31:42] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Deploy helm-state-metrics to staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/806871 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm)
[15:32:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[15:32:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:35:45] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:36:21] <moritzm>	 !log upload jenkins 2.332.4 to apt.wikimedia.org T311068
[15:36:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:02] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy helm-state-metrics to staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/806871 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm)
[15:40:09] <icinga-wm>	 PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:40:27] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:40:30] <wikibugs>	 (03CR) 10Ssingh: "recheck" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/807554 (owner: 10Ssingh)
[15:40:33] <icinga-wm>	 PROBLEM - Host ms-be2063 is DOWN: PING CRITICAL - Packet loss = 100%
[15:41:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10jcrespo) So I checked the recipe and it didn't change since april 2020 (except briefly in may for 3 days for some old/bad hardware). In particu...
[15:42:03] <icinga-wm>	 RECOVERY - Host ms-be2063 is UP: PING WARNING - Packet loss = 77%, RTA = 34.19 ms
[15:44:01] <wikibugs>	 (03PS3) 10Ssingh: Release 9.1.2-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/807554
[15:50:33] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10netbox, 10PostgreSQL: Puppet change at each run on postgres replicas - https://phabricator.wikimedia.org/T311156 (10ayounsi) p:05Triage→03Medium
[15:51:05] <icinga-wm>	 PROBLEM - Host ms-be2063 is DOWN: PING CRITICAL - Packet loss = 100%
[15:51:21] <icinga-wm>	 RECOVERY - Host ms-be2063 is UP: PING OK - Packet loss = 0%, RTA = 31.55 ms
[15:53:24] <wikibugs>	 (03PS9) 10Ayounsi: Add python3.10 support to Tox [cookbooks] - 10https://gerrit.wikimedia.org/r/803263
[15:53:44] <wikibugs>	 (03PS1) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/807581 (https://phabricator.wikimedia.org/T308244)
[15:54:03] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/807581 (https://phabricator.wikimedia.org/T308244) (owner: 10Kosta Harlan)
[15:59:52] <wikibugs>	 (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/807581 (https://phabricator.wikimedia.org/T308244) (owner: 10Kosta Harlan)
[16:00:30] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: prometheus: Enable scraping of the ipmi exporter [puppet] - 10https://gerrit.wikimedia.org/r/807494
[16:00:35] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] prometheus: Enable scraping of the ipmi exporter [puppet] - 10https://gerrit.wikimedia.org/r/807494 (owner: 10Alexandros Kosiaris)
[16:01:46] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Keyholder: add new agent for trainbranchbot [puppet] - 10https://gerrit.wikimedia.org/r/807192 (https://phabricator.wikimedia.org/T310620) (owner: 10Thcipriani)
[16:04:02] <logmsgbot>	 !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply
[16:04:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:04:08] <wikibugs>	 (03PS10) 10Ayounsi: Add python3.10 support to Tox [cookbooks] - 10https://gerrit.wikimedia.org/r/803263
[16:04:10] <wikibugs>	 (03PS18) 10Ayounsi: Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261
[16:05:19] <logmsgbot>	 !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply
[16:05:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:06:31] <logmsgbot>	 !log kharlan@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply
[16:06:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:06:38] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] Delete git-setup script [dns] - 10https://gerrit.wikimedia.org/r/807229 (owner: 10BCornwall)
[16:07:55] <wikibugs>	 (03CR) 10Ayounsi: Initial support for servers switch interfaces (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi)
[16:08:47] <logmsgbot>	 !log kharlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply
[16:08:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:09:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi)
[16:09:03] <wikibugs>	 (03PS3) 10Zabe: vrts: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/806220 (https://phabricator.wikimedia.org/T308013)
[16:09:41] <wikibugs>	 (03CR) 10Zabe: vrts: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806220 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[16:09:51] <logmsgbot>	 !log kharlan@deploy1002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply
[16:09:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:11:01] <wikibugs>	 (03PS19) 10Ayounsi: Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261
[16:11:46] <logmsgbot>	 !log kharlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply
[16:11:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:11:53] <icinga-wm>	 PROBLEM - Host ms-be2063 is DOWN: PING CRITICAL - Packet loss = 100%
[16:12:14] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[16:13:55] <icinga-wm>	 RECOVERY - Host ms-be2063 is UP: PING OK - Packet loss = 0%, RTA = 31.58 ms
[16:13:55] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host backup1009.eqiad.wmnet with OS bullseye
[16:13:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:00] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host backup1009.eqiad.wmnet with OS bullseye
[16:14:02] <logmsgbot>	 !log jynus@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1009.eqiad.wmnet with OS bullseye
[16:14:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:08] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1009.eqiad.wmnet with OS bullseye executed w...
[16:14:21] <icinga-wm>	 PROBLEM - Keyholder SSH agent on deploy1002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder
[16:14:21] <icinga-wm>	 PROBLEM - Keyholder SSH agent on deploy2002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder
[16:14:31] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2063 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:16:51] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2063 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:31] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host backup1009.eqiad.wmnet with OS bullseye
[16:18:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:36] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host backup1009.eqiad.wmnet with OS bullseye
[16:20:07] <jynus>	 they keyholder thing is expected (e.g. maintenance, reboot)?
[16:20:34] <RhinosF1>	 jynus: akosiaris just merged a change
[16:20:44] <jynus>	 ok
[16:20:49] <RhinosF1>	 It's likely them adding the gerrit/scap stuff
[16:20:56] <jynus>	 ok, cool
[16:21:16] <jynus>	 with so much noise it is not easy to track all changes :-D
[16:21:23] <icinga-wm>	 RECOVERY - Keyholder SSH agent on deploy1002 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder
[16:21:25] <icinga-wm>	 RECOVERY - Keyholder SSH agent on deploy2002 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder
[16:21:32] <wikibugs>	 10SRE, 10DSE-Kubernetes-Cluster, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 3 VMs requested for Etcd cluster in support of the new DSE Kubernetes cluster - https://phabricator.wikimedia.org/T311131 (10BTullis)
[16:23:46] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: Requesting SSH keypair for deployment server keyholder to push to Gerrit - https://phabricator.wikimedia.org/T310620 (10akosiaris) 05Open→03Resolved
[16:23:50] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: Requesting SSH keypair for deployment server keyholder to push to Gerrit - https://phabricator.wikimedia.org/T310620 (10akosiaris) >>! In T310620#8020287, @JMeybohm wrote: > I would assume we can reuse the `pwstore/pw.git/deployment-key-passphrase` for this as the aud...
[16:24:34] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: Requesting SSH keypair for deployment server keyholder to push to Gerrit - https://phabricator.wikimedia.org/T310620 (10akosiaris) a:03akosiaris key generated, change merged, keyholder and keyholder-proxy restart and rearmed. I think we are done on this front! I am...
[16:25:13] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[16:25:42] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[16:29:25] <icinga-wm>	 PROBLEM - Host ms-be2063 is DOWN: PING CRITICAL - Packet loss = 100%
[16:29:46] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1009.eqiad.wmnet with reason: host reimage
[16:29:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:30:09] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] openstack.vendordata: reduce timeout so it retries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807174 (https://phabricator.wikimedia.org/T309930) (owner: 10David Caro)
[16:30:21] <icinga-wm>	 RECOVERY - Host ms-be2063 is UP: PING OK - Packet loss = 0%, RTA = 31.60 ms
[16:33:02] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1009.eqiad.wmnet with reason: host reimage
[16:33:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:33:12] <wikibugs>	 (03PS2) 10Matthias Mullie: [ImageSuggestions] Enable extension on ptwiki, ruwiki & idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807050 (https://phabricator.wikimedia.org/T302711)
[16:37:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10jcrespo)
[16:37:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10jcrespo) 05Open→03Resolved It turned out everything was perfectly configured, we just needed to retry (e.g. for puppet to apply the new con...
[16:42:37] <hashar>	 jouncebot: now
[16:42:37] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 17 minute(s)
[16:43:49] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2063.codfw.wmnet
[16:43:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:44:24] <wikibugs>	 (03PS4) 10Ssingh: Release 9.1.2-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/807554
[16:45:04] <hashar>	 !log Restarting CI Jenkins
[16:45:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:46:15] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1009.eqiad.wmnet with OS bullseye
[16:46:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:46:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1009.eqiad.wmnet with OS bullseye completed:...
[16:54:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10BTullis) @Cmjohnson yes please, that's perfect.
[16:54:05] <wikibugs>	 (03PS2) 10BCornwall: traffic: Port over ATS restart alert [alerts] - 10https://gerrit.wikimedia.org/r/807214 (https://phabricator.wikimedia.org/T300723)
[16:54:14] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[16:54:19] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:56:50] <wikibugs>	 10SRE, 10serviceops: Requesting SSH keypair for deployment server keyholder to push to Gerrit - https://phabricator.wikimedia.org/T310620 (10dancy) Thank you @akosiaris !     What's the official way to collect the public key?
[16:57:23] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] traffic: Port over ATS restart alert [alerts] - 10https://gerrit.wikimedia.org/r/807214 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall)
[16:58:44] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[16:59:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:02:05] <wikibugs>	 (03PS1) 10MarcoAurelio: gawiki: Set category collation from `uppercase` to `uca-ga-u-kn` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807593 (https://phabricator.wikimedia.org/T311136)
[17:04:10] <wikibugs>	 10SRE, 10serviceops: Requesting SSH keypair for deployment server keyholder to push to Gerrit - https://phabricator.wikimedia.org/T310620 (10akosiaris) >>! In T310620#8020587, @dancy wrote: > Thank you @akosiaris !    >  > What's the official way to collect the public key?  Can't say we have an official way to...
[17:04:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:08:13] <wikibugs>	 10SRE-tools, 10Spicerack: spicerack.redfish: Add handle for when job returns - "Job for this device is already present" - https://phabricator.wikimedia.org/T311162 (10jbond) p:05Triage→03Medium
[17:09:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:10:39] <wikibugs>	 (03PS2) 10Jbond: C:postgresql: grab the data directory from postgresql [puppet] - 10https://gerrit.wikimedia.org/r/807553 (https://phabricator.wikimedia.org/T311156)
[17:13:36] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10netbox, 10Patch-For-Review, 10PostgreSQL: Puppet change at each run on postgres replicas - https://phabricator.wikimedia.org/T311156 (10jbond) I think https://gerrit.wikimedia.org/r/c/operations/puppet/+/807553 should fix this issue  > How to know if it's safe to...
[17:15:45] <hauskatze>	 jouncebot: nowandnext
[17:15:45] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 44 minute(s)
[17:15:45] <jouncebot>	 In 0 hour(s) and 44 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220622T1800)
[17:15:45] <jouncebot>	 In 0 hour(s) and 44 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220622T1800)
[17:17:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10BTullis) @Cmjohnson yes please, let's use hardware RAID for this please. As @RobH suggested in the parent task, let's...  >  use the flex bays as a raid1 for the OS data, and the...
[17:37:12] <hashar>	 jouncebot: now
[17:37:12] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 22 minute(s)
[17:37:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:39:56] <wikibugs>	 10SRE, 10serviceops: Requesting SSH keypair for deployment server keyholder to push to Gerrit - https://phabricator.wikimedia.org/T310620 (10dancy) Beautiful.  I added the public key to Gerrit's trainbranchbot using the following command: ` echo ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIA9PnDpx0+F5mgJUbLxiCOFm2G5an...
[17:42:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:44:48] <wikibugs>	 (03PS2) 10MarcoAurelio: gawiki: Change category collation from `uppercase` to `uca-ga-u-kn` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807593 (https://phabricator.wikimedia.org/T311136)
[18:00:04] <jouncebot>	 hashar and brennen: Time to snap out of that daydream and deploy Train log triage with CPT. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220622T1800).
[18:00:04] <jouncebot>	 hashar and brennen: May I have your attention please! MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220622T1800)
[18:04:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:09:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:12:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:17:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:33:39] <wikibugs>	 (03PS1) 10Jcrespo: install_server: Move backup1009, backup2009 to the list of manual partitioning [puppet] - 10https://gerrit.wikimedia.org/r/807602
[18:41:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:41:19] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:41:40] <sukhe>	 here
[18:41:42] <sukhe>	 and ACKed
[18:42:07] <jhathaway>	 here as well
[18:42:25] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] install_server: Move backup1009, backup2009 to the list of manual partitioning [puppet] - 10https://gerrit.wikimedia.org/r/807602 (owner: 10Jcrespo)
[18:44:21] <jhathaway>	 increase in thumbor latency, but I don't see anything particular strange in the thumbor dashboard
[18:44:27] <sukhe>	 yeah 
[18:44:45] <sukhe>	 I am trying to find the resolution just in case it gets worse
[18:45:39] <sukhe>	 that seems to be a recovery unless I am reading it incorrectly
[18:46:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:46:24] <sukhe>	 oh yeah
[18:46:25] <sukhe>	 hm
[18:47:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:47:40] <TheresNoTime>	 👀
[18:47:47] <sukhe>	 yeah needs to be resolved
[18:48:13] <jhathaway>	 ah, it will keep firing with an ack?
[18:48:48] <sukhe>	 jhathaway: I think it resolved and happened again, hence the separate alert
[18:48:58] <jhathaway>	 ah, ok that makes sense
[18:49:25] <sukhe>	 I have ACKed this one again but yeah, this not the solution clearly
[18:51:17] <wikibugs>	 (03PS3) 10Krinkle: [POC] noc: Add perwiki.php to show per wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[18:51:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:51:19] <wikibugs>	 (03PS1) 10Krinkle: buildConfigCache,buildDBLists: Remove redundant defines.php include [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807604
[18:51:21] <wikibugs>	 (03PS1) 10Krinkle: build: Make config gen signature for prod compatible with test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807605
[18:51:33] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:51:48] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:52:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [POC] noc: Add perwiki.php to show per wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[18:52:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] buildConfigCache,buildDBLists: Remove redundant defines.php include [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807604 (owner: 10Krinkle)
[18:52:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] build: Make config gen signature for prod compatible with test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807605 (owner: 10Krinkle)
[18:54:24] <wikibugs>	 (03PS2) 10Krinkle: buildConfigCache,buildDBLists: Remove redundant defines.php include [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807604
[18:54:26] <wikibugs>	 (03PS2) 10Krinkle: build: Make config gen signature for prod compatible with test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807605
[18:54:28] <wikibugs>	 (03PS4) 10Krinkle: [POC] noc: Add perwiki.php to show per wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[18:54:49] <wikibugs>	 (03PS3) 10Krinkle: build: Remove redundant defines.php includes from CI build scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807604
[18:54:51] <wikibugs>	 (03PS3) 10Krinkle: build: Make config gen signature for prod compatible with test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807605
[18:54:53] <wikibugs>	 (03PS5) 10Krinkle: [POC] noc: Add perwiki.php to show per wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[18:55:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] build: Make config gen signature for prod compatible with test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807605 (owner: 10Krinkle)
[18:55:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] build: Remove redundant defines.php includes from CI build scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807604 (owner: 10Krinkle)
[18:56:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [POC] noc: Add perwiki.php to show per wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[18:56:33] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:58:47] <wikibugs>	 (03PS4) 10Krinkle: build: Remove redundant defines.php includes from CI build scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807604
[18:58:49] <wikibugs>	 (03PS4) 10Krinkle: build: Make config gen signature for prod compatible with test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807605
[18:58:51] <wikibugs>	 (03PS6) 10Krinkle: [POC] noc: Add perwiki.php to show per wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[19:00:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] build: Make config gen signature for prod compatible with test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807605 (owner: 10Krinkle)
[19:00:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] build: Remove redundant defines.php includes from CI build scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807604 (owner: 10Krinkle)
[19:00:16] <wikibugs>	 (03CR) 10jenkins-bot: [POC] noc: Add perwiki.php to show per wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[19:00:33] <wikibugs>	 (03PS5) 10Krinkle: build: Remove redundant defines.php includes from CI build scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807604
[19:00:35] <wikibugs>	 (03PS5) 10Krinkle: build: Make config gen signature for prod compatible with test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807605
[19:00:37] <wikibugs>	 (03PS7) 10Krinkle: [POC] noc: Add perwiki.php to show per wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[19:01:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] build: Make config gen signature for prod compatible with test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807605 (owner: 10Krinkle)
[19:01:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [POC] noc: Add perwiki.php to show per wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[19:02:51] <wikibugs>	 (03PS6) 10Krinkle: build: Make config gen signature for prod compatible with test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807605
[19:02:53] <wikibugs>	 (03PS8) 10Krinkle: [POC] noc: Add perwiki.php to show per wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[19:03:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [POC] noc: Add perwiki.php to show per wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[19:05:31] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[19:05:43] <icinga-wm>	 PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:06:29] <hashar>	 !log Restarting CI Jenkins
[19:06:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:06:59] <icinga-wm>	 PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:11:03] <wikibugs>	 (03PS9) 10Krinkle: [POC] noc: Add perwiki.php to show per wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[19:11:17] <icinga-wm>	 PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - Certificate lists.wikimedia.org valid until 2022-06-25 07:55:09 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:11:59] <sukhe>	 ^ this is not a problem problem
[19:12:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [POC] noc: Add perwiki.php to show per wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[19:13:11] <icinga-wm>	 RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2022-08-24 07:48:40 +0000 (expires in 62 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:14:17] <herron>	 !log bounced apache on lists1001
[19:14:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:14:39] <wikibugs>	 (03PS10) 10Krinkle: noc: Add wiki.php to view a given wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[19:15:58] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1006.eqiad.wmnet with OS buster
[19:16:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:16:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS buster
[19:16:48] <wikibugs>	 (03Abandoned) 10Hashar: Add SonarQube scanner [software/gerrit/jsonschemagenerator] - 10https://gerrit.wikimedia.org/r/791692 (owner: 10Hashar)
[19:16:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10Cmjohnson)
[19:23:56] <wikibugs>	 (03CR) 10Dzahn: alertmanager: create receivers for serviceops-collab (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/807201 (owner: 10Dzahn)
[19:31:10] <aqu>	 !log Deploying analytics/refinery (weekly train)
[19:31:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:31:51] <urandom>	 Is there anyone around that could spare a few minutes to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/805883 for me? 
[19:32:05] <urandom>	 It's pretty trivial
[19:32:07] <logmsgbot>	 !log aqu@deploy1002 Started deploy [analytics/refinery@99cca44]: Regular analytics weekly train [analytics/refinery@99cca44]
[19:32:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:35:46] <wikibugs>	 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T309741 (10phaultfinder)
[19:37:06] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@1f2f286]: namespace maps: Exclude labtest database group from data collection
[19:37:10] <ryankemper>	 urandom: I can merge it
[19:37:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:37:24] <urandom>	 ryankemper: awesome, thank you!
[19:37:45] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] AQS: Use data-center apropos host list [puppet] - 10https://gerrit.wikimedia.org/r/805883 (https://phabricator.wikimedia.org/T307641) (owner: 10Eevans)
[19:38:49] <ryankemper>	 urandom: just merged (haven't manually ran puppet yet)
[19:38:59] <urandom>	 ryankemper: I can take care of that
[19:39:02] <ryankemper>	 FWIW there's another patch that was waiting that I puppet-merged as well: `Jcrespo: install_server: Move backup1009, backup2009 to the list of manual partitioning (8eded6f9e9)`
[19:39:10] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@1f2f286]: namespace maps: Exclude labtest database group from data collection (duration: 02m 03s)
[19:39:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:39:14] <ryankemper>	 not sure who jcrespo is on IRC exactly but the change looked minor
[19:39:23] <ryankemper>	 urandom: cool, puppet merge done so feel free to proceed
[19:39:32] <urandom>	 ryankemper: thanks again!
[19:41:06] <ryankemper>	 np!
[19:41:15] <RhinosF1>	 jynus: ^
[19:41:26] <RhinosF1>	 ryankemper: jynus is jcrespo
[19:42:08] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-presto1006.eqiad.wmnet with OS buster
[19:42:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:42:13] <ryankemper>	 jynus: just to save you a small backlog scroll, I merged `Jcrespo: install_server: Move backup1009, backup2009 to the list of manual partitioning (8eded6f9e9)`
[19:42:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS buster exec...
[19:42:26] <ryankemper>	 RhinosF1: tyvm
[19:42:29] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1006.eqiad.wmnet with OS bullseye
[19:42:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:42:36] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS bullseye
[19:43:04] <RhinosF1>	 ryankemper: /who *jcrespo* should work as will any first letter, surname for anyone who has a WMF cloak
[19:43:34] <ryankemper>	 oh neat, thanks
[19:43:53] <icinga-wm>	 RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:45:37] <wikibugs>	 (03PS1) 10Krinkle: missing.php: Update docs and add test plan [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807609 (https://phabricator.wikimedia.org/T308932)
[19:45:39] <wikibugs>	 (03PS1) 10Krinkle: multiversion: Move missing.php from wmf-config/ to /multiversion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807610 (https://phabricator.wikimedia.org/T308932)
[19:46:35] <icinga-wm>	 RECOVERY - AQS root url on aqs2003 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.085 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[19:46:35] <icinga-wm>	 RECOVERY - AQS root url on aqs2004 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[19:48:23] <icinga-wm>	 RECOVERY - Check systemd state on aqs2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:51:21] <icinga-wm>	 RECOVERY - Check systemd state on aqs2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:52:27] <icinga-wm>	 RECOVERY - AQS root url on aqs2005 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.085 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[19:53:45] <icinga-wm>	 RECOVERY - AQS root url on aqs2006 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[19:54:29] <icinga-wm>	 RECOVERY - AQS root url on aqs2007 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.085 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[19:55:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:55:45] <icinga-wm>	 RECOVERY - AQS root url on aqs2009 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[19:55:52] <sukhe>	 👀
[19:56:57] <icinga-wm>	 RECOVERY - Check systemd state on aqs2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:58:03] <icinga-wm>	 RECOVERY - AQS root url on aqs2012 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[19:58:05] <icinga-wm>	 RECOVERY - AQS root url on aqs2011 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.088 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[19:58:11] <icinga-wm>	 RECOVERY - Check systemd state on aqs2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, and cjming: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220622T2000).
[20:00:05] <jouncebot>	 hauskatze: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:01:51] * hauskatze reporting for backport window, sorry I'm late
[20:02:16] <cjming>	 hi - i can deploy
[20:03:06] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [analytics/refinery@99cca44]: Regular analytics weekly train [analytics/refinery@99cca44] (duration: 30m 58s)
[20:03:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:03:40] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] gawiki: Change category collation from `uppercase` to `uca-ga-u-kn` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807593 (https://phabricator.wikimedia.org/T311136) (owner: 10MarcoAurelio)
[20:03:58] <logmsgbot>	 !log aqu@deploy1002 Started deploy [analytics/refinery@99cca44]: Regular analytics weekly train retry [analytics/refinery@99cca44]
[20:04:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:04:25] <wikibugs>	 (03Merged) 10jenkins-bot: gawiki: Change category collation from `uppercase` to `uca-ga-u-kn` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807593 (https://phabricator.wikimedia.org/T311136) (owner: 10MarcoAurelio)
[20:05:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:06:09] <hauskatze>	 hi cjming - thanks for deploying today. My patch cannot really be tested on mwdebug
[20:06:25] <hauskatze>	 it needs a maintenance script run after deployment to fully apply
[20:06:30] <cjming>	 hi hauskatze: i was just gonna ask you about that -- ok -- so i'll sync and then run the script
[20:06:33] <urbanecm>	 see https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#updateCollation for details :)
[20:07:01] <icinga-wm>	 RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:07:24] <hauskatze>	 left the command on the calendar, in this case: mwscript updateCollation.php --wiki=gawiki --previous-collation=uppercase
[20:07:40] <cjming>	 i should run that on the deployment server right?
[20:07:52] <hauskatze>	 on mwmaint yep
[20:07:56] <hauskatze>	 urbanecm: right?
[20:08:17] <urbanecm>	 cjming: all maintenance scripts should be ran from mwmaint1002.eqiad.wmnet (ie. _not_ deployment srv)
[20:08:18] <hauskatze>	 not sure which mwmaint100x are we on nowadays :-)
[20:08:19] <icinga-wm>	 RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:08:33] <cjming>	 thanks urbanecm: got it
[20:08:49] <urbanecm>	 np
[20:09:06] <urbanecm>	 otherwise, the cmdline hauskatze quoted should work fine
[20:09:09] <cjming>	 urbancecm: so the process is 1. sync on deployment server 2. run mwscript on maintenance server
[20:09:43] <hauskatze>	 I think we need the change fully deployed first
[20:10:00] <cjming>	 ok - syncing now
[20:10:01] <urbanecm>	 cjming: correct
[20:10:15] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [analytics/refinery@99cca44]: Regular analytics weekly train retry [analytics/refinery@99cca44] (duration: 06m 16s)
[20:10:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:10:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:11:17] <logmsgbot>	 !log aqu@deploy1002 Started deploy [analytics/refinery@99cca44] (thin): Regular analytics weekly train THIN [analytics/refinery@99cca44]
[20:11:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:11:25] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [analytics/refinery@99cca44] (thin): Regular analytics weekly train THIN [analytics/refinery@99cca44] (duration: 00m 07s)
[20:11:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:11:41] <logmsgbot>	 !log aqu@deploy1002 Started deploy [analytics/refinery@99cca44] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@99cca44]
[20:11:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:12:14] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[20:13:12] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-presto1006.eqiad.wmnet with OS bullseye
[20:13:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:13:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS bullseye ex...
[20:13:51] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:807593|gawiki: Change category collation from `uppercase` to `uca-ga-u-kn` (T311136)]] (duration: 03m 39s)
[20:13:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:13:54] <stashbot>	 T311136: Set $wgCategoryCollation for the Irish language Wikipedia, gawiki - https://phabricator.wikimedia.org/T311136
[20:14:02] <cjming>	 running maint script now
[20:14:09] <hauskatze>	 Great, thanks :)
[20:14:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:14:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:14:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:15:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:16:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:16:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:17:11] <cjming>	 hauskatze: any idea how many rows in total?  it's still running - at about ~100k rows now
[20:17:40] <hauskatze>	 The requestor mentioned some 50k articles
[20:19:16] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [analytics/refinery@99cca44] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@99cca44] (duration: 07m 36s)
[20:19:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:19:34] <cjming>	 alrighty - just finished -- processed ~189k rows
[20:20:16] <hauskatze>	 does the output look alright?
[20:20:31] <hauskatze>	 I'm seeing no havoc on wiki so it should be okay :)
[20:20:39] <cjming>	 hauskatze: should be live - script is done
[20:21:36] <hauskatze>	 thanks cjming - I'll let our requestor know, so she can check as well
[20:21:37] <cjming>	 ya - i'm not sure what to look for other than gawiki is still up and not blowing up
[20:21:49] <cjming>	 np!
[20:22:00] <hauskatze>	 definitely not setting the wiki ablaze today :)
[20:22:01] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service,refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:22:03] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1006.eqiad.wmnet with OS buster
[20:22:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:22:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS buster
[20:22:51] <icinga-wm>	 RECOVERY - Check systemd state on mw1406 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:24:09] <cjming>	 !log end of UTC late backport window
[20:24:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:25:42] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[20:27:41] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-presto1006.eqiad.wmnet with OS buster
[20:27:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:27:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS buster exec...
[20:28:00] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1006.eqiad.wmnet with OS bullseye
[20:28:04] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+1] Structured task: enable free text for "other" rejection reason in betalabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805480 (https://phabricator.wikimedia.org/T304099) (owner: 10MewOphaswongse)
[20:28:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:28:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS bullseye
[20:45:06] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1007.eqiad.wmnet with OS bullseye
[20:45:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:45:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1007.eqiad.wmnet with OS bullseye
[20:48:15] <logmsgbot>	 !log aqu@deploy1002 Started deploy [analytics/refinery@99cca44]: Regular analytics weekly train retry force [analytics/refinery@99cca44]
[20:48:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:49:34] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [analytics/refinery@99cca44]: Regular analytics weekly train retry force [analytics/refinery@99cca44] (duration: 01m 18s)
[20:49:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:54:14] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[20:58:12] <wikibugs>	 (03PS2) 10Cwhite: profile: add kibana to dashboards rewrite rule [puppet] - 10https://gerrit.wikimedia.org/r/806451 (https://phabricator.wikimedia.org/T310360)
[20:58:51] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] "tested redirect on beta" [puppet] - 10https://gerrit.wikimedia.org/r/806451 (https://phabricator.wikimedia.org/T310360) (owner: 10Cwhite)
[21:09:05] <wikibugs>	 (03PS2) 10Cwhite: opensearch: disable compatibility mode [puppet] - 10https://gerrit.wikimedia.org/r/803588 (https://phabricator.wikimedia.org/T301017)
[21:10:19] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:17:21] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:28:42] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1006.eqiad.wmnet with OS bullseye
[21:28:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:28:48] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS bullseye ex...
[21:33:25] <wikibugs>	 (03CR) 10Dzahn: alertmanager: create receivers for serviceops-collab (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/807201 (owner: 10Dzahn)
[21:37:39] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "just merging - since we don't actually use this yet and we can always amend. I'll bring it up in the next team meeting." [puppet] - 10https://gerrit.wikimedia.org/r/807201 (owner: 10Dzahn)
[21:38:59] <wikibugs>	 (03Abandoned) 10Dzahn: docker::network: refresh service docker after adding a docker network [puppet] - 10https://gerrit.wikimedia.org/r/806341 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn)
[21:40:15] <wikibugs>	 (03Abandoned) 10Dzahn: cumin: add alias for hosts with sensitive sysctl settings [puppet] - 10https://gerrit.wikimedia.org/r/806486 (https://phabricator.wikimedia.org/T287081) (owner: 10Dzahn)
[21:40:39] <wikibugs>	 (03PS2) 10Dzahn: gitlab: add prometheus blackbox http monitor [puppet] - 10https://gerrit.wikimedia.org/r/806476
[21:44:21] <ebernhardson>	 !log restart elasticsearch_6@cloudelastic-chi-eqiad to resolve Old GC Hell alert
[21:44:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:44:35] <ebernhardson>	 !log restart elasticsearch_6@cloudelastic-chi-eqiad on cloudelastic1003 to resolve Old GC Hell alert 
[21:44:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:45:16] <wikibugs>	 (03PS1) 10Ryan Kemper: [wip] elastic: temp keystore for index restoration [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648)
[21:45:50] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1007.eqiad.wmnet with OS bullseye
[21:45:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:45:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1007.eqiad.wmnet with OS bullseye ex...
[21:46:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [wip] elastic: temp keystore for index restoration [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper)
[21:46:53] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:47:43] <wikibugs>	 10SRE, 10WMF-Annual-Report (Policy site): migrate policy.wikimedia.org from WMF cluster to Wordpress - https://phabricator.wikimedia.org/T110203 (10Dzahn) In T310738 there is a request to revert this and move the domain back to WMF infra.
[21:48:24] <wikibugs>	 (03PS2) 10Ryan Kemper: [wip] elastic: temp keystore for index restoration [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648)
[21:48:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [wip] elastic: temp keystore for index restoration [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper)
[21:50:32] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (3) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[21:51:37] <wikibugs>	 (03PS1) 10Ahmon Dancy: safe-service-restart.py: Ensure 'status' always has a value [puppet] - 10https://gerrit.wikimedia.org/r/807624 (https://phabricator.wikimedia.org/T311182)
[21:56:49] <wikibugs>	 (03PS3) 10Ryan Kemper: [wip] elastic: temp keystore for index restoration [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648)
[21:57:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [wip] elastic: temp keystore for index restoration [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper)
[22:00:16] <wikibugs>	 (03PS4) 10Ryan Kemper: [wip] elastic: temp keystore for index restoration [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648)
[22:00:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [wip] elastic: temp keystore for index restoration [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper)
[22:02:33] <wikibugs>	 10SRE, 10DSE-Kubernetes-Cluster, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 2 VMs requested for DSE Kubernetes Cluster control plane servers - https://phabricator.wikimedia.org/T311133 (10Dzahn) fyi: The design document isn't accesible and from the tickets alone it's unclear what this is ab...
[22:09:53] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (3) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[22:16:17] <wikibugs>	 (03PS5) 10Ryan Kemper: [wip] elastic: temp keystore for index restoration [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648)
[22:17:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [wip] elastic: temp keystore for index restoration [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper)
[22:21:09] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:21:16] <wikibugs>	 (03PS6) 10Ryan Kemper: [wip] elastic: temp keystore for index restoration [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648)
[22:22:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [wip] elastic: temp keystore for index restoration [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper)
[22:23:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox: Grant cn=nda some sort of read only access to Netbox - https://phabricator.wikimedia.org/T302870 (10Dzahn) Before we talk about technical implementation and putting this on ice. I am wondering..has anyone even had specific concerns or data fields in mind that sh...
[22:27:58] <wikibugs>	 (03PS1) 10Ryan Kemper: elastic: add fake elasticsearch.keystore [labs/private] - 10https://gerrit.wikimedia.org/r/807650 (https://phabricator.wikimedia.org/T309648)
[22:28:15] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:29:21] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+1] "Pattern looks correct." [puppet] - 10https://gerrit.wikimedia.org/r/807518 (owner: 10Jelto)
[22:29:33] <wikibugs>	 (03CR) 10Bking: [C: 03+2] elastic: add fake elasticsearch.keystore [labs/private] - 10https://gerrit.wikimedia.org/r/807650 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper)
[22:29:39] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gitlab_runner: add docker-registry.discovery.wmnet to allowed_images [puppet] - 10https://gerrit.wikimedia.org/r/807518 (owner: 10Jelto)
[22:30:03] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+2] elastic: add fake elasticsearch.keystore [labs/private] - 10https://gerrit.wikimedia.org/r/807650 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper)
[22:30:51] <wikibugs>	 (03PS7) 10Ryan Kemper: [wip] elastic: temp keystore for index restoration [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648)
[22:31:52] <wikibugs>	 (03CR) 10Ryan Kemper: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper)
[22:33:48] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "we may be able to deploy this during phab maintenance window in a bit" [puppet] - 10https://gerrit.wikimedia.org/r/806207 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[22:35:32] <wikibugs>	 (03PS8) 10Ryan Kemper: [wip] elastic: temp keystore for index restoration [puppet] - 10https://gerrit.wikimedia.org/r/807623 (https://phabricator.wikimedia.org/T309648)
[22:37:41] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:50:37] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[22:55:18] <jinxer-wm>	 (ProbeDown) firing: (7) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:56:15] <tzatziki>	 !log removing 1 file for legal compliance
[22:56:18] <jinxer-wm>	 (ProbeDown) firing: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:56:35] <jinxer-wm>	 (FrontendUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[22:56:35] <jinxer-wm>	 (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[22:56:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:57:01] <icinga-wm>	 PROBLEM - Apache HTTP on mw1354 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[22:57:18] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[22:57:21] <icinga-wm>	 PROBLEM - Apache HTTP on mw1351 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[22:57:21] <icinga-wm>	 PROBLEM - Apache HTTP on mw1355 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[22:57:27] <icinga-wm>	 PROBLEM - Apache HTTP on mw1365 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[22:57:35] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[22:58:11] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is CRITICAL: 0.9726 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[22:58:29] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[22:59:01] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:59:15] <icinga-wm>	 RECOVERY - Apache HTTP on mw1354 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[22:59:33] <icinga-wm>	 RECOVERY - Apache HTTP on mw1351 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[22:59:33] <icinga-wm>	 RECOVERY - Apache HTTP on mw1355 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[22:59:41] <icinga-wm>	 RECOVERY - Apache HTTP on mw1365 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[23:00:18] <jinxer-wm>	 (ProbeDown) resolved: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:00:21] <wikibugs>	 (03PS3) 10Labdajiwa: Add wordmark and tagline for jvwiki, jvwikt, and jvws [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807247 (https://phabricator.wikimedia.org/T311104)
[23:00:33] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[23:00:51] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[23:01:18] <jinxer-wm>	 (ProbeDown) resolved: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:01:35] <jinxer-wm>	 (FrontendUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[23:01:35] <jinxer-wm>	 (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[23:02:17] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[23:02:18] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[23:16:29] <wikibugs>	 10SRE, 10MediaWiki-General, 10Traffic: Advance declaration of query parameters - https://phabricator.wikimedia.org/T310087 (10ori)
[23:17:57] <wikibugs>	 10SRE, 10DNS, 10Traffic, 10WMF-Legal, and 2 others: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10Dzahn) >>! In T310738#8018422, @Varnent wrote: > We are "closing" this site on the VIP site. So, essentially whenever we want on...
[23:22:35] <wikibugs>	 10SRE, 10MediaWiki-General, 10Traffic: Advance declaration of query parameters - https://phabricator.wikimedia.org/T310087 (10ori)
[23:22:45] <wikibugs>	 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10ori)
[23:23:05] <wikibugs>	 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10RobH) Ok now SG3 staff are telling me my ticket isn't valid for this type of thing, despite telling me on a voice call yesterday they'd place it today, and require me to raise a trouble ticket, not a remote h...
[23:27:00] <wikibugs>	 10SRE, 10MediaWiki-General, 10Traffic: Advance declaration of query parameters - https://phabricator.wikimedia.org/T310087 (10ori) Re-scoping this to be about advanced declaration of query parameters, and moving discussion of parameter ordering to T302459.
[23:35:34] <wikibugs>	 (03CR) 10Brennen Bearnes: phabricator: get envoy to listen on ipv6 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806207 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)