[00:06:28] PROBLEM - Check systemd state on kafkamon1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:30] PROBLEM - Check systemd state on pc2012 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:32] PROBLEM - Check systemd state on ganeti2009 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:14:06] RECOVERY - Check systemd state on kafkamon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:14:42] RECOVERY - Disk space on maps2006 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=maps2006&var-datasource=codfw+prometheus/ops [00:19:48] PROBLEM - Check systemd state on kafkamon1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:10] RECOVERY - Check systemd state on kafkamon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:14] RECOVERY - Check systemd state on pc2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:16] RECOVERY - Check systemd state on ganeti2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:49:00] PROBLEM - snapshot of s6 in codfw on alert1001 is CRITICAL: snapshot for s6 at codfw taken more than 3 days ago: Most recent backup 2021-09-09 00:39:32 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [01:26:02] PROBLEM - Check systemd state on cp5010 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:51:02] RECOVERY - Check systemd state on cp5010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:23:34] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is CRITICAL: 115.9 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37 [03:08:46] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [03:12:34] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 15 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [03:15:22] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is OK: (C)100 gt (W)80 gt 40.68 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37 [04:30:40] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is CRITICAL: 105.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37 [04:35:16] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is CRITICAL: 127.1 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37 [04:52:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [04:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:53:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [04:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:16] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: an-web1001, labstore1006, mw2254, ms-be1051, ms-be1062 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [05:25:08] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is OK: (C)100 gt (W)80 gt 76.27 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37 [06:08:08] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is OK: (C)100 gt (W)80 gt 58.98 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37 [06:43:14] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is CRITICAL: 692.5 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37 [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210912T0700) [07:25:58] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [07:44:30] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is OK: (C)100 gt (W)80 gt 69.15 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37 [07:48:52] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [07:52:42] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [07:59:22] PROBLEM - Too high an incoming rate of browser-reported Network Error Logging events on alert1001 is CRITICAL: type=tcp.timed_out https://wikitech.wikimedia.org/wiki/Network_monitoring%23NEL_alerts https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 [08:00:59] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) [08:01:16] RECOVERY - Too high an incoming rate of browser-reported Network Error Logging events on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Network_monitoring%23NEL_alerts https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 [08:16:58] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is CRITICAL: 120 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37 [08:25:10] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [08:38:32] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [08:45:06] PROBLEM - Check systemd state on db1109 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. An error occured trying to list the failed units https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:47:34] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is CRITICAL: 124.1 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37 [09:08:00] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is OK: (C)100 gt (W)80 gt 71.19 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37 [09:12:54] RECOVERY - Check systemd state on db1109 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:31:12] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:35:02] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:06:28] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [11:08:18] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 10 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [11:33:14] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is CRITICAL: 304.1 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37 [12:24:12] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:24:12] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:26:42] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is OK: (C)100 gt (W)80 gt 38.64 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37 [12:53:30] PROBLEM - Check systemd state on cp5005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:06:26] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is CRITICAL: 138.3 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37 [13:21:10] RECOVERY - Check systemd state on cp5005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:52:08] PROBLEM - NFS Share Volume Space /srv/tools on labstore1004 is CRITICAL: DISK CRITICAL - free space: /srv/tools 1264123 MB (15% inode=78%): https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage%23NFS_volume_cleanup https://grafana.wikimedia.org/d/50z0i4XWz/tools-overall-nfs-storage-utilization?orgId=1 [13:53:08] ACKNOWLEDGEMENT - NFS Share Volume Space /srv/tools on labstore1004 is CRITICAL: DISK CRITICAL - free space: /srv/tools 1264123 MB (15% inode=78%): andrew bogott I will find something to delete https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage%23NFS_volume_cleanup https://grafana.wikimedia.org/d/50z0i4XWz/tools-overall-nfs-storage-utilization?orgId=1 [14:01:06] PROBLEM - Check systemd state on cp5011 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:02] RECOVERY - Check systemd state on cp5011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:34] PROBLEM - Check systemd state on cp5011 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:42] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:28:26] RECOVERY - Check systemd state on cp5011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:42:30] RECOVERY - snapshot of s6 in codfw on alert1001 is OK: Last snapshot for s6 at codfw (db2141.codfw.wmnet:3316) taken on 2021-09-12 13:18:55 (587 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [15:17:20] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37 [17:15:51] (03PS1) 10Bstorm: labstore: fix old scratch mountpoint permissions [puppet] - 10https://gerrit.wikimedia.org/r/720477 (https://phabricator.wikimedia.org/T290825) [17:18:16] (03CR) 10Bstorm: [C: 03+2] labstore: fix old scratch mountpoint permissions [puppet] - 10https://gerrit.wikimedia.org/r/720477 (https://phabricator.wikimedia.org/T290825) (owner: 10Bstorm) [17:50:50] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is CRITICAL: 138.3 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37 [18:13:01] PROBLEM - ATS TLS has reduced HTTP availability #page on alert1001 is CRITICAL: cluster=cache_upload layer=tls https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [18:13:04] PROBLEM - Varnish HTTP upload-frontend - port 3127 on cp3061 is CRITICAL: HTTP CRITICAL - No data received from host https://wikitech.wikimedia.org/wiki/Varnish [18:13:04] PROBLEM - Varnish HTTP upload-frontend - port 3123 on cp3057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:13:04] PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp3057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:13:18] PROBLEM - Varnish HTTP upload-frontend - port 3126 on cp3061 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:13:18] PROBLEM - Varnish HTTP upload-frontend - port 3126 on cp3057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:13:52] PROBLEM - Varnish HTTP upload-frontend - port 3122 on cp3057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:13:54] * legoktm looks [18:14:10] PROBLEM - Varnish HTTP upload-frontend - port 3122 on cp3061 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:14:10] PROBLEM - Varnish HTTP upload-frontend - port 3122 on cp3055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:14:12] upload. indeed seems down for me [18:14:22] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_maps_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:14:24] PROBLEM - Varnish HTTP upload-frontend - port 3124 on cp3061 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:14:24] PROBLEM - Varnish HTTP upload-frontend - port 3124 on cp3057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:14:31] looking [18:14:56] PROBLEM - Varnish HTTP upload-frontend - port 3123 on cp3063 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:14:58] PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp3055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:15:02] PROBLEM - PyBal backends health check on lvs3007 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb6_443: Servers cp3055.esams.wmnet, cp3065.esams.wmnet, cp3063.esams.wmnet are marked down but pooled: uploadlb_443: Servers cp3055.esams.wmnet, cp3065.esams.wmnet, cp3063.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:15:02] PROBLEM - PyBal backends health check on lvs3006 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb6_443: Servers cp3057.esams.wmnet, cp3063.esams.wmnet, cp3065.esams.wmnet, cp3061.esams.wmnet are marked down but pooled: uploadlb_443: Servers cp3055.esams.wmnet, cp3063.esams.wmnet, cp3065.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:15:08] PROBLEM - Varnish HTTP upload-frontend - port 3125 on cp3063 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:15:26] PROBLEM - Varnish HTTP upload-frontend - port 3121 on cp3057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:15:28] PROBLEM - Varnish HTTP upload-frontend - port 3123 on cp3055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:15:28] PROBLEM - Varnish HTTP upload-frontend - port 3124 on cp3063 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:15:36] PROBLEM - Varnish HTTP upload-frontend - port 3125 on cp3055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:15:36] PROBLEM - Varnish HTTP upload-frontend - port 3121 on cp3061 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:15:44] PROBLEM - Varnish HTTP upload-frontend - port 3124 on cp3055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:15:56] PROBLEM - Varnish HTTP upload-frontend - port 3127 on cp3057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:15:56] PROBLEM - Varnish HTTP upload-frontend - port 3126 on cp3063 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:15:56] PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp3061 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:15:56] PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp3065 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:15:56] PROBLEM - Varnish HTTP upload-frontend - port 3125 on cp3057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:16:20] PROBLEM - Varnish HTTP upload-frontend - port 3127 on cp3055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:16:22] PROBLEM - Varnish HTTP upload-frontend - port 3122 on cp3065 is CRITICAL: HTTP CRITICAL - No data received from host https://wikitech.wikimedia.org/wiki/Varnish [18:16:22] PROBLEM - Varnish HTTP upload-frontend - port 3127 on cp3065 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:16:22] PROBLEM - Varnish HTTP upload-frontend - port 3125 on cp3065 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:16:22] PROBLEM - Varnish HTTP upload-frontend - port 3127 on cp3063 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:16:40] PROBLEM - Varnish HTTP upload-frontend - port 3125 on cp3061 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:16:42] PROBLEM - Varnish HTTP upload-frontend - port 3123 on cp3061 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:16:56] PROBLEM - Varnish HTTP upload-frontend - port 3122 on cp3063 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:16:58] PROBLEM - Varnish HTTP upload-frontend - port 3121 on cp3055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:17:08] PROBLEM - Varnish HTTP upload-frontend - port 3126 on cp3065 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:17:38] PROBLEM - Varnish HTTP upload-frontend - port 3126 on cp3055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:17:46] PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp3063 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:17:46] PROBLEM - Varnish HTTP upload-frontend - port 3124 on cp3065 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:17:46] PROBLEM - Varnish HTTP upload-frontend - port 3123 on cp3065 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:18:16] PROBLEM - Varnish HTTP upload-frontend - port 3121 on cp3063 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:18:20] PROBLEM - Varnish HTTP upload-frontend - port 3121 on cp3065 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [18:26:05] Hmm is that why I just failed to load a thumbnail? (I thought it was packet loss on my side) [18:26:16] !log restart varnish on cp3057 [18:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:22] Nemo_bis: probably :) we're working on it [18:26:25] Nemo_bis, ongoing issue, known [18:26:34] thanks [18:26:53] Nemo_bis: you should anyways curse your ISP; they probably deserve it [18:27:08] definitely [18:27:30] RECOVERY - Varnish HTTP upload-frontend - port 3127 on cp3057 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:27:30] RECOVERY - Varnish HTTP upload-frontend - port 3125 on cp3057 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.165 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:27:42] as expected, let's see if it keeps it up [18:28:11] marostegui: they keep you working on Sundays too? :P [18:28:30] RECOVERY - Varnish HTTP upload-frontend - port 3123 on cp3057 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:28:30] RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp3057 is OK: HTTP OK: HTTP/1.1 200 OK - 471 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:28:46] RECOVERY - Varnish HTTP upload-frontend - port 3126 on cp3057 is OK: HTTP OK: HTTP/1.1 200 OK - 471 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:29:13] vgutierrez: should we try another server? [18:29:18] yup [18:29:20] going with cp3055 [18:29:22] RECOVERY - Varnish HTTP upload-frontend - port 3122 on cp3057 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:29:52] RECOVERY - Varnish HTTP upload-frontend - port 3124 on cp3057 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:29:53] !log restart varnish on cp3055 [18:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:56] RECOVERY - Varnish HTTP upload-frontend - port 3121 on cp3057 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.163 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:30:58] RECOVERY - Varnish HTTP upload-frontend - port 3123 on cp3055 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.166 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:31:04] RECOVERY - Varnish HTTP upload-frontend - port 3125 on cp3055 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:31:14] RECOVERY - Varnish HTTP upload-frontend - port 3124 on cp3055 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:31:46] still getting a lot of 502s thjough [18:31:48] RECOVERY - Varnish HTTP upload-frontend - port 3127 on cp3055 is OK: HTTP OK: HTTP/1.1 200 OK - 471 bytes in 0.163 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:32:24] RECOVERY - Varnish HTTP upload-frontend - port 3121 on cp3055 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:33:06] RECOVERY - Varnish HTTP upload-frontend - port 3126 on cp3055 is OK: HTTP OK: HTTP/1.1 200 OK - 470 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:33:30] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:33:32] RECOVERY - Varnish HTTP upload-frontend - port 3122 on cp3055 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.308 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:33:58] !log restart varnish-fe on cp3061, cp3063 and cp3065 [18:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:20] RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp3055 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:34:32] RECOVERY - Varnish HTTP upload-frontend - port 3125 on cp3063 is OK: HTTP OK: HTTP/1.1 200 OK - 471 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:34:52] RECOVERY - Varnish HTTP upload-frontend - port 3124 on cp3063 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:35:00] RECOVERY - Varnish HTTP upload-frontend - port 3121 on cp3061 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:35:20] RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp3061 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:35:20] RECOVERY - Varnish HTTP upload-frontend - port 3126 on cp3063 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.163 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:35:20] RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp3065 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:35:42] RECOVERY - Varnish HTTP upload-frontend - port 3127 on cp3065 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:35:44] RECOVERY - Varnish HTTP upload-frontend - port 3122 on cp3065 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:35:44] RECOVERY - Varnish HTTP upload-frontend - port 3125 on cp3065 is OK: HTTP OK: HTTP/1.1 200 OK - 471 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:35:44] RECOVERY - Varnish HTTP upload-frontend - port 3127 on cp3063 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:36:04] RECOVERY - Varnish HTTP upload-frontend - port 3125 on cp3061 is OK: HTTP OK: HTTP/1.1 200 OK - 471 bytes in 0.163 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:36:06] RECOVERY - Varnish HTTP upload-frontend - port 3123 on cp3061 is OK: HTTP OK: HTTP/1.1 200 OK - 471 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:36:08] RECOVERY - PyBal backends health check on lvs3006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:36:08] RECOVERY - PyBal backends health check on lvs3007 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:36:18] RECOVERY - Varnish HTTP upload-frontend - port 3122 on cp3063 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:36:20] RECOVERY - Varnish HTTP upload-frontend - port 3127 on cp3061 is OK: HTTP OK: HTTP/1.1 200 OK - 471 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:36:30] RECOVERY - Varnish HTTP upload-frontend - port 3126 on cp3065 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:36:36] RECOVERY - Varnish HTTP upload-frontend - port 3126 on cp3061 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:37:10] RECOVERY - Varnish HTTP upload-frontend - port 3124 on cp3065 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:37:10] RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp3063 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:37:10] RECOVERY - Varnish HTTP upload-frontend - port 3123 on cp3065 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:37:26] RECOVERY - Varnish HTTP upload-frontend - port 3122 on cp3061 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:37:42] RECOVERY - Varnish HTTP upload-frontend - port 3124 on cp3061 is OK: HTTP OK: HTTP/1.1 200 OK - 471 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:37:42] RECOVERY - Varnish HTTP upload-frontend - port 3121 on cp3063 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:37:44] RECOVERY - Varnish HTTP upload-frontend - port 3121 on cp3065 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:37:47] RECOVERY - ATS TLS has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [18:38:16] RECOVERY - Varnish HTTP upload-frontend - port 3123 on cp3063 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:43:18] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is OK: (C)100 gt (W)80 gt 53.9 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37 [19:05:56] 10SRE-swift-storage: Media storage metadata inconsistent with Swift - https://phabricator.wikimedia.org/T289996 (10Aklapper) [21:16:08] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is CRITICAL: 213.6 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37 [21:16:52] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is CRITICAL: 175.9 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37 [22:07:56] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is OK: (C)100 gt (W)80 gt 77.29 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37 [22:29:00] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:30:58] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:46:34] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37 [23:25:06] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is CRITICAL: 116.9 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37