[00:01:33] <icinga-wm>	 PROBLEM - Disk space on thanos-be2001 is CRITICAL: DISK CRITICAL - free space: / 2106 MB (3% inode=97%): /tmp 2106 MB (3% inode=97%): /var/tmp 2106 MB (3% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2001&var-datasource=codfw+prometheus/ops
[00:01:58] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[00:05:09] <icinga-wm>	 PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:22:31] <icinga-wm>	 RECOVERY - Disk space on thanos-be2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2001&var-datasource=codfw+prometheus/ops
[00:26:35] <icinga-wm>	 RECOVERY - SSH on an-coord1002.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:40:13] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST namespaces) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[01:22:53] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[01:30:07] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:31:17] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:38:46] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:48:46] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:04:33] <icinga-wm>	 PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: package_builder_Clean_up_build_directory.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:08:46] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:12:09] <icinga-wm>	 RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:18:46] <jinxer-wm>	 (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:20:41] <icinga-wm>	 PROBLEM - SSH on mw1310.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:25:01] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:21:35] <icinga-wm>	 RECOVERY - SSH on mw1310.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:17:23] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[04:19:23] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[04:40:13] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST namespaces) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[05:46:47] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 121 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[06:41:57] <icinga-wm>	 PROBLEM - Check systemd state on cp2031 is CRITICAL: CRITICAL - degraded: The following units failed: varnishmtail@default.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:43:57] <icinga-wm>	 PROBLEM - Check systemd state on cp2035 is CRITICAL: CRITICAL - degraded: The following units failed: varnishmtail@default.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:53:49] <icinga-wm>	 RECOVERY - Check systemd state on cp2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:54:11] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 133 probes of 687 (alerts on 90) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[06:59:45] <icinga-wm>	 RECOVERY - Check systemd state on cp2035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:00:11] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 42 probes of 687 (alerts on 90) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[07:17:09] <icinga-wm>	 PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:48:41] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 122 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[08:26:27] <icinga-wm>	 PROBLEM - SSH on mw1310.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:40:13] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST namespaces) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:31:22] <elukey>	 !log restart kube-apiserver on ml-staging-ctrl2002
[09:31:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:32:52] <elukey>	 !log restart kube-apiserver on ml-staging-ctrl2001
[09:32:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:20] <elukey>	 !log reinstall kubernetes-node on ml-staging200[12] to allow puppet to run (cleanup after yesterday issue, worker nodes had master role applied)
[09:39:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST namespaces) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:20:07] <icinga-wm>	 RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:28:17] <icinga-wm>	 RECOVERY - SSH on mw1310.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:36:26] <wikibugs>	 (03PS2) 10Daniel Kinzler: Set VisualEditorDefaultParsoidClient for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845058 (https://phabricator.wikimedia.org/T320531)
[10:43:10] <wikibugs>	 (03PS3) 10Daniel Kinzler: Set VisualEditorDefaultParsoidClient for dewiki-beta mad testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845058 (https://phabricator.wikimedia.org/T320531)
[12:08:57] <icinga-wm>	 PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:22:03] <icinga-wm>	 PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:55:53] <logmsgbot>	 !log mfossati@deploy1002 Started deploy [airflow-dags/platform_eng@c849762]: (no justification provided)
[12:56:43] <logmsgbot>	 !log mfossati@deploy1002 Finished deploy [airflow-dags/platform_eng@c849762]: (no justification provided) (duration: 00m 49s)
[12:56:59] <wikibugs>	 (03PS4) 10Daniel Kinzler: Set VisualEditorDefaultParsoidClient for dewiki-beta and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845058 (https://phabricator.wikimedia.org/T320531)
[13:05:43] <wikibugs>	 (03CR) 10LSobanski: [C: 03+1] roles: add/update role contacts for aphlict,miscweb,planet,rt [puppet] - 10https://gerrit.wikimedia.org/r/853454 (owner: 10Dzahn)
[13:06:25] <wikibugs>	 (03CR) 10LSobanski: [C: 03+1] "But let's still discuss :)" [puppet] - 10https://gerrit.wikimedia.org/r/853454 (owner: 10Dzahn)
[13:09:47] <icinga-wm>	 RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:23:01] <icinga-wm>	 PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:24:51] <icinga-wm>	 RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:46:41] <icinga-wm>	 PROBLEM - SSH on mw1330.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:58:11] <wikibugs>	 (03PS1) 10Vgutierrez: swift: Ramp up ms-be05 draining [puppet] - 10https://gerrit.wikimedia.org/r/853475 (https://phabricator.wikimedia.org/T322231)
[16:04:42] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] swift: Ramp up ms-be05 draining [puppet] - 10https://gerrit.wikimedia.org/r/853475 (https://phabricator.wikimedia.org/T322231) (owner: 10Vgutierrez)
[16:47:31] <icinga-wm>	 RECOVERY - SSH on mw1330.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:07:33] <icinga-wm>	 PROBLEM - SSH on mw1329.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:26:49] <icinga-wm>	 PROBLEM - SSH on mw1309.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:08:23] <icinga-wm>	 RECOVERY - SSH on mw1329.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:17:15] <wikibugs>	 (03PS1) 10Urbanecm: [Growth] Do not override wgGEMentorshipUseIsActiveFlag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853482 (https://phabricator.wikimedia.org/T318457)
[18:17:34] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-2] "column population logic needs to be verified first" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853482 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm)
[18:27:39] <icinga-wm>	 RECOVERY - SSH on mw1309.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:05:47] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[19:07:41] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[19:28:49] <icinga-wm>	 RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:34:06] <wikibugs>	 (03PS1) 10Urbanecm: Calculate mentorship-related metrics [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853440 (https://phabricator.wikimedia.org/T318684)
[20:34:36] <wikibugs>	 (03PS3) 10Urbanecm: growthexperiments.pp: Run updateMetrics.php daily [puppet] - 10https://gerrit.wikimedia.org/r/853265 (https://phabricator.wikimedia.org/T318684)
[20:35:10] <wikibugs>	 (03PS4) 10Urbanecm: growthexperiments.pp: Run updateMetrics.php daily [puppet] - 10https://gerrit.wikimedia.org/r/853265 (https://phabricator.wikimedia.org/T318684)
[20:35:22] <wikibugs>	 (03CR) 10Urbanecm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/853265 (https://phabricator.wikimedia.org/T318684) (owner: 10Urbanecm)
[21:07:11] <icinga-wm>	 PROBLEM - SSH on db1120.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:08:07] <icinga-wm>	 RECOVERY - SSH on db1120.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:25:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[23:30:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[23:32:39] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[23:33:47] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[23:34:33] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[23:35:41] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29