[00:00:09] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is CRITICAL: 24.88 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:01:17] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 57.53 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:02:07] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 102.1 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:02:31] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:02:37] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [00:03:39] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:07:17] PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:18:35] RECOVERY - Disk space on ml-etcd2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ml-etcd2002&var-datasource=codfw+prometheus/ops [00:29:21] RECOVERY - SSH on ms-fe2008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:44:45] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:51:13] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:53:29] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:08:19] PROBLEM - SSH on mw2257.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:10:53] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: debian-weekly-rebuild.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:10:53] RECOVERY - SSH on mw2257.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:24:08] (03PS3) 10Legoktm: extdist: migrate crons to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/751173 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [07:25:54] (03CR) 10Legoktm: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/751173 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [07:32:56] (03CR) 10Legoktm: "I watched the skindist run, it worked great! This will also let us simplify the extdist code since we no longer need to check for duplicat" [puppet] - 10https://gerrit.wikimedia.org/r/751173 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [07:33:10] (03PS2) 10Legoktm: extdist: remove absented crons [puppet] - 10https://gerrit.wikimedia.org/r/751174 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [07:34:06] (03CR) 10Legoktm: [C: 03+2] extdist: remove absented crons [puppet] - 10https://gerrit.wikimedia.org/r/751174 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [07:44:21] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:46:43] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220116T0800) [08:13:22] (03PS1) 10Giuseppe Lavagetto: cpjobqueue: double the concurrencty for recordlintjob [deployment-charts] - 10https://gerrit.wikimedia.org/r/754096 (https://phabricator.wikimedia.org/T299302) [08:16:05] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] cpjobqueue: double the concurrencty for recordlintjob [deployment-charts] - 10https://gerrit.wikimedia.org/r/754096 (https://phabricator.wikimedia.org/T299302) (owner: 10Giuseppe Lavagetto) [08:17:47] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply on production [08:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:51] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply on staging [08:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:15] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync on production [08:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:54] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply on production [08:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:57] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply on staging [08:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:17] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync on production [08:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:57] thanks _joe_ [08:30:40] <_joe_> legoktm: thank *you* [08:56:47] (03PS1) 10Amire80: Remove kea, nod, and sms from wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754097 (https://phabricator.wikimedia.org/T299304) [10:09:37] PROBLEM - Disk space on elastic2035 is CRITICAL: DISK CRITICAL - free space: / 717 MB (2% inode=94%): /tmp 717 MB (2% inode=94%): /var/tmp 717 MB (2% inode=94%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic2035&var-datasource=codfw+prometheus/ops [11:33:13] (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [11:38:13] (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [12:24:39] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 196 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:26:55] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 21 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:44:51] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:46:05] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:36:41] PROBLEM - SSH on restbase2011.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:45:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:52:13] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_webrequest_partitions.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:17] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3065 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [17:15:59] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 3 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10Stanglavine) >>! In T244567#7395913, @St... [17:18:07] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.4516 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [17:38:14] <_joe_> uhm the graphs seem to indicate a recovery happened, actually [17:39:11] RECOVERY - SSH on restbase2011.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:39:36] <_joe_> someone's been hammering the API, but I'd be inclined to let it go [18:05:19] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.09677 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [18:39:07] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3387 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [18:45:53] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.4032 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [18:50:27] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3387 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [18:50:49] PROBLEM - SSH on mw2254.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:18:57] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.08065 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [19:48:33] PROBLEM - SSH on restbase2010.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:49:47] RECOVERY - SSH on restbase2010.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:53:13] RECOVERY - SSH on mw2254.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:34:49] (03PS1) 10JHathaway: format yaml with vinyl [puppet] - 10https://gerrit.wikimedia.org/r/754114 [22:54:31] (03PS1) 10JHathaway: vinyl rake task [puppet] - 10https://gerrit.wikimedia.org/r/754116 [22:55:20] (03CR) 10jerkins-bot: [V: 04-1] vinyl rake task [puppet] - 10https://gerrit.wikimedia.org/r/754116 (owner: 10JHathaway) [22:56:54] (03PS2) 10JHathaway: vinyl rake task [puppet] - 10https://gerrit.wikimedia.org/r/754116 [23:14:05] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 237, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:16:29] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 238, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down