[00:08:34] (03CR) 10BBlack: [C: 03+2] prometheus6001: add to global node list [puppet] - 10https://gerrit.wikimedia.org/r/748225 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [00:22:18] (03PS1) 10BBlack: Add prometheus.svc.drmrs.wmnet alias [dns] - 10https://gerrit.wikimedia.org/r/748227 (https://phabricator.wikimedia.org/T282787) [00:25:41] (03CR) 10BBlack: [C: 03+2] Add prometheus.svc.drmrs.wmnet alias [dns] - 10https://gerrit.wikimedia.org/r/748227 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [00:26:07] (03PS1) 10BBlack: Add drmrs prometheus to various global config [puppet] - 10https://gerrit.wikimedia.org/r/748228 (https://phabricator.wikimedia.org/T282787) [00:28:23] (03CR) 10BBlack: [C: 03+2] Add drmrs prometheus to various global config [puppet] - 10https://gerrit.wikimedia.org/r/748228 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [00:37:34] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=drmrs https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:39:44] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:44:43] just for the record, anything that mentions "drmrs" is non-critical if it alerts. The site isn't active. [00:44:55] it's just hard to control for all possible spam fallouts as things are being initially configured [00:50:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:52:52] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:01:16] (03CR) 10Cwhite: [C: 03+2] add and enable subset filters [software/ecs] - 10https://gerrit.wikimedia.org/r/747641 (https://phabricator.wikimedia.org/T294581) (owner: 10Cwhite) [01:01:49] (03Merged) 10jenkins-bot: add and enable subset filters [software/ecs] - 10https://gerrit.wikimedia.org/r/747641 (https://phabricator.wikimedia.org/T294581) (owner: 10Cwhite) [01:06:11] (03PS1) 10Cwhite: profile: upgrade to ecs 1.11.0-2 [puppet] - 10https://gerrit.wikimedia.org/r/748230 (https://phabricator.wikimedia.org/T294581) [01:17:27] 10SRE: Allow Wikimedia Maps usage on bbcrewind.co.uk - https://phabricator.wikimedia.org/T297968 (10AntiCompositeNumber) I don't believe this request meets the criteria in the [[https://foundation.wikimedia.org/wiki/Maps_Terms_of_Use#Using_maps_in_third-party_services|Maps Terms of Use]]. > Wikimedia Maps may no... [01:18:59] (03CR) 10Samwilson: [C: 03+2] Move horizontal/vertical layout to CSS only [extensions/ProofreadPage] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/748095 (https://phabricator.wikimedia.org/T297339) (owner: 10Inductiveload) [01:46:12] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:04:06] PROBLEM - SSH on kubernetes1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:05:08] RECOVERY - SSH on kubernetes1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:43:39] 10SRE: Allow Wikimedia Maps usage on bbcrewind.co.uk - https://phabricator.wikimedia.org/T297968 (10Ed6767) Ignoring that this proposal contradicts the Maps Terms of Service at time of writing, will bbcrewind.co.uk support and benefit Wikimedia projects, other than through providing historical references? We can... [04:02:09] (03PS2) 10RLazarus: Add a pod_name column to ActiveContainerImage [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/747881 (https://phabricator.wikimedia.org/T287130) [04:02:11] (03PS1) 10RLazarus: Fix --cluster command line parsing and add tests [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/748232 (https://phabricator.wikimedia.org/T287130) [04:04:18] (03CR) 10jerkins-bot: [V: 04-1] Fix --cluster command line parsing and add tests [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/748232 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [04:05:11] (03PS2) 10RLazarus: Fix --clusters command line parsing and add tests [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/748232 (https://phabricator.wikimedia.org/T287130) [04:06:58] (03CR) 10jerkins-bot: [V: 04-1] Fix --clusters command line parsing and add tests [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/748232 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [04:11:01] (03PS3) 10RLazarus: Fix --clusters command line parsing and add tests [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/748232 (https://phabricator.wikimedia.org/T287130) [05:35:56] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:14:28] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:15:34] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:32:28] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:16:04] PROBLEM - SSH on kubernetes1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:09:03] 10SRE: Allow Wikimedia Maps usage on bbcrewind.co.uk - https://phabricator.wikimedia.org/T297968 (10LWyatt) For those commenting with concerns about 'slippery slope' and 'mission alignment' - I should clarify some context here: - The Maps API //used// to be available for anyone to use for any purpose, but was r... [13:17:10] RECOVERY - SSH on kubernetes1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:57:41] !log restarting blazegraph on wdqs1013 (jvm stuck for 10hours) [13:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:58] PROBLEM - At least one CPU core of an LVS is saturated- packet drops are likely on lvs3005 is CRITICAL: cpu={1,11,13,15,3,5,7,9} https://bit.ly/wmf-lvscpu https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3005&var-datasource=esams+prometheus/ops [15:39:14] RECOVERY - At least one CPU core of an LVS is saturated- packet drops are likely on lvs3005 is OK: All metrics within thresholds. https://bit.ly/wmf-lvscpu https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3005&var-datasource=esams+prometheus/ops [17:23:14] PROBLEM - SSH on rdb1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:24:14] RECOVERY - SSH on rdb1006.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:44:03] (03PS1) 10Zabe: Add towiki.ru to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748305 (https://phabricator.wikimedia.org/T294190)