[00:08:45] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b9be20d]: (no justification provided) [00:09:08] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b9be20d]: (no justification provided) (duration: 00m 22s) [00:11:09] PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: full-backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:23:41] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:24:11] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:25:14] (03PS1) 10Jdlrobson: EXPECTED VISUAL CHANGES IN origin/wmf/1.39.0-wmf.28 [skins/Vector] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/830955 (https://phabricator.wikimedia.org/T315261) [00:25:24] (03CR) 10Jdlrobson: [C: 04-1] EXPECTED VISUAL CHANGES IN origin/wmf/1.39.0-wmf.28 [skins/Vector] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/830955 (https://phabricator.wikimedia.org/T315261) (owner: 10Jdlrobson) [00:29:43] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:37:40] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [00:38:01] (03PS1) 10RLazarus: pcc: Warn before compiling all nodes by default [puppet] - 10https://gerrit.wikimedia.org/r/830957 (https://phabricator.wikimedia.org/T222075) [00:45:03] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:56:53] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:01:39] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:04:33] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:13:53] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T316121 (10phaultfinder) [01:15:08] (03CR) 10Dzahn: "I like the idea and thank you for it, but can it be interactive like that when started from the web interface?" [puppet] - 10https://gerrit.wikimedia.org/r/830957 (https://phabricator.wikimedia.org/T222075) (owner: 10RLazarus) [01:17:39] (03CR) 10RLazarus: pcc: Warn before compiling all nodes by default (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/830957 (https://phabricator.wikimedia.org/T222075) (owner: 10RLazarus) [01:21:15] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:23:39] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:26:37] (03CR) 10Dzahn: [C: 03+1] "ACK. (another repo). tested this, the confirmation y/n works, message makes sense" [puppet] - 10https://gerrit.wikimedia.org/r/830957 (https://phabricator.wikimedia.org/T222075) (owner: 10RLazarus) [01:26:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:27:53] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:30:14] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:36:45] (JobUnavailable) firing: (2) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:44:49] RECOVERY - SSH on mw1313.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:46:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:51:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:04:55] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: package_builder_Clean_up_build_directory.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:45] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:15:49] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [02:18:03] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [02:28:55] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:36:57] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [02:46:27] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:03:23] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:10:33] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:13:13] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:22:47] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:31:41] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:48:17] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:51:05] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:51:15] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:53:39] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:54:43] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:54:51] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:00:17] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [04:02:37] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [04:09:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:37:40] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [04:43:47] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [04:53:19] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:01:39] (03PS1) 10Ebernhardson: Disable CirrusSearch completion suggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830962 [05:02:20] There is an autocomplete problem across many wikis, going to deploy a small config patch to switch from the completion suggester, which isn't working 100%, to the prefix search which should be fine [05:02:46] (03PS2) 10Ebernhardson: Disable CirrusSearch completion suggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830962 [05:03:01] (03CR) 10Ebernhardson: [C: 03+2] Disable CirrusSearch completion suggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830962 (owner: 10Ebernhardson) [05:03:54] (03Merged) 10jenkins-bot: Disable CirrusSearch completion suggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830962 (owner: 10Ebernhardson) [05:05:13] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:10:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [05:11:09] <_joe_> ebernhardson: I'm around in case :) [05:11:12] !log ebernhardson@deploy1002 Synchronized wmf-config/InitialiseSettings.php: cirrus: Switch all wikis from completion suggester to prefix search, yesterdays completion index builds in codfw weren't all succesfull and users are getting incomplete results (duration: 04m 01s) [05:11:23] <_joe_> anything I need to check? [05:12:25] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:18:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [05:18:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [05:19:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [05:36:23] (03PS1) 10Marostegui: install_server: Do not reimage db1203 [puppet] - 10https://gerrit.wikimedia.org/r/830963 [05:37:50] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1203 [puppet] - 10https://gerrit.wikimedia.org/r/830963 (owner: 10Marostegui) [05:40:27] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:41:29] !log dbmaint s4 testcommonswiki eqiad T317349 [05:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:33] T317349: Add primary key and drop unique index on wb_id_counters on wmf wikis - https://phabricator.wikimedia.org/T317349 [05:42:12] !log dbmaint s4 commonswiki eqiad T317349 [05:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:47] !log dbmaint s3 testwikidatawiki eqiad T317349 [05:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:53] !log dbmaint s8 wikidatawiki eqiad T317349 [05:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:11] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:48:13] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:51:15] _joe_: no all is fine, it's a compatiblity issue between mediawiki versions and elasticsearch versions, todays rebuilds should be fine and the patch can be reverted in a few hours when the cronjobs on mwmaint1002 are done [05:51:47] (i'm assuming someone on my time will follow up, dcausse knows these bits. otherwise i can tomorrow morning) [05:51:51] s/my time/my team/ [05:52:59] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [06:13:14] (03CR) 10Ayounsi: [C: 03+2] Revert "Exclude cloud-eqiad prefix from MXs trusted networks" [puppet] - 10https://gerrit.wikimedia.org/r/830819 (owner: 10Ayounsi) [06:16:49] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [06:17:45] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Remove 185.15.56.0/24 from network::external - https://phabricator.wikimedia.org/T265864 (10ayounsi) 05Open→03Resolved a:03ayounsi This is all done. Please re-open if there are any issues. [06:33:49] PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:38:17] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [06:40:16] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10ayounsi) [06:40:39] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [06:51:06] (03PS13) 10Ayounsi: sre.network.peering: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/816730 [06:51:08] (03CR) 10Ayounsi: sre.network.peering: initial commit (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/816730 (owner: 10Ayounsi) [06:51:18] (03PS14) 10Ayounsi: sre.network.peering: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/816730 [06:54:55] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220909T0700) [07:05:17] (03PS1) 10Elukey: Revert "Add a kublet node_label to each master of the dse-k8s cluster" [puppet] - 10https://gerrit.wikimedia.org/r/830967 [07:06:53] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:08:04] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/812239 (owner: 10Muehlenhoff) [07:08:06] (03CR) 10Elukey: "--node-labels in the 'kubernetes.io' namespace must begin with an allowed prefix (kubelet.kubernetes.io, node.kubernetes.io) or be in the " [puppet] - 10https://gerrit.wikimedia.org/r/830967 (owner: 10Elukey) [07:10:03] (03CR) 10Filippo Giunchedi: [C: 03+1] Add the prometheus config to enable scraping from the dse-k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/830897 (https://phabricator.wikimedia.org/T310179) (owner: 10Btullis) [07:10:13] (03CR) 10Elukey: [C: 03+2] Revert "Add a kublet node_label to each master of the dse-k8s cluster" [puppet] - 10https://gerrit.wikimedia.org/r/830967 (owner: 10Elukey) [07:10:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-tool1007.eqiad.wmnet [07:11:49] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/826362 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [07:12:11] RECOVERY - Check systemd state on dse-k8s-ctrl1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:12:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1137 for upgrade', diff saved to https://phabricator.wikimedia.org/P34318 and previous config saved to /var/cache/conftool/dbconfig/20220909-071255-root.json [07:13:15] RECOVERY - Check systemd state on dse-k8s-ctrl1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:14:03] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:14:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1007.eqiad.wmnet [07:16:27] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:18:23] (03PS1) 10Marostegui: db1137: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830965 [07:19:33] (03CR) 10Marostegui: [C: 03+2] db1137: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830965 (owner: 10Marostegui) [07:22:49] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:33:42] RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:40:13] (03PS2) 10Majavah: P:wmcs::novaproxy: add prometheus nginx exporter [puppet] - 10https://gerrit.wikimedia.org/r/830928 [07:40:55] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37181/console" [puppet] - 10https://gerrit.wikimedia.org/r/830928 (owner: 10Majavah) [07:42:13] 10SRE, 10SRE-swift-storage, 10Sustainability (Incident Followup): 2022-08-24 swift incident (tracking) - https://phabricator.wikimedia.org/T317358 (10fgiunchedi) I don't have the time/bandwidth to followup with a decent incident report, though yes tl;dr is that reverting https://gerrit.wikimedia.org/r/c/oper... [07:42:50] 10SRE, 10Data Engineering Planning, 10Foundational Technology Requests, 10User-fgiunchedi: Add a webrequest sampled topic and ingest into druid/turnilo - https://phabricator.wikimedia.org/T314981 (10fgiunchedi) [07:46:45] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [07:46:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [07:47:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [07:47:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T312863)', diff saved to https://phabricator.wikimedia.org/P34319 and previous config saved to /var/cache/conftool/dbconfig/20220909-074710-ladsgroup.json [07:47:13] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [07:49:54] (03PS1) 10Muehlenhoff: More RAID cleanups [puppet] - 10https://gerrit.wikimedia.org/r/830987 [07:50:56] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:52:08] (03CR) 10Muehlenhoff: [C: 03+2] More RAID cleanups [puppet] - 10https://gerrit.wikimedia.org/r/830987 (owner: 10Muehlenhoff) [07:52:32] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: reduce webrequest retention to 31 days (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821324 (https://phabricator.wikimedia.org/T313099) (owner: 10Cwhite) [07:52:48] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: reduce replica count to 1 after 1 day (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/830690 (https://phabricator.wikimedia.org/T313099) (owner: 10Cwhite) [07:55:10] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:56:38] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:01:58] 10SRE, 10Observability-Alerting: Export and share alerts data - https://phabricator.wikimedia.org/T317393 (10fgiunchedi) [08:03:46] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (38) node(s) change every puppet run: an-launcher1002, aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudcephosd1031, cloudcephosd1033, cloudcephosd1034, clouddumps1001, clouddumps1002, labstore1006, labstore1007, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, [08:03:46] 1, ms-fe2012, releases1002, releases2002, stat1004, stat1005, stat1006, stat1007, stat1008, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, thanos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [08:05:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance [08:05:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance [08:05:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:05:52] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:06:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:06:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T312863)', diff saved to https://phabricator.wikimedia.org/P34320 and previous config saved to /var/cache/conftool/dbconfig/20220909-080609-ladsgroup.json [08:06:12] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [08:06:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [08:06:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [08:07:08] PROBLEM - cassandra-b CQL 10.64.0.149:9042 on restbase1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [08:07:40] PROBLEM - cassandra-a CQL 10.64.0.148:9042 on restbase1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [08:07:46] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [08:08:16] PROBLEM - SSH on ms-be1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:08:30] PROBLEM - cassandra-c CQL 10.64.0.150:9042 on restbase1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [08:09:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:09:50] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [08:10:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance [08:10:16] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:10:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance [08:10:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:10:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:10:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T314041)', diff saved to https://phabricator.wikimedia.org/P34321 and previous config saved to /var/cache/conftool/dbconfig/20220909-081042-ladsgroup.json [08:10:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance [08:10:46] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [08:10:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance [08:11:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2106 (T314041)', diff saved to https://phabricator.wikimedia.org/P34322 and previous config saved to /var/cache/conftool/dbconfig/20220909-081103-ladsgroup.json [08:11:04] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (38) node(s) change every puppet run: an-launcher1002, aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudcephosd1031, cloudcephosd1033, cloudcephosd1034, clouddumps1001, clouddumps1002, labstore1006, labstore1007, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, [08:11:04] 1, ms-fe2012, releases1002, releases2002, stat1004, stat1005, stat1006, stat1007, stat1008, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, thanos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [08:11:39] (03PS1) 10Filippo Giunchedi: prometheus: deploy provision-fs.sh [puppet] - 10https://gerrit.wikimedia.org/r/831030 [08:12:40] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [08:13:17] (03CR) 10David Caro: [C: 03+2] P:wmcs::novaproxy: add prometheus nginx exporter [puppet] - 10https://gerrit.wikimedia.org/r/830928 (owner: 10Majavah) [08:16:09] !log restarting on blazegraph on wdqs2002 (BlazegraphFreeAllocatorsDecreasingRapidly) [08:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:32:24] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:32:31] !log rebuilding all completion indices in elastic@codfw [08:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:39] (03PS6) 10David Caro: p::metricsinfra:haproxy: Allow exposing federation endpoints [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031) [08:32:41] (03PS4) 10David Caro: p::wmcs:prometheus: Add cloudvps federation job [puppet] - 10https://gerrit.wikimedia.org/r/829756 (https://phabricator.wikimedia.org/T316982) [08:37:40] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [08:37:53] !log cgoubert@cumin1001 START - Cookbook sre.hosts.decommission for hosts wtp[1029-1033].eqiad.wmnet [08:38:39] (03CR) 10Majavah: [C: 04-1] p::metricsinfra:haproxy: Allow exposing federation endpoints (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031) (owner: 10David Caro) [08:40:08] PROBLEM - Restbase root url on restbase1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/RESTBase [08:43:08] PROBLEM - SSH on restbase1021 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:43:28] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:43:46] (03PS1) 10Muehlenhoff: ntp: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831034 (https://phabricator.wikimedia.org/T308013) [08:44:43] (03PS7) 10David Caro: p::metricsinfra:haproxy: Allow exposing federation endpoints [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031) [08:44:45] (03CR) 10David Caro: p::metricsinfra:haproxy: Allow exposing federation endpoints (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031) (owner: 10David Caro) [08:44:47] (03PS5) 10David Caro: p::wmcs:prometheus: Add cloudvps federation job [puppet] - 10https://gerrit.wikimedia.org/r/829756 (https://phabricator.wikimedia.org/T316982) [08:45:24] (03PS1) 10Muehlenhoff: confd: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831035 (https://phabricator.wikimedia.org/T308013) [08:46:07] (03CR) 10CI reject: [V: 04-1] confd: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831035 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [08:47:36] (03PS1) 10David Caro: p::metricsinfra:haproxy: rename some vars to reflect intent [puppet] - 10https://gerrit.wikimedia.org/r/831036 [08:47:51] (03PS1) 10Muehlenhoff: java: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831037 (https://phabricator.wikimedia.org/T308013) [08:48:04] (03PS2) 10Muehlenhoff: confd: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831035 (https://phabricator.wikimedia.org/T308013) [08:48:57] (03PS1) 10Majavah: hieradata: add metricsinfra puppetmaster key to pcc facts submitters [puppet] - 10https://gerrit.wikimedia.org/r/831038 [08:50:19] (03PS1) 10Muehlenhoff: scp: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831039 (https://phabricator.wikimedia.org/T308013) [08:51:58] RECOVERY - SSH on restbase1021 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:54:59] (03PS2) 10Muehlenhoff: scap: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831039 (https://phabricator.wikimedia.org/T308013) [08:55:19] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM (Including license change from GPL to Apache 2.0 for jheapdump which I authored)" [puppet] - 10https://gerrit.wikimedia.org/r/831037 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [08:56:44] !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox [08:58:05] (03PS1) 10Majavah: P:mariadb::cloudinfra: remove direct access from puppetmaster hosts [puppet] - 10https://gerrit.wikimedia.org/r/831041 [08:59:00] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:59:01] !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wtp[1029-1033].eqiad.wmnet [09:03:05] (03PS1) 10Majavah: hieradata: split pcc entries for cloudinfra [puppet] - 10https://gerrit.wikimedia.org/r/831042 [09:07:11] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:08:53] PROBLEM - SSH on restbase1021 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:11:03] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:11:39] RECOVERY - SSH on restbase1021 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:13:51] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:15:13] 10SRE, 10SRE-swift-storage, 10Sustainability (Incident Followup): 2022-08-24 swift incident (tracking) - https://phabricator.wikimedia.org/T317358 (10MatthewVernon) p:05Triage→03Low [09:16:56] (03PS3) 10Muehlenhoff: scap: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831039 (https://phabricator.wikimedia.org/T308013) [09:17:19] PROBLEM - Check systemd state on ms-be2039 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:17:40] (03PS2) 10Majavah: P:mariadb::cloudinfra: remove direct access from puppetmaster hosts [puppet] - 10https://gerrit.wikimedia.org/r/831041 [09:17:42] (03PS1) 10Majavah: hieradata: add cache_hosts for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/831044 [09:17:44] (03PS1) 10Majavah: P:mariadb::cloudinfra: add web proxy database/grants [puppet] - 10https://gerrit.wikimedia.org/r/831045 (https://phabricator.wikimedia.org/T316982) [09:18:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34323 and previous config saved to /var/cache/conftool/dbconfig/20220909-091809-root.json [09:19:03] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:20:31] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:20:47] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:21:10] (03PS1) 10Muehlenhoff: Remove obsolete owner annotations [puppet] - 10https://gerrit.wikimedia.org/r/831046 [09:22:33] (03CR) 10Vgutierrez: [C: 03+1] "overall the CR reduces complexity and I don't think it's messing with the functionality of the script (or its verbosity) and enables more " [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall) [09:23:54] (03CR) 10Vgutierrez: [C: 03+1] varnish: Remove extraneous checks for Docker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall) [09:30:31] (03CR) 10Vgutierrez: "Brett, could you check this CR and rebase it on top of https://gerrit.wikimedia.org/r/c/operations/puppet/+/826367/ I think it would be in" [puppet] - 10https://gerrit.wikimedia.org/r/771863 (owner: 10Giuseppe Lavagetto) [09:32:11] (03PS1) 10Btullis: Use the dumpsgen user to mount the NFS dumps directories [puppet] - 10https://gerrit.wikimedia.org/r/831049 (https://phabricator.wikimedia.org/T317359) [09:32:25] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/831049 (https://phabricator.wikimedia.org/T317359) (owner: 10Btullis) [09:32:50] (03PS1) 10Marostegui: Revert "db1137: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/830968 [09:33:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34325 and previous config saved to /var/cache/conftool/dbconfig/20220909-093314-root.json [09:36:19] (03CR) 10Marostegui: [C: 03+2] Revert "db1137: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/830968 (owner: 10Marostegui) [09:36:21] (03PS5) 10Clément Goubert: wtp: Purge wtp servers following migration to parse [puppet] - 10https://gerrit.wikimedia.org/r/830802 (https://phabricator.wikimedia.org/T317025) [09:38:22] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for phedenskog - https://phabricator.wikimedia.org/T317401 (10Peter) [09:39:15] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37184/console" [puppet] - 10https://gerrit.wikimedia.org/r/831049 (https://phabricator.wikimedia.org/T317359) (owner: 10Btullis) [09:39:50] (03CR) 10Btullis: [V: 03+1 C: 03+2] Use the dumpsgen user to mount the NFS dumps directories [puppet] - 10https://gerrit.wikimedia.org/r/831049 (https://phabricator.wikimedia.org/T317359) (owner: 10Btullis) [09:39:59] (03CR) 10Clément Goubert: [C: 03+2] wtp: Purge wtp servers following migration to parse [puppet] - 10https://gerrit.wikimedia.org/r/830802 (https://phabricator.wikimedia.org/T317025) (owner: 10Clément Goubert) [09:40:55] (03CR) 10Jbond: [C: 03+1] "lgtm, few nits but nothing blocking" [puppet] - 10https://gerrit.wikimedia.org/r/830948 (https://phabricator.wikimedia.org/T314815) (owner: 10JHathaway) [09:42:11] (03CR) 10Btullis: [C: 03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/831030 (owner: 10Filippo Giunchedi) [09:42:51] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:42:54] (03CR) 10Jbond: [C: 03+2] "lgtm will merge" [puppet] - 10https://gerrit.wikimedia.org/r/831038 (owner: 10Majavah) [09:43:38] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/831046 (owner: 10Muehlenhoff) [09:44:43] (03PS2) 10Jbond: hieradata: split pcc entries for cloudinfra [puppet] - 10https://gerrit.wikimedia.org/r/831042 (owner: 10Majavah) [09:44:55] (03CR) 10Jbond: [C: 03+2] "lgtm thanks will merge" [puppet] - 10https://gerrit.wikimedia.org/r/831042 (owner: 10Majavah) [09:45:05] (03CR) 10Jbond: [V: 03+2 C: 03+2] hieradata: split pcc entries for cloudinfra [puppet] - 10https://gerrit.wikimedia.org/r/831042 (owner: 10Majavah) [09:45:11] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:45:32] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/831035 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:45:55] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/831034 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:47:01] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/831037 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:47:35] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/831039 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:47:55] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/831046 (owner: 10Muehlenhoff) [09:48:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34326 and previous config saved to /var/cache/conftool/dbconfig/20220909-094819-root.json [09:51:17] (03PS7) 10Vgutierrez: Unlink certificate renewal and OCSP handling [software/acme-chief] - 10https://gerrit.wikimedia.org/r/820795 (https://phabricator.wikimedia.org/T244232) (owner: 10BCornwall) [09:52:47] PROBLEM - SSH on restbase1021 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:53:34] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete owner annotations [puppet] - 10https://gerrit.wikimedia.org/r/831046 (owner: 10Muehlenhoff) [09:53:46] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b9be20d]: (no justification provided) [09:53:56] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b9be20d]: (no justification provided) (duration: 00m 09s) [09:54:12] (03CR) 10Muehlenhoff: [C: 03+2] ntp: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831034 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:57:36] (03CR) 10Btullis: [C: 03+2] Add the prometheus config to enable scraping from the dse-k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/830897 (https://phabricator.wikimedia.org/T310179) (owner: 10Btullis) [09:59:54] 10ops-eqiad, 10decommission-hardware, 10serviceops, 10Patch-For-Review: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10Clement_Goubert) [10:01:34] RECOVERY - Check systemd state on ms-be2039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:03:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34327 and previous config saved to /var/cache/conftool/dbconfig/20220909-100324-root.json [10:06:52] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:07:02] 10ops-eqiad, 10decommission-hardware, 10serviceops, 10Patch-For-Review: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10Clement_Goubert) `wtp[1029-1033].eqiad.wmnet` didn't power off correctly. [10:08:38] RECOVERY - SSH on ms-be1041.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:13:36] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:15:12] (03PS1) 10Muehlenhoff: memcached: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831055 (https://phabricator.wikimedia.org/T308013) [10:15:14] (03PS1) 10Muehlenhoff: druid: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831056 (https://phabricator.wikimedia.org/T308013) [10:18:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34328 and previous config saved to /var/cache/conftool/dbconfig/20220909-101830-root.json [10:19:11] (03CR) 10CI reject: [V: 04-1] memcached: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831055 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [10:19:36] (03PS2) 10Muehlenhoff: memcached: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831055 (https://phabricator.wikimedia.org/T308013) [10:19:40] (03CR) 10CI reject: [V: 04-1] druid: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831056 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [10:19:50] (03PS2) 10Muehlenhoff: druid: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831056 (https://phabricator.wikimedia.org/T308013) [10:20:15] (03CR) 10Vgutierrez: Unlink certificate renewal and OCSP handling (034 comments) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/820795 (https://phabricator.wikimedia.org/T244232) (owner: 10BCornwall) [10:26:47] 10SRE, 10Observability-Alerting: Export and share alerts data - https://phabricator.wikimedia.org/T317393 (10fgiunchedi) p:05Triage→03Medium [10:26:52] (03PS1) 10Slyngshede: C:raid::perccli handle case with no virtual devices. [puppet] - 10https://gerrit.wikimedia.org/r/831057 (https://phabricator.wikimedia.org/T317344) [10:27:14] (03PS1) 10Majavah: dynamicproxy: simplify tls configuration [puppet] - 10https://gerrit.wikimedia.org/r/831058 [10:27:29] (03CR) 10CI reject: [V: 04-1] C:raid::perccli handle case with no virtual devices. [puppet] - 10https://gerrit.wikimedia.org/r/831057 (https://phabricator.wikimedia.org/T317344) (owner: 10Slyngshede) [10:27:56] (03CR) 10CI reject: [V: 04-1] dynamicproxy: simplify tls configuration [puppet] - 10https://gerrit.wikimedia.org/r/831058 (owner: 10Majavah) [10:28:41] (03PS2) 10Majavah: dynamicproxy: simplify tls configuration [puppet] - 10https://gerrit.wikimedia.org/r/831058 [10:29:18] (03CR) 10CI reject: [V: 04-1] dynamicproxy: simplify tls configuration [puppet] - 10https://gerrit.wikimedia.org/r/831058 (owner: 10Majavah) [10:29:48] (03PS2) 10Slyngshede: C:raid::perccli handle case with no virtual devices. [puppet] - 10https://gerrit.wikimedia.org/r/831057 (https://phabricator.wikimedia.org/T317344) [10:30:53] (03PS3) 10Majavah: dynamicproxy: simplify tls configuration [puppet] - 10https://gerrit.wikimedia.org/r/831058 [10:31:16] (03PS4) 10Majavah: dynamicproxy: simplify tls configuration [puppet] - 10https://gerrit.wikimedia.org/r/831058 [10:31:18] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b9be20d]: (no justification provided) [10:31:28] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b9be20d]: (no justification provided) (duration: 00m 09s) [10:33:04] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37185/console" [puppet] - 10https://gerrit.wikimedia.org/r/831058 (owner: 10Majavah) [10:33:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34329 and previous config saved to /var/cache/conftool/dbconfig/20220909-103334-root.json [10:37:55] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/831057 (https://phabricator.wikimedia.org/T317344) (owner: 10Slyngshede) [10:38:21] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 (10Vgutierrez) [10:38:33] 10SRE, 10Traffic, 10Upstream: metric discrepancies between ATS 9.x and ATS 8.x - https://phabricator.wikimedia.org/T315064 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [10:39:10] 10SRE-OnFire, 10serviceops, 10Sustainability (Incident Followup): Page on etcdmirror critical status - https://phabricator.wikimedia.org/T317402 (10Clement_Goubert) [10:41:49] 10SRE-OnFire, 10serviceops, 10Sustainability (Incident Followup): Add etcdmirror status check to scap - https://phabricator.wikimedia.org/T317403 (10Clement_Goubert) [10:41:54] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: deploy provision-fs.sh [puppet] - 10https://gerrit.wikimedia.org/r/831030 (owner: 10Filippo Giunchedi) [10:42:00] (03PS2) 10Filippo Giunchedi: prometheus: deploy provision-fs.sh [puppet] - 10https://gerrit.wikimedia.org/r/831030 [10:43:55] (03PS1) 10Hashar: devtools: add keyholder agent for Gerrit [puppet] - 10https://gerrit.wikimedia.org/r/831062 (https://phabricator.wikimedia.org/T317404) [10:44:40] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:45:05] 10SRE-OnFire, 10serviceops, 10Sustainability (Incident Followup): Add failure rate triggered rollback to scap - https://phabricator.wikimedia.org/T317405 (10Clement_Goubert) [10:45:37] (03PS1) 10Hashar: Add deployment configuration for devtools WMCS project [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/831063 (https://phabricator.wikimedia.org/T317404) [10:46:54] (03CR) 10Hashar: "Looks like that does the right thing on deploy-1004.devtools.eqiad1.wikimedia.cloud:" [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/831063 (https://phabricator.wikimedia.org/T317404) (owner: 10Hashar) [10:49:54] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:51:46] (03CR) 10Hashar: [C: 04-1] "There are a lot more changes I have to do before keyholder configuration can be merged ;)" [puppet] - 10https://gerrit.wikimedia.org/r/831062 (https://phabricator.wikimedia.org/T317404) (owner: 10Hashar) [10:53:42] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:54:05] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Evaluate xbzrle and/or auto-converge in qemu - https://phabricator.wikimedia.org/T317406 (10MoritzMuehlenhoff) [10:55:44] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:01:31] (03PS1) 10Btullis: Correct a typo in the k8s-dse cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/831066 (https://phabricator.wikimedia.org/T310179) [11:02:34] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff) [11:02:53] (03CR) 10Majavah: [V: 03+1] Remove support for overriding LDAP client stack (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826536 (owner: 10Majavah) [11:06:10] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Evaluate xbzrle and/or auto-converge in qemu - https://phabricator.wikimedia.org/T317406 (10MoritzMuehlenhoff) p:05Triage→03Medium [11:06:55] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37186/console" [puppet] - 10https://gerrit.wikimedia.org/r/831066 (https://phabricator.wikimedia.org/T310179) (owner: 10Btullis) [11:08:50] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:09:08] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:10:40] (03CR) 10Vlad.shapik: [C: 03+1] "I have downloaded this patch and run the online tests. Everything has passed successfully." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830903 (owner: 10Hnowlan) [11:13:03] (03CR) 10Hashar: "The project Puppetmaster points to the WMCS puppet master. I am pretty sure last time I switched a puppet master to be served by itself th" [puppet] - 10https://gerrit.wikimedia.org/r/831062 (https://phabricator.wikimedia.org/T317404) (owner: 10Hashar) [11:14:33] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/831057 (https://phabricator.wikimedia.org/T317344) (owner: 10Slyngshede) [11:14:40] (03CR) 10BCornwall: [C: 03+2] prometheus: Parse ATS config to node exporter text [puppet] - 10https://gerrit.wikimedia.org/r/826362 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [11:14:59] (03CR) 10BCornwall: [C: 03+2] ats: Enable node_ats_config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/830686 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [11:15:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T312863)', diff saved to https://phabricator.wikimedia.org/P34330 and previous config saved to /var/cache/conftool/dbconfig/20220909-111509-ladsgroup.json [11:15:14] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [11:16:00] (03CR) 10Jbond: [C: 03+2] devtools: add keyholder agent for Gerrit [puppet] - 10https://gerrit.wikimedia.org/r/831062 (https://phabricator.wikimedia.org/T317404) (owner: 10Hashar) [11:16:02] (03CR) 10Btullis: [V: 03+1 C: 03+2] Correct a typo in the k8s-dse cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/831066 (https://phabricator.wikimedia.org/T310179) (owner: 10Btullis) [11:16:42] btullis: fyi i merged your type change [11:17:03] jbond: Great, thanks. [11:17:11] np [11:18:04] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:18:52] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:20:05] (03CR) 10Alexandros Kosiaris: [C: 03+1] pcc: Warn before compiling all nodes by default [puppet] - 10https://gerrit.wikimedia.org/r/830957 (https://phabricator.wikimedia.org/T222075) (owner: 10RLazarus) [11:23:06] PROBLEM - Check systemd state on cp1081 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:23:12] PROBLEM - Check systemd state on cp2040 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:23:14] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:23:28] PROBLEM - Check systemd state on cp2036 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:25:48] PROBLEM - Check systemd state on cp1087 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:26:52] PROBLEM - Check systemd state on cp1082 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:27:22] PROBLEM - Check systemd state on cp2031 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:28:00] PROBLEM - Check systemd state on cp4030 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:28:14] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:28:20] PROBLEM - Check systemd state on cp4025 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:28:22] PROBLEM - Check systemd state on cp2034 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:29:00] PROBLEM - Check systemd state on cp3061 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:29:13] whelp [11:29:42] PROBLEM - Check systemd state on cp2042 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:29:42] PROBLEM - Check systemd state on cp2030 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:29:42] PROBLEM - Check systemd state on cp2027 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:29:58] PROBLEM - Check systemd state on cp5003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:30:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P34331 and previous config saved to /var/cache/conftool/dbconfig/20220909-113016-ladsgroup.json [11:30:18] PROBLEM - Check systemd state on cp3051 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:30:34] PROBLEM - Check systemd state on cp3052 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:30:36] PROBLEM - Check systemd state on cp4029 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:31:16] PROBLEM - Check systemd state on cp5002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:31:32] PROBLEM - Check systemd state on cp4033 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:31:40] PROBLEM - Check systemd state on cp4021 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:32:12] PROBLEM - Check systemd state on cp4024 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:32:16] PROBLEM - Check systemd state on cp3054 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:32:16] PROBLEM - Check systemd state on cp3056 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:32:38] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:32:50] PROBLEM - Check systemd state on cp4035 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:33:20] PROBLEM - Check systemd state on cp3058 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:33:36] PROBLEM - Check systemd state on cp2035 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:34:18] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] deployment-prep: Add P:beta::mediawiki_packages [puppet] - 10https://gerrit.wikimedia.org/r/830629 (https://phabricator.wikimedia.org/T317128) (owner: 10Clément Goubert) [11:34:19] I am guessing new monitoring deployment, WIP? [11:34:39] Yeah, issue found [11:34:46] Sorry for the spam [11:34:50] ._. [11:34:51] no problem [11:35:02] no impact alert > high impact alert :-P [11:35:08] PROBLEM - Check systemd state on cp3063 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:35:17] These should resolve within a few minutes [11:36:04] PROBLEM - Check systemd state on cp4023 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:36:04] PROBLEM - Check systemd state on cp4022 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:36:24] PROBLEM - Check systemd state on cp3055 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:37:42] PROBLEM - Check systemd state on cp5007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:37:58] PROBLEM - Check systemd state on cp1086 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:37:59] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/831056 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:38:12] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/831055 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:38:58] (03PS1) 10BCornwall: ats: Use variable for ATS 8 in ATS config monitor [puppet] - 10https://gerrit.wikimedia.org/r/831073 (https://phabricator.wikimedia.org/T292815) [11:39:10] PROBLEM - Check systemd state on cp2028 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:39:20] PROBLEM - Check systemd state on cp1088 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:39:36] PROBLEM - Check systemd state on cp4036 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:39:40] PROBLEM - Check systemd state on cp5008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:39:52] PROBLEM - Check systemd state on cp1075 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:40:06] PROBLEM - Check systemd state on cp5005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:40:26] PROBLEM - Check systemd state on cp1080 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:40:34] (03CR) 10BCornwall: [C: 03+2] ats: Use variable for ATS 8 in ATS config monitor [puppet] - 10https://gerrit.wikimedia.org/r/831073 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [11:40:38] PROBLEM - Check systemd state on cp1083 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:40:38] PROBLEM - Check systemd state on cp1085 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:40:54] PROBLEM - Check systemd state on cp5013 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:38] PROBLEM - Check systemd state on cp1079 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:42:17] (03PS1) 10Clément Goubert: beta: don't duplicate fonts install [puppet] - 10https://gerrit.wikimedia.org/r/831076 [11:42:18] PROBLEM - Check systemd state on cp5012 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:42:50] PROBLEM - Check systemd state on cp2041 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:00] PROBLEM - Check systemd state on cp1078 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:12] PROBLEM - Check systemd state on cp4034 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:52] RECOVERY - Check systemd state on cp2042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:56] 10Puppet, 10SRE, 10Infrastructure-Foundations: Facter is slow on a few hosts - https://phabricator.wikimedia.org/T251293 (10MoritzMuehlenhoff) Hi Cole, I ran into this porting away things from the "raid" Puppet fact towards the new "raid_mgmt_tools" fact. All the slowness was originally caused by IPMI and th... [11:44:02] PROBLEM - Check systemd state on cp3060 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:44:12] PROBLEM - Check systemd state on cp5009 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:44:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: one disk not working properly in cloudcephosd1030.eqiad.wmnet - https://phabricator.wikimedia.org/T316673 (10fnegri) @Jclark-ctr right now I cannot connect to cloudcephosd1030.mgmt.eqiad.wmnet with SSH. Icinga is also show... [11:44:40] PROBLEM - Check systemd state on cp2033 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:44:42] (03CR) 10Clément Goubert: "Small bugfix" [puppet] - 10https://gerrit.wikimedia.org/r/831076 (owner: 10Clément Goubert) [11:45:18] PROBLEM - Check systemd state on cp2038 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:45:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P34333 and previous config saved to /var/cache/conftool/dbconfig/20220909-114522-ladsgroup.json [11:45:52] PROBLEM - Check systemd state on cp5011 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:45:54] PROBLEM - Check systemd state on cp4028 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:04] PROBLEM - Check systemd state on cp1084 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:08] PROBLEM - Check systemd state on cp3050 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:47] (03CR) 10Muehlenhoff: beta: don't duplicate fonts install (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831076 (owner: 10Clément Goubert) [11:47:10] PROBLEM - Check systemd state on cp5006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:47:38] PROBLEM - Check systemd state on cp3062 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:47:40] PROBLEM - Check systemd state on cp1077 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:47:48] PROBLEM - Check systemd state on cp2037 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:47:48] (03PS2) 10Clément Goubert: beta: don't duplicate fonts install [puppet] - 10https://gerrit.wikimedia.org/r/831076 (https://phabricator.wikimedia.org/T317128) [11:48:29] (03CR) 10Clément Goubert: beta: don't duplicate fonts install (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831076 (https://phabricator.wikimedia.org/T317128) (owner: 10Clément Goubert) [11:48:58] RECOVERY - Check systemd state on cp1081 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:49:03] (03PS3) 10Clément Goubert: beta: don't duplicate fonts install [puppet] - 10https://gerrit.wikimedia.org/r/831076 (https://phabricator.wikimedia.org/T317128) [11:49:08] RECOVERY - Check systemd state on cp2040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:49:20] RECOVERY - Check systemd state on cp2033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:49:20] RECOVERY - Check systemd state on cp2036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:49:44] Sorry again! [11:50:12] RECOVERY - Check systemd state on cp4023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:50:22] RECOVERY - Check systemd state on cp1082 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:50:34] RECOVERY - Check systemd state on cp3055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:50:48] RECOVERY - Check systemd state on cp3050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:51:34] RECOVERY - Check systemd state on cp4030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:51:42] RECOVERY - Check systemd state on cp1087 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:51:52] RECOVERY - Check systemd state on cp5007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:51:54] RECOVERY - Check systemd state on cp4025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:51:58] RECOVERY - Check systemd state on cp2034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:52:20] RECOVERY - Check systemd state on cp3062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:52:22] RECOVERY - Check systemd state on cp1077 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:52:30] RECOVERY - Check systemd state on cp2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:53:18] RECOVERY - Check systemd state on cp2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:53:18] RECOVERY - Check systemd state on cp2027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:53:44] RECOVERY - Check systemd state on cp4036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:54:12] RECOVERY - Check systemd state on cp3052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:54:14] RECOVERY - Check systemd state on cp4029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:54:14] RECOVERY - Check systemd state on cp5006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:54:28] RECOVERY - Check systemd state on cp1086 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:54:32] RECOVERY - Check systemd state on cp1080 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:54:44] RECOVERY - Check systemd state on cp1083 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:54:58] RECOVERY - Check systemd state on cp3061 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:55:02] RECOVERY - Check systemd state on cp5013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:55:22] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Undocumented IP on WMCS network - https://phabricator.wikimedia.org/T315955 (10cmooney) @Andrew yep that's what was needed from the zone side so looking good there. It's not actually returning any data for specific IPs in the range though.... [11:55:32] (03CR) 10Muehlenhoff: [C: 03+1] beta: don't duplicate fonts install [puppet] - 10https://gerrit.wikimedia.org/r/831076 (https://phabricator.wikimedia.org/T317128) (owner: 10Clément Goubert) [11:55:38] RECOVERY - Check systemd state on cp2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:55:38] RECOVERY - Check systemd state on cp2030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:55:46] RECOVERY - Check systemd state on cp4024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:55:46] RECOVERY - Check systemd state on cp1088 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:55:50] RECOVERY - Check systemd state on cp3056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:55:58] RECOVERY - Check systemd state on cp5003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:56:10] RECOVERY - Check systemd state on cp5008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:56:18] RECOVERY - Check systemd state on cp3051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:56:18] RECOVERY - Check systemd state on cp1075 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:56:36] RECOVERY - Check systemd state on cp5005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:56:58] RECOVERY - Check systemd state on cp2041 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:57:00] RECOVERY - Check systemd state on cp3058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:57:06] RECOVERY - Check systemd state on cp1085 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:57:08] RECOVERY - Check systemd state on cp1078 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:57:16] RECOVERY - Check systemd state on cp5002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:57:18] RECOVERY - Check systemd state on cp4034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:57:28] RECOVERY - Check systemd state on cp4033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:57:38] RECOVERY - Check systemd state on cp4021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:58:04] RECOVERY - Check systemd state on cp1079 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:58:12] RECOVERY - Check systemd state on cp3054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:58:34] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:58:48] RECOVERY - Check systemd state on cp3063 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:58:50] RECOVERY - Check systemd state on cp4035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:58:52] RECOVERY - Check systemd state on cp5012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:59:28] RECOVERY - Check systemd state on cp2038 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:59:36] RECOVERY - Check systemd state on cp2035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:59:40] RECOVERY - Check systemd state on cp4022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:00:00] RECOVERY - Check systemd state on cp5011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:00:02] RECOVERY - Check systemd state on cp4028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:00:12] RECOVERY - Check systemd state on cp1084 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:00:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T312863)', diff saved to https://phabricator.wikimedia.org/P34334 and previous config saved to /var/cache/conftool/dbconfig/20220909-120029-ladsgroup.json [12:00:33] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [12:00:34] RECOVERY - Check systemd state on cp3060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:00:44] RECOVERY - Check systemd state on cp5009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:01:35] (03PS1) 10Majavah: dynamicproxy: include prometheus redis exporter [puppet] - 10https://gerrit.wikimedia.org/r/831080 [12:02:16] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37187/console" [puppet] - 10https://gerrit.wikimedia.org/r/831080 (owner: 10Majavah) [12:05:40] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:05:59] (03PS1) 10Btullis: Add configuration for k8s-dse in prometheus [puppet] - 10https://gerrit.wikimedia.org/r/831081 (https://phabricator.wikimedia.org/T310179) [12:06:07] (03PS3) 10Vlad.shapik: WP: Remove division operation hack related to Python2 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830907 (https://phabricator.wikimedia.org/T314393) [12:06:48] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:07:50] (03PS2) 10Majavah: dynamicproxy: include prometheus redis exporter [puppet] - 10https://gerrit.wikimedia.org/r/831080 [12:07:58] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:08:52] 10SRE-swift-storage: swift_ring_manager should be able to rebalance rings without making other changes - https://phabricator.wikimedia.org/T317409 (10MatthewVernon) [12:09:05] 10SRE-swift-storage: swift_ring_manager should be able to rebalance rings without making other changes - https://phabricator.wikimedia.org/T317409 (10MatthewVernon) p:05Triage→03Medium [12:12:19] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37188/console" [puppet] - 10https://gerrit.wikimedia.org/r/831081 (https://phabricator.wikimedia.org/T310179) (owner: 10Btullis) [12:15:37] (03PS1) 10Jelto: gitlab: allow git user to access backup folder [puppet] - 10https://gerrit.wikimedia.org/r/831083 (https://phabricator.wikimedia.org/T274463) [12:15:50] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37189/console" [puppet] - 10https://gerrit.wikimedia.org/r/831081 (https://phabricator.wikimedia.org/T310179) (owner: 10Btullis) [12:20:06] (03CR) 10JMeybohm: [C: 04-1] "Nice one! I've a couple of probably quite unstructured comments (sorry for that). Also I must admit that I have not rendered the template " [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [12:24:03] !log ayounsi@cumin1001 START - Cookbook sre.network.cf [12:24:04] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [12:24:46] (03CR) 10JMeybohm: [C: 04-1] thumbor: new service chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [12:24:50] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:25:19] (03CR) 10JMeybohm: helmfile.d: add thumbor configuration (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/824519 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [12:27:12] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:28:13] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37190/console" [puppet] - 10https://gerrit.wikimedia.org/r/831083 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [12:29:22] (03CR) 10Jelto: [V: 03+1] "see also https://phabricator.wikimedia.org/T274463#8224580" [puppet] - 10https://gerrit.wikimedia.org/r/831083 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [12:29:56] PROBLEM - SSH on mw1316.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:30:57] (03PS2) 10Btullis: Add configuration for k8s-dse in prometheus [puppet] - 10https://gerrit.wikimedia.org/r/831081 (https://phabricator.wikimedia.org/T310179) [12:33:01] (03CR) 10Jbond: [C: 03+1] "LGTm see nit" [puppet] - 10https://gerrit.wikimedia.org/r/831083 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [12:37:40] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [12:38:04] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for phedenskog - https://phabricator.wikimedia.org/T317401 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi I have investigated this with @Peter via the ldap audit logs and found the following entry, which seems to point to an error while editing records:... [12:54:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: one disk not working properly in cloudcephosd1030.eqiad.wmnet - https://phabricator.wikimedia.org/T316673 (10Jclark-ctr) Sorry i had left it in a screen for hardware test that is my mistake [12:58:27] (03PS1) 10Filippo Giunchedi: grafana: audit Grafana API actions [puppet] - 10https://gerrit.wikimedia.org/r/831087 [13:00:00] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:00:22] (03CR) 10Filippo Giunchedi: [C: 03+1] Add configuration for k8s-dse in prometheus [puppet] - 10https://gerrit.wikimedia.org/r/831081 (https://phabricator.wikimedia.org/T310179) (owner: 10Btullis) [13:01:47] (03CR) 10Elukey: [C: 03+1] Add configuration for k8s-dse in prometheus [puppet] - 10https://gerrit.wikimedia.org/r/831081 (https://phabricator.wikimedia.org/T310179) (owner: 10Btullis) [13:02:15] (03PS4) 10Vlad.shapik: Remove division operation hack related to Python2 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830907 (https://phabricator.wikimedia.org/T314393) [13:02:24] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:03:01] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [13:03:32] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:05:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:06:30] something going on with kafka [13:07:22] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:07:27] there was a sudden increase in rsyslog-notice requests to process [13:07:32] since 13:05 [13:07:37] seems gone now [13:08:09] I wonder which service as origin? [13:08:26] (03CR) 10Hashar: [C: 03+2] "Tested it and it works!" [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/831063 (https://phabricator.wikimedia.org/T317404) (owner: 10Hashar) [13:09:13] (03Merged) 10jenkins-bot: Add deployment configuration for devtools WMCS project [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/831063 (https://phabricator.wikimedia.org/T317404) (owner: 10Hashar) [13:10:38] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:10:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:11:04] (03CR) 10Clément Goubert: [C: 03+2] beta: don't duplicate fonts install [puppet] - 10https://gerrit.wikimedia.org/r/831076 (https://phabricator.wikimedia.org/T317128) (owner: 10Clément Goubert) [13:12:54] (03PS1) 10David Caro: opensatck: remove some not needed absented resources [puppet] - 10https://gerrit.wikimedia.org/r/831089 [13:16:25] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37192/console" [puppet] - 10https://gerrit.wikimedia.org/r/831089 (owner: 10David Caro) [13:16:52] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:17:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: one disk not working properly in cloudcephosd1030.eqiad.wmnet - https://phabricator.wikimedia.org/T316673 (10fnegri) No worries, now I was able to SSH, I created a test partition /dev/sde1 and indeed `mkfs.ext4 /dev/sde1` d... [13:25:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2003:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [13:28:26] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:32:54] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T315352 (10phaultfinder) [13:33:58] !log restartin blazegraph on wdqs2003 (BlazegraphFreeAllocatorsDecreasingRapidly) [13:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:31] (03PS1) 10DCausse: Revert "Disable CirrusSearch completion suggester" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830978 [13:37:54] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:37:55] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T315352 (10phaultfinder) [13:38:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: Maintenance [13:38:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: Maintenance [13:38:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2112 (T312863)', diff saved to https://phabricator.wikimedia.org/P34336 and previous config saved to /var/cache/conftool/dbconfig/20220909-133846-ladsgroup.json [13:38:50] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [13:40:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2003:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [13:48:40] (03CR) 10Herron: [C: 03+1] pcc: Warn before compiling all nodes by default [puppet] - 10https://gerrit.wikimedia.org/r/830957 (https://phabricator.wikimedia.org/T222075) (owner: 10RLazarus) [13:49:44] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:50:37] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission elastic10[48-52].eqiad.wmnet - https://phabricator.wikimedia.org/T316728 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [13:51:40] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission elastic10[48-52].eqiad.wmnet - https://phabricator.wikimedia.org/T316728 (10Jclark-ctr) [13:52:29] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission elastic10[48-52].eqiad.wmnet - https://phabricator.wikimedia.org/T316728 (10Jclark-ctr) 05Open→03Resolved Removed servers from racks and ran Offline script [13:55:04] (03PS1) 10Hashar: scap: automatize plugins handling [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/831093 (https://phabricator.wikimedia.org/T317412) [13:56:44] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:57:28] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frlog1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T315924 (10Jclark-ctr) @Jgreen where the last two steps done before handed over? [13:57:53] (03PS1) 10Ssingh: P:wikidough: update status message for service restart check [puppet] - 10https://gerrit.wikimedia.org/r/831094 [13:59:34] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37193/console" [puppet] - 10https://gerrit.wikimedia.org/r/831094 (owner: 10Ssingh) [14:00:36] (03CR) 10Alexandros Kosiaris: [C: 04-1] "pedantic nitpick, otherwise LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/830624 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm) [14:04:06] (03PS7) 10JMeybohm: Alert on high Kubernetes API error rate [alerts] - 10https://gerrit.wikimedia.org/r/830624 (https://phabricator.wikimedia.org/T311251) [14:04:08] (03PS5) 10JMeybohm: Alert on high Kubernetes API latency [alerts] - 10https://gerrit.wikimedia.org/r/830637 (https://phabricator.wikimedia.org/T311251) [14:04:31] (03CR) 10JMeybohm: Alert on high Kubernetes API error rate (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/830624 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm) [14:04:47] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:wikidough: update status message for service restart check [puppet] - 10https://gerrit.wikimedia.org/r/831094 (owner: 10Ssingh) [14:07:40] PROBLEM - Check systemd state on ms-be1035 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:54] (03CR) 10Herron: [C: 03+1] "Good idea!" [puppet] - 10https://gerrit.wikimedia.org/r/831087 (owner: 10Filippo Giunchedi) [14:15:50] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:16:57] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff) [14:17:16] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:17:57] (03PS1) 10Elukey: Move kafka on kafka-logging2001 to PKI TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/831096 (https://phabricator.wikimedia.org/T300130) [14:19:25] (03PS2) 10Samtar: CommonSettings-labs.php: Set $wgPhonosFileBackend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830657 (https://phabricator.wikimedia.org/T317195) [14:19:58] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 3 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37196/console" [puppet] - 10https://gerrit.wikimedia.org/r/831096 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [14:20:34] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:20:44] jouncebot: now [14:20:44] For the next 16 hour(s) and 39 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220909T0700) [14:22:37] thcipriani (or anyone), is there *any* chance that I could deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/830657 today? (production no-op, only modifying `wmf-config/CommonSettings-labs.php`) [14:23:54] TheresNoTime: I think it's perfectly fine to ship "labs" only patches [14:24:03] (03CR) 10DCausse: [C: 03+1] CommonSettings-labs.php: Set $wgPhonosFileBackend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830657 (https://phabricator.wikimedia.org/T317195) (owner: 10Samtar) [14:25:45] dcausse: "think" always worries me :P [14:26:35] TheresNoTime: sure :), I'll have to ship one production patch soon, happy to +2 your patch at this time :) [14:28:16] dcausse: I don't mind doing it if we're sure it's okay :) when are you planning on deploying? [14:28:34] going to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/830978 it's a followup of an issue that happened yesterday (related to elastic7 upgrade) [14:28:37] TheresNoTime: now :) [14:29:05] dcausse: Okay :) I'll let you handle it, thank you! [14:29:25] (03CR) 10DCausse: [C: 03+2] CommonSettings-labs.php: Set $wgPhonosFileBackend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830657 (https://phabricator.wikimedia.org/T317195) (owner: 10Samtar) [14:30:20] (03Merged) 10jenkins-bot: CommonSettings-labs.php: Set $wgPhonosFileBackend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830657 (https://phabricator.wikimedia.org/T317195) (owner: 10Samtar) [14:31:12] (03CR) 10DCausse: [C: 03+2] Revert "Disable CirrusSearch completion suggester" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830978 (owner: 10DCausse) [14:32:07] (03Merged) 10jenkins-bot: Revert "Disable CirrusSearch completion suggester" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830978 (owner: 10DCausse) [14:32:32] RECOVERY - SSH on mw1316.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:33:04] testing my patch mwdebug1002 [14:33:58] (manually triggered a beta sync/scap so I can test fwiw) [14:34:26] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:34:29] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416 (10MoritzMuehlenhoff) [14:35:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:36:17] (all looks good for me, thanks dcausse) [14:36:22] TheresNoTime: yw! [14:36:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:36:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:36:54] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:39:18] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:39:56] (03CR) 10JMeybohm: [C: 03+2] Alert on high lateny of kubelet operations [alerts] - 10https://gerrit.wikimedia.org/r/830228 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm) [14:39:59] (03CR) 10JMeybohm: [C: 03+2] Alert on high Kubernetes API error rate [alerts] - 10https://gerrit.wikimedia.org/r/830624 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm) [14:40:02] (03CR) 10JMeybohm: [C: 03+2] Alert on high Kubernetes API latency [alerts] - 10https://gerrit.wikimedia.org/r/830637 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm) [14:40:18] !log herron@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-main-codfw cluster: Roll restart of jvm daemons. [14:40:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:41:37] (03Merged) 10jenkins-bot: Alert on high lateny of kubelet operations [alerts] - 10https://gerrit.wikimedia.org/r/830228 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm) [14:42:24] (03Merged) 10jenkins-bot: Alert on high Kubernetes API error rate [alerts] - 10https://gerrit.wikimedia.org/r/830624 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm) [14:42:26] (03Merged) 10jenkins-bot: Alert on high Kubernetes API latency [alerts] - 10https://gerrit.wikimedia.org/r/830637 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm) [14:43:29] !log dcausse@deploy1002 Synchronized wmf-config/InitialiseSettings.php: T317381: Revert "Disable CirrusSearch completion suggester" (duration: 03m 57s) [14:43:32] T317381: Reduction in helpfulness and quantity of autocomplete search results - https://phabricator.wikimedia.org/T317381 [14:44:09] !log imported jenkins 2.346.3 to thirdparty/ci [14:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:26] (03CR) 10Filippo Giunchedi: [C: 03+1] Move kafka on kafka-logging2001 to PKI TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/831096 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [14:44:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: one disk not working properly in cloudcephosd1030.eqiad.wmnet - https://phabricator.wikimedia.org/T316673 (10fnegri) The "kicked off" part is explained by @Jclark-ctr rebooting the instance. The partition disappearing inste... [14:45:00] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frlog1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T315924 (10Jgreen) >>! In T315924#8224700, @Jclark-ctr wrote: > @Jgreen where the last two steps done before handed over? The unchecked steps were not done. Re. cumin/cookbook is as e... [14:45:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10nskaggs) Yes, feel free to coordinate with @fnegri for the depooling portion. Thanks! [14:45:40] (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: audit Grafana API actions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831087 (owner: 10Filippo Giunchedi) [14:46:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10Marostegui) @nskaggs can this be led by your team, as these proxies are from your service :-) [14:47:35] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:48:11] (03PS1) 10Jgreen: Remove temporary _dmarcian TXT record, it has served its purpose. [dns] - 10https://gerrit.wikimedia.org/r/831101 (https://phabricator.wikimedia.org/T316899) [14:50:14] (03CR) 10Jgreen: [C: 03+2] Remove temporary _dmarcian TXT record, it has served its purpose. [dns] - 10https://gerrit.wikimedia.org/r/831101 (https://phabricator.wikimedia.org/T316899) (owner: 10Jgreen) [14:53:50] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:53:58] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Undocumented IP on WMCS network - https://phabricator.wikimedia.org/T315955 (10Andrew) Yep, we're using those IPs for rapid tests so most of the time they're unallocated. [14:54:26] PROBLEM - cassandra-c service on restbase1021 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:54:30] PROBLEM - cassandra-b SSL 10.64.0.149:7001 on restbase1021 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [14:54:42] PROBLEM - cassandra-a SSL 10.64.0.148:7001 on restbase1021 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [14:55:00] RECOVERY - SSH on restbase1021 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:55:10] PROBLEM - cassandra-b service on restbase1021 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:55:16] PROBLEM - cassandra-c SSL 10.64.0.150:7001 on restbase1021 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [14:55:50] RECOVERY - Restbase root url on restbase1021 is OK: HTTP OK: HTTP/1.1 200 - 17317 bytes in 0.030 second response time https://wikitech.wikimedia.org/wiki/RESTBase [14:55:58] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:56:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10nskaggs) @Marostegui yes. Sorry, my comment about coordination was directed towards @Cmjohnson. Need to pick a convenient time for DCOPs and WMCS. [14:56:12] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:56:30] PROBLEM - cassandra-a service on restbase1021 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:56:40] RECOVERY - cassandra-c service on restbase1021 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:57:30] RECOVERY - cassandra-b service on restbase1021 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:57:38] (03PS2) 10JHathaway: mail::mx: Add support for PLAIN auth over tls [puppet] - 10https://gerrit.wikimedia.org/r/830948 (https://phabricator.wikimedia.org/T314815) [14:58:18] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:58:32] RECOVERY - cassandra-a CQL 10.64.0.148:9042 on restbase1021 is OK: TCP OK - 0.000 second response time on 10.64.0.148 port 9042 https://phabricator.wikimedia.org/T93886 [14:58:35] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/831089 (owner: 10David Caro) [14:58:40] RECOVERY - cassandra-a SSL 10.64.0.148:7001 on restbase1021 is OK: SSL OK - Certificate restbase1021-a valid until 2023-04-14 11:20:45 +0000 (expires in 216 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [14:58:52] RECOVERY - cassandra-a service on restbase1021 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:59:00] (03CR) 10Jbond: [V: 03+1] O:prometheus: use map instead of reduce (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817784 (https://phabricator.wikimedia.org/T313910) (owner: 10Jbond) [14:59:12] RECOVERY - cassandra-b SSL 10.64.0.149:7001 on restbase1021 is OK: SSL OK - Certificate restbase1021-b valid until 2023-04-14 11:20:48 +0000 (expires in 216 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [14:59:52] RECOVERY - cassandra-b CQL 10.64.0.149:9042 on restbase1021 is OK: TCP OK - 0.000 second response time on 10.64.0.149 port 9042 https://phabricator.wikimedia.org/T93886 [14:59:58] RECOVERY - cassandra-c SSL 10.64.0.150:7001 on restbase1021 is OK: SSL OK - Certificate restbase1021-c valid until 2023-04-14 11:20:51 +0000 (expires in 216 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [15:00:22] RECOVERY - cassandra-c CQL 10.64.0.150:9042 on restbase1021 is OK: TCP OK - 0.000 second response time on 10.64.0.150 port 9042 https://phabricator.wikimedia.org/T93886 [15:00:45] (03CR) 10JHathaway: "Thanks again jbond, going ahead with merging" [puppet] - 10https://gerrit.wikimedia.org/r/830948 (https://phabricator.wikimedia.org/T314815) (owner: 10JHathaway) [15:00:54] (03PS3) 10JHathaway: mail::mx: Add support for PLAIN auth over tls [puppet] - 10https://gerrit.wikimedia.org/r/830948 (https://phabricator.wikimedia.org/T314815) [15:01:42] (03CR) 10Andrew Bogott: [C: 03+1] "I think this is fine. Most of the absented things are distracting but harmless files that used to be installed by upstream debian packages" [puppet] - 10https://gerrit.wikimedia.org/r/831089 (owner: 10David Caro) [15:02:14] RECOVERY - Check systemd state on ms-be1035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:02:17] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/830948 (https://phabricator.wikimedia.org/T314815) (owner: 10JHathaway) [15:06:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T312863)', diff saved to https://phabricator.wikimedia.org/P34338 and previous config saved to /var/cache/conftool/dbconfig/20220909-150651-ladsgroup.json [15:06:55] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [15:06:58] (KubernetesAPILatencySecretsLIST) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatencySecretsLIST [15:07:19] damn :) [15:07:42] 10SRE-OnFire, 10Observability-Alerting, 10Discovery-Search (Current work), 10Sustainability (Incident Followup): Improve Search team alerting - https://phabricator.wikimedia.org/T313095 (10bking) [15:09:12] (03PS1) 10Giuseppe Lavagetto: sre: add paging alert for etcdmirror down [alerts] - 10https://gerrit.wikimedia.org/r/831103 [15:09:32] (03PS1) 10Jgreen: DMARC External Domain Verification for wikipedia.org and w.wiki. [dns] - 10https://gerrit.wikimedia.org/r/831104 (https://phabricator.wikimedia.org/T211401) [15:09:57] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/830948 (https://phabricator.wikimedia.org/T314815) (owner: 10JHathaway) [15:09:58] 10SRE-OnFire, 10Discovery-Search, 10Sustainability (Incident Followup): Better test environments for Elastic - https://phabricator.wikimedia.org/T317420 (10bking) [15:10:44] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:11:58] (KubernetesAPILatencySecretsLIST) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatencySecretsLIST [15:12:49] (03CR) 10Jbond: [C: 03+1] opensatck: remove some not needed absented resources (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831089 (owner: 10David Caro) [15:14:15] (03CR) 10Btullis: [C: 03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/831056 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:14:17] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Patch-For-Review: Wikipedia.org DMARC "rua" and "ruf" email addresses need verification - https://phabricator.wikimedia.org/T211401 (10Jgreen) There's a nice summary of this issue here https://dmarcian.com/what-is-external-destination-verification/ [15:16:19] (03PS2) 10Clément Goubert: sre: add paging alert for etcdmirror down [alerts] - 10https://gerrit.wikimedia.org/r/831103 (https://phabricator.wikimedia.org/T317402) (owner: 10Giuseppe Lavagetto) [15:22:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P34339 and previous config saved to /var/cache/conftool/dbconfig/20220909-152159-ladsgroup.json [15:23:19] (03CR) 10JHathaway: [C: 03+2] mail::mx: Add support for PLAIN auth over tls [puppet] - 10https://gerrit.wikimedia.org/r/830948 (https://phabricator.wikimedia.org/T314815) (owner: 10JHathaway) [15:23:38] TheresNoTime: looks like you got everything deployed. I'm fine with -labs deployments on Fridays as long as the deployment server stays tidy (i.e., fetch and rebase so the next deployer isn't surprised/confused) [15:25:12] thcipriani: okay, thank you for clarifying :) [15:29:26] (03CR) 10Ahmon Dancy: [C: 03+1] "All files look ok" [puppet] - 10https://gerrit.wikimedia.org/r/831039 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:30:03] 10SRE-OnFire, 10Observability-Alerting, 10Discovery-Search (Current work), 10Sustainability (Incident Followup): Improve Search team alerting - https://phabricator.wikimedia.org/T313095 (10bking) Suggestion from @EBernhardson : "random guesses at what we need, the search reindex process looks at the old i... [15:33:18] (03PS1) 10Muehlenhoff: turnilo: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/831111 [15:36:44] (03PS1) 10Cwhite: smart: restore get_fact and deprecate get_raid_drivers [puppet] - 10https://gerrit.wikimedia.org/r/831114 (https://phabricator.wikimedia.org/T251293) [15:36:59] (KubernetesAPILatencySecretsLIST) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatencySecretsLIST [15:37:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P34340 and previous config saved to /var/cache/conftool/dbconfig/20220909-153706-ladsgroup.json [15:39:48] (03PS1) 10Jdlrobson: EXPECTED VISUAL CHANGES IN origin/wmf/1.39.0-wmf.28 [skins/Vector] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/831115 [15:41:58] (KubernetesAPILatencySecretsLIST) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatencySecretsLIST [15:42:25] (03PS2) 10Jdlrobson: EXPECTED VISUAL CHANGES IN origin/wmf/1.39.0-wmf.28 [skins/Vector] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/830955 [15:42:33] (03CR) 10Jdlrobson: [C: 04-1] EXPECTED VISUAL CHANGES IN origin/wmf/1.39.0-wmf.28 [skins/Vector] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/830955 (owner: 10Jdlrobson) [15:42:40] (03Abandoned) 10Jdlrobson: EXPECTED VISUAL CHANGES IN origin/wmf/1.39.0-wmf.28 [skins/Vector] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/831115 (owner: 10Jdlrobson) [15:44:27] 10SRE-OnFire, 10serviceops, 10Sustainability (Incident Followup): Add failure rate triggered rollback to scap - https://phabricator.wikimedia.org/T317405 (10Clement_Goubert) [15:45:03] (03PS1) 10JMeybohm: Improve alerts on high Kubernetes API latency [alerts] - 10https://gerrit.wikimedia.org/r/831116 (https://phabricator.wikimedia.org/T311251) [15:46:03] (03PS1) 10Jdlrobson: Enable Nearby on Hebrew and French Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831117 (https://phabricator.wikimedia.org/T246493) [15:46:13] (03CR) 10CI reject: [V: 04-1] Enable Nearby on Hebrew and French Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831117 (https://phabricator.wikimedia.org/T246493) (owner: 10Jdlrobson) [15:47:47] (03CR) 10JMeybohm: [C: 03+2] Improve alerts on high Kubernetes API latency [alerts] - 10https://gerrit.wikimedia.org/r/831116 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm) [15:50:10] (03Merged) 10jenkins-bot: Improve alerts on high Kubernetes API latency [alerts] - 10https://gerrit.wikimedia.org/r/831116 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm) [15:50:32] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:50:51] (03PS2) 10Jelto: gitlab: allow git user to access backup folder [puppet] - 10https://gerrit.wikimedia.org/r/831083 (https://phabricator.wikimedia.org/T274463) [15:52:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T312863)', diff saved to https://phabricator.wikimedia.org/P34341 and previous config saved to /var/cache/conftool/dbconfig/20220909-155213-ladsgroup.json [15:52:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1118.eqiad.wmnet with reason: Maintenance [15:52:17] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [15:52:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1118.eqiad.wmnet with reason: Maintenance [15:52:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1118 (T312863)', diff saved to https://phabricator.wikimedia.org/P34342 and previous config saved to /var/cache/conftool/dbconfig/20220909-155234-ladsgroup.json [15:52:58] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37197/console" [puppet] - 10https://gerrit.wikimedia.org/r/831083 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [15:55:20] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:55:58] (03PS3) 10Jelto: gitlab: allow git user to access backup folder [puppet] - 10https://gerrit.wikimedia.org/r/831083 (https://phabricator.wikimedia.org/T274463) [15:57:02] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37198/console" [puppet] - 10https://gerrit.wikimedia.org/r/831083 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [15:59:00] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: allow git user to access backup folder (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831083 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [16:05:50] RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:13:50] PROBLEM - SSH on mw1315.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:14:20] (03PS1) 10David Caro: bootstrap_and_add: added preflight checks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831122 (https://phabricator.wikimedia.org/T316021) [16:16:24] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:18:14] (03Abandoned) 10Jforrester: Check $thumb->isError() before trying to use it [extensions/PageImages] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/720858 (https://phabricator.wikimedia.org/T290973) (owner: 10Jforrester) [16:18:17] (03CR) 10CI reject: [V: 04-1] bootstrap_and_add: added preflight checks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831122 (https://phabricator.wikimedia.org/T316021) (owner: 10David Caro) [16:18:46] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:21:29] !log herron@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-main-codfw cluster: Roll restart of jvm daemons. [16:23:06] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.2.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method= [16:25:28] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [16:37:40] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [16:37:48] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:42:13] (03PS2) 10Hashar: scap: automatize plugins handling [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/831093 (https://phabricator.wikimedia.org/T317412) [16:44:26] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830197 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [16:44:52] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:15:04] RECOVERY - SSH on mw1315.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:19:22] (03CR) 10Dwisehaupt: [C: 03+1] "Talked through this with jgreen and I'm happy with it. I agree that additional SRE eyes on this change would be good." [dns] - 10https://gerrit.wikimedia.org/r/831104 (https://phabricator.wikimedia.org/T211401) (owner: 10Jgreen) [17:42:05] (03Abandoned) 10Andrew Bogott: dynamic proxy: block a second troublesome UA [puppet] - 10https://gerrit.wikimedia.org/r/830934 (owner: 10Andrew Bogott) [17:42:10] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:46:46] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:46:56] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:49:08] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:53:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Replace cloudnet100[34] with cloudnet100[56] - https://phabricator.wikimedia.org/T316284 (10Andrew) a:05Andrew→03aborrero [18:01:02] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:01:36] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 114 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:04:00] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:10:34] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:21:11] !log herron@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-main-eqiad cluster: Roll restart of jvm daemons. [18:21:38] 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10jhathaway) @TAndic this configuration has now been deployed to prod and tested. I can provide you the credentials so you can setup the Qualtrics side. [18:25:00] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:39:10] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:43:54] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:46:18] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:46:26] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [19:17:16] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [19:19:38] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [19:24:32] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [19:26:56] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [19:33:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2112 (T312863)', diff saved to https://phabricator.wikimedia.org/P34343 and previous config saved to /var/cache/conftool/dbconfig/20220909-193316-ladsgroup.json [19:33:20] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [19:34:14] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/830923 (https://phabricator.wikimedia.org/T313431) (owner: 10Bking) [19:43:38] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [19:44:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:45:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:45:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:46:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:48:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2112', diff saved to https://phabricator.wikimedia.org/P34344 and previous config saved to /var/cache/conftool/dbconfig/20220909-194822-ladsgroup.json [19:50:40] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [19:50:48] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [19:55:28] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [20:00:18] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [20:02:20] !log herron@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-main-eqiad cluster: Roll restart of jvm daemons. [20:03:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2112', diff saved to https://phabricator.wikimedia.org/P34345 and previous config saved to /var/cache/conftool/dbconfig/20220909-200329-ladsgroup.json [20:03:51] !log herron@cumin1001 START - Cookbook sre.ganeti.makevm for new host dispatch-be1001.eqiad.wmnet [20:03:52] !log herron@cumin1001 START - Cookbook sre.dns.netbox [20:08:14] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1002/37199/" [puppet] - 10https://gerrit.wikimedia.org/r/830284 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [20:12:33] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "one of the dependecy cycles you don't see in the compiler. and yea, this is because we are using systemd::sysuser..." [puppet] - 10https://gerrit.wikimedia.org/r/830284 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [20:15:05] (03PS1) 10Dzahn: phabricator: remove require for homedir from systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/831144 (https://phabricator.wikimedia.org/T280597) [20:16:01] (03CR) 10Nray: "fyi, I just merged https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/829220 which will cause a number of expected visual changes. " [skins/Vector] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/830955 (owner: 10Jdlrobson) [20:16:54] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [20:18:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2112 (T312863)', diff saved to https://phabricator.wikimedia.org/P34347 and previous config saved to /var/cache/conftool/dbconfig/20220909-201835-ladsgroup.json [20:18:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance [20:18:41] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [20:18:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance [20:18:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2116 (T312863)', diff saved to https://phabricator.wikimedia.org/P34348 and previous config saved to /var/cache/conftool/dbconfig/20220909-201857-ladsgroup.json [20:19:40] (03CR) 10Dzahn: [C: 03+2] phabricator: remove require for homedir from systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/831144 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [20:19:57] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "follow-up https://gerrit.wikimedia.org/r/c/operations/puppet/+/831144" [puppet] - 10https://gerrit.wikimedia.org/r/830284 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [20:23:33] (03CR) 10Dzahn: [C: 03+2] "noop on phab1001,phab2001 and phab2002. phab1004 still pages full of dependency problems. the difference is that here the phd user was not" [puppet] - 10https://gerrit.wikimedia.org/r/831144 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [20:25:27] (03CR) 10Dzahn: [C: 03+2] "probably means we would see more errors related to systemd::sysuser if we applied the role on new hosts..and we just don't see those becau" [puppet] - 10https://gerrit.wikimedia.org/r/831144 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [20:27:14] !log herron@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [20:27:14] !log herron@cumin1001 START - Cookbook sre.dns.wipe-cache dispatch-be1001.eqiad.wmnet on all recursors [20:27:18] !log herron@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dispatch-be1001.eqiad.wmnet on all recursors [20:27:21] !log herron@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=93) for new host dispatch-be1001.eqiad.wmnet [20:36:25] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission elastic10[48-52].eqiad.wmnet - https://phabricator.wikimedia.org/T316728 (10herron) Hello, I'm seeing some pending wmfNNNN.mgmt forward/reverse dns record removals which may be related to this task, here's a paste https://phabricator.wikimedia.org/P34... [20:37:40] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [20:38:24] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [20:39:38] (03PS3) 10Dzahn: phabricator::phd: actually use $phd_user variable and small improvements [puppet] - 10https://gerrit.wikimedia.org/r/830285 (https://phabricator.wikimedia.org/T280597) [20:40:48] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [21:03:10] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [21:16:38] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [21:17:31] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1003/37200/" [puppet] - 10https://gerrit.wikimedia.org/r/830285 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [21:27:48] (03CR) 10Dzahn: [V: 03+1 C: 03+2] phabricator::phd: actually use $phd_user variable and small improvements [puppet] - 10https://gerrit.wikimedia.org/r/830285 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [21:28:32] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [21:29:22] RECOVERY - ElasticSearch setting check - 9400 on elastic2052 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [21:29:22] RECOVERY - ElasticSearch setting check - 9200 on elastic2031 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [21:29:22] RECOVERY - ElasticSearch setting check - 9200 on elastic2042 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [21:29:22] RECOVERY - ElasticSearch setting check - 9200 on elastic2025 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [21:29:22] RECOVERY - ElasticSearch setting check - 9400 on elastic2042 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [21:29:22] RECOVERY - ElasticSearch setting check - 9400 on elastic2047 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [21:41:18] (03PS3) 10Jdlrobson: EXPECTED VISUAL CHANGES IN origin/wmf/1.39.0-wmf.28 [skins/Vector] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/830955 (https://phabricator.wikimedia.org/T315261) [21:42:52] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [21:42:54] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [21:52:26] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [21:54:14] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [21:54:18] RECOVERY - Check systemd state on search-loader2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:57:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T312863)', diff saved to https://phabricator.wikimedia.org/P34349 and previous config saved to /var/cache/conftool/dbconfig/20220909-215704-ladsgroup.json [21:57:08] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [22:02:02] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [22:06:59] (03PS1) 10Dzahn: phabricator: do not use systemd::sysuser on phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/831154 [22:07:40] (03PS2) 10Dzahn: phabricator: do not use systemd::sysuser on phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/831154 [22:11:36] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [22:12:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P34350 and previous config saved to /var/cache/conftool/dbconfig/20220909-221210-ladsgroup.json [22:25:19] (03CR) 10Dzahn: [C: 03+2] "just a test / debugging" [puppet] - 10https://gerrit.wikimedia.org/r/831154 (owner: 10Dzahn) [22:26:54] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission elastic10[48-52].eqiad.wmnet - https://phabricator.wikimedia.org/T316728 (10Jclark-ctr) @herron they are good to be removed if they are related to these servers [22:27:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P34351 and previous config saved to /var/cache/conftool/dbconfig/20220909-222717-ladsgroup.json [22:27:55] (03CR) 10Cwhite: [C: 03+1] Move kafka on kafka-logging2001 to PKI TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/831096 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [22:28:02] RECOVERY - Check that envoy is running on phab1004 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [22:28:34] (03CR) 10Cwhite: [C: 03+2] logstash: reduce replica count to 1 after 1 day [puppet] - 10https://gerrit.wikimedia.org/r/830690 (https://phabricator.wikimedia.org/T313099) (owner: 10Cwhite) [22:29:10] (03PS4) 10Jdlrobson: EXPECTED VISUAL CHANGES IN origin/wmf/1.39.0-wmf.28 [skins/Vector] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/830955 (https://phabricator.wikimedia.org/T315261) [22:29:22] (03CR) 10Dzahn: [C: 03+2] "after this the scap system user was created and puppet could do a bunch of things it could not do before. there are other errors cause by " [puppet] - 10https://gerrit.wikimedia.org/r/831154 (owner: 10Dzahn) [22:30:38] RECOVERY - Check no envoy runtime configuration is left persistent on phab1004 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [22:42:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T312863)', diff saved to https://phabricator.wikimedia.org/P34352 and previous config saved to /var/cache/conftool/dbconfig/20220909-224223-ladsgroup.json [22:42:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance [22:42:27] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [22:42:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance [22:42:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T312863)', diff saved to https://phabricator.wikimedia.org/P34353 and previous config saved to /var/cache/conftool/dbconfig/20220909-224245-ladsgroup.json [22:45:05] (03CR) 10Yahya: [C: 03+1] Create six more namespaces (three content namespaces and their corresponding three discussion namespaces) on the bn.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830982 (https://phabricator.wikimedia.org/T317424) (owner: 10Aishik Rehman) [22:54:44] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:01:52] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:15:43] (03PS5) 10Jdlrobson: EXPECTED VISUAL CHANGES IN origin/wmf/1.39.0-wmf.28 [skins/Vector] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/830955 (https://phabricator.wikimedia.org/T315261) [23:16:14] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:16:14] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:32:52] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:35:18] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:37:42] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:49:40] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:56:52] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:58:15] 10SRE, 10DBA, 10MediaWiki-General, 10Patch-For-Review: img_metadata queries for PDF files saturates s4 replicas - https://phabricator.wikimedia.org/T147296 (10Krinkle) [23:58:23] (03CR) 10RLazarus: [C: 03+2] pcc: Warn before compiling all nodes by default [puppet] - 10https://gerrit.wikimedia.org/r/830957 (https://phabricator.wikimedia.org/T222075) (owner: 10RLazarus) [23:58:43] 10SRE, 10DBA, 10MediaWiki-General: img_metadata queries for PDF files saturates s4 replicas - https://phabricator.wikimedia.org/T147296 (10Krinkle) [23:59:01] 10SRE, 10DBA, 10MediaWiki-File-management: img_metadata queries for PDF files saturates s4 replicas - https://phabricator.wikimedia.org/T147296 (10Krinkle)