[00:08:45] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b9be20d]: (no justification provided)
[00:09:08] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b9be20d]: (no justification provided) (duration: 00m 22s)
[00:11:09] <icinga-wm>	 PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: full-backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:23:41] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[00:24:11] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[00:25:14] <wikibugs>	 (03PS1) 10Jdlrobson: EXPECTED VISUAL CHANGES IN origin/wmf/1.39.0-wmf.28 [skins/Vector] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/830955 (https://phabricator.wikimedia.org/T315261)
[00:25:24] <wikibugs>	 (03CR) 10Jdlrobson: [C: 04-1] EXPECTED VISUAL CHANGES IN origin/wmf/1.39.0-wmf.28 [skins/Vector] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/830955 (https://phabricator.wikimedia.org/T315261) (owner: 10Jdlrobson)
[00:29:43] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[00:37:40] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[00:38:01] <wikibugs>	 (03PS1) 10RLazarus: pcc: Warn before compiling all nodes by default [puppet] - 10https://gerrit.wikimedia.org/r/830957 (https://phabricator.wikimedia.org/T222075)
[00:45:03] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[00:56:53] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[01:01:39] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[01:04:33] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[01:13:53] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T316121 (10phaultfinder)
[01:15:08] <wikibugs>	 (03CR) 10Dzahn: "I like the idea and thank you for it, but can it be interactive like that when started from the web interface?" [puppet] - 10https://gerrit.wikimedia.org/r/830957 (https://phabricator.wikimedia.org/T222075) (owner: 10RLazarus)
[01:17:39] <wikibugs>	 (03CR) 10RLazarus: pcc: Warn before compiling all nodes by default (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/830957 (https://phabricator.wikimedia.org/T222075) (owner: 10RLazarus)
[01:21:15] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[01:23:39] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[01:26:37] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "ACK. (another repo). tested this, the confirmation y/n works, message makes sense" [puppet] - 10https://gerrit.wikimedia.org/r/830957 (https://phabricator.wikimedia.org/T222075) (owner: 10RLazarus)
[01:26:49] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:27:53] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[01:30:14] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[01:36:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:41:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:44:49] <icinga-wm>	 RECOVERY - SSH on mw1313.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:46:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:51:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:04:55] <icinga-wm>	 PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: package_builder_Clean_up_build_directory.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:06:45] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:11:45] <jinxer-wm>	 (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:15:49] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[02:18:03] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[02:28:55] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[02:36:57] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[02:46:27] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[03:03:23] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:10:33] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:13:13] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[03:22:47] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[03:31:41] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[03:48:17] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[03:51:05] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:51:15] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[03:53:39] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[03:54:43] <icinga-wm>	 RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:54:51] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[04:00:17] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[04:02:37] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[04:09:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[04:37:40] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[04:43:47] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[04:53:19] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[05:01:39] <wikibugs>	 (03PS1) 10Ebernhardson: Disable CirrusSearch completion suggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830962
[05:02:20] <ebernhardson>	 There is an autocomplete problem across many wikis, going to deploy a small config patch to switch from the completion suggester, which isn't working 100%, to the prefix search which should be fine
[05:02:46] <wikibugs>	 (03PS2) 10Ebernhardson: Disable CirrusSearch completion suggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830962
[05:03:01] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] Disable CirrusSearch completion suggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830962 (owner: 10Ebernhardson)
[05:03:54] <wikibugs>	 (03Merged) 10jenkins-bot: Disable CirrusSearch completion suggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830962 (owner: 10Ebernhardson)
[05:05:13] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[05:10:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[05:11:09] <_joe_>	 ebernhardson: I'm around in case :)
[05:11:12] <logmsgbot>	 !log ebernhardson@deploy1002 Synchronized wmf-config/InitialiseSettings.php: cirrus: Switch all wikis from completion suggester to prefix search, yesterdays completion index builds in codfw weren't all succesfull and users are getting incomplete results (duration: 04m 01s)
[05:11:23] <_joe_>	 anything I need to check?
[05:12:25] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[05:18:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[05:18:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[05:19:13] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[05:36:23] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage db1203 [puppet] - 10https://gerrit.wikimedia.org/r/830963
[05:37:50] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1203 [puppet] - 10https://gerrit.wikimedia.org/r/830963 (owner: 10Marostegui)
[05:40:27] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[05:41:29] <marostegui>	 !log dbmaint s4 testcommonswiki eqiad T317349
[05:41:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:41:33] <stashbot>	 T317349: Add primary key and drop unique index on wb_id_counters on wmf wikis - https://phabricator.wikimedia.org/T317349
[05:42:12] <marostegui>	 !log dbmaint s4 commonswiki eqiad T317349
[05:42:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:43:47] <marostegui>	 !log dbmaint s3 testwikidatawiki eqiad T317349
[05:43:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:44:53] <marostegui>	 !log dbmaint s8 wikidatawiki eqiad T317349
[05:44:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:45:11] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[05:48:13] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[05:51:15] <ebernhardson>	 _joe_: no all is fine, it's a compatiblity issue between mediawiki versions and elasticsearch versions, todays rebuilds should be fine and the patch can be reverted in a few hours when the cronjobs on mwmaint1002 are done
[05:51:47] <ebernhardson>	 (i'm assuming someone on my time will follow up, dcausse knows these bits. otherwise i can tomorrow morning)
[05:51:51] <ebernhardson>	 s/my time/my team/
[05:52:59] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[06:13:14] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Revert "Exclude cloud-eqiad prefix from MXs trusted networks" [puppet] - 10https://gerrit.wikimedia.org/r/830819 (owner: 10Ayounsi)
[06:16:49] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[06:17:45] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Remove 185.15.56.0/24 from network::external - https://phabricator.wikimedia.org/T265864 (10ayounsi) 05Open→03Resolved a:03ayounsi This is all done. Please re-open if there are any issues.
[06:33:49] <icinga-wm>	 PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:38:17] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[06:40:16] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10ayounsi)
[06:40:39] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[06:51:06] <wikibugs>	 (03PS13) 10Ayounsi: sre.network.peering: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/816730
[06:51:08] <wikibugs>	 (03CR) 10Ayounsi: sre.network.peering: initial commit (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/816730 (owner: 10Ayounsi)
[06:51:18] <wikibugs>	 (03PS14) 10Ayounsi: sre.network.peering: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/816730
[06:54:55] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220909T0700)
[07:05:17] <wikibugs>	 (03PS1) 10Elukey: Revert "Add a kublet node_label to each master of the dse-k8s cluster" [puppet] - 10https://gerrit.wikimedia.org/r/830967
[07:06:53] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[07:08:04] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/812239 (owner: 10Muehlenhoff)
[07:08:06] <wikibugs>	 (03CR) 10Elukey: "--node-labels in the 'kubernetes.io' namespace must begin with an allowed prefix (kubelet.kubernetes.io, node.kubernetes.io) or be in the " [puppet] - 10https://gerrit.wikimedia.org/r/830967 (owner: 10Elukey)
[07:10:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Add the prometheus config to enable scraping from the dse-k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/830897 (https://phabricator.wikimedia.org/T310179) (owner: 10Btullis)
[07:10:13] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Revert "Add a kublet node_label to each master of the dse-k8s cluster" [puppet] - 10https://gerrit.wikimedia.org/r/830967 (owner: 10Elukey)
[07:10:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-tool1007.eqiad.wmnet
[07:11:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/826362 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[07:12:11] <icinga-wm>	 RECOVERY - Check systemd state on dse-k8s-ctrl1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:12:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1137 for upgrade', diff saved to https://phabricator.wikimedia.org/P34318 and previous config saved to /var/cache/conftool/dbconfig/20220909-071255-root.json
[07:13:15] <icinga-wm>	 RECOVERY - Check systemd state on dse-k8s-ctrl1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:14:03] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[07:14:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1007.eqiad.wmnet
[07:16:27] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[07:18:23] <wikibugs>	 (03PS1) 10Marostegui: db1137: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830965
[07:19:33] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1137: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830965 (owner: 10Marostegui)
[07:22:49] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[07:33:42] <icinga-wm>	 RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:40:13] <wikibugs>	 (03PS2) 10Majavah: P:wmcs::novaproxy: add prometheus nginx exporter [puppet] - 10https://gerrit.wikimedia.org/r/830928
[07:40:55] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37181/console" [puppet] - 10https://gerrit.wikimedia.org/r/830928 (owner: 10Majavah)
[07:42:13] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Sustainability (Incident Followup): 2022-08-24 swift incident (tracking) - https://phabricator.wikimedia.org/T317358 (10fgiunchedi) I don't have the time/bandwidth to followup with a decent incident report, though yes tl;dr is that reverting https://gerrit.wikimedia.org/r/c/oper...
[07:42:50] <wikibugs>	 10SRE, 10Data Engineering Planning, 10Foundational Technology Requests, 10User-fgiunchedi: Add a webrequest sampled topic and ingest into druid/turnilo - https://phabricator.wikimedia.org/T314981 (10fgiunchedi)
[07:46:45] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[07:46:50] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[07:47:04] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[07:47:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T312863)', diff saved to https://phabricator.wikimedia.org/P34319 and previous config saved to /var/cache/conftool/dbconfig/20220909-074710-ladsgroup.json
[07:47:13] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[07:49:54] <wikibugs>	 (03PS1) 10Muehlenhoff: More RAID cleanups [puppet] - 10https://gerrit.wikimedia.org/r/830987
[07:50:56] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[07:52:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] More RAID cleanups [puppet] - 10https://gerrit.wikimedia.org/r/830987 (owner: 10Muehlenhoff)
[07:52:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: reduce webrequest retention to 31 days (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821324 (https://phabricator.wikimedia.org/T313099) (owner: 10Cwhite)
[07:52:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: reduce replica count to 1 after 1 day (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/830690 (https://phabricator.wikimedia.org/T313099) (owner: 10Cwhite)
[07:55:10] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[07:56:38] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[08:01:58] <wikibugs>	 10SRE, 10Observability-Alerting: Export and share alerts data - https://phabricator.wikimedia.org/T317393 (10fgiunchedi)
[08:03:46] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (38) node(s) change every puppet run: an-launcher1002, aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudcephosd1031, cloudcephosd1033, cloudcephosd1034, clouddumps1001, clouddumps1002, labstore1006, labstore1007, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, 
[08:03:46] <icinga-wm>	 1, ms-fe2012, releases1002, releases2002, stat1004, stat1005, stat1006, stat1007, stat1008, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, thanos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[08:05:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[08:05:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[08:05:47] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[08:05:52] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[08:06:03] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[08:06:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T312863)', diff saved to https://phabricator.wikimedia.org/P34320 and previous config saved to /var/cache/conftool/dbconfig/20220909-080609-ladsgroup.json
[08:06:12] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[08:06:36] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance
[08:06:49] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance
[08:07:08] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.0.149:9042 on restbase1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886
[08:07:40] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.0.148:9042 on restbase1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886
[08:07:46] <icinga-wm>	 PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds
[08:08:16] <icinga-wm>	 PROBLEM - SSH on ms-be1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:08:30] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.0.150:9042 on restbase1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886
[08:09:16] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[08:09:50] <icinga-wm>	 RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[08:10:06] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance
[08:10:16] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[08:10:19] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance
[08:10:20] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[08:10:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[08:10:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T314041)', diff saved to https://phabricator.wikimedia.org/P34321 and previous config saved to /var/cache/conftool/dbconfig/20220909-081042-ladsgroup.json
[08:10:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance
[08:10:46] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[08:10:57] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance
[08:11:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2106 (T314041)', diff saved to https://phabricator.wikimedia.org/P34322 and previous config saved to /var/cache/conftool/dbconfig/20220909-081103-ladsgroup.json
[08:11:04] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (38) node(s) change every puppet run: an-launcher1002, aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudcephosd1031, cloudcephosd1033, cloudcephosd1034, clouddumps1001, clouddumps1002, labstore1006, labstore1007, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, 
[08:11:04] <icinga-wm>	 1, ms-fe2012, releases1002, releases2002, stat1004, stat1005, stat1006, stat1007, stat1008, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, thanos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[08:11:39] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: deploy provision-fs.sh [puppet] - 10https://gerrit.wikimedia.org/r/831030
[08:12:40] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[08:13:17] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] P:wmcs::novaproxy: add prometheus nginx exporter [puppet] - 10https://gerrit.wikimedia.org/r/830928 (owner: 10Majavah)
[08:16:09] <dcausse>	 !log restarting on blazegraph on wdqs2002 (BlazegraphFreeAllocatorsDecreasingRapidly)
[08:16:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2002:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[08:32:24] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[08:32:31] <dcausse>	 !log rebuilding all completion indices in elastic@codfw
[08:32:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:39] <wikibugs>	 (03PS6) 10David Caro: p::metricsinfra:haproxy: Allow exposing federation endpoints [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031)
[08:32:41] <wikibugs>	 (03PS4) 10David Caro: p::wmcs:prometheus: Add cloudvps federation job [puppet] - 10https://gerrit.wikimedia.org/r/829756 (https://phabricator.wikimedia.org/T316982)
[08:37:40] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[08:37:53] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.decommission for hosts wtp[1029-1033].eqiad.wmnet
[08:38:39] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] p::metricsinfra:haproxy: Allow exposing federation endpoints (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031) (owner: 10David Caro)
[08:40:08] <icinga-wm>	 PROBLEM - Restbase root url on restbase1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/RESTBase
[08:43:08] <icinga-wm>	 PROBLEM - SSH on restbase1021 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:43:28] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[08:43:46] <wikibugs>	 (03PS1) 10Muehlenhoff: ntp: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831034 (https://phabricator.wikimedia.org/T308013)
[08:44:43] <wikibugs>	 (03PS7) 10David Caro: p::metricsinfra:haproxy: Allow exposing federation endpoints [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031)
[08:44:45] <wikibugs>	 (03CR) 10David Caro: p::metricsinfra:haproxy: Allow exposing federation endpoints (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031) (owner: 10David Caro)
[08:44:47] <wikibugs>	 (03PS5) 10David Caro: p::wmcs:prometheus: Add cloudvps federation job [puppet] - 10https://gerrit.wikimedia.org/r/829756 (https://phabricator.wikimedia.org/T316982)
[08:45:24] <wikibugs>	 (03PS1) 10Muehlenhoff: confd: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831035 (https://phabricator.wikimedia.org/T308013)
[08:46:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] confd: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831035 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[08:47:36] <wikibugs>	 (03PS1) 10David Caro: p::metricsinfra:haproxy: rename some vars to reflect intent [puppet] - 10https://gerrit.wikimedia.org/r/831036
[08:47:51] <wikibugs>	 (03PS1) 10Muehlenhoff: java: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831037 (https://phabricator.wikimedia.org/T308013)
[08:48:04] <wikibugs>	 (03PS2) 10Muehlenhoff: confd: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831035 (https://phabricator.wikimedia.org/T308013)
[08:48:57] <wikibugs>	 (03PS1) 10Majavah: hieradata: add metricsinfra puppetmaster key to pcc facts submitters [puppet] - 10https://gerrit.wikimedia.org/r/831038
[08:50:19] <wikibugs>	 (03PS1) 10Muehlenhoff: scp: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831039 (https://phabricator.wikimedia.org/T308013)
[08:51:58] <icinga-wm>	 RECOVERY - SSH on restbase1021 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:54:59] <wikibugs>	 (03PS2) 10Muehlenhoff: scap: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831039 (https://phabricator.wikimedia.org/T308013)
[08:55:19] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM (Including license change from GPL to Apache 2.0 for jheapdump which I authored)" [puppet] - 10https://gerrit.wikimedia.org/r/831037 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[08:56:44] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox
[08:58:05] <wikibugs>	 (03PS1) 10Majavah: P:mariadb::cloudinfra: remove direct access from puppetmaster hosts [puppet] - 10https://gerrit.wikimedia.org/r/831041
[08:59:00] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:59:01] <logmsgbot>	 !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wtp[1029-1033].eqiad.wmnet
[09:03:05] <wikibugs>	 (03PS1) 10Majavah: hieradata: split pcc entries for cloudinfra [puppet] - 10https://gerrit.wikimedia.org/r/831042
[09:07:11] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[09:08:53] <icinga-wm>	 PROBLEM - SSH on restbase1021 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:11:03] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[09:11:39] <icinga-wm>	 RECOVERY - SSH on restbase1021 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:13:51] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[09:15:13] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Sustainability (Incident Followup): 2022-08-24 swift incident (tracking) - https://phabricator.wikimedia.org/T317358 (10MatthewVernon) p:05Triage→03Low
[09:16:56] <wikibugs>	 (03PS3) 10Muehlenhoff: scap: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831039 (https://phabricator.wikimedia.org/T308013)
[09:17:19] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2039 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:17:40] <wikibugs>	 (03PS2) 10Majavah: P:mariadb::cloudinfra: remove direct access from puppetmaster hosts [puppet] - 10https://gerrit.wikimedia.org/r/831041
[09:17:42] <wikibugs>	 (03PS1) 10Majavah: hieradata: add cache_hosts for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/831044
[09:17:44] <wikibugs>	 (03PS1) 10Majavah: P:mariadb::cloudinfra: add web proxy database/grants [puppet] - 10https://gerrit.wikimedia.org/r/831045 (https://phabricator.wikimedia.org/T316982)
[09:18:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34323 and previous config saved to /var/cache/conftool/dbconfig/20220909-091809-root.json
[09:19:03] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[09:20:31] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[09:20:47] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[09:21:10] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete owner annotations [puppet] - 10https://gerrit.wikimedia.org/r/831046
[09:22:33] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "overall the CR reduces complexity and I don't think it's messing with the functionality of the script (or its verbosity) and enables more " [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall)
[09:23:54] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] varnish: Remove extraneous checks for Docker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall)
[09:30:31] <wikibugs>	 (03CR) 10Vgutierrez: "Brett, could you check this CR and rebase it on top of https://gerrit.wikimedia.org/r/c/operations/puppet/+/826367/ I think it would be in" [puppet] - 10https://gerrit.wikimedia.org/r/771863 (owner: 10Giuseppe Lavagetto)
[09:32:11] <wikibugs>	 (03PS1) 10Btullis: Use the dumpsgen user to mount the NFS dumps directories [puppet] - 10https://gerrit.wikimedia.org/r/831049 (https://phabricator.wikimedia.org/T317359)
[09:32:25] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/831049 (https://phabricator.wikimedia.org/T317359) (owner: 10Btullis)
[09:32:50] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1137: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/830968
[09:33:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34325 and previous config saved to /var/cache/conftool/dbconfig/20220909-093314-root.json
[09:36:19] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1137: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/830968 (owner: 10Marostegui)
[09:36:21] <wikibugs>	 (03PS5) 10Clément Goubert: wtp: Purge wtp servers following migration to parse [puppet] - 10https://gerrit.wikimedia.org/r/830802 (https://phabricator.wikimedia.org/T317025)
[09:38:22] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for phedenskog - https://phabricator.wikimedia.org/T317401 (10Peter)
[09:39:15] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37184/console" [puppet] - 10https://gerrit.wikimedia.org/r/831049 (https://phabricator.wikimedia.org/T317359) (owner: 10Btullis)
[09:39:50] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Use the dumpsgen user to mount the NFS dumps directories [puppet] - 10https://gerrit.wikimedia.org/r/831049 (https://phabricator.wikimedia.org/T317359) (owner: 10Btullis)
[09:39:59] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] wtp: Purge wtp servers following migration to parse [puppet] - 10https://gerrit.wikimedia.org/r/830802 (https://phabricator.wikimedia.org/T317025) (owner: 10Clément Goubert)
[09:40:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm, few nits but nothing blocking" [puppet] - 10https://gerrit.wikimedia.org/r/830948 (https://phabricator.wikimedia.org/T314815) (owner: 10JHathaway)
[09:42:11] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/831030 (owner: 10Filippo Giunchedi)
[09:42:51] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[09:42:54] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "lgtm will merge" [puppet] - 10https://gerrit.wikimedia.org/r/831038 (owner: 10Majavah)
[09:43:38] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/831046 (owner: 10Muehlenhoff)
[09:44:43] <wikibugs>	 (03PS2) 10Jbond: hieradata: split pcc entries for cloudinfra [puppet] - 10https://gerrit.wikimedia.org/r/831042 (owner: 10Majavah)
[09:44:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "lgtm thanks will merge" [puppet] - 10https://gerrit.wikimedia.org/r/831042 (owner: 10Majavah)
[09:45:05] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] hieradata: split pcc entries for cloudinfra [puppet] - 10https://gerrit.wikimedia.org/r/831042 (owner: 10Majavah)
[09:45:11] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[09:45:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/831035 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[09:45:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/831034 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[09:47:01] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/831037 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[09:47:35] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/831039 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[09:47:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/831046 (owner: 10Muehlenhoff)
[09:48:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34326 and previous config saved to /var/cache/conftool/dbconfig/20220909-094819-root.json
[09:51:17] <wikibugs>	 (03PS7) 10Vgutierrez: Unlink certificate renewal and OCSP handling [software/acme-chief] - 10https://gerrit.wikimedia.org/r/820795 (https://phabricator.wikimedia.org/T244232) (owner: 10BCornwall)
[09:52:47] <icinga-wm>	 PROBLEM - SSH on restbase1021 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:53:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete owner annotations [puppet] - 10https://gerrit.wikimedia.org/r/831046 (owner: 10Muehlenhoff)
[09:53:46] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b9be20d]: (no justification provided)
[09:53:56] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b9be20d]: (no justification provided) (duration: 00m 09s)
[09:54:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] ntp: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831034 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[09:57:36] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add the prometheus config to enable scraping from the dse-k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/830897 (https://phabricator.wikimedia.org/T310179) (owner: 10Btullis)
[09:59:54] <wikibugs>	 10ops-eqiad, 10decommission-hardware, 10serviceops, 10Patch-For-Review: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10Clement_Goubert)
[10:01:34] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:03:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34327 and previous config saved to /var/cache/conftool/dbconfig/20220909-100324-root.json
[10:06:52] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[10:07:02] <wikibugs>	 10ops-eqiad, 10decommission-hardware, 10serviceops, 10Patch-For-Review: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10Clement_Goubert) `wtp[1029-1033].eqiad.wmnet` didn't power off correctly.
[10:08:38] <icinga-wm>	 RECOVERY - SSH on ms-be1041.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:13:36] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[10:15:12] <wikibugs>	 (03PS1) 10Muehlenhoff: memcached: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831055 (https://phabricator.wikimedia.org/T308013)
[10:15:14] <wikibugs>	 (03PS1) 10Muehlenhoff: druid: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831056 (https://phabricator.wikimedia.org/T308013)
[10:18:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34328 and previous config saved to /var/cache/conftool/dbconfig/20220909-101830-root.json
[10:19:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] memcached: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831055 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[10:19:36] <wikibugs>	 (03PS2) 10Muehlenhoff: memcached: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831055 (https://phabricator.wikimedia.org/T308013)
[10:19:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] druid: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831056 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[10:19:50] <wikibugs>	 (03PS2) 10Muehlenhoff: druid: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831056 (https://phabricator.wikimedia.org/T308013)
[10:20:15] <wikibugs>	 (03CR) 10Vgutierrez: Unlink certificate renewal and OCSP handling (034 comments) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/820795 (https://phabricator.wikimedia.org/T244232) (owner: 10BCornwall)
[10:26:47] <wikibugs>	 10SRE, 10Observability-Alerting: Export and share alerts data - https://phabricator.wikimedia.org/T317393 (10fgiunchedi) p:05Triage→03Medium
[10:26:52] <wikibugs>	 (03PS1) 10Slyngshede: C:raid::perccli handle case with no virtual devices. [puppet] - 10https://gerrit.wikimedia.org/r/831057 (https://phabricator.wikimedia.org/T317344)
[10:27:14] <wikibugs>	 (03PS1) 10Majavah: dynamicproxy: simplify tls configuration [puppet] - 10https://gerrit.wikimedia.org/r/831058
[10:27:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:raid::perccli handle case with no virtual devices. [puppet] - 10https://gerrit.wikimedia.org/r/831057 (https://phabricator.wikimedia.org/T317344) (owner: 10Slyngshede)
[10:27:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dynamicproxy: simplify tls configuration [puppet] - 10https://gerrit.wikimedia.org/r/831058 (owner: 10Majavah)
[10:28:41] <wikibugs>	 (03PS2) 10Majavah: dynamicproxy: simplify tls configuration [puppet] - 10https://gerrit.wikimedia.org/r/831058
[10:29:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dynamicproxy: simplify tls configuration [puppet] - 10https://gerrit.wikimedia.org/r/831058 (owner: 10Majavah)
[10:29:48] <wikibugs>	 (03PS2) 10Slyngshede: C:raid::perccli handle case with no virtual devices. [puppet] - 10https://gerrit.wikimedia.org/r/831057 (https://phabricator.wikimedia.org/T317344)
[10:30:53] <wikibugs>	 (03PS3) 10Majavah: dynamicproxy: simplify tls configuration [puppet] - 10https://gerrit.wikimedia.org/r/831058
[10:31:16] <wikibugs>	 (03PS4) 10Majavah: dynamicproxy: simplify tls configuration [puppet] - 10https://gerrit.wikimedia.org/r/831058
[10:31:18] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b9be20d]: (no justification provided)
[10:31:28] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b9be20d]: (no justification provided) (duration: 00m 09s)
[10:33:04] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37185/console" [puppet] - 10https://gerrit.wikimedia.org/r/831058 (owner: 10Majavah)
[10:33:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34329 and previous config saved to /var/cache/conftool/dbconfig/20220909-103334-root.json
[10:37:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/831057 (https://phabricator.wikimedia.org/T317344) (owner: 10Slyngshede)
[10:38:21] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 (10Vgutierrez)
[10:38:33] <wikibugs>	 10SRE, 10Traffic, 10Upstream: metric discrepancies between ATS 9.x and ATS 8.x - https://phabricator.wikimedia.org/T315064 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez
[10:39:10] <wikibugs>	 10SRE-OnFire, 10serviceops, 10Sustainability (Incident Followup): Page on etcdmirror critical status - https://phabricator.wikimedia.org/T317402 (10Clement_Goubert)
[10:41:49] <wikibugs>	 10SRE-OnFire, 10serviceops, 10Sustainability (Incident Followup): Add etcdmirror status check to scap - https://phabricator.wikimedia.org/T317403 (10Clement_Goubert)
[10:41:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: deploy provision-fs.sh [puppet] - 10https://gerrit.wikimedia.org/r/831030 (owner: 10Filippo Giunchedi)
[10:42:00] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: deploy provision-fs.sh [puppet] - 10https://gerrit.wikimedia.org/r/831030
[10:43:55] <wikibugs>	 (03PS1) 10Hashar: devtools: add keyholder agent for Gerrit [puppet] - 10https://gerrit.wikimedia.org/r/831062 (https://phabricator.wikimedia.org/T317404)
[10:44:40] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[10:45:05] <wikibugs>	 10SRE-OnFire, 10serviceops, 10Sustainability (Incident Followup): Add failure rate triggered rollback to scap - https://phabricator.wikimedia.org/T317405 (10Clement_Goubert)
[10:45:37] <wikibugs>	 (03PS1) 10Hashar: Add deployment configuration for devtools WMCS project [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/831063 (https://phabricator.wikimedia.org/T317404)
[10:46:54] <wikibugs>	 (03CR) 10Hashar: "Looks like that does the right thing on deploy-1004.devtools.eqiad1.wikimedia.cloud:" [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/831063 (https://phabricator.wikimedia.org/T317404) (owner: 10Hashar)
[10:49:54] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[10:51:46] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "There are a lot more changes I have to do before keyholder configuration can be merged ;)" [puppet] - 10https://gerrit.wikimedia.org/r/831062 (https://phabricator.wikimedia.org/T317404) (owner: 10Hashar)
[10:53:42] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:54:05] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Evaluate xbzrle and/or auto-converge in qemu - https://phabricator.wikimedia.org/T317406 (10MoritzMuehlenhoff)
[10:55:44] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:01:31] <wikibugs>	 (03PS1) 10Btullis: Correct a typo in the k8s-dse cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/831066 (https://phabricator.wikimedia.org/T310179)
[11:02:34] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff)
[11:02:53] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] Remove support for overriding LDAP client stack (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826536 (owner: 10Majavah)
[11:06:10] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Evaluate xbzrle and/or auto-converge in qemu - https://phabricator.wikimedia.org/T317406 (10MoritzMuehlenhoff) p:05Triage→03Medium
[11:06:55] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37186/console" [puppet] - 10https://gerrit.wikimedia.org/r/831066 (https://phabricator.wikimedia.org/T310179) (owner: 10Btullis)
[11:08:50] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:09:08] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[11:10:40] <wikibugs>	 (03CR) 10Vlad.shapik: [C: 03+1] "I have downloaded this patch and run the online tests. Everything has passed successfully." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830903 (owner: 10Hnowlan)
[11:13:03] <wikibugs>	 (03CR) 10Hashar: "The project Puppetmaster points to the WMCS puppet master. I am pretty sure last time I switched a puppet master to be served by itself th" [puppet] - 10https://gerrit.wikimedia.org/r/831062 (https://phabricator.wikimedia.org/T317404) (owner: 10Hashar)
[11:14:33] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/831057 (https://phabricator.wikimedia.org/T317344) (owner: 10Slyngshede)
[11:14:40] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] prometheus: Parse ATS config to node exporter text [puppet] - 10https://gerrit.wikimedia.org/r/826362 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[11:14:59] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] ats: Enable node_ats_config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/830686 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[11:15:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T312863)', diff saved to https://phabricator.wikimedia.org/P34330 and previous config saved to /var/cache/conftool/dbconfig/20220909-111509-ladsgroup.json
[11:15:14] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[11:16:00] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] devtools: add keyholder agent for Gerrit [puppet] - 10https://gerrit.wikimedia.org/r/831062 (https://phabricator.wikimedia.org/T317404) (owner: 10Hashar)
[11:16:02] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Correct a typo in the k8s-dse cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/831066 (https://phabricator.wikimedia.org/T310179) (owner: 10Btullis)
[11:16:42] <jbond>	 btullis: fyi i merged your type change
[11:17:03] <btullis>	 jbond: Great, thanks.
[11:17:11] <jbond>	 np
[11:18:04] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:18:52] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[11:20:05] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] pcc: Warn before compiling all nodes by default [puppet] - 10https://gerrit.wikimedia.org/r/830957 (https://phabricator.wikimedia.org/T222075) (owner: 10RLazarus)
[11:23:06] <icinga-wm>	 PROBLEM - Check systemd state on cp1081 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:23:12] <icinga-wm>	 PROBLEM - Check systemd state on cp2040 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:23:14] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[11:23:28] <icinga-wm>	 PROBLEM - Check systemd state on cp2036 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:25:48] <icinga-wm>	 PROBLEM - Check systemd state on cp1087 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:26:52] <icinga-wm>	 PROBLEM - Check systemd state on cp1082 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:27:22] <icinga-wm>	 PROBLEM - Check systemd state on cp2031 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:28:00] <icinga-wm>	 PROBLEM - Check systemd state on cp4030 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:28:14] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[11:28:20] <icinga-wm>	 PROBLEM - Check systemd state on cp4025 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:28:22] <icinga-wm>	 PROBLEM - Check systemd state on cp2034 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:29:00] <icinga-wm>	 PROBLEM - Check systemd state on cp3061 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:29:13] <brett>	 whelp
[11:29:42] <icinga-wm>	 PROBLEM - Check systemd state on cp2042 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:29:42] <icinga-wm>	 PROBLEM - Check systemd state on cp2030 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:29:42] <icinga-wm>	 PROBLEM - Check systemd state on cp2027 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:29:58] <icinga-wm>	 PROBLEM - Check systemd state on cp5003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:30:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P34331 and previous config saved to /var/cache/conftool/dbconfig/20220909-113016-ladsgroup.json
[11:30:18] <icinga-wm>	 PROBLEM - Check systemd state on cp3051 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:30:34] <icinga-wm>	 PROBLEM - Check systemd state on cp3052 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:30:36] <icinga-wm>	 PROBLEM - Check systemd state on cp4029 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:31:16] <icinga-wm>	 PROBLEM - Check systemd state on cp5002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:31:32] <icinga-wm>	 PROBLEM - Check systemd state on cp4033 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:31:40] <icinga-wm>	 PROBLEM - Check systemd state on cp4021 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:32:12] <icinga-wm>	 PROBLEM - Check systemd state on cp4024 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:32:16] <icinga-wm>	 PROBLEM - Check systemd state on cp3054 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:32:16] <icinga-wm>	 PROBLEM - Check systemd state on cp3056 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:32:38] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[11:32:50] <icinga-wm>	 PROBLEM - Check systemd state on cp4035 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:33:20] <icinga-wm>	 PROBLEM - Check systemd state on cp3058 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:33:36] <icinga-wm>	 PROBLEM - Check systemd state on cp2035 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:34:18] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] deployment-prep: Add P:beta::mediawiki_packages [puppet] - 10https://gerrit.wikimedia.org/r/830629 (https://phabricator.wikimedia.org/T317128) (owner: 10Clément Goubert)
[11:34:19] <jynus>	 I am guessing new monitoring deployment, WIP?
[11:34:39] <brett>	 Yeah, issue found
[11:34:46] <brett>	 Sorry for the spam
[11:34:50] <brett>	 ._.
[11:34:51] <jynus>	 no problem
[11:35:02] <jynus>	 no impact alert > high impact alert :-P
[11:35:08] <icinga-wm>	 PROBLEM - Check systemd state on cp3063 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:35:17] <brett>	 These should resolve within a few minutes
[11:36:04] <icinga-wm>	 PROBLEM - Check systemd state on cp4023 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:36:04] <icinga-wm>	 PROBLEM - Check systemd state on cp4022 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:36:24] <icinga-wm>	 PROBLEM - Check systemd state on cp3055 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:37:42] <icinga-wm>	 PROBLEM - Check systemd state on cp5007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:37:58] <icinga-wm>	 PROBLEM - Check systemd state on cp1086 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:37:59] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/831056 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[11:38:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/831055 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[11:38:58] <wikibugs>	 (03PS1) 10BCornwall: ats: Use variable for ATS 8 in ATS config monitor [puppet] - 10https://gerrit.wikimedia.org/r/831073 (https://phabricator.wikimedia.org/T292815)
[11:39:10] <icinga-wm>	 PROBLEM - Check systemd state on cp2028 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:39:20] <icinga-wm>	 PROBLEM - Check systemd state on cp1088 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:39:36] <icinga-wm>	 PROBLEM - Check systemd state on cp4036 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:39:40] <icinga-wm>	 PROBLEM - Check systemd state on cp5008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:39:52] <icinga-wm>	 PROBLEM - Check systemd state on cp1075 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:40:06] <icinga-wm>	 PROBLEM - Check systemd state on cp5005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:40:26] <icinga-wm>	 PROBLEM - Check systemd state on cp1080 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:40:34] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] ats: Use variable for ATS 8 in ATS config monitor [puppet] - 10https://gerrit.wikimedia.org/r/831073 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[11:40:38] <icinga-wm>	 PROBLEM - Check systemd state on cp1083 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:40:38] <icinga-wm>	 PROBLEM - Check systemd state on cp1085 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:40:54] <icinga-wm>	 PROBLEM - Check systemd state on cp5013 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:38] <icinga-wm>	 PROBLEM - Check systemd state on cp1079 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:42:17] <wikibugs>	 (03PS1) 10Clément Goubert: beta: don't duplicate fonts install [puppet] - 10https://gerrit.wikimedia.org/r/831076
[11:42:18] <icinga-wm>	 PROBLEM - Check systemd state on cp5012 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:42:50] <icinga-wm>	 PROBLEM - Check systemd state on cp2041 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:00] <icinga-wm>	 PROBLEM - Check systemd state on cp1078 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:12] <icinga-wm>	 PROBLEM - Check systemd state on cp4034 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:52] <icinga-wm>	 RECOVERY - Check systemd state on cp2042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:56] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations: Facter is slow on a few hosts - https://phabricator.wikimedia.org/T251293 (10MoritzMuehlenhoff) Hi Cole, I ran into this porting away things from the "raid" Puppet fact towards the new "raid_mgmt_tools" fact. All the slowness was originally caused by IPMI and th...
[11:44:02] <icinga-wm>	 PROBLEM - Check systemd state on cp3060 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:44:12] <icinga-wm>	 PROBLEM - Check systemd state on cp5009 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:44:36] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: one disk not working properly in cloudcephosd1030.eqiad.wmnet - https://phabricator.wikimedia.org/T316673 (10fnegri) @Jclark-ctr right now I cannot connect to cloudcephosd1030.mgmt.eqiad.wmnet with SSH.  Icinga is also show...
[11:44:40] <icinga-wm>	 PROBLEM - Check systemd state on cp2033 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:44:42] <wikibugs>	 (03CR) 10Clément Goubert: "Small bugfix" [puppet] - 10https://gerrit.wikimedia.org/r/831076 (owner: 10Clément Goubert)
[11:45:18] <icinga-wm>	 PROBLEM - Check systemd state on cp2038 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:45:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P34333 and previous config saved to /var/cache/conftool/dbconfig/20220909-114522-ladsgroup.json
[11:45:52] <icinga-wm>	 PROBLEM - Check systemd state on cp5011 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:45:54] <icinga-wm>	 PROBLEM - Check systemd state on cp4028 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:46:04] <icinga-wm>	 PROBLEM - Check systemd state on cp1084 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:46:08] <icinga-wm>	 PROBLEM - Check systemd state on cp3050 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:46:47] <wikibugs>	 (03CR) 10Muehlenhoff: beta: don't duplicate fonts install (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831076 (owner: 10Clément Goubert)
[11:47:10] <icinga-wm>	 PROBLEM - Check systemd state on cp5006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:47:38] <icinga-wm>	 PROBLEM - Check systemd state on cp3062 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:47:40] <icinga-wm>	 PROBLEM - Check systemd state on cp1077 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:47:48] <icinga-wm>	 PROBLEM - Check systemd state on cp2037 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:47:48] <wikibugs>	 (03PS2) 10Clément Goubert: beta: don't duplicate fonts install [puppet] - 10https://gerrit.wikimedia.org/r/831076 (https://phabricator.wikimedia.org/T317128)
[11:48:29] <wikibugs>	 (03CR) 10Clément Goubert: beta: don't duplicate fonts install (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831076 (https://phabricator.wikimedia.org/T317128) (owner: 10Clément Goubert)
[11:48:58] <icinga-wm>	 RECOVERY - Check systemd state on cp1081 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:49:03] <wikibugs>	 (03PS3) 10Clément Goubert: beta: don't duplicate fonts install [puppet] - 10https://gerrit.wikimedia.org/r/831076 (https://phabricator.wikimedia.org/T317128)
[11:49:08] <icinga-wm>	 RECOVERY - Check systemd state on cp2040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:49:20] <icinga-wm>	 RECOVERY - Check systemd state on cp2033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:49:20] <icinga-wm>	 RECOVERY - Check systemd state on cp2036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:49:44] <brett>	 Sorry again!
[11:50:12] <icinga-wm>	 RECOVERY - Check systemd state on cp4023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:50:22] <icinga-wm>	 RECOVERY - Check systemd state on cp1082 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:50:34] <icinga-wm>	 RECOVERY - Check systemd state on cp3055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:50:48] <icinga-wm>	 RECOVERY - Check systemd state on cp3050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:51:34] <icinga-wm>	 RECOVERY - Check systemd state on cp4030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:51:42] <icinga-wm>	 RECOVERY - Check systemd state on cp1087 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:51:52] <icinga-wm>	 RECOVERY - Check systemd state on cp5007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:51:54] <icinga-wm>	 RECOVERY - Check systemd state on cp4025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:51:58] <icinga-wm>	 RECOVERY - Check systemd state on cp2034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:52:20] <icinga-wm>	 RECOVERY - Check systemd state on cp3062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:52:22] <icinga-wm>	 RECOVERY - Check systemd state on cp1077 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:52:30] <icinga-wm>	 RECOVERY - Check systemd state on cp2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:53:18] <icinga-wm>	 RECOVERY - Check systemd state on cp2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:53:18] <icinga-wm>	 RECOVERY - Check systemd state on cp2027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:53:44] <icinga-wm>	 RECOVERY - Check systemd state on cp4036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:54:12] <icinga-wm>	 RECOVERY - Check systemd state on cp3052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:54:14] <icinga-wm>	 RECOVERY - Check systemd state on cp4029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:54:14] <icinga-wm>	 RECOVERY - Check systemd state on cp5006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:54:28] <icinga-wm>	 RECOVERY - Check systemd state on cp1086 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:54:32] <icinga-wm>	 RECOVERY - Check systemd state on cp1080 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:54:44] <icinga-wm>	 RECOVERY - Check systemd state on cp1083 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:54:58] <icinga-wm>	 RECOVERY - Check systemd state on cp3061 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:55:02] <icinga-wm>	 RECOVERY - Check systemd state on cp5013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:55:22] <wikibugs>	 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Undocumented IP on WMCS network - https://phabricator.wikimedia.org/T315955 (10cmooney) @Andrew yep that's what was needed from the zone side so looking good there.  It's not actually returning any data for specific IPs in the range though....
[11:55:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] beta: don't duplicate fonts install [puppet] - 10https://gerrit.wikimedia.org/r/831076 (https://phabricator.wikimedia.org/T317128) (owner: 10Clément Goubert)
[11:55:38] <icinga-wm>	 RECOVERY - Check systemd state on cp2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:55:38] <icinga-wm>	 RECOVERY - Check systemd state on cp2030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:55:46] <icinga-wm>	 RECOVERY - Check systemd state on cp4024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:55:46] <icinga-wm>	 RECOVERY - Check systemd state on cp1088 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:55:50] <icinga-wm>	 RECOVERY - Check systemd state on cp3056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:55:58] <icinga-wm>	 RECOVERY - Check systemd state on cp5003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:56:10] <icinga-wm>	 RECOVERY - Check systemd state on cp5008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:56:18] <icinga-wm>	 RECOVERY - Check systemd state on cp3051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:56:18] <icinga-wm>	 RECOVERY - Check systemd state on cp1075 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:56:36] <icinga-wm>	 RECOVERY - Check systemd state on cp5005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:56:58] <icinga-wm>	 RECOVERY - Check systemd state on cp2041 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:57:00] <icinga-wm>	 RECOVERY - Check systemd state on cp3058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:57:06] <icinga-wm>	 RECOVERY - Check systemd state on cp1085 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:57:08] <icinga-wm>	 RECOVERY - Check systemd state on cp1078 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:57:16] <icinga-wm>	 RECOVERY - Check systemd state on cp5002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:57:18] <icinga-wm>	 RECOVERY - Check systemd state on cp4034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:57:28] <icinga-wm>	 RECOVERY - Check systemd state on cp4033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:57:38] <icinga-wm>	 RECOVERY - Check systemd state on cp4021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:58:04] <icinga-wm>	 RECOVERY - Check systemd state on cp1079 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:58:12] <icinga-wm>	 RECOVERY - Check systemd state on cp3054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:58:34] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[11:58:48] <icinga-wm>	 RECOVERY - Check systemd state on cp3063 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:58:50] <icinga-wm>	 RECOVERY - Check systemd state on cp4035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:58:52] <icinga-wm>	 RECOVERY - Check systemd state on cp5012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:59:28] <icinga-wm>	 RECOVERY - Check systemd state on cp2038 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:59:36] <icinga-wm>	 RECOVERY - Check systemd state on cp2035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:59:40] <icinga-wm>	 RECOVERY - Check systemd state on cp4022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:00:00] <icinga-wm>	 RECOVERY - Check systemd state on cp5011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:00:02] <icinga-wm>	 RECOVERY - Check systemd state on cp4028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:00:12] <icinga-wm>	 RECOVERY - Check systemd state on cp1084 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:00:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T312863)', diff saved to https://phabricator.wikimedia.org/P34334 and previous config saved to /var/cache/conftool/dbconfig/20220909-120029-ladsgroup.json
[12:00:33] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[12:00:34] <icinga-wm>	 RECOVERY - Check systemd state on cp3060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:00:44] <icinga-wm>	 RECOVERY - Check systemd state on cp5009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:01:35] <wikibugs>	 (03PS1) 10Majavah: dynamicproxy: include prometheus redis exporter [puppet] - 10https://gerrit.wikimedia.org/r/831080
[12:02:16] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37187/console" [puppet] - 10https://gerrit.wikimedia.org/r/831080 (owner: 10Majavah)
[12:05:40] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[12:05:59] <wikibugs>	 (03PS1) 10Btullis: Add configuration for k8s-dse in prometheus [puppet] - 10https://gerrit.wikimedia.org/r/831081 (https://phabricator.wikimedia.org/T310179)
[12:06:07] <wikibugs>	 (03PS3) 10Vlad.shapik: WP: Remove division operation hack related to Python2 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830907 (https://phabricator.wikimedia.org/T314393)
[12:06:48] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:07:50] <wikibugs>	 (03PS2) 10Majavah: dynamicproxy: include prometheus redis exporter [puppet] - 10https://gerrit.wikimedia.org/r/831080
[12:07:58] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[12:08:52] <wikibugs>	 10SRE-swift-storage: swift_ring_manager should be able to rebalance rings without making other changes - https://phabricator.wikimedia.org/T317409 (10MatthewVernon)
[12:09:05] <wikibugs>	 10SRE-swift-storage: swift_ring_manager should be able to rebalance rings without making other changes - https://phabricator.wikimedia.org/T317409 (10MatthewVernon) p:05Triage→03Medium
[12:12:19] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37188/console" [puppet] - 10https://gerrit.wikimedia.org/r/831081 (https://phabricator.wikimedia.org/T310179) (owner: 10Btullis)
[12:15:37] <wikibugs>	 (03PS1) 10Jelto: gitlab: allow git user to access backup folder [puppet] - 10https://gerrit.wikimedia.org/r/831083 (https://phabricator.wikimedia.org/T274463)
[12:15:50] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37189/console" [puppet] - 10https://gerrit.wikimedia.org/r/831081 (https://phabricator.wikimedia.org/T310179) (owner: 10Btullis)
[12:20:06] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "Nice one! I've a couple of probably quite unstructured comments (sorry for that). Also I must admit that I have not rendered the template " [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[12:24:03] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.cf
[12:24:04] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0)
[12:24:46] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] thumbor: new service chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[12:24:50] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[12:25:19] <wikibugs>	 (03CR) 10JMeybohm: helmfile.d: add thumbor configuration (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/824519 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[12:27:12] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[12:28:13] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37190/console" [puppet] - 10https://gerrit.wikimedia.org/r/831083 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[12:29:22] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "see also https://phabricator.wikimedia.org/T274463#8224580" [puppet] - 10https://gerrit.wikimedia.org/r/831083 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[12:29:56] <icinga-wm>	 PROBLEM - SSH on mw1316.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:30:57] <wikibugs>	 (03PS2) 10Btullis: Add configuration for k8s-dse in prometheus [puppet] - 10https://gerrit.wikimedia.org/r/831081 (https://phabricator.wikimedia.org/T310179)
[12:33:01] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTm see nit" [puppet] - 10https://gerrit.wikimedia.org/r/831083 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[12:37:40] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[12:38:04] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for phedenskog - https://phabricator.wikimedia.org/T317401 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi I have investigated this with @Peter via the ldap audit logs and found the following entry, which seems to point to an error while editing records:...
[12:54:13] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: one disk not working properly in cloudcephosd1030.eqiad.wmnet - https://phabricator.wikimedia.org/T316673 (10Jclark-ctr) Sorry i had left it in a screen for hardware test  that is my mistake
[12:58:27] <wikibugs>	 (03PS1) 10Filippo Giunchedi: grafana: audit Grafana API actions [puppet] - 10https://gerrit.wikimedia.org/r/831087
[13:00:00] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[13:00:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Add configuration for k8s-dse in prometheus [puppet] - 10https://gerrit.wikimedia.org/r/831081 (https://phabricator.wikimedia.org/T310179) (owner: 10Btullis)
[13:01:47] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Add configuration for k8s-dse in prometheus [puppet] - 10https://gerrit.wikimedia.org/r/831081 (https://phabricator.wikimedia.org/T310179) (owner: 10Btullis)
[13:02:15] <wikibugs>	 (03PS4) 10Vlad.shapik: Remove division operation hack related to Python2 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830907 (https://phabricator.wikimedia.org/T314393)
[13:02:24] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[13:03:01] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[13:03:32] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:05:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[13:06:30] <jynus>	 something going on with kafka
[13:07:22] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[13:07:27] <elukey>	 there was a sudden increase in rsyslog-notice requests to process
[13:07:32] <jynus>	 since 13:05
[13:07:37] <elukey>	 seems gone now
[13:08:09] <jynus>	 I wonder which service as origin?
[13:08:26] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] "Tested it and it works!" [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/831063 (https://phabricator.wikimedia.org/T317404) (owner: 10Hashar)
[13:09:13] <wikibugs>	 (03Merged) 10jenkins-bot: Add deployment configuration for devtools WMCS project [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/831063 (https://phabricator.wikimedia.org/T317404) (owner: 10Hashar)
[13:10:38] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:10:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[13:11:04] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] beta: don't duplicate fonts install [puppet] - 10https://gerrit.wikimedia.org/r/831076 (https://phabricator.wikimedia.org/T317128) (owner: 10Clément Goubert)
[13:12:54] <wikibugs>	 (03PS1) 10David Caro: opensatck: remove some not needed absented resources [puppet] - 10https://gerrit.wikimedia.org/r/831089
[13:16:25] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37192/console" [puppet] - 10https://gerrit.wikimedia.org/r/831089 (owner: 10David Caro)
[13:16:52] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[13:17:12] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: one disk not working properly in cloudcephosd1030.eqiad.wmnet - https://phabricator.wikimedia.org/T316673 (10fnegri) No worries, now I was able to SSH, I created a test partition /dev/sde1 and indeed `mkfs.ext4 /dev/sde1` d...
[13:25:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2003:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[13:28:26] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[13:32:54] <wikibugs>	 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T315352 (10phaultfinder)
[13:33:58] <dcausse>	 !log restartin blazegraph on wdqs2003 (BlazegraphFreeAllocatorsDecreasingRapidly)
[13:33:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:31] <wikibugs>	 (03PS1) 10DCausse: Revert "Disable CirrusSearch completion suggester" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830978
[13:37:54] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[13:37:55] <wikibugs>	 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T315352 (10phaultfinder)
[13:38:27] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: Maintenance
[13:38:41] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: Maintenance
[13:38:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2112 (T312863)', diff saved to https://phabricator.wikimedia.org/P34336 and previous config saved to /var/cache/conftool/dbconfig/20220909-133846-ladsgroup.json
[13:38:50] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[13:40:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2003:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[13:48:40] <wikibugs>	 (03CR) 10Herron: [C: 03+1] pcc: Warn before compiling all nodes by default [puppet] - 10https://gerrit.wikimedia.org/r/830957 (https://phabricator.wikimedia.org/T222075) (owner: 10RLazarus)
[13:49:44] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[13:50:37] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission elastic10[48-52].eqiad.wmnet - https://phabricator.wikimedia.org/T316728 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr
[13:51:40] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission elastic10[48-52].eqiad.wmnet - https://phabricator.wikimedia.org/T316728 (10Jclark-ctr)
[13:52:29] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission elastic10[48-52].eqiad.wmnet - https://phabricator.wikimedia.org/T316728 (10Jclark-ctr) 05Open→03Resolved Removed servers from racks and ran Offline script
[13:55:04] <wikibugs>	 (03PS1) 10Hashar: scap: automatize plugins handling [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/831093 (https://phabricator.wikimedia.org/T317412)
[13:56:44] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[13:57:28] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frlog1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T315924 (10Jclark-ctr) @Jgreen   where the last two steps done before handed over?
[13:57:53] <wikibugs>	 (03PS1) 10Ssingh: P:wikidough: update status message for service restart check [puppet] - 10https://gerrit.wikimedia.org/r/831094
[13:59:34] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37193/console" [puppet] - 10https://gerrit.wikimedia.org/r/831094 (owner: 10Ssingh)
[14:00:36] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "pedantic nitpick, otherwise LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/830624 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm)
[14:04:06] <wikibugs>	 (03PS7) 10JMeybohm: Alert on high Kubernetes API error rate [alerts] - 10https://gerrit.wikimedia.org/r/830624 (https://phabricator.wikimedia.org/T311251)
[14:04:08] <wikibugs>	 (03PS5) 10JMeybohm: Alert on high Kubernetes API latency [alerts] - 10https://gerrit.wikimedia.org/r/830637 (https://phabricator.wikimedia.org/T311251)
[14:04:31] <wikibugs>	 (03CR) 10JMeybohm: Alert on high Kubernetes API error rate (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/830624 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm)
[14:04:47] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:wikidough: update status message for service restart check [puppet] - 10https://gerrit.wikimedia.org/r/831094 (owner: 10Ssingh)
[14:07:40] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1035 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:07:54] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "Good idea!" [puppet] - 10https://gerrit.wikimedia.org/r/831087 (owner: 10Filippo Giunchedi)
[14:15:50] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[14:16:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff)
[14:17:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff) p:05Triage→03Medium
[14:17:57] <wikibugs>	 (03PS1) 10Elukey: Move kafka on kafka-logging2001 to PKI TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/831096 (https://phabricator.wikimedia.org/T300130)
[14:19:25] <wikibugs>	 (03PS2) 10Samtar: CommonSettings-labs.php: Set $wgPhonosFileBackend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830657 (https://phabricator.wikimedia.org/T317195)
[14:19:58] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 3 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37196/console" [puppet] - 10https://gerrit.wikimedia.org/r/831096 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey)
[14:20:34] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[14:20:44] <dcausse>	 jouncebot: now
[14:20:44] <jouncebot>	 For the next 16 hour(s) and 39 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220909T0700)
[14:22:37] <TheresNoTime>	 thcipriani (or anyone), is there *any* chance that I could deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/830657 today? (production no-op, only modifying `wmf-config/CommonSettings-labs.php`)
[14:23:54] <dcausse>	 TheresNoTime: I think it's perfectly fine to ship "labs" only patches
[14:24:03] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] CommonSettings-labs.php: Set $wgPhonosFileBackend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830657 (https://phabricator.wikimedia.org/T317195) (owner: 10Samtar)
[14:25:45] <TheresNoTime>	 dcausse: "think" always worries me :P
[14:26:35] <dcausse>	 TheresNoTime: sure :), I'll have to ship one production patch soon, happy to +2 your patch at this time :) 
[14:28:16] <TheresNoTime>	 dcausse: I don't mind doing it if we're sure it's okay :) when are you planning on deploying?
[14:28:34] <dcausse>	 going to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/830978 it's a followup of an issue that happened yesterday (related to elastic7 upgrade)
[14:28:37] <dcausse>	 TheresNoTime: now :)
[14:29:05] <TheresNoTime>	 dcausse: Okay :) I'll let you handle it, thank you!
[14:29:25] <wikibugs>	 (03CR) 10DCausse: [C: 03+2] CommonSettings-labs.php: Set $wgPhonosFileBackend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830657 (https://phabricator.wikimedia.org/T317195) (owner: 10Samtar)
[14:30:20] <wikibugs>	 (03Merged) 10jenkins-bot: CommonSettings-labs.php: Set $wgPhonosFileBackend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830657 (https://phabricator.wikimedia.org/T317195) (owner: 10Samtar)
[14:31:12] <wikibugs>	 (03CR) 10DCausse: [C: 03+2] Revert "Disable CirrusSearch completion suggester" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830978 (owner: 10DCausse)
[14:32:07] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Disable CirrusSearch completion suggester" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830978 (owner: 10DCausse)
[14:32:32] <icinga-wm>	 RECOVERY - SSH on mw1316.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:33:04] <dcausse>	 testing my patch mwdebug1002
[14:33:58] <TheresNoTime>	 (manually triggered a beta sync/scap so I can test fwiw)
[14:34:26] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[14:34:29] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416 (10MoritzMuehlenhoff)
[14:35:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:36:17] <TheresNoTime>	 (all looks good for me, thanks dcausse)
[14:36:22] <dcausse>	 TheresNoTime: yw!
[14:36:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:36:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:36:54] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[14:39:18] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[14:39:56] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Alert on high lateny of kubelet operations [alerts] - 10https://gerrit.wikimedia.org/r/830228 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm)
[14:39:59] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Alert on high Kubernetes API error rate [alerts] - 10https://gerrit.wikimedia.org/r/830624 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm)
[14:40:02] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Alert on high Kubernetes API latency [alerts] - 10https://gerrit.wikimedia.org/r/830637 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm)
[14:40:18] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-main-codfw cluster: Roll restart of jvm daemons.
[14:40:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:41:37] <wikibugs>	 (03Merged) 10jenkins-bot: Alert on high lateny of kubelet operations [alerts] - 10https://gerrit.wikimedia.org/r/830228 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm)
[14:42:24] <wikibugs>	 (03Merged) 10jenkins-bot: Alert on high Kubernetes API error rate [alerts] - 10https://gerrit.wikimedia.org/r/830624 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm)
[14:42:26] <wikibugs>	 (03Merged) 10jenkins-bot: Alert on high Kubernetes API latency [alerts] - 10https://gerrit.wikimedia.org/r/830637 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm)
[14:43:29] <logmsgbot>	 !log dcausse@deploy1002 Synchronized wmf-config/InitialiseSettings.php: T317381: Revert "Disable CirrusSearch completion suggester" (duration: 03m 57s)
[14:43:32] <stashbot>	 T317381: Reduction in helpfulness and quantity of autocomplete search results - https://phabricator.wikimedia.org/T317381
[14:44:09] <moritzm>	 !log imported jenkins 2.346.3 to thirdparty/ci
[14:44:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Move kafka on kafka-logging2001 to PKI TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/831096 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey)
[14:44:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: one disk not working properly in cloudcephosd1030.eqiad.wmnet - https://phabricator.wikimedia.org/T316673 (10fnegri) The "kicked off" part is explained by @Jclark-ctr rebooting the instance. The partition disappearing inste...
[14:45:00] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frlog1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T315924 (10Jgreen) >>! In T315924#8224700, @Jclark-ctr wrote: > @Jgreen   where the last two steps done before handed over?   The unchecked steps were not done. Re. cumin/cookbook is as e...
[14:45:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10nskaggs) Yes, feel free to coordinate with @fnegri for the depooling portion. Thanks!
[14:45:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: audit Grafana API actions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831087 (owner: 10Filippo Giunchedi)
[14:46:48] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10Marostegui) @nskaggs can this be led by your team, as these proxies are from your service :-)
[14:47:35] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416 (10MoritzMuehlenhoff) p:05Triage→03Medium
[14:48:11] <wikibugs>	 (03PS1) 10Jgreen: Remove temporary _dmarcian TXT record, it has served its purpose. [dns] - 10https://gerrit.wikimedia.org/r/831101 (https://phabricator.wikimedia.org/T316899)
[14:50:14] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] Remove temporary _dmarcian TXT record, it has served its purpose. [dns] - 10https://gerrit.wikimedia.org/r/831101 (https://phabricator.wikimedia.org/T316899) (owner: 10Jgreen)
[14:53:50] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[14:53:58] <wikibugs>	 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Undocumented IP on WMCS network - https://phabricator.wikimedia.org/T315955 (10Andrew) Yep, we're using those IPs for rapid tests so most of the time they're unallocated.
[14:54:26] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1021 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:54:30] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.0.149:7001 on restbase1021 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[14:54:42] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.0.148:7001 on restbase1021 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[14:55:00] <icinga-wm>	 RECOVERY - SSH on restbase1021 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:55:10] <icinga-wm>	 PROBLEM - cassandra-b service on restbase1021 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:55:16] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.0.150:7001 on restbase1021 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[14:55:50] <icinga-wm>	 RECOVERY - Restbase root url on restbase1021 is OK: HTTP OK: HTTP/1.1 200 - 17317 bytes in 0.030 second response time https://wikitech.wikimedia.org/wiki/RESTBase
[14:55:58] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[14:56:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10nskaggs) @Marostegui yes.  Sorry, my comment about coordination was directed towards @Cmjohnson.  Need to pick a convenient time for DCOPs and WMCS.
[14:56:12] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[14:56:30] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1021 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:56:40] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1021 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:57:30] <icinga-wm>	 RECOVERY - cassandra-b service on restbase1021 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:57:38] <wikibugs>	 (03PS2) 10JHathaway: mail::mx: Add support for PLAIN auth over tls [puppet] - 10https://gerrit.wikimedia.org/r/830948 (https://phabricator.wikimedia.org/T314815)
[14:58:18] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[14:58:32] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.0.148:9042 on restbase1021 is OK: TCP OK - 0.000 second response time on 10.64.0.148 port 9042 https://phabricator.wikimedia.org/T93886
[14:58:35] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/831089 (owner: 10David Caro)
[14:58:40] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.0.148:7001 on restbase1021 is OK: SSL OK - Certificate restbase1021-a valid until 2023-04-14 11:20:45 +0000 (expires in 216 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[14:58:52] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1021 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:59:00] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] O:prometheus: use map instead of reduce (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817784 (https://phabricator.wikimedia.org/T313910) (owner: 10Jbond)
[14:59:12] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.0.149:7001 on restbase1021 is OK: SSL OK - Certificate restbase1021-b valid until 2023-04-14 11:20:48 +0000 (expires in 216 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[14:59:52] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.64.0.149:9042 on restbase1021 is OK: TCP OK - 0.000 second response time on 10.64.0.149 port 9042 https://phabricator.wikimedia.org/T93886
[14:59:58] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.0.150:7001 on restbase1021 is OK: SSL OK - Certificate restbase1021-c valid until 2023-04-14 11:20:51 +0000 (expires in 216 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[15:00:22] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.0.150:9042 on restbase1021 is OK: TCP OK - 0.000 second response time on 10.64.0.150 port 9042 https://phabricator.wikimedia.org/T93886
[15:00:45] <wikibugs>	 (03CR) 10JHathaway: "Thanks again jbond, going ahead with merging" [puppet] - 10https://gerrit.wikimedia.org/r/830948 (https://phabricator.wikimedia.org/T314815) (owner: 10JHathaway)
[15:00:54] <wikibugs>	 (03PS3) 10JHathaway: mail::mx: Add support for PLAIN auth over tls [puppet] - 10https://gerrit.wikimedia.org/r/830948 (https://phabricator.wikimedia.org/T314815)
[15:01:42] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "I think this is fine. Most of the absented things are distracting but harmless files that used to be installed by upstream debian packages" [puppet] - 10https://gerrit.wikimedia.org/r/831089 (owner: 10David Caro)
[15:02:14] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:02:17] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/830948 (https://phabricator.wikimedia.org/T314815) (owner: 10JHathaway)
[15:06:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T312863)', diff saved to https://phabricator.wikimedia.org/P34338 and previous config saved to /var/cache/conftool/dbconfig/20220909-150651-ladsgroup.json
[15:06:55] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[15:06:58] <jinxer-wm>	 (KubernetesAPILatencySecretsLIST) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatencySecretsLIST
[15:07:19] <jayme>	 damn :)
[15:07:42] <wikibugs>	 10SRE-OnFire, 10Observability-Alerting, 10Discovery-Search (Current work), 10Sustainability (Incident Followup): Improve Search team alerting - https://phabricator.wikimedia.org/T313095 (10bking)
[15:09:12] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: sre: add paging alert for etcdmirror down [alerts] - 10https://gerrit.wikimedia.org/r/831103
[15:09:32] <wikibugs>	 (03PS1) 10Jgreen: DMARC External Domain Verification for wikipedia.org and w.wiki. [dns] - 10https://gerrit.wikimedia.org/r/831104 (https://phabricator.wikimedia.org/T211401)
[15:09:57] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/830948 (https://phabricator.wikimedia.org/T314815) (owner: 10JHathaway)
[15:09:58] <wikibugs>	 10SRE-OnFire, 10Discovery-Search, 10Sustainability (Incident Followup): Better test environments for Elastic - https://phabricator.wikimedia.org/T317420 (10bking)
[15:10:44] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:11:58] <jinxer-wm>	 (KubernetesAPILatencySecretsLIST) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatencySecretsLIST
[15:12:49] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] opensatck: remove some not needed absented resources (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831089 (owner: 10David Caro)
[15:14:15] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/831056 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[15:14:17] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Patch-For-Review: Wikipedia.org DMARC "rua" and "ruf" email addresses need verification - https://phabricator.wikimedia.org/T211401 (10Jgreen) There's a nice summary of this issue here https://dmarcian.com/what-is-external-destination-verification/
[15:16:19] <wikibugs>	 (03PS2) 10Clément Goubert: sre: add paging alert for etcdmirror down [alerts] - 10https://gerrit.wikimedia.org/r/831103 (https://phabricator.wikimedia.org/T317402) (owner: 10Giuseppe Lavagetto)
[15:22:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P34339 and previous config saved to /var/cache/conftool/dbconfig/20220909-152159-ladsgroup.json
[15:23:19] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] mail::mx: Add support for PLAIN auth over tls [puppet] - 10https://gerrit.wikimedia.org/r/830948 (https://phabricator.wikimedia.org/T314815) (owner: 10JHathaway)
[15:23:38] <thcipriani>	 TheresNoTime: looks like you got everything deployed. I'm fine with -labs deployments on Fridays as long as the deployment server stays tidy (i.e., fetch and rebase so the next deployer isn't surprised/confused)
[15:25:12] <TheresNoTime>	 thcipriani: okay, thank you for clarifying :)
[15:29:26] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] "All files look ok" [puppet] - 10https://gerrit.wikimedia.org/r/831039 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[15:30:03] <wikibugs>	 10SRE-OnFire, 10Observability-Alerting, 10Discovery-Search (Current work), 10Sustainability (Incident Followup): Improve Search team alerting - https://phabricator.wikimedia.org/T313095 (10bking) Suggestion from @EBernhardson :  "random guesses at what we need, the search reindex process looks at the old i...
[15:33:18] <wikibugs>	 (03PS1) 10Muehlenhoff: turnilo: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/831111
[15:36:44] <wikibugs>	 (03PS1) 10Cwhite: smart: restore get_fact and deprecate get_raid_drivers [puppet] - 10https://gerrit.wikimedia.org/r/831114 (https://phabricator.wikimedia.org/T251293)
[15:36:59] <jinxer-wm>	 (KubernetesAPILatencySecretsLIST) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatencySecretsLIST
[15:37:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P34340 and previous config saved to /var/cache/conftool/dbconfig/20220909-153706-ladsgroup.json
[15:39:48] <wikibugs>	 (03PS1) 10Jdlrobson: EXPECTED VISUAL CHANGES IN origin/wmf/1.39.0-wmf.28 [skins/Vector] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/831115
[15:41:58] <jinxer-wm>	 (KubernetesAPILatencySecretsLIST) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatencySecretsLIST
[15:42:25] <wikibugs>	 (03PS2) 10Jdlrobson: EXPECTED VISUAL CHANGES IN origin/wmf/1.39.0-wmf.28 [skins/Vector] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/830955
[15:42:33] <wikibugs>	 (03CR) 10Jdlrobson: [C: 04-1] EXPECTED VISUAL CHANGES IN origin/wmf/1.39.0-wmf.28 [skins/Vector] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/830955 (owner: 10Jdlrobson)
[15:42:40] <wikibugs>	 (03Abandoned) 10Jdlrobson: EXPECTED VISUAL CHANGES IN origin/wmf/1.39.0-wmf.28 [skins/Vector] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/831115 (owner: 10Jdlrobson)
[15:44:27] <wikibugs>	 10SRE-OnFire, 10serviceops, 10Sustainability (Incident Followup): Add failure rate triggered rollback to scap - https://phabricator.wikimedia.org/T317405 (10Clement_Goubert)
[15:45:03] <wikibugs>	 (03PS1) 10JMeybohm: Improve alerts on high Kubernetes API latency [alerts] - 10https://gerrit.wikimedia.org/r/831116 (https://phabricator.wikimedia.org/T311251)
[15:46:03] <wikibugs>	 (03PS1) 10Jdlrobson: Enable Nearby on Hebrew and French Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831117 (https://phabricator.wikimedia.org/T246493)
[15:46:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Enable Nearby on Hebrew and French Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831117 (https://phabricator.wikimedia.org/T246493) (owner: 10Jdlrobson)
[15:47:47] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Improve alerts on high Kubernetes API latency [alerts] - 10https://gerrit.wikimedia.org/r/831116 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm)
[15:50:10] <wikibugs>	 (03Merged) 10jenkins-bot: Improve alerts on high Kubernetes API latency [alerts] - 10https://gerrit.wikimedia.org/r/831116 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm)
[15:50:32] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[15:50:51] <wikibugs>	 (03PS2) 10Jelto: gitlab: allow git user to access backup folder [puppet] - 10https://gerrit.wikimedia.org/r/831083 (https://phabricator.wikimedia.org/T274463)
[15:52:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T312863)', diff saved to https://phabricator.wikimedia.org/P34341 and previous config saved to /var/cache/conftool/dbconfig/20220909-155213-ladsgroup.json
[15:52:15] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1118.eqiad.wmnet with reason: Maintenance
[15:52:17] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[15:52:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1118.eqiad.wmnet with reason: Maintenance
[15:52:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1118 (T312863)', diff saved to https://phabricator.wikimedia.org/P34342 and previous config saved to /var/cache/conftool/dbconfig/20220909-155234-ladsgroup.json
[15:52:58] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37197/console" [puppet] - 10https://gerrit.wikimedia.org/r/831083 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[15:55:20] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[15:55:58] <wikibugs>	 (03PS3) 10Jelto: gitlab: allow git user to access backup folder [puppet] - 10https://gerrit.wikimedia.org/r/831083 (https://phabricator.wikimedia.org/T274463)
[15:57:02] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37198/console" [puppet] - 10https://gerrit.wikimedia.org/r/831083 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[15:59:00] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: allow git user to access backup folder (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831083 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[16:05:50] <icinga-wm>	 RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:13:50] <icinga-wm>	 PROBLEM - SSH on mw1315.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:14:20] <wikibugs>	 (03PS1) 10David Caro: bootstrap_and_add: added preflight checks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831122 (https://phabricator.wikimedia.org/T316021)
[16:16:24] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[16:18:14] <wikibugs>	 (03Abandoned) 10Jforrester: Check $thumb->isError() before trying to use it [extensions/PageImages] (wmf/1.37.0-wmf.23) - 10https://gerrit.wikimedia.org/r/720858 (https://phabricator.wikimedia.org/T290973) (owner: 10Jforrester)
[16:18:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] bootstrap_and_add: added preflight checks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831122 (https://phabricator.wikimedia.org/T316021) (owner: 10David Caro)
[16:18:46] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[16:21:29] <logmsgbot>	 !log herron@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-main-codfw cluster: Roll restart of jvm daemons.
[16:23:06] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.2.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=
[16:25:28] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[16:37:40] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[16:37:48] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[16:42:13] <wikibugs>	 (03PS2) 10Hashar: scap: automatize plugins handling [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/831093 (https://phabricator.wikimedia.org/T317412)
[16:44:26] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830197 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno)
[16:44:52] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[17:15:04] <icinga-wm>	 RECOVERY - SSH on mw1315.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:19:22] <wikibugs>	 (03CR) 10Dwisehaupt: [C: 03+1] "Talked through this with jgreen and I'm happy with it. I agree that additional SRE eyes on this change would be good." [dns] - 10https://gerrit.wikimedia.org/r/831104 (https://phabricator.wikimedia.org/T211401) (owner: 10Jgreen)
[17:42:05] <wikibugs>	 (03Abandoned) 10Andrew Bogott: dynamic proxy: block a second troublesome UA [puppet] - 10https://gerrit.wikimedia.org/r/830934 (owner: 10Andrew Bogott)
[17:42:10] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[17:46:46] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[17:46:56] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[17:49:08] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[17:53:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Replace cloudnet100[34] with cloudnet100[56] - https://phabricator.wikimedia.org/T316284 (10Andrew) a:05Andrew→03aborrero
[18:01:02] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[18:01:36] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 114 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:04:00] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:10:34] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[18:21:11] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-main-eqiad cluster: Roll restart of jvm daemons.
[18:21:38] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10jhathaway) @TAndic this configuration has now been deployed to prod and tested. I can provide you the credentials so you can setup the Qualtrics side.
[18:25:00] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[18:39:10] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[18:43:54] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[18:46:18] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[18:46:26] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[19:17:16] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[19:19:38] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[19:24:32] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[19:26:56] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[19:33:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2112 (T312863)', diff saved to https://phabricator.wikimedia.org/P34343 and previous config saved to /var/cache/conftool/dbconfig/20220909-193316-ladsgroup.json
[19:33:20] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[19:34:14] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/830923 (https://phabricator.wikimedia.org/T313431) (owner: 10Bking)
[19:43:38] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[19:44:33] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[19:45:30] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[19:45:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[19:46:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[19:48:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2112', diff saved to https://phabricator.wikimedia.org/P34344 and previous config saved to /var/cache/conftool/dbconfig/20220909-194822-ladsgroup.json
[19:50:40] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[19:50:48] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[19:55:28] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[20:00:18] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[20:02:20] <logmsgbot>	 !log herron@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-main-eqiad cluster: Roll restart of jvm daemons.
[20:03:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2112', diff saved to https://phabricator.wikimedia.org/P34345 and previous config saved to /var/cache/conftool/dbconfig/20220909-200329-ladsgroup.json
[20:03:51] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.ganeti.makevm for new host dispatch-be1001.eqiad.wmnet
[20:03:52] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.dns.netbox
[20:08:14] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1002/37199/" [puppet] - 10https://gerrit.wikimedia.org/r/830284 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[20:12:33] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "one of the dependecy cycles you don't see in the compiler. and yea, this is because we are using systemd::sysuser..." [puppet] - 10https://gerrit.wikimedia.org/r/830284 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[20:15:05] <wikibugs>	 (03PS1) 10Dzahn: phabricator: remove require for homedir from systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/831144 (https://phabricator.wikimedia.org/T280597)
[20:16:01] <wikibugs>	 (03CR) 10Nray: "fyi, I just merged https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/829220 which will cause a number of expected visual changes. " [skins/Vector] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/830955 (owner: 10Jdlrobson)
[20:16:54] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[20:18:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2112 (T312863)', diff saved to https://phabricator.wikimedia.org/P34347 and previous config saved to /var/cache/conftool/dbconfig/20220909-201835-ladsgroup.json
[20:18:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance
[20:18:41] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[20:18:51] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance
[20:18:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2116 (T312863)', diff saved to https://phabricator.wikimedia.org/P34348 and previous config saved to /var/cache/conftool/dbconfig/20220909-201857-ladsgroup.json
[20:19:40] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: remove require for homedir from systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/831144 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[20:19:57] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "follow-up https://gerrit.wikimedia.org/r/c/operations/puppet/+/831144" [puppet] - 10https://gerrit.wikimedia.org/r/830284 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[20:23:33] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "noop on phab1001,phab2001 and phab2002. phab1004 still pages full of dependency problems. the difference is that here the phd user was not" [puppet] - 10https://gerrit.wikimedia.org/r/831144 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[20:25:27] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "probably means we would see more errors related to systemd::sysuser if we applied the role on new hosts..and we just don't see those becau" [puppet] - 10https://gerrit.wikimedia.org/r/831144 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[20:27:14] <logmsgbot>	 !log herron@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[20:27:14] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.dns.wipe-cache dispatch-be1001.eqiad.wmnet on all recursors
[20:27:18] <logmsgbot>	 !log herron@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dispatch-be1001.eqiad.wmnet on all recursors
[20:27:21] <logmsgbot>	 !log herron@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=93) for new host dispatch-be1001.eqiad.wmnet
[20:36:25] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission elastic10[48-52].eqiad.wmnet - https://phabricator.wikimedia.org/T316728 (10herron) Hello, I'm seeing some pending wmfNNNN.mgmt forward/reverse dns record removals which may be related to this task, here's a paste https://phabricator.wikimedia.org/P34...
[20:37:40] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[20:38:24] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[20:39:38] <wikibugs>	 (03PS3) 10Dzahn: phabricator::phd: actually use $phd_user variable and small improvements [puppet] - 10https://gerrit.wikimedia.org/r/830285 (https://phabricator.wikimedia.org/T280597)
[20:40:48] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[21:03:10] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[21:16:38] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[21:17:31] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1003/37200/" [puppet] - 10https://gerrit.wikimedia.org/r/830285 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[21:27:48] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] phabricator::phd: actually use $phd_user variable and small improvements [puppet] - 10https://gerrit.wikimedia.org/r/830285 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[21:28:32] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[21:29:22] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9400 on elastic2052 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:29:22] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9200 on elastic2031 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:29:22] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9200 on elastic2042 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:29:22] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9200 on elastic2025 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:29:22] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9400 on elastic2042 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:29:22] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9400 on elastic2047 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:41:18] <wikibugs>	 (03PS3) 10Jdlrobson: EXPECTED VISUAL CHANGES IN origin/wmf/1.39.0-wmf.28 [skins/Vector] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/830955 (https://phabricator.wikimedia.org/T315261)
[21:42:52] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[21:42:54] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[21:52:26] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[21:54:14] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[21:54:18] <icinga-wm>	 RECOVERY - Check systemd state on search-loader2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:57:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T312863)', diff saved to https://phabricator.wikimedia.org/P34349 and previous config saved to /var/cache/conftool/dbconfig/20220909-215704-ladsgroup.json
[21:57:08] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[22:02:02] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[22:06:59] <wikibugs>	 (03PS1) 10Dzahn: phabricator: do not use systemd::sysuser on phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/831154
[22:07:40] <wikibugs>	 (03PS2) 10Dzahn: phabricator: do not use systemd::sysuser on phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/831154
[22:11:36] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[22:12:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P34350 and previous config saved to /var/cache/conftool/dbconfig/20220909-221210-ladsgroup.json
[22:25:19] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "just a test / debugging" [puppet] - 10https://gerrit.wikimedia.org/r/831154 (owner: 10Dzahn)
[22:26:54] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission elastic10[48-52].eqiad.wmnet - https://phabricator.wikimedia.org/T316728 (10Jclark-ctr) @herron they are good to be removed if they are related to these servers
[22:27:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P34351 and previous config saved to /var/cache/conftool/dbconfig/20220909-222717-ladsgroup.json
[22:27:55] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] Move kafka on kafka-logging2001 to PKI TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/831096 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey)
[22:28:02] <icinga-wm>	 RECOVERY - Check that envoy is running on phab1004 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[22:28:34] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: reduce replica count to 1 after 1 day [puppet] - 10https://gerrit.wikimedia.org/r/830690 (https://phabricator.wikimedia.org/T313099) (owner: 10Cwhite)
[22:29:10] <wikibugs>	 (03PS4) 10Jdlrobson: EXPECTED VISUAL CHANGES IN origin/wmf/1.39.0-wmf.28 [skins/Vector] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/830955 (https://phabricator.wikimedia.org/T315261)
[22:29:22] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "after this the scap system user was created and puppet could do a bunch of things it could not do before. there are other errors cause by " [puppet] - 10https://gerrit.wikimedia.org/r/831154 (owner: 10Dzahn)
[22:30:38] <icinga-wm>	 RECOVERY - Check no envoy runtime configuration is left persistent on phab1004 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[22:42:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T312863)', diff saved to https://phabricator.wikimedia.org/P34352 and previous config saved to /var/cache/conftool/dbconfig/20220909-224223-ladsgroup.json
[22:42:25] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[22:42:27] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[22:42:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[22:42:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T312863)', diff saved to https://phabricator.wikimedia.org/P34353 and previous config saved to /var/cache/conftool/dbconfig/20220909-224245-ladsgroup.json
[22:45:05] <wikibugs>	 (03CR) 10Yahya: [C: 03+1] Create six more namespaces (three content namespaces and their corresponding three discussion namespaces) on the bn.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830982 (https://phabricator.wikimedia.org/T317424) (owner: 10Aishik Rehman)
[22:54:44] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[23:01:52] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[23:15:43] <wikibugs>	 (03PS5) 10Jdlrobson: EXPECTED VISUAL CHANGES IN origin/wmf/1.39.0-wmf.28 [skins/Vector] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/830955 (https://phabricator.wikimedia.org/T315261)
[23:16:14] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[23:16:14] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[23:32:52] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[23:35:18] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[23:37:42] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[23:49:40] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[23:56:52] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[23:58:15] <wikibugs>	 10SRE, 10DBA, 10MediaWiki-General, 10Patch-For-Review: img_metadata queries for PDF files saturates s4 replicas - https://phabricator.wikimedia.org/T147296 (10Krinkle)
[23:58:23] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] pcc: Warn before compiling all nodes by default [puppet] - 10https://gerrit.wikimedia.org/r/830957 (https://phabricator.wikimedia.org/T222075) (owner: 10RLazarus)
[23:58:43] <wikibugs>	 10SRE, 10DBA, 10MediaWiki-General: img_metadata queries for PDF files saturates s4 replicas - https://phabricator.wikimedia.org/T147296 (10Krinkle)
[23:59:01] <wikibugs>	 10SRE, 10DBA, 10MediaWiki-File-management: img_metadata queries for PDF files saturates s4 replicas - https://phabricator.wikimedia.org/T147296 (10Krinkle)