[00:03:37] <icinga-wm>	 RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:09:47] <icinga-wm>	 PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: full-backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:28:09] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:35:21] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:37:40] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[00:46:23] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[00:53:43] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[01:14:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:15:23] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[01:19:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:20:13] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:25:11] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[01:37:45] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:42:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:47:01] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[01:47:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:52:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:59:29] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.dns.netbox
[02:01:45] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[02:01:46] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[02:02:16] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db2169
[02:02:17] <logmsgbot>	 !log pt1979@cumin1001 END (ERROR) - Cookbook sre.network.configure-switch-interfaces (exit_code=97) for host db2169
[02:04:07] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kafka-logging1005:
[02:04:27] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kafka-logging1005:
[02:05:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10Papaul)
[02:07:45] <jinxer-wm>	 (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:25:51] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[02:33:07] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[02:38:25] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[02:38:31] <wikibugs>	 (03PS14) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040)
[02:46:07] <icinga-wm>	 PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:48:40] <wikibugs>	 (03CR) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[03:22:55] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:47:25] <icinga-wm>	 RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:10:40] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 3 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) 05Open→03Resolved a:03tstarling
[04:28:13] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:35:37] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:37:40] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[04:51:27] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[05:01:11] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[05:09:54] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 23 hosts with reason: Primary switchover s3 T316622
[05:09:57] <stashbot>	 T316622: Switchover s3 master (db1157 -> db1123) - https://phabricator.wikimedia.org/T316622
[05:10:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: Primary switchover s3 T316622
[05:10:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1123 with weight 0 T316622', diff saved to https://phabricator.wikimedia.org/P34128 and previous config saved to /var/cache/conftool/dbconfig/20220908-051043-ladsgroup.json
[05:13:29] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[05:18:17] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[05:29:52] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 3 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) Here's a model of the benefit of the multi-DC project for users west of codfw. The servers are 30ms closer, but codfw seems a bit slower, so if yo...
[05:30:02] <wikibugs>	 (03PS1) 10Marostegui: db1202: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830717
[05:30:25] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[05:30:28] <wikibugs>	 (03PS2) 10Ladsgroup: mariadb: Promote db1123 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/827860 (https://phabricator.wikimedia.org/T316622) (owner: 10Gerrit maintenance bot)
[05:30:38] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] mariadb: Promote db1123 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/827860 (https://phabricator.wikimedia.org/T316622) (owner: 10Gerrit maintenance bot)
[05:30:53] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1202: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830717 (owner: 10Marostegui)
[05:32:24] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "Let's document all this in wikitech. There are so many new options that it is hard to follow 😊" [software] - 10https://gerrit.wikimedia.org/r/830671 (owner: 10Ladsgroup)
[05:32:48] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] auto_schema: Add support for a graceful stop [software] - 10https://gerrit.wikimedia.org/r/830672 (owner: 10Ladsgroup)
[05:35:19] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[05:37:05] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage db1200 [puppet] - 10https://gerrit.wikimedia.org/r/830718
[05:37:54] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1200 [puppet] - 10https://gerrit.wikimedia.org/r/830718 (owner: 10Marostegui)
[05:41:49] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db1202 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/830720 (https://phabricator.wikimedia.org/T316342)
[05:42:26] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Allow runnint it on one dc only with --dc (031 comment) [software] - 10https://gerrit.wikimedia.org/r/830671 (owner: 10Ladsgroup)
[05:42:31] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Add support for a graceful stop [software] - 10https://gerrit.wikimedia.org/r/830672 (owner: 10Ladsgroup)
[05:43:08] <wikibugs>	 (03Merged) 10jenkins-bot: auto_schema: Allow runnint it on one dc only with --dc [software] - 10https://gerrit.wikimedia.org/r/830671 (owner: 10Ladsgroup)
[05:43:13] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1202 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/830720 (https://phabricator.wikimedia.org/T316342) (owner: 10Marostegui)
[05:43:21] <wikibugs>	 (03Merged) 10jenkins-bot: auto_schema: Add support for a graceful stop [software] - 10https://gerrit.wikimedia.org/r/830672 (owner: 10Ladsgroup)
[05:44:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1202 to s7, depooled, T316342', diff saved to https://phabricator.wikimedia.org/P34129 and previous config saved to /var/cache/conftool/dbconfig/20220908-054429-marostegui.json
[05:44:33] <stashbot>	 T316342: Productionize db1196-db1203 - https://phabricator.wikimedia.org/T316342
[05:49:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Pooling back db2140', diff saved to https://phabricator.wikimedia.org/P34130 and previous config saved to /var/cache/conftool/dbconfig/20220908-054921-ladsgroup.json
[05:54:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pooling db1202 for the first time in s7 T316342', diff saved to https://phabricator.wikimedia.org/P34131 and previous config saved to /var/cache/conftool/dbconfig/20220908-055451-marostegui.json
[05:54:55] <stashbot>	 T316342: Productionize db1196-db1203 - https://phabricator.wikimedia.org/T316342
[05:55:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase weight for db1194', diff saved to https://phabricator.wikimedia.org/P34132 and previous config saved to /var/cache/conftool/dbconfig/20220908-055546-marostegui.json
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: #bothumor I � Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220908T0600).
[06:00:09] <marostegui>	 o/
[06:00:12] <Amir1>	 o/
[06:00:28] <Amir1>	 I realized testwiki is on s3
[06:00:32] <Amir1>	 that\s nice
[06:00:56] <Amir1>	 !log Starting s3 eqiad failover from db1157 to db1123 - T316622
[06:00:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:00:59] <stashbot>	 T316622: Switchover s3 master (db1157 -> db1123) - https://phabricator.wikimedia.org/T316622
[06:01:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set s3 eqiad as read-only for maintenance - T316622', diff saved to https://phabricator.wikimedia.org/P34133 and previous config saved to /var/cache/conftool/dbconfig/20220908-060110-ladsgroup.json
[06:01:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1123 to s3 primary and set section read-write T316622', diff saved to https://phabricator.wikimedia.org/P34134 and previous config saved to /var/cache/conftool/dbconfig/20220908-060138-ladsgroup.json
[06:02:26] <Amir1>	 writes coming
[06:02:33] <marostegui>	 \o/
[06:03:26] <wikibugs>	 (03PS2) 10Ladsgroup: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/827861 (https://phabricator.wikimedia.org/T316622) (owner: 10Gerrit maintenance bot)
[06:03:34] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/827861 (https://phabricator.wikimedia.org/T316622) (owner: 10Gerrit maintenance bot)
[06:03:37] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/827861 (https://phabricator.wikimedia.org/T316622) (owner: 10Gerrit maintenance bot)
[06:04:27] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'db1202 (re)pooling @ 2%: Pooling for the first time in s7', diff saved to https://phabricator.wikimedia.org/P34135 and previous config saved to /var/cache/conftool/dbconfig/20220908-060426-root.json
[06:04:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1157 T316622', diff saved to https://phabricator.wikimedia.org/P34136 and previous config saved to /var/cache/conftool/dbconfig/20220908-060438-ladsgroup.json
[06:07:11] <wikibugs>	 (03PS1) 10Marostegui: db1203: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830721 (https://phabricator.wikimedia.org/T316342)
[06:07:47] <kostajh>	 _joe_: is there a header that shows if the response came from a server using php7.2 or php7.4?
[06:07:48] <wikibugs>	 (03Abandoned) 10Marostegui: mariadb: Switch s6 primary db1173 -> db1131 [puppet] - 10https://gerrit.wikimedia.org/r/764784 (https://phabricator.wikimedia.org/T300471) (owner: 10Kormat)
[06:08:01] <wikibugs>	 (03Abandoned) 10Marostegui: wmnet: Update s6-master CNAME. [dns] - 10https://gerrit.wikimedia.org/r/764786 (https://phabricator.wikimedia.org/T300471) (owner: 10Kormat)
[06:08:14] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1203: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830721 (https://phabricator.wikimedia.org/T316342) (owner: 10Marostegui)
[06:08:21] <_joe_>	 kostajh: X-Powered-By, but it gets stripped in the response to you unless you go via X-Wikimedia-Debug
[06:08:26] <_joe_>	 kostajh: why are you asking?
[06:08:57] <kostajh>	 ah I was hoping I could get it without X-Wikimedia-Debug
[06:09:47] <_joe_>	 kostajh: what are you trying to figure out?
[06:09:53] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[06:09:55] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[06:10:10] <_joe_>	 maybe I can help
[06:10:11] <kostajh>	 _joe_: looking into T317187. I recall there was an issue about incompatible serialization formats that AFAIK has been resolved, but it made me wonder if somehow cache entries are being invalidated for visitors to Special:Homepage if their traffic varies from php7.2 to php7.4 from one request to the next
[06:10:12] <stashbot>	 T317187: GrowthExperiments Special:Homepage: investigate performance regression in wmf.28 - https://phabricator.wikimedia.org/T317187
[06:10:47] <_joe_>	 kostajh: cache entries in their browser?
[06:10:56] <kostajh>	 _joe_: WANObjectCache
[06:11:02] <_joe_>	 I doubt that can be the case, but you can check your cookies
[06:11:57] <kostajh>	 before a visitor goes to Special:Homepage, we do an (expensive) set of calls to ElasticSearch. That gets put into a cache entry with WANObjectCache, and we look it up on visit to Special:Homepage. The spikiness seen in T317187 could be explained by that.
[06:12:07] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db1203 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/830722 (https://phabricator.wikimedia.org/T316342)
[06:12:51] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1203 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/830722 (https://phabricator.wikimedia.org/T316342) (owner: 10Marostegui)
[06:14:14] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'Add db1203 to s8, depooled, T316342', diff saved to https://phabricator.wikimedia.org/P34137 and previous config saved to /var/cache/conftool/dbconfig/20220908-061413-marostegui.json
[06:14:17] <stashbot>	 T316342: Productionize db1196-db1203 - https://phabricator.wikimedia.org/T316342
[06:14:54] <_joe_>	 kostajh: so if you want to verify this
[06:15:24] <_joe_>	 you can add a specific cookie to force yourself onto one php version
[06:15:54] <_joe_>	 but also, the chances of a single user switching php versions between the visits to that page are small
[06:16:13] <_joe_>	 we try to get people to stick to one interpreter consistently as much as possible
[06:16:41] <kostajh>	 ok, yeah that was my understanding as well.
[06:16:45] <_joe_>	 kostajh: have you tried, going via X-W-D, to visit that page on php 7.4 twice
[06:16:55] <_joe_>	 then go visit it with 7.2 another time
[06:17:04] <_joe_>	 and see if that last call is slower than the second
[06:17:17] <_joe_>	 and if it is, start profiling the thing and see where the time is spent
[06:17:44] <_joe_>	 there's also what Tim just said on the task, image-suggestion being slow to respond at times
[06:17:55] <kostajh>	 do you know offhand which cookie to set to opt-in to 7.2?
[06:18:13] <_joe_>	 one sec, need to look at the code
[06:18:22] <kostajh>	 right, but the image-suggestion thing shouldn't be directly related – that is not blocking page render on Special:Homepage
[06:18:52] <_joe_>	 https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/WikimediaEvents/+/refs/heads/master/modules/ext.wikimediaEvents/phpEngine.js#34
[06:18:59] <_joe_>	 PHP_ENGINE_STICKY
[06:19:11] <kostajh>	 the calls to image-suggestion service happen in a deferred update. If it causes slow down for the user, it is when they visit an article and need the image suggestion metadata, not on special:homepage which is where we observe the spikiness
[06:19:13] <_joe_>	 see the comment above
[06:19:24] <_joe_>	 ack
[06:19:33] <_joe_>	 brb
[06:19:46] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "Looks good, no functional changes." [puppet] - 10https://gerrit.wikimedia.org/r/830184 (https://phabricator.wikimedia.org/T273673) (owner: 10Muehlenhoff)
[06:19:56] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'db1202 (re)pooling @ 3%: Pooling for the first time in s7', diff saved to https://phabricator.wikimedia.org/P34138 and previous config saved to /var/cache/conftool/dbconfig/20220908-061955-root.json
[06:21:38] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] C:spamassassin Allow debugging of why service fails. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829108 (https://phabricator.wikimedia.org/T316903) (owner: 10Slyngshede)
[06:21:45] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:spamassassin Allow debugging of why service fails. [puppet] - 10https://gerrit.wikimedia.org/r/829108 (https://phabricator.wikimedia.org/T316903) (owner: 10Slyngshede)
[06:22:28] <kostajh>	 _joe_: thanks. it seems like cache lookup is broken somehow. back in a bit...
[06:23:14] <icinga-wm>	 PROBLEM - HTTPS-wmfusercontent on phab.wmfusercontent.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2022-10-08 06:21:52 +0000 (expires in 29 days) https://phabricator.wikimedia.org/tag/phabricator/
[06:25:19] <_joe_>	 kostajh: if your object is now overflowing a certain size, maybe you're failing to save it to the cache
[06:35:26] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'db1202 (re)pooling @ 4%: Pooling for the first time in s7', diff saved to https://phabricator.wikimedia.org/P34139 and previous config saved to /var/cache/conftool/dbconfig/20220908-063525-root.json
[06:37:38] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'db1203 (re)pooling @ 2%: Pooling for the first time in s8', diff saved to https://phabricator.wikimedia.org/P34140 and previous config saved to /var/cache/conftool/dbconfig/20220908-063737-root.json
[06:41:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Cleanup some more stale references/comments to crons  [puppet] - 10https://gerrit.wikimedia.org/r/830184 (https://phabricator.wikimedia.org/T273673) (owner: 10Muehlenhoff)
[06:43:03] <wikibugs>	 (03PS1) 10Marostegui: es2026,es2028,es2020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830723
[06:44:21] <wikibugs>	 (03PS2) 10Marostegui: es2026,es2027,es2028: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830723
[06:44:51] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'Depool es2026, es2027, es2028', diff saved to https://phabricator.wikimedia.org/P34141 and previous config saved to /var/cache/conftool/dbconfig/20220908-064450-marostegui.json
[06:44:59] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es2026,es2027,es2028: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830723 (owner: 10Marostegui)
[06:50:54] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "LGTM! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/830684 (https://phabricator.wikimedia.org/T300130) (owner: 10Cwhite)
[06:50:56] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'db1202 (re)pooling @ 5%: Pooling for the first time in s7', diff saved to https://phabricator.wikimedia.org/P34142 and previous config saved to /var/cache/conftool/dbconfig/20220908-065054-root.json
[06:51:03] <icinga-wm>	 PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:51:27] <wikibugs>	 (03PS8) 10Ladsgroup: data-persistence: Add alert for replication lag [alerts] - 10https://gerrit.wikimedia.org/r/825294 (https://phabricator.wikimedia.org/T315866)
[06:51:32] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] data-persistence: Add alert for replication lag (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/825294 (https://phabricator.wikimedia.org/T315866) (owner: 10Ladsgroup)
[06:52:17] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'es2026 (re)pooling @ 1%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34143 and previous config saved to /var/cache/conftool/dbconfig/20220908-065216-root.json
[06:52:20] <wikibugs>	 (03PS1) 10Marostegui: Revert "es2026,es2027,es2028: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/830603
[06:52:23] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'es2027 (re)pooling @ 1%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34144 and previous config saved to /var/cache/conftool/dbconfig/20220908-065222-root.json
[06:52:30] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'es2028 (re)pooling @ 1%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34145 and previous config saved to /var/cache/conftool/dbconfig/20220908-065229-root.json
[06:53:05] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "es2026,es2027,es2028: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/830603 (owner: 10Marostegui)
[06:53:08] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'db1203 (re)pooling @ 3%: Pooling for the first time in s8', diff saved to https://phabricator.wikimedia.org/P34146 and previous config saved to /var/cache/conftool/dbconfig/20220908-065306-root.json
[06:53:11] <wikibugs>	 (03Merged) 10jenkins-bot: data-persistence: Add alert for replication lag [alerts] - 10https://gerrit.wikimedia.org/r/825294 (https://phabricator.wikimedia.org/T315866) (owner: 10Ladsgroup)
[06:54:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Drop a few now obsolete permission [puppet] - 10https://gerrit.wikimedia.org/r/830656 (owner: 10Muehlenhoff)
[06:54:57] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage db1201 [puppet] - 10https://gerrit.wikimedia.org/r/830724
[06:55:54] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1201 [puppet] - 10https://gerrit.wikimedia.org/r/830724 (owner: 10Marostegui)
[06:57:19] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-serve: raise connection limit to the MW API [deployment-charts] - 10https://gerrit.wikimedia.org/r/830661 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey)
[07:00:04] <jouncebot>	 Amir1, apergos, and jnuche: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220908T0700).
[07:00:14] <apergos>	 morning!  there are no trainees signed up for the morning slot and no patches scheduled in the window 
[07:00:35] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[07:00:47] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[07:00:47] <wikibugs>	 (03PS1) 10Ayounsi: Add Junos image validation [cookbooks] - 10https://gerrit.wikimedia.org/r/830727 (https://phabricator.wikimedia.org/T316539)
[07:00:53] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[07:01:05] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[07:01:23] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[07:01:28] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[07:02:22] <wikibugs>	 (03PS1) 10Marostegui: es2029,es2030,es2031: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830728
[07:02:24] <wikibugs>	 (03PS1) 10Ayounsi: Depool esams for routers upgrades [dns] - 10https://gerrit.wikimedia.org/r/830729 (https://phabricator.wikimedia.org/T295690)
[07:02:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10Marostegui)
[07:04:30] <wikibugs>	 (03PS1) 10Ayounsi: Disable VRRP auth for esams [homer/public] - 10https://gerrit.wikimedia.org/r/830730 (https://phabricator.wikimedia.org/T295690)
[07:06:26] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'db1202 (re)pooling @ 10%: Pooling for the first time in s7', diff saved to https://phabricator.wikimedia.org/P34147 and previous config saved to /var/cache/conftool/dbconfig/20220908-070625-root.json
[07:07:47] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'es2026 (re)pooling @ 2%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34148 and previous config saved to /var/cache/conftool/dbconfig/20220908-070746-root.json
[07:07:53] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'es2027 (re)pooling @ 2%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34149 and previous config saved to /var/cache/conftool/dbconfig/20220908-070752-root.json
[07:08:01] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'es2028 (re)pooling @ 2%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34150 and previous config saved to /var/cache/conftool/dbconfig/20220908-070800-root.json
[07:08:37] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'db1203 (re)pooling @ 4%: Pooling for the first time in s8', diff saved to https://phabricator.wikimedia.org/P34151 and previous config saved to /var/cache/conftool/dbconfig/20220908-070836-root.json
[07:08:41] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[07:10:09] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[07:14:08] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/830729 (https://phabricator.wikimedia.org/T295690) (owner: 10Ayounsi)
[07:14:32] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/830730 (https://phabricator.wikimedia.org/T295690) (owner: 10Ayounsi)
[07:21:55] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'db1202 (re)pooling @ 25%: Pooling for the first time in s7', diff saved to https://phabricator.wikimedia.org/P34152 and previous config saved to /var/cache/conftool/dbconfig/20220908-072154-root.json
[07:23:17] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'es2026 (re)pooling @ 3%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34153 and previous config saved to /var/cache/conftool/dbconfig/20220908-072316-root.json
[07:23:22] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'es2027 (re)pooling @ 3%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34154 and previous config saved to /var/cache/conftool/dbconfig/20220908-072321-root.json
[07:23:31] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'es2028 (re)pooling @ 3%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34155 and previous config saved to /var/cache/conftool/dbconfig/20220908-072330-root.json
[07:23:31] <icinga-wm>	 RECOVERY - Disk space on apt1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=apt1001&var-datasource=eqiad+prometheus/ops
[07:24:06] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'db1203 (re)pooling @ 5%: Pooling for the first time in s8', diff saved to https://phabricator.wikimedia.org/P34156 and previous config saved to /var/cache/conftool/dbconfig/20220908-072405-root.json
[07:37:25] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'db1202 (re)pooling @ 50%: Pooling for the first time in s7', diff saved to https://phabricator.wikimedia.org/P34157 and previous config saved to /var/cache/conftool/dbconfig/20220908-073724-root.json
[07:38:47] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'es2026 (re)pooling @ 4%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34158 and previous config saved to /var/cache/conftool/dbconfig/20220908-073846-root.json
[07:38:52] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'es2027 (re)pooling @ 4%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34159 and previous config saved to /var/cache/conftool/dbconfig/20220908-073851-root.json
[07:39:01] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'es2028 (re)pooling @ 4%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34160 and previous config saved to /var/cache/conftool/dbconfig/20220908-073900-root.json
[07:39:36] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'db1203 (re)pooling @ 10%: Pooling for the first time in s8', diff saved to https://phabricator.wikimedia.org/P34161 and previous config saved to /var/cache/conftool/dbconfig/20220908-073935-root.json
[07:40:26] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Depool esams for routers upgrades [dns] - 10https://gerrit.wikimedia.org/r/830729 (https://phabricator.wikimedia.org/T295690) (owner: 10Ayounsi)
[07:41:06] <XioNoX>	 !log depool esams for routers upgrade - T295690
[07:41:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:41:08] <stashbot>	 T295690: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690
[07:44:37] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] "the original idea for supporting docker in these tests is being able to move the execution to our CI at some point. In that environment wh" [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall)
[07:52:55] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'db1202 (re)pooling @ 75%: Pooling for the first time in s7', diff saved to https://phabricator.wikimedia.org/P34162 and previous config saved to /var/cache/conftool/dbconfig/20220908-075253-root.json
[07:54:17] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'es2026 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34163 and previous config saved to /var/cache/conftool/dbconfig/20220908-075416-root.json
[07:54:22] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'es2027 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34164 and previous config saved to /var/cache/conftool/dbconfig/20220908-075421-root.json
[07:54:30] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'es2028 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34165 and previous config saved to /var/cache/conftool/dbconfig/20220908-075429-root.json
[07:55:05] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'db1203 (re)pooling @ 25%: Pooling for the first time in s8', diff saved to https://phabricator.wikimedia.org/P34166 and previous config saved to /var/cache/conftool/dbconfig/20220908-075504-root.json
[07:55:15] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[07:59:00] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] "If you need specific support for podman to make it easier to run the tests in your environment please go ahead" [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall)
[08:00:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cumin1001.eqiad.wmnet
[08:02:05] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[08:06:39] <hashar>	 kostajh: is the GrowthExperiments Special:Homepage performance regression only for Growths or is that general? ( T317187 )
[08:06:40] <stashbot>	 T317187: GrowthExperiments Special:Homepage: investigate performance regression in wmf.28 - https://phabricator.wikimedia.org/T317187
[08:07:28] <XioNoX>	 !log drain draffic from cr3-esams - T295690
[08:07:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:07:31] <stashbot>	 T295690: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690
[08:08:24] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'db1202 (re)pooling @ 100%: Pooling for the first time in s7', diff saved to https://phabricator.wikimedia.org/P34167 and previous config saved to /var/cache/conftool/dbconfig/20220908-080823-root.json
[08:08:56] <logmsgbot>	 !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1019.eqiad.wmnet
[08:09:05] <logmsgbot>	 !log ayounsi@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cr3-esams,cr3-esams IPv6,re0.cr3-esams.mgmt with reason: router upgrade
[08:09:24] <logmsgbot>	 !log ayounsi@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cr3-esams,cr3-esams IPv6,re0.cr3-esams.mgmt with reason: router upgrade
[08:09:32] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=3b336fa4-f522-4b10-abdb-d6be83f6a04a) set by ayounsi@cumin2002 for 2:00:00 on 3 host(s) and th...
[08:09:47] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'es2026 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34168 and previous config saved to /var/cache/conftool/dbconfig/20220908-080946-root.json
[08:09:52] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'es2027 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34169 and previous config saved to /var/cache/conftool/dbconfig/20220908-080951-root.json
[08:09:59] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'es2028 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34170 and previous config saved to /var/cache/conftool/dbconfig/20220908-080958-root.json
[08:10:35] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'db1203 (re)pooling @ 50%: Pooling for the first time in s8', diff saved to https://phabricator.wikimedia.org/P34171 and previous config saved to /var/cache/conftool/dbconfig/20220908-081034-root.json
[08:10:57] <logmsgbot>	 !log ayounsi@cumin2002 START - Cookbook sre.network.cf
[08:10:58] <logmsgbot>	 !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.cf (exit_code=0)
[08:11:12] <wikibugs>	 (03PS1) 10Vgutierrez: Release 0.18 [software/purged] - 10https://gerrit.wikimedia.org/r/830782 (https://phabricator.wikimedia.org/T317064)
[08:12:06] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cumin1001.eqiad.wmnet
[08:12:53] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:12:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Release 0.18 [software/purged] - 10https://gerrit.wikimedia.org/r/830782 (https://phabricator.wikimedia.org/T317064) (owner: 10Vgutierrez)
[08:16:49] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[08:18:51] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add a kublet node_label to each master of the dse-k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/828049 (https://phabricator.wikimedia.org/T310172) (owner: 10Btullis)
[08:22:47] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.remove-downtime for parse1019.eqiad.wmnet
[08:22:47] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1019.eqiad.wmnet
[08:24:14] <claime>	 !log pooled parse1019.eqiad.wmnet (php 7.4 only) in parsoid cluster T307219
[08:24:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:24:17] <stashbot>	 T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219
[08:24:38] <wikibugs>	 (03PS1) 10Muehlenhoff: Update httpbb test to expect PHP 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/830783
[08:25:01] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'db1127 (re)pooling @ 2%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34172 and previous config saved to /var/cache/conftool/dbconfig/20220908-082500-root.json
[08:25:05] <wikibugs>	 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) db1127 is being repooled. All the sX 10.6 hosts are now serving traffic with the patch
[08:25:22] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'es2027 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34173 and previous config saved to /var/cache/conftool/dbconfig/20220908-082521-root.json
[08:25:29] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'es2028 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34174 and previous config saved to /var/cache/conftool/dbconfig/20220908-082528-root.json
[08:25:31] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'es2026 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34175 and previous config saved to /var/cache/conftool/dbconfig/20220908-082530-root.json
[08:25:45] <wikibugs>	 (03PS2) 10Vgutierrez: Release 0.18 [software/purged] - 10https://gerrit.wikimedia.org/r/830782 (https://phabricator.wikimedia.org/T317064)
[08:25:47] <wikibugs>	 (03PS1) 10Vgutierrez: querysort: Backport strings.Cut [software/purged] - 10https://gerrit.wikimedia.org/r/830784 (https://phabricator.wikimedia.org/T317064)
[08:26:05] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'db1203 (re)pooling @ 75%: Pooling for the first time in s8', diff saved to https://phabricator.wikimedia.org/P34176 and previous config saved to /var/cache/conftool/dbconfig/20220908-082604-root.json
[08:26:34] <vgutierrez>	 _joe_: I'm aware that https://gerrit.wikimedia.org/r/830782 is far from ideal but we don't have golang 1.18 available yet
[08:26:40] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "We will update all tests, and remove a few, once we've moved fully to php 7.4." [puppet] - 10https://gerrit.wikimedia.org/r/830783 (owner: 10Muehlenhoff)
[08:26:51] <icinga-wm>	 PROBLEM - Check systemd state on dse-k8s-ctrl1001 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:27:16] <vgutierrez>	 _joe_: hmm sorry, https://gerrit.wikimedia.org/r/c/operations/software/purged/+/830784
[08:27:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Release 0.18 [software/purged] - 10https://gerrit.wikimedia.org/r/830782 (https://phabricator.wikimedia.org/T317064) (owner: 10Vgutierrez)
[08:28:25] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] querysort: Backport strings.Cut [software/purged] - 10https://gerrit.wikimedia.org/r/830784 (https://phabricator.wikimedia.org/T317064) (owner: 10Vgutierrez)
[08:28:38] <_joe_>	 vgutierrez: heh sorry about that
[08:30:09] <wikibugs>	 (03PS2) 10Vgutierrez: querysort: Backport strings.Cut [software/purged] - 10https://gerrit.wikimedia.org/r/830784 (https://phabricator.wikimedia.org/T317064)
[08:30:11] <wikibugs>	 (03PS3) 10Vgutierrez: Release 0.18 [software/purged] - 10https://gerrit.wikimedia.org/r/830782 (https://phabricator.wikimedia.org/T317064)
[08:31:03] <logmsgbot>	 !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1020.eqiad.wmnet
[08:31:32] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service Muehlenhoff See https://gerrit.wikimedia.org/r/c/operations/puppet/+/830783/ https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:31:32] <icinga-wm>	 ACKNOWLEDGEMENT - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver Muehlenhoff See https://gerrit.wikimedia.org/r/c/operations/puppet/+/830783/ https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[08:32:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Release 0.18 [software/purged] - 10https://gerrit.wikimedia.org/r/830782 (https://phabricator.wikimedia.org/T317064) (owner: 10Vgutierrez)
[08:32:50] <vgutierrez>	 sigh³
[08:33:05] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Update httpbb test to expect PHP 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/830783 (owner: 10Muehlenhoff)
[08:37:40] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[08:37:44] <wikibugs>	 (03PS1) 10Volans: Class-based cookbooks: get parent argument_parser [cookbooks] - 10https://gerrit.wikimedia.org/r/830786
[08:38:11] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es2029,es2030,es2031: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830728 (owner: 10Marostegui)
[08:38:23] <wikibugs>	 (03PS2) 10Volans: Class-based cookbooks: use parent argument_parser [cookbooks] - 10https://gerrit.wikimedia.org/r/830786
[08:38:33] <wikibugs>	 (03PS4) 10Vgutierrez: Release 0.18 [software/purged] - 10https://gerrit.wikimedia.org/r/830782 (https://phabricator.wikimedia.org/T317064)
[08:38:39] <icinga-wm>	 PROBLEM - Check systemd state on dse-k8s-ctrl1002 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:38:45] <vgutierrez>	 never push code if $coffee < 2
[08:39:41] <wikibugs>	 10SRE, 10Observability-Metrics: librenms / scap::target change of permissions on every puppet run - https://phabricator.wikimedia.org/T317286 (10fgiunchedi)
[08:39:43] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'Depool es2029, es2030, es2031', diff saved to https://phabricator.wikimedia.org/P34178 and previous config saved to /var/cache/conftool/dbconfig/20220908-083941-marostegui.json
[08:40:18] <claime>	 !log depooled wtp1028.eqiad.wmnet from parsoid cluster T307219
[08:40:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:21] <stashbot>	 T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219
[08:40:30] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'db1127 (re)pooling @ 3%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34179 and previous config saved to /var/cache/conftool/dbconfig/20220908-084029-root.json
[08:40:35] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.remove-downtime for parse1020.eqiad.wmnet
[08:40:36] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1020.eqiad.wmnet
[08:40:52] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'es2027 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34180 and previous config saved to /var/cache/conftool/dbconfig/20220908-084051-root.json
[08:40:58] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'es2028 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34181 and previous config saved to /var/cache/conftool/dbconfig/20220908-084057-root.json
[08:41:00] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'es2026 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34182 and previous config saved to /var/cache/conftool/dbconfig/20220908-084059-root.json
[08:41:19] <claime>	 !log pooled parse1020.eqiad.wmnet (php 7.4 only) in parsoid cluster T307219
[08:41:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:34] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'db1203 (re)pooling @ 100%: Pooling for the first time in s8', diff saved to https://phabricator.wikimedia.org/P34183 and previous config saved to /var/cache/conftool/dbconfig/20220908-084133-root.json
[08:43:25] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[08:44:25] <wikibugs>	 (03PS1) 10Elukey: ml-services: update Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/830788 (https://phabricator.wikimedia.org/T313915)
[08:44:27] <XioNoX>	 !log reverting cr3-esams changes (JTAC will be needed for a firmware upgrade), and moving on to cr2-esams - T295690
[08:44:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:44:30] <stashbot>	 T295690: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690
[08:44:53] <wikibugs>	 (03PS1) 10Filippo Giunchedi: librenms: recurse setting permissions on sessions directory [puppet] - 10https://gerrit.wikimedia.org/r/830789 (https://phabricator.wikimedia.org/T317286)
[08:44:58] <logmsgbot>	 !log ayounsi@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cr2-esams,cr2-esams IPv6,re0.cr2-esams.mgmt with reason: router upgrade
[08:45:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34184 and previous config saved to /var/cache/conftool/dbconfig/20220908-084500-root.json
[08:45:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34185 and previous config saved to /var/cache/conftool/dbconfig/20220908-084502-root.json
[08:45:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2029 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34186 and previous config saved to /var/cache/conftool/dbconfig/20220908-084503-root.json
[08:45:12] <wikibugs>	 (03PS1) 10Marostegui: Revert "es2029,es2030,es2031: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/830806
[08:45:17] <logmsgbot>	 !log ayounsi@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cr2-esams,cr2-esams IPv6,re0.cr2-esams.mgmt with reason: router upgrade
[08:45:23] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e0d9eb2b-5520-4f80-912e-3627c94e9982) set by ayounsi@cumin2002 for 2:00:00 on 3 host(s) and th...
[08:46:01] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "es2029,es2030,es2031: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/830806 (owner: 10Marostegui)
[08:47:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-tool1011.eqiad.wmnet
[08:47:23] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] querysort: Backport strings.Cut [software/purged] - 10https://gerrit.wikimedia.org/r/830784 (https://phabricator.wikimedia.org/T317064) (owner: 10Vgutierrez)
[08:47:32] <wikibugs>	 10SRE, 10Thumbor: Thumbor units failing / service general slowness - https://phabricator.wikimedia.org/T312722 (10fgiunchedi) p:05Triage→03Medium @MatthewVernon yes medium works, {{done}}
[08:47:34] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Release 0.18 [software/purged] - 10https://gerrit.wikimedia.org/r/830782 (https://phabricator.wikimedia.org/T317064) (owner: 10Vgutierrez)
[08:47:44] <wikibugs>	 10SRE, 10Data Engineering Planning, 10Foundational Technology Requests: Add a webrequest sampled topic and ingest into druid/turnilo - https://phabricator.wikimedia.org/T314981 (10fgiunchedi) p:05Triage→03Medium
[08:47:47] <icinga-wm>	 RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 23, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:48:21] <wikibugs>	 (03CR) 10Volans: "minor comments, looks good in general" [cookbooks] - 10https://gerrit.wikimedia.org/r/830727 (https://phabricator.wikimedia.org/T316539) (owner: 10Ayounsi)
[08:51:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1011.eqiad.wmnet
[08:52:04] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] ml-services: update Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/830788 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey)
[08:52:30] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to wmf, analytics-privatedata-users for TTaylor - https://phabricator.wikimedia.org/T292299 (10Milimetric) 05Resolved→03Open It appears that Tajh never got analytics-privatedata access as requested here (https://github.com/wikimedia/puppet/blob/90a5ff1441598161f3e...
[08:53:09] <claime>	 !log depooled wtp1029.eqiad.wmnet from parsoid cluster T307219
[08:53:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:53:12] <stashbot>	 T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219
[08:53:14] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "Had a chat (live) with Filippo about the issue. It is not the ideal solution since puppet will change permissions before with scap:target " [puppet] - 10https://gerrit.wikimedia.org/r/830789 (https://phabricator.wikimedia.org/T317286) (owner: 10Filippo Giunchedi)
[08:53:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] librenms: recurse setting permissions on sessions directory [puppet] - 10https://gerrit.wikimedia.org/r/830789 (https://phabricator.wikimedia.org/T317286) (owner: 10Filippo Giunchedi)
[08:55:24] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Add cookbook to easily route mediawiki traffic [cookbooks] - 10https://gerrit.wikimedia.org/r/830607 (https://phabricator.wikimedia.org/T315995)
[08:55:32] <logmsgbot>	 !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1021.eqiad.wmnet
[08:56:00] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'db1127 (re)pooling @ 4%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34187 and previous config saved to /var/cache/conftool/dbconfig/20220908-085559-root.json
[08:56:22] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'es2027 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34188 and previous config saved to /var/cache/conftool/dbconfig/20220908-085621-root.json
[08:56:28] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'es2028 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34189 and previous config saved to /var/cache/conftool/dbconfig/20220908-085627-root.json
[08:56:31] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'es2026 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34190 and previous config saved to /var/cache/conftool/dbconfig/20220908-085630-root.json
[08:57:39] <wikibugs>	 10SRE, 10Observability-Metrics, 10Patch-For-Review: librenms / scap::target change of permissions on every puppet run - https://phabricator.wikimedia.org/T317286 (10fgiunchedi) I have bandaided the issue by recursing chown `www-data` into the sessions directory, though puppet will flip/flop permissions betwe...
[08:59:31] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: update Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/830788 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey)
[09:00:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34192 and previous config saved to /var/cache/conftool/dbconfig/20220908-090005-root.json
[09:00:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34193 and previous config saved to /var/cache/conftool/dbconfig/20220908-090007-root.json
[09:00:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2029 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34194 and previous config saved to /var/cache/conftool/dbconfig/20220908-090008-root.json
[09:00:15] <wikibugs>	 10SRE, 10Observability-Metrics, 10Patch-For-Review: librenms / scap::target change of permissions on every puppet run - https://phabricator.wikimedia.org/T317286 (10jbond) > My hunch is that this is a fallout from forcing owner/group in scap::target in https://gerrit.wikimedia.org/r/c/operations/puppet/+/830...
[09:02:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/821324 (https://phabricator.wikimedia.org/T313099) (owner: 10Cwhite)
[09:02:25] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[09:03:04] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/830633 (https://phabricator.wikimedia.org/T310179) (owner: 10Btullis)
[09:03:20] <logmsgbot>	 !log cgoubert@cumin2002 START - Cookbook sre.hosts.remove-downtime for parse1021.eqiad.wmnet
[09:03:21] <logmsgbot>	 !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1021.eqiad.wmnet
[09:03:43] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[09:04:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: logstash: reduce replica count to 1 after 1 day (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/830690 (https://phabricator.wikimedia.org/T313099) (owner: 10Cwhite)
[09:04:24] <claime>	 !log pooled parse1021.eqiad.wmnet (php 7.4 only) in parsoid cluster T307219
[09:04:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:27] <stashbot>	 T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219
[09:04:44] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[09:04:54] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:05:51] <logmsgbot>	 !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1022.eqiad.wmnet
[09:06:06] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:07:41] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[09:08:55] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[09:09:42] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[09:10:15] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[09:10:32] <wikibugs>	 (03CR) 10Hashar: [C: 04-2] "I talked with David about the code I used to extract the list of events from the private field com.google.gerrit.server.events.Event.types" [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/830654 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar)
[09:11:31] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'db1127 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34195 and previous config saved to /var/cache/conftool/dbconfig/20220908-091129-root.json
[09:11:52] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'es2027 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34196 and previous config saved to /var/cache/conftool/dbconfig/20220908-091151-root.json
[09:11:58] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'es2028 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34197 and previous config saved to /var/cache/conftool/dbconfig/20220908-091157-root.json
[09:12:01] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'es2026 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34198 and previous config saved to /var/cache/conftool/dbconfig/20220908-091200-root.json
[09:14:25] <claime>	 !log depooled wtp1030.eqiad.wmnet from parsoid cluster T307219
[09:14:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:29] <stashbot>	 T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219
[09:15:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34199 and previous config saved to /var/cache/conftool/dbconfig/20220908-091510-root.json
[09:15:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34200 and previous config saved to /var/cache/conftool/dbconfig/20220908-091512-root.json
[09:15:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2029 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34201 and previous config saved to /var/cache/conftool/dbconfig/20220908-091513-root.json
[09:16:31] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on wtp[1025-1028].eqiad.wmnet with reason: Downtiming replaced wtp servers
[09:16:35] <wikibugs>	 (03PS1) 10Jbond: C:librenms: fix file permissions [puppet] - 10https://gerrit.wikimedia.org/r/830792
[09:16:45] <logmsgbot>	 !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1028.eqiad.wmnet
[09:16:47] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wtp[1025-1028].eqiad.wmnet with reason: Downtiming replaced wtp servers
[09:16:52] <logmsgbot>	 !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1029.eqiad.wmnet
[09:16:53] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] C:librenms: fix file permissions [puppet] - 10https://gerrit.wikimedia.org/r/830792 (owner: 10Jbond)
[09:17:19] <logmsgbot>	 !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=wtp1029.eqiad.wmnet
[09:17:26] <wikibugs>	 10SRE, 10Observability-Metrics, 10Patch-For-Review: librenms / scap::target change of permissions on every puppet run - https://phabricator.wikimedia.org/T317286 (10jbond) forgot to link https://gerrit.wikimedia.org/r/c/operations/puppet/+/830792
[09:17:34] <icinga-wm>	 RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:17:59] <logmsgbot>	 !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1026.eqiad.wmnet
[09:18:07] <logmsgbot>	 !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1025.eqiad.wmnet
[09:18:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Idea LGTM, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/826362 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[09:18:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, though see related change" [puppet] - 10https://gerrit.wikimedia.org/r/830686 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[09:18:49] <vgutierrez>	 !log testing purged 0.18 in cp4026 and cp4032
[09:18:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:30] <wikibugs>	 (03PS1) 10Jbond: C:librenms: correct typo [puppet] - 10https://gerrit.wikimedia.org/r/830793 (https://phabricator.wikimedia.org/T317286)
[09:19:44] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1022.eqiad.wmnet
[09:19:44] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1022.eqiad.wmnet
[09:19:50] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] C:librenms: correct typo [puppet] - 10https://gerrit.wikimedia.org/r/830793 (https://phabricator.wikimedia.org/T317286) (owner: 10Jbond)
[09:20:22] <wikibugs>	 (03CR) 10Kosta Harlan: "LGTM, but blocked on T305406 and T312686, right? If so, let's set a -2 on here to avoid any confusion." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830202 (https://phabricator.wikimedia.org/T305408) (owner: 10Sergio Gimeno)
[09:21:05] <claime>	 !log pooled parse1022.eqiad.wmnet (php 7.4 only) in parsoid cluster T307219
[09:21:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:10] <stashbot>	 T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219
[09:21:47] <logmsgbot>	 !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1023.eqiad.wmnet
[09:22:38] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[09:23:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es2028 to es1 codfw master', diff saved to https://phabricator.wikimedia.org/P34202 and previous config saved to /var/cache/conftool/dbconfig/20220908-092301-marostegui.json
[09:23:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es2026 to es2 codfw master', diff saved to https://phabricator.wikimedia.org/P34203 and previous config saved to /var/cache/conftool/dbconfig/20220908-092346-marostegui.json
[09:24:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es2027 to es3 codfw master', diff saved to https://phabricator.wikimedia.org/P34204 and previous config saved to /var/cache/conftool/dbconfig/20220908-092436-marostegui.json
[09:25:25] <wikibugs>	 10SRE-OnFire, 10Observability-Alerting: Improve AlertManager alert titles as sent to VictorOps - https://phabricator.wikimedia.org/T317240 (10LSobanski) The tag combinations are also really confusing when they form a coherent statement, e.g. my first reaction to "page thanos sre" is "we should page Observabili...
[09:26:57] <wikibugs>	 10SRE, 10Observability-Metrics, 10Patch-For-Review: librenms / scap::target change of permissions on every puppet run - https://phabricator.wikimedia.org/T317286 (10fgiunchedi) >>! In T317286#8220241, @jbond wrote: >> My hunch is that this is a fallout from forcing owner/group in scap::target in https://gerr...
[09:27:01] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'db1127 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34205 and previous config saved to /var/cache/conftool/dbconfig/20220908-092700-root.json
[09:27:37] <wikibugs>	 (03PS1) 10Marostegui: es2032,es2033,es2034: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830794
[09:30:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34206 and previous config saved to /var/cache/conftool/dbconfig/20220908-093015-root.json
[09:30:17] <wikibugs>	 (03CR) 10JMeybohm: Alert on high Kubernetes API error rate (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/830624 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm)
[09:30:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34207 and previous config saved to /var/cache/conftool/dbconfig/20220908-093017-root.json
[09:30:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2029 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34208 and previous config saved to /var/cache/conftool/dbconfig/20220908-093018-root.json
[09:31:14] <vgutierrez>	 !log upload purged 0.18 to apt.wm.o (buster) - T317064
[09:31:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:16] <stashbot>	 T317064: History pages' caches not being invalidated after edits - https://phabricator.wikimedia.org/T317064
[09:31:17] <logmsgbot>	 !log ayounsi@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cr3-knams,cr3-knams IPv6 with reason: router upgrade
[09:31:35] <logmsgbot>	 !log ayounsi@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cr3-knams,cr3-knams IPv6 with reason: router upgrade
[09:31:42] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ff1db65d-a6ee-4e20-ae07-837bbe264b2f) set by ayounsi@cumin2002 for 2:00:00 on 2 host(s) and th...
[09:31:51] <vgutierrez>	 !log rolling restart of purged - T317064
[09:31:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:00] <claime>	 !log depooled wtp1031.eqiad.wmnet from parsoid cluster T307219
[09:33:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:03] <stashbot>	 T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219
[09:33:14] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es2032,es2033,es2034: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830794 (owner: 10Marostegui)
[09:34:22] <wikibugs>	 (03PS1) 10Filippo Giunchedi: librenms: recursively chwon sessions directory to www-data as a bandaid [puppet] - 10https://gerrit.wikimedia.org/r/830795 (https://phabricator.wikimedia.org/T317286)
[09:35:03] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[09:35:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] librenms: recursively chwon sessions directory to www-data as a bandaid [puppet] - 10https://gerrit.wikimedia.org/r/830795 (https://phabricator.wikimedia.org/T317286) (owner: 10Filippo Giunchedi)
[09:35:48] <XioNoX>	 !log drain draffic from cr3-knams - T295690
[09:35:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:51] <stashbot>	 T295690: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690
[09:36:45] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1023.eqiad.wmnet
[09:36:45] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1023.eqiad.wmnet
[09:36:59] <wikibugs>	 (03PS1) 10Btullis: Grant ttaylor access to PII through Superset [puppet] - 10https://gerrit.wikimedia.org/r/830796 (https://phabricator.wikimedia.org/T292299)
[09:37:25] <claime>	 !log pooled parse1023.eqiad.wmnet (php 7.4 only) in parsoid cluster T307219
[09:37:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:37] <logmsgbot>	 !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1024.eqiad.wmnet
[09:38:51] <wikibugs>	 10SRE, 10Observability-Metrics, 10Patch-For-Review: librenms / scap::target change of permissions on every puppet run - https://phabricator.wikimedia.org/T317286 (10jbond) I have rolled back https://gerrit.wikimedia.org/r/c/830789.  As per the [[ https://github.com/wikimedia/puppet/blob/production/modules/li...
[09:40:16] <wikibugs>	 (03PS1) 10Jbond: Revert "librenms: recursively chwon sessions directory to www-data as a bandaid" [puppet] - 10https://gerrit.wikimedia.org/r/830807
[09:41:01] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Revert "librenms: recursively chwon sessions directory to www-data as a bandaid" [puppet] - 10https://gerrit.wikimedia.org/r/830807 (owner: 10Jbond)
[09:41:02] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant Access to wmf, analytics-privatedata-users for TTaylor - https://phabricator.wikimedia.org/T292299 (10BTullis) I have added a patch to correct Tajh's access permissions, so that he will be able to access PII in Superset. The wikitech reference to this a...
[09:42:31] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'db1127 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34209 and previous config saved to /var/cache/conftool/dbconfig/20220908-094229-root.json
[09:42:51] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on parse1024 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[09:44:52] <wikibugs>	 (03PS2) 10Phuedx: testwiki: Add mediawiki.edit_attempt stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826234 (https://phabricator.wikimedia.org/T309013)
[09:45:09] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[09:45:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34210 and previous config saved to /var/cache/conftool/dbconfig/20220908-094520-root.json
[09:45:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34211 and previous config saved to /var/cache/conftool/dbconfig/20220908-094522-root.json
[09:45:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2029 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34212 and previous config saved to /var/cache/conftool/dbconfig/20220908-094523-root.json
[09:46:51] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[09:47:09] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:47:09] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:47:11] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:47:13] <claime>	 !log depooled wtp1032.eqiad.wmnet from parsoid cluster T307219
[09:47:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:47:15] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:47:15] <stashbot>	 T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219
[09:48:07] <claime>	 Don't bother about wtp1040 ^
[09:48:24] <claime>	 I'll put the management downtime that I forgot in a sec
[09:49:59] <wikibugs>	 10SRE, 10Observability-Metrics, 10Patch-For-Review: librenms / scap::target change of permissions on every puppet run - https://phabricator.wikimedia.org/T317286 (10jbond) > I have rolled back https://gerrit.wikimedia.org/r/c/830789. As per the comment the session files need to be 0644 This was introduced in...
[09:50:02] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1024.eqiad.wmnet
[09:50:02] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1024.eqiad.wmnet
[09:50:32] <claime>	 !log pooled parse1024.eqiad.wmnet (php 7.4 only) in parsoid cluster T307219
[09:50:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:35] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for pginer - https://phabricator.wikimedia.org/T317291 (10BTullis)
[09:51:04] <wikibugs>	 10SRE, 10Observability-Metrics, 10Patch-For-Review: librenms / scap::target change of permissions on every puppet run - https://phabricator.wikimedia.org/T317286 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Thank you @jbond for the assistance/fix/investigation, seems all good now! resolving
[09:51:52] <wikibugs>	 10SRE, 10Observability-Metrics: librenms: investigate making the session directory 0660 - https://phabricator.wikimedia.org/T317292 (10jbond)
[09:52:49] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:52:49] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:52:51] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:53:34] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Overlay VRF / VXLAN traffic failure between lsw1-f2-eqiad and lsw1-f3-eqiad - https://phabricator.wikimedia.org/T315038 (10cmooney) 05Open→03Resolved So after quite a bit of back-and-forth with Juniper and pulling logs etc. they say they can't see anything i...
[09:53:54] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "LGTM, one small bug to fix and a question inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/830607 (https://phabricator.wikimedia.org/T315995) (owner: 10Giuseppe Lavagetto)
[09:55:28] <wikibugs>	 (03CR) 10Muehlenhoff: Initially adapt perccli to use the new raid_mgmt_tools fact (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/825369 (https://phabricator.wikimedia.org/T315608) (owner: 10Muehlenhoff)
[09:57:23] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff)
[09:57:43] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Disable VRRP auth for esams [homer/public] - 10https://gerrit.wikimedia.org/r/830730 (https://phabricator.wikimedia.org/T295690) (owner: 10Ayounsi)
[09:58:00] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'db1127 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34213 and previous config saved to /var/cache/conftool/dbconfig/20220908-095759-root.json
[09:58:14] <wikibugs>	 (03CR) 10Alexandros Kosiaris: Alert on high Kubernetes API error rate (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/830624 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm)
[09:58:53] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for pginer - https://phabricator.wikimedia.org/T317291 (10BTullis) I have now added Pau to the wmf group. ` btullis@mwmaint1002:~$ ldapsearch -x cn=wmf|grep pginer member: uid=pginer,ou=people,dc=wikimedia,dc=org ` Also added to the #wmf-nda group in Phab...
[09:59:08] <wikibugs>	 (03PS1) 10Ayounsi: Revert "Depool esams for routers upgrades" [dns] - 10https://gerrit.wikimedia.org/r/830808
[09:59:11] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for pginer - https://phabricator.wikimedia.org/T317291 (10BTullis) 05Open→03Resolved p:05Triage→03Medium
[10:00:04] <jouncebot>	 mvolz: How many deployers does it take to do Services – Citoid / Zotero deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220908T1000).
[10:00:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2032 es2033 es2034 for upgrade', diff saved to https://phabricator.wikimedia.org/P34214 and previous config saved to /var/cache/conftool/dbconfig/20220908-100014-root.json
[10:00:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34215 and previous config saved to /var/cache/conftool/dbconfig/20220908-100025-root.json
[10:00:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34216 and previous config saved to /var/cache/conftool/dbconfig/20220908-100027-root.json
[10:00:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2029 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34217 and previous config saved to /var/cache/conftool/dbconfig/20220908-100028-root.json
[10:00:53] <claime>	 !log depooled wtp1033.eqiad.wmnet from parsoid cluster T307219
[10:00:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:55] <stashbot>	 T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219
[10:01:13] <claime>	 !log Serving 100% of parsoid traffic with php 7.4 T307219
[10:01:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:01:35] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on es[2032-2034].codfw.wmnet with reason: Upgrade
[10:01:50] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2032-2034].codfw.wmnet with reason: Upgrade
[10:03:11] <wikibugs>	 (03PS7) 10Muehlenhoff: Switch to the new raid_mgmt_tools fact to enable RAID tools [puppet] - 10https://gerrit.wikimedia.org/r/825369 (https://phabricator.wikimedia.org/T315608)
[10:04:56] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/825369 (https://phabricator.wikimedia.org/T315608) (owner: 10Muehlenhoff)
[10:05:25] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ayounsi) cr2-esams and cr3-knams got upgraded as expected. cr3-esams failed as it requires a firmware upgrade, and only JTAC can provide us the firmware. We wi...
[10:05:57] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on wtp[1029-1031].eqiad.wmnet with reason: Downtiming replaced wtp servers
[10:06:12] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wtp[1029-1031].eqiad.wmnet with reason: Downtiming replaced wtp servers
[10:06:34] <logmsgbot>	 !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1030.eqiad.wmnet
[10:06:36] <wikibugs>	 10SRE, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-Page-history, 10Traffic, and 3 others: History pages' caches not being invalidated after edits - https://phabricator.wikimedia.org/T317064 (10Vgutierrez) 05Open→03Resolved a:03Joe ` #before the edit vgutierrez@carrot:~$ curl "https://test.wikipedia.org/...
[10:06:46] <logmsgbot>	 !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1031.eqiad.wmnet
[10:06:59] <logmsgbot>	 !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1029.eqiad.wmnet
[10:07:01] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ayounsi)
[10:07:27] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Revert "Depool esams for routers upgrades" [dns] - 10https://gerrit.wikimedia.org/r/830808 (owner: 10Ayounsi)
[10:07:42] <XioNoX>	 !log re-pool esams after routers upgrade - T295690
[10:07:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:07:47] <stashbot>	 T295690: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690
[10:07:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34218 and previous config saved to /var/cache/conftool/dbconfig/20220908-100754-root.json
[10:08:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34219 and previous config saved to /var/cache/conftool/dbconfig/20220908-100759-root.json
[10:08:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34220 and previous config saved to /var/cache/conftool/dbconfig/20220908-100805-root.json
[10:08:19] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add the configuration to create LVM volumes for dse-k8s monitoring [puppet] - 10https://gerrit.wikimedia.org/r/830633 (https://phabricator.wikimedia.org/T310179) (owner: 10Btullis)
[10:08:30] <wikibugs>	 (03PS1) 10Marostegui: Revert "es2032,es2033,es2034: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/830810
[10:12:05] <wikibugs>	 (03PS8) 10Muehlenhoff: Switch to the new raid_mgmt_tools fact to enable RAID tools [puppet] - 10https://gerrit.wikimedia.org/r/825369 (https://phabricator.wikimedia.org/T315608)
[10:12:54] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "es2032,es2033,es2034: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/830810 (owner: 10Marostegui)
[10:13:30] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'db1127 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34221 and previous config saved to /var/cache/conftool/dbconfig/20220908-101329-root.json
[10:13:35] <wikibugs>	 (03PS2) 10Ayounsi: Add Junos image validation [cookbooks] - 10https://gerrit.wikimedia.org/r/830727 (https://phabricator.wikimedia.org/T316539)
[10:13:38] <wikibugs>	 (03CR) 10Ayounsi: "thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/830727 (https://phabricator.wikimedia.org/T316539) (owner: 10Ayounsi)
[10:16:22] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[10:16:30] <wikibugs>	 (03PS1) 10Mvolz: Switch Zotero to node 12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/830799 (https://phabricator.wikimedia.org/T290753)
[10:17:36] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/825369 (https://phabricator.wikimedia.org/T315608) (owner: 10Muehlenhoff)
[10:18:03] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on 7 hosts with reason: Downtiming replaced wtp servers
[10:18:21] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on 7 hosts with reason: Downtiming replaced wtp servers
[10:18:34] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on 15 hosts with reason: Downtiming replaced wtp servers
[10:18:57] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on 15 hosts with reason: Downtiming replaced wtp servers
[10:20:02] <wikibugs>	 (03CR) 10Mvolz: [C: 03+2] Switch Zotero to node 12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/830799 (https://phabricator.wikimedia.org/T290753) (owner: 10Mvolz)
[10:20:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1022, es1025, es2022, es2025 for upgrade', diff saved to https://phabricator.wikimedia.org/P34222 and previous config saved to /var/cache/conftool/dbconfig/20220908-102040-root.json
[10:20:50] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:22:04] <wikibugs>	 (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/830727 (https://phabricator.wikimedia.org/T316539) (owner: 10Ayounsi)
[10:22:08] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[10:22:09] <wikibugs>	 (03PS1) 10Marostegui: es1022,es1025,es2022,es2025: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830800
[10:22:50] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es1022,es1025,es2022,es2025: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830800 (owner: 10Marostegui)
[10:22:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34223 and previous config saved to /var/cache/conftool/dbconfig/20220908-102259-root.json
[10:23:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34224 and previous config saved to /var/cache/conftool/dbconfig/20220908-102304-root.json
[10:23:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34225 and previous config saved to /var/cache/conftool/dbconfig/20220908-102310-root.json
[10:23:32] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply
[10:23:34] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply
[10:23:51] <wikibugs>	 (03PS9) 10Muehlenhoff: Switch to the new raid_mgmt_tools fact to enable RAID tools [puppet] - 10https://gerrit.wikimedia.org/r/825369 (https://phabricator.wikimedia.org/T315608)
[10:26:39] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply
[10:27:16] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply
[10:28:32] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/zotero: apply
[10:28:38] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[10:29:00] <logmsgbot>	 !log marostegui@cumin2002 dbctl commit (dc=all): 'db1127 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34226 and previous config saved to /var/cache/conftool/dbconfig/20220908-102859-root.json
[10:29:16] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/zotero: apply
[10:29:46] <wikibugs>	 (03PS3) 10Ayounsi: Add Junos image validation [cookbooks] - 10https://gerrit.wikimedia.org/r/830727 (https://phabricator.wikimedia.org/T316539)
[10:30:07] <wikibugs>	 (03CR) 10Ayounsi: "fair enough :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/830727 (https://phabricator.wikimedia.org/T316539) (owner: 10Ayounsi)
[10:30:08] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[10:30:34] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/zotero: apply
[10:30:55] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/830727 (https://phabricator.wikimedia.org/T316539) (owner: 10Ayounsi)
[10:30:57] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/825369 (https://phabricator.wikimedia.org/T315608) (owner: 10Muehlenhoff)
[10:31:23] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply
[10:32:43] <wikibugs>	 10SRE, 10Citoid, 10Editing-team: Migrate citoid and zotero production services to node12 - https://phabricator.wikimedia.org/T290753 (10Mvolz) 05Open→03Resolved p:05Triage→03Medium
[10:32:46] <wikibugs>	 10SRE, 10serviceops: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Mvolz)
[10:33:06] <wikibugs>	 10SRE, 10serviceops: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Mvolz)
[10:34:22] <icinga-wm>	 PROBLEM - cassandra-a service on restbase2014 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:36:37] <wikibugs>	 (03PS1) 10Marostegui: Revert "es1022,es1025,es2022,es2025: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/830811
[10:37:32] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "es1022,es1025,es2022,es2025: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/830811 (owner: 10Marostegui)
[10:38:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34227 and previous config saved to /var/cache/conftool/dbconfig/20220908-103804-root.json
[10:38:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34228 and previous config saved to /var/cache/conftool/dbconfig/20220908-103809-root.json
[10:38:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34229 and previous config saved to /var/cache/conftool/dbconfig/20220908-103815-root.json
[10:38:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34230 and previous config saved to /var/cache/conftool/dbconfig/20220908-103826-root.json
[10:38:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34231 and previous config saved to /var/cache/conftool/dbconfig/20220908-103830-root.json
[10:38:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34232 and previous config saved to /var/cache/conftool/dbconfig/20220908-103836-root.json
[10:38:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2025 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34233 and previous config saved to /var/cache/conftool/dbconfig/20220908-103842-root.json
[10:40:00] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[10:41:22] <wikibugs>	 (03PS1) 10Marostegui: es1026,es1027,es1028: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830801
[10:41:30] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[10:41:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1027 es1026 es1028 for upgrade', diff saved to https://phabricator.wikimedia.org/P34234 and previous config saved to /var/cache/conftool/dbconfig/20220908-104152-root.json
[10:42:12] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es1026,es1027,es1028: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830801 (owner: 10Marostegui)
[10:42:26] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[10:43:49] <wikibugs>	 (03CR) 10JMeybohm: "Adding CC @elukey because of high rate of 409's in ml-serve in eqiad" [alerts] - 10https://gerrit.wikimedia.org/r/830624 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm)
[10:44:04] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[10:44:34] <icinga-wm>	 RECOVERY - cassandra-a service on restbase2014 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:49:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34235 and previous config saved to /var/cache/conftool/dbconfig/20220908-104902-root.json
[10:49:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34236 and previous config saved to /var/cache/conftool/dbconfig/20220908-104910-root.json
[10:49:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34237 and previous config saved to /var/cache/conftool/dbconfig/20220908-104914-root.json
[10:49:18] <wikibugs>	 (03PS1) 10Marostegui: Revert "Revert "es1022,es1025,es2022,es2025: Disable notifications"" [puppet] - 10https://gerrit.wikimedia.org/r/830812
[10:50:09] <wikibugs>	 (03Abandoned) 10Marostegui: Revert "Revert "es1022,es1025,es2022,es2025: Disable notifications"" [puppet] - 10https://gerrit.wikimedia.org/r/830812 (owner: 10Marostegui)
[10:50:22] <wikibugs>	 (03PS1) 10Marostegui: Revert "es1026,es1027,es1028: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/830813
[10:50:53] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10cmooney)
[10:51:02] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "es1026,es1027,es1028: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/830813 (owner: 10Marostegui)
[10:52:17] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[10:52:39] <wikibugs>	 (03PS1) 10Clément Goubert: wtp: Purge wtp servers following migration to parse [puppet] - 10https://gerrit.wikimedia.org/r/830802 (https://phabricator.wikimedia.org/T317025)
[10:52:39] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10cmooney) The firmware provided by Juniper seems to be accepted by cr3-esams: ` cmooney@re0.cr3-esams> show system firmware | match "^Part|version|i40" Part...
[10:52:56] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10cmooney)
[10:53:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34238 and previous config saved to /var/cache/conftool/dbconfig/20220908-105309-root.json
[10:53:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34239 and previous config saved to /var/cache/conftool/dbconfig/20220908-105314-root.json
[10:53:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34240 and previous config saved to /var/cache/conftool/dbconfig/20220908-105320-root.json
[10:53:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34241 and previous config saved to /var/cache/conftool/dbconfig/20220908-105331-root.json
[10:53:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34242 and previous config saved to /var/cache/conftool/dbconfig/20220908-105335-root.json
[10:53:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34243 and previous config saved to /var/cache/conftool/dbconfig/20220908-105341-root.json
[10:53:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2025 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34244 and previous config saved to /var/cache/conftool/dbconfig/20220908-105347-root.json
[10:54:20] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37165/console" [puppet] - 10https://gerrit.wikimedia.org/r/830802 (https://phabricator.wikimedia.org/T317025) (owner: 10Clément Goubert)
[10:55:28] <icinga-wm>	 RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:55:29] <wikibugs>	 (03PS1) 10Clément Goubert: wtp: Purge wtp servers following migration to parse [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830803 (https://phabricator.wikimedia.org/T317025)
[10:55:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wtp: Purge wtp servers following migration to parse [puppet] - 10https://gerrit.wikimedia.org/r/830802 (https://phabricator.wikimedia.org/T317025) (owner: 10Clément Goubert)
[10:57:35] <wikibugs>	 (03PS2) 10Clément Goubert: wtp: Purge wtp servers following migration to parse [puppet] - 10https://gerrit.wikimedia.org/r/830802 (https://phabricator.wikimedia.org/T317025)
[11:01:36] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[11:04:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34245 and previous config saved to /var/cache/conftool/dbconfig/20220908-110407-root.json
[11:04:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34246 and previous config saved to /var/cache/conftool/dbconfig/20220908-110415-root.json
[11:04:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34247 and previous config saved to /var/cache/conftool/dbconfig/20220908-110419-root.json
[11:05:11] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm no idea on the sleep question" [cookbooks] - 10https://gerrit.wikimedia.org/r/830676 (owner: 10Volans)
[11:08:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34248 and previous config saved to /var/cache/conftool/dbconfig/20220908-110814-root.json
[11:08:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34249 and previous config saved to /var/cache/conftool/dbconfig/20220908-110819-root.json
[11:08:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34250 and previous config saved to /var/cache/conftool/dbconfig/20220908-110825-root.json
[11:08:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34251 and previous config saved to /var/cache/conftool/dbconfig/20220908-110836-root.json
[11:08:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34252 and previous config saved to /var/cache/conftool/dbconfig/20220908-110840-root.json
[11:08:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34253 and previous config saved to /var/cache/conftool/dbconfig/20220908-110846-root.json
[11:08:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2025 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34254 and previous config saved to /var/cache/conftool/dbconfig/20220908-110852-root.json
[11:08:58] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[11:09:00] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/830786 (owner: 10Volans)
[11:11:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/830796 (https://phabricator.wikimedia.org/T292299) (owner: 10Btullis)
[11:14:52] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37166/console" [puppet] - 10https://gerrit.wikimedia.org/r/825369 (https://phabricator.wikimedia.org/T315608) (owner: 10Muehlenhoff)
[11:15:29] <wikibugs>	 (03CR) 10Hashar: [C: 04-2] "On hold, pending a change proposed upstream to add getters https://gerrit-review.googlesource.com/c/gerrit/+/345017" [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/830654 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar)
[11:16:47] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/825369 (https://phabricator.wikimedia.org/T315608) (owner: 10Muehlenhoff)
[11:17:00] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[11:19:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34255 and previous config saved to /var/cache/conftool/dbconfig/20220908-111912-root.json
[11:19:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34256 and previous config saved to /var/cache/conftool/dbconfig/20220908-111920-root.json
[11:19:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34257 and previous config saved to /var/cache/conftool/dbconfig/20220908-111924-root.json
[11:19:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm but dont know the history" [puppet] - 10https://gerrit.wikimedia.org/r/830704 (owner: 10Andrew Bogott)
[11:23:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34258 and previous config saved to /var/cache/conftool/dbconfig/20220908-112319-root.json
[11:23:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34259 and previous config saved to /var/cache/conftool/dbconfig/20220908-112324-root.json
[11:23:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34260 and previous config saved to /var/cache/conftool/dbconfig/20220908-112329-root.json
[11:23:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34261 and previous config saved to /var/cache/conftool/dbconfig/20220908-112341-root.json
[11:23:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34262 and previous config saved to /var/cache/conftool/dbconfig/20220908-112345-root.json
[11:23:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34263 and previous config saved to /var/cache/conftool/dbconfig/20220908-112351-root.json
[11:23:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2025 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34264 and previous config saved to /var/cache/conftool/dbconfig/20220908-112357-root.json
[11:30:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch to the new raid_mgmt_tools fact to enable RAID tools [puppet] - 10https://gerrit.wikimedia.org/r/825369 (https://phabricator.wikimedia.org/T315608) (owner: 10Muehlenhoff)
[11:32:27] <icinga-wm>	 RECOVERY - Check systemd state on dse-k8s-ctrl1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:34:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34265 and previous config saved to /var/cache/conftool/dbconfig/20220908-113417-root.json
[11:34:19] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[11:34:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34266 and previous config saved to /var/cache/conftool/dbconfig/20220908-113425-root.json
[11:34:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34267 and previous config saved to /var/cache/conftool/dbconfig/20220908-113429-root.json
[11:35:53] <icinga-wm>	 PROBLEM - Check systemd state on dse-k8s-ctrl1002 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:38:15] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[11:38:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34268 and previous config saved to /var/cache/conftool/dbconfig/20220908-113846-root.json
[11:38:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34269 and previous config saved to /var/cache/conftool/dbconfig/20220908-113850-root.json
[11:38:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34270 and previous config saved to /var/cache/conftool/dbconfig/20220908-113856-root.json
[11:39:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2025 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34271 and previous config saved to /var/cache/conftool/dbconfig/20220908-113902-root.json
[11:41:37] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[11:49:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34272 and previous config saved to /var/cache/conftool/dbconfig/20220908-114922-root.json
[11:49:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34273 and previous config saved to /var/cache/conftool/dbconfig/20220908-114930-root.json
[11:49:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34274 and previous config saved to /var/cache/conftool/dbconfig/20220908-114934-root.json
[11:50:33] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove absented Raid Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/830846
[11:53:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34275 and previous config saved to /var/cache/conftool/dbconfig/20220908-115351-root.json
[11:53:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34276 and previous config saved to /var/cache/conftool/dbconfig/20220908-115355-root.json
[11:54:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34277 and previous config saved to /var/cache/conftool/dbconfig/20220908-115401-root.json
[11:54:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2025 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34278 and previous config saved to /var/cache/conftool/dbconfig/20220908-115407-root.json
[11:55:43] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[11:58:39] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Grant ttaylor access to PII through Superset [puppet] - 10https://gerrit.wikimedia.org/r/830796 (https://phabricator.wikimedia.org/T292299) (owner: 10Btullis)
[12:03:36] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[12:04:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34280 and previous config saved to /var/cache/conftool/dbconfig/20220908-120427-root.json
[12:04:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34281 and previous config saved to /var/cache/conftool/dbconfig/20220908-120435-root.json
[12:04:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34282 and previous config saved to /var/cache/conftool/dbconfig/20220908-120439-root.json
[12:05:13] <wikibugs>	 (03PS1) 10Marostegui: es1029,es1030,es1031: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830848
[12:05:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1029 es1030 es1031 for upgrade', diff saved to https://phabricator.wikimedia.org/P34283 and previous config saved to /var/cache/conftool/dbconfig/20220908-120528-root.json
[12:06:37] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[12:07:02] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es1029,es1030,es1031: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830848 (owner: 10Marostegui)
[12:08:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:09:55] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[12:10:44] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] Alert on high Kubernetes API error rate (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/830624 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm)
[12:11:35] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[12:11:44] <wikibugs>	 (03PS1) 10Jaime Nuche: k8s scap: change format of mediawiki deployment files [puppet] - 10https://gerrit.wikimedia.org/r/830850 (https://phabricator.wikimedia.org/T299648)
[12:12:40] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[12:13:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:14:25] <wikibugs>	 (03PS1) 10Marostegui: Revert "es1029,es1030,es1031: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/830816
[12:14:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34284 and previous config saved to /var/cache/conftool/dbconfig/20220908-121459-root.json
[12:15:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34285 and previous config saved to /var/cache/conftool/dbconfig/20220908-121506-root.json
[12:15:09] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth)
[12:15:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34286 and previous config saved to /var/cache/conftool/dbconfig/20220908-121511-root.json
[12:15:35] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "es1029,es1030,es1031: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/830816 (owner: 10Marostegui)
[12:15:51] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[12:16:06] <wikibugs>	 (03PS1) 10Muehlenhoff: smart: Also use new raid_mgmt_tools fact [puppet] - 10https://gerrit.wikimedia.org/r/830852 (https://phabricator.wikimedia.org/T313312)
[12:17:28] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[12:18:08] <wikibugs>	 (03CR) 10Jaime Nuche: "PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1002/37167/" [puppet] - 10https://gerrit.wikimedia.org/r/830850 (https://phabricator.wikimedia.org/T299648) (owner: 10Jaime Nuche)
[12:18:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:23:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:25:37] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on wtp[1032-1033].mgmt with reason: Downtiming replaced wtp servers
[12:25:42] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[12:25:51] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wtp[1032-1033].mgmt with reason: Downtiming replaced wtp servers
[12:26:05] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on wtp[1032-1033].eqiad.wmnet with reason: Downtiming replaced wtp servers
[12:26:20] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wtp[1032-1033].eqiad.wmnet with reason: Downtiming replaced wtp servers
[12:26:27] <logmsgbot>	 !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1032.eqiad.wmnet
[12:26:36] <logmsgbot>	 !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1033.eqiad.wmnet
[12:29:17] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage db1202 [puppet] - 10https://gerrit.wikimedia.org/r/830854
[12:30:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34287 and previous config saved to /var/cache/conftool/dbconfig/20220908-123004-root.json
[12:30:09] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1202 [puppet] - 10https://gerrit.wikimedia.org/r/830854 (owner: 10Marostegui)
[12:30:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34288 and previous config saved to /var/cache/conftool/dbconfig/20220908-123011-root.json
[12:30:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34289 and previous config saved to /var/cache/conftool/dbconfig/20220908-123016-root.json
[12:37:40] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[12:37:44] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] scap/dsh: remove parsoid service, replaced by parsoid-php [puppet] - 10https://gerrit.wikimedia.org/r/825753 (https://phabricator.wikimedia.org/T241207) (owner: 10Dzahn)
[12:39:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es1027 to es1 eqiad master, promote es1026 to es2 eqiad master, promote es1028 to es3 eqiad master', diff saved to https://phabricator.wikimedia.org/P34290 and previous config saved to /var/cache/conftool/dbconfig/20220908-123955-marostegui.json
[12:42:08] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@9e4ed94]: (no justification provided)
[12:42:17] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@9e4ed94]: (no justification provided) (duration: 00m 09s)
[12:43:22] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[12:45:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34291 and previous config saved to /var/cache/conftool/dbconfig/20220908-124509-root.json
[12:45:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34292 and previous config saved to /var/cache/conftool/dbconfig/20220908-124516-root.json
[12:45:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34293 and previous config saved to /var/cache/conftool/dbconfig/20220908-124521-root.json
[12:47:13] <wikibugs>	 (03CR) 10Zabe: wtp: Purge wtp servers following migration to parse (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/830802 (https://phabricator.wikimedia.org/T317025) (owner: 10Clément Goubert)
[12:48:38] <wikibugs>	 (03CR) 10Clément Goubert: wtp: Purge wtp servers following migration to parse (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/830802 (https://phabricator.wikimedia.org/T317025) (owner: 10Clément Goubert)
[12:49:50] <wikibugs>	 (03PS3) 10Clément Goubert: wtp: Purge wtp servers following migration to parse [puppet] - 10https://gerrit.wikimedia.org/r/830802 (https://phabricator.wikimedia.org/T317025)
[12:50:12] <wikibugs>	 (03PS7) 10Slyngshede: Initial checkin. User and Group classes for interacting with LDAP. [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595)
[12:50:16] <wikibugs>	 (03CR) 10Clément Goubert: wtp: Purge wtp servers following migration to parse (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/830802 (https://phabricator.wikimedia.org/T317025) (owner: 10Clément Goubert)
[12:50:39] <wikibugs>	 (03PS4) 10Clément Goubert: wtp: Purge wtp servers following migration to parse [puppet] - 10https://gerrit.wikimedia.org/r/830802 (https://phabricator.wikimedia.org/T317025)
[12:50:46] <wikibugs>	 (03CR) 10Slyngshede: Initial checkin. User and Group classes for interacting with LDAP. (035 comments) [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede)
[12:51:13] <wikibugs>	 (03PS1) 10Muehlenhoff: raid::perccli: Run the correct monitoring tool [puppet] - 10https://gerrit.wikimedia.org/r/830860 (https://phabricator.wikimedia.org/T315608)
[12:51:23] <wikibugs>	 (03PS2) 10Muehlenhoff: raid::perccli: Run the correct monitoring tool [puppet] - 10https://gerrit.wikimedia.org/r/830860 (https://phabricator.wikimedia.org/T315608)
[12:51:42] <wikibugs>	 (03CR) 10Slyngshede: Initial checkin. User and Group classes for interacting with LDAP. (0311 comments) [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede)
[12:53:17] <wikibugs>	 (03PS1) 10Stang: tnwiki: Add extendedconfirmed group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830861 (https://phabricator.wikimedia.org/T317276)
[12:54:00] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37168/console" [puppet] - 10https://gerrit.wikimedia.org/r/830802 (https://phabricator.wikimedia.org/T317025) (owner: 10Clément Goubert)
[12:54:26] <wikibugs>	 10SRE, 10serviceops: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Jdforrester-WMF)
[12:56:27] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@9e4ed94]: (no justification provided)
[12:56:37] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@9e4ed94]: (no justification provided) (duration: 00m 09s)
[12:57:16] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37169/console" [puppet] - 10https://gerrit.wikimedia.org/r/830802 (https://phabricator.wikimedia.org/T317025) (owner: 10Clément Goubert)
[13:00:05] <jouncebot>	 Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220908T1300)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220908T1300).
[13:00:05] <jouncebot>	 arlolra and phuedx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34294 and previous config saved to /var/cache/conftool/dbconfig/20220908-130014-root.json
[13:00:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34295 and previous config saved to /var/cache/conftool/dbconfig/20220908-130021-root.json
[13:00:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34296 and previous config saved to /var/cache/conftool/dbconfig/20220908-130026-root.json
[13:00:28] <arlolra>	 here
[13:00:50] <Lucas_WMDE>	 I can deploy in maybe 15 minutes or so
[13:01:11] <arlolra>	 thank you
[13:01:53] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[13:02:23] <phuedx>	 o/
[13:06:09] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:07:19] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48681 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:08:39] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[13:12:59] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[13:15:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34297 and previous config saved to /var/cache/conftool/dbconfig/20220908-131519-root.json
[13:15:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34298 and previous config saved to /var/cache/conftool/dbconfig/20220908-131526-root.json
[13:15:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34299 and previous config saved to /var/cache/conftool/dbconfig/20220908-131531-root.json
[13:15:55] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove support for mptsas RAID [puppet] - 10https://gerrit.wikimedia.org/r/830862
[13:19:51] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Add cookbook to easily route mediawiki traffic (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/830607 (https://phabricator.wikimedia.org/T315995) (owner: 10Giuseppe Lavagetto)
[13:19:53] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[13:20:10] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Add cookbook to easily route mediawiki traffic [cookbooks] - 10https://gerrit.wikimedia.org/r/830607 (https://phabricator.wikimedia.org/T315995)
[13:22:17] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/830607 (https://phabricator.wikimedia.org/T315995) (owner: 10Giuseppe Lavagetto)
[13:23:16] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.decommission for hosts wtp1034.eqiad.wmnet
[13:23:27] <wikibugs>	 (03PS1) 10Zabe: wikimedia.org: Move nyc to the wikis section [dns] - 10https://gerrit.wikimedia.org/r/830863
[13:26:45] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[13:28:50] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox
[13:29:55] <moritzm>	 !log installing apache2 security updates on Bullseye
[13:29:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34300 and previous config saved to /var/cache/conftool/dbconfig/20220908-133024-root.json
[13:30:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34301 and previous config saved to /var/cache/conftool/dbconfig/20220908-133031-root.json
[13:30:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34302 and previous config saved to /var/cache/conftool/dbconfig/20220908-133036-root.json
[13:30:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 10%: Maint over long time ago', diff saved to https://phabricator.wikimedia.org/P34303 and previous config saved to /var/cache/conftool/dbconfig/20220908-133045-ladsgroup.json
[13:30:48] <wikibugs>	 (03CR) 10JMeybohm: Alert on high Kubernetes API error rate (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/830624 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm)
[13:31:12] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:31:13] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wtp1034.eqiad.wmnet
[13:34:06] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Improve documentation of the CookbookBase classes usage [software/spicerack] - 10https://gerrit.wikimedia.org/r/830866
[13:35:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 10%: Will do maint later', diff saved to https://phabricator.wikimedia.org/P34304 and previous config saved to /var/cache/conftool/dbconfig/20220908-133514-ladsgroup.json
[13:35:39] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/830860 (https://phabricator.wikimedia.org/T315608) (owner: 10Muehlenhoff)
[13:36:36] <wikibugs>	 (03PS1) 10Vgutierrez: trafficserver: Update to ATS 9 on drmrs [puppet] - 10https://gerrit.wikimedia.org/r/830867 (https://phabricator.wikimedia.org/T309651)
[13:36:39] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.decommission for hosts wtp1035.eqiad.wmnet
[13:37:32] <wikibugs>	 (03CR) 10Ssingh: "Let's go!" [puppet] - 10https://gerrit.wikimedia.org/r/830867 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez)
[13:38:32] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37170/console" [puppet] - 10https://gerrit.wikimedia.org/r/830867 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez)
[13:38:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] raid::perccli: Run the correct monitoring tool [puppet] - 10https://gerrit.wikimedia.org/r/830860 (https://phabricator.wikimedia.org/T315608) (owner: 10Muehlenhoff)
[13:39:32] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] trafficserver: Update to ATS 9 on drmrs [puppet] - 10https://gerrit.wikimedia.org/r/830867 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez)
[13:39:46] <vgutierrez>	 !log disable puppet on A:cp-drmrs during the update to ATS 9.1.3 - T309651
[13:39:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:49] <stashbot>	 T309651: Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651
[13:40:43] <wikibugs>	 (03PS1) 10Ayounsi: Move peeringdb token to spicerack namespace [labs/private] - 10https://gerrit.wikimedia.org/r/830868
[13:40:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Improve documentation of the CookbookBase classes usage [software/spicerack] - 10https://gerrit.wikimedia.org/r/830866 (owner: 10Giuseppe Lavagetto)
[13:41:21] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Update to ATS 9 on drmrs [puppet] - 10https://gerrit.wikimedia.org/r/830867 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez)
[13:41:27] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox
[13:41:52] <wikibugs>	 (03PS5) 10JMeybohm: Alert on high Kubernetes API error rate [alerts] - 10https://gerrit.wikimedia.org/r/830624 (https://phabricator.wikimedia.org/T311251)
[13:41:54] <wikibugs>	 (03PS3) 10JMeybohm: Alert on high Kubernetes API latency [alerts] - 10https://gerrit.wikimedia.org/r/830637 (https://phabricator.wikimedia.org/T311251)
[13:41:56] <wikibugs>	 (03PS5) 10Ayounsi: Spicerack: add configuration file and API key for PeeringDB [puppet] - 10https://gerrit.wikimedia.org/r/819562
[13:43:13] <vgutierrez>	 !log rolling upgrade to ats 9 in cp drmrs - T309651
[13:43:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:44] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[13:43:47] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:43:48] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wtp1035.eqiad.wmnet
[13:45:29] <wikibugs>	 (03PS1) 10Volans: doc: fix sphinx_checker for Python 3.10 [software/spicerack] - 10https://gerrit.wikimedia.org/r/830871
[13:45:35] <wikibugs>	 (03CR) 10Ayounsi: Spicerack: add configuration file and API key for PeeringDB (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/819562 (owner: 10Ayounsi)
[13:45:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 25%: Maint over long time ago', diff saved to https://phabricator.wikimedia.org/P34305 and previous config saved to /var/cache/conftool/dbconfig/20220908-134550-ladsgroup.json
[13:46:01] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] doc: fix sphinx_checker for Python 3.10 [software/spicerack] - 10https://gerrit.wikimedia.org/r/830871 (owner: 10Volans)
[13:47:53] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.decommission for hosts wtp1036.eqiad.wmnet
[13:49:11] <wikibugs>	 (03PS4) 10JMeybohm: Alert on high lateny of kubelet operations [alerts] - 10https://gerrit.wikimedia.org/r/830228 (https://phabricator.wikimedia.org/T311251)
[13:49:13] <wikibugs>	 (03PS6) 10JMeybohm: Alert on high Kubernetes API error rate [alerts] - 10https://gerrit.wikimedia.org/r/830624 (https://phabricator.wikimedia.org/T311251)
[13:49:15] <wikibugs>	 (03PS4) 10JMeybohm: Alert on high Kubernetes API latency [alerts] - 10https://gerrit.wikimedia.org/r/830637 (https://phabricator.wikimedia.org/T311251)
[13:49:55] <arlolra>	 Lucas_WMDE: will we not be making this window?
[13:50:06] <Lucas_WMDE>	 oh, sorry, I totally forgot about it :(
[13:50:11] <Lucas_WMDE>	 damn
[13:50:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [labs/private] - 10https://gerrit.wikimedia.org/r/830868 (owner: 10Ayounsi)
[13:50:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 25%: Will do maint later', diff saved to https://phabricator.wikimedia.org/P34307 and previous config saved to /var/cache/conftool/dbconfig/20220908-135019-ladsgroup.json
[13:50:31] <arlolra>	 I guess I should have spoken up sooner
[13:50:53] <Lucas_WMDE>	 I don’t think there’s time for backports now, no
[13:50:55] <Lucas_WMDE>	 my bad :(
[13:50:59] <wikibugs>	 (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Move peeringdb token to spicerack namespace [labs/private] - 10https://gerrit.wikimedia.org/r/830868 (owner: 10Ayounsi)
[13:51:00] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/830846 (owner: 10Muehlenhoff)
[13:51:11] <arlolra>	 ok, no problem, thanks
[13:51:43] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[13:51:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting, 10Patch-For-Review: icinga raid monitoring inoperable for H750 controllers - https://phabricator.wikimedia.org/T315608 (10MoritzMuehlenhoff) The servers with Perc H750 are now correctly detected by Puppet and the respective new monitoring scrip...
[13:52:16] <wikibugs>	 (03CR) 10Jbond: "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/830852 (https://phabricator.wikimedia.org/T313312) (owner: 10Muehlenhoff)
[13:53:33] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox
[13:54:13] <wikibugs>	 (03CR) 10Volans: [C: 03+2] doc: fix sphinx_checker for Python 3.10 [software/spicerack] - 10https://gerrit.wikimedia.org/r/830871 (owner: 10Volans)
[13:55:40] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:55:41] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wtp1036.eqiad.wmnet
[13:55:49] <wikibugs>	 (03PS1) 10Matthias Mullie: Add SearchVue to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830874 (https://phabricator.wikimedia.org/T310367)
[13:56:13] <wikibugs>	 (03Abandoned) 10Matthias Mullie: Add SearchVue to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830874 (https://phabricator.wikimedia.org/T310367) (owner: 10Matthias Mullie)
[13:56:19] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/830862 (owner: 10Muehlenhoff)
[13:57:04] <wikibugs>	 (03PS1) 10Btullis: Attempt to run the MCE and MAE consumers in the GMS container [deployment-charts] - 10https://gerrit.wikimedia.org/r/830875 (https://phabricator.wikimedia.org/T317053)
[13:57:41] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.decommission for hosts wtp1037.eqiad.wmnet
[13:58:12] <wikibugs>	 (03PS1) 10Matthias Mullie: [SearchVue] Enable extension on beta testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830876 (https://phabricator.wikimedia.org/T310367)
[13:58:13] <wikibugs>	 (03PS1) 10Matthias Mullie: [SearchVue] Enable extension on ptwiki, ruwiki & idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830877
[13:58:16] <icinga-wm>	 PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:58:33] <wikibugs>	 (03PS2) 10Matthias Mullie: [SearchVue] Enable extension on beta ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830876 (https://phabricator.wikimedia.org/T310367)
[13:59:16] <wikibugs>	 (03Abandoned) 10Matthias Mullie: [SearchVue] Enable extension on ptwiki, ruwiki & idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830877 (owner: 10Matthias Mullie)
[13:59:19] <wikibugs>	 (03Abandoned) 10Matthias Mullie: [SearchVue] Enable extension on beta ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830876 (https://phabricator.wikimedia.org/T310367) (owner: 10Matthias Mullie)
[13:59:23] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/819562 (owner: 10Ayounsi)
[14:00:13] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] "🚀" [puppet] - 10https://gerrit.wikimedia.org/r/819562 (owner: 10Ayounsi)
[14:00:22] <papaul>	 !log on going maintenance on mr1-codfw
[14:00:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:00:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 75%: Maint over long time ago', diff saved to https://phabricator.wikimedia.org/P34309 and previous config saved to /var/cache/conftool/dbconfig/20220908-140055-ladsgroup.json
[14:01:52] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[14:03:31] <wikibugs>	 (03Merged) 10jenkins-bot: doc: fix sphinx_checker for Python 3.10 [software/spicerack] - 10https://gerrit.wikimedia.org/r/830871 (owner: 10Volans)
[14:04:02] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox
[14:05:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 75%: Will do maint later', diff saved to https://phabricator.wikimedia.org/P34310 and previous config saved to /var/cache/conftool/dbconfig/20220908-140524-ladsgroup.json
[14:05:32] <wikibugs>	 10SRE, 10Data-Services: 2022-09-04 Scraping from AS714 (Apple) against dumps.wikimedia.org saturating network links - https://phabricator.wikimedia.org/T317001 (10CDanis) p:05Triage→03High
[14:06:08] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:06:09] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wtp1037.eqiad.wmnet
[14:06:23] <wikibugs>	 10SRE, 10Observability-Metrics: librenms: investigate making the session directory 0660 - https://phabricator.wikimedia.org/T317292 (10jbond) p:05Triage→03Medium
[14:06:32] <wikibugs>	 10SRE, 10Observability-Metrics: librenms: investigate making the session directory 0660 - https://phabricator.wikimedia.org/T317292 (10jbond) p:05Medium→03Low
[14:07:04] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.decommission for hosts wtp[1038-1042].eqiad.wmnet
[14:12:14] <wikibugs>	 (03PS2) 10Volans: Improve documentation of the CookbookBase classes usage [software/spicerack] - 10https://gerrit.wikimedia.org/r/830866 (owner: 10Giuseppe Lavagetto)
[14:16:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 100%: Maint over long time ago', diff saved to https://phabricator.wikimedia.org/P34311 and previous config saved to /var/cache/conftool/dbconfig/20220908-141600-ladsgroup.json
[14:20:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 100%: Will do maint later', diff saved to https://phabricator.wikimedia.org/P34312 and previous config saved to /var/cache/conftool/dbconfig/20220908-142029-ladsgroup.json
[14:20:31] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox
[14:21:10] <wikibugs>	 (03CR) 10Ayounsi: sre.network.peering: initial commit (0312 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/816730 (owner: 10Ayounsi)
[14:21:12] <wikibugs>	 (03PS12) 10Ayounsi: sre.network.peering: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/816730
[14:22:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] smart: Also use new raid_mgmt_tools fact [puppet] - 10https://gerrit.wikimedia.org/r/830852 (https://phabricator.wikimedia.org/T313312) (owner: 10Muehlenhoff)
[14:22:24] <wikibugs>	 10SRE-swift-storage: Swiftrepl doesn't work on bullseye (and swiftrepl.conf is deployed by hand) - https://phabricator.wikimedia.org/T299125 (10MatthewVernon) The plan is indeed to replace `swiftrepl` with `rclone`. There are two infelicities with `rclone` for our use case:   # it holds entire container listings...
[14:23:09] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:23:09] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wtp[1038-1042].eqiad.wmnet
[14:25:18] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.decommission for hosts wtp[1043-1047].eqiad.wmnet
[14:28:34] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[14:31:47] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/830802 (https://phabricator.wikimedia.org/T317025) (owner: 10Clément Goubert)
[14:31:51] <wikibugs>	 10SRE, 10Observability-Logging, 10Observability-Metrics, 10Performance-Team (Radar): Framework for running experiments on a subset of the app server fleet - https://phabricator.wikimedia.org/T315403 (10ori) p:05Triage→03Low
[14:33:51] <wikibugs>	 10SRE-swift-storage: Swiftrepl doesn't work on bullseye (and swiftrepl.conf is deployed by hand) - https://phabricator.wikimedia.org/T299125 (10MoritzMuehlenhoff) >>! In T299125#8221206, @MatthewVernon wrote: > Then we need to build a .deb of the patched `rclone` (may be annoying because of the need of newer `go...
[14:34:06] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[14:35:10] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[14:35:26] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:36:44] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox
[14:38:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove support for mptsas RAID [puppet] - 10https://gerrit.wikimedia.org/r/830862 (owner: 10Muehlenhoff)
[14:38:09] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[14:38:57] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:38:58] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wtp[1043-1047].eqiad.wmnet
[14:39:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove absented Raid Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/830846 (owner: 10Muehlenhoff)
[14:39:55] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove absented Raid Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/830846
[14:40:22] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[14:40:57] <wikibugs>	 10SRE, 10Thumbor: Thumbor units failing / service general slowness - https://phabricator.wikimedia.org/T312722 (10Krinkle) @Joe @fgiunchedi I wrote a rough draft based on the above. Feel free to expand or correct accordingly:  https://wikitech.wikimedia.org/wiki/Incidents/2022-07-10_thumbor
[14:40:57] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.decommission for hosts wtp[1025-1028,1048].eqiad.wmnet
[14:41:53] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] "Definitely easier to understand." [puppet] - 10https://gerrit.wikimedia.org/r/830850 (https://phabricator.wikimedia.org/T299648) (owner: 10Jaime Nuche)
[14:43:24] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[14:44:42] <wikibugs>	 (03CR) 10AOkoth: vrts: install vrts script (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth)
[14:44:49] <wikibugs>	 (03CR) 10AOkoth: [C: 03+2] vrts: install vrts script [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth)
[14:45:10] <wikibugs>	 (03CR) 10AOkoth: [C: 03+2] vrts: install vrts script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth)
[14:45:47] <icinga-wm>	 PROBLEM - Host asw-a-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[14:45:47] <icinga-wm>	 PROBLEM - Host asw-b-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[14:45:47] <icinga-wm>	 PROBLEM - Host asw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[14:45:47] <icinga-wm>	 PROBLEM - Host asw-d-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[14:47:09] <wikibugs>	 (03PS1) 10Elukey: ml-services: add default in helmfiles for recreatePods [deployment-charts] - 10https://gerrit.wikimedia.org/r/830883
[14:47:57] <icinga-wm>	 PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:49:00] <wikibugs>	 (03CR) 10Cwhite: logstash: reduce webrequest retention to 31 days (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821324 (https://phabricator.wikimedia.org/T313099) (owner: 10Cwhite)
[14:49:05] <icinga-wm>	 PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:49:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ml-services: add default in helmfiles for recreatePods [deployment-charts] - 10https://gerrit.wikimedia.org/r/830883 (owner: 10Elukey)
[14:49:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job pdu in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:49:45] <elukey>	 uff
[14:49:46] <wikibugs>	 (03CR) 10Cwhite: logstash: reduce replica count to 1 after 1 day (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/830690 (https://phabricator.wikimedia.org/T313099) (owner: 10Cwhite)
[14:49:57] <icinga-wm>	 PROBLEM - Host fasw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[14:51:07] <wikibugs>	 (03PS2) 10Elukey: ml-services: add default in helmfiles for recreatePods [deployment-charts] - 10https://gerrit.wikimedia.org/r/830883
[14:51:09] <icinga-wm>	 PROBLEM - Host mr1-codfw IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[14:51:21] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, replies inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/816730 (owner: 10Ayounsi)
[14:51:31] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] apifeatureusage: use new kafka truststore [puppet] - 10https://gerrit.wikimedia.org/r/830684 (https://phabricator.wikimedia.org/T300130) (owner: 10Cwhite)
[14:52:03] <elukey>	 cwhite: \o/
[14:52:30] * cwhite presses thumbs
[14:54:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job pdu in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:54:58] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox
[14:55:15] <icinga-wm>	 RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:55:25] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:56:14] <wikibugs>	 (03CR) 10Jaime Nuche: k8s scap: change format of mediawiki deployment files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/830850 (https://phabricator.wikimedia.org/T299648) (owner: 10Jaime Nuche)
[14:56:19] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "I have not vetted every IP in this range but I'm willing to give this a try. Somewhat nervous that it will break unexpected toolforge thin" [puppet] - 10https://gerrit.wikimedia.org/r/677872 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi)
[14:57:14] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:57:15] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wtp[1025-1028,1048].eqiad.wmnet
[14:57:49] <icinga-wm>	 RECOVERY - Host asw-d-codfw is UP: PING WARNING - Packet loss = 50%, RTA = 44.29 ms
[14:57:49] <icinga-wm>	 RECOVERY - Host asw-c-codfw is UP: PING WARNING - Packet loss = 50%, RTA = 44.15 ms
[14:57:51] <icinga-wm>	 RECOVERY - Host asw-a-codfw is UP: PING WARNING - Packet loss = 60%, RTA = 33.88 ms
[14:58:27] <icinga-wm>	 RECOVERY - Host fasw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.51 ms
[14:58:39] <icinga-wm>	 RECOVERY - Host asw-b-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.49 ms
[14:58:59] <papaul>	 !log maintenance on mr1-codfw complete 
[14:59:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:03] <icinga-wm>	 RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:00:26] <wikibugs>	 10SRE, 10Observability-Logging, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1): Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10colewhite) apifeatureusage now using the new pki truststore and appears to be working.
[15:00:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade management routers and switches to Junos 21 - https://phabricator.wikimedia.org/T316529 (10Papaul) I was having the issue below upgrading mr1 to version 21 ` Validating against /config/rescue.conf.gz /config/rescue.conf.gz:61:(21) syntax error at 'rfc-co...
[15:01:28] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade management routers and switches to Junos 21 - https://phabricator.wikimedia.org/T316529 (10Papaul)
[15:01:52] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade management routers and switches to Junos 21 - https://phabricator.wikimedia.org/T316529 (10Papaul) ` papaul@mr1-codfw> show version Hostname: mr1-codfw Model: srx300 Junos: 21.2R3-S2.9 JUNOS Software Release [21.2R3-S2.9] `
[15:01:52] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job pdu in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:02:32] <moritzm>	 !log installing nginx security updates on bullseye
[15:02:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:34] <icinga-wm>	 RECOVERY - Host mr1-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 33.67 ms
[15:04:12] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] wtp: Purge wtp servers following migration to parse [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830803 (https://phabricator.wikimedia.org/T317025) (owner: 10Clément Goubert)
[15:04:38] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Overall LGTM, if merged once servers are decommissioned." [puppet] - 10https://gerrit.wikimedia.org/r/830802 (https://phabricator.wikimedia.org/T317025) (owner: 10Clément Goubert)
[15:04:58] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: add default in helmfiles for recreatePods [deployment-charts] - 10https://gerrit.wikimedia.org/r/830883 (owner: 10Elukey)
[15:05:05] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] wtp: Purge wtp servers following migration to parse [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830803 (https://phabricator.wikimedia.org/T317025) (owner: 10Clément Goubert)
[15:05:32] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Attempt to run the MCE and MAE consumers in the GMS container [deployment-charts] - 10https://gerrit.wikimedia.org/r/830875 (https://phabricator.wikimedia.org/T317053) (owner: 10Btullis)
[15:06:04] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Thanks Giuseppe for improving the existing docs. I'm happy to merge this as-is. If we see that it causes confusion because of the shortene" [software/spicerack] - 10https://gerrit.wikimedia.org/r/830866 (owner: 10Giuseppe Lavagetto)
[15:07:16] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] wtp: Purge wtp servers following migration to parse [puppet] - 10https://gerrit.wikimedia.org/r/830802 (https://phabricator.wikimedia.org/T317025) (owner: 10Clément Goubert)
[15:08:48] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: add default in helmfiles for recreatePods [deployment-charts] - 10https://gerrit.wikimedia.org/r/830883 (owner: 10Elukey)
[15:09:00] <wikibugs>	 (03Merged) 10jenkins-bot: Attempt to run the MCE and MAE consumers in the GMS container [deployment-charts] - 10https://gerrit.wikimedia.org/r/830875 (https://phabricator.wikimedia.org/T317053) (owner: 10Btullis)
[15:10:39] <claime>	 jouncebot: now
[15:10:39] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 49 minute(s)
[15:11:11] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] wtp: Purge wtp servers following migration to parse [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830803 (https://phabricator.wikimedia.org/T317025) (owner: 10Clément Goubert)
[15:11:55] <wikibugs>	 (03Merged) 10jenkins-bot: wtp: Purge wtp servers following migration to parse [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830803 (https://phabricator.wikimedia.org/T317025) (owner: 10Clément Goubert)
[15:11:55] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[15:12:44] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[15:14:22] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[15:14:55] <wikibugs>	 (03PS5) 10Ayounsi: Remove 185.15.56.0/24 from network::external [puppet] - 10https://gerrit.wikimedia.org/r/677872 (https://phabricator.wikimedia.org/T265864)
[15:15:45] <papaul>	 volans: hey done with the mr1-codfw upgrade and kafka-logging1005 is ready for testing the provision cookbook
[15:15:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/824315 (https://phabricator.wikimedia.org/T315500) (owner: 10Cwhite)
[15:16:19] <volans>	 papaul: great, thanks, have you seen the patch? there is a question for you too there about how much to sleep
[15:16:44] <papaul>	 volans: looking
[15:16:53] <volans>	 thx
[15:18:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[15:18:14] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Remove 185.15.56.0/24 from network::external [puppet] - 10https://gerrit.wikimedia.org/r/677872 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi)
[15:18:33] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] prometheus-openstack-stale-puppet-certs: preserve original cert name [puppet] - 10https://gerrit.wikimedia.org/r/830704 (owner: 10Andrew Bogott)
[15:19:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[15:19:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[15:20:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[15:20:12] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:21:48] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: sync on main
[15:21:57] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[15:22:02] <icinga-wm>	 PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2005 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:22:02] <wikibugs>	 (03CR) 10Abijeet Patro: [C: 04-1] "Waiting for community feedback: https://meta.wikimedia.org/wiki/Meta_talk:Babylon#Grant_editcontentmodel_right_for_translation_administrat" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830817 (https://phabricator.wikimedia.org/T311587) (owner: 10Abijeet Patro)
[15:22:32] <icinga-wm>	 PROBLEM - Check systemd state on conf2005 is CRITICAL: CRITICAL - degraded: The following units failed: etcdmirror-conftool-eqiad-wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:22:36] <wikibugs>	 (03CR) 10Papaul: sre.hosts.provision: reboot after RAID changes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/830676 (owner: 10Volans)
[15:22:59] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.hosts.provision: reboot after RAID changes [cookbooks] - 10https://gerrit.wikimedia.org/r/830676 (owner: 10Volans)
[15:23:11] <volans>	 thanks papaul, merging and deploying then we can test it
[15:23:37] <papaul>	 volans: thanks
[15:23:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job etcdmirror in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:24:38] <XioNoX>	 papaul: good job on the upgrade!
[15:24:46] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: sync on main
[15:24:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff)
[15:24:53] <papaul>	 XioNoX: thanks
[15:25:11] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main
[15:25:20] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: sync on main
[15:25:36] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main
[15:26:28] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:27:39] <wikibugs>	 (03CR) 10BCornwall: varnish: Remove extraneous checks for Docker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall)
[15:27:57] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.provision: reboot after RAID changes [cookbooks] - 10https://gerrit.wikimedia.org/r/830676 (owner: 10Volans)
[15:28:22] <icinga-wm>	 RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:28:44] <logmsgbot>	 !log cgoubert@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:830803|wtp: Purge wtp servers following migration to parse (T317025)]] (duration: 12m 48s)
[15:28:47] <stashbot>	 T317025: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025
[15:30:38] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:33:43] <akosiaris>	 !log restart etcdmirror on conf2005
[15:33:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:50] <icinga-wm>	 RECOVERY - Check systemd state on conf2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:35:30] <icinga-wm>	 RECOVERY - etcdmirror-conftool-eqiad-wmnet service on conf2005 is OK: OK - etcdmirror-conftool-eqiad-wmnet is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:35:36] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=api-https
[15:36:32] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=api_appserver
[15:36:42] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=appserver
[15:37:54] <_joe_>	 sigh
[15:38:16] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.2.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[15:38:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job etcdmirror in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:39:23] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.dns.netbox
[15:39:32] <wikibugs>	 (03PS1) 10Ayounsi: Revert "Exclude cloud-eqiad prefix from VRT trusted networks" [puppet] - 10https://gerrit.wikimedia.org/r/830818
[15:39:37] <wikibugs>	 (03PS1) 10Ayounsi: Revert "Exclude cloud-eqiad prefix from MXs trusted networks" [puppet] - 10https://gerrit.wikimedia.org/r/830819
[15:39:43] <wikibugs>	 (03PS1) 10Ayounsi: Revert "Exclude cloud-eqiad prefix from lists trusted networks" [puppet] - 10https://gerrit.wikimedia.org/r/830820
[15:40:37] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:40:38] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[15:42:12] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[15:44:38] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[15:45:42] <logmsgbot>	 !log cgoubert@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:830803|wtp: Purge wtp servers following migration to parse (T317025)]] (duration: 04m 00s)
[15:45:45] <stashbot>	 T317025: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025
[15:48:34] <wikibugs>	 (03PS1) 10Btullis: Add the prometheus config to enable scraping from the dse-k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/830897 (https://phabricator.wikimedia.org/T310179)
[15:49:34] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:49:49] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.dns.netbox
[15:50:32] <logmsgbot>	 !log cgoubert@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=parsoid
[15:51:05] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:51:58] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:52:54] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.dns.netbox
[15:55:30] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:55:38] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:57:06] <icinga-wm>	 PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:57:06] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:57:47] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host kafka-logging1005.mgmt.eqiad.wmnet with reboot policy FORCED
[15:58:38] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[15:58:54] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[16:00:05] <jouncebot>	 jbond and rzl: #bothumor My software never has bugs. It just develops random features. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220908T1600).
[16:00:05] <jouncebot>	 dancy: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[16:00:09] <dancy>	 ]o/
[16:01:16] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[16:01:24] <jbond>	 dancy: taking a look now
[16:01:43] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Revert comment change [puppet] - 10https://gerrit.wikimedia.org/r/828583 (owner: 10Ahmon Dancy)
[16:01:54] <wikibugs>	 (03PS3) 10Jbond: Revert comment change [puppet] - 10https://gerrit.wikimedia.org/r/828583 (owner: 10Ahmon Dancy)
[16:03:20] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[16:04:10] <logmsgbot>	 !log dancy@deploy1002 Installing scap version "4.17.0" for 566 hosts
[16:04:23] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Create program to interact with Atlas RIPE API - https://phabricator.wikimedia.org/T315536 (10BCornwall) p:05Triage→03Medium
[16:04:29] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37171/console" [puppet] - 10https://gerrit.wikimedia.org/r/830850 (https://phabricator.wikimedia.org/T299648) (owner: 10Jaime Nuche)
[16:04:38] <wikibugs>	 10SRE, 10Traffic: Remove old and unused libvarnishapi - https://phabricator.wikimedia.org/T300247 (10BCornwall) p:05Triage→03Medium
[16:04:54] <wikibugs>	 10SRE, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097 (10BCornwall) p:05Triage→03Medium
[16:04:59] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Add DP cookie for pageview filtering - https://phabricator.wikimedia.org/T315676 (10BCornwall) p:05Triage→03Medium
[16:05:39] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "> Patch Set 1: Verified+1" [puppet] - 10https://gerrit.wikimedia.org/r/830850 (https://phabricator.wikimedia.org/T299648) (owner: 10Jaime Nuche)
[16:06:17] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] Revert "Exclude cloud-eqiad prefix from lists trusted networks" [puppet] - 10https://gerrit.wikimedia.org/r/830820 (owner: 10Ayounsi)
[16:06:43] <wikibugs>	 (03CR) 10Ssingh: varnish: Remove extraneous checks for Docker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall)
[16:07:27] <jbond>	 dancy: i have merged the revert just checking with service ops on the other one
[16:07:55] <dancy>	 ok.  Note that for https://gerrit.wikimedia.org/r/c/operations/puppet/+/830850/, scap is the only thing that reads the resulting file (and the scap code to do that reading is not enabled yet)
[16:08:24] <dancy>	 And the changes reported in https://puppet-compiler.wmflabs.org/pcc-worker1002/37171/deploy2002.codfw.wmnet/index.html are expected.
[16:08:36] <jbond>	 dancy: ack so its safe to merge and you will update the scap code later?
[16:08:58] <dancy>	 Yep.  I intend to merge the corresponding scap code today if all goes well.
[16:08:59] <wikibugs>	 (03CR) 10BCornwall: varnish: Remove extraneous checks for Docker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall)
[16:09:08] <jbond>	 ack great thanks will merge now
[16:09:12] <dancy>	 thx!
[16:09:16] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] k8s scap: change format of mediawiki deployment files [puppet] - 10https://gerrit.wikimedia.org/r/830850 (https://phabricator.wikimedia.org/T299648) (owner: 10Jaime Nuche)
[16:09:23] <wikibugs>	 (03PS2) 10Jbond: k8s scap: change format of mediawiki deployment files [puppet] - 10https://gerrit.wikimedia.org/r/830850 (https://phabricator.wikimedia.org/T299648) (owner: 10Jaime Nuche)
[16:10:57] <wikibugs>	 (03CR) 10Ahmon Dancy: k8s scap: change format of mediawiki deployment files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/830850 (https://phabricator.wikimedia.org/T299648) (owner: 10Jaime Nuche)
[16:13:03] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/830690 (https://phabricator.wikimedia.org/T313099) (owner: 10Cwhite)
[16:13:36] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/821324 (https://phabricator.wikimedia.org/T313099) (owner: 10Cwhite)
[16:15:39] <jbond>	 dancy: merged and deployed
[16:15:43] <dancy>	 Thanks jbond!
[16:15:47] <jbond>	 np
[16:16:13] <dancy>	 I have verified that the new format file is on deploy1002.
[16:17:43] <wikibugs>	 (03CR) 10BCornwall: varnish: Remove extraneous checks for Docker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall)
[16:18:10] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:22:56] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-logging1005.mgmt.eqiad.wmnet with reboot policy FORCED
[16:23:04] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading: File upload not working: Error: 502, Next Hop Connection Failed - https://phabricator.wikimedia.org/T295343 (10Krinkle)
[16:23:38] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading, 10Traffic, and 2 others: 502 Server Hangup Error on esams for "Upload a new version of this file" on Special:Upload on Commons - https://phabricator.wikimedia.org/T247454 (10Krinkle)
[16:25:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10Papaul)
[16:28:43] <wikibugs>	 (03PS1) 10AOkoth: vrts: fix puppet failure on otrs1001 [puppet] - 10https://gerrit.wikimedia.org/r/830902 (https://phabricator.wikimedia.org/T317059)
[16:29:05] <wikibugs>	 (03PS2) 10AOkoth: vrts: fix puppet failure on otrs1001 [puppet] - 10https://gerrit.wikimedia.org/r/830902 (https://phabricator.wikimedia.org/T317059)
[16:30:04] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:32:04] <wikibugs>	 (03PS3) 10AOkoth: vrts: fix puppet failure on otrs1001 [puppet] - 10https://gerrit.wikimedia.org/r/830902 (https://phabricator.wikimedia.org/T317059)
[16:32:25] <wikibugs>	 (03PS4) 10AOkoth: vrts: fix puppet failure on otrs1001 [puppet] - 10https://gerrit.wikimedia.org/r/830902 (https://phabricator.wikimedia.org/T317059)
[16:34:38] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[16:37:40] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[16:37:50] <wikibugs>	 (03CR) 10AOkoth: [C: 03+2] vrts: fix puppet failure on otrs1001 [puppet] - 10https://gerrit.wikimedia.org/r/830902 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth)
[16:39:24] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[16:42:13] <wikibugs>	 (03CR) 10Andrew Bogott: "https://phabricator.wikimedia.org/T317344" [puppet] - 10https://gerrit.wikimedia.org/r/830860 (https://phabricator.wikimedia.org/T315608) (owner: 10Muehlenhoff)
[16:43:17] <wikibugs>	 (03PS8) 10BCornwall: prometheus: Parse ATS config to node exporter text [puppet] - 10https://gerrit.wikimedia.org/r/826362 (https://phabricator.wikimedia.org/T292815)
[16:43:19] <wikibugs>	 (03PS4) 10BCornwall: ats: Enable node_ats_config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/830686 (https://phabricator.wikimedia.org/T292815)
[16:43:58] <wikibugs>	 (03CR) 10BCornwall: "This has been tested with pcc as well as running the commands manually on an ATS instance." [puppet] - 10https://gerrit.wikimedia.org/r/826362 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[16:45:42] <wikibugs>	 (03PS1) 10Hnowlan: Fix offline tests [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830903
[16:46:10] <wikibugs>	 (03PS2) 10Hnowlan: Fix online tests [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830903
[16:49:06] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:49:40] <sukhe>	 ^ misbehaving Google CT log. if it persists, we will remove it
[16:51:18] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[16:56:06] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[16:58:14] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[17:00:04] <jouncebot>	 bd808: May I have your attention please! Technical Engagement weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220908T1700)
[17:00:31] * bd808 makes a patch to update developer portal
[17:00:38] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[17:03:08] <wikibugs>	 (03CR) 10Ssingh: varnish: Remove extraneous checks for Docker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall)
[17:03:54] <wikibugs>	 (03PS1) 10BryanDavis: developer-portal: Bump container version to 2022-09-08-111810-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/830906
[17:13:31] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container version to 2022-09-08-111810-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/830906 (owner: 10BryanDavis)
[17:13:33] <wikibugs>	 (03PS1) 10Vlad.shapik: Remove division operation hack related to Python2 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830907 (https://phabricator.wikimedia.org/T314393)
[17:14:31] <wikibugs>	 (03PS2) 10Vlad.shapik: WP: Remove division operation hack related to Python2 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830907 (https://phabricator.wikimedia.org/T314393)
[17:15:12] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[17:15:22] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:17:02] <wikibugs>	 (03Merged) 10jenkins-bot: developer-portal: Bump container version to 2022-09-08-111810-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/830906 (owner: 10BryanDavis)
[17:18:24] <wikibugs>	 (03CR) 10Vlad.shapik: [C: 03+1] "Looks reasonable to remove ignoring." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830608 (owner: 10Hnowlan)
[17:20:52] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply
[17:21:20] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply
[17:21:26] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply
[17:22:05] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply
[17:22:13] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply
[17:22:24] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[17:22:58] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply
[17:25:23] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Revert "Exclude cloud-eqiad prefix from lists trusted networks" [puppet] - 10https://gerrit.wikimedia.org/r/830820 (owner: 10Ayounsi)
[17:26:07] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Ongoing media storage errors: backend-fail-internal on deletions and ~2000 read errors/s - https://phabricator.wikimedia.org/T316188 (10Krinkle) 05Open→03Resolved a:03Krinkle
[17:27:20] <joal>	 Hi ops folks - I could do with some help on the stat1008.eqiad.wmnet host - we have a process killing the host CPU - can any of you have a look please?
[17:28:26] <jynus>	 Krinkle: I am not sure we should close T316188 until at least a report is filed
[17:28:27] <stashbot>	 T316188: Ongoing media storage errors: backend-fail-internal on deletions and ~2000 read errors/s - https://phabricator.wikimedia.org/T316188
[17:29:15] <sukhe>	 joal: can't SSH to the host, so I guess we have to force a reboot. if that's OK, I am happy to do that but I also realize there might be other active scripts running and the output might not be saved?
[17:29:24] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Ongoing media storage errors: backend-fail-internal on deletions and ~2000 read errors/s - https://phabricator.wikimedia.org/T316188 (10Krinkle) Public placeholder report at: <https://wikitech.wikimedia.org/wiki/Incidents/2022-08-24_swift>
[17:29:38] <Krinkle>	 jynus: the prod error is resolved.
[17:29:57] <Krinkle>	 afaik we don't usually use tasks for the writing of an incident unless the incident has a tracking task which most don't.
[17:30:11] <joal>	 thank sukhe - indeed there could be other sccripts - I wonder if it's worth letting the host alone and expect it might come, or reboot it now
[17:30:16] <jynus>	 I don't mind actually resolving that, but people will forget to file one if there is nothing on phab encouraging to dod that
[17:30:54] <joal>	 sukhe: let's reboot it please - it's unusable now, so let's make it back
[17:30:57] <sukhe>	 joal: I saw the conversation in the other channel. if it is simply a matter of a CPU intensive cookbook we can perhaps wait for it but if it is an unknown, then a reboot is probably the only way
[17:31:14] <Krinkle>	 I assume the incident ritual and spreadsheet will eventually make it to this through the "draft" category or whatever we use to track that. There are tons more that don't have a task for it that presumably go through the same process.
[17:31:23] <Krinkle>	 I closed it for the prod error stats :)
[17:31:26] <sukhe>	 joal: sure. OK to reboot it then? I will wait for your definite yes
[17:31:28] <Krinkle>	 for which I'm 64 days overdue.
[17:31:50] <joal>	 Yes sukhe - please reboot - thank you
[17:31:53] <sukhe>	 doing
[17:33:01] <sukhe>	 !log stat1008: sudo ipmitool -I lanplus -H "stat1008.mgmt.eqiad.wmnet" -U root -E chassis power cycle
[17:33:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:44] <wikibugs>	 (03CR) 10BCornwall: Unlink certificate renewal and OCSP handling (031 comment) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/820795 (https://phabricator.wikimedia.org/T244232) (owner: 10BCornwall)
[17:35:56] <icinga-wm>	 PROBLEM - Host stat1008 is DOWN: PING CRITICAL - Packet loss = 100%
[17:36:05] <wikibugs>	 (03CR) 10BCornwall: varnish: Remove extraneous checks for Docker (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall)
[17:38:53] <sukhe>	 joal: doesn't seem to be coming back up so I am guessing it was another issue
[17:39:06] <sukhe>	 https://puppetboard.wikimedia.org/node/stat1008.eqiad.wmnet seems to suggest Puppet was failing for over a day now
[17:39:16] <sukhe>	 Failed to set owner to '0': Read-only file system @ apply2files - /mnt/nfs/dumps-labstore1006.wikimedia.org
[17:39:18] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:39:22] <joal>	 wow
[17:39:25] <sukhe>	 change from 400 to 'root' failed: Failed to set owner to '0': Read-only file system @ apply2files - /mnt/nfs/dumps-labstore1006.wikimedia.org
[17:39:44] <icinga-wm>	 RECOVERY - Host stat1008 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms
[17:39:47] <sukhe>	 ok back up now
[17:40:15] <wikibugs>	 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Undocumented IP on WMCS network - https://phabricator.wikimedia.org/T315955 (10Andrew) ` root@cloudcontrol2005-dev:~# dig +noall +answer SOA 16-29.57.15.185.in-addr.arpa. 16-29.57.15.185.in-addr.arpa. 120 IN SOA ns0.openstack.codfw1dev.wikime...
[17:40:26] <sukhe>	 a Puppet run is in progress, let's see if it completes. but yesh, probably requires a deeper look 
[17:40:42] <sukhe>	 s/yesh/yes
[17:41:04] <joal>	 sukhe: the errors feels related to a change that has happened yesterday: migration of labstore to clouddumps, changing nfs
[17:41:12] <joal>	 I hope puppet doesn't fail :S
[17:41:34] <wikibugs>	 (03PS1) 10Dduvall: buildkitd: Bump version to 0.10.4 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/830909
[17:41:51] <sukhe>	 joal: yeah, I see the commit (5dd213d019a012dee344ee2a6a586c0615b7c9dd)
[17:42:01] <bblack>	 with the news, we've got a spike of traffic
[17:42:03] <joal>	 it's been rolledback if I don't mistake
[17:42:19] <bblack>	 I'd advise we aim for stability for a little while right now, avoid any risky changes that can be deferred
[17:43:00] <sukhe>	 joal: Puppet failed again, similar error. I am not sure if you can access Puppetboard so I am happy to share the error and you can create a task (not sure who owns the machines but yeah)
[17:43:49] <joal>	 sukhe: I'm no SRE, I can't access - If ou could create a task with the error and ping btullis on it that'd be awesome
[17:43:57] <sukhe>	 btullis, sure
[17:44:00] <sukhe>	 happy to do that
[17:44:08] <joal>	 Thank you so much sukhe 
[17:48:35] <andrewbogott>	 joal: how can I best see the issue?
[17:48:48] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Sustainability (Incident Followup): 2022-08-24 swift incident (tracking) - https://phabricator.wikimedia.org/T317358 (10Krinkle)
[17:49:02] <sukhe>	 andrewbogott: if it helps, see T317359
[17:49:02] <stashbot>	 T317359: Puppet failure on stat1008 - https://phabricator.wikimedia.org/T317359
[17:49:06] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Sustainability (Incident Followup): 2022-08-24 swift incident (tracking) - https://phabricator.wikimedia.org/T317358 (10Krinkle)
[17:49:10] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Ongoing media storage errors: backend-fail-internal on deletions and ~2000 read errors/s - https://phabricator.wikimedia.org/T316188 (10Krinkle)
[17:49:12] <joal>	 andrewbogott: I think sukhe is creating a task - puppet doesn't run on the host anymore
[17:49:47] <andrewbogott>	 ok, looking...
[17:49:52] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 180 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[17:50:06] <sukhe>	 oh
[17:52:18] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:00:04] <jouncebot>	 jeena and dduvall: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220908T1800).
[18:00:30] <jeena>	 train deployments are paused for the time being due to high traffic from current events. We will re-asses in an hour
[18:00:50] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:03:38] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Sustainability (Incident Followup): 2022-08-24 swift incident (tracking) - https://phabricator.wikimedia.org/T317358 (10jcrespo) Adding @MatthewVernon, as he and @fgiunchedi will be the most knowledgeable people to understand what went wrong to add to the doc. Moritz and I can h...
[18:05:26] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[18:18:28] <icinga-wm>	 RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:19:46] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[18:22:22] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:22:31] <sukhe>	 sigh, going to remove the Google CT log
[18:33:54] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[18:41:00] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[18:42:02] <wikibugs>	 (03PS1) 10DDesouza: Deploy Research Incentive Survey to idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830917 (https://phabricator.wikimedia.org/T316466)
[18:42:35] <wikibugs>	 (03PS2) 10DDesouza: Deploy Research Incentive Survey to idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830917 (https://phabricator.wikimedia.org/T316466)
[18:47:52] <icinga-wm>	 PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:48:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: one disk not working properly in cloudcephosd1030.eqiad.wmnet - https://phabricator.wikimedia.org/T316673 (10Jclark-ctr) Dell has requested i run Hardware Diagnostics after Support log showed no errors i have run multiple t...
[18:48:32] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:01:19] <jeena>	 I will be rolling forward to group 1 in a few minutes
[19:03:07] <wikibugs>	 (03PS1) 10Bking: Revert "elastic: reduce master-eligibles for codfw back down to 2" [puppet] - 10https://gerrit.wikimedia.org/r/830923 (https://phabricator.wikimedia.org/T313431)
[19:04:44] <wikibugs>	 (03PS2) 10Ryan Kemper: Revert "elastic: reduce master-eligibles for codfw back down to 2" [puppet] - 10https://gerrit.wikimedia.org/r/830923 (https://phabricator.wikimedia.org/T313431) (owner: 10Bking)
[19:04:50] <Amir1>	 jeena: I'm around as mw and sre stuff, let me know if you see massive changes
[19:05:11] <wikibugs>	 (03PS1) 10Jgreen: Add a temporary TXT record for Dmarcian account owner change from ccogdill@ to postmaster@. [dns] - 10https://gerrit.wikimedia.org/r/830925 (https://phabricator.wikimedia.org/T316899)
[19:05:14] <Reedy>	 Amir1: all the code changed!!!!
[19:05:26] <Amir1>	 :D
[19:05:50] <jeena>	 Thanks Amir1!
[19:06:06] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (NOOP 3 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37175/console" [puppet] - 10https://gerrit.wikimedia.org/r/830923 (https://phabricator.wikimedia.org/T313431) (owner: 10Bking)
[19:06:16] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830926 (https://phabricator.wikimedia.org/T314189)
[19:06:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830926 (https://phabricator.wikimedia.org/T314189) (owner: 10TrainBranchBot)
[19:06:58] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] Add a temporary TXT record for Dmarcian account owner change from ccogdill@ to postmaster@. [dns] - 10https://gerrit.wikimedia.org/r/830925 (https://phabricator.wikimedia.org/T316899) (owner: 10Jgreen)
[19:07:00] <Amir1>	 jeena: just curios, when is the plan for group2?
[19:07:03] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830926 (https://phabricator.wikimedia.org/T314189) (owner: 10TrainBranchBot)
[19:07:29] <jeena>	 I was going to go ahead and go to all wikis if all seemed fine after 15-30 minutes
[19:07:30] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[19:11:25] <logmsgbot>	 !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.28  refs T314189
[19:11:27] <wikibugs>	 (03PS3) 10Ryan Kemper: Revert "elastic: reduce master-eligibles for codfw back down to 2" [puppet] - 10https://gerrit.wikimedia.org/r/830923 (https://phabricator.wikimedia.org/T313431) (owner: 10Bking)
[19:11:28] <stashbot>	 T314189: 1.39.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T314189
[19:12:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[19:13:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[19:13:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[19:14:42] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[19:14:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[19:14:52] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:15:05] <logmsgbot>	 !log jhuneidi@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.28  refs T314189 (duration: 03m 39s)
[19:17:15] <Amir1>	 jeena: read on s8 is quite elevated but it's partially expected, let me see if it recovers 
[19:17:23] <jeena>	 okay
[19:19:24] <Amir1>	 jeena: mostly recovered 
[19:19:29] <jeena>	 nice
[19:19:51] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::novaproxy: add prometheus nginx exporter [puppet] - 10https://gerrit.wikimedia.org/r/830928
[19:25:53] <jeena>	 logs look alright so if all seems fine to you still Amir1 then I'll roll to all wikis soon
[19:26:49] <Amir1>	 we have an uptick in esams, I'm not sure why but it shouldn't affect anything
[19:26:58] <Amir1>	 reading a million graphs at the same time
[19:27:14] <jeena>	 is there a particular dashboard I should be looking at?
[19:28:01] <Amir1>	 https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&from=now-1h&to=now
[19:28:05] <Amir1>	 This is the most important one
[19:28:26] <Amir1>	 but i check db load as well, as that usually gets upset first https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=now-1h&to=now
[19:28:45] <jeena>	 thanks!
[19:28:50] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[19:29:12] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:30:43] <jeena>	 deploying to all wikis now
[19:31:16] <wikibugs>	 (03PS1) 10TrainBranchBot: all wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830929 (https://phabricator.wikimedia.org/T314189)
[19:31:16] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[19:31:18] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830929 (https://phabricator.wikimedia.org/T314189) (owner: 10TrainBranchBot)
[19:32:07] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830929 (https://phabricator.wikimedia.org/T314189) (owner: 10TrainBranchBot)
[19:35:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[19:35:17] * Amir1 makes a cross
[19:35:58] <Amir1>	 the five hundreds from the new train are arriving but natural 
[19:36:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[19:36:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[19:36:13] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: one disk not working properly in cloudcephosd1030.eqiad.wmnet - https://phabricator.wikimedia.org/T316673 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr
[19:36:24] <logmsgbot>	 !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.28  refs T314189
[19:36:27] <stashbot>	 T314189: 1.39.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T314189
[19:36:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[19:38:23] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::novaproxy: add rate limiter [puppet] - 10https://gerrit.wikimedia.org/r/830932
[19:39:49] <ebernhardson>	 jeena: hmm, elasticsearch (risky patch this week) is rejecting a few requests right now, will wait a minute to see if it subsides (the initial traffic spike onto the db can be a bit rough, needs to pull a lot of content from disk into memory caches), but if it stays will need to roll back
[19:40:04] <jeena>	 ok let me know
[19:40:57] <Amir1>	 500s are fine
[19:42:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[19:42:49] <wikibugs>	 (03PS1) 10Majavah: hieradata: add fake novaproxy passwords [labs/private] - 10https://gerrit.wikimedia.org/r/830933
[19:43:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[19:43:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[19:43:05] <ebernhardson>	 spike in rejections is declining, from 5k/30s down to 3k/30s, if it keeps declining should be fine
[19:43:08] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[19:43:18] <icinga-wm>	 PROBLEM - CirrusSearch codfw 95th percentile latency - more_like on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [2000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[19:43:49] <jeena>	 thanks ebernhardson 
[19:43:50] <wikibugs>	 (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] hieradata: add fake novaproxy passwords [labs/private] - 10https://gerrit.wikimedia.org/r/830933 (owner: 10Majavah)
[19:43:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[19:44:12] <jeena>	 there are quite a few errors contacting parsoid/RESTBase that i'm not sure about
[19:45:32] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[19:46:04] <wikibugs>	 (03PS1) 10Andrew Bogott: dynamic proxy: block a second troublesome UA [puppet] - 10https://gerrit.wikimedia.org/r/830934
[19:47:09] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37178/console" [puppet] - 10https://gerrit.wikimedia.org/r/830932 (owner: 10Majavah)
[19:48:36] <ebernhardson>	 jeena: hmm, but now its going back up :( we might need to roll back and rebalance the shards in the cluster, essentially two of the nodes are looking overloaded and elastic doesn't do a good job of routing around struggling nodes
[19:49:08] <icinga-wm>	 RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:49:11] <jeena>	 okay, roll back to group1 or further?
[19:49:46] <ebernhardson>	 jeena: group1 is fine. hopefully just an hour or so to tell elastic to move some shards away from these two nodes
[19:49:58] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] P:wmcs::novaproxy: add rate limiter [puppet] - 10https://gerrit.wikimedia.org/r/830932 (owner: 10Majavah)
[19:50:18] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37179/console" [puppet] - 10https://gerrit.wikimedia.org/r/830934 (owner: 10Andrew Bogott)
[19:50:44] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:51:22] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 wikis to 1.39.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830935 (https://phabricator.wikimedia.org/T314189)
[19:51:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.39.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830935 (https://phabricator.wikimedia.org/T314189) (owner: 10TrainBranchBot)
[19:51:36] <jeena>	 ebernhardson: rolling back now
[19:52:06] <wikibugs>	 (03Merged) 10jenkins-bot: group2 wikis to 1.39.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830935 (https://phabricator.wikimedia.org/T314189) (owner: 10TrainBranchBot)
[19:54:28] <ebernhardson>	 jeena: thanks!
[19:56:19] <logmsgbot>	 !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.39.0-wmf.27  refs T314189
[19:56:23] <stashbot>	 T314189: 1.39.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T314189
[19:59:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:00:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:00:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:00:04] <jouncebot>	 brennen and TheresNoTime: Time to snap out of that daydream and deploy UTC late backport and config training. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220908T2000).
[20:00:04] <jouncebot>	 arlolra and danisztls: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:24] <arlolra>	 here
[20:00:24] * TheresNoTime is here!
[20:00:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:00:58] <thcipriani>	 o/
[20:01:02] <Amir1>	 the train is ongoing
[20:01:15] <thcipriani>	 I saw that we just rolled back
[20:01:44] <TheresNoTime>	 I was just looking to see where things were
[20:02:11] <hashar>	 if needed I can roll the train tomorrow morning (in like 12 hours)
[20:02:26] <TheresNoTime>	 are we delaying this deployment window then? :)
[20:02:37] <thcipriani>	 looks like we're staying in the current state for an hour or so at least (is that right ebernhardson ?)
[20:02:37] <ebernhardson>	 train rolled back for elasticsearch, it was struggling with the paired traffic shift from eqiad->codfw, two nodes struggling but one in particular rejected 110k requests over 15 minutes. Optimistically can re-run the train forward in about an hour after we get elastic to shuffle some shards away
[20:02:49] <Sario>	 \o/
[20:03:22] <thcipriani>	 k, sounds like we can backport in the interim
[20:03:48] <TheresNoTime>	 Okay! I can deploy :)
[20:04:26] <hashar>	 thanks Erik!
[20:04:37] <TheresNoTime>	 arlolra: will start with 830702
[20:05:31] <arlolra>	 thank you
[20:05:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [core] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/830702 (https://phabricator.wikimedia.org/T317215) (owner: 10Arlolra)
[20:09:43] <TheresNoTime>	 looks like these two backports are going to take ~15 minutes in CI :')
[20:10:00] <arlolra>	 fun
[20:10:54] <thcipriani>	 TheresNoTime: if you want to get out the config change while you're waiting you can ctrl-c and re-run scap backport after it merges and it'll Just Work™
[20:11:21] <TheresNoTime>	 ooh!
[20:11:55] <TheresNoTime>	 doesn't look like danisztls is around to test though
[20:12:00] <thcipriani>	 ah
[20:12:08] <icinga-wm>	 RECOVERY - CirrusSearch codfw 95th percentile latency - more_like on graphite1004 is OK: OK: Less than 20.00% above the threshold [1200.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[20:12:32] <thcipriani>	 well nevermind :) for future reference, I guess
[20:12:43] <TheresNoTime>	 useful to know, thank you :)
[20:14:38] <jynus>	 nice, cirrus error rate even lower than 2 hours ago
[20:20:59] <wikibugs>	 (03Merged) 10jenkins-bot: Fix selser on html endpoints [core] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/830702 (https://phabricator.wikimedia.org/T317215) (owner: 10Arlolra)
[20:21:23] <Amir1>	 I go take a break, keep my phone close
[20:21:25] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:830702|Fix selser on html endpoints (T317215)]]
[20:21:28] <stashbot>	 T317215: Selser broken on html endpoints - https://phabricator.wikimedia.org/T317215
[20:21:49] <logmsgbot>	 !log samtar@deploy1002 samtar and arlolra: Backport for [[gerrit:830702|Fix selser on html endpoints (T317215)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet
[20:22:10] <TheresNoTime>	 arlolra: can you test on mwdebug1001?
[20:23:23] <wikibugs>	 (03PS2) 10Dzahn: Revert "Exclude cloud-eqiad prefix from VRT trusted networks" [puppet] - 10https://gerrit.wikimedia.org/r/830818 (owner: 10Ayounsi)
[20:24:01] <TheresNoTime>	 (not entirely sure that's testable to be honest)
[20:25:00] <arlolra>	 yes, give me a sec
[20:25:11] <TheresNoTime>	 Okay :)
[20:28:47] <arlolra>	 it seems safe to continue
[20:28:59] <TheresNoTime>	 Syncing :)
[20:31:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:32:43] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "Exclude cloud-eqiad prefix from VRT trusted networks" [puppet] - 10https://gerrit.wikimedia.org/r/830818 (owner: 10Ayounsi)
[20:33:31] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:830702|Fix selser on html endpoints (T317215)]] (duration: 12m 06s)
[20:33:34] <stashbot>	 T317215: Selser broken on html endpoints - https://phabricator.wikimedia.org/T317215
[20:33:39] <TheresNoTime>	 arlolra: now doing 830703
[20:33:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [core] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/830703 (https://phabricator.wikimedia.org/T317215) (owner: 10Arlolra)
[20:34:10] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "noop confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/830818 (owner: 10Ayounsi)
[20:34:39] <arlolra>	 28 is only on group0?
[20:35:15] <dancy>	 https://versions.toolforge.org/ shows .28 is on group0 and group1
[20:35:25] <dancy>	 group is on .27
[20:35:28] <dancy>	 *group2
[20:35:37] <arlolra>	 thank you
[20:36:09] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "true, also the current section says "no wikis" and it's a wiki" [dns] - 10https://gerrit.wikimedia.org/r/830863 (owner: 10Zabe)
[20:37:01] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] wikimedia.org: Move nyc to the wikis section [dns] - 10https://gerrit.wikimedia.org/r/830863 (owner: 10Zabe)
[20:37:05] <wikibugs>	 (03PS2) 10Dzahn: wikimedia.org: Move nyc to the wikis section [dns] - 10https://gerrit.wikimedia.org/r/830863 (owner: 10Zabe)
[20:37:33] <TheresNoTime>	 While we wait, is there anyone here who is familiar enough with T316466 (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/830917) and wants to take up testing it, otherwise it's unlikely to get deployed
[20:37:34] <stashbot>	 T316466: Deploy Research Incentive Survey on Indonesian Wikipedia - https://phabricator.wikimedia.org/T316466
[20:37:40] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[20:37:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:38:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:40:26] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:44:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:48:17] <TheresNoTime>	 830703 almost merged :')
[20:49:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:54:21] <wikibugs>	 (03Merged) 10jenkins-bot: Fix selser on html endpoints [core] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/830703 (https://phabricator.wikimedia.org/T317215) (owner: 10Arlolra)
[20:54:25] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "downloaded  wget https://github.com/moby/buildkit/releases/download/v0.10.4/buildkit-v0.10.4.linux-amd64.tar.gz and confirmed SHA256sum" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/830909 (owner: 10Dduvall)
[20:54:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [core] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/830703 (https://phabricator.wikimedia.org/T317215) (owner: 10Arlolra)
[20:55:01] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:830703|Fix selser on html endpoints (T317215)]]
[20:55:04] <stashbot>	 T317215: Selser broken on html endpoints - https://phabricator.wikimedia.org/T317215
[20:55:24] <logmsgbot>	 !log samtar@deploy1002 samtar and arlolra: Backport for [[gerrit:830703|Fix selser on html endpoints (T317215)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet
[20:55:43] <TheresNoTime>	 arlolra: finally merged! can you test on mwdebug1001 please :)
[20:55:50] <arlolra>	 ok
[20:56:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:56:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:57:38] <arlolra>	 ok, please continue
[20:57:47] <TheresNoTime>	 syncing
[20:59:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:01:49] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:830703|Fix selser on html endpoints (T317215)]] (duration: 06m 48s)
[21:01:54] <stashbot>	 T317215: Selser broken on html endpoints - https://phabricator.wikimedia.org/T317215
[21:01:57] <TheresNoTime>	 all done :)
[21:02:04] <TheresNoTime>	 !log closing UTC late backport and config training
[21:02:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:02:42] <arlolra>	 thanks TheresNoTime, selser is indeed fixed on html endpoints now
[21:02:52] <TheresNoTime>	 ebernhardson: FYI deployments done ref retrying the train
[21:02:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:03:05] <TheresNoTime>	 arlolra: you're very welcome :)
[21:03:05] <ebernhardson>	 we should be ready to retry the train now
[21:03:40] <jeena>	 deploying to all wikis now
[21:03:56] <wikibugs>	 (03PS1) 10TrainBranchBot: all wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830944 (https://phabricator.wikimedia.org/T314189)
[21:03:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830944 (https://phabricator.wikimedia.org/T314189) (owner: 10TrainBranchBot)
[21:04:40] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830944 (https://phabricator.wikimedia.org/T314189) (owner: 10TrainBranchBot)
[21:08:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:08:46] <logmsgbot>	 !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.28  refs T314189
[21:08:49] <stashbot>	 T314189: 1.39.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T314189
[21:09:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:09:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:09:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:18:45] <ebernhardson>	 so far everything is looking happy on the elastic side. should be safe to leave the train up
[21:18:55] <jeena>	 thanks ebernhardson!
[21:19:33] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] buildkitd: Bump version to 0.10.4 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/830909 (owner: 10Dduvall)
[21:20:21] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:24:55] <icinga-wm>	 PROBLEM - Check systemd state on search-loader2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_mjolnir-kafka-msearch-daemon@2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:31:33] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[21:35:43] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:38:26] <wikibugs>	 (03PS1) 10Ssingh: certspotter: remove misbehaving Google CT log [puppet] - 10https://gerrit.wikimedia.org/r/830945
[21:40:45] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37180/console" [puppet] - 10https://gerrit.wikimedia.org/r/830945 (owner: 10Ssingh)
[21:41:13] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] certspotter: remove misbehaving Google CT log [puppet] - 10https://gerrit.wikimedia.org/r/830945 (owner: 10Ssingh)
[21:41:34] <wikibugs>	 (03CR) 10Dzahn: "Thanks for fixing flapping monitoring. =I can confirm this URL is 404. You also seem to have done this before. Only thing is one of the co" [puppet] - 10https://gerrit.wikimedia.org/r/830945 (owner: 10Ssingh)
[21:43:23] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[21:48:48] <sukhe>	 mutante: thanks, I guess I copied the current commit
[21:49:07] <sukhe>	 that's what I get for doing this as the water boils for dinner :)
[21:53:32] <mutante>	 sukhe: thanks for fixing the icinga alert about icinga itself which is actually Google breaking URLs
[21:55:05] <wikibugs>	 (03PS1) 10JHathaway: mail::mx: Add support for PLAIN auth over tls [puppet] - 10https://gerrit.wikimedia.org/r/830948 (https://phabricator.wikimedia.org/T314815)
[21:55:11] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[21:56:28] <wikibugs>	 (03CR) 10JHathaway: "Kindly review!" [puppet] - 10https://gerrit.wikimedia.org/r/830948 (https://phabricator.wikimedia.org/T314815) (owner: 10JHathaway)
[21:56:59] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:59:21] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:02:17] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[22:03:25] <wikibugs>	 (03PS1) 10BCornwall: ats: Alert on high connection/request count [alerts] - 10https://gerrit.wikimedia.org/r/830950 (https://phabricator.wikimedia.org/T292815)
[22:12:36] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, 10User-herron, 10User-jbond: Prevent puppet catalog compiler workers from running out of disk space - https://phabricator.wikimedia.org/T222075 (10Dzahn) This happened again the other day and made me mail the SRE list.  Then I added docs how to clea...
[22:16:33] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[22:28:21] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[22:40:11] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[22:49:09] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[22:53:55] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[23:03:53] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[23:17:43] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[23:20:05] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[23:42:17] <icinga-wm>	 PROBLEM - SSH on mw1313.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:48:13] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[23:52:57] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[23:55:42] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b9be20d]: (no justification provided)
[23:56:10] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b9be20d]: (no justification provided) (duration: 00m 27s)