[00:03:37] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:09:47] PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: full-backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:28:09] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:35:21] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:37:40] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [00:46:23] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:53:43] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:14:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:15:23] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:19:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:20:13] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:25:11] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:37:45] (JobUnavailable) firing: (6) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:01] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:47:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:59:29] !log pt1979@cumin1001 START - Cookbook sre.dns.netbox [02:01:45] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [02:01:46] !log pt1979@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [02:02:16] !log pt1979@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db2169 [02:02:17] !log pt1979@cumin1001 END (ERROR) - Cookbook sre.network.configure-switch-interfaces (exit_code=97) for host db2169 [02:04:07] !log pt1979@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kafka-logging1005: [02:04:27] !log pt1979@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kafka-logging1005: [02:05:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10Papaul) [02:07:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:25:51] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [02:33:07] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [02:38:25] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:38:31] (03PS14) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) [02:46:07] PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:48:40] (03CR) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [03:22:55] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:47:25] RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:10:40] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 3 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) 05Openβ†’03Resolved a:03tstarling [04:28:13] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:35:37] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:37:40] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [04:51:27] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:01:11] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:09:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 23 hosts with reason: Primary switchover s3 T316622 [05:09:57] T316622: Switchover s3 master (db1157 -> db1123) - https://phabricator.wikimedia.org/T316622 [05:10:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: Primary switchover s3 T316622 [05:10:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1123 with weight 0 T316622', diff saved to https://phabricator.wikimedia.org/P34128 and previous config saved to /var/cache/conftool/dbconfig/20220908-051043-ladsgroup.json [05:13:29] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:18:17] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:29:52] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 3 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) Here's a model of the benefit of the multi-DC project for users west of codfw. The servers are 30ms closer, but codfw seems a bit slower, so if yo... [05:30:02] (03PS1) 10Marostegui: db1202: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830717 [05:30:25] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:30:28] (03PS2) 10Ladsgroup: mariadb: Promote db1123 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/827860 (https://phabricator.wikimedia.org/T316622) (owner: 10Gerrit maintenance bot) [05:30:38] (03CR) 10Ladsgroup: [C: 03+2] mariadb: Promote db1123 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/827860 (https://phabricator.wikimedia.org/T316622) (owner: 10Gerrit maintenance bot) [05:30:53] (03CR) 10Marostegui: [C: 03+2] db1202: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830717 (owner: 10Marostegui) [05:32:24] (03CR) 10Marostegui: [C: 03+1] "Let's document all this in wikitech. There are so many new options that it is hard to follow 😊" [software] - 10https://gerrit.wikimedia.org/r/830671 (owner: 10Ladsgroup) [05:32:48] (03CR) 10Marostegui: [C: 03+1] auto_schema: Add support for a graceful stop [software] - 10https://gerrit.wikimedia.org/r/830672 (owner: 10Ladsgroup) [05:35:19] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:37:05] (03PS1) 10Marostegui: install_server: Do not reimage db1200 [puppet] - 10https://gerrit.wikimedia.org/r/830718 [05:37:54] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1200 [puppet] - 10https://gerrit.wikimedia.org/r/830718 (owner: 10Marostegui) [05:41:49] (03PS1) 10Marostegui: instances.yaml: Add db1202 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/830720 (https://phabricator.wikimedia.org/T316342) [05:42:26] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Allow runnint it on one dc only with --dc (031 comment) [software] - 10https://gerrit.wikimedia.org/r/830671 (owner: 10Ladsgroup) [05:42:31] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Add support for a graceful stop [software] - 10https://gerrit.wikimedia.org/r/830672 (owner: 10Ladsgroup) [05:43:08] (03Merged) 10jenkins-bot: auto_schema: Allow runnint it on one dc only with --dc [software] - 10https://gerrit.wikimedia.org/r/830671 (owner: 10Ladsgroup) [05:43:13] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1202 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/830720 (https://phabricator.wikimedia.org/T316342) (owner: 10Marostegui) [05:43:21] (03Merged) 10jenkins-bot: auto_schema: Add support for a graceful stop [software] - 10https://gerrit.wikimedia.org/r/830672 (owner: 10Ladsgroup) [05:44:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1202 to s7, depooled, T316342', diff saved to https://phabricator.wikimedia.org/P34129 and previous config saved to /var/cache/conftool/dbconfig/20220908-054429-marostegui.json [05:44:33] T316342: Productionize db1196-db1203 - https://phabricator.wikimedia.org/T316342 [05:49:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Pooling back db2140', diff saved to https://phabricator.wikimedia.org/P34130 and previous config saved to /var/cache/conftool/dbconfig/20220908-054921-ladsgroup.json [05:54:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pooling db1202 for the first time in s7 T316342', diff saved to https://phabricator.wikimedia.org/P34131 and previous config saved to /var/cache/conftool/dbconfig/20220908-055451-marostegui.json [05:54:55] T316342: Productionize db1196-db1203 - https://phabricator.wikimedia.org/T316342 [05:55:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase weight for db1194', diff saved to https://phabricator.wikimedia.org/P34132 and previous config saved to /var/cache/conftool/dbconfig/20220908-055546-marostegui.json [06:00:05] kormat, marostegui, and Amir1: #bothumor I οΏ½ Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220908T0600). [06:00:09] o/ [06:00:12] o/ [06:00:28] I realized testwiki is on s3 [06:00:32] that\s nice [06:00:56] !log Starting s3 eqiad failover from db1157 to db1123 - T316622 [06:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:59] T316622: Switchover s3 master (db1157 -> db1123) - https://phabricator.wikimedia.org/T316622 [06:01:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set s3 eqiad as read-only for maintenance - T316622', diff saved to https://phabricator.wikimedia.org/P34133 and previous config saved to /var/cache/conftool/dbconfig/20220908-060110-ladsgroup.json [06:01:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1123 to s3 primary and set section read-write T316622', diff saved to https://phabricator.wikimedia.org/P34134 and previous config saved to /var/cache/conftool/dbconfig/20220908-060138-ladsgroup.json [06:02:26] writes coming [06:02:33] \o/ [06:03:26] (03PS2) 10Ladsgroup: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/827861 (https://phabricator.wikimedia.org/T316622) (owner: 10Gerrit maintenance bot) [06:03:34] (03CR) 10Ladsgroup: [C: 03+2] wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/827861 (https://phabricator.wikimedia.org/T316622) (owner: 10Gerrit maintenance bot) [06:03:37] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/827861 (https://phabricator.wikimedia.org/T316622) (owner: 10Gerrit maintenance bot) [06:04:27] !log marostegui@cumin2002 dbctl commit (dc=all): 'db1202 (re)pooling @ 2%: Pooling for the first time in s7', diff saved to https://phabricator.wikimedia.org/P34135 and previous config saved to /var/cache/conftool/dbconfig/20220908-060426-root.json [06:04:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1157 T316622', diff saved to https://phabricator.wikimedia.org/P34136 and previous config saved to /var/cache/conftool/dbconfig/20220908-060438-ladsgroup.json [06:07:11] (03PS1) 10Marostegui: db1203: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830721 (https://phabricator.wikimedia.org/T316342) [06:07:47] _joe_: is there a header that shows if the response came from a server using php7.2 or php7.4? [06:07:48] (03Abandoned) 10Marostegui: mariadb: Switch s6 primary db1173 -> db1131 [puppet] - 10https://gerrit.wikimedia.org/r/764784 (https://phabricator.wikimedia.org/T300471) (owner: 10Kormat) [06:08:01] (03Abandoned) 10Marostegui: wmnet: Update s6-master CNAME. [dns] - 10https://gerrit.wikimedia.org/r/764786 (https://phabricator.wikimedia.org/T300471) (owner: 10Kormat) [06:08:14] (03CR) 10Marostegui: [C: 03+2] db1203: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830721 (https://phabricator.wikimedia.org/T316342) (owner: 10Marostegui) [06:08:21] <_joe_> kostajh: X-Powered-By, but it gets stripped in the response to you unless you go via X-Wikimedia-Debug [06:08:26] <_joe_> kostajh: why are you asking? [06:08:57] ah I was hoping I could get it without X-Wikimedia-Debug [06:09:47] <_joe_> kostajh: what are you trying to figure out? [06:09:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [06:09:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [06:10:10] <_joe_> maybe I can help [06:10:11] _joe_: looking into T317187. I recall there was an issue about incompatible serialization formats that AFAIK has been resolved, but it made me wonder if somehow cache entries are being invalidated for visitors to Special:Homepage if their traffic varies from php7.2 to php7.4 from one request to the next [06:10:12] T317187: GrowthExperiments Special:Homepage: investigate performance regression in wmf.28 - https://phabricator.wikimedia.org/T317187 [06:10:47] <_joe_> kostajh: cache entries in their browser? [06:10:56] _joe_: WANObjectCache [06:11:02] <_joe_> I doubt that can be the case, but you can check your cookies [06:11:57] before a visitor goes to Special:Homepage, we do an (expensive) set of calls to ElasticSearch. That gets put into a cache entry with WANObjectCache, and we look it up on visit to Special:Homepage. The spikiness seen in T317187 could be explained by that. [06:12:07] (03PS1) 10Marostegui: instances.yaml: Add db1203 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/830722 (https://phabricator.wikimedia.org/T316342) [06:12:51] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1203 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/830722 (https://phabricator.wikimedia.org/T316342) (owner: 10Marostegui) [06:14:14] !log marostegui@cumin2002 dbctl commit (dc=all): 'Add db1203 to s8, depooled, T316342', diff saved to https://phabricator.wikimedia.org/P34137 and previous config saved to /var/cache/conftool/dbconfig/20220908-061413-marostegui.json [06:14:17] T316342: Productionize db1196-db1203 - https://phabricator.wikimedia.org/T316342 [06:14:54] <_joe_> kostajh: so if you want to verify this [06:15:24] <_joe_> you can add a specific cookie to force yourself onto one php version [06:15:54] <_joe_> but also, the chances of a single user switching php versions between the visits to that page are small [06:16:13] <_joe_> we try to get people to stick to one interpreter consistently as much as possible [06:16:41] ok, yeah that was my understanding as well. [06:16:45] <_joe_> kostajh: have you tried, going via X-W-D, to visit that page on php 7.4 twice [06:16:55] <_joe_> then go visit it with 7.2 another time [06:17:04] <_joe_> and see if that last call is slower than the second [06:17:17] <_joe_> and if it is, start profiling the thing and see where the time is spent [06:17:44] <_joe_> there's also what Tim just said on the task, image-suggestion being slow to respond at times [06:17:55] do you know offhand which cookie to set to opt-in to 7.2? [06:18:13] <_joe_> one sec, need to look at the code [06:18:22] right, but the image-suggestion thing shouldn't be directly related – that is not blocking page render on Special:Homepage [06:18:52] <_joe_> https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/WikimediaEvents/+/refs/heads/master/modules/ext.wikimediaEvents/phpEngine.js#34 [06:18:59] <_joe_> PHP_ENGINE_STICKY [06:19:11] the calls to image-suggestion service happen in a deferred update. If it causes slow down for the user, it is when they visit an article and need the image suggestion metadata, not on special:homepage which is where we observe the spikiness [06:19:13] <_joe_> see the comment above [06:19:24] <_joe_> ack [06:19:33] <_joe_> brb [06:19:46] (03CR) 10Slyngshede: [C: 03+1] "Looks good, no functional changes." [puppet] - 10https://gerrit.wikimedia.org/r/830184 (https://phabricator.wikimedia.org/T273673) (owner: 10Muehlenhoff) [06:19:56] !log marostegui@cumin2002 dbctl commit (dc=all): 'db1202 (re)pooling @ 3%: Pooling for the first time in s7', diff saved to https://phabricator.wikimedia.org/P34138 and previous config saved to /var/cache/conftool/dbconfig/20220908-061955-root.json [06:21:38] (03CR) 10Slyngshede: [V: 03+1] C:spamassassin Allow debugging of why service fails. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829108 (https://phabricator.wikimedia.org/T316903) (owner: 10Slyngshede) [06:21:45] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:spamassassin Allow debugging of why service fails. [puppet] - 10https://gerrit.wikimedia.org/r/829108 (https://phabricator.wikimedia.org/T316903) (owner: 10Slyngshede) [06:22:28] _joe_: thanks. it seems like cache lookup is broken somehow. back in a bit... [06:23:14] PROBLEM - HTTPS-wmfusercontent on phab.wmfusercontent.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2022-10-08 06:21:52 +0000 (expires in 29 days) https://phabricator.wikimedia.org/tag/phabricator/ [06:25:19] <_joe_> kostajh: if your object is now overflowing a certain size, maybe you're failing to save it to the cache [06:35:26] !log marostegui@cumin2002 dbctl commit (dc=all): 'db1202 (re)pooling @ 4%: Pooling for the first time in s7', diff saved to https://phabricator.wikimedia.org/P34139 and previous config saved to /var/cache/conftool/dbconfig/20220908-063525-root.json [06:37:38] !log marostegui@cumin2002 dbctl commit (dc=all): 'db1203 (re)pooling @ 2%: Pooling for the first time in s8', diff saved to https://phabricator.wikimedia.org/P34140 and previous config saved to /var/cache/conftool/dbconfig/20220908-063737-root.json [06:41:26] (03CR) 10Muehlenhoff: [C: 03+2] Cleanup some more stale references/comments to crons [puppet] - 10https://gerrit.wikimedia.org/r/830184 (https://phabricator.wikimedia.org/T273673) (owner: 10Muehlenhoff) [06:43:03] (03PS1) 10Marostegui: es2026,es2028,es2020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830723 [06:44:21] (03PS2) 10Marostegui: es2026,es2027,es2028: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830723 [06:44:51] !log marostegui@cumin2002 dbctl commit (dc=all): 'Depool es2026, es2027, es2028', diff saved to https://phabricator.wikimedia.org/P34141 and previous config saved to /var/cache/conftool/dbconfig/20220908-064450-marostegui.json [06:44:59] (03CR) 10Marostegui: [C: 03+2] es2026,es2027,es2028: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830723 (owner: 10Marostegui) [06:50:54] (03CR) 10Elukey: [C: 03+1] "LGTM! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/830684 (https://phabricator.wikimedia.org/T300130) (owner: 10Cwhite) [06:50:56] !log marostegui@cumin2002 dbctl commit (dc=all): 'db1202 (re)pooling @ 5%: Pooling for the first time in s7', diff saved to https://phabricator.wikimedia.org/P34142 and previous config saved to /var/cache/conftool/dbconfig/20220908-065054-root.json [06:51:03] PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:51:27] (03PS8) 10Ladsgroup: data-persistence: Add alert for replication lag [alerts] - 10https://gerrit.wikimedia.org/r/825294 (https://phabricator.wikimedia.org/T315866) [06:51:32] (03CR) 10Ladsgroup: [C: 03+2] data-persistence: Add alert for replication lag (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/825294 (https://phabricator.wikimedia.org/T315866) (owner: 10Ladsgroup) [06:52:17] !log marostegui@cumin2002 dbctl commit (dc=all): 'es2026 (re)pooling @ 1%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34143 and previous config saved to /var/cache/conftool/dbconfig/20220908-065216-root.json [06:52:20] (03PS1) 10Marostegui: Revert "es2026,es2027,es2028: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/830603 [06:52:23] !log marostegui@cumin2002 dbctl commit (dc=all): 'es2027 (re)pooling @ 1%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34144 and previous config saved to /var/cache/conftool/dbconfig/20220908-065222-root.json [06:52:30] !log marostegui@cumin2002 dbctl commit (dc=all): 'es2028 (re)pooling @ 1%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34145 and previous config saved to /var/cache/conftool/dbconfig/20220908-065229-root.json [06:53:05] (03CR) 10Marostegui: [C: 03+2] Revert "es2026,es2027,es2028: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/830603 (owner: 10Marostegui) [06:53:08] !log marostegui@cumin2002 dbctl commit (dc=all): 'db1203 (re)pooling @ 3%: Pooling for the first time in s8', diff saved to https://phabricator.wikimedia.org/P34146 and previous config saved to /var/cache/conftool/dbconfig/20220908-065306-root.json [06:53:11] (03Merged) 10jenkins-bot: data-persistence: Add alert for replication lag [alerts] - 10https://gerrit.wikimedia.org/r/825294 (https://phabricator.wikimedia.org/T315866) (owner: 10Ladsgroup) [06:54:55] (03CR) 10Muehlenhoff: [C: 03+2] Drop a few now obsolete permission [puppet] - 10https://gerrit.wikimedia.org/r/830656 (owner: 10Muehlenhoff) [06:54:57] (03PS1) 10Marostegui: install_server: Do not reimage db1201 [puppet] - 10https://gerrit.wikimedia.org/r/830724 [06:55:54] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1201 [puppet] - 10https://gerrit.wikimedia.org/r/830724 (owner: 10Marostegui) [06:57:19] (03CR) 10Elukey: [C: 03+2] ml-serve: raise connection limit to the MW API [deployment-charts] - 10https://gerrit.wikimedia.org/r/830661 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [07:00:04] Amir1, apergos, and jnuche: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220908T0700). [07:00:14] morning! there are no trainees signed up for the morning slot and no patches scheduled in the window [07:00:35] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [07:00:47] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [07:00:47] (03PS1) 10Ayounsi: Add Junos image validation [cookbooks] - 10https://gerrit.wikimedia.org/r/830727 (https://phabricator.wikimedia.org/T316539) [07:00:53] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [07:01:05] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [07:01:23] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [07:01:28] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [07:02:22] (03PS1) 10Marostegui: es2029,es2030,es2031: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830728 [07:02:24] (03PS1) 10Ayounsi: Depool esams for routers upgrades [dns] - 10https://gerrit.wikimedia.org/r/830729 (https://phabricator.wikimedia.org/T295690) [07:02:54] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10Marostegui) [07:04:30] (03PS1) 10Ayounsi: Disable VRRP auth for esams [homer/public] - 10https://gerrit.wikimedia.org/r/830730 (https://phabricator.wikimedia.org/T295690) [07:06:26] !log marostegui@cumin2002 dbctl commit (dc=all): 'db1202 (re)pooling @ 10%: Pooling for the first time in s7', diff saved to https://phabricator.wikimedia.org/P34147 and previous config saved to /var/cache/conftool/dbconfig/20220908-070625-root.json [07:07:47] !log marostegui@cumin2002 dbctl commit (dc=all): 'es2026 (re)pooling @ 2%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34148 and previous config saved to /var/cache/conftool/dbconfig/20220908-070746-root.json [07:07:53] !log marostegui@cumin2002 dbctl commit (dc=all): 'es2027 (re)pooling @ 2%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34149 and previous config saved to /var/cache/conftool/dbconfig/20220908-070752-root.json [07:08:01] !log marostegui@cumin2002 dbctl commit (dc=all): 'es2028 (re)pooling @ 2%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34150 and previous config saved to /var/cache/conftool/dbconfig/20220908-070800-root.json [07:08:37] !log marostegui@cumin2002 dbctl commit (dc=all): 'db1203 (re)pooling @ 4%: Pooling for the first time in s8', diff saved to https://phabricator.wikimedia.org/P34151 and previous config saved to /var/cache/conftool/dbconfig/20220908-070836-root.json [07:08:41] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:10:09] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:14:08] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/830729 (https://phabricator.wikimedia.org/T295690) (owner: 10Ayounsi) [07:14:32] (03CR) 10Cathal Mooney: [C: 03+1] "lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/830730 (https://phabricator.wikimedia.org/T295690) (owner: 10Ayounsi) [07:21:55] !log marostegui@cumin2002 dbctl commit (dc=all): 'db1202 (re)pooling @ 25%: Pooling for the first time in s7', diff saved to https://phabricator.wikimedia.org/P34152 and previous config saved to /var/cache/conftool/dbconfig/20220908-072154-root.json [07:23:17] !log marostegui@cumin2002 dbctl commit (dc=all): 'es2026 (re)pooling @ 3%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34153 and previous config saved to /var/cache/conftool/dbconfig/20220908-072316-root.json [07:23:22] !log marostegui@cumin2002 dbctl commit (dc=all): 'es2027 (re)pooling @ 3%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34154 and previous config saved to /var/cache/conftool/dbconfig/20220908-072321-root.json [07:23:31] !log marostegui@cumin2002 dbctl commit (dc=all): 'es2028 (re)pooling @ 3%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34155 and previous config saved to /var/cache/conftool/dbconfig/20220908-072330-root.json [07:23:31] RECOVERY - Disk space on apt1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=apt1001&var-datasource=eqiad+prometheus/ops [07:24:06] !log marostegui@cumin2002 dbctl commit (dc=all): 'db1203 (re)pooling @ 5%: Pooling for the first time in s8', diff saved to https://phabricator.wikimedia.org/P34156 and previous config saved to /var/cache/conftool/dbconfig/20220908-072405-root.json [07:37:25] !log marostegui@cumin2002 dbctl commit (dc=all): 'db1202 (re)pooling @ 50%: Pooling for the first time in s7', diff saved to https://phabricator.wikimedia.org/P34157 and previous config saved to /var/cache/conftool/dbconfig/20220908-073724-root.json [07:38:47] !log marostegui@cumin2002 dbctl commit (dc=all): 'es2026 (re)pooling @ 4%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34158 and previous config saved to /var/cache/conftool/dbconfig/20220908-073846-root.json [07:38:52] !log marostegui@cumin2002 dbctl commit (dc=all): 'es2027 (re)pooling @ 4%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34159 and previous config saved to /var/cache/conftool/dbconfig/20220908-073851-root.json [07:39:01] !log marostegui@cumin2002 dbctl commit (dc=all): 'es2028 (re)pooling @ 4%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34160 and previous config saved to /var/cache/conftool/dbconfig/20220908-073900-root.json [07:39:36] !log marostegui@cumin2002 dbctl commit (dc=all): 'db1203 (re)pooling @ 10%: Pooling for the first time in s8', diff saved to https://phabricator.wikimedia.org/P34161 and previous config saved to /var/cache/conftool/dbconfig/20220908-073935-root.json [07:40:26] (03CR) 10Ayounsi: [C: 03+2] Depool esams for routers upgrades [dns] - 10https://gerrit.wikimedia.org/r/830729 (https://phabricator.wikimedia.org/T295690) (owner: 10Ayounsi) [07:41:06] !log depool esams for routers upgrade - T295690 [07:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:08] T295690: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 [07:44:37] (03CR) 10Vgutierrez: [C: 04-1] "the original idea for supporting docker in these tests is being able to move the execution to our CI at some point. In that environment wh" [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall) [07:52:55] !log marostegui@cumin2002 dbctl commit (dc=all): 'db1202 (re)pooling @ 75%: Pooling for the first time in s7', diff saved to https://phabricator.wikimedia.org/P34162 and previous config saved to /var/cache/conftool/dbconfig/20220908-075253-root.json [07:54:17] !log marostegui@cumin2002 dbctl commit (dc=all): 'es2026 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34163 and previous config saved to /var/cache/conftool/dbconfig/20220908-075416-root.json [07:54:22] !log marostegui@cumin2002 dbctl commit (dc=all): 'es2027 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34164 and previous config saved to /var/cache/conftool/dbconfig/20220908-075421-root.json [07:54:30] !log marostegui@cumin2002 dbctl commit (dc=all): 'es2028 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34165 and previous config saved to /var/cache/conftool/dbconfig/20220908-075429-root.json [07:55:05] !log marostegui@cumin2002 dbctl commit (dc=all): 'db1203 (re)pooling @ 25%: Pooling for the first time in s8', diff saved to https://phabricator.wikimedia.org/P34166 and previous config saved to /var/cache/conftool/dbconfig/20220908-075504-root.json [07:55:15] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:59:00] (03CR) 10Vgutierrez: [C: 04-1] "If you need specific support for podman to make it easier to run the tests in your environment please go ahead" [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall) [08:00:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cumin1001.eqiad.wmnet [08:02:05] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:06:39] kostajh: is the GrowthExperiments Special:Homepage performance regression only for Growths or is that general? ( T317187 ) [08:06:40] T317187: GrowthExperiments Special:Homepage: investigate performance regression in wmf.28 - https://phabricator.wikimedia.org/T317187 [08:07:28] !log drain draffic from cr3-esams - T295690 [08:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:31] T295690: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 [08:08:24] !log marostegui@cumin2002 dbctl commit (dc=all): 'db1202 (re)pooling @ 100%: Pooling for the first time in s7', diff saved to https://phabricator.wikimedia.org/P34167 and previous config saved to /var/cache/conftool/dbconfig/20220908-080823-root.json [08:08:56] !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1019.eqiad.wmnet [08:09:05] !log ayounsi@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cr3-esams,cr3-esams IPv6,re0.cr3-esams.mgmt with reason: router upgrade [08:09:24] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cr3-esams,cr3-esams IPv6,re0.cr3-esams.mgmt with reason: router upgrade [08:09:32] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=3b336fa4-f522-4b10-abdb-d6be83f6a04a) set by ayounsi@cumin2002 for 2:00:00 on 3 host(s) and th... [08:09:47] !log marostegui@cumin2002 dbctl commit (dc=all): 'es2026 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34168 and previous config saved to /var/cache/conftool/dbconfig/20220908-080946-root.json [08:09:52] !log marostegui@cumin2002 dbctl commit (dc=all): 'es2027 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34169 and previous config saved to /var/cache/conftool/dbconfig/20220908-080951-root.json [08:09:59] !log marostegui@cumin2002 dbctl commit (dc=all): 'es2028 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34170 and previous config saved to /var/cache/conftool/dbconfig/20220908-080958-root.json [08:10:35] !log marostegui@cumin2002 dbctl commit (dc=all): 'db1203 (re)pooling @ 50%: Pooling for the first time in s8', diff saved to https://phabricator.wikimedia.org/P34171 and previous config saved to /var/cache/conftool/dbconfig/20220908-081034-root.json [08:10:57] !log ayounsi@cumin2002 START - Cookbook sre.network.cf [08:10:58] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.cf (exit_code=0) [08:11:12] (03PS1) 10Vgutierrez: Release 0.18 [software/purged] - 10https://gerrit.wikimedia.org/r/830782 (https://phabricator.wikimedia.org/T317064) [08:12:06] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cumin1001.eqiad.wmnet [08:12:53] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:12:55] (03CR) 10CI reject: [V: 04-1] Release 0.18 [software/purged] - 10https://gerrit.wikimedia.org/r/830782 (https://phabricator.wikimedia.org/T317064) (owner: 10Vgutierrez) [08:16:49] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:18:51] (03CR) 10Btullis: [C: 03+2] Add a kublet node_label to each master of the dse-k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/828049 (https://phabricator.wikimedia.org/T310172) (owner: 10Btullis) [08:22:47] !log cgoubert@cumin2002 START - Cookbook sre.hosts.remove-downtime for parse1019.eqiad.wmnet [08:22:47] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1019.eqiad.wmnet [08:24:14] !log pooled parse1019.eqiad.wmnet (php 7.4 only) in parsoid cluster T307219 [08:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:17] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [08:24:38] (03PS1) 10Muehlenhoff: Update httpbb test to expect PHP 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/830783 [08:25:01] !log marostegui@cumin2002 dbctl commit (dc=all): 'db1127 (re)pooling @ 2%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34172 and previous config saved to /var/cache/conftool/dbconfig/20220908-082500-root.json [08:25:05] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) db1127 is being repooled. All the sX 10.6 hosts are now serving traffic with the patch [08:25:22] !log marostegui@cumin2002 dbctl commit (dc=all): 'es2027 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34173 and previous config saved to /var/cache/conftool/dbconfig/20220908-082521-root.json [08:25:29] !log marostegui@cumin2002 dbctl commit (dc=all): 'es2028 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34174 and previous config saved to /var/cache/conftool/dbconfig/20220908-082528-root.json [08:25:31] !log marostegui@cumin2002 dbctl commit (dc=all): 'es2026 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34175 and previous config saved to /var/cache/conftool/dbconfig/20220908-082530-root.json [08:25:45] (03PS2) 10Vgutierrez: Release 0.18 [software/purged] - 10https://gerrit.wikimedia.org/r/830782 (https://phabricator.wikimedia.org/T317064) [08:25:47] (03PS1) 10Vgutierrez: querysort: Backport strings.Cut [software/purged] - 10https://gerrit.wikimedia.org/r/830784 (https://phabricator.wikimedia.org/T317064) [08:26:05] !log marostegui@cumin2002 dbctl commit (dc=all): 'db1203 (re)pooling @ 75%: Pooling for the first time in s8', diff saved to https://phabricator.wikimedia.org/P34176 and previous config saved to /var/cache/conftool/dbconfig/20220908-082604-root.json [08:26:34] _joe_: I'm aware that https://gerrit.wikimedia.org/r/830782 is far from ideal but we don't have golang 1.18 available yet [08:26:40] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "We will update all tests, and remove a few, once we've moved fully to php 7.4." [puppet] - 10https://gerrit.wikimedia.org/r/830783 (owner: 10Muehlenhoff) [08:26:51] PROBLEM - Check systemd state on dse-k8s-ctrl1001 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:27:16] _joe_: hmm sorry, https://gerrit.wikimedia.org/r/c/operations/software/purged/+/830784 [08:27:26] (03CR) 10CI reject: [V: 04-1] Release 0.18 [software/purged] - 10https://gerrit.wikimedia.org/r/830782 (https://phabricator.wikimedia.org/T317064) (owner: 10Vgutierrez) [08:28:25] (03CR) 10Giuseppe Lavagetto: [C: 03+1] querysort: Backport strings.Cut [software/purged] - 10https://gerrit.wikimedia.org/r/830784 (https://phabricator.wikimedia.org/T317064) (owner: 10Vgutierrez) [08:28:38] <_joe_> vgutierrez: heh sorry about that [08:30:09] (03PS2) 10Vgutierrez: querysort: Backport strings.Cut [software/purged] - 10https://gerrit.wikimedia.org/r/830784 (https://phabricator.wikimedia.org/T317064) [08:30:11] (03PS3) 10Vgutierrez: Release 0.18 [software/purged] - 10https://gerrit.wikimedia.org/r/830782 (https://phabricator.wikimedia.org/T317064) [08:31:03] !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1020.eqiad.wmnet [08:31:32] ACKNOWLEDGEMENT - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service Muehlenhoff See https://gerrit.wikimedia.org/r/c/operations/puppet/+/830783/ https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:31:32] ACKNOWLEDGEMENT - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver Muehlenhoff See https://gerrit.wikimedia.org/r/c/operations/puppet/+/830783/ https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:32:42] (03CR) 10CI reject: [V: 04-1] Release 0.18 [software/purged] - 10https://gerrit.wikimedia.org/r/830782 (https://phabricator.wikimedia.org/T317064) (owner: 10Vgutierrez) [08:32:50] sighΒ³ [08:33:05] (03Abandoned) 10Muehlenhoff: Update httpbb test to expect PHP 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/830783 (owner: 10Muehlenhoff) [08:37:40] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [08:37:44] (03PS1) 10Volans: Class-based cookbooks: get parent argument_parser [cookbooks] - 10https://gerrit.wikimedia.org/r/830786 [08:38:11] (03CR) 10Marostegui: [C: 03+2] es2029,es2030,es2031: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830728 (owner: 10Marostegui) [08:38:23] (03PS2) 10Volans: Class-based cookbooks: use parent argument_parser [cookbooks] - 10https://gerrit.wikimedia.org/r/830786 [08:38:33] (03PS4) 10Vgutierrez: Release 0.18 [software/purged] - 10https://gerrit.wikimedia.org/r/830782 (https://phabricator.wikimedia.org/T317064) [08:38:39] PROBLEM - Check systemd state on dse-k8s-ctrl1002 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:38:45] never push code if $coffee < 2 [08:39:41] 10SRE, 10Observability-Metrics: librenms / scap::target change of permissions on every puppet run - https://phabricator.wikimedia.org/T317286 (10fgiunchedi) [08:39:43] !log marostegui@cumin2002 dbctl commit (dc=all): 'Depool es2029, es2030, es2031', diff saved to https://phabricator.wikimedia.org/P34178 and previous config saved to /var/cache/conftool/dbconfig/20220908-083941-marostegui.json [08:40:18] !log depooled wtp1028.eqiad.wmnet from parsoid cluster T307219 [08:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:21] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [08:40:30] !log marostegui@cumin2002 dbctl commit (dc=all): 'db1127 (re)pooling @ 3%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34179 and previous config saved to /var/cache/conftool/dbconfig/20220908-084029-root.json [08:40:35] !log cgoubert@cumin2002 START - Cookbook sre.hosts.remove-downtime for parse1020.eqiad.wmnet [08:40:36] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1020.eqiad.wmnet [08:40:52] !log marostegui@cumin2002 dbctl commit (dc=all): 'es2027 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34180 and previous config saved to /var/cache/conftool/dbconfig/20220908-084051-root.json [08:40:58] !log marostegui@cumin2002 dbctl commit (dc=all): 'es2028 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34181 and previous config saved to /var/cache/conftool/dbconfig/20220908-084057-root.json [08:41:00] !log marostegui@cumin2002 dbctl commit (dc=all): 'es2026 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34182 and previous config saved to /var/cache/conftool/dbconfig/20220908-084059-root.json [08:41:19] !log pooled parse1020.eqiad.wmnet (php 7.4 only) in parsoid cluster T307219 [08:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:34] !log marostegui@cumin2002 dbctl commit (dc=all): 'db1203 (re)pooling @ 100%: Pooling for the first time in s8', diff saved to https://phabricator.wikimedia.org/P34183 and previous config saved to /var/cache/conftool/dbconfig/20220908-084133-root.json [08:43:25] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:44:25] (03PS1) 10Elukey: ml-services: update Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/830788 (https://phabricator.wikimedia.org/T313915) [08:44:27] !log reverting cr3-esams changes (JTAC will be needed for a firmware upgrade), and moving on to cr2-esams - T295690 [08:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:30] T295690: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 [08:44:53] (03PS1) 10Filippo Giunchedi: librenms: recurse setting permissions on sessions directory [puppet] - 10https://gerrit.wikimedia.org/r/830789 (https://phabricator.wikimedia.org/T317286) [08:44:58] !log ayounsi@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cr2-esams,cr2-esams IPv6,re0.cr2-esams.mgmt with reason: router upgrade [08:45:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34184 and previous config saved to /var/cache/conftool/dbconfig/20220908-084500-root.json [08:45:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34185 and previous config saved to /var/cache/conftool/dbconfig/20220908-084502-root.json [08:45:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2029 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34186 and previous config saved to /var/cache/conftool/dbconfig/20220908-084503-root.json [08:45:12] (03PS1) 10Marostegui: Revert "es2029,es2030,es2031: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/830806 [08:45:17] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cr2-esams,cr2-esams IPv6,re0.cr2-esams.mgmt with reason: router upgrade [08:45:23] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e0d9eb2b-5520-4f80-912e-3627c94e9982) set by ayounsi@cumin2002 for 2:00:00 on 3 host(s) and th... [08:46:01] (03CR) 10Marostegui: [C: 03+2] Revert "es2029,es2030,es2031: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/830806 (owner: 10Marostegui) [08:47:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-tool1011.eqiad.wmnet [08:47:23] (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] querysort: Backport strings.Cut [software/purged] - 10https://gerrit.wikimedia.org/r/830784 (https://phabricator.wikimedia.org/T317064) (owner: 10Vgutierrez) [08:47:32] 10SRE, 10Thumbor: Thumbor units failing / service general slowness - https://phabricator.wikimedia.org/T312722 (10fgiunchedi) p:05Triageβ†’03Medium @MatthewVernon yes medium works, {{done}} [08:47:34] (03CR) 10Vgutierrez: [C: 03+2] Release 0.18 [software/purged] - 10https://gerrit.wikimedia.org/r/830782 (https://phabricator.wikimedia.org/T317064) (owner: 10Vgutierrez) [08:47:44] 10SRE, 10Data Engineering Planning, 10Foundational Technology Requests: Add a webrequest sampled topic and ingest into druid/turnilo - https://phabricator.wikimedia.org/T314981 (10fgiunchedi) p:05Triageβ†’03Medium [08:47:47] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 23, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:48:21] (03CR) 10Volans: "minor comments, looks good in general" [cookbooks] - 10https://gerrit.wikimedia.org/r/830727 (https://phabricator.wikimedia.org/T316539) (owner: 10Ayounsi) [08:51:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1011.eqiad.wmnet [08:52:04] (03CR) 10Klausman: [C: 03+1] ml-services: update Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/830788 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [08:52:30] 10SRE, 10SRE-Access-Requests: Grant Access to wmf, analytics-privatedata-users for TTaylor - https://phabricator.wikimedia.org/T292299 (10Milimetric) 05Resolvedβ†’03Open It appears that Tajh never got analytics-privatedata access as requested here (https://github.com/wikimedia/puppet/blob/90a5ff1441598161f3e... [08:53:09] !log depooled wtp1029.eqiad.wmnet from parsoid cluster T307219 [08:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:12] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [08:53:14] (03CR) 10Elukey: [C: 03+1] "Had a chat (live) with Filippo about the issue. It is not the ideal solution since puppet will change permissions before with scap:target " [puppet] - 10https://gerrit.wikimedia.org/r/830789 (https://phabricator.wikimedia.org/T317286) (owner: 10Filippo Giunchedi) [08:53:31] (03CR) 10Filippo Giunchedi: [C: 03+2] librenms: recurse setting permissions on sessions directory [puppet] - 10https://gerrit.wikimedia.org/r/830789 (https://phabricator.wikimedia.org/T317286) (owner: 10Filippo Giunchedi) [08:55:24] (03PS2) 10Giuseppe Lavagetto: Add cookbook to easily route mediawiki traffic [cookbooks] - 10https://gerrit.wikimedia.org/r/830607 (https://phabricator.wikimedia.org/T315995) [08:55:32] !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1021.eqiad.wmnet [08:56:00] !log marostegui@cumin2002 dbctl commit (dc=all): 'db1127 (re)pooling @ 4%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34187 and previous config saved to /var/cache/conftool/dbconfig/20220908-085559-root.json [08:56:22] !log marostegui@cumin2002 dbctl commit (dc=all): 'es2027 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34188 and previous config saved to /var/cache/conftool/dbconfig/20220908-085621-root.json [08:56:28] !log marostegui@cumin2002 dbctl commit (dc=all): 'es2028 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34189 and previous config saved to /var/cache/conftool/dbconfig/20220908-085627-root.json [08:56:31] !log marostegui@cumin2002 dbctl commit (dc=all): 'es2026 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34190 and previous config saved to /var/cache/conftool/dbconfig/20220908-085630-root.json [08:57:39] 10SRE, 10Observability-Metrics, 10Patch-For-Review: librenms / scap::target change of permissions on every puppet run - https://phabricator.wikimedia.org/T317286 (10fgiunchedi) I have bandaided the issue by recursing chown `www-data` into the sessions directory, though puppet will flip/flop permissions betwe... [08:59:31] (03CR) 10Elukey: [C: 03+2] ml-services: update Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/830788 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [09:00:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34192 and previous config saved to /var/cache/conftool/dbconfig/20220908-090005-root.json [09:00:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34193 and previous config saved to /var/cache/conftool/dbconfig/20220908-090007-root.json [09:00:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2029 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34194 and previous config saved to /var/cache/conftool/dbconfig/20220908-090008-root.json [09:00:15] 10SRE, 10Observability-Metrics, 10Patch-For-Review: librenms / scap::target change of permissions on every puppet run - https://phabricator.wikimedia.org/T317286 (10jbond) > My hunch is that this is a fallout from forcing owner/group in scap::target in https://gerrit.wikimedia.org/r/c/operations/puppet/+/830... [09:02:00] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/821324 (https://phabricator.wikimedia.org/T313099) (owner: 10Cwhite) [09:02:25] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [09:03:04] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/830633 (https://phabricator.wikimedia.org/T310179) (owner: 10Btullis) [09:03:20] !log cgoubert@cumin2002 START - Cookbook sre.hosts.remove-downtime for parse1021.eqiad.wmnet [09:03:21] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1021.eqiad.wmnet [09:03:43] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [09:04:13] (03CR) 10Filippo Giunchedi: logstash: reduce replica count to 1 after 1 day (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/830690 (https://phabricator.wikimedia.org/T313099) (owner: 10Cwhite) [09:04:24] !log pooled parse1021.eqiad.wmnet (php 7.4 only) in parsoid cluster T307219 [09:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:27] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [09:04:44] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [09:04:54] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:05:51] !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1022.eqiad.wmnet [09:06:06] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:07:41] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [09:08:55] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [09:09:42] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [09:10:15] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [09:10:32] (03CR) 10Hashar: [C: 04-2] "I talked with David about the code I used to extract the list of events from the private field com.google.gerrit.server.events.Event.types" [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/830654 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar) [09:11:31] !log marostegui@cumin2002 dbctl commit (dc=all): 'db1127 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34195 and previous config saved to /var/cache/conftool/dbconfig/20220908-091129-root.json [09:11:52] !log marostegui@cumin2002 dbctl commit (dc=all): 'es2027 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34196 and previous config saved to /var/cache/conftool/dbconfig/20220908-091151-root.json [09:11:58] !log marostegui@cumin2002 dbctl commit (dc=all): 'es2028 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34197 and previous config saved to /var/cache/conftool/dbconfig/20220908-091157-root.json [09:12:01] !log marostegui@cumin2002 dbctl commit (dc=all): 'es2026 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34198 and previous config saved to /var/cache/conftool/dbconfig/20220908-091200-root.json [09:14:25] !log depooled wtp1030.eqiad.wmnet from parsoid cluster T307219 [09:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:29] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [09:15:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34199 and previous config saved to /var/cache/conftool/dbconfig/20220908-091510-root.json [09:15:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34200 and previous config saved to /var/cache/conftool/dbconfig/20220908-091512-root.json [09:15:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2029 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34201 and previous config saved to /var/cache/conftool/dbconfig/20220908-091513-root.json [09:16:31] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on wtp[1025-1028].eqiad.wmnet with reason: Downtiming replaced wtp servers [09:16:35] (03PS1) 10Jbond: C:librenms: fix file permissions [puppet] - 10https://gerrit.wikimedia.org/r/830792 [09:16:45] !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1028.eqiad.wmnet [09:16:47] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wtp[1025-1028].eqiad.wmnet with reason: Downtiming replaced wtp servers [09:16:52] !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1029.eqiad.wmnet [09:16:53] (03CR) 10Jbond: [V: 03+2 C: 03+2] C:librenms: fix file permissions [puppet] - 10https://gerrit.wikimedia.org/r/830792 (owner: 10Jbond) [09:17:19] !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=wtp1029.eqiad.wmnet [09:17:26] 10SRE, 10Observability-Metrics, 10Patch-For-Review: librenms / scap::target change of permissions on every puppet run - https://phabricator.wikimedia.org/T317286 (10jbond) forgot to link https://gerrit.wikimedia.org/r/c/operations/puppet/+/830792 [09:17:34] RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:17:59] !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1026.eqiad.wmnet [09:18:07] !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1025.eqiad.wmnet [09:18:09] (03CR) 10Filippo Giunchedi: "Idea LGTM, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/826362 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [09:18:28] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, though see related change" [puppet] - 10https://gerrit.wikimedia.org/r/830686 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [09:18:49] !log testing purged 0.18 in cp4026 and cp4032 [09:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:30] (03PS1) 10Jbond: C:librenms: correct typo [puppet] - 10https://gerrit.wikimedia.org/r/830793 (https://phabricator.wikimedia.org/T317286) [09:19:44] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1022.eqiad.wmnet [09:19:44] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1022.eqiad.wmnet [09:19:50] (03CR) 10Jbond: [V: 03+2 C: 03+2] C:librenms: correct typo [puppet] - 10https://gerrit.wikimedia.org/r/830793 (https://phabricator.wikimedia.org/T317286) (owner: 10Jbond) [09:20:22] (03CR) 10Kosta Harlan: "LGTM, but blocked on T305406 and T312686, right? If so, let's set a -2 on here to avoid any confusion." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830202 (https://phabricator.wikimedia.org/T305408) (owner: 10Sergio Gimeno) [09:21:05] !log pooled parse1022.eqiad.wmnet (php 7.4 only) in parsoid cluster T307219 [09:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:10] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [09:21:47] !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1023.eqiad.wmnet [09:22:38] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:23:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es2028 to es1 codfw master', diff saved to https://phabricator.wikimedia.org/P34202 and previous config saved to /var/cache/conftool/dbconfig/20220908-092301-marostegui.json [09:23:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es2026 to es2 codfw master', diff saved to https://phabricator.wikimedia.org/P34203 and previous config saved to /var/cache/conftool/dbconfig/20220908-092346-marostegui.json [09:24:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es2027 to es3 codfw master', diff saved to https://phabricator.wikimedia.org/P34204 and previous config saved to /var/cache/conftool/dbconfig/20220908-092436-marostegui.json [09:25:25] 10SRE-OnFire, 10Observability-Alerting: Improve AlertManager alert titles as sent to VictorOps - https://phabricator.wikimedia.org/T317240 (10LSobanski) The tag combinations are also really confusing when they form a coherent statement, e.g. my first reaction to "page thanos sre" is "we should page Observabili... [09:26:57] 10SRE, 10Observability-Metrics, 10Patch-For-Review: librenms / scap::target change of permissions on every puppet run - https://phabricator.wikimedia.org/T317286 (10fgiunchedi) >>! In T317286#8220241, @jbond wrote: >> My hunch is that this is a fallout from forcing owner/group in scap::target in https://gerr... [09:27:01] !log marostegui@cumin2002 dbctl commit (dc=all): 'db1127 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34205 and previous config saved to /var/cache/conftool/dbconfig/20220908-092700-root.json [09:27:37] (03PS1) 10Marostegui: es2032,es2033,es2034: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830794 [09:30:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34206 and previous config saved to /var/cache/conftool/dbconfig/20220908-093015-root.json [09:30:17] (03CR) 10JMeybohm: Alert on high Kubernetes API error rate (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/830624 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm) [09:30:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34207 and previous config saved to /var/cache/conftool/dbconfig/20220908-093017-root.json [09:30:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2029 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34208 and previous config saved to /var/cache/conftool/dbconfig/20220908-093018-root.json [09:31:14] !log upload purged 0.18 to apt.wm.o (buster) - T317064 [09:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:16] T317064: History pages' caches not being invalidated after edits - https://phabricator.wikimedia.org/T317064 [09:31:17] !log ayounsi@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cr3-knams,cr3-knams IPv6 with reason: router upgrade [09:31:35] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cr3-knams,cr3-knams IPv6 with reason: router upgrade [09:31:42] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ff1db65d-a6ee-4e20-ae07-837bbe264b2f) set by ayounsi@cumin2002 for 2:00:00 on 2 host(s) and th... [09:31:51] !log rolling restart of purged - T317064 [09:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:00] !log depooled wtp1031.eqiad.wmnet from parsoid cluster T307219 [09:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:03] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [09:33:14] (03CR) 10Marostegui: [C: 03+2] es2032,es2033,es2034: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830794 (owner: 10Marostegui) [09:34:22] (03PS1) 10Filippo Giunchedi: librenms: recursively chwon sessions directory to www-data as a bandaid [puppet] - 10https://gerrit.wikimedia.org/r/830795 (https://phabricator.wikimedia.org/T317286) [09:35:03] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:35:31] (03CR) 10Filippo Giunchedi: [C: 03+2] librenms: recursively chwon sessions directory to www-data as a bandaid [puppet] - 10https://gerrit.wikimedia.org/r/830795 (https://phabricator.wikimedia.org/T317286) (owner: 10Filippo Giunchedi) [09:35:48] !log drain draffic from cr3-knams - T295690 [09:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:51] T295690: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 [09:36:45] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1023.eqiad.wmnet [09:36:45] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1023.eqiad.wmnet [09:36:59] (03PS1) 10Btullis: Grant ttaylor access to PII through Superset [puppet] - 10https://gerrit.wikimedia.org/r/830796 (https://phabricator.wikimedia.org/T292299) [09:37:25] !log pooled parse1023.eqiad.wmnet (php 7.4 only) in parsoid cluster T307219 [09:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:37] !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1024.eqiad.wmnet [09:38:51] 10SRE, 10Observability-Metrics, 10Patch-For-Review: librenms / scap::target change of permissions on every puppet run - https://phabricator.wikimedia.org/T317286 (10jbond) I have rolled back https://gerrit.wikimedia.org/r/c/830789. As per the [[ https://github.com/wikimedia/puppet/blob/production/modules/li... [09:40:16] (03PS1) 10Jbond: Revert "librenms: recursively chwon sessions directory to www-data as a bandaid" [puppet] - 10https://gerrit.wikimedia.org/r/830807 [09:41:01] (03CR) 10Jbond: [C: 03+2] Revert "librenms: recursively chwon sessions directory to www-data as a bandaid" [puppet] - 10https://gerrit.wikimedia.org/r/830807 (owner: 10Jbond) [09:41:02] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant Access to wmf, analytics-privatedata-users for TTaylor - https://phabricator.wikimedia.org/T292299 (10BTullis) I have added a patch to correct Tajh's access permissions, so that he will be able to access PII in Superset. The wikitech reference to this a... [09:42:31] !log marostegui@cumin2002 dbctl commit (dc=all): 'db1127 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34209 and previous config saved to /var/cache/conftool/dbconfig/20220908-094229-root.json [09:42:51] RECOVERY - mediawiki-installation DSH group on parse1024 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [09:44:52] (03PS2) 10Phuedx: testwiki: Add mediawiki.edit_attempt stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826234 (https://phabricator.wikimedia.org/T309013) [09:45:09] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:45:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34210 and previous config saved to /var/cache/conftool/dbconfig/20220908-094520-root.json [09:45:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34211 and previous config saved to /var/cache/conftool/dbconfig/20220908-094522-root.json [09:45:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2029 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34212 and previous config saved to /var/cache/conftool/dbconfig/20220908-094523-root.json [09:46:51] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:47:09] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:47:09] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:47:11] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:47:13] !log depooled wtp1032.eqiad.wmnet from parsoid cluster T307219 [09:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:15] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:47:15] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [09:48:07] Don't bother about wtp1040 ^ [09:48:24] I'll put the management downtime that I forgot in a sec [09:49:59] 10SRE, 10Observability-Metrics, 10Patch-For-Review: librenms / scap::target change of permissions on every puppet run - https://phabricator.wikimedia.org/T317286 (10jbond) > I have rolled back https://gerrit.wikimedia.org/r/c/830789. As per the comment the session files need to be 0644 This was introduced in... [09:50:02] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1024.eqiad.wmnet [09:50:02] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1024.eqiad.wmnet [09:50:32] !log pooled parse1024.eqiad.wmnet (php 7.4 only) in parsoid cluster T307219 [09:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:35] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for pginer - https://phabricator.wikimedia.org/T317291 (10BTullis) [09:51:04] 10SRE, 10Observability-Metrics, 10Patch-For-Review: librenms / scap::target change of permissions on every puppet run - https://phabricator.wikimedia.org/T317286 (10fgiunchedi) 05Openβ†’03Resolved a:03fgiunchedi Thank you @jbond for the assistance/fix/investigation, seems all good now! resolving [09:51:52] 10SRE, 10Observability-Metrics: librenms: investigate making the session directory 0660 - https://phabricator.wikimedia.org/T317292 (10jbond) [09:52:49] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:52:49] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:52:51] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:53:34] 10SRE, 10Infrastructure-Foundations, 10netops: Overlay VRF / VXLAN traffic failure between lsw1-f2-eqiad and lsw1-f3-eqiad - https://phabricator.wikimedia.org/T315038 (10cmooney) 05Openβ†’03Resolved So after quite a bit of back-and-forth with Juniper and pulling logs etc. they say they can't see anything i... [09:53:54] (03CR) 10Volans: [C: 04-1] "LGTM, one small bug to fix and a question inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/830607 (https://phabricator.wikimedia.org/T315995) (owner: 10Giuseppe Lavagetto) [09:55:28] (03CR) 10Muehlenhoff: Initially adapt perccli to use the new raid_mgmt_tools fact (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/825369 (https://phabricator.wikimedia.org/T315608) (owner: 10Muehlenhoff) [09:57:23] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff) [09:57:43] (03CR) 10Ayounsi: [C: 03+2] Disable VRRP auth for esams [homer/public] - 10https://gerrit.wikimedia.org/r/830730 (https://phabricator.wikimedia.org/T295690) (owner: 10Ayounsi) [09:58:00] !log marostegui@cumin2002 dbctl commit (dc=all): 'db1127 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34213 and previous config saved to /var/cache/conftool/dbconfig/20220908-095759-root.json [09:58:14] (03CR) 10Alexandros Kosiaris: Alert on high Kubernetes API error rate (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/830624 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm) [09:58:53] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for pginer - https://phabricator.wikimedia.org/T317291 (10BTullis) I have now added Pau to the wmf group. ` btullis@mwmaint1002:~$ ldapsearch -x cn=wmf|grep pginer member: uid=pginer,ou=people,dc=wikimedia,dc=org ` Also added to the #wmf-nda group in Phab... [09:59:08] (03PS1) 10Ayounsi: Revert "Depool esams for routers upgrades" [dns] - 10https://gerrit.wikimedia.org/r/830808 [09:59:11] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for pginer - https://phabricator.wikimedia.org/T317291 (10BTullis) 05Openβ†’03Resolved p:05Triageβ†’03Medium [10:00:04] mvolz: How many deployers does it take to do Services – Citoid / Zotero deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220908T1000). [10:00:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2032 es2033 es2034 for upgrade', diff saved to https://phabricator.wikimedia.org/P34214 and previous config saved to /var/cache/conftool/dbconfig/20220908-100014-root.json [10:00:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34215 and previous config saved to /var/cache/conftool/dbconfig/20220908-100025-root.json [10:00:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34216 and previous config saved to /var/cache/conftool/dbconfig/20220908-100027-root.json [10:00:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2029 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34217 and previous config saved to /var/cache/conftool/dbconfig/20220908-100028-root.json [10:00:53] !log depooled wtp1033.eqiad.wmnet from parsoid cluster T307219 [10:00:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:55] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [10:01:13] !log Serving 100% of parsoid traffic with php 7.4 T307219 [10:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on es[2032-2034].codfw.wmnet with reason: Upgrade [10:01:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2032-2034].codfw.wmnet with reason: Upgrade [10:03:11] (03PS7) 10Muehlenhoff: Switch to the new raid_mgmt_tools fact to enable RAID tools [puppet] - 10https://gerrit.wikimedia.org/r/825369 (https://phabricator.wikimedia.org/T315608) [10:04:56] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/825369 (https://phabricator.wikimedia.org/T315608) (owner: 10Muehlenhoff) [10:05:25] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ayounsi) cr2-esams and cr3-knams got upgraded as expected. cr3-esams failed as it requires a firmware upgrade, and only JTAC can provide us the firmware. We wi... [10:05:57] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on wtp[1029-1031].eqiad.wmnet with reason: Downtiming replaced wtp servers [10:06:12] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wtp[1029-1031].eqiad.wmnet with reason: Downtiming replaced wtp servers [10:06:34] !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1030.eqiad.wmnet [10:06:36] 10SRE, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-Page-history, 10Traffic, and 3 others: History pages' caches not being invalidated after edits - https://phabricator.wikimedia.org/T317064 (10Vgutierrez) 05Openβ†’03Resolved a:03Joe ` #before the edit vgutierrez@carrot:~$ curl "https://test.wikipedia.org/... [10:06:46] !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1031.eqiad.wmnet [10:06:59] !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1029.eqiad.wmnet [10:07:01] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ayounsi) [10:07:27] (03CR) 10Ayounsi: [C: 03+2] Revert "Depool esams for routers upgrades" [dns] - 10https://gerrit.wikimedia.org/r/830808 (owner: 10Ayounsi) [10:07:42] !log re-pool esams after routers upgrade - T295690 [10:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:47] T295690: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 [10:07:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34218 and previous config saved to /var/cache/conftool/dbconfig/20220908-100754-root.json [10:08:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34219 and previous config saved to /var/cache/conftool/dbconfig/20220908-100759-root.json [10:08:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34220 and previous config saved to /var/cache/conftool/dbconfig/20220908-100805-root.json [10:08:19] (03CR) 10Btullis: [C: 03+2] Add the configuration to create LVM volumes for dse-k8s monitoring [puppet] - 10https://gerrit.wikimedia.org/r/830633 (https://phabricator.wikimedia.org/T310179) (owner: 10Btullis) [10:08:30] (03PS1) 10Marostegui: Revert "es2032,es2033,es2034: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/830810 [10:12:05] (03PS8) 10Muehlenhoff: Switch to the new raid_mgmt_tools fact to enable RAID tools [puppet] - 10https://gerrit.wikimedia.org/r/825369 (https://phabricator.wikimedia.org/T315608) [10:12:54] (03CR) 10Marostegui: [C: 03+2] Revert "es2032,es2033,es2034: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/830810 (owner: 10Marostegui) [10:13:30] !log marostegui@cumin2002 dbctl commit (dc=all): 'db1127 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34221 and previous config saved to /var/cache/conftool/dbconfig/20220908-101329-root.json [10:13:35] (03PS2) 10Ayounsi: Add Junos image validation [cookbooks] - 10https://gerrit.wikimedia.org/r/830727 (https://phabricator.wikimedia.org/T316539) [10:13:38] (03CR) 10Ayounsi: "thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/830727 (https://phabricator.wikimedia.org/T316539) (owner: 10Ayounsi) [10:16:22] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:16:30] (03PS1) 10Mvolz: Switch Zotero to node 12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/830799 (https://phabricator.wikimedia.org/T290753) [10:17:36] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/825369 (https://phabricator.wikimedia.org/T315608) (owner: 10Muehlenhoff) [10:18:03] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on 7 hosts with reason: Downtiming replaced wtp servers [10:18:21] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on 7 hosts with reason: Downtiming replaced wtp servers [10:18:34] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on 15 hosts with reason: Downtiming replaced wtp servers [10:18:57] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on 15 hosts with reason: Downtiming replaced wtp servers [10:20:02] (03CR) 10Mvolz: [C: 03+2] Switch Zotero to node 12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/830799 (https://phabricator.wikimedia.org/T290753) (owner: 10Mvolz) [10:20:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1022, es1025, es2022, es2025 for upgrade', diff saved to https://phabricator.wikimedia.org/P34222 and previous config saved to /var/cache/conftool/dbconfig/20220908-102040-root.json [10:20:50] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:22:04] (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/830727 (https://phabricator.wikimedia.org/T316539) (owner: 10Ayounsi) [10:22:08] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:22:09] (03PS1) 10Marostegui: es1022,es1025,es2022,es2025: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830800 [10:22:50] (03CR) 10Marostegui: [C: 03+2] es1022,es1025,es2022,es2025: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830800 (owner: 10Marostegui) [10:22:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34223 and previous config saved to /var/cache/conftool/dbconfig/20220908-102259-root.json [10:23:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34224 and previous config saved to /var/cache/conftool/dbconfig/20220908-102304-root.json [10:23:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34225 and previous config saved to /var/cache/conftool/dbconfig/20220908-102310-root.json [10:23:32] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply [10:23:34] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply [10:23:51] (03PS9) 10Muehlenhoff: Switch to the new raid_mgmt_tools fact to enable RAID tools [puppet] - 10https://gerrit.wikimedia.org/r/825369 (https://phabricator.wikimedia.org/T315608) [10:26:39] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply [10:27:16] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply [10:28:32] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/zotero: apply [10:28:38] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:29:00] !log marostegui@cumin2002 dbctl commit (dc=all): 'db1127 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34226 and previous config saved to /var/cache/conftool/dbconfig/20220908-102859-root.json [10:29:16] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [10:29:46] (03PS3) 10Ayounsi: Add Junos image validation [cookbooks] - 10https://gerrit.wikimedia.org/r/830727 (https://phabricator.wikimedia.org/T316539) [10:30:07] (03CR) 10Ayounsi: "fair enough :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/830727 (https://phabricator.wikimedia.org/T316539) (owner: 10Ayounsi) [10:30:08] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:30:34] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/zotero: apply [10:30:55] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/830727 (https://phabricator.wikimedia.org/T316539) (owner: 10Ayounsi) [10:30:57] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/825369 (https://phabricator.wikimedia.org/T315608) (owner: 10Muehlenhoff) [10:31:23] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [10:32:43] 10SRE, 10Citoid, 10Editing-team: Migrate citoid and zotero production services to node12 - https://phabricator.wikimedia.org/T290753 (10Mvolz) 05Openβ†’03Resolved p:05Triageβ†’03Medium [10:32:46] 10SRE, 10serviceops: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Mvolz) [10:33:06] 10SRE, 10serviceops: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Mvolz) [10:34:22] PROBLEM - cassandra-a service on restbase2014 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:36:37] (03PS1) 10Marostegui: Revert "es1022,es1025,es2022,es2025: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/830811 [10:37:32] (03CR) 10Marostegui: [C: 03+2] Revert "es1022,es1025,es2022,es2025: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/830811 (owner: 10Marostegui) [10:38:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34227 and previous config saved to /var/cache/conftool/dbconfig/20220908-103804-root.json [10:38:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34228 and previous config saved to /var/cache/conftool/dbconfig/20220908-103809-root.json [10:38:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34229 and previous config saved to /var/cache/conftool/dbconfig/20220908-103815-root.json [10:38:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34230 and previous config saved to /var/cache/conftool/dbconfig/20220908-103826-root.json [10:38:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34231 and previous config saved to /var/cache/conftool/dbconfig/20220908-103830-root.json [10:38:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34232 and previous config saved to /var/cache/conftool/dbconfig/20220908-103836-root.json [10:38:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2025 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34233 and previous config saved to /var/cache/conftool/dbconfig/20220908-103842-root.json [10:40:00] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:41:22] (03PS1) 10Marostegui: es1026,es1027,es1028: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830801 [10:41:30] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:41:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1027 es1026 es1028 for upgrade', diff saved to https://phabricator.wikimedia.org/P34234 and previous config saved to /var/cache/conftool/dbconfig/20220908-104152-root.json [10:42:12] (03CR) 10Marostegui: [C: 03+2] es1026,es1027,es1028: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830801 (owner: 10Marostegui) [10:42:26] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:43:49] (03CR) 10JMeybohm: "Adding CC @elukey because of high rate of 409's in ml-serve in eqiad" [alerts] - 10https://gerrit.wikimedia.org/r/830624 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm) [10:44:04] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:44:34] RECOVERY - cassandra-a service on restbase2014 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:49:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34235 and previous config saved to /var/cache/conftool/dbconfig/20220908-104902-root.json [10:49:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34236 and previous config saved to /var/cache/conftool/dbconfig/20220908-104910-root.json [10:49:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34237 and previous config saved to /var/cache/conftool/dbconfig/20220908-104914-root.json [10:49:18] (03PS1) 10Marostegui: Revert "Revert "es1022,es1025,es2022,es2025: Disable notifications"" [puppet] - 10https://gerrit.wikimedia.org/r/830812 [10:50:09] (03Abandoned) 10Marostegui: Revert "Revert "es1022,es1025,es2022,es2025: Disable notifications"" [puppet] - 10https://gerrit.wikimedia.org/r/830812 (owner: 10Marostegui) [10:50:22] (03PS1) 10Marostegui: Revert "es1026,es1027,es1028: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/830813 [10:50:53] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10cmooney) [10:51:02] (03CR) 10Marostegui: [C: 03+2] Revert "es1026,es1027,es1028: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/830813 (owner: 10Marostegui) [10:52:17] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:52:39] (03PS1) 10ClΓ©ment Goubert: wtp: Purge wtp servers following migration to parse [puppet] - 10https://gerrit.wikimedia.org/r/830802 (https://phabricator.wikimedia.org/T317025) [10:52:39] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10cmooney) The firmware provided by Juniper seems to be accepted by cr3-esams: ` cmooney@re0.cr3-esams> show system firmware | match "^Part|version|i40" Part... [10:52:56] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10cmooney) [10:53:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34238 and previous config saved to /var/cache/conftool/dbconfig/20220908-105309-root.json [10:53:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34239 and previous config saved to /var/cache/conftool/dbconfig/20220908-105314-root.json [10:53:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34240 and previous config saved to /var/cache/conftool/dbconfig/20220908-105320-root.json [10:53:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34241 and previous config saved to /var/cache/conftool/dbconfig/20220908-105331-root.json [10:53:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34242 and previous config saved to /var/cache/conftool/dbconfig/20220908-105335-root.json [10:53:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34243 and previous config saved to /var/cache/conftool/dbconfig/20220908-105341-root.json [10:53:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2025 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34244 and previous config saved to /var/cache/conftool/dbconfig/20220908-105347-root.json [10:54:20] (03CR) 10ClΓ©ment Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37165/console" [puppet] - 10https://gerrit.wikimedia.org/r/830802 (https://phabricator.wikimedia.org/T317025) (owner: 10ClΓ©ment Goubert) [10:55:28] RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:55:29] (03PS1) 10ClΓ©ment Goubert: wtp: Purge wtp servers following migration to parse [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830803 (https://phabricator.wikimedia.org/T317025) [10:55:39] (03CR) 10CI reject: [V: 04-1] wtp: Purge wtp servers following migration to parse [puppet] - 10https://gerrit.wikimedia.org/r/830802 (https://phabricator.wikimedia.org/T317025) (owner: 10ClΓ©ment Goubert) [10:57:35] (03PS2) 10ClΓ©ment Goubert: wtp: Purge wtp servers following migration to parse [puppet] - 10https://gerrit.wikimedia.org/r/830802 (https://phabricator.wikimedia.org/T317025) [11:01:36] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:04:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34245 and previous config saved to /var/cache/conftool/dbconfig/20220908-110407-root.json [11:04:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34246 and previous config saved to /var/cache/conftool/dbconfig/20220908-110415-root.json [11:04:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34247 and previous config saved to /var/cache/conftool/dbconfig/20220908-110419-root.json [11:05:11] (03CR) 10Jbond: [C: 03+1] "lgtm no idea on the sleep question" [cookbooks] - 10https://gerrit.wikimedia.org/r/830676 (owner: 10Volans) [11:08:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34248 and previous config saved to /var/cache/conftool/dbconfig/20220908-110814-root.json [11:08:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34249 and previous config saved to /var/cache/conftool/dbconfig/20220908-110819-root.json [11:08:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34250 and previous config saved to /var/cache/conftool/dbconfig/20220908-110825-root.json [11:08:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34251 and previous config saved to /var/cache/conftool/dbconfig/20220908-110836-root.json [11:08:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34252 and previous config saved to /var/cache/conftool/dbconfig/20220908-110840-root.json [11:08:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34253 and previous config saved to /var/cache/conftool/dbconfig/20220908-110846-root.json [11:08:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2025 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34254 and previous config saved to /var/cache/conftool/dbconfig/20220908-110852-root.json [11:08:58] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:09:00] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/830786 (owner: 10Volans) [11:11:18] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/830796 (https://phabricator.wikimedia.org/T292299) (owner: 10Btullis) [11:14:52] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37166/console" [puppet] - 10https://gerrit.wikimedia.org/r/825369 (https://phabricator.wikimedia.org/T315608) (owner: 10Muehlenhoff) [11:15:29] (03CR) 10Hashar: [C: 04-2] "On hold, pending a change proposed upstream to add getters https://gerrit-review.googlesource.com/c/gerrit/+/345017" [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/830654 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar) [11:16:47] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/825369 (https://phabricator.wikimedia.org/T315608) (owner: 10Muehlenhoff) [11:17:00] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:19:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34255 and previous config saved to /var/cache/conftool/dbconfig/20220908-111912-root.json [11:19:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34256 and previous config saved to /var/cache/conftool/dbconfig/20220908-111920-root.json [11:19:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34257 and previous config saved to /var/cache/conftool/dbconfig/20220908-111924-root.json [11:19:42] (03CR) 10Jbond: [C: 03+1] "lgtm but dont know the history" [puppet] - 10https://gerrit.wikimedia.org/r/830704 (owner: 10Andrew Bogott) [11:23:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34258 and previous config saved to /var/cache/conftool/dbconfig/20220908-112319-root.json [11:23:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34259 and previous config saved to /var/cache/conftool/dbconfig/20220908-112324-root.json [11:23:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34260 and previous config saved to /var/cache/conftool/dbconfig/20220908-112329-root.json [11:23:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34261 and previous config saved to /var/cache/conftool/dbconfig/20220908-112341-root.json [11:23:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34262 and previous config saved to /var/cache/conftool/dbconfig/20220908-112345-root.json [11:23:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34263 and previous config saved to /var/cache/conftool/dbconfig/20220908-112351-root.json [11:23:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2025 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34264 and previous config saved to /var/cache/conftool/dbconfig/20220908-112357-root.json [11:30:41] (03CR) 10Muehlenhoff: [C: 03+2] Switch to the new raid_mgmt_tools fact to enable RAID tools [puppet] - 10https://gerrit.wikimedia.org/r/825369 (https://phabricator.wikimedia.org/T315608) (owner: 10Muehlenhoff) [11:32:27] RECOVERY - Check systemd state on dse-k8s-ctrl1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:34:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34265 and previous config saved to /var/cache/conftool/dbconfig/20220908-113417-root.json [11:34:19] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:34:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34266 and previous config saved to /var/cache/conftool/dbconfig/20220908-113425-root.json [11:34:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34267 and previous config saved to /var/cache/conftool/dbconfig/20220908-113429-root.json [11:35:53] PROBLEM - Check systemd state on dse-k8s-ctrl1002 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:38:15] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:38:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34268 and previous config saved to /var/cache/conftool/dbconfig/20220908-113846-root.json [11:38:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34269 and previous config saved to /var/cache/conftool/dbconfig/20220908-113850-root.json [11:38:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34270 and previous config saved to /var/cache/conftool/dbconfig/20220908-113856-root.json [11:39:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2025 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34271 and previous config saved to /var/cache/conftool/dbconfig/20220908-113902-root.json [11:41:37] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:49:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34272 and previous config saved to /var/cache/conftool/dbconfig/20220908-114922-root.json [11:49:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34273 and previous config saved to /var/cache/conftool/dbconfig/20220908-114930-root.json [11:49:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34274 and previous config saved to /var/cache/conftool/dbconfig/20220908-114934-root.json [11:50:33] (03PS1) 10Muehlenhoff: Remove absented Raid Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/830846 [11:53:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34275 and previous config saved to /var/cache/conftool/dbconfig/20220908-115351-root.json [11:53:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34276 and previous config saved to /var/cache/conftool/dbconfig/20220908-115355-root.json [11:54:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2022 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34277 and previous config saved to /var/cache/conftool/dbconfig/20220908-115401-root.json [11:54:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2025 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34278 and previous config saved to /var/cache/conftool/dbconfig/20220908-115407-root.json [11:55:43] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:58:39] (03CR) 10Btullis: [C: 03+2] Grant ttaylor access to PII through Superset [puppet] - 10https://gerrit.wikimedia.org/r/830796 (https://phabricator.wikimedia.org/T292299) (owner: 10Btullis) [12:03:36] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [12:04:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34280 and previous config saved to /var/cache/conftool/dbconfig/20220908-120427-root.json [12:04:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34281 and previous config saved to /var/cache/conftool/dbconfig/20220908-120435-root.json [12:04:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34282 and previous config saved to /var/cache/conftool/dbconfig/20220908-120439-root.json [12:05:13] (03PS1) 10Marostegui: es1029,es1030,es1031: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830848 [12:05:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1029 es1030 es1031 for upgrade', diff saved to https://phabricator.wikimedia.org/P34283 and previous config saved to /var/cache/conftool/dbconfig/20220908-120528-root.json [12:06:37] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [12:07:02] (03CR) 10Marostegui: [C: 03+2] es1029,es1030,es1031: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830848 (owner: 10Marostegui) [12:08:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:09:55] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [12:10:44] (03CR) 10Alexandros Kosiaris: [C: 04-1] Alert on high Kubernetes API error rate (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/830624 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm) [12:11:35] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [12:11:44] (03PS1) 10Jaime Nuche: k8s scap: change format of mediawiki deployment files [puppet] - 10https://gerrit.wikimedia.org/r/830850 (https://phabricator.wikimedia.org/T299648) [12:12:40] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [12:13:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:14:25] (03PS1) 10Marostegui: Revert "es1029,es1030,es1031: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/830816 [12:14:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34284 and previous config saved to /var/cache/conftool/dbconfig/20220908-121459-root.json [12:15:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34285 and previous config saved to /var/cache/conftool/dbconfig/20220908-121506-root.json [12:15:09] (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth) [12:15:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34286 and previous config saved to /var/cache/conftool/dbconfig/20220908-121511-root.json [12:15:35] (03CR) 10Marostegui: [C: 03+2] Revert "es1029,es1030,es1031: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/830816 (owner: 10Marostegui) [12:15:51] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [12:16:06] (03PS1) 10Muehlenhoff: smart: Also use new raid_mgmt_tools fact [puppet] - 10https://gerrit.wikimedia.org/r/830852 (https://phabricator.wikimedia.org/T313312) [12:17:28] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [12:18:08] (03CR) 10Jaime Nuche: "PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1002/37167/" [puppet] - 10https://gerrit.wikimedia.org/r/830850 (https://phabricator.wikimedia.org/T299648) (owner: 10Jaime Nuche) [12:18:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:23:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:25:37] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on wtp[1032-1033].mgmt with reason: Downtiming replaced wtp servers [12:25:42] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:25:51] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wtp[1032-1033].mgmt with reason: Downtiming replaced wtp servers [12:26:05] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on wtp[1032-1033].eqiad.wmnet with reason: Downtiming replaced wtp servers [12:26:20] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wtp[1032-1033].eqiad.wmnet with reason: Downtiming replaced wtp servers [12:26:27] !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1032.eqiad.wmnet [12:26:36] !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1033.eqiad.wmnet [12:29:17] (03PS1) 10Marostegui: install_server: Do not reimage db1202 [puppet] - 10https://gerrit.wikimedia.org/r/830854 [12:30:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34287 and previous config saved to /var/cache/conftool/dbconfig/20220908-123004-root.json [12:30:09] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1202 [puppet] - 10https://gerrit.wikimedia.org/r/830854 (owner: 10Marostegui) [12:30:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34288 and previous config saved to /var/cache/conftool/dbconfig/20220908-123011-root.json [12:30:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34289 and previous config saved to /var/cache/conftool/dbconfig/20220908-123016-root.json [12:37:40] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [12:37:44] (03CR) 10Jforrester: [C: 03+1] scap/dsh: remove parsoid service, replaced by parsoid-php [puppet] - 10https://gerrit.wikimedia.org/r/825753 (https://phabricator.wikimedia.org/T241207) (owner: 10Dzahn) [12:39:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es1027 to es1 eqiad master, promote es1026 to es2 eqiad master, promote es1028 to es3 eqiad master', diff saved to https://phabricator.wikimedia.org/P34290 and previous config saved to /var/cache/conftool/dbconfig/20220908-123955-marostegui.json [12:42:08] !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@9e4ed94]: (no justification provided) [12:42:17] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@9e4ed94]: (no justification provided) (duration: 00m 09s) [12:43:22] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:45:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34291 and previous config saved to /var/cache/conftool/dbconfig/20220908-124509-root.json [12:45:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34292 and previous config saved to /var/cache/conftool/dbconfig/20220908-124516-root.json [12:45:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34293 and previous config saved to /var/cache/conftool/dbconfig/20220908-124521-root.json [12:47:13] (03CR) 10Zabe: wtp: Purge wtp servers following migration to parse (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/830802 (https://phabricator.wikimedia.org/T317025) (owner: 10ClΓ©ment Goubert) [12:48:38] (03CR) 10ClΓ©ment Goubert: wtp: Purge wtp servers following migration to parse (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/830802 (https://phabricator.wikimedia.org/T317025) (owner: 10ClΓ©ment Goubert) [12:49:50] (03PS3) 10ClΓ©ment Goubert: wtp: Purge wtp servers following migration to parse [puppet] - 10https://gerrit.wikimedia.org/r/830802 (https://phabricator.wikimedia.org/T317025) [12:50:12] (03PS7) 10Slyngshede: Initial checkin. User and Group classes for interacting with LDAP. [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) [12:50:16] (03CR) 10ClΓ©ment Goubert: wtp: Purge wtp servers following migration to parse (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/830802 (https://phabricator.wikimedia.org/T317025) (owner: 10ClΓ©ment Goubert) [12:50:39] (03PS4) 10ClΓ©ment Goubert: wtp: Purge wtp servers following migration to parse [puppet] - 10https://gerrit.wikimedia.org/r/830802 (https://phabricator.wikimedia.org/T317025) [12:50:46] (03CR) 10Slyngshede: Initial checkin. User and Group classes for interacting with LDAP. (035 comments) [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede) [12:51:13] (03PS1) 10Muehlenhoff: raid::perccli: Run the correct monitoring tool [puppet] - 10https://gerrit.wikimedia.org/r/830860 (https://phabricator.wikimedia.org/T315608) [12:51:23] (03PS2) 10Muehlenhoff: raid::perccli: Run the correct monitoring tool [puppet] - 10https://gerrit.wikimedia.org/r/830860 (https://phabricator.wikimedia.org/T315608) [12:51:42] (03CR) 10Slyngshede: Initial checkin. User and Group classes for interacting with LDAP. (0311 comments) [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede) [12:53:17] (03PS1) 10Stang: tnwiki: Add extendedconfirmed group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830861 (https://phabricator.wikimedia.org/T317276) [12:54:00] (03CR) 10ClΓ©ment Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37168/console" [puppet] - 10https://gerrit.wikimedia.org/r/830802 (https://phabricator.wikimedia.org/T317025) (owner: 10ClΓ©ment Goubert) [12:54:26] 10SRE, 10serviceops: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Jdforrester-WMF) [12:56:27] !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@9e4ed94]: (no justification provided) [12:56:37] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@9e4ed94]: (no justification provided) (duration: 00m 09s) [12:57:16] (03CR) 10ClΓ©ment Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37169/console" [puppet] - 10https://gerrit.wikimedia.org/r/830802 (https://phabricator.wikimedia.org/T317025) (owner: 10ClΓ©ment Goubert) [13:00:05] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220908T1300) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220908T1300). [13:00:05] arlolra and phuedx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34294 and previous config saved to /var/cache/conftool/dbconfig/20220908-130014-root.json [13:00:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34295 and previous config saved to /var/cache/conftool/dbconfig/20220908-130021-root.json [13:00:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34296 and previous config saved to /var/cache/conftool/dbconfig/20220908-130026-root.json [13:00:28] here [13:00:50] I can deploy in maybe 15 minutes or so [13:01:11] thank you [13:01:53] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:02:23] o/ [13:06:09] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:07:19] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48681 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:08:39] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:12:59] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:15:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34297 and previous config saved to /var/cache/conftool/dbconfig/20220908-131519-root.json [13:15:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34298 and previous config saved to /var/cache/conftool/dbconfig/20220908-131526-root.json [13:15:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34299 and previous config saved to /var/cache/conftool/dbconfig/20220908-131531-root.json [13:15:55] (03PS1) 10Muehlenhoff: Remove support for mptsas RAID [puppet] - 10https://gerrit.wikimedia.org/r/830862 [13:19:51] (03CR) 10Giuseppe Lavagetto: Add cookbook to easily route mediawiki traffic (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/830607 (https://phabricator.wikimedia.org/T315995) (owner: 10Giuseppe Lavagetto) [13:19:53] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:20:10] (03PS3) 10Giuseppe Lavagetto: Add cookbook to easily route mediawiki traffic [cookbooks] - 10https://gerrit.wikimedia.org/r/830607 (https://phabricator.wikimedia.org/T315995) [13:22:17] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/830607 (https://phabricator.wikimedia.org/T315995) (owner: 10Giuseppe Lavagetto) [13:23:16] !log cgoubert@cumin1001 START - Cookbook sre.hosts.decommission for hosts wtp1034.eqiad.wmnet [13:23:27] (03PS1) 10Zabe: wikimedia.org: Move nyc to the wikis section [dns] - 10https://gerrit.wikimedia.org/r/830863 [13:26:45] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:28:50] !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox [13:29:55] !log installing apache2 security updates on Bullseye [13:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34300 and previous config saved to /var/cache/conftool/dbconfig/20220908-133024-root.json [13:30:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34301 and previous config saved to /var/cache/conftool/dbconfig/20220908-133031-root.json [13:30:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34302 and previous config saved to /var/cache/conftool/dbconfig/20220908-133036-root.json [13:30:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 10%: Maint over long time ago', diff saved to https://phabricator.wikimedia.org/P34303 and previous config saved to /var/cache/conftool/dbconfig/20220908-133045-ladsgroup.json [13:30:48] (03CR) 10JMeybohm: Alert on high Kubernetes API error rate (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/830624 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm) [13:31:12] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:31:13] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wtp1034.eqiad.wmnet [13:34:06] (03PS1) 10Giuseppe Lavagetto: Improve documentation of the CookbookBase classes usage [software/spicerack] - 10https://gerrit.wikimedia.org/r/830866 [13:35:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 10%: Will do maint later', diff saved to https://phabricator.wikimedia.org/P34304 and previous config saved to /var/cache/conftool/dbconfig/20220908-133514-ladsgroup.json [13:35:39] (03CR) 10Slyngshede: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/830860 (https://phabricator.wikimedia.org/T315608) (owner: 10Muehlenhoff) [13:36:36] (03PS1) 10Vgutierrez: trafficserver: Update to ATS 9 on drmrs [puppet] - 10https://gerrit.wikimedia.org/r/830867 (https://phabricator.wikimedia.org/T309651) [13:36:39] !log cgoubert@cumin1001 START - Cookbook sre.hosts.decommission for hosts wtp1035.eqiad.wmnet [13:37:32] (03CR) 10Ssingh: "Let's go!" [puppet] - 10https://gerrit.wikimedia.org/r/830867 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez) [13:38:32] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37170/console" [puppet] - 10https://gerrit.wikimedia.org/r/830867 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez) [13:38:43] (03CR) 10Muehlenhoff: [C: 03+2] raid::perccli: Run the correct monitoring tool [puppet] - 10https://gerrit.wikimedia.org/r/830860 (https://phabricator.wikimedia.org/T315608) (owner: 10Muehlenhoff) [13:39:32] (03CR) 10Ssingh: [C: 03+1] trafficserver: Update to ATS 9 on drmrs [puppet] - 10https://gerrit.wikimedia.org/r/830867 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez) [13:39:46] !log disable puppet on A:cp-drmrs during the update to ATS 9.1.3 - T309651 [13:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:49] T309651: Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 [13:40:43] (03PS1) 10Ayounsi: Move peeringdb token to spicerack namespace [labs/private] - 10https://gerrit.wikimedia.org/r/830868 [13:40:53] (03CR) 10CI reject: [V: 04-1] Improve documentation of the CookbookBase classes usage [software/spicerack] - 10https://gerrit.wikimedia.org/r/830866 (owner: 10Giuseppe Lavagetto) [13:41:21] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Update to ATS 9 on drmrs [puppet] - 10https://gerrit.wikimedia.org/r/830867 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez) [13:41:27] !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox [13:41:52] (03PS5) 10JMeybohm: Alert on high Kubernetes API error rate [alerts] - 10https://gerrit.wikimedia.org/r/830624 (https://phabricator.wikimedia.org/T311251) [13:41:54] (03PS3) 10JMeybohm: Alert on high Kubernetes API latency [alerts] - 10https://gerrit.wikimedia.org/r/830637 (https://phabricator.wikimedia.org/T311251) [13:41:56] (03PS5) 10Ayounsi: Spicerack: add configuration file and API key for PeeringDB [puppet] - 10https://gerrit.wikimedia.org/r/819562 [13:43:13] !log rolling upgrade to ats 9 in cp drmrs - T309651 [13:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:44] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:43:47] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:43:48] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wtp1035.eqiad.wmnet [13:45:29] (03PS1) 10Volans: doc: fix sphinx_checker for Python 3.10 [software/spicerack] - 10https://gerrit.wikimedia.org/r/830871 [13:45:35] (03CR) 10Ayounsi: Spicerack: add configuration file and API key for PeeringDB (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/819562 (owner: 10Ayounsi) [13:45:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 25%: Maint over long time ago', diff saved to https://phabricator.wikimedia.org/P34305 and previous config saved to /var/cache/conftool/dbconfig/20220908-134550-ladsgroup.json [13:46:01] (03CR) 10Giuseppe Lavagetto: [C: 03+1] doc: fix sphinx_checker for Python 3.10 [software/spicerack] - 10https://gerrit.wikimedia.org/r/830871 (owner: 10Volans) [13:47:53] !log cgoubert@cumin1001 START - Cookbook sre.hosts.decommission for hosts wtp1036.eqiad.wmnet [13:49:11] (03PS4) 10JMeybohm: Alert on high lateny of kubelet operations [alerts] - 10https://gerrit.wikimedia.org/r/830228 (https://phabricator.wikimedia.org/T311251) [13:49:13] (03PS6) 10JMeybohm: Alert on high Kubernetes API error rate [alerts] - 10https://gerrit.wikimedia.org/r/830624 (https://phabricator.wikimedia.org/T311251) [13:49:15] (03PS4) 10JMeybohm: Alert on high Kubernetes API latency [alerts] - 10https://gerrit.wikimedia.org/r/830637 (https://phabricator.wikimedia.org/T311251) [13:49:55] Lucas_WMDE: will we not be making this window? [13:50:06] oh, sorry, I totally forgot about it :( [13:50:11] damn [13:50:12] (03CR) 10Jbond: [C: 03+1] "lgtm" [labs/private] - 10https://gerrit.wikimedia.org/r/830868 (owner: 10Ayounsi) [13:50:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 25%: Will do maint later', diff saved to https://phabricator.wikimedia.org/P34307 and previous config saved to /var/cache/conftool/dbconfig/20220908-135019-ladsgroup.json [13:50:31] I guess I should have spoken up sooner [13:50:53] I don’t think there’s time for backports now, no [13:50:55] my bad :( [13:50:59] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Move peeringdb token to spicerack namespace [labs/private] - 10https://gerrit.wikimedia.org/r/830868 (owner: 10Ayounsi) [13:51:00] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/830846 (owner: 10Muehlenhoff) [13:51:11] ok, no problem, thanks [13:51:43] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:51:46] 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting, 10Patch-For-Review: icinga raid monitoring inoperable for H750 controllers - https://phabricator.wikimedia.org/T315608 (10MoritzMuehlenhoff) The servers with Perc H750 are now correctly detected by Puppet and the respective new monitoring scrip... [13:52:16] (03CR) 10Jbond: "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/830852 (https://phabricator.wikimedia.org/T313312) (owner: 10Muehlenhoff) [13:53:33] !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox [13:54:13] (03CR) 10Volans: [C: 03+2] doc: fix sphinx_checker for Python 3.10 [software/spicerack] - 10https://gerrit.wikimedia.org/r/830871 (owner: 10Volans) [13:55:40] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:55:41] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wtp1036.eqiad.wmnet [13:55:49] (03PS1) 10Matthias Mullie: Add SearchVue to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830874 (https://phabricator.wikimedia.org/T310367) [13:56:13] (03Abandoned) 10Matthias Mullie: Add SearchVue to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830874 (https://phabricator.wikimedia.org/T310367) (owner: 10Matthias Mullie) [13:56:19] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/830862 (owner: 10Muehlenhoff) [13:57:04] (03PS1) 10Btullis: Attempt to run the MCE and MAE consumers in the GMS container [deployment-charts] - 10https://gerrit.wikimedia.org/r/830875 (https://phabricator.wikimedia.org/T317053) [13:57:41] !log cgoubert@cumin1001 START - Cookbook sre.hosts.decommission for hosts wtp1037.eqiad.wmnet [13:58:12] (03PS1) 10Matthias Mullie: [SearchVue] Enable extension on beta testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830876 (https://phabricator.wikimedia.org/T310367) [13:58:13] (03PS1) 10Matthias Mullie: [SearchVue] Enable extension on ptwiki, ruwiki & idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830877 [13:58:16] PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:58:33] (03PS2) 10Matthias Mullie: [SearchVue] Enable extension on beta ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830876 (https://phabricator.wikimedia.org/T310367) [13:59:16] (03Abandoned) 10Matthias Mullie: [SearchVue] Enable extension on ptwiki, ruwiki & idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830877 (owner: 10Matthias Mullie) [13:59:19] (03Abandoned) 10Matthias Mullie: [SearchVue] Enable extension on beta ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830876 (https://phabricator.wikimedia.org/T310367) (owner: 10Matthias Mullie) [13:59:23] (03CR) 10Jbond: [C: 03+1] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/819562 (owner: 10Ayounsi) [14:00:13] (03CR) 10Ayounsi: [C: 03+2] "πŸš€" [puppet] - 10https://gerrit.wikimedia.org/r/819562 (owner: 10Ayounsi) [14:00:22] !log on going maintenance on mr1-codfw [14:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 75%: Maint over long time ago', diff saved to https://phabricator.wikimedia.org/P34309 and previous config saved to /var/cache/conftool/dbconfig/20220908-140055-ladsgroup.json [14:01:52] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:03:31] (03Merged) 10jenkins-bot: doc: fix sphinx_checker for Python 3.10 [software/spicerack] - 10https://gerrit.wikimedia.org/r/830871 (owner: 10Volans) [14:04:02] !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox [14:05:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 75%: Will do maint later', diff saved to https://phabricator.wikimedia.org/P34310 and previous config saved to /var/cache/conftool/dbconfig/20220908-140524-ladsgroup.json [14:05:32] 10SRE, 10Data-Services: 2022-09-04 Scraping from AS714 (Apple) against dumps.wikimedia.org saturating network links - https://phabricator.wikimedia.org/T317001 (10CDanis) p:05Triageβ†’03High [14:06:08] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:06:09] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wtp1037.eqiad.wmnet [14:06:23] 10SRE, 10Observability-Metrics: librenms: investigate making the session directory 0660 - https://phabricator.wikimedia.org/T317292 (10jbond) p:05Triageβ†’03Medium [14:06:32] 10SRE, 10Observability-Metrics: librenms: investigate making the session directory 0660 - https://phabricator.wikimedia.org/T317292 (10jbond) p:05Mediumβ†’03Low [14:07:04] !log cgoubert@cumin1001 START - Cookbook sre.hosts.decommission for hosts wtp[1038-1042].eqiad.wmnet [14:12:14] (03PS2) 10Volans: Improve documentation of the CookbookBase classes usage [software/spicerack] - 10https://gerrit.wikimedia.org/r/830866 (owner: 10Giuseppe Lavagetto) [14:16:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 100%: Maint over long time ago', diff saved to https://phabricator.wikimedia.org/P34311 and previous config saved to /var/cache/conftool/dbconfig/20220908-141600-ladsgroup.json [14:20:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 100%: Will do maint later', diff saved to https://phabricator.wikimedia.org/P34312 and previous config saved to /var/cache/conftool/dbconfig/20220908-142029-ladsgroup.json [14:20:31] !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox [14:21:10] (03CR) 10Ayounsi: sre.network.peering: initial commit (0312 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/816730 (owner: 10Ayounsi) [14:21:12] (03PS12) 10Ayounsi: sre.network.peering: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/816730 [14:22:12] (03CR) 10Muehlenhoff: [C: 03+2] smart: Also use new raid_mgmt_tools fact [puppet] - 10https://gerrit.wikimedia.org/r/830852 (https://phabricator.wikimedia.org/T313312) (owner: 10Muehlenhoff) [14:22:24] 10SRE-swift-storage: Swiftrepl doesn't work on bullseye (and swiftrepl.conf is deployed by hand) - https://phabricator.wikimedia.org/T299125 (10MatthewVernon) The plan is indeed to replace `swiftrepl` with `rclone`. There are two infelicities with `rclone` for our use case: # it holds entire container listings... [14:23:09] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:23:09] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wtp[1038-1042].eqiad.wmnet [14:25:18] !log cgoubert@cumin1001 START - Cookbook sre.hosts.decommission for hosts wtp[1043-1047].eqiad.wmnet [14:28:34] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:31:47] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/830802 (https://phabricator.wikimedia.org/T317025) (owner: 10ClΓ©ment Goubert) [14:31:51] 10SRE, 10Observability-Logging, 10Observability-Metrics, 10Performance-Team (Radar): Framework for running experiments on a subset of the app server fleet - https://phabricator.wikimedia.org/T315403 (10ori) p:05Triageβ†’03Low [14:33:51] 10SRE-swift-storage: Swiftrepl doesn't work on bullseye (and swiftrepl.conf is deployed by hand) - https://phabricator.wikimedia.org/T299125 (10MoritzMuehlenhoff) >>! In T299125#8221206, @MatthewVernon wrote: > Then we need to build a .deb of the patched `rclone` (may be annoying because of the need of newer `go... [14:34:06] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:35:10] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [14:35:26] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:36:44] !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox [14:38:00] (03CR) 10Muehlenhoff: [C: 03+2] Remove support for mptsas RAID [puppet] - 10https://gerrit.wikimedia.org/r/830862 (owner: 10Muehlenhoff) [14:38:09] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [14:38:57] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:38:58] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wtp[1043-1047].eqiad.wmnet [14:39:44] (03CR) 10Muehlenhoff: [C: 03+2] Remove absented Raid Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/830846 (owner: 10Muehlenhoff) [14:39:55] (03PS2) 10Muehlenhoff: Remove absented Raid Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/830846 [14:40:22] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:40:57] 10SRE, 10Thumbor: Thumbor units failing / service general slowness - https://phabricator.wikimedia.org/T312722 (10Krinkle) @Joe @fgiunchedi I wrote a rough draft based on the above. Feel free to expand or correct accordingly: https://wikitech.wikimedia.org/wiki/Incidents/2022-07-10_thumbor [14:40:57] !log cgoubert@cumin1001 START - Cookbook sre.hosts.decommission for hosts wtp[1025-1028,1048].eqiad.wmnet [14:41:53] (03CR) 10Ahmon Dancy: [C: 03+1] "Definitely easier to understand." [puppet] - 10https://gerrit.wikimedia.org/r/830850 (https://phabricator.wikimedia.org/T299648) (owner: 10Jaime Nuche) [14:43:24] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:44:42] (03CR) 10AOkoth: vrts: install vrts script (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth) [14:44:49] (03CR) 10AOkoth: [C: 03+2] vrts: install vrts script [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth) [14:45:10] (03CR) 10AOkoth: [C: 03+2] vrts: install vrts script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth) [14:45:47] PROBLEM - Host asw-a-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:45:47] PROBLEM - Host asw-b-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:45:47] PROBLEM - Host asw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:45:47] PROBLEM - Host asw-d-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:47:09] (03PS1) 10Elukey: ml-services: add default in helmfiles for recreatePods [deployment-charts] - 10https://gerrit.wikimedia.org/r/830883 [14:47:57] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:49:00] (03CR) 10Cwhite: logstash: reduce webrequest retention to 31 days (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821324 (https://phabricator.wikimedia.org/T313099) (owner: 10Cwhite) [14:49:05] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:49:27] (03CR) 10CI reject: [V: 04-1] ml-services: add default in helmfiles for recreatePods [deployment-charts] - 10https://gerrit.wikimedia.org/r/830883 (owner: 10Elukey) [14:49:45] (JobUnavailable) firing: Reduced availability for job pdu in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:49:45] uff [14:49:46] (03CR) 10Cwhite: logstash: reduce replica count to 1 after 1 day (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/830690 (https://phabricator.wikimedia.org/T313099) (owner: 10Cwhite) [14:49:57] PROBLEM - Host fasw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:51:07] (03PS2) 10Elukey: ml-services: add default in helmfiles for recreatePods [deployment-charts] - 10https://gerrit.wikimedia.org/r/830883 [14:51:09] PROBLEM - Host mr1-codfw IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:51:21] (03CR) 10Volans: [C: 03+1] "LGTM, replies inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/816730 (owner: 10Ayounsi) [14:51:31] (03CR) 10Cwhite: [C: 03+2] apifeatureusage: use new kafka truststore [puppet] - 10https://gerrit.wikimedia.org/r/830684 (https://phabricator.wikimedia.org/T300130) (owner: 10Cwhite) [14:52:03] cwhite: \o/ [14:52:30] * cwhite presses thumbs [14:54:45] (JobUnavailable) firing: (2) Reduced availability for job pdu in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:54:58] !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox [14:55:15] RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:55:25] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:56:14] (03CR) 10Jaime Nuche: k8s scap: change format of mediawiki deployment files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/830850 (https://phabricator.wikimedia.org/T299648) (owner: 10Jaime Nuche) [14:56:19] (03CR) 10Andrew Bogott: [C: 03+1] "I have not vetted every IP in this range but I'm willing to give this a try. Somewhat nervous that it will break unexpected toolforge thin" [puppet] - 10https://gerrit.wikimedia.org/r/677872 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi) [14:57:14] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:57:15] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wtp[1025-1028,1048].eqiad.wmnet [14:57:49] RECOVERY - Host asw-d-codfw is UP: PING WARNING - Packet loss = 50%, RTA = 44.29 ms [14:57:49] RECOVERY - Host asw-c-codfw is UP: PING WARNING - Packet loss = 50%, RTA = 44.15 ms [14:57:51] RECOVERY - Host asw-a-codfw is UP: PING WARNING - Packet loss = 60%, RTA = 33.88 ms [14:58:27] RECOVERY - Host fasw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.51 ms [14:58:39] RECOVERY - Host asw-b-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.49 ms [14:58:59] !log maintenance on mr1-codfw complete [14:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:03] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:00:26] 10SRE, 10Observability-Logging, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1): Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10colewhite) apifeatureusage now using the new pki truststore and appears to be working. [15:00:58] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade management routers and switches to Junos 21 - https://phabricator.wikimedia.org/T316529 (10Papaul) I was having the issue below upgrading mr1 to version 21 ` Validating against /config/rescue.conf.gz /config/rescue.conf.gz:61:(21) syntax error at 'rfc-co... [15:01:28] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade management routers and switches to Junos 21 - https://phabricator.wikimedia.org/T316529 (10Papaul) [15:01:52] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade management routers and switches to Junos 21 - https://phabricator.wikimedia.org/T316529 (10Papaul) ` papaul@mr1-codfw> show version Hostname: mr1-codfw Model: srx300 Junos: 21.2R3-S2.9 JUNOS Software Release [21.2R3-S2.9] ` [15:01:52] (JobUnavailable) resolved: (2) Reduced availability for job pdu in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:32] !log installing nginx security updates on bullseye [15:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:34] RECOVERY - Host mr1-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 33.67 ms [15:04:12] (03CR) 10Alexandros Kosiaris: [C: 03+1] wtp: Purge wtp servers following migration to parse [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830803 (https://phabricator.wikimedia.org/T317025) (owner: 10ClΓ©ment Goubert) [15:04:38] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Overall LGTM, if merged once servers are decommissioned." [puppet] - 10https://gerrit.wikimedia.org/r/830802 (https://phabricator.wikimedia.org/T317025) (owner: 10ClΓ©ment Goubert) [15:04:58] (03CR) 10Elukey: [C: 03+2] ml-services: add default in helmfiles for recreatePods [deployment-charts] - 10https://gerrit.wikimedia.org/r/830883 (owner: 10Elukey) [15:05:05] (03CR) 10Giuseppe Lavagetto: [C: 03+1] wtp: Purge wtp servers following migration to parse [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830803 (https://phabricator.wikimedia.org/T317025) (owner: 10ClΓ©ment Goubert) [15:05:32] (03CR) 10Btullis: [C: 03+2] Attempt to run the MCE and MAE consumers in the GMS container [deployment-charts] - 10https://gerrit.wikimedia.org/r/830875 (https://phabricator.wikimedia.org/T317053) (owner: 10Btullis) [15:06:04] (03CR) 10Volans: [C: 03+1] "Thanks Giuseppe for improving the existing docs. I'm happy to merge this as-is. If we see that it causes confusion because of the shortene" [software/spicerack] - 10https://gerrit.wikimedia.org/r/830866 (owner: 10Giuseppe Lavagetto) [15:07:16] (03CR) 10Alexandros Kosiaris: [C: 03+1] wtp: Purge wtp servers following migration to parse [puppet] - 10https://gerrit.wikimedia.org/r/830802 (https://phabricator.wikimedia.org/T317025) (owner: 10ClΓ©ment Goubert) [15:08:48] (03Merged) 10jenkins-bot: ml-services: add default in helmfiles for recreatePods [deployment-charts] - 10https://gerrit.wikimedia.org/r/830883 (owner: 10Elukey) [15:09:00] (03Merged) 10jenkins-bot: Attempt to run the MCE and MAE consumers in the GMS container [deployment-charts] - 10https://gerrit.wikimedia.org/r/830875 (https://phabricator.wikimedia.org/T317053) (owner: 10Btullis) [15:10:39] jouncebot: now [15:10:39] No deployments scheduled for the next 0 hour(s) and 49 minute(s) [15:11:11] (03CR) 10ClΓ©ment Goubert: [C: 03+2] wtp: Purge wtp servers following migration to parse [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830803 (https://phabricator.wikimedia.org/T317025) (owner: 10ClΓ©ment Goubert) [15:11:55] (03Merged) 10jenkins-bot: wtp: Purge wtp servers following migration to parse [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830803 (https://phabricator.wikimedia.org/T317025) (owner: 10ClΓ©ment Goubert) [15:11:55] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [15:12:44] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:14:22] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:14:55] (03PS5) 10Ayounsi: Remove 185.15.56.0/24 from network::external [puppet] - 10https://gerrit.wikimedia.org/r/677872 (https://phabricator.wikimedia.org/T265864) [15:15:45] volans: hey done with the mr1-codfw upgrade and kafka-logging1005 is ready for testing the provision cookbook [15:15:49] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/824315 (https://phabricator.wikimedia.org/T315500) (owner: 10Cwhite) [15:16:19] papaul: great, thanks, have you seen the patch? there is a question for you too there about how much to sleep [15:16:44] volans: looking [15:16:53] thx [15:18:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:18:14] (03CR) 10Ayounsi: [C: 03+2] Remove 185.15.56.0/24 from network::external [puppet] - 10https://gerrit.wikimedia.org/r/677872 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi) [15:18:33] (03CR) 10Andrew Bogott: [C: 03+2] prometheus-openstack-stale-puppet-certs: preserve original cert name [puppet] - 10https://gerrit.wikimedia.org/r/830704 (owner: 10Andrew Bogott) [15:19:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:19:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:20:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:20:12] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:21:48] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: sync on main [15:21:57] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [15:22:02] PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2005 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:22:02] (03CR) 10Abijeet Patro: [C: 04-1] "Waiting for community feedback: https://meta.wikimedia.org/wiki/Meta_talk:Babylon#Grant_editcontentmodel_right_for_translation_administrat" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830817 (https://phabricator.wikimedia.org/T311587) (owner: 10Abijeet Patro) [15:22:32] PROBLEM - Check systemd state on conf2005 is CRITICAL: CRITICAL - degraded: The following units failed: etcdmirror-conftool-eqiad-wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:22:36] (03CR) 10Papaul: sre.hosts.provision: reboot after RAID changes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/830676 (owner: 10Volans) [15:22:59] (03CR) 10Volans: [C: 03+2] sre.hosts.provision: reboot after RAID changes [cookbooks] - 10https://gerrit.wikimedia.org/r/830676 (owner: 10Volans) [15:23:11] thanks papaul, merging and deploying then we can test it [15:23:37] volans: thanks [15:23:45] (JobUnavailable) firing: Reduced availability for job etcdmirror in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:24:38] papaul: good job on the upgrade! [15:24:46] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: sync on main [15:24:51] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff) [15:24:53] XioNoX: thanks [15:25:11] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [15:25:20] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: sync on main [15:25:36] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [15:26:28] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:27:39] (03CR) 10BCornwall: varnish: Remove extraneous checks for Docker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall) [15:27:57] (03Merged) 10jenkins-bot: sre.hosts.provision: reboot after RAID changes [cookbooks] - 10https://gerrit.wikimedia.org/r/830676 (owner: 10Volans) [15:28:22] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:28:44] !log cgoubert@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:830803|wtp: Purge wtp servers following migration to parse (T317025)]] (duration: 12m 48s) [15:28:47] T317025: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 [15:30:38] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:33:43] !log restart etcdmirror on conf2005 [15:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:50] RECOVERY - Check systemd state on conf2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:35:30] RECOVERY - etcdmirror-conftool-eqiad-wmnet service on conf2005 is OK: OK - etcdmirror-conftool-eqiad-wmnet is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:35:36] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=api-https [15:36:32] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=api_appserver [15:36:42] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=appserver [15:37:54] <_joe_> sigh [15:38:16] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.2.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [15:38:45] (JobUnavailable) resolved: Reduced availability for job etcdmirror in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:39:23] !log pt1979@cumin1001 START - Cookbook sre.dns.netbox [15:39:32] (03PS1) 10Ayounsi: Revert "Exclude cloud-eqiad prefix from VRT trusted networks" [puppet] - 10https://gerrit.wikimedia.org/r/830818 [15:39:37] (03PS1) 10Ayounsi: Revert "Exclude cloud-eqiad prefix from MXs trusted networks" [puppet] - 10https://gerrit.wikimedia.org/r/830819 [15:39:43] (03PS1) 10Ayounsi: Revert "Exclude cloud-eqiad prefix from lists trusted networks" [puppet] - 10https://gerrit.wikimedia.org/r/830820 [15:40:37] !log pt1979@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:40:38] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [15:42:12] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:44:38] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:45:42] !log cgoubert@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:830803|wtp: Purge wtp servers following migration to parse (T317025)]] (duration: 04m 00s) [15:45:45] T317025: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 [15:48:34] (03PS1) 10Btullis: Add the prometheus config to enable scraping from the dse-k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/830897 (https://phabricator.wikimedia.org/T310179) [15:49:34] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:49] !log pt1979@cumin1001 START - Cookbook sre.dns.netbox [15:50:32] !log cgoubert@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=parsoid [15:51:05] !log pt1979@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:51:58] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:54] !log pt1979@cumin1001 START - Cookbook sre.dns.netbox [15:55:30] !log pt1979@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:55:38] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:57:06] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:57:06] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:57:47] !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host kafka-logging1005.mgmt.eqiad.wmnet with reboot policy FORCED [15:58:38] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:58:54] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:00:05] jbond and rzl: #bothumor My software never has bugs. It just develops random features. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220908T1600). [16:00:05] dancy: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:09] ]o/ [16:01:16] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:01:24] dancy: taking a look now [16:01:43] (03CR) 10Jbond: [C: 03+2] Revert comment change [puppet] - 10https://gerrit.wikimedia.org/r/828583 (owner: 10Ahmon Dancy) [16:01:54] (03PS3) 10Jbond: Revert comment change [puppet] - 10https://gerrit.wikimedia.org/r/828583 (owner: 10Ahmon Dancy) [16:03:20] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:04:10] !log dancy@deploy1002 Installing scap version "4.17.0" for 566 hosts [16:04:23] 10SRE, 10Traffic, 10Patch-For-Review: Create program to interact with Atlas RIPE API - https://phabricator.wikimedia.org/T315536 (10BCornwall) p:05Triageβ†’03Medium [16:04:29] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37171/console" [puppet] - 10https://gerrit.wikimedia.org/r/830850 (https://phabricator.wikimedia.org/T299648) (owner: 10Jaime Nuche) [16:04:38] 10SRE, 10Traffic: Remove old and unused libvarnishapi - https://phabricator.wikimedia.org/T300247 (10BCornwall) p:05Triageβ†’03Medium [16:04:54] 10SRE, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097 (10BCornwall) p:05Triageβ†’03Medium [16:04:59] 10SRE, 10Traffic, 10Patch-For-Review: Add DP cookie for pageview filtering - https://phabricator.wikimedia.org/T315676 (10BCornwall) p:05Triageβ†’03Medium [16:05:39] (03CR) 10Jbond: [V: 03+1] "> Patch Set 1: Verified+1" [puppet] - 10https://gerrit.wikimedia.org/r/830850 (https://phabricator.wikimedia.org/T299648) (owner: 10Jaime Nuche) [16:06:17] (03CR) 10Ladsgroup: [C: 03+1] Revert "Exclude cloud-eqiad prefix from lists trusted networks" [puppet] - 10https://gerrit.wikimedia.org/r/830820 (owner: 10Ayounsi) [16:06:43] (03CR) 10Ssingh: varnish: Remove extraneous checks for Docker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall) [16:07:27] dancy: i have merged the revert just checking with service ops on the other one [16:07:55] ok. Note that for https://gerrit.wikimedia.org/r/c/operations/puppet/+/830850/, scap is the only thing that reads the resulting file (and the scap code to do that reading is not enabled yet) [16:08:24] And the changes reported in https://puppet-compiler.wmflabs.org/pcc-worker1002/37171/deploy2002.codfw.wmnet/index.html are expected. [16:08:36] dancy: ack so its safe to merge and you will update the scap code later? [16:08:58] Yep. I intend to merge the corresponding scap code today if all goes well. [16:08:59] (03CR) 10BCornwall: varnish: Remove extraneous checks for Docker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall) [16:09:08] ack great thanks will merge now [16:09:12] thx! [16:09:16] (03CR) 10Jbond: [V: 03+1 C: 03+2] k8s scap: change format of mediawiki deployment files [puppet] - 10https://gerrit.wikimedia.org/r/830850 (https://phabricator.wikimedia.org/T299648) (owner: 10Jaime Nuche) [16:09:23] (03PS2) 10Jbond: k8s scap: change format of mediawiki deployment files [puppet] - 10https://gerrit.wikimedia.org/r/830850 (https://phabricator.wikimedia.org/T299648) (owner: 10Jaime Nuche) [16:10:57] (03CR) 10Ahmon Dancy: k8s scap: change format of mediawiki deployment files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/830850 (https://phabricator.wikimedia.org/T299648) (owner: 10Jaime Nuche) [16:13:03] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/830690 (https://phabricator.wikimedia.org/T313099) (owner: 10Cwhite) [16:13:36] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/821324 (https://phabricator.wikimedia.org/T313099) (owner: 10Cwhite) [16:15:39] dancy: merged and deployed [16:15:43] Thanks jbond! [16:15:47] np [16:16:13] I have verified that the new format file is on deploy1002. [16:17:43] (03CR) 10BCornwall: varnish: Remove extraneous checks for Docker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall) [16:18:10] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:22:56] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-logging1005.mgmt.eqiad.wmnet with reboot policy FORCED [16:23:04] 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading: File upload not working: Error: 502, Next Hop Connection Failed - https://phabricator.wikimedia.org/T295343 (10Krinkle) [16:23:38] 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading, 10Traffic, and 2 others: 502 Server Hangup Error on esams for "Upload a new version of this file" on Special:Upload on Commons - https://phabricator.wikimedia.org/T247454 (10Krinkle) [16:25:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10Papaul) [16:28:43] (03PS1) 10AOkoth: vrts: fix puppet failure on otrs1001 [puppet] - 10https://gerrit.wikimedia.org/r/830902 (https://phabricator.wikimedia.org/T317059) [16:29:05] (03PS2) 10AOkoth: vrts: fix puppet failure on otrs1001 [puppet] - 10https://gerrit.wikimedia.org/r/830902 (https://phabricator.wikimedia.org/T317059) [16:30:04] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:32:04] (03PS3) 10AOkoth: vrts: fix puppet failure on otrs1001 [puppet] - 10https://gerrit.wikimedia.org/r/830902 (https://phabricator.wikimedia.org/T317059) [16:32:25] (03PS4) 10AOkoth: vrts: fix puppet failure on otrs1001 [puppet] - 10https://gerrit.wikimedia.org/r/830902 (https://phabricator.wikimedia.org/T317059) [16:34:38] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:37:40] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [16:37:50] (03CR) 10AOkoth: [C: 03+2] vrts: fix puppet failure on otrs1001 [puppet] - 10https://gerrit.wikimedia.org/r/830902 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth) [16:39:24] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:42:13] (03CR) 10Andrew Bogott: "https://phabricator.wikimedia.org/T317344" [puppet] - 10https://gerrit.wikimedia.org/r/830860 (https://phabricator.wikimedia.org/T315608) (owner: 10Muehlenhoff) [16:43:17] (03PS8) 10BCornwall: prometheus: Parse ATS config to node exporter text [puppet] - 10https://gerrit.wikimedia.org/r/826362 (https://phabricator.wikimedia.org/T292815) [16:43:19] (03PS4) 10BCornwall: ats: Enable node_ats_config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/830686 (https://phabricator.wikimedia.org/T292815) [16:43:58] (03CR) 10BCornwall: "This has been tested with pcc as well as running the commands manually on an ATS instance." [puppet] - 10https://gerrit.wikimedia.org/r/826362 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [16:45:42] (03PS1) 10Hnowlan: Fix offline tests [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830903 [16:46:10] (03PS2) 10Hnowlan: Fix online tests [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830903 [16:49:06] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:49:40] ^ misbehaving Google CT log. if it persists, we will remove it [16:51:18] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:56:06] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:58:14] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:00:04] bd808: May I have your attention please! Technical Engagement weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220908T1700) [17:00:31] * bd808 makes a patch to update developer portal [17:00:38] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:03:08] (03CR) 10Ssingh: varnish: Remove extraneous checks for Docker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall) [17:03:54] (03PS1) 10BryanDavis: developer-portal: Bump container version to 2022-09-08-111810-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/830906 [17:13:31] (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container version to 2022-09-08-111810-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/830906 (owner: 10BryanDavis) [17:13:33] (03PS1) 10Vlad.shapik: Remove division operation hack related to Python2 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830907 (https://phabricator.wikimedia.org/T314393) [17:14:31] (03PS2) 10Vlad.shapik: WP: Remove division operation hack related to Python2 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830907 (https://phabricator.wikimedia.org/T314393) [17:15:12] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:15:22] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:17:02] (03Merged) 10jenkins-bot: developer-portal: Bump container version to 2022-09-08-111810-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/830906 (owner: 10BryanDavis) [17:18:24] (03CR) 10Vlad.shapik: [C: 03+1] "Looks reasonable to remove ignoring." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830608 (owner: 10Hnowlan) [17:20:52] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:21:20] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:21:26] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:22:05] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:22:13] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:22:24] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:22:58] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:25:23] (03CR) 10Ayounsi: [C: 03+2] Revert "Exclude cloud-eqiad prefix from lists trusted networks" [puppet] - 10https://gerrit.wikimedia.org/r/830820 (owner: 10Ayounsi) [17:26:07] 10SRE, 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Ongoing media storage errors: backend-fail-internal on deletions and ~2000 read errors/s - https://phabricator.wikimedia.org/T316188 (10Krinkle) 05Openβ†’03Resolved a:03Krinkle [17:27:20] Hi ops folks - I could do with some help on the stat1008.eqiad.wmnet host - we have a process killing the host CPU - can any of you have a look please? [17:28:26] Krinkle: I am not sure we should close T316188 until at least a report is filed [17:28:27] T316188: Ongoing media storage errors: backend-fail-internal on deletions and ~2000 read errors/s - https://phabricator.wikimedia.org/T316188 [17:29:15] joal: can't SSH to the host, so I guess we have to force a reboot. if that's OK, I am happy to do that but I also realize there might be other active scripts running and the output might not be saved? [17:29:24] 10SRE, 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Ongoing media storage errors: backend-fail-internal on deletions and ~2000 read errors/s - https://phabricator.wikimedia.org/T316188 (10Krinkle) Public placeholder report at: [17:29:38] jynus: the prod error is resolved. [17:29:57] afaik we don't usually use tasks for the writing of an incident unless the incident has a tracking task which most don't. [17:30:11] thank sukhe - indeed there could be other sccripts - I wonder if it's worth letting the host alone and expect it might come, or reboot it now [17:30:16] I don't mind actually resolving that, but people will forget to file one if there is nothing on phab encouraging to dod that [17:30:54] sukhe: let's reboot it please - it's unusable now, so let's make it back [17:30:57] joal: I saw the conversation in the other channel. if it is simply a matter of a CPU intensive cookbook we can perhaps wait for it but if it is an unknown, then a reboot is probably the only way [17:31:14] I assume the incident ritual and spreadsheet will eventually make it to this through the "draft" category or whatever we use to track that. There are tons more that don't have a task for it that presumably go through the same process. [17:31:23] I closed it for the prod error stats :) [17:31:26] joal: sure. OK to reboot it then? I will wait for your definite yes [17:31:28] for which I'm 64 days overdue. [17:31:50] Yes sukhe - please reboot - thank you [17:31:53] doing [17:33:01] !log stat1008: sudo ipmitool -I lanplus -H "stat1008.mgmt.eqiad.wmnet" -U root -E chassis power cycle [17:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:44] (03CR) 10BCornwall: Unlink certificate renewal and OCSP handling (031 comment) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/820795 (https://phabricator.wikimedia.org/T244232) (owner: 10BCornwall) [17:35:56] PROBLEM - Host stat1008 is DOWN: PING CRITICAL - Packet loss = 100% [17:36:05] (03CR) 10BCornwall: varnish: Remove extraneous checks for Docker (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall) [17:38:53] joal: doesn't seem to be coming back up so I am guessing it was another issue [17:39:06] https://puppetboard.wikimedia.org/node/stat1008.eqiad.wmnet seems to suggest Puppet was failing for over a day now [17:39:16] Failed to set owner to '0': Read-only file system @ apply2files - /mnt/nfs/dumps-labstore1006.wikimedia.org [17:39:18] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:39:22] wow [17:39:25] change from 400 to 'root' failed: Failed to set owner to '0': Read-only file system @ apply2files - /mnt/nfs/dumps-labstore1006.wikimedia.org [17:39:44] RECOVERY - Host stat1008 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [17:39:47] ok back up now [17:40:15] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Undocumented IP on WMCS network - https://phabricator.wikimedia.org/T315955 (10Andrew) ` root@cloudcontrol2005-dev:~# dig +noall +answer SOA 16-29.57.15.185.in-addr.arpa. 16-29.57.15.185.in-addr.arpa. 120 IN SOA ns0.openstack.codfw1dev.wikime... [17:40:26] a Puppet run is in progress, let's see if it completes. but yesh, probably requires a deeper look [17:40:42] s/yesh/yes [17:41:04] sukhe: the errors feels related to a change that has happened yesterday: migration of labstore to clouddumps, changing nfs [17:41:12] I hope puppet doesn't fail :S [17:41:34] (03PS1) 10Dduvall: buildkitd: Bump version to 0.10.4 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/830909 [17:41:51] joal: yeah, I see the commit (5dd213d019a012dee344ee2a6a586c0615b7c9dd) [17:42:01] with the news, we've got a spike of traffic [17:42:03] it's been rolledback if I don't mistake [17:42:19] I'd advise we aim for stability for a little while right now, avoid any risky changes that can be deferred [17:43:00] joal: Puppet failed again, similar error. I am not sure if you can access Puppetboard so I am happy to share the error and you can create a task (not sure who owns the machines but yeah) [17:43:49] sukhe: I'm no SRE, I can't access - If ou could create a task with the error and ping btullis on it that'd be awesome [17:43:57] btullis, sure [17:44:00] happy to do that [17:44:08] Thank you so much sukhe [17:48:35] joal: how can I best see the issue? [17:48:48] 10SRE, 10SRE-swift-storage, 10Sustainability (Incident Followup): 2022-08-24 swift incident (tracking) - https://phabricator.wikimedia.org/T317358 (10Krinkle) [17:49:02] andrewbogott: if it helps, see T317359 [17:49:02] T317359: Puppet failure on stat1008 - https://phabricator.wikimedia.org/T317359 [17:49:06] 10SRE, 10SRE-swift-storage, 10Sustainability (Incident Followup): 2022-08-24 swift incident (tracking) - https://phabricator.wikimedia.org/T317358 (10Krinkle) [17:49:10] 10SRE, 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Ongoing media storage errors: backend-fail-internal on deletions and ~2000 read errors/s - https://phabricator.wikimedia.org/T316188 (10Krinkle) [17:49:12] andrewbogott: I think sukhe is creating a task - puppet doesn't run on the host anymore [17:49:47] ok, looking... [17:49:52] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 180 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:50:06] oh [17:52:18] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:00:04] jeena and dduvall: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220908T1800). [18:00:30] train deployments are paused for the time being due to high traffic from current events. We will re-asses in an hour [18:00:50] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:03:38] 10SRE, 10SRE-swift-storage, 10Sustainability (Incident Followup): 2022-08-24 swift incident (tracking) - https://phabricator.wikimedia.org/T317358 (10jcrespo) Adding @MatthewVernon, as he and @fgiunchedi will be the most knowledgeable people to understand what went wrong to add to the doc. Moritz and I can h... [18:05:26] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:18:28] RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:19:46] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:22:22] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:22:31] sigh, going to remove the Google CT log [18:33:54] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:41:00] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:42:02] (03PS1) 10DDesouza: Deploy Research Incentive Survey to idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830917 (https://phabricator.wikimedia.org/T316466) [18:42:35] (03PS2) 10DDesouza: Deploy Research Incentive Survey to idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830917 (https://phabricator.wikimedia.org/T316466) [18:47:52] PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:48:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: one disk not working properly in cloudcephosd1030.eqiad.wmnet - https://phabricator.wikimedia.org/T316673 (10Jclark-ctr) Dell has requested i run Hardware Diagnostics after Support log showed no errors i have run multiple t... [18:48:32] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:01:19] I will be rolling forward to group 1 in a few minutes [19:03:07] (03PS1) 10Bking: Revert "elastic: reduce master-eligibles for codfw back down to 2" [puppet] - 10https://gerrit.wikimedia.org/r/830923 (https://phabricator.wikimedia.org/T313431) [19:04:44] (03PS2) 10Ryan Kemper: Revert "elastic: reduce master-eligibles for codfw back down to 2" [puppet] - 10https://gerrit.wikimedia.org/r/830923 (https://phabricator.wikimedia.org/T313431) (owner: 10Bking) [19:04:50] jeena: I'm around as mw and sre stuff, let me know if you see massive changes [19:05:11] (03PS1) 10Jgreen: Add a temporary TXT record for Dmarcian account owner change from ccogdill@ to postmaster@. [dns] - 10https://gerrit.wikimedia.org/r/830925 (https://phabricator.wikimedia.org/T316899) [19:05:14] Amir1: all the code changed!!!! [19:05:26] :D [19:05:50] Thanks Amir1! [19:06:06] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (NOOP 3 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37175/console" [puppet] - 10https://gerrit.wikimedia.org/r/830923 (https://phabricator.wikimedia.org/T313431) (owner: 10Bking) [19:06:16] (03PS1) 10TrainBranchBot: group1 wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830926 (https://phabricator.wikimedia.org/T314189) [19:06:20] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830926 (https://phabricator.wikimedia.org/T314189) (owner: 10TrainBranchBot) [19:06:58] (03CR) 10Jgreen: [C: 03+2] Add a temporary TXT record for Dmarcian account owner change from ccogdill@ to postmaster@. [dns] - 10https://gerrit.wikimedia.org/r/830925 (https://phabricator.wikimedia.org/T316899) (owner: 10Jgreen) [19:07:00] jeena: just curios, when is the plan for group2? [19:07:03] (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830926 (https://phabricator.wikimedia.org/T314189) (owner: 10TrainBranchBot) [19:07:29] I was going to go ahead and go to all wikis if all seemed fine after 15-30 minutes [19:07:30] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [19:11:25] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.28 refs T314189 [19:11:27] (03PS3) 10Ryan Kemper: Revert "elastic: reduce master-eligibles for codfw back down to 2" [puppet] - 10https://gerrit.wikimedia.org/r/830923 (https://phabricator.wikimedia.org/T313431) (owner: 10Bking) [19:11:28] T314189: 1.39.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T314189 [19:12:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:13:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:13:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:14:42] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [19:14:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:14:52] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:15:05] !log jhuneidi@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.28 refs T314189 (duration: 03m 39s) [19:17:15] jeena: read on s8 is quite elevated but it's partially expected, let me see if it recovers [19:17:23] okay [19:19:24] jeena: mostly recovered [19:19:29] nice [19:19:51] (03PS1) 10Majavah: P:wmcs::novaproxy: add prometheus nginx exporter [puppet] - 10https://gerrit.wikimedia.org/r/830928 [19:25:53] logs look alright so if all seems fine to you still Amir1 then I'll roll to all wikis soon [19:26:49] we have an uptick in esams, I'm not sure why but it shouldn't affect anything [19:26:58] reading a million graphs at the same time [19:27:14] is there a particular dashboard I should be looking at? [19:28:01] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&from=now-1h&to=now [19:28:05] This is the most important one [19:28:26] but i check db load as well, as that usually gets upset first https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=now-1h&to=now [19:28:45] thanks! [19:28:50] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [19:29:12] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:30:43] deploying to all wikis now [19:31:16] (03PS1) 10TrainBranchBot: all wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830929 (https://phabricator.wikimedia.org/T314189) [19:31:16] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [19:31:18] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830929 (https://phabricator.wikimedia.org/T314189) (owner: 10TrainBranchBot) [19:32:07] (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830929 (https://phabricator.wikimedia.org/T314189) (owner: 10TrainBranchBot) [19:35:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:35:17] * Amir1 makes a cross [19:35:58] the five hundreds from the new train are arriving but natural [19:36:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:36:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:36:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: one disk not working properly in cloudcephosd1030.eqiad.wmnet - https://phabricator.wikimedia.org/T316673 (10Jclark-ctr) a:05Cmjohnsonβ†’03Jclark-ctr [19:36:24] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.28 refs T314189 [19:36:27] T314189: 1.39.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T314189 [19:36:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:38:23] (03PS1) 10Majavah: P:wmcs::novaproxy: add rate limiter [puppet] - 10https://gerrit.wikimedia.org/r/830932 [19:39:49] jeena: hmm, elasticsearch (risky patch this week) is rejecting a few requests right now, will wait a minute to see if it subsides (the initial traffic spike onto the db can be a bit rough, needs to pull a lot of content from disk into memory caches), but if it stays will need to roll back [19:40:04] ok let me know [19:40:57] 500s are fine [19:42:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:42:49] (03PS1) 10Majavah: hieradata: add fake novaproxy passwords [labs/private] - 10https://gerrit.wikimedia.org/r/830933 [19:43:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:43:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:43:05] spike in rejections is declining, from 5k/30s down to 3k/30s, if it keeps declining should be fine [19:43:08] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [19:43:18] PROBLEM - CirrusSearch codfw 95th percentile latency - more_like on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [2000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [19:43:49] thanks ebernhardson [19:43:50] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] hieradata: add fake novaproxy passwords [labs/private] - 10https://gerrit.wikimedia.org/r/830933 (owner: 10Majavah) [19:43:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:44:12] there are quite a few errors contacting parsoid/RESTBase that i'm not sure about [19:45:32] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [19:46:04] (03PS1) 10Andrew Bogott: dynamic proxy: block a second troublesome UA [puppet] - 10https://gerrit.wikimedia.org/r/830934 [19:47:09] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37178/console" [puppet] - 10https://gerrit.wikimedia.org/r/830932 (owner: 10Majavah) [19:48:36] jeena: hmm, but now its going back up :( we might need to roll back and rebalance the shards in the cluster, essentially two of the nodes are looking overloaded and elastic doesn't do a good job of routing around struggling nodes [19:49:08] RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:49:11] okay, roll back to group1 or further? [19:49:46] jeena: group1 is fine. hopefully just an hour or so to tell elastic to move some shards away from these two nodes [19:49:58] (03CR) 10Andrew Bogott: [C: 03+2] P:wmcs::novaproxy: add rate limiter [puppet] - 10https://gerrit.wikimedia.org/r/830932 (owner: 10Majavah) [19:50:18] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37179/console" [puppet] - 10https://gerrit.wikimedia.org/r/830934 (owner: 10Andrew Bogott) [19:50:44] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:51:22] (03PS1) 10TrainBranchBot: group2 wikis to 1.39.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830935 (https://phabricator.wikimedia.org/T314189) [19:51:24] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.39.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830935 (https://phabricator.wikimedia.org/T314189) (owner: 10TrainBranchBot) [19:51:36] ebernhardson: rolling back now [19:52:06] (03Merged) 10jenkins-bot: group2 wikis to 1.39.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830935 (https://phabricator.wikimedia.org/T314189) (owner: 10TrainBranchBot) [19:54:28] jeena: thanks! [19:56:19] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.39.0-wmf.27 refs T314189 [19:56:23] T314189: 1.39.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T314189 [19:59:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:00:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:00:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:00:04] brennen and TheresNoTime: Time to snap out of that daydream and deploy UTC late backport and config training. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220908T2000). [20:00:04] arlolra and danisztls: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:24] here [20:00:24] * TheresNoTime is here! [20:00:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:00:58] o/ [20:01:02] the train is ongoing [20:01:15] I saw that we just rolled back [20:01:44] I was just looking to see where things were [20:02:11] if needed I can roll the train tomorrow morning (in like 12 hours) [20:02:26] are we delaying this deployment window then? :) [20:02:37] looks like we're staying in the current state for an hour or so at least (is that right ebernhardson ?) [20:02:37] train rolled back for elasticsearch, it was struggling with the paired traffic shift from eqiad->codfw, two nodes struggling but one in particular rejected 110k requests over 15 minutes. Optimistically can re-run the train forward in about an hour after we get elastic to shuffle some shards away [20:02:49] \o/ [20:03:22] k, sounds like we can backport in the interim [20:03:48] Okay! I can deploy :) [20:04:26] thanks Erik! [20:04:37] arlolra: will start with 830702 [20:05:31] thank you [20:05:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [core] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/830702 (https://phabricator.wikimedia.org/T317215) (owner: 10Arlolra) [20:09:43] looks like these two backports are going to take ~15 minutes in CI :') [20:10:00] fun [20:10:54] TheresNoTime: if you want to get out the config change while you're waiting you can ctrl-c and re-run scap backport after it merges and it'll Just Workβ„’ [20:11:21] ooh! [20:11:55] doesn't look like danisztls is around to test though [20:12:00] ah [20:12:08] RECOVERY - CirrusSearch codfw 95th percentile latency - more_like on graphite1004 is OK: OK: Less than 20.00% above the threshold [1200.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [20:12:32] well nevermind :) for future reference, I guess [20:12:43] useful to know, thank you :) [20:14:38] nice, cirrus error rate even lower than 2 hours ago [20:20:59] (03Merged) 10jenkins-bot: Fix selser on html endpoints [core] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/830702 (https://phabricator.wikimedia.org/T317215) (owner: 10Arlolra) [20:21:23] I go take a break, keep my phone close [20:21:25] !log samtar@deploy1002 Started scap: Backport for [[gerrit:830702|Fix selser on html endpoints (T317215)]] [20:21:28] T317215: Selser broken on html endpoints - https://phabricator.wikimedia.org/T317215 [20:21:49] !log samtar@deploy1002 samtar and arlolra: Backport for [[gerrit:830702|Fix selser on html endpoints (T317215)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [20:22:10] arlolra: can you test on mwdebug1001? [20:23:23] (03PS2) 10Dzahn: Revert "Exclude cloud-eqiad prefix from VRT trusted networks" [puppet] - 10https://gerrit.wikimedia.org/r/830818 (owner: 10Ayounsi) [20:24:01] (not entirely sure that's testable to be honest) [20:25:00] yes, give me a sec [20:25:11] Okay :) [20:28:47] it seems safe to continue [20:28:59] Syncing :) [20:31:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:32:43] (03CR) 10Dzahn: [C: 03+2] Revert "Exclude cloud-eqiad prefix from VRT trusted networks" [puppet] - 10https://gerrit.wikimedia.org/r/830818 (owner: 10Ayounsi) [20:33:31] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:830702|Fix selser on html endpoints (T317215)]] (duration: 12m 06s) [20:33:34] T317215: Selser broken on html endpoints - https://phabricator.wikimedia.org/T317215 [20:33:39] arlolra: now doing 830703 [20:33:54] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [core] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/830703 (https://phabricator.wikimedia.org/T317215) (owner: 10Arlolra) [20:34:10] (03CR) 10Dzahn: [C: 03+2] "noop confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/830818 (owner: 10Ayounsi) [20:34:39] 28 is only on group0? [20:35:15] https://versions.toolforge.org/ shows .28 is on group0 and group1 [20:35:25] group is on .27 [20:35:28] *group2 [20:35:37] thank you [20:36:09] (03CR) 10Dzahn: [C: 03+1] "true, also the current section says "no wikis" and it's a wiki" [dns] - 10https://gerrit.wikimedia.org/r/830863 (owner: 10Zabe) [20:37:01] (03CR) 10Dzahn: [C: 03+2] wikimedia.org: Move nyc to the wikis section [dns] - 10https://gerrit.wikimedia.org/r/830863 (owner: 10Zabe) [20:37:05] (03PS2) 10Dzahn: wikimedia.org: Move nyc to the wikis section [dns] - 10https://gerrit.wikimedia.org/r/830863 (owner: 10Zabe) [20:37:33] While we wait, is there anyone here who is familiar enough with T316466 (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/830917) and wants to take up testing it, otherwise it's unlikely to get deployed [20:37:34] T316466: Deploy Research Incentive Survey on Indonesian Wikipedia - https://phabricator.wikimedia.org/T316466 [20:37:40] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [20:37:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:38:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:40:26] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:44:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:48:17] 830703 almost merged :') [20:49:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:54:21] (03Merged) 10jenkins-bot: Fix selser on html endpoints [core] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/830703 (https://phabricator.wikimedia.org/T317215) (owner: 10Arlolra) [20:54:25] (03CR) 10Dzahn: [C: 03+1] "downloaded wget https://github.com/moby/buildkit/releases/download/v0.10.4/buildkit-v0.10.4.linux-amd64.tar.gz and confirmed SHA256sum" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/830909 (owner: 10Dduvall) [20:54:39] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [core] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/830703 (https://phabricator.wikimedia.org/T317215) (owner: 10Arlolra) [20:55:01] !log samtar@deploy1002 Started scap: Backport for [[gerrit:830703|Fix selser on html endpoints (T317215)]] [20:55:04] T317215: Selser broken on html endpoints - https://phabricator.wikimedia.org/T317215 [20:55:24] !log samtar@deploy1002 samtar and arlolra: Backport for [[gerrit:830703|Fix selser on html endpoints (T317215)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:55:43] arlolra: finally merged! can you test on mwdebug1001 please :) [20:55:50] ok [20:56:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:56:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:57:38] ok, please continue [20:57:47] syncing [20:59:45] (JobUnavailable) firing: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:01:49] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:830703|Fix selser on html endpoints (T317215)]] (duration: 06m 48s) [21:01:54] T317215: Selser broken on html endpoints - https://phabricator.wikimedia.org/T317215 [21:01:57] all done :) [21:02:04] !log closing UTC late backport and config training [21:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:42] thanks TheresNoTime, selser is indeed fixed on html endpoints now [21:02:52] ebernhardson: FYI deployments done ref retrying the train [21:02:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:03:05] arlolra: you're very welcome :) [21:03:05] we should be ready to retry the train now [21:03:40] deploying to all wikis now [21:03:56] (03PS1) 10TrainBranchBot: all wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830944 (https://phabricator.wikimedia.org/T314189) [21:03:58] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830944 (https://phabricator.wikimedia.org/T314189) (owner: 10TrainBranchBot) [21:04:40] (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830944 (https://phabricator.wikimedia.org/T314189) (owner: 10TrainBranchBot) [21:08:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:08:46] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.28 refs T314189 [21:08:49] T314189: 1.39.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T314189 [21:09:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:09:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:09:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:18:45] so far everything is looking happy on the elastic side. should be safe to leave the train up [21:18:55] thanks ebernhardson! [21:19:33] (03CR) 10Dzahn: [C: 03+2] buildkitd: Bump version to 0.10.4 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/830909 (owner: 10Dduvall) [21:20:21] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:24:55] PROBLEM - Check systemd state on search-loader2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_mjolnir-kafka-msearch-daemon@2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:31:33] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [21:35:43] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:38:26] (03PS1) 10Ssingh: certspotter: remove misbehaving Google CT log [puppet] - 10https://gerrit.wikimedia.org/r/830945 [21:40:45] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37180/console" [puppet] - 10https://gerrit.wikimedia.org/r/830945 (owner: 10Ssingh) [21:41:13] (03CR) 10Ssingh: [V: 03+1 C: 03+2] certspotter: remove misbehaving Google CT log [puppet] - 10https://gerrit.wikimedia.org/r/830945 (owner: 10Ssingh) [21:41:34] (03CR) 10Dzahn: "Thanks for fixing flapping monitoring. =I can confirm this URL is 404. You also seem to have done this before. Only thing is one of the co" [puppet] - 10https://gerrit.wikimedia.org/r/830945 (owner: 10Ssingh) [21:43:23] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [21:48:48] mutante: thanks, I guess I copied the current commit [21:49:07] that's what I get for doing this as the water boils for dinner :) [21:53:32] sukhe: thanks for fixing the icinga alert about icinga itself which is actually Google breaking URLs [21:55:05] (03PS1) 10JHathaway: mail::mx: Add support for PLAIN auth over tls [puppet] - 10https://gerrit.wikimedia.org/r/830948 (https://phabricator.wikimedia.org/T314815) [21:55:11] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [21:56:28] (03CR) 10JHathaway: "Kindly review!" [puppet] - 10https://gerrit.wikimedia.org/r/830948 (https://phabricator.wikimedia.org/T314815) (owner: 10JHathaway) [21:56:59] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:59:21] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:02:17] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [22:03:25] (03PS1) 10BCornwall: ats: Alert on high connection/request count [alerts] - 10https://gerrit.wikimedia.org/r/830950 (https://phabricator.wikimedia.org/T292815) [22:12:36] 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, 10User-herron, 10User-jbond: Prevent puppet catalog compiler workers from running out of disk space - https://phabricator.wikimedia.org/T222075 (10Dzahn) This happened again the other day and made me mail the SRE list. Then I added docs how to clea... [22:16:33] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [22:28:21] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [22:40:11] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [22:49:09] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [22:53:55] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:03:53] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:17:43] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:20:05] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:42:17] PROBLEM - SSH on mw1313.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:48:13] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:52:57] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:55:42] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b9be20d]: (no justification provided) [23:56:10] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b9be20d]: (no justification provided) (duration: 00m 27s)