[00:14:37] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b9be20d]: (no justification provided) [00:14:54] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b9be20d]: (no justification provided) (duration: 00m 17s) [00:17:02] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:17:26] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:19:12] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.217 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:19:34] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48681 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:22:12] 10SRE, 10conftool, 10Sustainability (Incident Followup): requestctl can't act on cache hits - https://phabricator.wikimedia.org/T317794 (10CDanis) Going to be bold and append to the task description with what we discussed in the cachebust WG meeting today (so that anyone can update it and tick boxes as we go). [00:24:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:27:59] (03CR) 10CDanis: [C: 04-1] C:varnish: Add cluster_fe_hit and cluster_fe_ratelimit_hits subroutines (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/832268 (https://phabricator.wikimedia.org/T317799) (owner: 10Jbond) [00:29:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:39:35] 10SRE, 10Traffic, 10conftool, 10Sustainability (Incident Followup): requestctl can't act on cache hits - https://phabricator.wikimedia.org/T317794 (10CDanis) [00:49:31] (03CR) 10CDanis: C:varnish: Rate limit hotlinking (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768723 (https://phabricator.wikimedia.org/T317799) (owner: 10Jbond) [00:51:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:52:46] (03CR) 10CDanis: C:varnish: Rate limit hotlinking (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768723 (https://phabricator.wikimedia.org/T317799) (owner: 10Jbond) [00:54:51] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:56:43] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48682 bytes in 0.159 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:36:45] (JobUnavailable) firing: Reduced availability for job workhorse in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:57] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:46:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:49:57] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:51:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:54:23] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b9be20d]: (no justification provided) [01:54:33] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b9be20d]: (no justification provided) (duration: 00m 10s) [01:57:33] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b9be20d]: (no justification provided) [01:57:42] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b9be20d]: (no justification provided) (duration: 00m 09s) [02:06:45] (JobUnavailable) resolved: (9) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:37] PROBLEM - Disk space on dumpsdata1003 is CRITICAL: DISK CRITICAL - free space: /data 875135 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops [02:41:15] RECOVERY - Debian mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/debian is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [02:50:03] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:51:56] 10SRE, 10Wikimedia-Mailing-lists: Create non-public Dagbani Wikimedians Usergroup Team mailing list - https://phabricator.wikimedia.org/T317851 (103245) >>! In T317851#8240817, @Masssly wrote: > @3245 As the primary admin, can you please check your email and proceed with the rest of the settings? Well noted! [03:06:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:11:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:19:56] (03PS1) 10Dduvall: buildkitd: Support configuration of OCI executor nameservers [puppet] - 10https://gerrit.wikimedia.org/r/832584 (https://phabricator.wikimedia.org/T317904) [03:20:33] (03PS2) 10Dduvall: buildkitd: Support configuration of OCI executor nameservers [puppet] - 10https://gerrit.wikimedia.org/r/832584 (https://phabricator.wikimedia.org/T317904) [03:21:27] (03CR) 10CI reject: [V: 04-1] buildkitd: Support configuration of OCI executor nameservers [puppet] - 10https://gerrit.wikimedia.org/r/832584 (https://phabricator.wikimedia.org/T317904) (owner: 10Dduvall) [03:26:13] (03PS3) 10Dduvall: buildkitd: Support configuration of OCI executor nameservers [puppet] - 10https://gerrit.wikimedia.org/r/832584 (https://phabricator.wikimedia.org/T317904) [03:29:18] (03CR) 10CI reject: [V: 04-1] buildkitd: Support configuration of OCI executor nameservers [puppet] - 10https://gerrit.wikimedia.org/r/832584 (https://phabricator.wikimedia.org/T317904) (owner: 10Dduvall) [03:30:38] (03CR) 10Dduvall: "Our wmf-style linter does not seem to like the cross profile lookup(). I'm not sure how to restructure this change to appease the linter." [puppet] - 10https://gerrit.wikimedia.org/r/832584 (https://phabricator.wikimedia.org/T317904) (owner: 10Dduvall) [03:31:28] (03CR) 10Dzahn: "Hey, kudos for tracking all this down and explaining it in such detail. This looks all good to me. Just one thing, please move the "lookup" [puppet] - 10https://gerrit.wikimedia.org/r/832584 (https://phabricator.wikimedia.org/T317904) (owner: 10Dduvall) [03:33:10] (03CR) 10Dzahn: "My comment may make no sense anymore because I review a previous PS" [puppet] - 10https://gerrit.wikimedia.org/r/832584 (https://phabricator.wikimedia.org/T317904) (owner: 10Dduvall) [03:34:19] (03CR) 10Dzahn: buildkitd: Support configuration of OCI executor nameservers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832584 (https://phabricator.wikimedia.org/T317904) (owner: 10Dduvall) [03:34:55] dduvall: try moving the lookup() to the parameter section and later just use the variable [03:35:36] I got a little confused about review, will look again first thing in the morning. and kudos tracking this down! [03:35:42] bbl [03:36:23] thanks! I’ll come back to it in the morning as well [03:37:57] dduvall: good morning ;) [03:38:15] have a good night Californians [03:51:46] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [04:22:07] (03PS1) 10MarkAHershberger: Planet: Add my blog [puppet] - 10https://gerrit.wikimedia.org/r/832585 [04:44:43] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:45:09] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:51:45] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:52:11] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:17:21] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:17:49] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:20:51] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:42:40] 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Marostegui) Thanks - I will reclone this host now and put it back in production [05:44:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1198', diff saved to https://phabricator.wikimedia.org/P34795 and previous config saved to /var/cache/conftool/dbconfig/20220916-054438-root.json [05:45:23] (03PS1) 10Marostegui: db1198: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/832587 [05:46:29] (03CR) 10Marostegui: [C: 03+2] db1198: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/832587 (owner: 10Marostegui) [05:50:06] (03PS1) 10Marostegui: db1168: Install 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/832588 (https://phabricator.wikimedia.org/T301879) [05:50:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1168', diff saved to https://phabricator.wikimedia.org/P34797 and previous config saved to /var/cache/conftool/dbconfig/20220916-055031-root.json [05:51:24] !log Install 10.6 on db1168 T301879 [05:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:27] T301879: Test MariaDB 10.6 on Bullseye - https://phabricator.wikimedia.org/T301879 [05:51:55] (03CR) 10Marostegui: [C: 03+2] db1168: Install 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/832588 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui) [05:54:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P34798 and previous config saved to /var/cache/conftool/dbconfig/20220916-055424-root.json [05:55:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1168', diff saved to https://phabricator.wikimedia.org/P34799 and previous config saved to /var/cache/conftool/dbconfig/20220916-055542-root.json [05:57:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P34800 and previous config saved to /var/cache/conftool/dbconfig/20220916-055717-root.json [06:05:05] (03PS2) 10JMeybohm: Add missing dashboard links to k8s related alerts [alerts] - 10https://gerrit.wikimedia.org/r/832517 [06:12:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P34801 and previous config saved to /var/cache/conftool/dbconfig/20220916-061222-root.json [06:27:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P34802 and previous config saved to /var/cache/conftool/dbconfig/20220916-062727-root.json [06:42:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P34803 and previous config saved to /var/cache/conftool/dbconfig/20220916-064232-root.json [06:53:30] (03CR) 10Muehlenhoff: Add a cookbook to restart/reboot logstash collector/Kibana nodes (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/832447 (owner: 10Muehlenhoff) [06:54:43] (03CR) 10Muehlenhoff: [C: 03+2] Add cookbook to restart/reboot the Docker registry (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/832241 (owner: 10Muehlenhoff) [06:55:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET services) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:57:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P34804 and previous config saved to /var/cache/conftool/dbconfig/20220916-065737-root.json [06:58:56] (03PS7) 10Muehlenhoff: Add a cookbook to restart/reboot logstash collector nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/832447 [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220916T0700) [07:00:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET services) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:04:20] (03PS1) 10Muehlenhoff: sre.misc-clusters.roll-restart-reboot-docker-registry: Also restart docker-registry itself [cookbooks] - 10https://gerrit.wikimedia.org/r/832591 [07:10:43] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:12:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P34805 and previous config saved to /var/cache/conftool/dbconfig/20220916-071241-root.json [07:13:06] (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/832447 (owner: 10Muehlenhoff) [07:17:41] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:19:16] * jayme checking deploy_to_mwdebug [07:21:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:22:21] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:25:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:25:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:26:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:27:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P34806 and previous config saved to /var/cache/conftool/dbconfig/20220916-072746-root.json [07:29:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2180', diff saved to https://phabricator.wikimedia.org/P34807 and previous config saved to /var/cache/conftool/dbconfig/20220916-072958-root.json [07:30:59] (03PS1) 10Marostegui: db2180: Install 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/832607 (https://phabricator.wikimedia.org/T301879) [07:32:08] (03CR) 10Marostegui: [C: 03+2] db2180: Install 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/832607 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui) [07:41:46] (03PS1) 10Marostegui: Revert "db1198: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/832561 [07:42:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P34808 and previous config saved to /var/cache/conftool/dbconfig/20220916-074251-root.json [07:45:38] (03CR) 10Marostegui: [C: 03+2] Revert "db1198: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/832561 (owner: 10Marostegui) [07:45:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 1%: Repooling after cloning db1189', diff saved to https://phabricator.wikimedia.org/P34809 and previous config saved to /var/cache/conftool/dbconfig/20220916-074548-root.json [07:46:41] (03PS1) 10Marostegui: Revert "db1189: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/832562 [07:50:44] (03CR) 10Marostegui: [C: 03+2] Revert "db1189: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/832562 (owner: 10Marostegui) [07:51:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 1%: After being recloned', diff saved to https://phabricator.wikimedia.org/P34810 and previous config saved to /var/cache/conftool/dbconfig/20220916-075100-root.json [07:51:46] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [08:00:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 3%: Repooling after cloning db1189', diff saved to https://phabricator.wikimedia.org/P34811 and previous config saved to /var/cache/conftool/dbconfig/20220916-080052-root.json [08:06:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 3%: After being recloned', diff saved to https://phabricator.wikimedia.org/P34812 and previous config saved to /var/cache/conftool/dbconfig/20220916-080605-root.json [08:06:18] (03Abandoned) 10Jforrester: Restore compatibility with overrides for IndexPager::makeLink() [core] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/831215 (https://phabricator.wikimedia.org/T317477) (owner: 10Jforrester) [08:06:58] (03PS2) 10Jforrester: ExtensionDistributor: Add REL1_39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829877 (https://phabricator.wikimedia.org/T313925) [08:07:18] (03CR) 10Jforrester: "Oops, I go away for two weeks and forget to get this deployed. Sorry!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829877 (https://phabricator.wikimedia.org/T313925) (owner: 10Jforrester) [08:15:22] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 149 probes of 680 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:15:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 5%: Repooling after cloning db1189', diff saved to https://phabricator.wikimedia.org/P34813 and previous config saved to /var/cache/conftool/dbconfig/20220916-081557-root.json [08:21:08] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 50 probes of 680 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:21:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 5%: After being recloned', diff saved to https://phabricator.wikimedia.org/P34814 and previous config saved to /var/cache/conftool/dbconfig/20220916-082110-root.json [08:21:53] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:22:06] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:23:52] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:31:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 10%: Repooling after cloning db1189', diff saved to https://phabricator.wikimedia.org/P34815 and previous config saved to /var/cache/conftool/dbconfig/20220916-083102-root.json [08:36:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 10%: After being recloned', diff saved to https://phabricator.wikimedia.org/P34816 and previous config saved to /var/cache/conftool/dbconfig/20220916-083615-root.json [08:46:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 25%: Repooling after cloning db1189', diff saved to https://phabricator.wikimedia.org/P34817 and previous config saved to /var/cache/conftool/dbconfig/20220916-084607-root.json [08:47:32] (03PS1) 10Elukey: admin_ng: use TLS origination for ml-serve sidecar configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/832609 (https://phabricator.wikimedia.org/T313915) [08:51:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 25%: After being recloned', diff saved to https://phabricator.wikimedia.org/P34818 and previous config saved to /var/cache/conftool/dbconfig/20220916-085120-root.json [08:51:55] (03CR) 10Filippo Giunchedi: "Nice! Replace grafana-rw.w.o with grafana.w.o (I'll add a CI check) but other than that LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/832517 (owner: 10JMeybohm) [09:01:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 50%: Repooling after cloning db1189', diff saved to https://phabricator.wikimedia.org/P34819 and previous config saved to /var/cache/conftool/dbconfig/20220916-090111-root.json [09:04:55] (03CR) 10Klausman: [C: 03+1] admin_ng: use TLS origination for ml-serve sidecar configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/832609 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [09:06:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 50%: After being recloned', diff saved to https://phabricator.wikimedia.org/P34820 and previous config saved to /var/cache/conftool/dbconfig/20220916-090625-root.json [09:10:47] (03PS1) 10Filippo Giunchedi: Fail tests on links to grafana-rw.w.o [alerts] - 10https://gerrit.wikimedia.org/r/832613 [09:12:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [09:12:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [09:12:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T314041)', diff saved to https://phabricator.wikimedia.org/P34821 and previous config saved to /var/cache/conftool/dbconfig/20220916-091234-ladsgroup.json [09:12:37] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [09:13:14] (03CR) 10CI reject: [V: 04-1] Fail tests on links to grafana-rw.w.o [alerts] - 10https://gerrit.wikimedia.org/r/832613 (owner: 10Filippo Giunchedi) [09:13:36] PROBLEM - Hadoop NodeManager on analytics1076 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:14:52] PROBLEM - Check systemd state on analytics1076 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:16:06] (03CR) 10Filippo Giunchedi: rewrite.py: changes for Phonos deployment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831955 (https://phabricator.wikimedia.org/T317417) (owner: 10MusikAnimal) [09:16:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 75%: Repooling after cloning db1189', diff saved to https://phabricator.wikimedia.org/P34822 and previous config saved to /var/cache/conftool/dbconfig/20220916-091616-root.json [09:17:16] PROBLEM - Check systemd state on an-worker1136 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:18:18] PROBLEM - Hadoop NodeManager on an-worker1136 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:20:42] (03PS3) 10JMeybohm: Add missing dashboard links to k8s related alerts [alerts] - 10https://gerrit.wikimedia.org/r/832517 [09:21:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 75%: After being recloned', diff saved to https://phabricator.wikimedia.org/P34823 and previous config saved to /var/cache/conftool/dbconfig/20220916-092130-root.json [09:21:52] RECOVERY - Check systemd state on an-worker1136 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:22:28] (03PS2) 10JMeybohm: Fail tests on links to grafana-rw.w.o [alerts] - 10https://gerrit.wikimedia.org/r/832613 (owner: 10Filippo Giunchedi) [09:22:56] RECOVERY - Hadoop NodeManager on an-worker1136 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:24:22] (03CR) 10JMeybohm: [C: 03+1] Fail tests on links to grafana-rw.w.o [alerts] - 10https://gerrit.wikimedia.org/r/832613 (owner: 10Filippo Giunchedi) [09:24:41] (03CR) 10Clément Goubert: profile::beta::monitoring: Add blackbox check for meta.wikimedia.beta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832326 (https://phabricator.wikimedia.org/T315695) (owner: 10Samtar) [09:26:07] (03CR) 10JMeybohm: [C: 03+2] Add missing dashboard links to k8s related alerts [alerts] - 10https://gerrit.wikimedia.org/r/832517 (owner: 10JMeybohm) [09:27:16] PROBLEM - Hadoop NodeManager on an-worker1124 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:27:30] PROBLEM - Check systemd state on an-worker1124 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:28:32] (03Merged) 10jenkins-bot: Add missing dashboard links to k8s related alerts [alerts] - 10https://gerrit.wikimedia.org/r/832517 (owner: 10JMeybohm) [09:31:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 100%: Repooling after cloning db1189', diff saved to https://phabricator.wikimedia.org/P34824 and previous config saved to /var/cache/conftool/dbconfig/20220916-093121-root.json [09:33:07] (03CR) 10Filippo Giunchedi: [C: 03+2] Fail tests on links to grafana-rw.w.o [alerts] - 10https://gerrit.wikimedia.org/r/832613 (owner: 10Filippo Giunchedi) [09:34:04] RECOVERY - Hadoop NodeManager on analytics1076 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:35:18] RECOVERY - Check systemd state on analytics1076 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:36:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 100%: After being recloned', diff saved to https://phabricator.wikimedia.org/P34825 and previous config saved to /var/cache/conftool/dbconfig/20220916-093635-root.json [09:36:40] PROBLEM - Check systemd state on an-worker1108 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:38:00] PROBLEM - Hadoop NodeManager on an-worker1108 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:38:21] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/831955 (https://phabricator.wikimedia.org/T317417) (owner: 10MusikAnimal) [09:39:46] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:42:26] <_joe_> apergos: ^^ [09:42:40] <_joe_> the systemd unit logs just say [09:42:43] <_joe_> Sep 16 09:33:12 snapshot1008 dumpcirrussearch.sh[9752]: <13>Sep 13 14:03:26 dumpsgen: extensions/CirrusSearch/maintenance/DumpIndex.php failed for /mnt/dumpsdata/otherdumps/cirrussearch/20220912/commonswiki-20220912-cirrussearch-file.json.gz [09:45:10] RECOVERY - Hadoop NodeManager on an-worker1124 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:45:26] RECOVERY - Check systemd state on an-worker1124 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:47:14] PROBLEM - SSH on mw1314.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:51:57] (03PS3) 10Jbond: C:varnish: Add cluster_fe_hit and cluster_fe_ratelimit_hits subroutines [puppet] - 10https://gerrit.wikimedia.org/r/832268 (https://phabricator.wikimedia.org/T317799) [09:52:58] (03CR) 10Jbond: C:varnish: Add cluster_fe_hit and cluster_fe_ratelimit_hits subroutines (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/832268 (https://phabricator.wikimedia.org/T317799) (owner: 10Jbond) [09:58:42] (03CR) 10Jbond: [C: 03+1] sre.misc-clusters.roll-restart-reboot-docker-registry: Also restart docker-registry itself [cookbooks] - 10https://gerrit.wikimedia.org/r/832591 (owner: 10Muehlenhoff) [10:00:32] RECOVERY - Hadoop NodeManager on an-worker1108 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:01:32] RECOVERY - Check systemd state on an-worker1108 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:04:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P34826 and previous config saved to /var/cache/conftool/dbconfig/20220916-100400-root.json [10:12:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T314041)', diff saved to https://phabricator.wikimedia.org/P34827 and previous config saved to /var/cache/conftool/dbconfig/20220916-101250-ladsgroup.json [10:12:54] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [10:19:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P34828 and previous config saved to /var/cache/conftool/dbconfig/20220916-101905-root.json [10:27:03] (03PS1) 10Raymond Ndibe: toolforge: add on-wiki edits of toolforge tools to toolsview [puppet] - 10https://gerrit.wikimedia.org/r/832620 (https://phabricator.wikimedia.org/T317953) [10:27:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P34829 and previous config saved to /var/cache/conftool/dbconfig/20220916-102756-ladsgroup.json [10:27:59] (03CR) 10CI reject: [V: 04-1] toolforge: add on-wiki edits of toolforge tools to toolsview [puppet] - 10https://gerrit.wikimedia.org/r/832620 (https://phabricator.wikimedia.org/T317953) (owner: 10Raymond Ndibe) [10:33:35] (03PS27) 10Jbond: C:varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/768723 (https://phabricator.wikimedia.org/T317799) [10:33:37] (03PS1) 10Jbond: C:varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/832621 (https://phabricator.wikimedia.org/T317799) [10:34:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P34830 and previous config saved to /var/cache/conftool/dbconfig/20220916-103411-root.json [10:39:55] (03CR) 10Jbond: "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/768723 (https://phabricator.wikimedia.org/T317799) (owner: 10Jbond) [10:40:29] (03PS28) 10Jbond: C:varnish: Rate limit hotlinking dry-run [puppet] - 10https://gerrit.wikimedia.org/r/768723 (https://phabricator.wikimedia.org/T317799) [10:42:22] (03PS29) 10Jbond: C:varnish: Rate limit hotlinking dry-run [puppet] - 10https://gerrit.wikimedia.org/r/768723 (https://phabricator.wikimedia.org/T317799) [10:42:33] (03PS2) 10Jbond: C:varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/832621 (https://phabricator.wikimedia.org/T317799) [10:43:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P34831 and previous config saved to /var/cache/conftool/dbconfig/20220916-104303-ladsgroup.json [10:48:01] (03PS1) 10Jbond: raid: add Broadcom / LSI MegaRAID SAS-3 3324 [Intruder] (rev 01) [puppet] - 10https://gerrit.wikimedia.org/r/832622 (https://phabricator.wikimedia.org/T317924) [10:48:26] RECOVERY - SSH on mw1314.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:49:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P34832 and previous config saved to /var/cache/conftool/dbconfig/20220916-104916-root.json [10:50:23] (03CR) 10Jbond: [C: 03+2] raid: add Broadcom / LSI MegaRAID SAS-3 3324 [Intruder] (rev 01) [puppet] - 10https://gerrit.wikimedia.org/r/832622 (https://phabricator.wikimedia.org/T317924) (owner: 10Jbond) [10:56:11] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Facter is slow on a few hosts - https://phabricator.wikimedia.org/T251293 (10jbond) [10:56:16] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: raid_mgmt_tools cannot detect raid on clouddb1021 - https://phabricator.wikimedia.org/T317924 (10jbond) 05Open→03Resolved a:03jbond thanks updated the new fact to recognise this controler ` lang=shell $ sudo /usr/bin/facter --puppet --json raid_... [10:57:29] (03PS1) 10Zabe: Regenerate ukwikivoyage logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832623 (https://phabricator.wikimedia.org/T317718) [10:58:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T314041)', diff saved to https://phabricator.wikimedia.org/P34834 and previous config saved to /var/cache/conftool/dbconfig/20220916-105809-ladsgroup.json [10:58:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance [10:58:13] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [10:58:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance [10:58:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1189 (T314041)', diff saved to https://phabricator.wikimedia.org/P34835 and previous config saved to /var/cache/conftool/dbconfig/20220916-105819-ladsgroup.json [11:04:12] PROBLEM - haproxy failover on dbproxy2002 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:04:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P34836 and previous config saved to /var/cache/conftool/dbconfig/20220916-110420-root.json [11:04:32] dbproxy alerts are expected [11:05:27] (03CR) 10Jbond: [C: 03+2] spec_helper: include the monkey patch for the actual spec tests [puppet] - 10https://gerrit.wikimedia.org/r/832469 (owner: 10Jbond) [11:07:46] RECOVERY - haproxy failover on dbproxy2002 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:09:54] (03PS7) 10Jbond: C:rsync::server: convert to concat [puppet] - 10https://gerrit.wikimedia.org/r/703452 (https://phabricator.wikimedia.org/T205618) [11:10:16] (03CR) 10CI reject: [V: 04-1] C:rsync::server: convert to concat [puppet] - 10https://gerrit.wikimedia.org/r/703452 (https://phabricator.wikimedia.org/T205618) (owner: 10Jbond) [11:13:47] (03PS2) 10Raymond Ndibe: toolforge: add on-wiki edits of toolforge tools to toolsview [puppet] - 10https://gerrit.wikimedia.org/r/832620 (https://phabricator.wikimedia.org/T317953) [11:14:42] (03CR) 10CI reject: [V: 04-1] toolforge: add on-wiki edits of toolforge tools to toolsview [puppet] - 10https://gerrit.wikimedia.org/r/832620 (https://phabricator.wikimedia.org/T317953) (owner: 10Raymond Ndibe) [11:17:15] (03PS8) 10Jbond: C:rsync::server: convert to concat [puppet] - 10https://gerrit.wikimedia.org/r/703452 (https://phabricator.wikimedia.org/T205618) [11:18:20] (03PS3) 10Raymond Ndibe: toolforge: add on-wiki edits of toolforge tools to toolsview [puppet] - 10https://gerrit.wikimedia.org/r/832620 (https://phabricator.wikimedia.org/T317953) [11:19:14] (03CR) 10jenkins-bot: toolforge: add on-wiki edits of toolforge tools to toolsview [puppet] - 10https://gerrit.wikimedia.org/r/832620 (https://phabricator.wikimedia.org/T317953) (owner: 10Raymond Ndibe) [11:19:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P34837 and previous config saved to /var/cache/conftool/dbconfig/20220916-111925-root.json [11:19:50] (03PS4) 10Raymond Ndibe: toolforge: add on-wiki edits of toolforge tools to toolsview [puppet] - 10https://gerrit.wikimedia.org/r/832620 (https://phabricator.wikimedia.org/T317953) [11:20:44] (03CR) 10CI reject: [V: 04-1] toolforge: add on-wiki edits of toolforge tools to toolsview [puppet] - 10https://gerrit.wikimedia.org/r/832620 (https://phabricator.wikimedia.org/T317953) (owner: 10Raymond Ndibe) [11:22:53] _joe_: sorry, was fighting with the greek bureaucracy as we do every year or two around this time, every time it's an een bigger time sink [11:23:33] the cirrussearch stuff is known, there's an open task for the search folks, I just copy paste the errors from failures in there when they show up, so they know the errors aren't gone, that's it [11:26:55] (03PS5) 10Raymond Ndibe: toolforge: add on-wiki edits of toolforge tools to toolsview [puppet] - 10https://gerrit.wikimedia.org/r/832620 (https://phabricator.wikimedia.org/T317953) [11:27:40] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:27:49] (03CR) 10CI reject: [V: 04-1] toolforge: add on-wiki edits of toolforge tools to toolsview [puppet] - 10https://gerrit.wikimedia.org/r/832620 (https://phabricator.wikimedia.org/T317953) (owner: 10Raymond Ndibe) [11:27:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1114', diff saved to https://phabricator.wikimedia.org/P34838 and previous config saved to /var/cache/conftool/dbconfig/20220916-112750-root.json [11:27:55] <_joe_> apergos: I feel you [11:32:43] (03PS1) 10Jbond: P:swift::proxy: initiate the rsycn::server explicitly [puppet] - 10https://gerrit.wikimedia.org/r/832628 (https://phabricator.wikimedia.org/T311066) [11:33:14] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: rsync::server::module installs an rsync server even when $ensure is absent - https://phabricator.wikimedia.org/T311066 (10jbond) ultimatly i think moving to [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/703452 | concat ]] is t... [11:33:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P34839 and previous config saved to /var/cache/conftool/dbconfig/20220916-113325-root.json [11:34:22] (03PS6) 10Raymond Ndibe: toolforge: add on-wiki edits of toolforge tools to toolsview [puppet] - 10https://gerrit.wikimedia.org/r/832620 (https://phabricator.wikimedia.org/T317953) [11:34:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P34840 and previous config saved to /var/cache/conftool/dbconfig/20220916-113431-root.json [11:35:06] (03PS2) 10Jbond: P:swift::proxy: initiate the rsycn::server explicitly [puppet] - 10https://gerrit.wikimedia.org/r/832628 (https://phabricator.wikimedia.org/T311066) [11:35:17] (03CR) 10CI reject: [V: 04-1] toolforge: add on-wiki edits of toolforge tools to toolsview [puppet] - 10https://gerrit.wikimedia.org/r/832620 (https://phabricator.wikimedia.org/T317953) (owner: 10Raymond Ndibe) [11:35:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1134', diff saved to https://phabricator.wikimedia.org/P34841 and previous config saved to /var/cache/conftool/dbconfig/20220916-113543-root.json [11:35:58] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37291/console" [puppet] - 10https://gerrit.wikimedia.org/r/832628 (https://phabricator.wikimedia.org/T311066) (owner: 10Jbond) [11:37:37] (03PS3) 10Jbond: P:swift::proxy: initiate the rsycn::server explicitly [puppet] - 10https://gerrit.wikimedia.org/r/832628 (https://phabricator.wikimedia.org/T311066) [11:37:39] (03CR) 10Majavah: [C: 04-1] "-1 as I don't think this should be run on the web proxy logrotate hook since it doesn't need access to the web proxy logs. Instead it seem" [puppet] - 10https://gerrit.wikimedia.org/r/832620 (https://phabricator.wikimedia.org/T317953) (owner: 10Raymond Ndibe) [11:38:38] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37292/console" [puppet] - 10https://gerrit.wikimedia.org/r/832628 (https://phabricator.wikimedia.org/T311066) (owner: 10Jbond) [11:42:38] (03PS7) 10Raymond Ndibe: toolforge: add on-wiki edits of toolforge tools to toolsview [puppet] - 10https://gerrit.wikimedia.org/r/832620 (https://phabricator.wikimedia.org/T317953) [11:43:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34842 and previous config saved to /var/cache/conftool/dbconfig/20220916-114316-root.json [11:43:31] (03CR) 10CI reject: [V: 04-1] toolforge: add on-wiki edits of toolforge tools to toolsview [puppet] - 10https://gerrit.wikimedia.org/r/832620 (https://phabricator.wikimedia.org/T317953) (owner: 10Raymond Ndibe) [11:45:50] (03PS8) 10Raymond Ndibe: toolforge: add on-wiki edits of toolforge tools to toolsview [puppet] - 10https://gerrit.wikimedia.org/r/832620 (https://phabricator.wikimedia.org/T317953) [11:46:00] (03PS4) 10Jbond: P:swift::proxy: initiate the rsync::server explicitly [puppet] - 10https://gerrit.wikimedia.org/r/832628 (https://phabricator.wikimedia.org/T311066) [11:47:18] (03PS1) 10Jbond: P:thanos::swift::frontend: initiate the rsync::server explicitly [puppet] - 10https://gerrit.wikimedia.org/r/832630 (https://phabricator.wikimedia.org/T311066) [11:48:19] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37293/console" [puppet] - 10https://gerrit.wikimedia.org/r/832630 (https://phabricator.wikimedia.org/T311066) (owner: 10Jbond) [11:48:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P34843 and previous config saved to /var/cache/conftool/dbconfig/20220916-114831-root.json [11:49:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P34844 and previous config saved to /var/cache/conftool/dbconfig/20220916-114935-root.json [11:51:46] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [11:52:53] (03PS4) 10Jbond: C:varnish: Add cluster_fe_hit and cluster_fe_ratelimit_hits subroutines [puppet] - 10https://gerrit.wikimedia.org/r/832268 (https://phabricator.wikimedia.org/T317799) [11:58:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34845 and previous config saved to /var/cache/conftool/dbconfig/20220916-115821-root.json [12:01:48] (03PS1) 10Jbond: P:cache::varnish::frontend: Add parameter to enable requestctl on hits [puppet] - 10https://gerrit.wikimedia.org/r/832631 (https://phabricator.wikimedia.org/T317794) [12:02:43] (03PS1) 10Muehlenhoff: Ship WMF-specific systemd unit parts as systemd override [puppet] - 10https://gerrit.wikimedia.org/r/832632 (https://phabricator.wikimedia.org/T317746) [12:02:56] (03CR) 10Jbond: [C: 04-1] "self -1 as this is dependent on the following tasks" [puppet] - 10https://gerrit.wikimedia.org/r/832631 (https://phabricator.wikimedia.org/T317794) (owner: 10Jbond) [12:03:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P34846 and previous config saved to /var/cache/conftool/dbconfig/20220916-120336-root.json [12:04:31] (03CR) 10Jbond: [C: 04-1] P:cache::varnish::frontend: Add parameter to enable requestctl on hits (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832631 (https://phabricator.wikimedia.org/T317794) (owner: 10Jbond) [12:04:39] (03CR) 10Jbond: [V: 03+1 C: 04-1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37294/console" [puppet] - 10https://gerrit.wikimedia.org/r/832631 (https://phabricator.wikimedia.org/T317794) (owner: 10Jbond) [12:08:41] (03PS2) 10Jbond: sre.hardware.upgrade-firmware: read firmware_store from config [cookbooks] - 10https://gerrit.wikimedia.org/r/831919 [12:08:42] (03CR) 10Jbond: sre.hardware.upgrade-firmware: read firmware_store from config (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/831919 (owner: 10Jbond) [12:08:53] (03PS2) 10Jbond: sre.hardware.upgrade-firmware: create subfolderes for firmware type [cookbooks] - 10https://gerrit.wikimedia.org/r/831920 [12:08:59] (03PS3) 10Jbond: sre.hardware.upgrade-firmware: read firmware_store from config [cookbooks] - 10https://gerrit.wikimedia.org/r/831919 [12:09:03] (03PS3) 10Jbond: sre.hardware.upgrade-firmware: create subfolderes for firmware type [cookbooks] - 10https://gerrit.wikimedia.org/r/831920 [12:09:07] (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: create subfolderes for firmware type [cookbooks] - 10https://gerrit.wikimedia.org/r/831920 (owner: 10Jbond) [12:09:12] (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: read firmware_store from config [cookbooks] - 10https://gerrit.wikimedia.org/r/831919 (owner: 10Jbond) [12:10:00] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/832632 (https://phabricator.wikimedia.org/T317746) (owner: 10Muehlenhoff) [12:13:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34847 and previous config saved to /var/cache/conftool/dbconfig/20220916-121326-root.json [12:13:50] (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: read firmware_store from config [cookbooks] - 10https://gerrit.wikimedia.org/r/831919 (owner: 10Jbond) [12:13:52] (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: create subfolderes for firmware type [cookbooks] - 10https://gerrit.wikimedia.org/r/831920 (owner: 10Jbond) [12:18:13] (03CR) 10David Caro: [C: 03+1] "LGTM, you should probably keep an eye for the traffic cross-dc, as that might be a bottleneck." [puppet] - 10https://gerrit.wikimedia.org/r/832543 (https://phabricator.wikimedia.org/T222349) (owner: 10Ebernhardson) [12:18:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P34848 and previous config saved to /var/cache/conftool/dbconfig/20220916-121841-root.json [12:24:20] (03CR) 10David Caro: [C: 04-1] "Actually, you should be using clouddumps instead, labstore1006/7 have been deprecated." [puppet] - 10https://gerrit.wikimedia.org/r/832543 (https://phabricator.wikimedia.org/T222349) (owner: 10Ebernhardson) [12:27:03] (03CR) 10JMeybohm: [C: 03+1] haproxy: use haproxy24 component [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/832235 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [12:27:37] (03CR) 10JMeybohm: [C: 03+1] sre.misc-clusters.roll-restart-reboot-docker-registry: Also restart docker-registry itself [cookbooks] - 10https://gerrit.wikimedia.org/r/832591 (owner: 10Muehlenhoff) [12:28:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34849 and previous config saved to /var/cache/conftool/dbconfig/20220916-122831-root.json [12:33:37] (03CR) 10CDanis: [C: 03+1] "+1 from me, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/832268 (https://phabricator.wikimedia.org/T317799) (owner: 10Jbond) [12:33:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P34850 and previous config saved to /var/cache/conftool/dbconfig/20220916-123346-root.json [12:36:33] (03CR) 10Majavah: "if the problem is that an internal server is rate limiting internal clients.. why don't we adjust those rate limits instead?" [puppet] - 10https://gerrit.wikimedia.org/r/832543 (https://phabricator.wikimedia.org/T222349) (owner: 10Ebernhardson) [12:37:00] (03CR) 10CDanis: C:varnish: Rate limit hotlinking dry-run (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/768723 (https://phabricator.wikimedia.org/T317799) (owner: 10Jbond) [12:43:02] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1002.eqiad.wmnet [12:43:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34851 and previous config saved to /var/cache/conftool/dbconfig/20220916-124336-root.json [12:48:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P34852 and previous config saved to /var/cache/conftool/dbconfig/20220916-124850-root.json [12:58:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34853 and previous config saved to /var/cache/conftool/dbconfig/20220916-125841-root.json [13:03:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P34854 and previous config saved to /var/cache/conftool/dbconfig/20220916-130357-root.json [13:14:29] (03CR) 10CDanis: victorps.py: add print_weekly_schedule command (033 comments) [software/klaxon] - 10https://gerrit.wikimedia.org/r/827562 (https://phabricator.wikimedia.org/T309115) (owner: 10Herron) [13:19:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P34855 and previous config saved to /var/cache/conftool/dbconfig/20220916-131902-root.json [13:30:16] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:33:45] (03CR) 10Hashar: gerrit: remove unused mysql-connector-java lib (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832344 (owner: 10Hashar) [13:47:06] (03CR) 10Elukey: [C: 03+2] admin_ng: use TLS origination for ml-serve sidecar configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/832609 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [13:48:58] (03PS2) 10JMeybohm: admin_ng: Allow to pin calico chart versions per environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/826268 (https://phabricator.wikimedia.org/T307943) [13:49:00] (03PS2) 10JMeybohm: calico-crd: Split crds.yaml into multiple files [deployment-charts] - 10https://gerrit.wikimedia.org/r/826269 [13:49:02] (03PS4) 10JMeybohm: Update calico-crds to v3.23.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/826270 (https://phabricator.wikimedia.org/T307943) [13:49:04] (03PS4) 10JMeybohm: Update calico to v3.23.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/826810 (https://phabricator.wikimedia.org/T307943) [13:49:53] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:50:04] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:50:35] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:51:31] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [13:51:54] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:52:10] sorry some k8s-related-sync spam is coming :) [13:52:29] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [13:52:56] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [13:54:06] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [13:54:24] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [13:56:59] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:57:28] uff [14:01:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:06:09] (03CR) 10Vgutierrez: [C: 03+1] C:varnish: Add cluster_fe_hit and cluster_fe_ratelimit_hits subroutines (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832268 (https://phabricator.wikimedia.org/T317799) (owner: 10Jbond) [14:17:00] !log add 100G to prometheus/eqiad instance k8s-mlserve [14:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:33] (03CR) 10JMeybohm: admin_ng: Allow to pin calico chart versions per environment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/826268 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [14:22:24] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [14:22:36] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [14:23:07] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [14:25:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:30:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:36:04] (03PS1) 10Elukey: admin_ng: fix Istio ServiceEntry for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/832644 [14:39:14] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [14:41:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:42:50] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [14:43:00] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (29) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudcephosd1031, cloudcephosd1033, cloudcephosd1034, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, phab1004, releases1002, releases2002, thanos-fe1002, th [14:43:00] 003, thanos-fe2001, thanos-fe2002, thanos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [14:45:39] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [14:45:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:45:59] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [14:46:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:47:22] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [14:48:01] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [14:48:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:49:20] yeah this is definitely me deploying --^ [14:50:01] (03CR) 10Elukey: [C: 03+2] admin_ng: fix Istio ServiceEntry for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/832644 (owner: 10Elukey) [14:52:40] :| [14:53:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:55:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:56:33] (03PS1) 10Vivian Rook: update link [puppet] - 10https://gerrit.wikimedia.org/r/832646 (https://phabricator.wikimedia.org/T317987) [14:56:50] godog: I think that istio/knative generate a ton of spam when deploying (at least, their controllers). Cole already warned me about a ton of messages, I am trying to see if I can reduce them, sorry :( [14:57:39] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [14:57:51] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [14:58:11] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:58:23] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:58:44] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:00:19] elukey: no worries, it is the codfw logstash consumers that lag behind IIRC, eqiad is fine and they can keep up [15:00:20] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Discovery-Search (Current work): Add pfischer to #wmf-nda on Phab and to #wmf on LDAP - https://phabricator.wikimedia.org/T316922 (10Gehel) 05Open→03Resolved [15:01:06] !log jbond@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts sretest1002.eqiad.wmnet [15:01:44] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1002.eqiad.wmnet [15:01:50] !log jbond@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts sretest1002.eqiad.wmnet [15:02:25] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [15:02:30] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [15:02:51] last burst of deployments I promise [15:03:01] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [15:04:26] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [15:05:24] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [15:05:44] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [15:06:14] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db218[34] - https://phabricator.wikimedia.org/T313979 (10Papaul) @jcrespo @Marostegui do we need IPV6 on those hosts? [15:06:30] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [15:06:34] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db218[34] - https://phabricator.wikimedia.org/T313979 (10Papaul) [15:06:40] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:08:43] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:10:54] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (29) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudcephosd1031, cloudcephosd1033, cloudcephosd1034, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, phab1004, releases1002, releases2002, thanos-fe1002, th [15:10:54] 003, thanos-fe2001, thanos-fe2002, thanos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [15:11:40] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:13:28] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:15:19] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10Milimetric) yep, all set @BCornwall, thank you!! [15:21:24] 10SRE, 10ops-codfw, 10DC-Ops: codfw spare pool system for partman testing - https://phabricator.wikimedia.org/T215301 (10Papaul) [15:21:36] 10SRE: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183 (10Papaul) [15:21:57] 10SRE, 10ops-codfw, 10DC-Ops: codfw spare pool system for partman testing - https://phabricator.wikimedia.org/T215301 (10Papaul) 05Open→03Resolved Complete [15:28:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T314041)', diff saved to https://phabricator.wikimedia.org/P34856 and previous config saved to /var/cache/conftool/dbconfig/20220916-152827-ladsgroup.json [15:28:32] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [15:35:47] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db218[34] - https://phabricator.wikimedia.org/T313979 (10Papaul) [15:39:39] !log dancy@deploy1002 Started scap: testing [15:40:49] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db218[34] - https://phabricator.wikimedia.org/T313979 (10Marostegui) They should be installed as normal DBs [15:43:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P34857 and previous config saved to /var/cache/conftool/dbconfig/20220916-154333-ladsgroup.json [15:44:32] !log dancy@deploy1002 Finished scap: testing (duration: 04m 53s) [15:51:11] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [15:51:46] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [15:51:46] !log dancy@deploy1002 Installing scap version "4.20.0" for 561 hosts [15:52:05] !log dancy@deploy1002 Installation of scap version "4.20.0" completed for 561 hosts [15:52:24] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [15:53:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:55:57] (03PS1) 10Andrew Bogott: toolviews: remove tool name from prometheus metric name [puppet] - 10https://gerrit.wikimedia.org/r/832661 (https://phabricator.wikimedia.org/T317714) [15:57:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:58:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P34858 and previous config saved to /var/cache/conftool/dbconfig/20220916-155840-ladsgroup.json [15:58:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:02:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:05:58] deployments done :) [16:06:26] (03PS2) 10Andrew Bogott: toolviews: remove tool name from prometheus metric name [puppet] - 10https://gerrit.wikimedia.org/r/832661 (https://phabricator.wikimedia.org/T317714) [16:07:56] (03CR) 10Andrew Bogott: [C: 03+2] toolviews: remove tool name from prometheus metric name [puppet] - 10https://gerrit.wikimedia.org/r/832661 (https://phabricator.wikimedia.org/T317714) (owner: 10Andrew Bogott) [16:09:01] (03PS4) 10Dduvall: buildkitd: Support configuration of OCI executor nameservers [puppet] - 10https://gerrit.wikimedia.org/r/832584 (https://phabricator.wikimedia.org/T317904) [16:11:09] (03CR) 10Dduvall: buildkitd: Support configuration of OCI executor nameservers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832584 (https://phabricator.wikimedia.org/T317904) (owner: 10Dduvall) [16:12:16] (03CR) 10CI reject: [V: 04-1] buildkitd: Support configuration of OCI executor nameservers [puppet] - 10https://gerrit.wikimedia.org/r/832584 (https://phabricator.wikimedia.org/T317904) (owner: 10Dduvall) [16:13:33] (03CR) 10Dduvall: buildkitd: Support configuration of OCI executor nameservers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832584 (https://phabricator.wikimedia.org/T317904) (owner: 10Dduvall) [16:13:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T314041)', diff saved to https://phabricator.wikimedia.org/P34859 and previous config saved to /var/cache/conftool/dbconfig/20220916-161346-ladsgroup.json [16:13:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance [16:13:52] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [16:14:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance [16:14:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1198 (T314041)', diff saved to https://phabricator.wikimedia.org/P34860 and previous config saved to /var/cache/conftool/dbconfig/20220916-161409-ladsgroup.json [16:20:43] (03PS5) 10Dduvall: buildkitd: Support configuration of OCI executor nameservers [puppet] - 10https://gerrit.wikimedia.org/r/832584 (https://phabricator.wikimedia.org/T317904) [16:25:00] (03CR) 10Dduvall: buildkitd: Support configuration of OCI executor nameservers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832584 (https://phabricator.wikimedia.org/T317904) (owner: 10Dduvall) [16:27:21] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for CMyrick-WMF - https://phabricator.wikimedia.org/T317996 (10CMyrick-WMF) [16:34:24] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:41:12] PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 14.76 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005-1006-1007 [16:41:37] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host db2183 [16:42:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2183 [16:42:17] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host db2184 [16:42:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2184 [16:43:00] (03CR) 10Dzahn: "ah, it was picky about the lookup key not matching the profile name as well? nice work around" [puppet] - 10https://gerrit.wikimedia.org/r/832584 (https://phabricator.wikimedia.org/T317904) (owner: 10Dduvall) [16:43:31] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:45:31] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:45:56] RECOVERY - Persistent high iowait on clouddumps1001 is OK: (C)10 ge (W)5 ge 2.386 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005-1006-1007 [16:46:32] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for CMyrick-WMF - https://phabricator.wikimedia.org/T317996 (10YLiou_WMF) Not sure whether this is the place to do it, but on behalf of GDI, I approve @CMyrick-WMF 's request [16:46:32] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2183.mgmt.codfw.wmnet with reboot policy FORCED [16:48:33] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for CMyrick-WMF - https://phabricator.wikimedia.org/T317996 (10CMyrick-WMF) Additional note: I noticed in my Superset user info that my email address is listed as **cmyrick@email.notfound**. I'm not sure if tha... [16:53:16] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2183.mgmt.codfw.wmnet with reboot policy FORCED [16:53:51] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2183.mgmt.codfw.wmnet with reboot policy FORCED [16:54:52] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1002/37296/gitlab-runner1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/832584 (https://phabricator.wikimedia.org/T317904) (owner: 10Dduvall) [16:55:17] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2183.mgmt.codfw.wmnet with reboot policy FORCED [16:55:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2183.mgmt.codfw.wmnet with reboot policy FORCED [16:56:40] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2183.mgmt.codfw.wmnet with reboot policy FORCED [16:59:51] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for CMyrick-WMF - https://phabricator.wikimedia.org/T317996 (10BCornwall) p:05Triage→03Medium a:03BCornwall [17:00:54] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db218[34] - https://phabricator.wikimedia.org/T313979 (10Papaul) @Volans when running " sudo cookbook sre.hosts.provision" on db2183 it prompt me to setup the RAID after setting up the raid I enter modified and it failed. [17:01:45] !log gitlab-runner*: deployed gerrit:832584 and systemctl restart buildkitd on 6 hosts for T317904 [17:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:49] T317904: Explicitly config buildkitd with internal DNS nameserver - https://phabricator.wikimedia.org/T317904 [17:04:58] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for CMyrick-WMF - https://phabricator.wikimedia.org/T317996 (10BCornwall) [17:06:13] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for CMyrick-WMF - https://phabricator.wikimedia.org/T317996 (10BCornwall) @odimitrijevic or @Ottomata Do you approve of this and verify that this is the access they need? [17:06:45] In the namespace setting of The Bangla Wikipedia, can anyone give any reference to where the description of the namespace of a language that is not Bengali comes from in the InitialiseSettings.php? It's in lines ‌3309, 3310, 3312 and 3313. [17:09:49] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db218[34] - https://phabricator.wikimedia.org/T313979 (10Marostegui) Were the hosts added to the correct partman recipe? [17:10:10] (03CR) 10JMeybohm: thumbor: new service chart (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [17:12:13] (03PS1) 10Dduvall: buildkitd: Add missing `--config` argument [puppet] - 10https://gerrit.wikimedia.org/r/832694 (https://phabricator.wikimedia.org/T317904) [17:14:39] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for CMyrick-WMF - https://phabricator.wikimedia.org/T317996 (10BCornwall) @CMyrick-WMF It appears that your SSH key is missing the header. It looks like it is ssh-ed25519 but I'd like to be sure. :) [17:17:37] (03CR) 10Dzahn: [C: 03+2] buildkitd: Add missing `--config` argument [puppet] - 10https://gerrit.wikimedia.org/r/832694 (https://phabricator.wikimedia.org/T317904) (owner: 10Dduvall) [17:18:34] MdsShakil: I replied on the other channel (-tech) [17:22:04] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:22:41] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db218[34] - https://phabricator.wikimedia.org/T313979 (10Papaul) @Marostegui not at that step yet so i didn't check i am just doing the IDRAC and BIOS configuration that is what the cookbook sre.hosts.provision does [17:30:56] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for CMyrick-WMF - https://phabricator.wikimedia.org/T317996 (10CMyrick-WMF) @BCornwall The file name was "id_ed25519.pub" so I believe that's correct [17:41:00] (03PS2) 10Dzahn: Planet: Add hexmode.com blog [puppet] - 10https://gerrit.wikimedia.org/r/832585 (owner: 10MarkAHershberger) [17:41:31] (03CR) 10Dzahn: [C: 03+2] "greetings hexmode, welcome to planet :)" [puppet] - 10https://gerrit.wikimedia.org/r/832585 (owner: 10MarkAHershberger) [17:59:22] (03PS10) 10Ebernhardson: Mount labstore to wcqs/wdqs instance for dumps reload [puppet] - 10https://gerrit.wikimedia.org/r/832543 (https://phabricator.wikimedia.org/T222349) [17:59:24] (03CR) 10Ebernhardson: Mount labstore to wcqs/wdqs instance for dumps reload (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/832543 (https://phabricator.wikimedia.org/T222349) (owner: 10Ebernhardson) [18:02:18] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:03:36] (03CR) 10Ebernhardson: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37297/console" [puppet] - 10https://gerrit.wikimedia.org/r/832543 (https://phabricator.wikimedia.org/T222349) (owner: 10Ebernhardson) [18:05:52] (03CR) 10Dzahn: "Makes sense. I just did not want to merge it on Friday before travel and being out for a while. If anyone else feels up to do it together " [puppet] - 10https://gerrit.wikimedia.org/r/832344 (owner: 10Hashar) [18:05:54] (03PS1) 10Bking: update-known-hosts-production: Capture all fingerprints [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/832698 (https://phabricator.wikimedia.org/T318006) [18:06:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET namespaces) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:11:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET namespaces) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:18:22] (03CR) 10Raymond Ndibe: toolforge: add on-wiki edits of toolforge tools to toolsview (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/832620 (https://phabricator.wikimedia.org/T317953) (owner: 10Raymond Ndibe) [18:18:46] (03PS2) 10Bking: update-known-hosts-production: Capture all fingerprints [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/832698 (https://phabricator.wikimedia.org/T318006) [18:22:40] (03CR) 10Thcipriani: [C: 03+1] add error.stack.previous_trace field [software/ecs] - 10https://gerrit.wikimedia.org/r/831943 (https://phabricator.wikimedia.org/T314098) (owner: 10Cwhite) [19:00:28] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp1081 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikipedia.org has 86371 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [19:00:34] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp1081 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikipedia.org has 86365 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [19:03:14] ^ hmm, that's a day [19:03:30] Saturday expiration probably isn't a recipe for a good time [19:04:42] (03PS3) 10Bking: update-known-hosts-production: Capture all fingerprints [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/832698 (https://phabricator.wikimedia.org/T318006) [19:15:00] rzl: it's supposed to update every 12 hours. how about we try manually running /usr/local/sbin/update-ocsp on cp1081 [19:15:11] I was about to do that [19:15:26] oh okay! sure, go for it [19:16:21] (03CR) 10Ryan Kemper: update-known-hosts-production: Capture all fingerprints (032 comments) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/832698 (https://phabricator.wikimedia.org/T318006) (owner: 10Bking) [19:16:34] !log cp1081 /usr/local/sbin/update-ocsp-all [19:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:45] actually that -all script is different [19:17:02] the one without -all wanted me to set parameters [19:17:27] this does not and it's also being run by trafficserver.systemd_override.erb:ExecStartPre=/usr/local/sbin/update-ocsp-all [19:18:16] interestingly on this host also puppet status is "unknown" [19:18:27] and wasnt 1081 that pink unicorn host.. [19:18:54] yea, puppet is disabled there [19:19:01] Reason: '' [19:19:47] hmm [19:21:06] rescheduling checks [19:21:23] have not enabled puppet so far [19:21:42] it still seems pooled so we might want to either find out what's going on with it or depool it [19:21:58] the number 1081 sounds familiar [19:22:55] looking at SAL [19:23:42] hmm.nothing. so maybe puppet is not supposed ot be disabled [19:24:37] nobody has logged in since..July [19:24:44] lastlog | grep -v Never [19:25:12] think we should just enable it [19:25:34] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:25:37] depool, enable puppet, leave a ticket? [19:25:47] I'll at least ask in -traffic first, hate to disrupt somebody's work if it was on purpose [19:25:57] sounds good [19:27:15] (03PS4) 10Ryan Kemper: update-known-hosts-production: Capture all fingerprints [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/832698 (https://phabricator.wikimedia.org/T318006) (owner: 10Bking) [19:27:44] (03CR) 10Ryan Kemper: update-known-hosts-production: Capture all fingerprints (032 comments) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/832698 (https://phabricator.wikimedia.org/T318006) (owner: 10Bking) [19:51:46] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [20:04:13] (03PS5) 10Ryan Kemper: update-known-hosts-production: Capture all fingerprints [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/832698 (https://phabricator.wikimedia.org/T318006) (owner: 10Bking) [20:06:18] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db218[34] - https://phabricator.wikimedia.org/T313979 (10Volans) @Papaul I will have to investigate as this is a new unexpected error as it failed to reboot the host. I will need to check what changed in the Redfish support in t... [20:08:52] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp1081 is OK: SSL OK - OCSP staple validity for wikipedia.org has 294667 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2022-12-07 05:25:18 +0000 (expires in 81 days) https://wikitech.wikimedia.org/wiki/HTTPS [20:08:54] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp1081 is OK: SSL OK - OCSP staple validity for wikipedia.org has 294666 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2022-12-07 05:25:32 +0000 (expires in 81 days) https://wikitech.wikimedia.org/wiki/HTTPS [20:09:09] 👍 [20:09:47] :) [20:10:18] (03CR) 10Ryan Kemper: [C: 03+1] "Works well, and shellcheck is totally happy. I ran this from macos so we should have someone on a linux setup run this to verify out of an" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/832698 (https://phabricator.wikimedia.org/T318006) (owner: 10Bking) [20:15:10] PROBLEM - SSH on db1098.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:17:10] mutante: I read 1098 as 1089 :P [20:19:15] heh, makes sense [20:28:54] (03CR) 10Ryan Kemper: "Realized I was reviewing the implementation but not the actual goal, so retracting my +1 for the timebeing because it's not clear that cap" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/832698 (https://phabricator.wikimedia.org/T318006) (owner: 10Bking) [20:38:30] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:43:32] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:43:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T314041)', diff saved to https://phabricator.wikimedia.org/P34861 and previous config saved to /var/cache/conftool/dbconfig/20220916-204345-ladsgroup.json [20:43:51] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [20:58:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P34862 and previous config saved to /var/cache/conftool/dbconfig/20220916-205852-ladsgroup.json [20:58:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:03:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:13:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P34863 and previous config saved to /var/cache/conftool/dbconfig/20220916-211358-ladsgroup.json [21:15:43] RECOVERY - SSH on db1098.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:29:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T314041)', diff saved to https://phabricator.wikimedia.org/P34864 and previous config saved to /var/cache/conftool/dbconfig/20220916-212905-ladsgroup.json [21:29:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [21:29:09] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [21:29:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [21:56:01] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:56:30] win 79 [22:12:55] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:14:47] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:22:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs2001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [22:24:09] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:37:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs2001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [22:37:31] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs2001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [22:38:41] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:40:13] PROBLEM - Check systemd state on mw1314 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:57:31] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs2001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [23:07:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs2001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [23:07:03] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10Dzahn) 05Open→03Resolved [23:31:59] RECOVERY - Check systemd state on mw1314 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:37:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs2001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [23:47:37] PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:51:46] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert