[00:11:28] 10SRE-swift-storage, 10Community-Tech, 10MediaWiki-extensions-Phonos, 10MW-1.40-notes (1.40.0-wmf.6; 2022-10-17): Establish Phonos production storage requirements - https://phabricator.wikimedia.org/T320675 (10Eevans) Ok, so let's try to move this in a (more) constructive direction: If I'm being honest, I... [00:43:36] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Create mw-web helmfile deployment - https://phabricator.wikimedia.org/T321900 (10Krinkle) [00:43:46] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Create mw-jobrunner helmfile deployment - https://phabricator.wikimedia.org/T321897 (10Krinkle) [00:43:56] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Create mw-videoscaler helmfile deployment - https://phabricator.wikimedia.org/T321899 (10Krinkle) [00:44:05] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Create mw-api-int helmfile deployment - https://phabricator.wikimedia.org/T321895 (10Krinkle) [00:44:27] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Create mw-api-ext helmfile deployment - https://phabricator.wikimedia.org/T321896 (10Krinkle) [01:19:15] (KubernetesAPILatency) firing: (6) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:28:32] !log andrew@deploy1002 Started deploy [horizon/deploy@9d02cd6]: (no justification provided) [01:29:18] !log andrew@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: (no justification provided) (duration: 00m 46s) [01:38:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:48:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:51:23] PROBLEM - SSH on mw1309.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:53:45] (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:10:35] 10SRE, 10MediaWiki-Authentication-and-authorization, 10Platform Engineering, 10serviceops: Increased session loss since 20221001 - https://phabricator.wikimedia.org/T319279 (10jijiki) p:05Triage→03High [02:48:38] PROBLEM - Check systemd state on wcqs1003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:52:14] RECOVERY - SSH on mw1309.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:52:24] PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:02:08] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [03:04:08] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [03:22:08] RECOVERY - Check systemd state on wcqs1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:28:02] PROBLEM - Check systemd state on wcqs1003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:45:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:46:27] (03Abandoned) 10Andrew Bogott: codfw1dev horizon back to wallaby [puppet] - 10https://gerrit.wikimedia.org/r/853000 (https://phabricator.wikimedia.org/T322359) (owner: 10Andrew Bogott) [03:50:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:51:46] RECOVERY - Check systemd state on wcqs1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:53:22] RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:57:40] PROBLEM - Check systemd state on wcqs1003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:21:18] RECOVERY - Check systemd state on wcqs1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:27:16] PROBLEM - Check systemd state on wcqs1003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:29:00] (KubernetesAPILatency) firing: (7) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:51:33] RECOVERY - Check systemd state on wcqs1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:56:41] PROBLEM - Check systemd state on wcqs1003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:21:48] RECOVERY - Check systemd state on wcqs1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:27:02] PROBLEM - Check systemd state on wcqs1003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:36:08] (03PS1) 10Marostegui: report_users: Replace dbproxy1019's IP [software] - 10https://gerrit.wikimedia.org/r/853083 [05:36:53] (03CR) 10Marostegui: [C: 03+2] report_users: Replace dbproxy1019's IP [software] - 10https://gerrit.wikimedia.org/r/853083 (owner: 10Marostegui) [05:37:50] (03Merged) 10jenkins-bot: report_users: Replace dbproxy1019's IP [software] - 10https://gerrit.wikimedia.org/r/853083 (owner: 10Marostegui) [05:39:15] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10Marostegui) I have cleaned up the old IP and changed the report users script. [05:41:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 1%: After schema change', diff saved to https://phabricator.wikimedia.org/P38104 and previous config saved to /var/cache/conftool/dbconfig/20221104-054147-root.json [05:56:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 3%: After schema change', diff saved to https://phabricator.wikimedia.org/P38105 and previous config saved to /var/cache/conftool/dbconfig/20221104-055652-root.json [06:11:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 5%: After schema change', diff saved to https://phabricator.wikimedia.org/P38106 and previous config saved to /var/cache/conftool/dbconfig/20221104-061157-root.json [06:13:38] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:15:28] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.244 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:22:26] RECOVERY - Check systemd state on wcqs1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:26:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es4 T322389 [06:26:38] T322389: Switchover es4 codfw master (es2020 -> es2021) - https://phabricator.wikimedia.org/T322389 [06:27:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P38107 and previous config saved to /var/cache/conftool/dbconfig/20221104-062702-root.json [06:27:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es4 T322389 [06:27:40] (03PS1) 10Marostegui: mariadb: Promote es2021 to es4 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/853085 (https://phabricator.wikimedia.org/T322389) [06:27:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set es2021 with weight 0 T322389', diff saved to https://phabricator.wikimedia.org/P38108 and previous config saved to /var/cache/conftool/dbconfig/20221104-062740-root.json [06:28:26] PROBLEM - Check systemd state on wcqs1003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:28:38] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote es2021 to es4 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/853085 (https://phabricator.wikimedia.org/T322389) (owner: 10Marostegui) [06:30:32] !log Starting es4 codfw failover from es2020 to es2021 - T322389 [06:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es2021 to es4 primary and set section read-write T322389', diff saved to https://phabricator.wikimedia.org/P38109 and previous config saved to /var/cache/conftool/dbconfig/20221104-063128-root.json [06:32:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2020 T322389', diff saved to https://phabricator.wikimedia.org/P38110 and previous config saved to /var/cache/conftool/dbconfig/20221104-063224-root.json [06:32:27] T322389: Switchover es4 codfw master (es2020 -> es2021) - https://phabricator.wikimedia.org/T322389 [06:32:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give weight to es2021', diff saved to https://phabricator.wikimedia.org/P38111 and previous config saved to /var/cache/conftool/dbconfig/20221104-063250-root.json [06:34:47] (03PS1) 10Marostegui: es2020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/853086 [06:42:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P38112 and previous config saved to /var/cache/conftool/dbconfig/20221104-064207-root.json [06:42:44] (03CR) 10Marostegui: [C: 03+2] es2020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/853086 (owner: 10Marostegui) [06:57:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P38113 and previous config saved to /var/cache/conftool/dbconfig/20221104-065712-root.json [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221104T0700) [07:12:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P38114 and previous config saved to /var/cache/conftool/dbconfig/20221104-071217-root.json [07:14:27] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting deployment group membership for mfossati - https://phabricator.wikimedia.org/T321772 (10SLyngshede-WMF) [07:14:37] (03PS4) 10Slyngshede: data.yaml: Move user mfossati from restricted to deployment [puppet] - 10https://gerrit.wikimedia.org/r/850409 (https://phabricator.wikimedia.org/T321772) [07:17:15] (03CR) 10Slyngshede: [C: 03+2] data.yaml: Move user mfossati from restricted to deployment [puppet] - 10https://gerrit.wikimedia.org/r/850409 (https://phabricator.wikimedia.org/T321772) (owner: 10Slyngshede) [07:18:18] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting deployment group membership for mfossati - https://phabricator.wikimedia.org/T321772 (10SLyngshede-WMF) 05Open→03Resolved [07:18:48] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting deployment group membership for mfossati - https://phabricator.wikimedia.org/T321772 (10SLyngshede-WMF) The user have been moved to deployment, and removed from restricted. [07:23:22] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Ilias Sarantopoulos - https://phabricator.wikimedia.org/T322347 (10elukey) [07:23:31] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ilias Sarantopoulos - https://phabricator.wikimedia.org/T322350 (10elukey) [07:23:51] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ilias Sarantopoulos - https://phabricator.wikimedia.org/T322350 (10elukey) Ilias will also need access to the `wmf` LDAP group as well :) [07:27:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2121 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P38115 and previous config saved to /var/cache/conftool/dbconfig/20221104-072722-root.json [07:37:43] 10SRE, 10Infrastructure-Foundations: IDM: Central logging on all changes - https://phabricator.wikimedia.org/T320431 (10SLyngshede-WMF) 05Resolved→03In progress [07:37:45] 10SRE, 10Infrastructure-Foundations: IDM milestone 1 "Initial development work" - https://phabricator.wikimedia.org/T319407 (10SLyngshede-WMF) [07:43:40] (03PS1) 10Elukey: Add prod access for Ilias Sarantopoulos [puppet] - 10https://gerrit.wikimedia.org/r/853090 (https://phabricator.wikimedia.org/T322350) [08:03:56] (03PS1) 10Elukey: admin_ng: extend retry to include HTTP 503s for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/853184 (https://phabricator.wikimedia.org/T322196) [08:09:25] (03CR) 10JMeybohm: [C: 04-1] Enable profile::auto_restarts::service for jwt-authorizer on docker registry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/852831 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:10:58] (03CR) 10JMeybohm: [C: 03+2] k8s::package: only install the apt source once [puppet] - 10https://gerrit.wikimedia.org/r/852922 (https://phabricator.wikimedia.org/T270271) (owner: 10Jbond) [08:16:20] (03CR) 10JMeybohm: [C: 03+1] Import istioctl 1.15.3 [debs/istioctl] - 10https://gerrit.wikimedia.org/r/852921 (https://phabricator.wikimedia.org/T322193) (owner: 10Elukey) [08:17:40] (03CR) 10JMeybohm: [C: 03+1] istio: upgrade to 1.15.3 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/852842 (https://phabricator.wikimedia.org/T322193) (owner: 10Elukey) [08:21:01] (03PS3) 10JMeybohm: Rename ml_k8s staging roles to match naming scheme [puppet] - 10https://gerrit.wikimedia.org/r/852158 [08:29:15] (KubernetesAPILatency) firing: (6) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:52:41] (03CR) 10Hashar: [C: 03+1] "I don't know what `profile::contacts::role_contacts` is for, I am assuming it is mostly informational? At least releng is still a contac" [puppet] - 10https://gerrit.wikimedia.org/r/852832 (owner: 10Muehlenhoff) [09:10:23] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] Move kube-scheduler config to file [puppet] - 10https://gerrit.wikimedia.org/r/852908 (https://phabricator.wikimedia.org/T300499) (owner: 10JMeybohm) [09:10:51] (03CR) 10Hashar: [C: 03+1] ci: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850476 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:16:22] (03PS1) 10JMeybohm: Actually remove --kubeconfig flag from kube-scheduler [puppet] - 10https://gerrit.wikimedia.org/r/853226 (https://phabricator.wikimedia.org/T300499) [09:17:06] (03CR) 10JMeybohm: [C: 03+2] Actually remove --kubeconfig flag from kube-scheduler [puppet] - 10https://gerrit.wikimedia.org/r/853226 (https://phabricator.wikimedia.org/T300499) (owner: 10JMeybohm) [09:30:01] (03CR) 10Vgutierrez: prometheus: Rename ats_ metrics to trafficserver_ (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/851139 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [09:30:34] (03CR) 10Vgutierrez: [C: 03+1] Set profile::contacts::role_contacts for role::dns::auth [puppet] - 10https://gerrit.wikimedia.org/r/852918 (owner: 10Muehlenhoff) [09:32:24] (03PS6) 10Jelto: gitlab_runner: enable restrict_firewall for Shared Runners [puppet] - 10https://gerrit.wikimedia.org/r/849499 (https://phabricator.wikimedia.org/T317341) [09:34:25] (03PS1) 10Klausman: ml_k8s: move config for ML staging master/worker to be more consistent [labs/private] - 10https://gerrit.wikimedia.org/r/853235 [09:34:28] (03CR) 10Vgutierrez: [C: 04-1] prometheus: Add ats header/body size total metrics (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/845688 (https://phabricator.wikimedia.org/T284304) (owner: 10BCornwall) [09:35:51] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37951/console" [puppet] - 10https://gerrit.wikimedia.org/r/849499 (https://phabricator.wikimedia.org/T317341) (owner: 10Jelto) [09:36:05] (03CR) 10Klausman: [C: 03+2] Rename ml_k8s staging roles to match naming scheme [labs/private] - 10https://gerrit.wikimedia.org/r/852196 (owner: 10JMeybohm) [09:36:10] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] onSpecialSearchCreateLink: Handle another null from Title::newFromText (031 comment) [extensions/WikimediaIncubator] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/851017 (https://phabricator.wikimedia.org/T320736) (owner: 10Jforrester) [09:36:13] (03CR) 10Klausman: [V: 03+2 C: 03+2] Rename ml_k8s staging roles to match naming scheme [labs/private] - 10https://gerrit.wikimedia.org/r/852196 (owner: 10JMeybohm) [09:37:20] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] onSpecialSearchCreateLink: Handle null from Title::newFromText [extensions/WikimediaIncubator] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/851016 (https://phabricator.wikimedia.org/T320736) (owner: 10Jforrester) [09:38:28] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37952/console" [puppet] - 10https://gerrit.wikimedia.org/r/852158 (owner: 10JMeybohm) [09:39:07] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37953/console" [puppet] - 10https://gerrit.wikimedia.org/r/849180 (owner: 10BBlack) [09:40:18] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: enable restrict_firewall for Shared Runners [puppet] - 10https://gerrit.wikimedia.org/r/849499 (https://phabricator.wikimedia.org/T317341) (owner: 10Jelto) [09:43:34] (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: Fix usage [cookbooks] - 10https://gerrit.wikimedia.org/r/852773 (owner: 10Muehlenhoff) [09:43:39] (03PS2) 10Jbond: sre.hardware.upgrade-firmware: Fix usage [cookbooks] - 10https://gerrit.wikimedia.org/r/852773 (owner: 10Muehlenhoff) [09:44:08] (03CR) 10Jbond: [V: 03+2] sre.hardware.upgrade-firmware: Fix usage [cookbooks] - 10https://gerrit.wikimedia.org/r/852773 (owner: 10Muehlenhoff) [09:47:28] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37955/console" [puppet] - 10https://gerrit.wikimedia.org/r/852158 (owner: 10JMeybohm) [09:48:37] (03PS14) 10Jbond: sre.hardware.upgrade-firmware: use packaging.version.Version [cookbooks] - 10https://gerrit.wikimedia.org/r/835579 [09:48:39] (03PS34) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [09:48:42] (03PS27) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 [09:48:44] (03PS11) 10Jbond: sre.hardware.upgrade-firmware: add support for storage devices [cookbooks] - 10https://gerrit.wikimedia.org/r/836226 [09:49:16] (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] "looking good, see inline comment" [puppet] - 10https://gerrit.wikimedia.org/r/849180 (owner: 10BBlack) [09:50:40] (03PS35) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [09:50:46] (03PS28) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 [09:50:50] (03PS12) 10Jbond: sre.hardware.upgrade-firmware: add support for storage devices [cookbooks] - 10https://gerrit.wikimedia.org/r/836226 [09:51:04] (03PS4) 10Klausman: Rename ml_k8s staging roles to match naming scheme [puppet] - 10https://gerrit.wikimedia.org/r/852158 (owner: 10JMeybohm) [09:51:18] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37956/console" [puppet] - 10https://gerrit.wikimedia.org/r/852158 (owner: 10JMeybohm) [09:51:53] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37957/console" [puppet] - 10https://gerrit.wikimedia.org/r/852158 (owner: 10JMeybohm) [09:55:56] 10SRE, 10SRE-Access-Requests: Requesting access to ops and analytics for stevemunene - https://phabricator.wikimedia.org/T322339 (10Stevemunene) @BTullis [09:57:06] (03PS6) 10Filippo Giunchedi: prometheus: probe SSH on mgmt network [puppet] - 10https://gerrit.wikimedia.org/r/845529 (https://phabricator.wikimedia.org/T310266) [09:59:08] (03PS15) 10Jbond: sre.hardware.upgrade-firmware: use packaging.version.Version [cookbooks] - 10https://gerrit.wikimedia.org/r/835579 [09:59:10] (03PS36) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [09:59:12] (03PS29) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 [09:59:14] (03PS13) 10Jbond: sre.hardware.upgrade-firmware: add support for storage devices [cookbooks] - 10https://gerrit.wikimedia.org/r/836226 [10:03:15] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37958/console" [puppet] - 10https://gerrit.wikimedia.org/r/845529 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [10:08:06] 10SRE, 10Traffic: haproxy::site doesn't work as expected on the first puppet run - https://phabricator.wikimedia.org/T321684 (10Vgutierrez) I believe this is not affecting cp instances. In your log, systemd is complaining about several notifications: ` systemd[1]: haproxy.service: Got notification message from... [10:09:37] (03CR) 10Klausman: [V: 03+1 C: 03+2] Rename ml_k8s staging roles to match naming scheme [puppet] - 10https://gerrit.wikimedia.org/r/852158 (owner: 10JMeybohm) [10:09:57] (03CR) 10Filippo Giunchedi: [V: 03+1] "This is ready for review, let me know!" [puppet] - 10https://gerrit.wikimedia.org/r/845529 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [10:10:28] (03PS30) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 [10:10:30] (03PS14) 10Jbond: sre.hardware.upgrade-firmware: add support for storage devices [cookbooks] - 10https://gerrit.wikimedia.org/r/836226 [10:11:33] 10SRE: haproxy::site doesn't work as expected on the first puppet run - https://phabricator.wikimedia.org/T321684 (10Vgutierrez) I'm removing traffic from this task cause we don't own the HAProxy puppetization, happy to help here as one of the main users within SRE. Our custom bits for HAProxy are shipped on `pr... [10:13:42] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10fnegri) Thanks @Marostegui! @Jclark-ctr I think this task can be resolved. [10:14:27] (03PS31) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 [10:14:29] (03PS15) 10Jbond: sre.hardware.upgrade-firmware: add support for storage devices [cookbooks] - 10https://gerrit.wikimedia.org/r/836226 [10:15:06] PROBLEM - SSH on an-coord1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:17:03] (03CR) 10Hnowlan: [C: 03+2] requirements: add missing pycurl package [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/850453 (owner: 10Hnowlan) [10:17:16] (03CR) 10Vgutierrez: [C: 03+1] Clean up monitor metrics on stop() [debs/pybal] (1.15) - 10https://gerrit.wikimedia.org/r/844469 (https://phabricator.wikimedia.org/T321191) (owner: 10Filippo Giunchedi) [10:21:10] (03CR) 10Klausman: [C: 03+1] admin_ng: extend retry to include HTTP 503s for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/853184 (https://phabricator.wikimedia.org/T322196) (owner: 10Elukey) [10:21:47] (03CR) 10Klausman: [C: 03+1] Add prod access for Ilias Sarantopoulos [puppet] - 10https://gerrit.wikimedia.org/r/853090 (https://phabricator.wikimedia.org/T322350) (owner: 10Elukey) [10:22:09] (03CR) 10Klausman: [C: 03+1] Import istioctl 1.15.3 [debs/istioctl] - 10https://gerrit.wikimedia.org/r/852921 (https://phabricator.wikimedia.org/T322193) (owner: 10Elukey) [10:30:24] (03Merged) 10jenkins-bot: requirements: add missing pycurl package [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/850453 (owner: 10Hnowlan) [10:31:32] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ilias Sarantopoulos - https://phabricator.wikimedia.org/T322350 (10jbond) @Ottomata / @odimitrijevic are you able to approve for access to anal... [10:34:37] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/852955 (https://phabricator.wikimedia.org/T320721) (owner: 10Volans) [10:35:16] (03CR) 10Jbond: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/835579 (owner: 10Jbond) [10:35:23] (03CR) 10Jbond: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond) [10:35:29] (03CR) 10Jbond: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 (owner: 10Jbond) [10:37:40] (03PS1) 10Jbond: P:spicerack: remove whitespace in package name [puppet] - 10https://gerrit.wikimedia.org/r/853250 [10:38:50] 10SRE, 10Dumps-Generation, 10serviceops, 10MW-1.40-notes (1.40.0-wmf.10; 2022-11-14), and 2 others: conf* hosts ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10jcrespo) [10:39:12] (03PS1) 10Marostegui: Revert "es2020: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/853038 [10:40:05] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10jbond) [10:42:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2174.codfw.wmnet with reason: Maintenance [10:42:15] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Ilooremeta - https://phabricator.wikimedia.org/T322147 (10jbond) [10:42:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2174.codfw.wmnet with reason: Maintenance [10:42:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2174 (T318955)', diff saved to https://phabricator.wikimedia.org/P38116 and previous config saved to /var/cache/conftool/dbconfig/20221104-104227-ladsgroup.json [10:42:33] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [10:43:25] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Ilooremeta - https://phabricator.wikimedia.org/T322147 (10jbond) >>! In T322147#8368318, @Dzahn wrote: > confirmed L3 signature > > @jbond fwiw, can't find on Namely though, unlike other users on current req... [10:43:27] 10SRE, 10Dumps-Generation, 10serviceops, 10MW-1.40-notes (1.40.0-wmf.10; 2022-11-14), and 2 others: conf* hosts ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10ArielGlenn) After the patch https://gerrit.wikimedia.org/r/c/mediawiki/core/+/852990 was backported and deplo... [10:45:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T318955)', diff saved to https://phabricator.wikimedia.org/P38117 and previous config saved to /var/cache/conftool/dbconfig/20221104-104508-ladsgroup.json [10:47:30] (03PS1) 10Jbond: admin: add Hghani and Ilooremeta to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/853255 (https://phabricator.wikimedia.org/T322147) [10:47:50] (03CR) 10Jbond: [C: 03+2] P:spicerack: remove whitespace in package name [puppet] - 10https://gerrit.wikimedia.org/r/853250 (owner: 10Jbond) [10:48:01] (03CR) 10Jbond: [C: 03+2] admin: add Hghani and Ilooremeta to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/853255 (https://phabricator.wikimedia.org/T322147) (owner: 10Jbond) [10:48:46] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [10:49:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1135.eqiad.wmnet with reason: Maintenance [10:49:09] (03CR) 10Marostegui: [C: 03+2] Revert "es2020: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/853038 (owner: 10Marostegui) [10:49:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1135.eqiad.wmnet with reason: Maintenance [10:49:25] jbond: ok to merge? [10:49:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T318955)', diff saved to https://phabricator.wikimedia.org/P38119 and previous config saved to /var/cache/conftool/dbconfig/20221104-104927-ladsgroup.json [10:49:50] yes please [10:50:01] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [10:50:08] jbond: Done, it was 3a99eb0af7 for what is worth [10:50:13] ack thanks [10:50:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2020 (re)pooling @ 1%: After reboot', diff saved to https://phabricator.wikimedia.org/P38120 and previous config saved to /var/cache/conftool/dbconfig/20221104-105031-root.json [10:50:46] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [10:51:59] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users & Kerberos identity for Ilooremeta - https://phabricator.wikimedia.org/T322147 (10jbond) 05In progress→03Resolved a:03jbond This has been completed let me know it there are any issues [10:52:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T318955)', diff saved to https://phabricator.wikimedia.org/P38121 and previous config saved to /var/cache/conftool/dbconfig/20221104-105205-ladsgroup.json [10:52:07] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10jbond) 05In progress→03Resolved a:03jbond This has been completed let me know it there are any issues [10:54:40] 10SRE, 10Dumps-Generation, 10serviceops, 10MW-1.40-notes (1.40.0-wmf.10; 2022-11-14), and 2 others: conf* hosts ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10jcrespo) [10:57:46] (03PS32) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 [11:00:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P38122 and previous config saved to /var/cache/conftool/dbconfig/20221104-110014-ladsgroup.json [11:04:16] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [11:05:53] 10SRE, 10SRE-Access-Requests: Requesting access to ops and analytics for stevemunene - https://phabricator.wikimedia.org/T322339 (10jbond) 05Open→03Stalled [11:05:59] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ilias Sarantopoulos - https://phabricator.wikimedia.org/T322350 (10jbond) 05In progress→03Stalled p:05Triage→03Medium [11:06:05] 10SRE, 10Dumps-Generation, 10serviceops, 10MW-1.40-notes (1.40.0-wmf.10; 2022-11-14), and 2 others: conf* hosts ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10jcrespo) DB maintenance is back to normal/no longer affected, as far as I understood from @Marostegui and @Lad... [11:06:08] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [11:06:57] (03PS1) 10Slyngshede: Initial checkin [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/853257 (https://phabricator.wikimedia.org/T313595) [11:07:01] (03PS2) 10Hnowlan: Encode messages written to poolcounter stream [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/852958 (https://phabricator.wikimedia.org/T233196) [11:07:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P38123 and previous config saved to /var/cache/conftool/dbconfig/20221104-110712-ladsgroup.json [11:07:56] (03CR) 10Hnowlan: Encode messages written to poolcounter stream (034 comments) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/852958 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [11:08:15] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ilias Sarantopoulos - https://phabricator.wikimedia.org/T322350 (10jbond) @calbon are yuo able to approve as the manager, in this case i think y... [11:08:22] 10SRE, 10Dumps-Generation, 10serviceops, 10MW-1.40-notes (1.40.0-wmf.10; 2022-11-14), and 2 others: conf* hosts ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10jcrespo) A small incident report summary should happen soon at: https://wikitech.wikimedia.org/wiki/Incident_s... [11:08:47] (03Abandoned) 10Slyngshede: Initial checkin. User and Group classes for interacting with LDAP. [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede) [11:12:32] 10SRE, 10Dumps-Generation, 10serviceops, 10MW-1.40-notes (1.40.0-wmf.10; 2022-11-14), and 2 others: conf* hosts ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10jcrespo) [11:13:00] 10SRE, 10Dumps-Generation, 10serviceops, 10MW-1.40-notes (1.40.0-wmf.10; 2022-11-14), and 2 others: conf* hosts ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10jcrespo) [11:13:15] 10SRE, 10Dumps-Generation, 10serviceops, 10MW-1.40-notes (1.40.0-wmf.10; 2022-11-14), and 2 others: conf* hosts ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10jcrespo) p:05Triage→03High [11:15:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P38124 and previous config saved to /var/cache/conftool/dbconfig/20221104-111521-ladsgroup.json [11:15:30] (03PS1) 10JMeybohm: pontoon: Make poonton lb a central DNS server for the stack [puppet] - 10https://gerrit.wikimedia.org/r/853259 [11:16:07] (03CR) 10CI reject: [V: 04-1] pontoon: Make poonton lb a central DNS server for the stack [puppet] - 10https://gerrit.wikimedia.org/r/853259 (owner: 10JMeybohm) [11:17:33] (03PS2) 10JMeybohm: pontoon: Make poonton lb a central DNS server for the stack [puppet] - 10https://gerrit.wikimedia.org/r/853259 [11:18:46] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ilias Sarantopoulos - https://phabricator.wikimedia.org/T322350 (10elukey) >>! In T322350#8369133, @elukey wrote: > Ilias will also need access... [11:19:08] 10SRE: haproxy::site doesn't work as expected on the first puppet run - https://phabricator.wikimedia.org/T321684 (10jbond) > cause we don't own the HAProxy puppetization, @Vgutierrez do you know who does? the CP servers are the biggest user of this class making up 75% of users. with cloud-control, dbproxy and... [11:20:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2020 (re)pooling @ 5%: After reboot', diff saved to https://phabricator.wikimedia.org/P38125 and previous config saved to /var/cache/conftool/dbconfig/20221104-112041-root.json [11:21:34] 10SRE, 10Data-Persistence: haproxy::site doesn't work as expected on the first puppet run - https://phabricator.wikimedia.org/T321684 (10jbond) [11:22:07] (03CR) 10Hnowlan: [C: 03+2] Show deprecation warnings [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830608 (owner: 10Hnowlan) [11:22:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P38127 and previous config saved to /var/cache/conftool/dbconfig/20221104-112218-ladsgroup.json [11:22:51] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ilias Sarantopoulos - https://phabricator.wikimedia.org/T322350 (10elukey) Next steps: * Get approvals from @calbon @Ottomata @odimitrijevic *... [11:27:13] !log restart kube-apiserver on ml-serve-ctrl2002 - high latencies for LIST (knative resources) [11:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:16] 10SRE, 10ops-codfw: Degraded RAID on elastic2052 - https://phabricator.wikimedia.org/T322042 (10jbond) [11:30:20] 10SRE, 10ops-codfw, 10Discovery-Search (Current work): Degraded RAID on elastic2052 - https://phabricator.wikimedia.org/T320482 (10jbond) [11:30:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T318955)', diff saved to https://phabricator.wikimedia.org/P38128 and previous config saved to /var/cache/conftool/dbconfig/20221104-113027-ladsgroup.json [11:30:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2176.codfw.wmnet with reason: Maintenance [11:30:32] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [11:30:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2176.codfw.wmnet with reason: Maintenance [11:30:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2176 (T318955)', diff saved to https://phabricator.wikimedia.org/P38129 and previous config saved to /var/cache/conftool/dbconfig/20221104-113048-ladsgroup.json [11:31:04] (03Merged) 10jenkins-bot: Show deprecation warnings [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830608 (owner: 10Hnowlan) [11:31:25] 10SRE, 10ops-codfw, 10Discovery-Search (Current work): Degraded RAID on elastic2052 - https://phabricator.wikimedia.org/T320482 (10jbond) FYI we are still reciving alerts for this disk to root ` mdadm monitoring root@elastic2052.codfw.wmnet via wikimedia.org 7:25 AM (5 hours ago) to root This is an aut... [11:32:59] (03PS1) 10Urbanecm: growthexperiments.pp: Run updateMetrics.php daily [puppet] - 10https://gerrit.wikimedia.org/r/853265 (https://phabricator.wikimedia.org/T318684) [11:33:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T318955)', diff saved to https://phabricator.wikimedia.org/P38130 and previous config saved to /var/cache/conftool/dbconfig/20221104-113329-ladsgroup.json [11:33:36] (03CR) 10Filippo Giunchedi: "See inline" [puppet] - 10https://gerrit.wikimedia.org/r/853259 (owner: 10JMeybohm) [11:34:00] (KubernetesAPILatency) resolved: (6) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:34:10] (03CR) 10CI reject: [V: 04-1] growthexperiments.pp: Run updateMetrics.php daily [puppet] - 10https://gerrit.wikimedia.org/r/853265 (https://phabricator.wikimedia.org/T318684) (owner: 10Urbanecm) [11:35:05] (03CR) 10Elukey: [C: 03+2] admin_ng: extend retry to include HTTP 503s for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/853184 (https://phabricator.wikimedia.org/T322196) (owner: 10Elukey) [11:35:34] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [11:35:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2020 (re)pooling @ 10%: After reboot', diff saved to https://phabricator.wikimedia.org/P38131 and previous config saved to /var/cache/conftool/dbconfig/20221104-113546-root.json [11:36:06] 10Puppet, 10Infrastructure-Foundations, 10Sustainability (Incident Followup): Fix the general problem of randomly-bad puppet agent cron timings within redundant clusters - https://phabricator.wikimedia.org/T161145 (10jbond) [11:37:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T318955)', diff saved to https://phabricator.wikimedia.org/P38132 and previous config saved to /var/cache/conftool/dbconfig/20221104-113725-ladsgroup.json [11:37:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1139.eqiad.wmnet with reason: Maintenance [11:37:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1139.eqiad.wmnet with reason: Maintenance [11:38:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1140.eqiad.wmnet with reason: Maintenance [11:38:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1140.eqiad.wmnet with reason: Maintenance [11:39:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1169.eqiad.wmnet with reason: Maintenance [11:39:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1169.eqiad.wmnet with reason: Maintenance [11:39:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T318955)', diff saved to https://phabricator.wikimedia.org/P38133 and previous config saved to /var/cache/conftool/dbconfig/20221104-113929-ladsgroup.json [11:40:41] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [11:42:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T318955)', diff saved to https://phabricator.wikimedia.org/P38134 and previous config saved to /var/cache/conftool/dbconfig/20221104-114207-ladsgroup.json [11:42:36] urbanecm: as far as I can see, skin.json and extension.json get cached in APC for 24h. There probably is an invalidation mechanism somewhere, but I don't see it... [11:45:59] 10SRE, 10Data-Persistence: haproxy::site doesn't work as expected on the first puppet run - https://phabricator.wikimedia.org/T321684 (10Vgutierrez) >>! In T321684#8369557, @jbond wrote: >> cause we don't own the HAProxy puppetization, > @Vgutierrez do you know who does? the CP servers are the biggest user of... [11:48:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P38135 and previous config saved to /var/cache/conftool/dbconfig/20221104-114835-ladsgroup.json [11:50:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2020 (re)pooling @ 25%: After reboot', diff saved to https://phabricator.wikimedia.org/P38136 and previous config saved to /var/cache/conftool/dbconfig/20221104-115051-root.json [11:51:44] 10Puppet, 10Cloud-Services, 10Data-Persistence, 10Infrastructure-Foundations, 10Thumbor: haproxy::site doesn't work as expected on the first puppet run - https://phabricator.wikimedia.org/T321684 (10jbond) >>! In T321684#8369618, @Vgutierrez wrote: >>>! In T321684#8369557, @jbond wrote: >>> cause we don'... [11:52:40] (03CR) 10Vlad.shapik: [C: 03+1] "Looks good." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/852958 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [11:52:46] (03PS3) 10Hnowlan: changeprop: make num_workers configurable for jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/826570 (https://phabricator.wikimedia.org/T233196) [11:52:57] (03CR) 10Hnowlan: [C: 03+2] Encode messages written to poolcounter stream [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/852958 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [11:54:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:54:40] 10Puppet, 10Cloud-Services, 10Data-Persistence, 10Infrastructure-Foundations, 10Thumbor: haproxy::site doesn't work as expected on the first puppet run - https://phabricator.wikimedia.org/T321684 (10Vgutierrez) It's my understanding that right now the haproxy class is flexible enough for the current use... [11:57:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P38137 and previous config saved to /var/cache/conftool/dbconfig/20221104-115713-ladsgroup.json [11:58:32] (03PS1) 10Jcrespo: bacula: Add missing fileset definition dispatch-postgres [puppet] - 10https://gerrit.wikimedia.org/r/853268 (https://phabricator.wikimedia.org/T313229) [11:59:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:59:08] (03CR) 10CI reject: [V: 04-1] bacula: Add missing fileset definition dispatch-postgres [puppet] - 10https://gerrit.wikimedia.org/r/853268 (https://phabricator.wikimedia.org/T313229) (owner: 10Jcrespo) [11:59:10] (03Merged) 10jenkins-bot: Encode messages written to poolcounter stream [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/852958 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [11:59:19] (03PS16) 10Jbond: sre.hardware.upgrade-firmware: use packaging.version.Version [cookbooks] - 10https://gerrit.wikimedia.org/r/835579 [11:59:42] (03PS2) 10Jcrespo: bacula: Add missing fileset definition dispatch-postgres [puppet] - 10https://gerrit.wikimedia.org/r/853268 (https://phabricator.wikimedia.org/T313229) [11:59:44] (03PS37) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [11:59:48] (03PS33) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 [11:59:52] (03PS16) 10Jbond: sre.hardware.upgrade-firmware: add support for storage devices [cookbooks] - 10https://gerrit.wikimedia.org/r/836226 [12:02:10] (03PS3) 10Jcrespo: bacula: Add missing fileset definition dispatch-postgres [puppet] - 10https://gerrit.wikimedia.org/r/853268 (https://phabricator.wikimedia.org/T313229) [12:03:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P38138 and previous config saved to /var/cache/conftool/dbconfig/20221104-120342-ladsgroup.json [12:05:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2020 (re)pooling @ 50%: After reboot', diff saved to https://phabricator.wikimedia.org/P38139 and previous config saved to /var/cache/conftool/dbconfig/20221104-120556-root.json [12:07:59] (03PS1) 10Hnowlan: thumbor: image bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/853271 (https://phabricator.wikimedia.org/T233196) [12:09:08] (03Abandoned) 10Klausman: ml_k8s: move config for ML staging master/worker to be more consistent [labs/private] - 10https://gerrit.wikimedia.org/r/853235 (owner: 10Klausman) [12:11:00] (03CR) 10Slyngshede: "Reworked LDAP wrapper library. Now based on ldap3 to avoid doing to much work ourselves." [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/853257 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede) [12:11:45] (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: add support for storage devices [cookbooks] - 10https://gerrit.wikimedia.org/r/836226 (owner: 10Jbond) [12:11:54] (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: use packaging.version.Version [cookbooks] - 10https://gerrit.wikimedia.org/r/835579 (owner: 10Jbond) [12:11:57] (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond) [12:11:59] (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 (owner: 10Jbond) [12:12:12] (03PS2) 10Slyngshede: Initial checkin - LDAP wrapper library based on ldap3, to make everyday operations on LDAP users and groups more convenient. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/853257 (https://phabricator.wikimedia.org/T313595) [12:12:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P38140 and previous config saved to /var/cache/conftool/dbconfig/20221104-121219-ladsgroup.json [12:12:24] (03PS3) 10JMeybohm: pontoon: Make poonton lb a central DNS server for the stack [puppet] - 10https://gerrit.wikimedia.org/r/853259 [12:12:43] (03PS4) 10JMeybohm: pontoon: Make poonton lb a central DNS server for the stack [puppet] - 10https://gerrit.wikimedia.org/r/853259 [12:13:39] (03CR) 10Hnowlan: [C: 03+2] thumbor: image bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/853271 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [12:14:06] (03CR) 10JMeybohm: pontoon: Make poonton lb a central DNS server for the stack (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/853259 (owner: 10JMeybohm) [12:16:34] (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: use packaging.version.Version [cookbooks] - 10https://gerrit.wikimedia.org/r/835579 (owner: 10Jbond) [12:16:36] (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond) [12:17:10] (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 (owner: 10Jbond) [12:17:12] (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: add support for storage devices [cookbooks] - 10https://gerrit.wikimedia.org/r/836226 (owner: 10Jbond) [12:17:33] (03Merged) 10jenkins-bot: thumbor: image bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/853271 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [12:18:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T318955)', diff saved to https://phabricator.wikimedia.org/P38141 and previous config saved to /var/cache/conftool/dbconfig/20221104-121848-ladsgroup.json [12:18:54] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [12:18:54] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [12:19:43] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [12:21:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2020 (re)pooling @ 75%: After reboot', diff saved to https://phabricator.wikimedia.org/P38142 and previous config saved to /var/cache/conftool/dbconfig/20221104-122101-root.json [12:22:29] (03PS6) 10Slyngshede: Bitu IDM, initial checkin [software/bitu] - 10https://gerrit.wikimedia.org/r/850465 (https://phabricator.wikimedia.org/T319410) [12:24:38] (03CR) 10Filippo Giunchedi: pontoon: Make poonton lb a central DNS server for the stack (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/853259 (owner: 10JMeybohm) [12:25:52] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thank you !" [puppet] - 10https://gerrit.wikimedia.org/r/853268 (https://phabricator.wikimedia.org/T313229) (owner: 10Jcrespo) [12:27:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T318955)', diff saved to https://phabricator.wikimedia.org/P38143 and previous config saved to /var/cache/conftool/dbconfig/20221104-122726-ladsgroup.json [12:27:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1184.eqiad.wmnet with reason: Maintenance [12:27:32] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [12:27:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1184.eqiad.wmnet with reason: Maintenance [12:27:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T318955)', diff saved to https://phabricator.wikimedia.org/P38144 and previous config saved to /var/cache/conftool/dbconfig/20221104-122747-ladsgroup.json [12:30:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T318955)', diff saved to https://phabricator.wikimedia.org/P38145 and previous config saved to /var/cache/conftool/dbconfig/20221104-123026-ladsgroup.json [12:36:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2020 (re)pooling @ 100%: After reboot', diff saved to https://phabricator.wikimedia.org/P38146 and previous config saved to /var/cache/conftool/dbconfig/20221104-123606-root.json [12:40:05] 10Puppet, 10Cloud-Services, 10Data-Persistence, 10Infrastructure-Foundations, 10Thumbor: haproxy::site doesn't work as expected on the first puppet run - https://phabricator.wikimedia.org/T321684 (10fgiunchedi) Thank you for the investigation and the context -- appreciate it! >>! In T321684#8369634, @Vg... [12:45:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P38147 and previous config saved to /var/cache/conftool/dbconfig/20221104-124533-ladsgroup.json [12:46:10] (03CR) 10Jon Harald Søby: [C: 03+1] onSpecialSearchCreateLink: Handle another null from Title::newFromText (031 comment) [extensions/WikimediaIncubator] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/851017 (https://phabricator.wikimedia.org/T320736) (owner: 10Jforrester) [12:46:19] (03CR) 10Jon Harald Søby: onSpecialSearchCreateLink: Handle another null from Title::newFromText [extensions/WikimediaIncubator] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/851017 (https://phabricator.wikimedia.org/T320736) (owner: 10Jforrester) [12:52:57] (03PS2) 10Hashar: Move custom CSS style to a standalone plugin [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/853057 (https://phabricator.wikimedia.org/T319378) [12:53:03] (03PS2) 10Hashar: Move custom links to a standalone plugin [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/853058 (https://phabricator.wikimedia.org/T319378) [12:55:04] (03PS2) 10Hashar: gerrit: remove gerrit-theme.js [puppet] - 10https://gerrit.wikimedia.org/r/853061 (https://phabricator.wikimedia.org/T319378) [12:58:01] (03PS5) 10JMeybohm: pontoon: Make poonton lb a central DNS server for the stack [puppet] - 10https://gerrit.wikimedia.org/r/853259 [13:00:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P38148 and previous config saved to /var/cache/conftool/dbconfig/20221104-130039-ladsgroup.json [13:01:27] PROBLEM - SSH on mw1309.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:01:28] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice, LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/853259 (owner: 10JMeybohm) [13:02:55] (03PS2) 10Thiemo Kreuz (WMDE): onSpecialSearchCreateLink: Handle another null from Title::newFromText [extensions/WikimediaIncubator] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/851017 (https://phabricator.wikimedia.org/T320736) (owner: 10Jforrester) [13:03:55] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] onSpecialSearchCreateLink: Handle another null from Title::newFromText (031 comment) [extensions/WikimediaIncubator] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/851017 (https://phabricator.wikimedia.org/T320736) (owner: 10Jforrester) [13:05:02] (03CR) 10Jcrespo: [C: 03+2] bacula: Add missing fileset definition dispatch-postgres [puppet] - 10https://gerrit.wikimedia.org/r/853268 (https://phabricator.wikimedia.org/T313229) (owner: 10Jcrespo) [13:06:57] (03CR) 10Jon Harald Søby: [C: 03+1] onSpecialSearchCreateLink: Handle another null from Title::newFromText (031 comment) [extensions/WikimediaIncubator] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/851017 (https://phabricator.wikimedia.org/T320736) (owner: 10Jforrester) [13:09:56] !log reprepro -C main include bullseye-wikimedia fifo-log-demux_0.6.3_amd64.changes: T321309 [13:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:59] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [13:10:13] (03CR) 10JMeybohm: [C: 03+2] pontoon: Make poonton lb a central DNS server for the stack (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/853259 (owner: 10JMeybohm) [13:10:39] !log reprepro -C main include bullseye-wikimedia file-read-backwards_2.0.0-3_amd64.changes: T321309 [13:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:28] !log reprepro -C main include bullseye-wikimedia prometheus-rdkafka-exporter_0.3_amd64.changes: T321309 [13:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:44] (03PS1) 10Slyngshede: C:idm::deployment logrotation for Django logs. [puppet] - 10https://gerrit.wikimedia.org/r/853283 (https://phabricator.wikimedia.org/T320431) [13:14:29] (03CR) 10CI reject: [V: 04-1] C:idm::deployment logrotation for Django logs. [puppet] - 10https://gerrit.wikimedia.org/r/853283 (https://phabricator.wikimedia.org/T320431) (owner: 10Slyngshede) [13:15:07] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:15:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T318955)', diff saved to https://phabricator.wikimedia.org/P38149 and previous config saved to /var/cache/conftool/dbconfig/20221104-131546-ladsgroup.json [13:15:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1186.eqiad.wmnet with reason: Maintenance [13:15:50] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [13:15:59] (03PS2) 10Slyngshede: C:idm::deployment logrotation for Django logs. [puppet] - 10https://gerrit.wikimedia.org/r/853283 (https://phabricator.wikimedia.org/T320431) [13:16:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1186.eqiad.wmnet with reason: Maintenance [13:16:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1186 (T318955)', diff saved to https://phabricator.wikimedia.org/P38150 and previous config saved to /var/cache/conftool/dbconfig/20221104-131607-ladsgroup.json [13:17:04] !log reprepro -C main include bullseye-wikimedia python-logstash_0.4.6-3_amd64.changes: T321309 [13:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:07] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [13:17:58] RECOVERY - SSH on an-coord1002.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:18:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T318955)', diff saved to https://phabricator.wikimedia.org/P38151 and previous config saved to /var/cache/conftool/dbconfig/20221104-131846-ladsgroup.json [13:21:11] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv4: OpenConfirm - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:22:29] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:26:42] 10SRE, 10Release-Engineering-Team, 10serviceops, 10Continuous-Integration-Config: operations/docker-images/production-images has no CI - https://phabricator.wikimedia.org/T283855 (10jbond) [13:27:18] 10SRE, 10SRE-tools, 10Discovery-Search, 10Infrastructure-Foundations, 10Spicerack: Create cookbook to reindex into elasticsearch / cirrus - https://phabricator.wikimedia.org/T219507 (10jbond) [13:29:13] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 122 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [13:29:45] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 109 probes of 694 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:30:53] 10Puppet, 10Infrastructure-Foundations, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech: Migrate WDQS to profile::java - https://phabricator.wikimedia.org/T264181 (10jbond) [13:31:17] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv6: OpenConfirm - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:31:27] 10Puppet, 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10jbond) [13:33:02] 10SRE, 10Data-Services, 10Traffic: 2022-09-04 Scraping from AS714 (Apple) against dumps.wikimedia.org saturating network links - https://phabricator.wikimedia.org/T317001 (10jbond) [13:33:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P38152 and previous config saved to /var/cache/conftool/dbconfig/20221104-133353-ladsgroup.json [13:34:20] 10SRE, 10Logos: SRE needs a logo - https://phabricator.wikimedia.org/T312067 (10jbond) p:05High→03Medium [13:35:18] (03PS3) 10Btullis: Update the spark and spark-operator images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/850244 (https://phabricator.wikimedia.org/T318730) [13:35:47] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 65 probes of 694 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:37:19] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv4: OpenConfirm - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:38:50] (03CR) 10Slyngshede: C:idm::deployment add required packages for testing. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/852890 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [13:39:02] (03CR) 10Slyngshede: [C: 03+2] C:idm::deployment add required packages for testing. [puppet] - 10https://gerrit.wikimedia.org/r/852890 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [13:39:20] (03PS3) 10Slyngshede: C:idm::deployment logrotation for Django logs. [puppet] - 10https://gerrit.wikimedia.org/r/853283 (https://phabricator.wikimedia.org/T320431) [13:42:39] 10SRE: SSL address space separation - https://phabricator.wikimedia.org/T83736 (10jbond) 05Open→03Resolved a:03jbond Im going to boldly close this, i think the infrastructure has moved on significantly from this so im not sure the task is still valid but please re-open and update if you disagree [13:43:52] (03CR) 10Btullis: Update the spark and spark-operator images (035 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/850244 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [13:44:25] 10SRE: Allow strace/gdb attachment to processes running as a user one can sudo as - https://phabricator.wikimedia.org/T84257 (10jbond) 05Open→03Resolved a:03jbond As there has been no update on this for ~8 years and given filipo's last comment im going to close but please re-open and update if there is sti... [13:45:53] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp4052'] [13:46:33] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:47:32] 10SRE, 10serviceops, 10Documentation: document redis upgrade/restart procedures - https://phabricator.wikimedia.org/T101585 (10jbond) [13:49:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P38153 and previous config saved to /var/cache/conftool/dbconfig/20221104-134859-ladsgroup.json [13:49:43] 10SRE: Track systems/roles for which intentionally no firewall rules are applied - https://phabricator.wikimedia.org/T104958 (10jbond) 05Open→03Resolved a:03jbond going to boldy close this, we can get this list using cumin with `sudo cumin "A:all and not P{C:ferm}" ` [13:51:19] (03PS1) 10Slyngshede: Add RQ support to Django [software/bitu] - 10https://gerrit.wikimedia.org/r/853290 [13:51:23] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv4: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:52:11] 10SRE: Adapt all the things to localized Special: namespaces - https://phabricator.wikimedia.org/T105434 (10jbond) Is this still an issue if so can you document the specific error and desired action so we can assign the task correctly, thanks [13:52:24] 10SRE, 10conftool, 10serviceops: Not all confd errors throw icinga alerts - https://phabricator.wikimedia.org/T110933 (10jbond) [13:54:58] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [13:56:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:57:17] 10SRE, 10cloud-services-team (Kanban): Booleans in hiera may be harmful - https://phabricator.wikimedia.org/T114018 (10jbond) 05Open→03Resolved a:03jbond The two links are dead, im going to close this but FTR if its a bare string it will be boolean if its in quotes it will be a string, as defined by the... [13:57:28] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host dbprov2004 [13:57:49] 10SRE, 10Infrastructure-Foundations, 10Packaging: Create an upload queue for reprepro - https://phabricator.wikimedia.org/T115349 (10jbond) [13:58:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dbprov2004 [13:58:18] 10SRE, 10Infrastructure-Foundations, 10Packaging: Debian repository supporting multiple package versions - https://phabricator.wikimedia.org/T115758 (10jbond) [13:58:58] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dbprov2004.mgmt.codfw.wmnet with reboot policy FORCED [13:59:49] 10Puppet, 10Cloud-Services, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Untangle labs/production roles from labs/instance roles - https://phabricator.wikimedia.org/T119401 (10jbond) 05Open→03Resolved a:03jbond boldly closing but please re-open and update if its still actionable [14:01:31] 10SRE, 10SRE-Misc, 10Logos: SRE needs a logo - https://phabricator.wikimedia.org/T312067 (10LSobanski) [14:02:07] 10SRE-Misc, 10PM: SRE Clinic duty - triage query review - https://phabricator.wikimedia.org/T320959 (10LSobanski) [14:02:19] RECOVERY - SSH on mw1309.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:02:20] (03CR) 10Herron: "Very nice! Please see initial thoughts inline" [puppet] - 10https://gerrit.wikimedia.org/r/852992 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [14:04:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T318955)', diff saved to https://phabricator.wikimedia.org/P38154 and previous config saved to /var/cache/conftool/dbconfig/20221104-140405-ladsgroup.json [14:04:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1196.eqiad.wmnet with reason: Maintenance [14:04:10] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [14:04:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1196.eqiad.wmnet with reason: Maintenance [14:04:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1196 (T318955)', diff saved to https://phabricator.wikimedia.org/P38155 and previous config saved to /var/cache/conftool/dbconfig/20221104-140427-ladsgroup.json [14:05:02] 10SRE, 10Education-Program-Dashboard, 10Programs-and-Events-Dashboard-Sprint 2, 10Spike: Spike: What do we have to package to run the Programs and Events dashboard on production? - https://phabricator.wikimedia.org/T126295 (10jbond) 05Open→03Resolved a:03jbond im going to boldly close this, debian st... [14:05:25] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:07:01] 10Puppet, 10Cloud-Services, 10Data-Persistence, 10Infrastructure-Foundations, 10Thumbor: haproxy::site doesn't work as expected on the first puppet run - https://phabricator.wikimedia.org/T321684 (10Vgutierrez) >>! In T321684#8369754, @fgiunchedi wrote: > however if someone is introducing `haproxy::site`... [14:07:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T318955)', diff saved to https://phabricator.wikimedia.org/P38156 and previous config saved to /var/cache/conftool/dbconfig/20221104-140705-ladsgroup.json [14:07:09] 10SRE, 10Wikimedia-IRC-RC-Server, 10Patch-For-Review: RC stream is broken over IRC - https://phabricator.wikimedia.org/T134247 (10jbond) [14:09:39] 10SRE, 10Data-Persistence-Backup: Deploy etcddump (or another etcd dump & load tool) to production - https://phabricator.wikimedia.org/T135124 (10jbond) Is this still required [14:10:16] (03PS2) 10Slyngshede: Add RQ support to Django [software/bitu] - 10https://gerrit.wikimedia.org/r/853290 [14:10:18] (03CR) 10Herron: [C: 03+1] prometheus: probe SSH on mgmt network [puppet] - 10https://gerrit.wikimedia.org/r/845529 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [14:10:33] 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 3 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10Vgutierrez) Their TLS termination has been improved over time but they still don't meet the requirements listed on https://wikitech.wikimed... [14:13:21] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:14:30] 10SRE, 10serviceops: Turn on etcd TLS for intra-cluster communications - https://phabricator.wikimedia.org/T135128 (10jbond) 05Open→03Resolved a:03jbond I believe this is now in place but please re-open if im wrong [14:14:33] 10SRE, 10Technical-Debt: Reduce etcd technical debt - https://phabricator.wikimedia.org/T135122 (10jbond) [14:15:27] 10SRE, 10observability: mod_deflate + mod_uwsgi causing mangled apache responses - https://phabricator.wikimedia.org/T135595 (10jbond) 05Open→03Resolved a:03jbond im going to boldy close this with the hope that we dont still need to support http/1.0 [14:17:21] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv4: OpenSent - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:18:43] PROBLEM - SSH on mw1337.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:19:10] 10SRE, 10Data-Persistence, 10MediaWiki-Maintenance-system: Separate host lookup from the sql shell script - https://phabricator.wikimedia.org/T141255 (10jbond) 05Open→03Resolved a:03jbond Im boldy closing this task as i belive the infrastructure has moved on since this was raised (terbium no longer exi... [14:20:37] 10Puppet, 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Moving network::external to hiera broke much of labs - https://phabricator.wikimedia.org/T141959 (10jbond) 05Open→03Resolved a:03jbond Im going to close this ticket assuming that the issues has been resolved in the mean time... [14:21:28] 10SRE, 10Infrastructure-Foundations: create notifications about user accounts that have not been used for a long time - https://phabricator.wikimedia.org/T146657 (10jbond) [14:22:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P38157 and previous config saved to /var/cache/conftool/dbconfig/20221104-142212-ladsgroup.json [14:23:12] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp4052'] [14:23:52] 10SRE, 10Arc-Lamp, 10Performance-Team (Radar): Expand RAM on arclamp hosts and move them to baremetal - https://phabricator.wikimedia.org/T316223 (10Jclark-ctr) [14:24:05] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: create notifications about user accounts that have not been used for a long time - https://phabricator.wikimedia.org/T146657 (10jbond) By used do you mean to accessing to some service via ldap auth/cas sso, ssh, both or something else? Further what would the... [14:24:32] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov2004.mgmt.codfw.wmnet with reboot policy FORCED [14:25:36] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp4052'] [14:26:07] 10SRE: Puppet fails only once when restarting ferm is not successful - https://phabricator.wikimedia.org/T157972 (10jbond) I believe this has been fixed by the addition of the `ferm-stastus` script. but please re-open if im missing something https://gerrit.wikimedia.org/r/c/operations/puppet/+/576101 [14:26:13] 10Puppet, 10SRE, 10Infrastructure-Foundations: Puppet fails only once when restarting ferm is not successful - https://phabricator.wikimedia.org/T157972 (10jbond) 05Open→03Resolved a:03jbond [14:26:46] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp4052'] [14:28:03] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dbprov2004.mgmt.codfw.wmnet with reboot policy FORCED [14:28:27] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10Jclark-ctr) 05Open→03Resolved [14:29:05] 10SRE, 10Infrastructure-Foundations, 10Packaging: make apt.wikimedia.org HA - https://phabricator.wikimedia.org/T158022 (10jbond) 05Open→03Resolved a:03jbond apt is [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/730523 | now configured ]] as a dns discover services in active/passive [14:30:02] (03PS1) 10AikoChou: ml-services: update docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/853298 (https://phabricator.wikimedia.org/T320374) [14:30:13] 10SRE, 10Infrastructure-Foundations, 10LDAP: Cross-check disabled accounts from corp LDAP against data.yaml - https://phabricator.wikimedia.org/T161003 (10jbond) wonder if this is still needed? [14:30:18] (03PS3) 10Filippo Giunchedi: dispatch: sync user role and info from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/852992 (https://phabricator.wikimedia.org/T313229) [14:30:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:30:28] (03CR) 10Filippo Giunchedi: "Thank you for the quick review!" [puppet] - 10https://gerrit.wikimedia.org/r/852992 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [14:30:37] uh.. ^^ [14:30:59] 10SRE, 10conftool, 10serviceops: confctl no longer logs a non-changing state change - https://phabricator.wikimedia.org/T161096 (10jbond) [14:31:16] it looks like thumbor has been struggling for a while in eqiad [14:31:29] 10SRE, 10SRE Program Management, 10Documentation, 10PM: Create a Clinic Duty roster process - https://phabricator.wikimedia.org/T244266 (10LSobanski) 05Open→03Resolved a:03LSobanski Clinic Duty is now tracked in Splunk On-Call and publicly available here: https://wikitech.wikimedia.org/wiki/SRE/Oncal... [14:32:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:32:45] 10SRE, 10SRE Program Management, 10Logos: SRE needs a logo - https://phabricator.wikimedia.org/T312067 (10LSobanski) [14:32:45] PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [14:33:07] uh.. Emperor ^^ [14:33:24] (03PS1) 10Stevemunene: Add stevemunene to ops and analytics [puppet] - 10https://gerrit.wikimedia.org/r/853300 (https://phabricator.wikimedia.org/T322339) [14:33:26] didn't we just have this a couple days ago [14:33:27] acking [14:33:51] nevermind denisse|m already on the ack :) [14:33:53] PROBLEM - Docker registry HTTPS interface on registry1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string schemaVersion not found on https://registry1004.eqiad.wmnet:443/v2/wikimedia-stretch/manifests/latest - 362 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Docker [14:33:58] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1012.eqiad.wmnet, ms-fe1010.eqiad.wmnet, ms-fe1011.eqiad.wmnet are marked down but pooled: swift_80: Servers ms-fe1010.eqiad.wmnet, ms-fe1011.eqiad.wmnet are marked down but pooled: thumbor_8800: Servers thumbor1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:34:21] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 200 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:34:23] IIRC last time there were docker alerts around the same time as the thumbor/swift stuff as well, which is odd [14:34:34] bblack: Yes, I'm taking a look at it. :) [14:34:35] (FrontendUnavailable) firing: HAProxy (cache_upload) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [14:34:44] lovely [14:34:46] uhm [14:34:53] ms-be1064.eqiad.wmnet seems to be down [14:34:53] it's because of swift/thumbor, presumably [14:34:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:35:00] (the cache_upload alert) [14:35:04] Nov 4 14:34:16 ms-fe1010 proxy-server: ERROR with Account server 10.64.0.71:6002/sdb3 re: Trying to HEAD /v1/AUTH_mw: ConnectionTimeout (0.5s) [14:35:13] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 408 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:35:18] (ProbeDown) firing: (2) Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:35:19] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [14:35:27] 10Puppet, 10SRE, 10Infrastructure-Foundations: Fix UIDs for deployment server users - https://phabricator.wikimedia.org/T163667 (10jbond) Is this still an issue, we can now create damon users via admin.yaml with persistent uid's see reprepro as an example [14:35:31] two frontends down... hmmm [14:35:37] let's depool swift@eqiad [14:35:38] PROBLEM - Docker registry HTTPS interface on registry1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [14:35:47] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1012.eqiad.wmnet, ms-fe1010.eqiad.wmnet, ms-fe1011.eqiad.wmnet are marked down but pooled: swift_80: Servers ms-fe1012.eqiad.wmnet, ms-fe1010.eqiad.wmnet, ms-fe1011.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:35:57] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.263 second response time https://wikitech.wikimedia.org/wiki/Swift [14:35:57] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.129 second response time https://wikitech.wikimedia.org/wiki/Swift [14:36:03] the MW fatals last time were an effect from swift as well [14:36:20] bblack: ok with depooling swift@eqiad? [14:36:20] the root problem is most likely in the swift world, as I vaguely recall the backscroll from the other recent incident following this pattern [14:36:36] vgutierrez: yeah. I don't think it truly fixes things, but I think it helps with impact? [14:36:43] PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 6.290 second response time https://wikitech.wikimedia.org/wiki/Swift [14:37:04] yes [14:37:12] it should mitigate the CDN impact [14:37:14] looking for prev doc info [14:37:18] (ProbeDown) firing: (3) Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:37:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P38158 and previous config saved to /var/cache/conftool/dbconfig/20221104-143718-ladsgroup.json [14:37:27] PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [14:37:39] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=swift,name=eqiad [14:37:46] vgutierrez: bblack Here's the document for the issue: https://docs.google.com/document/d/1Gd98aR28A4dw6dsXf0lMpHOWZoBwEzgTieR_5m7NSuk/edit?usp=sharing [14:38:21] here's the old one (by the alert patterns, I suspect it's a repeat): https://docs.google.com/document/d/1gAJhqBnDCQK6bLz61w1Mr9z2wYiSwMRuFkU3EYQgNyw/edit [14:38:31] it says "error with account server" afaics in the swift proxy logs [14:38:39] PROBLEM - Docker registry health on registry1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - pattern not found - 224 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Docker [14:38:58] I think that's largely a "whups I'm overloaded", BICBW [14:38:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:40:18] (ProbeDown) firing: (3) Service docker-registry:443 has failed probes (http_docker-registry_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:40:36] o/ [14:40:49] should we all move to #sre to avoid the alert spam? [14:40:51] o/ [14:40:55] there is a proxy-server deadlock bug fixed in 2.27 https://bugs.launchpad.net/swift/+bug/1895739 [14:41:08] so seems like last time around, our mitigation in the short term was the eqiad-side depool for traffic [done already this time], and then [permanently?] depooling ms-fe1009? [14:41:19] no sorry, ms-fe2009 [14:41:35] (03CR) 10Jforrester: Remove logo setting in YAML files (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843514 (owner: 10Jdlrobson) [14:41:41] ms-fe2009 was depooled cause it's runing stretch [14:41:50] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov2004.mgmt.codfw.wmnet with reboot policy FORCED [14:41:53] but today bullseye nodes are failing in eqiad as well, right? [14:42:01] vgutierrez, bblack folks let's move to #sre [14:42:18] (ProbeDown) firing: (3) Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:42:34] (FrontendUnavailable) firing: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [14:42:47] PROBLEM - Docker registry health on registry1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - pattern not found - 228 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Docker [14:43:49] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Swift [14:43:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:44:09] PROBLEM - Check systemd state on registry1003 is CRITICAL: CRITICAL - degraded: The following units failed: build-homepage.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:44:15] !log restart swift-proxy on ms-fe1010 [14:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:35] (FrontendUnavailable) resolved: HAProxy (cache_upload) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [14:44:35] RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Swift [14:44:35] RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.027 second response time https://wikitech.wikimedia.org/wiki/Swift [14:44:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:45:18] (ProbeDown) firing: (3) Service docker-registry:443 has failed probes (http_docker-registry_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:45:47] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:45:51] RECOVERY - Docker registry HTTPS interface on registry1004 is OK: HTTP OK: HTTP/1.1 200 OK - 3755 bytes in 0.212 second response time https://wikitech.wikimedia.org/wiki/Docker [14:46:21] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Install NVMe SSDs into moss-be100[1|2] & thanos-be100? - https://phabricator.wikimedia.org/T310922 (10LSobanski) a:05LSobanski→03MatthewVernon Reassigning to @MatthewVernon as I'm not adding much value as a middleman. [14:46:37] RECOVERY - Docker registry health on registry1004 is OK: HTTP OK: HTTP/1.1 200 OK - 143 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Docker [14:47:07] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:47:18] (ProbeDown) firing: (5) Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:47:34] (FrontendUnavailable) resolved: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [14:47:55] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.347 second response time https://wikitech.wikimedia.org/wiki/Swift [14:47:55] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:48:38] !log restart swift-proxy on ms-fe1011 [14:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:47] RECOVERY - Docker registry health on registry1003 is OK: HTTP OK: HTTP/1.1 200 OK - 143 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Docker [14:48:47] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [14:48:57] PROBLEM - Docker registry HTTPS interface on registry2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [14:49:13] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.034 second response time https://wikitech.wikimedia.org/wiki/Swift [14:49:17] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2010.codfw.wmnet, ms-fe2011.codfw.wmnet are marked down but pooled: swift_80: Servers ms-fe2010.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2012.codfw.wmnet are marked down but pooled: thumbor_8800: Servers thumbor2004.codfw.wmnet, thumbor2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:49:21] RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.027 second response time https://wikitech.wikimedia.org/wiki/Swift [14:49:27] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [14:49:27] RECOVERY - Docker registry HTTPS interface on registry1003 is OK: HTTP OK: HTTP/1.1 200 OK - 3755 bytes in 0.165 second response time https://wikitech.wikimedia.org/wiki/Docker [14:49:34] (FrontendUnavailable) firing: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [14:49:39] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [14:49:47] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [14:49:49] PROBLEM - Docker registry HTTPS interface on registry2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [14:49:59] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [14:50:05] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2010.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2012.codfw.wmnet are marked down but pooled: swift_80: Servers ms-fe2010.codfw.wmnet, ms-fe2011.codfw.wmnet are marked down but pooled: thumbor_8800: Servers thumbor2004.codfw.wmnet, thumbor2005.codfw.wmnet, thumbor2006.codfw.wmnet, thumbor2003.codfw.wmnet are marked down but [14:50:05] https://wikitech.wikimedia.org/wiki/PyBal [14:50:09] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [14:50:18] (ProbeDown) firing: (5) Service docker-registry:443 has failed probes (http_docker-registry_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:50:35] (FrontendUnavailable) firing: HAProxy (cache_upload) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [14:51:13] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 39 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:51:25] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:51:42] !log restart swift-proxy on ms-fe1012 [14:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:02] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=swift,name=eqiad [14:52:18] (ProbeDown) firing: (5) Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:52:19] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: (C)100 gt (W)50 gt 7 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:52:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T318955)', diff saved to https://phabricator.wikimedia.org/P38159 and previous config saved to /var/cache/conftool/dbconfig/20221104-145225-ladsgroup.json [14:52:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [14:52:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [14:52:32] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [14:54:16] (03PS2) 10Urbanecm: growthexperiments.pp: Run updateMetrics.php daily [puppet] - 10https://gerrit.wikimedia.org/r/853265 (https://phabricator.wikimedia.org/T318684) [14:54:35] 10SRE, 10DC-Ops: Audit / document reasons for not enabling HT? - https://phabricator.wikimedia.org/T165618 (10jbond) i have run this again and we only have the following nodes with HT disabled * an-druid1003.eqiad.wmnet, * conf[2004-2005].codfw.wmnet, * db[1103,1154].eqiad.wmnet, * ms-be1056.eqiad.wmnet, * sre... [14:57:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:58:54] (03PS1) 10Hashar: build: add eslint for JavaScript plugins [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/853306 (https://phabricator.wikimedia.org/T319378) [14:58:55] RECOVERY - Docker registry HTTPS interface on registry2004 is OK: HTTP OK: HTTP/1.1 200 OK - 3755 bytes in 5.909 second response time https://wikitech.wikimedia.org/wiki/Docker [14:59:12] 10SRE, 10Infrastructure-Foundations, 10Packaging, 10User-Joe: Sync internal nutcracker package with Debian package - https://phabricator.wikimedia.org/T166038 (10jbond) 05Open→03Resolved a:03jbond Closing, from what i can see we are now using the debian packages. please reopen if this is not the cas... [14:59:34] (FrontendUnavailable) resolved: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [15:00:18] (ProbeDown) firing: (6) Service docker-registry:443 has failed probes (http_docker-registry_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:00:55] !log `elukey@cumin1001:~$ sudo cumin 'ms-fe2*' 'systemctl restart swift-proxy' -b 1 -s 20` [15:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:22] (03CR) 10Urbanecm: [C: 04-1] "needs the script deployed to production first" [puppet] - 10https://gerrit.wikimedia.org/r/853265 (https://phabricator.wikimedia.org/T318684) (owner: 10Urbanecm) [15:01:25] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:01:29] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.155 second response time https://wikitech.wikimedia.org/wiki/Swift [15:01:35] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.139 second response time https://wikitech.wikimedia.org/wiki/Swift [15:01:41] \o/ [15:01:47] RECOVERY - Docker registry HTTPS interface on registry2003 is OK: HTTP OK: HTTP/1.1 200 OK - 3755 bytes in 9.600 second response time https://wikitech.wikimedia.org/wiki/Docker [15:01:51] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.150 second response time https://wikitech.wikimedia.org/wiki/Swift [15:01:59] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.134 second response time https://wikitech.wikimedia.org/wiki/Swift [15:02:05] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:02:18] (ProbeDown) resolved: (6) Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:02:39] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Sustainability (Incident Followup): Review sizing of maps cluster - https://phabricator.wikimedia.org/T228497 (10LSobanski) p:05High→03Medium Considering this task is over 3 years old and there have been changes to the Maps infrastructure, I'll lo... [15:03:17] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:03:34] 10SRE, 10Data-Persistence-Backup, 10IPv6: update bacula-sd config so that it listens on IPv6 - https://phabricator.wikimedia.org/T253986 (10LSobanski) [15:05:17] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.150 second response time https://wikitech.wikimedia.org/wiki/Swift [15:05:18] (ProbeDown) resolved: (5) Service docker-registry:443 has failed probes (http_docker-registry_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:05:35] (FrontendUnavailable) resolved: HAProxy (cache_upload) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [15:06:41] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.148 second response time https://wikitech.wikimedia.org/wiki/Swift [15:08:20] (03PS1) 10Jbond: C:raid::mdadm: remove daily cron job [puppet] - 10https://gerrit.wikimedia.org/r/853307 (https://phabricator.wikimedia.org/T169564) [15:09:15] 10SRE, 10observability, 10Patch-For-Review: MD RAID: remove mdadm daily check - https://phabricator.wikimedia.org/T169564 (10jbond) [15:11:00] 10Puppet, 10Cloud-Services, 10Data-Persistence, 10Infrastructure-Foundations, 10Thumbor: haproxy::site doesn't work as expected on the first puppet run - https://phabricator.wikimedia.org/T321684 (10taavi) Looking at the logs, it seems like that haproxy.service is started when the package is installed, a... [15:12:25] (03CR) 10Elukey: [C: 03+2] ml-services: update docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/853298 (https://phabricator.wikimedia.org/T320374) (owner: 10AikoChou) [15:12:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:13:16] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10observability: run-no-puppet leave puppet disabled on kill/crash - https://phabricator.wikimedia.org/T182228 (10jbond) what exactly dose this script do? [15:14:45] 10SRE, 10LDAP: Update certificates on productions replicas of corp.wikimedia.org LDAP - https://phabricator.wikimedia.org/T168460 (10LSobanski) 05Open→03Resolved a:03LSobanski This looks long done - here's the task for the 2018 renewal: T197840. Resolving. [15:15:46] 10SRE, 10Observability-Logging, 10Wikimedia-Apache-configuration, 10User-Joe: Gain visibility into httpd mod_proxy actions - https://phabricator.wikimedia.org/T188601 (10jbond) is this related to some specific service or perhaps now no longer valid? [15:17:43] 10SRE, 10Beta-Cluster-Infrastructure: "Obama" page on Beta Cluster often responds with 500 or 503 - https://phabricator.wikimedia.org/T188913 (10jbond) 05Open→03Resolved a:03jbond Boldly closing this task as we have not had any recent reports. please re-open if this is still an issue [15:19:12] 10ops-codfw: Toubleshoot why latest idrac version is not working on Dell servers - https://phabricator.wikimedia.org/T322419 (10Papaul) [15:19:52] 10SRE, 10User-jbond: Discussion: Explore push notifications options - https://phabricator.wikimedia.org/T221265 (10LSobanski) 05Open→03Resolved a:03LSobanski I believe this is addressed by Splunk On-Call. Resolving. [15:19:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:20:07] 10SRE: Access requests process: People sometimes specify hostnames instead of admin groups in access requests - https://phabricator.wikimedia.org/T207754 (10jbond) 05Open→03Resolved a:03jbond im closing this request it seems like the process for accesses requests has changed somewhat since this task was cr... [15:20:25] 10SRE, 10SRE Observability, 10UI-Standardization: Use white version of Wikimedia logo for grafana - https://phabricator.wikimedia.org/T226970 (10LSobanski) [15:20:37] 10SRE, 10serviceops, 10Security: Filter potentially harmful PostScript commands in Commons upload/thumbor - https://phabricator.wikimedia.org/T210833 (10jbond) [15:23:10] 10SRE, 10Cassandra, 10RESTBase-Cassandra: restbase cassandra driver excessive logging when cassandra hosts are down - https://phabricator.wikimedia.org/T212424 (10LSobanski) [15:23:36] 10SRE, 10Observability-Metrics, 10UI-Standardization: Use white version of Wikimedia logo for grafana - https://phabricator.wikimedia.org/T226970 (10colewhite) [15:23:40] 10SRE, 10Infrastructure-Foundations: keyholder: continue to arm keys if one fails - https://phabricator.wikimedia.org/T227272 (10LSobanski) [15:26:59] 10SRE, 10Traffic: ATS flags origin servers as down during 60 seconds after a connect timeout - https://phabricator.wikimedia.org/T322420 (10Vgutierrez) [15:27:15] 10SRE, 10Traffic: ATS flags origin servers as down during 60 seconds after a connect timeout - https://phabricator.wikimedia.org/T322420 (10Vgutierrez) p:05Triage→03Medium a:03Vgutierrez [15:28:16] 10Puppet, 10SRE, 10Infrastructure-Foundations: Usual git mechanism for aborting commit does not work on the private puppet repo - https://phabricator.wikimedia.org/T211121 (10jbond) Can you confirm is this is still an issue i just tried to recreate and it looks to work as expected now [15:28:42] 10SRE, 10serviceops, 10Release Pipeline (Blubber): blubber template for nodejs should allow defining configuration files to copy to the container - https://phabricator.wikimedia.org/T211580 (10jbond) [15:29:26] 10SRE, 10Cassandra, 10Observability-Logging, 10RESTBase-Cassandra: restbase cassandra driver excessive logging when cassandra hosts are down - https://phabricator.wikimedia.org/T212424 (10jbond) [15:29:47] RECOVERY - Check systemd state on registry1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:31:15] 10SRE, 10Traffic: ATS flags origin servers as down during 60 seconds after a connect timeout - https://phabricator.wikimedia.org/T322420 (10BBlack) Arguably we want this down server cache time to be very low or even disabled in the general case. It's not likely that caching the origin outage is going to help... [15:32:41] 10SRE, 10Infrastructure-Foundations, 10Packaging: Add support for temporary chroots to boron - https://phabricator.wikimedia.org/T219977 (10jbond) @MoritzMuehlenhoff is this still worth exploring or has it perhaps been superseded ? [15:34:24] (03PS1) 10Jelto: gitlab_runner: run cleanup of docker cache twice daily [puppet] - 10https://gerrit.wikimedia.org/r/853312 [15:36:03] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10serviceops-collab: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 (10LSobanski) p:05Triage→03High [15:37:17] 10SRE: run-no-puppet: rewrite using puppet-common.sh - https://phabricator.wikimedia.org/T223937 (10jbond) [15:37:22] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10observability: run-no-puppet leave puppet disabled on kill/crash - https://phabricator.wikimedia.org/T182228 (10jbond) [15:37:25] (03CR) 10Ahmon Dancy: gitlab_runner: run cleanup of docker cache twice daily (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/853312 (owner: 10Jelto) [15:38:42] (03PS2) 10Jelto: gitlab_runner: run cleanup of docker cache twice daily [puppet] - 10https://gerrit.wikimedia.org/r/853312 [15:38:48] 10SRE, 10Observability-Logging, 10WMF-General-or-Unknown, 10serviceops: Re-consider ` >/dev/null 2>&1` as output of many cron'd MW maintenance scripts - https://phabricator.wikimedia.org/T187078 (10jbond) [15:38:52] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ilias Sarantopoulos - https://phabricator.wikimedia.org/T322350 (10calbon) I approve! [15:39:16] (03CR) 10Jelto: gitlab_runner: run cleanup of docker cache twice daily (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/853312 (owner: 10Jelto) [15:39:18] (03CR) 10Ahmon Dancy: [C: 03+1] gitlab_runner: run cleanup of docker cache twice daily [puppet] - 10https://gerrit.wikimedia.org/r/853312 (owner: 10Jelto) [15:39:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:40:09] 10SRE, 10Scap: scap sudo violation on first puppet run - https://phabricator.wikimedia.org/T185189 (10jbond) [15:41:12] !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [15:42:22] 10SRE, 10User-herron: Improve visibility of incoming operations tasks - https://phabricator.wikimedia.org/T197624 (10jbond) 05Open→03In progress a:03LSobanski @LSobanski i think this could possibly be closed as part of the recent work you have been doing? [15:43:01] !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [15:43:40] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10serviceops-collab: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 (10hashar) @Jnuche we will have to setup a spare Jenkins and a Zuul merger on this new host contint1002 :-) [15:44:45] 10SRE, 10WMF-General-or-Unknown: Request for information about hosting services for WM-ES - https://phabricator.wikimedia.org/T211414 (10jbond) 05Open→03Resolved a:03jbond Resolving as there has been no response since the last question/. please re-open and update if there is still something we can help... [15:47:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:48:15] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp4052'] [15:48:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:49:42] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10serviceops-collab: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 (10jnuche) @hashar sounds like a good opportunity to pair! [15:52:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:54:40] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host moss-fe1001.eqiad.wmnet with OS bullseye [15:55:13] (03PS3) 10Jelto: gitlab_runner: run cleanup of docker cache twice daily [puppet] - 10https://gerrit.wikimedia.org/r/853312 (https://phabricator.wikimedia.org/T310593) [15:55:26] (03PS2) 10Jbond: Add prod access for Ilias Sarantopoulos [puppet] - 10https://gerrit.wikimedia.org/r/853090 (https://phabricator.wikimedia.org/T322350) (owner: 10Elukey) [15:55:39] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host moss-fe2001.codfw.wmnet with OS bullseye [15:55:44] (03CR) 10Jbond: [C: 03+2] "lgtm will merge" [puppet] - 10https://gerrit.wikimedia.org/r/853090 (https://phabricator.wikimedia.org/T322350) (owner: 10Elukey) [15:56:09] (03CR) 10CI reject: [V: 04-1] Add prod access for Ilias Sarantopoulos [puppet] - 10https://gerrit.wikimedia.org/r/853090 (https://phabricator.wikimedia.org/T322350) (owner: 10Elukey) [15:56:50] (03PS3) 10Jbond: Add prod access for Ilias Sarantopoulos [puppet] - 10https://gerrit.wikimedia.org/r/853090 (https://phabricator.wikimedia.org/T322350) (owner: 10Elukey) [15:57:25] !log repool ms-fe{1,2}009 [15:57:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:31] (03CR) 10Jbond: [C: 03+2] Add prod access for Ilias Sarantopoulos [puppet] - 10https://gerrit.wikimedia.org/r/853090 (https://phabricator.wikimedia.org/T322350) (owner: 10Elukey) [16:00:04] 10SRE, 10Traffic: ATS flags origin servers as down during 60 seconds after a connect timeout - https://phabricator.wikimedia.org/T322420 (10Vgutierrez) I think `proxy.config.http.connect.dead.policy` is also interesting for us: ` Controls what origin server connection failures contribute to marking a server de... [16:00:09] slyngs: happy for me to merge [16:00:39] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ilias Sarantopoulos - https://phabricator.wikimedia.org/T322350 (10jbond) [16:03:59] (03PS1) 10Vgutierrez: trafficserver: Avoid marking origin servers down/dead [puppet] - 10https://gerrit.wikimedia.org/r/853321 (https://phabricator.wikimedia.org/T322420) [16:05:17] slyngs: change looks safe enough and affects nothing in prod, will merge [16:05:48] (03CR) 10BBlack: [C: 03+1] "Seems like the right thing to me! Maybe we should hold for next week though, just in case" [puppet] - 10https://gerrit.wikimedia.org/r/853321 (https://phabricator.wikimedia.org/T322420) (owner: 10Vgutierrez) [16:06:00] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37960/console" [puppet] - 10https://gerrit.wikimedia.org/r/853321 (https://phabricator.wikimedia.org/T322420) (owner: 10Vgutierrez) [16:06:11] bblack: agreed, I'll deploy it on Monday [16:06:20] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [16:06:23] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [16:07:53] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe1001.eqiad.wmnet with reason: host reimage [16:10:20] (03PS1) 10Filippo Giunchedi: swift: add moss-fe[12]001 to swift frontends [puppet] - 10https://gerrit.wikimedia.org/r/853324 (https://phabricator.wikimedia.org/T322417) [16:10:22] (03PS1) 10Filippo Giunchedi: hieradata: add moss-fe[12]001 to swift memcached servers [puppet] - 10https://gerrit.wikimedia.org/r/853325 (https://phabricator.wikimedia.org/T322417) [16:10:37] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe2001.codfw.wmnet with reason: host reimage [16:11:16] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-fe1001.eqiad.wmnet with reason: host reimage [16:11:28] (03CR) 10CI reject: [V: 04-1] swift: add moss-fe[12]001 to swift frontends [puppet] - 10https://gerrit.wikimedia.org/r/853324 (https://phabricator.wikimedia.org/T322417) (owner: 10Filippo Giunchedi) [16:13:20] (03PS2) 10Filippo Giunchedi: swift: add moss-fe[12]001 to swift frontends [puppet] - 10https://gerrit.wikimedia.org/r/853324 (https://phabricator.wikimedia.org/T322417) [16:13:22] (03PS2) 10Filippo Giunchedi: hieradata: add moss-fe[12]001 to swift memcached servers [puppet] - 10https://gerrit.wikimedia.org/r/853325 (https://phabricator.wikimedia.org/T322417) [16:13:52] (03PS5) 10BCornwall: prometheus: Rename ats_ metrics to trafficserver_ [puppet] - 10https://gerrit.wikimedia.org/r/851139 (https://phabricator.wikimedia.org/T292815) [16:13:56] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-fe2001.codfw.wmnet with reason: host reimage [16:13:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:14:36] (03CR) 10CI reject: [V: 04-1] prometheus: Rename ats_ metrics to trafficserver_ [puppet] - 10https://gerrit.wikimedia.org/r/851139 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [16:16:22] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ilias Sarantopoulos - https://phabricator.wikimedia.org/T322350 (10jbond) 05Stalled→03Resolved a:05calbon→03jbond Access has been confug... [16:16:42] 10SRE, 10SRE-swift-storage: Repeated swift (cascading?) failures, late 2022 - https://phabricator.wikimedia.org/T322424 (10CDanis) [16:16:54] brett: just keep the syntax of the original line but add ensure => 'absent' :) [16:17:00] 10SRE, 10SRE-swift-storage: Repeated swift (cascading?) failures, late 2022 - https://phabricator.wikimedia.org/T322424 (10CDanis) [16:17:55] (03PS3) 10Filippo Giunchedi: swift: add moss-fe[12]001 to swift frontends [puppet] - 10https://gerrit.wikimedia.org/r/853324 (https://phabricator.wikimedia.org/T322424) [16:17:57] (03PS3) 10Filippo Giunchedi: hieradata: add moss-fe[12]001 to swift memcached servers [puppet] - 10https://gerrit.wikimedia.org/r/853325 (https://phabricator.wikimedia.org/T322424) [16:19:08] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37962/console" [puppet] - 10https://gerrit.wikimedia.org/r/853324 (https://phabricator.wikimedia.org/T322424) (owner: 10Filippo Giunchedi) [16:19:41] RECOVERY - SSH on mw1337.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:19:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:20:03] (03CR) 10Abijeet Patro: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853357 (owner: 10Abijeet Patro) [16:20:09] (03PS2) 10Abijeet Patro: Enable logging for UpdateMessageBundleJob [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853357 [16:20:46] (03PS6) 10BCornwall: prometheus: Rename ats_ metrics to trafficserver_ [puppet] - 10https://gerrit.wikimedia.org/r/851139 (https://phabricator.wikimedia.org/T292815) [16:20:56] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37963/console" [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) (owner: 10Xcollazo) [16:22:57] (03CR) 10MVernon: [C: 03+1] "LGTM thank you" [puppet] - 10https://gerrit.wikimedia.org/r/853324 (https://phabricator.wikimedia.org/T322424) (owner: 10Filippo Giunchedi) [16:23:07] (03CR) 10MVernon: [C: 03+1] "LGTM thank you" [puppet] - 10https://gerrit.wikimedia.org/r/853325 (https://phabricator.wikimedia.org/T322424) (owner: 10Filippo Giunchedi) [16:25:59] (03PS1) 10Ahmon Dancy: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/853369 [16:26:34] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-fe1001.eqiad.wmnet with OS bullseye [16:26:42] (03CR) 10Vgutierrez: [C: 04-1] prometheus: Rename ats_ metrics to trafficserver_ (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/851139 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [16:27:57] (03PS1) 10Klausman: ml-staging: fix wrong role assignment of staging workers [puppet] - 10https://gerrit.wikimedia.org/r/853370 [16:28:03] (03CR) 10Ahmon Dancy: [C: 03+2] Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/853369 (owner: 10Ahmon Dancy) [16:29:08] 10SRE, 10SRE-swift-storage, 10Patch-For-Review: Commons/multimedia errors, caused by repeated swift (cascading?) failures, late 2022 - https://phabricator.wikimedia.org/T322424 (10CDanis) [16:29:31] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-fe2001.codfw.wmnet with OS bullseye [16:29:32] (03Merged) 10jenkins-bot: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/853369 (owner: 10Ahmon Dancy) [16:29:34] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37965/console" [puppet] - 10https://gerrit.wikimedia.org/r/853370 (owner: 10Klausman) [16:30:15] (03CR) 10Elukey: [C: 03+1] ml-staging: fix wrong role assignment of staging workers [puppet] - 10https://gerrit.wikimedia.org/r/853370 (owner: 10Klausman) [16:30:20] (03CR) 10MVernon: [C: 03+2] swift: add moss-fe[12]001 to swift frontends [puppet] - 10https://gerrit.wikimedia.org/r/853324 (https://phabricator.wikimedia.org/T322424) (owner: 10Filippo Giunchedi) [16:30:25] (03CR) 10Klausman: [V: 03+1 C: 03+2] ml-staging: fix wrong role assignment of staging workers [puppet] - 10https://gerrit.wikimedia.org/r/853370 (owner: 10Klausman) [16:31:34] (03PS2) 10Klausman: ml-staging: fix wrong role assignment of staging workers [puppet] - 10https://gerrit.wikimedia.org/r/853370 [16:32:30] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Ilooremeta - https://phabricator.wikimedia.org/T322147 (10Dzahn) >>! In T322147#8369396, @jbond wrote: > thanks daniel, im not sure the process regarding namley on contractors however approval by Christina is... [16:33:15] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cp4052'] [16:34:37] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp4052'] [16:34:51] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host moss-fe1001.eqiad.wmnet [16:35:13] 10SRE, 10SRE-swift-storage, 10Patch-For-Review: Commons/multimedia errors, caused by repeated swift (cascading?) failures, late 2022 - https://phabricator.wikimedia.org/T322424 (10ops-monitoring-bot) Host rebooted by mvernon@cumin1001 with reason: None [16:35:36] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp4052'] [16:35:57] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-fe2001.codfw.wmnet [16:36:20] 10SRE, 10SRE-swift-storage, 10Patch-For-Review: Commons/multimedia errors, caused by repeated swift (cascading?) failures, late 2022 - https://phabricator.wikimedia.org/T322424 (10ops-monitoring-bot) Host rebooted by mvernon@cumin2002 with reason: None [16:39:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST namespaces) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:40:51] (03PS11) 10Xcollazo: Modify jupyterhub config to point to conda-analytics instead of anaconda-wmf. [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) [16:41:04] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-fe1001.eqiad.wmnet [16:41:10] (03PS7) 10BCornwall: prometheus: Rename ats_ metrics to trafficserver_ [puppet] - 10https://gerrit.wikimedia.org/r/851139 (https://phabricator.wikimedia.org/T292815) [16:41:37] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-fe2001.codfw.wmnet [16:43:00] 10SRE, 10SRE Program Management, 10User-herron: Improve visibility of incoming operations tasks - https://phabricator.wikimedia.org/T197624 (10LSobanski) 05In progress→03Resolved Agreed. [16:43:09] (03CR) 10Xcollazo: Modify jupyterhub config to point to conda-analytics instead of anaconda-wmf. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) (owner: 10Xcollazo) [16:44:11] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37966/console" [puppet] - 10https://gerrit.wikimedia.org/r/851139 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [16:44:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:45:35] (03CR) 10Xcollazo: "PPC runs cleanly and with easier to follow changes: https://puppet-compiler.wmflabs.org/pcc-worker1001/37967/" [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) (owner: 10Xcollazo) [16:46:24] (03CR) 10MVernon: [C: 03+2] hieradata: add moss-fe[12]001 to swift memcached servers [puppet] - 10https://gerrit.wikimedia.org/r/853325 (https://phabricator.wikimedia.org/T322424) (owner: 10Filippo Giunchedi) [16:48:12] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host moss-fe1001.eqiad.wmnet [16:48:22] 10SRE, 10SRE-swift-storage, 10Patch-For-Review: Commons/multimedia errors, caused by repeated swift (cascading?) failures, late 2022 - https://phabricator.wikimedia.org/T322424 (10ops-monitoring-bot) Host rebooted by mvernon@cumin1001 with reason: None [16:48:23] (03PS1) 10Arturo Borrero Gonzalez: cr-cloud: enable openstack heat API TCP port [homer/public] - 10https://gerrit.wikimedia.org/r/853374 (https://phabricator.wikimedia.org/T309407) [16:48:26] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-fe2001.codfw.wmnet [16:48:48] 10SRE, 10SRE-swift-storage, 10Patch-For-Review: Commons/multimedia errors, caused by repeated swift (cascading?) failures, late 2022 - https://phabricator.wikimedia.org/T322424 (10ops-monitoring-bot) Host rebooted by mvernon@cumin2002 with reason: None [16:49:07] (03CR) 10BCornwall: prometheus: Rename ats_ metrics to trafficserver_ (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/851139 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [16:50:57] (03CR) 10Majavah: [C: 04-1] cr-cloud: enable openstack heat API TCP port (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/853374 (https://phabricator.wikimedia.org/T309407) (owner: 10Arturo Borrero Gonzalez) [16:51:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:52:57] (03PS2) 10Arturo Borrero Gonzalez: cr-cloud: enable openstack heat API TCP port [homer/public] - 10https://gerrit.wikimedia.org/r/853374 (https://phabricator.wikimedia.org/T309407) [16:53:42] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-fe1001.eqiad.wmnet [16:54:57] 10SRE, 10ops-codfw: Toubleshoot why latest idrac version is not working on Dell servers - https://phabricator.wikimedia.org/T322419 (10Papaul) p:05Triage→03Medium [16:55:07] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-fe2001.codfw.wmnet [16:58:01] !log rolling restart of swift-proxies to bring moss-fe{1,2}001 into service T322424 [16:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:05] T322424: Commons/multimedia errors, caused by repeated swift (cascading?) failures, late 2022 - https://phabricator.wikimedia.org/T322424 [16:58:24] (03CR) 10Arturo Borrero Gonzalez: cr-cloud: enable openstack heat API TCP port (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/853374 (https://phabricator.wikimedia.org/T309407) (owner: 10Arturo Borrero Gonzalez) [17:00:35] !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=swift-fe,name=moss-fe1001.eqiad.wmnet [17:00:46] !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=nginx,name=moss-fe1001.eqiad.wmnet [17:01:00] !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=swift-fe,name=moss-fe2001.codfw.wmnet [17:01:20] !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=nginx,name=moss-fe2001.codfw.wmnet [17:04:34] !log fnegri@cumin1001 START - Cookbook sre.hosts.reboot-single for host dbproxy1018.eqiad.wmnet [17:04:48] !log fnegri@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host dbproxy1018.eqiad.wmnet [17:05:19] 10SRE, 10SRE-swift-storage, 10Patch-For-Review: Commons/multimedia errors, caused by repeated swift (cascading?) failures, late 2022 - https://phabricator.wikimedia.org/T322424 (10MatthewVernon) [17:06:08] (03PS3) 10Abijeet Patro: Enable logging for UpdateMessageBundleJob [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853357 (https://phabricator.wikimedia.org/T322430) [17:06:21] !log fnegri@cumin1001 START - Cookbook sre.hosts.reboot-single for host dbproxy1018.eqiad.wmnet [17:06:23] !log fnegri@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host dbproxy1018.eqiad.wmnet [17:06:37] 10SRE, 10SRE-swift-storage, 10Patch-For-Review: Commons/multimedia errors, caused by repeated swift (cascading?) failures, late 2022 - https://phabricator.wikimedia.org/T322424 (10MatthewVernon) moss-fe{1,2}001 now in service as extra proxies. [17:07:08] ^ cookbook is failing with "wmflib.phabricator.PhabricatorError: Unable to update Phabricator task T316195" [17:07:29] I'll try without --task-id, but if anyone knows the reason of that error let me know [17:08:00] dhinus: it's a protected task, the bot has no access to it [17:08:12] ha, thanks! [17:08:15] should've spotted it :) [17:08:21] !log fnegri@cumin1001 START - Cookbook sre.hosts.reboot-single for host dbproxy1018.eqiad.wmnet [17:08:33] I added the task id to --reason instead [17:08:36] there was a discussion in I/F to give it access to NDA tasks [17:09:30] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200): /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [17:10:10] 10SRE, 10ops-codfw: Toubleshoot why latest idrac version is not working on Dell servers - https://phabricator.wikimedia.org/T322419 (10Papaul) I used cp4052 to test out this issue so what i can tell so for is when the IDRAC is at version 5.10.30, the IDRAC.WeServer.HostHeaderCheck value is 0=disable and when... [17:10:16] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10ARM support: SRE Summit 2022 Outcome of Session "Adoption of aarch64 (aka arm64) in WMF production?" - https://phabricator.wikimedia.org/T320811 (10jbond) [17:10:44] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [17:11:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:14:54] PROBLEM - PyBal IPVS diff check on lvs1018 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.154.242:3316, 208.80.154.242:3317, 208.80.154.242:3314, 208.80.154.242:3315, 208.80.154.242:3312, 208.80.154.242:3313, 208.80.154.242:3311, 208.80.154.242:3318]) https://wikitech.wikimedia.org/wiki/PyBal [17:15:16] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.154.242:3316, 208.80.154.242:3317, 208.80.154.242:3314, 208.80.154.242:3315, 208.80.154.242:3312, 208.80.154.242:3313, 208.80.154.242:3311, 208.80.154.242:3318]) https://wikitech.wikimedia.org/wiki/PyBal [17:15:47] dhinus: ^ I think these are caused from the dbproxy1018 restart? [17:16:21] probably, hopefully they will go away shortly [17:16:35] I also added --depool to the cookbook but that wasn't enough [17:17:51] the cookbook is still running [17:18:57] dhinus: it's because everything's depooled [17:18:58] {"dbproxy1018.eqiad.wmnet": {"weight": 0, "pooled": "no"}, "tags": "dc=eqiad,cluster=wikireplicas-a,service=wikireplicas-a"} [17:19:02] {"dbproxy1019.eqiad.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=wikireplicas-a,service=wikireplicas-a"} [17:19:12] depooling is to remove 1/N, but things will break if you leave none pooled [17:19:21] hmm right, I should've pooled 1019 on replicas-a [17:19:33] !log fnegri@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host dbproxy1018.eqiad.wmnet [17:19:38] I learned a lot about LVS in the past 24 hours, but definitely not enough :D [17:20:34] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 16 down 5: https://wikitech.wikimedia.org/wiki/HAProxy [17:20:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:23:35] not sure why haproxy is not happy after the reboot [17:25:22] "pool-wikireplicas-a" on dbproxy1019 also didn't work, trying with confctl [17:25:25] !log fnegri@cumin1001 conftool action : set/pooled=yes; selector: name=dbproxy1019.eqiad.wmnet,service=wikireplicas-a [17:26:20] RECOVERY - PyBal IPVS diff check on lvs1018 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:26:38] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:28:46] dhinus: it also needs a change to its "weight" property to be non-zero I think, ideally [17:30:35] is it normal that the weights appear to be all at 0 for this pool? (not just for wikireplicas-a but also for -b that I didn't touch) [17:31:24] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 16 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [17:32:25] haproxy just needed a gentle prod (i.e. systemctl restart) :P [17:32:38] I think I can repool the host now [17:33:01] (03PS1) 10Andrew Bogott: add cinder_seed.sh.erb [puppet] - 10https://gerrit.wikimedia.org/r/853381 [17:33:53] (03CR) 10CI reject: [V: 04-1] add cinder_seed.sh.erb [puppet] - 10https://gerrit.wikimedia.org/r/853381 (owner: 10Andrew Bogott) [17:33:59] dhinus: no idea if it's "normal" in the specific case of wikireplicas, but in general weight should be non-zero [17:34:02] (03PS5) 10Jbond: 2.5.0: prepare release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852837 [17:34:04] (03PS1) 10Jbond: controller: Add option for basic pcc run [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/853382 [17:34:05] (for a pooled and live thing) [17:34:20] bblack: yeah I wonder if that's just a misconfiguration that we didn't notice before [17:34:56] https://wikitech.wikimedia.org/wiki/Conftool#Pooling/depooling_a_server_from_all_the_related_services [17:35:05] ^ one mention of the expected semantics, anyways, in that list of operations [17:35:47] (03PS2) 10Andrew Bogott: add cinder_seed.sh.erb [puppet] - 10https://gerrit.wikimedia.org/r/853381 [17:36:25] (03CR) 10CI reject: [V: 04-1] 2.5.0: prepare release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852837 (owner: 10Jbond) [17:36:37] (03CR) 10CI reject: [V: 04-1] add cinder_seed.sh.erb [puppet] - 10https://gerrit.wikimedia.org/r/853381 (owner: 10Andrew Bogott) [17:36:47] weight=0 there is defined as a "drain" state (if you transition weight=1 -> weight=0, the lower-level ipvs backend definition would still exist to allow existing connections to continue working, but the weight=0 removes it from the decision set for new fresh connections) [17:37:11] but if the whole pool has weight=0, I donno, maybe that functions ok and you just don't have the ability to drain? [17:37:56] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 172 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:38:39] bblack: interesting, thanks. it definitely looks like weight=1 would make more sense. these servers are a bit uncommon I think because all connections arrive from another proxy layer, see https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_Replicas [17:39:45] FWIW, in the current live example, your singular pooled backend with weight=0 becomes weight=1 at the underlying LVS layer [17:39:49] -> RemoteAddress:Port Forward Weight ActiveConn InActConn [17:39:52] TCP 208.80.154.242:3311 wrr [17:39:54] -> 10.64.37.27:3311 Route 1 0 0 [17:39:54] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 4 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:40:12] but I'm not sure exactly why :) [17:40:21] (03PS2) 10Jbond: controller: Add option for basic pcc run [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/853382 (https://phabricator.wikimedia.org/T289666) [17:40:34] (03PS1) 10Andrew Bogott: Cinder: add cinder-api-uwsgi.ini from X-version cinder package [puppet] - 10https://gerrit.wikimedia.org/r/853383 (https://phabricator.wikimedia.org/T305828) [17:40:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:41:18] (03CR) 10CI reject: [V: 04-1] Cinder: add cinder-api-uwsgi.ini from X-version cinder package [puppet] - 10https://gerrit.wikimedia.org/r/853383 (https://phabricator.wikimedia.org/T305828) (owner: 10Andrew Bogott) [17:42:20] (03CR) 10CI reject: [V: 04-1] controller: Add option for basic pcc run [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/853382 (https://phabricator.wikimedia.org/T289666) (owner: 10Jbond) [17:42:43] https://github.com/wikimedia/PyBal/blob/b331a4a4cd62b2ec519b07a69a3cc8dd7b6711d5/pybal/ipvs.py#L134 I suspect 0 gets evaluated as falsey and ipvsadm defaults weight to 0 [17:42:48] defaults to 1* [17:42:50] (probably at some layer, either pybal's etcd code, or pybal internals, or even LVS itself, if there are no non-zero weights they get set to 1 for sanity?) [17:42:59] oh look there's the answer just as I was pressing enter :) [17:43:30] thanks taavi! [17:43:47] yep, confirmed by the ipvsadm(8) man page, default weight is 1 [17:44:56] I wonder if conftool should reject trying to pool a server with weight=0 [17:44:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS buster [17:45:16] (03PS3) 10Andrew Bogott: add cinder_seed.sh.erb [puppet] - 10https://gerrit.wikimedia.org/r/853381 [17:45:18] (03PS2) 10Andrew Bogott: Cinder: add cinder-api-uwsgi.ini from X-version cinder package [puppet] - 10https://gerrit.wikimedia.org/r/853383 (https://phabricator.wikimedia.org/T305828) [17:45:47] taavi: maybe, but it's tricky. technically things other than conftool can edit etcd, so consumers still have to handle it sanely. [17:46:11] it might be a nice safety-net though [17:46:17] (03CR) 10CI reject: [V: 04-1] add cinder_seed.sh.erb [puppet] - 10https://gerrit.wikimedia.org/r/853381 (owner: 10Andrew Bogott) [17:47:39] hmm although server.weight in pybal is given a type of "int" in a class definition [17:48:01] oh nevermind, I ran off the road somewhere in my brain [17:48:12] it's not because "0" is true, it's because 0 is false and thus never set [17:48:17] (03PS1) 10Zabe: Remove outdated TODO comment in wmnet template [dns] - 10https://gerrit.wikimedia.org/r/853384 [17:48:18] ok [17:48:31] (03PS4) 10Andrew Bogott: add cinder_seed.sh.erb [puppet] - 10https://gerrit.wikimedia.org/r/853381 [17:48:33] (03PS3) 10Andrew Bogott: Cinder: add cinder-api-uwsgi.ini from X-version cinder package [puppet] - 10https://gerrit.wikimedia.org/r/853383 (https://phabricator.wikimedia.org/T305828) [17:48:45] arguably, pybal should set that weight to zero if asked, otherwise draining doesn't really work right [17:49:06] it probably seems to mostly work right, if a pool full of servers with weight=100 has one drop to weight=1, it won't get very *many* new connections [17:49:25] but it still gets some [17:49:56] haha sounds one of those cases where "it's broken, but in a way that somehow still does what it should" [17:49:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:50:12] (03CR) 10Andrew Bogott: [C: 03+2] add cinder_seed.sh.erb [puppet] - 10https://gerrit.wikimedia.org/r/853381 (owner: 10Andrew Bogott) [17:50:26] (03CR) 10Andrew Bogott: [C: 03+2] Cinder: add cinder-api-uwsgi.ini from X-version cinder package [puppet] - 10https://gerrit.wikimedia.org/r/853383 (https://phabricator.wikimedia.org/T305828) (owner: 10Andrew Bogott) [17:51:07] but another related bit of the various puzzles: in the current lvs implementation (pybal -> ipvs), for any services that use "sh" as their scheduler (source IP hashing, as opposed to e.g. wrr) [17:51:40] for "sh", the weights aren't just arbitrary and relative. Large weights have problems, which is why our "sh"-scheduled cache frontend stuff uses weight=1 as its normal value. [17:51:54] so draining definitely doesn't work there, thanks to that [17:52:15] hmm [17:52:22] (03PS1) 10Andrew Bogott: Cinder: cinder-api-uwsgi.ini is a file, not a template [puppet] - 10https://gerrit.wikimedia.org/r/853385 (https://phabricator.wikimedia.org/T305828) [17:52:43] for pooled=no, isn't the intended behavior to have it in pybal but with weight set to 0? [17:52:49] (TL;DR is "sh" uses an extremely simple and efficient traffic-hashing scheme with only 255 slots in it, and it fills the array by going "server1 gets the first weight/255 slots, server2 gets the next weight/255 slots", so if the total weights for the pool add up to more than 255, some don't get slots at all [17:52:54] ) [17:53:38] taavi: I think pooled=no is supposed to remove the entry entirely at the IPVS layer, rather than re-weight it. [17:54:03] (03CR) 10Andrew Bogott: [C: 03+2] Cinder: cinder-api-uwsgi.ini is a file, not a template [puppet] - 10https://gerrit.wikimedia.org/r/853385 (https://phabricator.wikimedia.org/T305828) (owner: 10Andrew Bogott) [17:54:03] but that also sounds like what pooled=inactive would do, too, so I'm not sure what the distinction is. a lot of this got invented after I stopped paying very close attention [17:55:05] the wikitech page also says pooled=no is 'means the server is not pooled but (only in pybal) present in the config' which I interpret as being in there with zero weight [17:56:02] looking at hieradata though, "sh" is not widely used. just the main public traffic edge stuff, and one other service (kibana7), so that whole thing isn't a practical problem for most services [17:56:04] the reason why I'm asking is that if I'm reading the code correctly, it's actually impossible for a server to end up in ipvs with zero weight [17:56:14] to add more confusion, I'm pretty sure yesterday the non-pooled services were set to "inactive", but now that I've run "depool-wikireplicas-", they're set to "no" [17:56:21] taavi: yeah I'm pretty sure it is impossible [17:56:24] so I'm not sure which is the "correct" value [17:57:09] with the conftool+pybal stuff we do for Traffic, we do use pooled=no and do see the backend server get fully-removed at the ipvs level [17:57:20] (03PS1) 10BCornwall: prometheus: Alter node_ats_config class to include [puppet] - 10https://gerrit.wikimedia.org/r/853387 (https://phabricator.wikimedia.org/T292815) [17:57:29] (its IP isn't there anymore, as opposed to being there with some 0 or 1 weight) [17:58:15] there's some additional confusion in all of this because pybal has its own internal state-tracking of the set of servers for a service, which is independent of ipvs and etcd, and has some of its own logic [17:58:49] (like depool thresholds, where it will stop depooling things at the IPVS level for etcd and/or healthcheck reasons, because too many servers are missing already) [17:59:17] (03CR) 10CI reject: [V: 04-1] prometheus: Alter node_ats_config class to include [puppet] - 10https://gerrit.wikimedia.org/r/853387 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [17:59:57] ok, the source code suggests that indeed pooled=no gets entirely removed from ipvs [18:00:08] (03CR) 10Herron: dispatch: sync user role and info from LDAP (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/852992 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [18:00:16] (we've long suspected there are some bugs there too, where once a service enters the thresholded set, if more depools try to happen, it loses track of how to recover the state later) [18:00:16] * taavi clearly has falled into a rather deep rabbit hole [18:00:50] it's a recursive one, so you'll eventually see some familiar territory, but the trip never ends :) [18:01:52] dhinus: looks like for this case those two are pretty much the same, so it probably doesn't matter [18:02:02] yeah [18:02:23] for mediawiki, there's a difference since since pooled=no still get code updates and pooled=inactive don't, but that doesn't apply here [18:02:23] cool [18:03:14] thanks both for your help, I need to log off now, but I will probably explore this a bit more next week [18:03:22] left some notes in https://phabricator.wikimedia.org/T316195 [18:04:40] IIRC the nature of the state bug at the pybal layer was something along the lines of this exampe: you have an 8 server set and a depool_threshold of 0.5 (meaning there has to be 4 in ipvs at all times). 4 of them auto-depool due to failing pybal healthchecks (now we're at the limit). Someone does a manual depool of a 5th server, and pybal basically negates that action (can't go below depool [18:04:46] threshold). Then the other 4 servers become healthy and get repooled (healthchecks now succeeding), but the manually-depooled one remains pooled with no apparent/current reason it should be so. [18:05:49] basically it's not globally re-evaluating the state of the pool after such changes, and/or not tracking its own past temporary state overrides, whichever way you want to look at it [18:07:53] (03PS2) 10BCornwall: prometheus: Alter node_ats_config class to include [puppet] - 10https://gerrit.wikimedia.org/r/853387 (https://phabricator.wikimedia.org/T292815) [18:08:28] (03CR) 10CI reject: [V: 04-1] prometheus: Alter node_ats_config class to include [puppet] - 10https://gerrit.wikimedia.org/r/853387 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [18:09:25] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage [18:12:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage [18:14:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:21:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:31:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4052.ulsfo.wmnet with OS buster [18:43:48] (03CR) 10Andrew Bogott: Add upgrade_openstack_node.py (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852313 (owner: 10Andrew Bogott) [18:45:31] (03PS3) 10BCornwall: prometheus: Alter node_ats_config class to include [puppet] - 10https://gerrit.wikimedia.org/r/853387 (https://phabricator.wikimedia.org/T292815) [18:46:18] (03CR) 10CI reject: [V: 04-1] prometheus: Alter node_ats_config class to include [puppet] - 10https://gerrit.wikimedia.org/r/853387 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [18:46:25] win 16 [18:46:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:47:31] (03PS3) 10Andrew Bogott: Rename live_upgrade_ussuri_to_victoria.py to remove version-specific name [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852312 [18:47:33] (03PS4) 10Andrew Bogott: Add upgrade_openstack_node.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852313 [18:50:49] (03CR) 10CI reject: [V: 04-1] Add upgrade_openstack_node.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852313 (owner: 10Andrew Bogott) [18:53:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:54:47] (03PS4) 10Ssingh: prometheus: Alter node_ats_config class to include [puppet] - 10https://gerrit.wikimedia.org/r/853387 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [18:58:37] (03CR) 10BCornwall: "Valentin, it doesn't seem right to me to turn off linting - what is the expectation for properly including this while adhering to WMF styl" [puppet] - 10https://gerrit.wikimedia.org/r/853387 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [19:06:48] (03PS5) 10Andrew Bogott: Add upgrade_openstack_node.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852313 [19:10:13] (03CR) 10CI reject: [V: 04-1] Add upgrade_openstack_node.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852313 (owner: 10Andrew Bogott) [19:16:29] (03PS6) 10Andrew Bogott: Add upgrade_openstack_node.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852313 [19:18:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:19:51] (03CR) 10CI reject: [V: 04-1] Add upgrade_openstack_node.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852313 (owner: 10Andrew Bogott) [19:21:54] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 102 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:23:53] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 3 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:23:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:24:42] (03CR) 10Dzahn: [C: 03+2] gitlab_runner: run cleanup of docker cache twice daily [puppet] - 10https://gerrit.wikimedia.org/r/853312 (https://phabricator.wikimedia.org/T310593) (owner: 10Jelto) [19:25:55] (03CR) 10Vgutierrez: [C: 04-1] prometheus: Alter node_ats_config class to include (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/853387 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [19:27:37] (03CR) 10Dzahn: prometheus: Alter node_ats_config class to include (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/853387 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [19:29:32] (03CR) 10Vgutierrez: [C: 04-1] prometheus: Alter node_ats_config class to include (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/853387 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [19:34:20] (03CR) 10Dzahn: "ACK, disregard my comment. was just trying to help with the lint. good then" [puppet] - 10https://gerrit.wikimedia.org/r/853387 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [19:35:57] (03PS1) 10Vgutierrez: swift: Ramp up ms-be08 rebalance [puppet] - 10https://gerrit.wikimedia.org/r/853401 (https://phabricator.wikimedia.org/T322231) [19:36:50] (03PS1) 10Vlad.shapik: Add ability to specify a DPI value for PDF [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/853402 [19:37:17] (03CR) 10Dzahn: [C: 03+2] "Trigger: Sat 2022-11-05 05:00:00 UTC; 9h left - deployed on runner-1021" [puppet] - 10https://gerrit.wikimedia.org/r/853312 (https://phabricator.wikimedia.org/T310593) (owner: 10Jelto) [19:38:04] (03PS2) 10Vlad.shapik: Add ability to specify a DPI value for PDF [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/853402 (https://phabricator.wikimedia.org/T256959) [19:38:35] (03CR) 10Vgutierrez: [C: 03+2] swift: Ramp up ms-be08 rebalance [puppet] - 10https://gerrit.wikimedia.org/r/853401 (https://phabricator.wikimedia.org/T322231) (owner: 10Vgutierrez) [19:39:38] (03PS7) 10Andrew Bogott: Add upgrade_openstack_node.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852313 [19:40:41] (03PS3) 10Vlad.shapik: WP:Add ability to specify a DPI value for PDF [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/853402 (https://phabricator.wikimedia.org/T256959) [19:42:03] 10SRE, 10Infrastructure-Foundations: Initial production deployment of the IDM - https://phabricator.wikimedia.org/T320797 (10Dzahn) [19:42:37] 10SRE, 10Infrastructure-Foundations: Implement a staging setup for the IDM - https://phabricator.wikimedia.org/T320795 (10Dzahn) [19:43:40] (03PS1) 10Jbond: differ: change PuppetCatalog paramter to dict from file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/853403 (https://phabricator.wikimedia.org/T322437) [19:43:42] (03CR) 10CI reject: [V: 04-1] Add upgrade_openstack_node.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852313 (owner: 10Andrew Bogott) [19:43:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:46:02] 10SRE, 10Infrastructure-Foundations, 10Mail: 2022-05-09 Exim BDAT Errors incident - https://phabricator.wikimedia.org/T309238 (10Krinkle) [19:46:36] (03CR) 10CI reject: [V: 04-1] differ: change PuppetCatalog paramter to dict from file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/853403 (https://phabricator.wikimedia.org/T322437) (owner: 10Jbond) [19:46:37] 10SRE, 10Wikimedia-Incident: text-https:443 has failed probes (retrospective task) - https://phabricator.wikimedia.org/T309178 (10Krinkle) 05Open→03Resolved a:03jbond [19:46:39] (03CR) 10Dzahn: "ACK, I see it. Your description of the deploy sequence tells me it should NOT be merged yet because other changes in the other repo need t" [puppet] - 10https://gerrit.wikimedia.org/r/853061 (https://phabricator.wikimedia.org/T319378) (owner: 10Hashar) [19:48:49] (03CR) 10Ssingh: prometheus: Alter node_ats_config class to include (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/853387 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [19:50:58] (03CR) 10Andrew Bogott: "The linter has now told me that my closing docstring triple-quote both should and should not be on its own line. I give up!" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852313 (owner: 10Andrew Bogott) [19:50:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:52:43] (03PS2) 10Jbond: differ: change PuppetCatalog paramter to dict from file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/853403 (https://phabricator.wikimedia.org/T322437) [19:53:59] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [19:54:22] (03CR) 10Cwhite: [C: 03+2] beta-logs: enable gc_log on collector nodes to match production [puppet] - 10https://gerrit.wikimedia.org/r/844007 (https://phabricator.wikimedia.org/T304440) (owner: 10Cwhite) [19:54:39] (03CR) 10Dzahn: [V: 04-1] "Class[Profile::Dumps::Distribution::Ferm]: parameter 'rsync_mirrors' expects a Hash value, got Tuple" [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [19:54:47] (03PS2) 10Cwhite: beta-logs: add new hosts [puppet] - 10https://gerrit.wikimedia.org/r/844563 (https://phabricator.wikimedia.org/T321410) [19:55:09] (03CR) 10CI reject: [V: 04-1] differ: change PuppetCatalog paramter to dict from file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/853403 (https://phabricator.wikimedia.org/T322437) (owner: 10Jbond) [19:55:59] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [19:59:27] (03PS3) 10Dzahn: dumps/distribution: add more data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852260 [20:00:04] (03CR) 10CI reject: [V: 04-1] dumps/distribution: add more data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [20:00:16] (03PS4) 10Vlad.shapik: WP:Add ability to specify a DPI value for PDF [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/853402 (https://phabricator.wikimedia.org/T256959) [20:00:22] (03PS5) 10Ssingh: prometheus: Alter node_ats_config class to include [puppet] - 10https://gerrit.wikimedia.org/r/853387 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [20:00:28] (03CR) 10Cwhite: [C: 03+2] beta-logs: add new hosts [puppet] - 10https://gerrit.wikimedia.org/r/844563 (https://phabricator.wikimedia.org/T321410) (owner: 10Cwhite) [20:00:32] (03PS4) 10Dzahn: dumps/distribution: add more data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852260 [20:04:57] (03CR) 10Ssingh: prometheus: Alter node_ats_config class to include (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/853387 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [20:06:15] (03PS7) 10Andrew Bogott: Remove obsolete files for OpenStack version Victoria [puppet] - 10https://gerrit.wikimedia.org/r/851123 [20:06:53] (03CR) 10CI reject: [V: 04-1] Remove obsolete files for OpenStack version Victoria [puppet] - 10https://gerrit.wikimedia.org/r/851123 (owner: 10Andrew Bogott) [20:09:21] (03PS3) 10Jbond: differ: change PuppetCatalog paramter to dict from file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/853403 (https://phabricator.wikimedia.org/T322437) [20:10:16] (03PS5) 10Andrew Bogott: wmf spec tests: Update to test Bullseye/Xena [puppet] - 10https://gerrit.wikimedia.org/r/851126 [20:10:18] (03PS8) 10Andrew Bogott: Remove obsolete files for OpenStack version Victoria [puppet] - 10https://gerrit.wikimedia.org/r/851123 [20:10:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:11:00] (03CR) 10CI reject: [V: 04-1] wmf spec tests: Update to test Bullseye/Xena [puppet] - 10https://gerrit.wikimedia.org/r/851126 (owner: 10Andrew Bogott) [20:12:21] (03PS6) 10Andrew Bogott: wmf spec tests: Update to test Bullseye/Xena [puppet] - 10https://gerrit.wikimedia.org/r/851126 [20:12:23] (03PS9) 10Andrew Bogott: Remove obsolete files for OpenStack version Victoria [puppet] - 10https://gerrit.wikimedia.org/r/851123 [20:12:35] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37973/console" [puppet] - 10https://gerrit.wikimedia.org/r/853387 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [20:13:02] (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] prometheus: Alter node_ats_config class to include [puppet] - 10https://gerrit.wikimedia.org/r/853387 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [20:16:25] (03CR) 10Andrew Bogott: wmf spec tests: Update to test Bullseye/Xena (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/851126 (owner: 10Andrew Bogott) [20:17:08] (03CR) 10Andrew Bogott: wmf spec tests: Update to test Bullseye/Xena (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/851126 (owner: 10Andrew Bogott) [20:18:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:26:21] PROBLEM - SSH on mw1312.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:39:51] (03PS4) 10Jbond: differ: change PuppetCatalog paramter to dict from file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/853403 (https://phabricator.wikimedia.org/T322437) [20:40:13] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST namespaces) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:41:29] (03CR) 10CI reject: [V: 04-1] differ: change PuppetCatalog paramter to dict from file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/853403 (https://phabricator.wikimedia.org/T322437) (owner: 10Jbond) [20:43:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:48:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:50:23] (03PS5) 10Jbond: differ: change PuppetCatalog paramter to dict from file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/853403 (https://phabricator.wikimedia.org/T322437) [20:54:58] (03PS6) 10Jbond: differ: change PuppetCatalog paramter to dict from file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/853403 (https://phabricator.wikimedia.org/T322437) [20:56:37] (03CR) 10CI reject: [V: 04-1] differ: change PuppetCatalog paramter to dict from file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/853403 (https://phabricator.wikimedia.org/T322437) (owner: 10Jbond) [21:08:49] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [21:08:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:10:49] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [21:15:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:23:25] PROBLEM - SSH on mw1337.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:23:45] PROBLEM - SSH on an-coord1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:25:13] (03PS7) 10Jbond: differ: change PuppetCatalog paramter to dict from file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/853403 (https://phabricator.wikimedia.org/T322437) [21:27:07] (03CR) 10Hashar: "Exactly. This removes gerrit-theme.js from Puppet in favor of a version in operations/software/gerrit which is deployed with scap. I furth" [puppet] - 10https://gerrit.wikimedia.org/r/853061 (https://phabricator.wikimedia.org/T319378) (owner: 10Hashar) [21:27:21] (03CR) 10CI reject: [V: 04-1] differ: change PuppetCatalog paramter to dict from file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/853403 (https://phabricator.wikimedia.org/T322437) (owner: 10Jbond) [21:34:14] (03CR) 10Dzahn: [V: 04-1] "parameter 'rsync_mirrors' expects size to be 2, got 14" [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [21:35:08] (03PS5) 10Dzahn: dumps/distribution: add more data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852260 [21:35:46] (03CR) 10CI reject: [V: 04-1] dumps/distribution: add more data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [21:36:29] (03CR) 10Dzahn: "Ok, or we can do it together when it's still early in US day. np" [puppet] - 10https://gerrit.wikimedia.org/r/853061 (https://phabricator.wikimedia.org/T319378) (owner: 10Hashar) [21:36:32] (03CR) 10Hashar: Add upgrade_openstack_node.py (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852313 (owner: 10Andrew Bogott) [21:36:59] mutante: it is 11pm here ;) too late for such a change even if I tested it locally :-) [21:37:13] (03PS6) 10Dzahn: dumps/distribution: add more data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852260 [21:37:57] hashar: that is precisely why I said "when it's still earlier in the US day". I tried hard to avoid the IRC ping. have a good weekend [21:40:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:47:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:52:33] mutante: happy week-end :-] [21:53:53] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 111 probes of 695 (alerts on 90) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:59:53] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 35 probes of 695 (alerts on 90) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:01:08] 10SRE, 10ops-ulsfo: ulsfo: cp4052 repro whole provisioning process - https://phabricator.wikimedia.org/T322238 (10Papaul) [22:05:03] 10SRE, 10ops-ulsfo: ulsfo: cp4052 repro whole provisioning process - https://phabricator.wikimedia.org/T322238 (10Papaul) After working on cp4052 i have some thoughts why the provision cookbook failed in the first place on those R450's. I will be 100% sure after i do more testing on the R650 to double check so... [22:05:58] 10SRE, 10Traffic: strip non session cookies before cache lookup in ATS - https://phabricator.wikimedia.org/T316338 (10Krinkle) >>! In T316338#8205843, @Vgutierrez wrote: > As a direct result cache hitrate shows up to a 100% increase in the text cluster at the ats layer […] Images for future reference, as from... [22:07:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:25:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:28:14] RECOVERY - SSH on mw1312.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:35:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:44:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:52:00] 10SRE, 10Discovery-Search, 10Observability-Alerting, 10Traffic: Use DNS name instead of IP in PyBal alerts - https://phabricator.wikimedia.org/T322377 (10Dzahn) The check is defined in: ` modules/pybal/manifests/monitoring.pp: nrpe::plugin { 'check_pybal_ipvs_diff': ` so it runs a command via NRPE o... [22:52:56] (03CR) 10Dzahn: [V: 04-1 C: 04-1] "parameter 'stats_hosts' expects an Array value, got String" [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [23:05:48] (03CR) 10Dzahn: [C: 03+1] Set profile::contacts::role_contacts for contint* to ServiceOps-Collab [puppet] - 10https://gerrit.wikimedia.org/r/852832 (owner: 10Muehlenhoff) [23:09:21] PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:09:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:11:10] (03PS1) 10Dzahn: roles: add/update role contacts for aphlict,miscweb,planet,rt [puppet] - 10https://gerrit.wikimedia.org/r/853454 [23:12:21] (03CR) 10Dzahn: "Let's discuss this - WIP" [puppet] - 10https://gerrit.wikimedia.org/r/853454 (owner: 10Dzahn) [23:17:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:25:13] RECOVERY - SSH on mw1337.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:32:28] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar), 10Upstream: ATS cache read p999 metrics shows up requests taking up to 1 second on cache read operations - https://phabricator.wikimedia.org/T317748 (10Krinkle) For future reference, some additional graphs captured over a slightly wider ra... [23:37:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:46:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown